Tabular Data Processing

Utilities for processing tabular data in Pandas dataframes.

The following examples show how the entries in the widely used Gene Ontology Annotations database distributed in the GAF format can be loaded with pandas then normalized with the Bioregistry. It can be loaded in full with the get_goa_example() function.

get_goa_example() DataFrame[source]

Get the GOA file.

normalize_prefixes(df: DataFrame, column: int | str, *, target_column: str | None = None) None[source]

Normalize prefixes in a given column.

Parameters:
  • df – A dataframe

  • column – A column in the dataframe containing prefixes

  • target_column – The target column to put the normalized prefixes. If not given, overwrites the given column in place

import bioregistry.pandas as brpd
import pandas as pd

df = brpd.get_goa_example()

# column 1: DB
#  i.e., `UniProtKB` becomes `uniprot`
brpd.normalize_prefixes(df, column=0)
normalize_curies(df: DataFrame, column: int | str, *, target_column: str | None = None) None[source]

Normalize CURIEs in a given column.

Parameters:
  • df – A dataframe

  • column – The column of CURIEs to normalize

  • target_column – The column to put the normalized CURIEs in. If not given, overwrites the given column in place.

import bioregistry.pandas as brpd
import pandas as pd

df = brpd.get_goa_example()

# column 5: GO ID - fix normalization of capitalization of prefix,
#  i.e., `GO:0003993` becomes `go:0003993`
brpd.normalize_curies(df, column=4)

# column 6: DB:Reference (|DB:Reference) - fix synonym of prefix
#  i.e., `PMID:2676709` becomes `pubmed:2676709`
brpd.normalize_curies(df, column=5)

# column 8: With (or) From
#  i.e., `GO:0000346` becomes `go:0000346`
brpd.normalize_curies(df, column=7)

# column 13: Taxon(|taxon) - fix synonym of prefix
#  i.e., `taxon:9606` becomes `ncbitaxon:9606`
brpd.normalize_curies(df, column=12)
validate_prefixes(df: DataFrame, column: int | str, *, target_column: str | None = None) Series[source]

Validate prefixes in a given column.

Parameters:
  • df – A DataFrame

  • column – The column of prefixes to validate

  • target_column – The optional column to put the results of validation

Returns:

A pandas series corresponding to the validity of each row

import bioregistry.pandas as brpd
import pandas as pd

df = brpd.get_goa_example()

# column 1: DB
#  i.e., `UniProtKB` entries are not standard, and are therefore false
idx = brpd.validate_prefixes(df, column=0)

# Slice the dataframe based on valid and invalid prefixes
valid_prefix_df = df[idx]
invalid_prefix_df = df[~idx]
validate_curies(df: DataFrame, column: int | str, *, target_column: str | None = None) Series[source]

Validate CURIEs in a given column.

Parameters:
  • df – A DataFrame

  • column – The column of CURIEs to validate

  • target_column – The optional column to put the results of validation.

Returns:

A pandas series corresponding to the validity of each row

import bioregistry.pandas as brpd
import pandas as pd

df = brpd.get_goa_example()

# column 5: GO ID - fix normalization of capitalization of prefix,
#  i.e., `GO:0003993` is not standard and is therefore false
idx = brpd.validate_curies(df, column=4)

# Slice the dataframe
valid_go_df = df[idx]
invalid_go_df = df[~idx]
validate_identifiers(df: DataFrame, column: int | str, *, prefix: str | None = None, prefix_column: str | None = None, target_column: str | None = None, use_tqdm: bool = False) Series[source]

Validate local unique identifiers in a given column.

Some data sources split the prefix and identifier in separate columns, so you can use the prefix_column argument instead of the prefix argument like in the following example with the GO Annotation Database:

Parameters:
  • df – A dataframe

  • column – A column in the dataframe containing identifiers

  • prefix – Specify the prefix if all identifiers in the given column are from the same namespace

  • prefix_column – Specify the prefix_column if there is an additional column whose rows contain the prefix for each rows’ respective identifiers.

  • target_column – If given, stores the results of validation in this column

  • use_tqdm – Should a progress bar be shown?

Returns:

A pandas series corresponding to the validity of each row

Raises:
  • PrefixLocationError – If not exactly one of the prefix and prefix_column arguments are given

  • ValueError – If prefix_column is given and it contains no valid prefixes

import bioregistry.pandas as brpd
import pandas as pd

df = brpd.get_goa_example()

# Use a combination of column 1 (DB) and column 2 (DB Object ID) for validation
idx = brpd.validate_identifiers(df, column=1, prefix_column=0)

# Split the dataframe based on valid and invalid identifiers
valid_df = df[idx]
invalid_df = df[~idx]
identifiers_to_curies(df: DataFrame, column: int | str, *, prefix: str | None = None, prefix_column: None | int | str = None, target_column: str | None = None, use_tqdm: bool = False, normalize_prefixes_: bool = True) None[source]

Convert a column of local unique identifiers to CURIEs.

Parameters:
  • df – A dataframe

  • column – A column in the dataframe containing identifiers

  • prefix – Specify the prefix if all identifiers in the given column are from the same namespace

  • prefix_column – Specify the prefix_column if there is an additional column whose rows contain the prefix for each rows’ respective identifiers.

  • target_column – If given, stores CURIEs in this column,

  • use_tqdm – Should a progress bar be shown?

  • normalize_prefixes – Should the prefix column get auto-normalized if prefix_column is not None?

Raises:
  • PrefixLocationError – If not exactly one of the prefix and prefix_column arguments are given

  • ValueError – If the given prefix is not normalizable

import bioregistry.pandas as brpd
import pandas as pd

df = brpd.get_goa_example()

# Use a combination of column 1 (DB) and column 2 (DB Object ID) for conversion
brpd.identifiers_to_curies(df, column=1, prefix_column=0)
identifiers_to_iris(df: DataFrame, column: int | str, *, prefix: str, prefix_column: str | None = None, target_column: str | None = None, use_tqdm: bool = False) None[source]

Convert a column of local unique identifiers to IRIs.

Parameters:
  • df – A dataframe

  • column – A column in the dataframe containing identifiers

  • prefix – Specify the prefix if all identifiers in the given column are from the same namespace

  • prefix_column – Specify the prefix_column if there is an additional column whose rows contain the prefix for each rows’ respective identifiers.

  • target_column – If given, stores IRIs in this column

  • use_tqdm – Should a progress bar be shown?

Raises:
  • PrefixLocationError – If not exactly one of the prefix and prefix_column arguments are given

  • ValueError – If the given prefix is not normalizable

import bioregistry.pandas as brpd
import pandas as pd

df = brpd.get_goa_example()

# Use a combination of column 1 (DB) and column 2 (DB Object ID) for conversion
brpd.identifiers_to_iris(df, column=1, prefix_column=0)
curies_to_iris(df: DataFrame, column: int | str, *, target_column: str | None = None) None[source]

Convert a column of CURIEs to IRIs.

Parameters:
  • df – A dataframe

  • column – A column in the dataframe containing CURIEs

  • target_column – If given, stores the IRIs in this column. Otherwise, overwrites the given column in place.

See also

iris_to_curies()

curies_to_identifiers(df: DataFrame, column: int | str, *, target_column: str | None = None, prefix_column_name: str | None = None) None[source]

Split a CURIE column into a prefix and local identifier column.

By default, the local identifier stays in the same column unless target_column is given. If prefix_column_name isn’t given, it’s derived from the target column (if labels available) or just appended to the end if not

Parameters:
  • df – A dataframe

  • column – A column in the dataframe containing CURIEs

  • target_column – If given, stores identifiers in this column. Else, stores in the given column

  • prefix_column_name – If given, stores prefixes in this column. Else, derives the column name from the target column name.

Raises:

ValueError – If no prefix_column_name is given and the auto-generated name conflicts with a column already in the dataframe.

import bioregistry.pandas as brpd
import pandas as pd

df = brpd.get_goa_example()

# column 5: GO ID - convert CURIEs directly to IRIs
#  i.e., `GO:0003993` becomes `http://amigo.geneontology.org/amigo/term/GO:0003993`
brpd.curies_to_identifiers(df, column=4)
iris_to_curies(df: DataFrame, column: int | str, *, target_column: str | None = None) None[source]

Convert a column of IRIs to CURIEs.

Parameters:
  • df – A dataframe

  • column – A column in the dataframe containing IRIs

  • target_column – If given, stores the CURIEs in this column. Otherwise, overwrites the given column in place.

See also

curies_to_iris()