Tabular Data Processing

Utilities for processing tabular data in Pandas dataframes.

The following examples show how the entries in the widely used Gene Ontology Annotations database distributed in the GAF format can be loaded with pandas then normalized with the Bioregistry. It can be loaded in full with the get_goa_example() function.

get_goa_example()[source]

Get the GOA file.

Return type:: DataFrame

normalize_prefixes(df, column, *, target_column=None)[source]

Normalize prefixes in a given column.

Parameters:

df (DataFrame) – A dataframe
column (Union[int, str]) – A column in the dataframe containing prefixes
target_column (Optional[str]) – The target column to put the normalized prefixes. If not given, overwrites the given column in place

Return type:

None

import bioregistry.pandas as brpd
import pandas as pd

df = brpd.get_goa_example()

# column 1: DB
#  i.e., `UniProtKB` becomes `uniprot`
brpd.normalize_prefixes(df, column=0)

normalize_curies(df, column, *, target_column=None)[source]

Normalize CURIEs in a given column.

Parameters:

df (DataFrame) – A dataframe
column (Union[int, str]) – The column of CURIEs to normalize
target_column (Optional[str]) – The column to put the normalized CURIEs in. If not given, overwrites the given column in place.

Return type:

None

import bioregistry.pandas as brpd
import pandas as pd

df = brpd.get_goa_example()

# column 5: GO ID - fix normalization of capitalization of prefix,
#  i.e., `GO:0003993` becomes `go:0003993`
brpd.normalize_curies(df, column=4)

# column 6: DB:Reference (|DB:Reference) - fix synonym of prefix
#  i.e., `PMID:2676709` becomes `pubmed:2676709`
brpd.normalize_curies(df, column=5)

# column 8: With (or) From
#  i.e., `GO:0000346` becomes `go:0000346`
brpd.normalize_curies(df, column=7)

# column 13: Taxon(|taxon) - fix synonym of prefix
#  i.e., `taxon:9606` becomes `ncbitaxon:9606`
brpd.normalize_curies(df, column=12)

validate_prefixes(df, column, *, target_column=None)[source]

Validate prefixes in a given column.

Parameters:

df (DataFrame) – A DataFrame
column (Union[int, str]) – The column of prefixes to validate
target_column (Optional[str]) – The optional column to put the results of validation

Returns:

A pandas series corresponding to the validity of each row

Return type:

Series

import bioregistry.pandas as brpd
import pandas as pd

df = brpd.get_goa_example()

# column 1: DB
#  i.e., `UniProtKB` entries are not standard, and are therefore false
idx = brpd.validate_prefixes(df, column=0)

# Slice the dataframe based on valid and invalid prefixes
valid_prefix_df = df[idx]
invalid_prefix_df = df[~idx]

validate_curies(df, column, *, target_column=None)[source]

Validate CURIEs in a given column.

Parameters:

df (DataFrame) – A DataFrame
column (Union[int, str]) – The column of CURIEs to validate
target_column (Optional[str]) – The optional column to put the results of validation.

Returns:

A pandas series corresponding to the validity of each row

Return type:

Series

import bioregistry.pandas as brpd
import pandas as pd

df = brpd.get_goa_example()

# column 5: GO ID - fix normalization of capitalization of prefix,
#  i.e., `GO:0003993` is not standard and is therefore false
idx = brpd.validate_curies(df, column=4)

# Slice the dataframe
valid_go_df = df[idx]
invalid_go_df = df[~idx]

validate_identifiers(df, column, *, prefix=None, prefix_column=None, target_column=None, use_tqdm=False)[source]

Validate local unique identifiers in a given column.

Some data sources split the prefix and identifier in separate columns, so you can use the prefix_column argument instead of the prefix argument like in the following example with the GO Annotation Database:

Parameters:

df (DataFrame) – A dataframe
column (Union[int, str]) – A column in the dataframe containing identifiers
prefix (Optional[str]) – Specify the prefix if all identifiers in the given column are from the same namespace
prefix_column (Optional[str]) – Specify the prefix_column if there is an additional column whose rows contain the prefix for each rows’ respective identifiers.
target_column (Optional[str]) – If given, stores the results of validation in this column
use_tqdm (bool) – Should a progress bar be shown?

Returns:

A pandas series corresponding to the validity of each row

Raises:

PrefixLocationError – If not exactly one of the prefix and prefix_column arguments are given
ValueError – If prefix_column is given and it contains no valid prefixes

Return type:

Series

import bioregistry.pandas as brpd
import pandas as pd

df = brpd.get_goa_example()

# Use a combination of column 1 (DB) and column 2 (DB Object ID) for validation
idx = brpd.validate_identifiers(df, column=1, prefix_column=0)

# Split the dataframe based on valid and invalid identifiers
valid_df = df[idx]
invalid_df = df[~idx]

identifiers_to_curies(df, column, *, prefix=None, prefix_column=None, target_column=None, use_tqdm=False, normalize_prefixes_=True)[source]

Convert a column of local unique identifiers to CURIEs.

Parameters:

df (DataFrame) – A dataframe
column (Union[int, str]) – A column in the dataframe containing identifiers
prefix (Optional[str]) – Specify the prefix if all identifiers in the given column are from the same namespace
prefix_column (Union[None, int, str]) – Specify the prefix_column if there is an additional column whose rows contain the prefix for each rows’ respective identifiers.
target_column (Optional[str]) – If given, stores CURIEs in this column,
use_tqdm (bool) – Should a progress bar be shown?
normalize_prefixes – Should the prefix column get auto-normalized if prefix_column is not None?

Raises:

PrefixLocationError – If not exactly one of the prefix and prefix_column arguments are given
ValueError – If the given prefix is not normalizable

Return type:

None

import bioregistry.pandas as brpd
import pandas as pd

df = brpd.get_goa_example()

# Use a combination of column 1 (DB) and column 2 (DB Object ID) for conversion
brpd.identifiers_to_curies(df, column=1, prefix_column=0)

identifiers_to_iris(df, column, *, prefix, prefix_column=None, target_column=None, use_tqdm=False)[source]

Convert a column of local unique identifiers to IRIs.

Parameters:

df (DataFrame) – A dataframe
column (Union[int, str]) – A column in the dataframe containing identifiers
prefix (str) – Specify the prefix if all identifiers in the given column are from the same namespace
prefix_column (Optional[str]) – Specify the prefix_column if there is an additional column whose rows contain the prefix for each rows’ respective identifiers.
target_column (Optional[str]) – If given, stores IRIs in this column
use_tqdm (bool) – Should a progress bar be shown?

Raises:

PrefixLocationError – If not exactly one of the prefix and prefix_column arguments are given
ValueError – If the given prefix is not normalizable

Return type:

None

import bioregistry.pandas as brpd
import pandas as pd

df = brpd.get_goa_example()

# Use a combination of column 1 (DB) and column 2 (DB Object ID) for conversion
brpd.identifiers_to_iris(df, column=1, prefix_column=0)

curies_to_iris(df, column, *, target_column=None)[source]

Convert a column of CURIEs to IRIs.

Parameters:

df (DataFrame) – A dataframe
column (Union[int, str]) – A column in the dataframe containing CURIEs
target_column (Optional[str]) – If given, stores the IRIs in this column. Otherwise, overwrites the given column in place.

Return type:

None