Tabular Data Processing

Utilities for processing tabular data in Pandas dataframes.

The following examples show how the entries in the widely used Gene Ontology Annotations database distributed in the GAF format can be loaded with pandas then normalized with the Bioregistry. It can be loaded in full with the get_goa_example() function.

get_goa_example()[source]

Get the GOA file.

Return type: DataFrame

normalize_prefixes(df, column, *, target_column=None)[source]

Normalize prefixes in a given column.

Parameters

df (DataFrame) – A dataframe
column (Union[int, str]) – A column in the dataframe containing prefixes
target_column (Optional[str]) – The target column to put the normalized prefixes. If not given, overwrites the given column in place

import bioregistry.pandas as brpd
import pandas as pd

df = brpd.get_goa_example()

# column 1: DB
#  i.e., `UniProtKB` becomes `uniprot`
brpd.normalize_prefixes(df, column=0)

Return type: None

normalize_curies(df, column, *, target_column=None)[source]

Normalize CURIEs in a given column.

Parameters

df (DataFrame) – A dataframe
column (Union[int, str]) – The column of CURIEs to normalize
target_column (Optional[str]) – The column to put the normalized CURIEs in. If not given, overwrites the given column in place.

import bioregistry.pandas as brpd
import pandas as pd

df = brpd.get_goa_example()

# column 5: GO ID - fix normalization of capitalization of prefix,
#  i.e., `GO:0003993` becomes `go:0003993`
brpd.normalize_curies(df, column=4)

# column 6: DB:Reference (|DB:Reference) - fix synonym of prefix
#  i.e., `PMID:2676709` becomes `pubmed:2676709`
brpd.normalize_curies(df, column=5)

# column 8: With (or) From
#  i.e., `GO:0000346` becomes `go:0000346`
brpd.normalize_curies(df, column=7)

# column 13: Taxon(|taxon) - fix synonym of prefix
#  i.e., `taxon:9606` becomes `ncbitaxon:9606`
brpd.normalize_curies(df, column=12)

Return type: None

validate_prefixes(df, column, *, target_column=None)[source]

Validate prefixes in a given column.

Parameters

df (DataFrame) – A DataFrame
column (Union[int, str]) – The column of prefixes to validate
target_column (Optional[str]) – The optional column to put the results of validation

Return type

Series

Returns

A pandas series corresponding to the validity of each row

import bioregistry.pandas as brpd
import pandas as pd

df = brpd.get_goa_example()

# column 1: DB
#  i.e., `UniProtKB` entries are not standard, and are therefore false
idx = brpd.validate_prefixes(df, column=0)

# Slice the dataframe based on valid and invalid prefixes
valid_prefix_df = df[idx]
invalid_prefix_df = df[~idx]

validate_curies(df, column, *, target_column=None)[source]

Validate CURIEs in a given column.

Parameters

df (DataFrame) – A DataFrame
column (Union[int, str]) – The column of CURIEs to validate
target_column (Optional[str]) – The optional column to put the results of validation.

Return type

Series

Returns

A pandas series corresponding to the validity of each row

import bioregistry.pandas as brpd
import pandas as pd

df = brpd.get_goa_example()

# column 5: GO ID - fix normalization of capitalization of prefix,
#  i.e., `GO:0003993` is not standard and is therefore false
idx = brpd.validate_curies(df, column=4)

# Slice the dataframe
valid_go_df = df[idx]
invalid_go_df = df[~idx]

validate_identifiers(df, column, *, prefix=None, prefix_column=None, target_column=None, use_tqdm=False)[source]

Validate local unique identifiers in a given column.

Some data sources split the prefix and identifier in separate columns, so you can use the prefix_column argument instead of the prefix argument like in the following example with the GO Annotation Database:

Parameters

df (DataFrame) – A dataframe
column (Union[int, str]) – A column in the dataframe containing identifiers
prefix (Optional[str]) – Specify the prefix if all identifiers in the given column are from the same namespace
prefix_column (Optional[str]) – Specify the prefix_column if there is an additional column whose rows contain the prefix for each rows’ respective identifiers.
target_column (Optional[str]) – If given, stores the results of validation in this column
use_tqdm (bool) – Should a progress bar be shown?

Return type

Series

Returns

A pandas series corresponding to the validity of each row

Raises

PrefixLocationError – If not exactly one of the prefix and prefix_column arguments are given
ValueError – If prefix_column is given and it contains no valid prefixes

import bioregistry.pandas as brpd
import pandas as pd

df = brpd.get_goa_example()

# Use a combination of column 1 (DB) and column 2 (DB Object ID) for validation
idx = brpd.validate_identifiers(df, column=1, prefix_column=0)

# Split the dataframe based on valid and invalid identifiers
valid_df = df[idx]
invalid_df = df[~idx]

identifiers_to_curies(df, column, *, prefix=None, prefix_column=None, target_column=None, use_tqdm=False, normalize_prefixes_=True)[source]

Convert a column of local unique identifiers to CURIEs.

Parameters

df (DataFrame) – A dataframe
column (Union[int, str]) – A column in the dataframe containing identifiers
prefix (Optional[str]) – Specify the prefix if all identifiers in the given column are from the same namespace
prefix_column (Union[None, int, str]) – Specify the prefix_column if there is an additional column whose rows contain the prefix for each rows’ respective identifiers.
target_column (Optional[str]) – If given, stores CURIEs in this column,
use_tqdm (bool) – Should a progress bar be shown?
normalize_prefixes – Should the prefix column get auto-normalized if prefix_column is not None?

Raises

PrefixLocationError – If not exactly one of the prefix and prefix_column arguments are given
ValueError – If the given prefix is not normalizable

import bioregistry.pandas as brpd
import pandas as pd

df = brpd.get_goa_example()

# Use a combination of column 1 (DB) and column 2 (DB Object ID) for conversion
brpd.identifiers_to_curies(df, column=1, prefix_column=0)

Return type: None

identifiers_to_iris(df, column, *, prefix, prefix_column=None, target_column=None, use_tqdm=False)[source]

Convert a column of local unique identifiers to IRIs.

Parameters

df (DataFrame) – A dataframe
column (Union[int, str]) – A column in the dataframe containing identifiers
prefix (str) – Specify the prefix if all identifiers in the given column are from the same namespace
prefix_column (Optional[str]) – Specify the prefix_column if there is an additional column whose rows contain the prefix for each rows’ respective identifiers.
target_column (Optional[str]) – If given, stores IRIs in this column
use_tqdm (bool) – Should a progress bar be shown?

Raises

PrefixLocationError – If not exactly one of the prefix and prefix_column arguments are given
ValueError – If the given prefix is not normalizable

import bioregistry.pandas as brpd
import pandas as pd

df = brpd.get_goa_example()

# Use a combination of column 1 (DB) and column 2 (DB Object ID) for conversion
brpd.identifiers_to_iris(df, column=1, prefix_column=0)

Return type: None

curies_to_iris(df, column, *, target_column=None)[source]

Convert a column of CURIEs to IRIs.

Parameters

df (DataFrame) – A dataframe
column (Union[int, str]) – A column in the dataframe containing CURIEs
target_column (Optional[str]) – If given, stores the IRIs in this column. Otherwise, overwrites the given column in place.