Tabular Data Processing
Utilities for processing tabular data in Pandas dataframes.
The following examples show how the entries in the widely used Gene Ontology Annotations database distributed
in the GAF format can
be loaded with pandas
then normalized with the Bioregistry. It can be loaded in full
with the get_goa_example()
function.
- normalize_prefixes(df: DataFrame, column: int | str, *, target_column: str | None = None) None [source]
Normalize prefixes in a given column.
- Parameters:
df – A dataframe
column – A column in the dataframe containing prefixes
target_column – The target column to put the normalized prefixes. If not given, overwrites the given
column
in place
import bioregistry.pandas as brpd import pandas as pd df = brpd.get_goa_example() # column 1: DB # i.e., `UniProtKB` becomes `uniprot` brpd.normalize_prefixes(df, column=0)
- normalize_curies(df: DataFrame, column: int | str, *, target_column: str | None = None) None [source]
Normalize CURIEs in a given column.
- Parameters:
df – A dataframe
column – The column of CURIEs to normalize
target_column – The column to put the normalized CURIEs in. If not given, overwrites the given
column
in place.
import bioregistry.pandas as brpd import pandas as pd df = brpd.get_goa_example() # column 5: GO ID - fix normalization of capitalization of prefix, # i.e., `GO:0003993` becomes `go:0003993` brpd.normalize_curies(df, column=4) # column 6: DB:Reference (|DB:Reference) - fix synonym of prefix # i.e., `PMID:2676709` becomes `pubmed:2676709` brpd.normalize_curies(df, column=5) # column 8: With (or) From # i.e., `GO:0000346` becomes `go:0000346` brpd.normalize_curies(df, column=7) # column 13: Taxon(|taxon) - fix synonym of prefix # i.e., `taxon:9606` becomes `ncbitaxon:9606` brpd.normalize_curies(df, column=12)
- validate_prefixes(df: DataFrame, column: int | str, *, target_column: str | None = None) Series [source]
Validate prefixes in a given column.
- Parameters:
df – A DataFrame
column – The column of prefixes to validate
target_column – The optional column to put the results of validation
- Returns:
A pandas series corresponding to the validity of each row
import bioregistry.pandas as brpd import pandas as pd df = brpd.get_goa_example() # column 1: DB # i.e., `UniProtKB` entries are not standard, and are therefore false idx = brpd.validate_prefixes(df, column=0) # Slice the dataframe based on valid and invalid prefixes valid_prefix_df = df[idx] invalid_prefix_df = df[~idx]
- validate_curies(df: DataFrame, column: int | str, *, target_column: str | None = None) Series [source]
Validate CURIEs in a given column.
- Parameters:
df – A DataFrame
column – The column of CURIEs to validate
target_column – The optional column to put the results of validation.
- Returns:
A pandas series corresponding to the validity of each row
import bioregistry.pandas as brpd import pandas as pd df = brpd.get_goa_example() # column 5: GO ID - fix normalization of capitalization of prefix, # i.e., `GO:0003993` is not standard and is therefore false idx = brpd.validate_curies(df, column=4) # Slice the dataframe valid_go_df = df[idx] invalid_go_df = df[~idx]
- validate_identifiers(df: DataFrame, column: int | str, *, prefix: str | None = None, prefix_column: str | None = None, target_column: str | None = None, use_tqdm: bool = False) Series [source]
Validate local unique identifiers in a given column.
Some data sources split the prefix and identifier in separate columns, so you can use the
prefix_column
argument instead of theprefix
argument like in the following example with the GO Annotation Database:- Parameters:
df – A dataframe
column – A column in the dataframe containing identifiers
prefix – Specify the prefix if all identifiers in the given column are from the same namespace
prefix_column – Specify the
prefix_column
if there is an additional column whose rows contain the prefix for each rows’ respective identifiers.target_column – If given, stores the results of validation in this column
use_tqdm – Should a progress bar be shown?
- Returns:
A pandas series corresponding to the validity of each row
- Raises:
PrefixLocationError – If not exactly one of the prefix and prefix_column arguments are given
ValueError – If prefix_column is given and it contains no valid prefixes
import bioregistry.pandas as brpd import pandas as pd df = brpd.get_goa_example() # Use a combination of column 1 (DB) and column 2 (DB Object ID) for validation idx = brpd.validate_identifiers(df, column=1, prefix_column=0) # Split the dataframe based on valid and invalid identifiers valid_df = df[idx] invalid_df = df[~idx]
- identifiers_to_curies(df: DataFrame, column: int | str, *, prefix: str | None = None, prefix_column: None | int | str = None, target_column: str | None = None, use_tqdm: bool = False, normalize_prefixes_: bool = True) None [source]
Convert a column of local unique identifiers to CURIEs.
- Parameters:
df – A dataframe
column – A column in the dataframe containing identifiers
prefix – Specify the prefix if all identifiers in the given column are from the same namespace
prefix_column – Specify the
prefix_column
if there is an additional column whose rows contain the prefix for each rows’ respective identifiers.target_column – If given, stores CURIEs in this column,
use_tqdm – Should a progress bar be shown?
normalize_prefixes – Should the prefix column get auto-normalized if
prefix_column
is not None?
- Raises:
PrefixLocationError – If not exactly one of the prefix and prefix_column arguments are given
ValueError – If the given prefix is not normalizable
import bioregistry.pandas as brpd import pandas as pd df = brpd.get_goa_example() # Use a combination of column 1 (DB) and column 2 (DB Object ID) for conversion brpd.identifiers_to_curies(df, column=1, prefix_column=0)
- identifiers_to_iris(df: DataFrame, column: int | str, *, prefix: str, prefix_column: str | None = None, target_column: str | None = None, use_tqdm: bool = False) None [source]
Convert a column of local unique identifiers to IRIs.
- Parameters:
df – A dataframe
column – A column in the dataframe containing identifiers
prefix – Specify the prefix if all identifiers in the given column are from the same namespace
prefix_column – Specify the
prefix_column
if there is an additional column whose rows contain the prefix for each rows’ respective identifiers.target_column – If given, stores IRIs in this column
use_tqdm – Should a progress bar be shown?
- Raises:
PrefixLocationError – If not exactly one of the prefix and prefix_column arguments are given
ValueError – If the given prefix is not normalizable
import bioregistry.pandas as brpd import pandas as pd df = brpd.get_goa_example() # Use a combination of column 1 (DB) and column 2 (DB Object ID) for conversion brpd.identifiers_to_iris(df, column=1, prefix_column=0)
- curies_to_iris(df: DataFrame, column: int | str, *, target_column: str | None = None) None [source]
Convert a column of CURIEs to IRIs.
- Parameters:
df – A dataframe
column – A column in the dataframe containing CURIEs
target_column – If given, stores the IRIs in this column. Otherwise, overwrites the given column in place.
See also
- curies_to_identifiers(df: DataFrame, column: int | str, *, target_column: str | None = None, prefix_column_name: str | None = None) None [source]
Split a CURIE column into a prefix and local identifier column.
By default, the local identifier stays in the same column unless target_column is given. If prefix_column_name isn’t given, it’s derived from the target column (if labels available) or just appended to the end if not
- Parameters:
df – A dataframe
column – A column in the dataframe containing CURIEs
target_column – If given, stores identifiers in this column. Else, stores in the given column
prefix_column_name – If given, stores prefixes in this column. Else, derives the column name from the target column name.
- Raises:
ValueError – If no prefix_column_name is given and the auto-generated name conflicts with a column already in the dataframe.
import bioregistry.pandas as brpd import pandas as pd df = brpd.get_goa_example() # column 5: GO ID - convert CURIEs directly to IRIs # i.e., `GO:0003993` becomes `http://amigo.geneontology.org/amigo/term/GO:0003993` brpd.curies_to_identifiers(df, column=4)
- iris_to_curies(df: DataFrame, column: int | str, *, target_column: str | None = None) None [source]
Convert a column of IRIs to CURIEs.
- Parameters:
df – A dataframe
column – A column in the dataframe containing IRIs
target_column – If given, stores the CURIEs in this column. Otherwise, overwrites the given column in place.
See also