Tabular Data Processing
Utilities for processing tabular data in Pandas dataframes.
The following examples show how the entries in the widely used Gene Ontology Annotations database distributed
in the GAF format can
be loaded with pandas
then normalized with the Bioregistry. It can be loaded in full
with the get_goa_example()
function.
- normalize_prefixes(df, column, *, target_column=None)[source]
Normalize prefixes in a given column.
- Parameters:
- Return type:
import bioregistry.pandas as brpd import pandas as pd df = brpd.get_goa_example() # column 1: DB # i.e., `UniProtKB` becomes `uniprot` brpd.normalize_prefixes(df, column=0)
- normalize_curies(df, column, *, target_column=None)[source]
Normalize CURIEs in a given column.
- Parameters:
- Return type:
import bioregistry.pandas as brpd import pandas as pd df = brpd.get_goa_example() # column 5: GO ID - fix normalization of capitalization of prefix, # i.e., `GO:0003993` becomes `go:0003993` brpd.normalize_curies(df, column=4) # column 6: DB:Reference (|DB:Reference) - fix synonym of prefix # i.e., `PMID:2676709` becomes `pubmed:2676709` brpd.normalize_curies(df, column=5) # column 8: With (or) From # i.e., `GO:0000346` becomes `go:0000346` brpd.normalize_curies(df, column=7) # column 13: Taxon(|taxon) - fix synonym of prefix # i.e., `taxon:9606` becomes `ncbitaxon:9606` brpd.normalize_curies(df, column=12)
- validate_prefixes(df, column, *, target_column=None)[source]
Validate prefixes in a given column.
- Parameters:
- Returns:
A pandas series corresponding to the validity of each row
- Return type:
Series
import bioregistry.pandas as brpd import pandas as pd df = brpd.get_goa_example() # column 1: DB # i.e., `UniProtKB` entries are not standard, and are therefore false idx = brpd.validate_prefixes(df, column=0) # Slice the dataframe based on valid and invalid prefixes valid_prefix_df = df[idx] invalid_prefix_df = df[~idx]
- validate_curies(df, column, *, target_column=None)[source]
Validate CURIEs in a given column.
- Parameters:
- Returns:
A pandas series corresponding to the validity of each row
- Return type:
Series
import bioregistry.pandas as brpd import pandas as pd df = brpd.get_goa_example() # column 5: GO ID - fix normalization of capitalization of prefix, # i.e., `GO:0003993` is not standard and is therefore false idx = brpd.validate_curies(df, column=4) # Slice the dataframe valid_go_df = df[idx] invalid_go_df = df[~idx]
- validate_identifiers(df, column, *, prefix=None, prefix_column=None, target_column=None, use_tqdm=False)[source]
Validate local unique identifiers in a given column.
Some data sources split the prefix and identifier in separate columns, so you can use the
prefix_column
argument instead of theprefix
argument like in the following example with the GO Annotation Database:- Parameters:
df (
DataFrame
) – A dataframecolumn (
Union
[int
,str
]) – A column in the dataframe containing identifiersprefix (
Optional
[str
]) – Specify the prefix if all identifiers in the given column are from the same namespaceprefix_column (
Optional
[str
]) – Specify theprefix_column
if there is an additional column whose rows contain the prefix for each rows’ respective identifiers.target_column (
Optional
[str
]) – If given, stores the results of validation in this columnuse_tqdm (
bool
) – Should a progress bar be shown?
- Returns:
A pandas series corresponding to the validity of each row
- Raises:
PrefixLocationError – If not exactly one of the prefix and prefix_column arguments are given
ValueError – If prefix_column is given and it contains no valid prefixes
- Return type:
Series
import bioregistry.pandas as brpd import pandas as pd df = brpd.get_goa_example() # Use a combination of column 1 (DB) and column 2 (DB Object ID) for validation idx = brpd.validate_identifiers(df, column=1, prefix_column=0) # Split the dataframe based on valid and invalid identifiers valid_df = df[idx] invalid_df = df[~idx]
- identifiers_to_curies(df, column, *, prefix=None, prefix_column=None, target_column=None, use_tqdm=False, normalize_prefixes_=True)[source]
Convert a column of local unique identifiers to CURIEs.
- Parameters:
df (
DataFrame
) – A dataframecolumn (
Union
[int
,str
]) – A column in the dataframe containing identifiersprefix (
Optional
[str
]) – Specify the prefix if all identifiers in the given column are from the same namespaceprefix_column (
Union
[None
,int
,str
]) – Specify theprefix_column
if there is an additional column whose rows contain the prefix for each rows’ respective identifiers.target_column (
Optional
[str
]) – If given, stores CURIEs in this column,use_tqdm (
bool
) – Should a progress bar be shown?normalize_prefixes – Should the prefix column get auto-normalized if
prefix_column
is not None?
- Raises:
PrefixLocationError – If not exactly one of the prefix and prefix_column arguments are given
ValueError – If the given prefix is not normalizable
- Return type:
import bioregistry.pandas as brpd import pandas as pd df = brpd.get_goa_example() # Use a combination of column 1 (DB) and column 2 (DB Object ID) for conversion brpd.identifiers_to_curies(df, column=1, prefix_column=0)
- identifiers_to_iris(df, column, *, prefix, prefix_column=None, target_column=None, use_tqdm=False)[source]
Convert a column of local unique identifiers to IRIs.
- Parameters:
df (
DataFrame
) – A dataframecolumn (
Union
[int
,str
]) – A column in the dataframe containing identifiersprefix (
str
) – Specify the prefix if all identifiers in the given column are from the same namespaceprefix_column (
Optional
[str
]) – Specify theprefix_column
if there is an additional column whose rows contain the prefix for each rows’ respective identifiers.target_column (
Optional
[str
]) – If given, stores IRIs in this columnuse_tqdm (
bool
) – Should a progress bar be shown?
- Raises:
PrefixLocationError – If not exactly one of the prefix and prefix_column arguments are given
ValueError – If the given prefix is not normalizable
- Return type:
import bioregistry.pandas as brpd import pandas as pd df = brpd.get_goa_example() # Use a combination of column 1 (DB) and column 2 (DB Object ID) for conversion brpd.identifiers_to_iris(df, column=1, prefix_column=0)
- curies_to_iris(df, column, *, target_column=None)[source]
Convert a column of CURIEs to IRIs.
- Parameters:
- Return type:
See also
- curies_to_identifiers(df, column, *, target_column=None, prefix_column_name=None)[source]
Split a CURIE column into a prefix and local identifier column.
By default, the local identifier stays in the same column unless target_column is given. If prefix_column_name isn’t given, it’s derived from the target column (if labels available) or just appended to the end if not
- Parameters:
df (
DataFrame
) – A dataframecolumn (
Union
[int
,str
]) – A column in the dataframe containing CURIEstarget_column (
Optional
[str
]) – If given, stores identifiers in this column. Else, stores in the given columnprefix_column_name (
Optional
[str
]) – If given, stores prefixes in this column. Else, derives the column name from the target column name.
- Raises:
ValueError – If no prefix_column_name is given and the auto-generated name conflicts with a column already in the dataframe.
- Return type:
import bioregistry.pandas as brpd import pandas as pd df = brpd.get_goa_example() # column 5: GO ID - convert CURIEs directly to IRIs # i.e., `GO:0003993` becomes `http://amigo.geneontology.org/amigo/term/GO:0003993` brpd.curies_to_identifiers(df, column=4)