Inspect & standardize identifiers#
To make data queryable by an entity identifier, one needs to ensure that identifiers comply to a chosen standard.
Bionty enables this by mapping metadata on the versioned ontologies using inspect()
.
For terms that are not directly mappable, we offer (also see Search & lookup terms):
import bionty as bt
import pandas as pd
Inspect and mapping synonyms of gene identifiers#
To illustrate it, let us generate a DataFrame that stores a number of gene identifiers, some of which corrupted.
data = {
"gene symbol": ["A1CF", "A1BG", "FANCD1", "corrupted"],
"hgnc id": ["HGNC:24086", "HGNC:5", "HGNC:1101", "corrupted"],
"ensembl_gene_id": [
"ENSG00000148584",
"ENSG00000121410",
"ENSG00000188389",
"ENSGcorrupted",
],
}
df_orig = pd.DataFrame(data).set_index("ensembl_gene_id")
df_orig
gene symbol | hgnc id | |
---|---|---|
ensembl_gene_id | ||
ENSG00000148584 | A1CF | HGNC:24086 |
ENSG00000121410 | A1BG | HGNC:5 |
ENSG00000188389 | FANCD1 | HGNC:1101 |
ENSGcorrupted | corrupted | corrupted |
First we can check whether any of our values are mappable against the ontology reference.
Tip: available fields are accessible via gene_bionty.fields
gene_bionty = bt.Gene()
gene_bionty
Gene
Species: human
Source: ensembl, release-109
๐ Gene.df(): ontology reference table
๐ Gene.lookup(): autocompletion of terms
๐ฏ Gene.search(): free text search of terms
๐ง Gene.inspect(): check if identifiers are mappable
๐ฝ Gene.map_synonyms(): map synonyms to standardized names
๐ Gene.ontology: Pronto.Ontology object
gene_bionty.inspect(df_orig.index, gene_bionty.ensembl_gene_id)
โ
3 terms (75.0%) are mapped.
๐ถ 1 terms (25.0%) are not mapped.
{'mapped': ['ENSG00000148584', 'ENSG00000121410', 'ENSG00000188389'],
'not_mapped': ['ENSGcorrupted']}
The same procedure is available for gene symbols. First, we inspect which symbols are mappable against the ontology.
gene_bionty.inspect(df_orig["gene symbol"], gene_bionty.symbol)
๐ถ The identifiers contain synonyms!
To increase mappability, standardize them via '.map_synonyms()'
โ
2 terms (50.0%) are mapped.
๐ถ 2 terms (50.0%) are not mapped.
{'mapped': ['A1CF', 'A1BG'], 'not_mapped': ['FANCD1', 'corrupted']}
Apparently 2 of the gene symbols are mappable. Bionty further warns us that some of our symbols can be mapped into standardized symbols.
Mapping synonyms returns a list of standardized terms:
mapped_symbol_synonyms = gene_bionty.map_synonyms(df_orig["gene symbol"])
mapped_symbol_synonyms
['A1CF', 'A1BG', 'BRCA2', 'corrupted']
Optionally, only returns a mapper of {synonym : standardized name}:
gene_bionty.map_synonyms(df_orig["gene symbol"], return_mapper=True)
{'FANCD1': 'BRCA2'}
We can use the standardized symbols as the new index:
df_curated = df_orig.reset_index()
df_curated.index = mapped_symbol_synonyms
df_curated
ensembl_gene_id | gene symbol | hgnc id | |
---|---|---|---|
A1CF | ENSG00000148584 | A1CF | HGNC:24086 |
A1BG | ENSG00000121410 | A1BG | HGNC:5 |
BRCA2 | ENSG00000188389 | FANCD1 | HGNC:1101 |
corrupted | ENSGcorrupted | corrupted | corrupted |
You may return a DataFrame with a boolean column indicating if the identifiers are mappable:
gene_bionty.inspect(df_curated.index, gene_bionty.symbol, return_df=True)
โ
3 terms (75.0%) are mapped.
๐ถ 1 terms (25.0%) are not mapped.
__mapped__ | |
---|---|
A1CF | True |
A1BG | True |
BRCA2 | True |
corrupted | False |
Standardize and look up unmapped CellMarker identifiers#
Depending on how the data was collected and which terminology was used, it is not always possible to curate values. Some values might have used a different standard or be corrupted.
This section will demonstrate how to look up unmatched terms and curate them using CellMarker
.
First, we take an example DataFrame whose index containing a valid & invalid cell markers (antibody targets) and an additional feature (time) from a flow cytometry dataset.
markers = pd.DataFrame(
index=[
"KI67",
"CCR7",
"CD14",
"CD8",
"CD45RA",
"CD4",
"CD3",
"CD127a",
"PD1",
"Invalid-1",
"Invalid-2",
"CD66b",
"Siglec8",
"Time",
]
)
Letโs instantiate the CellMarker ontology with the default database and version.
cellmarker_bionty = bt.CellMarker()
cellmarker_bionty
CellMarker
Species: human
Source: cellmarker, 2.0
๐ CellMarker.df(): ontology reference table
๐ CellMarker.lookup(): autocompletion of terms
๐ฏ CellMarker.search(): free text search of terms
๐ง CellMarker.inspect(): check if identifiers are mappable
๐ฝ CellMarker.map_synonyms(): map synonyms to standardized names
๐ CellMarker.ontology: Pronto.Ontology object
Now letโs check which cell markers from the file can be found in the reference:
cellmarker_bionty.inspect(markers.index, cellmarker_bionty.name)
๐ถ Detected inconsistent casing of mapped terms!
For best practice, standardize casing via '.map_synonyms()'
๐ถ The identifiers contain synonyms!
To increase mappability, standardize them via '.map_synonyms()'
โ
8 terms (57.1%) are mapped.
๐ถ 6 terms (42.9%) are not mapped.
{'mapped': ['CCR7', 'CD14', 'CD8', 'CD45RA', 'CD4', 'CD3', 'CD66b', 'Siglec8'],
'not_mapped': ['KI67', 'CD127a', 'PD1', 'Invalid-1', 'Invalid-2', 'Time']}
Logging suggests we map synonyms:
synonyms_mapper = cellmarker_bionty.map_synonyms(markers.index, return_mapper=True)
Now we mapped 2 additional terms:
synonyms_mapper
{'KI67': 'Ki-67', 'PD1': 'PD-1', 'Siglec8': 'SIGLEC8'}
Letโs replace the synonyms with standardized names in the markers DataFrame:
markers.rename(index=synonyms_mapper, inplace=True)
From the logging, it can be seen that 4 terms were not found in the reference!
Among them Time
, Invalid-1
and Invalid-2
are non-marker channels which wonโt be curated by cell marker.
cellmarker_bionty.inspect(markers.index, cellmarker_bionty.name)
โ
10 terms (71.4%) are mapped.
๐ถ 4 terms (28.6%) are not mapped.
{'mapped': ['Ki-67',
'CCR7',
'CD14',
'CD8',
'CD45RA',
'CD4',
'CD3',
'PD-1',
'CD66b',
'SIGLEC8'],
'not_mapped': ['CD127a', 'Invalid-1', 'Invalid-2', 'Time']}
We donโt really find CD127a
, letโs check in the lookup with auto-completion:
lookup = cellmarker_bionty.lookup()
lookup.cd127
CellMarker(name='CD127', synonyms='IL-7R|IL7r|Il7r|IL7R', gene_symbol='IL7R', ncbi_gene_id='3575', uniprotkb_id='P16871')
Indeed we find it should be cd127, we had a typo there with cd127a
.
Now letโs fix the markers so all of them can be linked:
Tip
Using the .lookup instead of passing a string helps eliminate possible typos!
curated_df = markers.rename(index={"CD127a": lookup.cd127.name})
Optionally, run a fuzzy match:
cellmarker_bionty.search("CD127a").head()
synonyms | gene_symbol | ncbi_gene_id | uniprotkb_id | __ratio__ | |
---|---|---|---|---|---|
name | |||||
CD127 | IL-7R|IL7r|Il7r|IL7R | IL7R | 3575 | P16871 | 90.909091 |
CD120a | TNFRSF1A|TNFR1 | TNFRSF1A | 7132 | P19438 | 83.333333 |
LAMP1 | CD107a|Lamp1 | LAMP1 | 3916 | A0A024RDY3 | 83.333333 |
CD121a | None | None | None | 83.333333 | |
CD167a | None | None | None | 83.333333 |
OK, now we can try to run curate again and all cell markers are linked!
cellmarker_bionty.inspect(curated_df.index, cellmarker_bionty.name)
โ
11 terms (78.6%) are mapped.
๐ถ 3 terms (21.4%) are not mapped.
{'mapped': ['Ki-67',
'CCR7',
'CD14',
'CD8',
'CD45RA',
'CD4',
'CD3',
'CD127',
'PD-1',
'CD66b',
'SIGLEC8'],
'not_mapped': ['Invalid-1', 'Invalid-2', 'Time']}