scrna2/6 Jupyter Notebook lamindata

Standardize and append a shard of data#

Here, we’ll learn

  • how to standardize a less well curated dataset

  • how to append it as a new shard of data to the growing versioned dataset

import lamindb as ln
import lnschema_bionty as lb

ln.settings.verbosity = "hint"
ln.track()
πŸ’‘ lamindb instance: testuser1/test-scrna
πŸ’‘ notebook imports: lamindb==0.63.2 lnschema_bionty==0.35.3
πŸ’‘ saved: Transform(uid='ManDYgmftZ8Cz8', name='Standardize and append a shard of data', short_name='scrna2', version='0', type=notebook, updated_at=2023-12-05 17:29:20 UTC, created_by_id=1)
πŸ’‘ saved: Run(uid='s1GaxKOImqmlu2gb5ir7', run_at=2023-12-05 17:29:20 UTC, transform_id=2, created_by_id=1)

Standardize a data shard#

Let’s now consider a dataset with less-well curated features:

adata = ln.dev.datasets.anndata_pbmc68k_reduced()
adata
Hide code cell output
AnnData object with n_obs Γ— n_vars = 70 Γ— 765
    obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
    var: 'n_counts', 'highly_variable'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

We are still working with human data, and can globally instruct bionty to assume human:

lb.settings.organism = "human"

Standardize & validate genes #

This data shard is indexed by gene symbols which we’ll want to map on Ensemble ids:

adata.var.head()
Hide code cell output
n_counts highly_variable
index
HES4 1153.387451 True
TNFRSF4 304.358154 True
SSU72 2530.272705 False
PARK7 7451.664062 False
RBP7 272.811035 True

Let’s inspect the identifiers:

lb.Gene.inspect(adata.var.index, lb.Gene.symbol)
Hide code cell output
βœ… 695 terms (90.80%) are validated for symbol
❗ 70 terms (9.20%) are not validated for symbol: ATPIF1, C1orf228, CCBL2, RP11-782C8.1, RP11-277L2.3, RP11-156E8.1, AC079767.4, GPX1, H1FX, SELT, ATP5I, IGJ, CCDC109B, FYB, H2AFY, FAM65B, HIST1H4C, HIST1H1E, ZNRD1, C6orf48, ...
   detected 54 terms with synonyms: ATPIF1, C1orf228, CCBL2, AC079767.4, H1FX, SELT, ATP5I, IGJ, CCDC109B, FYB, H2AFY, FAM65B, HIST1H4C, HIST1H1E, ZNRD1, C6orf48, SEPT7, WBSCR22, RSBN1L-AS1, CCDC132, ...
β†’  standardize terms via .standardize()
   detected 5 Gene terms in Bionty for symbol: 'GPX1', 'IGLL5', 'SNORD3B-2', 'RN7SL1', 'SOD2'
β†’  add records from Bionty to your Gene registry via .from_values()
   couldn't validate 11 terms: 'RP11-489E7.4', 'RP3-467N11.1', 'CTD-3138B18.5', 'TMBIM4-1', 'RP11-291B21.2', 'RP11-277L2.3', 'AC084018.1', 'RP11-156E8.1', 'RP11-782C8.1', 'RP11-620J15.3', 'RP11-390E23.6'
β†’  if you are sure, create new records via ln.Gene() and save to your registry
<lamin_utils._inspect.InspectResult at 0x7f3984ec3700>

Let’s first standardize the gene symbols from synonyms:

adata.var.index = lb.Gene.standardize(adata.var.index, lb.Gene.symbol)
validated = lb.Gene.validate(adata.var.index, lb.Gene.symbol)
πŸ’‘ standardized 749/765 terms
βœ… 749 terms (97.90%) are validated for symbol
❗ 16 terms (2.10%) are not validated for symbol: RP11-782C8.1, RP11-277L2.3, RP11-156E8.1, GPX1, RP3-467N11.1, SOD2, RP11-390E23.6, RP11-489E7.4, RP11-291B21.2, RP11-620J15.3, TMBIM4-1, AC084018.1, RN7SL1, SNORD3B-2, CTD-3138B18.5, IGLL5

We only want to register data with validated genes:

adata_validated = adata[:, validated].copy()

Now that all symbols are validated, let’s convert them to Ensembl ids via standardize(). Note that this is an ambiguous mapping and the first match is kept because the keep arg of .standardize() defaults to "first":

adata_validated.var["ensembl_gene_id"] = lb.Gene.standardize(
    adata_validated.var.index,
    field=lb.Gene.symbol,
    return_field=lb.Gene.ensembl_gene_id,
)
adata_validated.var.index.name = "symbol"
adata_validated.var = adata_validated.var.reset_index().set_index("ensembl_gene_id")
adata_validated.var.head()
Hide code cell output
πŸ’‘ standardized 749/749 terms
symbol n_counts highly_variable
ensembl_gene_id
ENSG00000188290 HES4 1153.387451 True
ENSG00000186827 TNFRSF4 304.358154 True
ENSG00000160075 SSU72 2530.272705 False
ENSG00000116288 PARK7 7451.664062 False
ENSG00000162444 RBP7 272.811035 True

Here, we’ll use .raw:

adata_validated.raw = adata.raw[:, validated].to_adata()
adata_validated.raw.var.index = adata_validated.var.index

Standardize & validate cell types #

Inspection shows none of the terms are validated:

inspector = lb.CellType.inspect(adata_validated.obs.cell_type)
Hide code cell output
❗ received 9 unique terms, 61 empty/duplicated terms are ignored
❗ 9 terms (100.00%) are not validated for name: Dendritic cells, CD19+ B, CD4+/CD45RO+ Memory, CD8+ Cytotoxic T, CD4+/CD25 T Reg, CD14+ Monocytes, CD56+ NK, CD8+/CD45RA+ Naive Cytotoxic, CD34+
   couldn't validate 9 terms: 'CD56+ NK', 'Dendritic cells', 'CD4+/CD25 T Reg', 'CD19+ B', 'CD34+', 'CD14+ Monocytes', 'CD8+/CD45RA+ Naive Cytotoxic', 'CD4+/CD45RO+ Memory', 'CD8+ Cytotoxic T'
β†’  if you are sure, create new records via ln.CellType() and save to your registry

Let us search the cell type names from the public ontology, and add the name found in the AnnData object as a synonym to the top match found in the public ontology.

bionty = lb.CellType.bionty()  # access the public ontology through bionty
name_mapper = {}
for name in adata_validated.obs.cell_type.unique():
    # search the public ontology and use the ontology id of the top match
    ontology_id = bionty.search(name).iloc[0].ontology_id
    # create a record by loading the top match from bionty
    record = lb.CellType.from_bionty(ontology_id=ontology_id)
    name_mapper[name] = record.name  # map the original name to standardized name
    record.save()  # save the record
    # add the original name as a synonym, so that next time, we can just run .standardize()
    record.add_synonym(name)
Hide code cell output
βœ… created 1 CellType record from Bionty matching ontology_id: 'CL:0000451'
πŸ’‘ also saving parents of CellType(uid='9JGbXeUA', name='dendritic cell', ontology_id='CL:0000451', description='A Cell Of Hematopoietic Origin, Typically Resident In Particular Tissues, Specialized In The Uptake, Processing, And Transport Of Antigens To Lymph Nodes For The Purpose Of Stimulating An Immune Response Via T Cell Activation. These Cells Are Lineage Negative (Cd3-Negative, Cd19-Negative, Cd34-Negative, And Cd56-Negative).', updated_at=2023-12-05 17:29:27 UTC, bionty_source_id=21, created_by_id=1)
βœ… created 2 CellType records from Bionty matching ontology_id: 'CL:0000842', 'CL:0000145'
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
πŸ’‘ you can switch this off via: lb.settings.auto_save_parents = False
πŸ’‘ also saving parents of CellType(uid='3iDuKkvT', name='mononuclear cell', ontology_id='CL:0000842', synonyms='mononuclear leukocyte', description='A Leukocyte With A Single Non-Segmented Nucleus In The Mature Form.', updated_at=2023-12-05 17:29:28 UTC, bionty_source_id=21, created_by_id=1)
βœ… created 2 CellType records from Bionty matching ontology_id: 'CL:0000738', 'CL:0000226'
πŸ’‘ also saving parents of CellType(uid='MkrH0gsX', name='leukocyte', ontology_id='CL:0000738', synonyms='white blood cell|leucocyte', description='An Achromatic Cell Of The Myeloid Or Lymphoid Lineages Capable Of Ameboid Movement, Found In Blood Or Other Tissue.', updated_at=2023-12-05 17:29:28 UTC, bionty_source_id=21, created_by_id=1)
βœ… created 3 CellType records from Bionty matching ontology_id: 'CL:0000219', 'CL:0002242', 'CL:0000988'
πŸ’‘ also saving parents of CellType(uid='ek6M40lJ', name='motile cell', ontology_id='CL:0000219', description='A Cell That Moves By Its Own Activities.', updated_at=2023-12-05 17:29:29 UTC, bionty_source_id=21, created_by_id=1)
βœ… created 1 CellType record from Bionty matching ontology_id: 'CL:0000003'
πŸ’‘ also saving parents of CellType(uid='VT73gpK2', name='native cell', ontology_id='CL:0000003', description='A Cell That Is Found In A Natural Setting, Which Includes Multicellular Organism Cells 'In Vivo' (I.E. Part Of An Organism), And Unicellular Organisms 'In Environment' (I.E. Part Of A Natural Environment).', updated_at=2023-12-05 17:29:29 UTC, bionty_source_id=21, created_by_id=1)
βœ… created 1 CellType record from Bionty matching ontology_id: 'CL:0000000'
πŸ’‘ also saving parents of CellType(uid='P0lR3Ehm', name='nucleate cell', ontology_id='CL:0002242', description='A Cell Containing At Least One Nucleus.', updated_at=2023-12-05 17:29:29 UTC, bionty_source_id=21, created_by_id=1)
πŸ’‘ also saving parents of CellType(uid='Q0aQr5JB', name='hematopoietic cell', ontology_id='CL:0000988', synonyms='haematopoietic cell|hemopoietic cell|haemopoietic cell', description='A Cell Of A Hematopoietic Lineage.', updated_at=2023-12-05 17:29:29 UTC, bionty_source_id=21, created_by_id=1)
βœ… created 1 CellType record from Bionty matching ontology_id: 'CL:0002371'
πŸ’‘ also saving parents of CellType(uid='QMAH6IlS', name='somatic cell', ontology_id='CL:0002371', description='A Cell Of An Organism That Does Not Pass On Its Genetic Material To The Organism'S Offspring (I.E. A Non-Germ Line Cell).', updated_at=2023-12-05 17:29:31 UTC, bionty_source_id=21, created_by_id=1)
πŸ’‘ also saving parents of CellType(uid='6ADi80vK', name='single nucleate cell', ontology_id='CL:0000226', description='A Cell With A Single Nucleus.', updated_at=2023-12-05 17:29:28 UTC, bionty_source_id=21, created_by_id=1)
πŸ’‘ also saving parents of CellType(uid='Iv0Y02ZA', name='professional antigen presenting cell', ontology_id='CL:0000145', description='A Cell Capable Of Processing And Presenting Lipid And Protein Antigens To T Cells In Order To Initiate An Immune Response.', updated_at=2023-12-05 17:29:28 UTC, bionty_source_id=21, created_by_id=1)
πŸ’‘ also saving parents of CellType(uid='9JGbXeUA', name='dendritic cell', ontology_id='CL:0000451', synonyms='Dendritic cells', description='A Cell Of Hematopoietic Origin, Typically Resident In Particular Tissues, Specialized In The Uptake, Processing, And Transport Of Antigens To Lymph Nodes For The Purpose Of Stimulating An Immune Response Via T Cell Activation. These Cells Are Lineage Negative (Cd3-Negative, Cd19-Negative, Cd34-Negative, And Cd56-Negative).', updated_at=2023-12-05 17:29:31 UTC, bionty_source_id=21, created_by_id=1)
βœ… created 1 CellType record from Bionty matching ontology_id: 'CL:0001201'
πŸ’‘ also saving parents of CellType(uid='CIS4VJI0', name='B cell, CD19-positive', ontology_id='CL:0001201', synonyms='B-cell, CD19-positive|B lymphocyte, CD19-positive|CD19-positive B cell|B-lymphocyte, CD19-positive|CD19+ B cell', description='A B Cell That Is Cd19-Positive.', updated_at=2023-12-05 17:29:32 UTC, bionty_source_id=21, created_by_id=1)
βœ… created 2 CellType records from Bionty matching ontology_id: 'CL:0001200', 'CL:0000236'
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
πŸ’‘ you can switch this off via: lb.settings.auto_save_parents = False
πŸ’‘ also saving parents of CellType(uid='YsKb7oIy', name='lymphocyte of B lineage, CD19-positive', ontology_id='CL:0001200', description='A Lymphocyte Of B Lineage That Is Cd19-Positive.', updated_at=2023-12-05 17:29:32 UTC, bionty_source_id=21, created_by_id=1)
βœ… created 1 CellType record from Bionty matching ontology_id: 'CL:0000945'
πŸ’‘ also saving parents of CellType(uid='Z0yFV7vU', name='lymphocyte of B lineage', ontology_id='CL:0000945', description='A Lymphocyte Of B Lineage With The Commitment To Express An Immunoglobulin Complex.', updated_at=2023-12-05 17:29:33 UTC, bionty_source_id=21, created_by_id=1)
πŸ’‘ also saving parents of CellType(uid='cx8VcggA', name='B cell', ontology_id='CL:0000236', synonyms='B-lymphocyte|B lymphocyte|B-cell', description='A Lymphocyte Of B Lineage That Is Capable Of B Cell Mediated Immunity.', updated_at=2023-12-05 17:29:32 UTC, bionty_source_id=21, created_by_id=1)
πŸ’‘ also saving parents of CellType(uid='CIS4VJI0', name='B cell, CD19-positive', ontology_id='CL:0001201', synonyms='CD19-positive B cell|CD19+ B|B lymphocyte, CD19-positive|B-cell, CD19-positive|CD19+ B cell|B-lymphocyte, CD19-positive', description='A B Cell That Is Cd19-Positive.', updated_at=2023-12-05 17:29:33 UTC, bionty_source_id=21, created_by_id=1)
βœ… created 1 CellType record from Bionty matching ontology_id: 'CL:0001087'
πŸ’‘ also saving parents of CellType(uid='6VQXlWS7', name='effector memory CD4-positive, alpha-beta T cell, terminally differentiated', ontology_id='CL:0001087', synonyms='CD4-positive TEMRA|CD4+ TEMRA', description='A Cd4-Positive, Alpha Beta Memory T Cell With The Phenotype Cd45Ra-Positive, Cd45Ro-Negative, And Ccr7-Negative.', updated_at=2023-12-05 17:29:34 UTC, bionty_source_id=21, created_by_id=1)
βœ… created 2 CellType records from Bionty matching ontology_id: 'CL:4030002', 'CL:0000897'
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
πŸ’‘ you can switch this off via: lb.settings.auto_save_parents = False
πŸ’‘ also saving parents of CellType(uid='ylUbqlrS', name='effector memory CD45RA-positive, alpha-beta T cell, terminally differentiated', ontology_id='CL:4030002', synonyms='TEMRA cell|terminally differentiated effector memory CD45RA+ T cells|terminally differentiated effector memory cells re-expressing CD45RA', description='An Alpha-Beta Memory T Cell With The Phenotype Cd45Ra-Positive.', updated_at=2023-12-05 17:29:34 UTC, bionty_source_id=21, created_by_id=1)
βœ… created 1 CellType record from Bionty matching ontology_id: 'CL:0000791'
πŸ’‘ also saving parents of CellType(uid='WKpZjuYS', name='mature alpha-beta T cell', ontology_id='CL:0000791', synonyms='mature alpha-beta T-cell|mature alpha-beta T lymphocyte|mature alpha-beta T-lymphocyte', description='A Alpha-Beta T Cell That Has A Mature Phenotype.', updated_at=2023-12-05 17:29:35 UTC, bionty_source_id=21, created_by_id=1)
βœ… loaded 1 CellType record matching ontology_id: 'CL:0000789'
βœ… created 1 CellType record from Bionty matching ontology_id: 'CL:0002419'
πŸ’‘ also saving parents of CellType(uid='2C5PhwrW', name='mature T cell', ontology_id='CL:0002419', synonyms='mature T-cell|CD3e-positive T cell', description='A T Cell That Expresses A T Cell Receptor Complex And Has Completed T Cell Selection.', updated_at=2023-12-05 17:29:36 UTC, bionty_source_id=21, created_by_id=1)
βœ… created 1 CellType record from Bionty matching ontology_id: 'CL:0000084'
πŸ’‘ also saving parents of CellType(uid='BxNjby0x', name='T cell', ontology_id='CL:0000084', synonyms='T-cell|T-lymphocyte|T lymphocyte', description='A Type Of Lymphocyte Whose Defining Characteristic Is The Expression Of A T Cell Receptor Complex.', updated_at=2023-12-05 17:29:36 UTC, bionty_source_id=21, created_by_id=1)
πŸ’‘ also saving parents of CellType(uid='s6Ag7R5U', name='CD4-positive, alpha-beta memory T cell', ontology_id='CL:0000897', synonyms='CD4-positive, alpha-beta memory T lymphocyte|CD4-positive, alpha-beta memory T-lymphocyte|CD4-positive, alpha-beta memory T-cell', description='A Cd4-Positive, Alpha-Beta T Cell That Has Differentiated Into A Memory T Cell.', updated_at=2023-12-05 17:29:34 UTC, bionty_source_id=21, created_by_id=1)
βœ… created 2 CellType records from Bionty matching ontology_id: 'CL:0000813', 'CL:0000624'
πŸ’‘ also saving parents of CellType(uid='Re00kg0W', name='memory T cell', ontology_id='CL:0000813', synonyms='memory T-cell|memory T lymphocyte|memory T-lymphocyte', description='A Long-Lived, Antigen-Experienced T Cell That Has Acquired A Memory Phenotype Including Distinct Surface Markers And The Ability To Differentiate Into An Effector T Cell Upon Antigen Reexposure.', updated_at=2023-12-05 17:29:37 UTC, bionty_source_id=21, created_by_id=1)
πŸ’‘ also saving parents of CellType(uid='05vQoepH', name='CD4-positive, alpha-beta T cell', ontology_id='CL:0000624', synonyms='CD4-positive, alpha-beta T lymphocyte|CD4-positive, alpha-beta T-cell|CD4-positive, alpha-beta T-lymphocyte', description='A Mature Alpha-Beta T Cell That Expresses An Alpha-Beta T Cell Receptor And The Cd4 Coreceptor.', updated_at=2023-12-05 17:29:37 UTC, bionty_source_id=21, created_by_id=1)
πŸ’‘ also saving parents of CellType(uid='6VQXlWS7', name='effector memory CD4-positive, alpha-beta T cell, terminally differentiated', ontology_id='CL:0001087', synonyms='CD4+ TEMRA|CD4-positive TEMRA|CD4+/CD45RO+ Memory', description='A Cd4-Positive, Alpha Beta Memory T Cell With The Phenotype Cd45Ra-Positive, Cd45Ro-Negative, And Ccr7-Negative.', updated_at=2023-12-05 17:29:37 UTC, bionty_source_id=21, created_by_id=1)
βœ… created 1 CellType record from Bionty matching ontology_id: 'CL:0000910'
πŸ’‘ also saving parents of CellType(uid='OxsmyL44', name='cytotoxic T cell', ontology_id='CL:0000910', synonyms='cytotoxic T lymphocyte|cytotoxic T-lymphocyte|cytotoxic T-cell', description='A Mature T Cell That Differentiated And Acquired Cytotoxic Function With The Phenotype Perforin-Positive And Granzyme-B Positive.', updated_at=2023-12-05 17:29:38 UTC, bionty_source_id=21, created_by_id=1)
βœ… created 1 CellType record from Bionty matching ontology_id: 'CL:0000911'
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
πŸ’‘ you can switch this off via: lb.settings.auto_save_parents = False
πŸ’‘ also saving parents of CellType(uid='yvHkIrVI', name='effector T cell', ontology_id='CL:0000911', synonyms='effector T-cell|effector T-lymphocyte|effector T lymphocyte', description='A Differentiated T Cell With Ability To Traffic To Peripheral Tissues And Is Capable Of Mounting A Specific Immune Response.', updated_at=2023-12-05 17:29:38 UTC, bionty_source_id=21, created_by_id=1)
πŸ’‘ also saving parents of CellType(uid='OxsmyL44', name='cytotoxic T cell', ontology_id='CL:0000910', synonyms='cytotoxic T-lymphocyte|CD8+ Cytotoxic T|cytotoxic T-cell|cytotoxic T lymphocyte', description='A Mature T Cell That Differentiated And Acquired Cytotoxic Function With The Phenotype Perforin-Positive And Granzyme-B Positive.', updated_at=2023-12-05 17:29:38 UTC, bionty_source_id=21, created_by_id=1)
βœ… created 1 CellType record from Bionty matching ontology_id: 'CL:0000919'
πŸ’‘ also saving parents of CellType(uid='ORD0dMdt', name='CD8-positive, CD25-positive, alpha-beta regulatory T cell', ontology_id='CL:0000919', synonyms='CD8+CD25+ T cell|CD8+CD25+ T-cell|CD8+CD25+ T(reg)|CD8-positive, CD25-positive, alpha-beta regulatory T-lymphocyte|CD8+CD25+ T lymphocyte|CD8-positive, CD25-positive, alpha-beta regulatory T-cell|CD8+CD25+ T-lymphocyte|CD8+CD25+ Treg|CD8-positive, CD25-positive Treg|CD8-positive, CD25-positive, alpha-beta regulatory T lymphocyte', description='A Cd8-Positive Alpha Beta-Positive T Cell With The Phenotype Foxp3-Positive And Having Suppressor Function.', updated_at=2023-12-05 17:29:39 UTC, bionty_source_id=21, created_by_id=1)
βœ… created 1 CellType record from Bionty matching ontology_id: 'CL:0000795'
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
πŸ’‘ you can switch this off via: lb.settings.auto_save_parents = False
πŸ’‘ also saving parents of CellType(uid='oTsFrhYW', name='CD8-positive, alpha-beta regulatory T cell', ontology_id='CL:0000795', synonyms='CD8-positive, alpha-beta regulatory T-cell|CD8-positive Treg|CD8-positive T(reg)|CD8+ regulatory T cell|CD8-positive, alpha-beta regulatory T lymphocyte|CD8-positive, alpha-beta Treg|CD8-positive, alpha-beta regulatory T-lymphocyte|CD8+ T(reg)|CD8+ Treg', description='A Cd8-Positive, Alpha-Beta T Cell That Regulates Overall Immune Responses As Well As The Responses Of Other T Cell Subsets Through Direct Cell-Cell Contact And Cytokine Release.', updated_at=2023-12-05 17:29:39 UTC, bionty_source_id=21, created_by_id=1)
βœ… loaded 1 CellType record matching ontology_id: 'CL:0000815'
βœ… created 1 CellType record from Bionty matching ontology_id: 'CL:0000625'
πŸ’‘ also saving parents of CellType(uid='VnKkQsME', name='CD8-positive, alpha-beta T cell', ontology_id='CL:0000625', synonyms='CD8-positive, alpha-beta T-cell|CD8-positive, alpha-beta T lymphocyte|CD8-positive, alpha-beta T-lymphocyte', description='A T Cell Expressing An Alpha-Beta T Cell Receptor And The Cd8 Coreceptor.', updated_at=2023-12-05 17:29:40 UTC, bionty_source_id=21, created_by_id=1)
πŸ’‘ also saving parents of CellType(uid='ORD0dMdt', name='CD8-positive, CD25-positive, alpha-beta regulatory T cell', ontology_id='CL:0000919', synonyms='CD8-positive, CD25-positive, alpha-beta regulatory T lymphocyte|CD8+CD25+ T-cell|CD8+CD25+ Treg|CD8-positive, CD25-positive Treg|CD4+/CD25 T Reg|CD8-positive, CD25-positive, alpha-beta regulatory T-lymphocyte|CD8+CD25+ T cell|CD8+CD25+ T(reg)|CD8+CD25+ T-lymphocyte|CD8+CD25+ T lymphocyte|CD8-positive, CD25-positive, alpha-beta regulatory T-cell', description='A Cd8-Positive Alpha Beta-Positive T Cell With The Phenotype Foxp3-Positive And Having Suppressor Function.', updated_at=2023-12-05 17:29:40 UTC, bionty_source_id=21, created_by_id=1)
βœ… created 1 CellType record from Bionty matching ontology_id: 'CL:0002057'
πŸ’‘ also saving parents of CellType(uid='O0AQiAuv', name='CD14-positive, CD16-negative classical monocyte', ontology_id='CL:0002057', synonyms='CD16-negative monocyte|CD16- monocyte', description='A Classical Monocyte That Is Cd14-Positive, Cd16-Negative, Cd64-Positive, Cd163-Positive.', updated_at=2023-12-05 17:29:41 UTC, bionty_source_id=21, created_by_id=1)
βœ… loaded 1 CellType record matching ontology_id: 'CL:0000860'
βœ… created 1 CellType record from Bionty matching ontology_id: 'CL:0001054'
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
πŸ’‘ you can switch this off via: lb.settings.auto_save_parents = False
πŸ’‘ also saving parents of CellType(uid='E5LdQF00', name='CD14-positive monocyte', ontology_id='CL:0001054', description='A Monocyte That Expresses Cd14 And Is Negative For The Lineage Markers Cd3, Cd19, And Cd20.', updated_at=2023-12-05 17:29:42 UTC, bionty_source_id=21, created_by_id=1)
βœ… created 1 CellType record from Bionty matching ontology_id: 'CL:0000576'
πŸ’‘ also saving parents of CellType(uid='YzV7Qgmj', name='monocyte', ontology_id='CL:0000576', description='Myeloid Mononuclear Recirculating Leukocyte That Can Act As A Precursor Of Tissue Macrophages, Osteoclasts And Some Populations Of Tissue Dendritic Cells.', updated_at=2023-12-05 17:29:42 UTC, bionty_source_id=21, created_by_id=1)
βœ… loaded 2 CellType records matching ontology_id: 'CL:0000842', 'CL:0011026'
βœ… created 1 CellType record from Bionty matching ontology_id: 'CL:0000766'
πŸ’‘ also saving parents of CellType(uid='40onq0tm', name='myeloid leukocyte', ontology_id='CL:0000766', description='A Cell Of The Monocyte, Granulocyte, Or Mast Cell Lineage.', updated_at=2023-12-05 17:29:43 UTC, bionty_source_id=21, created_by_id=1)
βœ… loaded 1 CellType record matching ontology_id: 'CL:0000738'
βœ… created 1 CellType record from Bionty matching ontology_id: 'CL:0000763'
πŸ’‘ also saving parents of CellType(uid='g1zY6vUW', name='myeloid cell', ontology_id='CL:0000763', description='A Cell Of The Monocyte, Granulocyte, Mast Cell, Megakaryocyte, Or Erythroid Lineage.', updated_at=2023-12-05 17:29:43 UTC, bionty_source_id=21, created_by_id=1)
πŸ’‘ also saving parents of CellType(uid='O0AQiAuv', name='CD14-positive, CD16-negative classical monocyte', ontology_id='CL:0002057', synonyms='CD16- monocyte|CD16-negative monocyte|CD14+ Monocytes', description='A Classical Monocyte That Is Cd14-Positive, Cd16-Negative, Cd64-Positive, Cd163-Positive.', updated_at=2023-12-05 17:29:43 UTC, bionty_source_id=21, created_by_id=1)
βœ… created 1 CellType record from Bionty matching ontology_id: 'CL:0002101'
πŸ’‘ also saving parents of CellType(uid='a6U0yA2P', name='CD38-positive naive B cell', ontology_id='CL:0002101', synonyms='CD38-positive naive B-cell|CD38+ naive B cell|CD38+ naive B lymphocyte|CD38+ naive B-cell|CD38-positive naive B lymphocyte|CD38-positive naive B-lymphocyte|CD38+ naive B-lymphocyte', description='A Cd38-Positive Naive B Cell Is A Mature B Cell That Has The Phenotype Cd38-Positive, Surface Igd-Positive, Surface Igm-Positive, And Cd27-Negative, And That Has Not Yet Been Activated By Antigen In The Periphery.', updated_at=2023-12-05 17:29:44 UTC, bionty_source_id=21, created_by_id=1)
πŸ’‘ also saving parents of CellType(uid='a6U0yA2P', name='CD38-positive naive B cell', ontology_id='CL:0002101', synonyms='CD38-positive naive B lymphocyte|CD38+ naive B lymphocyte|CD38+ naive B cell|CD8+/CD45RA+ Naive Cytotoxic|CD38-positive naive B-cell|CD38-positive naive B-lymphocyte|CD38+ naive B-cell|CD38+ naive B-lymphocyte', description='A Cd38-Positive Naive B Cell Is A Mature B Cell That Has The Phenotype Cd38-Positive, Surface Igd-Positive, Surface Igm-Positive, And Cd27-Negative, And That Has Not Yet Been Activated By Antigen In The Periphery.', updated_at=2023-12-05 17:29:44 UTC, bionty_source_id=21, created_by_id=1)

We can now standardize cell type names using the search-based mapper:

adata_validated.obs.cell_type = adata_validated.obs.cell_type.map(name_mapper)

Now, all cell types are validated:

validated = lb.CellType.validate(adata_validated.obs.cell_type)
assert all(validated)
βœ… 9 terms (100.00%) are validated for name

We don’t want to store any of the other metadata columns:

for column in ["n_genes", "percent_mito", "louvain"]:
    adata.obs.drop(column, axis=1)

Register #

experimental_factors = lb.ExperimentalFactor.lookup()
organism = lb.Organism.lookup()
features = ln.Feature.lookup()
file = ln.File.from_anndata(
    adata_validated,
    description="10x reference adata",
    field=lb.Gene.ensembl_gene_id,
)
Hide code cell output
πŸ’‘ file will be copied to default storage upon `save()` with key `None` ('.lamindb/Y4C5W8Jiy9AVqGbTlDKx.h5ad')
πŸ’‘ parsing feature names of X stored in slot 'var'
βœ…    749 terms (100.00%) are validated for ensembl_gene_id
βœ…    linked: FeatureSet(uid='stU5edT4Vh1SM2UPifDd', n=749, type='number', registry='bionty.Gene', hash='ZL6ScVsUK3gvyaiIVPVr', created_by_id=1)
πŸ’‘ parsing feature names of slot 'obs'
βœ…    1 term (25.00%) is validated for name
❗    3 terms (75.00%) are not validated for name: n_genes, percent_mito, louvain
βœ…    linked: FeatureSet(uid='DZRZcwARjPpSxjJ1M0cr', n=1, registry='core.Feature', hash='f9thqMMHAUUsyaECGAif', created_by_id=1)

As we do not want to manage the remaining unvalidated terms in registries, we can save and annotate the file:

file.save()
file.labels.add(adata_validated.obs.cell_type, features.cell_type)
file.labels.add(organism.human, feature=features.organism)
file.labels.add(experimental_factors.single_cell_rna_sequencing, feature=features.assay)
file.describe()
βœ… saved 2 feature sets for slots: 'var','obs'
βœ… storing file 'Y4C5W8Jiy9AVqGbTlDKx' at '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/Y4C5W8Jiy9AVqGbTlDKx.h5ad'
βœ… loaded: FeatureSet(uid='t86sjYHMAndJUzZR56Ms', n=1, registry='core.Feature', hash='PXslZunc1o6wt3QwvJ9C', updated_at=2023-12-05 17:29:13 UTC, created_by_id=1)
βœ… linked new feature 'organism' together with new feature set FeatureSet(uid='t86sjYHMAndJUzZR56Ms', n=1, registry='core.Feature', hash='PXslZunc1o6wt3QwvJ9C', updated_at=2023-12-05 17:29:45 UTC, created_by_id=1)
πŸ’‘ no file links to it anymore, deleting feature set FeatureSet(uid='t86sjYHMAndJUzZR56Ms', n=1, registry='core.Feature', hash='PXslZunc1o6wt3QwvJ9C', updated_at=2023-12-05 17:29:45 UTC, created_by_id=1)
βœ… linked new feature 'assay' together with new feature set FeatureSet(uid='nUIxYX2OqUI3VmwxBSK0', n=2, registry='core.Feature', hash='1vr5aZWUqpHOJJcKfB-4', updated_at=2023-12-05 17:29:45 UTC, created_by_id=1)
File(uid='Y4C5W8Jiy9AVqGbTlDKx', suffix='.h5ad', accessor='AnnData', description='10x reference adata', size=853388, hash='eKH1ljAEh7Kd81-o2H4A7w', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2023-12-05 17:29:45 UTC)

Provenance:
  πŸ—ƒοΈ storage: Storage(uid='XR9DwKJf', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-12-05 17:28:50 UTC, created_by_id=1)
  πŸ’« transform: Transform(uid='ManDYgmftZ8Cz8', name='Standardize and append a shard of data', short_name='scrna2', version='0', type=notebook, updated_at=2023-12-05 17:29:20 UTC, created_by_id=1)
  πŸ‘£ run: Run(uid='s1GaxKOImqmlu2gb5ir7', run_at=2023-12-05 17:29:20 UTC, transform_id=2, created_by_id=1)
  πŸ‘€ created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-12-05 17:28:50 UTC)
Features:
  var: FeatureSet(uid='stU5edT4Vh1SM2UPifDd', n=749, type='number', registry='bionty.Gene', hash='ZL6ScVsUK3gvyaiIVPVr', updated_at=2023-12-05 17:29:45 UTC, created_by_id=1)
    'IL18', 'NPM3', 'S100A9', 'S100A8', 'CNN2', 'ARHGAP45', 'RNF34', 'GPX4', 'S100A6', 'ADISSP', 'S100A4', 'FAM174C', 'SIT1', 'CCDC107', 'RSL1D1', 'TLN1', 'HES4', 'TNFRSF17', 'PCNA', 'RAB13', ...
  obs: FeatureSet(uid='DZRZcwARjPpSxjJ1M0cr', n=1, registry='core.Feature', hash='f9thqMMHAUUsyaECGAif', updated_at=2023-12-05 17:29:45 UTC, created_by_id=1)
    πŸ”— cell_type (9, bionty.CellType): 'dendritic cell', 'CD38-positive naive B cell', 'B cell, CD19-positive', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated', 'CD16-positive, CD56-dim natural killer cell, human', 'CD4-positive, alpha-beta T cell', 'cytotoxic T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'CD14-positive, CD16-negative classical monocyte'
  external: FeatureSet(uid='nUIxYX2OqUI3VmwxBSK0', n=2, registry='core.Feature', hash='1vr5aZWUqpHOJJcKfB-4', updated_at=2023-12-05 17:29:45 UTC, created_by_id=1)
    πŸ”— assay (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
    πŸ”— organism (1, bionty.Organism): 'human'
Labels:
  🏷️ organism (1, bionty.Organism): 'human'
  🏷️ cell_types (9, bionty.CellType): 'dendritic cell', 'CD38-positive naive B cell', 'B cell, CD19-positive', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated', 'CD16-positive, CD56-dim natural killer cell, human', 'CD4-positive, alpha-beta T cell', 'cytotoxic T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'CD14-positive, CD16-negative classical monocyte'
  🏷️ experimental_factors (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
file.view_flow()
_images/9cc6177b4974b6ea2ec8c1527bfb97e66a3443809a198ff1b837b4e1b957dee6.svg

Append the shard to the dataset#

Query the previous dataset:

dataset_v1 = ln.Dataset.filter(name="My versioned scRNA-seq dataset", version="1").one()

Create a new version of the dataset by sharding it across the new file and the file underlying version 1 of the dataset:

dataset_v2 = ln.Dataset(
    [file, dataset_v1.file],
    is_new_version_of=dataset_v1,
)
dataset_v2.save()
dataset_v2.labels.add_from(file)
dataset_v2.labels.add_from(dataset_v1)
Hide code cell output
βœ… loaded: FeatureSet(uid='b59F7mH9j2Yk56ov67X5', n=36390, type='number', registry='bionty.Gene', hash='rMZltwoBCMdVPVR8x6nJ', updated_at=2023-12-05 17:29:11 UTC, created_by_id=1)
βœ… loaded: FeatureSet(uid='KvzEE67whbLB9QLaX1gA', n=4, registry='core.Feature', hash='m7On8gCIe69zqDojEoKv', updated_at=2023-12-05 17:29:12 UTC, created_by_id=1)
βœ… loaded: FeatureSet(uid='nUIxYX2OqUI3VmwxBSK0', n=2, registry='core.Feature', hash='1vr5aZWUqpHOJJcKfB-4', updated_at=2023-12-05 17:29:45 UTC, created_by_id=1)
πŸ’‘ adding dataset [1] as input for run 2, adding parent transform 1
πŸ’‘ adding file [1] as input for run 2, adding parent transform 1
πŸ’‘ transferring cell_type
πŸ’‘ transferring assay
πŸ’‘ transferring organism
πŸ’‘ transferring cell_type
πŸ’‘ transferring assay
πŸ’‘ transferring tissue
πŸ’‘ transferring donor

Version 2 of the dataset covers significantly more conditions.

dataset_v2.describe()
Dataset(uid='2w70s0sFY5apzHvx0vEr', name='My versioned scRNA-seq dataset', version='2', hash='BOAf0T5UbN_iOe3fQDyq', visibility=1, updated_at=2023-12-05 17:29:46 UTC)

Provenance:
  πŸ’« transform: Transform(uid='ManDYgmftZ8Cz8', name='Standardize and append a shard of data', short_name='scrna2', version='0', type=notebook, updated_at=2023-12-05 17:29:20 UTC, created_by_id=1)
  πŸ‘£ run: Run(uid='s1GaxKOImqmlu2gb5ir7', run_at=2023-12-05 17:29:20 UTC, transform_id=2, created_by_id=1)
  πŸ”– initial_version: Dataset(uid='dvjFX4JWDFgBPpxs9N2H', name='My versioned scRNA-seq dataset', version='1', hash='9sXda5E7BYiVoDOQkTC0KB', visibility=1, updated_at=2023-12-05 17:29:14 UTC, transform_id=1, run_id=1, file_id=1, created_by_id=1)
  πŸ‘€ created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-12-05 17:28:50 UTC)
Features:
  var: FeatureSet(uid='b59F7mH9j2Yk56ov67X5', n=36390, type='number', registry='bionty.Gene', hash='rMZltwoBCMdVPVR8x6nJ', updated_at=2023-12-05 17:29:11 UTC, created_by_id=1)
    'MIR1302-2HG', 'FAM138A', 'OR4F5', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'OR4F29', 'None', 'OR4F16', 'None', 'LINC01409', 'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C', 'None', ...
  obs: FeatureSet(uid='KvzEE67whbLB9QLaX1gA', n=4, registry='core.Feature', hash='m7On8gCIe69zqDojEoKv', updated_at=2023-12-05 17:29:12 UTC, created_by_id=1)
    πŸ”— cell_type (40, bionty.CellType): 'dendritic cell', 'CD38-positive naive B cell', 'B cell, CD19-positive', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated', 'CD4-positive, alpha-beta T cell', 'cytotoxic T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'CD14-positive, CD16-negative classical monocyte', 'classical monocyte', 'T follicular helper cell', ...
    πŸ”— assay (4, bionty.ExperimentalFactor): '10x 3' v3', '10x 5' v2', '10x 5' v1', 'single-cell RNA sequencing'
    πŸ”— tissue (17, bionty.Tissue): 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', ...
    πŸ”— donor (12, core.ULabel): 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...
  external: FeatureSet(uid='nUIxYX2OqUI3VmwxBSK0', n=2, registry='core.Feature', hash='1vr5aZWUqpHOJJcKfB-4', updated_at=2023-12-05 17:29:45 UTC, created_by_id=1)
    πŸ”— assay (4, bionty.ExperimentalFactor): '10x 3' v3', '10x 5' v2', '10x 5' v1', 'single-cell RNA sequencing'
    πŸ”— organism (1, bionty.Organism): 'human'
Labels:
  🏷️ organism (1, bionty.Organism): 'human'
  🏷️ tissues (17, bionty.Tissue): 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', ...
  🏷️ cell_types (40, bionty.CellType): 'dendritic cell', 'CD38-positive naive B cell', 'B cell, CD19-positive', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated', 'CD4-positive, alpha-beta T cell', 'cytotoxic T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'CD14-positive, CD16-negative classical monocyte', 'classical monocyte', 'T follicular helper cell', ...
  🏷️ experimental_factors (4, bionty.ExperimentalFactor): '10x 3' v3', '10x 5' v2', '10x 5' v1', 'single-cell RNA sequencing'
  🏷️ ulabels (12, core.ULabel): 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...

View the flow:

dataset_v2.view_flow()
_images/3f92718bf52d221ffa34bb2f2cd72f61f3bfd9f74236847476277999e29292e3.svg