Jupyter Notebook Binder

Analysis flow#

Here, we’ll track typical data transformations like subsetting that occur during analysis.

If exploring more generally, read this first: Project flow.

Setup#

# a lamindb instance containing Bionty schema
!lamin init --storage ./analysis-usecase --schema bionty
Hide code cell output
✅ saved: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-12-05 17:28:32 UTC)
✅ saved: Storage(uid='8kc6ZuBo', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase', type='local', updated_at=2023-12-05 17:28:32 UTC, created_by_id=1)
💡 loaded instance: testuser1/analysis-usecase
💡 did not register local instance on hub

import lamindb as ln
import lnschema_bionty as lb
from lamin_utils import logger

lb.settings.organism = "human"  # globally set organism
lb.settings.auto_save_parents = False
💡 lamindb instance: testuser1/analysis-usecase

Register an initial dataset#

Here we register an initial file with a pipeline script.

# register_example_file.py


def register_example_file():
    # create a pipeline transform to track the registration of the file
    transform = ln.Transform(
        name="register example file", type="pipeline", version="0.0.1"
    )
    ln.track(transform)

    # an example dataset that has a few cell type, tissue and disease annotations
    adata = ln.dev.datasets.anndata_with_obs()

    # validate and register features
    genes = lb.Gene.from_values(adata.var_names, lb.Gene.ensembl_gene_id)
    ln.save(genes)
    obs_features = ln.Feature.from_df(adata.obs)
    ln.save(obs_features)

    # validate and register labels
    cell_types = lb.CellType.from_values(adata.obs["cell_type"])
    ln.save(cell_types)
    tissues = lb.Tissue.from_values(adata.obs["tissue"])
    ln.save(tissues)
    diseases = lb.Disease.from_values(adata.obs["disease"])
    ln.save(diseases)

    # register file and annotate with features & labels
    file = ln.File.from_anndata(
        adata, description="anndata with obs", field=lb.Gene.ensembl_gene_id
    )
    file.save()
    features = ln.Feature.lookup()
    file.labels.add(cell_types, features.cell_type)
    file.labels.add(tissues, features.tissue)
    file.labels.add(diseases, features.disease)


register_example_file()
Hide code cell output
💡 saved: Transform(uid='R1gs2it1SKTX7Z', name='register example file', version='0.0.1', type='pipeline', updated_at=2023-12-05 17:28:33 UTC, created_by_id=1)
💡 saved: Run(uid='n8QN2OHrHOyaYB3OYjkf', run_at=2023-12-05 17:28:33 UTC, transform_id=1, created_by_id=1)
did not create CellType record for 1 non-validated name: 'my new cell type'
... storing 'cell_type' as categorical
... storing 'cell_type_id' as categorical
... storing 'tissue' as categorical
... storing 'disease' as categorical

Pull the registered dataset, apply a transformation, and register the result#

Set the current notebook as the new transform:

ln.track()
💡 notebook imports: lamin_utils==0.12.0 lamindb==0.63.2 lnschema_bionty==0.35.3
💡 saved: Transform(uid='eNef4Arw8nNMz8', name='Analysis flow', short_name='analysis-flow', version='0', type=notebook, updated_at=2023-12-05 17:28:40 UTC, created_by_id=1)
💡 saved: Run(uid='nYTyEm1xLBAc94UBy6IB', run_at=2023-12-05 17:28:40 UTC, transform_id=2, created_by_id=1)
file = ln.File.filter(description="anndata with obs").one()
file.describe()
File(uid='F2UOVLdIPq301YCVoVsJ', suffix='.h5ad', accessor='AnnData', description='anndata with obs', size=46992, hash='IJORtcQUSS11QBqD-nTD0A', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2023-12-05 17:28:39 UTC)

Provenance:
  🗃️ storage: Storage(uid='8kc6ZuBo', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase', type='local', updated_at=2023-12-05 17:28:32 UTC, created_by_id=1)
  🧩 transform: Transform(uid='R1gs2it1SKTX7Z', name='register example file', version='0.0.1', type='pipeline', updated_at=2023-12-05 17:28:33 UTC, created_by_id=1)
  👣 run: Run(uid='n8QN2OHrHOyaYB3OYjkf', run_at=2023-12-05 17:28:33 UTC, transform_id=1, created_by_id=1)
  👤 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-12-05 17:28:32 UTC)
Features:
  var: FeatureSet(uid='DvS20OY34TniUqwUnJu8', n=99, type='number', registry='bionty.Gene', hash='fHbDaAAmJse48vnUQh9C', updated_at=2023-12-05 17:28:39 UTC, created_by_id=1)
    'TSPAN6', 'TNMD', 'DPM1', 'SCYL3', 'C1orf112', 'FGR', 'CFH', 'FUCA2', 'GCLC', 'NFYA', 'STPG1', 'NIPAL3', 'LAS1L', 'ENPP4', 'SEMA3F', 'CFTR', 'ANKIB1', 'CYP51A1', 'KRIT1', 'RAD52', ...
  obs: FeatureSet(uid='yoMWsH9gZREKxDWkYzkL', n=4, registry='core.Feature', hash='9IiwMXi_VHcAU64aDasV', updated_at=2023-12-05 17:28:39 UTC, created_by_id=1)
    🔗 cell_type (3, bionty.CellType): 'T cell', 'hematopoietic stem cell', 'hepatocyte'
    cell_type_id (category)
    🔗 tissue (4, bionty.Tissue): 'kidney', 'liver', 'heart', 'brain'
    🔗 disease (4, bionty.Disease): 'chronic kidney disease', 'liver lymphoma', 'cardiac ventricle disorder', 'Alzheimer disease'
Labels:
  🏷️ tissues (4, bionty.Tissue): 'kidney', 'liver', 'heart', 'brain'
  🏷️ cell_types (3, bionty.CellType): 'T cell', 'hematopoietic stem cell', 'hepatocyte'
  🏷️ diseases (4, bionty.Disease): 'chronic kidney disease', 'liver lymphoma', 'cardiac ventricle disorder', 'Alzheimer disease'

Get a backed AnnData object#

adata = file.backed()
adata
AnnDataAccessor object with n_obs × n_vars = 40 × 100
  constructed for the AnnData object F2UOVLdIPq301YCVoVsJ.h5ad
    obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
    var: ['_index']

Subset dataset to specific cell types and diseases#

cell_types = file.cell_types.all().lookup(return_field="name")
diseases = file.diseases.all().lookup(return_field="name")

Create the subset:

subset_obs = adata.obs.cell_type.isin(
    [cell_types.t_cell, cell_types.hematopoietic_stem_cell]
) & (adata.obs.disease.isin([diseases.liver_lymphoma, diseases.chronic_kidney_disease]))
adata_subset = adata[subset_obs]
adata_subset
AnnDataAccessorSubset object with n_obs × n_vars = 20 × 100
  obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
  var: ['_index']
adata_subset.obs[["cell_type", "disease"]].value_counts()
cell_type                disease               
T cell                   chronic kidney disease    10
hematopoietic stem cell  liver lymphoma            10
dtype: int64

Register the subsetted AnnData:

file_subset = ln.File.from_anndata(
    adata_subset.to_memory(),
    description="anndata with obs subset",
    field=lb.Gene.ensembl_gene_id,
)
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/anndata/_core/anndata.py:1899: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
file_subset.save()
features = ln.Feature.lookup()

file_subset.labels.add(adata_subset.obs.cell_type, features.cell_type)
file_subset.labels.add(adata_subset.obs.disease, features.disease)
file_subset.labels.add(adata_subset.obs.tissue, features.tissue)

Examine data flow#

Query a subsetted .h5ad file containing “hematopoietic stem cell” and “T cell”:

cell_types = lb.CellType.lookup()
my_subset = ln.File.filter(
    suffix=".h5ad",
    description__endswith="subset",
    cell_types__in=[
        cell_types.hematopoietic_stem_cell,
        cell_types.t_cell,
    ],
).first()
my_subset
File(uid='J8mAWLYjU162QsDb78sk', suffix='.h5ad', accessor='AnnData', description='anndata with obs subset', size=38992, hash='RgGUx7ndRplZZSmalTAWiw', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2023-12-05 17:28:40 UTC, storage_id=1, transform_id=2, run_id=2, created_by_id=1)

Common questions that might arise are:

  • What is the history of this file?

  • Which features and labels are associated with it?

  • Which notebook analyzed and registered this file?

  • By whom?

  • And which file is its parent?

Let’s answer this using LaminDB:

print("--> What is the history of this file?\n")
file_subset.view_flow()

print("\n\n--> Which features and labels are associated with it?\n")
logger.print(file_subset.features)
logger.print(file_subset.labels)

print("\n\n--> Which notebook analyzed and registered this file\n")
logger.print(file_subset.transform)

print("\n\n--> By whom\n")
logger.print(file_subset.created_by)

print("\n\n--> And which file is its parent\n")
display(file_subset.run.input_files.df())
--> What is the history of this file?

_images/70640dc45ed53b7a3a66eadb07bc6b690d9262dd49d7c7995f500ad9a98525ef.svg
--> Which features and labels are associated with it?

Features:
  var: FeatureSet(uid='DvS20OY34TniUqwUnJu8', n=99, type='number', registry='bionty.Gene', hash='fHbDaAAmJse48vnUQh9C', updated_at=2023-12-05 17:28:39 UTC, created_by_id=1)
    'TSPAN6', 'TNMD', 'DPM1', 'SCYL3', 'C1orf112', 'FGR', 'CFH', 'FUCA2', 'GCLC', 'NFYA', 'STPG1', 'NIPAL3', 'LAS1L', 'ENPP4', 'SEMA3F', 'CFTR', 'ANKIB1', 'CYP51A1', 'KRIT1', 'RAD52', ...
  obs: FeatureSet(uid='yoMWsH9gZREKxDWkYzkL', n=4, registry='core.Feature', hash='9IiwMXi_VHcAU64aDasV', updated_at=2023-12-05 17:28:39 UTC, created_by_id=1)
    🔗 cell_type (2, bionty.CellType): 'T cell', 'hematopoietic stem cell'
    cell_type_id (category)
    🔗 tissue (2, bionty.Tissue): 'kidney', 'liver'
    🔗 disease (2, bionty.Disease): 'chronic kidney disease', 'liver lymphoma'
Labels:
  🏷️ tissues (2, bionty.Tissue): 'kidney', 'liver'
  🏷️ cell_types (2, bionty.CellType): 'T cell', 'hematopoietic stem cell'
  🏷️ diseases (2, bionty.Disease): 'chronic kidney disease', 'liver lymphoma'
--> Which notebook analyzed and registered this file

Transform(uid='eNef4Arw8nNMz8', name='Analysis flow', short_name='analysis-flow', version='0', type=notebook, updated_at=2023-12-05 17:28:40 UTC, created_by_id=1)
--> By whom

User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-12-05 17:28:32 UTC)
--> And which file is its parent

uid storage_id key suffix accessor description version size hash hash_type transform_id run_id initial_version_id visibility key_is_virtual updated_at created_by_id
id
1 F2UOVLdIPq301YCVoVsJ 1 None .h5ad AnnData anndata with obs None 46992 IJORtcQUSS11QBqD-nTD0A md5 1 1 None 1 True 2023-12-05 17:28:39.975936+00:00 1
Hide code cell content
!lamin delete --force analysis-usecase
!rm -r ./analysis-usecase
💡 deleting instance testuser1/analysis-usecase
✅     deleted instance settings file: /home/runner/.lamin/instance--testuser1--analysis-usecase.env
✅     instance cache deleted
✅     deleted '.lndb' sqlite file
❗     consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase