Benchmarking Cell Atlases for Label Transfer

Jonathan Alles

EVOBYTE Digital Biology

Introduction: de-risking “reference risk” before annotation transfer

If you’ve ever rushed a single‑cell RNA‑seq analysis into production, you know the uneasy feeling of relying on a public cell atlas you didn’t build. It’s tempting to grab a well-known reference and fire up Seurat or scvi‑tools for label transfer. But if the atlas is mismatched to your tissue, biased toward a few donors, or annotated with idiosyncratic labels, you can bake systematic errors into every downstream result: QC dashboards, biomarker discovery, clinical stratification, even production models. One poor reference can quietly ripple through an entire pipeline.

This post is a hands-on playbook for evaluating public cell atlases before you trust them in production. We’ll stay pragmatic, focusing on four things that actually determine transfer success at scale: annotation quality, batch diversity, disease relevance, and transferability across platforms and cohorts. Along the way, we’ll point to concrete metrics like LISI and kBET, common tools like Seurat/Azimuth, Symphony, scVI/scANVI, and practical sanity checks you can automate. The goal is simple: reduce “reference risk” before you push labels to production systems.

What makes a production-grade cell atlas?

A cell atlas is more than a big AnnData or Seurat object. In production, you want three properties to line up. First, coverage: does the atlas actually include the tissue compartments, developmental stages, and technologies your pipeline will encounter? Second, correctness: are cell types annotated with consistent evidence and mapped to a shared language like the Cell Ontology (CL), so your labels mean the same thing across datasets? Third, compatibility: can the reference accept new data—possibly from other labs and platforms—without collapsing biological structure or over-correcting technical variation?

General-purpose resources like the Human Cell Atlas (HCA) and CELLxGENE Census give you breadth and programmatic access for building and testing references. HCA networks publish integrated atlases by organ; Census provides versioned aggregates with standardized metadata you can slice via Python or R. For narrower use cases, domain references such as the integrated Human Lung Cell Atlas (HLCA) or multi‑tissue atlases like Tabula Sapiens are strong starting points. They’re widely used, transparent about data provenance, and maintained, which makes them easier to defend in regulated contexts.

How to judge annotation quality and ontology alignment

Start by treating the atlas like any labeled dataset you’d use for machine learning. You want to know where the labels came from, how consistent they are, and whether they map to a stable taxonomy.

Look for annotation provenance. Were labels assigned by expert manual curation, supervised learning on canonical markers, or automated consensus tools? Resources often blend these approaches. For example, many pipelines now cross‑check manual assignments with automated methods like CellTypist or semi‑supervised models such as scANVI; consensus frameworks like PopV even combine multiple classifiers to stabilize results across batches and tissues. The point isn’t to pick a single “winner.” It’s to ensure the labels aren’t a black box and that disagreements are documented with marker evidence or confusion matrices.

Then verify ontology mapping. Production labels should map to CL identifiers, not just free‑text names. Ontology alignment gives you unambiguous, versionable labels across teams, tissues, and releases. If the atlas ships CL terms in its metadata—or provides a clean mapping script—you’re in a good place. If not, budget the time to align label strings to CL and resolve edge cases like “doublets,” “low quality,” or study‑specific subtypes that don’t cleanly map to canonical terms. The Cell Ontology documentation and OBO Foundry principles provide the ground truth and definitions you’ll need to standardize these labels.

Finally, spot-check label granularity and marker coherence. In a healthy reference, fine‑grained subtypes should show strong, stable marker programs within donor and platform strata. If markers dissolve as soon as you facet by donor or chemistry, the annotation may be over‑segmented or driven by batch artifacts. As a rule of thumb, prefer atlases that publish marker tables with effect sizes and that show stability across independent cohorts.

Measuring batch and donor diversity with quantitative metrics

Batch diversity matters because your production queries won’t match the reference’s lab, chemistry, or donor mix. A reference that “only works” for one platform won’t hold up in production. Quantify this by reading the atlas metadata and computing objective batch‑mixing and structure‑preservation metrics.

Two practical metrics are kBET and LISI. kBET is a k‑nearest neighbor batch‑effect test that compares local batch composition to the global mix; lower rejection indicates better batch mixing. LISI—the Local Inverse Simpson’s Index—comes in two flavors: iLISI quantifies batch mixing and cLISI quantifies biological structure (cell type separation). Together, they surface methods that over‑correct (high iLISI but poor cLISI) versus those that preserve biology while reducing batch effects. These metrics are widely used in benchmarking studies and are available in several toolkits, so you can integrate them into CI checks for reference updates.

Don’t stop at a single embedding. Compute metrics on embeddings produced by different integration strategies you actually use in production—e.g., Harmony, scVI, RPCA‑based Seurat mapping. If your downstream label transfer uses scANVI, evaluate on the scVI/scANVI latent space. If you map with Symphony, check cLISI/iLISI on Symphony’s reference space as well. Methods differ in how they trade off batch mixing and cell‑type resolution, and you want an embedding that matches your label transfer path in production. Symphony’s compressed reference approach, for example, is intentionally optimized for fast, reproducible mapping onto large, pre‑annotated atlases.

Validating disease relevance and clinical context

Most production workflows care about disease states, not just healthy tissue. Before transfer, verify that the reference captures the disease biology you expect to see—or that it is explicitly healthy‑only so you can control for that. Integrated resources like HLCA include data from healthy lung and multiple diseases, with harmonized annotations across cohorts; Census exposes disease metadata programmatically so you can build disease‑specific or mixed references. If your application targets a specific pathology, test transfer on public disease datasets first, then hold out donors to simulate generalization to new cohorts. This will save you from learning, too late, that your “universal” reference is actually biased toward a single disease stage or therapy.

A quick, high‑value check is to compare predicted labels with orthogonal modalities when available. In PBMCs or tumor microenvironment data containing CITE‑seq, verify that transferred T/NK/B/myeloid subtypes align with protein markers. Where spatial data are available, look for anatomical plausibility. If immune subsets appear in impossible niches after transfer, you likely have batch‑driven or ontology mismatches.

Proving transferability with small, controlled experiments

Before putting an atlas behind your production API, run two lightweight but telling experiments: within‑platform transfer and cross‑platform transfer.

Within‑platform transfer is your sanity check. Pick a held‑out donor from the same platform and lab, map it to the atlas, and compute simple classification metrics against author labels: per‑class F1, confusion matrices, and calibration curves. This validates that label names and boundaries make sense in the same technical regime.

Cross‑platform transfer is the real test. Hold out data from another chemistry or lab, map to the atlas, then compare metrics. If performance collapses, examine iLISI/cLISI and kBET in the atlas embedding to see if the reference is under‑diversified or your mapping is over‑correcting. If you use Seurat/Azimuth to transfer labels, check prediction score distributions and impose thresholds; if you use scANVI, treat low‑confidence outputs as “Unknown” and route them to review. Symphony is useful here because it separates building a compressed reference from the fast mapping step, letting you prototype alternative references quickly and compare label F1 across tissues.

A final guardrail is novelty detection. Methods like scArches/treeArches can flag query cells that don’t match known reference states, which is critical for catching real biology—new cell states or treatment effects—rather than forcing them into the nearest known label. If your atlas or mapping strategy can’t say “unknown,” you risk systematic mislabeling in disease or perturbed contexts.

Minimal, reproducible tests you can automate

To keep the process grounded, here are two tiny examples you can adapt for your CI or notebook playbooks. They won’t replace your full analysis, but they give you immediate signal on batch diversity, ontology alignment, and transfer stability.

Example 1: pull a slice from CELLxGENE Census and measure basic batch diversity in a candidate reference

# Python 3.10+
import cellxgene_census
import pandas as pd
import scanpy as sc

# 1) open a versioned Census and pull a small lung reference slice
with cellxgene_census.open_soma(census_version="2024-07-01") as census:
    adata = cellxgene_census.get_anndata(
        census=census,
        organism="Homo sapiens",
        filter_obs=("tissue_general == 'lung' and is_primary_data == True and assay in ['10x 3\' v3', '10x 5\'']"),
        X_name="raw"
    )

# 2) take quick stock of donor, assay, and disease diversity
meta = adata.obs[["dataset_id", "donor_id", "assay", "disease", "cell_type"]].copy()
summary = {
    "n_cells": adata.n_obs,
    "n_donors": meta["donor_id"].nunique(),
    "n_datasets": meta["dataset_id"].nunique(),
    "assays": meta["assay"].value_counts().to_dict(),
    "diseases": meta["disease"].value_counts().to_dict(),
    "top_cell_types": meta["cell_type"].value_counts().head(10).to_dict()
}
print(summary)

# 3) prepare an embedding you’ll also use for transfer (e.g., scVI later)
sc.pp.normalize_total(adata, target_sum=1e4); sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, flavor="seurat_v3", n_top_genes=3000, subset=True)
sc.pp.pca(adata, n_comps=50); sc.pp.neighbors(adata, n_neighbors=30); sc.tl.umap(adata)

# You can now compute iLISI/cLISI and kBET in your preferred metrics package on this same space

This snippet takes a versioned slice, summarizes donor/assay/disease diversity, and prepares a consistent embedding. Because Census is versioned and exposes stable APIs, you can pin the reference slice to a date and reproduce the same evaluation later when you refresh the atlas build.

Example 2: run a tiny Seurat/Azimuth label transfer and sanity‑check predictions

# R (Seurat v5+)
library(Seurat)
# Assume 'ref' is a Seurat object with annotations and 'qry' is unlabeled
anchors <- FindTransferAnchors(reference = ref, query = qry, dims = 1:30)
pred <- TransferData(anchorset = anchors, refdata = ref$celltype, dims = 1:30)
qry$predicted_id <- pred$predicted.id
qry$pred_score  <- pred$prediction.score.max

# Simple confusion vs. author labels if available
if (!is.null(qry$author_label)) {
  print(table(pred = qry$predicted_id, truth = qry$author_label))
}

# Flag low-confidence predictions for manual review or alternate mapping
qry$flag_low_conf <- qry$pred_score < 0.7

This gives you a fast “does this make sense?” readout. Keep an eye on low‑confidence clusters, where over‑correction or ontology drift tends to hide. Seurat’s documentation distinguishes mapping/transfer from joint integration, which helps you reason about what is, and is not, being changed in the query.

Putting it all together: a simple acceptance checklist

In practice, the strongest indication that an atlas is production‑ready is when all four axes point in the same direction.

Annotation quality looks solid when labels come with provenance, CL mappings, and marker evidence, ideally cross‑checked with automated methods. Batch diversity looks good when iLISI is high and kBET rejection is low without wiping out cLISI; ideally, this holds across the same embedding you’ll use for transfer. Disease relevance looks credible when the atlas represents the spectrum of states you’ll see—or states clearly that it is healthy‑only and meant to be paired with disease‑specific references. Finally, transferability looks robust when within‑platform and cross‑platform pilots show stable F1 and sensible confidence distributions, and when novelty detection can route unknowns instead of forcing hard assignments.

If any one axis is weak, you still have options. You can constrain the label set to coarser, more reliable types, rebuild a tissue‑specific reference from Census or HCA slices, or switch mapping engines—Seurat/Azimuth, Symphony, scANVI—depending on your latency, scale, and OOD‑detection needs. These are engineering choices, not ideological ones; the right choice is the one that preserves biological signal while meeting your production SLOs.

Summary / Takeaways

Public cell atlases are incredible accelerants for single‑cell pipelines, but they aren’t one‑size‑fits‑all. Before you transfer labels into production, treat the reference like any other external dependency: read the provenance, pin a version, quantify batch and structure with objective metrics, and run small transfer pilots that mirror your real workloads. Prefer atlases that publish transparent annotations, map to the Cell Ontology, and span multiple donors, labs, and chemistries. Use iLISI/cLISI and kBET to catch over‑correction or under‑mixing. Test both within‑ and cross‑platform transfers, and keep a path for “unknown” states via scArches or conservative score thresholds.

The payoff is simple: cleaner labels, fewer firefights, and a pipeline you can explain and defend. Before you push the next annotation job, what single experiment will you run this week to quantify your reference risk?