Multimodal Cell Atlases for Biology Foundation Models 2026

Jonathan Alles

EVOBYTE Digital Biology

Introduction

Biology has finally reached its “ImageNet moment,” but with a twist. Instead of cats and dogs, our training data are living cells, measured across tissues, mapped into 3D space, and profiled with multiple omics at once. Over the past two years, integrated atlases have matured from niche resources into routinely queriable corpora spanning tens of millions of cells. At the same time, foundation models built on single‑cell data have moved from preprints to peer‑reviewed results and early benchmarks. Taken together, 2026 is shaping up to be the year when multimodal cell atlases become the default pretraining ground for biology foundation models (BFMs), and where meaningful performance gains will come from how we integrate spatial and multi‑omic context, not just from stacking more cells.

If you’re a data scientist eyeing real biological impact, the strategic question is simple: where will the next 12–24 months of model improvements actually come from? The short answer is better atlases, richer modalities, and smarter training pipelines that treat “a cell” not as a flat vector, but as a situated, interacting unit in tissues and time. The rest of this post unpacks that shift and shows how to plug in.

The atlas era has quietly become real

In November 2024, the Human Cell Atlas (HCA) consortium published a sweeping collection of papers across Nature Portfolio journals, moving the project from scattered cell counts toward integrated, cross‑tissue maps. The message was clear: building an atlas is no longer about data accumulation alone; it’s about harmonized standards, global coverage, and AI‑ready integration. This is exactly the substrate BFMs need.

Meanwhile, the CZ CELLxGENE ecosystem has turned the world’s largest single‑cell corpus into something you can actually query and stream. As of October 1, 2024, the platform reported 169.3 million cells (93.6 million unique) across 449 tissues, with “Census” providing Python/R APIs for out‑of‑core access powered by TileDB‑SOMA. This is not just convenience—Census’s iterable loaders and standardized schema let you assemble atlas‑slices for training without bespoke wrangling.

Beyond dissociated RNA, atlases are becoming inherently multimodal. HuBMAP’s Human Reference Atlas (HRA) has advanced a 3D Common Coordinate Framework (CCF) linking anatomical structures, cell types, and biomarkers, giving us a way to register diverse assays into a shared spatial scaffold. That scaffolding is what lets models learn from spatial neighborhoods and tissue gradients, not merely per‑cell gene counts.

Neuroscience shows what’s possible when scale meets multimodality. The BRAIN Initiative networks (BICCN/BICAN) integrated single‑cell transcriptomics with epigenomics and single‑cell‑level spatial data to chart a whole‑brain cell atlas in mouse and deepen human brain maps—demonstrating how spatial context and multi‑omics together sharpen cell‑type definitions and circuit understanding.

From “more cells” to “richer context”

In single‑cell work, scRNA‑seq gave us a lingua franca—gene‑by‑cell matrices—but also a blind spot: location and regulation. Spatial transcriptomics layers in the “where,” which often determines the “what.” A T cell in a tumor rim behaves differently from one in lymph node cortex because its neighbors and microenvironment differ. Multi‑omic assays like joint RNA+ATAC capture parts of the “why,” tying expression to open chromatin and regulatory programs. Models that pretrain across these aligned views have a path to learn cell identity as a function of tissue context and regulatory state.

You can already see this arc in model papers. Geneformer and scGPT showed that transformer pretraining on tens of millions of transcriptomes unlocks strong representations for cell annotation, perturbation prediction, and gene‑network inference. The next climb will come from exposing those same architectures to paired spatial coordinates, chromatin marks, surface proteins, and curated cell ontologies—so their embeddings encode not just co‑expression, but neighborhood structure and regulatory logic.

Standards are the unsung enablers here. Corpus‑scale access through Census and TileDB‑SOMA means you can sample cells consistently across studies and export straight into AnnData or Seurat for downstream loaders. On the ontology side, resources such as the HRA CCF and cell‑type frameworks help reduce label chaos and bring model outputs into alignment with how biologists actually reason about tissue structure. Put simply: when atlases speak the same language, BFMs learn it faster.

Why 2026 looks like a tipping point

Three trends are converging. First, integrated atlases are crossing the usability threshold: you can stream millions of harmonized cells, slice by tissue and disease, and keep memory use sane on commodity hardware. That moves BFMs from “hero experiments” to everyday training runs.

Second, spatial and multi‑omic coverage is no longer a neuroscience‑only luxury. The HRA CCF and related tools are normalizing registration and visualization across many organs, while atlas collections add more spatially resolved datasets every quarter. As this coverage widens, pretraining can routinely incorporate tissue neighborhoods and landmark structures.

Third, the BFM stack is maturing. Early models gave proof that self‑supervised objectives on rank‑normalized gene tokens are viable. Now, the field is building more principled evaluations, better tokenization schemes for sparse counts, and cross‑modal objectives that tie RNA, ATAC, protein, and space together. The leap from good embeddings to robust, generalizable zero‑shot performance will likely come from how we exploit atlas integration rather than just scaling parameter counts.

Together, these forces set the stage for 2026 to be the year when BFMs trained on multimodal atlases become the default backbone for downstream biology tasks—from cell annotation and deconvolution to in silico perturbations and mechanism‑guided target discovery.

What improvements should data scientists expect?

Expect gains wherever context matters. In cell‑type annotation, cross‑tissue references tend to fail at fine granularity because the same label behaves differently in different anatomical niches. Pretraining on spatially anchored atlases should reduce these failures, improving label transfer in new tissues and species. In perturbation and drug response modeling, pairing RNA with ATAC and protein will help distinguish correlation from regulation, making counterfactual predictions less brittle. And in deconvolution of bulk data or pathology+omics fusion, atlas‑trained embeddings should yield more faithful tissue compositions because they “know” how gene programs vary across microenvironments.

Will there be disappointments? Yes. Benchmarks are already showing that “more cells” alone doesn’t guarantee zero‑shot wins, and that evaluation protocols must penalize leakage and overfitting to atlas idiosyncrasies. But those critiques are also map‑making tools: they push us to curate better negative controls, build tissue‑aware splits, and prioritize transfer across donors, labs, and platforms. In short, success will look less like chasing leaderboard decimals and more like consistent gains across anatomically and technologically diverse scenarios.

Turning atlases into training data: a practical on‑ramp

You don’t need to forklift terabytes to start. The Census API lets you stream exactly the cells you need, filtered by organism, tissue, assay, or ontology‑aligned labels, and send batches straight into PyTorch. Under the hood, TileDB‑SOMA handles sparse arrays and out‑of‑core iteration so your model can scale without cumbersome pre‑sharding.

Here’s a minimal example that slices primary human lung immune cells and exports an AnnData object you can feed into your dataloader later:

# pip install cellxgene-census anndata
import cellxgene_census as czen

with czen.open_soma(census_version="latest") as cz:
    adata = czen.get_anndata(
        cz,
        organism="Homo sapiens",
        obs_value_filter="tissue_general == 'lung' and is_primary_data == True and cell_type == 'immune cell'"
    )

# adata.X is sparse; write to disk or wrap in your DataLoader
adata.write_h5ad("lung_immune_census.h5ad")

For streaming directly into a training loop, use an iterable that yields dense or sparse tensors per batch. The key is to keep tokenization and normalization consistent with your pretraining objective (for instance, rank‑based or log1p CPM). The sketch below shows how you might iterate batches without materializing the full slice:

# pip install cellxgene-census torch
import torch
from torch.utils.data import IterableDataset, DataLoader
import cellxgene_census as czen

class CensusBatches(IterableDataset):
    def __iter__(self):
        with czen.open_soma(census_version="latest") as cz:
            it = czen.experimental.get_tiledb_reader(
                cz,
                organism="Homo sapiens",
                obs_value_filter="tissue_general in ['liver','kidney'] and is_primary_data == True",
                batch_size=32768
            )
            for batch in it:
                X = batch.X  # scipy.sparse CSR
                yield torch.from_numpy(X.toarray()).float()

loader = DataLoader(CensusBatches(), batch_size=None)
for xb in loader:
    # xb: [cells, genes] float32 — plug into masked-gene or contrastive objective
    pass

This pattern abstracts away the painful parts—dataset assembly, batching, and sparsity—so you can focus on objectives, architectures, and evaluation. The same idea applies when you start incorporating spatial coordinates or multi‑omic matrices: index into the atlas by tissue region or anatomical landmark, then align modalities via shared barcodes, spots, or registration transforms provided by the atlas’s coordinate framework.

Strategy: designing BFMs that actually use atlas context

First, make space part of the token stream. Whether you encode tissue compartments as learnable tokens, bucket spatial coordinates, or inject graph edges from k‑nearest neighbors in physical space, give your model a way to represent neighborhoods. Spatial inductive bias is cheap and often decisive.

Second, tie expression to regulation. When joint RNA+ATAC or CITE‑seq is available, consider multi‑task heads that predict chromatin accessibility or surface proteins from RNA context (and vice versa). Even a weak auxiliary objective pushes embeddings toward causal structure instead of mere co‑expression.

Third, align with ontologies early. Map observations to atlas‑standard fields—cell_type, tissue_general, anatomical structure IDs—before training. Use these labels for curriculum sampling and hard‑negative mining across tissues. You’ll reduce shortcut learning and improve transfer to unseen donors and labs because your batches reflect biological heterogeneity, not dataset quirks.

Finally, evaluate like a biologist. Split by donor and lab, hold out tissue regions, and test zero‑shot on spatial sections from anatomically adjacent but distinct areas. If your model claims to “know liver,” it should generalize from periportal to pericentral zones without collapsing into generic hepatocyte clusters. Neuroscience atlases already do this kind of geography‑aware validation; non‑brain tissues are catching up fast.

A short story: the atlas‑native perturbation model

Imagine training a model to predict the effect of a kinase inhibitor across the liver’s metabolic zones. You stream hepatocytes and endothelial cells from portal to central veins, include paired ATAC to capture regulatory motifs, and add spatial bins that preserve the oxygen gradient. The pretraining task masks genes and asks the model to reconstruct them while also predicting a small set of zone‑specific markers and accessible chromatin peaks. Fine‑tuning happens on a modest perturb‑seq set from one region.

Here’s what changes. Instead of memorizing “drug X down‑regulates gene Y,” the model learns that in pericentral hepatocytes the same pathway sits on a different regulatory backbone, so the response is blunted or inverted. When you deploy to unseen donors or a tissue slice from a different lab, the predictions hold up because the embedding encodes the right causal context. That is what a 2026‑grade BFM trained on a multimodal atlas should do.

Summary / Takeaways

The past two years have delivered the ingredients biology foundation models were missing: massive, harmonized atlases accessible by API; maturing spatial and multi‑omic coverage grounded in common coordinate frameworks; and early but convincing evidence that transformer pretraining on single‑cell data yields broadly useful representations. In 2026, the biggest model gains are likely to come from how richly we encode context, not just how many cells we ingest. If you’re building or deploying BFMs, now is the time to wire your pipelines to atlas standards, add spatial and regulatory views to your objectives, and evaluate with anatomically aware splits. Do that, and your models won’t just recognize cells—they’ll understand where those cells live and why they behave the way they do.