Foundation Models for Single-cell Omics: Nicheformer

Jonathan Alles

EVOBYTE Digital Biology

By EVOBYTE Your partner in bioinformatics

Introduction

If you’ve ever tried to infer a cell’s neighborhood from a dissociated single‑cell RNA‑seq (scRNA‑seq) profile, you know the feeling of flying blind. The transcriptome tells you a lot about the cell, yet it strips away the one thing tissues care about most: where that cell lives and who its neighbors are. Spatial transcriptomics restores the map, but often at the cost of gene coverage or sample scale. Until recently, you had to choose.

Enter Nicheformer, a foundation model that learns from both worlds—dissociated scRNA‑seq and spatial omics—so it can carry spatial context into settings where coordinates are missing. In late 2025, Nicheformer was introduced as a transformer‑based model trained on a massive cross‑species corpus, and it set a new bar for transferring spatially aware labels and neighborhood composition into unseen data. Rather than treating spatial and single‑cell data as separate silos, Nicheformer tries to make them fluent in each other’s language.

This post gives a plain‑English tour of foundation models for single‑cell biology, what Nicheformer adds, and how you might use it today. We’ll keep the focus practical, with brief code snippets you can adapt.

Why foundation models matter for single‑cell and spatial omics

“Foundation model” is a term borrowed from AI: a large model pretrained on diverse data, then adapted to many tasks with light supervision. In biology, the analogy goes like this. If language models learn grammar from text, cell models learn “cell grammar” from gene expression. Genes are like tokens, cells are like sentences, and a pretrained encoder can embed both in a shared space useful for tasks such as annotation, integration, perturbation prediction, or pathway discovery.

This idea is not hypothetical. scGPT, for example, scaled generative pretraining over tens of millions of cells and showed that transfer learning can boost a range of downstream analyses, from batch correction to multi‑omic integration. Its core contribution was to treat gene expression as a learnable sequence, then reuse those embeddings broadly.

However, early enthusiasm met a reality check. A rigorous zero‑shot evaluation in 2025 found that popular single‑cell foundation models did not consistently outperform strong classical baselines when used without fine‑tuning. The message was not “don’t use foundation models,” but rather “be clear about when pretraining helps and when it doesn’t.” In discovery settings where labels are scarce, naïvely dropping in a pretrained embedding may underdeliver unless the model’s objective aligns with the task.

Spatial omics added another twist. Most single‑cell foundation models were trained only on dissociated profiles. That limits their ability to represent microenvironments like tumor–immune niches or tissue subregions, because those patterns are inherently spatial. A few efforts, such as CellPLM, started to encode cell–cell relationships during pretraining and to leverage spatial data explicitly, but broad, multimodal corpora remained rare.

Against this backdrop, Nicheformer’s promise is straightforward: learn jointly from dissociated and spatial data at scale, then use that representation to predict spatially informed labels—even when coordinates are unavailable.

What makes Nicheformer different

At its heart, Nicheformer is a transformer encoder trained on SpatialCorpus‑110M, a curated collection of more than 110 million cells spanning both human and mouse, across dozens of tissues, and including tens of millions of spatially resolved profiles. Instead of forcing an integrated latent space up front, the corpus preserves biological and technical variability while harmonizing identifiers (for example, orthologous genes) and metadata. The model uses rank‑based gene tokenization—ordering genes by expression relative to modality‑specific means—which helps stabilize embeddings across assays. It also appends contextual tokens for species, modality, and technology so the network can learn how these factors shape expression.

Architecturally, Nicheformer keeps things practical: a 12‑layer transformer with multi‑head self‑attention, 1,500‑token inputs, and 512‑dimensional embeddings—roughly 49 million parameters. The point isn’t sheer size; it’s the pretraining diet. Because spatial and dissociated measurements live side by side in the corpus, the model can learn to encode patterns that predict neighborhood composition, region labels, or niche identities from expression alone. In benchmarks crafted for spatial tasks, Nicheformer beats baselines including scGPT, Geneformer, UCE, and spatial‑aware approaches like CellPLM. Linear probing on frozen embeddings already performs well; fine‑tuning adds another bump when labels permit.

Two ideas stand out in the results. First, training only on dissociated data—even with more cells—misses spatial complexity. The model benefits from explicit exposure to spatial assays during pretraining. Second, the learned representation transfers: you can project spatially aware labels onto scRNA‑seq atlases and recover regional structure and neighborhood trends that would otherwise require imaging. That unlocks in‑silico enrichment of existing atlases with “where‑like” information.

Of course, there are caveats. Nicheformer does not ingest raw coordinates during pretraining, by design; it learns spatial proxies from expression. Performance depends on tissue coverage and transcriptional diversity, so niche types with sparse training examples are harder. And as with other foundation models, independent, task‑matched benchmarks still matter—especially if you plan to use the embeddings zero‑shot. Taken together, the conclusions are optimistic but measured: spatial context leaves a signature in gene expression, and a joint pretraining regime can capture enough of that signal to help downstream analysis.

Using Nicheformer in practice

Let’s say you maintain a large scRNA‑seq atlas of human lung and want to annotate spatially enriched structures—immune niches, epithelial subregions, or perivascular neighborhoods—without running new imaging experiments. The practical path looks like this:

You start by embedding your dataset with the pretrained model. The official repository provides code and weights, and you can operate in a familiar AnnData (Scanpy) workflow. After computing embeddings, you train a slim classifier—often just a linear head—on a labeled spatial dataset from a related tissue or platform. That head learns the mapping from Nicheformer space to your target labels. Then you apply it to your atlas to impute spatial labels and to estimate neighborhood composition.

When should you fine‑tune? If you have enough matched labels in your domain—say, dozens of spatial sections from the same organ—fine‑tuning the encoder for a few epochs can improve subtle distinctions that linear probing might blur. If labels are scarce or heterogeneous, stick with frozen embeddings and light supervision first, then revisit fine‑tuning after validating on a small holdout set. These heuristics mirror the model’s reported strengths and the broader experience with foundation models in biology.

Where Nicheformer fits alongside scGPT, Geneformer, and CellPLM

Nicheformer isn’t the only path to general‑purpose cell embeddings, but it fills a gap. Models like scGPT and UCE excel at learning from massive dissociated datasets and can shine when you fine‑tune them for tasks such as annotation, integration, or perturbation response. Still, their pretraining objectives—often masked token modeling or generative prediction—do not explicitly target spatial microenvironments. That’s why a model trained jointly on spatial and dissociated data has an edge on spatial label transfer.

CellPLM took a related step by encoding cell–cell relations during pretraining and by leveraging spatially resolved data to model tissue structure directly. It’s a valuable direction, especially if your downstream tasks depend on neighborhoods more than on single‑cell states. Nicheformer’s contribution is to scale this idea across a far larger multimodal corpus and to demonstrate consistent gains on spatially defined benchmarks, even with simple linear heads. If you think of the landscape as complementary tools, you’d reach for scGPT when you need broad single‑cell transfer and perturbation‑aware tasks, for CellPLM when cell–cell graph structure is central, and for Nicheformer when the question is, “What spatial story does this transcriptome tell?”

As the field matures, two themes are emerging. First, evaluation must match deployment. Zero‑shot performance is not guaranteed, and benchmarks that mirror your use case—spatial label transfer, neighborhood prediction, cross‑species mapping—matter more than aggregate scores. Second, bigger isn’t always better if the pretraining diet and objective don’t align with your biology. The best results come from models that see the right modalities during pretraining and that you adapt with just enough supervision to stay honest.

Summary / Takeaways

The big idea behind Nicheformer is simple but powerful: spatial context leaves a detectable imprint on gene expression, and a model that learns jointly from spatial and dissociated data can read that imprint. In practice, this means you can enrich scRNA‑seq atlases with spatially aware labels, estimate neighborhood composition, and reason about tissue subregions using nothing more than expression matrices and a lightweight head on top of a pretrained encoder. The published benchmarks suggest that this approach outperforms models trained solely on dissociated data and retains strong performance even when you use linear probing.

At the same time, the usual cautions apply. Foundation models are not magic; they’re tools that shine when their pretraining mirrors your downstream question. If you deploy embeddings zero‑shot, validate against a task‑matched benchmark and keep a baseline in the loop. If you fine‑tune, do it sparingly and measure generalization. And if your biology hinges on microenvironments, reach for a model that actually saw spatial data during pretraining.

A final thought for practitioners: think of Nicheformer as a bridge. It doesn’t replace spatial experiments, but it can guide them—helping you prioritize which regions to image, which niches to sample deeper, and which tissues to revisit across species. That feedback loop—pretrain broadly, adapt lightly, validate carefully—will define the next wave of single‑cell foundation models.