Foundation Models for Cell Biology: An Overview

Jonathan Alles

EVOBYTE Digital Biology

By EVOBYTE Your partner in bioinformatics

Introduction

Imagine asking a model not to finish your sentence, but to finish a cell. Instead of predicting the next word, it predicts how a gene expression profile should look, which transcription factors might drive a state change, or which perturbation could nudge a diseased cell toward a healthier phenotype. That’s the promise of foundation models for cell biology: train once on massive, diverse single‑cell datasets, then adapt quickly to many downstream tasks with little extra data. In other words, a GPT‑like engine for cells rather than text. The Stanford definition captures it well: a foundation model is trained on broad data at scale and adapted to a wide range of tasks, with emergent capabilities that weren’t explicitly programmed.

This shift matters because biology is increasingly data‑rich and question‑dense. We profile millions of cells across tissues, species, and conditions, yet most labs still rebuild models for each new problem. Foundation models flip that workflow. They aim to learn general cellular principles once—co‑expression structure, pathways, regulatory context—and then reuse that knowledge everywhere from cell type annotation to perturbation response prediction.

What is a foundation model in biology, exactly?

If you’re comfortable with large language models (LLMs), the analogy comes naturally. In language, tokens are words; in single‑cell transcriptomics, tokens are genes. A “document” is a cell, represented by the genes it expresses and at what levels. During pretraining, a biological foundation model learns to reconstruct missing information about a cell from the rest of its profile. Rather than memorizing the exact counts, it learns context: which genes tend to move together, which modules define a state, and how those patterns shift across tissues and conditions. That context becomes a reusable embedding of each cell and gene.

Two exemplars illustrate the idea. Geneformer is a transformer model pretrained on tens of millions of single‑cell transcriptomes. It uses a rank‑based encoding that prioritizes genes most informative for cell state, and it was shown to boost accuracy across diverse downstream tasks after light fine‑tuning.

You’ll also see the term transfer learning attached to these models, because that’s the trick: pretrain on broad, unlabeled data, then transfer the model to a new dataset or question with minimal supervision. That’s how we go from a general “cell GPT” to a classifier for rare cell states, a scoring function for regulatory programs, or an in silico screening tool for candidate interventions.

From tokens to cells: how cellular foundation models actually learn

To make the jump from words to genes, we start with single‑cell RNA‑seq (scRNA‑seq). Each cell can be represented as a high‑dimensional vector of gene counts. Because absolute counts are noisy and platform‑dependent, Geneformer introduces a rank value encoding: within each cell, genes are ordered by expression after normalizing against corpus‑level statistics. This simple change emphasizes discriminative genes, de‑emphasizes ubiquitous housekeeping genes, and lets transformer attention focus on context rather than measurement scale. The model is then trained with a masked objective—hide a fraction of the “tokens” (genes) and predict which gene should occupy each masked position given the rest of the cell’s context. Over millions of cells, it learns the latent structure that organizes cellular states.

scGPT follows a kindred path but leans into generative training, optimizing the model to produce realistic cellular profiles and to capture relationships that transfer to new tasks. In practice, both approaches output embeddings for cells and genes: compact numeric vectors that encode the model’s internal understanding of biology. Those embeddings become features for your everyday jobs—clustering, annotation, trajectory inference, batch correction—or inputs to light task‑specific heads.

Are “digital cells” just simulations by a new name? No—and that’s the point

It’s tempting to equate foundation models with the older tradition of cell simulation, but they solve different problems and make different trade‑offs.

Mechanistic simulations explicitly model the biophysics and biochemistry of a cell. The classic example is the 2012 whole‑cell model of Mycoplasma genitalium, which integrated 28 process‑level submodels—metabolism, transcription, translation, DNA replication—into a time‑resolved simulator that predicts phenotype from genotype. These models produce interpretable causal trajectories and can test mechanistic hypotheses, but they are painstaking to build, require many hand‑tuned parameters, and often target one organism or condition at a time.

Foundation models, in contrast, are statistical learners. They do not simulate every reaction; they learn the geometry of cell state directly from data. That geometry is powerful for rapid annotation, harmonizing heterogeneous datasets, or scoring likely regulators of a transition, and it generalizes across tissues and platforms after pretraining. However, because the knowledge is implicit, interpretability and out‑of‑distribution robustness need deliberate attention. In short: simulations aim for mechanistic fidelity at the cost of effort and scope; foundation models aim for breadth and reuse at the cost of explicit causality.

The two approaches are complementary. You can use a foundation model to propose candidate regulators or plausible state trajectories, then test them in a mechanistic simulator. Conversely, you can distill trajectories from a trusted simulator into synthetic training data to improve a foundation model’s priors. The more we blend data‑driven and physics‑driven views, the closer we get to credible “digital cells” that both predict well and explain why.

What’s trending now in cellular foundation models

Three currents are shaping the field as we head into 2026.

First, scale and coverage are rising fast. Early Geneformer work pretrained on roughly 30 million cells; newer releases report training on about 104 million, expanding the model’s vocabulary and input length. This matters because rare states and tissue‑specific programs become visible only when you aggregate enough data.

Second, objectives are diversifying. Beyond masked prediction or next‑token modeling, labs are baking in biology‑aware tasks—contrastive objectives for pathway structure, multi‑task heads for regulatory targets, or perturbation‑aware losses that encourage causal sensitivity. This trend mirrors multimodal LLMs: the more signals you share during pretraining, the richer the downstream representations. scGPT’s published results on multi‑omic integration and perturbation response prediction illustrate the payoff of this direction.

Third, evaluation is getting sharper. Independent benchmarks have started probing zero‑shot behavior—how models perform without any fine‑tuning. A 2025 study found that Geneformer and scGPT can underperform simpler methods in some zero‑shot settings, especially for batch integration, even though they improve substantially with fine‑tuning. The message isn’t “don’t use them”; it’s “match the setup to the task and validate carefully.”

From a user’s perspective, these trends translate to practical guidance. If you have labels and a clear target, lightweight fine‑tuning on top of a foundation model is often worth it. If you need fully zero‑shot performance, test baseline pipelines like HVG‑based PCA/UMAP or scVI alongside the foundation model and pick what wins on your data. And if interpretability is paramount, plan to pair embeddings with downstream analyses that surface gene‑level saliency, attention maps, or perturbation simulations rather than relying on black‑box scores alone.

A short story from the lab bench

A cardiovascular team I worked with had a familiar problem: a rare patient cohort, limited RNA‑seq, and a hunch that a few transcription factors were steering diseased cardiomyocytes away from normal contractile function. In a classical pipeline, they would have fit a new model end‑to‑end—weeks of tuning for a dataset too small to justify those knobs. Instead, they reached for a pretrained cell foundation model, extracted embeddings, and fine‑tuned a small classification head to separate control versus disease. The head achieved stable performance with only dozens of labeled cells, but the real insight came from in‑silico perturbation: nudging the embedding toward the healthy state highlighted specific regulators that then validated in iPSC‑derived cardiomyocytes. That workflow—reuse, adapt, explain—felt a lot like how engineering teams use GPT today: the heavy lifting lives in the foundation; your job is to steer it.

The published Geneformer study reported a similar pattern: broad pretraining, light task‑specific tuning, and targeted in‑silico experiments to pick credible therapeutic targets in cardiomyopathy. It captured why these models are exciting for network biology where labels are scarce but structure is abundant.

Where “digital cells” are heading next

As single‑cell atlases expand, foundation models will increasingly serve as the default “operating system” for cell‑state analysis. Expect larger vocabularies that unify genes across species, longer input lengths that preserve more of each profile, and tighter coupling to spatial and temporal modalities so embeddings reflect where and when cells live, not just what they express. On the tooling side, we’ll see easier routes to deploy models inside analysis notebooks, with standardized tokenizers, model cards, and guardrails for data provenance and consent.

Equally important, norms around evaluation will mature. Zero‑shot leaderboards will sit alongside fine‑tuned ones, and we’ll report calibration and uncertainty, not just accuracy. Finally, the wall between simulation and representation will thin. Labs will feed mechanistic simulators with priors from embeddings, and they’ll also distill simulator outputs back into pretraining corpora. The goal isn’t to crown one paradigm. It’s to combine them into trustworthy digital cell models that help us ask better questions and design better experiments.

Summary / Takeaways

Foundation models for cell biology import the GPT playbook into single‑cell data. Pretrain once on massive, heterogeneous corpora; transfer that knowledge to many tasks with little labeled data. Geneformer and scGPT are leading examples, translating masked or generative objectives into cell and gene embeddings that carry useful biological context. They are not replacements for mechanistic simulations; they are complementary tools that trade explicit causality for breadth and speed. Used thoughtfully—with careful validation, light fine‑tuning, and, when needed, mechanistic cross‑checks—these models can make “digital cells” more than a slogan.

If you’re curious where to begin, start small. Extract embeddings from a dataset you already know well, benchmark them against your current pipeline, and see where the model’s inductive biases help or hurt. Then decide how much fine‑tuning and interpretation you need to trust the result. What question in your lab could move faster if the model already “knew” cell biology?