Foundation Models for Single‑Cell Omics: scGPT

Jonathan Alles

EVOBYTE Digital Biology

By EVOBYTE Your partner in bioinformatics

Introduction: why single‑cell needs a new kind of model

Single‑cell RNA sequencing (scRNA‑seq) turned biology into a data‑rich language of cells, but the vocabulary is huge and noisy. Batch effects blur signals, labels are scarce, and new tissues and technologies arrive every month. That’s why “foundation models” are gaining attention. Like large language models in NLP, these transformer‑based systems learn general representations from massive corpora and then adapt to specific tasks. In single‑cell biology, the promise is simple: pretrain once on millions of cells, transfer everywhere. The best‑known example so far is scGPT.

What is a foundation model for single‑cell biology?

A foundation model (FM) is a large neural network trained on diverse unlabeled data so it can be fine‑tuned or prompted for many downstream tasks. In omics, cells become “documents,” genes become “tokens,” and expression magnitudes become “values” the model must encode. Training often borrows masked language modeling (MLM): hide a subset of gene signals and ask the model to recover them from context. When this works, the model captures co‑expression structure, pathway context, and technical confounders, producing embeddings that can power tasks like cell type annotation, integration across batches, perturbation response prediction, and even gene regulatory network (GRN) inference. scGPT crystallizes this recipe at single‑cell scale.

scGPT in a nutshell: training recipe and what it learns

scGPT is a generative pretrained transformer trained on more than 33 million single‑cell profiles spanning diverse tissues. Instead of words, it ingests gene tokens and uses value‑aware encodings to represent expression bins, teaching the model both which genes are “present” and how strongly they’re expressed. After pretraining, scGPT is adapted via transfer learning to specific tasks, where it has reported strong results on cell annotation, multi‑batch and multi‑omic integration, perturbation prediction, and GRN discovery. The core insight is that large‑scale pretraining distills reusable biological structure that fine‑tuning can quickly specialize for a lab’s dataset.

Performance in practice: fine‑tuning shines, zero‑shot needs caution

It’s tempting to drop a pretrained FM into analysis “zero‑shot,” expecting instant wins. Recent systematic evaluations advise caution. Independent studies found that, without any task‑specific adaptation, embeddings from current single‑cell FMs such as scGPT and Geneformer can underperform simpler baselines for clustering or batch correction. The takeaway isn’t that FMs fail; it’s that their value appears most reliably when you fine‑tune or align them to the dataset at hand. In discovery settings with unknown labels, you should verify performance and avoid assuming that bigger pretraining automatically yields better zero‑shot results. As the field matures—with models pushing into spatial and multimodal data—benchmarks are catching up, but today’s best practice is still “pretrain broadly, adapt locally.”

Summary / Takeaways

Foundation models are arriving in single‑cell biology, and scGPT is a leading early blueprint. Pretraining at the scale of tens of millions of cells lets the model internalize broad gene‑gene and cell‑state structure, which you can then transfer to everyday tasks. However, the strongest results typically come after light task‑specific fine‑tuning, while zero‑shot use still shows limits on clustering and batch correction in independent tests. If you treat scGPT as a starting point rather than a drop‑in replacement, you get the best of both worlds: reusable biological priors plus the adaptability your datasets demand. Looking ahead, multimodal and spatial extensions suggest a future where one FM knits together RNA, protein, and neighborhood context into a single, fast, and more faithful representation of tissue biology.