From scGPT to scConcept: Models for Single-Cell Biology

Jonathan Alles

EVOBYTE Digital Biology

By EVOBYTE Your partner in bioinformatics

Introduction

Single‑cell datasets now span tens to hundreds of millions of profiles, and the community is racing to turn that ocean of counts into compact, reusable “foundation models” (FMs) for biology. These models promise a kind of virtual cell: an embedding that “understands” cell identity, responds coherently to perturbations, and transfers across assays with minimal tuning. Yet, as excitement builds, a quieter but important shift is happening in how we train them. Instead of reconstructing every count, a new wave of contrastive methods focuses on what matters most for downstream science: stable, technology‑agnostic cell representations. In this post, we unpack the landscape, introduce scConcept, and explain how contrastive learning—what we’ll shorthand as “sc‑contrast”—differs from the reconstruction‑style models that came first. We’ll also look honestly at current limitations and highlight where these models are already delivering value.

Note on terminology: by “sc‑contrast,” we refer to contrastive pretraining frameworks for single‑cell data such as scConcept and related approaches in this family, rather than a single package literally named “scContrast.” If you had a specific library in mind, let me know and I’ll update the examples.

Foundation models for single‑cell biology: where we are now

Foundation models pretrain on massive corpora of single‑cell RNA‑seq (scRNA‑seq) to learn general representations that transfer to tasks like cell‑type annotation, atlas mapping, and perturbation prediction. Early models borrowed self‑supervised recipes from language modeling. Some treat a cell as a “bag of genes” and mask or bin expression values, asking the model to reconstruct what’s missing. Others learn from rank information—predicting gene order rather than raw counts—to sidestep platform‑specific quirks. A recent synthesis grouped single‑cell FMs by their learning targets: ordering (rank‑based methods such as iSEEEK, tGPT, Geneformer), value categorization (binning counts), and value projection (continuous value prediction). This taxonomy clarifies why two models trained on the same atlas can behave so differently when moved to a new assay.

Reality checks have been healthy for the field. Independent zero‑shot evaluations—where you apply a pretrained model without fine‑tuning—have shown that scGPT and Geneformer do not consistently beat simpler baselines, and that robustness to batch effects remains an open problem. These findings don’t diminish the promise of FMs; they simply remind us that better objectives and evaluation protocols are needed before we declare victory on “virtual cells.”

Meanwhile, new entrants are scaling ambitiously. For example, scLong pushes self‑attention over the full ~28k human genes and fuses Gene Ontology knowledge via graph convolution to model long‑range dependencies that earlier architectures often ignored. This kind of architectural rethink—together with smarter objectives—is where much of the forward motion is happening.

From reconstruction to contrast: how scConcept and sc‑contrast change the recipe

Reconstruction‑style training asks a model to rebuild gene counts from corrupted inputs. That’s a sensible proxy task, but it can entangle nuisance variation—library size, assay‑specific artifacts, and panel choice—with the biological signal you actually want to keep. Contrastive pretraining flips the incentive. Instead of making the model a great reconstructor, it makes the model a great identifier of “the same cell” across multiple views.

scConcept embodies this shift. It generates multiple views of a cell (for instance, different normalizations, stochastic gene or panel subsampling, and other light perturbations) and trains a transformer to pull those views together in embedding space while pushing apart different cells. By optimizing a cell‑level identification objective, scConcept learns invariances that reconstruction doesn’t naturally enforce: embeddings become less sensitive to count distributions and more resilient across technologies and gene panels. In large‑scale experiments over a 30‑million‑cell corpus, the authors report improvements across cell‑type annotation, integration, atlas mapping of new technologies, spatial transfer and imputation, and even gene‑panel optimization. The core idea is simple: teach the model to recognize a cell regardless of viewpoint, and you recover a representation that travels.

scConcept vs. scGPT/Geneformer: what’s actually different

The differences are philosophical but also very practical.

First, objective. scGPT and related models largely rely on masked/reconstruction objectives over binned counts or masked tokens. Geneformer, by contrast, is rank‑based: it encodes the relative order of genes within a cell. Both choices help, but they can still tie the embedding to platform‑specific distributions or to quirks of rank encoding. scConcept drops the reconstruction step and directly optimizes the property you want in downstream tasks: technology‑agnostic cell identity. That means invariance is baked into training, not bolted on with post‑hoc corrections.

Second, tokenization and invariances. Bin‑based tokenization can lose gene‑level resolution and introduce biases toward abundant cell types; rank encodings reduce some of that but still don’t explicitly enforce cross‑technology stability. Contrastive learning can target those invariances by construction, aligning views that mimic cross‑platform shifts or gene‑panel changes. In practice, this is why sc‑contrast methods often report stronger out‑of‑distribution behavior in tasks like cross‑atlas mapping or spatial transfer.

Third, architectural headroom. Models like scLong show that attention over the full gene universe is now feasible, and that injecting external knowledge graphs can help. It’s reasonable to expect contrastive objectives to benefit from that same scale, because the positive/negative structure in contrastive learning thrives when the encoder can model long‑range gene dependencies.

Where these models are working

Despite the caveats from zero‑shot benchmarks, single‑cell FMs are already showing up in production research workflows.

One visible example is perturbation modeling. The scGenePT model, released through the Virtual Cells platform, extends scGPT by injecting language‑derived gene embeddings from curated knowledge sources. It’s explicitly positioned for predicting gene and drug perturbation responses, with public documentation, demo datasets, and a permissive license that encourages integration into internal pipelines. This is a concrete case of an FM moving from paper to a usable tool, complete with intended‑use guidance and risk notes.

Atlas‑scale mapping and label transfer is another. Groups are using pretrained backbones to embed new experiments into reference spaces and to standardize annotations across projects. scConcept’s reports highlight mapping new technologies onto existing atlases and optimizing targeted panels—two day‑to‑day problems in translational labs managing mixed chemistries and limited sequencing budgets.

At the other end of the spectrum, models like CellFM show that scale itself can be a capability. Trained on 100 million cells, it frames single‑cell FMs in three training paradigms and reports improved recall of rare populations and stronger gene‑signature prediction—signals that matter when a 0.1% subpopulation is the whole story in a disease cohort. Whether you adopt CellFM outright or not, its results underscore why atlas‑level pretraining is economically sensible now that millions of cells per year are routine.

Finally, architecture innovation is broadening the modality horizon. scLong’s full‑genome attention and ontology‑aware training are early signs that single‑cell FMs will lean more on external biological knowledge and long‑context reasoning, not just clever tokenization of counts. That direction is promising for downstream tasks like regulatory inference, where long‑range gene‑gene dependencies matter.

Current limitations

If you plan to productionize these models, it pays to be clear‑eyed about their limits.

Generalization without fine‑tuning is not solved. Zero‑shot evaluations found that scGPT and Geneformer often underperform simpler baselines, especially under strong batch effects. This means claims of “virtual cells” should be qualified by task, tissue, and technology—and you should budget for small amounts of adaptation to your data.

Evaluation and availability still lag ambition. A Nature Methods perspective noted that even when teams set out to evaluate “nearly a dozen” models, many lacked usable code or weights. As a result, community benchmarks are narrow, and performance claims can be hard to reproduce across tissues and assays. This is improving, but you should favor models with transparent releases, clear intended‑use statements, and documented risks.

Representation choices carry trade‑offs. Bin‑based tokenization is simple and scalable but can wash out gene‑level nuance; rank‑based methods capture relative structure but may drop information tied to absolute changes. Contrastive objectives help by enforcing invariances, but they depend on the quality of “views” you generate during training; views that are too easy or too unrealistic both hurt transfer. This is where domain‑informed augmentations—panel subsampling distributions that reflect real capture biases, for example—make a big difference.

Compute and context remain practical constraints. Full‑gene attention and graph‑augmented training, as in scLong, raise memory footprints. If you’re running on mid‑range GPUs, prefer models that expose frozen‑encoder inference paths, mixed‑precision support, and lightweight adapters for fine‑tuning. Expect steady progress here as open weights and optimized kernels spread.

Summary / Takeaways

Foundation models for single‑cell biology are maturing from clever reconstructions of counts into purpose‑built cell encoders. Contrastive pretraining—exemplified by scConcept—trains the model to recognize a cell regardless of common nuisance variation, which is exactly what robust downstream pipelines need. At the same time, independent evaluations remind us that zero‑shot performance is not a given, and that careful fine‑tuning and rigorous benchmarks are essential.

If you’re choosing a backbone today, start with a clear target: do you need technology‑agnostic embeddings for atlas mapping, or do you need perturbation prediction? Pick models whose objectives align with that need, and prefer releases with code, weights, and intended‑use notes. Then validate on your data, in your chemistry, with your definitions of success. You’ll get the most value by combining the right objective (contrastive when you need invariance) with a modest, well‑designed adaptation pass.

What question do you want your “virtual cell” to answer first? Once you name it, the right pretraining recipe often becomes obvious.