Latent Spaces and Embeddings in Single-Cell Biology

Jonathan Alles

EVOBYTE Digital Biology

By EVOBYTE Your partner in bioinformatics

Introduction

If you’ve stared at a colorful t-SNE or UMAP plot and wondered what those coordinates actually mean, you’re in the right place. In single-cell analysis, terms like latent space and embedding appear early and often, yet they’re easy to conflate. This short primer demystifies the vocabulary, shows where these concepts show up in a typical single-cell RNA‑seq workflow, and clarifies how they differ. Along the way, we’ll anchor the ideas with concrete examples and a few lines of code you can adapt to your own data.

What is a latent space?

A latent space is a compact coordinate system learned from data that captures underlying, often unobserved, factors of variation. Think of it as the backstage where the true structure of your dataset—cell cycle, lineage, activation state, batch effects—can be separated and represented with far fewer numbers than the original gene expression matrix. Each axis in this space corresponds to a latent variable, which you don’t measure directly but infer from patterns in the data. In practice, we pick a dimensionality that is much smaller than the number of genes, and we fit a model that maps every cell to a point in this lower‑dimensional space. Models as simple as principal component analysis (PCA) or as flexible as variational autoencoders (VAEs) learn such spaces by optimizing a criterion that balances data fidelity with parsimony. The result is a set of coordinates that preserve the most informative structure for downstream tasks like visualization, clustering, trajectory inference, or differential testing.

In single‑cell biology, a good latent space makes biological neighbors sit close together and technical confounders sit apart. When the map “feels right,” T cells cluster near other T cells, progenitors arrange along smooth paths toward differentiated states, and cells collected on different days overlap once batch is accounted for.

What is an embedding?

An embedding is the numerical representation you actually use: the coordinates assigned to each cell when you place it into a chosen space. If latent space is the stage, an embedding is where you seat each actor. The term shows up in two common ways. First, as a general technique: representing complex objects (cells, genes, images, words) as vectors so that similarity corresponds to geometric closeness. Second, as the product of a mapping: the “embedding” returned by PCA, t‑SNE, UMAP, or a neural network encoder. The idea is always the same—turn rich, high‑dimensional measurements into vectors that are easy to compare with distances like Euclidean or cosine similarity.

It’s helpful to keep the relationship straight. Latent space refers to the learned coordinate system and its semantics; an embedding refers to the coordinates of your data points in some space, latent or otherwise. You can have a two‑dimensional embedding purely for visualization (for instance, UMAP(2D)) that’s not the model’s true latent space, and you can also have a higher‑dimensional latent space that’s used for modeling, inference, or integration, from which a separate 2D embedding is derived for plotting.

From matrices to manifolds: PCA, t‑SNE, and UMAP in single‑cell

Most single‑cell pipelines begin with PCA, which finds orthogonal directions—principal components—that explain the largest variance in the gene‑by‑cell matrix after normalization and scaling. PCA is linear and global, so its latent axes are weighted combinations of genes you can inspect, and its coordinates form a practical intermediate representation for neighbor graphs, clustering, and velocity kernels. Tools like Scanpy use PCA as the default stepping stone before constructing a k‑nearest‑neighbors graph and a low‑dimensional visualization. The approach has held up across datasets thanks to its speed, interpretability, and compatibility with downstream graph methods.

For visualization, many analysts then switch to nonlinear methods that emphasize local neighborhoods. t‑SNE popularized this idea by optimizing a map where nearby points in high dimensions remain nearby in two dimensions, reducing the tendency of dense clusters to collapse onto each other. This makes fine‑grained subpopulations pop visually, though absolute distances and global geometry may be distorted. You’ll see crisp islands of similar cells, which is ideal for inspecting heterogeneity but less reliable for reading long‑range relationships off the plot.

UMAP arrives with a similar goal but different math. It builds a fuzzy topological graph of local neighborhoods and then optimizes a low‑dimensional layout to preserve that structure. In practice, it often captures both local detail and more of the global arrangement than t‑SNE, while scaling well to very large datasets and allowing any target dimensionality. That’s why UMAP has become the day‑to‑day default for scRNA‑seq visualization and even for building intermediate spaces used by clustering or pseudotime tools.

What about batch effects and multi‑dataset integration? Here, specialized methods operate directly on a latent representation to mix like with like across experiments. Harmony, for example, iteratively adjusts cell positions in a shared embedding so cells group by biology rather than batch. The output is still an embedding, but one that is deliberately “de‑confounded,” making downstream clustering and annotation more robust across donors, technologies, and conditions.

Probabilistic latent spaces with scVI

Linear methods and neighbor embeddings are powerful, but they don’t explicitly model the count nature, sparsity, and technical variability of single‑cell data. Probabilistic models do. The scVI framework uses a variational autoencoder to learn a latent space that explains observed UMI counts as draws from distributions that capture library size and overdispersion. In plainer words, scVI tries to disentangle biological signal from technical noise while compressing each cell into a small vector. Once trained, the latent representation can drive integration, differential expression, and visualization, and it scales to atlases with millions of cells. This is a latent space in the strictest sense: a generative model defines what each axis means statistically, even if axes are not directly interpretable as single biological programs.

A helpful mental model is to treat PCA as a fast microscope lens, UMAP as your framing and depth of field for an appealing shot, and scVI as a physics‑aware camera that knows about lighting and sensor noise. You might inspect quickly with PCA+UMAP, but you’ll reach for a probabilistic latent space when you need principled batch correction, uncertainty estimates, or integrated analyses across studies.

A quick single‑cell example you can try

Let’s make the ideas concrete by building a latent space and a 2D embedding from a small dataset. The first snippet uses Scanpy to compute PCA and UMAP, which you can then use for clustering or marker discovery.

import scanpy as sc

adata = sc.datasets.pbmc3k()             # toy dataset
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000, subset=True)
sc.pp.scale(adata, max_value=10)

sc.tl.pca(adata, n_comps=50)             # latent coords (linear)
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=30)
sc.tl.umap(adata)                         # 2D embedding for visualization
sc.pl.umap(adata, color=["louvain"], show=False)

Notice the distinction in practice. PCA gives you a 50‑dimensional latent representation that encodes major sources of variation. UMAP provides a two‑dimensional embedding optimized for visual continuity of neighborhoods. The pretty plot comes from the embedding; the neighbor graph and clustering often rely on the latent coordinates (PCA) that sit underneath. Scanpy wires these choices together so you rarely think about them explicitly, yet the distinction matters when you interpret results or tune parameters.

Now repeat the exercise with scVI to learn a probabilistic latent space that accounts for count noise and batch. After training, you can visualize the scVI latent space with UMAP, mixing generative modeling with neighborhood‑preserving layout.

import scvi
import anndata as ad

# assume 'adata' exists and has adata.obs['batch'] if multiple batches
scvi.data.setup_anndata(adata, batch_key="batch")   # register covariates
model = scvi.model.SCVI(adata, n_latent=20)         # VAE latent space
model.train()                                       # learn the space

adata.obsm["X_scvi"] = model.get_latent_representation()
sc.pp.neighbors(adata, use_rep="X_scvi")
sc.tl.umap(adata)                                   # visualize scVI latent
sc.pl.umap(adata, color=["batch", "louvain"], show=False)

Here, X_scvi is the learned latent space; UMAP is just a convenient lens for viewing it. If batches overlap nicely on the plot while cell types remain distinct, you’ve built a biologically meaningful, batch‑corrected representation suitable for downstream analyses like differential expression in the scVI framework.

Clearing up the difference in terms

Because we toss these words around in the same sentences, it’s easy to blur their meanings. The crisp way to remember the difference is to focus on the role each concept plays. A latent space is the lower‑dimensional coordinate system a model learns to explain your data; it’s defined by assumptions and an objective, whether linear variance maximization in PCA or a generative likelihood in scVI. An embedding is the representation of your specific cells inside some space; it’s the numeric coordinates you compute for each cell. You can embed data into a latent space, but you can also embed it into a purely visual space (for example, a two‑dimensional UMAP) that isn’t the model’s latent space. In day‑to‑day analysis, we typically learn a latent space for structure and then compute an embedding for visualization, even though the vocabulary often collapses the two.

A quick single‑cell story helps cement the idea. Imagine profiling blood cells from five donors. A PCA latent space reveals axes aligned with immune lineages and a batch axis tied to library size. Harmony nudges cells across donors into a shared embedding that preserves biology but removes batch. A VAE like scVI learns a latent space that probabilistically separates lineage from technical noise. Finally, a 2D UMAP presents the neighborhood structure so you can annotate clusters and highlight rare subpopulations. Each step uses an embedding, but only some of those embeddings live in a model’s latent space.

Practical guidance for single‑cell projects

As you choose methods, let your questions drive the representation. If your goal is exploratory visualization and quick clustering, a PCA latent space with a UMAP embedding remains a reliable baseline, especially when coupled with a balanced neighbor graph. When datasets come from multiple studies or chemistries, plan on an integration step that explicitly fixes batch in the latent space rather than hoping a 2D plot will hide it. And when you need principled uncertainty, generative imputation, or large‑scale integration, move to probabilistic latent spaces such as scVI, which model the count process and can correct for confounders as part of training. Best‑practice tutorials that walk through these decisions are worth following the first few times; they save you from treating pretty pictures as proof of biology.

Summary / Takeaways

Latent spaces and embeddings are the quiet workhorses beneath almost every single‑cell figure you see. A latent space is the compressed coordinate system that captures the structure in your data; an embedding is the concrete set of coordinates assigned to your cells. PCA, t‑SNE, UMAP, Harmony, and scVI all produce embeddings, but they differ in goals, assumptions, and how much biological or technical signal they preserve. If you remember to separate the idea of the space from the points inside it, you’ll make better choices, interpret your plots more confidently, and build analyses that generalize beyond a single dataset.

If you want to go deeper, try the code, flip between PCA and scVI latent spaces, and watch how the UMAP view changes. What stays stable across methods is often your strongest biological signal.