Spatial transcriptomics: practical intro for data scientists

Jonathan Alles

EVOBYTE Digital Biology

By EVOBYTE Your partner in bioinformatics

Introduction

If single-cell RNA-seq is a census of cell identities, spatial transcriptomics (ST) is the city map. It tells you not only who lives in the tissue, but where they live, who their neighbors are, and which streets connect them. That extra context changes the kinds of questions you can ask. Suddenly, gradients, boundaries, and microenvironments come into focus. You can follow immune cells marching into a tumor, or watch cortical layers resolve in a brain section, not as abstract clusters but as physical structures.

In this primer, we’ll walk through an end‑to‑end mental model for ST analysis. We’ll start with quality control and preprocessing, because spatial data have quirks that differ from single‑cell. We’ll define spatially variable genes (SVGs), show how they relate to classical differential gene expression (DGE), and outline core methods to find SVGs. Then we’ll introduce cellular neighborhoods—recurrent spatial configurations of cells—and show how to compute and use them to extract biology. Along the way, we’ll keep the language practical, with short code snippets you can adapt in Python or R, and grounded definitions of key acronyms so you can orient your own projects quickly. For readers new to ST tooling, we’ll point to widely used open‑source frameworks at the end.

From tissue to matrix: QC, filtering, and preprocessing in spatial transcriptomics

The first mile of an ST project looks familiar if you’ve done single‑cell work, but the road surface is different. You still ingest a count matrix and metadata, yet the metadata now carry spatial coordinates, imaging information, and platform‑specific artifacts. Your initial goal is simple: retain informative observations (spots or cells) that truly belong to tissue and preserve the spatial signal they carry.

Start with basic spot- or cell‑level metrics. For sequencing-based platforms like 10x Visium or Slide‑seq, examine total UMIs, the number of detected genes, and the fraction of mitochondrial reads per spot. These highlight stressed tissue regions, low‑RNA areas, and technical outliers. But don’t over‑port defaults from scRNA‑seq: a low‑UMI spot at the edge of a section may still be biologically meaningful if it sits on a boundary; conversely, very high library sizes can reflect thickness variation rather than rich biology. Image context helps—overlay your QC metrics on the H&E or fluorescence image so thresholds respect histology rather than arbitrary cutoffs.

Next, confirm tissue segmentation and alignment. Misaligned tissue masks or rotated overlays quietly leak bias into downstream steps. A quick visual check—spots colored by total UMIs on top of the tissue image—often catches mis-registered sections or stray background spots. If you see expression bleeding from tissue into background (a common 10x issue), consider decontamination strategies that model “spot swapping” so you don’t mistake ambient signal for real gradients later. Preserve raw counts, but keep a cleaned layer for analysis.

Normalization deserves a spatial footnote. Library size adjustment or SCTransform can still be useful, yet blind normalization can flatten real spatial abundance differences tied to histology or cell density. When your immediate goal is spatial domain discovery, it’s reasonable to delay aggressive normalization and rely on graph‑based methods that are robust to moderate library size variation, then normalize more carefully for cross‑sample comparisons, pathway analysis, or deconvolution. The key is intent: match normalization to the question, not habit.

Batch effects also show up in ST, especially across sections, donors, or platforms. Integration tools adapted from single‑cell can help, but be cautious: strong integration can inadvertently wash out spatial gradients. Favor approaches that respect both expression similarity and physical proximity, and always sanity‑check that known structures (for example, cortical layers) remain crisp after correction.

Finally, define a spatial graph. This is the backbone for nearly every downstream task. For Visium, adjacency on the hex grid is a natural start; for single‑cell imaging methods, a radius‑based k‑nearest neighbors (KNN) graph from cell centroids, or a Delaunay triangulation, tends to capture local context. Once you’ve built the graph, you’re ready to ask spatial questions.

What are spatially variable genes—and how do we find them?

Spatially variable genes (SVGs) are features whose expression varies across the tissue in a way that reflects spatial structure rather than random scatter. They’re the genes that trace laminar patterns in cortex, mark tumor–stroma borders, or paint localized immune niches. The statistical idea is spatial autocorrelation: nearby observations have more similar expression than distant ones.

There are two complementary families of methods to detect SVGs. The first uses classical spatial statistics computed on the graph you built. Moran’s I and Geary’s C are workhorses here. Given a spatial weight matrix W over spots or cells, Moran’s I estimates global autocorrelation per gene and provides a test against a null of spatial randomness. Tools like Squidpy implement these efficiently and at scale for AnnData workflows, and Seurat offers a Moran’s I option in FindSpatiallyVariableFeatures for R users. These scores let you rank genes by spatial structure and filter at an FDR threshold suited to your resolution and sample size.

The second family models spatial covariance more explicitly. SpatialDE introduced a Gaussian process formulation that tests, per gene, whether a spatial covariance kernel explains additional variance over a non‑spatial model. It also classifies patterns (for example, periodic, linear gradients, or general spatial structure). SPARK and SPARK‑X extend this idea with scalable, robust testing that handles large datasets without sacrificing error control. In practice, Moran’s I is fast and intuitive, while GP‑based methods can be more sensitive to complex patterns and varying length scales. Many analysts run both and intersect results to balance speed and specificity.

If your data are noisy or very sparse, consider correlation‑based module finders like Hotspot. Instead of testing genes in isolation, Hotspot searches for groups of genes that co‑vary locally in space, which often stabilizes weak signals into coherent modules you can interpret as pathways or cell‑state programs. This module view complements single‑gene SVG lists and can surface spatial biology that individual‑gene tests miss.

Here is a minimal Python example using Squidpy to compute Moran’s I scores for SVG discovery. It assumes you have an AnnData object with spatial coordinates and a precomputed spatial graph.

import squidpy as sq

# Build a spatial graph if you haven't already
sq.gr.spatial_neighbors(adata, coord_type="generic", delaunay=True)

# Score genes by spatial autocorrelation (Moran's I)
sq.gr.spatial_autocorr(adata, mode="moran")
svgs = (adata.uns["moranI"]
        .sort_values("pval_norm_fdr_bh")
        .head(20))
print(svgs[["I", "pval_norm_fdr_bh"]])

SVGs versus classical DGE: how the questions differ

Differential gene expression (DGE) asks whether two predefined groups—say, tumor versus stroma, or cluster A versus cluster B—differ in their mean expression. It’s agnostic to the physical arrangement of those groups. You can shuffle cells in space and get the same DGE result.

SVG detection flips the perspective. It asks whether expression correlates with spatial proximity regardless of any predefined labels. A gene can be spatially variable even if no two clusters differ in its mean—think of a smooth gradient that runs across several clusters, or a localized island confined to a small niche. Conversely, a gene can be strongly differential between two clusters yet not spatially variable if those clusters are intermingled throughout the section. Practically, DGE highlights identity and condition effects, while SVGs reveal architecture, boundaries, and microenvironments.

When should you use which? Use DGE to characterize annotated compartments or cell types, and to compare conditions. Use SVGs to discover tissue domains and to prioritize markers that respect spatial context. The most compelling stories usually combine them: identify SVGs, segment spatial domains, then test DGE within or across domains to pinpoint pathways that drive those structures.

Cellular neighborhoods: what they are and why they matter

A cellular neighborhood is a recurring spatial configuration of cells—think of it as a local microenvironment fingerprint. Each neighborhood is defined by the composition and arrangement of nearby cells within a small radius, repeated across the tissue. In cancer, for example, you might find neighborhoods where regulatory T cells and macrophages co‑locate at the invasive front, or lymphoid‑like aggregates deep in the stroma. These patterns carry biological and clinical meaning because interactions are local: signaling, nutrient exchange, and immune recognition happen over tens of microns, not across the whole slide.

Pioneering multiplex imaging studies formalized this idea by computing composition vectors around each cell and clustering those vectors to discover conserved neighborhoods across patients. That same concept transfers naturally to spatial transcriptomics when you have single‑cell resolution or robust deconvolution at spot level. Neighborhoods become a bridge from genes to structure: SVGs flag where interesting biology might be happening; neighborhoods tell you who is there, together, when it happens.

How to compute cellular neighborhoods and use them for analysis

The recipe is straightforward once you’ve built a spatial graph. For each focal cell (or spot), gather its local neighbors within a chosen radius or KNN. Summarize the microenvironment as a vector: counts or proportions of cell types if you have annotations, or module scores if you only have expression. Repeat across all foci, then cluster these vectors to identify recurrent neighborhood archetypes. The result is a label per focus—its neighborhood class—and a set of neighborhood centroids with interpretable compositions.

Two practical tips make this robust. First, pick the neighborhood scale deliberately. A small radius captures immediate contact and cell–cell signaling; a larger radius captures tissue zones like germinal centers or tumor borders. Try a few scales and check stability. Second, avoid circularity when annotating cell types used to define neighborhoods. If you inferred types via deconvolution on the same data, validate with orthogonal markers or histology where possible.

Once you assign neighborhoods, several analyses open up. You can map their spatial distribution to see where each microenvironment appears. You can test differential abundance of neighborhoods across conditions or patient groups. You can perform “niche‑DE”: within a given cell type, compare gene expression between neighborhood classes to ask how context rewires programs—immune activation at the border versus quiescence in the interior, for example. And you can quantify neighborhood adjacency, asking which microenvironments tend to sit next door, hinting at transitions and interfaces.

Here is a minimal R example that illustrates neighborhood‑aware SVG finding using Seurat’s Moran’s I implementation and a simple composition‑based neighborhood clustering. It assumes you have a Seurat object with cell type labels and spatial coordinates.

library(Seurat)

DefaultAssay(st) <- "SCT"
st <- FindSpatiallyVariableFeatures(
  st,
  selection.method = "moransi",
  features = VariableFeatures(st)[1:2000]
)

# Build a simple neighborhood composition: fraction of cell types in a radius
coords <- GetTissueCoordinates(st)
celltypes <- Idents(st)
radius <- 50 # microns, adjust to your scale

comp <- sapply(1:nrow(coords), function(i){
  d <- sqrt(rowSums((t(coords) - coords[i,])^2))
  nbrs <- which(d <= radius)
  prop.table(table(celltypes[nbrs]))
})
comp <- t(sapply(comp, function(x) {x[levels(celltypes)] %||% 0}))

# Cluster neighborhoods
nb_clusters <- kmeans(comp, centers = 8, nstart = 20)$cluster
st$Neighborhood <- factor(nb_clusters)

Putting it together: a high‑level workflow that scales

A pragmatic ST workflow starts with the graph and never loses sight of the image. After QC and preprocessing, compute SVGs with Moran’s I to get a fast, global picture of spatial structure, then refine with a GP‑based method when patterns look complex or multi‑scale. Use top SVGs to seed domain discovery by clustering on expression plus spatial adjacency, and validate that domains align with histology. If you’re working with spot‑based data, add a deconvolution step to estimate cell‑type proportions per spot; that unlocks neighborhood analysis even without single‑cell resolution.

Neighborhoods then become the canvas for mechanism. Examine which cell types co‑occur in each archetype, map where those archetypes live, and test which pathways are upregulated in cells when they sit inside one neighborhood versus another. These steps reveal context dependence that bulk DGE would miss: dendritic cells near TLS‑like neighborhoods may show activation signatures absent elsewhere; malignant cells at the leading edge may upregulate EMT genes only in neighborhoods rich in specific fibroblast states. These are the hypotheses that link spatial patterns to function and, eventually, to interventions.

As your projects grow, lean on mature libraries that unify image features, graphs, and gene counts. Python users can use Squidpy alongside Scanpy to compute spatial autocorrelation, neighborhood enrichment, and image‑derived features in one pipeline. R users can mix Seurat’s spatial utilities with dedicated spatial packages; specialized toolboxes like Giotto offer end‑to‑end workflows that integrate SVG detection, neighborhood analysis, and interactive visualization. Keep your objects light by storing raw counts, a cleaned analysis layer, and the spatial graph; everything else is reproducible from these ingredients.

Summary / Takeaways

Spatial transcriptomics turns expression matrices into maps. The practical pieces are not exotic, but they demand a spatial mindset.

Treat QC as a spatial problem. Check count and mitochondrial patterns on top of the tissue image, fix obvious misalignment, and be wary of over‑normalization that erases genuine gradients.
Use SVGs to find structure. Start with Moran’s I or Geary’s C for speed and interpretability, and bring in model‑based approaches like SpatialDE or SPARK‑X when patterns are subtle or multi‑scale. These methods answer a different question than classical DGE—and that difference is the point.
Make cellular neighborhoods your unit of context. By clustering local composition vectors, you can discover recurring microenvironments and ask how cell programs change from one neighborhood to another. This turns co‑localization into testable biology with clear visual maps and hypotheses.

If you’re choosing a first project, pick a tissue with known structure—a mouse brain slice or well‑annotated tumor section—so your spatial intuition can grow alongside your models. Then, once you trust your pipeline, bring it to noisier, more heterogeneous samples. The questions only get more interesting as the maps get messier.