Imaging-Based Spatial Transcriptomics: Preprocessing and QC

Jonathan Alles

EVOBYTE Digital Biology

By EVOBYTE Your partner in bioinformatics

You can find the first intro part of the series here.

Introduction

If single‑cell RNA‑seq tells us “who’s in the room,” spatial transcriptomics tells us “where everyone is standing.” In Part 2 of our Introduction to Spatial Transcriptomics Data Analysis series, we focus on imaging‑based spatial transcriptomics preprocessing and quality control. We’ll unpack what is actually measured on the slide, why resolution determines whether you’re observing single cells or mixtures, and how tissue morphology and sample quality can make or break your experiment. We’ll close with a practical example: recognizing a necrotic core and what it means for data quality.

What imaging‑based spatial transcriptomics really measures

Imaging‑based spatial transcriptomics (often abbreviated ST) detects RNA molecules directly in tissue using microscopy. Methods such as MERFISH and seqFISH+ rely on fluorescent in situ hybridization (ISH) with sequential barcoding and imaging to assign a digital code to each transcript. The result is a cloud of precisely located RNA “spots” at subcellular resolution, which can then be assigned to segmented cells. In contrast, capture array approaches like Visium map gene expression to predefined capture areas on a slide. Those areas, commonly called spots, are around tens of micrometers in diameter and typically collect RNA released from multiple neighboring cells during permeabilization.

Why does resolution matter? Because it decides your unit of analysis. Subcellular ISH resolves individual transcripts and, after segmentation, gives you per‑cell counts with spatial coordinates. Array capture yields expression profiles per spot that may reflect one to many cells, depending on tissue architecture and spot size. In dense tissues such as epithelium or lymphoid organs, a single spot often blends several cell types. In sparse tissues or with very small capture bins (as in high‑definition variants), you may approximate single‑cell resolution. Understanding these constraints upfront will shape every downstream choice, from normalization and clustering to deconvolution and neighborhood analysis.

From images to counts: a lean preprocessing pipeline

Although pipelines differ by platform, the logic is similar. First, you align and stitch microscopy fields of view, correct for stage drift, and subtract background. Then you decode barcodes to identify transcripts and register those transcript coordinates to the tissue image. For ISH, cell segmentation follows, typically using nuclear channels, membrane markers, or machine‑learning based segmentation. Each transcript is then assigned to a cell or left unassigned if it falls outside any mask. For array capture, you instead register the tissue image to the slide layout so each capture spot can be linked to its tissue region. Either way, the output is a count matrix plus spatial coordinates for cells or spots, and a reference image such as H&E or immunofluorescence for visual context.

Two small but critical choices pay off later. First, preserve the link between the raw imaging metadata and the count matrix; when QC flags a problem area, you’ll want to jump back to the original image tiles to see what went wrong. Second, keep image‑space transformations (affine or non‑linear warps) consistent across channels and steps; even tiny misalignments can cause false negatives during transcript assignment or bleed‑over between neighboring cells.

Here’s a compact example that loads Visium‑style data, computes basic QC metrics, and filters obvious outliers using Scanpy. The same logic applies to cell‑resolved ISH data if adata.obs represents cells rather than spots.

import scanpy as sc
import numpy as np

# Load a Visium dataset (space ranger output or anndata with .obsm['spatial'])
adata = sc.read_visium("path/to/visium_dir")
adata.var_names_make_unique()

# Compute QC: genes per spot, counts per spot, mitochondrial fraction
adata.var['mt'] = adata.var_names.str.upper().str.startswith(('MT-','MT_'))
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)

# Simple, interpretable filters (tune per tissue/platform)
min_genes = 200
max_mt = 15  # percentage
adata = adata[(adata.obs['n_genes_by_counts'] >= min_genes) &
              (adata.obs['pct_counts_mt'] <= max_mt), :]

# Library-size normalization for visualization
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

Quality control that actually predicts success

Good QC starts before sequencing. For fresh‑frozen tissue, aim for high RNA integrity (commonly RIN ≥ 7) and consistent section thickness. For FFPE, assess DV200 (the fraction of fragments >200 nt) and ensure fixation and processing were done properly. Tissue optimization for permeabilization is worth the day it costs; under‑permeabilization yields weak signal, while over‑permeabilization blurs spatial boundaries and leaks RNA beyond true tissue borders.

Once data are generated, interpret QC in the context of resolution. In imaging‑based ISH, look for decoding accuracy and background levels, uniform registration across fields, and reasonable transcript densities per cell. In array capture, focus on basic metrics per spot such as total counts, genes detected, and mitochondrial RNA fraction. These are not mere checkboxes; they reflect biology and slide handling. Elevated mitochondrial fractions along cut edges often indicate mechanical damage, whereas a gradient of counts from center to margin may hint at suboptimal permeabilization. Most importantly, map every QC metric back onto the tissue image. Spatial outliers are either biology you should keep or artifacts you must exclude; the image tells you which.

A quick overlay in Squidpy helps you see what the numbers mean on tissue:

import squidpy as sq
import matplotlib.pyplot as plt

img = sq.im.ImageContainer("path/to/he_image.tif")
sq.pl.spatial_scatter(
    adata,
    color=['n_genes_by_counts','pct_counts_mt'],
    img=img,
    library_id=adata.uns['spatial'].keys().__iter__().__next__(),
    wspace=0.4
)
plt.show()

Tissue morphology and sample quality: read the slide before the data

Spatial transcriptomics is ultimately a histology‑anchored assay. The H&E or IF image is not decoration; it’s your first QC gate. Well‑preserved morphology with crisp nuclei, intact membranes, and consistent staining usually tracks with high‑quality counts. Conversely, folds, tears, chatter marks from sectioning, and detachment artifacts create spatial patterns of low counts that no algorithm can fix. Pay attention to tissue type as well. Adipose‑rich or calcified regions often perform poorly due to low RNA yield or poor adhesion, while highly vascular areas can show unexpected background if blood cells dominate the capture.

A frequent pitfall is assuming all regions in a section are equally informative. They aren’t. Before you launch clustering or neighborhood analyses, annotate the slide for regions to include or exclude. Masking obviously damaged areas, over‑stained regions, or out‑of‑focus tiles can raise the quality of your downstream findings more than any sophisticated normalization. In cell‑resolved ISH, verify that segmentation aligns with biology; if nuclei are clumped or cytoplasm is faint, transcripts may be misassigned, inflating cell sizes or creating artificial doublets.

A practical example: the necrotic core

Imagine analyzing a tumor section that includes a central necrotic core. Histologically, you’ll see loss of nuclear detail, ghost outlines of cells, and often a granular background. Biochemically, dead or dying cells degrade RNA, so the core yields few unique molecular identifiers and a low number of detected genes. On array capture slides, spots sitting over this zone look pale in expression maps and may show atypically high background or elevated mitochondrial fractions from fragmented mitochondria. In imaging‑based ISH, decoding density collapses and error‑correction may discard most signals as noise.

Here is where integrating morphology with metrics pays off. First, use the H&E to delineate the necrotic area. Next, visualize counts and gene numbers over the slide; expect a stark depression that aligns with the morphological boundary. Finally, exclude or annotate this region before differential expression or spatial domain discovery. Otherwise, algorithms might call the necrotic core a “new cluster” and contaminate neighborhood statistics. If your biological question involves hypoxia or cell death at the invasive front, keep the perinecrotic rim but document its definition. The key is to let the tissue image guide the mask, then let the metrics confirm it.

Summary / Takeaways

Imaging‑based spatial transcriptomics measures RNA molecules in place, but what you ultimately analyze depends on resolution. ISH methods resolve transcripts at subcellular scales and, after segmentation, yield per‑cell maps; capture arrays summarize expression across micrometer‑scale spots that can mix multiple cells. Preprocessing is about faithful registration, decoding, and assignment. Quality control is about linking counts back to the slide and asking whether the spatial pattern matches biology or artifact. Above all, morphology is your compass. If the tissue shows damage, folds, or necrosis, the data will echo it. Start with the image, validate with metrics, and mask deliberately. Your downstream biology will be clearer, and your conclusions more robust.