Integrate scRNA-seq and Visium in scverse

Why this matters

Integrating single-cell RNA-seq with Visium spatial transcriptomics lets you combine the strengths of both assays: single-cell data gives high-resolution cell identities, while Visium tells you where those signals sit in tissue. In practice, this helps you map likely cell states onto spatial spots, inspect tissue structure, and build more biologically interpretable spatial analyses.

This tutorial shows a practical scverse workflow for reference mapping and label transfer: use an annotated scRNA-seq dataset as the reference, project Visium spots into the same latent space with Scanpy, then visualize predicted cell identities on tissue with Squidpy.

Steps and expected outcome

We will:

Load an annotated scRNA-seq reference and a Visium dataset.
Preprocess each dataset separately.
Restrict both datasets to shared genes.
Build a latent representation from the scRNA-seq reference.
Map Visium spots into that reference with scanpy.tl.ingest.
Compute soft cell-type scores from nearest neighbors.
Plot predicted labels and scores in spatial coordinates.

Expected outcome: each Visium spot will receive a predicted dominant cell type, a confidence-like score, and optional soft scores for multiple cell types that you can inspect on the tissue image.

Requirements

Basic Python and AnnData familiarity
A scRNA-seq reference already annotated with a column such as obs["cell_type"]
A Visium output directory from Space Ranger, or an .h5ad version of it
Python with:
- scanpy
- squidpy
- anndata
- numpy
- pandas
- scikit-learn
- matplotlib
CPU is enough for this workflow; no GPU is required

pip install scanpy squidpy anndata numpy pandas scikit-learn matplotlib

Step 1: Import packages and define paths

We start with imports and point to the two inputs:

an annotated scRNA-seq reference
a Visium folder

import scanpy as sc
import squidpy as sq
import anndata as ad
import numpy as np
import pandas as pd

from sklearn.neighbors import NearestNeighbors

sc.settings.verbosity = 2
sc.settings.set_figure_params(dpi=120, facecolor="white")

SCRNA_PATH = "data/scrna_reference.h5ad"
VISIUM_PATH = "data/visium_sample/"
OUT_PATH = "results/visium_mapped.h5ad"

Step 2: Load the scRNA-seq and Visium data

For current scverse workflows, Squidpy is the convenient reader for Visium data. The single-cell reference is usually already stored as an .h5ad file.

adata_sc = sc.read_h5ad(SCRNA_PATH)
adata_vis = sq.read.visium(VISIUM_PATH)

adata_sc.var_names_make_unique()
adata_vis.var_names_make_unique()

print(adata_sc)
print(adata_vis)

Make sure your single-cell reference contains cell-type labels:

if "cell_type" not in adata_sc.obs:
    raise ValueError("adata_sc.obs must contain a 'cell_type' column.")

adata_sc.obs["cell_type"] = adata_sc.obs["cell_type"].astype("category")

Step 3: Standardize gene names

This step reduces avoidable mismatches between the two datasets. If your reference uses Ensembl IDs and Visium uses gene symbols, you will need a separate mapping step. If both already use gene symbols, simple normalization is often enough.

adata_sc.var_names = adata_sc.var_names.str.upper()
adata_vis.var_names = adata_vis.var_names.str.upper()

adata_sc.var_names_make_unique()
adata_vis.var_names_make_unique()

Step 4: Preprocess the scRNA-seq reference

The reference should be clean and annotated because all downstream mapping depends on it.

def preprocess_scrna(adata):
    adata = adata.copy()

    sc.pp.filter_cells(adata, min_genes=200)
    sc.pp.filter_genes(adata, min_cells=10)

    adata.var["mt"] = adata.var_names.str.startswith("MT-")
    sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True)

    if "pct_counts_mt" in adata.obs:
        adata = adata[adata.obs["pct_counts_mt"] < 15].copy()

    adata.layers["counts"] = adata.X.copy()

    sc.pp.normalize_total(adata, target_sum=1e4)
    sc.pp.log1p(adata)

    return adata

adata_sc = preprocess_scrna(adata_sc)

If your reference was already fully preprocessed and clustered, you can skip the QC portion and keep the normalization and log transform consistent with the Visium data.

Step 5: Preprocess the Visium data

Visium spots are not single cells, but we still do standard count normalization and log transformation so the data can be compared in the same feature space.

def preprocess_visium(adata):
    adata = adata.copy()

    if "in_tissue" in adata.obs:
        adata = adata[adata.obs["in_tissue"].astype(int) == 1].copy()

    adata.var["mt"] = adata.var_names.str.startswith("MT-")
    sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True)

    adata.layers["counts"] = adata.X.copy()

    sc.pp.normalize_total(adata, target_sum=1e4)
    sc.pp.log1p(adata)

    return adata

adata_vis = preprocess_visium(adata_vis)

A quick QC check is useful before mapping:

sc.pl.violin(
    adata_vis,
    ["total_counts", "n_genes_by_counts", "pct_counts_mt"],
    jitter=0.4,
    multi_panel=True,
)

Step 6: Keep only shared genes

ingest requires both datasets to use the same variables. In most real projects, this is the step where many integration issues are fixed.

shared_genes = adata_sc.var_names.intersection(adata_vis.var_names)

print(f"Shared genes: {len(shared_genes)}")

if len(shared_genes) < 1000:
    raise ValueError(
        "Too few shared genes. Check whether both datasets use the same gene naming convention."
    )

adata_sc = adata_sc[:, shared_genes].copy()
adata_vis = adata_vis[:, shared_genes].copy()

Step 7: Select informative genes from the reference

A common strategy is to define highly variable genes on the scRNA-seq reference, then subset both datasets to that shared set.

sc.pp.highly_variable_genes(adata_sc, n_top_genes=3000, flavor="seurat")
hvg = adata_sc.var_names[adata_sc.var["highly_variable"]]

print(f"Reference HVGs: {len(hvg)}")

adata_sc = adata_sc[:, hvg].copy()
adata_vis = adata_vis[:, hvg].copy()

Step 8: Build the reference latent space with Scanpy

Now we create the latent representation on the single-cell reference. Visium spots will be projected into this space.

sc.pp.pca(adata_sc, n_comps=50)
sc.pp.neighbors(adata_sc, n_neighbors=15, n_pcs=30)
sc.tl.umap(adata_sc)

Plot the reference UMAP to confirm the annotation looks sensible:

sc.pl.umap(adata_sc, color="cell_type", legend_loc="on data")

Step 9: Map Visium spots with `ingest`

This is the core integration step. The Visium query is projected into the reference PCA/neighbor graph, and labels are transferred from the reference.

sc.tl.ingest(adata_vis, adata_sc, obs="cell_type")

# Rename the transferred label to make the result explicit
adata_vis.obs["pred_cell_type"] = adata_vis.obs["cell_type"].astype("category")

At this point, the Visium object should contain transferred labels and reference-based embeddings such as X_pca and X_umap.

You can inspect the mapped query in the reference manifold:

sc.pl.embedding(adata_vis, basis="umap", color="pred_cell_type")

Step 10: Compute soft cell-type scores per spot

A single Visium spot often contains more than one cell type, so a hard label is only a rough summary. A practical improvement is to compute weighted nearest-neighbor scores from the reference.

def knn_celltype_scores(adata_ref, adata_query, label_key="cell_type", n_neighbors=30):
    ref_emb = adata_ref.obsm["X_pca"]
    qry_emb = adata_query.obsm["X_pca"]

    nn = NearestNeighbors(n_neighbors=n_neighbors, metric="euclidean")
    nn.fit(ref_emb)

    dists, idx = nn.kneighbors(qry_emb)

    ref_labels = adata_ref.obs[label_key].astype(str).to_numpy()
    classes = np.array(sorted(pd.unique(ref_labels)))

    # Distance-based weights
    weights = 1.0 / (dists + 1e-8)
    weights = weights / weights.sum(axis=1, keepdims=True)

    score_mat = np.zeros((adata_query.n_obs, len(classes)), dtype=float)

    for i in range(adata_query.n_obs):
        neigh_labels = ref_labels[idx[i]]
        for j, cls in enumerate(classes):
            score_mat[i, j] = weights[i, neigh_labels == cls].sum()

    scores = pd.DataFrame(score_mat, index=adata_query.obs_names, columns=classes)
    return scores

scores = knn_celltype_scores(
    adata_ref=adata_sc,
    adata_query=adata_vis,
    label_key="cell_type",
    n_neighbors=30,
)

adata_vis.obs = adata_vis.obs.join(scores)
adata_vis.obs["pred_cell_type_soft"] = scores.idxmax(axis=1).astype("category")
adata_vis.obs["pred_confidence"] = scores.max(axis=1)

Now each spot has:

pred_cell_type: transferred label from ingest
pred_cell_type_soft: dominant label from neighbor voting
one numeric score column per cell type
pred_confidence: max soft score

Step 11: Visualize labels on the tissue image

For spatial visualization, use Squidpy.

sq.pl.spatial_scatter(
    adata_vis,
    color="pred_cell_type_soft",
    img=True,
    size=1.3,
)

Plot confidence to identify uncertain regions:

sq.pl.spatial_scatter(
    adata_vis,
    color="pred_confidence",
    cmap="viridis",
    img=True,
    size=1.3,
)

You can also inspect selected cell-type scores directly on tissue. Replace the names below with labels from your own reference.

example_celltypes = [c for c in ["ASTROCYTE", "OLIGODENDROCYTE", "EXCITATORY_NEURON"] if c in adata_vis.obs.columns]

if example_celltypes:
    sq.pl.spatial_scatter(
        adata_vis,
        color=example_celltypes,
        cmap="magma",
        img=True,
        size=1.3,
        ncols=2,
    )

Step 12: Compare mapped labels with marker genes

A good sanity check is to compare transferred labels with known marker genes. Replace the genes with markers appropriate for your tissue.

marker_genes = [g for g in ["GFAP", "MOBP", "SLC17A7", "GAD1"] if g in adata_vis.var_names]

if marker_genes:
    sq.pl.spatial_scatter(
        adata_vis,
        color=marker_genes,
        img=True,
        size=1.3,
        cmap="inferno",
        ncols=2,
    )

If predicted astrocyte-rich regions also show high GFAP, or inhibitory regions align with GAD1, your mapping is much easier to trust.

Step 13: Save the integrated result

Always save the mapped object so you can reuse it downstream.

adata_vis.write_h5ad(OUT_PATH)
print(f"Saved: {OUT_PATH}")

Practical notes

What this workflow does well

This recipe is a good choice when you want to:

transfer a known cell-type annotation from scRNA-seq to Visium
quickly inspect where reference-defined cell states occur
build a lightweight and reproducible baseline before moving to heavier models

What this workflow does not do

This is not a full probabilistic deconvolution workflow. A Visium spot can contain multiple cell types, and ingest mainly gives you a reference-based projection plus a dominant label. The added kNN scores help, but they are still a heuristic.

If you need more rigorous cell-type abundance estimation per spot, use a dedicated deconvolution model later.

Recap

In this tutorial, you:

loaded an annotated scRNA-seq reference and a Visium dataset
preprocessed them separately
aligned them on shared genes
built a reference latent space with Scanpy
mapped Visium spots using ingest
added soft cell-type scores with nearest neighbors
visualized the results in tissue coordinates with Squidpy

This gives you a clean starting point for spatial cell-state annotation in the scverse ecosystem.

FAQ

1. Why do my Visium spots get only one label if spots contain multiple cells?

Because ingest performs reference mapping, not full spot deconvolution. The hard label is best interpreted as the dominant nearest transcriptional state. Use the soft kNN scores from this tutorial for a less binary view, and switch to a dedicated deconvolution model if you need cell-type mixtures.

2. Why do I get very poor mapping quality?

The most common causes are:

mismatched gene identifiers between datasets
weak or noisy scRNA-seq annotations
too few shared genes
reference tissue not matching the Visium tissue closely enough

Start by checking gene overlap, marker genes, and whether the reference labels are biologically plausible.

3. Why does the spatial plot fail or show no tissue image?

Usually the Visium object is missing image metadata or was not loaded from a standard folder layout. Confirm that the Visium directory includes the expected spatial files. If needed, you can still plot spots without the background image by setting img=False in sq.pl.spatial_scatter().