scVI for scRNA‑seq Dataset Integration: A Practical, Probabilistic Workflow

Intro: why integrate scRNA‑seq datasets

Combining datasets across labs, donors, or technologies boosts power and generalizability—but batch effects can swamp biology. scVI (single‑cell Variational Inference) tackles this with a deep generative model that learns a shared latent space while modeling counts and technical covariates, making integration and downstream analyses (clustering, DE) more reliable. It also powers scANVI (semi‑supervised) and plays nicely with reference mapping via scArches. (embopress.org)

What makes scVI different

Probabilistic by design: Negative Binomial likelihood for counts and uncertainty-aware latent embeddings; supports posterior-based differential expression. (embopress.org)
Flexible covariates: Correct multiple batch variables (e.g., donor, chemistry) via setup_anndata; works at atlas scale on CPU or GPU. (docs.scvi-tools.org)
Extensible family: scANVI adds labels to improve separation and label transfer; scArches enables fast mapping of new datasets into an existing reference without retraining from scratch. (embopress.org)

Quick scVI workflow (short code example)

Below is a minimal Python example that integrates multiple batches and produces an embedding for neighbors/UMAP and clustering.

import scanpy as sc
import scvi

adata = sc.read_h5ad("your_data.h5ad")           # adata.layers["counts"] should hold raw counts

scvi.model.SCVI.setup_anndata(
    adata,
    layer="counts",
    batch_key="batch",                 # or use categorical_covariate_keys=["batch","donor"]
)

model = scvi.model.SCVI(adata, n_latent=30)
model.train(max_epochs=100, early_stopping=True)

adata.obsm["X_scVI"] = model.get_latent_representation()

sc.pp.neighbors(adata, use_rep="X_scVI")
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=1.0)

Tips:
– Use 1k–10k highly variable genes; keep raw counts in a layer for likelihood modeling.
– If you have partial labels and want better separation/label transfer, fine‑tune with scANVI. (docs.scvi-tools.org)

When to reach for scANVI or scArches

scANVI: you have cell type labels for some cells or one dataset and want semi‑supervised integration plus label transfer with uncertainty. Great for harmonizing multiple studies while leveraging annotations. (embopress.org)
scArches: you already trained a reference model (SCVI/SCANVI/TOTALVI) and need to map new samples quickly without sharing raw data or re‑integrating everything. Ideal for iterative atlas building and cross‑cohort projects. (pubmed.ncbi.nlm.nih.gov)

Alternatives and how to choose

Seurat integration (CCA/RPCA Anchors): strong defaults in R; widely adopted for datasets with shared cell types. RPCA speeds up large projects. (satijalab.org)
Others you’ll see: Harmony (fast PC‑space correction), BBKNN, Scanorama, fastMNN. In general, prefer scVI/scANVI when you want a probabilistic model for counts, integrated DE, and scalable reference mapping; consider Seurat/Harmony when your pipeline is R‑first and you need tight workflow integration.

Summary: a simple decision rule

Start with scVI for count‑aware, scalable integration; add scANVI if you have labels; use scArches to grow a reference over time. Keep raw counts, register batches properly, and evaluate with biological controls (marker genes, known compositions). Then pick the neighbors/UMAP built on X_scVI as your integrated space.