Scanpy scRNA-seq QC Tutorial

Introduction

Quality control is one of the first and most important steps in any single-cell RNA-seq workflow. Before clustering cells, finding markers, or building cell type annotations, you need to remove low-quality cells and uninformative genes. If you skip this step, downstream results can be noisy, biased, or simply wrong.

In this tutorial, you will learn how to perform basic single-cell RNA-seq quality control with scanpy using concrete Python examples. The workflow focuses on common QC signals such as:

  • total counts per cell
  • number of detected genes per cell
  • fraction of mitochondrial counts
  • lowly detected genes

Summary

We will walk through a practical QC workflow in scanpy:

  1. Load a small example dataset
  2. Inspect the raw count matrix
  3. Annotate mitochondrial genes
  4. Compute QC metrics
  5. Visualize QC distributions
  6. Filter low-quality cells and genes
  7. Save the cleaned dataset

Expected outcome

By the end, you will have:

  • a cleaned AnnData object
  • a reusable QC template for your own scRNA-seq data
  • a better understanding of common QC thresholds and why they matter

Requirements

  • Basic Python knowledge
  • Familiarity with tabular data and plotting
  • Python 3.10+ recommended
  • Packages: scanpy, anndata, numpy, pandas, matplotlib
  • No GPU required

Step 1: Install and import dependencies

Start by installing the required packages in your environment.

pip install scanpy anndata matplotlib pandas numpy

Now import the libraries.

import scanpy as sc
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

It is also useful to set plotting defaults.

sc.settings.verbosity = 2
sc.settings.set_figure_params(dpi=100, facecolor="white")

Step 2: Load an example single-cell dataset

For a simple QC tutorial, scanpy provides a small PBMC dataset that is widely used for demonstrations.

adata = sc.datasets.pbmc3k()
adata

You should see an AnnData object with cells as rows and genes as columns.

To inspect the matrix dimensions:

print(f"Cells: {adata.n_obs}")
print(f"Genes: {adata.n_vars}")

Preview the metadata tables:

print(adata.obs.head())
print(adata.var.head())

At this stage, adata.obs usually has little cell metadata, and adata.var contains gene-level information.

Step 3: Make gene names unique

Some datasets may contain duplicated gene names. This can cause issues later, so it is good practice to make them unique.

adata.var_names_make_unique()

Step 4: Annotate mitochondrial genes

A high fraction of mitochondrial RNA often indicates stressed or dying cells. To measure this, we first mark mitochondrial genes.

For many human datasets, mitochondrial genes start with MT-.

adata.var["mt"] = adata.var_names.str.startswith("MT-")
adata.var[["mt"]].head()

If you work with mouse data, mitochondrial genes are often labeled with mt- instead.

Step 5: Calculate QC metrics

Now compute standard quality control metrics for every cell and gene.

sc.pp.calculate_qc_metrics(
    adata,
    qc_vars=["mt"],
    percent_top=None,
    log1p=False,
    inplace=True
)

This adds several useful columns to adata.obs and adata.var.

Check the new cell-level QC columns:

print(adata.obs.columns.tolist())

Common columns include:

  • n_genes_by_counts: number of detected genes per cell
  • total_counts: total UMI/counts per cell
  • pct_counts_mt: percentage of counts from mitochondrial genes

Preview a few rows:

adata.obs[["n_genes_by_counts", "total_counts", "pct_counts_mt"]].head()

Step 6: Visualize the QC metrics

Before filtering anything, inspect the metric distributions. This helps you choose thresholds based on the dataset rather than guessing blindly.

Violin plots

sc.pl.violin(
    adata,
    ["n_genes_by_counts", "total_counts", "pct_counts_mt"],
    jitter=0.4,
    multi_panel=True
)

These plots help identify cells with:

  • very few genes detected
  • unusually high total counts
  • high mitochondrial fraction

Scatter plots

Plot relationships between QC metrics.

sc.pl.scatter(adata, x="total_counts", y="n_genes_by_counts")
sc.pl.scatter(adata, x="total_counts", y="pct_counts_mt")

These views can reveal:

  • droplets with very low counts
  • potential doublets with unusually high counts and genes
  • damaged cells with high mitochondrial content

Summary statistics

It is also helpful to inspect numeric summaries.

adata.obs[["n_genes_by_counts", "total_counts", "pct_counts_mt"]].describe()

Step 7: Define practical QC thresholds

QC thresholds depend on:

  • tissue type
  • protocol
  • sequencing depth
  • species
  • experiment quality

For a basic tutorial, we can use a simple rule-based filter:

  • keep cells with at least 200 genes detected
  • remove cells with more than 2,500 genes detected
  • remove cells with more than 5% mitochondrial counts
  • keep genes detected in at least 3 cells

These values are common for small PBMC examples, but you should adapt them for real datasets.

Step 8: Filter low-quality cells

First, create a copy so you keep the original raw data unchanged.

adata_qc = adata.copy()

Apply cell-level filters.

adata_qc = adata_qc[adata_qc.obs["n_genes_by_counts"] >= 200, :]
adata_qc = adata_qc[adata_qc.obs["n_genes_by_counts"] <= 2500, :]
adata_qc = adata_qc[adata_qc.obs["pct_counts_mt"] < 5, :]

Check how many cells remain:

print(f"Cells before QC: {adata.n_obs}")
print(f"Cells after cell filtering: {adata_qc.n_obs}")

Step 9: Filter lowly detected genes

Genes detected in only a few cells usually add noise and little biological value in early analysis.

sc.pp.filter_genes(adata_qc, min_cells=3)

Check the updated shape:

print(f"Cells after QC: {adata_qc.n_obs}")
print(f"Genes after QC: {adata_qc.n_vars}")

Step 10: Compare before and after filtering

It is good practice to replot QC metrics on the filtered data.

sc.pl.violin(
    adata_qc,
    ["n_genes_by_counts", "total_counts", "pct_counts_mt"],
    jitter=0.4,
    multi_panel=True
)

You can also compare summary statistics:

print("Before QC")
print(adata.obs[["n_genes_by_counts", "total_counts", "pct_counts_mt"]].describe())

print("\nAfter QC")
print(adata_qc.obs[["n_genes_by_counts", "total_counts", "pct_counts_mt"]].describe())

This helps confirm that low-quality cells were actually removed.

Step 11: Store the raw counts for later use

Before normalization and downstream analysis, it is often useful to preserve the filtered raw counts.

adata_qc.raw = adata_qc

This allows later access to the filtered but unnormalized expression matrix.

Step 12: Save the cleaned dataset

Finally, save the processed object to disk.

adata_qc.write("pbmc3k_qc_filtered.h5ad")

You can reload it later with:

adata_qc = sc.read_h5ad("pbmc3k_qc_filtered.h5ad")

Full Example Script

Here is the complete QC workflow in one place.

import scanpy as sc
import matplotlib.pyplot as plt

sc.settings.verbosity = 2
sc.settings.set_figure_params(dpi=100, facecolor="white")

# Load data
adata = sc.datasets.pbmc3k()
adata.var_names_make_unique()

# Annotate mitochondrial genes
adata.var["mt"] = adata.var_names.str.startswith("MT-")

# Calculate QC metrics
sc.pp.calculate_qc_metrics(
    adata,
    qc_vars=["mt"],
    percent_top=None,
    log1p=False,
    inplace=True
)

# Plot QC metrics before filtering
sc.pl.violin(
    adata,
    ["n_genes_by_counts", "total_counts", "pct_counts_mt"],
    jitter=0.4,
    multi_panel=True
)
sc.pl.scatter(adata, x="total_counts", y="n_genes_by_counts")
sc.pl.scatter(adata, x="total_counts", y="pct_counts_mt")

# Copy and filter cells
adata_qc = adata.copy()
adata_qc = adata_qc[adata_qc.obs["n_genes_by_counts"] >= 200, :]
adata_qc = adata_qc[adata_qc.obs["n_genes_by_counts"] <= 2500, :]
adata_qc = adata_qc[adata_qc.obs["pct_counts_mt"] < 5, :]

# Filter genes
sc.pp.filter_genes(adata_qc, min_cells=3)

# Store filtered raw counts
adata_qc.raw = adata_qc

# Save result
adata_qc.write("pbmc3k_qc_filtered.h5ad")

print(f"Original shape: {adata.shape}")
print(f"Filtered shape: {adata_qc.shape}")

Common QC Tips

Use thresholds as starting points, not fixed rules

A threshold like 5% mitochondrial counts may work for one PBMC dataset but fail for another sample type. Always inspect plots first.

Very high counts can indicate doublets

Cells with unusually high total_counts and n_genes_by_counts may actually be two cells captured together.

Different organisms use different mitochondrial prefixes

Human data often uses MT-, while mouse data often uses mt-. If the prefix is wrong, mitochondrial percentages will be incorrect.

Keep a copy of unfiltered data

It is often useful to compare filtered and unfiltered results during troubleshooting.

Recap

In this tutorial, you learned how to perform basic scRNA-seq QC with scanpy by:

  • loading an example dataset
  • labeling mitochondrial genes
  • computing QC metrics
  • visualizing cell quality
  • filtering poor-quality cells and weak genes
  • saving a cleaned AnnData object

This gives you a solid foundation for the next steps in single-cell analysis, such as normalization, highly variable gene selection, dimensionality reduction, clustering, and marker detection.

Further Reading

FAQ

1. How do I choose the right QC thresholds?

Start by plotting n_genes_by_counts, total_counts, and pct_counts_mt. Choose thresholds based on the visible distribution and your experiment type. PBMC thresholds are often not appropriate for every tissue or protocol.

2. Why are mitochondrial genes important for QC?

A high mitochondrial fraction can indicate damaged or stressed cells. When the cell membrane is compromised, cytoplasmic RNA may be lost while mitochondrial RNA remains relatively enriched.

3. Should I filter before or after normalization?

Basic cell and gene QC should usually happen before normalization. You want to normalize a dataset that already excludes obvious low-quality cells and extremely weak genes.

Leave a Comment