Scanpy scRNA-seq Normalization

Single-cell RNA-seq datasets usually contain large differences in library size between cells. If you compare raw counts directly, highly sequenced cells can dominate the analysis and obscure real biology. Normalization makes cells more comparable, which improves downstream steps such as highly variable gene selection, PCA, neighborhood graphs, clustering, and marker detection.

In Scanpy, the most common baseline is total-count normalization followed by log1p. But that is not your only option. In this tutorial, you will implement and compare several practical normalization strategies on the same dataset.

Summary

We will walk through these steps:

  1. Load a raw scRNA-seq dataset in Scanpy.
  2. Preserve raw counts in an AnnData layer.
  3. Run light QC and filtering.
  4. Apply multiple normalization strategies:
    • total-count normalization to CP10k + log1p
    • median-depth normalization + log1p
    • total-count normalization while excluding highly expressed genes
    • Pearson residual normalization for UMI data
  5. Compare the effect of each strategy.

Expected outcome: by the end, you will have several normalized AnnData objects and a clear sense of when each method is useful.

Requirements

  • Basic Python and pandas familiarity
  • Introductory understanding of scRNA-seq count matrices
  • Python environment with:
    • scanpy
    • anndata
    • numpy
    • pandas
    • matplotlib
    • scipy
  • No GPU required
  • A laptop or workstation with 8 GB+ RAM is enough for the example below
pip install scanpy anndata numpy pandas matplotlib scipy

Step 1: Import packages and configure Scanpy

We start with a minimal setup.

import scanpy as sc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import sparse

sc.settings.verbosity = 2
sc.set_figure_params(figsize=(5, 4))

Step 2: Load raw data and preserve counts

A key rule in single-cell workflows is: never lose the raw counts. Some downstream steps expect log-transformed values, while others work best on counts.

Here we load the built-in PBMC dataset and save the original matrix into a layer called counts.

adata = sc.datasets.pbmc3k()
adata.var_names_make_unique()

# Preserve raw counts before any transformation
adata.layers["counts"] = adata.X.copy()

print(adata)

Step 3: Run light QC and filtering

Normalization does not fix low-quality cells. It is better to remove obvious outliers first.

This example keeps the filtering simple:
– remove cells with very few detected genes
– remove genes seen in very few cells
– compute mitochondrial fraction
– remove cells with high mitochondrial content

sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)

adata.var["mt"] = adata.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True)

adata = adata[adata.obs["pct_counts_mt"] < 20].copy()

print(adata)

Optional quick QC plots:

sc.pl.violin(
    adata,
    ["n_genes_by_counts", "total_counts", "pct_counts_mt"],
    jitter=0.4,
    multi_panel=True,
)

Step 4: Inspect library sizes before normalization

Before changing anything, it helps to look at how different the per-cell totals are.

def cell_sums(matrix):
    if sparse.issparse(matrix):
        return np.asarray(matrix.sum(axis=1)).ravel()
    return matrix.sum(axis=1)

raw_totals = cell_sums(adata.layers["counts"])

print(pd.Series(raw_totals).describe())

A quick histogram makes the imbalance obvious:

plt.hist(raw_totals, bins=50)
plt.xlabel("Raw counts per cell")
plt.ylabel("Number of cells")
plt.title("Library sizes before normalization")
plt.show()

Step 5: Strategy 1, total-count normalization to CP10k

This is the most common Scanpy workflow.

normalize_total rescales each cell so that all cells have the same total count. Setting target_sum=1e4 gives counts per 10,000. After that, log1p compresses the dynamic range and stabilizes variance for many downstream methods.

adata_cp10k = adata.copy()

sc.pp.normalize_total(adata_cp10k, target_sum=1e4)
adata_cp10k.layers["cp10k"] = adata_cp10k.X.copy()

sc.pp.log1p(adata_cp10k)

print(adata_cp10k)

Check that the normalized layer has roughly the same total per cell:

cp10k_totals = cell_sums(adata_cp10k.layers["cp10k"])
print(pd.Series(cp10k_totals).describe())

You should see totals close to 10000 for all cells.

A common next step is highly variable gene selection on the log-transformed matrix:

sc.pp.highly_variable_genes(
    adata_cp10k,
    flavor="seurat",
    n_top_genes=2000,
)

print(adata_cp10k.var["highly_variable"].sum())

Step 6: Strategy 2, median-depth normalization

If you leave target_sum=None, Scanpy rescales each cell to the dataset median total count instead of a fixed target like 10,000. This is still total-count normalization, but the final scale is tied to the dataset itself.

adata_median = adata.copy()

sc.pp.normalize_total(adata_median)
adata_median.layers["median_norm"] = adata_median.X.copy()

sc.pp.log1p(adata_median)

median_totals = cell_sums(adata_median.layers["median_norm"])
print(pd.Series(median_totals).describe())

This strategy is useful if you want normalization to stay close to the original dataset scale.

Step 7: Strategy 3, exclude highly expressed genes

Sometimes a small number of genes dominate the counts in some cells. That can distort the size factor used for normalization. Scanpy can estimate scaling factors while ignoring very highly expressed genes.

adata_excl = adata.copy()

sc.pp.normalize_total(
    adata_excl,
    target_sum=1e4,
    exclude_highly_expressed=True,
    max_fraction=0.05,
)
adata_excl.layers["cp10k_excl_high"] = adata_excl.X.copy()

sc.pp.log1p(adata_excl)

Inspect the totals:

excl_totals = cell_sums(adata_excl.layers["cp10k_excl_high"])
print(pd.Series(excl_totals).describe())

Use this when one or a few genes are taking up a large fraction of counts and you suspect the size factors are being skewed.

Step 8: Strategy 4, Pearson residual normalization

For UMI count data, Pearson residuals are a useful alternative to the classic normalize_total + log1p approach. In Scanpy, these functions live under sc.experimental.pp.

A practical pattern is:

  1. keep raw counts in a layer
  2. select highly variable genes from counts
  3. subset to those genes
  4. normalize with Pearson residuals
  5. continue with PCA
adata_pr = adata.copy()

# HVG selection from raw counts
sc.experimental.pp.highly_variable_genes(
    adata_pr,
    flavor="pearson_residuals",
    layer="counts",
    n_top_genes=2000,
)

adata_pr = adata_pr[:, adata_pr.var["highly_variable"]].copy()

# Move raw counts into X for residual normalization
adata_pr.X = adata_pr.layers["counts"].copy()

sc.experimental.pp.normalize_pearson_residuals(adata_pr)

# Continue to PCA
sc.pp.pca(adata_pr)

print(adata_pr)
print(adata_pr.obsm["X_pca"].shape)

This gives you a normalized matrix in adata_pr.X that is ready for downstream dimensionality reduction.

Step 9: Compare strategies side by side

Now let’s summarize how the per-cell totals differ across approaches.

comparison = pd.DataFrame(
    {
        "raw_counts": pd.Series(cell_sums(adata.layers["counts"])),
        "cp10k": pd.Series(cell_sums(adata_cp10k.layers["cp10k"])),
        "median_norm": pd.Series(cell_sums(adata_median.layers["median_norm"])),
        "cp10k_excl_high": pd.Series(cell_sums(adata_excl.layers["cp10k_excl_high"])),
    }
)

summary = comparison.describe().T[["min", "50%", "max", "mean", "std"]]
print(summary)

You can also visualize the distributions:

fig, axes = plt.subplots(1, 4, figsize=(16, 3))

for ax, column in zip(axes, comparison.columns):
    ax.hist(comparison[column], bins=40)
    ax.set_title(column)
    ax.set_xlabel("Cell total")

axes[0].set_ylabel("Number of cells")
plt.tight_layout()
plt.show()

How to choose in practice

A simple rule of thumb:

  • Use CP10k + log1p if you want a robust, standard workflow that works well with many Scanpy examples.
  • Use median-depth normalization if you prefer dataset-relative scaling.
  • Use exclude_highly_expressed=True if a few genes dominate library size estimates.
  • Use Pearson residuals for UMI data when you want a count-aware alternative and are comfortable using the experimental API.

Recap

Normalization is one of the most important preprocessing choices in single-cell RNA-seq.

In this tutorial, you learned how to:

  • preserve raw counts in adata.layers["counts"]
  • run basic QC before normalization
  • apply standard total-count normalization with log1p
  • use median-depth scaling
  • reduce the effect of dominating genes during scaling
  • apply Pearson residual normalization for UMI data

A strong default for many projects is still:

  1. keep raw counts in a layer
  2. sc.pp.normalize_total(..., target_sum=1e4)
  3. sc.pp.log1p(...)

Then compare against Pearson residuals if your downstream results are sensitive to preprocessing choices.

Further Reading

FAQ

1. Should I normalize before or after filtering?

Usually after basic filtering. Remove obvious low-quality cells and rarely detected genes first, then normalize the cleaned count matrix.

2. Do I always need log1p after normalization?

Not always. For the standard total-count workflow, yes, log1p is commonly used before HVG selection, PCA, and clustering. For Pearson residual workflows, the normalization step already produces a transformed matrix, so you do not add log1p on top.

3. Should I always use Pearson residuals for UMI data?

Not necessarily. Pearson residuals can work very well, but the classic CP10k + log1p pipeline remains a strong default and is often easier to compare across tutorials and teams. A good habit is to try both on a representative subset and check whether your major biological conclusions stay consistent.

Leave a Comment