GSEA for Single-Cell Data in Scanpy

Why this matters

Single-cell RNA-seq helps you find cell populations and marker genes, but marker lists alone do not always explain the biology. Gene set enrichment analysis, or GSEA, adds a functional layer by asking which pathways, processes, or signatures are overrepresented in each cluster. In practice, this is useful for cluster annotation, hypothesis generation, and comparing cell states across conditions.

In Python, a practical workflow is to use Scanpy for clustering and marker detection, then use GSEApy to run pathway analysis on cluster-level gene rankings.

Steps and expected outcome

In this tutorial, you will:

Load a small PBMC single-cell dataset in Scanpy.
Preprocess and cluster the cells.
Compute cluster marker genes.
Build a ranked gene list for one cluster.
Run preranked GSEA on the full ranking.
Run over-representation analysis on filtered upregulated markers.
Repeat the enrichment step across all clusters.

Expected outcome: a working Scanpy object with clusters, marker genes, and pathway-level summaries that help interpret each cluster biologically.

Requirements

Basic Python and pandas knowledge
Familiarity with AnnData and Scanpy concepts
Python environment with:
- scanpy
- gseapy
- pandas
- numpy
- matplotlib
- python-igraph
- leidenalg
A laptop or CPU server is enough
No GPU is required
Internet access is helpful if you use online gene set libraries

Step 1: Install the required packages

Use your preferred environment manager. For a simple pip setup:

pip install scanpy gseapy pandas numpy matplotlib python-igraph leidenalg

Step 2: Import libraries and configure plotting

This sets up the Python session for single-cell analysis and enrichment plots.

import scanpy as sc
import pandas as pd
import numpy as np
import gseapy as gp
import matplotlib.pyplot as plt

sc.settings.verbosity = 2
sc.set_figure_params(dpi=100, facecolor="white")

Step 3: Load a small single-cell dataset

We will use the built-in PBMC example dataset so the tutorial stays easy to reproduce.

adata = sc.datasets.pbmc3k()
adata.var_names_make_unique()

print(adata)
adata.obs.head()

You should see an AnnData object with a few thousand cells and tens of thousands of genes.

Step 4: Run basic quality control and preprocessing

Here we remove obvious low-quality cells, normalize counts, log-transform the data, and prepare the feature space used for clustering.

# Mark mitochondrial genes
adata.var["mt"] = adata.var_names.str.startswith("MT-")

# QC metrics
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True)

# Basic filtering
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)

adata = adata[adata.obs["n_genes_by_counts"] < 2500, :].copy()
adata = adata[adata.obs["pct_counts_mt"] < 5, :].copy()

# Keep raw normalized expression for marker testing later
adata.layers["counts"] = adata.X.copy()

# Normalize and log-transform
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

# Save the log-normalized full matrix
adata.raw = adata

# Select highly variable genes for clustering
sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor="seurat")
adata = adata[:, adata.var["highly_variable"]].copy()

# Scale and reduce dimensionality
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver="arpack")
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.5)

Plot the clusters:

sc.pl.umap(adata, color="leiden", legend_loc="on data", frameon=False)

Step 5: Compute marker genes for each cluster

This is the bridge between clustering and enrichment. We compute genes that distinguish each cluster from the others.

sc.tl.rank_genes_groups(
    adata,
    groupby="leiden",
    method="wilcoxon",
    use_raw=True
)

Inspect the top markers:

sc.pl.rank_genes_groups(adata, n_genes=15, sharey=False)

Convert one cluster’s marker table into a pandas DataFrame:

cluster_key = "leiden"
cluster_id = adata.obs[cluster_key].cat.categories[0]

markers = sc.get.rank_genes_groups_df(
    adata,
    group=cluster_id,
    key="rank_genes_groups"
).dropna()

markers.head(10)

Step 6: Build a ranked gene list for preranked GSEA

Preranked GSEA is often the better choice when you want to use the full marker ranking instead of a hard cutoff.

We will rank genes by the Scanpy test score for one cluster.

rank_df = (
    markers[["names", "scores"]]
    .rename(columns={"names": "gene", "scores": "score"})
    .dropna()
)

# Make gene symbols uppercase to improve matching with pathway libraries
rank_df["gene"] = rank_df["gene"].str.upper()

# Remove duplicated gene names if any
rank_df = rank_df.groupby("gene", as_index=False)["score"].max()

# Highest scores first
rank_df = rank_df.sort_values("score", ascending=False)

rank_df.head()

Step 7: Fetch a gene set library and run preranked GSEA

In this example, we pick the most recent available GO biological process library exposed by the enrichment service.

libraries = gp.get_library_name(organism="Human")
go_bp_library = sorted([x for x in libraries if x.startswith("GO_Biological_Process")])[-1]

print("Using gene set library:", go_bp_library)

go_bp_sets = gp.get_library(name=go_bp_library, organism="Human")

Now run preranked GSEA:

pre_res = gp.prerank(
    rnk=rank_df,
    gene_sets=go_bp_sets,
    outdir=None,
    min_size=10,
    max_size=500,
    permutation_num=1000,
    seed=42,
    threads=4,
    verbose=True
)

Check the result table:

pre_res.res2d.sort_values("FDR q-val").head(10)

Useful columns include:

Term
NES
FDR q-val
Lead_genes

A positive NES usually means the gene set is enriched near the top of your ranked list for that cluster.

Step 8: Plot the top GSEA results

A quick dot plot is often the easiest summary.

gp.dotplot(
    pre_res.res2d,
    column="FDR q-val",
    title=f"Cluster {cluster_id} preranked GSEA",
    cutoff=0.25,
    top_term=10,
    figsize=(4, 6),
    cmap="viridis_r"
)
plt.show()

You can also plot enrichment curves for the top terms:

top_terms = pre_res.res2d.sort_values("FDR q-val")["Term"].head(3).tolist()
axes = pre_res.plot(terms=top_terms)
plt.show()

Step 9: Run over-representation analysis on filtered marker genes

Sometimes you want a simpler workflow based on a selected marker list. For that, filter marker genes first, then run over-representation analysis.

sc.tl.filter_rank_genes_groups(
    adata,
    key="rank_genes_groups",
    groupby="leiden",
    key_added="rank_genes_groups_filtered",
    min_in_group_fraction=0.1,
    min_fold_change=0.25,
    max_out_group_fraction=0.5,
    use_raw=True
)

filtered_markers = sc.get.rank_genes_groups_df(
    adata,
    group=cluster_id,
    key="rank_genes_groups_filtered"
).dropna()

filtered_markers.head(10)

Create an upregulated marker list:

up_genes = (
    filtered_markers.loc[filtered_markers["logfoldchanges"] > 0, "names"]
    .str.upper()
    .drop_duplicates()
    .tolist()
)

len(up_genes), up_genes[:10]

Run enrichment:

ora_res = gp.enrichr(
    gene_list=up_genes,
    gene_sets=go_bp_library,
    organism="Human",
    outdir=None
)

Inspect results:

ora_res.res2d.sort_values("Adjusted P-value").head(10)

Plot them:

gp.barplot(
    ora_res.res2d,
    column="Adjusted P-value",
    title=f"Cluster {cluster_id} marker enrichment",
    cutoff=0.05,
    top_term=10,
    figsize=(5, 6),
    color="steelblue"
)
plt.show()

Step 10: Run enrichment for every cluster

The next logical step is to automate the same analysis across all clusters.

def run_cluster_prerank(adata, cluster_id, organism="Human"):
    markers = sc.get.rank_genes_groups_df(
        adata,
        group=cluster_id,
        key="rank_genes_groups"
    ).dropna()

    rank_df = (
        markers[["names", "scores"]]
        .rename(columns={"names": "gene", "scores": "score"})
        .dropna()
    )
    rank_df["gene"] = rank_df["gene"].str.upper()
    rank_df = rank_df.groupby("gene", as_index=False)["score"].max()
    rank_df = rank_df.sort_values("score", ascending=False)

    libraries = gp.get_library_name(organism=organism)
    go_bp_library = sorted(
        [x for x in libraries if x.startswith("GO_Biological_Process")]
    )[-1]
    go_bp_sets = gp.get_library(name=go_bp_library, organism=organism)

    pre_res = gp.prerank(
        rnk=rank_df,
        gene_sets=go_bp_sets,
        outdir=None,
        min_size=10,
        max_size=500,
        permutation_num=1000,
        seed=42,
        threads=4,
        verbose=False
    )
    return pre_res


all_results = []
for cid in adata.obs["leiden"].cat.categories:
    pre_res = run_cluster_prerank(adata, cid)
    top_hits = (
        pre_res.res2d
        .sort_values("FDR q-val")
        .head(5)
        .loc[:, ["Term", "NES", "FDR q-val", "Lead_genes"]]
        .copy()
    )
    top_hits["cluster"] = cid
    all_results.append(top_hits)

cluster_pathways = pd.concat(all_results, ignore_index=True)
cluster_pathways = cluster_pathways[["cluster", "Term", "NES", "FDR q-val", "Lead_genes"]]

cluster_pathways.head(20)

This gives you a compact table of the top pathways per cluster.

Step 11: Save results for downstream reporting

Saving both the Scanpy object and enrichment tables makes the workflow easier to reproduce later.

adata.write("pbmc3k_gsea_tutorial.h5ad")

pre_res.res2d.to_csv(f"cluster_{cluster_id}_prerank_gsea.csv", index=False)
ora_res.res2d.to_csv(f"cluster_{cluster_id}_ora.csv", index=False)
cluster_pathways.to_csv("all_cluster_top_pathways.csv", index=False)

Recap

You now have a complete Python workflow for pathway analysis on single-cell clusters:

Scanpy handled preprocessing, clustering, and marker detection
rank_genes_groups produced cluster-wise differential signals
GSEApy turned those gene-level results into pathway-level interpretation
preranked GSEA used the full ranking
over-representation analysis used a filtered marker list

For exploratory single-cell work, this is a strong default pattern because it is simple, scalable, and easy to adapt to your own datasets.

FAQ

1. Should I use preranked GSEA or over-representation analysis?

Use preranked GSEA when you want to keep information from the full gene ranking. Use over-representation analysis when you want a fast, simple summary based on a cutoff-selected marker list. In single-cell tutorials, it is common to run both and compare them.

2. Why do my pathway results look noisy or overly significant?

This often happens because single-cell differential expression can treat cells as independent observations. That is useful for exploratory marker discovery, but it can overstate significance. For stronger inferential analyses across biological replicates, move to sample-aware or pseudobulk strategies.

3. Why are some expected pathways missing?

Common reasons are:
– gene identifiers do not match the pathway library
– gene symbols are not uppercase
– the cluster has weak or diffuse marker structure
– the selected gene set library is not a good fit for your biology
– the filtering thresholds are too strict

A good first fix is to verify gene naming, try a different library family, and compare preranked GSEA with marker-list enrichment.