Cell-Cell Communication in scRNA-seq

Why this matters

Cell-cell communication analysis helps you move from “which cells are present?” to “how might these cells influence each other?”. In scRNA-seq studies, receptor-ligand analysis is often the bridge between clustering and mechanism, because it highlights candidate signaling programs across cell populations. CellPhoneDB is useful here because it combines a curated interaction database with statistical testing across cell-type pairs, and it works directly with AnnData inputs such as .h5ad.

Summary

In this tutorial, you will:

  1. load a scRNA-seq dataset with Scanpy,
  2. perform basic QC and clustering,
  3. create a CellPhoneDB-compatible metadata table,
  4. export a normalized .h5ad count matrix,
  5. run CellPhoneDB statistical analysis from Python,
  6. rank and visualize significant receptor-ligand interactions.

The expected outcome is a small but practical workflow that produces clustered cells, CellPhoneDB input files, and interpretable outputs such as pvalues.csv and significant_means.csv for downstream prioritization.

Requirements

  • Skills: basic Python, pandas, and Scanpy familiarity.
  • Frameworks: Python, Scanpy, AnnData, pandas, seaborn, CellPhoneDB.
  • Input data: a human scRNA-seq dataset in .h5ad or 10x-style format, plus gene identifiers that match the mode you pass to CellPhoneDB. CellPhoneDB supports .h5ad, .h5, text matrices, or a 10x folder, with .h5ad recommended.
  • Infrastructure: a CPU-only machine is usually sufficient for this workflow; the documented execution model is standard Python plus thread-based parallelism, not GPU-specific. This is an inference from the official runtime and installation guidance.
  • Version note: current CellPhoneDB docs state Python 3.8+ support.

Step 1: Create the environment

Install the core packages first.

python -m venv sc-ccc-env
source sc-ccc-env/bin/activate

pip install scanpy anndata pandas numpy scipy matplotlib seaborn cellphonedb leidenalg python-igraph

CellPhoneDB is distributed as a Python package, and the official docs recommend working inside an isolated environment. You also need a local CellPhoneDB database zip file for the analysis call later in the tutorial.

Step 2: Load your scRNA-seq data

In this tutorial, I assume you already have an .h5ad file. If you start from 10x output, you can switch to sc.read_10x_mtx(...).

from pathlib import Path

import scanpy as sc
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

sc.settings.verbosity = 2
sc.set_figure_params(figsize=(5, 5))

workdir = Path("cellphonedb_tutorial")
(workdir / "cpdb_input").mkdir(parents=True, exist_ok=True)
(workdir / "cpdb_output").mkdir(parents=True, exist_ok=True)

adata = sc.read_h5ad("data/my_scRNAseq.h5ad")
# Alternative:
# adata = sc.read_10x_mtx("data/filtered_feature_bc_matrix", var_names="gene_symbols", cache=True)

adata.var_names_make_unique()

# CellPhoneDB input advice:
# - avoid dashes in cell names
# - keep names synchronized across metadata and counts
adata.obs_names = adata.obs_names.str.replace("-", "_", regex=False)

# Preserve original counts
adata.layers["counts"] = adata.X.copy()

print(adata)
print(adata.obs.head())
print(adata.var.head())

CellPhoneDB accepts AnnData as .h5ad, which is the recommended large-input format. The docs also warn against dashes in cell names and against purely numeric cell-type names in the metadata, so it is worth fixing naming issues early. CellPhoneDB’s database is human-focused, so if your study is not human, convert genes to human ortholog identifiers before running the analysis.

Step 3: Run basic QC and preprocessing

Now create a working copy for clustering. We will keep the original object for export later.

adata_work = adata.copy()

# Annotate mitochondrial genes for common QC
adata_work.var["mt"] = adata_work.var_names.str.upper().str.startswith("MT-")

sc.pp.calculate_qc_metrics(adata_work, qc_vars=["mt"], inplace=True)

# Example QC filters — adjust for your dataset
adata_work = adata_work[
    (adata_work.obs["n_genes_by_counts"] > 200) &
    (adata_work.obs["total_counts"] > 500) &
    (adata_work.obs["pct_counts_mt"] < 15),
].copy()

# Standard Scanpy preprocessing for clustering
sc.pp.normalize_total(adata_work, target_sum=1e4)
sc.pp.log1p(adata_work)
sc.pp.highly_variable_genes(adata_work, n_top_genes=2000, flavor="seurat")
adata_work = adata_work[:, adata_work.var["highly_variable"]].copy()

sc.pp.scale(adata_work, max_value=10)
sc.tl.pca(adata_work, svd_solver="arpack")
sc.pp.neighbors(adata_work, n_neighbors=15, n_pcs=30)
sc.tl.umap(adata_work)
sc.tl.leiden(adata_work, resolution=0.6, key_added="leiden")

This is the standard Scanpy pattern: depth normalization with normalize_total, log transform with log1p, feature selection with highly_variable_genes, then graph construction with neighbors and community detection with leiden. Scanpy documents these as the typical building blocks of clustering workflows.

Optionally inspect the embedding:

sc.pl.umap(adata_work, color=["leiden", "n_genes_by_counts", "pct_counts_mt"])

Step 4: Define the cell groups for CellPhoneDB

CellPhoneDB needs a metadata table mapping each cell to a cell type or cluster. For a first-pass analysis, Leiden clusters are fine. In a real study, you would often replace these with curated biological labels.

# Transfer cluster labels back to the full-gene object
adata = adata[adata_work.obs_names].copy()
adata.obs["leiden"] = adata_work.obs["leiden"].astype(str)

# Avoid purely numeric labels
adata.obs["cell_type"] = "cluster_" + adata.obs["leiden"]
adata.obs["cell_type"] = adata.obs["cell_type"].astype("category")

print(adata.obs["cell_type"].value_counts())

If you already have curated labels, use them instead:

# Example only
# adata.obs["cell_type"] = adata.obs["manual_annotation"].astype("category")

The CellPhoneDB metadata format is simple: one column for cell IDs and one for cell types. The official docs explicitly describe the meta file as the mapping between cells and annotated clusters or cell types.

Step 5: Export CellPhoneDB input files

A practical pattern is to write a normalized but non-log-transformed matrix for CellPhoneDB input, while keeping log-transformed data for Scanpy visualization and clustering. This mirrors the documentation example that normalizes counts before export. .h5ad is a supported input format, so we can stay fully inside the AnnData ecosystem.

# Build a CellPhoneDB export object from original counts
adata_cpdb = adata.copy()
adata_cpdb.X = adata_cpdb.layers["counts"].copy()

# Normalize counts for CellPhoneDB input
sc.pp.normalize_total(adata_cpdb, target_sum=1e4)

# Save h5ad counts file
counts_path = workdir / "cpdb_input" / "counts_normalized.h5ad"
adata_cpdb.write_h5ad(counts_path)

# Save metadata file
meta = adata_cpdb.obs[["cell_type"]].copy()
meta.index.name = "Cell"

meta_path = workdir / "cpdb_input" / "meta.tsv"
meta.to_csv(meta_path, sep="\t")

print(counts_path)
print(meta_path)
print(meta.head())

Choose the correct identifier mode for counts_data:

  • use "hgnc_symbol" if adata.var_names contains human gene symbols,
  • use "ensembl" if adata.var_names contains Ensembl gene IDs.

CellPhoneDB defaults to Ensembl-style interpretation unless told otherwise.

counts_data = "hgnc_symbol"

Step 6: Run CellPhoneDB statistical analysis

Now run the core receptor-ligand analysis from Python. You need a local CellPhoneDB database zip path.

from cellphonedb.src.core.methods import cpdb_statistical_analysis_method

cpdb_file_path = "resources/cellphonedb.zip"
output_path = workdir / "cpdb_output"

cpdb_results = cpdb_statistical_analysis_method.call(
    cpdb_file_path=cpdb_file_path,
    meta_file_path=str(meta_path),
    counts_file_path=str(counts_path),
    counts_data=counts_data,
    output_path=str(output_path),
    iterations=1000,
    threshold=0.1,
    threads=8,
    debug_seed=42,
    pvalue=0.05,
    score_interactions=True,
)

statistical_analysis is the CellPhoneDB mode that estimates cell-type-specific receptor-ligand interactions by shuffling labels across cells. The docs describe threshold=0.1 as the default fraction of cells expressing a ligand or receptor required for testing, and this mode produces outputs such as pvalues.csv and significant_means.csv.

If you only want a quick exploratory pass without permutation testing, the non-statistical method is also available, but for most tutorials and biological prioritization I recommend starting with the statistical mode.

Step 7: Load and rank the results

The most commonly used files are:

  • pvalues.csv
  • significant_means.csv
  • means.csv
  • deconvoluted.csv

These are the core outputs documented for CellPhoneDB analyses.

means = pd.read_csv(output_path / "means.csv")
pvals = pd.read_csv(output_path / "pvalues.csv")
sig_means = pd.read_csv(output_path / "significant_means.csv")
deconv = pd.read_csv(output_path / "deconvoluted.csv")

print(sig_means.head())

Find available cell-type pair columns:

pair_cols = [c for c in sig_means.columns if "|" in c]
print(pair_cols[:10])

Inspect one pair of interest:

pair_col = pair_cols[0]   # replace with a biologically relevant pair if desired

top_hits = (
    pvals[["interacting_pair", pair_col]]
    .rename(columns={pair_col: "pvalue"})
    .merge(
        sig_means[["interacting_pair", pair_col]].rename(columns={pair_col: "mean_expr"}),
        on="interacting_pair",
        how="left",
    )
    .dropna(subset=["mean_expr"])
    .query("pvalue < 0.05")
    .sort_values(["pvalue", "mean_expr"], ascending=[True, False])
)

print(top_hits.head(20))

pvalues.csv stores significance per interaction and cell-type pair, while significant_means.csv stores the mean expression values for interactions that pass significance filtering. Also note that CellPhoneDB interactions are directional: A|B is not the same as B|A, because partner A is evaluated in the first group and partner B in the second.

Step 8: Visualize interaction burden between cell types

A very useful first plot is a heatmap of how many significant interactions exist between each ordered cell-type pair.

sig_counts = sig_means[pair_cols].notna().sum(axis=0).rename("n_significant").reset_index()
sig_counts.columns = ["pair", "n_significant"]

sig_counts[["source", "target"]] = sig_counts["pair"].str.split("|", expand=True, regex=False)

heatmap_df = sig_counts.pivot(index="source", columns="target", values="n_significant").fillna(0)

plt.figure(figsize=(8, 6))
sns.heatmap(heatmap_df, cmap="magma", annot=True, fmt=".0f")
plt.title("Number of significant interactions")
plt.xlabel("Receiver / partner B cell type")
plt.ylabel("Sender / partner A cell type")
plt.tight_layout()
plt.show()

This plot is simple but effective: it quickly tells you which cell populations behave as major senders or receivers in your dataset. Because the analysis is directional, the matrix is generally not symmetric.

Step 9: Prioritize interactions for a biological question

After the global heatmap, narrow the analysis to a hypothesis. For example:

  • stromal-to-immune signaling,
  • myeloid-to-T-cell suppression,
  • chemokine recruitment programs,
  • growth factor signaling in tumor microenvironments.

Here is a small gene-focused filter:

focus_terms = ["CXCL", "CCR", "TGFB", "IFNG", "IL6", "TNF"]

pattern = "|".join(focus_terms)

candidate_interactions = top_hits[
    top_hits["interacting_pair"].fillna("").str.upper().str.contains(pattern, regex=True)
]

print(candidate_interactions.head(30))

In practice, this final prioritization step is where biology matters most. The official output guidance recommends focusing on specific cell-type pairs and then manually reviewing interactions with stronger significance and expression support.

Recap

You now have a practical workflow for receptor-ligand analysis in scRNA-seq using Python, Scanpy, and CellPhoneDB:

  • preprocess and cluster cells with Scanpy,
  • define cell groups,
  • export CellPhoneDB-ready inputs,
  • run statistical receptor-ligand analysis,
  • inspect pvalues.csv and significant_means.csv,
  • summarize signaling structure with a heatmap.

For many projects, this is enough to produce a first shortlist of candidate communication axes for validation.

Further Reading

FAQ

1. Should I use raw counts or normalized counts for CellPhoneDB?

A common and practical approach is to keep raw counts in your AnnData object, then export a normalized but non-log-transformed matrix for CellPhoneDB. This is consistent with the documentation examples and lets you use one representation for clustering and another for communication analysis. CellPhoneDB also accepts .h5ad directly, so this export step is straightforward.

2. What if my dataset is from mouse?

CellPhoneDB’s curated interaction database is human-focused, so mouse data should be mapped to human ortholog identifiers before analysis. Also make sure your counts_data setting matches the identifiers stored in adata.var_names.

3. Why do celltypeA|celltypeB and celltypeB|celltypeA give different values?

Because the interaction model is directional. Partner A is evaluated in the first cell group and partner B in the second, so reversing the order changes the biological question and therefore the result.

Leave a Comment