Why this matters
Single-cell TCR sequencing is great at finding exact clonotypes, but exact matches are only part of the story. In many datasets, T cells with related antigen specificity can carry different yet biochemically similar receptors. That is where TCRdist becomes useful.
Instead of asking only whether two TCRs are identical, TCRdist asks how similar they are across the receptor regions that matter most for recognition. Starting from 10x Genomics V(D)J output, you can use TCRdist to move from exact clonotypes to similarity-based TCR clusters that are often easier to interpret biologically.
This is especially helpful when you want to:
- detect convergent immune responses
- group related clonotypes beyond exact sequence identity
- compare TCR structure across cell states
- project TCR similarity clusters back onto a single-cell atlas
What you will build
In this tutorial, you will go from a 10x Genomics V(D)J result folder to:
- a clean paired TRA/TRB clonotype table
- a pairwise TCRdist matrix
- similarity-based clonotype clusters
- a simple 2D cluster visualization
- an optional per-cell cluster annotation file you can join back to your single-cell object
TCRdist in one minute
TCRdist is a distance metric for TCRs. Smaller values mean two receptors are more similar. Unlike plain edit distance on CDR3 alone, TCRdist combines information from multiple receptor regions, including the CDR loops and V-gene-derived features. In practice, that makes it much better for grouping receptors that may recognize similar targets even when they are not exact sequence matches.
A useful mental model is:
- Cell Ranger clonotype ID = exact or near-exact grouping from the reconstruction pipeline
- TCRdist cluster = neighborhood of biologically similar receptors
Requirements
- Basic Python and pandas
- Familiarity with 10x Genomics V(D)J outputs
- CPU machine is enough; no GPU required
- Best for small to medium repertoires in the dense workflow below
- Python packages:
pandasnumpyscipymatplotlibseabornumap-learnnetworkx(optional)tcrdist3
Step 1: Install the packages
For a simple local setup, install the required Python packages first.
python -m pip install pandas numpy scipy matplotlib seaborn umap-learn networkx tcrdist3
If you already work in a notebook environment, restart the kernel after installation.
Step 2: Start from the right 10x file
For TCR clustering, the most practical starting point is usually filtered_contig_annotations.csv, not the high-level clonotype summary file. The contig table keeps per-cell barcode information and chain-level annotations, which makes it easier to build paired TRA/TRB receptors and later map results back to single cells.
from pathlib import Path
import pandas as pd
vdj_dir = Path("outs/vdj_t")
contigs = pd.read_csv(vdj_dir / "filtered_contig_annotations.csv")
print(contigs.shape)
print(contigs.columns.tolist())
contigs.head()
What to check here
Make sure you can see columns like:
barcodechainv_genej_genecdr3cdr3_ntproductivefull_lengthreadsumis
Some output versions may include extra columns, which is fine.
Step 3: Keep productive full-length TCR contigs
Now keep only productive, full-length TCR alpha and beta chains.
import numpy as np
def as_bool(series):
return series.astype(str).str.lower().isin(["true", "1", "t", "yes"])
tcr = contigs.copy()
keep = tcr["chain"].isin(["TRA", "TRB"])
if "productive" in tcr.columns:
keep &= as_bool(tcr["productive"])
if "full_length" in tcr.columns:
keep &= as_bool(tcr["full_length"])
if "high_confidence" in tcr.columns:
keep &= as_bool(tcr["high_confidence"])
tcr = tcr.loc[keep].copy()
print(tcr.shape)
tcr["chain"].value_counts()
Why this filter matters
TCRdist works best on clean receptor calls. Filtering early reduces noise from incomplete or low-confidence contigs.
Step 4: Pick one alpha and one beta chain per cell
Single cells can have multiple TRA or TRB contigs. For a first-pass tutorial, we will keep the top-ranked chain per barcode based on UMI count, then read count.
This is a simplification, but it works well for a clean introductory workflow.
sort_cols = ["barcode", "chain"]
ascending = [True, True]
for col in ["umis", "reads"]:
if col in tcr.columns:
sort_cols.append(col)
ascending.append(False)
best = (
tcr.sort_values(sort_cols, ascending=ascending)
.drop_duplicates(subset=["barcode", "chain"], keep="first")
.copy()
)
if "sample" not in best.columns:
best["sample"] = "sample1"
best["cell_id"] = best["sample"].astype(str) + ":" + best["barcode"].astype(str)
alpha = (
best.loc[best["chain"] == "TRA", [
"sample", "barcode", "cell_id",
"v_gene", "j_gene", "cdr3", "cdr3_nt"
]]
.rename(columns={
"v_gene": "v_a_gene_raw",
"j_gene": "j_a_gene_raw",
"cdr3": "cdr3_a_aa",
"cdr3_nt": "cdr3_a_nucseq",
})
)
beta = (
best.loc[best["chain"] == "TRB", [
"sample", "barcode", "cell_id",
"v_gene", "j_gene", "cdr3", "cdr3_nt"
]]
.rename(columns={
"v_gene": "v_b_gene_raw",
"j_gene": "j_b_gene_raw",
"cdr3": "cdr3_b_aa",
"cdr3_nt": "cdr3_b_nucseq",
})
)
cells = alpha.merge(
beta,
on=["sample", "barcode", "cell_id"],
how="inner"
).copy()
print(cells.shape)
cells.head()
What happened
You now have one paired TRA/TRB receptor per cell. Cells without both chains are excluded from the paired analysis.
Step 5: Normalize V and J gene names for TCRdist
A common gotcha is gene naming. TCRdist expects IMGT-like gene names with alleles, such as TRBV7-9*01. In many 10x outputs, the allele is omitted. A practical fix is to append *01 when the allele is missing.
def add_default_allele(value):
if pd.isna(value):
return value
value = str(value).strip()
if value == "":
return value
return value if "*" in value else f"{value}*01"
for raw_col, clean_col in [
("v_a_gene_raw", "v_a_gene"),
("j_a_gene_raw", "j_a_gene"),
("v_b_gene_raw", "v_b_gene"),
("j_b_gene_raw", "j_b_gene"),
]:
cells[clean_col] = cells[raw_col].map(add_default_allele)
required = [
"v_a_gene", "j_a_gene", "cdr3_a_aa", "cdr3_a_nucseq",
"v_b_gene", "j_b_gene", "cdr3_b_aa", "cdr3_b_nucseq"
]
cells = cells.dropna(subset=required).copy()
print(cells[required].head())
Why include nucleotide CDR3 columns
If two receptors have the same amino acid sequence but different nucleotide rearrangements, those nucleotide columns help preserve that distinction before aggregation.
Step 6: Create a unique paired clonotype table
TCRdist works on a table of receptors. Since we want clonotype-level clustering, we will collapse identical paired receptors and keep a count column.
tcr_cols = [
"v_a_gene", "j_a_gene", "cdr3_a_aa", "cdr3_a_nucseq",
"v_b_gene", "j_b_gene", "cdr3_b_aa", "cdr3_b_nucseq"
]
clones = (
cells.groupby(tcr_cols, dropna=False)
.agg(
count=("cell_id", "size"),
cell_ids=("cell_id", lambda x: ";".join(sorted(x)))
)
.reset_index()
)
print(clones.shape)
clones.head()
Expected result
Each row is now one unique paired TCR clonotype, with:
- paired alpha and beta sequence features
- the number of cells carrying that receptor
- the original cell IDs for mapping back later
Step 7: Compute paired TRA/TRB TCRdist
Now build a TCRrep object and compute pairwise distances.
from tcrdist.repertoire import TCRrep
import numpy as np
organism = "human" # change to "mouse" if needed
tr = TCRrep(
cell_df=clones.copy(),
organism=organism,
chains=["alpha", "beta"],
compute_distances=True,
db_file="alphabeta_gammadelta_db.tsv"
)
pw_alpha = np.asarray(tr.pw_alpha).copy()
pw_beta = np.asarray(tr.pw_beta).copy()
pw_alpha[pw_alpha < 0] = 0
pw_beta[pw_beta < 0] = 0
pw_tcr = pw_alpha + pw_beta
print(pw_tcr.shape)
print(pw_tcr[:5, :5])
What this matrix means
pw_tcr[i, j]is the paired TRA/TRB TCRdist between clonotypeiand clonotypej- smaller numbers mean more similar receptors
- the diagonal should be zero
Practical note on scale
This dense workflow is ideal for tutorial-sized datasets and a few thousand unique paired clonotypes. For very large repertoires, you will usually want a sparse or chunked workflow.
Step 8: Turn the distance matrix into clusters
There is no single universal cutoff for TCRdist clustering. For a first pass, a fixed threshold is often good enough to explore the structure of your data.
Below, we use average-linkage hierarchical clustering with an example distance cutoff of 40.
from scipy.spatial.distance import squareform
from scipy.cluster.hierarchy import linkage, fcluster
condensed = squareform(pw_tcr, checks=False)
Z = linkage(condensed, method="average")
distance_threshold = 40
cluster_labels = fcluster(Z, t=distance_threshold, criterion="distance")
clone_df = tr.clone_df.copy()
clone_df["tcrdist_cluster"] = cluster_labels
cluster_sizes = (
clone_df.groupby("tcrdist_cluster")["count"]
.sum()
.rename("cluster_cells")
.reset_index()
)
clone_df = clone_df.merge(cluster_sizes, on="tcrdist_cluster", how="left")
summary = (
clone_df.groupby("tcrdist_cluster")
.agg(
n_unique_clones=("tcrdist_cluster", "size"),
n_cells=("count", "sum"),
example_alpha=("cdr3_a_aa", "first"),
example_beta=("cdr3_b_aa", "first"),
)
.sort_values("n_cells", ascending=False)
)
print(summary.head(10))
How to interpret this
A cluster here is a group of clonotypes that are close in paired TCRdist space, not necessarily exact sequence matches.
Step 9: Visualize TCR clusters in 2D
A simple and effective view is a UMAP embedding based on the precomputed TCRdist matrix.
import umap.umap_ as umap
import seaborn as sns
import matplotlib.pyplot as plt
reducer = umap.UMAP(
metric="precomputed",
n_neighbors=min(15, max(2, pw_tcr.shape[0] - 1)),
min_dist=0.35,
random_state=42,
)
embedding = reducer.fit_transform(pw_tcr)
plot_df = clone_df.copy()
plot_df["UMAP1"] = embedding[:, 0]
plot_df["UMAP2"] = embedding[:, 1]
top_clusters = (
plot_df["tcrdist_cluster"]
.value_counts()
.head(12)
.index
)
plot_df["cluster_plot"] = plot_df["tcrdist_cluster"].astype(str)
plot_df.loc[
~plot_df["tcrdist_cluster"].isin(top_clusters),
"cluster_plot"
] = "other"
plt.figure(figsize=(9, 7))
sns.scatterplot(
data=plot_df,
x="UMAP1",
y="UMAP2",
hue="cluster_plot",
size="count",
sizes=(30, 300),
palette="tab20",
alpha=0.85,
linewidth=0
)
plt.title("TCRdist clusters of paired TRA/TRB clonotypes")
plt.legend(bbox_to_anchor=(1.02, 1), loc="upper left")
plt.tight_layout()
plt.show()
What to look for
You will usually see:
- dense islands of very similar clonotypes
- larger points for expanded clonotypes
- clusters that may cut across exact Cell Ranger clonotype IDs
Step 10: Add a heatmap for cluster structure
A reordered heatmap is helpful when you want to see block-like structure in the distance matrix.
from scipy.cluster.hierarchy import leaves_list
order = leaves_list(Z)
ordered = pw_tcr[order][:, order]
heatmap_n = min(250, ordered.shape[0])
plt.figure(figsize=(8, 7))
sns.heatmap(
ordered[:heatmap_n, :heatmap_n],
cmap="viridis",
square=True,
cbar_kws={"label": "TCRdist"}
)
plt.title(f"TCRdist heatmap (top {heatmap_n} ordered clonotypes)")
plt.xlabel("Clonotypes")
plt.ylabel("Clonotypes")
plt.tight_layout()
plt.show()
Tip
For very large repertoires, plot only the top expanded clonotypes or one cluster at a time.
Step 11: Map cluster labels back to cells
To use the result in your single-cell analysis, merge cluster labels back to the original paired cells.
cluster_map = clone_df[tcr_cols + ["tcrdist_cluster", "cluster_cells"]].copy()
cell_clusters = cells.merge(
cluster_map,
on=tcr_cols,
how="left"
)
cell_clusters[[
"cell_id",
"sample",
"barcode",
"tcrdist_cluster",
"cluster_cells",
"cdr3_a_aa",
"cdr3_b_aa"
]].head()
Save it for reuse:
cell_clusters.to_csv("tcrdist_cluster_by_cell.tsv", sep="\t", index=False)
clone_df.to_csv("tcrdist_cluster_by_clonotype.tsv", sep="\t", index=False)
If you use Scanpy, you can join the cell-level table into adata.obs by barcode or cell_id.
# Example only if your AnnData obs index matches cell_id
# import scanpy as sc
# adata = sc.read_h5ad("t_cells.h5ad")
# adata.obs = adata.obs.join(
# cell_clusters.set_index("cell_id")[["tcrdist_cluster", "cluster_cells"]],
# how="left"
# )
Full minimal script
If you want the whole workflow in one place, here is a compact version.
from pathlib import Path
import pandas as pd
import numpy as np
from tcrdist.repertoire import TCRrep
from scipy.spatial.distance import squareform
from scipy.cluster.hierarchy import linkage, fcluster
def as_bool(series):
return series.astype(str).str.lower().isin(["true", "1", "t", "yes"])
def add_default_allele(value):
if pd.isna(value):
return value
value = str(value).strip()
if value == "":
return value
return value if "*" in value else f"{value}*01"
vdj_dir = Path("outs/vdj_t")
contigs = pd.read_csv(vdj_dir / "filtered_contig_annotations.csv")
keep = contigs["chain"].isin(["TRA", "TRB"])
if "productive" in contigs.columns:
keep &= as_bool(contigs["productive"])
if "full_length" in contigs.columns:
keep &= as_bool(contigs["full_length"])
if "high_confidence" in contigs.columns:
keep &= as_bool(contigs["high_confidence"])
tcr = contigs.loc[keep].copy()
sort_cols = ["barcode", "chain"]
ascending = [True, True]
for col in ["umis", "reads"]:
if col in tcr.columns:
sort_cols.append(col)
ascending.append(False)
best = (
tcr.sort_values(sort_cols, ascending=ascending)
.drop_duplicates(["barcode", "chain"], keep="first")
.copy()
)
if "sample" not in best.columns:
best["sample"] = "sample1"
best["cell_id"] = best["sample"].astype(str) + ":" + best["barcode"].astype(str)
alpha = (
best.loc[best["chain"] == "TRA", [
"sample", "barcode", "cell_id", "v_gene", "j_gene", "cdr3", "cdr3_nt"
]]
.rename(columns={
"v_gene": "v_a_gene_raw",
"j_gene": "j_a_gene_raw",
"cdr3": "cdr3_a_aa",
"cdr3_nt": "cdr3_a_nucseq",
})
)
beta = (
best.loc[best["chain"] == "TRB", [
"sample", "barcode", "cell_id", "v_gene", "j_gene", "cdr3", "cdr3_nt"
]]
.rename(columns={
"v_gene": "v_b_gene_raw",
"j_gene": "j_b_gene_raw",
"cdr3": "cdr3_b_aa",
"cdr3_nt": "cdr3_b_nucseq",
})
)
cells = alpha.merge(beta, on=["sample", "barcode", "cell_id"], how="inner").copy()
for raw_col, clean_col in [
("v_a_gene_raw", "v_a_gene"),
("j_a_gene_raw", "j_a_gene"),
("v_b_gene_raw", "v_b_gene"),
("j_b_gene_raw", "j_b_gene"),
]:
cells[clean_col] = cells[raw_col].map(add_default_allele)
tcr_cols = [
"v_a_gene", "j_a_gene", "cdr3_a_aa", "cdr3_a_nucseq",
"v_b_gene", "j_b_gene", "cdr3_b_aa", "cdr3_b_nucseq"
]
cells = cells.dropna(subset=tcr_cols).copy()
clones = (
cells.groupby(tcr_cols, dropna=False)
.agg(
count=("cell_id", "size"),
cell_ids=("cell_id", lambda x: ";".join(sorted(x)))
)
.reset_index()
)
tr = TCRrep(
cell_df=clones.copy(),
organism="human",
chains=["alpha", "beta"],
compute_distances=True,
db_file="alphabeta_gammadelta_db.tsv"
)
pw_tcr = np.asarray(tr.pw_alpha) + np.asarray(tr.pw_beta)
Z = linkage(squareform(pw_tcr, checks=False), method="average")
cluster_labels = fcluster(Z, t=40, criterion="distance")
clone_df = tr.clone_df.copy()
clone_df["tcrdist_cluster"] = cluster_labels
cluster_map = clone_df[tcr_cols + ["tcrdist_cluster"]].copy()
cell_clusters = cells.merge(cluster_map, on=tcr_cols, how="left")
clone_df.to_csv("tcrdist_cluster_by_clonotype.tsv", sep="\t", index=False)
cell_clusters.to_csv("tcrdist_cluster_by_cell.tsv", sep="\t", index=False)
Recap
You now have a complete starter workflow that:
- reads 10x TCR contig annotations
- filters productive full-length TRA/TRB chains
- builds one paired receptor per cell
- collapses identical receptors into clonotypes
- computes paired TCRdist
- clusters related clonotypes
- projects those clusters back into single-cell space
The key idea is simple: exact clonotypes are useful, but similarity-based TCR clusters often capture a richer immune signal.
Further Reading
- 10x Genomics V(D)J annotation outputs
- 10x Genomics V(D)J clonotyping overview
- tcrdist3 documentation
- tcrdist3 package page
- Original TCRdist paper
FAQ
1. Why not just use the Cell Ranger clonotype IDs?
Cell Ranger clonotypes are excellent for exact grouping, but they are not designed to capture broader receptor similarity. TCRdist helps you find related clonotypes that may reflect convergent recognition.
2. What if my cells have two alpha chains or two beta chains?
This tutorial keeps the top chain per barcode for simplicity. In a more advanced workflow, you can preserve secondary chains and either analyze them separately or define a custom paired-receptor representation.
3. How do I choose the TCRdist cutoff?
There is no universal cutoff. Start with a reasonable exploratory threshold such as 40, inspect the heatmap and UMAP, and tune the value based on how coarse or fine you want the clustering to be. For publication work, you should justify the threshold using your dataset and biological question.

