Cluster TCRs from 10x with TCRdist

Why this matters

Single-cell TCR sequencing is great at finding exact clonotypes, but exact matches are only part of the story. In many datasets, T cells with related antigen specificity can carry different yet biochemically similar receptors. That is where TCRdist becomes useful.

Instead of asking only whether two TCRs are identical, TCRdist asks how similar they are across the receptor regions that matter most for recognition. Starting from 10x Genomics V(D)J output, you can use TCRdist to move from exact clonotypes to similarity-based TCR clusters that are often easier to interpret biologically.

This is especially helpful when you want to:

detect convergent immune responses
group related clonotypes beyond exact sequence identity
compare TCR structure across cell states
project TCR similarity clusters back onto a single-cell atlas

What you will build

In this tutorial, you will go from a 10x Genomics V(D)J result folder to:

a clean paired TRA/TRB clonotype table
a pairwise TCRdist matrix
similarity-based clonotype clusters
a simple 2D cluster visualization
an optional per-cell cluster annotation file you can join back to your single-cell object

TCRdist in one minute

TCRdist is a distance metric for TCRs. Smaller values mean two receptors are more similar. Unlike plain edit distance on CDR3 alone, TCRdist combines information from multiple receptor regions, including the CDR loops and V-gene-derived features. In practice, that makes it much better for grouping receptors that may recognize similar targets even when they are not exact sequence matches.

A useful mental model is:

Cell Ranger clonotype ID = exact or near-exact grouping from the reconstruction pipeline
TCRdist cluster = neighborhood of biologically similar receptors

Requirements

Basic Python and pandas
Familiarity with 10x Genomics V(D)J outputs
CPU machine is enough; no GPU required
Best for small to medium repertoires in the dense workflow below
Python packages:
- pandas
- numpy
- scipy
- matplotlib
- seaborn
- umap-learn
- networkx (optional)
- tcrdist3

Step 1: Install the packages

For a simple local setup, install the required Python packages first.

python -m pip install pandas numpy scipy matplotlib seaborn umap-learn networkx tcrdist3

If you already work in a notebook environment, restart the kernel after installation.

Step 2: Start from the right 10x file

For TCR clustering, the most practical starting point is usually filtered_contig_annotations.csv, not the high-level clonotype summary file. The contig table keeps per-cell barcode information and chain-level annotations, which makes it easier to build paired TRA/TRB receptors and later map results back to single cells.

from pathlib import Path
import pandas as pd

vdj_dir = Path("outs/vdj_t")
contigs = pd.read_csv(vdj_dir / "filtered_contig_annotations.csv")

print(contigs.shape)
print(contigs.columns.tolist())
contigs.head()

What to check here

Make sure you can see columns like:

barcode
chain
v_gene
j_gene
cdr3
cdr3_nt
productive
full_length
reads
umis

Some output versions may include extra columns, which is fine.

Step 3: Keep productive full-length TCR contigs

Now keep only productive, full-length TCR alpha and beta chains.

import numpy as np

def as_bool(series):
    return series.astype(str).str.lower().isin(["true", "1", "t", "yes"])

tcr = contigs.copy()

keep = tcr["chain"].isin(["TRA", "TRB"])

if "productive" in tcr.columns:
    keep &= as_bool(tcr["productive"])

if "full_length" in tcr.columns:
    keep &= as_bool(tcr["full_length"])

if "high_confidence" in tcr.columns:
    keep &= as_bool(tcr["high_confidence"])

tcr = tcr.loc[keep].copy()

print(tcr.shape)
tcr["chain"].value_counts()

Why this filter matters

TCRdist works best on clean receptor calls. Filtering early reduces noise from incomplete or low-confidence contigs.

Step 4: Pick one alpha and one beta chain per cell

Single cells can have multiple TRA or TRB contigs. For a first-pass tutorial, we will keep the top-ranked chain per barcode based on UMI count, then read count.

This is a simplification, but it works well for a clean introductory workflow.

sort_cols = ["barcode", "chain"]
ascending = [True, True]

for col in ["umis", "reads"]:
    if col in tcr.columns:
        sort_cols.append(col)
        ascending.append(False)

best = (
    tcr.sort_values(sort_cols, ascending=ascending)
       .drop_duplicates(subset=["barcode", "chain"], keep="first")
       .copy()
)

if "sample" not in best.columns:
    best["sample"] = "sample1"

best["cell_id"] = best["sample"].astype(str) + ":" + best["barcode"].astype(str)

alpha = (
    best.loc[best["chain"] == "TRA", [
        "sample", "barcode", "cell_id",
        "v_gene", "j_gene", "cdr3", "cdr3_nt"
    ]]
    .rename(columns={
        "v_gene": "v_a_gene_raw",
        "j_gene": "j_a_gene_raw",
        "cdr3": "cdr3_a_aa",
        "cdr3_nt": "cdr3_a_nucseq",
    })
)

beta = (
    best.loc[best["chain"] == "TRB", [
        "sample", "barcode", "cell_id",
        "v_gene", "j_gene", "cdr3", "cdr3_nt"
    ]]
    .rename(columns={
        "v_gene": "v_b_gene_raw",
        "j_gene": "j_b_gene_raw",
        "cdr3": "cdr3_b_aa",
        "cdr3_nt": "cdr3_b_nucseq",
    })
)

cells = alpha.merge(
    beta,
    on=["sample", "barcode", "cell_id"],
    how="inner"
).copy()

print(cells.shape)
cells.head()

What happened

You now have one paired TRA/TRB receptor per cell. Cells without both chains are excluded from the paired analysis.

Step 5: Normalize V and J gene names for TCRdist

A common gotcha is gene naming. TCRdist expects IMGT-like gene names with alleles, such as TRBV7-9*01. In many 10x outputs, the allele is omitted. A practical fix is to append *01 when the allele is missing.

def add_default_allele(value):
    if pd.isna(value):
        return value
    value = str(value).strip()
    if value == "":
        return value
    return value if "*" in value else f"{value}*01"

for raw_col, clean_col in [
    ("v_a_gene_raw", "v_a_gene"),
    ("j_a_gene_raw", "j_a_gene"),
    ("v_b_gene_raw", "v_b_gene"),
    ("j_b_gene_raw", "j_b_gene"),
]:
    cells[clean_col] = cells[raw_col].map(add_default_allele)

required = [
    "v_a_gene", "j_a_gene", "cdr3_a_aa", "cdr3_a_nucseq",
    "v_b_gene", "j_b_gene", "cdr3_b_aa", "cdr3_b_nucseq"
]

cells = cells.dropna(subset=required).copy()

print(cells[required].head())

Why include nucleotide CDR3 columns

If two receptors have the same amino acid sequence but different nucleotide rearrangements, those nucleotide columns help preserve that distinction before aggregation.

Step 6: Create a unique paired clonotype table

TCRdist works on a table of receptors. Since we want clonotype-level clustering, we will collapse identical paired receptors and keep a count column.

tcr_cols = [
    "v_a_gene", "j_a_gene", "cdr3_a_aa", "cdr3_a_nucseq",
    "v_b_gene", "j_b_gene", "cdr3_b_aa", "cdr3_b_nucseq"
]

clones = (
    cells.groupby(tcr_cols, dropna=False)
         .agg(
             count=("cell_id", "size"),
             cell_ids=("cell_id", lambda x: ";".join(sorted(x)))
         )
         .reset_index()
)

print(clones.shape)
clones.head()

Expected result

Each row is now one unique paired TCR clonotype, with:

paired alpha and beta sequence features
the number of cells carrying that receptor
the original cell IDs for mapping back later

Step 7: Compute paired TRA/TRB TCRdist

Now build a TCRrep object and compute pairwise distances.

from tcrdist.repertoire import TCRrep
import numpy as np

organism = "human"   # change to "mouse" if needed

tr = TCRrep(
    cell_df=clones.copy(),
    organism=organism,
    chains=["alpha", "beta"],
    compute_distances=True,
    db_file="alphabeta_gammadelta_db.tsv"
)

pw_alpha = np.asarray(tr.pw_alpha).copy()
pw_beta = np.asarray(tr.pw_beta).copy()

pw_alpha[pw_alpha < 0] = 0
pw_beta[pw_beta < 0] = 0

pw_tcr = pw_alpha + pw_beta

print(pw_tcr.shape)
print(pw_tcr[:5, :5])

What this matrix means

pw_tcr[i, j] is the paired TRA/TRB TCRdist between clonotype i and clonotype j
smaller numbers mean more similar receptors
the diagonal should be zero

Practical note on scale

This dense workflow is ideal for tutorial-sized datasets and a few thousand unique paired clonotypes. For very large repertoires, you will usually want a sparse or chunked workflow.

Step 8: Turn the distance matrix into clusters

There is no single universal cutoff for TCRdist clustering. For a first pass, a fixed threshold is often good enough to explore the structure of your data.

Below, we use average-linkage hierarchical clustering with an example distance cutoff of 40.

from scipy.spatial.distance import squareform
from scipy.cluster.hierarchy import linkage, fcluster

condensed = squareform(pw_tcr, checks=False)
Z = linkage(condensed, method="average")

distance_threshold = 40
cluster_labels = fcluster(Z, t=distance_threshold, criterion="distance")

clone_df = tr.clone_df.copy()
clone_df["tcrdist_cluster"] = cluster_labels

cluster_sizes = (
    clone_df.groupby("tcrdist_cluster")["count"]
            .sum()
            .rename("cluster_cells")
            .reset_index()
)

clone_df = clone_df.merge(cluster_sizes, on="tcrdist_cluster", how="left")

summary = (
    clone_df.groupby("tcrdist_cluster")
            .agg(
                n_unique_clones=("tcrdist_cluster", "size"),
                n_cells=("count", "sum"),
                example_alpha=("cdr3_a_aa", "first"),
                example_beta=("cdr3_b_aa", "first"),
            )
            .sort_values("n_cells", ascending=False)
)

print(summary.head(10))

How to interpret this

A cluster here is a group of clonotypes that are close in paired TCRdist space, not necessarily exact sequence matches.

Step 9: Visualize TCR clusters in 2D

A simple and effective view is a UMAP embedding based on the precomputed TCRdist matrix.

import umap.umap_ as umap
import seaborn as sns
import matplotlib.pyplot as plt

reducer = umap.UMAP(
    metric="precomputed",
    n_neighbors=min(15, max(2, pw_tcr.shape[0] - 1)),
    min_dist=0.35,
    random_state=42,
)

embedding = reducer.fit_transform(pw_tcr)

plot_df = clone_df.copy()
plot_df["UMAP1"] = embedding[:, 0]
plot_df["UMAP2"] = embedding[:, 1]

top_clusters = (
    plot_df["tcrdist_cluster"]
    .value_counts()
    .head(12)
    .index
)

plot_df["cluster_plot"] = plot_df["tcrdist_cluster"].astype(str)
plot_df.loc[
    ~plot_df["tcrdist_cluster"].isin(top_clusters),
    "cluster_plot"
] = "other"

plt.figure(figsize=(9, 7))
sns.scatterplot(
    data=plot_df,
    x="UMAP1",
    y="UMAP2",
    hue="cluster_plot",
    size="count",
    sizes=(30, 300),
    palette="tab20",
    alpha=0.85,
    linewidth=0
)
plt.title("TCRdist clusters of paired TRA/TRB clonotypes")
plt.legend(bbox_to_anchor=(1.02, 1), loc="upper left")
plt.tight_layout()
plt.show()

What to look for

You will usually see:

dense islands of very similar clonotypes
larger points for expanded clonotypes
clusters that may cut across exact Cell Ranger clonotype IDs

Step 10: Add a heatmap for cluster structure

A reordered heatmap is helpful when you want to see block-like structure in the distance matrix.

from scipy.cluster.hierarchy import leaves_list

order = leaves_list(Z)
ordered = pw_tcr[order][:, order]

heatmap_n = min(250, ordered.shape[0])

plt.figure(figsize=(8, 7))
sns.heatmap(
    ordered[:heatmap_n, :heatmap_n],
    cmap="viridis",
    square=True,
    cbar_kws={"label": "TCRdist"}
)
plt.title(f"TCRdist heatmap (top {heatmap_n} ordered clonotypes)")
plt.xlabel("Clonotypes")
plt.ylabel("Clonotypes")
plt.tight_layout()
plt.show()

Tip

For very large repertoires, plot only the top expanded clonotypes or one cluster at a time.

Step 11: Map cluster labels back to cells

To use the result in your single-cell analysis, merge cluster labels back to the original paired cells.

cluster_map = clone_df[tcr_cols + ["tcrdist_cluster", "cluster_cells"]].copy()

cell_clusters = cells.merge(
    cluster_map,
    on=tcr_cols,
    how="left"
)

cell_clusters[[
    "cell_id",
    "sample",
    "barcode",
    "tcrdist_cluster",
    "cluster_cells",
    "cdr3_a_aa",
    "cdr3_b_aa"
]].head()

Save it for reuse:

cell_clusters.to_csv("tcrdist_cluster_by_cell.tsv", sep="\t", index=False)
clone_df.to_csv("tcrdist_cluster_by_clonotype.tsv", sep="\t", index=False)

If you use Scanpy, you can join the cell-level table into adata.obs by barcode or cell_id.

# Example only if your AnnData obs index matches cell_id
# import scanpy as sc
# adata = sc.read_h5ad("t_cells.h5ad")
# adata.obs = adata.obs.join(
#     cell_clusters.set_index("cell_id")[["tcrdist_cluster", "cluster_cells"]],
#     how="left"
# )

Full minimal script

If you want the whole workflow in one place, here is a compact version.

from pathlib import Path
import pandas as pd
import numpy as np
from tcrdist.repertoire import TCRrep
from scipy.spatial.distance import squareform
from scipy.cluster.hierarchy import linkage, fcluster

def as_bool(series):
    return series.astype(str).str.lower().isin(["true", "1", "t", "yes"])

def add_default_allele(value):
    if pd.isna(value):
        return value
    value = str(value).strip()
    if value == "":
        return value
    return value if "*" in value else f"{value}*01"

vdj_dir = Path("outs/vdj_t")
contigs = pd.read_csv(vdj_dir / "filtered_contig_annotations.csv")

keep = contigs["chain"].isin(["TRA", "TRB"])
if "productive" in contigs.columns:
    keep &= as_bool(contigs["productive"])
if "full_length" in contigs.columns:
    keep &= as_bool(contigs["full_length"])
if "high_confidence" in contigs.columns:
    keep &= as_bool(contigs["high_confidence"])

tcr = contigs.loc[keep].copy()

sort_cols = ["barcode", "chain"]
ascending = [True, True]
for col in ["umis", "reads"]:
    if col in tcr.columns:
        sort_cols.append(col)
        ascending.append(False)

best = (
    tcr.sort_values(sort_cols, ascending=ascending)
       .drop_duplicates(["barcode", "chain"], keep="first")
       .copy()
)

if "sample" not in best.columns:
    best["sample"] = "sample1"

best["cell_id"] = best["sample"].astype(str) + ":" + best["barcode"].astype(str)

alpha = (
    best.loc[best["chain"] == "TRA", [
        "sample", "barcode", "cell_id", "v_gene", "j_gene", "cdr3", "cdr3_nt"
    ]]
    .rename(columns={
        "v_gene": "v_a_gene_raw",
        "j_gene": "j_a_gene_raw",
        "cdr3": "cdr3_a_aa",
        "cdr3_nt": "cdr3_a_nucseq",
    })
)

beta = (
    best.loc[best["chain"] == "TRB", [
        "sample", "barcode", "cell_id", "v_gene", "j_gene", "cdr3", "cdr3_nt"
    ]]
    .rename(columns={
        "v_gene": "v_b_gene_raw",
        "j_gene": "j_b_gene_raw",
        "cdr3": "cdr3_b_aa",
        "cdr3_nt": "cdr3_b_nucseq",
    })
)

cells = alpha.merge(beta, on=["sample", "barcode", "cell_id"], how="inner").copy()

for raw_col, clean_col in [
    ("v_a_gene_raw", "v_a_gene"),
    ("j_a_gene_raw", "j_a_gene"),
    ("v_b_gene_raw", "v_b_gene"),
    ("j_b_gene_raw", "j_b_gene"),
]:
    cells[clean_col] = cells[raw_col].map(add_default_allele)

tcr_cols = [
    "v_a_gene", "j_a_gene", "cdr3_a_aa", "cdr3_a_nucseq",
    "v_b_gene", "j_b_gene", "cdr3_b_aa", "cdr3_b_nucseq"
]

cells = cells.dropna(subset=tcr_cols).copy()

clones = (
    cells.groupby(tcr_cols, dropna=False)
         .agg(
             count=("cell_id", "size"),
             cell_ids=("cell_id", lambda x: ";".join(sorted(x)))
         )
         .reset_index()
)

tr = TCRrep(
    cell_df=clones.copy(),
    organism="human",
    chains=["alpha", "beta"],
    compute_distances=True,
    db_file="alphabeta_gammadelta_db.tsv"
)

pw_tcr = np.asarray(tr.pw_alpha) + np.asarray(tr.pw_beta)

Z = linkage(squareform(pw_tcr, checks=False), method="average")
cluster_labels = fcluster(Z, t=40, criterion="distance")

clone_df = tr.clone_df.copy()
clone_df["tcrdist_cluster"] = cluster_labels

cluster_map = clone_df[tcr_cols + ["tcrdist_cluster"]].copy()
cell_clusters = cells.merge(cluster_map, on=tcr_cols, how="left")

clone_df.to_csv("tcrdist_cluster_by_clonotype.tsv", sep="\t", index=False)
cell_clusters.to_csv("tcrdist_cluster_by_cell.tsv", sep="\t", index=False)

Recap

You now have a complete starter workflow that:

reads 10x TCR contig annotations
filters productive full-length TRA/TRB chains
builds one paired receptor per cell
collapses identical receptors into clonotypes
computes paired TCRdist
clusters related clonotypes
projects those clusters back into single-cell space

The key idea is simple: exact clonotypes are useful, but similarity-based TCR clusters often capture a richer immune signal.

FAQ

1. Why not just use the Cell Ranger clonotype IDs?

Cell Ranger clonotypes are excellent for exact grouping, but they are not designed to capture broader receptor similarity. TCRdist helps you find related clonotypes that may reflect convergent recognition.

2. What if my cells have two alpha chains or two beta chains?

This tutorial keeps the top chain per barcode for simplicity. In a more advanced workflow, you can preserve secondary chains and either analyze them separately or define a custom paired-receptor representation.

3. How do I choose the TCRdist cutoff?

There is no universal cutoff. Start with a reasonable exploratory threshold such as 40, inspect the heatmap and UMAP, and tune the value based on how coarse or fine you want the clustering to be. For publication work, you should justify the threshold using your dataset and biological question.

Cluster TCRs from 10x with TCRdist

Why this matters

What you will build

TCRdist in one minute

Requirements

Step 1: Install the packages

Step 2: Start from the right 10x file

What to check here

Step 3: Keep productive full-length TCR contigs

Why this filter matters

Step 4: Pick one alpha and one beta chain per cell

What happened

Step 5: Normalize V and J gene names for TCRdist

Why include nucleotide CDR3 columns

Step 6: Create a unique paired clonotype table

Expected result

Step 7: Compute paired TRA/TRB TCRdist

What this matrix means

Practical note on scale

Step 8: Turn the distance matrix into clusters

How to interpret this

Step 9: Visualize TCR clusters in 2D

What to look for

Step 10: Add a heatmap for cluster structure

Tip

Step 11: Map cluster labels back to cells

Full minimal script

Recap

Further Reading

FAQ

1. Why not just use the Cell Ranger clonotype IDs?

2. What if my cells have two alpha chains or two beta chains?

3. How do I choose the TCRdist cutoff?

Leave a Comment Cancel Reply

Why this matters

What you will build

TCRdist in one minute

Requirements

Step 1: Install the packages

Step 2: Start from the right 10x file

What to check here

Step 3: Keep productive full-length TCR contigs

Why this filter matters

Step 4: Pick one alpha and one beta chain per cell

What happened

Step 5: Normalize V and J gene names for TCRdist

Why include nucleotide CDR3 columns

Step 6: Create a unique paired clonotype table

Expected result

Step 7: Compute paired TRA/TRB TCRdist

What this matrix means

Practical note on scale

Step 8: Turn the distance matrix into clusters

How to interpret this

Step 9: Visualize TCR clusters in 2D

What to look for

Step 10: Add a heatmap for cluster structure

Tip

Step 11: Map cluster labels back to cells

Full minimal script

Recap

Further Reading

FAQ

1. Why not just use the Cell Ranger clonotype IDs?

2. What if my cells have two alpha chains or two beta chains?

3. How do I choose the TCRdist cutoff?

Related Posts

Leave a Comment Cancel Reply