Clustering and Cell Type Identification in scRNA‑seq: from Graphs to Labels

Jonathan Alles

EVOBYTE Digital Biology

By EVOBYTE Your partner in bioinformatics

Introduction

You’ve run quality control and normalization on your single‑cell RNA‑seq (scRNA‑seq) data. Now comes the surprisingly human task: turning clouds of points into meaningful cell types. In this Part 3 of our Single‑Cell Intro series, we’ll demystify scRNA‑seq clustering, explain how dimensionality reduction fits in (and when it doesn’t), and walk from clusters to cell type annotation with practical tips and tiny code snippets. We’ll use terms like PCA, k‑nearest neighbor (KNN) graph, Louvain/Leiden, UMAP, t‑SNE, marker genes, and reference mapping—each highly relevant because they’re the backbone of modern single‑cell analysis pipelines used across pharma, biotech, and translational research.

How scRNA‑seq clustering actually works (and why graphs win)

Most pipelines build a KNN graph of cells in a low‑dimensional space (typically PCA), then detect “communities” on that graph. Graph‑based methods such as Louvain and Leiden are industry standards because they scale, are robust to noise, and align well with discrete biological populations. In practice, frameworks like Seurat and Scanpy construct the neighbor graph from principal components, then run community detection with a tunable resolution parameter that controls granularity; higher resolution yields more clusters and can help surface rare cell states.

Among community algorithms, Leiden is often preferred because it fixes connectivity issues observed with Louvain and typically finds better‑connected, more stable communities—useful when subtle subtypes matter, as in immunology or oncology discovery programs.

A minimal Seurat example shows the gist:

pbmc <- FindNeighbors(pbmc, dims = 1:30)
pbmc <- FindClusters(pbmc, resolution = 0.8)  # granularity knob

Adjust dims (PCs) and resolution based on data size and expected heterogeneity. Think of resolution as a hypothesis switch: are you after broad cell families or fine subtypes?

Dimensionality reduction vs clustering: visualize here, cluster there

Dimensionality reduction (DR) techniques like UMAP and t‑SNE compress high‑dimensional expression profiles for visualization. They’re superb for exploring structure, but they are not the substrate you should cluster on. Cluster on the PCA‑based neighbor graph; then use UMAP/t‑SNE to display results. This separation matters because the geometry in 2D embeddings can distort distances and densities, especially at global scales.

UMAP is fast, preserves local neighborhoods well, and often maintains more global structure than t‑SNE, which makes it a popular default for scRNA‑seq plots. (arxiv.org)
t‑SNE remains excellent for revealing local cluster structure, but its global layout should be interpreted cautiously; methods and best‑practice guides outline how to avoid common pitfalls.

In short: compute neighbors in PCA space, detect communities (Leiden/Louvain), and visualize those communities with UMAP or t‑SNE. This workflow gives you reproducible clusters and interpretable figures for reports and manuscripts.

From clustering to cell type annotation: markers, references, and automation

Clustering groups similar cells; annotation tells you what those groups are. Start simple: score canonical marker genes per cluster, then name clusters with domain knowledge. For peripheral blood, for example: MS4A1 suggests B cells, NKG7/NK markers suggest NK cells, LST1/LYZ suggests monocytes, and CCR7/IL7R points to naive T cells. Packages streamline this by computing cluster‑wise differential expression (DEGs) and marker rankings. Best‑practice tutorials emphasize that biological plausibility trumps any single numeric cutoff—over‑clustering looks neat but hampers interpretation.

A tiny Scanpy example that assigns labels from simple marker logic:

import scanpy as sc
adata = sc.pp.recipe_zheng17(adata, copy=True)  # just as an example recipe
sc.tl.leiden(adata, resolution=0.8)
adata.obs['celltype'] = 'unknown'
adata.obs.loc[adata[:, 'MS4A1'].X.A1 > 1, 'celltype'] = 'B cell'
adata.obs.loc[adata[:, 'NKG7'].X.A1 > 1, 'celltype'] = 'NK cell'

This won’t replace expert review, but it’s a fast first pass that pairs well with manual curation.

When scale and consistency matter—think multi‑cohort projects or automated data pipelines—bring in reference mapping. Tools like SingleR transfer labels from a curated reference to your query dataset; Azimuth (built on Seurat’s weighted nearest neighbors) maps cells to multimodal PBMC references; and CellTypist provides fast logistic‑regression models for common tissues, especially immune. These approaches boost reproducibility and are increasingly used in production workflows, though they’re only as good as their references. Always sanity‑check with markers and DEGs.

Summary / Takeaways

Build clusters on a PCA‑derived KNN graph, then visualize with UMAP or t‑SNE. This keeps your communities reproducible and your figures intuitive.
Treat the resolution parameter as a biological dial. Start coarse, then zoom in where markers support meaningful subclusters.
Annotate clusters with a mix of marker‑based reasoning and reference mapping (SingleR, Azimuth, CellTypist). Automate where it helps, but validate with expert knowledge.

Next step: pressure‑test your annotations with differential expression and pathway enrichment to confirm biology and surface novel states worth follow‑up.