Single‑Cell Genomics 101: Data Formats, QC and Preprocessing

By EVOBYTE Your partner in bioinformatics

Introduction

If you’ve ever been handed a folder of FASTQ files and asked to “run the single‑cell pipeline,” this guide is for you. In scRNA‑seq, good results start with understanding the core data formats, then applying sensible quality control (QC), normalization, and (if needed) batch correction. We’ll walk from raw reads to an analysis‑ready expression matrix, demystify common metrics like “percent mitochondrial,” and explain where tools like Cell Ranger and h5ad fit into a modern workflow.

From FASTQ to expression matrices: the file formats you’ll meet

Sequencing kicks off with FASTQ: a plain‑text format that stores each read and its per‑base quality scores. It’s the de facto starting point for most next‑generation sequencing, including scRNA‑seq. After alignment, reads are typically written to BAM, a compressed, indexed binary form of SAM that carries alignment info plus tags for cell barcodes and UMIs in single‑cell libraries. These tags (e.g., CB for the corrected cell barcode, UB for the corrected UMI) are critical for attributing reads to the right cell and molecule.

From there, the core product for downstream analysis is the expression matrix: counts of molecules (UMIs) per gene per cell. 10x Genomics pipelines output this as either a sparse MEX directory (matrix.mtx.gz + features.tsv.gz + barcodes.tsv.gz) or a compact HDF5 (.h5) file; both encode a “gene‑barcode” or “digital gene expression (DGE)” matrix. Many Python workflows prefer h5ad, the AnnData container that keeps counts, cell metadata (obs), gene metadata (var), and analysis results together in one file. This keeps your data tidy and speeds I/O on large datasets.

Platform pipelines like Cell Ranger automate this journey. Given FASTQs, cellranger count performs alignment, barcode error correction, UMI de‑duplication, and cell calling, then writes a filtered feature‑barcode matrix you can load into Seurat or Scanpy. Crucially, Cell Ranger annotates BAMs with barcode/UMI tags and applies transcript‑aware rules when counting reads, which is why most labs treat it as the “source of truth” for raw processing.

Example (minimal):
cellranger count --id=sampleA --transcriptome=/refs/GRCh38 --fastqs=/data/fastq --sample=sampleA --create-bam=true

Quality control metrics that matter (and why)

QC sifts real cells from empty droplets and damaged captures. Four metrics show up everywhere:

Number of UMIs per cell (often nCount_RNA) and number of genes per cell (nFeature_RNA). Very low values suggest poor capture; extreme highs can flag doublets.
Fraction of aligned reads: proportion of reads mapping confidently to the transcriptome; low alignment may signal contamination or reference mismatch.
Fraction mitochondrial (percent.mt): the share of UMIs from mitochondrial genes (e.g., MT‑prefixed in human). High values often indicate stressed or lysed cells.
Biology‑specific fractions: hemoglobin genes (e.g., HBB) spike in red blood cells; ribosomal genes (RPS/RPL) can dominate in some preparations—both are useful monitors during filtering.

In practice, analysts visualize distributions and set study‑specific thresholds instead of hard rules. Seurat, Scanpy, and Bioconductor’s OSCA recommend exploring percent.mt alongside nFeature_RNA and nCount_RNA to avoid removing bona fide cell types with naturally unusual profiles.

Quick example (Seurat):
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
You’d then inspect violin/scatter plots and subset cells by sensible cutoffs for your dataset and species.

Normalization, scaling, and batch correction in scRNA‑seq

Even after QC, cells differ in sequencing depth. That’s why we normalize. A common baseline is library‑size normalization to a target sum (e.g., counts per 10k) followed by log1p; this stabilizes variance enough for PCA and clustering in many datasets. Python users will recognize this as normalize_total + log1p in Scanpy; R users may use NormalizeData in Seurat.

For UMI data, sctransform (SCTransform in Seurat) often performs better by modeling counts with a regularized negative binomial and returning variance‑stabilized residuals. It removes depth effects without ad‑hoc log transforms and tends to yield cleaner separation for downstream steps. In Seurat v5, sctransform v2 is the default for many workflows.

Scaling is separate: it centers and scales features (genes) so highly expressed genes don’t dominate PCs purely by magnitude. Many pipelines scale the matrix used for PCA while keeping raw counts in a separate layer for differential expression.

Finally, a short definition of batch correction: it’s the process of removing technical differences between runs, lanes, chemistries, or labs so the same cell type looks similar across “batches.” Methods range from graph‑based (BBKNN), to integration anchors (Seurat), to latent‑space alignment (Harmony). Recent evaluations caution that over‑correction can introduce artifacts; method choice and diagnostics matter. Harmony has performed strongly in at least one 2025 benchmark, but you should validate with known biology and uncorrected views.

Summary / Takeaways

Know your files: FASTQ and BAM carry reads and alignments; DGE matrices (MEX/HDF5) feed analysis; h5ad bundles data and metadata for efficient iteration.
Treat QC as exploratory: use UMIs, genes per cell, alignment, percent mitochondrial, and context‑specific markers like HBB or ribosomal content to find reasonable cutoffs.
Normalize before you compare cells: start with library‑size + log1p or adopt sctransform for UMI data; then scale for PCA.
Be conservative with batch correction: it’s powerful but can distort signals—always check embeddings and marker genes both before and after correction.

Have a dataset on your desk right now? Start by loading the feature‑barcode matrix, compute percent.mt, plot QC metrics, and decide thresholds with your team. The rest—normalization, scaling, and integration—will go much smoother once QC is solid.

Single‑Cell Genomics 101: Data Formats, QC, and Preprocessing for scRNA‑seq

Introduction

From FASTQ to expression matrices: the file formats you’ll meet

Quality control metrics that matter (and why)

Normalization, scaling, and batch correction in scRNA‑seq

Summary / Takeaways

Further Reading

Leave a Comment Cancel Reply

Introduction

From FASTQ to expression matrices: the file formats you’ll meet

Quality control metrics that matter (and why)

Normalization, scaling, and batch correction in scRNA‑seq

Summary / Takeaways

Further Reading

Related Posts

Leave a Comment Cancel Reply