NGS File Format Primer: FASTQ, SAM/BAM, CRAM

Jonathan Alles

EVOBYTE Digital Biology

By EVOBYTE Your partner in bioinformatics

Introduction

If next‑generation sequencing (NGS) is a data factory, then file formats are its shipping labels. They tell you what’s inside, where it came from, and how to move it downstream. In this short guide, we’ll follow a read’s journey from the sequencer to downstream analysis, unpack what lives inside the most common formats, and show quick ways to peek under the hood with standard tools. Along the way, we’ll point to software—like STAR for alignment and SAMtools for inspection—that turns these files from opaque blobs into useful, searchable data.

From signals to sequences: FASTQ as the raw read container

Your first stop after a run is almost always FASTQ. Think of a FASTQ file as a box with four small cards per read: one for the read name, one for the nucleotide sequence, one divider, and one for base qualities. Those per‑base quality scores use Phred encoding, which compresses error probabilities into compact ASCII characters. In practice, FASTQ usually arrives compressed as .fastq.gz, sometimes in pairs (R1/R2) for paired‑end libraries. The structure is simple but powerful: four lines per read, repeated millions to billions of times. That predictability is what makes FASTQ friendly to streaming and to quick command‑line checks.

When someone says “sequencing output,” they almost always mean FASTQ. Platforms differ in how they package the files, but the underlying contract is the same: sequences plus qualities, ready for alignment or k‑mer–based steps like QC and contamination checks. If you only remember one thing, remember this: FASTQ is the raw truth your pipeline begins with.

Example: a fast sanity check to confirm read structure and compression works end‑to‑end.

zcat sample_R1.fastq.gz | head -4

From reads to coordinates: SAM, BAM, and CRAM as alignment outputs

Once you align reads to a reference genome, you leave text‑only land and step into coordinate space. The Sequence Alignment/Map (SAM) format is a tab‑delimited text file describing each read’s alignment—where it lands, how it aligns (the CIGAR string), mapping quality (MAPQ), flags that encode properties like proper pairing, and optional tags. Because SAM is verbose, we almost always store alignments as BAM, the compact binary equivalent. For large cohorts and cloud workflows, CRAM goes a step further by compressing alignments with reference‑aware encoding, often saving 30–60% over BAM. All three share the same conceptual schema; they differ mostly in storage efficiency and random‑access behavior.

Where do these files come from? Aligners like STAR, BWA, and Bowtie2 produce SAM/BAM as their primary output. In RNA‑seq, STAR is a popular choice because it handles spliced alignments at speed and can write BAM directly, optionally sorted by coordinate. That saves an extra sort step and gets you closer to analysis‑ready files in one command.

Example: running STAR to align paired‑end reads and emit a coordinate‑sorted BAM you can index immediately.

STAR --runThreadN 8 \
     --genomeDir /ref/STAR_GRCh38/ \
     --readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz \
     --readFilesCommand zcat \
     --outSAMtype BAM SortedByCoordinate \
     --outFileNamePrefix sample1_

This produces sample1_Aligned.sortedByCoord.out.bam plus helpful side products (like junction files in RNA‑seq).

What’s inside an alignment file? In addition to the per‑read fields, BAM/CRAM embed a header with reference contigs, sort order, and read‑group (RG) metadata. The header is critical for consistent indexing and for tools that rely on sample/library information. When you sort by coordinate and index (creating a .bai for BAM or .crai for CRAM), you unlock fast region‑based queries and genome‑browser viewing.

Variants and features: VCF, BED, and gene annotations in context

After alignment and processing, many pipelines distill evidence into variants. The Variant Call Format (VCF) captures these as one record per genomic position, with fixed fields—chromosome, position, reference and alternate alleles—and extensible INFO/FORMAT sections for annotations and per‑sample genotypes. Binary VCF (BCF) is the compact, indexed counterpart used for speed at scale. Most modern tools speak at least VCF v4.2 or v4.3; check your callers and annotations to match expectations.

Not every file describes alignments or variants. You’ll also encounter feature files used to define intervals or annotations. BED lists intervals (zero‑based, half‑open) and is perfect for targets, blacklists, or track visualization. GTF/GFF3 describe gene models and are essential for counting and transcript‑aware analysis. You don’t align to these formats, but you do join against them constantly. Keeping their coordinate conventions straight—especially BED’s zero‑based starts—prevents subtle off‑by‑one bugs in downstream statistics and plots.

Peeking inside with SAMtools

Even if you live in Python or R, lightweight command‑line checks save entire afternoons. SAMtools is the Swiss‑army knife for SAM/BAM/CRAM. Use samtools view to stream alignments and samtools flagstat for alignment summaries; add indexing and you can query a single gene or exon in seconds. For example, samtools view -H prints only the header so you can confirm read‑group tags before merging replicates, while samtools view -q 30 filters low‑confidence alignments on the fly. These micro‑checks catch problems early: truncated files, missing @SQ lines, wrong sort order, or mis‑labeled samples.

Visualization closes the loop. Load a coordinate‑sorted, indexed BAM into IGV alongside its VCF and BED targets to eyeball alignments at candidate loci. Watching the CIGAR strings dance across splice junctions or indel sites turns abstract QC metrics into concrete intuition about data quality and library prep. When something looks off—say, wildly uneven coverage or unexpected strand bias—those same formats make it easy to subset, re‑index, or re‑align just the affected regions.

Summary / Takeaways

NGS file formats map neatly onto the pipeline. FASTQ is the sequencer’s raw output. Aligners such as STAR emit SAM/BAM or directly coordinate‑sorted BAM, which you’ll index for random access or compress further as CRAM. Variant callers produce VCF, while BED and GTF/GFF3 provide intervals and annotations that guide interpretation. With SAMtools, you can inspect headers, count alignments, or extract regions in seconds. Keep this mental model close, and your next “what is this file?” moment becomes a two‑minute detour instead of a lost afternoon.