By EVOBYTE Your partner in bioinformatics
Introduction
If you’ve ever merged an RNA‑seq results table with a pathway database and watched half the rows vanish, you’ve felt the pain of gene ID mismatches. In computational biology, clean annotations are the glue that holds analyses together. Yet different databases use different identifiers, names change over time, and transcript versions drift. In this primer, we’ll clarify what “gene annotation” means, introduce the major naming authorities—HGNC, Ensembl, and RefSeq—explain why mappings can fail, and show simple, dependable ways to convert among identifiers without losing data or provenance.
What “gene annotation” means—and why genome builds matter
Gene annotation has two layers. First are identifiers, such as stable IDs and symbols that label a biological entity. Second are feature coordinates and attributes—exons, transcripts, and biotypes—often delivered as GTF or GFF3 files tied to a specific reference genome build. When results don’t line up, it’s often because the annotation build (for example, GRCh38) or the gene model release differs between tools. Start every project by writing down three items: the species, the genome build, and the annotation release. Everything else flows from those choices.
HGNC vs. Ensembl vs. RefSeq
HGNC, short for the HUGO Gene Nomenclature Committee, is the official authority for human gene symbols and names. It curates approved symbols (like TP53) and tracks withdrawn or alias symbols so that the community speaks a consistent language. HGNC symbols are human‑specific and human‑readable, which makes them great for figures and reports but sometimes brittle in code when names change.
Ensembl is a genome annotation project that assigns stable, machine‑friendly identifiers across species. A typical human Ensembl gene ID looks like ENSG00000141510, with related transcript (ENST…) and protein (ENSP…) IDs. The “stable” part helps across releases, while version numbers (e.g., ENST00000269305.9) reflect model updates. Ensembl also distributes complete gene sets as GTF/GFF3 and provides xrefs that map to HGNC, NCBI Gene, and UniProt.
RefSeq, maintained by NCBI, provides curated reference sequences and accessions for genes, transcripts, and proteins. You’ll recognize the prefixes: NM_ and NR_ for transcripts, NP_ for proteins, and a numeric NCBI Gene ID (formerly called Entrez Gene ID). RefSeq emphasizes a stable, curated backbone for genomic resources and is widely used by clinical and regulatory communities.
GENCODE is worth naming because it supplies comprehensive human and mouse gene models that are integrated with Ensembl and widely mirrored in major workflows. When someone says “Ensembl/Gencode gene set,” they usually mean the same coordinates and biotypes under Ensembl’s infrastructure.
Why ID mappings drift and break analyses
Even when everyone is “right,” IDs can fall out of sync. Symbols change as biology evolves; for instance, an alias used in a legacy dataset might now be withdrawn. Genes can split or merge as evidence improves, turning one‑to‑one mappings into one‑to‑many. Transcript versions update as UTRs are extended or splice junctions refined, so NM_ or ENST accessions gain new suffixes. And because annotations are tied to a specific reference, moving between genome builds can rearrange coordinates and sometimes gene models. Finally, species matters: HGNC symbols are human‑specific, while Ensembl IDs are species‑scoped; the same “symbol” in mouse may point to a different locus or not exist at all.
A practical mindset helps. Treat symbols as labels for communication and stable IDs (Ensembl or NCBI Gene) as keys for computation. Always pin the mapping to a particular release and keep a copy of the cross‑reference table you used, so you can reproduce results later.
Converting among HGNC, Ensembl, RefSeq: two compact workflows
Let’s make this concrete. Imagine you have differential expression results with HGNC symbols, but your enrichment tool expects NCBI Gene IDs. Or, you’ve downloaded a pathway gene set with RefSeq NM_ accessions and want Ensembl IDs to match your quantification outputs. Here are two reliable ways to translate.
In R, biomaRt taps Ensembl’s BioMart service and returns clean cross‑references. You can map from symbols to Ensembl and NCBI Gene in a few lines. Notice how we also supply the Ensembl release and species to lock the mapping:
# R: map HGNC symbols → Ensembl Gene ID and NCBI Gene ID
library(biomaRt)
mart <- useEnsembl(biomart = "genes", dataset = "hsapiens_gene_ensembl", version = 110) # pick your release
symbols <- c("TP53","EGFR","BRCA1")
xrefs <- getBM(
attributes = c("hgnc_symbol","ensembl_gene_id","entrezgene_id"),
filters = "hgnc_symbol",
values = symbols,
mart = mart
)
head(xrefs)
These tiny snippets hide a lot of good practice. You’re converting via an authoritative service, requesting just the fields you need, scoping to species, and, in the R example, fixing the release to guarantee reproducibility. If you prefer point‑and‑click, Ensembl’s BioMart web interface can export the same cross‑references alongside gene biotypes and chromosome locations. For clinical pipelines or regulatory documentation, NCBI E‑utilities and RefSeq release notes let you pin exact versions of NM_/NP_ accessions and track historical changes.
A quick word on GTF files
Quantification tools and splicing analysis depend on annotation files as much as on IDs. GTF and GFF3 files carry gene, transcript, and exon features with attributes like gene_id, transcript_id, gene_name, and biotype. Mismatches happen when counts are generated with one file and downstream steps assume another. To stay aligned, download the GTF/GFF3 from the same source and release as your ID mapping, keep the checksum, and document it in your project’s README. If you ever need to swap in a different gene set—say, moving from Ensembl to RefSeq—re‑quantifying with the matching GTF avoids quiet but painful discrepancies.
Summary / Takeaways
Gene annotation sounds bureaucratic, but it’s central to trustworthy results. Use HGNC symbols to communicate with humans and rely on stable identifiers, such as Ensembl gene IDs or NCBI Gene IDs, as the keys in your data joins. Anchor every mapping to a species, a genome build, and a release, and keep a copy of the cross‑reference you used. When in doubt, convert through authoritative services rather than ad‑hoc spreadsheets. As a next step, take one of your recent analyses, regenerate its ID mapping with an explicit release, and add the mapping file and version notes to your repo. Your future self—and your reviewers—will thank you.
Further Reading –
HGNC: HUGO Gene Nomenclature Committee –
Ensembl BioMart and identifiers –
NCBI Gene and RefSeq overview –
GENCODE gene sets for human and mouse –
mygene.info API documentation –
