dbVar for Computational Biologists: Guide to structural variation data

Jonathan Alles

EVOBYTE Digital Biology

By EVOBYTE Your partner in bioinformatics

Introduction: why structural variation and dbVar matter

When a copy-number change or inversion explains a tough phenotype, it often appears as a broad signal rather than a neat single-nucleotide variant. That’s where structural variation becomes the story, and dbVar becomes the library. dbVar is NCBI’s database dedicated to human genomic structural variants, bringing together clinical submissions and large surveys so you can search, visualize, and download the events that reshape chromosomes. Think of it as the place to check whether your suspected deletion has been seen before, how it was called, and what it might mean.

Structural variation (SV): a quick, working definition

Structural variation refers to genomic alterations typically 50 base pairs or larger that change structure or dosage. In practice, you’ll encounter deletions, insertions, duplications, inversions, balanced translocations, multiallelic copy-number changes, and more complex patterns. While dbVar can store events of many sizes, variants shorter than 50 bp are usually handled by dbSNP, and dbVar focuses on larger events that are harder to detect and interpret. Importantly, the database now concentrates on human data; new non‑human submissions were discontinued in 2017. If you’re working outside human, submit to DGVa at EMBL‑EBI instead, and rely on dbVar’s FTP archive for historical non‑human datasets.

Inside dbVar: studies, asserted regions (nsv) and supporting calls (nssv)

dbVar organizes data the way researchers produce it: by study. A study aggregates a coherent set of samples, assays, and analyses and receives an accession like nstd### (for NCBI submissions) or estd### (for EBI submissions). Within a study, two related objects capture the call: an asserted variant region (sv) and its supporting variant calls (ssv). You’ll see accessions prefixed accordingly: nsv or esv for asserted regions, nssv or essv for supporting calls. The asserted region is the author’s best statement of where the event lies; supporting calls record the specific evidence underlying it, such as array segments or read‑based signatures. This hierarchy mirrors the experimental process and lets you trace from a summary variant down to the signals that justified it. Unlike dbSNP’s rs identifiers, there isn’t yet a universal, cross‑study “reference” structural variant because breakpoint precision and calling methods still vary widely.

What’s in the database: clinical, common, and long‑read SV collections

dbVar aggregates millions of submitted structural variants from well‑known efforts, including 1000 Genomes and gnomAD, alongside clinically interpreted SVs funneled through ClinVar. If you want a single place to start for patient‑relevant events, load the Clinical Structural Variants study nstd102, which tracks ClinVar’s large variants with assertions like Benign or Pathogenic. For population background, NCBI also curates common SVs in nstd186. Long‑read datasets and broad genome‑wide surveys are highlighted in the Structural Variation Data Hub, making it easy to find the cohorts most relevant to your analysis. Together, these collections let you ask simple but powerful questions: Is my 500 kb deletion a known benign CNV, or does it overlap a pathogenic duplication hotspot?

Accessing dbVar in practice: browse, query, and download

You can explore dbVar visually or programmatically, and it pays to combine both.

Start in a browser when context matters. Variation Viewer and the dbVar Genome Browser show SV tracks in genomic context alongside genes and short variants. You can search by gene, coordinate range, or phenotype, toggle between GRCh37 and GRCh38, and zoom from chromosome to exon. A handy trick is to add the nstd102 track to instantly see clinically interpreted SVs in your region of interest. If you prefer UCSC or GDV, load the public dbVar track hub and keep your environment consistent across projects. Hovering over a feature reveals its length and type; clicking the dbVar accession (for example, nsv… or nssv…) opens the variant page with details and links to related resources.

When you need automation or scale, switch to Entrez E‑utilities. dbVar is an Entrez database, so you can script searches with ESearch, summarize with ESummary, and fetch records or links just as you would for PubMed. The pattern is familiar: send a query, capture UIDs, then retrieve structured summaries for downstream parsing. Here’s a tiny Biopython example that finds the first few variants from study nstd102 and prints basic fields. Replace the query with your gene or coordinate filter as needed.

from Bio import Entrez
Entrez.email = "you@example.com"

# Search dbVar for clinical SVs in study nstd102
handle = Entrez.esearch(db="dbvar",
                        term='("variant"[Object Type] AND nstd102)',
                        usehistory="y")
ids = Entrez.read(handle)
summ = Entrez.esummary(db="dbvar",
                       webenv=ids["WebEnv"],
                       query_key=ids["QueryKey"],
                       retmax=5, rettype="xml")
print(summ.read().decode("utf-8") if hasattr(summ, "read") else summ)

This pattern scales to hundreds of thousands of records using WebEnv and QueryKey without juggling explicit ID lists.

For bulk analysis, download standardized files from the dbVar FTP site. You’ll find VCF, GVF, and TSV bundles by study and assembly, with data for both GRCh37 and GRCh38. Because files are bgzipped and indexed, you can query them remotely with tabix without a local download. For example, this grabs all nstd102 variant regions on chr1:1–1,000,000 in GRCh38 directly from NCBI:

tabix -h https://ftp.ncbi.nlm.nih.gov/pub/dbVar/data/Homo_sapiens/by_study/vcf/nstd102.GRCh38.variant_region.vcf.gz 1:1-1000000

You can then intersect those results with your own targets using bedtools or stream them into a notebook for annotation. The FTP layout was redesigned to make this easier, with aggregated files by assembly and consistent naming.

A few practical notes help avoid confusion. dbVar synchronizes with DGVa at EMBL‑EBI, so the same study may appear with different prefixes depending on where it was submitted; treat nstd and estd as parallel doors to the same dataset. Also remember that “asserted regions” (nsv/esv) and “supporting calls” (nssv/essv) are both useful: the region provides a consolidated footprint, while the supporting calls expose the raw evidence, calling method, and placement details you may need for quality assessment or visualization. Finally, because structural variant calling remains heterogeneous, breakpoint precision varies; don’t be surprised if multiple asserted regions in different studies overlap rather than match exactly.

Summary / Takeaways

dbVar gives computational biologists a practical way to reason about large genomic changes. Use it to ground truth your calls against clinical collections like nstd102, to gauge background frequency with curated common SVs, and to visualize events in context before you commit to interpretation. Start in the browser to orient yourself, then automate searches with Entrez and stream VCF or GVF from FTP for scalable workflows. As you analyze, lean on the data hierarchy—asserted regions for a clean footprint, supporting calls for method‑level detail—and keep in mind that dbVar focuses on human data and syncs with DGVa. What question could you answer today by overlaying nstd102 with your region of interest and a gene list you care about?