AlphaGenome: 1 Mb context for regulatory predictions

Jonathan Alles

EVOBYTE Digital Biology

Introduction

Most of our DNA does not code for proteins. Yet this “non-coding” majority is the control room that decides when, where, and how genes turn on. For years, data scientists and biologists have tried to read that control logic—the regulatory code—directly from sequence. Now, AlphaGenome steps in as a foundation model for genomics, trained to map raw DNA to a rich set of molecular readouts, and to predict how single-base changes ripple through that system. First announced on June 25, 2025 and later published in Nature in January 2026, AlphaGenome brings long-range context and base-pair resolution together in one model, moving sequence-to-function prediction from niche benchmarks toward everyday analysis.

AlphaGenome: a unified sequence-to-function model

At its core, AlphaGenome is a multimodal deep learning model that takes up to one megabase (1,000,000 bases) of DNA as input and predicts diverse molecular signals that encode gene regulation. Those outputs include gene expression profiles, splice junction usage, chromatin accessibility and histone marks, transcription factor binding, and even 3D chromatin contact maps—all at base-level or assay-appropriate resolution. Because it sees a very long window, the model can capture distal enhancers that may sit hundreds of kilobases away from their target genes, while still resolving the single-nucleotide syntax that defines splice sites or transcription factor motifs.

Unlike earlier task-specific models, AlphaGenome is trained once to predict many modalities across many cell types and tissues, in both human and mouse. That makes it a genuine foundation model for regulatory genomics: a single system that learns a general representation of DNA sequence useful for a wide variety of downstream tasks, from variant interpretation to enhancer–promoter linking. In cross-benchmark evaluations, it matched or surpassed specialized state-of-the-art tools while offering a single interface for sequence-to-function queries.

Inside the model: how it learns the regulatory code

When people talk about the regulatory code, they mean the grammar that connects DNA letters to regulatory outcomes—motifs and motif combinations, spacing rules, nucleosome positioning preferences, long-range enhancer contacts, and context that differs by cell type. AlphaGenome is designed to learn that grammar by predicting what experimental assays measure. To do that, it trains on large public resources—ENCODE, GTEx, FANTOM5, 4D Nucleome, and related compendia—so it repeatedly sees how sequence patterns correlate with measured expression, accessibility, histone marks, transcription factor binding, and 3D structure across hundreds of cell and tissue contexts. Over time, those correlations become an internal language for regulation.

The architecture blends local and global reasoning. A convolutional encoder detects short motifs and local syntax. A transformer “tower” moves information across long distances, so signals from an enhancer can inform the representation near a gene’s promoter. A decoder then produces the different tracks (modalities) at appropriate resolutions, including explicit modeling of splice junctions. Training is staged: large ensembles of “teacher” models are first trained on 1 Mb windows at base-pair resolution; then a student model is distilled from those teachers to deliver near-ensemble performance at much lower inference cost. This combination—long context, base resolution, and distillation—turns out to be crucial for regulatory variant effect prediction.

Just as important is the multi-task objective. Because AlphaGenome predicts many modalities at once, it benefits from cross-modal structure. Accessibility and expression reinforce each other; splicing patterns constrain where exons start and end; contact maps bolt on the 3D geometry that enhancer–promoter pairs depend on. Ablation studies in the Nature paper show that the full multimodal model generally outperforms single-modality variants and that training on true 1 Mb sequences yields better results than shorter windows, even if you later run inference on smaller intervals for speed.

From predictions to practice

AlphaGenome’s headline use case is variant effect prediction. Given a candidate single-nucleotide variant (SNV), the model can contrast predictions for the reference and alternate allele to estimate how that change affects expression, splicing, chromatin accessibility, and more. In rare-disease analysis, this can help triage non-coding variants that would otherwise sit in the “uncertain significance” bucket. Early stories from hackathons and research collaborations suggest that such AI-guided triage can shorten the path from whole-genome sequencing to plausible molecular mechanisms, especially when paired with domain expertise and targeted wet-lab follow up.

Because it outputs 3D contact maps and promoter–enhancer features, the model also improves enhancer–gene linking. In the Nature study, AlphaGenome-derived features boosted the ENCODE rE2G enhancer–promoter model, a task where long-range context really matters. That provides a practical route to prioritize regulatory elements for CRISPRi/CRISPRa perturbations in disease-relevant cell types, before spending time and money in the lab.

We’re also starting to see targeted case studies where AlphaGenome narrows the search space for functional non-coding variants and those leads validate at the bench. For example, a recent paper used AlphaGenome to prioritize regulatory candidates underlying RHD expression differences and confirmed selected predictions with base-editing assays, pointing toward more scalable non-coding variant interrogation in blood-group genetics. Expect more of these “AI proposes, lab disposes” loops, especially in loci where a grab bag of variants all look plausible on paper.

Finally, the same mechanics make AlphaGenome interesting for synthetic biology. If you can forecast splicing, promoter strength, or enhancer activity across tissues, you can begin to design regulatory sequences to achieve cell-type-specific expression or to minimize off-target activity. Designs will still need iteration, but an accurate oracle can reduce the number of constructs you test and focus your exploration on the most promising sequence neighborhoods.

Quick start: a tiny API walkthrough you can reproduce

You don’t need to train anything from scratch to try these ideas. DeepMind provides a research API and client that let you submit a 1 Mb interval and, optionally, one or more variants to score. Here’s a minimal example that asks for RNA-seq predictions around a variant, then plots the reference versus alternate signal.

from alphagenome.data import genome
from alphagenome.models import dna_client

API_KEY = "YOUR_API_KEY"
model = dna_client.create(API_KEY)

# A 1 Mb interval around your locus of interest
interval = genome.Interval(chromosome="chr22", start=35677410, end=36725986)

# A single-nucleotide variant to score
variant = genome.Variant(chromosome="chr22", position=36201698,
                         reference_bases="A", alternate_bases="C")

outputs = model.predict_variant(
    interval=interval,
    variant=variant,
    ontology_terms=["UBERON:0001157"],  # tissue/cell-type hint
    requested_outputs=[dna_client.OutputType.RNA_SEQ],
)

If you’re exploring promoter–enhancer interactions, you can request contact maps for the same interval. That makes it straightforward to align putative enhancers with predicted 3D proximity and then look for consistent shifts in expression or accessibility when you toggle variants in silico.

outputs = model.predict_sequence(
    interval=interval,
    requested_outputs=[dna_client.OutputType.CONTACT_MAP,
                       dna_client.OutputType.ATAC_SEQ]
)

# Access predicted contact matrix and accessibility track
contact = outputs.contact_map.matrix
access  = outputs.atac_seq.signal

Under the hood, the client takes care of batching, resolution, and returning appropriately downsampled tracks where needed. In practice, teams often start with sequence-only predictions to scan a region, then switch to variant scoring at specific sites to quantify impact across modalities. For speed-sensitive loops, the distilled student model behind the API keeps inference practical on large batches without sacrificing much accuracy. (github.com)

Limits, caveats, and what’s next

AlphaGenome is powerful, but it isn’t a crystal ball for complex disease. It predicts molecular consequences of sequence and variants; it does not directly predict traits or patient outcomes, which depend on gene networks, development, environment, and many other layers. Also, although 1 Mb windows capture most enhancer–gene pairs, very long-range or multi-locus effects can slip through. And like any model trained on public data, biases in assay coverage, tissue representation, and species balance can shape what it learns. DeepMind explicitly positions the tool for research use, with non-commercial access via API and open resources for the community. Clinical interpretation still requires orthogonal evidence and, ideally, experimental validation.

With that said, the direction of travel is clear. The combination of long context, base resolution, and multimodal learning is a template other DNA foundation models are following. As new datasets expand across species and cell states, we should expect better zero-shot generalization, richer cell-type conditioning, and tighter integration with wet-lab design tools. In short, the regulatory code is becoming legible at scale, and AlphaGenome is one of the first robust readers.

More on foundation models for computational biology

A primer on the Nucleotide Transformer: Read More

Nicheformer for spatial omics: Read More