Evo2 and DNA LLMs: Perplexity, Prompting, Genomic Design

Jonathan Alles

EVOBYTE Digital Biology

Introduction

If you’ve ever stared at a wall of A, C, G, and T and wondered how on earth anyone can make sense of it, you’re not alone. Genomes are long, noisy, and packed with context. What’s changing—fast—is our ability to read those sequences the way we read natural language. Foundation models trained on vast corpora of DNA are beginning to recognize motifs, syntax, and long-range dependencies that humans rarely catch unaided. That shift is why many teams now talk about “DNA as language.”

In this post, we explore Evo2—the latest generation of genomic language models—and unpack what concepts like perplexity and DNA prompting actually mean in practice. We’ll talk about why perplexity is a useful but limited signal, how “oracle” predictors can be combined with language models conceptually, and why codon choices remain more nuanced than any single score suggests. Most importantly, we’ll ground the discussion in responsible, high-level principles suitable for research and education, not step-by-step bioengineering instructions.

Evo2 and genomic language models: what changes when DNA gets an LLM

Large language models for genomes—often called gLMs—take familiar NLP ideas and apply them to nucleotide sequences. Instead of words and sentences, they operate on tokens derived from DNA, sometimes as fixed k-mers and sometimes with variable-length tokenization. The goal is to capture statistical regularities across massive sequence libraries so the model can assign probability to what “comes next” or what “fits” a given context.

Evo2 is a new family of such gLMs trained at unusually large scale. Public materials describe training on a multi-domain corpus that spans bacteria, archaea, viruses, and eukaryotes, with architecture choices optimized for very long context windows—think hundreds of thousands to a million tokens—so the model can, in principle, connect distant regulatory regions or capture multi-gene patterns. Those details matter because regulatory biology rarely fits into a short-window view. Long context gives the model room to notice promoter–enhancer coordination, operon organization, and other structural relationships that don’t sit side by side in linear sequence.

Genomic LLMs don’t exist in isolation. Earlier DNA language models such as DNABERT and the Nucleotide Transformer family showed that transformer-style pretraining on nucleotide sequences can recover known binding motifs and improve classification of regulatory elements. These studies established that “DNA as language” is more than a metaphor—it can be operationalized into embeddings, attention patterns, and predictive heads that capture biologically meaningful signals.

At the same time, purpose-built predictors like Enformer—trained to map sequence windows to gene expression and chromatin readouts—illustrate a complementary path: models that don’t generate DNA per se but predict measurable functional outputs. This distinction sets the stage for “oracle-in-the-loop” ideas we’ll return to shortly.

Perplexity for genomes: what the number means—and what it doesn’t

If you’ve used language models before, you’ve met perplexity. It’s a transformation of average token-level log-likelihood: lower perplexity means the model finds the sequence more “expected,” given its training distribution and current context. In text, perplexity correlates with grammaticality and idiomatic phrasing. In genomics, perplexity can, in broad strokes, reflect whether a DNA segment looks typical of its species, genomic neighborhood, or functional class as captured by the model.

That’s a helpful intuition, but it’s crucial to keep the boundaries clear. A low-perplexity DNA sequence is not automatically “better,” “fitter,” or “higher expression.” It simply aligns with patterns the model has internalized. For conserved motifs, a section of promoter, or an archetypal splice junction, that alignment can be biologically meaningful. For synthetic constructs or cross-species transfers, the relationship may weaken or flip, because the training distribution and the design intent diverge.

This is why researchers treat perplexity and related log-probability metrics as exploratory signals rather than direct objectives. In protein modeling, likelihood-based scores have correlated with mutational fitness in some benchmarks, but those successes depended on well-matched data and careful evaluation. The same caution applies to genomics: use perplexity to highlight candidates worth deeper study, not to assert function.

A tiny code sketch can demystify the idea without touching real biological sequences:

import math

def perplexity(logprobs):
    # logprobs: list of per-token log probabilities (natural log)
    mean_neg_logp = -sum(logprobs) / len(logprobs)
    return math.exp(mean_neg_logp)

# Example on a generic token stream (not DNA):
# Suppose a model assigns these log-probabilities to a short string
toy_logps = [-0.2, -0.4, -0.1, -0.3]  # higher (less negative) is more expected
print(perplexity(toy_logps))  # lower value = more model-expected sequence

The point isn’t the number itself; it’s the comparison. If two candidate strings are scored under the same model and context, the one with lower perplexity is, to that model, more in-distribution. In biological research, that can inspire hypotheses about motif presence, context compatibility, or taxon-specific syntax—hypotheses that then require independent validation and, ideally, orthogonal measurement.

DNA prompting with oracle functions: a conceptual stack for responsible exploration

“Prompting” is the art of telling a model what you want. In genomic language modeling, that can mean conditioning on flanks, providing masked regions, or otherwise structuring the context to elicit plausible continuations. The promise is to turn open-ended generation into guided exploration: stay compatible with these neighborhoods, respect this motif, avoid that pattern.

On its own, prompting is descriptive. It shapes what the model thinks is appropriate. To make it predictive, researchers often imagine pairing a generator with an oracle. In this setting, “oracle” means any function that assigns a score to a sequence based on a measurable property: a neural predictor of expression from sequence, a regression model of stability, or even an empirical assay readout. Enformer is a canonical example of such a predictor for regulatory genomics; it maps sequence context to expression-related tracks, offering a proxy score for how a sequence might behave. The generator proposes; the oracle scores; the loop repeats. Conceptually, this separation echoes reinforcement learning or Bayesian optimization ideas from ML, though the biology adds layers of nuance.

Because this topic sits close to practical sequence design, it’s worth emphasizing boundaries. The following snippet is intentionally generic; it illustrates the choreography without applying it to biological sequences. It’s a sketch of how one might wrap any autoregressive model with an external scoring function for generic strings:

def propose(model, context, n=4):
    # Return n generic string candidates conditional on context
    return [model.sample(context) for _ in range(n)]

def select(candidates, score_fn):
    # Pick the candidate with the best external score
    scores = [(c, score_fn(c)) for c in candidates]
    return max(scores, key=lambda x: x[1])[0]

def iterate(model, context, score_fn, steps=3):
    seq = context
    for _ in range(steps):
        cands = propose(model, seq)
        seq = select(cands, score_fn)
    return seq

As a research pattern, “generator + oracle” is appealing because it combines descriptive priors (what looks plausible) with target-aware feedback (what looks promising). In genomics, however, “promising” must always be defined carefully, tied to ethical oversight, bounded by biosafety rules, and verified with robust controls. High-level frameworks help us reason; they must not be mistaken for instructions to manipulate living systems.

Codon choice, context, and the limits of “optimization”

Whenever codons come up, the conversation tends to slide into “optimize for expression.” That framing hides complexity. Codon usage does influence translation, but its effects are entangled with mRNA structure, start-site context, codon pair preferences, and tRNA availability. Classic work showed that local mRNA folding near the 5′ end can be a stronger determinant of bacterial expression than raw codon adaptation, and that N‑terminal codon bias likely reflects pressures on initiation and early elongation, not a one-dimensional “more common is always better” rule. These lessons generalize: sequence features interact, and the very same change can help or hurt depending on where it lands and what surrounds it.

So where does perplexity fit? Imagine you’re comparing two synonymous fragments in the same genomic neighborhood. If a genomic LLM consistently assigns lower perplexity to one variant, that may suggest it better matches distributional patterns the model learned for that context—perhaps reflecting species-specific codon usage, local motif avoidance, or higher-order syntax. That’s an interesting clue. But it is not a substitute for known design principles, nor a guarantee of higher expression. It simply says, “given what I’ve seen, this looks more like the sequences around here.”

That’s why many researchers treat perplexity as a sanity check rather than a steering wheel. Use it to flag outliers that look off-distribution. Use it to prioritize what to read more about. Combine it—conceptually—with predictors that specialize in the property you actually care about. Then, whatever the model suggests, remember that empirical validation (wet-lab or orthogonal computational evidence) is the arbiter. Even sophisticated predictors trained on sequence-to-expression tasks can struggle when you move far from their training regime, change organisms, or alter expression conditions. The more distribution shift you introduce, the more your confidence should widen.

It’s also worth noting that the tokenization strategy in a gLM can influence what “perplexity” is sensitive to. Fixed k-mers emphasize local motifs; byte-pair or unigram tokenization can capture variable-length patterns that resemble “words.” Different schemes change how the model encodes codon boundaries, repetitive elements, or GC-rich tracts. That’s one reason perplexity comparisons are most meaningful within the same model family and tokenization setup, not across them. Foundational works like DNABERT and follow-on analyses have explored how tokenization interacts with motif discovery and downstream interpretability.

Summary / Takeaways

Evo2 and its peers are ushering in a new phase for “DNA as language.” Trained across the tree of life with long context windows and modern architectures, these genomic language models can capture patterns that once required painstaking manual curation. They offer a common substrate—tokenization, embeddings, and likelihoods—on which many genomics tasks can be framed.

Perplexity is a useful lens in that world, but it’s not a magic score. Lower perplexity means “more in-distribution to this model,” not “better biology.” As a result, perplexity shines as a comparative heuristic for triage and hypothesis generation, especially when the sequences and context match the model’s training scope. Treat it as an invitation to look closer, not as an endpoint.

The conceptual marriage of DNA prompting and oracle functions is powerful. A generator proposes candidates that respect learned syntax; a predictor offers property-specific feedback. Together, they form a loop for guided exploration. Yet precisely because that pattern can edge toward sequence design, it carries ethical obligations. Any application that might influence biological function—especially expression—should be bounded by rigorous oversight, conservative assumptions, and clear safety review. This post deliberately keeps discussion at a high level to support learning and responsible research culture.

Codon decisions remind us why that caution matters. Decades of work show that expression is shaped by a web of interacting features—mRNA structure near the start codon, codon pair preferences, and species-specific context among them. Genomic LLMs add a new perspective on that web, revealing distributional cues that can complement specialized predictors. But the hard biological questions haven’t turned into simple one-number answers. They’ve become, perhaps, a little easier to ask—and a lot more important to validate.

If you’re exploring this space as a data scientist, a practical next step is to get comfortable with the basic objects these models manipulate: tokenizers for nucleotides, attention maps over long contexts, and log-likelihoods under different prompts. Read the foundational papers. Inspect how tokenization shapes what the model notices. Experiment—safely and synthetically—with perplexity on toy alphabets to build intuition. And when you do shift from intuition to biology, bring the right partners to the table: domain experts, safety officers, and reviewers who will help you draw the line between curiosity and capability.