Sequence design with the Nucleotide Transformer

Jonathan Alles

EVOBYTE Digital Biology

By EVOBYTE Your partner in bioinformatics

Introduction

If you’ve ever tweaked a promoter or swapped a few codons and watched expression jump, you’ve felt how sensitive DNA is to context. The right nucleotides in the right order can unlock transcription factor binding, stabilize mRNA, or smooth ribosome traffic. What’s changed recently is that we now have foundation models trained directly on genomes that can read this context at scale. The Nucleotide Transformer is one of the most capable of these models, and it’s surprisingly practical for everyday sequence design and optimization.

Meet the Nucleotide Transformer (NT): a DNA/RNA foundation model

The Nucleotide Transformer is a family of masked language models (MLMs) for DNA (and in some variants RNA) that learn the “grammar” of nucleic acids by predicting masked tokens in long sequences. Different sizes exist, from compact models to multi‑billion‑parameter variants, with training that spans the human reference, thousands of human genomes, and hundreds of non‑human species. This diversity helps the model internalize regulatory motifs, compositional biases, and long‑range dependencies that shape gene regulation. In benchmarking, NT models match or surpass specialized architectures on many genomics prediction tasks, including promoters and splicing, while operating on kilobase contexts.

Under the hood, NT uses a tokenizer that prefers 6‑mer tokens but falls back to single nucleotides when needed, and it’s exposed through familiar Hugging Face interfaces. That means you can load a model and start extracting embeddings or token probabilities with just a few lines of Python, without retraining a custom architecture.

From variant effect prediction to sequence design

Because NT captures how nucleotides co‑occur across real genomes, its token probabilities form a powerful “naturalness prior.” Sequences that fit the learned distribution tend to preserve motifs, spacing, and base composition that biology often prefers. This prior already proves useful for variant effect scoring, enhancer and promoter modeling, and splice site prediction, where NT has shown competitive accuracy after light fine‑tuning or even simple probing. For design problems, you can turn the same prior into a critic that scores candidate edits or into a generator that proposes likely alternatives in masked regions. Either way, you exploit the model’s sense of plausible sequence context to avoid brittle, out‑of‑distribution designs.

Designing for expression: using NT as a prior, a generator, and a glue layer

Expression optimization is rarely about a single knob. Promoter strength depends on motif presence and spacing; 5′ UTRs shape ribosome loading; codon choices influence elongation and mRNA stability; GC content and repeats affect synthesis and cloning. NT won’t replace your task‑specific predictor, but it plays three helpful roles.

First, as a prior, NT can rank candidates by pseudo‑perplexity, nudging you toward sequences whose local and long‑range patterns resemble high‑fitness neighborhoods. Second, as a generator, a masked‑LM pass over a promoter or 5′ UTR can suggest plausible motif‑preserving substitutions while you enforce constraints like GC windows or prohibited sites. Third, as glue, NT’s embeddings are strong features for lightweight regressors trained on assay data, letting you build compact expression predictors without engineering dozens of handcrafted sequence features. Benchmarks show that diversity and scale in NT’s pretraining improve downstream performance, which is handy when your assay dataset is modest.

Hands‑on: Python snippets for NT‑guided scoring and optimization

Let’s start by computing a simple NT score. Pseudo‑perplexity approximates how expected a sequence looks to the model: lower is “more natural.” We’ll use a public NT checkpoint via Hugging Face.

# pip install transformers>=4.39 torch --quiet
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch, math

ckpt = "InstaDeepAI/nucleotide-transformer-500m-human-ref"
tok = AutoTokenizer.from_pretrained(ckpt)
model = AutoModelForMaskedLM.from_pretrained(ckpt).eval()
mask_id = tok.mask_token_id

@torch.inference_mode()
def nt_pseudo_perplexity(seq: str) -> float:
    enc = tok(seq, return_tensors="pt")
    ids = enc["input_ids"][0]
    attn = enc["attention_mask"][0].bool()
    idxs = torch.where(attn)[0]  # include only non-pad tokens
    logp_sum, n = 0.0, 0
    for i in idxs:
        if ids[i] in {tok.cls_token_id, tok.sep_token_id, tok.pad_token_id}:
            continue
        if mask_id is None:  # fallback if model lacks an explicit [MASK]
            continue
        masked = ids.clone()
        masked[i] = mask_id
        out = model(masked.unsqueeze(0)).logits[0, i]
        logp = out.log_softmax(-1)[ids[i]].item()
        logp_sum += logp; n += 1
    return math.exp(-logp_sum / max(n, 1))  # lower is better

print(nt_pseudo_perplexity("TTGACATATAATACGACTCACTATAGGG"))  # toy promoter-like sequence

Now wrap that score into a tiny in‑silico directed evolution loop. This example “optimizes” a 5′ UTR for plausibility while softly steering GC toward a target as a crude proxy for expression stability. In practice, you would replace the gc_score with a learned expression predictor (for your organism and assay) and keep NT as a prior to discourage weird, hard‑to‑clone candidates.

import random

ALPHABET = "ACGT"

def gc_score(seq, target=0.5):
    gc = sum(b in "GC" for b in seq) / len(seq)
    return -abs(gc - target)

def propose(seq, k=1):
    s = list(seq)
    for _ in range(k):
        i = random.randrange(len(s))
        s[i] = random.choice([b for b in ALPHABET if b != s[i]])
    return "".join(s)

def optimize(seq, steps=200, temp=1.0):
    best = seq; best_f = -nt_pseudo_perplexity(seq) + 0.1 * gc_score(seq)
    cur, cur_f = best, best_f
    for t in range(steps):
        cand = propose(cur, k=1 + (t // 50))  # slightly bolder over time
        f = -nt_pseudo_perplexity(cand) + 0.1 * gc_score(cand)
        if f > cur_f or random.random() < math.exp((f - cur_f) / max(1e-6, temp)):
            cur, cur_f = cand, f
            if f > best_f:
                best, best_f = cand, f
    return best, best_f

seed_utr = "AAAGAAGAGAGAGAGAGAAAGAGAAAGAG"  # toy 5' UTR-length sequence
designed, score = optimize(seed_utr)
print(designed, score)

These snippets are intentionally simple. They show the basic pattern you can extend: plug in your organism‑specific expression model (for example, a small regressor trained on MPRA data using NT embeddings), add hard constraints for motifs you must keep or avoid, and let NT guide proposals toward biologically plausible neighborhoods. NT’s Hugging Face model cards and GitHub repo include examples for extracting embeddings and fine‑tuning if you want to build a better expression head. (huggingface.co)

Summary / Takeaways

Sequence design gets easier when you combine two instincts: optimize for your assay’s target and stay close to the genome’s learned grammar. The Nucleotide Transformer gives you that grammar in a form you can query from Python today. Use it as a prior to keep designs realistic, as a generator to propose motif‑aware edits, or as features for a lightweight expression predictor. As with any in‑silico design, the best results come from closing the loop with wet‑lab data and retraining the task head. But even before you fine‑tune, NT can help you avoid brittle sequences and move faster toward constructs that express the way you intended.

More on BioAI and Foundation Models

Check our previous post on models for single-cell biology:

From scGPT to scConcept: Models for Single-Cell Biology