Evolutionary Diffusion: Technical Overview¶

Trevor Bedford — 2026-04-09

This document describes the full pipeline for the evolutionary diffusion project: how training data is produced, how the model is architected and trained, and how we evaluate it. It is intended as the primary reference for understanding how the pieces fit together.

Overview¶

The goal is to model evolution as trajectories through sequence space using a discrete diffusion language model. Given a sequence of ancestral states, the model predicts what comes next — either forward in time (predicting descendants) or between contemporaneous sequences (predicting relatives).

The pipeline has three stages:

NCBI genomes (viruses, bacteria, fungi)
    ↓
Marker gene extraction + phylogenetic tree building
    (bac120, odb, rdrp)
    ↓
Nextstrain Auspice JSON trees
    ↓
Trajectory extraction + train/test split
    (trajectories)
    ↓
Compressed FASTA shards → S3
    ↓
Preprocessing to JSONL (raw, VCF, or EMD format)
    (diffusion-language-model)
    ↓
Continued pre-training of LLaDA2 16B discrete diffusion model
    (diffusion-language-model)
    ↓
Evaluation: trajectory metrics + downstream benchmarks
    (pegasus-evals)

Training data¶

Data sources¶

Training data comes from phylogenetic trees built across the tree of life. Each tree provides evolutionary trajectories — sequences of mutations along branches from ancestor to descendant.

Domain	Repo	Gene set	Organisms	Genes	Trees
Viruses	rdrp	RdRp catalytic domain	3 families (Paramyxoviridae, Flaviviridae, Picornaviridae)	3	3 main + 18 subtrees
Bacteria	bac120	GTDB bac120 single-copy genes	5 phyla (Cyanobacteriota through Pseudomonadota)	120	~485 trees across completed phyla
Fungi	odb	OrthoDB/BUSCO single-copy orthologs	Fungi (666 genomes)	1122	1115 trees

All training data is coding sequence (CDS) — no non-coding DNA. For eukaryotes, the longest splice form is used with introns stripped.

Marker gene extraction¶

Each data repo follows a similar pattern: download genomes from NCBI, extract marker genes, then build per-marker phylogenetic trees.

bac120 (pegasus-research/bac120): Downloads bacterial genomes by phylum from NCBI RefSeq. Runs GTDB-Tk to identify the 120 single-copy bacterial marker genes. Reorganizes output into per-marker FASTA files. Current scale:

Phylum	Genomes	Status
Cyanobacteriota	2492	Complete
Bacteroidota	22,236	Complete
Actinomycetota	49,808	Complete
Bacillota	122,669	Trees done, trajectories pending
Pseudomonadota	249,229	Not started

odb (pegasus-research/odb): Downloads eukaryotic genomes by clade. Runs compleasm (a fast BUSCO reimplementation using miniprot) to identify single-copy orthologs from the fungi_odb12 lineage. Extracts CDS via gffread. 666 fungal genomes processed, 1122 genes extracted, 1115 trees successfully built.

rdrp (pegasus-research/rdrp): Downloads complete viral genomes by family from NCBI. Extracts the conserved RdRp domain using family-specific strategies (L protein Domain V for Paramyxoviridae, NS5 for Flaviviridae, 3D polymerase for Picornaviridae). Extraction yields vary: 64% for Paramyxoviridae, 45% for Flaviviridae, 22% for Picornaviridae.

Phylogenetic tree building¶

All three repos use the same augur pipeline: filter → align (MAFFT) → tree → refine → ancestral reconstruction → export to Auspice JSON. Tree building methods evolved during the project:

IQ-TREE: Used initially, but runs out of memory above ~50k sequences.
RAxML-NG: 5-6x less memory than IQ-TREE. Used for medium datasets.
VeryFastTree (VFT): Confirmed in the Mar 18 meeting to provide acceptable accuracy (within ~0.5% likelihood of RAxML) with orders-of-magnitude better speed and 10x less memory. Now the default for large datasets.

Large alignments (diverse phyla like Bacteroidota) inflate 10-15x relative to raw sequences. A two-stage trimming approach handles this: first trim to columns with >0.1% occupancy for tree building, then a second trim for ancestral reconstruction.

Trajectory generation¶

The trajectories (blab/trajectories) repo converts Auspice JSON trees into training data. It auto-discovers trees from the bac120, odb, and rdrp repos.

Forwards trajectories trace the path from root to tip. Each trajectory is a FASTA file with one sequence per node along the path. Headers encode branch Hamming distance and cumulative distance from root. Zero-distance branches (where no mutations occurred) are skipped.

Pairwise trajectories pair two tip sequences with their Hamming distance. Pairwise data is ~9x more abundant than forwards data and avoids error-prone ancestral state reconstruction. The Jan 22 meeting established pairwise as the primary data format, following the PEINT approach of using tip-to-tip "cherries."

Train/test split. The clade excision strategy ensures statistical independence: randomly select seed tips, walk back N mutations up the tree, and hold out the entire descended subtree as test data. Default: 10% test, walk back 5 mutations, max 1% of tree per excised clade. This means test trajectories represent genuinely unseen evolutionary lineages.

Output. Trajectories are streamed directly into compressed tar.zst shards (up to 10,000 files per shard) and uploaded to s3://pegasus-training-data/trajectories/.

Datasets in the trajectories repo:

Dataset	Organism	Gene	Sequences	Alignment length
spike-xs	SARS-CoV-2	spike S1	10,195	2055 bp
spike-sm	SARS-CoV-2	spike S1	34,707	2055 bp
spike-lg	SARS-CoV-2	spike S1	~8M	2055 bp
flu-h3-xs	Influenza H3N2	HA1	10,263	987 bp
n450-xs	Measles	N450	2429	450 bp
rdrp-paramyxoviridae-xs	Paramyxoviridae	L Domain V	3985	1653 bp
rdrp-flaviviridae-xs	Flaviviridae	NS5 RdRp	4785	1884 bp
rdrp-picornaviridae-xs	Picornaviridae	3D polymerase	2627	1386 bp
cytb-xs	Mammals	cytochrome b	5059	1140 bp
bac120-cyano-*	Cyanobacteriota	123 GTDB markers	~2500 each	389–11,348 bp
bac120-bacteroidota-*	Bacteroidota	124 GTDB markers	~22,000 each	658–41,943 bp

Totals across all datasets (excluding spike-lg):

~1250 distinct genes across viruses (7), bacteria (120), fungi (1122), and mammals (1)
~4M total sequences in phylogenetic trees (~74k from viral/mammalian datasets, ~3.0M from bac120 Cyanobacteriota + Bacteroidota + Actinomycetota, ~740k from odb fungi)
~5 billion nucleotides of aligned sequence data

The spike-lg UShER dataset adds an additional ~8M sequences but is a single gene. Bacillota and Pseudomonadota bac120 data will substantially increase these totals once complete.

Data representations¶

A critical insight from the Jan 16 meeting: when training on raw sequences, ~90% of compute is spent learning to copy, since successive nodes in a trajectory differ by only a handful of mutations. This led to the development of compact mutation representations.

Raw sequence format¶

Full DNA sequences separated by >, with generation numbers:

0_ATCGATCG>1_ATCGATCG>2_ATCAATCG

Simple but wasteful — the model must learn that most positions are unchanged.

VCF-like format¶

Reference sequence plus compact variant descriptions per generation. Positions are center-relative (0 = middle of sequence, negative = left, positive = right):

CGGGCACGT|1:-744C>T|2:-744C>T,967C>T|3:-744C>T,967C>T,-307+TTACTTGCT

Variant notation: SNPs as posREF>ALT, insertions as pos+ALT, deletions as pos-REF, identical generations as =. This reframes evolution prediction as a "coding problem" that exploits the language model's generalization abilities — analogous to how mutation-annotated trees (MATs) already represent changes compactly.

EMD (embedding) format¶

Like VCF but each mutation includes flanking sequence context (4-12 bp on each side, adaptively sized for uniqueness):

[ATCG]G>A[TCGA]

This makes mutations more interpretable to the language model by providing local sequence context.

Structural diffusion format¶

A unidiff-style representation with hunks. Less commonly used than VCF and EMD.

Model architecture¶

Primary model: LLaDA2.0-mini¶

The primary model is a 16B-parameter Mixture-of-Experts (MoE) discrete diffusion language model based on LLaDA2.0-mini.

Architecture:

Hidden size: 1024
Layers: 24
Attention heads: 16 (head dimension 64)
MoE: 16 experts with 2 active per token (fused GPU implementation)
Max sequence length: 16,384 tokens
Vocab size: 30,592
Position encoding: RoPE (Rotary Position Embeddings, theta=10,000)
Activation: SiLU

Diffusion mechanism: masked block diffusion. Unlike autoregressive models that generate tokens left-to-right, the model uses masked diffusion:

During training, tokens in the input sequence are randomly masked (masking ratio sampled from 0.3–0.8 range).
The model learns to predict the masked tokens given the unmasked context.
A three-part composite attention mask enables this:
- Block diagonal: self-attention within noised blocks
- Offset block causal: cross-attention for conditional context
- Block causal: attention to clean (unmasked) tokens
Loss is cross-entropy on masked positions only.

Inference is non-autoregressive: the model generates an entire block of tokens simultaneously, then iteratively refines them over multiple steps.

Initialize output with mask tokens.
Divide into 32-token blocks.
For each block, perform 32 refinement iterations:
- Forward pass with block-diagonal attention.
- Sample tokens; accept those above 95% confidence threshold.
- Number of transferred tokens increases with each step.
Stop at EOS or max length.

Comparison baseline: Qwen3-4B¶

A 4B-parameter standard autoregressive transformer used as a comparison baseline. 36 layers, 2560 hidden size, 32 attention heads. Trained with standard next-token prediction loss. This comparison addresses the open architectural question (raised Mar 26): does discrete diffusion offer advantages over autoregressive generation for evolutionary sequence prediction?

Training¶

Continued pre-training¶

The model undergoes continued pre-training on evolutionary trajectory data. This is not training from scratch — the LLaDA2.0-mini base model already has language capabilities, and we adapt it to evolutionary sequences.

Training configuration:

Parameter	Value
Learning rate	1.0e-5 (constant)
Optimizer	AdamW (fused)
Weight decay	0.1
Gradient clipping	max norm 1.0
Global batch size	8 (micro batch 1)
Sequence length	2248 tokens
Noise range	0.3–0.8 (masking ratio)
Block size	32 tokens
LR warmup	3% of steps (linear)
LR schedule	Cosine decay
Precision	bfloat16

Distributed training uses PyTorch FSDP2 (Fully Sharded Data Parallel v2) with gradient checkpointing for memory efficiency. Training runs on H100 GPUs.

Training results (LLaDA2-mini, 3 epochs, 50,625 total steps):

Loss dropped from 0.950 → 0.010 (rapid decrease in first 1000 steps, plateau after 10,000).
Stable training with well-behaved gradients.
One epoch over all current data takes ~100 hours.
Standard practice: avoid more than one epoch to prevent memorization, though current runs use 3 epochs.

Training stages¶

Continued pre-training: Masked diffusion on evolutionary trajectories (current stage).
Supervised fine-tuning (SFT): Planned for longer trajectories and specific tasks.
Reinforcement learning: DPO pushed cancer driver classification accuracy to ~80%. RL with external validators as reward function (PPO) identified as needed for tasks with limited training data (e.g., anti-bacterial peptide generation where SFT was insufficient and ESM outperformed in zero-shot).

Data format experiments¶

Multiple training configurations are tested across data representations (raw, VCF, EMD, structural diffusion) and trajectory types (forward, pairwise). The Qwen3-4B baseline has 9 configuration variants covering these combinations. Pairwise and forward trajectories are unified into a single training format — the model receives a reference sequence plus historical trajectory and predicts tip mutations.

Evaluation¶

Trajectory prediction metrics¶

The pegasus-evals toolkit implements evaluation in a two-file mode: ground truth JSONL + prediction JSONL, matched by line order.

Sequence-level metrics:

Hamming distance (character-level mismatch count)
Edit distance (Levenshtein)
Identity percent
Alignment score (match +1, mismatch -1, gap -2)
Length difference

Biology-aware metrics:

BLOSUM62 and PAM250 substitution matrix scores
GC content difference
Hydrophobicity difference (Kyte-Doolittle scale)
Charge difference

Current trajectory results (LLaDA2-mini, Hamming distance):

Dataset	Organism	Mean	Std
spike-xs	SARS-CoV-2	1.72	3.62
spike-sm	SARS-CoV-2	1.92	3.75
cytb-xs	Mammals	49.6	194.9
n450-xs	Measles	120.0	165.0

Excellent performance on spike sequences (mean error <2 bases) but weaker on more divergent datasets. A known issue: the model's default behavior is to copy the previous generation's mutation, reflecting the dominant pattern in training data.

Downstream benchmarks¶

DMS (deep mutational scanning) correlation. The model predicts likelihood of reference given a mutation to score variant effects. Changing the likelihood schema — predicting P(reference | mutation) rather than P(mutation | reference) — significantly improved performance, potentially placing the model second among tested models.

Cancer driver classification. Positive results: the model demonstrated zero-shot ability to classify mutations as pathogenic vs benign, even without human genome training data. DPO-type RL pushed accuracy to ~80%.

Drug resistance. Added to the benchmark suite; results pending.

Anti-bacterial peptide generation. SFT alone was insufficient; ESM outperformed in zero-shot. Conclusion: RL with external validator reward is needed for generative tasks with limited training data.

Nucleotide frequency baseline (current priority)¶

The Apr 9 meeting identified a critical gap: the model does not currently correlate with nucleotide frequencies in multiple sequence alignments. A model trained on evolutionary trajectories should, at minimum, assign higher probability to commonly observed mutations. The team hypothesis is that Evo 2's advantage on DMS tasks may simply reflect its ability to capture nucleotide frequencies.

Immediate actions:

Rayan: reconstruct MSA nucleotide frequencies for bac120 markers
Trevor: write up CTMC math connecting equilibrium frequencies to model inference
Team: evaluate zero-shot model correlation with these frequencies

This is now the baseline capability to establish before returning to DMS or other downstream tasks.

Evaluation infrastructure¶

The pegasus-evals toolkit provides:

Inference runner supporting all data representations (raw, VCF, EMD, sdiff) for both diffusion and autoregressive models
Synthetic data generators for controlled experiments (random mutagenesis, directed evolution, neutral drift)
Format auto-detection and conversion between representations
Baseline implementations (random sequence, random mutation, identity, majority class)
Stratified analysis by prompt type, generation number, mutation count, and activity class

Current status and open questions¶

As of April 9, 2026:

Immediate priorities:

Nucleotide frequency baseline evaluation
CTMC math writeup connecting equilibrium frequencies to model inference
Synthetic lattice protein data for architecture comparison
Mammalian CDS dataset preparation

Ongoing:

Continued pre-training with expanded data (estimated 2-3 weeks + 1 week post-training)
DMS validation replication following the Evo 1 protocol
Downstream benchmark dashboard assembly
Unified evolutionary trajectory benchmark from compiled test splits