Evolutionary Diffusion: Technical Overview¶
Trevor Bedford — 2026-04-09
This document describes the full pipeline for the evolutionary diffusion project: how training data is produced, how the model is architected and trained, and how we evaluate it. It is intended as the primary reference for understanding how the pieces fit together.
Overview¶
The goal is to model evolution as trajectories through sequence space using a discrete diffusion language model. Given a sequence of ancestral states, the model predicts what comes next — either forward in time (predicting descendants) or between contemporaneous sequences (predicting relatives).
The pipeline has three stages:
NCBI genomes (viruses, bacteria, fungi)
↓
Marker gene extraction + phylogenetic tree building
(bac120, odb, rdrp)
↓
Nextstrain Auspice JSON trees
↓
Trajectory extraction + train/test split
(trajectories)
↓
Compressed FASTA shards → S3
↓
Preprocessing to JSONL (raw, VCF, or EMD format)
(diffusion-language-model)
↓
Continued pre-training of LLaDA2 16B discrete diffusion model
(diffusion-language-model)
↓
Evaluation: trajectory metrics + downstream benchmarks
(pegasus-evals)
Training data¶
Data sources¶
Training data comes from phylogenetic trees built across the tree of life. Each tree provides evolutionary trajectories — sequences of mutations along branches from ancestor to descendant.
| Domain | Repo | Gene set | Organisms | Genes | Trees |
|---|---|---|---|---|---|
| Viruses | rdrp | RdRp catalytic domain | 3 families (Paramyxoviridae, Flaviviridae, Picornaviridae) | 3 | 3 main + 18 subtrees |
| Bacteria | bac120 | GTDB bac120 single-copy genes | 5 phyla (Cyanobacteriota through Pseudomonadota) | 120 | ~485 trees across completed phyla |
| Fungi | odb | OrthoDB/BUSCO single-copy orthologs | Fungi (666 genomes) | 1122 | 1115 trees |
All training data is coding sequence (CDS) — no non-coding DNA. For eukaryotes, the longest splice form is used with introns stripped.
Marker gene extraction¶
Each data repo follows a similar pattern: download genomes from NCBI, extract marker genes, then build per-marker phylogenetic trees.
bac120 (pegasus-research/bac120): Downloads bacterial genomes by phylum from NCBI RefSeq. Runs GTDB-Tk to identify the 120 single-copy bacterial marker genes. Reorganizes output into per-marker FASTA files. Current scale:
| Phylum | Genomes | Status |
|---|---|---|
| Cyanobacteriota | 2492 | Complete |
| Bacteroidota | 22,236 | Complete |
| Actinomycetota | 49,808 | Complete |
| Bacillota | 122,669 | Trees done, trajectories pending |
| Pseudomonadota | 249,229 | Not started |
odb (pegasus-research/odb): Downloads eukaryotic genomes by clade. Runs compleasm (a fast BUSCO reimplementation using miniprot) to identify single-copy orthologs from the fungi_odb12 lineage. Extracts CDS via gffread. 666 fungal genomes processed, 1122 genes extracted, 1115 trees successfully built.
rdrp (pegasus-research/rdrp): Downloads complete viral genomes by family from NCBI. Extracts the conserved RdRp domain using family-specific strategies (L protein Domain V for Paramyxoviridae, NS5 for Flaviviridae, 3D polymerase for Picornaviridae). Extraction yields vary: 64% for Paramyxoviridae, 45% for Flaviviridae, 22% for Picornaviridae.
Phylogenetic tree building¶
All three repos use the same augur pipeline: filter → align (MAFFT) → tree → refine → ancestral reconstruction → export to Auspice JSON. Tree building methods evolved during the project:
- IQ-TREE: Used initially, but runs out of memory above ~50k sequences.
- RAxML-NG: 5-6x less memory than IQ-TREE. Used for medium datasets.
- VeryFastTree (VFT): Confirmed in the Mar 18 meeting to provide acceptable accuracy (within ~0.5% likelihood of RAxML) with orders-of-magnitude better speed and 10x less memory. Now the default for large datasets.
Large alignments (diverse phyla like Bacteroidota) inflate 10-15x relative to raw sequences. A two-stage trimming approach handles this: first trim to columns with >0.1% occupancy for tree building, then a second trim for ancestral reconstruction.
Trajectory generation¶
The trajectories (blab/trajectories) repo converts Auspice JSON trees into training data. It auto-discovers trees from the bac120, odb, and rdrp repos.
Forwards trajectories trace the path from root to tip. Each trajectory is a FASTA file with one sequence per node along the path. Headers encode branch Hamming distance and cumulative distance from root. Zero-distance branches (where no mutations occurred) are skipped.
Pairwise trajectories pair two tip sequences with their Hamming distance. Pairwise data is ~9x more abundant than forwards data and avoids error-prone ancestral state reconstruction. The Jan 22 meeting established pairwise as the primary data format, following the PEINT approach of using tip-to-tip "cherries."
Train/test split. The clade excision strategy ensures statistical independence: randomly select seed tips, walk back N mutations up the tree, and hold out the entire descended subtree as test data. Default: 10% test, walk back 5 mutations, max 1% of tree per excised clade. This means test trajectories represent genuinely unseen evolutionary lineages.
Output. Trajectories are streamed directly into compressed tar.zst shards (up to 10,000 files per shard) and uploaded to s3://pegasus-training-data/trajectories/.
Datasets in the trajectories repo:
| Dataset | Organism | Gene | Sequences | Alignment length |
|---|---|---|---|---|
| spike-xs | SARS-CoV-2 | spike S1 | 10,195 | 2055 bp |
| spike-sm | SARS-CoV-2 | spike S1 | 34,707 | 2055 bp |
| spike-lg | SARS-CoV-2 | spike S1 | ~8M | 2055 bp |
| flu-h3-xs | Influenza H3N2 | HA1 | 10,263 | 987 bp |
| n450-xs | Measles | N450 | 2429 | 450 bp |
| rdrp-paramyxoviridae-xs | Paramyxoviridae | L Domain V | 3985 | 1653 bp |
| rdrp-flaviviridae-xs | Flaviviridae | NS5 RdRp | 4785 | 1884 bp |
| rdrp-picornaviridae-xs | Picornaviridae | 3D polymerase | 2627 | 1386 bp |
| cytb-xs | Mammals | cytochrome b | 5059 | 1140 bp |
| bac120-cyano-* | Cyanobacteriota | 123 GTDB markers | ~2500 each | 389–11,348 bp |
| bac120-bacteroidota-* | Bacteroidota | 124 GTDB markers | ~22,000 each | 658–41,943 bp |
Totals across all datasets (excluding spike-lg):
- ~1250 distinct genes across viruses (7), bacteria (120), fungi (1122), and mammals (1)
- ~4M total sequences in phylogenetic trees (~74k from viral/mammalian datasets, ~3.0M from bac120 Cyanobacteriota + Bacteroidota + Actinomycetota, ~740k from odb fungi)
- ~5 billion nucleotides of aligned sequence data
The spike-lg UShER dataset adds an additional ~8M sequences but is a single gene. Bacillota and Pseudomonadota bac120 data will substantially increase these totals once complete.
Data representations¶
A critical insight from the Jan 16 meeting: when training on raw sequences, ~90% of compute is spent learning to copy, since successive nodes in a trajectory differ by only a handful of mutations. This led to the development of compact mutation representations.
Raw sequence format¶
Full DNA sequences separated by >, with generation numbers:
0_ATCGATCG>1_ATCGATCG>2_ATCAATCG
Simple but wasteful — the model must learn that most positions are unchanged.
VCF-like format¶
Reference sequence plus compact variant descriptions per generation. Positions are center-relative (0 = middle of sequence, negative = left, positive = right):
CGGGCACGT|1:-744C>T|2:-744C>T,967C>T|3:-744C>T,967C>T,-307+TTACTTGCT
Variant notation: SNPs as posREF>ALT, insertions as pos+ALT, deletions as pos-REF, identical generations as =. This reframes evolution prediction as a "coding problem" that exploits the language model's generalization abilities — analogous to how mutation-annotated trees (MATs) already represent changes compactly.
EMD (embedding) format¶
Like VCF but each mutation includes flanking sequence context (4-12 bp on each side, adaptively sized for uniqueness):
[ATCG]G>A[TCGA]
This makes mutations more interpretable to the language model by providing local sequence context.
Structural diffusion format¶
A unidiff-style representation with hunks. Less commonly used than VCF and EMD.
Model architecture¶
Primary model: LLaDA2.0-mini¶
The primary model is a 16B-parameter Mixture-of-Experts (MoE) discrete diffusion language model based on LLaDA2.0-mini.
Architecture:
- Hidden size: 1024
- Layers: 24
- Attention heads: 16 (head dimension 64)
- MoE: 16 experts with 2 active per token (fused GPU implementation)
- Max sequence length: 16,384 tokens
- Vocab size: 30,592
- Position encoding: RoPE (Rotary Position Embeddings, theta=10,000)
- Activation: SiLU
Diffusion mechanism: masked block diffusion. Unlike autoregressive models that generate tokens left-to-right, the model uses masked diffusion:
- During training, tokens in the input sequence are randomly masked (masking ratio sampled from 0.3–0.8 range).
- The model learns to predict the masked tokens given the unmasked context.
- A three-part composite attention mask enables this:
- Block diagonal: self-attention within noised blocks
- Offset block causal: cross-attention for conditional context
- Block causal: attention to clean (unmasked) tokens
- Loss is cross-entropy on masked positions only.
Inference is non-autoregressive: the model generates an entire block of tokens simultaneously, then iteratively refines them over multiple steps.
- Initialize output with mask tokens.
- Divide into 32-token blocks.
- For each block, perform 32 refinement iterations:
- Forward pass with block-diagonal attention.
- Sample tokens; accept those above 95% confidence threshold.
- Number of transferred tokens increases with each step.
- Stop at EOS or max length.
Comparison baseline: Qwen3-4B¶
A 4B-parameter standard autoregressive transformer used as a comparison baseline. 36 layers, 2560 hidden size, 32 attention heads. Trained with standard next-token prediction loss. This comparison addresses the open architectural question (raised Mar 26): does discrete diffusion offer advantages over autoregressive generation for evolutionary sequence prediction?
Training¶
Continued pre-training¶
The model undergoes continued pre-training on evolutionary trajectory data. This is not training from scratch — the LLaDA2.0-mini base model already has language capabilities, and we adapt it to evolutionary sequences.
Training configuration:
| Parameter | Value |
|---|---|
| Learning rate | 1.0e-5 (constant) |
| Optimizer | AdamW (fused) |
| Weight decay | 0.1 |
| Gradient clipping | max norm 1.0 |
| Global batch size | 8 (micro batch 1) |
| Sequence length | 2248 tokens |
| Noise range | 0.3–0.8 (masking ratio) |
| Block size | 32 tokens |
| LR warmup | 3% of steps (linear) |
| LR schedule | Cosine decay |
| Precision | bfloat16 |
Distributed training uses PyTorch FSDP2 (Fully Sharded Data Parallel v2) with gradient checkpointing for memory efficiency. Training runs on H100 GPUs.
Training results (LLaDA2-mini, 3 epochs, 50,625 total steps):
- Loss dropped from 0.950 → 0.010 (rapid decrease in first 1000 steps, plateau after 10,000).
- Stable training with well-behaved gradients.
- One epoch over all current data takes ~100 hours.
- Standard practice: avoid more than one epoch to prevent memorization, though current runs use 3 epochs.
Training stages¶
- Continued pre-training: Masked diffusion on evolutionary trajectories (current stage).
- Supervised fine-tuning (SFT): Planned for longer trajectories and specific tasks.
- Reinforcement learning: DPO pushed cancer driver classification accuracy to ~80%. RL with external validators as reward function (PPO) identified as needed for tasks with limited training data (e.g., anti-bacterial peptide generation where SFT was insufficient and ESM outperformed in zero-shot).
Data format experiments¶
Multiple training configurations are tested across data representations (raw, VCF, EMD, structural diffusion) and trajectory types (forward, pairwise). The Qwen3-4B baseline has 9 configuration variants covering these combinations. Pairwise and forward trajectories are unified into a single training format — the model receives a reference sequence plus historical trajectory and predicts tip mutations.
Evaluation¶
Trajectory prediction metrics¶
The pegasus-evals toolkit implements evaluation in a two-file mode: ground truth JSONL + prediction JSONL, matched by line order.
Sequence-level metrics:
- Hamming distance (character-level mismatch count)
- Edit distance (Levenshtein)
- Identity percent
- Alignment score (match +1, mismatch -1, gap -2)
- Length difference
Biology-aware metrics:
- BLOSUM62 and PAM250 substitution matrix scores
- GC content difference
- Hydrophobicity difference (Kyte-Doolittle scale)
- Charge difference
Current trajectory results (LLaDA2-mini, Hamming distance):
| Dataset | Organism | Mean | Std |
|---|---|---|---|
| spike-xs | SARS-CoV-2 | 1.72 | 3.62 |
| spike-sm | SARS-CoV-2 | 1.92 | 3.75 |
| cytb-xs | Mammals | 49.6 | 194.9 |
| n450-xs | Measles | 120.0 | 165.0 |
Excellent performance on spike sequences (mean error <2 bases) but weaker on more divergent datasets. A known issue: the model's default behavior is to copy the previous generation's mutation, reflecting the dominant pattern in training data.
Downstream benchmarks¶
DMS (deep mutational scanning) correlation. The model predicts likelihood of reference given a mutation to score variant effects. Changing the likelihood schema — predicting P(reference | mutation) rather than P(mutation | reference) — significantly improved performance, potentially placing the model second among tested models.
Cancer driver classification. Positive results: the model demonstrated zero-shot ability to classify mutations as pathogenic vs benign, even without human genome training data. DPO-type RL pushed accuracy to ~80%.
Drug resistance. Added to the benchmark suite; results pending.
Anti-bacterial peptide generation. SFT alone was insufficient; ESM outperformed in zero-shot. Conclusion: RL with external validator reward is needed for generative tasks with limited training data.
Nucleotide frequency baseline (current priority)¶
The Apr 9 meeting identified a critical gap: the model does not currently correlate with nucleotide frequencies in multiple sequence alignments. A model trained on evolutionary trajectories should, at minimum, assign higher probability to commonly observed mutations. The team hypothesis is that Evo 2's advantage on DMS tasks may simply reflect its ability to capture nucleotide frequencies.
Immediate actions:
- Rayan: reconstruct MSA nucleotide frequencies for bac120 markers
- Trevor: write up CTMC math connecting equilibrium frequencies to model inference
- Team: evaluate zero-shot model correlation with these frequencies
This is now the baseline capability to establish before returning to DMS or other downstream tasks.
Evaluation infrastructure¶
The pegasus-evals toolkit provides:
- Inference runner supporting all data representations (raw, VCF, EMD, sdiff) for both diffusion and autoregressive models
- Synthetic data generators for controlled experiments (random mutagenesis, directed evolution, neutral drift)
- Format auto-detection and conversion between representations
- Baseline implementations (random sequence, random mutation, identity, majority class)
- Stratified analysis by prompt type, generation number, mutation count, and activity class
Current status and open questions¶
As of April 9, 2026:
Immediate priorities:
- Nucleotide frequency baseline evaluation
- CTMC math writeup connecting equilibrium frequencies to model inference
- Synthetic lattice protein data for architecture comparison
- Mammalian CDS dataset preparation
Ongoing:
- Continued pre-training with expanded data (estimated 2-3 weeks + 1 week post-training)
- DMS validation replication following the Evo 1 protocol
- Downstream benchmark dashboard assembly
- Unified evolutionary trajectory benchmark from compiled test splits