Benchmarks and Results¶
Tracking evaluation results for the evolutionary diffusion model. Updated as new benchmarks are run.
Last updated: 2026-04-09
Models¶
| Model | Architecture | Parameters | Training data | Status |
|---|---|---|---|---|
| LLaDA2-mini | MoE discrete diffusion | 16B | Combined evo trajectories | Primary model, 3 epochs (50,625 steps) |
| Qwen3-4B | Autoregressive transformer | 4B | Forward raw trajectories | Comparison baseline, 1 epoch |
Trajectory prediction¶
Hamming distance between predicted and ground truth sequences (lower is better). Results from LLaDA2-mini after 3 epochs of training.
| Dataset | Organism | Mean Hamming | Std | Samples |
|---|---|---|---|---|
| spike-xs | SARS-CoV-2 | 1.72 | 3.62 | 100 |
| spike-sm | SARS-CoV-2 | 1.92 | 3.75 | 100 |
| cytb-xs | Mammals | 49.6 | 194.9 | 71 |
| n450-xs | Measles | 120.0 | 165.0 | 100 |
Observations:
- Excellent performance on SARS-CoV-2 spike sequences (mean error <2 bases), where successive sequences differ by few mutations.
- Weaker on more divergent datasets (cytb, n450) where mutation complexity is higher.
- High standard deviation across all datasets suggests inconsistent prediction quality across samples.
- The model's default behavior is to copy the previous generation's mutation — reflecting the dominant pattern in training data (Mar 18 meeting).
Downstream benchmarks¶
Cancer driver gene classification¶
Mar 26 meeting
The model classifies mutations as pathogenic or benign by predicting P(reference | mutation) — the probability of the reference genome given the mutated sequence.
| Method | Accuracy | Notes |
|---|---|---|
| Zero-shot (LLaDA2-mini) | Significant separation between benign/pathogenic groups | No human genome training data |
| + DPO post-training | ~80% | DPO-type reinforcement learning |
This is a positive result: the model learned enough about evolutionary constraints from non-human training data to distinguish pathogenic from benign mutations in human cancer drivers.
Anti-bacterial peptide generation¶
Mar 26 meeting
| Method | Performance | Notes |
|---|---|---|
| SFT (LLaDA2-mini) | Below ESM zero-shot | Insufficient for generative tasks |
| ESM (zero-shot) | Better than SFT | Strong protein-level baseline |
Conclusion: RL with an external validator as reward function (PPO) is needed for tasks with limited training data. SFT alone is insufficient when the model hasn't seen the target domain during pre-training.
DMS (deep mutational scanning) correlation¶
Apr 9 meeting
Evaluating correlation between model-predicted likelihoods and experimental DMS fitness measurements.
| Likelihood schema | Performance | Notes |
|---|---|---|
| P(mutation | reference) | Low across all models including Evo 2 7B | Original approach |
| P(reference | mutation) | Significantly improved | Potentially second among tested models |
Key finding: The likelihood direction matters. Predicting the probability of the wild-type sequence given a mutation (rather than the reverse) is a better proxy for fitness effects.
Complication: DMS experiments measure amino acid substitution effects, but the model predicts nucleotide-level changes. The codon table means single amino acid changes require 1–3 nucleotide mutations, and the model is trained to predict immediate next-step changes, not multi-step substitutions.
Drug resistance¶
Added to the benchmark suite (Apr 2 meeting). Results pending.
Nucleotide frequency correlation¶
Apr 9 meeting — current top priority
Status: Failing. The model does not currently correlate with nucleotide frequencies in multiple sequence alignments. This is the most consequential finding to date.
Why this matters: A model trained on evolutionary trajectories should, at minimum, assign higher probability to commonly observed mutations at each site. If Evo 2's advantage on DMS tasks comes from capturing site-specific nucleotide frequencies (which Pegasus currently lacks), then DMS performance improvements are premature.
Immediate actions:
- Rayan: reconstruct MSA nucleotide frequencies for bac120 markers
- Trevor: write up CTMC math connecting equilibrium frequencies to model inference
- Team: evaluate zero-shot model correlation with reconstructed frequencies
This is the baseline capability to establish before returning to DMS or other downstream benchmarks.
Training metrics¶
LLaDA2-mini training run (3 epochs, 50,625 steps):
| Metric | Start | 10,000 steps | End (50,625 steps) |
|---|---|---|---|
| Loss | 0.950 | 0.019 | 0.010 |
| Grad norm | 7.12 | 0.10 | 0.20 |
| Learning rate | 8.69e-9 | ~5e-6 | 2.81e-6 |
One epoch over all current data takes ~100 hours. Standard practice for large autoregressive models is to avoid more than one epoch to prevent memorization.
Open evaluation questions¶
- Diffusion vs autoregressive: Head-to-head comparison using synthetic lattice protein data to isolate architectural effects. Not yet started (proposed Mar 26).
- Nucleotide frequency baseline: Must establish before other benchmarks are meaningful (Apr 9).
- Unified benchmark suite: Compiling all existing test splits into a standardized evolutionary trajectory benchmark (Zehui + Rayan, ongoing).
Available metrics¶
The pegasus-evals toolkit implements:
Sequence-level: Hamming distance, edit distance (Levenshtein), identity percent, alignment score, length difference
Biology-aware: BLOSUM62, PAM250, GC content difference, hydrophobicity difference (Kyte-Doolittle), charge difference
See the Data Formats reference for how predictions are encoded and compared.