Skip to content

Benchmarks and Results

Tracking evaluation results for the evolutionary diffusion model. Updated as new benchmarks are run.

Last updated: 2026-04-09

Models

Model Architecture Parameters Training data Status
LLaDA2-mini MoE discrete diffusion 16B Combined evo trajectories Primary model, 3 epochs (50,625 steps)
Qwen3-4B Autoregressive transformer 4B Forward raw trajectories Comparison baseline, 1 epoch

Trajectory prediction

Hamming distance between predicted and ground truth sequences (lower is better). Results from LLaDA2-mini after 3 epochs of training.

Dataset Organism Mean Hamming Std Samples
spike-xs SARS-CoV-2 1.72 3.62 100
spike-sm SARS-CoV-2 1.92 3.75 100
cytb-xs Mammals 49.6 194.9 71
n450-xs Measles 120.0 165.0 100

Observations:

  • Excellent performance on SARS-CoV-2 spike sequences (mean error <2 bases), where successive sequences differ by few mutations.
  • Weaker on more divergent datasets (cytb, n450) where mutation complexity is higher.
  • High standard deviation across all datasets suggests inconsistent prediction quality across samples.
  • The model's default behavior is to copy the previous generation's mutation — reflecting the dominant pattern in training data (Mar 18 meeting).

Downstream benchmarks

Cancer driver gene classification

Mar 26 meeting

The model classifies mutations as pathogenic or benign by predicting P(reference | mutation) — the probability of the reference genome given the mutated sequence.

Method Accuracy Notes
Zero-shot (LLaDA2-mini) Significant separation between benign/pathogenic groups No human genome training data
+ DPO post-training ~80% DPO-type reinforcement learning

This is a positive result: the model learned enough about evolutionary constraints from non-human training data to distinguish pathogenic from benign mutations in human cancer drivers.

Anti-bacterial peptide generation

Mar 26 meeting

Method Performance Notes
SFT (LLaDA2-mini) Below ESM zero-shot Insufficient for generative tasks
ESM (zero-shot) Better than SFT Strong protein-level baseline

Conclusion: RL with an external validator as reward function (PPO) is needed for tasks with limited training data. SFT alone is insufficient when the model hasn't seen the target domain during pre-training.

DMS (deep mutational scanning) correlation

Apr 9 meeting

Evaluating correlation between model-predicted likelihoods and experimental DMS fitness measurements.

Likelihood schema Performance Notes
P(mutation | reference) Low across all models including Evo 2 7B Original approach
P(reference | mutation) Significantly improved Potentially second among tested models

Key finding: The likelihood direction matters. Predicting the probability of the wild-type sequence given a mutation (rather than the reverse) is a better proxy for fitness effects.

Complication: DMS experiments measure amino acid substitution effects, but the model predicts nucleotide-level changes. The codon table means single amino acid changes require 1–3 nucleotide mutations, and the model is trained to predict immediate next-step changes, not multi-step substitutions.

Drug resistance

Added to the benchmark suite (Apr 2 meeting). Results pending.

Nucleotide frequency correlation

Apr 9 meeting — current top priority

Status: Failing. The model does not currently correlate with nucleotide frequencies in multiple sequence alignments. This is the most consequential finding to date.

Why this matters: A model trained on evolutionary trajectories should, at minimum, assign higher probability to commonly observed mutations at each site. If Evo 2's advantage on DMS tasks comes from capturing site-specific nucleotide frequencies (which Pegasus currently lacks), then DMS performance improvements are premature.

Immediate actions:

  • Rayan: reconstruct MSA nucleotide frequencies for bac120 markers
  • Trevor: write up CTMC math connecting equilibrium frequencies to model inference
  • Team: evaluate zero-shot model correlation with reconstructed frequencies

This is the baseline capability to establish before returning to DMS or other downstream benchmarks.

Training metrics

LLaDA2-mini training run (3 epochs, 50,625 steps):

Metric Start 10,000 steps End (50,625 steps)
Loss 0.950 0.019 0.010
Grad norm 7.12 0.10 0.20
Learning rate 8.69e-9 ~5e-6 2.81e-6

One epoch over all current data takes ~100 hours. Standard practice for large autoregressive models is to avoid more than one epoch to prevent memorization.

Open evaluation questions

  • Diffusion vs autoregressive: Head-to-head comparison using synthetic lattice protein data to isolate architectural effects. Not yet started (proposed Mar 26).
  • Nucleotide frequency baseline: Must establish before other benchmarks are meaningful (Apr 9).
  • Unified benchmark suite: Compiling all existing test splits into a standardized evolutionary trajectory benchmark (Zehui + Rayan, ongoing).

Available metrics

The pegasus-evals toolkit implements:

Sequence-level: Hamming distance, edit distance (Levenshtein), identity percent, alignment score, length difference

Biology-aware: BLOSUM62, PAM250, GC content difference, hydrophobicity difference (Kyte-Doolittle), charge difference

See the Data Formats reference for how predictions are encoded and compared.