Benchmarks and Results¶

Tracking evaluation results for the evolutionary diffusion model. Updated as new benchmarks are run.

Last updated: 2026-04-09

Models¶

Model	Architecture	Parameters	Training data	Status
LLaDA2-mini	MoE discrete diffusion	16B	Combined evo trajectories	Primary model, 3 epochs (50,625 steps)
Qwen3-4B	Autoregressive transformer	4B	Forward raw trajectories	Comparison baseline, 1 epoch

Trajectory prediction¶

Hamming distance between predicted and ground truth sequences (lower is better). Results from LLaDA2-mini after 3 epochs of training.

Dataset	Organism	Mean Hamming	Std	Samples
spike-xs	SARS-CoV-2	1.72	3.62	100
spike-sm	SARS-CoV-2	1.92	3.75	100
cytb-xs	Mammals	49.6	194.9	71
n450-xs	Measles	120.0	165.0	100

Observations:

Excellent performance on SARS-CoV-2 spike sequences (mean error <2 bases), where successive sequences differ by few mutations.
Weaker on more divergent datasets (cytb, n450) where mutation complexity is higher.
High standard deviation across all datasets suggests inconsistent prediction quality across samples.
The model's default behavior is to copy the previous generation's mutation — reflecting the dominant pattern in training data (Mar 18 meeting).

Downstream benchmarks¶

Cancer driver gene classification¶

Mar 26 meeting

The model classifies mutations as pathogenic or benign by predicting P(reference | mutation) — the probability of the reference genome given the mutated sequence.

Method	Accuracy	Notes
Zero-shot (LLaDA2-mini)	Significant separation between benign/pathogenic groups	No human genome training data
+ DPO post-training	~80%	DPO-type reinforcement learning

This is a positive result: the model learned enough about evolutionary constraints from non-human training data to distinguish pathogenic from benign mutations in human cancer drivers.

Anti-bacterial peptide generation¶

Mar 26 meeting

Method	Performance	Notes
SFT (LLaDA2-mini)	Below ESM zero-shot	Insufficient for generative tasks
ESM (zero-shot)	Better than SFT	Strong protein-level baseline

Conclusion: RL with an external validator as reward function (PPO) is needed for tasks with limited training data. SFT alone is insufficient when the model hasn't seen the target domain during pre-training.

DMS (deep mutational scanning) correlation¶

Apr 9 meeting

Evaluating correlation between model-predicted likelihoods and experimental DMS fitness measurements.

Likelihood schema	Performance	Notes
P(mutation \| reference)	Low across all models including Evo 2 7B	Original approach
P(reference \| mutation)	Significantly improved	Potentially second among tested models

Key finding: The likelihood direction matters. Predicting the probability of the wild-type sequence given a mutation (rather than the reverse) is a better proxy for fitness effects.

Complication: DMS experiments measure amino acid substitution effects, but the model predicts nucleotide-level changes. The codon table means single amino acid changes require 1–3 nucleotide mutations, and the model is trained to predict immediate next-step changes, not multi-step substitutions.

Drug resistance¶

Added to the benchmark suite (Apr 2 meeting). Results pending.

Nucleotide frequency correlation¶

Apr 9 meeting — current top priority

Status: Failing. The model does not currently correlate with nucleotide frequencies in multiple sequence alignments. This is the most consequential finding to date.

Why this matters: A model trained on evolutionary trajectories should, at minimum, assign higher probability to commonly observed mutations at each site. If Evo 2's advantage on DMS tasks comes from capturing site-specific nucleotide frequencies (which Pegasus currently lacks), then DMS performance improvements are premature.

Immediate actions:

Rayan: reconstruct MSA nucleotide frequencies for bac120 markers
Trevor: write up CTMC math connecting equilibrium frequencies to model inference
Team: evaluate zero-shot model correlation with reconstructed frequencies

This is the baseline capability to establish before returning to DMS or other downstream benchmarks.

Training metrics¶

LLaDA2-mini training run (3 epochs, 50,625 steps):

Metric	Start	10,000 steps	End (50,625 steps)
Loss	0.950	0.019	0.010
Grad norm	7.12	0.10	0.20
Learning rate	8.69e-9	~5e-6	2.81e-6

One epoch over all current data takes ~100 hours. Standard practice for large autoregressive models is to avoid more than one epoch to prevent memorization.

Open evaluation questions¶

Diffusion vs autoregressive: Head-to-head comparison using synthetic lattice protein data to isolate architectural effects. Not yet started (proposed Mar 26).
Nucleotide frequency baseline: Must establish before other benchmarks are meaningful (Apr 9).
Unified benchmark suite: Compiling all existing test splits into a standardized evolutionary trajectory benchmark (Zehui + Rayan, ongoing).

Available metrics¶

The pegasus-evals toolkit implements:

Sequence-level: Hamming distance, edit distance (Levenshtein), identity percent, alignment score, length difference

Biology-aware: BLOSUM62, PAM250, GC content difference, hydrophobicity difference (Kyte-Doolittle), charge difference

See the Data Formats reference for how predictions are encoded and compared.