Skip to content

Evolutionary Diffusion: Synthetic Data

Trevor Bedford — 2026-05-18

This document describes the use of lattice proteins as synthetic ground-truth data for evaluating the evolutionary diffusion model. The lattice protein system provides an exactly known fitness landscape over sequence space, enabling controlled evaluation questions that are impossible with empirical data. The implementation lives in the github.com/blab/trellis repo.

The evaluation gap

The evolutionary diffusion model is trained on trajectories extracted from real phylogenetic trees — bacteria, fungi, and viruses (see Technical Overview). When evaluating on real data, stochasticity becomes a challenge. We have a number of realized train (and test) trajectories and it's possible to see that the model predicts test outcomes that correspond well to test data. However, we believe that test data is actually just a (small) subset of the potential evolutionary realizations. If the model produces a novel trajectory, it could be a legitimate trajectory that's not been sampled by evolution, or it could be a faulty prediction.

What is needed is a system where fitness is exactly known for every possible sequence, so evaluation can directly test whether the model has learned the relationship between sequence, structure, and fitness. The simple Miyazawa-Jernigan lattice protein model has a unique property that no more realistic approach can match: exact, deterministic, efficiently computable fitness for every possible sequence.

Alternatives considered:

  • Protein language model oracles (e.g., ESM-2 pseudolikelihood as fitness): testing one model's predictions against another model's predictions is a much weaker evaluation than testing against mathematically exact fitness values, and this is significantly complicated by lack of an obvious fitness function in PLMs.
  • Physics-based energy functions (e.g., Rosetta): orders of magnitude slower (minutes per sequence vs. seconds), the energy function is still parameterized, and stochastic sampling means the same sequence can give different energies on repeated evaluations.
  • Molecular dynamics FEP: gold standard for binding free energy but completely impractical for trajectory generation — each fitness evaluation takes hours of GPU time.

If the diffusion model cannot learn to predict the next step in a lattice protein trajectory — where the fitness landscape is smooth, deterministic, and has clear structure — it certainly won't succeed on real evolutionary data where the landscape is noisier and more complex. The lattice evaluation is a necessary condition for trusting the model on harder problems, though not sufficient on its own — the empirical benchmarks (Benchmarks) test whether it works on real biology.

Lattice protein model

Sequence, structure, and energy

A lattice protein is an amino acid sequence placed as a self-avoiding walk on a 2D square lattice. Each residue occupies one lattice site, consecutive residues are adjacent on the lattice, and no two residues share a site. When two non-bonded residues (not neighbors along the chain) happen to occupy adjacent lattice sites, they form a contact.

Each contact contributes energy determined by the Miyazawa-Jernigan (1985) contact potential — a 20×20 symmetric matrix of pairwise amino acid interaction energies derived from observed protein crystal structures. The MJ matrix is the only parameter in the model. Values range from −6.85 kT (Phe-Phe, strongly attractive) to +0.13 kT (Lys-Lys, slightly repulsive). The total energy of a conformation is the sum of MJ energies over all contacts:

\[E = \sum_{\text{contacts } (i,j)} \text{MJ}(a_i, a_j)\]

Exact folding

For short chains (≤20 residues on a 2D lattice), all possible self-avoiding walk conformations can be enumerated exhaustively. The trellis pipeline pre-enumerates every conformation for a given chain length and ligand once, extracts the contact pairs for each conformation, and stores them in compressed arrays. Scoring a new sequence then reduces to summing MJ matrix lookups over these precomputed contact lists — no geometry needs to be re-discovered. This inner scoring loop is compiled via Numba JIT for speed.

Because every conformation is scored, the partition function is exact:

\[Z = \sum_c \exp(-E_c / T)\]

where the sum runs over all conformations and \(T\) is the temperature in reduced units. The native conformation (lowest energy) is identified without approximation. The fraction folded — the Boltzmann probability of the native state — is

\[f(T) = \frac{\exp(-E_\text{native} / T)}{Z}\]

This exactness is the key advantage over all alternatives. There is no sampling noise, no convergence question, no parameterization beyond the MJ matrix itself.

Following Bloom et al. (2004), mean-field pruning removes low-contact conformations from the scored set (they contribute to \(Z\) via a cumulant expansion correction), reducing the number of conformations scored by approximately 7× without meaningful loss of accuracy. Enumeration is a one-time cost amortized over the thousands of fold calls in each SSWM trajectory.

Ligand binding and fitness

A short peptide ligand (e.g., a 4-residue sequence like "FWYL") is fixed at a position adjacent to the protein on the lattice. When the protein folds, protein-ligand contacts are scored using the same MJ matrix. Fitness combines folding stability and binding affinity following the literature-standard formulation from Bloom, Wilke, Arnold & Adami (2004):

\[\text{fitness} = f(T) \times (-E_\text{bind}(C_\text{native}))\]

where \(f(T)\) is the fraction folded and \(E_\text{bind}(C_\text{native})\) is the binding energy of the native conformation to the ligand (negated so that stronger binding gives higher fitness). This separates fitness into two components: stability (how reliably does the protein fold?) and function (how well does the folded protein bind?). An unstable protein has \(f(T) \approx 0\) and scores near zero regardless of its binding capacity.

Each ligand sequence defines a different fitness landscape over protein sequence space. Changing the ligand changes which protein sequences are fit — the same protein that binds well to one ligand may bind poorly to another. This means generating trajectories across many different ligands produces training data that samples diverse fitness landscapes, not just one.

Evolutionary simulation

Genetic code

Evolution in trellis operates at the DNA level, not the protein level. Each amino acid is encoded by a codon (3 nucleotides) via the standard genetic code. A 20-amino-acid protein is encoded by a 60-nucleotide DNA sequence. Single-nucleotide mutations produce three categories with different fitness consequences:

  • Synonymous: the codon changes but encodes the same amino acid. No fitness change. The genetic code is degenerate — most amino acids are encoded by 2–6 codons.
  • Nonsynonymous (missense): the codon encodes a different amino acid. Fitness changes depending on how the substitution affects folding and ligand binding.
  • Nonsense: the mutation introduces a stop codon. Lethal — fitness drops to zero.

This creates the same dN/dS structure as real molecular evolution, making synthetic trajectories directly comparable to the training data from bac120, odb, and rdrp. The model must learn that not all nucleotide changes matter equally — the same challenge it faces on real biological data.

SSWM dynamics

Trajectories are generated under the Strong Selection Weak Mutation (SSWM) regime: mutations are rare enough that each one either fixes in the population or is lost before the next arises. The population is always monomorphic (one genotype at a time). At each evolutionary step:

  1. Enumerate all single-nucleotide mutations of the current 60 nt sequence (3 alternatives at each of 60 positions = 180 possible point mutations).
  2. Deduplicate by translated amino acid sequence (~60 unique proteins from ~180 DNA mutations, because the genetic code is degenerate). Fold each unique protein once.
  3. Compute the selection coefficient for each mutation: \(s = w_\text{mutant} - w_\text{current}\), where \(w\) is fitness.
  4. Compute the fixation probability using the Kimura (1962) formula:
\[P_\text{fix}(s) = \frac{1 - e^{-2s}}{1 - e^{-2N_e s}}\]

where \(N_e\) is the effective population size. Beneficial mutations (\(s > 0\)) fix with probability greater than drift; deleterious mutations (\(s < 0\)) fix with probability less than drift; neutral mutations (\(s = 0\)) fix with probability \(1 / 2N_e\).

  1. Sample which mutation fixes, proportional to fixation probabilities.

This produces adaptive evolutionary trajectories that climb the fitness landscape, with occasional neutral substitutions and rare slightly deleterious fixations — the same dynamics expected in real microbial populations.

Trajectory output

Each trajectory is written as a FASTA file with the DNA sequence at each step, using the same format produced by blab/trajectories from real phylogenetic trees. Headers encode cumulative and per-branch Hamming distances. This means the existing preprocessing pipeline (FASTA → JSONL in raw, VCF, or EMD format) works identically for synthetic and real data — no special handling needed. Trajectories are packaged as compressed tar.zst shards for storage.

Planned dataset

Parameter Value
Ligands 1000 random 4-mer peptides
Trajectories per ligand 1000 (each from a random 20-mer protein start)
Steps per trajectory 50 SSWM substitutions
Total trajectories 1,000,000
Total sequence frames ~50,000,000
Sequence length 60 nt (20 amino acids)
Effective population size 1000
Temperature 1.0 (reduced units)
Output format FASTA → tar.zst shards

1000 different ligands means 1000 different fitness landscapes. This tests whether the model can generalize across landscapes rather than memorizing one. 1000 trajectories per ligand from random starting sequences samples diverse paths through each landscape — different starting points climb different ridges. 50 steps is long enough to observe meaningful fitness climbs and plateau effects where the protein approaches a local optimum.

Trajectories are independent and embarrassingly parallel. Each worker process loads its own conformation database and fitness cache with no shared mutable state. Per-trajectory RNG streams are created via NumPy's SeedSequence.spawn() for deterministic reproducibility regardless of worker count.

Evaluation questions

The synthetic data enables several evaluation questions that are impossible with empirical data:

Does the model assign higher probability to beneficial mutations? For each sequence in a trajectory, the exact fitness of every possible single-nucleotide mutant is known. We can test whether the model's predicted probability for a mutation correlates with its Kimura fixation probability under SSWM.

Does the model capture epistasis? The lattice protein fitness landscape is epistatic: the fitness effect of a mutation depends on which other mutations are present, because contact energies depend on the full set of residues in contact in the folded structure. We can test whether the model's predictions change appropriately when the background sequence changes.

Does the model distinguish synonymous from nonsynonymous changes? The genetic code layer means some nucleotide mutations change the protein and some do not. A model that has learned evolutionary constraints should assign similar probabilities to synonymous variants at a codon and different probabilities to nonsynonymous variants based on their structural consequences.

Can the model predict multiple steps along a fitness gradient? Given the first \(K\) steps of a trajectory, can the model predict steps \(K+1\) through \(K+N\) in a way that continues climbing the fitness landscape? This tests whether the model has learned the direction of adaptive evolution, not just single-step copying.

Does discrete diffusion outperform autoregressive generation? By running both LLaDA2-mini and Qwen3-4B on identical synthetic data, we can isolate the effect of generation architecture from the data confounds that complicate comparisons on real biological data.

Interactive visualization

An interactive visualization shows the lattice protein model in action: a fitness-versus-step trajectory plot alongside small-multiple panels of native conformations on the 2D lattice, with protein-ligand contacts, backbone structure, and mutation annotations. This makes it possible to build intuition for how sequence changes propagate through structure to affect fitness.

References

Bloom JD, Wilke CO, Arnold FH, Adami C. "Stability and the evolvability of function in a model protein." Biophysical Journal 86:2758–2764 (2004). doi:10.1016/S0006-3495(04)74325-7

Kimura M. "On the probability of fixation of mutant genes in a population." Genetics 47:713–719 (1962). doi:10.1093/genetics/47.6.713

Miyazawa S, Jernigan RL. "Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation." Macromolecules 18:534–552 (1985). doi:10.1021/ma00145a028