Evolutionary Diffusion: Meeting Notes¶

Trevor Bedford — 2026-04-09

Synthesized from Company meeting 2026 hand-written notes and Gemini auto-transcribed meeting recordings from Jan–Apr 2026.

Summary of open threads (as of Apr 9)¶

Immediate priorities: - Nucleotide frequency baseline evaluation (Rayan: reconstruct MSA frequencies; team: evaluate zero-shot model correlation) - CTMC math writeup connecting equilibrium frequencies to model inference (Trevor) - Synthetic lattice protein data for architecture comparison (Trevor) - Mammalian CDS dataset preparation, ~1–2 weeks (Rayan)

Ongoing: - Continued pre-training with expanded data (Zehui, 2–3 weeks + 1 week post-training) - DMS validation replication following the Evo 1 protocol (Trevor) - Downstream benchmark dashboard assembly (Zehui) - Unified evolutionary trajectory benchmark from compiled test splits (Zehui + Rayan)

Architectural question: Head-to-head comparison of discrete diffusion vs. autoregressive transformer, using lattice protein toy data to isolate framework effects — not yet started.

Deferred: - Bio-agentic systems infrastructure proposal (Zehui, LaTeX draft in progress) - Genome mining / metagenomic interpretability collaboration with Brian Hie (Trevor, proposal scaffolding) - Wet lab integration (eventual, possibly via Sanjay's Hutch lab or contract)

Apr 9 — Nucleotide frequency crisis and recalibration¶

Gemini transcript

The most consequential meeting to date for project direction. Zehui presented updated DMS results: baseline performance was low across all models including Evo 2 7B. However, changing the likelihood schema — predicting the likelihood of the reference genome rather than the alternative sequence — significantly improved the Pegasus model's performance, potentially placing it second among tested models.

Rayan raised a critical concern: the model does not currently correlate with nucleotide frequencies in multiple sequence alignments. This is surprising — a model trained on evolutionary trajectories should, at minimum, assign higher probability to commonly observed mutations. Rayan argued that Evo's advantage on DMS tasks may simply reflect its ability to capture nucleotide frequencies, which Pegasus currently lacks.

The team agreed to reprioritize: nucleotide frequency prediction is now the baseline capability to establish before returning to DMS or other downstream tasks. Rayan will reconstruct MSA nucleotide frequencies for a few bac120 markers for evaluation. Trevor committed to writing up the CTMC math connecting equilibrium frequencies to model inference.

Additional complexities discussed: the codon table means many single amino acid changes in DMS experiments require 1–3 nucleotide mutations, and the model is trained to predict immediate next-step nucleotide changes, not multi-step amino acid substitutions.

Training timeline: Zehui estimated 2–3 weeks for continued pre-training with new data, plus 1 week for post-training. One epoch over all data takes ~100 hours. Standard practice for large autoregressive models is to avoid more than one epoch to prevent memorization.

Team update: Sanjay's contract is executed; Will's is pending UW legal review. Trevor to draft a project summary for onboarding. The team discussed establishing a docs repo (Google Drive → markdown export once pegasus-docs is set up).

Apr 2 — Hiring solidifies, fungi data, agentic systems proposal¶

Gemini transcript

Contract offers extended to Will DeWitt, Sanjay Srivatsan, and Tami Lieberman. Will and Sanjay expected to start within weeks; Tami deferred until October (maternity leave). Will to focus on evolutionary trajectories; Sanjay to work on genome mining and cell modeling.

Rayan finished generating fungi trees for OrthoDB data. Observations: fungi data is an order of magnitude more mutated than spike data; the largest dataset (~50k nodes) can't be loaded by Auspice. Rayan deployed a Claude Code agent on Slack for cloud monitoring.

Zehui updated downstream benchmarks: removed two protein-based datasets that didn't make sense, added cancer driver and drug resistance datasets. Discussion of converting protein DMS experiments back to nucleotide-level mutations for model comparison — complicated because DMS data often provides only protein sequences.

Zehui proposed a 2-year parallel research project on bio-agentic systems infrastructure: (1) sandboxed containers for bio-agentic work, (2) communication layer for multi-agent coordination, (3) harness engineering for optimizing interactions. Trevor expressed strong interest and asked for a written proposal.

Branding exercise started with consultant Ray Ueno (via Gabe), beginning with working name, tagline, and brand guidelines.

Mar 26 — Cancer driver results, antibacterial peptides, and the architecture question¶

Gemini transcript

Zehui presented results on two downstream tasks:

Anti-bacterial peptide generation: SFT was insufficient; ESM outperformed in zero-shot. Conclusion: RL with an external validator as reward function (PPO) is needed for tasks with limited training data.

Cancer driver gene classification: Positive results. The model demonstrated zero-shot ability to classify mutations as pathogenic or benign by predicting the probability of the reference genome given the mutation — a significant difference between benign and pathogenic groups even without human genome training data. DPO-type RL pushed accuracy to ~80%.

Trevor emphasized the need for a head-to-head comparison between discrete diffusion and standard autoregressive transformers, proposing a toy lattice protein dataset to isolate architectural effects from data confounds.

Rayan noted that diffusion models for fitness landscape prediction are starting to appear in the broader field. Fungi trees are complete; a large ~1000-tree dataset is the next step. Zehui proposed compiling all existing test splits into a unified evolutionary trajectory benchmark.

Compute strategy discussed: the team should not rely on the Starfish server room (construction delays). Two options: purchase and collocate a server, or rent long-term from a cloud provider. Estimated compute budget: ~$1M/year out of $5–10M total. Valve's recommendation (via prior conversation): start simple, no Slurm, add scheduling only when resource contention appears.

Mar 18 — VFT confirmed, unified training format, classification benchmarks¶

Gemini transcript

Major milestone on tree construction: Rayan's comparison confirmed Very Fast Tree provides acceptable accuracy (within ~0.5% likelihood of RAxML) with orders-of-magnitude better speed. The team concluded VFT is sufficient for all current datasets.

Zehui reported on training a smaller 4B-parameter model for one epoch over two weeks. Performance dropped after removing tail duplication; some backward bac120 examples had missing reference sequences. The model's default behavior is to copy the previous generation's mutation — reflecting the dominant pattern in training data but highlighting the need for a better evaluation metric.

Key architectural development: pairwise and forward trajectories are now unified into a single training format. The model receives a reference sequence plus historical trajectory and predicts tip mutations. Pairwise data is ~9x more abundant. Rayan raised the contrarian view that pairwise prediction ("who is your brother") and forward prediction ("who is your son") are fundamentally different tasks; Zehui argued the model can learn the distinction from context.

Zehui proposed using the model as a representation learner with a classification head for tasks like pathogenic/benign mutation classification. Trevor suggested the logit ratio between ref→alt and alt→ref might indicate fitness. New benchmark directions discussed: drug resistance, DMS effects for E. coli and other bacteria, following the Evo paper protocol.

All current training data confirmed to be coding regions only — no non-coding DNA. For eukaryotes, the team decided to use the longest splice form (CDS with introns stripped). Trevor raised concern about large eukaryotic genes (up to 100kb with introns) possibly needing windowed chunking.

Mar 9 — Post-training pipeline and evaluation dimensions¶

Zehui demoed the lambda-resource-manager and identified a tip duplication issue (trajectories PR #15). The team discussed pairwise evaluation and established the need for multiple downstream benchmarks to develop "true capabilities."

Post-training pipeline outlined: first step is supervised fine-tuning (SFT), potentially for longer trajectories. The team identified at least 4 evaluation dimensions: trajectory prediction, branch length estimation, DMS correlation, and contact maps. Trevor wrote a Slack thread on simulated data as a complementary evaluation approach. Rayan to continue generating bac120 data with VFT comparison.

Infrastructure discussion: Zehui proposed building AutoResearch-style infrastructure giving Claude Code access to GPU, with safe operating constraints. They referenced AReaL as an example.

Feb 27 — Gaps resolution¶

Short meeting. Trevor described two separate data quality issues: (1) spurious gaps in SARS-CoV-2 data that can be partially repaired, and (2) genuine length variation where phylogenetic ancestral reconstruction doesn't assign gaps correctly. Decision: keep gaps in pairwise comparisons.

Feb 19 — Tree comparison challenges and gap handling¶

Rayan reported on IQ-TREE vs RAxML tree comparisons: trees differ between methods and computational trade-offs vary. Trevor proposed direct likelihood comparison between topologies. For trees with 50k+ tips, the team discussed clustering-based methods to split into smaller subtrees.

Training data status across the tree of life: viruses (2 markers, millions of leaves each), bacteria/bac120 (cyanobacteria 2k genomes, bacteroidetes 22k, acinetobacteria 50k, pseudomonata 250k planned), fungi (thousands of markers, ~20k leaves each).

Gap handling was resolved: true gaps are rare in most datasets (SARS-CoV-2 is an exception with many spurious gaps); pairwise comparisons will keep gaps in, following the PEINT approach of using tip-to-tip "cherries."

Feb 12 — Data representation debates¶

Trevor worked on integrating trajectories train/test data into his previous latent diffusion repo (cov-diffusion PR #2). Rayan reported IQ-TREE working for 20k samples but breaking at 50k. Zehui proposed a 4th data representation: replacing positions with context from raw sequence, inspired by how language models generate diffs rather than complete files.

Zehui raised the idea of using the model for out-of-domain data generation (e.g., tumor evolution) and the need for RL with a reward function. The team discussed direct ESM comparison and the importance of scalability. Zehui advocated for building agentic infrastructure for biological AI — "discrete diffusion brings representation of knowledge to the same level of language." The first mention of mkdocs-material for central documentation also occurred in this meeting.

Feb 5 — UShER data, hiring plans, and evaluation thinking¶

Trevor got the ~8M-sample UShER tree for SARS-CoV-2 spike into training data and planned to shift personally from training data to evaluation and the latent diffusion model. Hiring discussions: plans for affiliate (~20%) contracts with Tami Lieberman, Will DeWitt, and Sanjay Srivatsan. Zehui presented "Evolutionary Mutation Language" as a formalization of the compact data format.

Data quality issues surfaced: some test trajectories had length 1 (related to train/test split logic) and zero-length branches between final node and tip. Trevor took action to fix both. Zehui raised concerns that multi-tasking (forward + pairwise training jointly) might harm performance and noted the need for someone with strong engineering skills to package tools and host inference APIs.

Jan 22 — Train/test splits and pairwise trajectories¶

Trevor demoed the train/test split implementation in blab/trajectories PR #3. He proposed splitting forward trajectories (root-to-tip) from pairwise trajectories (tip-to-tip). Benefits of pairwise: (1) more training data, (2) higher accuracy since it skips error-prone ancestral state reconstruction, (3) handles indels and complex mutations naturally.

Action items established: Trevor to work on pairwise train/test sets, Rayan to expand RdRp repo with more families, Zehui to explore VCF-based training and decoder-only training. Zehui suggested adding taxonomy labels to FASTA headers. Trevor proposed auspice.json as the primary input format. The team acknowledged the need for a better evaluation metric beyond Hamming distance and for prioritization across the growing list of tasks.

Jan 16 — First model results and the VCF representation idea¶

Gemini transcript

Zehui demonstrated the first training results. The approach: continue pre-training a 16B-parameter Llama 2 mini diffusion language model on evolutionary trajectory triples (A→B→C), running on 4 H100 GPUs. Training used ~100k triples with max token length 2048; one epoch took about 5 hours. The model showed smooth training and perfect memorization on one training example.

A critical problem emerged: ~90% of compute was spent learning to copy, since successive nodes in a trajectory differ by only a handful of mutations. Zehui proposed reformulating the data as a compact VCF-like representation — positions and substitutions only — reframing evolution prediction as a "coding problem" that exploits the language model's generalization abilities. Trevor agreed this resembles how mutation-annotated trees (MATs) already represent changes compactly.

The team discussed train/test splitting. Trevor proposed a clade excision strategy: walk back from a random tip, excise the subtree, and hold it out. He took this as an action item. Rayan noted the massive data available: 8M tips in the UShER SARS-CoV-2 tree, 200M sequences for RdRp. Zehui set a target of publishing a paper within six months.

Jan 8 — Project kickoff¶

The first meeting established the data format for evo-diffusion training data and Trevor's plans to generate SARS-CoV-2 spike protein training data, starting with RdRp for a single clade. Compute infrastructure was discussed: B200 GPUs are roughly 2x faster than H200 and worth the expense; Zehui is most familiar with Azure but open to AWS; a 1xB200 cloud instance runs roughly $5k/month. Zehui presented a slide deck on diffusion language models covering architecture and implementation.