Evolutionary Diffusion¶
Variational autoencoders and conditional flow matching to treat evolution as vectors through latent space. This gives a whole genome mutation / selection process from which we can learn the tree of life.
- Technical Overview — comprehensive description of the training data pipeline, model architecture, and evaluation
- Data Formats — reference for raw, VCF, EMD, and sdiff data representations
- Benchmarks — current evaluation results and downstream benchmark tracking
- Synthetic Data — lattice protein synthetic data for controlled model evaluation
- Meeting Notes — chronological summary of project meetings
- Initial Directions — original project scoping discussion
Goals¶
- Model historical evolution as paths through latent space
- Train flow/diffusion models to recapitulate and project evolutionary trajectories
- Enable out-of-sample forward and reverse evolution prediction
Repos¶
Training data¶
- trajectories (currently in blab, will move to pegasus-research) — Provisions evolutionary sequence trajectories from Nextstrain trees. Extracts parent-child sequence pairs from augur phylogenetic reconstructions for use as diffusion model training data.
- bac120 — GTDB bac120 bacterial marker gene extraction and phylogenetic tree building. Extracts 120 single-copy marker genes from bacterial genomes across 5 phyla. Four phyla complete (Cyanobacteriota, Bacteroidota, Actinomycetota, Bacillota — 485 trees, ~23M tips on S3); Pseudomonadota in progress. Output to S3 at
s3://pegasus-training-data/trajectories/. - odb — OrthoDB marker gene phylogenetics for eukaryotes. Extracts single-copy marker genes from eukaryotic genomes using compleasm, currently targeting fungi (666 genomes, 1122 markers). Builds independent phylogenetic trees per marker via the augur pipeline.
- rdrp — Viral RdRp extraction and phylogenetic tree building. Extracts RNA-dependent RNA polymerase sequences from viral genomes across Paramyxoviridae, Flaviviridae, and Picornaviridae. Builds cross-family phylogenetic trees targeting the conserved RdRp catalytic domain.
Model training¶
- diffusion-language-model — Discrete diffusion language model fine-tuning for evolutionary sequence generation. Fine-tunes discrete diffusion models (based on dFactory/LLaDA) on evolutionary trajectory data produced by bac120, odb, and rdrp.
Synthetic data¶
- trellis — Lattice protein evolutionary trajectory simulator. Generates synthetic training and evaluation data with exactly known fitness landscapes using the Miyazawa-Jernigan contact potential on 2D lattice proteins.
Evaluation¶
- pegasus-evals — Evaluation toolkit for evolutionary genomics models. Provides metrics (Hamming, edit distance, identity, alignment score, BLOSUM62, PAM250), synthetic trajectory generators for test datasets, and an inference runner for trained models.