Evolutionary Diffusion¶
Variational autoencoders and conditional flow matching to treat evolution as vectors through latent space. This gives a whole genome mutation / selection process from which we can learn the tree of life.
See Initial Directions for the original project scoping discussion, Technical Overview for a comprehensive description of the training data pipeline, model architecture, and evaluation, and Meeting Notes for a chronological summary of project meetings.
Goals¶
- Model historical evolution as paths through latent space
- Train flow/diffusion models to recapitulate and project evolutionary trajectories
- Enable out-of-sample forward and reverse evolution prediction
Repos¶
Training data¶
- trajectories (currently in blab, will move to pegasus-research) — Provisions evolutionary sequence trajectories from Nextstrain trees. Extracts parent-child sequence pairs from augur phylogenetic reconstructions for use as diffusion model training data.
- bac120 — GTDB bac120 bacterial marker gene extraction and phylogenetic tree building. Extracts 120 single-copy marker genes from bacterial genomes across phyla (Cyanobacteriota, Bacteroidota, Actinomycetota, Bacillota, Pseudomonadota), builds per-gene phylogenetic trees, and generates trajectory data for ML training. Output to S3 at
s3://pegasus-training-data/trajectories/. - odb — OrthoDB marker gene phylogenetics for eukaryotes. Extracts single-copy marker genes from eukaryotic genomes using compleasm, currently targeting fungi (666 genomes, 1122 markers). Builds independent phylogenetic trees per marker via the augur pipeline.
- rdrp — Viral RdRp extraction and phylogenetic tree building. Extracts RNA-dependent RNA polymerase sequences from viral genomes across Paramyxoviridae, Flaviviridae, and Picornaviridae. Builds cross-family phylogenetic trees targeting the conserved RdRp catalytic domain.
Model training¶
- diffusion-language-model — Discrete diffusion language model fine-tuning for evolutionary sequence generation. Fine-tunes discrete diffusion models (based on dFactory/LLaDA) on evolutionary trajectory data produced by bac120, odb, and rdrp.
Evaluation¶
- pegasus-evals — Evaluation toolkit for evolutionary genomics models. Provides metrics (Hamming, edit distance, identity, alignment score, BLOSUM62, PAM250), synthetic trajectory generators for test datasets, and an inference runner for trained models.