Skip to content

Evolutionary Diffusion

Variational autoencoders and conditional flow matching to treat evolution as vectors through latent space. This gives a whole genome mutation / selection process from which we can learn the tree of life.

See Initial Directions for the original project scoping discussion, Technical Overview for a comprehensive description of the training data pipeline, model architecture, and evaluation, and Meeting Notes for a chronological summary of project meetings.

Goals

  • Model historical evolution as paths through latent space
  • Train flow/diffusion models to recapitulate and project evolutionary trajectories
  • Enable out-of-sample forward and reverse evolution prediction

Repos

Training data

  • trajectories (currently in blab, will move to pegasus-research) — Provisions evolutionary sequence trajectories from Nextstrain trees. Extracts parent-child sequence pairs from augur phylogenetic reconstructions for use as diffusion model training data.
  • bac120 — GTDB bac120 bacterial marker gene extraction and phylogenetic tree building. Extracts 120 single-copy marker genes from bacterial genomes across phyla (Cyanobacteriota, Bacteroidota, Actinomycetota, Bacillota, Pseudomonadota), builds per-gene phylogenetic trees, and generates trajectory data for ML training. Output to S3 at s3://pegasus-training-data/trajectories/.
  • odb — OrthoDB marker gene phylogenetics for eukaryotes. Extracts single-copy marker genes from eukaryotic genomes using compleasm, currently targeting fungi (666 genomes, 1122 markers). Builds independent phylogenetic trees per marker via the augur pipeline.
  • rdrp — Viral RdRp extraction and phylogenetic tree building. Extracts RNA-dependent RNA polymerase sequences from viral genomes across Paramyxoviridae, Flaviviridae, and Picornaviridae. Builds cross-family phylogenetic trees targeting the conserved RdRp catalytic domain.

Model training

  • diffusion-language-model — Discrete diffusion language model fine-tuning for evolutionary sequence generation. Fine-tunes discrete diffusion models (based on dFactory/LLaDA) on evolutionary trajectory data produced by bac120, odb, and rdrp.

Evaluation

  • pegasus-evals — Evaluation toolkit for evolutionary genomics models. Provides metrics (Hamming, edit distance, identity, alignment score, BLOSUM62, PAM250), synthetic trajectory generators for test datasets, and an inference runner for trained models.