Initial Research Directions¶

Trevor Bedford — 2025-12-15, decision 2025-12-22

Decision¶

We will dive directly into Direction 2 on evolutionary diffusion, based on the following factors:

The simple fact that I'm inherently most excited about this research direction
I was most worried about being able to build something without too much architecture R&D, but Zehui's comments about video generation models make me hopeful that we can try to directly lift this approach
I'm obliquely worried about getting scooped on the metagenomic SAE project as there are now multiple papers training language models on metagenomic data and there are multiple papers using SAEs on top of protein and DNA language models
Relatedly, the novelty of evolutionary diffusion model is clearer
A model that projects evolution forward and backwards in time (perhaps guided by environmental factors) directly aligns with potential applications and forms a nice foundation for future work
It seems that Rayan and I would have plenty to do in terms of provisioning training data

Overview¶

My central theme is one of forecasting / prediction of evolutionary and ecological systems. I'm assuming that the models produced by Pegasus would be able to:

Project how genomes will change forward in time if they continue to follow current selective pressures or if they are steered by new pressures (change in temperature; change in host immune landscape)
Project genomes backwards in time to reconstruct common ancestors
Project how assemblages of genomes will change if they continue to follow current trends
Project stable assemblages of genomes given specified abiotic environment

Direction 1¶

Sparse Autoencoders for Metagenomic Interpretability, discussed 2025-12-17¶

For background please see slides from July 26 lab group talk on "Interpreting the environmental virome at scale" and please see similarly titled grant proposal from July 13.

This project aims to identify viral dark matter in metagenomic samples. It would use the existing and further developing Lungfish wastewater dataset produced by colleagues Dave O'Connor, Shelby O'Connor and Marc Johnson.

Concretely, I believe this would progress as:

Fine tune Evo2 on viral data
1. Assemble viral genomic dataset
2. Fine tune Evo2 7B parameter model
3. Or use a different genome language model that has a good grasp of viral genomes
4. Save weights locally for further use (but don't share weights publicly because of dual use concerns)
5. Rayan: Sum up all virus genomes in Genbank is gigabases of data
6. Zehui: Alternative pre-train (or fine tune) transformer model directly on metagenomic data
7. Zehui: Evo2 codebase is difficult to work with, other frameworks would be easier to work with, also want to include supervised fine-tuning, also have reinforcement learning for improving consistency of the model, and RLHF align model distribution to handle human questions better, basing codebase on Evo2 makes post-training difficult, Allen Institute has post-training codebase, role of pretraining for human language is to memorize language, the role of post-training for a human language model is to improve the model's capability for certain tasks (including Q&A), but can be to do something like generate more context. Perhaps generate latent internal understanding.
8. Models that emit tokens besides ATGC, or amino acids. ChatNT? Directly outputs labels and text. OmniDNA to directly generate non-genomic tokens, meaningful and grammatical text for functional description of DNA sequences.
9. General direction of importing from the CS
Harvest activations from Lungfish data
1. Sample a fraction of the Lungfish dataset and house locally
2. Probably assemble contigs
3. Pipe contigs through fine-tuned Evo2 to assemble ~1B token embeddings

Harvest activations from Lungfish data

Train sparse autoencoder (SAE) on token embeddings
1. The SAE shouldn't need anything lower down in the transformer stack
2. Explicitly include taxonomic and functional prediction heads along SAE reconstruction loss
3. Zehui: generally can add regularization to SAE or VAE, could work, but might not get the optimal prediction accuracy for reconstruction loss, in stage 1, perform normal SAE reconstruction training, in stage 2 include additional heads
4. Zehui: if the main goal is classification, why are we using SAE?
5. Zehui: SAE not part of Evo 2 codebase

Train sparse autoencoder on token embeddings

Analyze resulting SAE features
1. Rayan: Separate issues between feature interpretability and things that don't blast, even if remote hits to a database
2. Zehui: What is is the ultimate goal?
3. Trevor: project features through space and time
4. Zehui: exactly how would we map features to biological entities? We could do class mapping in various fashions.
5. Rayan: many tools that take a DNA sequence and tell you what it is? Embark on a journey to develop a better tool to take a metagenome and tell you what the closest taxonomic entity is.
6. Zehui: Vector database to do vector matching / dot product

Analyze resulting SAE features

The challenge here will really be in how to actually turn an SAE feature into biological insight. We could easily have a situation where we have a pile of contigs that all light up a particular feature and none of the contigs blast, or if they blast it might not be clear what's actually shared. Or if there's a novel taxonomic group it won't be included in the SAE training and we could miss it. However I'd hope that we could semi-manually identify some interesting features to highlight and then we could ship this as a general resource. I'd hope we could just use the amazing Neuronpedia for visualization / exploration. They say they allow custom upload of SAEs.

Pros¶

Shovel-ready, there will be decisions that need to be made and technical challenges, but the full path here seems pretty clear
Could be immediate flag for org credibility
Would be nice to promote Lungfish work and we could write a paper with Dave, Shelby and Marc
With current team, I believe we could all immediately get to work

Cons¶

This seems like a one-off project, I don't know where to build on top of it
Consequently, I don't know if this is effort that would be better devoted to more longer term aims

General notes:

Zehui: 6 months is a lot of time for the simple approach, could have something more meaningful instead, virus specific training, very good work by itself, small models will be fast, 16 A100 GPUs could train a 1B token model in 1 week
Rayan: What would we apply the inference on? Could pre-blast everything that we'd run inference on, a bit worried about fishing expedition [ no no that's fine, it's the fun part of science ]
Zehui: core technical issues, use SAE so that patterns for metagenomes emerge by itself, we want to do something very smart so that embeddings (or SAE features) reveal patterns, focus on core technical challenge, we need some automatic method to describe patterns

Direction 2¶

Evolutionary Diffusion Foundation Model, discussed 2025-12-19¶

For background, please see slides from my Oct 28 lab presentation, and please see my github.com/blab/cov-diffusion repo.

The idea here is that we can treat historical evolution as paths through latent space. Each point in latent space corresponds to a gene (or genome) sequence and we can map between genome sequence x and latent point z with a variational autoencoder (VAE). Here's a simple example from SARS-CoV-2.

SARS-CoV-2 latent space example

I currently have this as a very simple linear autoencoder, but Zehui has a nicer architecture in DiscDiff. We'll need a more general purpose VAE that can map from different length genomes into a shared latent space. This may be transformer model embeddings or may be a convolutional model. I generally consider this VAE problem to be separable from the diffusion problem.

We can construct historical trajectories in latent space and then train a flow model or diffusion model to recapitulate evolutionary trajectories. Here I've used a simple vector field computed from parent/child node pairs.

Evolutionary trajectories as vector field

We have an unobserved process issue that can be approached in a couple ways:

Reconstruct ancestors using traditional phylogenetic approaches, this would give a series of sequences that mutate from one to another, but you won't catch all mutations and the reconstruction won't be fully accurate (this is what I've done in github.com/blab/cov-diffusion)
Somehow train directly on the partially observed tip sequences. In this case the model would produce alignments rather than individual genomes.

I would generally look to video generation models as an analogy here. Here we have observations of individual frames (genomes) and a sequence of frames we'd like to model and project forwards. However, my understanding is that modern video generation models treat the entire video [T, H, W, C] as a latent object with a VAE to convert between and then denoise to identify regions of latent space with consistent videos. I'm not sure how much of this to directly borrow, but I'm assuming knowledge of approaches to video generation would be helpful.

I'd imagine that something like all mitochondrial genomes across all Eukaryotes (n = 33k) would be a good initial dataset beyond the fairly simple example I have here with SARS-CoV-2 spike protein. Or we could use some standard taxonomic genes like cytochrome b. But eventually I'd position this as modeling the entire tree of life.

Concretely, I believe this would progress as:

Identify an appropriate VAE that works for (longish) DNA sequences of varying lengths
Identify an appropriate flow model to transport gene sequences through latent space
Gather a few training datasets of different scopes
Train model
Central task becomes out-of-sample forward or reverse evolution

Pros¶

Exactly aligned with larger remit
I can see a variety of useful applications of this sort of model including pathogen forecasting, but also in synthetic biology
Shipping a v1 version that's a more expansive version of what I've already done seems very approachable, but I don't know what it would really take to make a convincing foundation model

Cons¶

Not as shovel ready, there's a lot of R&D here
In particular, I'm not sure what Rayan would want to be doing immediately

Notes:

Zehui: in general video case, full video of noise, denoise to get a full video rather than frame-by-frame, we could do this the same as video generation, treat as temporary T by sequence length
Zehui: translate from Gaussian distribution to data distribution, in Trevor's case, want a model to convert from one distribution to another distribution
Zehui: how to encode sequences of different lengths, convolutional VAE can generalize, can take image much larger than training data, tokenizer works fine, this is because it learns local relationships, different lengths of input will have different lengths of embedding, padding or repeating the last frame, unified VAE converting to the same length of embedding is challenging
Zehui: usually called 3D VAE
Trevor: what would the challenge be in architecture?
Zehui: would need to adapt existing video tokenizer, high reconstruction rate for a collection of sequences, train diffusion model, data preparation would be very important,
Rayan: what's the data scale?
Zehui: need millions of clips, have 500k RdRps clustered at 90% identity, 1–10M individual examples
Zehui: time dependency (ie epistatis), train diffusion model, we can generation
Rayan: SARS-CoV-2 is one end of the data (lots of sequences, not diverse), RdRp is the other end of the data (lots of sequences, hugely diverse), maybe something like hemagglutinin that's in between would be useful
Zehui: mapping to time
Rayan: how much do we worry about dual use and publishing?

General discussion¶

Zehui: we have lots of resources, we should aim bigger, if the company will grow bigger and solve problems in genomics, want strong platforms for data generation, training, etc... want a project that will produce systems that can be re-used later, direction 2 could take 6 months if we have sufficient people, direction 1 could take 3 months
Rayan: not going to try say one or two is better, what sort of press release could we write? Direction One: Pegasus has discovered new species in the ocean, Direction Two: we've built a time machine
Zehui: Evo 2 might be restricting for future work, given

Future work¶

Metagenomic Diffusion Foundation Model¶

I believe solving this problem for temporal evolution of individual genomes would be the necessary first step to tackling forecasting / steering of metagenomic data. I'm imagining a similar approach for metagenomics, but where each metagenomic sample would be embedded into a latent space. You then have spatiotemporal trajectories through this latent space to train on.

I worry about data complexity relative to data quantity, both for VAE and for diffusion, but hopefully this would still at least be approachable.