Skip to content

Metagenomic Interpretability

Harvest activations from transformer-based models like Evo 2 to train a token-level sparse autoencoder (SAE) from metagenomic reads. Train SAE for embedding reconstruction alongside taxonomic (via NCBI taxonomy) and functional heads (via Pfam domains). This gives a window into metagenomic dark matter.

See Initial Directions for the full project scoping discussion.

Goals

  • Fine-tune or pre-train a genome language model on viral/metagenomic data
  • Harvest activations from Lungfish wastewater data
  • Train sparse autoencoder on token embeddings with taxonomic and functional heads
  • Identify and characterize novel biological features in metagenomic data

Repos