Metagenomic Interpretability¶
Harvest activations from transformer-based models like Evo 2 to train a token-level sparse autoencoder (SAE) from metagenomic reads. Train SAE for embedding reconstruction alongside taxonomic (via NCBI taxonomy) and functional heads (via Pfam domains). This gives a window into metagenomic dark matter.
See Initial Directions for the full project scoping discussion.
Goals¶
- Fine-tune or pre-train a genome language model on viral/metagenomic data
- Harvest activations from Lungfish wastewater data
- Train sparse autoencoder on token embeddings with taxonomic and functional heads
- Identify and characterize novel biological features in metagenomic data