Notes on Compute¶

Trevor Bedford — 2025-12-17

Like other AI applications, the main resource besides talent will be compute and storage. We need storage for large training corpuses. Even if it's largely sourcing data from places like GenBank or Lungfish, we still need to (at least temporarily) house data alongside heavy compute. Like general AI applications, we require a sizable cluster of modern GPUs in order to do the large matrix computations inherent to deep learning.

I believe we can get a good sense of the broad requirements here by assuming that we'd be targeting a model as complex as the current state-of-the-art Evo 2 7B parameter model developed by Brian Hie and colleagues at the Arc Institute, alongside estimates of what dealing with Lungfish data would look like. And as a further comparison point I looked at METAGENE-1 by Oliver Liu and colleagues. This is a similar 7B parameter model though with a shorter context window.

Storage¶

Storage for Evo 2. Evo 2 7B parameter model is trained on ~9.3 trillion DNA base pairs, which roughly requires ~10TB of storage uncompressed. This is still modest compared to typical storage available in a modern compute cluster.

Storage for METAGENE-1. This model was trained on ~370 billion tokens.

Storage for Lungfish. The raw FASTQ datasets from environmental metagenomics are huge with 1B reads per site per week. Each sample is roughly ~60 GB gz compressed. This means that 1 year of data is expected to be roughly 75 x 52 x 60 GB = 234 TB compressed. However, for training we'd certainly subsample a fraction of this data and train based on this.

General conclusion. Given the above, I'd ballpark storage requirements of ~100TB of fast storage. This is roughly $30k per year from cloud providers, but we probably don't need this continually available and so yearly cost should be south of this.

Compute¶

Compute for Evo 2. Training a 7B parameter model is non-trivial with the expectation of ~256 H100 GPUs running for ~30 days to train the model. The original team used NVIDIA DGX Cloud for this application (presumably comped by NVIDIA). At current rates this is expected to be ~$920k ($5 per hour × 256 GPUs × 720 hours). This is going to be a rough ballpark, but planning for $1M to $1.5M for a model of this scope seems broadly reasonable.

Compute for METAGENE-1. This used 32 H100 GPUs for training for a 7B parameter model. Based on the reported MFU and token count, I'd expect roughly 1 month of compute on 32 H100 GPUs. I believe the difference between Evo 2 and METAGENE-1 is that METAGENE-1 uses a shorter 512 token context and trains on a smaller corpus. One month of 32 H100 GPUs is expected to cost ~$110k in cloud compute ($5 per hour × 32 GPUs × 720 hours).

Compute for Lungfish. The Sparse Autoencoder project that I gamed out ballparked 1B tokens to train on. Generating these tokens requires ~400 GPU hours and training the sparse autoencoder requires ~650 GPU hours. This would total ~$5k ($5 per hour × 1050 hours).

General conclusion. Given the above, I'd ballpark training of a large model as ~200k GPU hours or $1M to $2M, which might be best expressed as ~yearly cost. But it's likely jumpy, where large training runs might be uncommon occurrences.

Probably budget something like $1M a year in compute out of $10M per year in spend. These numbers largely match what I learned from Claude/ChatGPT about general expectations of foundation model companies spending 20-40% of burn rate on compute, or where Hugging Face (2022, ~30 people) spends ~$1M/year on compute.

Recommendation¶

Recommend onsite 8xH100 server + supplement with cloud when necessary. It's been extremely helpful having our current 1xH100 server I purchased for ~$45k in November 2024. This is administered by Fred Hutch Scientific Computing. This doesn't require spinning up in the cloud and we have data that sits on the server. This gives a very low overhead for running computational experiments and has helped me and my lab be able to get stuff done. This local / low overhead system was also recommended by Jeremy Cowles, John McDonald and Chris Boyd in our August meeting.

A server with 8x H100 GPUs and 100 TB of storage would cost ~$350k and would give always available 5840 GPU hours/month. This dedicated server would then be supplemented with cloud compute as needed when training big models. Ie it would take 2.6 years to train Evo 2 on just this server, but it would exist to run experiments and train smaller models.

This 8xH100 server comfortably supports day-to-day experimentation, finetuning, and smaller pretraining runs (e.g. "pilot METAGENE-1-style" models). Occasional large-scale training runs (full-corpus 7B pretrain) would burst to the cloud, one or a few times per year, with budgeted spend of $1–2M/year in compute as outlined above.

I believe we should plan to purchase the ~$350k server and then house it in the Starfish server room.

This 8xH100 server seems to be a pretty good "building block" of compute and we could expand to a 2nd 8xH100 server in the (near) future if warranted. This would continue the pattern of always available dedicated compute and then expanding to cloud when training larger models.