S3 Storage¶
All training data and model weights are stored in AWS S3 across two buckets. Access requires AWS credentials (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY) — authenticate via aws login.
s3://pegasus-training-data¶
Evolutionary trajectory training data, organized under the trajectories/ prefix. Total size: ~900 GB.
Path structure¶
trajectories/
spike-xs/ # SARS-CoV-2 spike S1 (10,195 seqs)
spike-sm/ # SARS-CoV-2 spike S1 (34,707 seqs)
spike-lg/ # SARS-CoV-2 spike S1 (~8M seqs, UShER)
flu-h3-xs/ # Influenza H3N2 HA1
n450-xs/ # Measles N450
cytb-xs/ # Mammalian cytochrome b
rdrp-paramyxoviridae-xs/ # Paramyxoviridae RdRp
rdrp-flaviviridae-xs/ # Flaviviridae RdRp
rdrp-picornaviridae-xs/ # Picornaviridae RdRp
bac120-cyano-{marker}/ # 123 Cyanobacteriota marker genes
bac120-bacteroidota-{marker}/ # 124 Bacteroidota marker genes
bac120-actinomycetota-{marker}/ # 116 Actinomycetota marker genes
bac120-bacillota-{marker}/ # 122 Bacillota marker genes
odb-{marker_id}-sm/ # ~1120 fungal OrthoDB marker genes
Each dataset directory contains compressed tar.zst shards:
forwards-train-000.tar.zst # Root-to-tip trajectory shards
forwards-train-001.tar.zst
forwards-test-000.tar.zst
pairwise-train-000.tar.zst # Tip-to-tip pair shards
pairwise-train-001.tar.zst
pairwise-test-000.tar.zst
Each shard contains up to 10,000 FASTA trajectory files.
Current contents¶
Verified against S3 as of 2026-04-09. Total: 1607 dataset directories.
| Category | Datasets | Source repo |
|---|---|---|
| bac120-actinomycetota | 116 | bac120 |
| bac120-bacteroidota | 124 | bac120 |
| bac120-bacillota | 122 | bac120 |
| bac120-cyano | 123 | bac120 |
| odb (fungi) | ~1120 | odb |
| Viral/mammalian | 9 | trajectories, rdrp |
Uploaded by the trajectories repo via snakemake upload.
s3://pegasus-model-weights¶
Model checkpoints, training data in JSONL format, predictions, and per-instance workspaces. Total size: ~480 GB.
Path structure¶
diffusion_language_model/
model_raw/ # LLaDA2-mini trained on raw format
model_vcf/ # LLaDA2-mini trained on VCF format
model_emd/ # LLaDA2-mini trained on EMD format
dry_run_jsonl/ # Test JSONL data
qwen_4B_partial_data/ # Qwen3-4B checkpoints + data + predictions
jsonl/ # Preprocessed JSONL by format
predictions/ # Model predictions
small_ablations/
jsonl-{raw,vcf,emd,sdiff}/ # JSONL training data by format (forward)
jsonl-{raw,vcf,emd,sdiff}-pairwise/ # JSONL training data by format (pairwise)
llada2_mini_2026_02_16_fwd_{raw,vcf,emd,sdiff}/ # LLaDA2 ablation checkpoints
llada2_mini_2026_02_16_pw_{raw,sdiff}/ # LLaDA2 pairwise checkpoints
predictions/
predictions_v2.jsonl # Model predictions (~3.6 GB)
workspaces/
{instance_id}/ # Per-instance Lambda Labs workspace syncs
Managed by diffusion-language-model and lambda-resource-manager.
Common commands¶
List datasets:
aws s3 ls s3://pegasus-training-data/trajectories/
Download a specific dataset:
aws s3 sync s3://pegasus-training-data/trajectories/spike-xs/ ./spike-xs/
Download model checkpoints:
aws s3 sync s3://pegasus-model-weights/diffusion_language_model/ ./diffusion_language_model/
List model ablation checkpoints:
aws s3 ls s3://pegasus-model-weights/diffusion_language_model/small_ablations/