Skip to content

S3 Storage

All training data and model weights are stored in AWS S3 across two buckets. Access requires AWS credentials (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY) — authenticate via aws login.

s3://pegasus-training-data

Evolutionary trajectory training data, organized under the trajectories/ prefix. Total size: ~900 GB.

Path structure

trajectories/
  spike-xs/                     # SARS-CoV-2 spike S1 (10,195 seqs)
  spike-sm/                     # SARS-CoV-2 spike S1 (34,707 seqs)
  spike-lg/                     # SARS-CoV-2 spike S1 (~8M seqs, UShER)
  flu-h3-xs/                    # Influenza H3N2 HA1
  n450-xs/                      # Measles N450
  cytb-xs/                      # Mammalian cytochrome b
  rdrp-paramyxoviridae-xs/      # Paramyxoviridae RdRp
  rdrp-flaviviridae-xs/         # Flaviviridae RdRp
  rdrp-picornaviridae-xs/       # Picornaviridae RdRp
  bac120-cyano-{marker}/        # 123 Cyanobacteriota marker genes
  bac120-bacteroidota-{marker}/ # 124 Bacteroidota marker genes
  bac120-actinomycetota-{marker}/ # 116 Actinomycetota marker genes
  bac120-bacillota-{marker}/    # 122 Bacillota marker genes
  odb-{marker_id}-sm/           # ~1120 fungal OrthoDB marker genes

Each dataset directory contains compressed tar.zst shards:

forwards-train-000.tar.zst     # Root-to-tip trajectory shards
forwards-train-001.tar.zst
forwards-test-000.tar.zst
pairwise-train-000.tar.zst     # Tip-to-tip pair shards
pairwise-train-001.tar.zst
pairwise-test-000.tar.zst

Each shard contains up to 10,000 FASTA trajectory files.

Current contents

Verified against S3 as of 2026-04-09. Total: 1607 dataset directories.

Category Datasets Source repo
bac120-actinomycetota 116 bac120
bac120-bacteroidota 124 bac120
bac120-bacillota 122 bac120
bac120-cyano 123 bac120
odb (fungi) ~1120 odb
Viral/mammalian 9 trajectories, rdrp

Uploaded by the trajectories repo via snakemake upload.

s3://pegasus-model-weights

Model checkpoints, training data in JSONL format, predictions, and per-instance workspaces. Total size: ~480 GB.

Path structure

diffusion_language_model/
  model_raw/                        # LLaDA2-mini trained on raw format
  model_vcf/                        # LLaDA2-mini trained on VCF format
  model_emd/                        # LLaDA2-mini trained on EMD format
  dry_run_jsonl/                    # Test JSONL data
  qwen_4B_partial_data/             # Qwen3-4B checkpoints + data + predictions
    jsonl/                          # Preprocessed JSONL by format
    predictions/                    # Model predictions
  small_ablations/
    jsonl-{raw,vcf,emd,sdiff}/      # JSONL training data by format (forward)
    jsonl-{raw,vcf,emd,sdiff}-pairwise/ # JSONL training data by format (pairwise)
    llada2_mini_2026_02_16_fwd_{raw,vcf,emd,sdiff}/ # LLaDA2 ablation checkpoints
    llada2_mini_2026_02_16_pw_{raw,sdiff}/           # LLaDA2 pairwise checkpoints
predictions/
  predictions_v2.jsonl              # Model predictions (~3.6 GB)
workspaces/
  {instance_id}/                    # Per-instance Lambda Labs workspace syncs

Managed by diffusion-language-model and lambda-resource-manager.

Common commands

List datasets:

aws s3 ls s3://pegasus-training-data/trajectories/

Download a specific dataset:

aws s3 sync s3://pegasus-training-data/trajectories/spike-xs/ ./spike-xs/

Download model checkpoints:

aws s3 sync s3://pegasus-model-weights/diffusion_language_model/ ./diffusion_language_model/

List model ablation checkpoints:

aws s3 ls s3://pegasus-model-weights/diffusion_language_model/small_ablations/