Data Formats Reference¶
Quick reference for the data representations used in the evolutionary diffusion pipeline. The model is trained on evolutionary trajectories — sequences of mutations along phylogenetic branches — encoded in one of several formats.
All four formats below are under active evaluation in an ablation study comparing their effects on model performance. No single format has been settled on yet — the diffusion-language-model repo has parallel training configs for each.
Input: FASTA trajectories¶
All formats start from FASTA files produced by the trajectories repo. Each file contains a sequence of DNA sequences along a phylogenetic path.
Forwards trajectories (root-to-tip):
>NODE_0000|0|0
ATCGATCGATCG...
>NODE_0012|3|3
ATCAATCGATCG...
>TIP_NAME|5|2
ATCAATCAATCG...
Header format: >{node_name}|{cumulative_hamming}|{branch_hamming}
cumulative_hamming: total mutations from rootbranch_hamming: mutations on this branch only- Gaps (
-) and ambiguous bases (N) are ignored in distance calculations
Pairwise trajectories (tip-to-tip):
>TipA|0|0
ATCGATCGATCG...
>TipB|5|5
ATCAATCAATCG...
Two tip sequences with Hamming distance. Pairwise data is ~9x more abundant than forwards data.
Training format: JSONL¶
All representations are converted to JSONL with a chat-style message structure for model training:
{"messages": [{"role": "user", "content": "evodiff_spike-xs"}, {"role": "assistant", "content": "<formatted trajectory>"}]}
The user content is a dataset label (prompt). The assistant content is the trajectory in one of the formats below.
Raw format¶
Full DNA sequences with generation numbers, separated by |:
0_ATCGATCGATCG|1_ATCAATCGATCG|2_ATCAATCAATCG
Format: {generation}_{sequence}|{generation}_{sequence}|...
Simple and lossless, but wasteful — successive sequences typically differ by only a handful of mutations, so ~90% of tokens are redundant copying.
Variants: fwd_raw (forwards), pw_raw (pairwise)
VCF format¶
Reference sequence plus compact variant descriptions per generation. Only mutations are encoded, not the full sequence.
ATCGATCGATCG|1:4G>A|2:4G>A,8G>A
Format: {reference}|{gen}:{variants}|{gen}:{variants}|...
Identical generations (no new mutations) use =:
ATCGATCGATCG|1:4G>A|2:=|3:4G>A,8G>A
Position convention¶
Positions are center-relative: position 0 is the center of the reference sequence, negative positions are left of center, positive positions are right of center.
For a 1000 bp reference (center at position 500):
| Absolute position | Center-relative |
|---|---|
| 100 | -400 |
| 500 | 0 |
| 700 | +200 |
Variant types¶
| Type | Syntax | Example | Meaning |
|---|---|---|---|
| SNP | posREF>ALT |
-400C>T |
C→T at position -400 |
| MNP | posREF>ALT |
-5TT>GG |
TT→GG starting at -5 |
| Insertion | pos+ALT |
-307+TTAC |
Insert TTAC after position -307 |
| Deletion | pos-REF |
100-ATG |
Delete ATG at position 100 |
Multiple variants in a generation are comma-separated.
Variants: fwd_vcf (forwards), pw_vcf (pairwise)
EMD (embedding) format¶
Like VCF, but each mutation includes flanking sequence context on both sides. Context windows are adaptively sized (4–12 bp) to uniquely identify the mutation position.
ATCGATCGATCG|1:[ATCG]G>A[TCGA]|2:[ATCG]G>A[TCGA],[GATC]G>A[ATCG]
Format: {reference}|{gen}:[left_ctx]REF>ALT[right_ctx]|...
Insertions and deletions follow the same pattern:
| Type | Example |
|---|---|
| SNP | [ATCG]G>A[TCGA] |
| Insertion | [ATCG]+TTAC[TCGA] |
| Deletion | [ATCG]-ATG[TCGA] |
The context makes mutations more interpretable to the language model by providing surrounding sequence, analogous to how code diff tools show context lines.
Variants: fwd_emd (forwards), pw_emd (pairwise)
SDiff format¶
A unidiff-style representation with hunks, analogous to diff output for code:
ATCGATCGATCG
@@1
ATCG
-G
+A
TCGATCG
@@2
ATCG
-G
+A
TCG
-G
+A
TCG
@@{gen}header marks the generation- Space-prefixed lines are context (unchanged)
-prefixed lines are reference (removed)+prefixed lines are alternative (added)
Variants: fwd_sdiff (forwards), pw_sdiff (pairwise)
JSON-SDiff variant¶
A compact JSON encoding of the same hunk structure:
ATCGATCGATCG
@@1
{"h":[{"c":"ATCG","r":"G","a":"A"},{"c":"TCGATCG"}]}
Each hunk is an object with c (context), r (reference), a (alternative).
Applying variants to reconstruct sequences¶
To reconstruct a full sequence from VCF or EMD format, apply variants to the reference:
- Start with the reference sequence
- For each generation, apply all variants in order
- Insertions increase sequence length; deletions decrease it
- Center-relative positions must be converted to absolute:
abs_pos = len(ref) // 2 + offset
The pegasus-evals format converters (pegasus_inference/format_converters.py) implement apply_vcf_variants() and apply_emd_variants() for this.
Format comparison¶
| Format | Tokens per trajectory | Strengths | Weaknesses |
|---|---|---|---|
| Raw | High (full sequences) | Lossless, simple | ~90% redundant tokens |
| VCF | Low (mutations only) | Compact, precise | Requires position arithmetic |
| EMD | Medium (mutations + context) | LLM-interpretable context | Larger than VCF |
| SDiff | Medium (diff hunks) | Familiar diff format | More verbose than VCF |