Data Formats Reference¶

Quick reference for the data representations used in the evolutionary diffusion pipeline. The model is trained on evolutionary trajectories — sequences of mutations along phylogenetic branches — encoded in one of several formats.

All four formats below are under active evaluation in an ablation study comparing their effects on model performance. No single format has been settled on yet — the diffusion-language-model repo has parallel training configs for each.

Input: FASTA trajectories¶

All formats start from FASTA files produced by the trajectories repo. Each file contains a sequence of DNA sequences along a phylogenetic path.

Forwards trajectories (root-to-tip):

>NODE_0000|0|0
ATCGATCGATCG...
>NODE_0012|3|3
ATCAATCGATCG...
>TIP_NAME|5|2
ATCAATCAATCG...

Header format: >{node_name}|{cumulative_hamming}|{branch_hamming}

cumulative_hamming: total mutations from root
branch_hamming: mutations on this branch only
Gaps (-) and ambiguous bases (N) are ignored in distance calculations

Pairwise trajectories (tip-to-tip):

>TipA|0|0
ATCGATCGATCG...
>TipB|5|5
ATCAATCAATCG...

Two tip sequences with Hamming distance. Pairwise data is ~9x more abundant than forwards data.

Training format: JSONL¶

All representations are converted to JSONL with a chat-style message structure for model training:

{"messages": [{"role": "user", "content": "evodiff_spike-xs"}, {"role": "assistant", "content": "<formatted trajectory>"}]}

The user content is a dataset label (prompt). The assistant content is the trajectory in one of the formats below.

Raw format¶

Full DNA sequences with generation numbers, separated by |:

0_ATCGATCGATCG|1_ATCAATCGATCG|2_ATCAATCAATCG

Format: {generation}_{sequence}|{generation}_{sequence}|...

Simple and lossless, but wasteful — successive sequences typically differ by only a handful of mutations, so ~90% of tokens are redundant copying.

Variants: fwd_raw (forwards), pw_raw (pairwise)

VCF format¶

Reference sequence plus compact variant descriptions per generation. Only mutations are encoded, not the full sequence.

ATCGATCGATCG|1:4G>A|2:4G>A,8G>A

Format: {reference}|{gen}:{variants}|{gen}:{variants}|...

Identical generations (no new mutations) use =:

ATCGATCGATCG|1:4G>A|2:=|3:4G>A,8G>A

Position convention¶

Positions are center-relative: position 0 is the center of the reference sequence, negative positions are left of center, positive positions are right of center.

For a 1000 bp reference (center at position 500):

Absolute position	Center-relative
100	-400
500	0
700	+200

Variant types¶

Type	Syntax	Example	Meaning
SNP	`posREF>ALT`	`-400C>T`	C→T at position -400
MNP	`posREF>ALT`	`-5TT>GG`	TT→GG starting at -5
Insertion	`pos+ALT`	`-307+TTAC`	Insert TTAC after position -307
Deletion	`pos-REF`	`100-ATG`	Delete ATG at position 100

Multiple variants in a generation are comma-separated.

Variants: fwd_vcf (forwards), pw_vcf (pairwise)

EMD (embedding) format¶

Like VCF, but each mutation includes flanking sequence context on both sides. Context windows are adaptively sized (4–12 bp) to uniquely identify the mutation position.

ATCGATCGATCG|1:[ATCG]G>A[TCGA]|2:[ATCG]G>A[TCGA],[GATC]G>A[ATCG]

Format: {reference}|{gen}:[left_ctx]REF>ALT[right_ctx]|...

Insertions and deletions follow the same pattern:

Type	Example
SNP	`[ATCG]G>A[TCGA]`
Insertion	`[ATCG]+TTAC[TCGA]`
Deletion	`[ATCG]-ATG[TCGA]`

The context makes mutations more interpretable to the language model by providing surrounding sequence, analogous to how code diff tools show context lines.

Variants: fwd_emd (forwards), pw_emd (pairwise)

SDiff format¶

A unidiff-style representation with hunks, analogous to diff output for code:

ATCGATCGATCG
@@1
 ATCG
-G
+A
 TCGATCG
@@2
 ATCG
-G
+A
 TCG
-G
+A
 TCG

@@{gen} header marks the generation
Space-prefixed lines are context (unchanged)
- prefixed lines are reference (removed)
+ prefixed lines are alternative (added)

Variants: fwd_sdiff (forwards), pw_sdiff (pairwise)

JSON-SDiff variant¶

A compact JSON encoding of the same hunk structure:

ATCGATCGATCG
@@1
{"h":[{"c":"ATCG","r":"G","a":"A"},{"c":"TCGATCG"}]}

Each hunk is an object with c (context), r (reference), a (alternative).

Applying variants to reconstruct sequences¶

To reconstruct a full sequence from VCF or EMD format, apply variants to the reference:

Start with the reference sequence
For each generation, apply all variants in order
Insertions increase sequence length; deletions decrease it
Center-relative positions must be converted to absolute: abs_pos = len(ref) // 2 + offset

The pegasus-evals format converters (pegasus_inference/format_converters.py) implement apply_vcf_variants() and apply_emd_variants() for this.

Format comparison¶

Format	Tokens per trajectory	Strengths	Weaknesses
Raw	High (full sequences)	Lossless, simple	~90% redundant tokens
VCF	Low (mutations only)	Compact, precise	Requires position arithmetic
EMD	Medium (mutations + context)	LLM-interpretable context	Larger than VCF
SDiff	Medium (diff hunks)	Familiar diff format	More verbose than VCF