Skip to content

Data Formats Reference

Quick reference for the data representations used in the evolutionary diffusion pipeline. The model is trained on evolutionary trajectories — sequences of mutations along phylogenetic branches — encoded in one of several formats.

All four formats below are under active evaluation in an ablation study comparing their effects on model performance. No single format has been settled on yet — the diffusion-language-model repo has parallel training configs for each.

Input: FASTA trajectories

All formats start from FASTA files produced by the trajectories repo. Each file contains a sequence of DNA sequences along a phylogenetic path.

Forwards trajectories (root-to-tip):

>NODE_0000|0|0
ATCGATCGATCG...
>NODE_0012|3|3
ATCAATCGATCG...
>TIP_NAME|5|2
ATCAATCAATCG...

Header format: >{node_name}|{cumulative_hamming}|{branch_hamming}

  • cumulative_hamming: total mutations from root
  • branch_hamming: mutations on this branch only
  • Gaps (-) and ambiguous bases (N) are ignored in distance calculations

Pairwise trajectories (tip-to-tip):

>TipA|0|0
ATCGATCGATCG...
>TipB|5|5
ATCAATCAATCG...

Two tip sequences with Hamming distance. Pairwise data is ~9x more abundant than forwards data.

Training format: JSONL

All representations are converted to JSONL with a chat-style message structure for model training:

{"messages": [{"role": "user", "content": "evodiff_spike-xs"}, {"role": "assistant", "content": "<formatted trajectory>"}]}

The user content is a dataset label (prompt). The assistant content is the trajectory in one of the formats below.

Raw format

Full DNA sequences with generation numbers, separated by |:

0_ATCGATCGATCG|1_ATCAATCGATCG|2_ATCAATCAATCG

Format: {generation}_{sequence}|{generation}_{sequence}|...

Simple and lossless, but wasteful — successive sequences typically differ by only a handful of mutations, so ~90% of tokens are redundant copying.

Variants: fwd_raw (forwards), pw_raw (pairwise)

VCF format

Reference sequence plus compact variant descriptions per generation. Only mutations are encoded, not the full sequence.

ATCGATCGATCG|1:4G>A|2:4G>A,8G>A

Format: {reference}|{gen}:{variants}|{gen}:{variants}|...

Identical generations (no new mutations) use =:

ATCGATCGATCG|1:4G>A|2:=|3:4G>A,8G>A

Position convention

Positions are center-relative: position 0 is the center of the reference sequence, negative positions are left of center, positive positions are right of center.

For a 1000 bp reference (center at position 500):

Absolute position Center-relative
100 -400
500 0
700 +200

Variant types

Type Syntax Example Meaning
SNP posREF>ALT -400C>T C→T at position -400
MNP posREF>ALT -5TT>GG TT→GG starting at -5
Insertion pos+ALT -307+TTAC Insert TTAC after position -307
Deletion pos-REF 100-ATG Delete ATG at position 100

Multiple variants in a generation are comma-separated.

Variants: fwd_vcf (forwards), pw_vcf (pairwise)

EMD (embedding) format

Like VCF, but each mutation includes flanking sequence context on both sides. Context windows are adaptively sized (4–12 bp) to uniquely identify the mutation position.

ATCGATCGATCG|1:[ATCG]G>A[TCGA]|2:[ATCG]G>A[TCGA],[GATC]G>A[ATCG]

Format: {reference}|{gen}:[left_ctx]REF>ALT[right_ctx]|...

Insertions and deletions follow the same pattern:

Type Example
SNP [ATCG]G>A[TCGA]
Insertion [ATCG]+TTAC[TCGA]
Deletion [ATCG]-ATG[TCGA]

The context makes mutations more interpretable to the language model by providing surrounding sequence, analogous to how code diff tools show context lines.

Variants: fwd_emd (forwards), pw_emd (pairwise)

SDiff format

A unidiff-style representation with hunks, analogous to diff output for code:

ATCGATCGATCG
@@1
 ATCG
-G
+A
 TCGATCG
@@2
 ATCG
-G
+A
 TCG
-G
+A
 TCG
  • @@{gen} header marks the generation
  • Space-prefixed lines are context (unchanged)
  • - prefixed lines are reference (removed)
  • + prefixed lines are alternative (added)

Variants: fwd_sdiff (forwards), pw_sdiff (pairwise)

JSON-SDiff variant

A compact JSON encoding of the same hunk structure:

ATCGATCGATCG
@@1
{"h":[{"c":"ATCG","r":"G","a":"A"},{"c":"TCGATCG"}]}

Each hunk is an object with c (context), r (reference), a (alternative).

Applying variants to reconstruct sequences

To reconstruct a full sequence from VCF or EMD format, apply variants to the reference:

  1. Start with the reference sequence
  2. For each generation, apply all variants in order
  3. Insertions increase sequence length; deletions decrease it
  4. Center-relative positions must be converted to absolute: abs_pos = len(ref) // 2 + offset

The pegasus-evals format converters (pegasus_inference/format_converters.py) implement apply_vcf_variants() and apply_emd_variants() for this.

Format comparison

Format Tokens per trajectory Strengths Weaknesses
Raw High (full sequences) Lossless, simple ~90% redundant tokens
VCF Low (mutations only) Compact, precise Requires position arithmetic
EMD Medium (mutations + context) LLM-interpretable context Larger than VCF
SDiff Medium (diff hunks) Familiar diff format More verbose than VCF