arxiv: 2603.11703 · v2 · submitted 2026-03-12 · 💻 cs.LG

Recognition: no theorem link

EvoFlows: Evolutionary Edit-Based Flow-Matching for Protein Engineering

Nicolas Deutschmann , Constance Ferragu , Jonathan D. Ziegler , Shayan Aziznejad , Eli Bixby

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:51 UTC · model grok-4.3

classification 💻 cs.LG

keywords protein engineeringflow matchingedit operationssequence generationmutational trajectoriesprotein variantsevolutionary modeling

0 comments

The pith

EvoFlows models protein engineering as edit flows between evolutionarily related sequences to generate variants with controllable insertions, deletions and substitutions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EvoFlows as a variable-length sequence-to-sequence model that learns mutational trajectories from pairs of evolutionarily related proteins. It treats the process of turning one sequence into another as a continuous flow of edit operations, so the model can decide both where to change the template and what kind of change to make. This removes the need for pre-specified mutation sites that limit masked or diffusion models. A sympathetic reader would care because current protein language models either generate entire sequences from scratch or cannot naturally handle insertions and deletions, restricting their use in optimization tasks such as improving enzyme activity or stability.

Core claim

EvoFlows learns mutational trajectories between evolutionarily related protein sequences via edit flows, allowing it to perform a controllable number of insertions, deletions, and substitutions on a template sequence while generating variants that remain consistent with natural protein families.

What carries the argument

Edit flows, which represent the learned continuous trajectories of edit operations (insertions, deletions, substitutions) that transform one evolutionarily related sequence into another.

If this is right

The model can generate sequences whose length differs from the template without requiring pre-chosen edit positions.
Generated variants remain statistically consistent with natural families drawn from UniRef and OAS while lying farther from the starting sequence than outputs from leading baselines.
Both the type and the location of each mutation are predicted jointly rather than in separate steps.
The same framework supports optimization tasks that require variable numbers of changes, such as directed evolution campaigns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be paired with structure prediction or physics-based scoring to rank variants for specific functional targets.
If the in-silico consistency metrics correlate with experimental success, the method could reduce the number of sequences that must be synthesized and tested.
Similar edit-flow formulations might apply to other variable-length biological sequences such as RNA or antibody regions when aligned evolutionary data are available.

Load-bearing premise

That mutational trajectories learned from evolutionarily related sequences will produce useful engineered proteins that generalize beyond natural variation.

What would settle it

Synthesizing and assaying EvoFlows-generated variants in the laboratory to measure whether they retain or improve function compared with the template and with variants from autoregressive or masked-language baselines.

Figures

Figures reproduced from arXiv: 2603.11703 by Constance Ferragu, Eli Bixby, Jonathan D. Ziegler, Nicolas Deutschmann, Shayan Aziznejad.

**Figure 1.** Figure 1: Overview of EvoFlows. Edit process on two sequences from a set of homologs. Given the outsized impact of optimization in pre-clinical drug development pipelines (Paul et al., 2010), designing fit-for-purpose models is highly desirable. Before being conditioned on a specific task (Gruver et al., 2023; Widatalla et al., 2024), such a model should first generate function-preserving variants of a protein for u… view at source ↗

**Figure 2.** Figure 2: Precision-recall trade-offs under clock normalization. Precision-recall curves for each mutation type and no-ops on the deterministic benchmark, which shows how clock normalization controls the trade-off between recall and precision. While no-op predictions remain stable, insertion, substitution, and deletion exhibit systematic precision-recall shifts as the clock normalization varies, highlighting its rol… view at source ↗

**Figure 3.** Figure 3: Comparison of EvoFlows and baseline methods across evaluation metrics. Each [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Performance on the deterministic benchmark.: (Left) F1-score for each mutation type as a function of sequence length. Performance remains stable across sequence lengths, indicating that mutation prediction accuracy does not degrade for longer sequences. (Right) Confusion matrix showing the distribution of predicted mutation types versus ground-truth mutations (no-op, insertion, substitution, deletion). The… view at source ↗

**Figure 5.** Figure 5: Per-dataset comparison of EvoFlows and baseline methods. Comparison across multiple datasets (columns) and evaluation metrics (rows). The random transport model produces the most different sequences from the starting sequences as a function of the data. EvoFlows generates variants that remain close to the holdout distribution. Random baseline performs worst across all metrics. Evo-tuning without forced mut… view at source ↗

**Figure 7.** Figure 7: Per-position amino acid frequency heatmaps for all seed protein families. Each row shows one dataset, with EvoFlows-generated sequences (left) and holdout sequences (right). Color intensity indicates per-position amino acid frequency after alignment. Conserved positions appear as points of high intensity, while variable regions show more diffuse patterns. The similar frequency profiles between generated a… view at source ↗

**Figure 6.** Figure 6: Cross-dataset metric trends. Each subplot shows one evaluation metric, with lines connecting per-dataset values for each method: Random pairing, EvoFlows (ours), Evotuning, Evo-tuning with forced mutations, and random baseline. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

read the original abstract

We introduce EvoFlows, a variable-length protein sequence-to-sequence modeling approach designed for protein engineering. Existing protein language models are poorly suited for optimization tasks: autoregressive models require full sequence generation, masked language and discrete diffusion models rely on pre-specified mutation locations, and no existing methods naturally support insertions and deletions relative to a template sequence. EvoFlows learns mutational trajectories between evolutionarily related protein sequences via edit flows, allowing it to perform a controllable number of mutations (insertions, deletions, and substitutions) on a template sequence, predicting not only _which_ mutation to perform, but also _where_ it should occur. Through extensive _in silico_ evaluation on diverse protein families from UniRef and OAS, we show that EvoFlows generates variants that remain consistent with natural protein families while exploring farther from template sequences than leading baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EvoFlows introduces edit-based flow-matching to handle insertions, deletions, and substitutions from protein templates, but the abstract provides no metrics or functional validation to support the superiority claims.

read the letter

The main point is that EvoFlows trains flow-matching models on evolutionary sequence pairs to generate variable-length edits on a template, predicting both the mutation type and its position. This directly targets a limitation in autoregressive, masked, and diffusion protein models that either regenerate everything or require pre-specified change locations. The approach is presented as new for controllable protein engineering tasks. It does a reasonable job framing why existing methods fall short for optimization and shows in silico results on UniRef and OAS families where the outputs stay consistent with natural variation while moving farther from the starting sequence than baselines. That framing and the edit-flow idea are the clearest contributions. The soft spots are the evaluation details. The abstract mentions extensive in silico tests and superiority but supplies no numbers, no baseline specifics, no statistical tests, and no description of how consistency or distance were measured. Without those, the central claim cannot be checked. The stress-test note is on target: in silico family likelihood and distance metrics do not demonstrate that the variants would be functional or stable outside natural sequences, and there is no wet-lab or orthogonal assay data to close that gap. The assumption that evolutionary trajectories will generalize to useful engineering edits remains untested here. This paper is for computational biologists and ML researchers working on generative models for protein sequences. A reader interested in new ways to model edits would find the method description worth examining. It deserves peer review because the core idea addresses a real modeling gap and engages the literature on prior limitations, even though the current evidence is preliminary and would need stronger results and reproducibility checks from referees.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces EvoFlows, a variable-length sequence-to-sequence model for protein engineering based on edit flows that learns mutational trajectories (insertions, deletions, substitutions) between evolutionarily related sequences from datasets like UniRef and OAS. Unlike autoregressive PLMs, masked LMs, or discrete diffusion models, it supports controllable edits without pre-specifying locations. The central claim is that extensive in silico evaluations demonstrate generated variants remain consistent with natural protein families while achieving greater distance from template sequences than leading baselines.

Significance. If the in silico consistency and distance metrics hold and correlate with functional properties, EvoFlows could address key limitations in existing protein language models for optimization tasks by enabling natural handling of indels and mutation placement. This would represent a methodological advance in controllable sequence generation for directed evolution. The paper's strength lies in its focus on evolutionary edit trajectories, but the lack of wet-lab validation or orthogonal functional assays limits claims about engineering utility beyond natural variation.

major comments (2)

[Evaluation section] Evaluation section (and abstract): the claim of 'extensive in silico evaluation' and superiority over baselines is load-bearing for the central result, yet no specific quantitative metrics (e.g., mean edit distances, family likelihood or MSA scores, R² values, or statistical tests such as p-values or confidence intervals) are reported. Without these, including details on baseline implementations, data splits, and exclusion rules, the assertion that EvoFlows explores farther while remaining consistent cannot be assessed.
[§3] §3 (model formulation): the edit-flow trajectories are learned from evolutionary pairs, but the manuscript does not address whether the resulting variants generalize to functional improvements outside observed natural variation. The chosen consistency metrics (family likelihood) and distance measures may simply recover non-functional sequences at higher edit distance; a concrete test or ablation on held-out functional data would strengthen this link.

minor comments (2)

[Abstract] Abstract: 'leading baselines' are referenced without naming the specific methods (e.g., autoregressive, diffusion, or masked models) or their implementations; add this for reproducibility.
[§2] Notation in the edit-flow equations: clarify how variable-length handling is achieved in the flow-matching objective to avoid ambiguity with standard diffusion formulations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major point below and indicate the revisions that will be incorporated in the next version.

read point-by-point responses

Referee: [Evaluation section] Evaluation section (and abstract): the claim of 'extensive in silico evaluation' and superiority over baselines is load-bearing for the central result, yet no specific quantitative metrics (e.g., mean edit distances, family likelihood or MSA scores, R² values, or statistical tests such as p-values or confidence intervals) are reported. Without these, including details on baseline implementations, data splits, and exclusion rules, the assertion that EvoFlows explores farther while remaining consistent cannot be assessed.

Authors: We agree that explicit numerical reporting is necessary to substantiate the claims. In the revised manuscript we will add a new results table that reports mean edit distances (with standard deviations and 95% confidence intervals), average family likelihood under an independent MSA model, and MSA consistency scores for EvoFlows versus all baselines. We will also include p-values from paired Wilcoxon signed-rank tests and report R² values where regression analyses appear. The methods section will be expanded with complete baseline implementation details (hyperparameters, adaptation for variable-length generation), the precise train/test splits (UniRef cluster-based partitioning to prevent leakage), and all sequence exclusion rules (length, identity, and quality filters). These additions will make the quantitative comparisons fully reproducible and assessable. revision: yes
Referee: [§3] §3 (model formulation): the edit-flow trajectories are learned from evolutionary pairs, but the manuscript does not address whether the resulting variants generalize to functional improvements outside observed natural variation. The chosen consistency metrics (family likelihood) and distance measures may simply recover non-functional sequences at higher edit distance; a concrete test or ablation on held-out functional data would strengthen this link.

Authors: The manuscript uses consistency with natural evolutionary families as a standard computational proxy for plausibility, which is the appropriate scope for an in silico method. We acknowledge that an explicit link to functional outcomes would be valuable. In revision we will add an ablation study on held-out functional benchmarks (e.g., ProteinGym fitness datasets). Variants will be generated at controlled edit distances and evaluated for correlation between our consistency metrics and predicted fitness; we will report whether higher-distance EvoFlows sequences maintain or improve functional scores relative to baselines. This directly tests whether the distance-consistency tradeoff recovers non-functional sequences. revision: partial

Circularity Check

0 steps flagged

No circularity: new edit-flow model trained on evolutionary pairs with external in silico benchmarks

full rationale

The paper defines EvoFlows as a sequence-to-sequence edit-flow model that learns mutational trajectories directly from pairs of evolutionarily related sequences. Generation proceeds by applying learned edit operations to a template, with the number and type of edits controlled at inference. Evaluation metrics (family consistency via likelihood or MSA scores on UniRef/OAS, edit distance from template) are computed post-generation against held-out or external sequence sets and compared to independent baselines. No equation reduces a claimed prediction to a fitted parameter by algebraic identity, no uniqueness theorem is imported from self-citation, and no ansatz is smuggled via prior work. The central claim therefore rests on empirical comparison rather than definitional closure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on abstract; no explicit free parameters, axioms, or invented entities are detailed. Assumes evolutionary relatedness provides useful training signal for engineering edits.

axioms (1)

domain assumption Evolutionary relationships between protein sequences provide a reliable source of mutational trajectories for learning edit operations
Core premise for training on related sequences from UniRef and OAS

pith-pipeline@v0.9.0 · 5449 in / 1213 out tokens · 37300 ms · 2026-05-15T12:51:04.486857+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Tree-Conditioned Edit Flows for Ancestral Sequence Reconstruction
q-bio.QM 2026-05 unverdicted novelty 6.0

A new tree-conditioned edit-flow model for ancestral sequence reconstruction achieves reasonable accuracy on substitution-only evolved sequences and superior localization of changes on natural indel-rich sequences.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 1 Pith paper

[1]

Information on EC 2.6.1.51 - serine-pyruvate transaminase, 2025

https://doi.org/10.1093/bioinformatics/btac020 BRENDA Enzyme Database. Information on EC 2.6.1.51 - serine-pyruvate transaminase, 2025. Chen, A., Stanton, S. D., Alberstein, R. G., Watkins, A. M., Bonneau, R., Gligorijevic, V., Cho, K., and Frey, N. C. LLMs Are Highly-Constrained Biophysical Sequence Optimizers. NeurIPS 2024 Workshop on AI for New Drug Mo...

work page doi:10.1093/bioinformatics/btac020 2025
[2]

https://openreview.net/forum?id=Lm8T39vLDTE Jr, T. F. T., and Bepler, T. Understanding protein function with a multimodal retrieval-aug- mented foundation model. The Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025. https://openreview.net/forum?id=fKerD2AQai Koudelakova, T., Bidmanova, S., Dvorak, P., Pavelka, A., Chaloupkova,...

work page doi:10.1002/biot.201100486 2025
[3]

https://doi.org/10.1093/bioinformatics/btac353 Leslie, C., Eskin, E., and Noble, W. S. The spectrum kernel: a string kernel for SVM protein classification. Pac Symp Biocomput, 564–575, 2002. Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y., and others. Evolutionary-scale prediction of atomic-level pr...

work page doi:10.1093/bioinformatics/btac353 2002
[4]

FGF2 - Fibroblast growth factor 2 - Gallus gallus (Chicken), 2025

https://doi.org/10.1093/bioinformatics/btm098 UniProt Consortium. FGF2 - Fibroblast growth factor 2 - Gallus gallus (Chicken), 2025. Verkuil, R., Kabeli, O., Du, Y., Wicky, B. I. M., Milles, L. F., Dauparas, J., Baker, D., Ovchinnikov, S., Sercu, T., and Rives, A. Language models generalize beyond natural proteins. Biorxiv, 2022. https://doi.org/10.1101/2...

work page doi:10.1093/bioinformatics/btm098 2025
[5]

These frequencies reflect the natural abundance of amino acids in protein databases and have been empirically validated across diverse protein families

as biologically informed priors. These frequencies reflect the natural abundance of amino acids in protein databases and have been empirically validated across diverse protein families. The smoothed probability for amino acid 𝑖 is calculated as: 𝑝𝑎,𝛼 = 𝑥𝑎 + 𝛼𝜇𝑎 𝑁 + 𝛼 ⋅ 𝑑. (24) where 𝑥𝑎 is the observed count of amino acid 𝑎, 𝑁 is the total number of observ...

work page
[6]

Ensures biologically plausible probability estimates even with limited data

work page
[7]

Weights pseudo-counts according to amino acid abundance in natural proteins rather than treating all amino acids equally

work page
[8]

Provides a Bayesian interpretation with Dirichlet priors informed by empirical protein evolution 16 Foundation Models for Science: Real-World Impact and Science-First Design, ICLR 2026

work page 2026
[9]

Reduces the variance of frequency estimates while introducing minimal bias, as the smoothed probabilities converge to maximum likelihood estimates as dataset size increases This approach is particularly valuable when comparing generated sequences to natural sequences, as it ensures that distance measurements remain finite and meaningful even when datasets...

work page 2012
[10]

It is efficient to compute using sparse representations, avoiding the need to explicitly enumerate all |𝒜︀|𝑘 possible 𝑘-mers

work page
[11]

It makes no assumptions about the data distribution, unlike learned embeddings

work page
[12]

It captures local sequence composition

work page
[13]

up” and “down

Its simplicity ensures robustness when evaluating artificial sequences (Kucera et al., 2022). C Protein Types and Datasets We use the following seed proteins to construct homolog datasets via iterative profile search. Anti-SARS-CoV-2 VHH (Ty1) Ty1 is an alpaca-derived single-domain antibody (nanobody) that targets the receptor- binding domain (RBD) of the...

work page arXiv 2022