Two-Stage Fine-Tuning for Protein Sequence Generation with Targeted Amino-Acid Composition

Anna Mar\'ia D\'iaz-Rovira; Bertran Miquel-Oliver; Isaac Filella-Merce; Rub\'en Mu\~noz-Tafalla; V\'ictor Guallar; Violeta Basten-Romero

arxiv: 2606.27939 · v1 · pith:DKBEQ3MRnew · submitted 2026-06-26 · 💻 cs.LG · cs.AI· q-bio.BM· q-bio.GN

Two-Stage Fine-Tuning for Protein Sequence Generation with Targeted Amino-Acid Composition

Violeta Basten-Romero , Rub\'en Mu\~noz-Tafalla , Anna Mar\'ia D\'iaz-Rovira , Bertran Miquel-Oliver , Isaac Filella-Merce , V\'ictor Guallar This is my paper

Pith reviewed 2026-06-29 05:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AIq-bio.BMq-bio.GN

keywords protein sequence generationamino-acid compositionfine-tuningreinforcement learningconstrained generationprotein language modelssynthetic protein design

0 comments

The pith

A two-stage fine-tuning process generates protein sequences matching target amino-acid compositions while preserving sequence quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Protein language models need steering toward explicit design targets such as specific amino-acid composition profiles, which matter for applications like synthetic feed proteins whose nutritional value depends directly on that composition. The paper proposes a pipeline that first applies domain-adaptive fine-tuning on an in-domain protein dataset to shift the average composition toward the target. It then adds an iterative reinforcement-learning stage that uses a composition reward anchored to the frozen fine-tuned model to enforce tighter constraints that fine-tuning alone cannot meet. Experiments on two target compositions show the full pipeline achieves the alignment. Separate evaluations isolate each stage's contribution and confirm that sequence quality and diversity remain intact.

Core claim

The paper claims that domain-adaptive fine-tuning on relevant protein data brings generated sequences' average amino-acid composition close to a chosen target, while the subsequent reinforcement-learning stage with a composition-based reward enforces specific sequence-level constraints that fine-tuning cannot satisfy, and that the combined process achieves the target alignment without degrading sequence quality or diversity.

What carries the argument

The two-stage pipeline of domain-adaptive fine-tuning followed by iterative reward-weighted fine-tuning via reinforcement learning anchored to the fine-tuned model as a frozen reference.

If this is right

Fine-tuning alone moves the average amino-acid composition near the target.
Reinforcement learning supplies the additional precision needed for exact constraints.
The full pipeline reaches the target composition without reducing sequence quality.
Design choices for the composition reward can be tested against baselines and ablations.
The separate effects of the fine-tuning and reinforcement-learning stages can be measured.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same staged approach could be tested on other sequence constraints such as secondary-structure targets or functional motifs.
Anchoring the reinforcement-learning stage to a frozen fine-tuned model may generalize to multi-objective protein design tasks.
The method could reduce trial-and-error in applications where amino-acid balance directly controls a measurable property such as stability or binding.
Scaling the pipeline to larger protein language models would test whether the same separation of average shift and precise enforcement still holds.

Load-bearing premise

The composition reward used in the reinforcement-learning stage produces sequences that meet the exact target without introducing unmeasured losses in diversity or plausibility.

What would settle it

A direct comparison showing that sequences after the reinforcement-learning stage have markedly lower diversity or plausibility scores than sequences from the fine-tuning stage alone would falsify the no-degradation claim.

Figures

Figures reproduced from arXiv: 2606.27939 by Anna Mar\'ia D\'iaz-Rovira, Bertran Miquel-Oliver, Isaac Filella-Merce, Rub\'en Mu\~noz-Tafalla, V\'ictor Guallar, Violeta Basten-Romero.

**Figure 1.** Figure 1: Training dynamics on qA. Each curve is the median across the top six seeds per composition term variant, with the IQR shaded. (Top) Mean JSD against the target across iterations. (Bottom) Composition score evaluated at a fixed reference temperature βref=20 using the differentiated composition term for all four variants. different model family). Mean NetSolP solubility is 0.582±0.030 on qA and 0.628± 5 [P… view at source ↗

**Figure 2.** Figure 2: Per-residue calibration on qA (residues anonymized as aa1, . . . , aa20, sorted by descending q). Counts are pool-mean frequencies pi rescaled to a common reference length L=292 AA (the rounded mean sequence length of the best-RL pool). (Top) Target counts (blue) vs. best-RL counts (turquoise); Domain-adaptive FT and base ProtGPT2 are overlaid as dashed and dotted lines. (Middle) Signed count residual (ob… view at source ↗

**Figure 3.** Figure 3: Per-variant means with seed-level 95% bootstrap confidence intervals on JSD, composition score (re-scored at fixed βref=20), and pool tolerance (qA, n=30 seeds per variant). Dotted line: base ProtGPT2; dashed line: domain-adaptive FT (no RL). differentiated symmetric cosine global-deviation 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Composition score @ =20 differentiated symmetric cosine global-deviation 8 10 12 14 1… view at source ↗

**Figure 4.** Figure 4: Seed variance per composition term variant on the composition score (left, at βref=20) and the tolerance count N±30 (right) on qA, n=30 seeds per variant. Boxes summarize the seed distribution; jittered points show individual seeds. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: summarizes the per-iteration composition score (re-scored at fixed βref = 20) of sequences sampled by the two decoders, aggregated across the top 6 differentiated composition seeds on qA (matching the seed selection used in [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

read the original abstract

Protein language models are standard priors for biological sequence generation, but steering them toward explicit distributional design targets remains largely unexplored. We study a constrained protein generation problem in which sequences must match a desired amino-acid (AA) composition profile while preserving plausible sequence statistics and diversity. The motivating application is synthetic feed protein design, where the AA composition of dietary proteins directly determines their nutritional value. We propose a two-stage pipeline in which domain-adaptive fine-tuning (FT) on an in-domain protein dataset is followed by iterative reward-weighted FT via reinforcement learning (RL) anchored against the FT model as a frozen reference. We evaluate the pipeline on two AA compositions and find that FT brings the average composition close to the target, while the subsequent RL enforces specific sequence constraints that FT alone cannot satisfy. We additionally evaluate the design choices of the proposed composition reward term against two baselines and an ablated variant, isolate the contribution of each training stage, and verify that AA composition alignment is achieved without degrading sequence quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The two-stage FT-then-RL pipeline gets AA composition targets right on the tested cases while keeping sequence quality intact, but the work stays narrow and incremental.

read the letter

The key point is that domain-adaptive fine-tuning moves the average amino-acid composition toward the target, and the anchored RL stage then enforces the per-sequence constraints that fine-tuning alone leaves unsatisfied. The ablations isolate each stage's contribution and compare the reward term to baselines, which makes the empirical case clearer than a single end-to-end run would.

The paper does a decent job documenting that composition alignment does not come at the expense of diversity or plausibility metrics. The stress-test note indicates the full manuscript supplies the before-and-after statistics and reward variants, so the central claim holds up on its own terms rather than relying on circular fitting.

The main limitation is scope. Only two compositions are shown, both tied to the synthetic feed-protein use case. That keeps the result practical for that niche but leaves open how well the same recipe would transfer to other constraints or larger protein families. No broader benchmarks appear, so the advance is best read as a targeted engineering fix rather than a general method for steering protein LMs.

This is useful for groups already working on constrained sequence design in applied biology settings. A reader who needs explicit composition control and is willing to run the two-stage process would pick up concrete implementation details. It is not essential reading for people focused on scaling laws or new architectures.

I would send it to peer review. The experiments are grounded enough that referees can check the numbers and ablations directly, even if revisions are needed to tighten the claims around generalizability.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes a two-stage pipeline for generating protein sequences matching targeted amino-acid compositions: domain-adaptive fine-tuning (FT) on an in-domain dataset followed by iterative reward-weighted fine-tuning via reinforcement learning (RL) anchored to the frozen FT model. On two evaluated compositions, FT aligns average composition to the target while RL enforces per-sequence constraints that FT alone cannot meet; ablations isolate stage contributions, compare the composition reward against baselines, and confirm that alignment occurs without degrading sequence quality, diversity, or plausibility.

Significance. If the reported metrics and ablations hold, the work supplies a practical, empirically validated method for compositional control in protein language model generation. This is relevant to synthetic biology applications such as nutritional feed protein design. The explicit separation of average versus per-sequence effects and the verification against quality degradation constitute a clear incremental contribution to constrained sequence generation.

minor comments (3)

[Abstract] Abstract: the summary of results is entirely qualitative; a single sentence with the key quantitative improvements (e.g., composition error reduction, diversity/perplexity deltas) would allow readers to gauge effect sizes immediately.
[Evaluation] §4 (or equivalent evaluation section): confirm that all reported means are accompanied by standard deviations or confidence intervals across the multiple runs or sequences; the current description of “before/after statistics” should explicitly state the number of sequences sampled per condition.
Notation: the reward function definition should be given a numbered equation so that later ablations can refer to it unambiguously (e.g., “the full reward in Eq. (3) versus the ablated variant without the KL term”).

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No specific major comments appear in the report.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper describes an empirical two-stage training pipeline (domain-adaptive FT followed by RL) evaluated on held-out AA compositions with explicit ablations, reward variants, diversity/perplexity/plausibility metrics, and stage-isolation experiments. No mathematical derivations, equations, or uniqueness theorems are present that reduce any claimed result to fitted parameters or self-citations by construction. All load-bearing claims rest on external experimental outcomes rather than internal redefinitions or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The reward term and RL anchoring are treated as design choices whose effectiveness is asserted empirically.

pith-pipeline@v0.9.1-grok · 5753 in / 1014 out tokens · 22598 ms · 2026-06-29T05:17:28.906565+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 9 canonical work pages · 1 internal anchor

[1]

Cao, H., Torres, M

doi: 10.3390/ani12070935. Cao, H., Torres, M. D. T., Zhang, J., Gao, Z., Wu, F., Gu, C., Leskovec, J., Choi, Y ., de la Fuente-Nunez, C., Chen, G., and Heng, P.-A. A deep reinforcement learning platform for antibiotic discovery.bioRxiv,

work page doi:10.3390/ani12070935
[2]

Preprint

doi: 10.1101/ 2025.09.23.678086. Preprint. Emmert, J. L. and Baker, D. H. Use of the ideal protein concept for precision formulation of amino acid levels in broiler diets.Journal of Applied Poultry Research, 6(4): 462–470,

2025
[3]

Ferruz, N., Schmidt, S., and H¨ocker, B

doi: 10.1093/japr/6.4.462. Ferruz, N., Schmidt, S., and H¨ocker, B. ProtGPT2 is a deep unsupervised language model for protein design.Nature Communications, 13:4348,

work page doi:10.1093/japr/6.4.462
[4]

Gururangan, S., Marasovi ´c, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., and Smith, N. A. Don’t stop pretraining: Adapt language models to domains and tasks. InProceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics (ACL 2020), pp. 8342–8360. Association for Computational Linguistics,

2020
[5]

Hesslow, N

Hesslow, D., Zanichelli, N., Notin, P., Poli, I., and Marks, D. RITA: A study on scaling up generative protein sequence models.arXiv preprint arXiv:2205.05789,

work page arXiv
[6]

11.14.623630

doi: 10.1101/2024. 05.03.592223. Preprint. Nijkamp, E., Ruffolo, J. A., Weinstein, E. N., Naik, N., and Madani, A. ProGen2: Exploring the boundaries of protein language models.Cell Systems, 14(11):968–978.e3,

work page doi:10.1101/2024 2024
[7]

doi: 10.1016/j.cels.2023.10.002. Peng, X. B., Kumar, A., Zhang, G., and Levine, S. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/j.cels.2023.10.002 2023
[8]

and Schaal, S

Peters, J. and Schaal, S. Reinforcement learning by reward- weighted regression for operational space control. In Proceedings of the 24th International Conference on Ma- chine Learning (ICML 2007), pp. 745–750. ACM,

2007
[9]

D., and Finn, C

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., and Finn, C. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023),

2023
[10]

Guiding generative pro- tein language models with reinforcement learning.arXiv preprint arXiv:2412.12979,

Stocco, F., Artigues-Lleix`a, M., Hunklinger, A., Widatalla, T., G¨uell, M., and Ferruz, N. Guiding generative pro- tein language models with reinforcement learning.arXiv preprint arXiv:2412.12979,

work page arXiv
[11]

Subramanian, J., Sujit, S., Irtisam, N., Sain, U., Islam, R., Nowrouzezahrai, D., and Ebrahimi Kahou, S

Preprint. Subramanian, J., Sujit, S., Irtisam, N., Sain, U., Islam, R., Nowrouzezahrai, D., and Ebrahimi Kahou, S. Reinforce- ment learning for sequence design leveraging protein lan- guage models.arXiv preprint arXiv:2407.03154,

work page arXiv
[12]

The UniProt Consortium

Preprint. The UniProt Consortium. UniProt: the Universal Protein Knowledgebase in 2023.Nucleic Acids Research, 51(D1): D523–D531,

2023
[13]

UniProt: the universal protein knowledgebase in 2023.Nucleic Acids Research, 51(D1):D523–D531, 2023

doi: 10.1093/nar/gkac1052. 10 Two-Stage Fine-Tuning for Protein Sequence Generation with Targeted Amino-Acid Composition Thumuluri, V ., Almagro Armenteros, J. J., Johansen, A. R., Nielsen, H., and Winther, O. NetSolP: Predicting protein solubility inEscherichia coliusing language models.Bioinformatics, 38(4):941–946,

work page doi:10.1093/nar/gkac1052
[14]

Widatalla, T., Rafailov, R., and Hie, B

doi: 10.1093/bioinformatics/btab801. Widatalla, T., Rafailov, R., and Hie, B. Aligning protein generative models with experimental fitness via direct preference optimization.bioRxiv,

work page doi:10.1093/bioinformatics/btab801
[15]

Preprint

doi: 10.1101/ 2024.05.20.595026. Preprint. 11 Two-Stage Fine-Tuning for Protein Sequence Generation with Targeted Amino-Acid Composition A. Reproducibility details This appendix consolidates the practical settings used to produce every number in the main text. All runs used a single NVIDIA H100 GPU and a single shared conda environment. Dataset constructi...

2024

[1] [1]

Cao, H., Torres, M

doi: 10.3390/ani12070935. Cao, H., Torres, M. D. T., Zhang, J., Gao, Z., Wu, F., Gu, C., Leskovec, J., Choi, Y ., de la Fuente-Nunez, C., Chen, G., and Heng, P.-A. A deep reinforcement learning platform for antibiotic discovery.bioRxiv,

work page doi:10.3390/ani12070935

[2] [2]

Preprint

doi: 10.1101/ 2025.09.23.678086. Preprint. Emmert, J. L. and Baker, D. H. Use of the ideal protein concept for precision formulation of amino acid levels in broiler diets.Journal of Applied Poultry Research, 6(4): 462–470,

2025

[3] [3]

Ferruz, N., Schmidt, S., and H¨ocker, B

doi: 10.1093/japr/6.4.462. Ferruz, N., Schmidt, S., and H¨ocker, B. ProtGPT2 is a deep unsupervised language model for protein design.Nature Communications, 13:4348,

work page doi:10.1093/japr/6.4.462

[4] [4]

Gururangan, S., Marasovi ´c, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., and Smith, N. A. Don’t stop pretraining: Adapt language models to domains and tasks. InProceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics (ACL 2020), pp. 8342–8360. Association for Computational Linguistics,

2020

[5] [5]

Hesslow, N

Hesslow, D., Zanichelli, N., Notin, P., Poli, I., and Marks, D. RITA: A study on scaling up generative protein sequence models.arXiv preprint arXiv:2205.05789,

work page arXiv

[6] [6]

11.14.623630

doi: 10.1101/2024. 05.03.592223. Preprint. Nijkamp, E., Ruffolo, J. A., Weinstein, E. N., Naik, N., and Madani, A. ProGen2: Exploring the boundaries of protein language models.Cell Systems, 14(11):968–978.e3,

work page doi:10.1101/2024 2024

[7] [7]

doi: 10.1016/j.cels.2023.10.002. Peng, X. B., Kumar, A., Zhang, G., and Levine, S. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/j.cels.2023.10.002 2023

[8] [8]

and Schaal, S

Peters, J. and Schaal, S. Reinforcement learning by reward- weighted regression for operational space control. In Proceedings of the 24th International Conference on Ma- chine Learning (ICML 2007), pp. 745–750. ACM,

2007

[9] [9]

D., and Finn, C

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., and Finn, C. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023),

2023

[10] [10]

Guiding generative pro- tein language models with reinforcement learning.arXiv preprint arXiv:2412.12979,

Stocco, F., Artigues-Lleix`a, M., Hunklinger, A., Widatalla, T., G¨uell, M., and Ferruz, N. Guiding generative pro- tein language models with reinforcement learning.arXiv preprint arXiv:2412.12979,

work page arXiv

[11] [11]

Subramanian, J., Sujit, S., Irtisam, N., Sain, U., Islam, R., Nowrouzezahrai, D., and Ebrahimi Kahou, S

Preprint. Subramanian, J., Sujit, S., Irtisam, N., Sain, U., Islam, R., Nowrouzezahrai, D., and Ebrahimi Kahou, S. Reinforce- ment learning for sequence design leveraging protein lan- guage models.arXiv preprint arXiv:2407.03154,

work page arXiv

[12] [12]

The UniProt Consortium

Preprint. The UniProt Consortium. UniProt: the Universal Protein Knowledgebase in 2023.Nucleic Acids Research, 51(D1): D523–D531,

2023

[13] [13]

UniProt: the universal protein knowledgebase in 2023.Nucleic Acids Research, 51(D1):D523–D531, 2023

doi: 10.1093/nar/gkac1052. 10 Two-Stage Fine-Tuning for Protein Sequence Generation with Targeted Amino-Acid Composition Thumuluri, V ., Almagro Armenteros, J. J., Johansen, A. R., Nielsen, H., and Winther, O. NetSolP: Predicting protein solubility inEscherichia coliusing language models.Bioinformatics, 38(4):941–946,

work page doi:10.1093/nar/gkac1052

[14] [14]

Widatalla, T., Rafailov, R., and Hie, B

doi: 10.1093/bioinformatics/btab801. Widatalla, T., Rafailov, R., and Hie, B. Aligning protein generative models with experimental fitness via direct preference optimization.bioRxiv,

work page doi:10.1093/bioinformatics/btab801

[15] [15]

Preprint

doi: 10.1101/ 2024.05.20.595026. Preprint. 11 Two-Stage Fine-Tuning for Protein Sequence Generation with Targeted Amino-Acid Composition A. Reproducibility details This appendix consolidates the practical settings used to produce every number in the main text. All runs used a single NVIDIA H100 GPU and a single shared conda environment. Dataset constructi...

2024