pith. sign in

arxiv: 2605.08659 · v3 · pith:VR75DMQ7new · submitted 2026-05-09 · 💻 cs.CE · q-bio.BM

Pushing Biomolecular Utility-Diversity Frontiers with Supergroup Relative Policy Optimization

Pith reviewed 2026-05-20 23:27 UTC · model grok-4.3

classification 💻 cs.CE q-bio.BM
keywords Supergroup Relative Policy Optimizationbiomolecular generatorsutility diversity tradeoffreinforcement learningde novo molecular designprotein designPareto frontier
0
0 comments X

The pith

Supergroup Relative Policy Optimization uses set-level diversity rewards to expand the utility-diversity Pareto frontier in biomolecular generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Biomolecular generators often improve utility for specific tasks at the cost of diversity, concentrating on narrow families of molecules or proteins. The paper presents Supergroup Relative Policy Optimization as a way to directly incorporate set-level diversity into the reward signal during optimization. It does this by sampling multiple candidate sets as a supergroup for each condition, evaluating their relative diversity, and then allocating the diversity reward back to individual samples using leave-one-out contributions added to their utility scores. This framework is decoupled from particular generators or metrics and is tested on small molecule and protein design tasks. The results indicate better performance on the combined utility-diversity metrics compared to standard policy optimization methods.

Core claim

Supergroup Relative Policy Optimization constructs rewards from set-level diversity by sampling supergroups of candidate sets, comparing their diversity under the same condition, and redistributing the group diversity reward to individual rollouts through leave-one-out diversity contributions before combining it with rollout-level utility. This leads to expanded utility-diversity Pareto frontiers and superior frontier-level metrics in de novo small-molecule design, pocket-based small-molecule design, and de novo protein design.

What carries the argument

Supergroup Relative Policy Optimization (SGRPO), which samples supergroups of candidate sets and redistributes diversity rewards using leave-one-out contributions to balance utility and diversity in policy optimization.

If this is right

  • Across decoding sweeps, SGRPO achieves the best frontier-level metrics relative to pretrained generators, GRPO, and memory-assisted GRPO.
  • Direct set-level diversity rewards remain effective with small groups.
  • SGRPO helps preserve broader generation-distribution coverage during post-training.
  • The method can be instantiated with both GRPO and Coupled-GRPO across autoregressive and discrete diffusion generators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach might reduce reliance on memory-assisted techniques for maintaining diversity in generative models.
  • Applying SGRPO to other domains like image or text generation could test if set-level diversity rewards generalize beyond biomolecular tasks.
  • Further work could explore how the size of the supergroup affects the stability of the diversity signal.

Load-bearing premise

That leave-one-out diversity contributions provide an unbiased and effective redistribution of the group-level diversity reward to individual rollouts without distorting the overall optimization signal.

What would settle it

A decoding sweep on one of the biomolecular design tasks where SGRPO does not expand the utility-diversity Pareto frontier beyond the baselines.

Figures

Figures reproduced from arXiv: 2605.08659 by Bin Feng, Hao Li, He Cao, Shenghua Gao, Xiangru Tang, Xinwu Ye, Yu Li, Zijing Liu.

Figure 1
Figure 1. Figure 1: Overview of SGRPO. For each condition, SGRPO samples a same-condition supergroup, computes rollout-level utility and group-level diversity, compares groups by leave-one-out group-relative diversity, redis￾tributes the diversity signal within each group according to leave-one-out set contributions, centers the composed rewards over the supergroup, and updates the policy with a PPO-style objective and KL reg… view at source ↗
Figure 2
Figure 2. Figure 2: Utility–diversity operating points for de novo small-molecule design, pocket-based small-molecule design, and de novo protein design. Each marker corresponds to one decoding setting and reports the mean utility and diversity over five independent runs; error bars show 95% confidence intervals on both axes. Dashed lines trace the method-specific non-dominated subsets. 5.2 De novo Small-Molecule Design Setup… view at source ↗
Figure 3
Figure 3. Figure 3: Ablation study on GenMol￾based de novo small-molecule generation. Each point reports the mean over five in￾dependent sweeps, and error bars indicate 95% confidence intervals for utility and diversity. Removing the diversity term yields coupled-GRPO, while removing leave-one-out group credit weakens set￾aware credit assignment. We isolate two components of SGRPO on the GenMol-based de novo small-molecule de… view at source ↗
Figure 5
Figure 5. Figure 5: Analysis of diversity-estimator efficiency and group-reward weighting on the GenMol-based [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution dynamics during ProGen2 post-training. SGRPO explores multiple clusters early and preserves them, whereas GRPO contracts and Memory-assisted GRPO drift toward a narrow distant region. We study two practical sensitivities of SGRPO on the GenMol-based de novo small-molecule design task: the efficiency of the group-diversity estimator and the choice of group-reward weight λ. In both cases, we tra… view at source ↗
read the original abstract

Biomolecular generators are often adapted with reward feedback to improve task-specific utility, but pushing utility alone can concentrate generation on a narrow family of candidates. Maintaining diversity is difficult because sample diversity is a set-level property. We introduce Supergroup Relative Policy Optimization (SGRPO), a flexible GRPO-style framework that directly constructs rewards from set-level diversity. For each condition, SGRPO samples a supergroup of candidate sets, compares their diversity under the same condition, and redistributes the group diversity reward to individual rollouts through leave-one-out diversity contributions before combining it with rollout-level utility. This design decouples SGRPO from a particular generator, utility reward, or diversity metric, and allows instantiation with different GRPO-style approaches. We evaluate SGRPO on de novo small-molecule design, pocket-based small-molecule design, and de novo protein design, instantiating it with both GRPO and Coupled-GRPO across autoregressive and discrete diffusion generators. Across decoding sweeps, SGRPO expands the utility-diversity Pareto frontier and achieves the best frontier-level metrics relative to pretrained generators, GRPO, and memory-assisted GRPO when applicable. Our analyses further show that direct set-level diversity rewards remain effective with small groups and help preserve broader generation-distribution coverage during post-training. The code is available at https://github.com/IDEA-XL/SGRPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces Supergroup Relative Policy Optimization (SGRPO), a GRPO-style framework that constructs set-level diversity rewards by sampling supergroups of K candidate sets per condition, computing a group diversity score D, and redistributing it to individual rollouts via leave-one-out contributions D_{-i} before combining with rollout-level utility. It instantiates the approach with GRPO and Coupled-GRPO on autoregressive and discrete diffusion generators, and evaluates on de novo small-molecule design, pocket-based small-molecule design, and de novo protein design. The central claim is that SGRPO expands the utility-diversity Pareto frontier and achieves the best frontier-level metrics relative to pretrained generators, GRPO, and memory-assisted GRPO across decoding sweeps, while remaining effective with small groups and preserving broader distribution coverage.

Significance. If the reported Pareto expansions hold after verification that the leave-one-out redistribution does not introduce gradient artifacts, the work would be significant for multi-objective RL fine-tuning of biomolecular generators. It offers a decoupled, metric-agnostic way to handle set-level properties like diversity, with code release supporting reproducibility and extension to other generators or tasks in computational chemistry and biology.

major comments (1)
  1. [Section 3.2] Section 3.2: The leave-one-out redistribution assigns D_{-i} values whose sum is fixed relative to D, inducing linear dependence and correlation among per-rollout advantages within each supergroup. The GRPO-style loss (Eqs. 4-5) applies these advantages without covariance correction; this risks distorting the policy gradient signal and could partly artifactually inflate the reported frontier gains, particularly at the small supergroup sizes noted as effective. A covariance analysis or control experiment with independent per-rollout diversity sampling is needed to confirm the improvements reflect genuine utility-diversity trade-offs.
minor comments (2)
  1. [Evaluation sections] The manuscript should provide explicit formulas and implementation details for the diversity metrics used in the supergroup comparisons, along with any statistical tests (e.g., confidence intervals or significance levels) supporting the Pareto frontier comparisons.
  2. [Results] Clarify the exact range of decoding parameters in the sweeps and any controls for confounding factors such as sampling temperature or generator-specific biases when claiming superiority over baselines.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address the major concern regarding the leave-one-out redistribution and potential gradient effects below.

read point-by-point responses
  1. Referee: [Section 3.2] Section 3.2: The leave-one-out redistribution assigns D_{-i} values whose sum is fixed relative to D, inducing linear dependence and correlation among per-rollout advantages within each supergroup. The GRPO-style loss (Eqs. 4-5) applies these advantages without covariance correction; this risks distorting the policy gradient signal and could partly artifactually inflate the reported frontier gains, particularly at the small supergroup sizes noted as effective. A covariance analysis or control experiment with independent per-rollout diversity sampling is needed to confirm the improvements reflect genuine utility-diversity trade-offs.

    Authors: We acknowledge that leave-one-out redistribution induces linear dependence among the D_{-i} values by construction, since their sum is fixed relative to the group diversity D. This dependence is intentional: diversity is a set-level property, and the redistribution attributes each rollout's marginal contribution while preserving the total group reward. The GRPO loss normalizes advantages within each supergroup, which accounts for relative comparisons and reduces the impact of within-group correlations on the policy gradient. Our empirical results demonstrate consistent Pareto frontier expansion across multiple generators, tasks, and supergroup sizes (including small K), alongside preserved distribution coverage, which would be unlikely if the gains were primarily artifacts. An independent per-rollout diversity sampling control would not evaluate joint set diversity and thus would not test the core claim. We will add a short discussion of advantage correlations and their effect on the gradient in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No significant circularity; SGRPO reward construction is definitional and claims rest on empirical evaluation

full rationale

The paper introduces SGRPO by explicitly defining a supergroup sampling process, computing a group-level diversity score D, redistributing via leave-one-out contributions D_{-i} to individual rollouts, and combining with rollout utility rewards before applying a GRPO-style loss. This is a direct construction of the optimization objective from external set-level metrics, not a derivation that reduces by the paper's equations to a quantity fitted from target utility data or to prior self-citations. Performance claims of Pareto frontier expansion are supported by empirical decoding sweeps on small-molecule and protein design tasks, with comparisons to pretrained generators, GRPO, and memory-assisted baselines. No self-definitional reductions, fitted-input predictions, or load-bearing self-citation chains appear in the derivation; the leave-one-out redistribution is presented as an explicit design choice whose effects are analyzed separately rather than assumed to be unbiased by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method relies on the assumption that set-level diversity is a meaningful and redistributable property; it introduces hyperparameters for supergroup size and diversity metric choice but no new physical entities.

free parameters (2)
  • supergroup size
    Hyperparameter controlling how many candidate sets are sampled per condition for diversity comparison.
  • diversity metric
    Choice of set-level diversity function whose definition affects reward computation.
axioms (1)
  • domain assumption Leave-one-out contributions accurately apportion group diversity reward to individual samples without bias
    Invoked when redistributing the supergroup diversity reward to individual rollouts.

pith-pipeline@v0.9.0 · 5797 in / 1268 out tokens · 38391 ms · 2026-05-20T23:27:26.437099+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 3 internal anchors

  1. [2]

    Model-based reinforcement learning for biological sequence design

    Christof Angermueller, David Dohan, David Belanger, Ramya Deshpande, Kevin Murphy, and Lucy Colwell. Model-based reinforcement learning for biological sequence design. In International conference on learning representations, 2019

  2. [3]

    Molgpt: molecular generation using a transformer-decoder model.Journal of chemical information and modeling, 62(9): 2064–2076, 2021

    Viraj Bagal, Rishal Aggarwal, PK Vinod, and U Deva Priyakumar. Molgpt: molecular generation using a transformer-decoder model.Journal of chemical information and modeling, 62(9): 2064–2076, 2021

  3. [4]

    Why is tanimoto index an appropriate choice for fingerprint-based similarity calculations?Journal of cheminformatics, 7(1):20, 2015

    Dávid Bajusz, Anita Rácz, and Károly Héberger. Why is tanimoto index an appropriate choice for fingerprint-based similarity calculations?Journal of cheminformatics, 7(1):20, 2015

  4. [5]

    Quantifying the chemical beauty of drugs.Nature chemistry, 4(2):90–98, 2012

    G Richard Bickerton, Gaia V Paolini, Jérémy Besnard, Sorel Muresan, and Andrew L Hopkins. Quantifying the chemical beauty of drugs.Nature chemistry, 4(2):90–98, 2012

  5. [6]

    Esben Jannik Bjerrum, Christian Margreitter, Thomas Blaschke, Simona Kolarova, and Raquel López-Ríos de Castro. Faster and more diverse de novo molecular optimization with double- loop reinforcement learning using augmented smiles.Journal of Computer-Aided Molecular Design, 37(8):373–394, 2023

  6. [7]

    Memory-assisted reinforcement learning for diverse molecular de novo design.Journal of cheminformatics, 12 (1):68, 2020

    Thomas Blaschke, Ola Engkvist, Jürgen Bajorath, and Hongming Chen. Memory-assisted reinforcement learning for diverse molecular de novo design.Journal of cheminformatics, 12 (1):68, 2020

  7. [8]

    Design by adaptive sampling.arXiv preprint arXiv:1810.03714, 2018

    David H Brookes and Jennifer Listgarten. Design by adaptive sampling.arXiv preprint arXiv:1810.03714, 2018

  8. [9]

    Transformer-based protein generation with regularized latent space optimization.Nature Machine Intelligence, 4(10):840–851, 2022

    Egbert Castro, Abhinav Godavarthi, Julian Rubinfien, Kevin Givechian, Dhananjay Bhaskar, and Smita Krishnaswamy. Transformer-based protein generation with regularized latent space optimization.Nature Machine Intelligence, 4(10):840–851, 2022

  9. [10]

    Curiosity as a self- supervised method to improve exploration in de novo drug design

    Mohamed-Amine Chadi, Hajar Mousannif, and Ahmed Aamouche. Curiosity as a self- supervised method to improve exploration in de novo drug design. In2023 International Conference on Information Technology Research and Innovation (ICITRI), pages 151–156. IEEE, 2023

  10. [11]

    Decomposed direct preference optimization for structure-based drug design,

    Xiwei Cheng, Xiangxin Zhou, Yuwei Yang, Yu Bao, and Quanquan Gu. Decomposed direct preference optimization for structure-based drug design.arXiv preprint arXiv:2407.13981, 2024

  11. [12]

    Robust deep learning–based protein sequence design using proteinmpnn.Science, 378(6615):49–56, 2022

    Justas Dauparas, Ivan Anishchenko, Nathaniel Bennett, Hua Bai, Robert J Ragotte, Lukas F Milles, Basile IM Wicky, Alexis Courbet, Rob J de Haas, Neville Bethel, et al. Robust deep learning–based protein sequence design using proteinmpnn.Science, 378(6615):49–56, 2022

  12. [14]

    Reinforcement learning on structure-conditioned categorical diffusion for protein inverse folding.arXiv preprint arXiv:2410.17173, 2024

    Yasha Ektefaie, Olivia Viessmann, Siddharth Narayanan, Drew Dresser, J Mark Kim, and Armen Mkrtchyan. Reinforcement learning on structure-conditioned categorical diffusion for protein inverse folding.arXiv preprint arXiv:2410.17173, 2024

  13. [15]

    Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions.Journal of cheminfor- matics, 1(1):8, 2009

    Peter Ertl and Ansgar Schuffenhauer. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions.Journal of cheminfor- matics, 1(1):8, 2009

  14. [16]

    Libinvent: reaction-based generative scaffold decoration for in silico library design.Journal of Chemical Information and Modeling, 62(9):2046–2063, 2021

    Vendy Fialková, Jiaxi Zhao, Kostas Papadopoulos, Ola Engkvist, Esben Jannik Bjerrum, Thierry Kogej, and Atanas Patronov. Libinvent: reaction-based generative scaffold decoration for in silico library design.Journal of Chemical Information and Modeling, 62(9):2046–2063, 2021. 11

  15. [17]

    Three-dimensional convolutional neural networks and a cross- docked data set for structure-based drug design.Journal of chemical information and modeling, 60(9):4200–4215, 2020

    Paul G Francoeur, Tomohide Masuda, Jocelyn Sunseri, Andrew Jia, Richard B Iovanisci, Ian Snyder, and David R Koes. Three-dimensional convolutional neural networks and a cross- docked data set for structure-based drug design.Journal of chemical information and modeling, 60(9):4200–4215, 2020

  16. [18]

    Search- ing for high-value molecules using reinforcement learning and transformers.arXiv preprint arXiv:2310.02902, 2023

    Raj Ghugare, Santiago Miret, Adriana Hugessen, Mariano Phielipp, and Glen Berseth. Search- ing for high-value molecules using reinforcement learning and transformers.arXiv preprint arXiv:2310.02902, 2023

  17. [19]

    Automatic chemical design using a data-driven continuous representation of molecules.ACS central science, 4(2):268–276, 2018

    Rafael Gómez-Bombarelli, Jennifer N Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams, and Alán Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of molecules.ACS central science, 4(2):268–276, 2018

  18. [20]

    Diffu- coder: Understanding and improving masked diffusion mod- els for code generation.arXiv preprint arXiv:2506.20639,

    Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, and Yizhe Zhang. Diffucoder: Understanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639, 2025

  19. [21]

    Constrained bayesian optimization for automatic chemical design using variational autoencoders.Chemical science, 11(2):577–586, 2020

    Ryan-Rhys Griffiths and José Miguel Hernández-Lobato. Constrained bayesian optimization for automatic chemical design using variational autoencoders.Chemical science, 11(2):577–586, 2020

  20. [22]

    Utilizing reinforcement learning for de novo drug design.Machine Learning, 113(7):4811–4843, 2024

    Hampus Gummesson Svensson, Christian Tyrchan, Ola Engkvist, and Morteza Haghir Chehreghani. Utilizing reinforcement learning for de novo drug design.Machine Learning, 113(7):4811–4843, 2024

  21. [23]

    Protein–sol: a web tool for predicting protein solubility from sequence.Bioinformatics, 33(19):3098–3100, 2017

    Max Hebditch, M Alejandro Carballo-Amador, Spyros Charonis, Robin Curtis, and Jim War- wicker. Protein–sol: a web tool for predicting protein solubility from sequence.Bioinformatics, 33(19):3098–3100, 2017

  22. [24]

    Learning inverse folding from millions of predicted structures

    Chloe Hsu, Robert Verkuil, Jason Liu, Zeming Lin, Brian Hie, Tom Sercu, Adam Lerer, and Alexander Rives. Learning inverse folding from millions of predicted structures. InInternational conference on machine learning, pages 8946–8970. PMLR, 2022

  23. [25]

    Hamiltonian diver- sity: effectively measuring molecular diversity by shortest hamiltonian circuits.Journal of Cheminformatics, 16(1):94, 2024

    Xiuyuan Hu, Guoqing Liu, Quanming Yao, Yang Zhao, and Hao Zhang. Hamiltonian diver- sity: effectively measuring molecular diversity by shortest hamiltonian circuits.Journal of Cheminformatics, 16(1):94, 2024

  24. [26]

    Can llms generate diverse molecules? towards alignment with structural diversity.arXiv preprint arXiv:2410.03138, 2024

    Hyosoon Jang, Yunhui Jang, Jaehyung Kim, and Sungsoo Ahn. Can llms generate diverse molecules? towards alignment with structural diversity.arXiv preprint arXiv:2410.03138, 2024

  25. [27]

    Multi-objective molecule generation using interpretable substructures

    Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Multi-objective molecule generation using interpretable substructures. InInternational conference on machine learning, pages 4849–4859. PMLR, 2020

  26. [28]

    Any-property-conditional molecule generation with self-criticism using spanning trees.arXiv preprint arXiv:2407.09357, 2024

    Alexia Jolicoeur-Martineau, Aristide Baratin, Kisoo Kwon, Boris Knyazev, and Yan Zhang. Any-property-conditional molecule generation with self-criticism using spanning trees.arXiv preprint arXiv:2407.09357, 2024

  27. [29]

    Direct steering of de novo molecular generation with descriptor conditional recurrent neural networks.Nature Machine Intelligence, 2(5):254–265, 2020

    Panagiotis-Christos Kotsias, Josep Arús-Pous, Hongming Chen, Ola Engkvist, Christian Tyr- chan, and Esben Jannik Bjerrum. Direct steering of de novo molecular generation with descriptor conditional recurrent neural networks.Nature Machine Intelligence, 2(5):254–265, 2020

  28. [30]

    Genmol: A drug discovery generalist with discrete diffusion.arXiv preprint arXiv:2501.06158, 2025

    Seul Lee, Karsten Kreis, Srimukh Prasad Veccham, Meng Liu, Danny Reidenbach, Yuxing Peng, Saee Paliwal, Weili Nie, and Arash Vahdat. Genmol: A drug discovery generalist with discrete diffusion.arXiv preprint arXiv:2501.06158, 2025

  29. [31]

    CAGenMol: Condition-Aware Diffusion Language Model for Goal-Directed Molecular Generation

    Yanting Li, Zhuoyang Jiang, Enyan Dai, Lei Wang, Wen-Cai Ye, and Li Liu. Cagenmol: Condition-aware diffusion language model for goal-directed molecular generation.arXiv preprint arXiv:2604.11483, 2026

  30. [32]

    Molecular generative model based on conditional variational autoencoder for de novo molecular design.Journal of cheminformatics, 10(1):31, 2018

    Jaechang Lim, Seongok Ryu, Jin Woo Kim, and Woo Youn Kim. Molecular generative model based on conditional variational autoencoder for de novo molecular design.Journal of cheminformatics, 10(1):31, 2018. 12

  31. [33]

    Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

    Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

  32. [34]

    Xuhan Liu, Kai Ye, Herman WT Van Vlijmen, Adriaan P IJzerman, and Gerard JP Van Westen. An exploration strategy improves the diversity of de novo ligands using deep reinforcement learning: a case for the adenosine a2a receptor.Journal of cheminformatics, 11(1):35, 2019

  33. [35]

    Drugex v3: scaffold-constrained drug design with graph transformer-based reinforcement learning.Journal of Cheminformatics, 15(1):24, 2023

    Xuhan Liu, Kai Ye, Herman WT van Vlijmen, Adriaan P IJzerman, and Gerard JP van Westen. Drugex v3: scaffold-constrained drug design with graph transformer-based reinforcement learning.Journal of Cheminformatics, 15(1):24, 2023

  34. [36]

    Reinvent 4: Modern ai–driven generative molecule design.Journal of Cheminformatics, 16(1):20, 2024

    Hannes H Loeffler, Jiazhen He, Alessandro Tibo, Jon Paul Janet, Alexey V oronov, Lewis H Mervin, and Ola Engkvist. Reinvent 4: Modern ai–driven generative molecule design.Journal of Cheminformatics, 16(1):20, 2024

  35. [37]

    Progen2: exploring the boundaries of protein language models.Cell systems, 14(11):968–978, 2023

    Erik Nijkamp, Jeffrey A Ruffolo, Eli N Weinstein, Nikhil Naik, and Ali Madani. Progen2: exploring the boundaries of protein language models.Cell systems, 14(11):968–978, 2023

  36. [38]

    Gotta be safe: a new framework for molecular design.Digital Discovery, 3(4):796–804, 2024

    Emmanuel Noutahi, Cristian Gabellini, Michael Craig, Jonathan SC Lim, and Prudencio Tossou. Gotta be safe: a new framework for molecular design.Digital Discovery, 3(4):796–804, 2024

  37. [39]

    Molecular de-novo design through deep reinforcement learning.Journal of cheminformatics, 9(1):48, 2017

    Marcus Olivecrona, Thomas Blaschke, Ola Engkvist, and Hongming Chen. Molecular de-novo design through deep reinforcement learning.Journal of cheminformatics, 9(1):48, 2017

  38. [40]

    Jinyeong Park, Jaegyoon Ahn, Jonghwan Choi, and Jibum Kim. Mol-air: Molecular reinforce- ment learning with adaptive intrinsic rewards for goal-directed molecular generation.Journal of Chemical Information and Modeling, 65(5):2283–2296, 2025

  39. [41]

    Improving Inverse Folding for Peptide Design with Diversity-regularized Direct Preference Optimization

    Ryan Park, Darren J Hsu, C Brian Roland, Maria Korshunova, Chen Tessler, Shie Mannor, Olivia Viessmann, and Bruno Trentini. Improving inverse folding for peptide design with diversity-regularized direct preference optimization.arXiv preprint arXiv:2410.19471, 2024

  40. [42]

    Diversity oriented deep reinforcement learning for targeted molecule generation.Journal of cheminformatics, 13(1):21, 2021

    Tiago Pereira, Maryam Abbasi, Bernardete Ribeiro, and Joel P Arrais. Diversity oriented deep reinforcement learning for targeted molecule generation.Journal of cheminformatics, 13(1):21, 2021

  41. [43]

    Deep reinforcement learning for de novo drug design.Science advances, 4(7):eaap7885, 2018

    Mariya Popova, Olexandr Isayev, and Alexander Tropsha. Deep reinforcement learning for de novo drug design.Science advances, 4(7):eaap7885, 2018

  42. [44]

    Temberture: advancing protein ther- mostability prediction with deep learning and attention mechanisms.Bioinformatics Advances, 4(1):vbae103, 2024

    Chiara Rodella, Symela Lazaridi, and Thomas Lemmin. Temberture: advancing protein ther- mostability prediction with deep learning and attention mechanisms.Bioinformatics Advances, 4(1):vbae103, 2024

  43. [45]

    Silvr: guided diffusion for molecule generation

    Nicholas T Runcie and Antonia SJS Mey. Silvr: guided diffusion for molecule generation. Journal of chemical information and modeling, 63(19):5996–6005, 2023

  44. [46]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  45. [47]

    Sequential posterior sampling with diffusion models

    Tristan SW Stevens, Oisín Nolan, Jean-Luc Robert, and Ruud JG Van Sloun. Sequential posterior sampling with diffusion models. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

  46. [48]

    Diversity-aware reinforcement learning for de novo drug design.arXiv preprint arXiv:2410.10431, 2024

    Hampus Gummesson Svensson, Christian Tyrchan, Ola Engkvist, and Morteza Haghir Chehreghani. Diversity-aware reinforcement learning for de novo drug design.arXiv preprint arXiv:2410.10431, 2024

  47. [49]

    Diverse mini-batch selection in reinforcement learning for efficient chemical exploration in de novo drug design.arXiv preprint arXiv:2506.21158, 2025

    Hampus Gummesson Svensson, Ola Engkvist, Jon Paul Janet, Christian Tyrchan, and Morteza Haghir Chehreghani. Diverse mini-batch selection in reinforcement learning for efficient chemical exploration in de novo drug design.arXiv preprint arXiv:2506.21158, 2025. 13

  48. [50]

    Bc-design: A biochemistry-aware framework for inverse protein design.bioRxiv, 2025

    Xiangru Tang, Xinwu Ye, Fang Wu, Yimeng Liu, Anna Su, Antonia Panescu, Guanlue Li, Daniel Shao, Dong Xu, and Mark Gerstein. Bc-design: A biochemistry-aware framework for inverse protein design.bioRxiv, 2025. doi: 10.1101/2024.10.28.620755. URL https: //www.biorxiv.org/content/early/2025/11/24/2024.10.28.620755

  49. [51]

    Augmented hill-climb increases reinforcement learning efficiency for language-based de novo molecule generation

    Morgan Thomas, Noel M O’Boyle, Andreas Bender, and Chris De Graaf. Augmented hill-climb increases reinforcement learning efficiency for language-based de novo molecule generation. Journal of cheminformatics, 14(1):68, 2022

  50. [52]

    Sample-efficient optimiza- tion in the latent space of deep generative models via weighted retraining.Advances in Neural Information Processing Systems, 33:11259–11272, 2020

    Austin Tripp, Erik Daxberger, and José Miguel Hernández-Lobato. Sample-efficient optimiza- tion in the latent space of deep generative models via weighted retraining.Advances in Neural Information Processing Systems, 33:11259–11272, 2020

  51. [53]

    Oleg Trott and Arthur J Olson. Autodock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading.Journal of computational chemistry, 31(2):455–461, 2010

  52. [54]

    Pro- teinzero: Self-improving protein generation via online reinforcement learning.arXiv preprint arXiv:2506.07459, 2025

    Ziwen Wang, Jiajun Fan, Ruihan Guo, Thao Nguyen, Heng Ji, and Ge Liu. Pro- teinzero: Self-improving protein generation via online reinforcement learning.arXiv preprint arXiv:2506.07459, 2025

  53. [55]

    Aligning protein generative models with experimental fitness via direct preference optimization.bioRxiv, pages 2024–05, 2024

    Talal Widatalla, Rafael Rafailov, and Brian Hie. Aligning protein generative models with experimental fitness via direct preference optimization.bioRxiv, pages 2024–05, 2024

  54. [56]

    Proteinguide: On-the-fly property guidance for protein sequence generative models.arXiv preprint arXiv:2505.04823, 2025

    Junhao Xiong, Ishan Gaur, Maria Lukarska, Hunter Nisonoff, Luke M Oltrogge, David F Savage, and Jennifer Listgarten. Proteinguide: On-the-fly property guidance for protein sequence generative models.arXiv preprint arXiv:2505.04823, 2025

  55. [57]

    Hit and lead discovery with explorative rl and fragment-based molecule generation.Advances in Neural Information Processing Systems, 34:7924–7936, 2021

    Soojung Yang, Doyeong Hwang, Seul Lee, Seongok Ryu, and Sung Ju Hwang. Hit and lead discovery with explorative rl and fragment-based molecule generation.Advances in Neural Information Processing Systems, 34:7924–7936, 2021

  56. [58]

    Latentchem: From textual cot to latent thinking in chemical reasoning.arXiv preprint arXiv:2602.07075, 2026

    Xinwu Ye, Yicheng Mao, Jia Zhang, Yimeng Liu, Li Hao, Fang Wu, Zhiwei Li, Yuxuan Liao, Zehong Wang, Yingcheng Wu, et al. Latentchem: From textual cot to latent thinking in chemical reasoning.arXiv preprint arXiv:2602.07075, 2026

  57. [59]

    Graph convolutional policy network for goal-directed molecular graph generation.Advances in neural information processing systems, 31, 2018

    Jiaxuan You, Bowen Liu, Zhitao Ying, Vijay Pande, and Jure Leskovec. Graph convolutional policy network for goal-directed molecular graph generation.Advances in neural information processing systems, 31, 2018

  58. [60]

    Optimization of molecules via deep reinforcement learning.Scientific reports, 9(1):10752, 2019

    Zhenpeng Zhou, Steven Kearnes, Li Li, Richard N Zare, and Patrick Riley. Optimization of molecules via deep reinforcement learning.Scientific reports, 9(1):10752, 2019

  59. [61]

    Scaffold-driven molecular generation via rein- forced rnn with centroid distance evaluation.Expert Systems with Applications, 292:128606, 2025

    Xingzheng Zhu, Zhihong Zhao, and Fei Zhu. Scaffold-driven molecular generation via rein- forced rnn with centroid distance evaluation.Expert Systems with Applications, 292:128606, 2025. 14 A Full Training Procedure of SGRPO This appendix provides the full training procedure of Supergroup Relative Policy Optimization (SGRPO), corresponding to Section 4 in ...