Pushing Biomolecular Utility-Diversity Frontiers with Supergroup Relative Policy Optimization

Bin Feng; Hao Li; He Cao; Shenghua Gao; Xiangru Tang; Xinwu Ye; Yu Li; Zijing Liu

arxiv: 2605.08659 · v3 · pith:VR75DMQ7new · submitted 2026-05-09 · 💻 cs.CE · q-bio.BM

Pushing Biomolecular Utility-Diversity Frontiers with Supergroup Relative Policy Optimization

Xinwu Ye , He Cao , Hao Li , Bin Feng , Zijing Liu , Xiangru Tang , Yu Li , Shenghua Gao This is my paper

Pith reviewed 2026-05-20 23:27 UTC · model grok-4.3

classification 💻 cs.CE q-bio.BM

keywords Supergroup Relative Policy Optimizationbiomolecular generatorsutility diversity tradeoffreinforcement learningde novo molecular designprotein designPareto frontier

0 comments

The pith

Supergroup Relative Policy Optimization uses set-level diversity rewards to expand the utility-diversity Pareto frontier in biomolecular generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Biomolecular generators often improve utility for specific tasks at the cost of diversity, concentrating on narrow families of molecules or proteins. The paper presents Supergroup Relative Policy Optimization as a way to directly incorporate set-level diversity into the reward signal during optimization. It does this by sampling multiple candidate sets as a supergroup for each condition, evaluating their relative diversity, and then allocating the diversity reward back to individual samples using leave-one-out contributions added to their utility scores. This framework is decoupled from particular generators or metrics and is tested on small molecule and protein design tasks. The results indicate better performance on the combined utility-diversity metrics compared to standard policy optimization methods.

Core claim

Supergroup Relative Policy Optimization constructs rewards from set-level diversity by sampling supergroups of candidate sets, comparing their diversity under the same condition, and redistributing the group diversity reward to individual rollouts through leave-one-out diversity contributions before combining it with rollout-level utility. This leads to expanded utility-diversity Pareto frontiers and superior frontier-level metrics in de novo small-molecule design, pocket-based small-molecule design, and de novo protein design.

What carries the argument

Supergroup Relative Policy Optimization (SGRPO), which samples supergroups of candidate sets and redistributes diversity rewards using leave-one-out contributions to balance utility and diversity in policy optimization.

If this is right

Across decoding sweeps, SGRPO achieves the best frontier-level metrics relative to pretrained generators, GRPO, and memory-assisted GRPO.
Direct set-level diversity rewards remain effective with small groups.
SGRPO helps preserve broader generation-distribution coverage during post-training.
The method can be instantiated with both GRPO and Coupled-GRPO across autoregressive and discrete diffusion generators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach might reduce reliance on memory-assisted techniques for maintaining diversity in generative models.
Applying SGRPO to other domains like image or text generation could test if set-level diversity rewards generalize beyond biomolecular tasks.
Further work could explore how the size of the supergroup affects the stability of the diversity signal.

Load-bearing premise

That leave-one-out diversity contributions provide an unbiased and effective redistribution of the group-level diversity reward to individual rollouts without distorting the overall optimization signal.

What would settle it

A decoding sweep on one of the biomolecular design tasks where SGRPO does not expand the utility-diversity Pareto frontier beyond the baselines.

Figures

Figures reproduced from arXiv: 2605.08659 by Bin Feng, Hao Li, He Cao, Shenghua Gao, Xiangru Tang, Xinwu Ye, Yu Li, Zijing Liu.

**Figure 1.** Figure 1: Overview of SGRPO. For each condition, SGRPO samples a same-condition supergroup, computes rollout-level utility and group-level diversity, compares groups by leave-one-out group-relative diversity, redistributes the diversity signal within each group according to leave-one-out set contributions, centers the composed rewards over the supergroup, and updates the policy with a PPO-style objective and KL reg… view at source ↗

**Figure 2.** Figure 2: Utility–diversity operating points for de novo small-molecule design, pocket-based small-molecule design, and de novo protein design. Each marker corresponds to one decoding setting and reports the mean utility and diversity over five independent runs; error bars show 95% confidence intervals on both axes. Dashed lines trace the method-specific non-dominated subsets. 5.2 De novo Small-Molecule Design Setup… view at source ↗

**Figure 3.** Figure 3: Ablation study on GenMolbased de novo small-molecule generation. Each point reports the mean over five independent sweeps, and error bars indicate 95% confidence intervals for utility and diversity. Removing the diversity term yields coupled-GRPO, while removing leave-one-out group credit weakens setaware credit assignment. We isolate two components of SGRPO on the GenMol-based de novo small-molecule de… view at source ↗

**Figure 5.** Figure 5: Analysis of diversity-estimator efficiency and group-reward weighting on the GenMol-based [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 4.** Figure 4: Distribution dynamics during ProGen2 post-training. SGRPO explores multiple clusters early and preserves them, whereas GRPO contracts and Memory-assisted GRPO drift toward a narrow distant region. We study two practical sensitivities of SGRPO on the GenMol-based de novo small-molecule design task: the efficiency of the group-diversity estimator and the choice of group-reward weight λ. In both cases, we tra… view at source ↗

read the original abstract

Biomolecular generators are often adapted with reward feedback to improve task-specific utility, but pushing utility alone can concentrate generation on a narrow family of candidates. Maintaining diversity is difficult because sample diversity is a set-level property. We introduce Supergroup Relative Policy Optimization (SGRPO), a flexible GRPO-style framework that directly constructs rewards from set-level diversity. For each condition, SGRPO samples a supergroup of candidate sets, compares their diversity under the same condition, and redistributes the group diversity reward to individual rollouts through leave-one-out diversity contributions before combining it with rollout-level utility. This design decouples SGRPO from a particular generator, utility reward, or diversity metric, and allows instantiation with different GRPO-style approaches. We evaluate SGRPO on de novo small-molecule design, pocket-based small-molecule design, and de novo protein design, instantiating it with both GRPO and Coupled-GRPO across autoregressive and discrete diffusion generators. Across decoding sweeps, SGRPO expands the utility-diversity Pareto frontier and achieves the best frontier-level metrics relative to pretrained generators, GRPO, and memory-assisted GRPO when applicable. Our analyses further show that direct set-level diversity rewards remain effective with small groups and help preserve broader generation-distribution coverage during post-training. The code is available at https://github.com/IDEA-XL/SGRPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SGRPO adds a supergroup sampling step and leave-one-out redistribution to GRPO-style training so that set-level diversity can be optimized directly alongside utility in biomolecular generators.

read the letter

The core move is sampling a supergroup of candidate sets for each condition, scoring diversity at the group level, then handing each rollout its leave-one-out contribution before adding the usual utility term. This keeps the method decoupled from any one generator or diversity metric and lets it plug into existing GRPO or Coupled-GRPO losses. They run it on de novo small-molecule design, pocket-based design, and de novo protein design, using both autoregressive and discrete diffusion backbones, and report that the utility-diversity frontier moves outward relative to pretrained models and the plain GRPO baselines. Public code is a plus for anyone who wants to test the same setup on their own task or generator.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces Supergroup Relative Policy Optimization (SGRPO), a GRPO-style framework that constructs set-level diversity rewards by sampling supergroups of K candidate sets per condition, computing a group diversity score D, and redistributing it to individual rollouts via leave-one-out contributions D_{-i} before combining with rollout-level utility. It instantiates the approach with GRPO and Coupled-GRPO on autoregressive and discrete diffusion generators, and evaluates on de novo small-molecule design, pocket-based small-molecule design, and de novo protein design. The central claim is that SGRPO expands the utility-diversity Pareto frontier and achieves the best frontier-level metrics relative to pretrained generators, GRPO, and memory-assisted GRPO across decoding sweeps, while remaining effective with small groups and preserving broader distribution coverage.

Significance. If the reported Pareto expansions hold after verification that the leave-one-out redistribution does not introduce gradient artifacts, the work would be significant for multi-objective RL fine-tuning of biomolecular generators. It offers a decoupled, metric-agnostic way to handle set-level properties like diversity, with code release supporting reproducibility and extension to other generators or tasks in computational chemistry and biology.

major comments (1)

[Section 3.2] Section 3.2: The leave-one-out redistribution assigns D_{-i} values whose sum is fixed relative to D, inducing linear dependence and correlation among per-rollout advantages within each supergroup. The GRPO-style loss (Eqs. 4-5) applies these advantages without covariance correction; this risks distorting the policy gradient signal and could partly artifactually inflate the reported frontier gains, particularly at the small supergroup sizes noted as effective. A covariance analysis or control experiment with independent per-rollout diversity sampling is needed to confirm the improvements reflect genuine utility-diversity trade-offs.

minor comments (2)

[Evaluation sections] The manuscript should provide explicit formulas and implementation details for the diversity metrics used in the supergroup comparisons, along with any statistical tests (e.g., confidence intervals or significance levels) supporting the Pareto frontier comparisons.
[Results] Clarify the exact range of decoding parameters in the sweeps and any controls for confounding factors such as sampling temperature or generator-specific biases when claiming superiority over baselines.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address the major concern regarding the leave-one-out redistribution and potential gradient effects below.

read point-by-point responses

Referee: [Section 3.2] Section 3.2: The leave-one-out redistribution assigns D_{-i} values whose sum is fixed relative to D, inducing linear dependence and correlation among per-rollout advantages within each supergroup. The GRPO-style loss (Eqs. 4-5) applies these advantages without covariance correction; this risks distorting the policy gradient signal and could partly artifactually inflate the reported frontier gains, particularly at the small supergroup sizes noted as effective. A covariance analysis or control experiment with independent per-rollout diversity sampling is needed to confirm the improvements reflect genuine utility-diversity trade-offs.

Authors: We acknowledge that leave-one-out redistribution induces linear dependence among the D_{-i} values by construction, since their sum is fixed relative to the group diversity D. This dependence is intentional: diversity is a set-level property, and the redistribution attributes each rollout's marginal contribution while preserving the total group reward. The GRPO loss normalizes advantages within each supergroup, which accounts for relative comparisons and reduces the impact of within-group correlations on the policy gradient. Our empirical results demonstrate consistent Pareto frontier expansion across multiple generators, tasks, and supergroup sizes (including small K), alongside preserved distribution coverage, which would be unlikely if the gains were primarily artifacts. An independent per-rollout diversity sampling control would not evaluate joint set diversity and thus would not test the core claim. We will add a short discussion of advantage correlations and their effect on the gradient in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No significant circularity; SGRPO reward construction is definitional and claims rest on empirical evaluation

full rationale

The paper introduces SGRPO by explicitly defining a supergroup sampling process, computing a group-level diversity score D, redistributing via leave-one-out contributions D_{-i} to individual rollouts, and combining with rollout utility rewards before applying a GRPO-style loss. This is a direct construction of the optimization objective from external set-level metrics, not a derivation that reduces by the paper's equations to a quantity fitted from target utility data or to prior self-citations. Performance claims of Pareto frontier expansion are supported by empirical decoding sweeps on small-molecule and protein design tasks, with comparisons to pretrained generators, GRPO, and memory-assisted baselines. No self-definitional reductions, fitted-input predictions, or load-bearing self-citation chains appear in the derivation; the leave-one-out redistribution is presented as an explicit design choice whose effects are analyzed separately rather than assumed to be unbiased by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method relies on the assumption that set-level diversity is a meaningful and redistributable property; it introduces hyperparameters for supergroup size and diversity metric choice but no new physical entities.

free parameters (2)

supergroup size
Hyperparameter controlling how many candidate sets are sampled per condition for diversity comparison.
diversity metric
Choice of set-level diversity function whose definition affects reward computation.

axioms (1)

domain assumption Leave-one-out contributions accurately apportion group diversity reward to individual samples without bias
Invoked when redistributing the supergroup diversity reward to individual rollouts.

pith-pipeline@v0.9.0 · 5797 in / 1268 out tokens · 38391 ms · 2026-05-20T23:27:26.437099+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SGRPO samples a supergroup of candidate sets, compares their diversity under the same condition, and redistributes the group diversity reward to individual rollouts through leave-one-out diversity contributions before combining it with rollout-level utility.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

For each rollout xm,i ∈ Gm, we first compute its leave-one-out contribution cm,i = D(Gm) - D(Gm ∖ {xm,i})

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 3 internal anchors

[2]

Model-based reinforcement learning for biological sequence design

Christof Angermueller, David Dohan, David Belanger, Ramya Deshpande, Kevin Murphy, and Lucy Colwell. Model-based reinforcement learning for biological sequence design. In International conference on learning representations, 2019

work page 2019
[3]

Molgpt: molecular generation using a transformer-decoder model.Journal of chemical information and modeling, 62(9): 2064–2076, 2021

Viraj Bagal, Rishal Aggarwal, PK Vinod, and U Deva Priyakumar. Molgpt: molecular generation using a transformer-decoder model.Journal of chemical information and modeling, 62(9): 2064–2076, 2021

work page 2064
[4]

Why is tanimoto index an appropriate choice for fingerprint-based similarity calculations?Journal of cheminformatics, 7(1):20, 2015

Dávid Bajusz, Anita Rácz, and Károly Héberger. Why is tanimoto index an appropriate choice for fingerprint-based similarity calculations?Journal of cheminformatics, 7(1):20, 2015

work page 2015
[5]

Quantifying the chemical beauty of drugs.Nature chemistry, 4(2):90–98, 2012

G Richard Bickerton, Gaia V Paolini, Jérémy Besnard, Sorel Muresan, and Andrew L Hopkins. Quantifying the chemical beauty of drugs.Nature chemistry, 4(2):90–98, 2012

work page 2012
[6]

Esben Jannik Bjerrum, Christian Margreitter, Thomas Blaschke, Simona Kolarova, and Raquel López-Ríos de Castro. Faster and more diverse de novo molecular optimization with double- loop reinforcement learning using augmented smiles.Journal of Computer-Aided Molecular Design, 37(8):373–394, 2023

work page 2023
[7]

Memory-assisted reinforcement learning for diverse molecular de novo design.Journal of cheminformatics, 12 (1):68, 2020

Thomas Blaschke, Ola Engkvist, Jürgen Bajorath, and Hongming Chen. Memory-assisted reinforcement learning for diverse molecular de novo design.Journal of cheminformatics, 12 (1):68, 2020

work page 2020
[8]

Design by adaptive sampling.arXiv preprint arXiv:1810.03714, 2018

David H Brookes and Jennifer Listgarten. Design by adaptive sampling.arXiv preprint arXiv:1810.03714, 2018

work page arXiv 2018
[9]

Transformer-based protein generation with regularized latent space optimization.Nature Machine Intelligence, 4(10):840–851, 2022

Egbert Castro, Abhinav Godavarthi, Julian Rubinfien, Kevin Givechian, Dhananjay Bhaskar, and Smita Krishnaswamy. Transformer-based protein generation with regularized latent space optimization.Nature Machine Intelligence, 4(10):840–851, 2022

work page 2022
[10]

Curiosity as a self- supervised method to improve exploration in de novo drug design

Mohamed-Amine Chadi, Hajar Mousannif, and Ahmed Aamouche. Curiosity as a self- supervised method to improve exploration in de novo drug design. In2023 International Conference on Information Technology Research and Innovation (ICITRI), pages 151–156. IEEE, 2023

work page 2023
[11]

Decomposed direct preference optimization for structure-based drug design,

Xiwei Cheng, Xiangxin Zhou, Yuwei Yang, Yu Bao, and Quanquan Gu. Decomposed direct preference optimization for structure-based drug design.arXiv preprint arXiv:2407.13981, 2024

work page arXiv 2024
[12]

Robust deep learning–based protein sequence design using proteinmpnn.Science, 378(6615):49–56, 2022

Justas Dauparas, Ivan Anishchenko, Nathaniel Bennett, Hua Bai, Robert J Ragotte, Lukas F Milles, Basile IM Wicky, Alexis Courbet, Rob J de Haas, Neville Bethel, et al. Robust deep learning–based protein sequence design using proteinmpnn.Science, 378(6615):49–56, 2022

work page 2022
[14]

Reinforcement learning on structure-conditioned categorical diffusion for protein inverse folding.arXiv preprint arXiv:2410.17173, 2024

Yasha Ektefaie, Olivia Viessmann, Siddharth Narayanan, Drew Dresser, J Mark Kim, and Armen Mkrtchyan. Reinforcement learning on structure-conditioned categorical diffusion for protein inverse folding.arXiv preprint arXiv:2410.17173, 2024

work page arXiv 2024
[15]

Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions.Journal of cheminfor- matics, 1(1):8, 2009

Peter Ertl and Ansgar Schuffenhauer. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions.Journal of cheminfor- matics, 1(1):8, 2009

work page 2009
[16]

Libinvent: reaction-based generative scaffold decoration for in silico library design.Journal of Chemical Information and Modeling, 62(9):2046–2063, 2021

Vendy Fialková, Jiaxi Zhao, Kostas Papadopoulos, Ola Engkvist, Esben Jannik Bjerrum, Thierry Kogej, and Atanas Patronov. Libinvent: reaction-based generative scaffold decoration for in silico library design.Journal of Chemical Information and Modeling, 62(9):2046–2063, 2021. 11

work page 2046
[17]

Three-dimensional convolutional neural networks and a cross- docked data set for structure-based drug design.Journal of chemical information and modeling, 60(9):4200–4215, 2020

Paul G Francoeur, Tomohide Masuda, Jocelyn Sunseri, Andrew Jia, Richard B Iovanisci, Ian Snyder, and David R Koes. Three-dimensional convolutional neural networks and a cross- docked data set for structure-based drug design.Journal of chemical information and modeling, 60(9):4200–4215, 2020

work page 2020
[18]

Search- ing for high-value molecules using reinforcement learning and transformers.arXiv preprint arXiv:2310.02902, 2023

Raj Ghugare, Santiago Miret, Adriana Hugessen, Mariano Phielipp, and Glen Berseth. Search- ing for high-value molecules using reinforcement learning and transformers.arXiv preprint arXiv:2310.02902, 2023

work page arXiv 2023
[19]

Automatic chemical design using a data-driven continuous representation of molecules.ACS central science, 4(2):268–276, 2018

Rafael Gómez-Bombarelli, Jennifer N Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams, and Alán Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of molecules.ACS central science, 4(2):268–276, 2018

work page 2018
[20]

Diffu- coder: Understanding and improving masked diffusion mod- els for code generation.arXiv preprint arXiv:2506.20639,

Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, and Yizhe Zhang. Diffucoder: Understanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639, 2025

work page arXiv 2025
[21]

Constrained bayesian optimization for automatic chemical design using variational autoencoders.Chemical science, 11(2):577–586, 2020

Ryan-Rhys Griffiths and José Miguel Hernández-Lobato. Constrained bayesian optimization for automatic chemical design using variational autoencoders.Chemical science, 11(2):577–586, 2020

work page 2020
[22]

Utilizing reinforcement learning for de novo drug design.Machine Learning, 113(7):4811–4843, 2024

Hampus Gummesson Svensson, Christian Tyrchan, Ola Engkvist, and Morteza Haghir Chehreghani. Utilizing reinforcement learning for de novo drug design.Machine Learning, 113(7):4811–4843, 2024

work page 2024
[23]

Protein–sol: a web tool for predicting protein solubility from sequence.Bioinformatics, 33(19):3098–3100, 2017

Max Hebditch, M Alejandro Carballo-Amador, Spyros Charonis, Robin Curtis, and Jim War- wicker. Protein–sol: a web tool for predicting protein solubility from sequence.Bioinformatics, 33(19):3098–3100, 2017

work page 2017
[24]

Learning inverse folding from millions of predicted structures

Chloe Hsu, Robert Verkuil, Jason Liu, Zeming Lin, Brian Hie, Tom Sercu, Adam Lerer, and Alexander Rives. Learning inverse folding from millions of predicted structures. InInternational conference on machine learning, pages 8946–8970. PMLR, 2022

work page 2022
[25]

Hamiltonian diver- sity: effectively measuring molecular diversity by shortest hamiltonian circuits.Journal of Cheminformatics, 16(1):94, 2024

Xiuyuan Hu, Guoqing Liu, Quanming Yao, Yang Zhao, and Hao Zhang. Hamiltonian diver- sity: effectively measuring molecular diversity by shortest hamiltonian circuits.Journal of Cheminformatics, 16(1):94, 2024

work page 2024
[26]

Can llms generate diverse molecules? towards alignment with structural diversity.arXiv preprint arXiv:2410.03138, 2024

Hyosoon Jang, Yunhui Jang, Jaehyung Kim, and Sungsoo Ahn. Can llms generate diverse molecules? towards alignment with structural diversity.arXiv preprint arXiv:2410.03138, 2024

work page arXiv 2024
[27]

Multi-objective molecule generation using interpretable substructures

Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Multi-objective molecule generation using interpretable substructures. InInternational conference on machine learning, pages 4849–4859. PMLR, 2020

work page 2020
[28]

Any-property-conditional molecule generation with self-criticism using spanning trees.arXiv preprint arXiv:2407.09357, 2024

Alexia Jolicoeur-Martineau, Aristide Baratin, Kisoo Kwon, Boris Knyazev, and Yan Zhang. Any-property-conditional molecule generation with self-criticism using spanning trees.arXiv preprint arXiv:2407.09357, 2024

work page arXiv 2024
[29]

Direct steering of de novo molecular generation with descriptor conditional recurrent neural networks.Nature Machine Intelligence, 2(5):254–265, 2020

Panagiotis-Christos Kotsias, Josep Arús-Pous, Hongming Chen, Ola Engkvist, Christian Tyr- chan, and Esben Jannik Bjerrum. Direct steering of de novo molecular generation with descriptor conditional recurrent neural networks.Nature Machine Intelligence, 2(5):254–265, 2020

work page 2020
[30]

Genmol: A drug discovery generalist with discrete diffusion.arXiv preprint arXiv:2501.06158, 2025

Seul Lee, Karsten Kreis, Srimukh Prasad Veccham, Meng Liu, Danny Reidenbach, Yuxing Peng, Saee Paliwal, Weili Nie, and Arash Vahdat. Genmol: A drug discovery generalist with discrete diffusion.arXiv preprint arXiv:2501.06158, 2025

work page arXiv 2025
[31]

CAGenMol: Condition-Aware Diffusion Language Model for Goal-Directed Molecular Generation

Yanting Li, Zhuoyang Jiang, Enyan Dai, Lei Wang, Wen-Cai Ye, and Li Liu. Cagenmol: Condition-aware diffusion language model for goal-directed molecular generation.arXiv preprint arXiv:2604.11483, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

Molecular generative model based on conditional variational autoencoder for de novo molecular design.Journal of cheminformatics, 10(1):31, 2018

Jaechang Lim, Seongok Ryu, Jin Woo Kim, and Woo Youn Kim. Molecular generative model based on conditional variational autoencoder for de novo molecular design.Journal of cheminformatics, 10(1):31, 2018. 12

work page 2018
[33]

Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

work page 2023
[34]

Xuhan Liu, Kai Ye, Herman WT Van Vlijmen, Adriaan P IJzerman, and Gerard JP Van Westen. An exploration strategy improves the diversity of de novo ligands using deep reinforcement learning: a case for the adenosine a2a receptor.Journal of cheminformatics, 11(1):35, 2019

work page 2019
[35]

Drugex v3: scaffold-constrained drug design with graph transformer-based reinforcement learning.Journal of Cheminformatics, 15(1):24, 2023

Xuhan Liu, Kai Ye, Herman WT van Vlijmen, Adriaan P IJzerman, and Gerard JP van Westen. Drugex v3: scaffold-constrained drug design with graph transformer-based reinforcement learning.Journal of Cheminformatics, 15(1):24, 2023

work page 2023
[36]

Reinvent 4: Modern ai–driven generative molecule design.Journal of Cheminformatics, 16(1):20, 2024

Hannes H Loeffler, Jiazhen He, Alessandro Tibo, Jon Paul Janet, Alexey V oronov, Lewis H Mervin, and Ola Engkvist. Reinvent 4: Modern ai–driven generative molecule design.Journal of Cheminformatics, 16(1):20, 2024

work page 2024
[37]

Progen2: exploring the boundaries of protein language models.Cell systems, 14(11):968–978, 2023

Erik Nijkamp, Jeffrey A Ruffolo, Eli N Weinstein, Nikhil Naik, and Ali Madani. Progen2: exploring the boundaries of protein language models.Cell systems, 14(11):968–978, 2023

work page 2023
[38]

Gotta be safe: a new framework for molecular design.Digital Discovery, 3(4):796–804, 2024

Emmanuel Noutahi, Cristian Gabellini, Michael Craig, Jonathan SC Lim, and Prudencio Tossou. Gotta be safe: a new framework for molecular design.Digital Discovery, 3(4):796–804, 2024

work page 2024
[39]

Molecular de-novo design through deep reinforcement learning.Journal of cheminformatics, 9(1):48, 2017

Marcus Olivecrona, Thomas Blaschke, Ola Engkvist, and Hongming Chen. Molecular de-novo design through deep reinforcement learning.Journal of cheminformatics, 9(1):48, 2017

work page 2017
[40]

Jinyeong Park, Jaegyoon Ahn, Jonghwan Choi, and Jibum Kim. Mol-air: Molecular reinforce- ment learning with adaptive intrinsic rewards for goal-directed molecular generation.Journal of Chemical Information and Modeling, 65(5):2283–2296, 2025

work page 2025
[41]

Improving Inverse Folding for Peptide Design with Diversity-regularized Direct Preference Optimization

Ryan Park, Darren J Hsu, C Brian Roland, Maria Korshunova, Chen Tessler, Shie Mannor, Olivia Viessmann, and Bruno Trentini. Improving inverse folding for peptide design with diversity-regularized direct preference optimization.arXiv preprint arXiv:2410.19471, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Diversity oriented deep reinforcement learning for targeted molecule generation.Journal of cheminformatics, 13(1):21, 2021

Tiago Pereira, Maryam Abbasi, Bernardete Ribeiro, and Joel P Arrais. Diversity oriented deep reinforcement learning for targeted molecule generation.Journal of cheminformatics, 13(1):21, 2021

work page 2021
[43]

Deep reinforcement learning for de novo drug design.Science advances, 4(7):eaap7885, 2018

Mariya Popova, Olexandr Isayev, and Alexander Tropsha. Deep reinforcement learning for de novo drug design.Science advances, 4(7):eaap7885, 2018

work page 2018
[44]

Temberture: advancing protein ther- mostability prediction with deep learning and attention mechanisms.Bioinformatics Advances, 4(1):vbae103, 2024

Chiara Rodella, Symela Lazaridi, and Thomas Lemmin. Temberture: advancing protein ther- mostability prediction with deep learning and attention mechanisms.Bioinformatics Advances, 4(1):vbae103, 2024

work page 2024
[45]

Silvr: guided diffusion for molecule generation

Nicholas T Runcie and Antonia SJS Mey. Silvr: guided diffusion for molecule generation. Journal of chemical information and modeling, 63(19):5996–6005, 2023

work page 2023
[46]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Sequential posterior sampling with diffusion models

Tristan SW Stevens, Oisín Nolan, Jean-Luc Robert, and Ruud JG Van Sloun. Sequential posterior sampling with diffusion models. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

work page 2025
[48]

Diversity-aware reinforcement learning for de novo drug design.arXiv preprint arXiv:2410.10431, 2024

Hampus Gummesson Svensson, Christian Tyrchan, Ola Engkvist, and Morteza Haghir Chehreghani. Diversity-aware reinforcement learning for de novo drug design.arXiv preprint arXiv:2410.10431, 2024

work page arXiv 2024
[49]

Diverse mini-batch selection in reinforcement learning for efficient chemical exploration in de novo drug design.arXiv preprint arXiv:2506.21158, 2025

Hampus Gummesson Svensson, Ola Engkvist, Jon Paul Janet, Christian Tyrchan, and Morteza Haghir Chehreghani. Diverse mini-batch selection in reinforcement learning for efficient chemical exploration in de novo drug design.arXiv preprint arXiv:2506.21158, 2025. 13

work page arXiv 2025
[50]

Bc-design: A biochemistry-aware framework for inverse protein design.bioRxiv, 2025

Xiangru Tang, Xinwu Ye, Fang Wu, Yimeng Liu, Anna Su, Antonia Panescu, Guanlue Li, Daniel Shao, Dong Xu, and Mark Gerstein. Bc-design: A biochemistry-aware framework for inverse protein design.bioRxiv, 2025. doi: 10.1101/2024.10.28.620755. URL https: //www.biorxiv.org/content/early/2025/11/24/2024.10.28.620755

work page doi:10.1101/2024.10.28.620755 2025
[51]

Augmented hill-climb increases reinforcement learning efficiency for language-based de novo molecule generation

Morgan Thomas, Noel M O’Boyle, Andreas Bender, and Chris De Graaf. Augmented hill-climb increases reinforcement learning efficiency for language-based de novo molecule generation. Journal of cheminformatics, 14(1):68, 2022

work page 2022
[52]

Sample-efficient optimiza- tion in the latent space of deep generative models via weighted retraining.Advances in Neural Information Processing Systems, 33:11259–11272, 2020

Austin Tripp, Erik Daxberger, and José Miguel Hernández-Lobato. Sample-efficient optimiza- tion in the latent space of deep generative models via weighted retraining.Advances in Neural Information Processing Systems, 33:11259–11272, 2020

work page 2020
[53]

Oleg Trott and Arthur J Olson. Autodock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading.Journal of computational chemistry, 31(2):455–461, 2010

work page 2010
[54]

Pro- teinzero: Self-improving protein generation via online reinforcement learning.arXiv preprint arXiv:2506.07459, 2025

Ziwen Wang, Jiajun Fan, Ruihan Guo, Thao Nguyen, Heng Ji, and Ge Liu. Pro- teinzero: Self-improving protein generation via online reinforcement learning.arXiv preprint arXiv:2506.07459, 2025

work page arXiv 2025
[55]

Aligning protein generative models with experimental fitness via direct preference optimization.bioRxiv, pages 2024–05, 2024

Talal Widatalla, Rafael Rafailov, and Brian Hie. Aligning protein generative models with experimental fitness via direct preference optimization.bioRxiv, pages 2024–05, 2024

work page 2024
[56]

Proteinguide: On-the-fly property guidance for protein sequence generative models.arXiv preprint arXiv:2505.04823, 2025

Junhao Xiong, Ishan Gaur, Maria Lukarska, Hunter Nisonoff, Luke M Oltrogge, David F Savage, and Jennifer Listgarten. Proteinguide: On-the-fly property guidance for protein sequence generative models.arXiv preprint arXiv:2505.04823, 2025

work page arXiv 2025
[57]

Hit and lead discovery with explorative rl and fragment-based molecule generation.Advances in Neural Information Processing Systems, 34:7924–7936, 2021

Soojung Yang, Doyeong Hwang, Seul Lee, Seongok Ryu, and Sung Ju Hwang. Hit and lead discovery with explorative rl and fragment-based molecule generation.Advances in Neural Information Processing Systems, 34:7924–7936, 2021

work page 2021
[58]

Latentchem: From textual cot to latent thinking in chemical reasoning.arXiv preprint arXiv:2602.07075, 2026

Xinwu Ye, Yicheng Mao, Jia Zhang, Yimeng Liu, Li Hao, Fang Wu, Zhiwei Li, Yuxuan Liao, Zehong Wang, Yingcheng Wu, et al. Latentchem: From textual cot to latent thinking in chemical reasoning.arXiv preprint arXiv:2602.07075, 2026

work page arXiv 2026
[59]

Graph convolutional policy network for goal-directed molecular graph generation.Advances in neural information processing systems, 31, 2018

Jiaxuan You, Bowen Liu, Zhitao Ying, Vijay Pande, and Jure Leskovec. Graph convolutional policy network for goal-directed molecular graph generation.Advances in neural information processing systems, 31, 2018

work page 2018
[60]

Optimization of molecules via deep reinforcement learning.Scientific reports, 9(1):10752, 2019

Zhenpeng Zhou, Steven Kearnes, Li Li, Richard N Zare, and Patrick Riley. Optimization of molecules via deep reinforcement learning.Scientific reports, 9(1):10752, 2019

work page 2019
[61]

Scaffold-driven molecular generation via rein- forced rnn with centroid distance evaluation.Expert Systems with Applications, 292:128606, 2025

Xingzheng Zhu, Zhihong Zhao, and Fei Zhu. Scaffold-driven molecular generation via rein- forced rnn with centroid distance evaluation.Expert Systems with Applications, 292:128606, 2025. 14 A Full Training Procedure of SGRPO This appendix provides the full training procedure of Supergroup Relative Policy Optimization (SGRPO), corresponding to Section 4 in ...

work page 2025

[1] [2]

Model-based reinforcement learning for biological sequence design

Christof Angermueller, David Dohan, David Belanger, Ramya Deshpande, Kevin Murphy, and Lucy Colwell. Model-based reinforcement learning for biological sequence design. In International conference on learning representations, 2019

work page 2019

[2] [3]

Molgpt: molecular generation using a transformer-decoder model.Journal of chemical information and modeling, 62(9): 2064–2076, 2021

Viraj Bagal, Rishal Aggarwal, PK Vinod, and U Deva Priyakumar. Molgpt: molecular generation using a transformer-decoder model.Journal of chemical information and modeling, 62(9): 2064–2076, 2021

work page 2064

[3] [4]

Why is tanimoto index an appropriate choice for fingerprint-based similarity calculations?Journal of cheminformatics, 7(1):20, 2015

Dávid Bajusz, Anita Rácz, and Károly Héberger. Why is tanimoto index an appropriate choice for fingerprint-based similarity calculations?Journal of cheminformatics, 7(1):20, 2015

work page 2015

[4] [5]

Quantifying the chemical beauty of drugs.Nature chemistry, 4(2):90–98, 2012

G Richard Bickerton, Gaia V Paolini, Jérémy Besnard, Sorel Muresan, and Andrew L Hopkins. Quantifying the chemical beauty of drugs.Nature chemistry, 4(2):90–98, 2012

work page 2012

[5] [6]

Esben Jannik Bjerrum, Christian Margreitter, Thomas Blaschke, Simona Kolarova, and Raquel López-Ríos de Castro. Faster and more diverse de novo molecular optimization with double- loop reinforcement learning using augmented smiles.Journal of Computer-Aided Molecular Design, 37(8):373–394, 2023

work page 2023

[6] [7]

Memory-assisted reinforcement learning for diverse molecular de novo design.Journal of cheminformatics, 12 (1):68, 2020

Thomas Blaschke, Ola Engkvist, Jürgen Bajorath, and Hongming Chen. Memory-assisted reinforcement learning for diverse molecular de novo design.Journal of cheminformatics, 12 (1):68, 2020

work page 2020

[7] [8]

Design by adaptive sampling.arXiv preprint arXiv:1810.03714, 2018

David H Brookes and Jennifer Listgarten. Design by adaptive sampling.arXiv preprint arXiv:1810.03714, 2018

work page arXiv 2018

[8] [9]

Transformer-based protein generation with regularized latent space optimization.Nature Machine Intelligence, 4(10):840–851, 2022

Egbert Castro, Abhinav Godavarthi, Julian Rubinfien, Kevin Givechian, Dhananjay Bhaskar, and Smita Krishnaswamy. Transformer-based protein generation with regularized latent space optimization.Nature Machine Intelligence, 4(10):840–851, 2022

work page 2022

[9] [10]

Curiosity as a self- supervised method to improve exploration in de novo drug design

Mohamed-Amine Chadi, Hajar Mousannif, and Ahmed Aamouche. Curiosity as a self- supervised method to improve exploration in de novo drug design. In2023 International Conference on Information Technology Research and Innovation (ICITRI), pages 151–156. IEEE, 2023

work page 2023

[10] [11]

Decomposed direct preference optimization for structure-based drug design,

Xiwei Cheng, Xiangxin Zhou, Yuwei Yang, Yu Bao, and Quanquan Gu. Decomposed direct preference optimization for structure-based drug design.arXiv preprint arXiv:2407.13981, 2024

work page arXiv 2024

[11] [12]

Robust deep learning–based protein sequence design using proteinmpnn.Science, 378(6615):49–56, 2022

Justas Dauparas, Ivan Anishchenko, Nathaniel Bennett, Hua Bai, Robert J Ragotte, Lukas F Milles, Basile IM Wicky, Alexis Courbet, Rob J de Haas, Neville Bethel, et al. Robust deep learning–based protein sequence design using proteinmpnn.Science, 378(6615):49–56, 2022

work page 2022

[12] [14]

Reinforcement learning on structure-conditioned categorical diffusion for protein inverse folding.arXiv preprint arXiv:2410.17173, 2024

Yasha Ektefaie, Olivia Viessmann, Siddharth Narayanan, Drew Dresser, J Mark Kim, and Armen Mkrtchyan. Reinforcement learning on structure-conditioned categorical diffusion for protein inverse folding.arXiv preprint arXiv:2410.17173, 2024

work page arXiv 2024

[13] [15]

Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions.Journal of cheminfor- matics, 1(1):8, 2009

Peter Ertl and Ansgar Schuffenhauer. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions.Journal of cheminfor- matics, 1(1):8, 2009

work page 2009

[14] [16]

Libinvent: reaction-based generative scaffold decoration for in silico library design.Journal of Chemical Information and Modeling, 62(9):2046–2063, 2021

Vendy Fialková, Jiaxi Zhao, Kostas Papadopoulos, Ola Engkvist, Esben Jannik Bjerrum, Thierry Kogej, and Atanas Patronov. Libinvent: reaction-based generative scaffold decoration for in silico library design.Journal of Chemical Information and Modeling, 62(9):2046–2063, 2021. 11

work page 2046

[15] [17]

Three-dimensional convolutional neural networks and a cross- docked data set for structure-based drug design.Journal of chemical information and modeling, 60(9):4200–4215, 2020

Paul G Francoeur, Tomohide Masuda, Jocelyn Sunseri, Andrew Jia, Richard B Iovanisci, Ian Snyder, and David R Koes. Three-dimensional convolutional neural networks and a cross- docked data set for structure-based drug design.Journal of chemical information and modeling, 60(9):4200–4215, 2020

work page 2020

[16] [18]

Search- ing for high-value molecules using reinforcement learning and transformers.arXiv preprint arXiv:2310.02902, 2023

Raj Ghugare, Santiago Miret, Adriana Hugessen, Mariano Phielipp, and Glen Berseth. Search- ing for high-value molecules using reinforcement learning and transformers.arXiv preprint arXiv:2310.02902, 2023

work page arXiv 2023

[17] [19]

Automatic chemical design using a data-driven continuous representation of molecules.ACS central science, 4(2):268–276, 2018

Rafael Gómez-Bombarelli, Jennifer N Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams, and Alán Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of molecules.ACS central science, 4(2):268–276, 2018

work page 2018

[18] [20]

Diffu- coder: Understanding and improving masked diffusion mod- els for code generation.arXiv preprint arXiv:2506.20639,

Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, and Yizhe Zhang. Diffucoder: Understanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639, 2025

work page arXiv 2025

[19] [21]

Constrained bayesian optimization for automatic chemical design using variational autoencoders.Chemical science, 11(2):577–586, 2020

Ryan-Rhys Griffiths and José Miguel Hernández-Lobato. Constrained bayesian optimization for automatic chemical design using variational autoencoders.Chemical science, 11(2):577–586, 2020

work page 2020

[20] [22]

Utilizing reinforcement learning for de novo drug design.Machine Learning, 113(7):4811–4843, 2024

Hampus Gummesson Svensson, Christian Tyrchan, Ola Engkvist, and Morteza Haghir Chehreghani. Utilizing reinforcement learning for de novo drug design.Machine Learning, 113(7):4811–4843, 2024

work page 2024

[21] [23]

Protein–sol: a web tool for predicting protein solubility from sequence.Bioinformatics, 33(19):3098–3100, 2017

Max Hebditch, M Alejandro Carballo-Amador, Spyros Charonis, Robin Curtis, and Jim War- wicker. Protein–sol: a web tool for predicting protein solubility from sequence.Bioinformatics, 33(19):3098–3100, 2017

work page 2017

[22] [24]

Learning inverse folding from millions of predicted structures

Chloe Hsu, Robert Verkuil, Jason Liu, Zeming Lin, Brian Hie, Tom Sercu, Adam Lerer, and Alexander Rives. Learning inverse folding from millions of predicted structures. InInternational conference on machine learning, pages 8946–8970. PMLR, 2022

work page 2022

[23] [25]

Hamiltonian diver- sity: effectively measuring molecular diversity by shortest hamiltonian circuits.Journal of Cheminformatics, 16(1):94, 2024

Xiuyuan Hu, Guoqing Liu, Quanming Yao, Yang Zhao, and Hao Zhang. Hamiltonian diver- sity: effectively measuring molecular diversity by shortest hamiltonian circuits.Journal of Cheminformatics, 16(1):94, 2024

work page 2024

[24] [26]

Can llms generate diverse molecules? towards alignment with structural diversity.arXiv preprint arXiv:2410.03138, 2024

Hyosoon Jang, Yunhui Jang, Jaehyung Kim, and Sungsoo Ahn. Can llms generate diverse molecules? towards alignment with structural diversity.arXiv preprint arXiv:2410.03138, 2024

work page arXiv 2024

[25] [27]

Multi-objective molecule generation using interpretable substructures

Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Multi-objective molecule generation using interpretable substructures. InInternational conference on machine learning, pages 4849–4859. PMLR, 2020

work page 2020

[26] [28]

Any-property-conditional molecule generation with self-criticism using spanning trees.arXiv preprint arXiv:2407.09357, 2024

Alexia Jolicoeur-Martineau, Aristide Baratin, Kisoo Kwon, Boris Knyazev, and Yan Zhang. Any-property-conditional molecule generation with self-criticism using spanning trees.arXiv preprint arXiv:2407.09357, 2024

work page arXiv 2024

[27] [29]

Direct steering of de novo molecular generation with descriptor conditional recurrent neural networks.Nature Machine Intelligence, 2(5):254–265, 2020

Panagiotis-Christos Kotsias, Josep Arús-Pous, Hongming Chen, Ola Engkvist, Christian Tyr- chan, and Esben Jannik Bjerrum. Direct steering of de novo molecular generation with descriptor conditional recurrent neural networks.Nature Machine Intelligence, 2(5):254–265, 2020

work page 2020

[28] [30]

Genmol: A drug discovery generalist with discrete diffusion.arXiv preprint arXiv:2501.06158, 2025

Seul Lee, Karsten Kreis, Srimukh Prasad Veccham, Meng Liu, Danny Reidenbach, Yuxing Peng, Saee Paliwal, Weili Nie, and Arash Vahdat. Genmol: A drug discovery generalist with discrete diffusion.arXiv preprint arXiv:2501.06158, 2025

work page arXiv 2025

[29] [31]

CAGenMol: Condition-Aware Diffusion Language Model for Goal-Directed Molecular Generation

Yanting Li, Zhuoyang Jiang, Enyan Dai, Lei Wang, Wen-Cai Ye, and Li Liu. Cagenmol: Condition-aware diffusion language model for goal-directed molecular generation.arXiv preprint arXiv:2604.11483, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[30] [32]

Molecular generative model based on conditional variational autoencoder for de novo molecular design.Journal of cheminformatics, 10(1):31, 2018

Jaechang Lim, Seongok Ryu, Jin Woo Kim, and Woo Youn Kim. Molecular generative model based on conditional variational autoencoder for de novo molecular design.Journal of cheminformatics, 10(1):31, 2018. 12

work page 2018

[31] [33]

Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

work page 2023

[32] [34]

Xuhan Liu, Kai Ye, Herman WT Van Vlijmen, Adriaan P IJzerman, and Gerard JP Van Westen. An exploration strategy improves the diversity of de novo ligands using deep reinforcement learning: a case for the adenosine a2a receptor.Journal of cheminformatics, 11(1):35, 2019

work page 2019

[33] [35]

Drugex v3: scaffold-constrained drug design with graph transformer-based reinforcement learning.Journal of Cheminformatics, 15(1):24, 2023

Xuhan Liu, Kai Ye, Herman WT van Vlijmen, Adriaan P IJzerman, and Gerard JP van Westen. Drugex v3: scaffold-constrained drug design with graph transformer-based reinforcement learning.Journal of Cheminformatics, 15(1):24, 2023

work page 2023

[34] [36]

Reinvent 4: Modern ai–driven generative molecule design.Journal of Cheminformatics, 16(1):20, 2024

Hannes H Loeffler, Jiazhen He, Alessandro Tibo, Jon Paul Janet, Alexey V oronov, Lewis H Mervin, and Ola Engkvist. Reinvent 4: Modern ai–driven generative molecule design.Journal of Cheminformatics, 16(1):20, 2024

work page 2024

[35] [37]

Progen2: exploring the boundaries of protein language models.Cell systems, 14(11):968–978, 2023

Erik Nijkamp, Jeffrey A Ruffolo, Eli N Weinstein, Nikhil Naik, and Ali Madani. Progen2: exploring the boundaries of protein language models.Cell systems, 14(11):968–978, 2023

work page 2023

[36] [38]

Gotta be safe: a new framework for molecular design.Digital Discovery, 3(4):796–804, 2024

Emmanuel Noutahi, Cristian Gabellini, Michael Craig, Jonathan SC Lim, and Prudencio Tossou. Gotta be safe: a new framework for molecular design.Digital Discovery, 3(4):796–804, 2024

work page 2024

[37] [39]

Molecular de-novo design through deep reinforcement learning.Journal of cheminformatics, 9(1):48, 2017

Marcus Olivecrona, Thomas Blaschke, Ola Engkvist, and Hongming Chen. Molecular de-novo design through deep reinforcement learning.Journal of cheminformatics, 9(1):48, 2017

work page 2017

[38] [40]

Jinyeong Park, Jaegyoon Ahn, Jonghwan Choi, and Jibum Kim. Mol-air: Molecular reinforce- ment learning with adaptive intrinsic rewards for goal-directed molecular generation.Journal of Chemical Information and Modeling, 65(5):2283–2296, 2025

work page 2025

[39] [41]

Improving Inverse Folding for Peptide Design with Diversity-regularized Direct Preference Optimization

Ryan Park, Darren J Hsu, C Brian Roland, Maria Korshunova, Chen Tessler, Shie Mannor, Olivia Viessmann, and Bruno Trentini. Improving inverse folding for peptide design with diversity-regularized direct preference optimization.arXiv preprint arXiv:2410.19471, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [42]

Diversity oriented deep reinforcement learning for targeted molecule generation.Journal of cheminformatics, 13(1):21, 2021

Tiago Pereira, Maryam Abbasi, Bernardete Ribeiro, and Joel P Arrais. Diversity oriented deep reinforcement learning for targeted molecule generation.Journal of cheminformatics, 13(1):21, 2021

work page 2021

[41] [43]

Deep reinforcement learning for de novo drug design.Science advances, 4(7):eaap7885, 2018

Mariya Popova, Olexandr Isayev, and Alexander Tropsha. Deep reinforcement learning for de novo drug design.Science advances, 4(7):eaap7885, 2018

work page 2018

[42] [44]

Temberture: advancing protein ther- mostability prediction with deep learning and attention mechanisms.Bioinformatics Advances, 4(1):vbae103, 2024

Chiara Rodella, Symela Lazaridi, and Thomas Lemmin. Temberture: advancing protein ther- mostability prediction with deep learning and attention mechanisms.Bioinformatics Advances, 4(1):vbae103, 2024

work page 2024

[43] [45]

Silvr: guided diffusion for molecule generation

Nicholas T Runcie and Antonia SJS Mey. Silvr: guided diffusion for molecule generation. Journal of chemical information and modeling, 63(19):5996–6005, 2023

work page 2023

[44] [46]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [47]

Sequential posterior sampling with diffusion models

Tristan SW Stevens, Oisín Nolan, Jean-Luc Robert, and Ruud JG Van Sloun. Sequential posterior sampling with diffusion models. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

work page 2025

[46] [48]

Diversity-aware reinforcement learning for de novo drug design.arXiv preprint arXiv:2410.10431, 2024

Hampus Gummesson Svensson, Christian Tyrchan, Ola Engkvist, and Morteza Haghir Chehreghani. Diversity-aware reinforcement learning for de novo drug design.arXiv preprint arXiv:2410.10431, 2024

work page arXiv 2024

[47] [49]

Diverse mini-batch selection in reinforcement learning for efficient chemical exploration in de novo drug design.arXiv preprint arXiv:2506.21158, 2025

Hampus Gummesson Svensson, Ola Engkvist, Jon Paul Janet, Christian Tyrchan, and Morteza Haghir Chehreghani. Diverse mini-batch selection in reinforcement learning for efficient chemical exploration in de novo drug design.arXiv preprint arXiv:2506.21158, 2025. 13

work page arXiv 2025

[48] [50]

Bc-design: A biochemistry-aware framework for inverse protein design.bioRxiv, 2025

Xiangru Tang, Xinwu Ye, Fang Wu, Yimeng Liu, Anna Su, Antonia Panescu, Guanlue Li, Daniel Shao, Dong Xu, and Mark Gerstein. Bc-design: A biochemistry-aware framework for inverse protein design.bioRxiv, 2025. doi: 10.1101/2024.10.28.620755. URL https: //www.biorxiv.org/content/early/2025/11/24/2024.10.28.620755

work page doi:10.1101/2024.10.28.620755 2025

[49] [51]

Augmented hill-climb increases reinforcement learning efficiency for language-based de novo molecule generation

Morgan Thomas, Noel M O’Boyle, Andreas Bender, and Chris De Graaf. Augmented hill-climb increases reinforcement learning efficiency for language-based de novo molecule generation. Journal of cheminformatics, 14(1):68, 2022

work page 2022

[50] [52]

Sample-efficient optimiza- tion in the latent space of deep generative models via weighted retraining.Advances in Neural Information Processing Systems, 33:11259–11272, 2020

Austin Tripp, Erik Daxberger, and José Miguel Hernández-Lobato. Sample-efficient optimiza- tion in the latent space of deep generative models via weighted retraining.Advances in Neural Information Processing Systems, 33:11259–11272, 2020

work page 2020

[51] [53]

Oleg Trott and Arthur J Olson. Autodock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading.Journal of computational chemistry, 31(2):455–461, 2010

work page 2010

[52] [54]

Pro- teinzero: Self-improving protein generation via online reinforcement learning.arXiv preprint arXiv:2506.07459, 2025

Ziwen Wang, Jiajun Fan, Ruihan Guo, Thao Nguyen, Heng Ji, and Ge Liu. Pro- teinzero: Self-improving protein generation via online reinforcement learning.arXiv preprint arXiv:2506.07459, 2025

work page arXiv 2025

[53] [55]

Aligning protein generative models with experimental fitness via direct preference optimization.bioRxiv, pages 2024–05, 2024

Talal Widatalla, Rafael Rafailov, and Brian Hie. Aligning protein generative models with experimental fitness via direct preference optimization.bioRxiv, pages 2024–05, 2024

work page 2024

[54] [56]

Proteinguide: On-the-fly property guidance for protein sequence generative models.arXiv preprint arXiv:2505.04823, 2025

Junhao Xiong, Ishan Gaur, Maria Lukarska, Hunter Nisonoff, Luke M Oltrogge, David F Savage, and Jennifer Listgarten. Proteinguide: On-the-fly property guidance for protein sequence generative models.arXiv preprint arXiv:2505.04823, 2025

work page arXiv 2025

[55] [57]

Hit and lead discovery with explorative rl and fragment-based molecule generation.Advances in Neural Information Processing Systems, 34:7924–7936, 2021

Soojung Yang, Doyeong Hwang, Seul Lee, Seongok Ryu, and Sung Ju Hwang. Hit and lead discovery with explorative rl and fragment-based molecule generation.Advances in Neural Information Processing Systems, 34:7924–7936, 2021

work page 2021

[56] [58]

Latentchem: From textual cot to latent thinking in chemical reasoning.arXiv preprint arXiv:2602.07075, 2026

Xinwu Ye, Yicheng Mao, Jia Zhang, Yimeng Liu, Li Hao, Fang Wu, Zhiwei Li, Yuxuan Liao, Zehong Wang, Yingcheng Wu, et al. Latentchem: From textual cot to latent thinking in chemical reasoning.arXiv preprint arXiv:2602.07075, 2026

work page arXiv 2026

[57] [59]

Graph convolutional policy network for goal-directed molecular graph generation.Advances in neural information processing systems, 31, 2018

Jiaxuan You, Bowen Liu, Zhitao Ying, Vijay Pande, and Jure Leskovec. Graph convolutional policy network for goal-directed molecular graph generation.Advances in neural information processing systems, 31, 2018

work page 2018

[58] [60]

Optimization of molecules via deep reinforcement learning.Scientific reports, 9(1):10752, 2019

Zhenpeng Zhou, Steven Kearnes, Li Li, Richard N Zare, and Patrick Riley. Optimization of molecules via deep reinforcement learning.Scientific reports, 9(1):10752, 2019

work page 2019

[59] [61]

Scaffold-driven molecular generation via rein- forced rnn with centroid distance evaluation.Expert Systems with Applications, 292:128606, 2025

Xingzheng Zhu, Zhihong Zhao, and Fei Zhu. Scaffold-driven molecular generation via rein- forced rnn with centroid distance evaluation.Expert Systems with Applications, 292:128606, 2025. 14 A Full Training Procedure of SGRPO This appendix provides the full training procedure of Supergroup Relative Policy Optimization (SGRPO), corresponding to Section 4 in ...

work page 2025