pith. sign in

arxiv: 2603.18256 · v2 · submitted 2026-03-18 · 💻 cs.LG · cs.AI

MolRGen: A Training and Evaluation Setting for De Novo Molecular Generation with Reasonning Models

Pith reviewed 2026-05-15 09:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords de novo molecular generationreasoning LLMsmolecular verifierdocking scoresGRPOmulti-objective optimizationdiversity metricreinforcement learning
0
0 comments X

The pith

MolRGen supplies a real-time verifier so reasoning LLMs can generate molecules from scratch using docking and property rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a new training and evaluation setting called MolRGen for reasoning large language models on de novo molecular generation. It supplies roughly 4,500 protein targets that produce 50,000 multi-objective prompts, each scored by a verifier that calculates docking results and properties such as QED, synthetic accessibility, and logP without any reference molecule. Models propose structures directly, receive immediate rewards, and can be improved through reinforcement learning. The authors benchmark several open-source LLMs and then fine-tune a 128B model with GRPO to demonstrate measurable gains on the benchmark while documenting a resulting loss in molecular diversity. This setup creates a scalable testbed where verifiable outcomes guide step-by-step reasoning toward novel compounds.

Core claim

MolRGen is a benchmark and molecular verifier containing approximately 4,500 protein-pocket targets that yield 50k multi-objective optimization prompts. The verifier computes docking scores together with molecular properties at generation time, enabling training and evaluation of reasoning LLMs on molecules proposed entirely from scratch. Benchmarking of general and chemistry-specialized models reveals performance differences, and fine-tuning a 128B LLM via GRPO produces improved scores at the expense of a diversity-exploitation trade-off. The framework supports study of verifier-based reasoning and reinforcement learning in molecular design.

What carries the argument

The MolRGen molecular verifier, which evaluates each generated molecule in real time by running docking simulations and calculating property scores to supply rewards for reinforcement learning without reference structures.

If this is right

  • Reasoning LLMs can be trained to optimize multiple objectives at once through immediate verifier feedback during generation.
  • A diversity-aware top-k metric quantifies whether high-scoring outputs come from structurally varied molecules.
  • GRPO fine-tuning on the verifier improves benchmark scores for a 128B model.
  • The observed diversity-exploitation trade-off appears when models focus on maximizing verifier rewards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The verifier approach could be applied to other design domains where outcomes are computable but difficult to specify in natural language alone.
  • Expanding the set of protein targets would allow tests of whether the same reasoning patterns generalize across unrelated biological systems.
  • Closing the loop with periodic laboratory measurements of top-scoring molecules would show how well the computational rewards predict actual success.

Load-bearing premise

Docking scores and computed molecular properties provide a reliable proxy for real-world binding affinity and synthesizability when molecules are generated without any reference compounds.

What would settle it

An experiment that synthesizes and tests the binding affinity of a set of molecules proposed by the fine-tuned model versus those from the base model would directly test whether the reported performance gains hold in the laboratory.

Figures

Figures reproduced from arXiv: 2603.18256 by Ismail Ben Ayed, Maxime Darrin, Pablo Piantanida, Philippe Formont.

Figure 2
Figure 2. Figure 2: Diversity-aware top-k score. Evaluation of the diversity-aware top-k score (y-axis) against varying similarity thresholds (x-axis) between candidate clusters. performed the evaluation of RL-Mistral on these tasks, although it is worth noting, that the model has only seen 10% of the training set of these tasks during its training (see details in Appendix E). Regression Tasks. Overall, all models struggle to… view at source ↗
Figure 3
Figure 3. Figure 3: Property prediction performances. Accuracy of the LLMs on classification tasks (left), and normalized Spearman correlation on regression tasks (right). the chemical space. We evaluated a range of open-source large language models and showed that, on de novo molecular generation tasks, some reasoning-oriented LLMs can achieve performance comparable to chemically specialized models (not trained on de novo ge… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the target proteins. (a) Function of the proteins extracted from the PDB, our dataset comprises 21 molecular functions with at least 10 targets, the majority of which are kinases (30%). (b) Annotation score of the proteins on UniProt (from 1 to 5). The vast majority of the target proteins are high quality protein with strong evidence on their existence. C Molecular Property Prediction Data Crea… view at source ↗
Figure 5
Figure 5. Figure 5: Task sizes in the molecular property prediction objectives. The vast majority of tasks consist of regression tasks, and the largest benchmark used is the TDC benchmark. 0 50 100 150 200 250 300 Count origin novartis tdcommons polaris asap-discovery biogen [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Scaffold occurrence in the various benchmarks. Occurrences of the most frequent Murcko scaffolds (of at least 6 atoms) in each benchmark, illustrating the chemical diversity across tasks. scaffold patterns, indicating that the dataset covers chemically diverse molecular spaces rather than being biased towards a single scaffold class. However, data extracted from the asa-discovery dataset are mainly centere… view at source ↗
Figure 7
Figure 7. Figure 7: Overview of the molecular reaction dataset generation pipeline. Iterative stochastic process of synthesis generation: initialization with seed reactions, relaxed filtering for early steps, property filtering for later steps, probabilistic product selection, and chain extension up to 5 reaction steps. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Frequency and chemical diversity of reaction templates. D.3.2 Reaction Template Analysis The reaction templates form the core vocabulary of the synthesis dataset. We examine both the frequency distribution and chemical diversity of the SMARTS patterns used during generation in [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Training curves of RL-Mistral. Evolution of the average reward (left) and average completion length (right) during training. E.1 Loss Function We use the Group Relative Policy Optimization (GRPO) [Shao et al., 2024] with the modifications introduced by Magistral [Mistral-AI et al., 2025] to the loss function. For each prompt q, we generate G completions {oi} G i=1 with the policy πθold , and compute their… view at source ↗
Figure 11
Figure 11. Figure 11: Validity of the generated completions. Description of the validity of the generated completions. Generations can be invalid due to no answer being generated in the expected format, no SMILES being parsed in the answer, no valid SMILES or multiple SMILES being proposed. 25 50 75 100 nr 0.0 0.2 0.4 0.6 0.8 1.0 Uniqueness-Prompt-wise ChemDFM-R ChemDFM-v2.0 RL-Mistral RL-Mistral-100 ether0 MiniMax-M2 Qwen3 Qw… view at source ↗
Figure 12
Figure 12. Figure 12: Uniqueness and diversity evolution with the number of rollouts. We display the uniqueness (left) and diversity (right) of the generated molecules with respect to the number of rollouts. The figure at the center displays the average number of prompts a given molecule appears in. • Most models struggle generating valid completionsm and only a few models manage to generate more than 80% valid completions. • … view at source ↗
Figure 13
Figure 13. Figure 13: Evolution of the top-k score with the number of rollouts. Evolution of the top-k score as we sample more molecules per prompt. The x-axis represents the number of rollouts divided by the value of k. F.2 Diversity and Uniqueness of the generated molecules [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Diverity-aware top-k score for different fingerprints. We display the diversity-aware metric when the similarity between molecues is based on: ECFP, MACCS, Gobbi2d, MACCS, and Avalon fingerprints. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗
Figure 15
Figure 15. Figure 15 [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Ether0 refusal. Representative examples where Ether0 refuses to generate a molecule, interpreting property-optimization, or prediction instructions as requests to produce harmful sub￾stances. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗
read the original abstract

Recent reasoning-based large language models have shown strong performance on tasks with verifiable outcomes, but their use in de novo molecular generation remains limited by the lack of training environments where rewards can be computed without reference molecules. We introduce MolRGen, a benchmark and molecular verifier for training and evaluating reasoning LLMs on de novo molecular generation. MolRGen contains approximately 4,500 protein-pocket targets, resulting in 50k multi-objective optimization prompts combining docking scores with molecular properties such as QED, synthetic accessibility, logP, and physicochemical descriptors. Unlike caption-based generation or molecule-editing benchmarks, MolRGen evaluates molecules proposed from scratch by computing rewards at generation time. We benchmark general-purpose and chemistry-specialized open-source LLMs and introduce a diversity-aware top-k metric to measure whether models can generate a diverse set of high-scoring molecules. Finally, we use the verifier to fine-tune a 128B LLM with GRPO, showing improved performance, at the cost of a diversity-exploitation trade-off. MolRGen provides a scalable testbed for studying verifier-based reasoning and reinforcement learning in molecular design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MolRGen, a benchmark and molecular verifier for training and evaluating reasoning LLMs on de novo molecular generation. It features approximately 4,500 protein-pocket targets leading to 50k multi-objective prompts involving docking scores, QED, SA, logP, and descriptors. The work benchmarks LLMs, proposes a diversity-aware top-k metric, and shows that fine-tuning a 128B LLM with GRPO using the verifier improves performance, albeit with a diversity-exploitation trade-off.

Significance. If the reported improvements hold and the verifier provides a meaningful proxy, this establishes a valuable testbed for verifier-based RL in molecular design, potentially accelerating the application of reasoning models to chemistry by providing verifiable rewards without reference structures. The introduction of the diversity metric is a positive step toward balanced generation.

major comments (2)
  1. [Fine-tuning results] The claim that GRPO fine-tuning leads to improved performance is central but lacks specific quantitative evidence such as pre- and post-fine-tuning scores on docking, QED, or the top-k metric, as well as details on the number of training steps or reward curves; this undermines assessment of the practical utility.
  2. [Verifier and benchmark construction] The multi-objective reward computation is described at a high level; the paper should specify the exact aggregation method (e.g., weighted sum, Pareto optimization) and any validation against known molecular datasets to ensure the scores are not arbitrary.
minor comments (2)
  1. The title contains 'Reasonning' which is likely a misspelling of 'Reasoning'.
  2. [Abstract] The abstract could benefit from a brief mention of the scale of the benchmark (e.g., number of molecules generated or evaluation protocol) for better context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to provide the requested details and clarifications.

read point-by-point responses
  1. Referee: [Fine-tuning results] The claim that GRPO fine-tuning leads to improved performance is central but lacks specific quantitative evidence such as pre- and post-fine-tuning scores on docking, QED, or the top-k metric, as well as details on the number of training steps or reward curves; this undermines assessment of the practical utility.

    Authors: We agree that quantitative details are necessary to substantiate the fine-tuning claims. In the revised version, we will add a dedicated results table reporting pre- and post-GRPO scores on docking, QED, SA, logP, and the diversity-aware top-k metric. We will also include the number of training steps, training reward curves, and any relevant hyperparameters to allow full assessment of the improvements and the noted diversity-exploitation trade-off. revision: yes

  2. Referee: [Verifier and benchmark construction] The multi-objective reward computation is described at a high level; the paper should specify the exact aggregation method (e.g., weighted sum, Pareto optimization) and any validation against known molecular datasets to ensure the scores are not arbitrary.

    Authors: We acknowledge that the aggregation method requires explicit specification. In the revision, we will detail that the multi-objective reward is computed as a weighted sum of normalized individual scores (docking, QED, SA, logP, and physicochemical descriptors) with weights chosen to balance the objectives. We will also add a validation section comparing the verifier outputs against established datasets (e.g., known active compounds from PDBbind or ChEMBL) to demonstrate that the scores align with expected trends and are not arbitrary. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces a new benchmark (MolRGen) and verifier that defines multi-objective rewards from docking scores, QED, SA, logP and descriptors computed at generation time without reference molecules. It then benchmarks LLMs on this setting and applies GRPO fine-tuning to maximize the same verifier signal, reporting the resulting performance lift and diversity trade-off. This is an expected empirical outcome of closed-loop optimization on a self-defined reward rather than a claimed first-principles derivation that reduces to its inputs by construction. No load-bearing step matches any enumerated circularity pattern; the work is self-contained with newly introduced data, prompts and evaluation metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark relies on established computational chemistry methods for docking and property calculation, with no new free parameters or entities introduced in the abstract.

axioms (1)
  • domain assumption Docking scores and molecular properties like QED, synthetic accessibility, and logP can be reliably computed for any proposed molecule
    This underpins the reward computation in the verifier at generation time.

pith-pipeline@v0.9.0 · 5506 in / 1243 out tokens · 44005 ms · 2026-05-15T09:25:30.404024+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 1 internal anchor

  1. [1]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    URLhttps://arxiv.org/abs/2501.12948. Vishal Dey, Xiao Hu, and Xia Ning. Gellmo: Generalizing large language models for multi-property molecule optimization, 2025. URLhttps://arxiv.org/abs/2502.13398. Peter Ertl and Ansgar Schuffenhauer. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributi...

  2. [2]

    doi: 10.1038/s41598-025-99785-0

    ISSN 2045-2322. doi: 10.1038/s41598-025-99785-0. URL https://www.nature.com/ articles/s41598-025-99785-0. Publisher: Nature Publishing Group. Nafisa M. Hassan, Amr A. Alhossary, Yuguang Mu, and Chee-Keong Kwoh. Protein-Ligand Blind Docking Using QuickVina-W With Inter-Process Spatio-Temporal Integration.Scientific Reports, 7(1):15451, November 2017. ISSN ...

  3. [3]

    Sihang Li, Zhiyuan Liu, Yanchen Luo, Xiang Wang, Xiangnan He, Kenji Kawaguchi, Tat-Seng Chua, and Qi Tian

    URLhttps://arxiv.org/abs/2508.08401. Sihang Li, Zhiyuan Liu, Yanchen Luo, Xiang Wang, Xiangnan He, Kenji Kawaguchi, Tat-Seng Chua, and Qi Tian. Towards 3d molecule-text interpretation in language models, 2024. URL https://arxiv.org/abs/2401.13923. Hannes H. Loeffler, Jiazhen He, Alessandro Tibo, Jon Paul Janet, Alexey V oronov, Lewis H. Mervin, and Ola En...

  4. [4]

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang

    URLhttps://arxiv.org/abs/2402.09391. Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025. 12 Zihan Zhao, Da Ma, Lu Chen, Liangtai Sun, Zihao Li, Yi Xia, Bo Chen, Hongshen Xu, Ziche...

  5. [5]

    Loads a set of property definitions, docking targets, and pocket metadata from the data directory

  6. [6]

    Uses a rule-based prompt generator to sample multi-objective molecular-generation prompts

  7. [7]

    Stores the prompts and their metadata in two formats: a JSONL file and a HuggingFace dataset. B.1 Per-prompt sampling loop (inner generator) For each prompt, the following steps are executed: • Property Selection:The number of properties, nprops, is sampled from a probability distri- bution, ensuring that the selection adheres to the constraints defined b...

  8. [8]

    Per-sequence potency filter: for each protein sequence, keep only structures whose measured ligand potency (pIC50) is in the top 50% for that sequence

  9. [9]

    Per-sequence confidence filter: among the retained structures, keep only those with a confidence score in the top 50% of the retained set. 17 This double-filter yields, for each sequence, a subset of CIF files whose ligands are both potent and associated with high-confidence measurements; these files form the input for pocket detection. Pocket identificat...

  10. [10]

    Parse the CIF using Biopython’sMMCIFParserand select the first model

  11. [11]

    standard residues

    For each ligand atom, compute distances to all protein atom coordinates (atoms whose residue id flag equals the blank flag for “standard residues”). Select thetop-kclosest residues for each ligand atom (k= 3 ). The union of these residues across all ligand atoms forms the pocket residue set for that CIF. Aggregation across conformations (IoU clustering)Ma...

  12. [12]

    For each member structure, extract atomic coordinates for the residues in the aggregated pocket

  13. [13]

    Compute pairwise RMSD values (using Biopython) between all structures restricted to the pocket residues

  14. [14]

    The chosen structure is then written as a PDB file (ligand removed)

    Aggregate pairwise RMSD values into a matrix and select the structure with the smallest mean RMSD relative to the others as the best conformation. The chosen structure is then written as a PDB file (ligand removed). 18 (a) (b) Figure 4:Overview of the target proteins.(a) Function of the proteins extracted from the PDB, our dataset comprises 21 molecular f...

  15. [15]

    Initialization: Select a random seed reaction and identify available reactants via the com- patibility matrix

  16. [16]

    Relaxed Filtering for Early Steps: For multi-step syntheses (i.e., when the total number of steps nsteps >1 ), we randomly sample a number of initial stepsnnf ∼ U {0,⌊(n steps +1)/2⌋} allowed to produce molecules with abnormal properties for drug-like compounds, and products are selected by randomly selecting one allowed reaction given the previous produc...

  17. [17]

    Probabilistic Product Selection: After the no-filter steps (i.e., for steps i > n nf), property- based filtering is re-enabled. For each valid product, we compute a probability score based on a target distribution over molecular properties (QED, molecular weight, TPSA, H-bond donors/acceptors, rotatable bonds, aromatic rings). Products are selected propor...

  18. [18]

    Final Product: Predict the final product of a multi-step synthesis given all reaction SMARTS 2.Reactant Prediction: Identify a missing reactant for a single synthesis step

  19. [19]

    All Reactants: Given a reaction SMARTS and target product, predict all required reactants

  20. [20]

    Building Block Constrained: All reactants task with molecules restricted to a provided set 5.SMARTS Identification: Predict the SMARTS representation for a reaction step 6.Full Synthesis Path: Generate a multi-step synthesis pathway to a target molecule

  21. [21]

    Path with Building Block Reference: Synthesis design constrained to a provided set of building blocks 8.Path with SMARTS Reference: Synthesis design using only reactions from a curated set

  22. [22]

    Path with Both References: Full pathway design under both building block and reaction constraints

  23. [23]

    No building blocks or reaction templates are provided, requiring the model to identify appropriate reactants autonomously

    Path with Intermediate Products: Given a target molecule and ashuffledlist of interme- diate products (i.e., all products of the synthesis route except the final one), determine the correct ordering of intermediates and provide the full synthesis route, including the reactants for each step. No building blocks or reaction templates are provided, requiring...

  24. [24]

    impossible

    Path with Intermediate Products and Building Blocks: Same as the previous task, but the model is additionally provided with a set of commercially available building blocks (containing the ground-truth reactants mixed with random distractors) to select from when constructing the synthesis route. Each prompt is formatted with a system message establishing c...