pith. sign in

arxiv: 2606.03660 · v2 · pith:OW74HKNXnew · submitted 2026-06-02 · 💻 cs.AI

From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models

Pith reviewed 2026-06-28 09:39 UTC · model grok-4.3

classification 💻 cs.AI
keywords chemical reasoningLLM evaluationprocess supervisionverifiable benchmarkchemistry tasksreasoning tracesmodel diagnostics
0
0 comments X

The pith

Large language models often output correct chemistry answers while their reasoning steps violate chemical logic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ChemCoTBench-V2 to evaluate LLMs on chemistry tasks not only by final answers but by the structured reasoning processes. It uses templates that models must follow to expose intermediate steps, which are then verified with deterministic chemistry rules rather than LLM judges. This setup reveals a gap where models can get the right answer or follow the format but fail chemical-step checks. A sympathetic reader would care because using LLMs as chemistry assistants requires trustworthy reasoning paths, not just lucky final outputs. The benchmark spans multiple tasks and provides auditable, low-cost evaluation.

Core claim

ChemCoTBench-V2 requires models to expose key intermediate steps in expert-designed templates for molecular understanding, editing, optimization, and reaction prediction tasks. These steps are checked with deterministic chemistry rules and reference traces. Experiments on frontier models show a persistent gap between final-answer success and structured-reasoning-state consistency, with models often following format while failing chemical checks or answering correctly with weak reasoning.

What carries the argument

ChemCoTBench-V2 benchmark that uses rule-verifiable templates and deterministic chemistry rules to audit intermediate reasoning states instead of relying on LLM judges.

If this is right

  • The benchmark enables identification of the exact step where a reasoning trace violates chemical logic.
  • Evaluation of LLMs becomes scalable and consistent without costly human annotation or inconsistent LLM judges.
  • Three separate signals are reported: final-answer correctness, template adherence, and step-wise verifier correctness.
  • Open-ended molecular optimization is assessed with oracle-verifiable state constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could be extended to create similar rule-based verifiers for reasoning in other domains like physics or mathematics.
  • Training LLMs with rewards based on step-wise verifier scores might improve chemical reasoning beyond just answer accuracy.
  • The identified gap suggests current models may produce unreliable outputs in real-world chemical design workflows where step validity is crucial.

Load-bearing premise

The expert-designed templates and deterministic chemistry rules used for verification accurately capture valid chemical reasoning processes without bias, false positives, or omission of alternative valid reasoning paths.

What would settle it

Finding that step-wise verifier correctness rates closely match final-answer success rates across multiple frontier models would falsify the claimed persistent gap.

Figures

Figures reproduced from arXiv: 2606.03660 by Gongbo Zhang, Hao Li, He Cao, Hongyu Guo, Li Yuan.

Figure 1
Figure 1. Figure 1: CHEMCOTBENCH-V2 evaluates structured, verifier-addressable chemical reasoning traces beyond final answers with three signals: Layer 1 outcome correctness, Layer 2 template adherence, and Layer 3 step-wise validity under deterministic task-specific checks. diction. For closed-answer tasks, verified refer￾ences define benchmark states for Type-II check￾ing; for open-ended optimization, Layer 3 uses oracle-co… view at source ↗
Figure 2
Figure 2. Figure 2: Unified framework for reference construction and evaluation. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Sample-level diagnostic case from the Qwen3.5 Plus forward-reaction evaluation. The figure shows the [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Active sample counts for the 31 fine-grained chemical tasks. The distribution is uniform within each task [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Reduction from the initial construction pool to [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Reaction-class composition of the molecule-editing active set. Each edit type contains 300 samples. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
read the original abstract

Large language models are increasingly used as chemistry assistants, yet most chemistry benchmarks still score only final answers. This masks a critical failure mode: a model may output the correct molecule, product, or option while its reasoning violates chemical logic. Existing process-level evaluators are hard to scale because LLM judges and human step-level process annotation are costly, inconsistent, and vulnerable to hallucination. We introduce ChemCoTBench-V2, a rule-verifiable diagnostic benchmark for low-cost, auditable evaluation of structured, verifier-addressable chemical reasoning traces. It spans molecular understanding, molecule editing, molecular optimization, and reaction prediction, with 5,620 evaluation samples across 18 reporting tasks. Models must expose key intermediate steps in expert-designed templates, and those steps are checked with deterministic chemistry rules and, for closed-answer tasks, reference traces rather than another LLM judge. Open-ended molecular optimization is evaluated with oracle-verifiable state constraints rather than strict trace matching. The benchmark reports three separate signals: final-answer correctness, template adherence, and step-wise verifier correctness over expert-refined intermediate commitments. Experiments on frontier models reveal a persistent gap between final-answer success and structured-reasoning-state consistency: models often follow the requested format while failing chemical-step checks, or answer correctly with weak supporting reasoning. ChemCoTBench-V2 enables fine-grained model comparison and identifies the concrete step at which the trace first violates the verifier.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ChemCoTBench-V2, a benchmark spanning molecular understanding, editing, optimization, and reaction prediction (5,620 samples across 18 tasks). Models must output structured reasoning traces in expert-designed templates; these traces are checked for template adherence and step-wise correctness using deterministic chemistry rules and reference traces (rather than LLM judges). Open-ended optimization uses oracle-verifiable state constraints. Experiments on frontier models show a persistent gap: high final-answer accuracy often co-occurs with failures on chemical-step checks or weak supporting reasoning, while format adherence does not guarantee verifier correctness. The benchmark aims to enable fine-grained, auditable diagnosis of where reasoning first violates chemical logic.

Significance. If the verifier rules and templates are faithful, the work supplies a scalable, low-cost, and auditable process-level evaluation method for chemical reasoning that avoids the inconsistencies of LLM judges. It directly quantifies the dissociation between answer correctness and reasoning-state consistency, which is a practically important failure mode for chemistry assistants. The use of external deterministic rules and reference traces (independent of the evaluated models) is a clear methodological strength that reduces circularity.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Benchmark Design): The central claim of a persistent gap between final-answer success and step-wise verifier correctness is load-bearing on the assumption that the expert-designed templates and deterministic chemistry rules faithfully capture valid chemical reasoning. The manuscript provides no reported inter-expert validation, coverage analysis of alternative valid reasoning paths, or measured false-positive rate of the verifier against human chemists; without this, flagged failures could reflect template mismatch rather than defective reasoning.
  2. [§4 and Table 2] §4 (Experiments) and Table 2 (model results): The reported gap is quantified only via aggregate signals (final-answer correctness, template adherence, step-wise verifier correctness). No per-task breakdown or statistical test is described that isolates whether the gap persists after controlling for task difficulty or template strictness; this weakens the claim that the gap is a general property of frontier models rather than an artifact of specific template choices.
minor comments (2)
  1. [Abstract] The abstract states 5,620 samples but does not clarify whether this count includes only unique tasks or multiple templates per task; a clarifying sentence would aid reproducibility.
  2. [§3] Notation for the three reported signals (final-answer correctness, template adherence, step-wise verifier correctness) is introduced without an explicit equation or table defining how each is computed from the verifier output.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments. We address each major point below and propose revisions where the concerns identify areas for strengthening the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Benchmark Design): The central claim of a persistent gap between final-answer success and step-wise verifier correctness is load-bearing on the assumption that the expert-designed templates and deterministic chemistry rules faithfully capture valid chemical reasoning. The manuscript provides no reported inter-expert validation, coverage analysis of alternative valid reasoning paths, or measured false-positive rate of the verifier against human chemists; without this, flagged failures could reflect template mismatch rather than defective reasoning.

    Authors: We acknowledge that the fidelity of our templates and rules to expert chemical reasoning is central to the benchmark's validity. The templates were designed by chemists with domain expertise and refined through iterative review to align with standard chemical logic (e.g., ensuring correct atom mapping and valence satisfaction in reaction steps). The deterministic rules implement verifiable checks such as molecular formula consistency and reaction balance, which are independent of any model. However, the original manuscript does not include a formal inter-expert validation study, coverage of alternative reasoning paths, or a quantified false-positive rate against human judgments. In the revised version, we will expand §3 to detail the template design and refinement process by experts, and add a limitations section explicitly noting the lack of quantitative human validation metrics. This will clarify the assumptions while maintaining the benchmark's focus on rule-based verifiability. revision: yes

  2. Referee: [§4 and Table 2] §4 (Experiments) and Table 2 (model results): The reported gap is quantified only via aggregate signals (final-answer correctness, template adherence, step-wise verifier correctness). No per-task breakdown or statistical test is described that isolates whether the gap persists after controlling for task difficulty or template strictness; this weakens the claim that the gap is a general property of frontier models rather than an artifact of specific template choices.

    Authors: The aggregate results in §4 and Table 2 demonstrate the gap across a diverse set of 18 tasks and multiple models, suggesting it is not isolated to particular cases. Nevertheless, we agree that per-task analysis and controls for difficulty would provide stronger evidence. In the revision, we will include a supplementary per-task breakdown of the three metrics and add a short discussion or statistical summary (e.g., noting consistent patterns across task categories) to address whether the gap holds after accounting for task-specific factors. This will better support the generality of the finding. revision: yes

standing simulated objections not resolved
  • Providing a measured false-positive rate of the verifier against human chemists would require a new inter-expert annotation study, which we cannot complete within the scope of this revision.

Circularity Check

0 steps flagged

No circularity: evaluation rests on external deterministic rules independent of models

full rationale

The paper introduces ChemCoTBench-V2 using expert-designed templates checked by deterministic chemistry rules and reference traces (or oracle-verifiable constraints) that are fixed and external to the evaluated LLMs. The reported gap between final-answer correctness and step-wise verifier correctness is computed directly against these independent verifiers rather than any fitted parameter, self-citation chain, or self-definitional loop. No equations, ansatzes, or uniqueness theorems are invoked that reduce the central claim to the paper's own inputs by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The contribution rests on domain assumptions about the reliability of expert-designed templates and deterministic chemistry rules rather than new mathematical axioms, fitted parameters, or invented entities.

axioms (2)
  • domain assumption Expert-designed templates can elicit structured chemical reasoning from LLMs that exposes verifiable intermediate commitments.
    The evaluation framework depends on models following these templates to expose intermediates for checking.
  • domain assumption Deterministic chemistry rules can be formulated and applied to verify step correctness reliably across the covered tasks.
    Central to the verifier-addressable and auditable evaluation without LLM judges.

pith-pipeline@v0.9.1-grok · 5790 in / 1391 out tokens · 35062 ms · 2026-06-28T09:39:59.960031+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 24 canonical work pages · 6 internal anchors

  1. [1]

    Nature , volume=

    DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , author=. Nature , volume=. 2025 , doi=

  2. [2]

    International Conference on Learning Representations , year=

    Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models , author=. International Conference on Learning Representations , year=

  3. [11]

    Cao, He and Liu, Zijing and Lu, Xingyu and Yao, Yuan and Li, Yu , journal=

  4. [13]

    Cell Reports Physical Science , volume=

    Developing ChemDFM as a large language foundation model for chemistry , author=. Cell Reports Physical Science , volume=. 2025 , publisher=

  5. [14]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

    Biot5: Enriching cross-modal integration in biology with chemical knowledge and natural language associations , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

  6. [16]

    arXiv preprint arXiv:2506.17238 , year=

    Training a Scientific Reasoning Model for Chemistry , author=. arXiv preprint arXiv:2506.17238 , year=

  7. [19]

    Journal of Chemical Information and Modeling , volume=

    Do large language models understand chemistry? a conversation with chatgpt , author=. Journal of Chemical Information and Modeling , volume=. 2023 , publisher=

  8. [20]

    Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

    Moleculeqa: A dataset to evaluate factual accuracy in molecular comprehension , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

  9. [29]

    Journal of Chemical Information and Modeling , volume=

    Assessing the chemical intelligence of large language models , author=. Journal of Chemical Information and Modeling , volume=. 2026 , publisher=

  10. [30]

    Advances in Neural Information Processing Systems , volume=

    Can llms solve molecule puzzles? a multimodal benchmark for molecular structure elucidation , author=. Advances in Neural Information Processing Systems , volume=

  11. [31]

    Nucleic acids research , volume=

    PubChem substance and compound databases , author=. Nucleic acids research , volume=. 2016 , publisher=

  12. [32]

    Nucleic acids research , volume=

    The ChEMBL database in 2017 , author=. Nucleic acids research , volume=. 2017 , publisher=

  13. [33]

    Journal of chemical information and modeling , volume=

    ZINC: a free tool to discover chemistry for biology , author=. Journal of chemical information and modeling , volume=. 2012 , publisher=

  14. [34]

    Journal of Chemical Information and Modeling , volume=

    What's What: The (Nearly) Definitive Guide to Reaction Role Assignment , author=. Journal of Chemical Information and Modeling , volume=. 2016 , publisher=

  15. [35]

    Extraction of Chemical Structures and Reactions from the Literature , author=

  16. [36]

    Journal of the American Chemical Society , volume=

    The Open Reaction Database , author=. Journal of the American Chemical Society , volume=. 2021 , publisher=

  17. [37]

    Journal of the American Chemical Society , volume=

    Efficient Cross-Coupling of Secondary Alkyltrifluoroborates with Aryl Chlorides---Reaction Discovery Using Parallel Microscale Experimentation , author=. Journal of the American Chemical Society , volume=. 2008 , doi=

  18. [38]

    Science , volume=

    Predicting reaction performance in C--N cross-coupling using machine learning , author=. Science , volume=. 2018 , publisher=

  19. [39]

    Science , volume=

    A platform for automated nanomole-scale reaction screening and micromole-scale synthesis in flow , author=. Science , volume=. 2018 , publisher=

  20. [40]

    Proceedings of Neural Information Processing Systems Track on Datasets and Benchmarks , year=

    Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development , author=. Proceedings of Neural Information Processing Systems Track on Datasets and Benchmarks , year=

  21. [41]

    RDKit: Open-source cheminformatics , year=

  22. [42]

    Rdkit: Open-source cheminformatics

    2024. Rdkit: Open-source cheminformatics. https://www.rdkit.org

  23. [43]

    Derek T Ahneman, Jes \'u s G Estrada, Shishi Lin, Spencer D Dreher, and Abigail G Doyle. 2018. Predicting reaction performance in c--n cross-coupling using machine learning. Science, 360(6385):186--190

  24. [44]

    Christoph Bartmann, Johannes Schimunek, Mykyta Ielanskyi, Philipp Seidl, G \"u nter Klambauer, and Sohvi Luukkonen. 2026. Moleculariq: Characterizing chemical reasoning capabilities through symbolic verification on molecular graphs. arXiv preprint arXiv:2601.15279

  25. [45]

    He Cao, Zijing Liu, Xingyu Lu, Yuan Yao, and Yu Li. 2023. InstructMol : Multi-modal integration for building a versatile and reliable molecular assistant in drug discovery. arXiv preprint ARXIV.2311.16208

  26. [46]

    Cayque Monteiro Castro Nascimento and Andr \'e Silva Pimentel. 2023. Do large language models understand chemistry? a conversation with chatgpt. Journal of Chemical Information and Modeling, 63(6):1649--1655

  27. [47]

    Spencer D Dreher, Peter G Dormer, Deidre L Sandrock, and Gary A Molander. 2008. https://doi.org/10.1021/ja8031423 Efficient cross-coupling of secondary alkyltrifluoroborates with aryl chlorides---reaction discovery using parallel microscale experimentation . Journal of the American Chemical Society, 130(29):9257--9259

  28. [48]

    Yin Fang, Xiaozhuan Liang, Ningyu Zhang, Kangwei Liu, Rui Huang, Zhuo Chen, Xiaohui Fan, and Huajun Chen. 2024. Mol-instructions: A large-scale biomolecular instruction dataset for large language models. In International Conference on Learning Representations

  29. [49]

    Anna Gaulton, Anne Hersey, Micha Nowotka, A Patr \' cia Bento, Jon Chambers, David Mendez, Prudence Mutowo, Francis Atkinson, Louisa J Bellis, Elena Cibri \'a n-Uhalte, Mark Davies, Nathan Dedman, Anneli Karlsson, Mar \' a Paula Magari \ n os, John P Overington, George Papadatos, Ines Smit, and Andrew R Leach. 2017. The chembl database in 2017. Nucleic ac...

  30. [50]

    Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. 2025. Rstar-math: Small llms can master math reasoning with self-evolved deep thinking. arXiv preprint arXiv:2501.04519

  31. [51]

    Kehan Guo, Bozhao Nan, Yujun Zhou, Taicheng Guo, Zhichun Guo, Mihir Surve, Zhenwen Liang, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. 2024. Can llms solve molecule puzzles? a multimodal benchmark for molecular structure elucidation. Advances in Neural Information Processing Systems, 37:134721--134746

  32. [52]

    Kexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao, Yusuf Roohani, Jure Leskovec, Connor W Coley, Cao Xiao, Jimeng Sun, and Marinka Zitnik. 2021. Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development. Proceedings of Neural Information Processing Systems Track on Datasets and Benchmarks

  33. [53]

    Yuqing Huang, Rongyang Zhang, Xuesong He, Xuyang Zhi, Hao Wang, Xin Li, Feiyang Xu, Deguang Liu, Huadong Liang, Yi Li, Jian Cui, Zimu Liu, Shijin Wang, Guoping Hu, Guiquan Liu, Qi Liu, Defu Lian, and Enhong Chen. 2024. Chemeval: a comprehensive multi-level chemical evaluation for large language models. arXiv preprint arXiv:2409.13989

  34. [54]

    John J Irwin, Teague Sterling, Michael M Mysinger, Erin S Bolstad, and Ryan G Coleman. 2012. Zinc: a free tool to discover chemistry for biology. Journal of chemical information and modeling, 52(7):1757--1768

  35. [55]

    Alon Jacovi, Yonatan Bitton, Bernd Bohnet, Jonathan Herzig, Or Honovich, Michael Tseng, Michael Collins, Roee Aharoni, and Mor Geva. 2024. A chain-of-thought is as strong as its weakest link: A benchmark for verifiers of reasoning chains. arXiv preprint arXiv:2402.00559

  36. [56]

    Steven M Kearnes, Michael R Maser, Michael Wleklinski, Anton Kast, Abigail G Doyle, Spencer D Dreher, Joel M Hawkins, Klavs F Jensen, and Connor W Coley. 2021. The open reaction database. Journal of the American Chemical Society, 143(45):18820--18826

  37. [57]

    Sunghwan Kim, Paul A Thiessen, Evan E Bolton, Jie Chen, Gang Fu, Asta Gindulyte, Lianyi Han, Jane He, Siqian He, Benjamin A Shoemaker, Jiyao Wang, Bo Yu, Jian Zhang, and Stephen H Bryant. 2016. Pubchem substance and compound databases. Nucleic acids research, 44(D1):D1202--D1213

  38. [58]

    Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024 a . Llms-as-judges: a comprehensive survey on llm-based evaluation methods. arXiv preprint arXiv:2412.05579

  39. [59]

    Hanzheng Li, Xi Fang, Yixuan Li, Chaozheng Huang, Junjie Wang, Xi Wang, Hongzhe Bai, Bojun Hao, Shenyu Lin, Huiqi Liang, Linfeng Zhang, and Guolin Ke. 2026. Rxnbench: A multimodal benchmark for evaluating large language models on chemical reaction understanding from scientific literature. arXiv preprint arXiv:2512.23565

  40. [60]

    Hao Li, He Cao, Bin Feng, Yanjun Shao, Xiangru Tang, Zhiyuan Yan, Li Yuan, Yonghong Tian, and Yu Li. 2025. Beyond chemical qa: Evaluating llm's chemical reasoning with modular chemical operations. arXiv preprint arXiv:2505.21318

  41. [61]

    Jiatong Li, Junxian Li, Weida Wang, Yunqing Liu, Changmeng Zheng, Dongzhan Zhou, Xiao-yong Wei, and Qing Li. 2024 b . Speak-to-structure: Evaluating llms in open-domain natural language-driven molecule generation. arXiv preprint arXiv:2412.14642

  42. [62]

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let’s verify step by step, 2023. arXiv preprint arXiv:2305.20050, 17

  43. [63]

    Daniel Mark Lowe. 2012. Extraction of Chemical Structures and Reactions from the Literature. Ph.D. thesis, University of Cambridge

  44. [64]

    Xingyu Lu, He Cao, Zijing Liu, Shengyuan Bai, Leqing Chen, Yuan Yao, Hai-Tao Zheng, and Yu Li. 2024. Moleculeqa: A dataset to evaluate factual accuracy in molecular comprehension. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 3769--3789

  45. [65]

    Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, and Abhinav Rastogi. 2024. Improve mathematical reasoning in language models by automated process supervision. arXiv preprint arXiv:2406.06592

  46. [66]

    Are large language models superhuman chemists?arXiv preprint arXiv:2404.01475, 2024

    Adrian Mirza, Nawaf Alampara, Sreekanth Kunchapu, Marti \ n o R \' os-Garc \' a, Benedict Emoekabu, Aswanth Krishnan, Tanya Gupta, Mara Schilling-Wilhelmi, Macjonathan Okereke, Anagha Aneesh, Amir Mohammad Elahi, Mehrdad Asgari, Juliane Eberhardt, Hani M. Elbeheiry, Mar \' a Victoria Gil, Maximilian Greiner, Caroline T. Holick, Christina Glaubitz, Tim Hof...

  47. [67]

    arXiv preprint arXiv:2506.17238 , year=

    Siddharth M. Narayanan, James D. Braza, Ryan-Rhys Griffiths, Albert Bou, Geemi Wellawatte, Mayk Caldas Ramos, Ludovico Mitchener, Samuel G. Rodriques, and Andrew D. White. 2025. https://doi.org/10.48550/arXiv.2506.17238 Training a scientific reasoning model for chemistry . arXiv preprint arXiv:2506.17238

  48. [68]

    Qizhi Pei, Wei Zhang, Jinhua Zhu, Kehan Wu, Kaiyuan Gao, Lijun Wu, Yingce Xia, and Rui Yan. 2023. Biot5: Enriching cross-modal integration in biology with chemical knowledge and natural language associations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1102--1123

  49. [69]

    Damith Perera, Joseph W Tucker, Shalini Brahmbhatt, Christopher J Helal, Ashley Chong, William Farrell, Paul Richardson, and Neal W Sach. 2018. A platform for automated nanomole-scale reaction screening and micromole-scale synthesis in flow. Science, 359(6374):429--434

  50. [70]

    Nicholas T Runcie, Charlotte M Deane, and Fergus Imrie. 2026. Assessing the chemical intelligence of large language models. Journal of Chemical Information and Modeling, 66(1):216--227

  51. [71]

    Nadine Schneider, Nikolaus Stiefl, and Gregory A Landrum. 2016. What's what: The (nearly) definitive guide to reaction role assignment. Journal of Chemical Information and Modeling, 56(12):2336--2346

  52. [72]

    Zhihong Shao, Yuxiang Luo, Chengda Lu, Z. Z. Ren, Jiewen Hu, Tian Ye, Zhibin Gou, Shirong Ma, and Xiaokang Zhang. 2025. Deepseekmath-v2: Towards self-verifiable mathematical reasoning. arXiv preprint arXiv:2511.22570

  53. [73]

    Guijin Son, Hyunwoo Ko, Hoyoung Lee, Yewon Kim, and Seunghyeok Hong. 2024. Llm-as-a-judge & reward model: What they can and cannot do. arXiv preprint arXiv:2409.11239

  54. [74]

    Binghai Wang, Yantao Liu, Yuxuan Liu, Tianyi Tang, Shenzhi Wang, Chang Gao, Chujie Zheng, Yichang Zhang, Le Yu, Shixuan Liu, Tao Gui, Qi Zhang, Xuanjing Huang, Bowen Yu, Fei Huang, and Junyang Lin. 2026. Outcome accuracy is not enough: Aligning the reasoning process of reward models. arXiv preprint arXiv:2602.04649

  55. [75]

    Zichen Wen, Boxue Yang, Shuang Chen, Yaojie Zhang, Yuhang Han, Junlong Ke, Cong Wang, Yicheng Fu, Jiawang Zhao, Jiangchao Yao, Xi Fang, Zhen Wang, Henxing Cai, Lin Yao, Zhifeng Gao, Yanhui Hong, Nang Yuan, Yixuan Li, Guojiang Zhao, and 15 others. 2026. Innovator-vl: A multimodal large language model for scientific discovery. arXiv preprint arXiv:2601.19325

  56. [76]

    Lifan Yuan, Wendi Li, Huayu Chen, Ganqu Cui, Ning Ding, Kaiyan Zhang, Bowen Zhou, Zhiyuan Liu, and Hao Peng. 2024. Free process rewards without process labels. arXiv preprint arXiv:2412.01981

  57. [77]

    Di Zhang, Wei Liu, Qian Tan, Jingdan Chen, Hang Yan, Yuliang Yan, Jiatong Li, Weiran Huang, Xiangyu Yue, Wanli Ouyang, Dongzhan Zhou, Shufei Zhang, Mao Su, Han-Sen Zhong, and Yuqiang Li. 2024 a . Chemllm: A chemical large language model. arXiv preprint arXiv:2402.06852

  58. [78]

    Hanning Zhang, Pengcheng Wang, Shizhe Diao, Yong Lin, Rui Pan, Hanze Dong, Dylan Zhang, Pavlo Molchanov, and Tong Zhang. 2024 b . Entropy-regularized process reward model. arXiv preprint arXiv:2412.11006

  59. [79]

    Zehua Zhao, Zhixian Huang, Junren Li, Siyu Lin, Junting Zhou, Fengqi Cao, Kun Zhou, Rui Ge, Tingting Long, Yuexiang Zhu, Yan Liu, Jie Zheng, Junnian Wei, Rong Zhu, Peng Zou, Wenyu Li, Zekai Cheng, Tian Ding, Yaxuan Wang, and 12 others. 2025 a . Superchem: A multimodal reasoning benchmark in chemistry. arXiv preprint arXiv:2512.01274

  60. [80]

    Zihan Zhao, Da Ma, Lu Chen, Liangtai Sun, Zihao Li, Yi Xia, Bo Chen, Hongshen Xu, Zichen Zhu, Su Zhu, Shuai Fan, Guodong Shen, Kai Yu, and Xin Chen. 2025 b . Developing chemdfm as a large language foundation model for chemistry. Cell Reports Physical Science, 6(4)

  61. [81]

    Zihan Zhao, Ziping Wan, Lu Chen, Xuanze Lin, Shiyang Yu, Situo Zhang, Da Ma, Zichen Zhu, Danyang Zhang, Huayang Wang, Zhongyang Dai, Liyang Wen, Bo Chen, Xin Chen, and Kai Yu. 2025 c . Chemdfm-r: A chemical reasoning llm enhanced with atomized chemical knowledge. arXiv preprint arXiv:2507.21990