Canonical reference

Beyond chemical qa: Evaluating llm’s chemical reasoning with modular chemical operations

Hao Li, He Cao, Bin Feng, Yanjun Shao, Xiangru Tang, Zhiyuan Yan, Li Yuan, Yonghong Tian, Yu Li · 2025 · arXiv 2505.21318

Canonical reference. 80% of citing Pith papers cite this work as background.

10 Pith papers citing it

Background 80% of classified citations

read on arXiv browse 10 citing papers

citation-role summary

background 4 dataset 1

citation-polarity summary

background 4 use dataset 1

representative citing papers

oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning

cs.AI · 2025-10-09 · unverdicted · novelty 8.0

oMeBench and oMeS provide the first large-scale expert-annotated benchmark and dynamic scoring method for assessing LLM performance on organic mechanism elucidation and multi-step reasoning.

From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models

cs.AI · 2026-06-02 · unverdicted · novelty 7.0

ChemCoTBench-V2 is a new rule-verifiable benchmark with 5,620 samples across 18 tasks that evaluates LLM chemical reasoning traces using deterministic chemistry rules and reference traces rather than final answers alone.

FORGE: Fragment-Oriented Ranking and Generation for Context-Aware Molecular Optimization

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

FORGE reformulates molecular optimization as context-aware fragment ranking and replacement using mined low-to-high edit pairs, outperforming larger language models and graph methods on standard benchmarks.

Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

LLM agents reach only 50.6% accuracy on chemical cost estimation within 25% error even with tools, dropping with noise due to parsing, pack selection, and tool-use failures.

ToxReason: A Benchmark for Mechanistic Chemical Toxicity Reasoning via Adverse Outcome Pathway

q-bio.QM · 2026-04-07 · unverdicted · novelty 7.0

ToxReason is an AOP-grounded benchmark that evaluates LLMs on mechanistic organ-level toxicity reasoning from molecular initiating events to adverse outcomes, showing that high predictive accuracy does not guarantee faithful biological explanations.

MolDeTox: Evaluating Language Model's Stepwise Fragment Editing for Molecular Detoxification

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

MolDeTox is a new benchmark that shows fragment-level stepwise editing by LLMs and VLMs improves structural validity and detoxification quality over prior toxicity-focused evaluations.

Bolek: A Multimodal Language Model for Molecular Reasoning

cs.LG · 2026-05-04 · unverdicted · novelty 5.0

Bolek injects Morgan fingerprint embeddings into an instruction-tuned text model, then fine-tunes on molecular alignment and synthetic chain-of-thought tasks to improve performance and grounding on 15 TDC binary classification endpoints while generalizing to unseen tasks.

MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization

cs.AI · 2026-04-02 · unverdicted · novelty 5.0 · 2 refs

MolClaw deploys a hierarchical skill architecture to reach state-of-the-art results on a new benchmark of multi-step drug discovery tasks.

A Survey of Reinforcement Learning for Large Reasoning Models

cs.CL · 2025-09-10 · accept · novelty 3.0

A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.

OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation

cs.CV · 2026-04-20

citing papers explorer

Showing 8 of 8 citing papers after filters.

From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models cs.AI · 2026-06-02 · unverdicted · none · ref 60
ChemCoTBench-V2 is a new rule-verifiable benchmark with 5,620 samples across 18 tasks that evaluates LLM chemical reasoning traces using deterministic chemistry rules and reference traces rather than final answers alone.
FORGE: Fragment-Oriented Ranking and Generation for Context-Aware Molecular Optimization cs.LG · 2026-05-11 · unverdicted · none · ref 6
FORGE reformulates molecular optimization as context-aware fragment ranking and replacement using mined low-to-high edit pairs, outperforming larger language models and graph methods on standard benchmarks.
Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning cs.AI · 2026-05-08 · unverdicted · none · ref 15
LLM agents reach only 50.6% accuracy on chemical cost estimation within 25% error even with tools, dropping with noise due to parsing, pack selection, and tool-use failures.
ToxReason: A Benchmark for Mechanistic Chemical Toxicity Reasoning via Adverse Outcome Pathway q-bio.QM · 2026-04-07 · unverdicted · none · ref 3
ToxReason is an AOP-grounded benchmark that evaluates LLMs on mechanistic organ-level toxicity reasoning from molecular initiating events to adverse outcomes, showing that high predictive accuracy does not guarantee faithful biological explanations.
MolDeTox: Evaluating Language Model's Stepwise Fragment Editing for Molecular Detoxification cs.AI · 2026-05-12 · unverdicted · none · ref 12
MolDeTox is a new benchmark that shows fragment-level stepwise editing by LLMs and VLMs improves structural validity and detoxification quality over prior toxicity-focused evaluations.
Bolek: A Multimodal Language Model for Molecular Reasoning cs.LG · 2026-05-04 · unverdicted · none · ref 14
Bolek injects Morgan fingerprint embeddings into an instruction-tuned text model, then fine-tunes on molecular alignment and synthetic chain-of-thought tasks to improve performance and grounding on 15 TDC binary classification endpoints while generalizing to unseen tasks.
MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization cs.AI · 2026-04-02 · unverdicted · none · ref 36 · 2 links
MolClaw deploys a hierarchical skill architecture to reach state-of-the-art results on a new benchmark of multi-step drug discovery tasks.
OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation cs.CV · 2026-04-20 · unreviewed · ref 24

Beyond chemical qa: Evaluating llm’s chemical reasoning with modular chemical operations

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer