oMeBench and oMeS provide the first large-scale expert-annotated benchmark and dynamic scoring method for assessing LLM performance on organic mechanism elucidation and multi-step reasoning.
Canonical reference
Beyond chemical qa: Evaluating llm’s chemical reasoning with modular chemical operations
Canonical reference. 80% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
representative citing papers
ChemCoTBench-V2 is a new rule-verifiable benchmark with 5,620 samples across 18 tasks that evaluates LLM chemical reasoning traces using deterministic chemistry rules and reference traces rather than final answers alone.
FORGE reformulates molecular optimization as context-aware fragment ranking and replacement using mined low-to-high edit pairs, outperforming larger language models and graph methods on standard benchmarks.
LLM agents reach only 50.6% accuracy on chemical cost estimation within 25% error even with tools, dropping with noise due to parsing, pack selection, and tool-use failures.
ToxReason is an AOP-grounded benchmark that evaluates LLMs on mechanistic organ-level toxicity reasoning from molecular initiating events to adverse outcomes, showing that high predictive accuracy does not guarantee faithful biological explanations.
MolDeTox is a new benchmark that shows fragment-level stepwise editing by LLMs and VLMs improves structural validity and detoxification quality over prior toxicity-focused evaluations.
Bolek injects Morgan fingerprint embeddings into an instruction-tuned text model, then fine-tunes on molecular alignment and synthetic chain-of-thought tasks to improve performance and grounding on 15 TDC binary classification endpoints while generalizing to unseen tasks.
MolClaw deploys a hierarchical skill architecture to reach state-of-the-art results on a new benchmark of multi-step drug discovery tasks.
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
citing papers explorer
-
From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models
ChemCoTBench-V2 is a new rule-verifiable benchmark with 5,620 samples across 18 tasks that evaluates LLM chemical reasoning traces using deterministic chemistry rules and reference traces rather than final answers alone.
-
FORGE: Fragment-Oriented Ranking and Generation for Context-Aware Molecular Optimization
FORGE reformulates molecular optimization as context-aware fragment ranking and replacement using mined low-to-high edit pairs, outperforming larger language models and graph methods on standard benchmarks.
-
Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning
LLM agents reach only 50.6% accuracy on chemical cost estimation within 25% error even with tools, dropping with noise due to parsing, pack selection, and tool-use failures.
-
ToxReason: A Benchmark for Mechanistic Chemical Toxicity Reasoning via Adverse Outcome Pathway
ToxReason is an AOP-grounded benchmark that evaluates LLMs on mechanistic organ-level toxicity reasoning from molecular initiating events to adverse outcomes, showing that high predictive accuracy does not guarantee faithful biological explanations.
-
MolDeTox: Evaluating Language Model's Stepwise Fragment Editing for Molecular Detoxification
MolDeTox is a new benchmark that shows fragment-level stepwise editing by LLMs and VLMs improves structural validity and detoxification quality over prior toxicity-focused evaluations.
-
Bolek: A Multimodal Language Model for Molecular Reasoning
Bolek injects Morgan fingerprint embeddings into an instruction-tuned text model, then fine-tunes on molecular alignment and synthetic chain-of-thought tasks to improve performance and grounding on 15 TDC binary classification endpoints while generalizing to unseen tasks.
-
MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization
MolClaw deploys a hierarchical skill architecture to reach state-of-the-art results on a new benchmark of multi-step drug discovery tasks.
- OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation