oMeBench and oMeS provide the first large-scale expert-annotated benchmark and dynamic scoring method for assessing LLM performance on organic mechanism elucidation and multi-step reasoning.
Canonical reference
Beyond chemical qa: Evaluating llm’s chemical reasoning with modular chemical operations
Canonical reference. 80% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
representative citing papers
FORGE reformulates molecular optimization as context-aware fragment ranking and replacement using mined low-to-high edit pairs, outperforming larger language models and graph methods on standard benchmarks.
LLM agents reach only 50.6% accuracy on chemical cost estimation within 25% error even with tools, dropping with noise due to parsing, pack selection, and tool-use failures.
ToxReason is an AOP-grounded benchmark that evaluates LLMs on mechanistic organ-level toxicity reasoning from molecular initiating events to adverse outcomes, showing that high predictive accuracy does not guarantee faithful biological explanations.
MolDeTox is a new benchmark that shows fragment-level stepwise editing by LLMs and VLMs improves structural validity and detoxification quality over prior toxicity-focused evaluations.
Bolek injects Morgan fingerprint embeddings into an instruction-tuned text model, then fine-tunes on molecular alignment and synthetic chain-of-thought tasks to improve performance and grounding on 15 TDC binary classification endpoints while generalizing to unseen tasks.
MolClaw deploys a hierarchical skill architecture to reach state-of-the-art results on a new benchmark of multi-step drug discovery tasks.
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
citing papers explorer
-
oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning
oMeBench and oMeS provide the first large-scale expert-annotated benchmark and dynamic scoring method for assessing LLM performance on organic mechanism elucidation and multi-step reasoning.
-
FORGE: Fragment-Oriented Ranking and Generation for Context-Aware Molecular Optimization
FORGE reformulates molecular optimization as context-aware fragment ranking and replacement using mined low-to-high edit pairs, outperforming larger language models and graph methods on standard benchmarks.
-
Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning
LLM agents reach only 50.6% accuracy on chemical cost estimation within 25% error even with tools, dropping with noise due to parsing, pack selection, and tool-use failures.
-
ToxReason: A Benchmark for Mechanistic Chemical Toxicity Reasoning via Adverse Outcome Pathway
ToxReason is an AOP-grounded benchmark that evaluates LLMs on mechanistic organ-level toxicity reasoning from molecular initiating events to adverse outcomes, showing that high predictive accuracy does not guarantee faithful biological explanations.
-
MolDeTox: Evaluating Language Model's Stepwise Fragment Editing for Molecular Detoxification
MolDeTox is a new benchmark that shows fragment-level stepwise editing by LLMs and VLMs improves structural validity and detoxification quality over prior toxicity-focused evaluations.
-
Bolek: A Multimodal Language Model for Molecular Reasoning
Bolek injects Morgan fingerprint embeddings into an instruction-tuned text model, then fine-tunes on molecular alignment and synthetic chain-of-thought tasks to improve performance and grounding on 15 TDC binary classification endpoints while generalizing to unseen tasks.
-
MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization
MolClaw deploys a hierarchical skill architecture to reach state-of-the-art results on a new benchmark of multi-step drug discovery tasks.
-
A Survey of Reinforcement Learning for Large Reasoning Models
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
- OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation