oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning

· 2025 · cs.AI · arXiv 2510.07731

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Organic reaction mechanisms are the stepwise elementary reactions by which reactants form intermediates and products, and are fundamental to understanding chemical reactivity and designing new molecules and reactions. Although large language models (LLMs) have shown promise in understanding chemical tasks such as synthesis design, it is unclear to what extent this reflects genuine chemical reasoning capabilities, i.e., the ability to generate valid intermediates, maintain chemical consistency, and follow logically coherent multi-step pathways. We address this by introducing oMeBench, the first large-scale, expert-curated benchmark for organic mechanism reasoning in organic chemistry. It comprises over 10,000 annotated mechanistic steps with intermediates, type labels, and difficulty ratings. Furthermore, to evaluate LLM capability more precisely and enable fine-grained scoring, we propose oMeS, a dynamic evaluation framework that combines step-level logic and chemical similarity. We analyze the performance of state-of-the-art LLMs, and our results show that although current models display promising chemical intuition, they struggle with correct and consistent multi-step reasoning. Notably, we find that using prompting strategy and fine-tuning a specialist model on our proposed dataset increases performance by 50% over the leading closed-source model. We hope that oMeBench will serve as a rigorous foundation for advancing AI systems toward genuine chemical reasoning.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration

cs.AI · 2026-04-15 · unverdicted · novelty 6.0

TREX automates the LLM training lifecycle via collaborative agents and tree-based exploration, delivering consistent performance gains across 10 real-world fine-tuning tasks in FT-Bench.

ChemVA: Advancing Large Language Models on Chemical Reaction Diagrams Understanding

cs.AI · 2026-05-17 · unverdicted · novelty 5.0

ChemVA framework uses hybrid-granularity visual anchors and entity-name alignment to improve LLM performance on chemical reaction diagrams by ~20 points, reaching 92% structural accuracy on the new OCRD-Bench dataset.

citing papers explorer

Showing 2 of 2 citing papers.

TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration cs.AI · 2026-04-15 · unverdicted · none · ref 50 · internal anchor
TREX automates the LLM training lifecycle via collaborative agents and tree-based exploration, delivering consistent performance gains across 10 real-world fine-tuning tasks in FT-Bench.
ChemVA: Advancing Large Language Models on Chemical Reaction Diagrams Understanding cs.AI · 2026-05-17 · unverdicted · none · ref 43 · internal anchor
ChemVA framework uses hybrid-granularity visual anchors and entity-name alignment to improve LLM performance on chemical reaction diagrams by ~20 points, reaching 92% structural accuracy on the new OCRD-Bench dataset.

oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer