Multi-agent LLMs generate and verify 14,073 deterministic reaction rules from 665,901 patents, enabling 97.7% classification of unseen reactions with finer resolution than fixed proprietary systems.
Are large language models superhuman chemists?arXiv preprint arXiv:2404.01475
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 6verdicts
UNVERDICTED 6roles
background 2polarities
background 2representative citing papers
ChemCoTBench-V2 is a new rule-verifiable benchmark with 5,620 samples across 18 tasks that evaluates LLM chemical reasoning traces using deterministic chemistry rules and reference traces rather than final answers alone.
Introduces the Matter to Mechanism benchmark of 2,645 structured instances and a composite metric suite for evaluating AI co-scientists on problem-to-hypothesis reasoning in battery materials research.
PolyReal benchmark shows leading MLLMs perform well on polymer knowledge reasoning but drop sharply on practical tasks like lab safety analysis and raw data extraction.
Sci-PRM is a tool-aware process reward model trained on the SCIPRM70K dataset to provide fine-grained supervision for scientific reasoning and shown to boost foundation models via Best-of-N selection and RL.
Hackathon submissions indicate LLMs are moving from general assistants toward composable multi-agent systems for structuring scientific knowledge and automating tasks in materials science and chemistry.
citing papers explorer
-
PolyReal: A Benchmark for Real-World Polymer Science Workflows
PolyReal benchmark shows leading MLLMs perform well on polymer knowledge reasoning but drop sharply on practical tasks like lab safety analysis and raw data extraction.