When Single Answer Is Not Enough: Rethinking Single-Step Retrosynthesis Benchmarks for LLMs

Alex Aliper; Alex Zhavoronkov; Bogdan Zagribelnyy; Ivan Ilin; Maksim Kuznetsov; Mathieu Reymond; Mikolaj Mizera; Nikita Bondarev; Rim Shayakhmetov; Roman Schutski

When Single Answer Is Not Enough: Rethinking Single-Step Retrosynthesis Benchmarks for LLMs

Not yet reviewed by Pith; the record is open.

Re-run · record.json Download PDF Read on arXiv ↗

This paper has not been read by Pith yet. Machine review is queued; the pith claim, tier, and objections will appear here once it completes.

SPECIMEN: schema-true, not a live event

T0 review · schema-true

One-sentence machine reading of the paper's core claim.

pith:XXXXXXXX · record.json · timestamp

arxiv 2602.03554 v2 pith:B2DP64MX submitted 2026-02-03 cs.LG cs.AIcs.CEcs.CL

When Single Answer Is Not Enough: Rethinking Single-Step Retrosynthesis Benchmarks for LLMs

Bogdan Zagribelnyy , Ivan Ilin , Maksim Kuznetsov , Nikita Bondarev , Mathieu Reymond , Roman Schutski , Thomas MacDougall , Rim Shayakhmetov

show 5 more authors

Zulfat Miftakhutdinov Mikolaj Mizera Vladimir Aladinskiy Alex Aliper Alex Zhavoronkov

This is my paper

classification cs.LG cs.AIcs.CEcs.CL

keywords llmsplanningretrosynthesissynthesisbenchmarksnovelplausibilitysingle

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

0 comments

read the original abstract

Recent progress has expanded the use of large language models (LLMs) in drug discovery, including synthesis planning. However, objective evaluation of retrosynthesis performance remains limited. Existing benchmarks and metrics typically rely on published synthetic procedures and Top-K accuracy based on single ground-truth, which does not capture the open-ended nature of real-world synthesis planning. We propose a new benchmarking framework for single-step retrosynthesis that evaluates both general-purpose and chemistry-specialized LLMs using ChemCensor, a novel metric for chemical plausibility. By emphasizing plausibility over exact match, this approach better aligns with human synthesis planning practices. We also introduce CREED, a novel dataset comprising millions of ChemCensor-validated reaction records for LLM training, and use it to train a model that improves over the LLM baselines under this benchmark.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

URSA: Chemistry-Aware Benchmark for Utilitarian Retrosynthesis Assessment
cs.LG 2026-07 accept novelty 6.0

Specialized retrosynthesis models outperform LLMs on chemically plausible multi-step routes when scored by the new URSA Solv-2 protocol using ChemCensor.