A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains

Jacovi, Alon, Bitton, Yonatan, Bohnet, Bernd, Herzig, Jonathan, Honovich, Or, Tseng, Michael · 2024 · DOI 10.18653/v1/2024.acl-long.254

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open at publisher browse 3 citing papers

representative citing papers

Operadic consistency: a label-free signal for compositional reasoning failures in LLMs

cs.CL · 2026-06-11 · unverdicted · novelty 7.0

Operadic consistency is a new per-question signal that correlates strongly with accuracy (r 0.86-0.94) across four multi-hop QA datasets and improves selective prediction over CoT-SC baselines.

What LLMs explain is not what they believe: Evaluating explanation sufficiency under models' own input beliefs

cs.LG · 2026-06-26 · unverdicted · novelty 6.0

Proposes SCSuff metric for evaluating LLM explanation sufficiency via model-generated alternative inputs, showing explanations are typically insufficient and predictable from hidden states.

CulMind: Benchmarking Multimodal Understanding and Reasoning in Chinese Cultural Heritage

cs.CL · 2026-06-19 · unverdicted · novelty 6.0

Introduces CulMind benchmark, CulMind-R reasoning subset, and ReaScore metric to evaluate MLLMs on Chinese cultural heritage multimodal understanding and reasoning quality.

citing papers explorer

Showing 1 of 1 citing paper after filters.

What LLMs explain is not what they believe: Evaluating explanation sufficiency under models' own input beliefs cs.LG · 2026-06-26 · unverdicted · none · ref 66
Proposes SCSuff metric for evaluating LLM explanation sufficiency via model-generated alternative inputs, showing explanations are typically insufficient and predictable from hidden states.

A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains

fields

years

verdicts

representative citing papers

citing papers explorer