Operadic consistency is a new per-question signal that correlates strongly with accuracy (r 0.86-0.94) across four multi-hop QA datasets and improves selective prediction over CoT-SC baselines.
A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CL 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Introduces CulMind benchmark, CulMind-R reasoning subset, and ReaScore metric to evaluate MLLMs on Chinese cultural heritage multimodal understanding and reasoning quality.
citing papers explorer
-
Operadic consistency: a label-free signal for compositional reasoning failures in LLMs
Operadic consistency is a new per-question signal that correlates strongly with accuracy (r 0.86-0.94) across four multi-hop QA datasets and improves selective prediction over CoT-SC baselines.
-
CulMind: Benchmarking Multimodal Understanding and Reasoning in Chinese Cultural Heritage
Introduces CulMind benchmark, CulMind-R reasoning subset, and ReaScore metric to evaluate MLLMs on Chinese cultural heritage multimodal understanding and reasoning quality.