Beyond Accuracy: Decomposing the Reasoning Efficiency of LLMs

· 2026 · cs.CL · arXiv 2602.09805

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

As reasoning LLMs increasingly trade tokens for accuracy through deliberation, search, and self-correction, a single accuracy score can no longer tell whether those tokens buy useful reasoning, recovery from hard instances, or unnecessary verbosity. We introduce a trace-optional evaluation protocol that exactly decomposes token efficiency using three observables available even for closed models: completion rate, conditional correctness given completion, and generated length. When instance-level workload metadata is available, we further normalize generated length by declared task-implied work and separate mean verbalization overhead from workload-dependent scaling. When such metadata is absent, we define an auditable solver-derived workload scale and evaluate its stability under leave-self-out, leave-top-k, and held-out-reference-pool perturbations. We evaluate 14 shared open-weight models on CogniLoad, GSM8K, ProofWriter, and ZebraLogic. We further evaluate 11 additional models on CogniLoad, enabling a fine-grained analysis of reasoning-task difficulty factors: task length, intrinsic difficulty, and distractor density. Efficiency and overhead rankings remain stable across all benchmark pairs, more robustly than accuracy rankings, while the decomposition separates logic-limited, context-limited (truncation-driven), and verbosity-limited failure modes that look identical under accuracy-per-token. We release an evaluation artifact and reporting template, which elaborates on why an LLM is inefficient at reasoning.

representative citing papers

Brevity is the Soul of Inference Efficiency: Inducing Concision in VLMs via Data Curation

cs.LG · 2026-06-24 · unverdicted · novelty 6.0

Curating concise data for VLMs induces brevity, delivering 35x lower Cost-of-Pass at near-identical accuracy and higher matched-length accuracy than uncurated baselines.

citing papers explorer

Showing 1 of 1 citing paper.

Brevity is the Soul of Inference Efficiency: Inducing Concision in VLMs via Data Curation cs.LG · 2026-06-24 · unverdicted · none · ref 46 · internal anchor
Curating concise data for VLMs induces brevity, delivering 35x lower Cost-of-Pass at near-identical accuracy and higher matched-length accuracy than uncurated baselines.

Beyond Accuracy: Decomposing the Reasoning Efficiency of LLMs

fields

years

verdicts

representative citing papers

citing papers explorer