Beyond Accuracy: Decomposing the Reasoning Efficiency of LLMs
Pith reviewed 2026-05-21 13:23 UTC · model grok-4.3
The pith
A trace-optional protocol decomposes LLM token efficiency using completion rate, conditional correctness, and generated length.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors show that token efficiency can be exactly decomposed into completion rate, conditional correctness given completion, and generated length. When task workload metadata is available, generated length is normalized to separate verbalization overhead from workload-dependent scaling. This allows distinguishing logic-limited, context-limited, and verbosity-limited failure modes in reasoning tasks that appear the same under accuracy-per-token metrics.
What carries the argument
The trace-optional evaluation protocol that decomposes token efficiency from three observables available even for closed models.
If this is right
- Efficiency and overhead rankings stay stable across benchmark pairs, unlike accuracy rankings.
- The decomposition identifies specific failure modes: logic-limited, context-limited (truncation), and verbosity-limited.
- Analysis of task difficulty factors like length, intrinsic difficulty, and distractor density becomes possible.
- Evaluation artifact and reporting template released to explain why an LLM is inefficient at reasoning.
Where Pith is reading between the lines
- Applying this protocol during model training could help optimize for lower overhead in reasoning.
- Similar decompositions might extend to other domains like code generation or multi-step planning.
- Developers could use the workload scale to create more balanced test sets that penalize verbosity.
Load-bearing premise
That completion rate, conditional correctness, and generated length plus a stable solver-derived workload scale are enough to exactly decompose token efficiency.
What would settle it
Evaluating the protocol on a held-out set of models and finding that the efficiency rankings change significantly under different perturbations to the reference pool would falsify the stability of the workload scale.
Figures
read the original abstract
As reasoning LLMs increasingly trade tokens for accuracy through deliberation, search, and self-correction, a single accuracy score can no longer tell whether those tokens buy useful reasoning, recovery from hard instances, or unnecessary verbosity. We introduce a trace-optional evaluation protocol that exactly decomposes token efficiency using three observables available even for closed models: completion rate, conditional correctness given completion, and generated length. When instance-level workload metadata is available, we further normalize generated length by declared task-implied work and separate mean verbalization overhead from workload-dependent scaling. When such metadata is absent, we define an auditable solver-derived workload scale and evaluate its stability under leave-self-out, leave-top-k, and held-out-reference-pool perturbations. We evaluate 14 shared open-weight models on CogniLoad, GSM8K, ProofWriter, and ZebraLogic. We further evaluate 11 additional models on CogniLoad, enabling a fine-grained analysis of reasoning-task difficulty factors: task length, intrinsic difficulty, and distractor density. Efficiency and overhead rankings remain stable across all benchmark pairs, more robustly than accuracy rankings, while the decomposition separates logic-limited, context-limited (truncation-driven), and verbosity-limited failure modes that look identical under accuracy-per-token. We release an evaluation artifact and reporting template, which elaborates on why an LLM is inefficient at reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a trace-optional evaluation protocol that decomposes token efficiency of LLMs on reasoning tasks using three observables available even for closed models: completion rate, conditional correctness given completion, and generated length. When instance-level workload metadata is available it normalizes generated length by declared task-implied work to separate verbalization overhead from workload-dependent scaling; otherwise it defines an auditable solver-derived workload scale whose stability is evaluated under leave-self-out, leave-top-k, and held-out-reference-pool perturbations. Experiments cover 14 shared open-weight models on CogniLoad, GSM8K, ProofWriter, and ZebraLogic plus 11 additional models on CogniLoad, showing that efficiency and overhead rankings are more stable across benchmark pairs than accuracy rankings and that the decomposition distinguishes logic-limited, context-limited, and verbosity-limited failure modes.
Significance. If the decomposition is shown to be exact and free of residual model-specific confounds, the protocol supplies a practical, trace-optional lens for diagnosing why LLMs expend tokens on reasoning tasks. The separation of overhead from workload scaling, the stability of efficiency rankings relative to accuracy, and the public release of an evaluation artifact and reporting template constitute concrete strengths that could improve comparative evaluation and targeted optimization of reasoning models.
major comments (2)
- [§3] §3 (protocol and decomposition): the central claim that the three observables together with the solver-derived workload scale 'exactly decompose' token efficiency is load-bearing. The reported perturbation tests establish stability of the scale under leave-self-out, leave-top-k, and held-out-reference-pool conditions, yet do not test whether the scale remains orthogonal to LLM reasoning trajectories that diverge from the solver (e.g., on high-distractor-density instances). Without such a test the decomposition risks being approximate rather than exact.
- [§4–5] §4–5 (empirical results): the claim that efficiency rankings are 'more robustly' stable than accuracy rankings across benchmark pairs is central to the practical value. The manuscript should report quantitative rank-correlation or Kendall-τ values comparing efficiency versus accuracy stability; qualitative statements alone leave the magnitude of the improvement unclear.
minor comments (2)
- [§3] Notation for the normalized length (generated length divided by workload scale) should be introduced once with an explicit equation and then used consistently; current usage mixes verbal descriptions with symbols.
- [§5] The fine-grained difficulty-factor analysis (task length, intrinsic difficulty, distractor density) on CogniLoad is mentioned but its precise operationalization and whether it employs the efficiency decomposition or separate metrics is not fully specified.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating planned revisions where appropriate. The comments help clarify the scope of our claims and strengthen the presentation of the evaluation protocol.
read point-by-point responses
-
Referee: [§3] §3 (protocol and decomposition): the central claim that the three observables together with the solver-derived workload scale 'exactly decompose' token efficiency is load-bearing. The reported perturbation tests establish stability of the scale under leave-self-out, leave-top-k, and held-out-reference-pool conditions, yet do not test whether the scale remains orthogonal to LLM reasoning trajectories that diverge from the solver (e.g., on high-distractor-density instances). Without such a test the decomposition risks being approximate rather than exact.
Authors: We thank the referee for this observation. By construction, the decomposition expresses token efficiency as the product of completion rate, conditional correctness given completion, and normalized length (generated length divided by the workload scale). When the scale is solver-derived, it is defined independently of any LLM trajectory and the decomposition holds exactly with respect to that reference scale. The reported perturbation tests (leave-self-out, leave-top-k, held-out-reference-pool) establish robustness of the scale to changes in the reference solver pool. We acknowledge that these tests do not directly measure orthogonality to LLM-specific paths that diverge from the solver, for example on high-distractor-density instances. Because the protocol is intentionally trace-optional, a complete orthogonality check is not feasible for closed models. In the revision we will (i) explicitly qualify the 'exact' claim as holding with respect to the chosen workload scale, (ii) add a limitations paragraph discussing possible residual model-specific confounds, and (iii) include an auxiliary analysis on the open-weight models for which traces are available, reporting the correlation between the solver-derived scale and LLM-generated lengths on instances where reasoning paths visibly diverge. revision: partial
-
Referee: [§4–5] §4–5 (empirical results): the claim that efficiency rankings are 'more robustly' stable than accuracy rankings across benchmark pairs is central to the practical value. The manuscript should report quantitative rank-correlation or Kendall-τ values comparing efficiency versus accuracy stability; qualitative statements alone leave the magnitude of the improvement unclear.
Authors: We agree that quantitative rank-stability metrics would make the central empirical claim more precise and easier to evaluate. The current manuscript supports the claim through direct cross-benchmark comparisons of ranking orderings, but does not report correlation coefficients. In the revised version we will compute and report Kendall-τ values for the stability of accuracy rankings, efficiency rankings, and overhead rankings across all benchmark pairs. These numbers will be added to the relevant tables and discussed in §4–5, allowing readers to assess the magnitude of the robustness improvement directly. revision: yes
Circularity Check
Decomposition protocol remains self-contained with independent stability validation
full rationale
The paper defines its token-efficiency decomposition directly from three observables (completion rate, conditional correctness given completion, and generated length) that are measurable even for closed models, then normalizes by a solver-derived workload scale only when instance metadata is absent. Stability of that scale is checked via explicit leave-self-out, leave-top-k, and held-out-reference-pool perturbations on four distinct benchmarks (CogniLoad, GSM8K, ProofWriter, ZebraLogic). These perturbation tests constitute external, data-driven checks rather than any reduction of the target efficiency metric to a fitted parameter or self-citation. No equation is shown to be equivalent to its own inputs by construction, and the central claim of separating verbalization overhead from workload scaling rests on observable quantities plus auditable perturbations, not on a self-referential loop.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The three observables (completion rate, conditional correctness given completion, generated length) exactly decompose token efficiency.
- domain assumption The solver-derived workload scale remains stable under leave-self-out, leave-top-k, and held-out-reference-pool perturbations.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We define token efficiency E0 as the ratio of success probability to expected token usage... log E0 = log r_ctx + log r_logic − log E[T]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Split on whitespace and punctuation (preserving hyphens within words)
-
[2]
Filter to alphanumeric tokens (including hyphenated compounds) This produces analysis tokens x1, . . . , xT (w). We convert analysis-token counts to model-token counts by scaling with the observed ratioT /T (w). B.2. Grounded-Span Detection Ontology extraction.For each instance I, we extract P(I) , the set of (category, value) pairs from generator metadat...
-
[3]
Find all occurrences of valuevin the trace (with alias matching for spelling variants)
-
[4]
For each occurrence at positionj, check if any anchor token fromA c appears in[j−w, j+w]
-
[5]
If anchored, mark positions[j−w, j+w]as grounded We use window sizew= 6throughout. Anchor word sets by category.The following anchor tokens are used to validate grounded mentions: •location:located, at, in, location, move, moved, go, went •clothes shirt:shirt •clothes pant:pant, pants •clothes hat:hat •clothes socks:sock, socks •clothes gloves:glove, glov...
-
[6]
Split trace into lines
-
[7]
Normalize each line: lowercase, collapse whitespace, strip punctuation
-
[8]
Identify lines that appear more than once
-
[9]
Mark all tokens in repeated lines as repetitive Repeatedn-gram mask
-
[10]
Extract alln-grams from the analysis token sequence (defaultn= 8)
-
[11]
Identifyn-grams appearing more than once
-
[12]
Mark all tokens participating in repeatedn-grams as repetitive The repetition mask is the union of these two signals. B.4. Prompt-Copy Detection
-
[13]
Tokenize the input prompt using the same analysis tokenization
-
[14]
Extract alln-grams from the prompt (defaultn= 12)
-
[15]
For each trace token, check if it participates in ann-gram that appears in the prompt
-
[16]
talking about the right things
Mark such tokens as prompt-copied B.5. Signal Computation The signal mask is: Sj =G j ∧ ¬Rj ∧ ¬Cj, whereG j is grounded,R j is repetitive, andC j is prompt-copied. Per-instance signal fraction: σ(I) = PT (w) j=1 Sj T (w) . Model-token signal count: Tsig(I) =T(I)·σ(I). B.6. Auxiliary Grounding Diagnostics In addition to the decomposition metrics, we report...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.