Beyond Accuracy: Decomposing the Reasoning Efficiency of LLMs

Ali Ramezani-Kebrya; Arnoldo Frigessi; Benjamin Ricaud; Daniel Kaiser

arxiv: 2602.09805 · v2 · pith:FTWCIISVnew · submitted 2026-02-10 · 💻 cs.CL · cs.AI· cs.LG

Beyond Accuracy: Decomposing the Reasoning Efficiency of LLMs

Daniel Kaiser , Arnoldo Frigessi , Ali Ramezani-Kebrya , Benjamin Ricaud This is my paper

Pith reviewed 2026-05-21 13:23 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords LLM evaluationreasoning efficiencytoken usagedecompositioncompletion ratemodel benchmarkingfailure mode analysis

0 comments

The pith

A trace-optional protocol decomposes LLM token efficiency using completion rate, conditional correctness, and generated length.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes an evaluation method that breaks down how LLMs use tokens for reasoning without needing access to internal model states. It uses three simple observables: how often the model completes an answer, whether that answer is correct, and how long the output is. This separation helps identify whether extra tokens improve accuracy on difficult cases or just add unnecessary words. The method also normalizes for task workload when possible and tests the stability of its workload scale. Results show that efficiency measures are more consistent across different reasoning benchmarks than pure accuracy scores.

Core claim

The authors show that token efficiency can be exactly decomposed into completion rate, conditional correctness given completion, and generated length. When task workload metadata is available, generated length is normalized to separate verbalization overhead from workload-dependent scaling. This allows distinguishing logic-limited, context-limited, and verbosity-limited failure modes in reasoning tasks that appear the same under accuracy-per-token metrics.

What carries the argument

The trace-optional evaluation protocol that decomposes token efficiency from three observables available even for closed models.

If this is right

Efficiency and overhead rankings stay stable across benchmark pairs, unlike accuracy rankings.
The decomposition identifies specific failure modes: logic-limited, context-limited (truncation), and verbosity-limited.
Analysis of task difficulty factors like length, intrinsic difficulty, and distractor density becomes possible.
Evaluation artifact and reporting template released to explain why an LLM is inefficient at reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applying this protocol during model training could help optimize for lower overhead in reasoning.
Similar decompositions might extend to other domains like code generation or multi-step planning.
Developers could use the workload scale to create more balanced test sets that penalize verbosity.

Load-bearing premise

That completion rate, conditional correctness, and generated length plus a stable solver-derived workload scale are enough to exactly decompose token efficiency.

What would settle it

Evaluating the protocol on a held-out set of models and finding that the efficiency rankings change significantly under different perturbations to the reference pool would falsify the stability of the workload scale.

Figures

Figures reproduced from arXiv: 2602.09805 by Ali Ramezani-Kebrya, Arnoldo Frigessi, Benjamin Ricaud, Daniel Kaiser.

**Figure 1.** Figure 1: Where do models waste tokens? Relative to o3, we decompose the efficiency gap ∆ log E0 into token budget truncation robustness (∆ log rctx), logic robustness (∆ log rlogic), and workload-normalized verbosity (−∆ log V O¯ − ∆ log κ). Symbol Definition Data I Benchmark instance, I ∼ D — Succ Success indicator O (Succ = 1 iff final answer correct) T Output tokens generated by the model O C Event: token budget… view at source ↗

**Figure 2.** Figure 2: Trace-quality-normalized decomposition (12 trace-accessible models). We use DeepSeek-R1-Distill-Llama-70B as the reference for trace-quality analysis because it achieves the highest efficiency among trace-accessible models. The qtrace term (signal density) captures efficiency lost to repetition, prompt-copying, and off-task text; V O¯ sig captures overhead in signal tokens alone; κsig captures how signal t… view at source ↗

**Figure 3.** Figure 3: Verbalization overhead (V O¯ ) vs. coupling (κ). Cross-model verbosity differences are primarily driven by V O¯ ; κ is consistently sublinear (κ < 1) and varies less across models. 2 4 6 8 10 Intrinsic difficulty (d) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 E 0 Ed vs d 50 100 150 200 250 Total statements (N) EN vs N 20 40 60 80 (needle) [%] E vs o3-2025-04-16 o4-mini-2025-04-16 gpt-5-2025-08-07 gpt-5-m… view at source ↗

**Figure 4.** Figure 4: Token efficiency E0 across CogniLoad dimensions. Task length N is the dominant bottleneck—efficiency drops 70–90% from N = 20 to N = 250. Difficulty shows diminishing effects after d = 3; needle fraction has modest U-shaped effects. N. Detailed Case Studies N.1. Case Study 1: Degeneracy-Dominated Collapse Model: DeepSeek-R1-Distill-Qwen-1.5B Summary statistics. • E0 (%): 1.26 (rank 25/25) • rctx: 0.94 • rl… view at source ↗

read the original abstract

As reasoning LLMs increasingly trade tokens for accuracy through deliberation, search, and self-correction, a single accuracy score can no longer tell whether those tokens buy useful reasoning, recovery from hard instances, or unnecessary verbosity. We introduce a trace-optional evaluation protocol that exactly decomposes token efficiency using three observables available even for closed models: completion rate, conditional correctness given completion, and generated length. When instance-level workload metadata is available, we further normalize generated length by declared task-implied work and separate mean verbalization overhead from workload-dependent scaling. When such metadata is absent, we define an auditable solver-derived workload scale and evaluate its stability under leave-self-out, leave-top-k, and held-out-reference-pool perturbations. We evaluate 14 shared open-weight models on CogniLoad, GSM8K, ProofWriter, and ZebraLogic. We further evaluate 11 additional models on CogniLoad, enabling a fine-grained analysis of reasoning-task difficulty factors: task length, intrinsic difficulty, and distractor density. Efficiency and overhead rankings remain stable across all benchmark pairs, more robustly than accuracy rankings, while the decomposition separates logic-limited, context-limited (truncation-driven), and verbosity-limited failure modes that look identical under accuracy-per-token. We release an evaluation artifact and reporting template, which elaborates on why an LLM is inefficient at reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable way to split token use in reasoning LLMs into completion, conditional success, and length, then normalizes by a solver-derived workload scale that holds up under the tests they ran.

read the letter

The core idea is a trace-optional protocol that decomposes token efficiency using completion rate, conditional correctness given completion, and generated length. When workload metadata exists they normalize directly; otherwise they build an auditable solver-derived scale and check its stability with leave-self-out, leave-top-k, and held-out-reference-pool perturbations on CogniLoad, GSM8K, ProofWriter, and ZebraLogic. They run this on 14 open-weight models plus 11 more on CogniLoad and show efficiency rankings stay steadier than accuracy rankings while separating logic-limited, truncation, and verbosity failure modes that accuracy-per-token collapses together. Releasing the artifact and template is a practical plus for anyone who wants to report why a model is inefficient rather than just how inaccurate it is.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a trace-optional evaluation protocol that decomposes token efficiency of LLMs on reasoning tasks using three observables available even for closed models: completion rate, conditional correctness given completion, and generated length. When instance-level workload metadata is available it normalizes generated length by declared task-implied work to separate verbalization overhead from workload-dependent scaling; otherwise it defines an auditable solver-derived workload scale whose stability is evaluated under leave-self-out, leave-top-k, and held-out-reference-pool perturbations. Experiments cover 14 shared open-weight models on CogniLoad, GSM8K, ProofWriter, and ZebraLogic plus 11 additional models on CogniLoad, showing that efficiency and overhead rankings are more stable across benchmark pairs than accuracy rankings and that the decomposition distinguishes logic-limited, context-limited, and verbosity-limited failure modes.

Significance. If the decomposition is shown to be exact and free of residual model-specific confounds, the protocol supplies a practical, trace-optional lens for diagnosing why LLMs expend tokens on reasoning tasks. The separation of overhead from workload scaling, the stability of efficiency rankings relative to accuracy, and the public release of an evaluation artifact and reporting template constitute concrete strengths that could improve comparative evaluation and targeted optimization of reasoning models.

major comments (2)

[§3] §3 (protocol and decomposition): the central claim that the three observables together with the solver-derived workload scale 'exactly decompose' token efficiency is load-bearing. The reported perturbation tests establish stability of the scale under leave-self-out, leave-top-k, and held-out-reference-pool conditions, yet do not test whether the scale remains orthogonal to LLM reasoning trajectories that diverge from the solver (e.g., on high-distractor-density instances). Without such a test the decomposition risks being approximate rather than exact.
[§4–5] §4–5 (empirical results): the claim that efficiency rankings are 'more robustly' stable than accuracy rankings across benchmark pairs is central to the practical value. The manuscript should report quantitative rank-correlation or Kendall-τ values comparing efficiency versus accuracy stability; qualitative statements alone leave the magnitude of the improvement unclear.

minor comments (2)

[§3] Notation for the normalized length (generated length divided by workload scale) should be introduced once with an explicit equation and then used consistently; current usage mixes verbal descriptions with symbols.
[§5] The fine-grained difficulty-factor analysis (task length, intrinsic difficulty, distractor density) on CogniLoad is mentioned but its precise operationalization and whether it employs the efficiency decomposition or separate metrics is not fully specified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating planned revisions where appropriate. The comments help clarify the scope of our claims and strengthen the presentation of the evaluation protocol.

read point-by-point responses

Referee: [§3] §3 (protocol and decomposition): the central claim that the three observables together with the solver-derived workload scale 'exactly decompose' token efficiency is load-bearing. The reported perturbation tests establish stability of the scale under leave-self-out, leave-top-k, and held-out-reference-pool conditions, yet do not test whether the scale remains orthogonal to LLM reasoning trajectories that diverge from the solver (e.g., on high-distractor-density instances). Without such a test the decomposition risks being approximate rather than exact.

Authors: We thank the referee for this observation. By construction, the decomposition expresses token efficiency as the product of completion rate, conditional correctness given completion, and normalized length (generated length divided by the workload scale). When the scale is solver-derived, it is defined independently of any LLM trajectory and the decomposition holds exactly with respect to that reference scale. The reported perturbation tests (leave-self-out, leave-top-k, held-out-reference-pool) establish robustness of the scale to changes in the reference solver pool. We acknowledge that these tests do not directly measure orthogonality to LLM-specific paths that diverge from the solver, for example on high-distractor-density instances. Because the protocol is intentionally trace-optional, a complete orthogonality check is not feasible for closed models. In the revision we will (i) explicitly qualify the 'exact' claim as holding with respect to the chosen workload scale, (ii) add a limitations paragraph discussing possible residual model-specific confounds, and (iii) include an auxiliary analysis on the open-weight models for which traces are available, reporting the correlation between the solver-derived scale and LLM-generated lengths on instances where reasoning paths visibly diverge. revision: partial
Referee: [§4–5] §4–5 (empirical results): the claim that efficiency rankings are 'more robustly' stable than accuracy rankings across benchmark pairs is central to the practical value. The manuscript should report quantitative rank-correlation or Kendall-τ values comparing efficiency versus accuracy stability; qualitative statements alone leave the magnitude of the improvement unclear.

Authors: We agree that quantitative rank-stability metrics would make the central empirical claim more precise and easier to evaluate. The current manuscript supports the claim through direct cross-benchmark comparisons of ranking orderings, but does not report correlation coefficients. In the revised version we will compute and report Kendall-τ values for the stability of accuracy rankings, efficiency rankings, and overhead rankings across all benchmark pairs. These numbers will be added to the relevant tables and discussed in §4–5, allowing readers to assess the magnitude of the robustness improvement directly. revision: yes

Circularity Check

0 steps flagged

Decomposition protocol remains self-contained with independent stability validation

full rationale

The paper defines its token-efficiency decomposition directly from three observables (completion rate, conditional correctness given completion, and generated length) that are measurable even for closed models, then normalizes by a solver-derived workload scale only when instance metadata is absent. Stability of that scale is checked via explicit leave-self-out, leave-top-k, and held-out-reference-pool perturbations on four distinct benchmarks (CogniLoad, GSM8K, ProofWriter, ZebraLogic). These perturbation tests constitute external, data-driven checks rather than any reduction of the target efficiency metric to a fitted parameter or self-citation. No equation is shown to be equivalent to its own inputs by construction, and the central claim of separating verbalization overhead from workload scaling rests on observable quantities plus auditable perturbations, not on a self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The protocol rests on the domain assumption that the three observables suffice for exact decomposition and that the solver-derived scale is stable; no free parameters or invented entities are mentioned in the abstract.

axioms (2)

domain assumption The three observables (completion rate, conditional correctness given completion, generated length) exactly decompose token efficiency.
Directly stated as the basis of the introduced protocol.
domain assumption The solver-derived workload scale remains stable under leave-self-out, leave-top-k, and held-out-reference-pool perturbations.
Invoked to justify use of the scale when instance-level metadata is absent.

pith-pipeline@v0.9.0 · 5782 in / 1371 out tokens · 42962 ms · 2026-05-21T13:23:55.264724+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We define token efficiency E0 as the ratio of success probability to expected token usage... log E0 = log r_ctx + log r_logic − log E[T]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

[1]

Split on whitespace and punctuation (preserving hyphens within words)

work page
[2]

, xT (w)

Filter to alphanumeric tokens (including hyphenated compounds) This produces analysis tokens x1, . . . , xT (w). We convert analysis-token counts to model-token counts by scaling with the observed ratioT /T (w). B.2. Grounded-Span Detection Ontology extraction.For each instance I, we extract P(I) , the set of (category, value) pairs from generator metadat...

work page
[3]

Find all occurrences of valuevin the trace (with alias matching for spelling variants)

work page
[4]

For each occurrence at positionj, check if any anchor token fromA c appears in[j−w, j+w]

work page
[5]

If anchored, mark positions[j−w, j+w]as grounded We use window sizew= 6throughout. Anchor word sets by category.The following anchor tokens are used to validate grounded mentions: •location:located, at, in, location, move, moved, go, went •clothes shirt:shirt •clothes pant:pant, pants •clothes hat:hat •clothes socks:sock, socks •clothes gloves:glove, glov...

work page
[6]

Split trace into lines

work page
[7]

Normalize each line: lowercase, collapse whitespace, strip punctuation

work page
[8]

Identify lines that appear more than once

work page
[9]

Mark all tokens in repeated lines as repetitive Repeatedn-gram mask

work page
[10]

Extract alln-grams from the analysis token sequence (defaultn= 8)

work page
[11]

Identifyn-grams appearing more than once

work page
[12]

Mark all tokens participating in repeatedn-grams as repetitive The repetition mask is the union of these two signals. B.4. Prompt-Copy Detection

work page
[13]

Tokenize the input prompt using the same analysis tokenization

work page
[14]

Extract alln-grams from the prompt (defaultn= 12)

work page
[15]

For each trace token, check if it participates in ann-gram that appears in the prompt

work page
[16]

talking about the right things

Mark such tokens as prompt-copied B.5. Signal Computation The signal mask is: Sj =G j ∧ ¬Rj ∧ ¬Cj, whereG j is grounded,R j is repetitive, andC j is prompt-copied. Per-instance signal fraction: σ(I) = PT (w) j=1 Sj T (w) . Model-token signal count: Tsig(I) =T(I)·σ(I). B.6. Auxiliary Grounding Diagnostics In addition to the decomposition metrics, we report...

work page 2025

[1] [1]

Split on whitespace and punctuation (preserving hyphens within words)

work page

[2] [2]

, xT (w)

Filter to alphanumeric tokens (including hyphenated compounds) This produces analysis tokens x1, . . . , xT (w). We convert analysis-token counts to model-token counts by scaling with the observed ratioT /T (w). B.2. Grounded-Span Detection Ontology extraction.For each instance I, we extract P(I) , the set of (category, value) pairs from generator metadat...

work page

[3] [3]

Find all occurrences of valuevin the trace (with alias matching for spelling variants)

work page

[4] [4]

For each occurrence at positionj, check if any anchor token fromA c appears in[j−w, j+w]

work page

[5] [5]

If anchored, mark positions[j−w, j+w]as grounded We use window sizew= 6throughout. Anchor word sets by category.The following anchor tokens are used to validate grounded mentions: •location:located, at, in, location, move, moved, go, went •clothes shirt:shirt •clothes pant:pant, pants •clothes hat:hat •clothes socks:sock, socks •clothes gloves:glove, glov...

work page

[6] [6]

Split trace into lines

work page

[7] [7]

Normalize each line: lowercase, collapse whitespace, strip punctuation

work page

[8] [8]

Identify lines that appear more than once

work page

[9] [9]

Mark all tokens in repeated lines as repetitive Repeatedn-gram mask

work page

[10] [10]

Extract alln-grams from the analysis token sequence (defaultn= 8)

work page

[11] [11]

Identifyn-grams appearing more than once

work page

[12] [12]

Mark all tokens participating in repeatedn-grams as repetitive The repetition mask is the union of these two signals. B.4. Prompt-Copy Detection

work page

[13] [13]

Tokenize the input prompt using the same analysis tokenization

work page

[14] [14]

Extract alln-grams from the prompt (defaultn= 12)

work page

[15] [15]

For each trace token, check if it participates in ann-gram that appears in the prompt

work page

[16] [16]

talking about the right things

Mark such tokens as prompt-copied B.5. Signal Computation The signal mask is: Sj =G j ∧ ¬Rj ∧ ¬Cj, whereG j is grounded,R j is repetitive, andC j is prompt-copied. Per-instance signal fraction: σ(I) = PT (w) j=1 Sj T (w) . Model-token signal count: Tsig(I) =T(I)·σ(I). B.6. Auxiliary Grounding Diagnostics In addition to the decomposition metrics, we report...

work page 2025