Reasoning as Compression: Unifying Budget Forcing via the Conditional Information Bottleneck

Andrey Kuzmin; Arash Behboodi; Fabio Valerio Massoli

arxiv: 2603.08462 · v2 · pith:CSVCIX3Snew · submitted 2026-03-09 · 💻 cs.LG

Reasoning as Compression: Unifying Budget Forcing via the Conditional Information Bottleneck

Fabio Valerio Massoli , Andrey Kuzmin , Arash Behboodi This is my paper

Pith reviewed 2026-05-21 11:43 UTC · model grok-4.3

classification 💻 cs.LG

keywords chain-of-thoughtinformation bottleneckbudget forcingreinforcement learningreasoning compressionlarge language modelsconditional IBsemantic prior

0 comments

The pith

Chain-of-thought generation in transformers is best recast as a conditional information bottleneck that compresses reasoning traces while preserving task reward.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard budget-forcing techniques rely on ad-hoc length penalties that can discard useful reasoning along with filler. By treating the reasoning trace as a lossy compression step under the conditional information bottleneck, the authors derive a single reinforcement-learning objective that trades off task performance against a prior over traces. This formulation automatically recovers length penalties as the special case of a uniform prior and replaces token counting with a semantic surprisal cost drawn from a language model at negligible extra training cost. Experiments indicate the resulting objective prunes redundancy while maintaining fluency, producing higher accuracy at moderate compression and smaller accuracy drops under aggressive compression.

Core claim

Modeling CoT generation under the CIB principle, where the reasoning trace Z acts as a computational bridge that contains only the information about the response Y not directly accessible from the prompt X, yields a general RL objective: maximize task reward while compressing completions under a prior over reasoning traces. This subsumes common heuristics such as length penalties as special cases under uniform priors, and replaces naive token-count penalties with a semantic prior based on token surprisal.

What carries the argument

The conditional information bottleneck objective applied to the reasoning trace Z as a bridge between prompt X and response Y.

If this is right

Length penalties arise as the special case of a uniform prior over traces.
Semantic surprisal costs replace crude token counts with negligible added training overhead.
Accuracy improves at moderate compression levels while aggressive compression incurs only minimal accuracy loss.
The same objective works across multiple model families and task domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same compression view could be applied to non-CoT reasoning formats such as tree search or self-consistency ensembles.
Architectures that learn to emit compressed traces by design might emerge from training under this objective from the start.
Smaller models might show larger relative gains because their limited capacity makes explicit compression more valuable.

Load-bearing premise

The premise that attention in transformers violates the Markov property between prompt, reasoning trace, and response, so that a conditional rather than naive information bottleneck is required.

What would settle it

A controlled comparison in which the CIB-derived RL objective produces no measurable improvement in the accuracy-compression curve over standard length-penalty fine-tuning on identical models and benchmarks.

read the original abstract

\ac{CoT} prompting improves LLM accuracy on complex tasks but often increases token usage and inference cost. Existing ``Budget Forcing'' methods reduce cost via fine-tuning with heuristic length penalties, suppressing both essential reasoning and redundant filler. We recast efficient reasoning as a lossy compression problem under the \ac{IB} principle, and identify a key theoretical gap when applying naive \ac{IB} to transformers: attention violates the Markov property between prompt, reasoning trace, and response. To resolve this issue, we model \ac{CoT} generation under the \ac{CIB} principle, where the reasoning trace $Z$ acts as a computational bridge that contains only the information about the response $Y$ that is not directly accessible from the prompt $X$. This yields a general Reinforcement Learning objective: maximize task reward while compressing completions under a prior over reasoning traces, subsuming common heuristics (e.g., length penalties) as special cases (e.g., uniform priors). In contrast to naive token-counting approaches, we introduce a semantic prior that measures token cost by surprisal under a language model. Crucially, the prior is queried only for token-level log-probabilities, adding negligible overhead to the training loop. Empirically, our \ac{CIB} objective prunes reasoning redundancy while preserving fluency and logic, improving accuracy at moderate compression and enabling aggressive compression with minimal accuracy drop. These gains generalize across model families and task domains, confirming \ac{CIB} as a domain-agnostic CoT compression framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CIB framing turns CoT budget forcing into a unified RL objective with a semantic prior, though the step to the loss likely adds modeling choices beyond the bottleneck itself.

read the letter

The main thing to know is that this paper models Chain-of-Thought as a conditional information bottleneck task. The reasoning trace becomes a bridge carrying only the information about the answer that the prompt does not already provide, which produces an RL objective that maximizes task reward while compressing the trace under a prior. This setup is meant to recover length penalties and similar heuristics as special cases when the prior is uniform or simple.

Referee Report

2 major / 2 minor

Summary. The paper claims that recasting Chain-of-Thought (CoT) generation as a conditional information bottleneck (CIB) problem resolves a theoretical gap in naive information bottleneck (IB) for transformers—specifically, that attention violates the Markov property between prompt X, reasoning trace Z, and response Y. This modeling choice yields a general RL objective that maximizes task reward while compressing Z under a prior over reasoning traces (with a semantic surprisal-based prior), subsuming heuristic budget-forcing methods (e.g., length penalties under uniform priors) as special cases. The approach is claimed to prune redundancy while preserving accuracy, with empirical gains across model families and domains and negligible overhead from querying the prior.

Significance. If the CIB-to-RL derivation is shown to follow directly and the empirical protocol is fully specified, the work would provide a principled, domain-agnostic unification of budget-forcing techniques under information theory, offering a compression view of reasoning that could guide future efficient-inference methods without ad-hoc penalties.

major comments (2)

[theoretical development (derivation from CIB principle to RL objective)] The central claim that CIB directly produces the stated RL objective (maximize reward while compressing completions under a prior) rests on the premise that Z contains only information about Y inaccessible from X. However, moving from the CIB functional to a tractable RL loss requires specific choices of variational approximation, compression-term form, and independent prior querying; these steps are not automatic consequences of the attention-induced Markov violation alone and constitute additional modeling decisions whose necessity is not demonstrated.
[empirical evaluation and prior definition] The abstract asserts that the semantic prior (token-level log-probabilities from an external LM) adds negligible overhead and enables aggressive compression with minimal accuracy drop, but no error analysis, full empirical protocol, or ablation on prior choice is provided to support that the gains are attributable to CIB rather than the specific prior or RL implementation details.

minor comments (2)

[introduction or background] Notation for the CIB objective and the distinction between naive IB and CIB could be clarified with an explicit equation contrasting the two functionals.
[unification claim] The manuscript should include a table or figure summarizing how common heuristics (length penalty, etc.) emerge as special cases under different priors.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us identify areas for improvement in both the theoretical exposition and empirical reporting. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [theoretical development (derivation from CIB principle to RL objective)] The central claim that CIB directly produces the stated RL objective (maximize reward while compressing completions under a prior) rests on the premise that Z contains only information about Y inaccessible from X. However, moving from the CIB functional to a tractable RL loss requires specific choices of variational approximation, compression-term form, and independent prior querying; these steps are not automatic consequences of the attention-induced Markov violation alone and constitute additional modeling decisions whose necessity is not demonstrated.

Authors: The attention mechanism in transformers does indeed break the strict Markov chain X → Z → Y by allowing direct dependencies, which is why we adopt the conditional information bottleneck to ensure Z captures only the relevant information for Y not present in X. The derivation proceeds by expressing the CIB objective as an optimization over the distribution of Z, which naturally leads to a reward term for task performance (related to I(Z;Y|X)) and a compression term regularized by the prior. The specific RL formulation uses a variational approximation to make it tractable for policy optimization in the LLM setting. We agree that these choices require justification, and in the revised manuscript, we will expand the theoretical section to include a step-by-step derivation showing how each modeling decision follows from the CIB principle and the need for practical implementation in transformer-based models. revision: yes
Referee: [empirical evaluation and prior definition] The abstract asserts that the semantic prior (token-level log-probabilities from an external LM) adds negligible overhead and enables aggressive compression with minimal accuracy drop, but no error analysis, full empirical protocol, or ablation on prior choice is provided to support that the gains are attributable to CIB rather than the specific prior or RL implementation details.

Authors: We recognize the need for more rigorous empirical validation. The current version includes results across models and domains but lacks detailed ablations and error analysis. In the revision, we will add an appendix with the full experimental protocol, including hyperparameters, number of runs for error bars, and ablations on the prior (e.g., comparing semantic surprisal to length-based and uniform priors). We will also report the measured overhead of prior queries, which in our experiments was minimal (less than 3% additional compute). This will strengthen the claim that the benefits arise from the CIB framework. revision: yes

Circularity Check

0 steps flagged

CIB-to-RL derivation is self-contained modeling step with no reduction to inputs or self-citations

full rationale

The paper's central derivation applies the conditional information bottleneck to CoT generation after noting that transformer attention violates the Markov property between X, Z, and Y. This modeling choice directly produces the stated RL objective of maximizing reward while compressing Z under a prior, with the semantic prior obtained from an external LM and length penalties recovered as the uniform-prior special case. No equations reduce the objective to a fitted parameter or prior form by construction, no self-citations are load-bearing for the uniqueness of CIB, and the derivation introduces an independent variational framing rather than renaming a known result. The framework remains falsifiable via empirical compression-accuracy trade-offs across models and tasks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on information-theoretic principles and modeling assumptions about transformer information flow; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)

domain assumption Attention in transformers violates the Markov property between prompt, reasoning trace, and response.
Cited as the key theoretical gap preventing direct application of naive IB to CoT.
domain assumption The reasoning trace Z contains only the information about the response Y that is not directly accessible from the prompt X.
Core modeling choice that defines the conditional information bottleneck for CoT.

pith-pipeline@v0.9.0 · 5817 in / 1424 out tokens · 50053 ms · 2026-05-21T11:43:36.621329+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LCIB = I(X;Z) − μ I(Y;Z|X) ... semantic prior that measures token cost by surprisal under a language model

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.