pith. sign in

arxiv: 2604.02343 · v1 · submitted 2026-02-09 · 💻 cs.LG · cs.AI· cs.IT· math.IT

Haiku to Opus in Just 10 bits: LLMs Unlock Massive Compression Gains

Pith reviewed 2026-05-16 05:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ITmath.IT
keywords LLM compressionquestion asking compressionlossy compressionarithmetic codingmodel capability transferinteractive protocolsbinary questions
0
0 comments X

The pith

Ten yes/no questions recover 23 to 72 percent of the performance gap between small and large LLMs

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that an interactive protocol lets a small language model recover a large share of a stronger model's benchmark performance by asking only ten binary questions. Each answer transfers exactly one bit, producing compression ratios between 0.0006 and 0.004. This approach improves on both lossless arithmetic coding with LoRA adapters and lossy succinct rewrites, and it outperforms earlier LLM compression methods by more than two orders of magnitude. A sympathetic reader would care because the result shows that model capability can be transferred far more efficiently than by sending complete text responses. The work also maps a broader compression-compute frontier in which greater compression is possible when more compute is spent on interaction.

Core claim

We introduce Question-Asking compression (QA), an interactive lossy protocol inspired by the game Twenty Questions. A small model iteratively refines its response by asking yes/no questions to a stronger model, transferring exactly one bit per answer. On 8 benchmarks spanning math, science, and code, 10 binary questions recover 23% to 72% of the capability gap between a small and large model on standard benchmarks and 7% to 38% on harder benchmarks, achieving compression ratios of 0.0006 to 0.004.

What carries the argument

The Question-Asking (QA) protocol in which a small model asks yes/no questions to a stronger model to refine its generated response, with each answer providing one bit of information

If this is right

  • Domain-adapted LoRA adapters improve lossless LLM-based arithmetic coding by a factor of two over the base model alone
  • Prompting a model for a succinct rewrite followed by arithmetic coding yields a compression ratio of approximately 0.03, twice as good as compressing the original response
  • The QA protocol achieves compression ratios over 100 times smaller than prior LLM-based compression methods
  • The results define a compression-compute frontier in which greater compression becomes possible at the cost of additional compute

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Interactive binary protocols could support efficient knowledge transfer among multiple AI agents operating under tight bandwidth limits
  • Similar question-asking schemes might be tested for compressing structured outputs such as code or mathematical proofs rather than free text
  • The same mechanism could be explored for low-bandwidth human-AI clarification loops in which the AI asks the human targeted yes/no questions

Load-bearing premise

The small model can generate effective yes/no questions that extract the most useful information without already knowing the answers or introducing errors that cancel out the recovered capability

What would settle it

Running the QA protocol on the eight benchmarks and measuring that the small model's final performance stays at or below its unaided baseline due to poor questions or introduced errors

Figures

Figures reproduced from arXiv: 2604.02343 by Annabelle Michael Carrell, Keri Warr, Nicholas Carlini, Roy Rinberg, Simon Henniger.

Figure 1
Figure 1. Figure 1: Compression ratio vs. number of candidates [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the compression mechanism and its use in an interactive protocol between an [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy of random selection (dashed) versus best-compression selection (solid) on 90 [PITH_FULL_IMAGE:figures/full_fig_p030_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Absolute compression ratio (top) and relative compression ratio normalized to Temperature [PITH_FULL_IMAGE:figures/full_fig_p031_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Absolute compression ratio (top) and relative compression ratio normalized to Temperature [PITH_FULL_IMAGE:figures/full_fig_p032_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Compression ratio vs. number of candidates [PITH_FULL_IMAGE:figures/full_fig_p033_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Accuracy of random selection (dashed) versus best-compression selection (solid) on 90 [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Compression ratio (bits-per-character) for verbose original solutions versus succinct [PITH_FULL_IMAGE:figures/full_fig_p034_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: GSM8K Q&A compression accuracy (SLM=Haiku, Claude 4.5). GSM8K shows consistent [PITH_FULL_IMAGE:figures/full_fig_p044_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: MATH Algebra Q&A compression accuracy (SLM=Haiku, Claude 4.5). [PITH_FULL_IMAGE:figures/full_fig_p044_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: MATH Geometry Q&A compression accuracy (SLM=Haiku, Claude 4.5). [PITH_FULL_IMAGE:figures/full_fig_p045_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: MATH Number Theory Q&A compression accuracy (SLM=Haiku, Claude 4.5). Number [PITH_FULL_IMAGE:figures/full_fig_p045_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: GPQA (MC) Q&A compression accuracy (SLM=Haiku, Claude 4.5). The high proportion [PITH_FULL_IMAGE:figures/full_fig_p046_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: MBPP Q&A compression accuracy (SLM=Haiku, Claude 4.5). Code generation proves [PITH_FULL_IMAGE:figures/full_fig_p046_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: AIME Q&A compression accuracy (SLM=Haiku, Claude 4.5). Competition math [PITH_FULL_IMAGE:figures/full_fig_p047_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: HLE Q&A compression accuracy (SLM=Haiku, Claude 4.5). Despite the high Very Hard [PITH_FULL_IMAGE:figures/full_fig_p047_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: GSM8K Q&A compression accuracy (SLM=Haiku, Claude 3.5/4). Strong improvement [PITH_FULL_IMAGE:figures/full_fig_p048_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: MATH Algebra Q&A compression accuracy (SLM=Haiku, Claude 3.5/4). [PITH_FULL_IMAGE:figures/full_fig_p048_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: MATH Geometry Q&A compression accuracy (SLM=Haiku, Claude 3.5/4). [PITH_FULL_IMAGE:figures/full_fig_p049_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: MATH Number Theory Q&A compression accuracy (SLM=Haiku, Claude 3.5/4). [PITH_FULL_IMAGE:figures/full_fig_p049_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: GPQA (MC) Q&A compression accuracy (SLM=Haiku, Claude 3.5/4). [PITH_FULL_IMAGE:figures/full_fig_p050_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: MBPP Q&A compression accuracy (SLM=Haiku, Claude 3.5/4). [PITH_FULL_IMAGE:figures/full_fig_p050_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: AIME Q&A compression accuracy (SLM=Haiku, Claude 3.5/4). The older Haiku model [PITH_FULL_IMAGE:figures/full_fig_p051_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: HLE Q&A compression accuracy (SLM=Haiku, Claude 3.5/4). Q&A shows improvement [PITH_FULL_IMAGE:figures/full_fig_p051_24.png] view at source ↗
read the original abstract

We study the compression of LLM-generated text across lossless and lossy regimes, characterizing a compression-compute frontier where more compression is possible at the cost of more compute. For lossless compression, domain-adapted LoRA adapters can improve LLM-based arithmetic coding by 2x over compression with the base LLM alone. For lossy compression, prompting a model for a succinct rewrite then applying arithmetic coding can achieve compression ratios of approximately 0.03, a 2x improvement over compressing the original response. We further introduce Question-Asking compression (QA), an interactive lossy protocol inspired by the game 'Twenty Questions'. A small model iteratively refines its response by asking yes/no questions to a stronger model, transferring exactly one bit per answer. On 8 benchmarks spanning math, science, and code, 10 binary questions recover 23% to 72% of the capability gap between a small and large model on standard benchmarks and 7% to 38% on harder benchmarks, achieving compression ratios of 0.0006 to 0.004. This is over 100x smaller than prior LLM-based compression (Deletang et al., 2024), suggesting that interactive protocols can transfer knowledge far more efficiently than transmitting full responses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript studies compression of LLM-generated text in lossless and lossy regimes, characterizing a compression-compute frontier. It shows domain-adapted LoRA adapters improve LLM-based arithmetic coding by 2x for lossless compression. For lossy compression, succinct rewrites followed by arithmetic coding achieve ratios of ~0.03 (2x better than baselines). The central contribution is Question-Asking (QA) compression: a small model iteratively asks up to 10 yes/no questions to a larger model to refine its initial response, recovering 23-72% of the capability gap on standard benchmarks and 7-38% on harder ones across 8 math/science/code tasks, with ratios 0.0006-0.004 (over 100x smaller than Deletang et al. 2024).

Significance. If the empirical claims are substantiated with full controls, this work would demonstrate that interactive bit-by-bit protocols can transfer task-relevant knowledge far more efficiently than transmitting full responses or using static compression. The reported recovery of substantial capability gaps with only 10 bits, combined with the 100x improvement over prior LLM compression, would be a notable advance for efficient model interaction and deployment in low-bandwidth settings.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): the abstract states quantitative results on eight benchmarks but provides no details on exact benchmark definitions, statistical significance, variance across runs, or how the capability gap is measured; central claims rest on unreported experimental controls.
  2. [QA Compression Protocol] QA protocol description: no ablation isolating question source (small-model generation vs. large-model answers) is described, leaving open whether the headline recovery numbers (23-72% and 7-38%) conflate the value of the answers with the small model's own question-formulation ability.
  3. [Compression Ratios] Compression ratio claims: the reported ratios of 0.0006-0.004 and the 100x improvement over Deletang et al. (2024) require explicit accounting of base model size, transmitted bits, and encoding overhead to be verifiable; these numbers are load-bearing for the compression frontier narrative.
minor comments (2)
  1. [Notation and Metrics] Clarify the precise definition and measurement of the 'capability gap' between small and large models, including any normalization across benchmarks.
  2. [References] Provide full citation details and year for Deletang et al. (2024) in the references.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for your constructive feedback. We will revise the manuscript to address the concerns regarding experimental details, ablations, and compression calculations, as detailed in our point-by-point responses below.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the abstract states quantitative results on eight benchmarks but provides no details on exact benchmark definitions, statistical significance, variance across runs, or how the capability gap is measured; central claims rest on unreported experimental controls.

    Authors: We agree that additional transparency is required. In the revised manuscript, we will expand §4 with exact benchmark definitions, report means and standard deviations across multiple runs with statistical significance tests, and explicitly define the capability gap as (small_with_QA - small) / (large - small). These details will also be referenced from the abstract. revision: yes

  2. Referee: [QA Compression Protocol] QA protocol description: no ablation isolating question source (small-model generation vs. large-model answers) is described, leaving open whether the headline recovery numbers (23-72% and 7-38%) conflate the value of the answers with the small model's own question-formulation ability.

    Authors: We will add an ablation study in the revised §4 comparing the standard protocol (small model generates questions) against variants where the large model generates the questions or answers are provided independently, to isolate the contribution of question formulation from answer quality. revision: yes

  3. Referee: [Compression Ratios] Compression ratio claims: the reported ratios of 0.0006-0.004 and the 100x improvement over Deletang et al. (2024) require explicit accounting of base model size, transmitted bits, and encoding overhead to be verifiable; these numbers are load-bearing for the compression frontier narrative.

    Authors: We will revise the relevant sections to provide an explicit breakdown including base model sizes, the 10 transmitted bits plus protocol overhead, and a direct comparison table with Deletang et al. (2024) to substantiate the reported ratios and improvement factor. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents empirical results from benchmark experiments measuring compression ratios and capability recovery percentages via an interactive QA protocol between small and large models. No equations, parameter fittings, or derivations are described that reduce outputs to inputs by construction. Claims rely on direct experimental comparisons to baselines and prior work (e.g., Deletang et al.), with no self-citation load-bearing the central results or ansatz smuggling. The 23-72% recovery figures are measured outcomes, not tautological renamings or fitted predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; all quantitative claims rest on unreported experimental setup.

pith-pipeline@v0.9.0 · 5542 in / 1301 out tokens · 27572 ms · 2026-05-16T05:21:49.603878+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages

  1. [1]

    URLhttps://arxiv.org/abs/2601.10678. AI-MO. Aimo validation aime, 2024. URL https://huggingface.co/datasets/AI-MO/ aimo-validation-aime. Validation dataset for the AIMO Progress Prize, derived from AIME 2022–2024 problems. Anthropic. Activating ai safety level 3 protections. https://www.anthropic.com/news/ activating-asl3-protections, May 2025. Online, pu...

  2. [2]

    Alex Wang, Kyunghyun Cho, and Mike Lewis

    URLhttps://arxiv.org/abs/2306.04050. Alex Wang, Kyunghyun Cho, and Mike Lewis. Asking and answering questions to evaluate the factual consistency of summaries, 2020. URLhttps://arxiv.org/abs/2004.04228. Ian H. Witten, Radford M. Neal, and John G. Cleary. Arithmetic coding for data compression. Communications of the ACM, 30(6):520–540, 1987. doi: 10.1145/2...

  3. [3]

    Edelman, Danilo Francati, Daniele Venturi, Giuseppe Ateniese, and Boaz Barak

    URLhttps://arxiv.org/abs/2311.04378. Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. Wildchat: 1m chatgpt interaction logs in the wild, 2024. URLhttps://arxiv.org/abs/2405.01470. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric. P Xing, Joseph E. G...

  4. [4]

    This does not require a reference solution, but relies on the judge’s ability to assess correctness independently

    Objective (standalone) judging.The judge evaluates the SLM’s answer on its own merits, scoring solution quality on a 1–10 scale. This does not require a reference solution, but relies on the judge’s ability to assess correctness independently

  5. [5]

    This gives the judge a concrete reference, but the LLM’s own answer may itself be wrong on hard problems

    Comparison judging.The judge compares the SLM’s answer against the LLM’s own solution, scoring how similar or aligned they are. This gives the judge a concrete reference, but the LLM’s own answer may itself be wrong on hard problems. 34 We evaluate both options below, along with ablations on the quality threshold and gold-answer access. E.1.1 Quality-Thre...

  6. [6]

    If the score ≥ 7, the protocol accepts the current answer and early-stops

    Quality thresholding:The iterative variant adds a judge that evaluates the SLM’s answer on a 1–10 scale (mathematical soundness, calculation correctness, reasoning clarity) after each batch of 5 questions. If the score ≥ 7, the protocol accepts the current answer and early-stops

  7. [7]

    Standard

    Gold answer access:In the standard protocol, the LLM answering questions is given the gold answer as reference. In the iterative variant, the LLM generates its own solution first, then uses it as reference for answering questions. These two changes are confounded: the iterative variant both removes gold answer access and adds judging. However, since QA co...

  8. [8]

    score 9.7, 98.7% early stopped)

    Easy-to-judge datasets(MATH Algebra: avg. score 9.7, 98.7% early stopped). The judge gives high scores and accepts the answer after just one round of 5 questions, before the SLM has received enough guidance from the Q&A exchange. The no-judge protocol would have continued for a full 10 questions, giving the SLM more information to work with

  9. [9]

    the answer improved but is still imperfect

    Hard-to-judge datasets(AIME: avg. score 3.0, 80.4%notearly stopped; HLE: avg. score 5.4, 49.3% not early stopped). The judge scores are persistently low, so the protocol runs all 10 questions but the final answer is still scored poorly. On AIME, problems that werenot early stopped show severe regression: 7 out of 41 were initially correct, but only 1 rema...

  10. [10]

    This isolates whether the threshold level is the primary issue

    Higher threshold ( ≥ 9):Raising the quality threshold from 7 to 9 should reduce premature early stopping, since the judge will accept fewer answers. This isolates whether the threshold level is the primary issue

  11. [11]

    Standard

    Gold-answer judge:The judge is given the gold answer for evaluation, while the LLM still answers questions using its own solution. This isolates the judge’s evaluation quality from the LLM’s question-answering quality. 36 Higher threshold (≥ 9).Table 18 compares recovery rates under the default threshold ( ≥ 7) and a stricter threshold ( ≥ 9). The stricte...