Haiku to Opus in Just 10 bits: LLMs Unlock Massive Compression Gains
Pith reviewed 2026-05-16 05:21 UTC · model grok-4.3
The pith
Ten yes/no questions recover 23 to 72 percent of the performance gap between small and large LLMs
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce Question-Asking compression (QA), an interactive lossy protocol inspired by the game Twenty Questions. A small model iteratively refines its response by asking yes/no questions to a stronger model, transferring exactly one bit per answer. On 8 benchmarks spanning math, science, and code, 10 binary questions recover 23% to 72% of the capability gap between a small and large model on standard benchmarks and 7% to 38% on harder benchmarks, achieving compression ratios of 0.0006 to 0.004.
What carries the argument
The Question-Asking (QA) protocol in which a small model asks yes/no questions to a stronger model to refine its generated response, with each answer providing one bit of information
If this is right
- Domain-adapted LoRA adapters improve lossless LLM-based arithmetic coding by a factor of two over the base model alone
- Prompting a model for a succinct rewrite followed by arithmetic coding yields a compression ratio of approximately 0.03, twice as good as compressing the original response
- The QA protocol achieves compression ratios over 100 times smaller than prior LLM-based compression methods
- The results define a compression-compute frontier in which greater compression becomes possible at the cost of additional compute
Where Pith is reading between the lines
- Interactive binary protocols could support efficient knowledge transfer among multiple AI agents operating under tight bandwidth limits
- Similar question-asking schemes might be tested for compressing structured outputs such as code or mathematical proofs rather than free text
- The same mechanism could be explored for low-bandwidth human-AI clarification loops in which the AI asks the human targeted yes/no questions
Load-bearing premise
The small model can generate effective yes/no questions that extract the most useful information without already knowing the answers or introducing errors that cancel out the recovered capability
What would settle it
Running the QA protocol on the eight benchmarks and measuring that the small model's final performance stays at or below its unaided baseline due to poor questions or introduced errors
Figures
read the original abstract
We study the compression of LLM-generated text across lossless and lossy regimes, characterizing a compression-compute frontier where more compression is possible at the cost of more compute. For lossless compression, domain-adapted LoRA adapters can improve LLM-based arithmetic coding by 2x over compression with the base LLM alone. For lossy compression, prompting a model for a succinct rewrite then applying arithmetic coding can achieve compression ratios of approximately 0.03, a 2x improvement over compressing the original response. We further introduce Question-Asking compression (QA), an interactive lossy protocol inspired by the game 'Twenty Questions'. A small model iteratively refines its response by asking yes/no questions to a stronger model, transferring exactly one bit per answer. On 8 benchmarks spanning math, science, and code, 10 binary questions recover 23% to 72% of the capability gap between a small and large model on standard benchmarks and 7% to 38% on harder benchmarks, achieving compression ratios of 0.0006 to 0.004. This is over 100x smaller than prior LLM-based compression (Deletang et al., 2024), suggesting that interactive protocols can transfer knowledge far more efficiently than transmitting full responses.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies compression of LLM-generated text in lossless and lossy regimes, characterizing a compression-compute frontier. It shows domain-adapted LoRA adapters improve LLM-based arithmetic coding by 2x for lossless compression. For lossy compression, succinct rewrites followed by arithmetic coding achieve ratios of ~0.03 (2x better than baselines). The central contribution is Question-Asking (QA) compression: a small model iteratively asks up to 10 yes/no questions to a larger model to refine its initial response, recovering 23-72% of the capability gap on standard benchmarks and 7-38% on harder ones across 8 math/science/code tasks, with ratios 0.0006-0.004 (over 100x smaller than Deletang et al. 2024).
Significance. If the empirical claims are substantiated with full controls, this work would demonstrate that interactive bit-by-bit protocols can transfer task-relevant knowledge far more efficiently than transmitting full responses or using static compression. The reported recovery of substantial capability gaps with only 10 bits, combined with the 100x improvement over prior LLM compression, would be a notable advance for efficient model interaction and deployment in low-bandwidth settings.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): the abstract states quantitative results on eight benchmarks but provides no details on exact benchmark definitions, statistical significance, variance across runs, or how the capability gap is measured; central claims rest on unreported experimental controls.
- [QA Compression Protocol] QA protocol description: no ablation isolating question source (small-model generation vs. large-model answers) is described, leaving open whether the headline recovery numbers (23-72% and 7-38%) conflate the value of the answers with the small model's own question-formulation ability.
- [Compression Ratios] Compression ratio claims: the reported ratios of 0.0006-0.004 and the 100x improvement over Deletang et al. (2024) require explicit accounting of base model size, transmitted bits, and encoding overhead to be verifiable; these numbers are load-bearing for the compression frontier narrative.
minor comments (2)
- [Notation and Metrics] Clarify the precise definition and measurement of the 'capability gap' between small and large models, including any normalization across benchmarks.
- [References] Provide full citation details and year for Deletang et al. (2024) in the references.
Simulated Author's Rebuttal
Thank you for your constructive feedback. We will revise the manuscript to address the concerns regarding experimental details, ablations, and compression calculations, as detailed in our point-by-point responses below.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the abstract states quantitative results on eight benchmarks but provides no details on exact benchmark definitions, statistical significance, variance across runs, or how the capability gap is measured; central claims rest on unreported experimental controls.
Authors: We agree that additional transparency is required. In the revised manuscript, we will expand §4 with exact benchmark definitions, report means and standard deviations across multiple runs with statistical significance tests, and explicitly define the capability gap as (small_with_QA - small) / (large - small). These details will also be referenced from the abstract. revision: yes
-
Referee: [QA Compression Protocol] QA protocol description: no ablation isolating question source (small-model generation vs. large-model answers) is described, leaving open whether the headline recovery numbers (23-72% and 7-38%) conflate the value of the answers with the small model's own question-formulation ability.
Authors: We will add an ablation study in the revised §4 comparing the standard protocol (small model generates questions) against variants where the large model generates the questions or answers are provided independently, to isolate the contribution of question formulation from answer quality. revision: yes
-
Referee: [Compression Ratios] Compression ratio claims: the reported ratios of 0.0006-0.004 and the 100x improvement over Deletang et al. (2024) require explicit accounting of base model size, transmitted bits, and encoding overhead to be verifiable; these numbers are load-bearing for the compression frontier narrative.
Authors: We will revise the relevant sections to provide an explicit breakdown including base model sizes, the 10 transmitted bits plus protocol overhead, and a direct comparison table with Deletang et al. (2024) to substantiate the reported ratios and improvement factor. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper presents empirical results from benchmark experiments measuring compression ratios and capability recovery percentages via an interactive QA protocol between small and large models. No equations, parameter fittings, or derivations are described that reduce outputs to inputs by construction. Claims rely on direct experimental comparisons to baselines and prior work (e.g., Deletang et al.), with no self-citation load-bearing the central results or ansatz smuggling. The 23-72% recovery figures are measured outcomes, not tautological renamings or fitted predictions.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A small model iteratively refines its response by asking yes/no questions to a stronger model, transferring exactly one bit per answer.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2601.10678. AI-MO. Aimo validation aime, 2024. URL https://huggingface.co/datasets/AI-MO/ aimo-validation-aime. Validation dataset for the AIMO Progress Prize, derived from AIME 2022–2024 problems. Anthropic. Activating ai safety level 3 protections. https://www.anthropic.com/news/ activating-asl3-protections, May 2025. Online, pu...
-
[2]
Alex Wang, Kyunghyun Cho, and Mike Lewis
URLhttps://arxiv.org/abs/2306.04050. Alex Wang, Kyunghyun Cho, and Mike Lewis. Asking and answering questions to evaluate the factual consistency of summaries, 2020. URLhttps://arxiv.org/abs/2004.04228. Ian H. Witten, Radford M. Neal, and John G. Cleary. Arithmetic coding for data compression. Communications of the ACM, 30(6):520–540, 1987. doi: 10.1145/2...
-
[3]
Edelman, Danilo Francati, Daniele Venturi, Giuseppe Ateniese, and Boaz Barak
URLhttps://arxiv.org/abs/2311.04378. Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. Wildchat: 1m chatgpt interaction logs in the wild, 2024. URLhttps://arxiv.org/abs/2405.01470. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric. P Xing, Joseph E. G...
-
[4]
Objective (standalone) judging.The judge evaluates the SLM’s answer on its own merits, scoring solution quality on a 1–10 scale. This does not require a reference solution, but relies on the judge’s ability to assess correctness independently
-
[5]
Comparison judging.The judge compares the SLM’s answer against the LLM’s own solution, scoring how similar or aligned they are. This gives the judge a concrete reference, but the LLM’s own answer may itself be wrong on hard problems. 34 We evaluate both options below, along with ablations on the quality threshold and gold-answer access. E.1.1 Quality-Thre...
-
[6]
If the score ≥ 7, the protocol accepts the current answer and early-stops
Quality thresholding:The iterative variant adds a judge that evaluates the SLM’s answer on a 1–10 scale (mathematical soundness, calculation correctness, reasoning clarity) after each batch of 5 questions. If the score ≥ 7, the protocol accepts the current answer and early-stops
-
[7]
Gold answer access:In the standard protocol, the LLM answering questions is given the gold answer as reference. In the iterative variant, the LLM generates its own solution first, then uses it as reference for answering questions. These two changes are confounded: the iterative variant both removes gold answer access and adds judging. However, since QA co...
-
[8]
score 9.7, 98.7% early stopped)
Easy-to-judge datasets(MATH Algebra: avg. score 9.7, 98.7% early stopped). The judge gives high scores and accepts the answer after just one round of 5 questions, before the SLM has received enough guidance from the Q&A exchange. The no-judge protocol would have continued for a full 10 questions, giving the SLM more information to work with
-
[9]
the answer improved but is still imperfect
Hard-to-judge datasets(AIME: avg. score 3.0, 80.4%notearly stopped; HLE: avg. score 5.4, 49.3% not early stopped). The judge scores are persistently low, so the protocol runs all 10 questions but the final answer is still scored poorly. On AIME, problems that werenot early stopped show severe regression: 7 out of 41 were initially correct, but only 1 rema...
-
[10]
This isolates whether the threshold level is the primary issue
Higher threshold ( ≥ 9):Raising the quality threshold from 7 to 9 should reduce premature early stopping, since the judge will accept fewer answers. This isolates whether the threshold level is the primary issue
-
[11]
Gold-answer judge:The judge is given the gold answer for evaluation, while the LLM still answers questions using its own solution. This isolates the judge’s evaluation quality from the LLM’s question-answering quality. 36 Higher threshold (≥ 9).Table 18 compares recovery rates under the default threshold ( ≥ 7) and a stricter threshold ( ≥ 9). The stricte...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.