Haiku to Opus in Just 10 bits: LLMs Unlock Massive Compression Gains

Annabelle Michael Carrell; Keri Warr; Nicholas Carlini; Roy Rinberg; Simon Henniger

arxiv: 2604.02343 · v1 · submitted 2026-02-09 · 💻 cs.LG · cs.AI· cs.IT· math.IT

Haiku to Opus in Just 10 bits: LLMs Unlock Massive Compression Gains

Roy Rinberg , Annabelle Michael Carrell , Simon Henniger , Nicholas Carlini , Keri Warr This is my paper

Pith reviewed 2026-05-16 05:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ITmath.IT

keywords LLM compressionquestion asking compressionlossy compressionarithmetic codingmodel capability transferinteractive protocolsbinary questions

0 comments

The pith

Ten yes/no questions recover 23 to 72 percent of the performance gap between small and large LLMs

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that an interactive protocol lets a small language model recover a large share of a stronger model's benchmark performance by asking only ten binary questions. Each answer transfers exactly one bit, producing compression ratios between 0.0006 and 0.004. This approach improves on both lossless arithmetic coding with LoRA adapters and lossy succinct rewrites, and it outperforms earlier LLM compression methods by more than two orders of magnitude. A sympathetic reader would care because the result shows that model capability can be transferred far more efficiently than by sending complete text responses. The work also maps a broader compression-compute frontier in which greater compression is possible when more compute is spent on interaction.

Core claim

We introduce Question-Asking compression (QA), an interactive lossy protocol inspired by the game Twenty Questions. A small model iteratively refines its response by asking yes/no questions to a stronger model, transferring exactly one bit per answer. On 8 benchmarks spanning math, science, and code, 10 binary questions recover 23% to 72% of the capability gap between a small and large model on standard benchmarks and 7% to 38% on harder benchmarks, achieving compression ratios of 0.0006 to 0.004.

What carries the argument

The Question-Asking (QA) protocol in which a small model asks yes/no questions to a stronger model to refine its generated response, with each answer providing one bit of information

If this is right

Domain-adapted LoRA adapters improve lossless LLM-based arithmetic coding by a factor of two over the base model alone
Prompting a model for a succinct rewrite followed by arithmetic coding yields a compression ratio of approximately 0.03, twice as good as compressing the original response
The QA protocol achieves compression ratios over 100 times smaller than prior LLM-based compression methods
The results define a compression-compute frontier in which greater compression becomes possible at the cost of additional compute

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Interactive binary protocols could support efficient knowledge transfer among multiple AI agents operating under tight bandwidth limits
Similar question-asking schemes might be tested for compressing structured outputs such as code or mathematical proofs rather than free text
The same mechanism could be explored for low-bandwidth human-AI clarification loops in which the AI asks the human targeted yes/no questions

Load-bearing premise

The small model can generate effective yes/no questions that extract the most useful information without already knowing the answers or introducing errors that cancel out the recovered capability

What would settle it

Running the QA protocol on the eight benchmarks and measuring that the small model's final performance stays at or below its unaided baseline due to poor questions or introduced errors

Figures

Figures reproduced from arXiv: 2604.02343 by Annabelle Michael Carrell, Keri Warr, Nicholas Carlini, Roy Rinberg, Simon Henniger.

**Figure 2.** Figure 2: Overview of the compression mechanism and its use in an interactive protocol between an [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy of random selection (dashed) versus best-compression selection (solid) on 90 [PITH_FULL_IMAGE:figures/full_fig_p030_3.png] view at source ↗

**Figure 4.** Figure 4: Absolute compression ratio (top) and relative compression ratio normalized to Temperature [PITH_FULL_IMAGE:figures/full_fig_p031_4.png] view at source ↗

**Figure 5.** Figure 5: Absolute compression ratio (top) and relative compression ratio normalized to Temperature [PITH_FULL_IMAGE:figures/full_fig_p032_5.png] view at source ↗

**Figure 6.** Figure 6: Compression ratio vs. number of candidates [PITH_FULL_IMAGE:figures/full_fig_p033_6.png] view at source ↗

**Figure 7.** Figure 7: Accuracy of random selection (dashed) versus best-compression selection (solid) on 90 [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗

**Figure 8.** Figure 8: Compression ratio (bits-per-character) for verbose original solutions versus succinct [PITH_FULL_IMAGE:figures/full_fig_p034_8.png] view at source ↗

**Figure 9.** Figure 9: GSM8K Q&A compression accuracy (SLM=Haiku, Claude 4.5). GSM8K shows consistent [PITH_FULL_IMAGE:figures/full_fig_p044_9.png] view at source ↗

**Figure 10.** Figure 10: MATH Algebra Q&A compression accuracy (SLM=Haiku, Claude 4.5). [PITH_FULL_IMAGE:figures/full_fig_p044_10.png] view at source ↗

**Figure 11.** Figure 11: MATH Geometry Q&A compression accuracy (SLM=Haiku, Claude 4.5). [PITH_FULL_IMAGE:figures/full_fig_p045_11.png] view at source ↗

**Figure 12.** Figure 12: MATH Number Theory Q&A compression accuracy (SLM=Haiku, Claude 4.5). Number [PITH_FULL_IMAGE:figures/full_fig_p045_12.png] view at source ↗

**Figure 13.** Figure 13: GPQA (MC) Q&A compression accuracy (SLM=Haiku, Claude 4.5). The high proportion [PITH_FULL_IMAGE:figures/full_fig_p046_13.png] view at source ↗

**Figure 14.** Figure 14: MBPP Q&A compression accuracy (SLM=Haiku, Claude 4.5). Code generation proves [PITH_FULL_IMAGE:figures/full_fig_p046_14.png] view at source ↗

**Figure 15.** Figure 15: AIME Q&A compression accuracy (SLM=Haiku, Claude 4.5). Competition math [PITH_FULL_IMAGE:figures/full_fig_p047_15.png] view at source ↗

**Figure 16.** Figure 16: HLE Q&A compression accuracy (SLM=Haiku, Claude 4.5). Despite the high Very Hard [PITH_FULL_IMAGE:figures/full_fig_p047_16.png] view at source ↗

**Figure 17.** Figure 17: GSM8K Q&A compression accuracy (SLM=Haiku, Claude 3.5/4). Strong improvement [PITH_FULL_IMAGE:figures/full_fig_p048_17.png] view at source ↗

**Figure 18.** Figure 18: MATH Algebra Q&A compression accuracy (SLM=Haiku, Claude 3.5/4). [PITH_FULL_IMAGE:figures/full_fig_p048_18.png] view at source ↗

**Figure 19.** Figure 19: MATH Geometry Q&A compression accuracy (SLM=Haiku, Claude 3.5/4). [PITH_FULL_IMAGE:figures/full_fig_p049_19.png] view at source ↗

**Figure 20.** Figure 20: MATH Number Theory Q&A compression accuracy (SLM=Haiku, Claude 3.5/4). [PITH_FULL_IMAGE:figures/full_fig_p049_20.png] view at source ↗

**Figure 21.** Figure 21: GPQA (MC) Q&A compression accuracy (SLM=Haiku, Claude 3.5/4). [PITH_FULL_IMAGE:figures/full_fig_p050_21.png] view at source ↗

**Figure 22.** Figure 22: MBPP Q&A compression accuracy (SLM=Haiku, Claude 3.5/4). [PITH_FULL_IMAGE:figures/full_fig_p050_22.png] view at source ↗

**Figure 23.** Figure 23: AIME Q&A compression accuracy (SLM=Haiku, Claude 3.5/4). The older Haiku model [PITH_FULL_IMAGE:figures/full_fig_p051_23.png] view at source ↗

**Figure 24.** Figure 24: HLE Q&A compression accuracy (SLM=Haiku, Claude 3.5/4). Q&A shows improvement [PITH_FULL_IMAGE:figures/full_fig_p051_24.png] view at source ↗

read the original abstract

We study the compression of LLM-generated text across lossless and lossy regimes, characterizing a compression-compute frontier where more compression is possible at the cost of more compute. For lossless compression, domain-adapted LoRA adapters can improve LLM-based arithmetic coding by 2x over compression with the base LLM alone. For lossy compression, prompting a model for a succinct rewrite then applying arithmetic coding can achieve compression ratios of approximately 0.03, a 2x improvement over compressing the original response. We further introduce Question-Asking compression (QA), an interactive lossy protocol inspired by the game 'Twenty Questions'. A small model iteratively refines its response by asking yes/no questions to a stronger model, transferring exactly one bit per answer. On 8 benchmarks spanning math, science, and code, 10 binary questions recover 23% to 72% of the capability gap between a small and large model on standard benchmarks and 7% to 38% on harder benchmarks, achieving compression ratios of 0.0006 to 0.004. This is over 100x smaller than prior LLM-based compression (Deletang et al., 2024), suggesting that interactive protocols can transfer knowledge far more efficiently than transmitting full responses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's new interactive QA protocol lets a small model recover 23-72% of a large model's capability gap with just 10 yes/no bits on some benchmarks, but the gains depend on the small model already asking good questions.

read the letter

The core new thing is the Question-Asking compression setup: a small model iteratively asks yes/no questions to a stronger model to refine its own initial response, transferring one bit per answer. This sits on top of their broader framing of a compression-compute frontier, where LoRA-adapted arithmetic coding improves lossless results by 2x and succinct rewrites plus coding reach 0.03 ratios. The reported numbers on eight math/science/code benchmarks are the clearest empirical hook, showing 100x better compression than Deletang et al. 2024 while closing a non-trivial fraction of the capability gap at very low bit cost. That interactive angle is distinct from the static baselines they cite and deserves attention if the controls are solid. The main soft spot is exactly the one the stress-test flags: the small model has to generate the right questions in the first place. Recovery falls to 7-38% on harder benchmarks, which is consistent with the small model lacking enough task understanding to target the actual deficits. No ablation appears that swaps in questions from the large model or scores question quality directly, so the headline gains may partly reflect the small model's own strengths rather than pure knowledge transfer. The abstract also leaves out benchmark definitions, variance across runs, and how the capability gap is quantified, which makes the 23-72% range hard to evaluate without the full methods. This is worth a serious referee for anyone working on efficient inference or multi-model collaboration. The idea is clean enough and the compression claims large enough that a careful review could tighten the experiments without discarding the contribution. I'd bring it to a reading group to discuss the protocol itself.

Referee Report

3 major / 2 minor

Summary. The manuscript studies compression of LLM-generated text in lossless and lossy regimes, characterizing a compression-compute frontier. It shows domain-adapted LoRA adapters improve LLM-based arithmetic coding by 2x for lossless compression. For lossy compression, succinct rewrites followed by arithmetic coding achieve ratios of ~0.03 (2x better than baselines). The central contribution is Question-Asking (QA) compression: a small model iteratively asks up to 10 yes/no questions to a larger model to refine its initial response, recovering 23-72% of the capability gap on standard benchmarks and 7-38% on harder ones across 8 math/science/code tasks, with ratios 0.0006-0.004 (over 100x smaller than Deletang et al. 2024).

Significance. If the empirical claims are substantiated with full controls, this work would demonstrate that interactive bit-by-bit protocols can transfer task-relevant knowledge far more efficiently than transmitting full responses or using static compression. The reported recovery of substantial capability gaps with only 10 bits, combined with the 100x improvement over prior LLM compression, would be a notable advance for efficient model interaction and deployment in low-bandwidth settings.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): the abstract states quantitative results on eight benchmarks but provides no details on exact benchmark definitions, statistical significance, variance across runs, or how the capability gap is measured; central claims rest on unreported experimental controls.
[QA Compression Protocol] QA protocol description: no ablation isolating question source (small-model generation vs. large-model answers) is described, leaving open whether the headline recovery numbers (23-72% and 7-38%) conflate the value of the answers with the small model's own question-formulation ability.
[Compression Ratios] Compression ratio claims: the reported ratios of 0.0006-0.004 and the 100x improvement over Deletang et al. (2024) require explicit accounting of base model size, transmitted bits, and encoding overhead to be verifiable; these numbers are load-bearing for the compression frontier narrative.

minor comments (2)

[Notation and Metrics] Clarify the precise definition and measurement of the 'capability gap' between small and large models, including any normalization across benchmarks.
[References] Provide full citation details and year for Deletang et al. (2024) in the references.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for your constructive feedback. We will revise the manuscript to address the concerns regarding experimental details, ablations, and compression calculations, as detailed in our point-by-point responses below.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the abstract states quantitative results on eight benchmarks but provides no details on exact benchmark definitions, statistical significance, variance across runs, or how the capability gap is measured; central claims rest on unreported experimental controls.

Authors: We agree that additional transparency is required. In the revised manuscript, we will expand §4 with exact benchmark definitions, report means and standard deviations across multiple runs with statistical significance tests, and explicitly define the capability gap as (small_with_QA - small) / (large - small). These details will also be referenced from the abstract. revision: yes
Referee: [QA Compression Protocol] QA protocol description: no ablation isolating question source (small-model generation vs. large-model answers) is described, leaving open whether the headline recovery numbers (23-72% and 7-38%) conflate the value of the answers with the small model's own question-formulation ability.

Authors: We will add an ablation study in the revised §4 comparing the standard protocol (small model generates questions) against variants where the large model generates the questions or answers are provided independently, to isolate the contribution of question formulation from answer quality. revision: yes
Referee: [Compression Ratios] Compression ratio claims: the reported ratios of 0.0006-0.004 and the 100x improvement over Deletang et al. (2024) require explicit accounting of base model size, transmitted bits, and encoding overhead to be verifiable; these numbers are load-bearing for the compression frontier narrative.

Authors: We will revise the relevant sections to provide an explicit breakdown including base model sizes, the 10 transmitted bits plus protocol overhead, and a direct comparison table with Deletang et al. (2024) to substantiate the reported ratios and improvement factor. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents empirical results from benchmark experiments measuring compression ratios and capability recovery percentages via an interactive QA protocol between small and large models. No equations, parameter fittings, or derivations are described that reduce outputs to inputs by construction. Claims rely on direct experimental comparisons to baselines and prior work (e.g., Deletang et al.), with no self-citation load-bearing the central results or ansatz smuggling. The 23-72% recovery figures are measured outcomes, not tautological renamings or fitted predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; all quantitative claims rest on unreported experimental setup.

pith-pipeline@v0.9.0 · 5542 in / 1301 out tokens · 27572 ms · 2026-05-16T05:21:49.603878+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A small model iteratively refines its response by asking yes/no questions to a stronger model, transferring exactly one bit per answer.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages

[1]

URLhttps://arxiv.org/abs/2601.10678. AI-MO. Aimo validation aime, 2024. URL https://huggingface.co/datasets/AI-MO/ aimo-validation-aime. Validation dataset for the AIMO Progress Prize, derived from AIME 2022–2024 problems. Anthropic. Activating ai safety level 3 protections. https://www.anthropic.com/news/ activating-asl3-protections, May 2025. Online, pu...

work page doi:10.64434/tml.20250910 2024
[2]

Alex Wang, Kyunghyun Cho, and Mike Lewis

URLhttps://arxiv.org/abs/2306.04050. Alex Wang, Kyunghyun Cho, and Mike Lewis. Asking and answering questions to evaluate the factual consistency of summaries, 2020. URLhttps://arxiv.org/abs/2004.04228. Ian H. Witten, Radford M. Neal, and John G. Cleary. Arithmetic coding for data compression. Communications of the ACM, 30(6):520–540, 1987. doi: 10.1145/2...

work page doi:10.1145/214762.214771 2020
[3]

Edelman, Danilo Francati, Daniele Venturi, Giuseppe Ateniese, and Boaz Barak

URLhttps://arxiv.org/abs/2311.04378. Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. Wildchat: 1m chatgpt interaction logs in the wild, 2024. URLhttps://arxiv.org/abs/2405.01470. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric. P Xing, Joseph E. G...

work page arXiv 2024
[4]

This does not require a reference solution, but relies on the judge’s ability to assess correctness independently

Objective (standalone) judging.The judge evaluates the SLM’s answer on its own merits, scoring solution quality on a 1–10 scale. This does not require a reference solution, but relies on the judge’s ability to assess correctness independently

work page
[5]

This gives the judge a concrete reference, but the LLM’s own answer may itself be wrong on hard problems

Comparison judging.The judge compares the SLM’s answer against the LLM’s own solution, scoring how similar or aligned they are. This gives the judge a concrete reference, but the LLM’s own answer may itself be wrong on hard problems. 34 We evaluate both options below, along with ablations on the quality threshold and gold-answer access. E.1.1 Quality-Thre...

work page
[6]

If the score ≥ 7, the protocol accepts the current answer and early-stops

Quality thresholding:The iterative variant adds a judge that evaluates the SLM’s answer on a 1–10 scale (mathematical soundness, calculation correctness, reasoning clarity) after each batch of 5 questions. If the score ≥ 7, the protocol accepts the current answer and early-stops

work page
[7]

Standard

Gold answer access:In the standard protocol, the LLM answering questions is given the gold answer as reference. In the iterative variant, the LLM generates its own solution first, then uses it as reference for answering questions. These two changes are confounded: the iterative variant both removes gold answer access and adds judging. However, since QA co...

work page
[8]

score 9.7, 98.7% early stopped)

Easy-to-judge datasets(MATH Algebra: avg. score 9.7, 98.7% early stopped). The judge gives high scores and accepts the answer after just one round of 5 questions, before the SLM has received enough guidance from the Q&A exchange. The no-judge protocol would have continued for a full 10 questions, giving the SLM more information to work with

work page
[9]

the answer improved but is still imperfect

Hard-to-judge datasets(AIME: avg. score 3.0, 80.4%notearly stopped; HLE: avg. score 5.4, 49.3% not early stopped). The judge scores are persistently low, so the protocol runs all 10 questions but the final answer is still scored poorly. On AIME, problems that werenot early stopped show severe regression: 7 out of 41 were initially correct, but only 1 rema...

work page
[10]

This isolates whether the threshold level is the primary issue

Higher threshold ( ≥ 9):Raising the quality threshold from 7 to 9 should reduce premature early stopping, since the judge will accept fewer answers. This isolates whether the threshold level is the primary issue

work page
[11]

Standard

Gold-answer judge:The judge is given the gold answer for evaluation, while the LLM still answers questions using its own solution. This isolates the judge’s evaluation quality from the LLM’s question-answering quality. 36 Higher threshold (≥ 9).Table 18 compares recovery rates under the default threshold ( ≥ 7) and a stricter threshold ( ≥ 9). The stricte...

work page 2025

[1] [1]

URLhttps://arxiv.org/abs/2601.10678. AI-MO. Aimo validation aime, 2024. URL https://huggingface.co/datasets/AI-MO/ aimo-validation-aime. Validation dataset for the AIMO Progress Prize, derived from AIME 2022–2024 problems. Anthropic. Activating ai safety level 3 protections. https://www.anthropic.com/news/ activating-asl3-protections, May 2025. Online, pu...

work page doi:10.64434/tml.20250910 2024

[2] [2]

Alex Wang, Kyunghyun Cho, and Mike Lewis

URLhttps://arxiv.org/abs/2306.04050. Alex Wang, Kyunghyun Cho, and Mike Lewis. Asking and answering questions to evaluate the factual consistency of summaries, 2020. URLhttps://arxiv.org/abs/2004.04228. Ian H. Witten, Radford M. Neal, and John G. Cleary. Arithmetic coding for data compression. Communications of the ACM, 30(6):520–540, 1987. doi: 10.1145/2...

work page doi:10.1145/214762.214771 2020

[3] [3]

Edelman, Danilo Francati, Daniele Venturi, Giuseppe Ateniese, and Boaz Barak

URLhttps://arxiv.org/abs/2311.04378. Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. Wildchat: 1m chatgpt interaction logs in the wild, 2024. URLhttps://arxiv.org/abs/2405.01470. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric. P Xing, Joseph E. G...

work page arXiv 2024

[4] [4]

This does not require a reference solution, but relies on the judge’s ability to assess correctness independently

Objective (standalone) judging.The judge evaluates the SLM’s answer on its own merits, scoring solution quality on a 1–10 scale. This does not require a reference solution, but relies on the judge’s ability to assess correctness independently

work page

[5] [5]

This gives the judge a concrete reference, but the LLM’s own answer may itself be wrong on hard problems

Comparison judging.The judge compares the SLM’s answer against the LLM’s own solution, scoring how similar or aligned they are. This gives the judge a concrete reference, but the LLM’s own answer may itself be wrong on hard problems. 34 We evaluate both options below, along with ablations on the quality threshold and gold-answer access. E.1.1 Quality-Thre...

work page

[6] [6]

If the score ≥ 7, the protocol accepts the current answer and early-stops

Quality thresholding:The iterative variant adds a judge that evaluates the SLM’s answer on a 1–10 scale (mathematical soundness, calculation correctness, reasoning clarity) after each batch of 5 questions. If the score ≥ 7, the protocol accepts the current answer and early-stops

work page

[7] [7]

Standard

Gold answer access:In the standard protocol, the LLM answering questions is given the gold answer as reference. In the iterative variant, the LLM generates its own solution first, then uses it as reference for answering questions. These two changes are confounded: the iterative variant both removes gold answer access and adds judging. However, since QA co...

work page

[8] [8]

score 9.7, 98.7% early stopped)

Easy-to-judge datasets(MATH Algebra: avg. score 9.7, 98.7% early stopped). The judge gives high scores and accepts the answer after just one round of 5 questions, before the SLM has received enough guidance from the Q&A exchange. The no-judge protocol would have continued for a full 10 questions, giving the SLM more information to work with

work page

[9] [9]

the answer improved but is still imperfect

Hard-to-judge datasets(AIME: avg. score 3.0, 80.4%notearly stopped; HLE: avg. score 5.4, 49.3% not early stopped). The judge scores are persistently low, so the protocol runs all 10 questions but the final answer is still scored poorly. On AIME, problems that werenot early stopped show severe regression: 7 out of 41 were initially correct, but only 1 rema...

work page

[10] [10]

This isolates whether the threshold level is the primary issue

Higher threshold ( ≥ 9):Raising the quality threshold from 7 to 9 should reduce premature early stopping, since the judge will accept fewer answers. This isolates whether the threshold level is the primary issue

work page

[11] [11]

Standard

Gold-answer judge:The judge is given the gold answer for evaluation, while the LLM still answers questions using its own solution. This isolates the judge’s evaluation quality from the LLM’s question-answering quality. 36 Higher threshold (≥ 9).Table 18 compares recovery rates under the default threshold ( ≥ 7) and a stricter threshold ( ≥ 9). The stricte...

work page 2025