The Detection-Extraction Gap: Models Know the Answer Before They Can Say It

Hanyang Wang; Mingxuan Zhu

arxiv: 2604.06613 · v2 · submitted 2026-04-08 · 💻 cs.CL · cs.AI· cs.IT· cs.LG· math.IT

The Detection-Extraction Gap: Models Know the Answer Before They Can Say It

Hanyang Wang , Mingxuan Zhu This is my paper

Pith reviewed 2026-05-10 18:16 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.ITcs.LGmath.IT

keywords chain-of-thought reasoningdetection-extraction gapearly exitlanguage modelstotal variation boundadaptive generationblack-box methods

0 comments

The pith

Language models determine the correct answer early in chain-of-thought but generate most tokens afterward.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that across multiple models, families, and benchmarks, 52 to 88 percent of chain-of-thought tokens are produced after the answer becomes recoverable from a partial prefix. This reveals a detection-extraction gap in which free continuations from early prefixes recover the correct answer, yet standard forced decoding from the same prefixes often fails. A total-variation bound quantifies the shift induced by the extraction prompt. If the gap is real, models hold internal knowledge sooner than their output text indicates, which enables early-exit strategies that reduce generation length while preserving or raising accuracy.

Core claim

Across five model configurations, two families, and three benchmarks, 52--88% of chain-of-thought tokens are produced after the answer is recoverable from a partial prefix. Free continuations from early prefixes recover the correct answer even at 10% of the trace, while forced extraction fails on 42% of these cases. The answer is recoverable from the model state, yet prompt-conditioned decoding fails to extract it. The mismatch is formalized via a total-variation bound between free and forced continuation distributions.

What carries the argument

The detection-extraction gap: the mismatch between answers recoverable via free continuations from early prefixes and the failure of prompt-conditioned decoding to extract the same answers.

If this is right

Black-box Adaptive Early Exit truncates 70-78% of serial generation while improving accuracy by 1-5 percentage points.
For thinking-mode models, early exit prevents post-commitment overwriting and yields gains up to 5.8 percentage points.
A cost-optimized variant achieves 68-73% reduction in API calls at a median of 9 calls.
The total-variation bound provides quantitative estimates of how much the suffix prompt shifts the continuation distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Standard decoding may systematically delay the surfacing of knowledge that is already present internally.
Similar gaps could appear in tasks outside benchmarks, such as factual recall or code generation.
Direct use of hidden-state probes might bypass text generation and detect knowledge even earlier than free continuations.

Load-bearing premise

Free continuations from early prefixes faithfully reflect the model's internal knowledge state without distortion by the particular continuation prompt or sampling procedure.

What would settle it

If forced extraction from the identical early prefixes recovers the answer at the same rate as free continuations, the claimed gap would not exist.

Figures

Figures reproduced from arXiv: 2604.06613 by Hanyang Wang, Mingxuan Zhu.

**Figure 1.** Figure 1: Two core findings. (a) Commitment map for 32B-Think: 69% of CoT tokens follow the commitment boundary (75% under PSC threshold; [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Method pipeline. At prefix fraction f=0.1, PSC (free continuation) recovers the correct answer 82% of the time, while EFA (forced extraction) succeeds on only 34%. BAEE uses free continuations for both detection and extraction, sidestepping the gap entirely. 2 [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 4.** Figure 4: Commitment maps. (a) 32B-Think: 69% post-commitment. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 3.** Figure 3: Main results. (a) EFA accuracy by prefix length. (b) PSC agreement (>70% from the first checkpoint). (c) Commitment distributions (Think median ∼25%, NoThink ∼40%). (d) Commitment increases monotonically with problem difficulty [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 6.** Figure 6: MATH-500 vs GPQA-Diamond. (a,b) PSC vs EFA across prefixes. (c,d) Commitment distributions. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 5.** Figure 5: (a) Failure taxonomy for 208 gap instances. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 7.** Figure 7: Entropy analysis. (a) Per-token entropy along the CoT. (b) Wrong problems (dashed) show higher entropy. [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Offline majority-vote BAEE under aggressive thresholds. (a) Accuracy change vs full CoT across [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Outcome breakdown at θ = 0.625 (offline majority-vote simulation). Green = overthinking corrected (wrong under full CoT, correct under BAEE); red = BAEE harmed (correct under full CoT, wrong under BAEE); blue = always correct; gray = always wrong. E.2 Threshold Sweep and Operating Points To reduce post-hoc thresholding concerns, we report a fixed sweep over θ ∈ {1/8, 2/8, . . . , 1} (aligned with 8-sample … view at source ↗

**Figure 10.** Figure 10: Threshold-sweep frontier under PSC-8. Left: NoThink FP–savings trade-off on 500-problem runs. Right: [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Discriminative signals for filtering high- [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: PSC monotonicity: MATH-500 (non-monotone) vs GPQA-Diamond (monotone). [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

**Figure 13.** Figure 13: Post-commitment fraction across MATH-500 and GPQA-Diamond. Solid bars = MATH-500, hatched = [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Commitment map for GPQA-Diamond (same visualization as Figure 4). The commitment boundary [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: HumanEval results. (a) PSC agreement across prefix fractions. (b) EFA accuracy (code extraction is harder than math extraction). (c) Commitment distributions: all models commit by 10–20%. (d) Detection–extraction gap persists on code generation. Q.1 Three-Benchmark Summary [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: Post-commitment fraction across three benchmarks (MATH-500, GPQA-Diamond, HumanEval) and five [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

read the original abstract

Modern reasoning models continue generating long after the answer is already determined. Across five model configurations, two families, and three benchmarks, we find that 52--88% of chain-of-thought tokens are produced after the answer is recoverable from a partial prefix. This post-commitment generation reveals a structural phenomenon: the detection-extraction gap. Free continuations from early prefixes recover the correct answer even at 10% of the trace, while forced extraction fails on 42% of these cases. The answer is recoverable from the model state, yet prompt-conditioned decoding fails to extract it. We formalize this mismatch via a total-variation bound between free and forced continuation distributions, yielding quantitative estimates of suffix-induced shift. Exploiting this asymmetry, we propose Black-box Adaptive Early Exit (BAEE), which uses free continuations for both detection and extraction, truncating 70--78% of serial generation while improving accuracy by 1--5pp across all models. For thinking-mode models, early exit prevents post-commitment overwriting, yielding gains of up to 5.8pp; a cost-optimized variant achieves 68--73% reduction at a median of 9 API calls. Code is available at https://github.com/EdWangLoDaSc/know2say.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Models often know the answer after 10-20% of CoT tokens but standard prompts fail to extract it, and this paper turns that into a practical early-exit method that cuts tokens while holding or improving accuracy.

read the letter

The paper's core finding is that LLMs in chain-of-thought mode often settle on the correct answer after generating only 10-20% of the trace, but the original prompt makes it hard to extract that answer without continuing the full generation. Free sampling from those early prefixes recovers the answer far more often, creating what they call the detection-extraction gap. They back this up with results on five model setups, two families, and three benchmarks, where 52-88% of tokens come after the answer is already recoverable. The total-variation bound formalizes the shift between free and forced continuations, and they turn the asymmetry into Black-box Adaptive Early Exit (BAEE). BAEE cuts serial generation by 70-78% and lifts accuracy by 1-5 points in most cases, with bigger gains on thinking-mode models by avoiding later overwriting. The work is practical and the code is out, which helps. The numbers look consistent across the reported setups. The main soft spot is whether the free-continuation prompt is truly neutral. If the template used for detection adds its own pressure toward an answer, then the early recovery rates could be inflated, and the gap partly reflects the measurement method rather than pure model behavior. The paper would benefit from testing a few different detection prompts or showing that the effect holds without extra instructions. The circularity in deriving the TV bound from the same free/forced pairs is minor but worth noting. This is for people focused on inference efficiency and CoT optimization. It deserves a serious referee because the empirical claim is testable and the method is simple enough to adopt if it holds.

Referee Report

3 major / 2 minor

Summary. The paper claims that reasoning models determine the correct answer early in chain-of-thought traces (52-88% of tokens generated post-recovery across five configurations, two families, and three benchmarks), yet forced extraction from the same prefixes fails in 42% of cases. It identifies this as a detection-extraction gap, formalizes the mismatch via a total-variation bound on suffix-induced distributional shift between free and forced continuations, and introduces Black-box Adaptive Early Exit (BAEE) that uses free sampling for both detection and extraction to truncate 70-78% of generation while gaining 1-5pp accuracy (up to 5.8pp for thinking-mode models).

Significance. If the core empirical pattern holds, the work identifies a structural inefficiency in how current models commit to and surface answers during extended reasoning, with direct implications for inference optimization. The BAEE method provides a practical, black-box intervention that reduces serial token generation while preserving or improving accuracy, and the total-variation framing offers a quantitative handle on prompt-induced shifts that could generalize beyond the reported benchmarks.

major comments (3)

[Experimental protocol section] Experimental protocol section: the central statistic (52-88% post-commitment tokens) relies on free-continuation detection from 10%-prefixes, yet the manuscript provides no ablation or control showing that the continuation template itself is distributionally neutral relative to the original CoT prompt; any cue introduced by the detection prompt would directly inflate early-recovery rates and render the detection-extraction gap partly an artifact of the measurement procedure rather than an intrinsic model property.
[§4 (total-variation bound)] §4 (total-variation bound): the quantitative estimates of suffix-induced shift are computed from the same observed free/forced continuation pairs used to measure the gap, creating a circular dependence that prevents the bound from serving as an independent validation of the claimed asymmetry.
[Results tables] Results tables (e.g., Table 2 or equivalent): the reported accuracy gains for BAEE (1-5pp) and early-exit reductions (70-78%) lack error bars, number of runs, or statistical significance tests, so it is impossible to determine whether the improvements are robust across the five model configurations or sensitive to sampling variance.

minor comments (2)

[Notation] Notation for 'recoverable' is used throughout without a precise, reproducible definition (e.g., majority vote threshold, exact sampling parameters, or prefix length schedule).
[Results section] The abstract and introduction cite 'three benchmarks' but the results section does not tabulate per-benchmark breakdowns, making it hard to assess whether the 52-88% range is driven by a single outlier task.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below with clarifications and commit to revisions that strengthen the manuscript while remaining faithful to our original findings and experiments.

read point-by-point responses

Referee: Experimental protocol section: the central statistic (52-88% post-commitment tokens) relies on free-continuation detection from 10%-prefixes, yet the manuscript provides no ablation or control showing that the continuation template itself is distributionally neutral relative to the original CoT prompt; any cue introduced by the detection prompt would directly inflate early-recovery rates and render the detection-extraction gap partly an artifact of the measurement procedure rather than an intrinsic model property.

Authors: We appreciate the concern about potential distributional effects from the continuation template. Our template is a minimal continuation prompt ('Continue the reasoning from here:') chosen to avoid new instructions. To address this rigorously, we will add an ablation in the revised experimental protocol section comparing multiple neutral templates (empty continuation, varied phrasings, and direct suffix sampling) across a subset of configurations. This will demonstrate that early-recovery rates are robust and not inflated by template choice, confirming the gap as an intrinsic property. revision: yes
Referee: §4 (total-variation bound): the quantitative estimates of suffix-induced shift are computed from the same observed free/forced continuation pairs used to measure the gap, creating a circular dependence that prevents the bound from serving as an independent validation of the claimed asymmetry.

Authors: We agree that the total-variation estimates are derived from the same observed pairs and thus serve as a formalization of the measured shift rather than an independent validation. We will revise Section 4 to explicitly clarify this distinction, stating that the bound provides a quantitative handle on the free-vs-forced distributional mismatch to support interpretation of the detection-extraction gap, while the primary evidence remains the direct empirical comparison of answer recovery rates. revision: partial
Referee: Results tables (e.g., Table 2 or equivalent): the reported accuracy gains for BAEE (1-5pp) and early-exit reductions (70-78%) lack error bars, number of runs, or statistical significance tests, so it is impossible to determine whether the improvements are robust across the five model configurations or sensitive to sampling variance.

Authors: We acknowledge the need for statistical robustness reporting. In the revised manuscript we will add error bars (standard deviation over multiple independent runs), explicitly state the number of runs per configuration, and include statistical significance tests (e.g., paired t-tests or Wilcoxon tests) for the accuracy gains and token reductions. These will be incorporated into the results tables and discussed in the text. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; claims rest on direct empirical measurements.

full rationale

The paper's primary results consist of empirical statistics (52--88% post-commitment tokens, 42% forced-extraction failures) obtained by running free-continuation and forced-extraction experiments across five model configurations, two families, and three benchmarks. The total-variation bound is introduced only as a post-hoc formalization of the observed free/forced mismatch and does not generate any new quantitative predictions that reduce to fitted parameters or prior self-citations by construction. The BAEE method is a practical heuristic that exploits the measured asymmetry rather than deriving its performance from the same inputs it measures. No load-bearing self-citations, uniqueness theorems, or ansatzes appear in the provided abstract or description that would collapse the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that free continuations recover answers earlier than forced ones, plus the assumption that the total-variation distance meaningfully quantifies the suffix-induced distribution shift. No explicit free parameters or new invented entities are stated in the abstract.

axioms (1)

domain assumption Free continuations from partial prefixes recover the model's internal answer state more reliably than prompt-conditioned forced extraction
This is the load-bearing premise that defines the detection-extraction gap and justifies BAEE.

pith-pipeline@v0.9.0 · 5537 in / 1203 out tokens · 62910 ms · 2026-05-10T18:16:15.172075+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Free continuations from early prefixes recover the correct answer even at 10% of the trace, while forced extraction fails on 42% of these cases... formalized via a total-variation bound between free and forced continuation distributions
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BAEE: Black-box Adaptive Early Exit... truncating 70–78% of serial generation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

commitment:PSCmeasures behavioral recoverability, which upper-bounds latent commitment (§5)

Recoverability vs. commitment:PSCmeasures behavioral recoverability, which upper-bounds latent commitment (§5). Multiple controls (difficulty stratification, common-solved subsets, and three-benchmark validation) partially address this gap. 10 Preprint• April 10, 2026

work page 2026
[2]

To assess sensitivity, we run finer- grained probing on 50 problems with checkpoints at {2%, 4%, 5%, 6%, 8%, 10%, 12%, 15%, 20%, 25%, 30%, 40%, 50%} (Appendix K)

Temporal resolution: The main experiments use 9 checkpoints (10%–90%). To assess sensitivity, we run finer- grained probing on 50 problems with checkpoints at {2%, 4%, 5%, 6%, 8%, 10%, 12%, 15%, 20%, 25%, 30%, 40%, 50%} (Appendix K). PSC agreement at 2% already reaches 90%, confirming that the 10% grid does not artificially inflate post-commitment fractio...

work page
[3]

Competition-level mathematics and common-sense reasoning remain future work

Benchmark scope: Results span MATH-500, GPQA-Diamond, and HumanEval (Appendices O, Q), covering math, science, and code generation. Competition-level mathematics and common-sense reasoning remain future work

work page
[4]

White-box comparison: Our indirect comparison (§5) shows estimates consistent with white-box reports; a direct comparison on the same model is future work

work page
[5]

EFA suffix bias: 9–16% ofEFAprobes on unsolvable problems return “correct” answers; this is accounted for in our gap analysis and does not affectPSC-based metrics

work page
[6]

answer}

Total-token cost:BAEEtrades serial depth for parallel width, increasing total tokens 3–5 × under empirical continuation lengths (MATH-500: 3.6–5.0×; GPQA-Diamond: 3.1–5.0×; §4.5.3), always below SC-8-full’s fixed 8.0×. Under parallel execution (standard in API deployments), the latency reduction (63–76%) is the operationally relevant metric; in token-budg...

work page 2024
[7]

commitments

Early-agreement gate: require PSC@10% ≥0.50 . This eliminates 74.2% of FPs while retaining 89.2% of TPs. The intuition is that genuine commitments are evident from the earliest checkpoint; late-onset “commitments” are suspect

work page
[8]

This eliminates 93.9% of FPs but also removes 39.0% of TPs, making it too aggressive for general use but effective as a high-confidence filter

Monotonicity check: require ≤2PSCdrops across the trajectory. This eliminates 93.9% of FPs but also removes 39.0% of TPs, making it too aggressive for general use but effective as a high-confidence filter

work page
[9]

\boxed{” which biases the model toward emitting whatever is currently “closest to an answer

Variance + non-monotonicity: flag if PSC variance>0.06 anddrops ≥3 . This catches 54.5% of FPs while losing only 9.6% of TPs, making it a practical operating point for deployment. Practical recommendation.For deployment on new domains, we recommend a two-stage protocol: (1) apply the standard θ threshold for early exit; (2) post-filter triggered problems ...

work page 2026
[10]

answer now

Premature termination(59% of failures): the model emits a short ( ≤2 character) output, typically a single number or symbol. This suggests the forcing suffix triggers an “answer now” reflex that bypasses the model’s normal multi-step evaluation. The analogy is forcing a student to write a final answer mid-calculation: they write whatever is on their scrat...

work page
[11]

fast-forward

Intermediate-value extraction(30%): the model outputs a recognizable intermediate result (e.g., an unsimplified expression, a partial sum, or the result of the first step of a nested computation). These failures areinformative: the model has clearly begun the correct computation but has not yet completed it. The gap here is temporal, not informational: th...

work page
[12]

near-misses

Sign/parity errors(11%): the model produces an answer with the correct magnitude but wrong sign, parity, or off-by-one index. These are the closest to “near-misses” and suggest the forcing suffix disrupts bookkeeping operations (tracking alternating signs, counting iterations) that the model maintains implicitly during free generation. Why free continuati...

work page 2026
[13]

The null-prefix baseline is substantially lower (59–78% vs 88% on MATH), confirming that GPQA requires genuine multi-step reasoning that cold-start sampling cannot easily replicate

work page
[14]

Wait”, “Alternatively

The prefix’s incremental contributiongrows with prefix length: GPT-OSS gains+3.4 pp at f= 0.10 but +8.8 pp at f= 0.50, consistent with GPQA’s monotonically increasingPSCtrajectory (§4.3). Together, these results paint a coherent picture: the prefix encodes a progressively richer representation of the model’s computation. On easy benchmarks (MATH), the mod...

work page 2026

[1] [1]

commitment:PSCmeasures behavioral recoverability, which upper-bounds latent commitment (§5)

Recoverability vs. commitment:PSCmeasures behavioral recoverability, which upper-bounds latent commitment (§5). Multiple controls (difficulty stratification, common-solved subsets, and three-benchmark validation) partially address this gap. 10 Preprint• April 10, 2026

work page 2026

[2] [2]

To assess sensitivity, we run finer- grained probing on 50 problems with checkpoints at {2%, 4%, 5%, 6%, 8%, 10%, 12%, 15%, 20%, 25%, 30%, 40%, 50%} (Appendix K)

Temporal resolution: The main experiments use 9 checkpoints (10%–90%). To assess sensitivity, we run finer- grained probing on 50 problems with checkpoints at {2%, 4%, 5%, 6%, 8%, 10%, 12%, 15%, 20%, 25%, 30%, 40%, 50%} (Appendix K). PSC agreement at 2% already reaches 90%, confirming that the 10% grid does not artificially inflate post-commitment fractio...

work page

[3] [3]

Competition-level mathematics and common-sense reasoning remain future work

Benchmark scope: Results span MATH-500, GPQA-Diamond, and HumanEval (Appendices O, Q), covering math, science, and code generation. Competition-level mathematics and common-sense reasoning remain future work

work page

[4] [4]

White-box comparison: Our indirect comparison (§5) shows estimates consistent with white-box reports; a direct comparison on the same model is future work

work page

[5] [5]

EFA suffix bias: 9–16% ofEFAprobes on unsolvable problems return “correct” answers; this is accounted for in our gap analysis and does not affectPSC-based metrics

work page

[6] [6]

answer}

Total-token cost:BAEEtrades serial depth for parallel width, increasing total tokens 3–5 × under empirical continuation lengths (MATH-500: 3.6–5.0×; GPQA-Diamond: 3.1–5.0×; §4.5.3), always below SC-8-full’s fixed 8.0×. Under parallel execution (standard in API deployments), the latency reduction (63–76%) is the operationally relevant metric; in token-budg...

work page 2024

[7] [7]

commitments

Early-agreement gate: require PSC@10% ≥0.50 . This eliminates 74.2% of FPs while retaining 89.2% of TPs. The intuition is that genuine commitments are evident from the earliest checkpoint; late-onset “commitments” are suspect

work page

[8] [8]

This eliminates 93.9% of FPs but also removes 39.0% of TPs, making it too aggressive for general use but effective as a high-confidence filter

Monotonicity check: require ≤2PSCdrops across the trajectory. This eliminates 93.9% of FPs but also removes 39.0% of TPs, making it too aggressive for general use but effective as a high-confidence filter

work page

[9] [9]

\boxed{” which biases the model toward emitting whatever is currently “closest to an answer

Variance + non-monotonicity: flag if PSC variance>0.06 anddrops ≥3 . This catches 54.5% of FPs while losing only 9.6% of TPs, making it a practical operating point for deployment. Practical recommendation.For deployment on new domains, we recommend a two-stage protocol: (1) apply the standard θ threshold for early exit; (2) post-filter triggered problems ...

work page 2026

[10] [10]

answer now

Premature termination(59% of failures): the model emits a short ( ≤2 character) output, typically a single number or symbol. This suggests the forcing suffix triggers an “answer now” reflex that bypasses the model’s normal multi-step evaluation. The analogy is forcing a student to write a final answer mid-calculation: they write whatever is on their scrat...

work page

[11] [11]

fast-forward

Intermediate-value extraction(30%): the model outputs a recognizable intermediate result (e.g., an unsimplified expression, a partial sum, or the result of the first step of a nested computation). These failures areinformative: the model has clearly begun the correct computation but has not yet completed it. The gap here is temporal, not informational: th...

work page

[12] [12]

near-misses

Sign/parity errors(11%): the model produces an answer with the correct magnitude but wrong sign, parity, or off-by-one index. These are the closest to “near-misses” and suggest the forcing suffix disrupts bookkeeping operations (tracking alternating signs, counting iterations) that the model maintains implicitly during free generation. Why free continuati...

work page 2026

[13] [13]

The null-prefix baseline is substantially lower (59–78% vs 88% on MATH), confirming that GPQA requires genuine multi-step reasoning that cold-start sampling cannot easily replicate

work page

[14] [14]

Wait”, “Alternatively

The prefix’s incremental contributiongrows with prefix length: GPT-OSS gains+3.4 pp at f= 0.10 but +8.8 pp at f= 0.50, consistent with GPQA’s monotonically increasingPSCtrajectory (§4.3). Together, these results paint a coherent picture: the prefix encodes a progressively richer representation of the model’s computation. On easy benchmarks (MATH), the mod...

work page 2026