pith. sign in

arxiv: 2605.09329 · v2 · pith:PS6UM5SLnew · submitted 2026-05-10 · 💻 cs.CL · cs.LG

Test-Time Speculation

Pith reviewed 2026-05-20 22:49 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords speculative decodingtest-time adaptationonline distillationLLM inferenceacceptance lengthdraft modeltarget model
0
0 comments X

The pith

Adapting the draft model online during generation keeps acceptance lengths high for long LLM outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Speculative decoding speeds up inference by letting a fast draft model propose tokens that a slower target model verifies one by one. The number of accepted tokens per round falls toward one after only a few thousand output tokens because current speculators are trained offline on short text. This paper proposes test-time speculation, an online method that treats the target model's verification calls as free training signals to update the draft model continuously as generation proceeds. Each round of speculation supplies data for a small adaptation step that improves how well the draft matches the target on the current long sequence. The result is higher acceptance lengths that scale with output length rather than collapsing.

Core claim

By running online distillation at test time, where each verification step supplies the target model's outputs as training labels for the draft model at no extra cost, the speculator adapts to the specific long generation in progress and thereby maintains substantially longer acceptance lengths than any fixed offline-trained speculator.

What carries the argument

Test-Time Speculation (TTS), the online adaptation loop that updates the draft model after each verification round using the target's token predictions as supervision.

If this is right

  • Acceptance lengths rise by up to 72 percent over current state-of-the-art speculators on the tested models.
  • Average gains reach 41 percent across Qwen-3, Qwen-3.5, and Llama 3.1 families.
  • The improvement grows larger as the number of generated tokens increases.
  • Speculative decoding retains useful speedup on long-response tasks where it previously provided almost none.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same verification-driven update idea could be applied to other inference-time components that drift from their training distribution during extended outputs.
  • Choosing different frequencies or learning rates for the online updates might further improve the speed-accuracy trade-off.
  • This form of test-time adaptation may become standard for any draft model used in production systems that handle variable-length responses.

Load-bearing premise

The verification steps already provide enough training signal to improve the draft model's accuracy over time without adding net latency or causing instability in the updates.

What would settle it

Running TTS on a long-generation benchmark and observing that acceptance length still drops to near one after several thousand tokens, or that the added update steps increase total latency, would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.09329 by Avinash Kumar, Poulami Das, Sujay Sanghavi.

Figure 1
Figure 1. Figure 1: (a) Acceptance length (AL) for the LiveCodeBench dataset on Qwen3-8B with increasing [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Acceptance Length of four tasks using (a) DFlash, (b) EAGLE-3, and (c) PARD speculators [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Acceptance Length of four tasks using DFlash speculator on (a) Qwen3.5-35B, (b) Qwen3.6- [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution entropy (in nats) for Llama3.1-8B (target) with EAGLE-3 (draft). (a) Target [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Acceptance Length of TTS versus DFlash for (a) AIME 2024 and (b) LiveCodeBench on [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (a) Acceptance Length (AL) of TTS on Qwen3-8B with optimization steps per round ( [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Execution timeline of TTS with strided updates and asynchronous pipelining. Every [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
read the original abstract

Speculative decoding accelerates LLM inference by using a fast draft model to generate tokens and a more accurate target model to verify them. Its performance depends on the $\textit{acceptance length}$, or number of draft tokens accepted by the target. Our studies show that the acceptance length of even state-of-the-art speculators, like DFlash, EAGLE-3 and PARD degrade with generation length, reaching values close to 1 (i.e. no speedup) within just a few thousand output tokens, making speculators ineffective for long-response tasks. Acceptance lengths decline because most speculators are trained offline on short sequences, but are forced to match the target model on much longer outputs at inference, well beyond their training distribution. To address this issue, we propose $\textit{Test-Time Speculation (TTS)}$, an online distillation approach that continuously adapts the speculator at test-time. TTS leverages the key insight that the token verification step already invokes the target model for each draft token, providing the training signal needed to adapt the draft at no additional cost. Treating the draft as the student and the target as a teacher, TTS adjusts the draft over several speculation rounds, with each update improving the draft's accuracy as generation proceeds. Our results across multiple models from the Qwen-3, Qwen-3.5, and Llama3.1 families show that TTS improves acceptance lengths over state-of-the-art speculators by up to $72\%$ and $41\%$ on average, with the benefits scaling with increased generation lengths.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that acceptance lengths in speculative decoding degrade with longer generation sequences because speculators are trained offline on short sequences. It proposes Test-Time Speculation (TTS), an online distillation method that adapts the draft model using target model verification signals during inference at no additional cost. Experiments on Qwen-3, Qwen-3.5, and Llama3.1 families show acceptance length improvements of up to 72% and 41% on average, with benefits scaling as generation length increases.

Significance. If the no-cost adaptation claim holds and delivers net wall-clock gains, TTS could extend speculative decoding to long-form tasks where current methods fail. The multi-model family evaluation is a strength, but missing experimental details limit assessment of practical impact.

major comments (2)
  1. [Abstract] Abstract: The claim that the verification step supplies the training signal 'at no additional cost' is load-bearing for the net speedup argument. Any online update (gradient step, LoRA, or optimizer) on the draft model requires forward/backward passes beyond the target verification pass; without a timing or FLOPs breakdown separating these, it is unclear whether the reported acceptance-length gains already net out adaptation overhead or apply only to long generations.
  2. [Results] Results: The stated 41% average and up to 72% gains, plus the scaling-with-length claim, are presented without tables, error bars, run counts, or exact generation-length ranges. This prevents verification that improvements are robust rather than post-hoc and directly affects the central empirical contribution.
minor comments (2)
  1. [Abstract] Abstract: 'speculators' is used without explicit definition on first occurrence; clarify whether it denotes the draft model, the full algorithm, or both.
  2. Consider adding a plot of acceptance length versus output length for TTS versus baselines to make the scaling observation visually concrete.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive comments. We address the major comments point by point below, providing clarifications and committing to revisions that strengthen the empirical support and cost analysis without overstating our current results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that the verification step supplies the training signal 'at no additional cost' is load-bearing for the net speedup argument. Any online update (gradient step, LoRA, or optimizer) on the draft model requires forward/backward passes beyond the target verification pass; without a timing or FLOPs breakdown separating these, it is unclear whether the reported acceptance-length gains already net out adaptation overhead or apply only to long generations.

    Authors: We appreciate the referee highlighting the importance of net wall-clock impact. The target verification pass is performed in any case during speculative decoding and supplies the necessary logits for the distillation signal. However, the subsequent gradient-based update on the draft model does require additional forward and backward computation on the smaller draft model. We agree that a quantitative breakdown is needed to substantiate the net-gain claim. In the revision we will add a dedicated subsection with measured wall-clock overhead and FLOPs for the adaptation steps across generation lengths, showing that the overhead remains small relative to target-model savings and that net speedup improves with sequence length. revision: yes

  2. Referee: [Results] Results: The stated 41% average and up to 72% gains, plus the scaling-with-length claim, are presented without tables, error bars, run counts, or exact generation-length ranges. This prevents verification that improvements are robust rather than post-hoc and directly affects the central empirical contribution.

    Authors: We agree that the current presentation lacks sufficient detail for independent verification. The revised manuscript will include full tables of acceptance lengths for each model family and generation-length bucket, report means and standard deviations over five independent runs, and explicitly state the tested output-length ranges (512–4096 tokens). We will also add a plot of acceptance length versus generation length to illustrate the scaling trend with error bands. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical online adaptation with independent experimental support

full rationale

The paper describes TTS as an online distillation process that reuses the existing target-model verification step for draft adaptation, with performance gains demonstrated through direct experiments on acceptance lengths across Qwen and Llama families. No equations, derivations, or fitted parameters are shown that reduce the claimed improvements to self-referential quantities by construction. The method is presented as an empirical procedure whose benefits are measured externally rather than defined into existence, and no load-bearing self-citations or uniqueness theorems are invoked in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that verification calls supply usable training signals for effective online updates without hidden costs or degradation.

axioms (1)
  • domain assumption The token verification step already invokes the target model for each draft token, providing the training signal needed to adapt the draft at no additional cost.
    Explicitly stated as the key insight enabling zero-cost adaptation.

pith-pipeline@v0.9.0 · 5807 in / 1307 out tokens · 48062 ms · 2026-05-20T22:49:51.137412+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 16 internal anchors

  1. [1]

    International Conference on Machine Learning , pages=

    Fast inference from transformers via speculative decoding , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  2. [2]

    Accelerating Large Language Model Decoding with Speculative Sampling

    Accelerating large language model decoding with speculative sampling , author=. arXiv preprint arXiv:2302.01318 , year=

  3. [3]

    EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

    Eagle-3: Scaling up inference acceleration of large language models via training-time test , author=. arXiv preprint arXiv:2503.01840 , year=

  4. [4]

    arXiv preprint arXiv:2602.06036 , year=

    DFlash: Block Diffusion for Flash Speculative Decoding , author=. arXiv preprint arXiv:2602.06036 , year=

  5. [5]

    Pard: Accelerating llm inference with low-cost parallel draft model adaptation.arXiv preprint arXiv:2504.18583, 2025

    Pard: Accelerating llm inference with low-cost parallel draft model adaptation , author=. arXiv preprint arXiv:2504.18583 , year=

  6. [6]

    Training Verifiers to Solve Math Word Problems

    Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

  7. [7]

    Evaluating Large Language Models Trained on Code

    Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

  8. [8]

    arXiv preprint arXiv:2408.07055 , year=

    Longwriter: Unleashing 10,000+ word generation from long context llms , author=. arXiv preprint arXiv:2408.07055 , year=

  9. [9]

    Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

    Longbench: A bilingual, multitask benchmark for long context understanding , author=. Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

  10. [10]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

  11. [11]

    International Conference on Learning Representations (ICLR) , year=

    ReAct: Synergizing Reasoning and Acting in Language Models , author=. International Conference on Learning Representations (ICLR) , year=

  12. [12]

    ACM Transactions on Storage , year=

    Mooncake: A kvcache-centric disaggregated architecture for llm serving , author=. ACM Transactions on Storage , year=

  13. [13]

    18th USENIX symposium on operating systems design and implementation (OSDI 24) , pages=

    Taming \ Throughput-Latency \ tradeoff in \ LLM \ inference with \ Sarathi-Serve \ , author=. 18th USENIX symposium on operating systems design and implementation (OSDI 24) , pages=

  14. [14]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  15. [15]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  16. [16]

    gpt-oss-120b & gpt-oss-20b Model Card

    gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

  17. [17]

    LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification

    LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification , author=. arXiv preprint arXiv:2502.17421 , year=

  18. [18]

    The Twelfth International Conference on Learning Representations , year=

    YaRN: Efficient Context Window Extension of Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=

  19. [19]

    arXiv preprint arXiv:2512.02337 , year=

    SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification , author=. arXiv preprint arXiv:2512.02337 , year=

  20. [20]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

    Enhancing chat language models by scaling high-quality instructional conversations , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

  21. [21]

    ShareGPT

    OpenAI. ShareGPT. 2023

  22. [22]

    Advances in neural information processing systems , volume=

    Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=

  23. [23]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  24. [24]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. arXiv preprint arXiv:2403.07974 , year=

  25. [25]

    The twelfth international conference on learning representations , year=

    Let's verify step by step , author=. The twelfth international conference on learning representations , year=

  26. [26]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

  27. [27]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  28. [28]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=

  29. [29]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

    Theoremqa: A theorem-driven question answering dataset , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

  30. [30]

    AIME 2024 (I)

    MathArena. AIME 2024 (I). 2024

  31. [31]

    AIME 2024 (II)

    MathArena. AIME 2024 (II). 2024

  32. [32]

    AIME 2025

    MathArena. AIME 2025. 2025

  33. [33]

    arXiv preprint arXiv:2310.07177 , year=

    Online speculative decoding , author=. arXiv preprint arXiv:2310.07177 , year=

  34. [34]

    2025 , month=

    ATLAS: Adaptive-Learning Speculator System , author=. 2025 , month=

  35. [35]

    Xing, Joseph E

    Lmsys-chat-1m: A large-scale real-world llm conversation dataset , author=. arXiv preprint arXiv:2309.11998 , year=

  36. [36]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

    Gqa: Training generalized multi-query transformer models from multi-head checkpoints , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

  37. [37]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Outrageously large neural networks: The sparsely-gated mixture-of-experts layer , author=. arXiv preprint arXiv:1701.06538 , year=

  38. [38]

    Journal of Machine Learning Research , volume=

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=

  39. [39]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Mamba: Linear-time sequence modeling with selective state spaces , author=. arXiv preprint arXiv:2312.00752 , year=

  40. [40]

    arXiv preprint arXiv:2508.08192 , year=

    Efficient speculative decoding for llama at scale: Challenges and solutions , author=. arXiv preprint arXiv:2508.08192 , year=

  41. [41]

    Proceedings of the 2018 conference on empirical methods in natural language processing , pages=

    Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task , author=. Proceedings of the 2018 conference on empirical methods in natural language processing , pages=

  42. [42]

    CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

    Codesearchnet challenge: Evaluating the state of semantic code search , author=. arXiv preprint arXiv:1909.09436 , year=

  43. [43]

    2023 , howpublished=

    Finance-Alpaca: An Instruction-Following Dataset for Financial Question Answering , author=. 2023 , howpublished=

  44. [44]

    Advances in neural information processing systems , volume=

    Sglang: Efficient execution of structured language model programs , author=. Advances in neural information processing systems , volume=