Test-Time Speculation

Avinash Kumar; Poulami Das; Sujay Sanghavi

arxiv: 2605.09329 · v2 · pith:PS6UM5SLnew · submitted 2026-05-10 · 💻 cs.CL · cs.LG

Test-Time Speculation

Avinash Kumar , Sujay Sanghavi , Poulami Das This is my paper

Pith reviewed 2026-05-20 22:49 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords speculative decodingtest-time adaptationonline distillationLLM inferenceacceptance lengthdraft modeltarget model

0 comments

The pith

Adapting the draft model online during generation keeps acceptance lengths high for long LLM outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Speculative decoding speeds up inference by letting a fast draft model propose tokens that a slower target model verifies one by one. The number of accepted tokens per round falls toward one after only a few thousand output tokens because current speculators are trained offline on short text. This paper proposes test-time speculation, an online method that treats the target model's verification calls as free training signals to update the draft model continuously as generation proceeds. Each round of speculation supplies data for a small adaptation step that improves how well the draft matches the target on the current long sequence. The result is higher acceptance lengths that scale with output length rather than collapsing.

Core claim

By running online distillation at test time, where each verification step supplies the target model's outputs as training labels for the draft model at no extra cost, the speculator adapts to the specific long generation in progress and thereby maintains substantially longer acceptance lengths than any fixed offline-trained speculator.

What carries the argument

Test-Time Speculation (TTS), the online adaptation loop that updates the draft model after each verification round using the target's token predictions as supervision.

If this is right

Acceptance lengths rise by up to 72 percent over current state-of-the-art speculators on the tested models.
Average gains reach 41 percent across Qwen-3, Qwen-3.5, and Llama 3.1 families.
The improvement grows larger as the number of generated tokens increases.
Speculative decoding retains useful speedup on long-response tasks where it previously provided almost none.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same verification-driven update idea could be applied to other inference-time components that drift from their training distribution during extended outputs.
Choosing different frequencies or learning rates for the online updates might further improve the speed-accuracy trade-off.
This form of test-time adaptation may become standard for any draft model used in production systems that handle variable-length responses.

Load-bearing premise

The verification steps already provide enough training signal to improve the draft model's accuracy over time without adding net latency or causing instability in the updates.

What would settle it

Running TTS on a long-generation benchmark and observing that acceptance length still drops to near one after several thousand tokens, or that the added update steps increase total latency, would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.09329 by Avinash Kumar, Poulami Das, Sujay Sanghavi.

**Figure 2.** Figure 2: Acceptance Length of four tasks using (a) DFlash, (b) EAGLE-3, and (c) PARD speculators [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Acceptance Length of four tasks using DFlash speculator on (a) Qwen3.5-35B, (b) Qwen3.6- [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution entropy (in nats) for Llama3.1-8B (target) with EAGLE-3 (draft). (a) Target [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Acceptance Length of TTS versus DFlash for (a) AIME 2024 and (b) LiveCodeBench on [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: (a) Acceptance Length (AL) of TTS on Qwen3-8B with optimization steps per round ( [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Execution timeline of TTS with strided updates and asynchronous pipelining. Every [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

read the original abstract

Speculative decoding accelerates LLM inference by using a fast draft model to generate tokens and a more accurate target model to verify them. Its performance depends on the $\textit{acceptance length}$, or number of draft tokens accepted by the target. Our studies show that the acceptance length of even state-of-the-art speculators, like DFlash, EAGLE-3 and PARD degrade with generation length, reaching values close to 1 (i.e. no speedup) within just a few thousand output tokens, making speculators ineffective for long-response tasks. Acceptance lengths decline because most speculators are trained offline on short sequences, but are forced to match the target model on much longer outputs at inference, well beyond their training distribution. To address this issue, we propose $\textit{Test-Time Speculation (TTS)}$, an online distillation approach that continuously adapts the speculator at test-time. TTS leverages the key insight that the token verification step already invokes the target model for each draft token, providing the training signal needed to adapt the draft at no additional cost. Treating the draft as the student and the target as a teacher, TTS adjusts the draft over several speculation rounds, with each update improving the draft's accuracy as generation proceeds. Our results across multiple models from the Qwen-3, Qwen-3.5, and Llama3.1 families show that TTS improves acceptance lengths over state-of-the-art speculators by up to $72\%$ and $41\%$ on average, with the benefits scaling with increased generation lengths.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TTS uses online updates from target verification to keep speculator acceptance lengths high on long outputs, but the net wall-clock benefit still needs timing data to confirm.

read the letter

Colleague, the core observation here is that standard speculators trained offline see their acceptance length fall toward 1 after a few thousand tokens, and the proposed fix is to keep adapting the draft model on the fly using the target model's verification signals as supervision. That adaptation is presented as free because the target is already being queried for each draft token. The reported gains—up to 72 % better acceptance and 41 % on average across Qwen and Llama families, with larger gains at longer lengths—are the main empirical claim. If the updates really add negligible overhead, this would be a practical way to make speculative decoding usable for the long responses that matter in deployment. The insight that verification already supplies a training signal is straightforward and worth testing. What is less clear is exactly how the updates are implemented—full gradients, LoRA, how often they run, what optimizer state is kept—and whether any of that work is measured in the wall-clock numbers. The abstract gives acceptance-length improvements but no breakdown separating verification cost from adaptation cost, so it is still possible the reported 41 % average already nets out the overhead or that it does not. For moderate-length generations the cumulative cost of even small updates could offset the acceptance gain before the scaling benefit kicks in. The experiments span several model families, which helps, but without error bars, controls for sequence length distribution, or explicit timing tables the strength of the result is hard to judge from the summary alone. This work is aimed at people who care about inference latency and cost for long generations rather than at core model training. A reader who already runs speculative decoding pipelines would find the idea worth trying, provided the full paper supplies the missing implementation and measurement details. I would send it to peer review so the authors can add those controls and timings; the underlying problem is real and the direction is reasonable even if the current evidence is still preliminary.

Referee Report

2 major / 2 minor

Summary. The paper claims that acceptance lengths in speculative decoding degrade with longer generation sequences because speculators are trained offline on short sequences. It proposes Test-Time Speculation (TTS), an online distillation method that adapts the draft model using target model verification signals during inference at no additional cost. Experiments on Qwen-3, Qwen-3.5, and Llama3.1 families show acceptance length improvements of up to 72% and 41% on average, with benefits scaling as generation length increases.

Significance. If the no-cost adaptation claim holds and delivers net wall-clock gains, TTS could extend speculative decoding to long-form tasks where current methods fail. The multi-model family evaluation is a strength, but missing experimental details limit assessment of practical impact.

major comments (2)

[Abstract] Abstract: The claim that the verification step supplies the training signal 'at no additional cost' is load-bearing for the net speedup argument. Any online update (gradient step, LoRA, or optimizer) on the draft model requires forward/backward passes beyond the target verification pass; without a timing or FLOPs breakdown separating these, it is unclear whether the reported acceptance-length gains already net out adaptation overhead or apply only to long generations.
[Results] Results: The stated 41% average and up to 72% gains, plus the scaling-with-length claim, are presented without tables, error bars, run counts, or exact generation-length ranges. This prevents verification that improvements are robust rather than post-hoc and directly affects the central empirical contribution.

minor comments (2)

[Abstract] Abstract: 'speculators' is used without explicit definition on first occurrence; clarify whether it denotes the draft model, the full algorithm, or both.
Consider adding a plot of acceptance length versus output length for TTS versus baselines to make the scaling observation visually concrete.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive comments. We address the major comments point by point below, providing clarifications and committing to revisions that strengthen the empirical support and cost analysis without overstating our current results.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the verification step supplies the training signal 'at no additional cost' is load-bearing for the net speedup argument. Any online update (gradient step, LoRA, or optimizer) on the draft model requires forward/backward passes beyond the target verification pass; without a timing or FLOPs breakdown separating these, it is unclear whether the reported acceptance-length gains already net out adaptation overhead or apply only to long generations.

Authors: We appreciate the referee highlighting the importance of net wall-clock impact. The target verification pass is performed in any case during speculative decoding and supplies the necessary logits for the distillation signal. However, the subsequent gradient-based update on the draft model does require additional forward and backward computation on the smaller draft model. We agree that a quantitative breakdown is needed to substantiate the net-gain claim. In the revision we will add a dedicated subsection with measured wall-clock overhead and FLOPs for the adaptation steps across generation lengths, showing that the overhead remains small relative to target-model savings and that net speedup improves with sequence length. revision: yes
Referee: [Results] Results: The stated 41% average and up to 72% gains, plus the scaling-with-length claim, are presented without tables, error bars, run counts, or exact generation-length ranges. This prevents verification that improvements are robust rather than post-hoc and directly affects the central empirical contribution.

Authors: We agree that the current presentation lacks sufficient detail for independent verification. The revised manuscript will include full tables of acceptance lengths for each model family and generation-length bucket, report means and standard deviations over five independent runs, and explicitly state the tested output-length ranges (512–4096 tokens). We will also add a plot of acceptance length versus generation length to illustrate the scaling trend with error bands. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical online adaptation with independent experimental support

full rationale

The paper describes TTS as an online distillation process that reuses the existing target-model verification step for draft adaptation, with performance gains demonstrated through direct experiments on acceptance lengths across Qwen and Llama families. No equations, derivations, or fitted parameters are shown that reduce the claimed improvements to self-referential quantities by construction. The method is presented as an empirical procedure whose benefits are measured externally rather than defined into existence, and no load-bearing self-citations or uniqueness theorems are invoked in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that verification calls supply usable training signals for effective online updates without hidden costs or degradation.

axioms (1)

domain assumption The token verification step already invokes the target model for each draft token, providing the training signal needed to adapt the draft at no additional cost.
Explicitly stated as the key insight enabling zero-cost adaptation.

pith-pipeline@v0.9.0 · 5807 in / 1307 out tokens · 48062 ms · 2026-05-20T22:49:51.137412+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Treating the draft as the student and the target as a teacher, TTS adjusts the draft over several speculation rounds, with each update improving the draft's accuracy as generation proceeds.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TTS improves acceptance lengths over state-of-the-art speculators by up to 72% and 41% on average, with the benefits scaling with increased generation lengths.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 16 internal anchors

[1]

International Conference on Machine Learning , pages=

Fast inference from transformers via speculative decoding , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[2]

Accelerating Large Language Model Decoding with Speculative Sampling

Accelerating large language model decoding with speculative sampling , author=. arXiv preprint arXiv:2302.01318 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Eagle-3: Scaling up inference acceleration of large language models via training-time test , author=. arXiv preprint arXiv:2503.01840 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

arXiv preprint arXiv:2602.06036 , year=

DFlash: Block Diffusion for Flash Speculative Decoding , author=. arXiv preprint arXiv:2602.06036 , year=

work page internal anchor Pith review arXiv
[5]

Pard: Accelerating llm inference with low-cost parallel draft model adaptation.arXiv preprint arXiv:2504.18583, 2025

Pard: Accelerating llm inference with low-cost parallel draft model adaptation , author=. arXiv preprint arXiv:2504.18583 , year=

work page arXiv
[6]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

arXiv preprint arXiv:2408.07055 , year=

Longwriter: Unleashing 10,000+ word generation from long context llms , author=. arXiv preprint arXiv:2408.07055 , year=

work page arXiv
[9]

Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

Longbench: A bilingual, multitask benchmark for long context understanding , author=. Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

work page
[10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

International Conference on Learning Representations (ICLR) , year=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. International Conference on Learning Representations (ICLR) , year=

work page
[12]

ACM Transactions on Storage , year=

Mooncake: A kvcache-centric disaggregated architecture for llm serving , author=. ACM Transactions on Storage , year=

work page
[13]

18th USENIX symposium on operating systems design and implementation (OSDI 24) , pages=

Taming \ Throughput-Latency \ tradeoff in \ LLM \ inference with \ Sarathi-Serve \ , author=. 18th USENIX symposium on operating systems design and implementation (OSDI 24) , pages=

work page
[14]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

gpt-oss-120b & gpt-oss-20b Model Card

gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification

LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification , author=. arXiv preprint arXiv:2502.17421 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

The Twelfth International Conference on Learning Representations , year=

YaRN: Efficient Context Window Extension of Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=

work page
[19]

arXiv preprint arXiv:2512.02337 , year=

SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification , author=. arXiv preprint arXiv:2512.02337 , year=

work page arXiv
[20]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Enhancing chat language models by scaling high-quality instructional conversations , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2023
[21]

ShareGPT

OpenAI. ShareGPT. 2023

work page 2023
[22]

Advances in neural information processing systems , volume=

Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=

work page
[23]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[24]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. arXiv preprint arXiv:2403.07974 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

The twelfth international conference on learning representations , year=

Let's verify step by step , author=. The twelfth international conference on learning representations , year=

work page
[26]

Measuring Mathematical Problem Solving With the MATH Dataset

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[28]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Theoremqa: A theorem-driven question answering dataset , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2023
[30]

AIME 2024 (I)

MathArena. AIME 2024 (I). 2024

work page 2024
[31]

AIME 2024 (II)

MathArena. AIME 2024 (II). 2024

work page 2024
[32]

AIME 2025

MathArena. AIME 2025. 2025

work page 2025
[33]

arXiv preprint arXiv:2310.07177 , year=

Online speculative decoding , author=. arXiv preprint arXiv:2310.07177 , year=

work page arXiv
[34]

2025 , month=

ATLAS: Adaptive-Learning Speculator System , author=. 2025 , month=

work page 2025
[35]

Xing, Joseph E

Lmsys-chat-1m: A large-scale real-world llm conversation dataset , author=. arXiv preprint arXiv:2309.11998 , year=

work page arXiv
[36]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Gqa: Training generalized multi-query transformer models from multi-head checkpoints , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2023
[37]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer , author=. arXiv preprint arXiv:1701.06538 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Journal of Machine Learning Research , volume=

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=

work page
[39]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Mamba: Linear-time sequence modeling with selective state spaces , author=. arXiv preprint arXiv:2312.00752 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

arXiv preprint arXiv:2508.08192 , year=

Efficient speculative decoding for llama at scale: Challenges and solutions , author=. arXiv preprint arXiv:2508.08192 , year=

work page arXiv
[41]

Proceedings of the 2018 conference on empirical methods in natural language processing , pages=

Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task , author=. Proceedings of the 2018 conference on empirical methods in natural language processing , pages=

work page 2018
[42]

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Codesearchnet challenge: Evaluating the state of semantic code search , author=. arXiv preprint arXiv:1909.09436 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909
[43]

2023 , howpublished=

Finance-Alpaca: An Instruction-Following Dataset for Financial Question Answering , author=. 2023 , howpublished=

work page 2023
[44]

Advances in neural information processing systems , volume=

Sglang: Efficient execution of structured language model programs , author=. Advances in neural information processing systems , volume=

work page

[1] [1]

International Conference on Machine Learning , pages=

Fast inference from transformers via speculative decoding , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023

[2] [2]

Accelerating Large Language Model Decoding with Speculative Sampling

Accelerating large language model decoding with speculative sampling , author=. arXiv preprint arXiv:2302.01318 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Eagle-3: Scaling up inference acceleration of large language models via training-time test , author=. arXiv preprint arXiv:2503.01840 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

arXiv preprint arXiv:2602.06036 , year=

DFlash: Block Diffusion for Flash Speculative Decoding , author=. arXiv preprint arXiv:2602.06036 , year=

work page internal anchor Pith review arXiv

[5] [5]

Pard: Accelerating llm inference with low-cost parallel draft model adaptation.arXiv preprint arXiv:2504.18583, 2025

Pard: Accelerating llm inference with low-cost parallel draft model adaptation , author=. arXiv preprint arXiv:2504.18583 , year=

work page arXiv

[6] [6]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

arXiv preprint arXiv:2408.07055 , year=

Longwriter: Unleashing 10,000+ word generation from long context llms , author=. arXiv preprint arXiv:2408.07055 , year=

work page arXiv

[9] [9]

Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

Longbench: A bilingual, multitask benchmark for long context understanding , author=. Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

work page

[10] [10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

International Conference on Learning Representations (ICLR) , year=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. International Conference on Learning Representations (ICLR) , year=

work page

[12] [12]

ACM Transactions on Storage , year=

Mooncake: A kvcache-centric disaggregated architecture for llm serving , author=. ACM Transactions on Storage , year=

work page

[13] [13]

18th USENIX symposium on operating systems design and implementation (OSDI 24) , pages=

Taming \ Throughput-Latency \ tradeoff in \ LLM \ inference with \ Sarathi-Serve \ , author=. 18th USENIX symposium on operating systems design and implementation (OSDI 24) , pages=

work page

[14] [14]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

gpt-oss-120b & gpt-oss-20b Model Card

gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification

LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification , author=. arXiv preprint arXiv:2502.17421 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

The Twelfth International Conference on Learning Representations , year=

YaRN: Efficient Context Window Extension of Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=

work page

[19] [19]

arXiv preprint arXiv:2512.02337 , year=

SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification , author=. arXiv preprint arXiv:2512.02337 , year=

work page arXiv

[20] [20]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Enhancing chat language models by scaling high-quality instructional conversations , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2023

[21] [21]

ShareGPT

OpenAI. ShareGPT. 2023

work page 2023

[22] [22]

Advances in neural information processing systems , volume=

Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=

work page

[23] [23]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page

[24] [24]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. arXiv preprint arXiv:2403.07974 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

The twelfth international conference on learning representations , year=

Let's verify step by step , author=. The twelfth international conference on learning representations , year=

work page

[26] [26]

Measuring Mathematical Problem Solving With the MATH Dataset

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[28] [28]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Theoremqa: A theorem-driven question answering dataset , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2023

[30] [30]

AIME 2024 (I)

MathArena. AIME 2024 (I). 2024

work page 2024

[31] [31]

AIME 2024 (II)

MathArena. AIME 2024 (II). 2024

work page 2024

[32] [32]

AIME 2025

MathArena. AIME 2025. 2025

work page 2025

[33] [33]

arXiv preprint arXiv:2310.07177 , year=

Online speculative decoding , author=. arXiv preprint arXiv:2310.07177 , year=

work page arXiv

[34] [34]

2025 , month=

ATLAS: Adaptive-Learning Speculator System , author=. 2025 , month=

work page 2025

[35] [35]

Xing, Joseph E

Lmsys-chat-1m: A large-scale real-world llm conversation dataset , author=. arXiv preprint arXiv:2309.11998 , year=

work page arXiv

[36] [36]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Gqa: Training generalized multi-query transformer models from multi-head checkpoints , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2023

[37] [37]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer , author=. arXiv preprint arXiv:1701.06538 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

Journal of Machine Learning Research , volume=

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=

work page

[39] [39]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Mamba: Linear-time sequence modeling with selective state spaces , author=. arXiv preprint arXiv:2312.00752 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

arXiv preprint arXiv:2508.08192 , year=

Efficient speculative decoding for llama at scale: Challenges and solutions , author=. arXiv preprint arXiv:2508.08192 , year=

work page arXiv

[41] [41]

Proceedings of the 2018 conference on empirical methods in natural language processing , pages=

Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task , author=. Proceedings of the 2018 conference on empirical methods in natural language processing , pages=

work page 2018

[42] [42]

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Codesearchnet challenge: Evaluating the state of semantic code search , author=. arXiv preprint arXiv:1909.09436 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909

[43] [43]

2023 , howpublished=

Finance-Alpaca: An Instruction-Following Dataset for Financial Question Answering , author=. 2023 , howpublished=

work page 2023

[44] [44]

Advances in neural information processing systems , volume=

Sglang: Efficient execution of structured language model programs , author=. Advances in neural information processing systems , volume=

work page