Accelerated Test-Time Scaling with Model-Free Speculative Sampling

Aram Galstyan; Bhavana Ganesh; Jinwoo Shin; Sai Muralidhar Jayanthi; Saket Dingliwal; Sravan Babu Bodapati; Woomin Song

arxiv: 2506.04708 · v3 · pith:WKOWY7ILnew · submitted 2025-06-05 · 💻 cs.CL

Accelerated Test-Time Scaling with Model-Free Speculative Sampling

Woomin Song , Saket Dingliwal , Sai Muralidhar Jayanthi , Bhavana Ganesh , Jinwoo Shin , Aram Galstyan , Sravan Babu Bodapati This is my paper

Pith reviewed 2026-05-22 12:29 UTC · model grok-4.3

classification 💻 cs.CL

keywords speculative decodingtest-time scalinglanguage modelsreasoningN-graminference accelerationmodel-free

0 comments

The pith

STAND reduces inference latency by 60-65% on reasoning tasks using model-free speculative sampling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes STAND as a way to accelerate test-time scaling for language models on hard reasoning problems. It works by noticing that reasoning steps often repeat similar patterns and using that to draft possible next tokens with a simple N-gram memory instead of running a full separate model. This allows more tokens to be accepted per forward pass through the main model, cutting the total time needed for long reasoning chains by more than half without changing the final answer quality. The result is a plug-in accelerator that works on any existing model for tasks like math competitions and science questions.

Core claim

STAND is a stochastic adaptive N-gram drafting method that exploits redundancy in reasoning trajectories. By storing logit information in an N-gram module and using stochastic drafting with Gumbel-Top-K sampling and data-driven trees, it achieves higher acceptance rates. This leads to 60-65% lower latency than standard autoregressive decoding on benchmarks including AIME-2024, GPQA-Diamond, and LiveCodeBench, while matching accuracy and beating other speculative methods in multiple inference modes.

What carries the argument

The memory-efficient logit-based N-gram module combined with stochastic adaptive drafting, which predicts and verifies tokens from observed reasoning patterns without a separate draft model.

If this is right

STAND cuts inference latency by 60-65% while keeping accuracy the same as full autoregressive decoding.
It works better than current speculative decoding techniques for single path, batched, and tree search decoding.
The approach requires no extra training and applies directly to any language model.
Test-time scaling methods become more efficient, allowing deeper reasoning searches under the same compute budget.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If reasoning patterns are reusable across different models, STAND could transfer between models without retuning.
Extending the N-gram memory to capture longer contexts might further increase acceptance rates on complex tasks.
Combining STAND with quantization or other optimizations could yield even larger speedups in practice.

Load-bearing premise

Reasoning trajectories contain enough repeated patterns that an N-gram predictor can reliably draft correct tokens at high rates.

What would settle it

Measuring the token acceptance rate on a new set of reasoning problems and finding it no higher than random guessing or standard methods would disprove the efficiency gain.

read the original abstract

Language models have demonstrated remarkable capabilities in reasoning tasks through test-time scaling techniques like best-of-N sampling and tree search. However, these approaches often demand substantial computational resources, creating a critical trade-off between performance and efficiency. We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach that exploits the inherent redundancy in reasoning trajectories to achieve significant acceleration without compromising accuracy. Our analysis shows that reasoning paths frequently reuse similar reasoning patterns, enabling efficient model-free token prediction without requiring separate draft models. By introducing stochastic drafting and preserving probabilistic information through a memory-efficient logit-based N-gram module, combined with optimized Gumbel-Top-K sampling and data-driven tree construction, STAND significantly improves token acceptance rates. Extensive evaluations across multiple models and reasoning tasks (AIME-2024, GPQA-Diamond, and LiveCodeBench) demonstrate that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding while maintaining accuracy. Furthermore, STAND consistently outperforms state-of-the-art speculative decoding methods across diverse inference patterns, including single-trajectory decoding, batch decoding, and test-time tree search. As a model-free approach, STAND can be applied to any existing language model without additional training, making it a powerful plug-and-play solution for accelerating language model reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STAND shows a workable model-free n-gram drafting trick that cuts reasoning latency by 60% on the tested benchmarks, but the key reuse assumption gets little direct measurement.

read the letter

STAND is a model-free speculative decoding method that drafts tokens with stochastic adaptive n-grams plus a logit memory module and data-driven trees. The headline result is a 60-65% latency drop versus plain autoregressive decoding on AIME-2024, GPQA-Diamond, and LiveCodeBench while accuracy stays flat, and it beats some existing speculative baselines in single-trajectory, batch, and tree-search settings. No extra draft model or training is required, so it is genuinely plug-and-play for any existing LM.

Referee Report

2 major / 2 minor

Summary. The paper introduces STAND (STochastic Adaptive N-gram Drafting), a model-free speculative decoding method for accelerating test-time scaling in language model reasoning. It exploits redundancy in reasoning trajectories via stochastic n-gram drafting, a logit-based memory-efficient module, Gumbel-Top-K sampling, and data-driven tree construction. Evaluations on AIME-2024, GPQA-Diamond, and LiveCodeBench claim 60-65% latency reduction versus autoregressive decoding with no accuracy loss, plus consistent outperformance of prior speculative methods across single-trajectory, batch, and tree-search settings, all without training or draft models.

Significance. If the results hold, the work offers a practical, training-free plug-and-play acceleration for test-time compute methods such as best-of-N and tree search. The model-free design is a clear strength, broadening applicability to any existing LM. However, the absence of direct quantification of the core n-gram reuse assumption and limited ablation detail reduce the strength of the empirical claims relative to methods that report acceptance rates or parameter-free derivations.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): The 60-65% latency reduction and outperformance claims rest on the premise that reasoning paths reuse similar n-gram patterns, yet no direct measurements (n-gram hit rates, acceptance statistics, or sensitivity to n-gram order/memory size) are reported. The listed free parameters (n-gram order, stochastic drafting temperature) are not ablated, leaving open whether the speedup would persist on out-of-distribution reasoning steps where reuse is weaker.
[§4] §4 (Experimental Results): Latency and accuracy figures are presented without error bars, exact per-benchmark acceptance-rate tables, or variance across runs. This makes it difficult to evaluate the reliability of the consistent outperformance versus state-of-the-art speculative decoding baselines in single-trajectory, batch, and tree-search regimes.

minor comments (2)

[Method] Clarify in the method section how the logit-based N-gram module exactly preserves probabilistic information and interfaces with Gumbel-Top-K sampling; the current description is high-level.
[Related Work] Add a reference to prior n-gram speculative decoding work in the related-work section to better situate the model-free contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights opportunities to strengthen the empirical support for our claims. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The 60-65% latency reduction and outperformance claims rest on the premise that reasoning paths reuse similar n-gram patterns, yet no direct measurements (n-gram hit rates, acceptance statistics, or sensitivity to n-gram order/memory size) are reported. The listed free parameters (n-gram order, stochastic drafting temperature) are not ablated, leaving open whether the speedup would persist on out-of-distribution reasoning steps where reuse is weaker.

Authors: We agree that direct quantification of the n-gram reuse assumption would strengthen the paper. In the revision we will add a dedicated analysis section (or appendix) reporting n-gram hit rates, acceptance statistics, and sensitivity to n-gram order and memory size across the evaluated benchmarks. We will also include ablations on n-gram order and stochastic drafting temperature, with discussion of behavior on reasoning steps exhibiting weaker pattern reuse. revision: yes
Referee: [§4] §4 (Experimental Results): Latency and accuracy figures are presented without error bars, exact per-benchmark acceptance-rate tables, or variance across runs. This makes it difficult to evaluate the reliability of the consistent outperformance versus state-of-the-art speculative decoding baselines in single-trajectory, batch, and tree-search regimes.

Authors: We acknowledge that reporting variance and detailed acceptance rates improves interpretability. In the revised version we will rerun key experiments across multiple random seeds, add error bars to latency and accuracy plots, and include exact per-benchmark tables for acceptance rates and speedups in both the main text and appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on public benchmarks

full rationale

The paper introduces STAND as a model-free speculative decoding technique that exploits observed redundancy in reasoning trajectories for n-gram based drafting. Latency reductions of 60-65% and outperformance claims are presented as direct empirical measurements across AIME-2024, GPQA-Diamond, and LiveCodeBench rather than quantities derived from equations or parameters that reduce to the method's own inputs by construction. The enabling observation about reusable patterns is framed as an analysis result supporting the approach, with no self-citations, fitted inputs renamed as predictions, or uniqueness theorems invoked in a load-bearing way. The derivation chain remains self-contained and externally falsifiable through independent benchmark runs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method rests on one domain assumption about pattern reuse in reasoning and introduces a small number of implementation choices whose exact values are not enumerated in the abstract.

free parameters (2)

n-gram order and memory size
Size of the adaptive n-gram module and logit store must be chosen; treated as design parameters.
stochastic drafting temperature or acceptance parameters
Parameters controlling randomness and tree construction are data-driven but still require selection.

axioms (1)

domain assumption Reasoning paths frequently reuse similar reasoning patterns
Invoked as the enabling premise for model-free token prediction without draft models.

pith-pipeline@v0.9.0 · 5785 in / 1121 out tokens · 40793 ms · 2026-05-22T12:29:03.304482+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
cs.AI 2026-05 conditional novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
cs.CL 2025-03 accept novelty 5.0

A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.