arxiv: 2604.15672 · v1 · submitted 2026-04-17 · 💻 cs.LG · cs.CL

Faster LLM Inference via Sequential Monte Carlo

Yahya Emara , Mauricio Barba da Costa , Chi-Chih Chang , Cameron Freer , Tim Vieira , Ryan Cotterell , Mohamed S. Abdelfattah This is my paper

Pith reviewed 2026-05-10 08:33 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords speculative decodingsequential monte carlolarge language modelsinference accelerationimportance samplingrejection samplingautoregressive generation

0 comments p. Extension

The pith

By reweighting populations of draft particles instead of rejecting mismatches, sequential Monte Carlo speculative decoding accelerates LLM inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces sequential Monte Carlo speculative decoding (SMC-SD) to improve the speed of large language model inference. Standard speculative decoding drafts tokens from a cheap proposal model and rejects the entire draft block at the first mismatch with the target model, which reduces throughput when the models diverge. SMC-SD instead draws multiple draft particles and reweights them with importance sampling, converting verification into a parallel fixed-size operation that avoids truncation and rollback. Because inference is memory-bandwidth bound, the added arithmetic costs little, and the method supplies per-step bounds on approximation error. Experiments report 2.36 times the speed of ordinary speculative decoding and 5.2 times the speed of plain autoregressive decoding while staying within 3 percent accuracy on reasoning, instruction, and coding tasks.

Core claim

SMC-SD replaces the rejection-sampling step of speculative decoding with importance-weighted resampling over a population of draft particles. This change turns verification into a vectorized, fixed-length computation that uses idle arithmetic capacity without any rollback. The procedure remains a principled approximate inference method that preserves theoretical bounds on per-step error while trading exact token matching for higher throughput.

What carries the argument

Importance-weighted resampling over multiple draft particles in sequential Monte Carlo, which replaces token-level rejection to enable parallel, non-truncating verification.

If this is right

Verification becomes a vectorized fixed-size operation with no rollback or early truncation of drafts.
The method delivers 2.36 times the throughput of standard speculative decoding and 5.2 times the throughput of autoregressive decoding.
Accuracy stays within 3 percent of the target model on reasoning, instruction-following, and coding benchmarks.
Theoretical per-step approximation error remains bounded even though exact matching is relaxed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same resampling idea could apply to other token-generation settings where strict rejection is costly.
Varying the number of particles dynamically according to observed divergence might improve the speed-accuracy curve further.
On hardware where arithmetic is not nearly free, the relative gains from parallel scoring would need separate measurement.

Load-bearing premise

LLM inference is memory-bandwidth bound, so the extra arithmetic needed to draft and score multiple particles in parallel adds almost no cost.

What would settle it

A timing measurement on hardware where arithmetic is the bottleneck rather than memory, or an accuracy test showing more than 3 percent drop on the reported benchmarks, would falsify the claimed speedups and fidelity.

Figures

Figures reproduced from arXiv: 2604.15672 by Cameron Freer, Chi-Chih Chang, Mauricio Barba da Costa, Mohamed S. Abdelfattah, Ryan Cotterell, Tim Vieira, Yahya Emara.

**Figure 1.** Figure 1: Speed-up of SMC-SD on Llama 1B → 70B draft-target pair relative to autoregressive baseline, optimized tree-based SD (SGLang), Speculative Speculative Decoding (SSD; Kumar et al., 2026) on ShareGPT dataset. AR, SGLang SD, SMC-SD run on 4 H100 GPUs, while SSD runs on 5 H100 GPUs. Speculative decoding (SD; Leviathan et al., 2023) addresses this bottleneck by amortizing the cost of target-model calls. At its … view at source ↗

**Figure 2.** Figure 2: Our approach (bottom) compared to standard speculative decoding (top). In standard speculative decoding, a draft model generates a single sequence of draft tokens. A target model then performs a verification step, accepting a valid prefix of verified tokens while discarding rejected tokens. SMC-SD maintains a set of N candidate sequences (particles). In each iteration, the draft model extends the N sequenc… view at source ↗

**Figure 3.** Figure 3: Left, middle: Theoretical speed-up of SMC-SD over autoregressive decoding for a Llama1B → 8B pair as a function of draft length K (left) and number of particles N (middle), with ρ = 1/8, B = 1, and R = 295. Dashed lines mark the ridge point where the target forward pass transitions from memory-bound to compute-bound. In the memory-bound regime (left of the ridge point), increasing K increases the speed-up… view at source ↗

**Figure 4.** Figure 4: Speed–accuracy Pareto frontier on GSM8K (top left), MATH500 (top right), AlpacaEval (bottom left), and DS1000 (bottom right). Blue/solid: Llama 3.2-1B → 3.1-8B; orange/dashed: Qwen (SD: 0.5B→14B; SMC-SD: 3B→14B). Italic labels show (N, K) configurations along the SMC-SD Pareto frontier. Experiments were conducted using a single H100 SXM GPU. 4 Experiments We characterize the speed–accuracy Pareto frontier … view at source ↗

**Figure 5.** Figure 5: Throughput per GPU vs. batch size on ShareGPT for SGLang SD, SSD, and SMC-SD. 1B→8B uses 1 GPU for SGLang SD and SMC-SD, 2 GPUs for SSD. 1B→70B uses 4 H100 GPUs for SGLang SD and SMC-SD, 5 H100 GPUs for SSD. SMC-SD achieves the highest per-GPU throughput across most batch sizes for both model pairs. accuracy. The throughput gains come from a structural difference in how SMC-SD uses the target forward pass.… view at source ↗

**Figure 6.** Figure 6: SMC-SD / SD throughput ratio vs. temperature (left) and raw throughput in tokens/s (right). SMC-SD maintains near-constant TPS while SD degrades as acceptance rates fall, widening the speed-up from ∼1.5× at T=0.2 to ∼3× at T=1.0. On average, SMC-SD sequences incur only a ∼5% increase in NLL under the target model. F Power Sampling One advantage of SMC is that it allows us to sample approximately from un-no… view at source ↗

**Figure 7.** Figure 7: Power Sampling helps improve Pass@k quality G More Qualitative Examples GSM8K. Llama 1B→8B Question: Claire makes a 3 egg omelet every morning for breakfast. How many dozens of eggs will she eat in 4 weeks? SMC-SD: Step 1: Convert 4 weeks to days — 4 weeks × 7 days/week = 28 days. Step 2: Calculate the number of eggs in 28 days — 28 days × 3 eggs/day = 84 eggs. Step 3: Convert the number of eggs from eggs … view at source ↗

read the original abstract

Speculative decoding (SD) accelerates language model inference by drafting tokens from a cheap proposal model and verifying them against an expensive target model via rejection sampling. Because rejection truncates the draft block at the first error, throughput degrades when draft and target diverge. Rather than rejecting draft tokens outright, we propose to reweight them. To this end, we introduce sequential Monte Carlo speculative decoding (SMC-SD), which replaces token-level rejection with importance-weighted resampling over a population of draft particles. SMC-SD is a principled approximate inference scheme that trades exactness for additional speed, while preserving theoretical bounds on its per-step approximation error. Because LLM inference is memory bandwidth-bound, the arithmetic needed to draft particles and to score them in parallel comes nearly for free -- SMC-SD uses idle compute to turn verification into a vectorized, fixed-size operation with no rollback. Empirically, SMC-SD achieves 2.36x speed-up over speculative decoding and a 5.2x speed-up over autoregressive decoding, while remaining within 3% of the target model's accuracy on reasoning, instruction-following, and coding benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SMC-SD swaps rejection sampling for importance resampling over draft particles, which keeps verification fixed-size and parallel, but the speed claims rest on extra arithmetic staying nearly free in a memory-bound regime.

read the letter

The core move here is replacing the early truncation in speculative decoding with sequential Monte Carlo resampling. Instead of discarding a whole draft block at the first mismatch, the method keeps a population of particles, reweights them by importance, and resamples. This turns the verification step into a vectorized operation that does not depend on how far the draft matches the target. The paper shows this preserves per-step error bounds and reports 2.36x throughput over standard speculative decoding and 5.2x over plain autoregressive decoding, with accuracy within 3% on reasoning, instruction-following, and coding tasks. That framing is the actual novelty; prior speculative decoding work stays with rejection sampling, and the SMC lens is not in the cited literature. The empirical results are concrete enough to be worth testing on hardware. The authors avoid introducing new fitted parameters and keep the method grounded in existing SMC and speculative decoding machinery. The soft spot is the central performance assumption. The abstract states that because LLM inference is memory-bandwidth-bound, the extra drafting and scoring of particles comes nearly for free. If the additional forward passes or weight computations increase cache pressure or scheduler overhead, the wall-clock gains shrink. The provided text does not isolate timing measurements that separate memory traffic from compute time across different particle counts, so the reported speedups depend on an unquantified claim. A reader would want to see those breakdowns before accepting the 2x+ numbers as general. This paper is for people already running speculative decoding in production or research settings. Anyone tuning serving latency or exploring approximate inference during generation would get direct value from the resampling idea. It is coherent on its own terms and shows clear engagement with the relevant literature, so it deserves a serious referee even if the hardware measurements need tightening.

Referee Report

3 major / 2 minor

Summary. The paper introduces Sequential Monte Carlo Speculative Decoding (SMC-SD), which replaces token-level rejection sampling in speculative decoding with importance-weighted resampling over a population of draft particles. It asserts theoretical per-step approximation error bounds for this scheme and reports empirical speedups of 2.36x over standard speculative decoding and 5.2x over autoregressive decoding, while keeping accuracy within 3% of the target model on reasoning, instruction-following, and coding benchmarks. The central justification is that LLM inference is memory-bandwidth-bound, rendering the extra arithmetic for parallel particle drafting and scoring nearly free and converting verification into a fixed-size vectorized operation without rollback.

Significance. If the speedups and accuracy preservation are confirmed under controlled conditions, the work could provide a practical advance in LLM inference efficiency by exploiting idle compute in bandwidth-limited regimes to avoid rollback penalties. Framing the method as a principled SMC-based approximate inference procedure offers a theoretically motivated alternative to heuristic rejection sampling, with potential extensions to other sequential generation settings. The emphasis on hardware-aware design is a constructive contribution.

major comments (3)

Abstract: The assertion that 'the arithmetic needed to draft particles and to score them in parallel comes nearly for free' because inference is memory-bandwidth-bound is load-bearing for both the 2.36x and 5.2x speedup claims, yet no measurements isolating compute versus memory time, cache effects, or scheduler overhead for k>1 particles are provided to quantify or validate the assumption.
Abstract: Theoretical per-step approximation error bounds are asserted without any derivation, explicit statement of the bounds, or proof outline, which is required to support the claim that SMC-SD is a 'principled approximate inference scheme' trading exactness for speed.
Abstract: The reported speedups (2.36x over SD, 5.2x over AR) and accuracy claim ('within 3%') lack any reference to experimental controls, number of trials, error bars, specific models, batch sizes, or benchmark datasets, making it impossible to assess whether the central performance results hold under the stated conditions.

minor comments (2)

The abstract does not name the specific benchmarks or tasks used for the reasoning, instruction-following, and coding evaluations.
A short description of how importance weights are computed and normalized across particles would improve clarity of the SMC procedure.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address each major comment below and will incorporate revisions to strengthen the presentation of our claims.

read point-by-point responses

Referee: The assertion that 'the arithmetic needed to draft particles and to score them in parallel comes nearly for free' because inference is memory-bandwidth-bound is load-bearing for both the 2.36x and 5.2x speedup claims, yet no measurements isolating compute versus memory time, cache effects, or scheduler overhead for k>1 particles are provided to quantify or validate the assumption.

Authors: We agree that quantitative profiling would strengthen the central hardware-aware justification. In the revised manuscript we will add hardware-level timing breakdowns (compute vs. memory bandwidth) for varying particle counts k, together with cache-miss statistics and scheduler overhead measurements on the evaluation platform. These data will directly support the claim that the additional arithmetic is nearly free under the stated conditions. revision: yes
Referee: Theoretical per-step approximation error bounds are asserted without any derivation, explicit statement of the bounds, or proof outline, which is required to support the claim that SMC-SD is a 'principled approximate inference scheme' trading exactness for speed.

Authors: We acknowledge that the abstract currently asserts the existence of per-step bounds without stating them or outlining the derivation. We will revise the abstract to include an explicit statement of the bound and a concise proof sketch, while ensuring the full derivation remains in Section 3. This change will make the principled nature of the approximation immediately verifiable from the abstract. revision: yes
Referee: The reported speedups (2.36x over SD, 5.2x over AR) and accuracy claim ('within 3%') lack any reference to experimental controls, number of trials, error bars, specific models, batch sizes, or benchmark datasets, making it impossible to assess whether the central performance results hold under the stated conditions.

Authors: The full experimental protocol (models, batch sizes, benchmarks, number of independent trials, and error bars) is reported in Section 4. To address the abstract-level concern we will add a short clause referencing the evaluation setup and pointing to the detailed controls in the experiments section, while respecting abstract length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces SMC-SD by combining standard speculative decoding with sequential Monte Carlo resampling and importance weighting. The central performance argument rests on the external hardware claim that LLM inference is memory-bandwidth-bound so that extra parallel arithmetic is nearly free, but this is presented as an architectural observation rather than a quantity derived from or defined in terms of the method's own equations. No load-bearing step reduces by construction to a fitted parameter, a self-referential definition, or a self-citation chain; the claimed approximation-error bounds are stated to follow from existing SMC theory without redefinition inside the paper. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard sequential Monte Carlo assumptions for approximate inference and the memory-bandwidth-bound characterization of LLM inference; no free parameters or new entities are introduced in the abstract.

axioms (2)

standard math Sequential Monte Carlo provides a principled approximate inference scheme with bounded per-step error
Invoked to justify trading exactness for speed while preserving theoretical guarantees.
domain assumption LLM inference is memory bandwidth-bound so parallel particle scoring incurs negligible extra latency
Used to claim that the added arithmetic comes nearly for free.

pith-pipeline@v0.9.0 · 5516 in / 1267 out tokens · 58378 ms · 2026-05-10T08:33:47.153268+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[4]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...