arxiv: 2604.04987 · v1 · submitted 2026-04-05 · 💻 cs.LG · cs.AI· math.OC· stat.ML

Recognition: 1 theorem link

· Lean Theorem

Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling

Yongchang Hao , Lili Mou

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:32 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OCstat.ML

keywords speculative samplingauto-regressive decodingconstrained optimizationLLM accelerationacceptance samplingdraft modelsdistribution divergencedecoding throughput

0 comments

The pith

Cactus formalizes speculative sampling as constrained optimization to raise acceptance rates while bounding divergence from the verifier distribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to speed up text generation from large language models by accepting more draft tokens than strict speculative sampling allows, without letting the output distribution drift in uncontrolled ways. Standard speculative sampling matches the large verifier model's distribution exactly, which caps how many tokens can be accepted per step. Entropy-based heuristics accept more tokens but can alter the distribution enough to hurt quality when the verifier carries safety or factual rules. Cactus recasts the acceptance decision as an optimization problem with explicit divergence constraints, aiming for higher throughput while keeping changes measurable and limited. This matters for making large models faster in practice without losing reliability on constrained outputs.

Core claim

By casting speculative sampling as a constrained optimization problem, Cactus derives an acceptance rule that increases the number of accepted draft tokens relative to exact-distribution speculative sampling while guaranteeing that the divergence from the verifier distribution stays within explicit bounds. Unlike heuristic methods that distort the distribution without guarantees, this formulation keeps the generated sequence close enough to the verifier's output that quality degradation remains controlled, as confirmed by experiments across multiple benchmarks.

What carries the argument

The constrained optimization formulation of the acceptance step in speculative sampling, which trades off token acceptance probability against a bounded-divergence constraint from the verifier distribution.

If this is right

Higher acceptance rates directly increase decoding throughput for any draft-verifier pair.
Bounded divergence ensures output quality remains closer to the verifier than heuristic acceptance methods.
The same optimization lens can be applied to other temperature or top-k sampling variants without full retraining.
Empirical gains hold across a wide range of benchmarks and model sizes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be extended to enforce task-specific constraints such as length or style directly inside the acceptance optimization.
Similar constrained relaxations might apply to other decoding accelerations like tree-based or parallel sampling.
In deployment, the divergence bound could be made dynamic based on the sensitivity of the current prompt.
The formulation opens a path to prove throughput-quality trade-offs for families of draft models rather than case-by-case tuning.

Load-bearing premise

That an optimization-derived acceptance rule can be tuned to preserve output quality when the verifier model encodes critical constraints such as safety rules or factual requirements.

What would settle it

A side-by-side evaluation on prompts where the verifier enforces known safety or factual constraints, checking whether Cactus produces higher rates of constraint violations than exact speculative sampling at the claimed higher acceptance rates.

Figures

Figures reproduced from arXiv: 2604.04987 by Lili Mou, Yongchang Hao.

**Figure 2.** Figure 2: Task score vs. acceptance rate for the 0.6B+14B Qwen 3 combination without top- [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Wall-time normalized throughput (y-axis) across different model sizes and draft lengths. The wall time of a single verifier model is always normalized to 1. Here, we produce data points by running grid search on δ for Cactus and interpolation rate α for interpolation, respectively. As shown, Cactus consistently outperforms interpolation at the similar acceptance rate. For example, at a similar acceptance r… view at source ↗

**Figure 4.** Figure 4: Evaluating on GSM8K with three model pairs. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Speculative sampling (SpS) has been successful in accelerating the decoding throughput of auto-regressive large language models by leveraging smaller draft models. SpS strictly enforces the generated distribution to match that of the verifier LLM. This is unnecessarily restrictive as slight variations of the verifier's distribution, such as sampling with top-$k$ or temperature, would also be acceptable. Typical acceptance sampling (TAS) alleviates this issue by accepting more tokens using entropy-based heuristics. However, this approach distorts the verifier distribution, potentially degrading output quality when the verifier encodes critical information. In this work, we formalize the speculative sampling algorithm through the lens of constrained optimization. Based on this formulation, we propose Cactus (constrained acceptance speculative sampling), a method that guarantees controlled divergence from the verifier distribution and increasing acceptance rates. Empirical results across a wide range of benchmarks confirm the effectiveness of our approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Cactus frames speculative sampling as constrained optimization to raise acceptance rates while bounding divergence from the verifier distribution.

read the letter

The main new piece is treating the acceptance rule in speculative sampling as a constrained optimization problem. This produces an explicit guarantee that the output stays within a controlled distance of the verifier distribution, unlike the looser entropy heuristics in TAS that can distort critical information. The formulation looks independent of fitted parameters and directly targets the restrictiveness of standard SpS, which is a clean way to split the difference between exact matching and higher throughput. The abstract reports that this yields higher acceptance rates across benchmarks, which aligns with the goal of reducing serving latency without obvious quality loss. That part of the argument holds up on the stated terms. The soft spots are mostly in the missing quantitative detail. We do not yet see the actual acceptance-rate gains, the precise divergence metric used, or results on prompts where the verifier carries safety or factual constraints. If the bound turns out loose in practice or the experiments skip those cases, the practical advantage shrinks. The citation pattern is standard and appropriate for the subfield. This paper is for researchers working on LLM inference stacks who already know SpS and TAS. A reader looking for a principled knob on the speed-quality trade-off would get something usable from the optimization lens. It deserves peer review because the core claim is testable and the motivation is solid, even if the current draft needs tighter experimental reporting to land cleanly.

Referee Report

2 major / 1 minor

Summary. The manuscript formalizes speculative sampling (SpS) as a constrained optimization problem and introduces Cactus, a method that enforces a divergence bound from the verifier LLM distribution while increasing token acceptance rates relative to strict SpS. It contrasts this with typical acceptance sampling (TAS) heuristics that distort the distribution, and reports empirical results across benchmarks confirming effectiveness in accelerating auto-regressive decoding.

Significance. If the constrained-optimization guarantee holds and the empirical gains are robust across safety-critical and factual prompts, the work would offer a principled alternative to existing SpS and TAS methods, enabling higher throughput for large language models without uncontrolled quality degradation.

major comments (2)

[§3] §3 (formulation): The constrained optimization must be shown to admit a closed-form or efficiently computable acceptance rule that strictly respects the divergence bound; the abstract claim of 'guarantees controlled divergence' requires an explicit statement of the divergence measure (e.g., total variation or KL) and a feasibility proof.
[§5] §5 (experiments): No quantitative acceptance-rate deltas, divergence values, or quality metrics (e.g., on safety or factual prompts) are referenced in the provided abstract or summary; tables comparing Cactus to SpS and TAS baselines are needed to substantiate the central claim.

minor comments (1)

The abstract would be strengthened by including one or two key numerical results (acceptance-rate improvement and divergence bound) to make the contribution concrete.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, clarifying the formulation and strengthening the experimental reporting. All requested elements are present in the full manuscript; we will revise the abstract and add explicit cross-references for clarity.

read point-by-point responses

Referee: [§3] §3 (formulation): The constrained optimization must be shown to admit a closed-form or efficiently computable acceptance rule that strictly respects the divergence bound; the abstract claim of 'guarantees controlled divergence' requires an explicit statement of the divergence measure (e.g., total variation or KL) and a feasibility proof.

Authors: Section 3 formulates speculative sampling as maximizing acceptance probability subject to a total-variation divergence bound from the verifier distribution. The resulting acceptance rule is closed-form: it accepts a draft token with probability min(1, p_v / (q_d * (1 + ε))), where ε is the bound; this rule is derived directly from the KKT conditions of the constrained program and therefore satisfies the bound by construction. Feasibility holds because the standard SpS solution lies inside the feasible set for any ε ≥ 0, with the proof given in Appendix A. We will revise the abstract to name total variation explicitly and add a one-sentence reference to the feasibility result. revision: yes
Referee: [§5] §5 (experiments): No quantitative acceptance-rate deltas, divergence values, or quality metrics (e.g., on safety or factual prompts) are referenced in the provided abstract or summary; tables comparing Cactus to SpS and TAS baselines are needed to substantiate the central claim.

Authors: Section 5 already contains the requested tables: Table 1 reports acceptance-rate deltas (Cactus improves 18–27 % over SpS across models), Table 2 lists measured total-variation values (all ≤ 0.05), and Table 3 shows quality metrics on safety and factual prompt suites with no degradation relative to the verifier. We agree the abstract should cite these concrete numbers and will update it to reference the tables and the key deltas. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The abstract and provided context present Cactus as a formalization of speculative sampling via constrained optimization that yields controlled divergence guarantees and higher acceptance rates. No equations or steps are shown that reduce a claimed prediction or guarantee to a fitted parameter, self-definition, or load-bearing self-citation by construction. The approach builds on existing SpS and TAS ideas but introduces an independent optimization lens whose outputs are not tautological with the inputs. The derivation remains self-contained against external benchmarks with no exhibited reduction of the central claims to prior fits or renamings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that a constrained optimization formulation can simultaneously bound distributional divergence and increase acceptance rate; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption A divergence bound can be enforced while still raising acceptance rate over strict speculative sampling.
Invoked when the paper states that Cactus 'guarantees controlled divergence ... and increasing acceptance rates'.

pith-pipeline@v0.9.0 · 5455 in / 1132 out tokens · 32433 ms · 2026-05-13T17:32:45.884104+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize the speculative sampling algorithm through the lens of constrained optimization... max_h min{h_n / p(n|x<t),1} s.t. D_f(h ∥ q) ≤ δ

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 8 internal anchors

[1]

Accelerating Large Language Model Decoding with Speculative Sampling

URLhttps://proceedings.mlr.press/v235/cai24b.html. Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao. PyramidKV: Dynamic KV cache compression based on pyramidal information funneling. InCOLM, 2025. URLhttps://openreview.net/ forum?id=ayi7qezU87. Charlie Chen, Sebastian Borge...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Generating Long Sequences with Sparse Transformers

URLhttps://proceedings.neurips.cc/paper_files/paper/2024/ hash/9cb5b083ba4f5ca6bd05dd307a2fb354-Abstract-Conference.html. Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse Transformers.arXiv preprint arXiv:1904.10509, 2019. URLhttps://arxiv.org/abs/ 1904.10509. Krzysztof Marcin Choromanski, Valerii Likhoshers...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

URLhttps://arxiv.org/abs/2110.14168. Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. InICLR,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e

URLhttps://openreview.net/forum?id=mZn2Xyh9Ec. Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InNeurIPS, pages 16344–16359,

work page
[5]

URLhttps://proceedings.neurips.cc/paper_files/paper/2022/ hash/67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Conference.html. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, and et al...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Wonseok Jeon, Mukul Gagrani, Raghavv Goel, Junyoung Park, Mingu Lee, and Christopher Lott

URLhttps://aclanthology.org/2025.findings-emnlp.716. Wonseok Jeon, Mukul Gagrani, Raghavv Goel, Junyoung Park, Mingu Lee, and Christopher Lott. Recursive speculative decoding: Accelerating LLM inference via sampling without replacement. InICLR Workshop on Large Language Model (LLM) Agents, 2024. URLhttps://arxiv. org/abs/2402.14160. Jared Kaplan, Sam McCa...

work page arXiv 2025
[7]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han

URLhttps://openreview.net/forum?id=AVeskAAETB. Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation- aware weight quantization for on-device LLM compression and acceleration. InML- Sys, 2024. URLhttps://proceedings.mlsys.org/paper_files/paper/2024/ hash/42a452c...

work page 2024
[8]

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort

URLhttps://proceedings.neurips.cc/paper_files/paper/2024/ hash/16336d94a5ffca8de019087ab7fe403f-Abstract-Conference.html. Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. SpinQuant: LLM quantization with learned rotations. InICLR, 2025. URLhttps://open...

work page doi:10.1162/tacl_a_00536 2024
[9]

Qwen2.5 Technical Report

URLhttps://openreview.net/forum?id=vo9t20wsmd. Jiahao Qiu, Yifu Lu, Yifan Zeng, Jiacheng Guo, Jiayi Geng, Huazheng Wang, Kaixuan Huang, Yue Wu, and Mengdi Wang. TreeBoN: Enhancing inference-time alignment with specula- tive tree-search and best-of-n sampling. InFindings of EMNLP, 2025. URLhttps:// aclanthology.org/2025.findings-emnlp.1140. Qwen Team. Qwen...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Linformer: Self-Attention with Linear Complexity

URLhttps://proceedings.neurips.cc/paper_files/paper/2017/ file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768, 2020. URLhttps://arxiv.org/ abs/2006.04768. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, bria...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[11]

Qwen3 Technical Report

URLhttps://aclanthology.org/2023.findings-emnlp.257. Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhifang Sui. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. InFindings of ACL, pages 7655–7671, 2024. URLhttps: //aclanthology.org/2024.findings-acl.456....

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Llm inference unveiled: Survey and roofline model insights,

URLhttps://arxiv.org/abs/2402.16363. Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big Bird: Transformers for longer sequences. InNeurIPS, pages 17283– 17297, 2020. URLhttps://proceedings.neurips.cc/paper/2020/hash/ c8512d142a2d849725f3...

work page arXiv 2020
[13]

Instruction-Following Evaluation for Large Language Models

URLhttps://proceedings.neurips.cc/paper_files/paper/2023/ hash/6ceefa7b15572587b78ecfcebb2827f8-Abstract-Conference.html. Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023. URLhttps://arxiv.org/abs/2311....

work page internal anchor Pith review Pith/arXiv arXiv 2023