Recognition: 1 theorem link
· Lean TheoremCactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling
Pith reviewed 2026-05-13 17:32 UTC · model grok-4.3
The pith
Cactus formalizes speculative sampling as constrained optimization to raise acceptance rates while bounding divergence from the verifier distribution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By casting speculative sampling as a constrained optimization problem, Cactus derives an acceptance rule that increases the number of accepted draft tokens relative to exact-distribution speculative sampling while guaranteeing that the divergence from the verifier distribution stays within explicit bounds. Unlike heuristic methods that distort the distribution without guarantees, this formulation keeps the generated sequence close enough to the verifier's output that quality degradation remains controlled, as confirmed by experiments across multiple benchmarks.
What carries the argument
The constrained optimization formulation of the acceptance step in speculative sampling, which trades off token acceptance probability against a bounded-divergence constraint from the verifier distribution.
If this is right
- Higher acceptance rates directly increase decoding throughput for any draft-verifier pair.
- Bounded divergence ensures output quality remains closer to the verifier than heuristic acceptance methods.
- The same optimization lens can be applied to other temperature or top-k sampling variants without full retraining.
- Empirical gains hold across a wide range of benchmarks and model sizes.
Where Pith is reading between the lines
- The method could be extended to enforce task-specific constraints such as length or style directly inside the acceptance optimization.
- Similar constrained relaxations might apply to other decoding accelerations like tree-based or parallel sampling.
- In deployment, the divergence bound could be made dynamic based on the sensitivity of the current prompt.
- The formulation opens a path to prove throughput-quality trade-offs for families of draft models rather than case-by-case tuning.
Load-bearing premise
That an optimization-derived acceptance rule can be tuned to preserve output quality when the verifier model encodes critical constraints such as safety rules or factual requirements.
What would settle it
A side-by-side evaluation on prompts where the verifier enforces known safety or factual constraints, checking whether Cactus produces higher rates of constraint violations than exact speculative sampling at the claimed higher acceptance rates.
Figures
read the original abstract
Speculative sampling (SpS) has been successful in accelerating the decoding throughput of auto-regressive large language models by leveraging smaller draft models. SpS strictly enforces the generated distribution to match that of the verifier LLM. This is unnecessarily restrictive as slight variations of the verifier's distribution, such as sampling with top-$k$ or temperature, would also be acceptable. Typical acceptance sampling (TAS) alleviates this issue by accepting more tokens using entropy-based heuristics. However, this approach distorts the verifier distribution, potentially degrading output quality when the verifier encodes critical information. In this work, we formalize the speculative sampling algorithm through the lens of constrained optimization. Based on this formulation, we propose Cactus (constrained acceptance speculative sampling), a method that guarantees controlled divergence from the verifier distribution and increasing acceptance rates. Empirical results across a wide range of benchmarks confirm the effectiveness of our approach.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript formalizes speculative sampling (SpS) as a constrained optimization problem and introduces Cactus, a method that enforces a divergence bound from the verifier LLM distribution while increasing token acceptance rates relative to strict SpS. It contrasts this with typical acceptance sampling (TAS) heuristics that distort the distribution, and reports empirical results across benchmarks confirming effectiveness in accelerating auto-regressive decoding.
Significance. If the constrained-optimization guarantee holds and the empirical gains are robust across safety-critical and factual prompts, the work would offer a principled alternative to existing SpS and TAS methods, enabling higher throughput for large language models without uncontrolled quality degradation.
major comments (2)
- [§3] §3 (formulation): The constrained optimization must be shown to admit a closed-form or efficiently computable acceptance rule that strictly respects the divergence bound; the abstract claim of 'guarantees controlled divergence' requires an explicit statement of the divergence measure (e.g., total variation or KL) and a feasibility proof.
- [§5] §5 (experiments): No quantitative acceptance-rate deltas, divergence values, or quality metrics (e.g., on safety or factual prompts) are referenced in the provided abstract or summary; tables comparing Cactus to SpS and TAS baselines are needed to substantiate the central claim.
minor comments (1)
- The abstract would be strengthened by including one or two key numerical results (acceptance-rate improvement and divergence bound) to make the contribution concrete.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below, clarifying the formulation and strengthening the experimental reporting. All requested elements are present in the full manuscript; we will revise the abstract and add explicit cross-references for clarity.
read point-by-point responses
-
Referee: [§3] §3 (formulation): The constrained optimization must be shown to admit a closed-form or efficiently computable acceptance rule that strictly respects the divergence bound; the abstract claim of 'guarantees controlled divergence' requires an explicit statement of the divergence measure (e.g., total variation or KL) and a feasibility proof.
Authors: Section 3 formulates speculative sampling as maximizing acceptance probability subject to a total-variation divergence bound from the verifier distribution. The resulting acceptance rule is closed-form: it accepts a draft token with probability min(1, p_v / (q_d * (1 + ε))), where ε is the bound; this rule is derived directly from the KKT conditions of the constrained program and therefore satisfies the bound by construction. Feasibility holds because the standard SpS solution lies inside the feasible set for any ε ≥ 0, with the proof given in Appendix A. We will revise the abstract to name total variation explicitly and add a one-sentence reference to the feasibility result. revision: yes
-
Referee: [§5] §5 (experiments): No quantitative acceptance-rate deltas, divergence values, or quality metrics (e.g., on safety or factual prompts) are referenced in the provided abstract or summary; tables comparing Cactus to SpS and TAS baselines are needed to substantiate the central claim.
Authors: Section 5 already contains the requested tables: Table 1 reports acceptance-rate deltas (Cactus improves 18–27 % over SpS across models), Table 2 lists measured total-variation values (all ≤ 0.05), and Table 3 shows quality metrics on safety and factual prompt suites with no degradation relative to the verifier. We agree the abstract should cite these concrete numbers and will update it to reference the tables and the key deltas. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The abstract and provided context present Cactus as a formalization of speculative sampling via constrained optimization that yields controlled divergence guarantees and higher acceptance rates. No equations or steps are shown that reduce a claimed prediction or guarantee to a fitted parameter, self-definition, or load-bearing self-citation by construction. The approach builds on existing SpS and TAS ideas but introduces an independent optimization lens whose outputs are not tautological with the inputs. The derivation remains self-contained against external benchmarks with no exhibited reduction of the central claims to prior fits or renamings.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A divergence bound can be enforced while still raising acceptance rate over strict speculative sampling.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We formalize the speculative sampling algorithm through the lens of constrained optimization... max_h min{h_n / p(n|x<t),1} s.t. D_f(h ∥ q) ≤ δ
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Accelerating Large Language Model Decoding with Speculative Sampling
URLhttps://proceedings.mlr.press/v235/cai24b.html. Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao. PyramidKV: Dynamic KV cache compression based on pyramidal information funneling. InCOLM, 2025. URLhttps://openreview.net/ forum?id=ayi7qezU87. Charlie Chen, Sebastian Borge...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Generating Long Sequences with Sparse Transformers
URLhttps://proceedings.neurips.cc/paper_files/paper/2024/ hash/9cb5b083ba4f5ca6bd05dd307a2fb354-Abstract-Conference.html. Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse Transformers.arXiv preprint arXiv:1904.10509, 2019. URLhttps://arxiv.org/abs/ 1904.10509. Krzysztof Marcin Choromanski, Valerii Likhoshers...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
URLhttps://arxiv.org/abs/2110.14168. Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. InICLR,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e
URLhttps://openreview.net/forum?id=mZn2Xyh9Ec. Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InNeurIPS, pages 16344–16359,
-
[5]
URLhttps://proceedings.neurips.cc/paper_files/paper/2022/ hash/67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Conference.html. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, and et al...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
Wonseok Jeon, Mukul Gagrani, Raghavv Goel, Junyoung Park, Mingu Lee, and Christopher Lott
URLhttps://aclanthology.org/2025.findings-emnlp.716. Wonseok Jeon, Mukul Gagrani, Raghavv Goel, Junyoung Park, Mingu Lee, and Christopher Lott. Recursive speculative decoding: Accelerating LLM inference via sampling without replacement. InICLR Workshop on Large Language Model (LLM) Agents, 2024. URLhttps://arxiv. org/abs/2402.14160. Jared Kaplan, Sam McCa...
-
[7]
URLhttps://openreview.net/forum?id=AVeskAAETB. Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation- aware weight quantization for on-device LLM compression and acceleration. InML- Sys, 2024. URLhttps://proceedings.mlsys.org/paper_files/paper/2024/ hash/42a452c...
work page 2024
-
[8]
URLhttps://proceedings.neurips.cc/paper_files/paper/2024/ hash/16336d94a5ffca8de019087ab7fe403f-Abstract-Conference.html. Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. SpinQuant: LLM quantization with learned rotations. InICLR, 2025. URLhttps://open...
-
[9]
URLhttps://openreview.net/forum?id=vo9t20wsmd. Jiahao Qiu, Yifu Lu, Yifan Zeng, Jiacheng Guo, Jiayi Geng, Huazheng Wang, Kaixuan Huang, Yue Wu, and Mengdi Wang. TreeBoN: Enhancing inference-time alignment with specula- tive tree-search and best-of-n sampling. InFindings of EMNLP, 2025. URLhttps:// aclanthology.org/2025.findings-emnlp.1140. Qwen Team. Qwen...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Linformer: Self-Attention with Linear Complexity
URLhttps://proceedings.neurips.cc/paper_files/paper/2017/ file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768, 2020. URLhttps://arxiv.org/ abs/2006.04768. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, bria...
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[11]
URLhttps://aclanthology.org/2023.findings-emnlp.257. Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhifang Sui. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. InFindings of ACL, pages 7655–7671, 2024. URLhttps: //aclanthology.org/2024.findings-acl.456....
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Llm inference unveiled: Survey and roofline model insights,
URLhttps://arxiv.org/abs/2402.16363. Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big Bird: Transformers for longer sequences. InNeurIPS, pages 17283– 17297, 2020. URLhttps://proceedings.neurips.cc/paper/2020/hash/ c8512d142a2d849725f3...
-
[13]
Instruction-Following Evaluation for Large Language Models
URLhttps://proceedings.neurips.cc/paper_files/paper/2023/ hash/6ceefa7b15572587b78ecfcebb2827f8-Abstract-Conference.html. Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023. URLhttps://arxiv.org/abs/2311....
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.