MarginGate: Sparse Margin-Triggered Verification for Batch-Invariant LLM Inference
Pith reviewed 2026-06-29 08:53 UTC · model grok-4.3
The pith
MarginGate restores deterministic batch LLM decoding by verifying only low-margin tokens
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MarginGate keeps BF16 decoding on high-margin steps, verifies only low-margin steps, and repairs confirmed mismatches by replacing the current K/V column. Across five models on four datasets it restores 100 percent sequence-level deterministic decoding on Llama-3.1-8B and Qwen2.5-14B with 18.56 percent and 15.05 percent verifier trigger rates while reducing LLM-42 latency overhead by 2.23x and 1.99x; on DSR1-Distill-Qwen-7B it reaches determinism at 49.50 percent triggers.
What carries the argument
MarginGate, a sparse verification policy that triggers full checks only when top-1/top-2 logit margins are low and repairs mismatches via K/V column replacement.
If this is right
- Deterministic sequence output is achieved without verifying every token.
- The added latency of verification drops by a factor of about two relative to always-on methods.
- The policy calibrated on MATH500 transfers directly to GSM8K, SharedGPT, and HumanEval.
- Models with denser flip regimes still reach determinism but require higher trigger rates.
Where Pith is reading between the lines
- If logit margins reliably flag instability, the same signal could guide selective recomputation for other sources of non-determinism such as mixed-precision effects.
- Flat K/V perturbations before flips suggest margin monitoring could support preventive cache adjustments ahead of any mismatch.
- Lower verification frequency may make reproducible inference practical at larger batch sizes in production settings.
Load-bearing premise
Low top-1/top-2 logit margins expose essentially all the steps where batch-induced token flips occur.
What would settle it
A dataset or model where a substantial share of token flips occur at high top-1/top-2 margin steps would show the policy misses many mismatches.
Figures
read the original abstract
Temperature-zero BF16 LLM inference is often treated as reproducible, yet the same request can emit different tokens when decoded alone or inside a larger batch. Existing fixes use batch-invariant operators or LLM-42's per-token verification, incurring cost even when most steps are stable. We ask whether verification can be applied exclusively to flipped tokens. Across five models, batch-induced token flips are sparse on the flip-rate benchmarks: on MATH500, Llama-3.1-8B flips on $0.48\%$ of synchronous decode steps, and all tested models stay within the 0.3-1.3% range on MATH500, GSM8K, and HumanEval. K/V perturbations remain flat before flips, while low top-1/top-2 logit margins expose much of the flip risk. MarginGate turns these observations into a verifier policy: it keeps BF16 decoding on high-margin steps, verifies only low-margin steps, and repairs confirmed mismatches by replacing the current K/V column. We evaluate on four datasets, calibrating on MATH500 and transferring to GSM8K, SharedGPT, and HumanEval. MarginGate restores 100% sequence-level deterministic decoding on Llama-3.1-8B and Qwen2.5-14B with 18.56%/15.05% verifier trigger rates, reducing LLM-42's latency increment by 2.23x/1.99x relative to always-on verification. On DSR1-Distill-Qwen-7B, the same policy reaches determinism in a harder regime at 49.50% triggers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MarginGate, a sparse margin-triggered verification policy for achieving batch-invariant deterministic decoding during temperature-zero BF16 LLM inference. It reports that batch-induced token flips are sparse (0.3-1.3% on MATH500 across models), that K/V perturbations stay flat before flips while low top-1/top-2 logit margins correlate with flip risk, and that verifying only low-margin steps restores 100% sequence-level determinism on Llama-3.1-8B, Qwen2.5-14B, and DSR1-Distill-Qwen-7B at trigger rates of 18.56%, 15.05%, and 49.50% respectively, cutting the latency overhead of always-on verification (LLM-42) by roughly 2x. The policy is calibrated on MATH500 and transferred to GSM8K, SharedGPT, and HumanEval.
Significance. If the empirical correlation and transfer results hold, MarginGate would offer a practical, low-overhead route to deterministic batched inference by exploiting the observed sparsity of flips. The multi-model evaluation and cross-dataset transfer constitute a concrete strength of the work.
major comments (2)
- [Abstract and §4 (results)] The abstract and results sections state precise quantitative claims (100% determinism, exact trigger rates, 2.23x/1.99x latency reductions) yet supply no experimental protocol, batch-size distribution, hardware platform, margin-threshold selection procedure, or error analysis. This absence prevents evaluation of whether the reported numbers are reproducible or statistically supported.
- [§3 (policy) and results tables] The policy rests on the claim that low top-1/top-2 margins capture essentially all flip risk. No per-step breakdown, ROC-style analysis, or false-negative rate for the chosen margin threshold is provided, leaving open whether the 100% determinism result is robust or specific to the calibration set.
minor comments (2)
- [§2] Notation for the margin threshold and the exact definition of a 'flip' should be formalized with an equation rather than prose.
- [Figures 2-4] Figure captions should explicitly state the batch sizes and models used in each panel.
Simulated Author's Rebuttal
We thank the referee for highlighting the need for greater experimental transparency and robustness analysis. We will revise the manuscript to incorporate the requested details and additional supporting analysis while preserving the core empirical claims.
read point-by-point responses
-
Referee: [Abstract and §4 (results)] The abstract and results sections state precise quantitative claims (100% determinism, exact trigger rates, 2.23x/1.99x latency reductions) yet supply no experimental protocol, batch-size distribution, hardware platform, margin-threshold selection procedure, or error analysis. This absence prevents evaluation of whether the reported numbers are reproducible or statistically supported.
Authors: We agree that the current presentation lacks sufficient protocol details for full reproducibility. In the revised manuscript we will add an expanded experimental setup subsection in §4 that specifies: batch-size distributions (powers of two from 1–32), hardware platform (NVIDIA A100 80 GB GPUs, CUDA 12.4, PyTorch 2.4, vLLM backend), margin-threshold selection (grid search over [0.01, 0.20] on MATH500 to minimize trigger rate subject to 100% sequence determinism), and error analysis (standard deviations over five independent runs with different random seeds for batch ordering). These additions will directly support the reported quantitative claims. revision: yes
-
Referee: [§3 (policy) and results tables] The policy rests on the claim that low top-1/top-2 margins capture essentially all flip risk. No per-step breakdown, ROC-style analysis, or false-negative rate for the chosen margin threshold is provided, leaving open whether the 100% determinism result is robust or specific to the calibration set.
Authors: The 100% sequence-level determinism result implies a zero false-negative rate for the chosen threshold on the evaluated models and calibration set. To make this explicit and address robustness, the revision will include (i) a per-step margin histogram contrasting flip versus non-flip steps, (ii) the false-negative rate (reported as zero on MATH500 for the selected threshold), and (iii) a brief trade-off curve of trigger rate versus achieved determinism. These additions will clarify that the policy is not merely calibration-set specific while retaining the observed transfer performance on GSM8K, SharedGPT, and HumanEval. revision: yes
Circularity Check
No significant circularity
full rationale
The paper's central contribution is an empirically derived verification policy: observations of sparse batch-induced flips (0.3-1.3% on MATH500) and correlation with low top-1/top-2 margins are used to trigger verification only on low-margin steps. The policy is calibrated on MATH500 and evaluated for transfer on other datasets, with reported determinism rates and latency reductions measured directly against always-on baselines. No equations, derivations, or self-citations reduce the claimed outcomes to inputs by construction; the method rests on external experimental measurements rather than tautological redefinitions or fitted quantities renamed as predictions.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
online" 'onlinestring :=
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[2]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. https://www.usenix.org/conference/osdi24/presentation/agrawal Taming Throughput-Latency tradeoff in LLM inference with Sarathi-Serve . In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), p...
2024
-
[4]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, and 39 others. 2021. Evaluating large language models trained on code. arXi...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna : An open-source chatbot impressing GPT-4 with 90\ https://lmsys.org/blog/2023-03-30-vicuna/
2023
-
[6]
Kexin Chu, Zecheng Lin, Dawei Xiang, Zixu Shen, Jianchang Su, Cheng Chu, Yiwei Yang, Wenhui Zhang, Wenfei Wu, and Wei Zhang. 2025 a . https://doi.org/10.48550/arXiv.2508.08438 Selective KV -cache sharing to mitigate timing side-channels in LLM inference . arXiv:2508.08438
-
[7]
Kexin Chu, Zixu Shen, Sheng-Ru Cheng, Dawei Xiang, Ziqin Liu, and Wei Zhang. 2025 b . https://doi.org/10.1109/ICDCS63083.2025.00062 MCaM : Efficient LLM inference with multi-tier KV cache management . In Proceedings of the 45th IEEE International Conference on Distributed Computing Systems, pages 571--581
-
[8]
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
David Goldberg. 1991. What every computer scientist should know about floating-point arithmetic. ACM Computing Surveys, 23(1):5--48
1991
-
[10]
En- abling Determinism in LLM Inference with Verified Speculation
Raja Gond, Aditya K. Kamath, Ramachandran Ramjee, and Ashish Panwar. 2026. https://arxiv.org/abs/2601.17768 LLM-42 : Enabling determinism in LLM inference with verified speculation . arXiv:2601.17768
-
[11]
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. 2024. MiniLLM : Knowledge distillation of large language models. In International Conference on Learning Representations
2024
-
[12]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, and DeepSeek-AI . 2025. https://doi.org/10.1038/s41586-025-09422-z DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning . Nature, 645:633--638
-
[13]
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the MATH dataset. In Advances in Neural Information Processing Systems ( NeurIPS )
2021
-
[14]
Nicholas J. Higham. 2002. Accuracy and Stability of Numerical Algorithms, 2nd edition. SIAM
2002
-
[15]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with PagedAttention . In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, pages 611--626
2023
-
[16]
Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. https://proceedings.mlr.press/v202/leviathan23a.html Fast inference from transformers via speculative decoding . In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 19274--19286. PMLR
2023
-
[17]
Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024 a . https://doi.org/10.1145/3651890.3672274 CacheGen : KV cache compression and streaming for fast large language model serving . In Proceedings of the A...
-
[18]
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024 b . https://proceedings.mlr.press/v235/liu24bz.html KIVI : A tuning-free asymmetric 2bit quantization for KV cache . In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research...
2024
-
[19]
Meta AI . 2024. The Llama 3 herd of models. https://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
SGLang Project . 2026. Deterministic inference. https://sgl-project.github.io/advanced_features/deterministic_inference.html
2026
-
[21]
Yifan Song, Guoyin Wang, Sujian Li, and Bill Yuchen Lin. 2025. https://doi.org/10.18653/v1/2025.naacl-long.211 The good, the bad, and the greedy: Evaluation of LLM s should not ignore non-determinism . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies ...
-
[22]
Thinking Machines Lab . 2026. batch\_invariant\_ops: Batch-invariant implementations of reduction operations. https://github.com/thinking-machines-lab/batch_invariant_ops
2026
-
[23]
vLLM Project . 2026. Batch invariance. https://docs.vllm.ai/en/v0.19.0/features/batch_invariance/
2026
-
[24]
Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca : A distributed serving system for Transformer -based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521--538
2022
-
[25]
Jiayi Yuan, Hao Li, Xinheng Ding, Wenya Xie, Yu-Jhe Li, Wentian Zhao, Kun Wan, Jing Shi, Xia Hu, and Zirui Liu. 2025. Understanding and mitigating numerical sources of nondeterminism in LLM inference. In Advances in Neural Information Processing Systems ( NeurIPS ) . ArXiv:2506.09501
-
[26]
Ziyang Zhang, Xinheng Ding, Jiayi Yuan, Rixin Liu, Huizi Mao, Jiarong Xing, and Zirui Liu. 2025. https://arxiv.org/abs/2511.17826 Deterministic inference across tensor parallel sizes that eliminates training-inference mismatch . arXiv:2511.17826
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Gonzalez, Clark Barrett, and Ying Sheng
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2024. https://doi.org/10.52202/079017-2000 SGLang : Efficient execution of structured language model programs . In Advances in Neural Information Processing Systems, volume 37
-
[28]
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. https://www.usenix.org/conference/osdi24/presentation/zhong-yinmin DistServe : Disaggregating prefill and decoding for goodput-optimized large language model serving . In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pag...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.