pith. sign in

arxiv: 2605.30218 · v1 · pith:CHY7RR5Fnew · submitted 2026-05-28 · 💻 cs.LG · cs.PF

MarginGate: Sparse Margin-Triggered Verification for Batch-Invariant LLM Inference

Pith reviewed 2026-06-29 08:53 UTC · model grok-4.3

classification 💻 cs.LG cs.PF
keywords batch-invariant inferencesparse verificationlogit margindeterministic decodingK/V cachetemperature-zero inferenceflip detectionLLM inference
0
0 comments X

The pith

MarginGate restores deterministic batch LLM decoding by verifying only low-margin tokens

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that batch-induced token flips during temperature-zero LLM inference are sparse, appearing on 0.3 to 1.3 percent of decode steps across models and benchmarks. It shows these flips are largely predictable from low top-1 over top-2 logit margins while K/V perturbations stay flat beforehand. MarginGate therefore runs standard BF16 decoding on high-margin steps and applies verification plus K/V repair only on low-margin steps. The approach delivers full sequence-level determinism on Llama-3.1-8B and Qwen2.5-14B at trigger rates of 18.56 and 15.05 percent, cutting the latency cost of always-on verification by roughly half.

Core claim

MarginGate keeps BF16 decoding on high-margin steps, verifies only low-margin steps, and repairs confirmed mismatches by replacing the current K/V column. Across five models on four datasets it restores 100 percent sequence-level deterministic decoding on Llama-3.1-8B and Qwen2.5-14B with 18.56 percent and 15.05 percent verifier trigger rates while reducing LLM-42 latency overhead by 2.23x and 1.99x; on DSR1-Distill-Qwen-7B it reaches determinism at 49.50 percent triggers.

What carries the argument

MarginGate, a sparse verification policy that triggers full checks only when top-1/top-2 logit margins are low and repairs mismatches via K/V column replacement.

If this is right

  • Deterministic sequence output is achieved without verifying every token.
  • The added latency of verification drops by a factor of about two relative to always-on methods.
  • The policy calibrated on MATH500 transfers directly to GSM8K, SharedGPT, and HumanEval.
  • Models with denser flip regimes still reach determinism but require higher trigger rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If logit margins reliably flag instability, the same signal could guide selective recomputation for other sources of non-determinism such as mixed-precision effects.
  • Flat K/V perturbations before flips suggest margin monitoring could support preventive cache adjustments ahead of any mismatch.
  • Lower verification frequency may make reproducible inference practical at larger batch sizes in production settings.

Load-bearing premise

Low top-1/top-2 logit margins expose essentially all the steps where batch-induced token flips occur.

What would settle it

A dataset or model where a substantial share of token flips occur at high top-1/top-2 margin steps would show the policy misses many mismatches.

Figures

Figures reproduced from arXiv: 2605.30218 by Kexin Chu, Wei Zhang, Yang Zhou.

Figure 1
Figure 1. Figure 1: Divergence-aligned K/V-cache deviation for [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: MarginGate decode step. The BF16 engine produces tentative outputs. The controller accepts high￾margin steps as fast commits. Low-margin steps send a repair request to the deterministic repair engine; verified commits either accept the tentative state or repair the current K/V column. only the top two values, g logit t = ℓq,(1) − ℓq,(2), and triggers the verifier when g logit t < τ . These are the “uncerta… view at source ↗
Figure 3
Figure 3. Figure 3: Flip-aligned final-layer K/V-cache error trajec [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qwen2.5-14B: max-err per layer for K (left) [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Oracle single-column repair traces across [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
read the original abstract

Temperature-zero BF16 LLM inference is often treated as reproducible, yet the same request can emit different tokens when decoded alone or inside a larger batch. Existing fixes use batch-invariant operators or LLM-42's per-token verification, incurring cost even when most steps are stable. We ask whether verification can be applied exclusively to flipped tokens. Across five models, batch-induced token flips are sparse on the flip-rate benchmarks: on MATH500, Llama-3.1-8B flips on $0.48\%$ of synchronous decode steps, and all tested models stay within the 0.3-1.3% range on MATH500, GSM8K, and HumanEval. K/V perturbations remain flat before flips, while low top-1/top-2 logit margins expose much of the flip risk. MarginGate turns these observations into a verifier policy: it keeps BF16 decoding on high-margin steps, verifies only low-margin steps, and repairs confirmed mismatches by replacing the current K/V column. We evaluate on four datasets, calibrating on MATH500 and transferring to GSM8K, SharedGPT, and HumanEval. MarginGate restores 100% sequence-level deterministic decoding on Llama-3.1-8B and Qwen2.5-14B with 18.56%/15.05% verifier trigger rates, reducing LLM-42's latency increment by 2.23x/1.99x relative to always-on verification. On DSR1-Distill-Qwen-7B, the same policy reaches determinism in a harder regime at 49.50% triggers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MarginGate, a sparse margin-triggered verification policy for achieving batch-invariant deterministic decoding during temperature-zero BF16 LLM inference. It reports that batch-induced token flips are sparse (0.3-1.3% on MATH500 across models), that K/V perturbations stay flat before flips while low top-1/top-2 logit margins correlate with flip risk, and that verifying only low-margin steps restores 100% sequence-level determinism on Llama-3.1-8B, Qwen2.5-14B, and DSR1-Distill-Qwen-7B at trigger rates of 18.56%, 15.05%, and 49.50% respectively, cutting the latency overhead of always-on verification (LLM-42) by roughly 2x. The policy is calibrated on MATH500 and transferred to GSM8K, SharedGPT, and HumanEval.

Significance. If the empirical correlation and transfer results hold, MarginGate would offer a practical, low-overhead route to deterministic batched inference by exploiting the observed sparsity of flips. The multi-model evaluation and cross-dataset transfer constitute a concrete strength of the work.

major comments (2)
  1. [Abstract and §4 (results)] The abstract and results sections state precise quantitative claims (100% determinism, exact trigger rates, 2.23x/1.99x latency reductions) yet supply no experimental protocol, batch-size distribution, hardware platform, margin-threshold selection procedure, or error analysis. This absence prevents evaluation of whether the reported numbers are reproducible or statistically supported.
  2. [§3 (policy) and results tables] The policy rests on the claim that low top-1/top-2 margins capture essentially all flip risk. No per-step breakdown, ROC-style analysis, or false-negative rate for the chosen margin threshold is provided, leaving open whether the 100% determinism result is robust or specific to the calibration set.
minor comments (2)
  1. [§2] Notation for the margin threshold and the exact definition of a 'flip' should be formalized with an equation rather than prose.
  2. [Figures 2-4] Figure captions should explicitly state the batch sizes and models used in each panel.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for greater experimental transparency and robustness analysis. We will revise the manuscript to incorporate the requested details and additional supporting analysis while preserving the core empirical claims.

read point-by-point responses
  1. Referee: [Abstract and §4 (results)] The abstract and results sections state precise quantitative claims (100% determinism, exact trigger rates, 2.23x/1.99x latency reductions) yet supply no experimental protocol, batch-size distribution, hardware platform, margin-threshold selection procedure, or error analysis. This absence prevents evaluation of whether the reported numbers are reproducible or statistically supported.

    Authors: We agree that the current presentation lacks sufficient protocol details for full reproducibility. In the revised manuscript we will add an expanded experimental setup subsection in §4 that specifies: batch-size distributions (powers of two from 1–32), hardware platform (NVIDIA A100 80 GB GPUs, CUDA 12.4, PyTorch 2.4, vLLM backend), margin-threshold selection (grid search over [0.01, 0.20] on MATH500 to minimize trigger rate subject to 100% sequence determinism), and error analysis (standard deviations over five independent runs with different random seeds for batch ordering). These additions will directly support the reported quantitative claims. revision: yes

  2. Referee: [§3 (policy) and results tables] The policy rests on the claim that low top-1/top-2 margins capture essentially all flip risk. No per-step breakdown, ROC-style analysis, or false-negative rate for the chosen margin threshold is provided, leaving open whether the 100% determinism result is robust or specific to the calibration set.

    Authors: The 100% sequence-level determinism result implies a zero false-negative rate for the chosen threshold on the evaluated models and calibration set. To make this explicit and address robustness, the revision will include (i) a per-step margin histogram contrasting flip versus non-flip steps, (ii) the false-negative rate (reported as zero on MATH500 for the selected threshold), and (iii) a brief trade-off curve of trigger rate versus achieved determinism. These additions will clarify that the policy is not merely calibration-set specific while retaining the observed transfer performance on GSM8K, SharedGPT, and HumanEval. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central contribution is an empirically derived verification policy: observations of sparse batch-induced flips (0.3-1.3% on MATH500) and correlation with low top-1/top-2 margins are used to trigger verification only on low-margin steps. The policy is calibrated on MATH500 and evaluated for transfer on other datasets, with reported determinism rates and latency reductions measured directly against always-on baselines. No equations, derivations, or self-citations reduce the claimed outcomes to inputs by construction; the method rests on external experimental measurements rather than tautological redefinitions or fitted quantities renamed as predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no specific free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.1-grok · 5826 in / 1153 out tokens · 35936 ms · 2026-06-29T08:53:26.958804+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 12 canonical work pages · 4 internal anchors

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. https://www.usenix.org/conference/osdi24/presentation/agrawal Taming Throughput-Latency tradeoff in LLM inference with Sarathi-Serve . In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), p...

  4. [4]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, and 39 others. 2021. Evaluating large language models trained on code. arXi...

  5. [5]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna : An open-source chatbot impressing GPT-4 with 90\ https://lmsys.org/blog/2023-03-30-vicuna/

  6. [6]

    Kexin Chu, Zecheng Lin, Dawei Xiang, Zixu Shen, Jianchang Su, Cheng Chu, Yiwei Yang, Wenhui Zhang, Wenfei Wu, and Wei Zhang. 2025 a . https://doi.org/10.48550/arXiv.2508.08438 Selective KV -cache sharing to mitigate timing side-channels in LLM inference . arXiv:2508.08438

  7. [7]

    Kexin Chu, Zixu Shen, Sheng-Ru Cheng, Dawei Xiang, Ziqin Liu, and Wei Zhang. 2025 b . https://doi.org/10.1109/ICDCS63083.2025.00062 MCaM : Efficient LLM inference with multi-tier KV cache management . In Proceedings of the 45th IEEE International Conference on Distributed Computing Systems, pages 571--581

  8. [8]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

  9. [9]

    David Goldberg. 1991. What every computer scientist should know about floating-point arithmetic. ACM Computing Surveys, 23(1):5--48

  10. [10]

    En- abling Determinism in LLM Inference with Verified Speculation

    Raja Gond, Aditya K. Kamath, Ramachandran Ramjee, and Ashish Panwar. 2026. https://arxiv.org/abs/2601.17768 LLM-42 : Enabling determinism in LLM inference with verified speculation . arXiv:2601.17768

  11. [11]

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. 2024. MiniLLM : Knowledge distillation of large language models. In International Conference on Learning Representations

  12. [12]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, and DeepSeek-AI . 2025. https://doi.org/10.1038/s41586-025-09422-z DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning . Nature, 645:633--638

  13. [13]

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the MATH dataset. In Advances in Neural Information Processing Systems ( NeurIPS )

  14. [14]

    Nicholas J. Higham. 2002. Accuracy and Stability of Numerical Algorithms, 2nd edition. SIAM

  15. [15]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with PagedAttention . In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, pages 611--626

  16. [16]

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. https://proceedings.mlr.press/v202/leviathan23a.html Fast inference from transformers via speculative decoding . In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 19274--19286. PMLR

  17. [17]

    Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024 a . https://doi.org/10.1145/3651890.3672274 CacheGen : KV cache compression and streaming for fast large language model serving . In Proceedings of the A...

  18. [18]

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024 b . https://proceedings.mlr.press/v235/liu24bz.html KIVI : A tuning-free asymmetric 2bit quantization for KV cache . In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research...

  19. [19]

    Meta AI . 2024. The Llama 3 herd of models. https://arxiv.org/abs/2407.21783

  20. [20]

    SGLang Project . 2026. Deterministic inference. https://sgl-project.github.io/advanced_features/deterministic_inference.html

  21. [21]

    Yifan Song, Guoyin Wang, Sujian Li, and Bill Yuchen Lin. 2025. https://doi.org/10.18653/v1/2025.naacl-long.211 The good, the bad, and the greedy: Evaluation of LLM s should not ignore non-determinism . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies ...

  22. [22]

    Thinking Machines Lab . 2026. batch\_invariant\_ops: Batch-invariant implementations of reduction operations. https://github.com/thinking-machines-lab/batch_invariant_ops

  23. [23]

    vLLM Project . 2026. Batch invariance. https://docs.vllm.ai/en/v0.19.0/features/batch_invariance/

  24. [24]

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca : A distributed serving system for Transformer -based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521--538

  25. [25]

    Jiayi Yuan, Hao Li, Xinheng Ding, Wenya Xie, Yu-Jhe Li, Wentian Zhao, Kun Wan, Jing Shi, Xia Hu, and Zirui Liu. 2025. Understanding and mitigating numerical sources of nondeterminism in LLM inference. In Advances in Neural Information Processing Systems ( NeurIPS ) . ArXiv:2506.09501

  26. [26]

    Ziyang Zhang, Xinheng Ding, Jiayi Yuan, Rixin Liu, Huizi Mao, Jiarong Xing, and Zirui Liu. 2025. https://arxiv.org/abs/2511.17826 Deterministic inference across tensor parallel sizes that eliminates training-inference mismatch . arXiv:2511.17826

  27. [27]

    Gonzalez, Clark Barrett, and Ying Sheng

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2024. https://doi.org/10.52202/079017-2000 SGLang : Efficient execution of structured language model programs . In Advances in Neural Information Processing Systems, volume 37

  28. [28]

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. https://www.usenix.org/conference/osdi24/presentation/zhong-yinmin DistServe : Disaggregating prefill and decoding for goodput-optimized large language model serving . In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pag...