Training-Inference Kernel Contracts: Bounding Divergence in Post-Training and Deployment
Pith reviewed 2026-06-29 18:31 UTC · model grok-4.3
The pith
Kernel contracts bound divergence between training and inference kernels in post-training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that kernel contracts C = (N, S, R, O, Pi) allow specification of acceptable divergence between K_train and K_inf. A derived bound chain runs from logit drift through total-variation distance to bounded reward drift. Specializing to RL post-training, per-token importance-ratio drift bounds policy-gradient bias under explicit support and norm assumptions. The paper describes a four-stage promotion pipeline, online routing loop, and minimal YAML DSL for the contracts.
What carries the argument
Kernel contract C = (N, S, R, O, Pi) that combines numerical, statistical, runtime and observability clauses with an escalation policy from violations to routing actions.
If this is right
- If the bound chain holds, controlling per-token importance-ratio drift controls policy-gradient bias in RL.
- Contract violations escalate to routing actions that can switch kernels or models.
- The four-stage promotion pipeline uses contracts to validate models before inference deployment.
- The online routing loop applies contracts at runtime to detect and respond to divergence.
Where Pith is reading between the lines
- This approach could extend to monitoring other types of model divergence in production serving systems.
- The YAML DSL might enable automated contract enforcement in ML pipelines beyond the described stages.
- One could test the bounds by simulating logit drift in small RL setups and checking the resulting bias limits.
Load-bearing premise
The support and norm assumptions hold when bounding policy-gradient bias from per-token importance-ratio drift.
What would settle it
Finding an RL post-training scenario where the per-token importance-ratio drift is bounded but the policy-gradient bias is not would falsify the specialized bound.
read the original abstract
A modern post-training pipeline often writes one symbol for its policy, pi_theta, while evaluating it through two different programs: a training kernel optimized for autograd and an inference kernel optimized for low-precision, fused, dynamically batched serving. In finite precision, these kernels can induce different distributions at identical weights, with the gap concentrated on slices that aggregate benchmarks under-represent. This paper proposes kernel contracts: a contract-first framework for specifying acceptable divergence between K_train and K_inf. A contract C = (N, S, R, O, Pi) combines numerical, statistical, runtime, and observability clauses with an escalation policy from violations to routing actions. We derive a chain of bounds from logit drift to total-variation distance to bounded reward drift, and specialize it to RL post-training, where per-token importance-ratio drift yields a bound on policy-gradient bias under explicit support and norm assumptions. We also describe a four-stage promotion pipeline, online routing loop, and minimal YAML DSL for contract artifacts. This is a framework and vocabulary paper; we do not report production-scale empirical validation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes kernel contracts as a contract-first framework for specifying acceptable divergence between training kernels (optimized for autograd) and inference kernels (optimized for low-precision serving) in post-training pipelines. A contract is formalized as C = (N, S, R, O, Pi) combining numerical, statistical, runtime, and observability clauses with an escalation policy. The central technical contribution is a derivation chain from logit drift to total-variation distance to bounded reward drift, specialized to RL post-training where per-token importance-ratio drift is shown to bound policy-gradient bias under explicit support and norm assumptions. The work also outlines a four-stage promotion pipeline, an online routing loop, and a minimal YAML DSL for contract artifacts, and is explicitly positioned as a framework/vocabulary paper without production-scale empirical validation.
Significance. If the bound chain holds under the stated assumptions, the framework offers a principled vocabulary and tooling layer for managing training-inference discrepancies that benchmarks often miss, with direct relevance to RL post-training where policy-gradient bias is a practical concern. The explicit enumeration of support and norm assumptions, together with the concrete DSL and pipeline description, strengthens the proposal as a usable artifact rather than purely abstract. The absence of empirical checks is transparently acknowledged, so significance rests on the conceptual linkage from low-level kernel differences to reward-level impacts and the potential for subsequent validation.
major comments (2)
- [RL specialization section] RL specialization section: the bound on policy-gradient bias from per-token importance-ratio drift is load-bearing for the RL claim and rests on the explicit support and norm assumptions; the manuscript should include a short argument or reference showing these assumptions are satisfiable (or routinely checked) in standard RL post-training regimes rather than leaving applicability entirely to the reader.
- [Bound derivation section] Bound derivation (logit drift o TV o reward drift): while the chain is described as starting from observable logit drift, the manuscript should verify in the relevant derivation section that the final bound on reward drift does not collapse to a quantity fitted by construction once the support/norm assumptions are imposed.
minor comments (3)
- [Introduction] The contract tuple C = (N, S, R, O, Pi) is introduced without an immediate expanded definition of each component; a one-paragraph unpacking in the introduction would improve readability.
- The YAML DSL is described as 'minimal' but no example artifact is shown; including a short concrete YAML snippet would make the practical contribution more immediate.
- Consider adding a brief related-work paragraph contrasting kernel contracts with existing distribution-shift or serving-consistency literature.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation of minor revision. We address each major comment below and will incorporate the requested clarifications in the revised manuscript.
read point-by-point responses
-
Referee: [RL specialization section] RL specialization section: the bound on policy-gradient bias from per-token importance-ratio drift is load-bearing for the RL claim and rests on the explicit support and norm assumptions; the manuscript should include a short argument or reference showing these assumptions are satisfiable (or routinely checked) in standard RL post-training regimes rather than leaving applicability entirely to the reader.
Authors: We agree that a short discussion of satisfiability strengthens the RL specialization. In the revised version we will add a concise paragraph noting that the support assumption holds under standard clipped importance sampling or reference-policy regularization in PPO-style RL post-training, while the norm bounds follow from bounded reward models and typical policy parameterizations; we will cite representative works on importance-ratio monitoring in RLHF pipelines. revision: yes
-
Referee: [Bound derivation section] Bound derivation (logit drift → TV → reward drift): while the chain is described as starting from observable logit drift, the manuscript should verify in the relevant derivation section that the final bound on reward drift does not collapse to a quantity fitted by construction once the support/norm assumptions are imposed.
Authors: We will revise the bound derivation section to include an explicit verification remark. The support and norm assumptions supply worst-case multipliers applied to the independently measured logit drift; the resulting reward-drift bound therefore remains a non-trivial function of the observed kernel divergence rather than becoming tautological under the assumptions. revision: yes
Circularity Check
No significant circularity; derivation is self-contained under explicit assumptions
full rationale
The manuscript derives a bound chain (logit drift to TV distance to reward drift) and specializes it to per-token importance-ratio drift bounding policy-gradient bias. All steps rest on explicitly stated support and norm assumptions rather than on fitted parameters, self-citations, or definitional loops. The work is framed as a contract vocabulary and framework proposal without empirical validation or production claims that could introduce circularity. No load-bearing step reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Explicit support and norm assumptions are available to bound policy-gradient bias from importance-ratio drift
invented entities (1)
-
Kernel contract C = (N, S, R, O, Pi)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
What matters for on-policy deep actor-critic methods? a large-scale study
Marcin Andrychowicz, Anton Raichuk, Piotr Stańczyk, Manu Orsini, Sertan Girgin, Raphaël Marinier, Léonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, Sylvain Gelly, and Olivier Bachem. What matters for on-policy deep actor-critic methods? a large-scale study. International Conference on Learning Representations (ICLR), 2021
2021
-
[2]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems (NeurIPS), 2022
2022
-
[4]
Implementation matters in deep policy gradients: A case study on ppo and trpo.International Conference on Learning Representations (ICLR), 2020
Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. Implementation matters in deep policy gradients: A case study on ppo and trpo.International Conference on Learning Representations (ICLR), 2020
2020
-
[5]
Gptq: Accurate post-training quantization for generative pre-trained transformers.International Conference on Learning Representations (ICLR), 2023
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.International Conference on Learning Representations (ICLR), 2023
2023
-
[6]
Ai2: Safety and robustness certification of neural networks with abstract interpretation
Timon Gehr, Matthew Mirman, Dana Drachsler-Cohen, Petar Tsankov, Swarat Chaudhuri, and Martin Vechev. Ai2: Safety and robustness certification of neural networks with abstract interpretation. InIEEE Symposium on Security and Privacy (S&P), 2018
2018
-
[7]
Weinberger
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. InInternational Conference on Machine Learning (ICML), 2017
2017
-
[8]
Deep reinforcement learning that matters.AAAI, 2018
Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters.AAAI, 2018
2018
-
[9]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[10]
The hardware lottery.Communications of the ACM, 2021
Sara Hooker. The hardware lottery.Communications of the ACM, 2021. 20
2021
-
[11]
Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, Jeff Braga, Dipam Chakraborty, Kinal Mehta, and João G. M. Araújo. Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms.Journal of Machine Learning Research, 2022
2022
-
[12]
Dill, Kyle Julian, and Mykel J
Guy Katz, Clark Barrett, David L. Dill, Kyle Julian, and Mykel J. Kochenderfer. Reluplex: An efficient smt solver for verifying deep neural networks. InInternational Conference on Computer Aided Verification (CAV), 2017
2017
-
[13]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InACM Symposium on Operating Systems Principles (SOSP), 2023
2023
-
[14]
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration.arXiv preprint arXiv:2306.00978, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, Naveen Mellem- pudi, Stuart Oberman, Mohammad Shoeybi, Michael Siu, and Hao Wu. Fp8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
When does label smoothing help? In Advances in Neural Information Processing Systems (NeurIPS), 2019
Rafael Müller, Simon Kornblith, and Geoffrey Hinton. When does label smoothing help? In Advances in Neural Information Processing Systems (NeurIPS), 2019
2019
-
[17]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....
2022
-
[18]
Rohit Patel. Understanding reinforcement learning for model training, and future directions with grape.arXiv preprint arXiv:2509.04501, 2025
-
[19]
Manning, Stefano Ermon, and Chelsea Finn
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems (NeurIPS), 2023
2023
-
[20]
Jordan, and Pieter Abbeel
John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. InInternational Conference on Machine Learning (ICML), 2015
2015
-
[21]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. 21
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[22]
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.arXiv preprint arXiv:2407.08608, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. A Appendix A: Contract DSL specification This appendix gives a minimal YAML-based contract DSL designe...
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.