pith. sign in

arxiv: 2606.07581 · v1 · pith:BHE5APMKnew · submitted 2026-05-26 · 💻 cs.LG · cs.AI· cs.ET

Training-Inference Kernel Contracts: Bounding Divergence in Post-Training and Deployment

Pith reviewed 2026-06-29 18:31 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ET
keywords kernel contractstraining inference divergencelogit drifttotal variation distancepolicy gradientRL post-trainingcontract frameworkreward drift
0
0 comments X

The pith

Kernel contracts bound divergence between training and inference kernels in post-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes kernel contracts as a way to specify acceptable divergence between a training kernel and an inference kernel that evaluate the same policy weights. It derives a chain of mathematical bounds from logit drift to total-variation distance and then to bounded reward drift. The chain is specialized to RL post-training, showing how per-token importance-ratio drift bounds policy-gradient bias when support and norm conditions hold. The framework also includes a promotion pipeline, routing loop, and YAML syntax for contracts. Readers would care if they want to prevent silent distribution shifts in deployed models that training benchmarks miss.

Core claim

The central claim is that kernel contracts C = (N, S, R, O, Pi) allow specification of acceptable divergence between K_train and K_inf. A derived bound chain runs from logit drift through total-variation distance to bounded reward drift. Specializing to RL post-training, per-token importance-ratio drift bounds policy-gradient bias under explicit support and norm assumptions. The paper describes a four-stage promotion pipeline, online routing loop, and minimal YAML DSL for the contracts.

What carries the argument

Kernel contract C = (N, S, R, O, Pi) that combines numerical, statistical, runtime and observability clauses with an escalation policy from violations to routing actions.

If this is right

  • If the bound chain holds, controlling per-token importance-ratio drift controls policy-gradient bias in RL.
  • Contract violations escalate to routing actions that can switch kernels or models.
  • The four-stage promotion pipeline uses contracts to validate models before inference deployment.
  • The online routing loop applies contracts at runtime to detect and respond to divergence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could extend to monitoring other types of model divergence in production serving systems.
  • The YAML DSL might enable automated contract enforcement in ML pipelines beyond the described stages.
  • One could test the bounds by simulating logit drift in small RL setups and checking the resulting bias limits.

Load-bearing premise

The support and norm assumptions hold when bounding policy-gradient bias from per-token importance-ratio drift.

What would settle it

Finding an RL post-training scenario where the per-token importance-ratio drift is bounded but the policy-gradient bias is not would falsify the specialized bound.

read the original abstract

A modern post-training pipeline often writes one symbol for its policy, pi_theta, while evaluating it through two different programs: a training kernel optimized for autograd and an inference kernel optimized for low-precision, fused, dynamically batched serving. In finite precision, these kernels can induce different distributions at identical weights, with the gap concentrated on slices that aggregate benchmarks under-represent. This paper proposes kernel contracts: a contract-first framework for specifying acceptable divergence between K_train and K_inf. A contract C = (N, S, R, O, Pi) combines numerical, statistical, runtime, and observability clauses with an escalation policy from violations to routing actions. We derive a chain of bounds from logit drift to total-variation distance to bounded reward drift, and specialize it to RL post-training, where per-token importance-ratio drift yields a bound on policy-gradient bias under explicit support and norm assumptions. We also describe a four-stage promotion pipeline, online routing loop, and minimal YAML DSL for contract artifacts. This is a framework and vocabulary paper; we do not report production-scale empirical validation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes kernel contracts as a contract-first framework for specifying acceptable divergence between training kernels (optimized for autograd) and inference kernels (optimized for low-precision serving) in post-training pipelines. A contract is formalized as C = (N, S, R, O, Pi) combining numerical, statistical, runtime, and observability clauses with an escalation policy. The central technical contribution is a derivation chain from logit drift to total-variation distance to bounded reward drift, specialized to RL post-training where per-token importance-ratio drift is shown to bound policy-gradient bias under explicit support and norm assumptions. The work also outlines a four-stage promotion pipeline, an online routing loop, and a minimal YAML DSL for contract artifacts, and is explicitly positioned as a framework/vocabulary paper without production-scale empirical validation.

Significance. If the bound chain holds under the stated assumptions, the framework offers a principled vocabulary and tooling layer for managing training-inference discrepancies that benchmarks often miss, with direct relevance to RL post-training where policy-gradient bias is a practical concern. The explicit enumeration of support and norm assumptions, together with the concrete DSL and pipeline description, strengthens the proposal as a usable artifact rather than purely abstract. The absence of empirical checks is transparently acknowledged, so significance rests on the conceptual linkage from low-level kernel differences to reward-level impacts and the potential for subsequent validation.

major comments (2)
  1. [RL specialization section] RL specialization section: the bound on policy-gradient bias from per-token importance-ratio drift is load-bearing for the RL claim and rests on the explicit support and norm assumptions; the manuscript should include a short argument or reference showing these assumptions are satisfiable (or routinely checked) in standard RL post-training regimes rather than leaving applicability entirely to the reader.
  2. [Bound derivation section] Bound derivation (logit drift o TV o reward drift): while the chain is described as starting from observable logit drift, the manuscript should verify in the relevant derivation section that the final bound on reward drift does not collapse to a quantity fitted by construction once the support/norm assumptions are imposed.
minor comments (3)
  1. [Introduction] The contract tuple C = (N, S, R, O, Pi) is introduced without an immediate expanded definition of each component; a one-paragraph unpacking in the introduction would improve readability.
  2. The YAML DSL is described as 'minimal' but no example artifact is shown; including a short concrete YAML snippet would make the practical contribution more immediate.
  3. Consider adding a brief related-work paragraph contrasting kernel contracts with existing distribution-shift or serving-consistency literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We address each major comment below and will incorporate the requested clarifications in the revised manuscript.

read point-by-point responses
  1. Referee: [RL specialization section] RL specialization section: the bound on policy-gradient bias from per-token importance-ratio drift is load-bearing for the RL claim and rests on the explicit support and norm assumptions; the manuscript should include a short argument or reference showing these assumptions are satisfiable (or routinely checked) in standard RL post-training regimes rather than leaving applicability entirely to the reader.

    Authors: We agree that a short discussion of satisfiability strengthens the RL specialization. In the revised version we will add a concise paragraph noting that the support assumption holds under standard clipped importance sampling or reference-policy regularization in PPO-style RL post-training, while the norm bounds follow from bounded reward models and typical policy parameterizations; we will cite representative works on importance-ratio monitoring in RLHF pipelines. revision: yes

  2. Referee: [Bound derivation section] Bound derivation (logit drift → TV → reward drift): while the chain is described as starting from observable logit drift, the manuscript should verify in the relevant derivation section that the final bound on reward drift does not collapse to a quantity fitted by construction once the support/norm assumptions are imposed.

    Authors: We will revise the bound derivation section to include an explicit verification remark. The support and norm assumptions supply worst-case multipliers applied to the independently measured logit drift; the resulting reward-drift bound therefore remains a non-trivial function of the observed kernel divergence rather than becoming tautological under the assumptions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained under explicit assumptions

full rationale

The manuscript derives a bound chain (logit drift to TV distance to reward drift) and specializes it to per-token importance-ratio drift bounding policy-gradient bias. All steps rest on explicitly stated support and norm assumptions rather than on fitted parameters, self-citations, or definitional loops. The work is framed as a contract vocabulary and framework proposal without empirical validation or production claims that could introduce circularity. No load-bearing step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central contribution rests on introducing the contract tuple and deriving bounds under domain assumptions about support and norms; no numerical free parameters are mentioned.

axioms (1)
  • domain assumption Explicit support and norm assumptions are available to bound policy-gradient bias from importance-ratio drift
    Required for the RL specialization stated in the abstract.
invented entities (1)
  • Kernel contract C = (N, S, R, O, Pi) no independent evidence
    purpose: To combine numerical, statistical, runtime, and observability clauses with an escalation policy for train-inference divergence
    New artifact introduced to specify acceptable kernel differences.

pith-pipeline@v0.9.1-grok · 5722 in / 1280 out tokens · 33848 ms · 2026-06-29T18:31:02.472539+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 8 canonical work pages · 7 internal anchors

  1. [1]

    What matters for on-policy deep actor-critic methods? a large-scale study

    Marcin Andrychowicz, Anton Raichuk, Piotr Stańczyk, Manu Orsini, Sertan Girgin, Raphaël Marinier, Léonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, Sylvain Gelly, and Olivier Bachem. What matters for on-policy deep actor-critic methods? a large-scale study. International Conference on Learning Representations (ICLR), 2021

  2. [2]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023

  3. [3]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems (NeurIPS), 2022

  4. [4]

    Implementation matters in deep policy gradients: A case study on ppo and trpo.International Conference on Learning Representations (ICLR), 2020

    Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. Implementation matters in deep policy gradients: A case study on ppo and trpo.International Conference on Learning Representations (ICLR), 2020

  5. [5]

    Gptq: Accurate post-training quantization for generative pre-trained transformers.International Conference on Learning Representations (ICLR), 2023

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.International Conference on Learning Representations (ICLR), 2023

  6. [6]

    Ai2: Safety and robustness certification of neural networks with abstract interpretation

    Timon Gehr, Matthew Mirman, Dana Drachsler-Cohen, Petar Tsankov, Swarat Chaudhuri, and Martin Vechev. Ai2: Safety and robustness certification of neural networks with abstract interpretation. InIEEE Symposium on Security and Privacy (S&P), 2018

  7. [7]

    Weinberger

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. InInternational Conference on Machine Learning (ICML), 2017

  8. [8]

    Deep reinforcement learning that matters.AAAI, 2018

    Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters.AAAI, 2018

  9. [9]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  10. [10]

    The hardware lottery.Communications of the ACM, 2021

    Sara Hooker. The hardware lottery.Communications of the ACM, 2021. 20

  11. [11]

    Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, Jeff Braga, Dipam Chakraborty, Kinal Mehta, and João G. M. Araújo. Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms.Journal of Machine Learning Research, 2022

  12. [12]

    Dill, Kyle Julian, and Mykel J

    Guy Katz, Clark Barrett, David L. Dill, Kyle Julian, and Mykel J. Kochenderfer. Reluplex: An efficient smt solver for verifying deep neural networks. InInternational Conference on Computer Aided Verification (CAV), 2017

  13. [13]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InACM Symposium on Operating Systems Principles (SOSP), 2023

  14. [14]

    AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration.arXiv preprint arXiv:2306.00978, 2023

  15. [15]

    FP8 Formats for Deep Learning

    Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, Naveen Mellem- pudi, Stuart Oberman, Mohammad Shoeybi, Michael Siu, and Hao Wu. Fp8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022

  16. [16]

    When does label smoothing help? In Advances in Neural Information Processing Systems (NeurIPS), 2019

    Rafael Müller, Simon Kornblith, and Geoffrey Hinton. When does label smoothing help? In Advances in Neural Information Processing Systems (NeurIPS), 2019

  17. [17]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

  18. [18]

    Understanding reinforcement learning for model training, and future directions with grape.arXiv preprint arXiv:2509.04501, 2025

    Rohit Patel. Understanding reinforcement learning for model training, and future directions with grape.arXiv preprint arXiv:2509.04501, 2025

  19. [19]

    Manning, Stefano Ermon, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems (NeurIPS), 2023

  20. [20]

    Jordan, and Pieter Abbeel

    John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. InInternational Conference on Machine Learning (ICML), 2015

  21. [21]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. 21

  22. [22]

    FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.arXiv preprint arXiv:2407.08608, 2024

  23. [23]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. A Appendix A: Contract DSL specification This appendix gives a minimal YAML-based contract DSL designe...