Training-Inference Kernel Contracts: Bounding Divergence in Post-Training and Deployment

Bruce Changlong Xu; Lan Wu

arxiv: 2606.07581 · v1 · pith:BHE5APMKnew · submitted 2026-05-26 · 💻 cs.LG · cs.AI· cs.ET

Training-Inference Kernel Contracts: Bounding Divergence in Post-Training and Deployment

Bruce Changlong Xu , Lan Wu This is my paper

Pith reviewed 2026-06-29 18:31 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ET

keywords kernel contractstraining inference divergencelogit drifttotal variation distancepolicy gradientRL post-trainingcontract frameworkreward drift

0 comments

The pith

Kernel contracts bound divergence between training and inference kernels in post-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes kernel contracts as a way to specify acceptable divergence between a training kernel and an inference kernel that evaluate the same policy weights. It derives a chain of mathematical bounds from logit drift to total-variation distance and then to bounded reward drift. The chain is specialized to RL post-training, showing how per-token importance-ratio drift bounds policy-gradient bias when support and norm conditions hold. The framework also includes a promotion pipeline, routing loop, and YAML syntax for contracts. Readers would care if they want to prevent silent distribution shifts in deployed models that training benchmarks miss.

Core claim

The central claim is that kernel contracts C = (N, S, R, O, Pi) allow specification of acceptable divergence between K_train and K_inf. A derived bound chain runs from logit drift through total-variation distance to bounded reward drift. Specializing to RL post-training, per-token importance-ratio drift bounds policy-gradient bias under explicit support and norm assumptions. The paper describes a four-stage promotion pipeline, online routing loop, and minimal YAML DSL for the contracts.

What carries the argument

Kernel contract C = (N, S, R, O, Pi) that combines numerical, statistical, runtime and observability clauses with an escalation policy from violations to routing actions.

If this is right

If the bound chain holds, controlling per-token importance-ratio drift controls policy-gradient bias in RL.
Contract violations escalate to routing actions that can switch kernels or models.
The four-stage promotion pipeline uses contracts to validate models before inference deployment.
The online routing loop applies contracts at runtime to detect and respond to divergence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could extend to monitoring other types of model divergence in production serving systems.
The YAML DSL might enable automated contract enforcement in ML pipelines beyond the described stages.
One could test the bounds by simulating logit drift in small RL setups and checking the resulting bias limits.

Load-bearing premise

The support and norm assumptions hold when bounding policy-gradient bias from per-token importance-ratio drift.

What would settle it

Finding an RL post-training scenario where the per-token importance-ratio drift is bounded but the policy-gradient bias is not would falsify the specialized bound.

read the original abstract

A modern post-training pipeline often writes one symbol for its policy, pi_theta, while evaluating it through two different programs: a training kernel optimized for autograd and an inference kernel optimized for low-precision, fused, dynamically batched serving. In finite precision, these kernels can induce different distributions at identical weights, with the gap concentrated on slices that aggregate benchmarks under-represent. This paper proposes kernel contracts: a contract-first framework for specifying acceptable divergence between K_train and K_inf. A contract C = (N, S, R, O, Pi) combines numerical, statistical, runtime, and observability clauses with an escalation policy from violations to routing actions. We derive a chain of bounds from logit drift to total-variation distance to bounded reward drift, and specialize it to RL post-training, where per-token importance-ratio drift yields a bound on policy-gradient bias under explicit support and norm assumptions. We also describe a four-stage promotion pipeline, online routing loop, and minimal YAML DSL for contract artifacts. This is a framework and vocabulary paper; we do not report production-scale empirical validation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Kernel contracts is a transparent framework proposal for bounding train-inference divergence in RL, but it stays conceptual with no derivations or tests shown.

read the letter

The main takeaway is that this paper defines kernel contracts as a structured way to specify acceptable divergence between training and inference kernels, then derives a bound chain from logit drift through total variation distance to reward drift and specializes it to bound policy-gradient bias from per-token importance-ratio drift in RL post-training.

It does a solid job laying out the contract tuple with its numerical, statistical, runtime, and observability clauses plus an escalation policy. The four-stage promotion pipeline and the YAML DSL give the idea some operational shape. The authors are explicit about the support and norm assumptions needed for the RL bound and upfront that this is a framework paper without empirical validation or production checks. That transparency is useful.

The soft spots are straightforward. No equations, proofs, or data appear in the manuscript, so the actual tightness or practicality of the bound chain cannot be checked. The support assumption in particular looks restrictive for real models. It is also unclear how much the overall framing overlaps with prior work on model drift or serving contracts.

This is for engineers and researchers focused on reliable RL deployment who want a vocabulary and escalation mechanism for kernel differences. A reader dealing with post-training pipelines could pick up some useful structure from it.

I would send it to peer review. The problem is real and the approach is stated clearly enough to merit referee time, though it would need the missing derivations and some validation work to become more than a proposal.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes kernel contracts as a contract-first framework for specifying acceptable divergence between training kernels (optimized for autograd) and inference kernels (optimized for low-precision serving) in post-training pipelines. A contract is formalized as C = (N, S, R, O, Pi) combining numerical, statistical, runtime, and observability clauses with an escalation policy. The central technical contribution is a derivation chain from logit drift to total-variation distance to bounded reward drift, specialized to RL post-training where per-token importance-ratio drift is shown to bound policy-gradient bias under explicit support and norm assumptions. The work also outlines a four-stage promotion pipeline, an online routing loop, and a minimal YAML DSL for contract artifacts, and is explicitly positioned as a framework/vocabulary paper without production-scale empirical validation.

Significance. If the bound chain holds under the stated assumptions, the framework offers a principled vocabulary and tooling layer for managing training-inference discrepancies that benchmarks often miss, with direct relevance to RL post-training where policy-gradient bias is a practical concern. The explicit enumeration of support and norm assumptions, together with the concrete DSL and pipeline description, strengthens the proposal as a usable artifact rather than purely abstract. The absence of empirical checks is transparently acknowledged, so significance rests on the conceptual linkage from low-level kernel differences to reward-level impacts and the potential for subsequent validation.

major comments (2)

[RL specialization section] RL specialization section: the bound on policy-gradient bias from per-token importance-ratio drift is load-bearing for the RL claim and rests on the explicit support and norm assumptions; the manuscript should include a short argument or reference showing these assumptions are satisfiable (or routinely checked) in standard RL post-training regimes rather than leaving applicability entirely to the reader.
[Bound derivation section] Bound derivation (logit drift o TV o reward drift): while the chain is described as starting from observable logit drift, the manuscript should verify in the relevant derivation section that the final bound on reward drift does not collapse to a quantity fitted by construction once the support/norm assumptions are imposed.

minor comments (3)

[Introduction] The contract tuple C = (N, S, R, O, Pi) is introduced without an immediate expanded definition of each component; a one-paragraph unpacking in the introduction would improve readability.
The YAML DSL is described as 'minimal' but no example artifact is shown; including a short concrete YAML snippet would make the practical contribution more immediate.
Consider adding a brief related-work paragraph contrasting kernel contracts with existing distribution-shift or serving-consistency literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We address each major comment below and will incorporate the requested clarifications in the revised manuscript.

read point-by-point responses

Referee: [RL specialization section] RL specialization section: the bound on policy-gradient bias from per-token importance-ratio drift is load-bearing for the RL claim and rests on the explicit support and norm assumptions; the manuscript should include a short argument or reference showing these assumptions are satisfiable (or routinely checked) in standard RL post-training regimes rather than leaving applicability entirely to the reader.

Authors: We agree that a short discussion of satisfiability strengthens the RL specialization. In the revised version we will add a concise paragraph noting that the support assumption holds under standard clipped importance sampling or reference-policy regularization in PPO-style RL post-training, while the norm bounds follow from bounded reward models and typical policy parameterizations; we will cite representative works on importance-ratio monitoring in RLHF pipelines. revision: yes
Referee: [Bound derivation section] Bound derivation (logit drift → TV → reward drift): while the chain is described as starting from observable logit drift, the manuscript should verify in the relevant derivation section that the final bound on reward drift does not collapse to a quantity fitted by construction once the support/norm assumptions are imposed.

Authors: We will revise the bound derivation section to include an explicit verification remark. The support and norm assumptions supply worst-case multipliers applied to the independently measured logit drift; the resulting reward-drift bound therefore remains a non-trivial function of the observed kernel divergence rather than becoming tautological under the assumptions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained under explicit assumptions

full rationale

The manuscript derives a bound chain (logit drift to TV distance to reward drift) and specializes it to per-token importance-ratio drift bounding policy-gradient bias. All steps rest on explicitly stated support and norm assumptions rather than on fitted parameters, self-citations, or definitional loops. The work is framed as a contract vocabulary and framework proposal without empirical validation or production claims that could introduce circularity. No load-bearing step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central contribution rests on introducing the contract tuple and deriving bounds under domain assumptions about support and norms; no numerical free parameters are mentioned.

axioms (1)

domain assumption Explicit support and norm assumptions are available to bound policy-gradient bias from importance-ratio drift
Required for the RL specialization stated in the abstract.

invented entities (1)

Kernel contract C = (N, S, R, O, Pi) no independent evidence
purpose: To combine numerical, statistical, runtime, and observability clauses with an escalation policy for train-inference divergence
New artifact introduced to specify acceptable kernel differences.

pith-pipeline@v0.9.1-grok · 5722 in / 1280 out tokens · 33848 ms · 2026-06-29T18:31:02.472539+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 8 canonical work pages · 7 internal anchors

[1]

What matters for on-policy deep actor-critic methods? a large-scale study

Marcin Andrychowicz, Anton Raichuk, Piotr Stańczyk, Manu Orsini, Sertan Girgin, Raphaël Marinier, Léonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, Sylvain Gelly, and Olivier Bachem. What matters for on-policy deep actor-critic methods? a large-scale study. International Conference on Learning Representations (ICLR), 2021

2021
[2]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems (NeurIPS), 2022

2022
[4]

Implementation matters in deep policy gradients: A case study on ppo and trpo.International Conference on Learning Representations (ICLR), 2020

Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. Implementation matters in deep policy gradients: A case study on ppo and trpo.International Conference on Learning Representations (ICLR), 2020

2020
[5]

Gptq: Accurate post-training quantization for generative pre-trained transformers.International Conference on Learning Representations (ICLR), 2023

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.International Conference on Learning Representations (ICLR), 2023

2023
[6]

Ai2: Safety and robustness certification of neural networks with abstract interpretation

Timon Gehr, Matthew Mirman, Dana Drachsler-Cohen, Petar Tsankov, Swarat Chaudhuri, and Martin Vechev. Ai2: Safety and robustness certification of neural networks with abstract interpretation. InIEEE Symposium on Security and Privacy (S&P), 2018

2018
[7]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. InInternational Conference on Machine Learning (ICML), 2017

2017
[8]

Deep reinforcement learning that matters.AAAI, 2018

Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters.AAAI, 2018

2018
[9]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[10]

The hardware lottery.Communications of the ACM, 2021

Sara Hooker. The hardware lottery.Communications of the ACM, 2021. 20

2021
[11]

Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, Jeff Braga, Dipam Chakraborty, Kinal Mehta, and João G. M. Araújo. Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms.Journal of Machine Learning Research, 2022

2022
[12]

Dill, Kyle Julian, and Mykel J

Guy Katz, Clark Barrett, David L. Dill, Kyle Julian, and Mykel J. Kochenderfer. Reluplex: An efficient smt solver for verifying deep neural networks. InInternational Conference on Computer Aided Verification (CAV), 2017

2017
[13]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InACM Symposium on Operating Systems Principles (SOSP), 2023

2023
[14]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration.arXiv preprint arXiv:2306.00978, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

FP8 Formats for Deep Learning

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, Naveen Mellem- pudi, Stuart Oberman, Mohammad Shoeybi, Michael Siu, and Hao Wu. Fp8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

When does label smoothing help? In Advances in Neural Information Processing Systems (NeurIPS), 2019

Rafael Müller, Simon Kornblith, and Geoffrey Hinton. When does label smoothing help? In Advances in Neural Information Processing Systems (NeurIPS), 2019

2019
[17]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

2022
[18]

Understanding reinforcement learning for model training, and future directions with grape.arXiv preprint arXiv:2509.04501, 2025

Rohit Patel. Understanding reinforcement learning for model training, and future directions with grape.arXiv preprint arXiv:2509.04501, 2025

work page arXiv 2025
[19]

Manning, Stefano Ermon, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems (NeurIPS), 2023

2023
[20]

Jordan, and Pieter Abbeel

John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. InInternational Conference on Machine Learning (ICML), 2015

2015
[21]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. 21

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.arXiv preprint arXiv:2407.08608, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. A Appendix A: Contract DSL specification This appendix gives a minimal YAML-based contract DSL designe...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

What matters for on-policy deep actor-critic methods? a large-scale study

Marcin Andrychowicz, Anton Raichuk, Piotr Stańczyk, Manu Orsini, Sertan Girgin, Raphaël Marinier, Léonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, Sylvain Gelly, and Olivier Bachem. What matters for on-policy deep actor-critic methods? a large-scale study. International Conference on Learning Representations (ICLR), 2021

2021

[2] [2]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems (NeurIPS), 2022

2022

[4] [4]

Implementation matters in deep policy gradients: A case study on ppo and trpo.International Conference on Learning Representations (ICLR), 2020

Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. Implementation matters in deep policy gradients: A case study on ppo and trpo.International Conference on Learning Representations (ICLR), 2020

2020

[5] [5]

Gptq: Accurate post-training quantization for generative pre-trained transformers.International Conference on Learning Representations (ICLR), 2023

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.International Conference on Learning Representations (ICLR), 2023

2023

[6] [6]

Ai2: Safety and robustness certification of neural networks with abstract interpretation

Timon Gehr, Matthew Mirman, Dana Drachsler-Cohen, Petar Tsankov, Swarat Chaudhuri, and Martin Vechev. Ai2: Safety and robustness certification of neural networks with abstract interpretation. InIEEE Symposium on Security and Privacy (S&P), 2018

2018

[7] [7]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. InInternational Conference on Machine Learning (ICML), 2017

2017

[8] [8]

Deep reinforcement learning that matters.AAAI, 2018

Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters.AAAI, 2018

2018

[9] [9]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[10] [10]

The hardware lottery.Communications of the ACM, 2021

Sara Hooker. The hardware lottery.Communications of the ACM, 2021. 20

2021

[11] [11]

Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, Jeff Braga, Dipam Chakraborty, Kinal Mehta, and João G. M. Araújo. Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms.Journal of Machine Learning Research, 2022

2022

[12] [12]

Dill, Kyle Julian, and Mykel J

Guy Katz, Clark Barrett, David L. Dill, Kyle Julian, and Mykel J. Kochenderfer. Reluplex: An efficient smt solver for verifying deep neural networks. InInternational Conference on Computer Aided Verification (CAV), 2017

2017

[13] [13]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InACM Symposium on Operating Systems Principles (SOSP), 2023

2023

[14] [14]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration.arXiv preprint arXiv:2306.00978, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

FP8 Formats for Deep Learning

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, Naveen Mellem- pudi, Stuart Oberman, Mohammad Shoeybi, Michael Siu, and Hao Wu. Fp8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[16] [16]

When does label smoothing help? In Advances in Neural Information Processing Systems (NeurIPS), 2019

Rafael Müller, Simon Kornblith, and Geoffrey Hinton. When does label smoothing help? In Advances in Neural Information Processing Systems (NeurIPS), 2019

2019

[17] [17]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

2022

[18] [18]

Understanding reinforcement learning for model training, and future directions with grape.arXiv preprint arXiv:2509.04501, 2025

Rohit Patel. Understanding reinforcement learning for model training, and future directions with grape.arXiv preprint arXiv:2509.04501, 2025

work page arXiv 2025

[19] [19]

Manning, Stefano Ermon, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems (NeurIPS), 2023

2023

[20] [20]

Jordan, and Pieter Abbeel

John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. InInternational Conference on Machine Learning (ICML), 2015

2015

[21] [21]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. 21

work page internal anchor Pith review Pith/arXiv arXiv 2017

[22] [22]

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.arXiv preprint arXiv:2407.08608, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. A Appendix A: Contract DSL specification This appendix gives a minimal YAML-based contract DSL designe...

work page internal anchor Pith review Pith/arXiv arXiv 2024