arxiv: 2605.08737 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

Xin Li , Hao Jiang , Annan Wang , Yichi Zhang , Chau Yuen

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:38 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords on-policy distillationstructured outputsextrapolation thresholdformat collapseclip safetyLLM post-trainingJSON tasksBernoulli reduction

0 comments

The pith

In on-policy distillation of structured outputs, extrapolation past a closed-form lambda* switches training from format-preserving to format-collapsing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that on-policy distillation with reward extrapolation coefficient lambda can improve student performance beyond the teacher, yet crosses a sharp threshold lambda* where the output format contract is violated on near-deterministic structured tasks. In a single-position Bernoulli reduction the authors derive an exact expression for this clip-safety threshold in terms of the teacher's modal probability, the warm-start mass, and the importance-sampling clip strength. Below the threshold the extrapolated fixed point stays inside the safe region and format adherence is retained; above it the fixed point exits the region and training produces format collapse. The rule is shown to extend to calibrated K-ary listwise JSON tasks in which one binding equivalence class dominates the output contract. Experiments on Amazon Fashion confirm the predicted cliff location with pre-registered tests whose outcomes fall inside their locked windows.

Core claim

Above lambda*, the extrapolated fixed point exits the clip-safe region, changing training from format-preserving to format-collapsing. In the single-position Bernoulli reduction this threshold is given in closed form by lambda*(p,b,c) using the teacher modal probability p, the warm-start mass b, and the clip strength c. The same boundary governs calibrated K-ary listwise JSON tasks when a single equivalence class dominates and SFT retains parse headroom.

What carries the argument

The clip-safety threshold lambda*(p,b,c) obtained from the single-position Bernoulli reduction of the structured-output dynamics.

If this is right

Operating just below lambda* lets a 1.7B Qwen3 student reach in-domain parity with an 8B-SFT baseline while preserving parse validity.
NDCG@1 on parsed outputs stays flat across lambda while parse validity changes sharply at the predicted boundary.
The cliff location is independent of the downstream rubric and can be read directly from the three measurable quantities.
Small-clip cross-prediction reproduces the closed-form value below grid resolution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same Bernoulli-derived cliff may appear in any structured generation setting whose output contract is dominated by a single equivalence class.
Choosing lambda immediately below the predicted threshold offers a practical rule for safe extrapolation without exhaustive search.
Because the cliff diagnostic does not rely on the evaluator rubric, it can be applied to any parse-based or contract-based evaluation.

Load-bearing premise

The single-position Bernoulli reduction together with the dominance of one equivalence class accurately captures the contract dynamics of the full K-ary listwise JSON output.

What would settle it

If a fine-grid experiment on the Amazon Fashion task places the observed format-collapse point outside the interval predicted by lambda*(p,b,c) at the stated grid resolution, the closed-form threshold is falsified.

Figures

Figures reproduced from arXiv: 2605.08737 by Annan Wang, Chau Yuen, Hao Jiang, Xin Li, Yichi Zhang.

**Figure 1.** Figure 1: The extrapolation cliff in miniature. Left: the IS-clip-safe geometry; the sharpened fixed point exits at the base-neutral marker λ ⋆ (ptyp=0.9993, c=5)=1.22 (the full base-relative prediction bracket [λ ⋆ safe, λ⋆ typ] = [1.18, 1.28] at b=0.81 is in Tab. 1). Right: strict parse rate on Fashion K=8 listwise (Qwen3 1.7B×4B, N=212) collapses in the same [1.18, 1.28] band. the clipped tail mass (1−p)/c enforc… view at source ↗

**Figure 2.** Figure 2: ListOPD pipeline. Left: teacher πT at a scaffolding position concentrates on a modal token; base-relative extrapolation sharpens this target as λ grows. Middle: student rolls out the listwise JSON token-by-token. Right: the per-token IS ratio ρt is clipped at c; scaffolding positions whose asymptotic fixed point sits in the clip-unsafe region (×) drift to parse-collapse, the regime characterised by Thm. 4.… view at source ↗

**Figure 3.** Figure 3: Closed-form clip-safe threshold. Left: Fashion λ-sweep (1.7B×4B, 3-epoch); strict parse and FMC (K−1 truncation) transition in [1.15, 1.25], around the base-neutral λ ⋆=1.22. Right: USEFUL for SFT vs. ListOPD across 0.6B–8B Qwen3 students; sub-threshold ListOPD flattens the size curve to USEFUL∈[0.873, 0.897] (0.6B / 8B single-seed) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Finite-N first-passage at λ=1.15. Thin blue: 5-seed strict parse-rate trajectories at 42 steps (mean 0.921 ± 0.019, near the sub-critical boundary). Thick red: same λ extended to 70 steps (seed 42) crosses the clip-safe boundary between steps 60 and 70 (parse 0.887 → 0.675). Sub-critical λ ⋆=1.22 > 1.15 does not forbid this crossing; Thm. 4.2 gives a budget-dependent first-passage time and 28 extra steps s… view at source ↗

**Figure 5.** Figure 5: Thm. 4.2 budget-N pre-registered test. Left: observed parse rate vs. λ at three Fashion 1.7B×4B OPD budgets (N=42 blue, N=70 medium blue, N=200 red); the cliff curve shifts visibly leftward as N grows. The λ=1.05 red point shows the multi-seed mean (where computed). Right: cliff midpoint vs. N in linear-interpolated coordinates; dashed line is the two-point 1/N fit committed in the prereg, green band is th… view at source ↗

**Figure 6.** Figure 6: Teacher/student peak IS ratio maxt ρ TS t along λ and c. Left: final-step peak pre-clip teacher/student IS ratio climbs from ≈ 9 at λ=1.0 to 30.9 at λ=1.4 (one grid step past the cliff midpoint) and collapses to ≈ 5 post-cliff, the boundary-seeking flow of Thm. 4.1 followed by the post-cliff degenerate regime. Right: peak ratio across c ∈ {1.5, 2, 5, ∞} at fixed λ=1.15, N=42 is non-monotone (45, 4, 5, 13);… view at source ↗

**Figure 7.** Figure 7: Multi-λ first-passage trajectory and IS mechanism trace at N=200. Top: strict val parse rate vs. optimizer step at λ ∈ {1.00, 1.05, 1.10}, single seed 42, with the green errorbar showing the 3-seed mean at λ=1.05 step 196 (0.742 ± 0.107). λ=1.00 stays clip-safe; λ=1.05 crosses the 0.90 band in [80, 120] and lands at 0.703; λ=1.10 crosses earlier and collapses to 0.500. Bottom: per-step peak training-side I… view at source ↗

**Figure 8.** Figure 8: Per-step val-reward trajectories for the 8-point GSM8K [PITH_FULL_IMAGE:figures/full_fig_p037_8.png] view at source ↗

read the original abstract

On-policy distillation (OPD) is widely used for LLM post-training. When pushed with a reward-extrapolation coefficient lambda > 1, the student can lift past the teacher in domain, but past a threshold lambda* the same step violates the output contract on structured-output tasks. In a single-position Bernoulli reduction, we derive a closed-form base-relative clip-safety threshold lambda*(p,b,c) determined by three measurable quantities: the teacher modal probability, the warm-start mass, and the importance-sampling clip strength. Above lambda*, the extrapolated fixed point exits the clip-safe region, changing training from format-preserving to format-collapsing. We extend the rule to calibrated K-ary listwise JSON tasks where a single binding equivalence class dominates the output contract and SFT retains parse headroom. On Amazon Fashion, three pre-registered tests--a fine-grid cliff interval, a budget-extension test, and a small-clip cross-prediction--fall within their locked prediction windows, with the small-clip value matching the closed-form prediction below grid resolution. Operating just below lambda*, ListOPD brings a 1.7B Qwen3 student to in-domain parity with an 8B-SFT baseline at one-fifth the parameters. The gain is driven primarily by format adherence: NDCG@1 on parsed outputs remains flat across lambda, while parse validity sharply changes at the predicted boundary. The cliff diagnostic is rubric-independent, whereas the parity claim uses a Gemini-graded rubric and inherits that evaluator's exposure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper supplies a closed-form threshold for safe extrapolation in on-policy distillation of structured outputs, supported by pre-registered tests that align with the predictions.

read the letter

The key takeaway here is that on-policy distillation with reward extrapolation has a sharp cliff on structured outputs, and the paper gives a closed-form way to stay below it using three measurable quantities. They start from a single-position Bernoulli reduction and derive lambda*(p, b, c) where p is the teacher modal probability, b the warm-start mass, and c the clip strength. Above that threshold the fixed point moves outside the clip-safe region and training shifts from preserving format to collapsing it. They extend this to calibrated K-ary listwise JSON under the assumption that one binding equivalence class dominates. On the Amazon Fashion dataset they run three pre-registered tests—a fine-grid search for the cliff interval, a budget-extension check, and a small-clip cross-prediction—and all outcomes sit inside the locked prediction windows, with the small-clip result matching the formula below grid resolution. They also report that operating just below the threshold lets a 1.7B Qwen3 student reach in-domain parity with an 8B SFT baseline, with the lift coming mostly from parse validity rather than ranking quality on parsed items. The derivation and the test outcomes line up internally. The fact that the tests were pre-registered adds credibility to the confirmation. The practical demonstration with the smaller model is concrete. Where it is thinner is the leap from the single-position model to full structured outputs. Token-level dependencies and the possibility that no single equivalence class dominates could change where the extrapolated point sits relative to the clip. The experiments match the predictions, but they do not isolate whether the Bernoulli reduction remains accurate in the multi-position case. The main parity claim also uses a Gemini-graded rubric, which means that part of the result carries the grader's own limitations. This work is aimed at people doing on-policy distillation or reward modeling for tasks that require strict output formats, such as tool calling or data extraction. A reader who needs a guardrail for lambda in their own training runs will get something actionable to test. It is worth sending to peer review because the closed-form result and the matching pre-registered tests provide a clear, falsifiable claim even if the modeling assumptions invite further scrutiny.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a derivation of a closed-form clip-safety threshold λ*(p,b,c) for on-policy distillation in a single-position Bernoulli reduction, where p is the teacher modal probability, b the warm-start mass, and c the importance-sampling clip strength. It argues that exceeding this threshold causes the extrapolated fixed point to exit the clip-safe region, shifting from format-preserving to format-collapsing behavior in structured outputs. The rule is extended to calibrated K-ary listwise JSON tasks assuming a dominating equivalence class and retained parse headroom from SFT. Pre-registered experiments on Amazon Fashion, including a fine-grid cliff interval, budget-extension test, and small-clip cross-prediction, confirm the predictions, with a 1.7B Qwen3 student achieving in-domain parity with an 8B-SFT baseline primarily through improved format adherence.

Significance. Should the result hold, the work provides a useful, parameter-light diagnostic for safe reward extrapolation in OPD on near-deterministic structured tasks, which is increasingly relevant for LLM post-training. Strengths include the closed-form expression depending only on externally measurable quantities, the pre-registered nature of the tests, and the empirical match including exact prediction for small-clip. The observation that NDCG remains flat while parse validity changes at the boundary offers clear evidence for the mechanism. This could help practitioners avoid format collapse while gaining from extrapolation.

major comments (2)

The derivation of λ*(p,b,c) is performed under a single-position Bernoulli reduction. The subsequent extension to K-ary listwise JSON tasks depends critically on the assumption that a single binding equivalence class dominates the output contract and that SFT retains parse headroom. However, in the full multi-token, multi-position setting, joint probability mass and token dependencies could alter the location of the extrapolated fixed point relative to the clip boundary. The Amazon Fashion experiments match the predicted windows but do not ablate or isolate this assumption, leaving open whether the reduction remains faithful when lifted.
The three pre-registered tests fall within their locked prediction windows, which is positive. However, to fully support the cross-prediction claim, the paper should report the exact numerical value of the closed-form prediction for the small-clip case and the observed experimental value, rather than stating it matches below grid resolution.

minor comments (2)

The abstract refers to 'the small-clip value matching the closed-form prediction below grid resolution' without providing the specific numbers; including them would improve transparency.
The quantities p, b, and c should be explicitly defined with their measurement procedures in the main body early on for readers to follow the closed-form expression.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and presentation of our results. We respond to each major comment below and indicate planned revisions.

read point-by-point responses

Referee: The derivation of λ*(p,b,c) is performed under a single-position Bernoulli reduction. The subsequent extension to K-ary listwise JSON tasks depends critically on the assumption that a single binding equivalence class dominates the output contract and that SFT retains parse headroom. However, in the full multi-token, multi-position setting, joint probability mass and token dependencies could alter the location of the extrapolated fixed point relative to the clip boundary. The Amazon Fashion experiments match the predicted windows but do not ablate or isolate this assumption, leaving open whether the reduction remains faithful when lifted.

Authors: The closed-form threshold is derived exactly in the single-position Bernoulli reduction because that setting isolates the dominant modal probability p at the binding position. The extension to K-ary listwise JSON is explicitly conditioned on the two assumptions stated in the manuscript: a single equivalence class dominates the output contract and SFT retains sufficient parse headroom. These assumptions are motivated by the near-deterministic character of the structured tasks we target. While joint token dependencies in the unrestricted multi-token case could in principle shift the fixed point, the Amazon Fashion experiments already operate in the full multi-position JSON regime and reproduce the predicted cliff location. We therefore view the reduction as a faithful first-order model for the regime of interest. That said, we agree that an explicit ablation isolating the dominance assumption would be valuable; we will add a dedicated limitations paragraph discussing the conditions under which the reduction may cease to be accurate and note the absence of such an ablation as an open direction. revision: partial
Referee: The three pre-registered tests fall within their locked prediction windows, which is positive. However, to fully support the cross-prediction claim, the paper should report the exact numerical value of the closed-form prediction for the small-clip case and the observed experimental value, rather than stating it matches below grid resolution.

Authors: We accept this recommendation. The manuscript currently notes only that the small-clip result lies within grid resolution of the closed-form prediction. In the revision we will state the exact closed-form λ* (computed from the measured teacher modal probability p, warm-start mass b, and clip strength c for that condition) together with the experimentally observed cliff location, allowing readers to judge the numerical agreement directly. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained; no reduction to inputs by construction

full rationale

The paper presents a closed-form derivation of lambda*(p,b,c) inside an explicit single-position Bernoulli reduction, expressed directly in terms of three externally measurable quantities (teacher modal probability p, warm-start mass b, clip strength c). This is not a fit to data but a mathematical threshold whose inputs are observable outside the model. The extension to K-ary listwise JSON tasks is stated under additional modeling assumptions rather than derived as a necessary consequence. Pre-registered experiments test the predicted windows without the outcome being forced by the equations themselves. No self-citations appear as load-bearing premises, no parameters are fitted then relabeled as predictions, and no ansatz or uniqueness claim is smuggled via prior work. The central claim therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the Bernoulli reduction for the single-position case and the dominance of one equivalence class in K-ary tasks; these are domain assumptions rather than new entities or fitted free parameters. The three quantities in lambda* are treated as measurable inputs.

axioms (2)

domain assumption Single-position Bernoulli reduction models the structured output contract
Invoked to derive the closed-form base-relative clip-safety threshold lambda*(p,b,c).
domain assumption A single binding equivalence class dominates the output contract and SFT retains parse headroom in K-ary listwise JSON tasks
Required for extending the rule from the Bernoulli case to full structured tasks.

pith-pipeline@v0.9.0 · 5586 in / 1481 out tokens · 60067 ms · 2026-05-12T03:38:43.020903+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

In a single-position Bernoulli reduction, we derive a closed-form base-relative clip-safety threshold λ⋆(p, b, c) ... λ⋆(p, b, c) = log((1−p)/(c−1+p)) − log((1−b)/b) / log((1−p)/p) − log((1−b)/b)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We extend the rule to calibrated K-ary listwise JSON tasks where a single binding equivalence class dominates the output contract

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 14 internal anchors

[1]

Think inside the json: Reinforcement strategy for strict llm schema adherence.arXiv preprint arXiv:2502.14905, 2025

Bhavik Agarwal, Ishan Joshi, and Viktoria Rojkova. Think inside the json: Reinforcement strategy for strict llm schema adherence.arXiv preprint arXiv:2502.14905, 2025

work page arXiv 2025
[2]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe Twelfth International Conference on Learning Representations,

work page
[3]

URLhttps://openreview.net/forum?id=3zKtaqxLhW

work page
[4]

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. Ms marco: A human generated machine reading comprehension dataset.arXiv preprint arXiv:1611.09268, 2016

work page internal anchor Pith review arXiv 2016
[5]

Guiding LLMs The Right Way: Fast, Non-Invasive Constrained Generation,

Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. Guiding llms the right way: Fast, non-invasive constrained generation.arXiv preprint arXiv:2403.06988, 2024

work page arXiv 2024
[6]

Learning to rank with nonsmooth cost functions.Advances in neural information processing systems, 19, 2006

Christopher Burges, Robert Ragno, and Quoc Le. Learning to rank with nonsmooth cost functions.Advances in neural information processing systems, 19, 2006

work page 2006
[7]

Learning to rank: from pairwise approach to listwise approach

Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. InProceedings of the 24th international conference on Machine learning, pages 129–136, 2007

work page 2007
[8]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan- Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mis...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Overview of the trec 2020 deep learning track, 2021

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. Overview of the trec 2020 deep learning track, 2021. URLhttps://arxiv.org/abs/2102.07662

work page arXiv 2020
[10]

Decoupling task-solving and output format- ting in llm generation.arXiv preprint arXiv:2510.03595, 2025

Haikang Deng, Po-Nien Kung, and Nanyun Peng. Decoupling task-solving and output format- ting in llm generation.arXiv preprint arXiv:2510.03595, 2025

work page arXiv 2025
[11]

Xgrammar: Flexible and efficient structured generation engine for large language models

Yixin Dong, Charlie F Ruan, Yaxing Cai, Ziyi Xu, Yilong Zhao, Ruihang Lai, and Tianqi Chen. Xgrammar: Flexible and efficient structured generation engine for large language models. Proceedings of Machine Learning and Systems, 7, 2025

work page 2025
[12]

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on- policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

arXiv preprint arXiv:2501.10868 (2025)

Saibo Geng, Hudson Cooper, Michał Moskal, Samuel Jenkins, Julian Berman, Nathan Ranchin, Robert West, Eric Horvitz, and Harsha Nori. Jsonschemabench: A rigorous benchmark of structured outputs for language models.arXiv preprint arXiv:2501.10868, 2025

work page arXiv 2025
[14]

MiniLLM: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=5h0qf7IBZZ

work page 2024
[15]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[16]

Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

Yupeng Hou, Jiacheng Li, Zhankui He, An Yan, Xiusi Chen, and Julian McAuley. Bridging language and items for retrieval and recommendation.arXiv preprint arXiv:2403.03952, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Stable On-Policy Distillation through Adaptive Target Reformulation

Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggu Lim, and Taesup Kim. Stable on-policy distillation through adaptive target reformulation.arXiv preprint arXiv:2601.07155, 2026. 10

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

RLPO: Residual Listwise Preference Optimization for Long-Context Review Ranking

Hao Jiang, Zhi Yang, Annan Wang, Yichi Zhang, and Weisi Lin. Rlpo: Residual listwise preference optimization for long-context review ranking.arXiv preprint arXiv:2601.07449, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Entropy-aware on-policy distillation of language models

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models. arXiv preprint arXiv:2603.07079, 2026

work page arXiv 2026
[20]

Coverage improvement and fast convergence of on-policy preference learning.arXiv preprint arXiv:2601.08421, 2026

Juno Kim, Jihun Yun, Jason D Lee, and Kwang-Sung Jun. Coverage improvement and fast convergence of on-policy preference learning.arXiv preprint arXiv:2601.08421, 2026

work page arXiv 2026
[21]

Distillm: Towards streamlined distillation for large language models.arXiv preprint arXiv:2402.03898, 2024

Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun. Distillm: Towards streamlined distillation for large language models.arXiv preprint arXiv:2402.03898, 2024

work page arXiv 2024
[22]

Distillm-2: A contrastive approach boosts the distillation of llms.arXiv preprint arXiv:2503.07067, 2025

Jongwoo Ko, Tianyi Chen, Sungnyun Kim, Tianyu Ding, Luming Liang, Ilya Zharkov, and Se-Young Yun. Distillm-2: A contrastive approach boosts the distillation of llms.arXiv preprint arXiv:2503.07067, 2025

work page arXiv 2025
[23]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

work page 2023
[24]

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Learning to generate structured output with schema reinforcement learning

Yaxi Lu, Haolun Li, Xin Cong, Zhong Zhang, Yesai Wu, Yankai Lin, Zhiyuan Liu, Fangming Liu, and Maosong Sun. Learning to generate structured output with schema reinforcement learning. pages 4905–4918, 2025

work page 2025
[26]

Zero-shot listwise document reranking with a large language model.arXiv preprint arXiv:2305.02156, 2023

Xueguang Ma, Xinyu Zhang, Ronak Pradeep, and Jimmy Lin. Zero-shot listwise document reranking with a large language model.arXiv preprint arXiv:2305.02156, 2023

work page arXiv 2023
[27]

Document ranking with a pretrained sequence-to-sequence model

Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. Document ranking with a pretrained sequence-to-sequence model. InFindings of the association for computational linguistics: EMNLP 2020, pages 708–718, 2020

work page 2020
[28]

The analysis of permutations.Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975

Robin L Plackett. The analysis of permutations.Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975

work page 1975
[29]

RankZephyr: Effective and robust zero-shot listwise reranking is a breeze!arXiv preprint arXiv:2312.02724, 2023

Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. Rankzephyr: Effective and robust zero-shot listwise reranking is a breeze!arXiv preprint arXiv:2312.02724, 2023

work page arXiv 2023
[30]

Draft-Conditioned Constrained Decoding for Structured Generation in LLM’s,

Avinash Reddy, Thayne T Walker, James S Ide, and Amrit Singh Bedi. Draft-conditioned constrained decoding for structured generation in llms.arXiv preprint arXiv:2603.03305, 2026

work page arXiv 2026
[31]

First: Faster improved listwise reranking with single token decoding

Revanth Gangi Reddy, JaeHyeok Doo, Yifei Xu, Md Arafat Sultan, Deevya Swain, Avirup Sil, and Heng Ji. First: Faster improved listwise reranking with single token decoding. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8642–8652, 2024

work page 2024
[32]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Guobin Shen, Chenxiao Zhao, Xiang Cheng, Lei Huang, and Xing Yu. Vespo: Variational sequence-level soft policy optimization for stable off-policy llm training.arXiv preprint arXiv:2602.10693, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models. arXiv preprint arXiv:2604.00626, 2026

work page internal anchor Pith review arXiv 2026
[36]

Is chatGPT good at search? investigating large language models as re-ranking agents

Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. Is chatGPT good at search? investigating large language models as re-ranking agents. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URLhttps://openreview.net/forum?id=3Q6LON8y2I

work page 2023
[37]

Llmstructbench: Benchmarking large language model structured data extraction.arXiv preprint arXiv:2602.14743, 2026

Sönke Tenckhoff, Mario Koddenbrock, and Erik Rodner. Llmstructbench: Benchmarking large language model structured data extraction.arXiv preprint arXiv:2602.14743, 2026

work page arXiv 2026
[38]

Guided decoding and its critical role in retrieval-augmented generation

Özgür U˘gur, Musa Yılmaz, Esra ¸ Savirdi, Özay Ezerceli, Mahmut El Huseyni, Selva Ta¸ s, and Reyhan Bayraktar. Guided decoding and its critical role in retrieval-augmented generation. pages 1–4, 2025

work page 2025
[39]

arXiv preprint arXiv:2510.06062 , year=

Jiakang Wang, Runze Liu, Lei Lin, Wenping Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou, and Kun Gai. Aspo: Asymmetric importance sampling policy optimization.arXiv preprint arXiv:2510.06062, 2025

work page arXiv 2025
[40]

Efficient Guided Generation for Large Language Models

Brandon T Willard and Rémi Louf. Efficient guided generation for large language models. arXiv preprint arXiv:2307.09702, 2023

work page internal anchor Pith review arXiv 2023
[41]

Listwise approach to learning to rank: theory and algorithm

Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning to rank: theory and algorithm. InProceedings of the 25th international conference on Machine learning, pages 1192–1199, 2008

work page 2008
[42]

TIP: Token Importance in On-Policy Distillation

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, and Alborz Geramifard. Tip: Token importance in on-policy distillation.arXiv preprint arXiv:2604.14084, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[43]

Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

work page arXiv 2026
[44]

The price of format: Diversity collapse in llms, 2025

Longfei Yun, Chenyang An, Zilong Wang, Letian Peng, and Jingbo Shang. The price of format: Diversity collapse in llms.arXiv preprint arXiv:2505.18949, 2025

work page arXiv 2025
[45]

type": "array

Honglei Zhuang, Zhen Qin, Rolf Jagerman, Kai Hui, Ji Ma, Jing Lu, Jianmo Ni, Xuanhui Wang, and Michael Bendersky. Rankt5: Fine-tuning t5 for text ranking with ranking losses. In Proceedings of the 46th international ACM SIGIR conference on research and development in information retrieval, pages 2308–2313, 2023. 12 Appendix organization.App. A extends the...

work page 2023
[46]

within one λ-grid step (Tab

that λ⋆(pmin, c), λ⋆(pmean, c) and λ⋆(pmax, c) all lie within ±0.05 of each other at c=5, i.e. within one λ-grid step (Tab. 8); the prediction is therefore aggregator-robust at the grid resolution. The per-token worst-case mini,t mi,t overallgenerated positions is dominated by score-digit positions at which the teacher is genuinely uncertain and gives a d...

work page arXiv 2026
[47]

single-step clip event

crosses the clip-safe boundary between steps 60 and 70 (parse 0.887→0.675 ). Sub-critical λ⋆=1.22>1.15 does not forbid this crossing; Thm. 4.2 gives a budget-dependent first-passage time and 28 extra steps suffice. E.2 Pre-registered budget-Ntest of Thm. 4.2 The two budget points already in Fig. 4 (N=42 and N=70) support Thm. 4.2’s qualitative leftward- d...

work page 2026