pith. machine review for the scientific record. sign in

arxiv: 2605.08737 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:38 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords on-policy distillationstructured outputsextrapolation thresholdformat collapseclip safetyLLM post-trainingJSON tasksBernoulli reduction
0
0 comments X

The pith

In on-policy distillation of structured outputs, extrapolation past a closed-form lambda* switches training from format-preserving to format-collapsing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that on-policy distillation with reward extrapolation coefficient lambda can improve student performance beyond the teacher, yet crosses a sharp threshold lambda* where the output format contract is violated on near-deterministic structured tasks. In a single-position Bernoulli reduction the authors derive an exact expression for this clip-safety threshold in terms of the teacher's modal probability, the warm-start mass, and the importance-sampling clip strength. Below the threshold the extrapolated fixed point stays inside the safe region and format adherence is retained; above it the fixed point exits the region and training produces format collapse. The rule is shown to extend to calibrated K-ary listwise JSON tasks in which one binding equivalence class dominates the output contract. Experiments on Amazon Fashion confirm the predicted cliff location with pre-registered tests whose outcomes fall inside their locked windows.

Core claim

Above lambda*, the extrapolated fixed point exits the clip-safe region, changing training from format-preserving to format-collapsing. In the single-position Bernoulli reduction this threshold is given in closed form by lambda*(p,b,c) using the teacher modal probability p, the warm-start mass b, and the clip strength c. The same boundary governs calibrated K-ary listwise JSON tasks when a single equivalence class dominates and SFT retains parse headroom.

What carries the argument

The clip-safety threshold lambda*(p,b,c) obtained from the single-position Bernoulli reduction of the structured-output dynamics.

If this is right

  • Operating just below lambda* lets a 1.7B Qwen3 student reach in-domain parity with an 8B-SFT baseline while preserving parse validity.
  • NDCG@1 on parsed outputs stays flat across lambda while parse validity changes sharply at the predicted boundary.
  • The cliff location is independent of the downstream rubric and can be read directly from the three measurable quantities.
  • Small-clip cross-prediction reproduces the closed-form value below grid resolution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same Bernoulli-derived cliff may appear in any structured generation setting whose output contract is dominated by a single equivalence class.
  • Choosing lambda immediately below the predicted threshold offers a practical rule for safe extrapolation without exhaustive search.
  • Because the cliff diagnostic does not rely on the evaluator rubric, it can be applied to any parse-based or contract-based evaluation.

Load-bearing premise

The single-position Bernoulli reduction together with the dominance of one equivalence class accurately captures the contract dynamics of the full K-ary listwise JSON output.

What would settle it

If a fine-grid experiment on the Amazon Fashion task places the observed format-collapse point outside the interval predicted by lambda*(p,b,c) at the stated grid resolution, the closed-form threshold is falsified.

Figures

Figures reproduced from arXiv: 2605.08737 by Annan Wang, Chau Yuen, Hao Jiang, Xin Li, Yichi Zhang.

Figure 1
Figure 1. Figure 1: The extrapolation cliff in miniature. Left: the IS-clip-safe geometry; the sharpened fixed point exits at the base-neutral marker λ ⋆ (ptyp=0.9993, c=5)=1.22 (the full base-relative prediction bracket [λ ⋆ safe, λ⋆ typ] = [1.18, 1.28] at b=0.81 is in Tab. 1). Right: strict parse rate on Fashion K=8 listwise (Qwen3 1.7B×4B, N=212) collapses in the same [1.18, 1.28] band. the clipped tail mass (1−p)/c enforc… view at source ↗
Figure 2
Figure 2. Figure 2: ListOPD pipeline. Left: teacher πT at a scaffolding position concentrates on a modal token; base-relative extrapolation sharpens this target as λ grows. Middle: student rolls out the listwise JSON token-by-token. Right: the per-token IS ratio ρt is clipped at c; scaffolding positions whose asymptotic fixed point sits in the clip-unsafe region (×) drift to parse-collapse, the regime characterised by Thm. 4.… view at source ↗
Figure 3
Figure 3. Figure 3: Closed-form clip-safe threshold. Left: Fashion λ-sweep (1.7B×4B, 3-epoch); strict parse and FMC (K−1 truncation) transition in [1.15, 1.25], around the base-neutral λ ⋆=1.22. Right: USEFUL for SFT vs. ListOPD across 0.6B–8B Qwen3 students; sub-threshold ListOPD flattens the size curve to USEFUL∈[0.873, 0.897] (0.6B / 8B single-seed) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Finite-N first-passage at λ=1.15. Thin blue: 5-seed strict parse-rate trajectories at 42 steps (mean 0.921 ± 0.019, near the sub-critical boundary). Thick red: same λ extended to 70 steps (seed 42) crosses the clip-safe boundary between steps 60 and 70 (parse 0.887 → 0.675). Sub-critical λ ⋆=1.22 > 1.15 does not forbid this crossing; Thm. 4.2 gives a budget-dependent first-passage time and 28 extra steps s… view at source ↗
Figure 5
Figure 5. Figure 5: Thm. 4.2 budget-N pre-registered test. Left: observed parse rate vs. λ at three Fashion 1.7B×4B OPD budgets (N=42 blue, N=70 medium blue, N=200 red); the cliff curve shifts visibly leftward as N grows. The λ=1.05 red point shows the multi-seed mean (where computed). Right: cliff midpoint vs. N in linear-interpolated coordinates; dashed line is the two-point 1/N fit committed in the prereg, green band is th… view at source ↗
Figure 6
Figure 6. Figure 6: Teacher/student peak IS ratio maxt ρ TS t along λ and c. Left: final-step peak pre-clip teacher/student IS ratio climbs from ≈ 9 at λ=1.0 to 30.9 at λ=1.4 (one grid step past the cliff midpoint) and collapses to ≈ 5 post-cliff, the boundary-seeking flow of Thm. 4.1 followed by the post-cliff degenerate regime. Right: peak ratio across c ∈ {1.5, 2, 5, ∞} at fixed λ=1.15, N=42 is non-monotone (45, 4, 5, 13);… view at source ↗
Figure 7
Figure 7. Figure 7: Multi-λ first-passage trajectory and IS mechanism trace at N=200. Top: strict val parse rate vs. optimizer step at λ ∈ {1.00, 1.05, 1.10}, single seed 42, with the green errorbar showing the 3-seed mean at λ=1.05 step 196 (0.742 ± 0.107). λ=1.00 stays clip-safe; λ=1.05 crosses the 0.90 band in [80, 120] and lands at 0.703; λ=1.10 crosses earlier and collapses to 0.500. Bottom: per-step peak training-side I… view at source ↗
Figure 8
Figure 8. Figure 8: Per-step val-reward trajectories for the 8-point GSM8K [PITH_FULL_IMAGE:figures/full_fig_p037_8.png] view at source ↗
read the original abstract

On-policy distillation (OPD) is widely used for LLM post-training. When pushed with a reward-extrapolation coefficient lambda > 1, the student can lift past the teacher in domain, but past a threshold lambda* the same step violates the output contract on structured-output tasks. In a single-position Bernoulli reduction, we derive a closed-form base-relative clip-safety threshold lambda*(p,b,c) determined by three measurable quantities: the teacher modal probability, the warm-start mass, and the importance-sampling clip strength. Above lambda*, the extrapolated fixed point exits the clip-safe region, changing training from format-preserving to format-collapsing. We extend the rule to calibrated K-ary listwise JSON tasks where a single binding equivalence class dominates the output contract and SFT retains parse headroom. On Amazon Fashion, three pre-registered tests--a fine-grid cliff interval, a budget-extension test, and a small-clip cross-prediction--fall within their locked prediction windows, with the small-clip value matching the closed-form prediction below grid resolution. Operating just below lambda*, ListOPD brings a 1.7B Qwen3 student to in-domain parity with an 8B-SFT baseline at one-fifth the parameters. The gain is driven primarily by format adherence: NDCG@1 on parsed outputs remains flat across lambda, while parse validity sharply changes at the predicted boundary. The cliff diagnostic is rubric-independent, whereas the parity claim uses a Gemini-graded rubric and inherits that evaluator's exposure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a derivation of a closed-form clip-safety threshold λ*(p,b,c) for on-policy distillation in a single-position Bernoulli reduction, where p is the teacher modal probability, b the warm-start mass, and c the importance-sampling clip strength. It argues that exceeding this threshold causes the extrapolated fixed point to exit the clip-safe region, shifting from format-preserving to format-collapsing behavior in structured outputs. The rule is extended to calibrated K-ary listwise JSON tasks assuming a dominating equivalence class and retained parse headroom from SFT. Pre-registered experiments on Amazon Fashion, including a fine-grid cliff interval, budget-extension test, and small-clip cross-prediction, confirm the predictions, with a 1.7B Qwen3 student achieving in-domain parity with an 8B-SFT baseline primarily through improved format adherence.

Significance. Should the result hold, the work provides a useful, parameter-light diagnostic for safe reward extrapolation in OPD on near-deterministic structured tasks, which is increasingly relevant for LLM post-training. Strengths include the closed-form expression depending only on externally measurable quantities, the pre-registered nature of the tests, and the empirical match including exact prediction for small-clip. The observation that NDCG remains flat while parse validity changes at the boundary offers clear evidence for the mechanism. This could help practitioners avoid format collapse while gaining from extrapolation.

major comments (2)
  1. The derivation of λ*(p,b,c) is performed under a single-position Bernoulli reduction. The subsequent extension to K-ary listwise JSON tasks depends critically on the assumption that a single binding equivalence class dominates the output contract and that SFT retains parse headroom. However, in the full multi-token, multi-position setting, joint probability mass and token dependencies could alter the location of the extrapolated fixed point relative to the clip boundary. The Amazon Fashion experiments match the predicted windows but do not ablate or isolate this assumption, leaving open whether the reduction remains faithful when lifted.
  2. The three pre-registered tests fall within their locked prediction windows, which is positive. However, to fully support the cross-prediction claim, the paper should report the exact numerical value of the closed-form prediction for the small-clip case and the observed experimental value, rather than stating it matches below grid resolution.
minor comments (2)
  1. The abstract refers to 'the small-clip value matching the closed-form prediction below grid resolution' without providing the specific numbers; including them would improve transparency.
  2. The quantities p, b, and c should be explicitly defined with their measurement procedures in the main body early on for readers to follow the closed-form expression.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and presentation of our results. We respond to each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: The derivation of λ*(p,b,c) is performed under a single-position Bernoulli reduction. The subsequent extension to K-ary listwise JSON tasks depends critically on the assumption that a single binding equivalence class dominates the output contract and that SFT retains parse headroom. However, in the full multi-token, multi-position setting, joint probability mass and token dependencies could alter the location of the extrapolated fixed point relative to the clip boundary. The Amazon Fashion experiments match the predicted windows but do not ablate or isolate this assumption, leaving open whether the reduction remains faithful when lifted.

    Authors: The closed-form threshold is derived exactly in the single-position Bernoulli reduction because that setting isolates the dominant modal probability p at the binding position. The extension to K-ary listwise JSON is explicitly conditioned on the two assumptions stated in the manuscript: a single equivalence class dominates the output contract and SFT retains sufficient parse headroom. These assumptions are motivated by the near-deterministic character of the structured tasks we target. While joint token dependencies in the unrestricted multi-token case could in principle shift the fixed point, the Amazon Fashion experiments already operate in the full multi-position JSON regime and reproduce the predicted cliff location. We therefore view the reduction as a faithful first-order model for the regime of interest. That said, we agree that an explicit ablation isolating the dominance assumption would be valuable; we will add a dedicated limitations paragraph discussing the conditions under which the reduction may cease to be accurate and note the absence of such an ablation as an open direction. revision: partial

  2. Referee: The three pre-registered tests fall within their locked prediction windows, which is positive. However, to fully support the cross-prediction claim, the paper should report the exact numerical value of the closed-form prediction for the small-clip case and the observed experimental value, rather than stating it matches below grid resolution.

    Authors: We accept this recommendation. The manuscript currently notes only that the small-clip result lies within grid resolution of the closed-form prediction. In the revision we will state the exact closed-form λ* (computed from the measured teacher modal probability p, warm-start mass b, and clip strength c for that condition) together with the experimentally observed cliff location, allowing readers to judge the numerical agreement directly. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained; no reduction to inputs by construction

full rationale

The paper presents a closed-form derivation of lambda*(p,b,c) inside an explicit single-position Bernoulli reduction, expressed directly in terms of three externally measurable quantities (teacher modal probability p, warm-start mass b, clip strength c). This is not a fit to data but a mathematical threshold whose inputs are observable outside the model. The extension to K-ary listwise JSON tasks is stated under additional modeling assumptions rather than derived as a necessary consequence. Pre-registered experiments test the predicted windows without the outcome being forced by the equations themselves. No self-citations appear as load-bearing premises, no parameters are fitted then relabeled as predictions, and no ansatz or uniqueness claim is smuggled via prior work. The central claim therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the Bernoulli reduction for the single-position case and the dominance of one equivalence class in K-ary tasks; these are domain assumptions rather than new entities or fitted free parameters. The three quantities in lambda* are treated as measurable inputs.

axioms (2)
  • domain assumption Single-position Bernoulli reduction models the structured output contract
    Invoked to derive the closed-form base-relative clip-safety threshold lambda*(p,b,c).
  • domain assumption A single binding equivalence class dominates the output contract and SFT retains parse headroom in K-ary listwise JSON tasks
    Required for extending the rule from the Bernoulli case to full structured tasks.

pith-pipeline@v0.9.0 · 5586 in / 1481 out tokens · 60067 ms · 2026-05-12T03:38:43.020903+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 14 internal anchors

  1. [1]

    Think inside the json: Reinforcement strategy for strict llm schema adherence.arXiv preprint arXiv:2502.14905, 2025

    Bhavik Agarwal, Ishan Joshi, and Viktoria Rojkova. Think inside the json: Reinforcement strategy for strict llm schema adherence.arXiv preprint arXiv:2502.14905, 2025

  2. [2]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe Twelfth International Conference on Learning Representations,

  3. [3]

    URLhttps://openreview.net/forum?id=3zKtaqxLhW

  4. [4]

    MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

    Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. Ms marco: A human generated machine reading comprehension dataset.arXiv preprint arXiv:1611.09268, 2016

  5. [5]

    Guiding LLMs The Right Way: Fast, Non-Invasive Constrained Generation,

    Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. Guiding llms the right way: Fast, non-invasive constrained generation.arXiv preprint arXiv:2403.06988, 2024

  6. [6]

    Learning to rank with nonsmooth cost functions.Advances in neural information processing systems, 19, 2006

    Christopher Burges, Robert Ragno, and Quoc Le. Learning to rank with nonsmooth cost functions.Advances in neural information processing systems, 19, 2006

  7. [7]

    Learning to rank: from pairwise approach to listwise approach

    Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. InProceedings of the 24th international conference on Machine learning, pages 129–136, 2007

  8. [8]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan- Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mis...

  9. [9]

    Overview of the trec 2020 deep learning track, 2021

    Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. Overview of the trec 2020 deep learning track, 2021. URLhttps://arxiv.org/abs/2102.07662

  10. [10]

    Decoupling task-solving and output format- ting in llm generation.arXiv preprint arXiv:2510.03595, 2025

    Haikang Deng, Po-Nien Kung, and Nanyun Peng. Decoupling task-solving and output format- ting in llm generation.arXiv preprint arXiv:2510.03595, 2025

  11. [11]

    Xgrammar: Flexible and efficient structured generation engine for large language models

    Yixin Dong, Charlie F Ruan, Yaxing Cai, Ziyi Xu, Yilong Zhao, Ruihang Lai, and Tianqi Chen. Xgrammar: Flexible and efficient structured generation engine for large language models. Proceedings of Machine Learning and Systems, 7, 2025

  12. [12]

    Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

    Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on- policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562, 2026

  13. [13]

    arXiv preprint arXiv:2501.10868 (2025)

    Saibo Geng, Hudson Cooper, Michał Moskal, Samuel Jenkins, Julian Berman, Nathan Ranchin, Robert West, Eric Horvitz, and Harsha Nori. Jsonschemabench: A rigorous benchmark of structured outputs for language models.arXiv preprint arXiv:2501.10868, 2025

  14. [14]

    MiniLLM: Knowledge distillation of large language models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=5h0qf7IBZZ

  15. [15]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  16. [16]

    Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

    Yupeng Hou, Jiacheng Li, Zhankui He, An Yan, Xiusi Chen, and Julian McAuley. Bridging language and items for retrieval and recommendation.arXiv preprint arXiv:2403.03952, 2024

  17. [17]

    Stable On-Policy Distillation through Adaptive Target Reformulation

    Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggu Lim, and Taesup Kim. Stable on-policy distillation through adaptive target reformulation.arXiv preprint arXiv:2601.07155, 2026. 10

  18. [18]

    RLPO: Residual Listwise Preference Optimization for Long-Context Review Ranking

    Hao Jiang, Zhi Yang, Annan Wang, Yichi Zhang, and Weisi Lin. Rlpo: Residual listwise preference optimization for long-context review ranking.arXiv preprint arXiv:2601.07449, 2026

  19. [19]

    Entropy-aware on-policy distillation of language models

    Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models. arXiv preprint arXiv:2603.07079, 2026

  20. [20]

    Coverage improvement and fast convergence of on-policy preference learning.arXiv preprint arXiv:2601.08421, 2026

    Juno Kim, Jihun Yun, Jason D Lee, and Kwang-Sung Jun. Coverage improvement and fast convergence of on-policy preference learning.arXiv preprint arXiv:2601.08421, 2026

  21. [21]

    Distillm: Towards streamlined distillation for large language models.arXiv preprint arXiv:2402.03898, 2024

    Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun. Distillm: Towards streamlined distillation for large language models.arXiv preprint arXiv:2402.03898, 2024

  22. [22]

    Distillm-2: A contrastive approach boosts the distillation of llms.arXiv preprint arXiv:2503.07067, 2025

    Jongwoo Ko, Tianyi Chen, Sungnyun Kim, Tianyu Ding, Luming Liang, Ilya Zharkov, and Se-Young Yun. Distillm-2: A contrastive approach boosts the distillation of llms.arXiv preprint arXiv:2503.07067, 2025

  23. [23]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  24. [24]

    Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

    Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

  25. [25]

    Learning to generate structured output with schema reinforcement learning

    Yaxi Lu, Haolun Li, Xin Cong, Zhong Zhang, Yesai Wu, Yankai Lin, Zhiyuan Liu, Fangming Liu, and Maosong Sun. Learning to generate structured output with schema reinforcement learning. pages 4905–4918, 2025

  26. [26]

    Zero-shot listwise document reranking with a large language model.arXiv preprint arXiv:2305.02156, 2023

    Xueguang Ma, Xinyu Zhang, Ronak Pradeep, and Jimmy Lin. Zero-shot listwise document reranking with a large language model.arXiv preprint arXiv:2305.02156, 2023

  27. [27]

    Document ranking with a pretrained sequence-to-sequence model

    Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. Document ranking with a pretrained sequence-to-sequence model. InFindings of the association for computational linguistics: EMNLP 2020, pages 708–718, 2020

  28. [28]

    The analysis of permutations.Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975

    Robin L Plackett. The analysis of permutations.Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975

  29. [29]

    RankZephyr: Effective and robust zero-shot listwise reranking is a breeze!arXiv preprint arXiv:2312.02724, 2023

    Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. Rankzephyr: Effective and robust zero-shot listwise reranking is a breeze!arXiv preprint arXiv:2312.02724, 2023

  30. [30]

    Draft-Conditioned Constrained Decoding for Structured Generation in LLM’s,

    Avinash Reddy, Thayne T Walker, James S Ide, and Amrit Singh Bedi. Draft-conditioned constrained decoding for structured generation in llms.arXiv preprint arXiv:2603.03305, 2026

  31. [31]

    First: Faster improved listwise reranking with single token decoding

    Revanth Gangi Reddy, JaeHyeok Doo, Yifei Xu, Md Arafat Sultan, Deevya Swain, Avirup Sil, and Heng Ji. First: Faster improved listwise reranking with single token decoding. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8642–8652, 2024

  32. [32]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  33. [33]

    VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

    Guobin Shen, Chenxiao Zhao, Xiang Cheng, Lei Huang, and Xing Yu. Vespo: Variational sequence-level soft policy optimization for stable off-policy llm training.arXiv preprint arXiv:2602.10693, 2026

  34. [34]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024. 11

  35. [35]

    A Survey of On-Policy Distillation for Large Language Models

    Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models. arXiv preprint arXiv:2604.00626, 2026

  36. [36]

    Is chatGPT good at search? investigating large language models as re-ranking agents

    Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. Is chatGPT good at search? investigating large language models as re-ranking agents. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URLhttps://openreview.net/forum?id=3Q6LON8y2I

  37. [37]

    Llmstructbench: Benchmarking large language model structured data extraction.arXiv preprint arXiv:2602.14743, 2026

    Sönke Tenckhoff, Mario Koddenbrock, and Erik Rodner. Llmstructbench: Benchmarking large language model structured data extraction.arXiv preprint arXiv:2602.14743, 2026

  38. [38]

    Guided decoding and its critical role in retrieval-augmented generation

    Özgür U˘gur, Musa Yılmaz, Esra ¸ Savirdi, Özay Ezerceli, Mahmut El Huseyni, Selva Ta¸ s, and Reyhan Bayraktar. Guided decoding and its critical role in retrieval-augmented generation. pages 1–4, 2025

  39. [39]

    arXiv preprint arXiv:2510.06062 , year=

    Jiakang Wang, Runze Liu, Lei Lin, Wenping Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou, and Kun Gai. Aspo: Asymmetric importance sampling policy optimization.arXiv preprint arXiv:2510.06062, 2025

  40. [40]

    Efficient Guided Generation for Large Language Models

    Brandon T Willard and Rémi Louf. Efficient guided generation for large language models. arXiv preprint arXiv:2307.09702, 2023

  41. [41]

    Listwise approach to learning to rank: theory and algorithm

    Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning to rank: theory and algorithm. InProceedings of the 25th international conference on Machine learning, pages 1192–1199, 2008

  42. [42]

    TIP: Token Importance in On-Policy Distillation

    Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, and Alborz Geramifard. Tip: Token importance in on-policy distillation.arXiv preprint arXiv:2604.14084, 2026

  43. [43]

    Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

    Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

  44. [44]

    The price of format: Diversity collapse in llms, 2025

    Longfei Yun, Chenyang An, Zilong Wang, Letian Peng, and Jingbo Shang. The price of format: Diversity collapse in llms.arXiv preprint arXiv:2505.18949, 2025

  45. [45]

    type": "array

    Honglei Zhuang, Zhen Qin, Rolf Jagerman, Kai Hui, Ji Ma, Jing Lu, Jianmo Ni, Xuanhui Wang, and Michael Bendersky. Rankt5: Fine-tuning t5 for text ranking with ranking losses. In Proceedings of the 46th international ACM SIGIR conference on research and development in information retrieval, pages 2308–2313, 2023. 12 Appendix organization.App. A extends the...

  46. [46]

    within one λ-grid step (Tab

    that λ⋆(pmin, c), λ⋆(pmean, c) and λ⋆(pmax, c) all lie within ±0.05 of each other at c=5, i.e. within one λ-grid step (Tab. 8); the prediction is therefore aggregator-robust at the grid resolution. The per-token worst-case mini,t mi,t overallgenerated positions is dominated by score-digit positions at which the teacher is genuinely uncertain and gives a d...

  47. [47]

    single-step clip event

    crosses the clip-safe boundary between steps 60 and 70 (parse 0.887→0.675 ). Sub-critical λ⋆=1.22>1.15 does not forbid this crossing; Thm. 4.2 gives a budget-dependent first-passage time and 28 extra steps suffice. E.2 Pre-registered budget-Ntest of Thm. 4.2 The two budget points already in Fig. 4 (N=42 and N=70) support Thm. 4.2’s qualitative leftward- d...