PBSD: Privileged Bayesian Self-Distillation for Long-Horizon Credit Assignment

Bo Zhao; Jiang Bian; Junjie Li; Lei Song; Rui Wang; Shizhao Sun; Xumeng Wen; Yang Tian

arxiv: 2606.09348 · v1 · pith:QM5VVB33new · submitted 2026-06-08 · 💻 cs.LG · cs.CL

PBSD: Privileged Bayesian Self-Distillation for Long-Horizon Credit Assignment

Yang Tian , Rui Wang , Xumeng Wen , Junjie Li , Shizhao Sun , Lei Song , Jiang Bian , Bo Zhao This is my paper

Pith reviewed 2026-06-27 17:20 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords credit assignmentreinforcement learningself-distillationlong-horizon taskspolicy optimizationBayesian methodsagentic tasksprivileged information

0 comments

The pith

PBSD turns sparse outcome rewards into Bayes-calibrated turn-level credit signals via privileged self-distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes PBSD to solve the credit assignment problem in long-horizon reinforcement learning tasks with only final outcome rewards. It employs Bayes' rule to re-express the quality of a trajectory as a likelihood ratio between a student policy and an answer-conditioned teacher model. This ratio is then decomposed autoregressively to assign credit to each turn. A reader would care because this provides a principled way to guide policy updates in complex agent behaviors without requiring dense rewards.

Core claim

PBSD measures trajectory quality through the posterior-to-prior probability ratio of the verified answer and applies Bayes' rule to convert this hard-to-estimate answer-side ratio into a tractable likelihood ratio between a standard student model and a privileged answer-conditioned teacher model. Autoregressive decomposition of this Bayesian evidence score yields turn-level signals that identify whether each intermediate turn supports or undermines the verified outcome.

What carries the argument

The likelihood ratio between the student model and the privileged answer-conditioned teacher model, which serves as the Bayesian evidence score for reweighting trajectories.

If this is right

Provides turn-level credit signals compatible with standard policy optimization.
Enhances performance across in-domain and out-of-domain settings.
Enables effective transfer from short-context training to long-context inference.
Identifies supporting and undermining actions in successful and failed trajectories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could apply to other sparse-reward settings in RL where privileged information is accessible during training.
Bayesian decomposition might help in other multi-step decision processes beyond agents.
Similar ratios could extend to sequence generation tasks without explicit agents.

Load-bearing premise

The privileged answer-conditioned teacher model yields a tractable and unbiased likelihood ratio that produces valid turn-level credit signals without systematic biases from the conditioning or model mismatch.

What would settle it

Running experiments where PBSD signals are compared to baseline outcome supervision in long-horizon tasks and finding no consistent performance gains or even losses would falsify the effectiveness of the credit assignment.

read the original abstract

Long-horizon agentic tasks pose a fundamental credit assignment challenge for outcome-base reinforcement learning: trajectory-level rewards verify final correctness but provide limited guidance on which intermediate reasoning steps or tool interactions contribute to the outcome. The difficulty is especially pronounced in multi-turn search agents, where successful trajectories may contain misleading actions and failed trajectories may contain valuable evidence-gathering steps. We propose PBSD (Privileged Bayesian Self-Distillation), a Bayes-calibrated self-distillation method for fine-grained credit assignment under sparse final rewards. PBSD measures trajectory quality through the posterior-to-prior probability ratio of the verified answer and applies Bayes' rule to convert this hard-to-estimate answer-side ratio into a tractable likelihood ratio between a standard student model and a privileged answer-conditioned teacher model. Autoregressive decomposition of this Bayesian evidence score yields turn-level signals that identify whether each intermediate turn supports or undermines the verified outcome. Consequently, PBSD provides a principled and elegant reweighting scheme that transforms sparse outcome supervision into Bayes-calibrated turn-level credit signals, while remaining fully compatible with standard policy optimization. Experiments demonstrate that PBSD consistently enhances performance across both in-domain and out-of-domain settings, and effectively transfers knowledge from short-context training to long-context inference, suggesting that its fine-grained credit assignment mechanism facilitates more effective policy learning and yields improved generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PBSD tries to fix long-horizon credit assignment with a privileged teacher and Bayes decomposition, but the conditioning likely injects hindsight that the abstract does not address.

read the letter

The core idea is to turn a final answer reward into per-turn credits by taking the ratio of a privileged answer-conditioned teacher to the student policy, then breaking that ratio into a product over turns. The paper shows this works in practice: experiments report gains on both in-domain and out-of-domain tasks plus some transfer when training on short contexts and testing on long ones.

That experimental pattern is the main positive. It suggests the derived signals are at least directionally useful for policy updates.

The soft spot is exactly the one flagged in the stress-test note. Conditioning the teacher on the verified answer gives it outcome information the student never sees at generation time. Early turns can therefore get credit or blame based on consistency with the known answer rather than causal contribution. The abstract calls the result Bayes-calibrated without showing a correction term or mismatch analysis, so the unbiased claim rests on an assumption that is not obviously true. Without the equations or controls in front of me it is impossible to tell how large the bias is.

This paper is aimed at people working on RL for multi-turn agents and tool-use systems. Anyone already thinking about credit assignment will find the experimental results worth looking at, even if the derivation needs more scrutiny.

It is worth sending to referees because the underlying problem is real and the experimental signal is concrete enough to check.

Referee Report

2 major / 1 minor

Summary. The paper proposes PBSD, a method for long-horizon credit assignment in outcome-based RL. It converts the posterior-to-prior ratio over a verified answer into a tractable likelihood ratio between a standard student policy and a privileged answer-conditioned teacher via Bayes' rule, then autoregressively decomposes the ratio into per-turn credit signals. These signals are used to reweight trajectories for standard policy optimization. Experiments report consistent gains in in-domain and out-of-domain settings plus improved short-to-long context transfer.

Significance. If the central derivation is free of systematic bias from the privileged conditioning, PBSD would supply a parameter-free, Bayes-calibrated reweighting scheme that turns sparse final rewards into fine-grained turn-level credits. This could meaningfully advance credit assignment for multi-turn agents. The reported experimental improvements and transfer results would then constitute useful evidence of practical utility.

major comments (2)

[Abstract / §3] Abstract and §3 (method description): the central claim that the autoregressive decomposition of the likelihood ratio P(trajectory | answer)/P(trajectory) yields unbiased turn-level credits rests on an unverified assumption that the privileged teacher approximation introduces no systematic hindsight bias. The provided text supplies no equations, no correction term, and no analysis of the mismatch between the answer-conditioned teacher and the student at generation time.
The stress-test concern lands: early-turn probabilities under the privileged teacher can reflect consistency with the known answer rather than causal support for the outcome. Without an explicit bias analysis or empirical control (e.g., comparison against an oracle teacher or ablation removing answer conditioning), the 'Bayes-calibrated' property cannot be confirmed and the credit signals may be circular by construction.

minor comments (1)

[Experiments] The abstract states that PBSD 'remains fully compatible with standard policy optimization' but does not specify which RL algorithm or loss is used in the experiments; this should be stated explicitly in the experimental section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater rigor around the privileged-teacher approximation in PBSD. We address each major comment below and commit to strengthening the theoretical and empirical treatment of potential bias in the revision.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (method description): the central claim that the autoregressive decomposition of the likelihood ratio P(trajectory | answer)/P(trajectory) yields unbiased turn-level credits rests on an unverified assumption that the privileged teacher approximation introduces no systematic hindsight bias. The provided text supplies no equations, no correction term, and no analysis of the mismatch between the answer-conditioned teacher and the student at generation time.

Authors: We agree that §3 presents the Bayes-rule conversion and the subsequent autoregressive decomposition via the chain rule but does not supply an explicit bias analysis or correction term for the privileged conditioning. The derivation itself is exact under the modeling assumption that the teacher approximates the answer-conditioned distribution; any mismatch between teacher and student at generation time is therefore an approximation error whose effect on credit calibration is not quantified in the current text. We will add (i) the full set of equations showing the exact Bayes conversion and the per-turn decomposition, (ii) a dedicated paragraph discussing the hindsight-bias concern and the conditions under which the approximation remains calibrated, and (iii) a new ablation that trains a non-privileged teacher (answer conditioning removed) to measure the resulting change in credit-signal quality. revision: yes
Referee: [—] The stress-test concern lands: early-turn probabilities under the privileged teacher can reflect consistency with the known answer rather than causal support for the outcome. Without an explicit bias analysis or empirical control (e.g., comparison against an oracle teacher or ablation removing answer conditioning), the 'Bayes-calibrated' property cannot be confirmed and the credit signals may be circular by construction.

Authors: The concern is valid: because the teacher is conditioned on the verified answer, its early-turn probabilities necessarily incorporate future information, which is precisely what allows the ratio to serve as a credit signal. This is by design in the Bayesian framing, yet it does open the possibility of non-causal consistency effects. The manuscript currently offers no oracle-teacher comparison or ablation that isolates the contribution of answer conditioning. We will therefore add the suggested ablation (privileged teacher vs. answer-removed teacher) on the same trajectories and report both the resulting credit-signal statistics and downstream policy performance. If the ablation shows substantial degradation when conditioning is removed, we will qualify the 'Bayes-calibrated' claim accordingly in the revised text. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies exact Bayes identity to define credits

full rationale

The paper's central derivation applies Bayes' rule to equate the posterior-to-prior answer ratio with a likelihood ratio between student and privileged teacher models, then decomposes the latter autoregressively into turn-level terms. This is an exact mathematical identity (P(A|T)/P(A) = P(T|A)/P(T)), not a fitted parameter renamed as prediction, not a self-citation load-bearing claim, and not a self-definitional loop. The privileged teacher is an explicit modeling choice in the method rather than an unverified assumption smuggled in; the resulting reweighting scheme supplies independent content for credit assignment even if downstream bias from conditioning remains a separate correctness question. No equations reduce the output to the input by construction, and no self-citations are invoked for uniqueness or ansatz. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; all such elements are unknown from available text.

pith-pipeline@v0.9.1-grok · 5787 in / 1119 out tokens · 21275 ms · 2026-06-27T17:20:28.405585+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 1 canonical work pages

[1]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025
[3]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300

Pith/arXiv arXiv 2024
[4]

Tongyi deepresearch technical report

Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, et al. Tongyi deepresearch technical report. arXiv preprint arXiv:2510.24701, 2025

Pith/arXiv arXiv 2025
[5]

Dr-venus: Towards frontier edge-scale deep research agents with only 10k open data

Venus Team, Sunhao Dai, Yong Deng, Jinzhen Lin, Yusheng Song, Guoqing Wang, Xiaofeng Wu, Yuqi Zhou, Shuo Yang, Zhenzhe Ying, et al. Dr-venus: Towards frontier edge-scale deep research agents with only 10k open data. arXiv preprint arXiv:2604.19859, 2026

Pith/arXiv arXiv 2026
[6]

Opensearch-vl: An open recipe for frontier multimodal search agents

Shuang Chen, Kaituo Feng, Hangting Chen, Wenxuan Huang, Dasen Dai, Quanxin Shou, Yunlong Lin, Xiangyu Yue, Shenghua Gao, and Tianyu Pang. Opensearch-vl: An open recipe for frontier multimodal search agents. arXiv preprint arXiv:2605.05185, 2026

Pith/arXiv arXiv 2026
[7]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

Pith/arXiv arXiv 2026
[8]

Mind deepresearch technical report.arXiv preprint arXiv:2604.14518, 2026

MindDR Team and Li Auto Inc. Mind deepresearch technical report.arXiv preprint arXiv:2604.14518, 2026

Pith/arXiv arXiv 2026
[9]

Tree search for llm agent reinforcement learning.arXiv preprint arXiv:2509.21240, 2025

Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, and Liaoni Wu. Tree search for llm agent reinforcement learning.arXiv preprint arXiv:2509.21240, 2025

arXiv 2025
[10]

Treerpo: Tree relative policy optimization

Zhicheng Yang, Zhijiang Guo, Yinya Huang, Xiaodan Liang, Yiwei Wang, and Jing Tang. Treerpo: Tree relative policy optimization. arXiv preprint arXiv:2506.05183, 2025

arXiv 2025
[11]

On-policy distillation.Thinking Machines Lab: Con- nectionism, 2025

Kevin Lu and Thinking Machines Lab. On-policy distillation. Thinking Machines Lab: Connectionism, 2025. doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation

work page doi:10.64434/tml.20251026 2025
[12]

Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

Pith/arXiv arXiv 2026
[13]

Self-distilled reasoner: On-policy self-distillation for large language models, 2026

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models, 2026. URLhttps://arxiv.org/abs/2601.18734

Pith/arXiv arXiv 2026
[14]

Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026

Pith/arXiv arXiv 2026
[15]

Criticsearch: Fine-grained credit assignment for search agents via a retrospective critic.arXiv preprint arXiv:2511.12159, 2025

Yaocheng Zhang, Haohuan Huang, Zijun Song, Yuanheng Zhu, Qichao Zhang, Zijie Zhao, and Dongbin Zhao. Criticsearch: Fine-grained credit assignment for search agents via a retrospective critic.arXiv preprint arXiv:2511.12159, 2025

arXiv 2025
[16]

Rubricem: Meta-rl with rubric-guided policy decomposition beyond verifiable rewards

Gaotang Li, Bhavana Dalvi Mishra, Zifeng Wang, Jun Yan, Yanfei Chen, Chun-Liang Li, Long T Le, Rujun Han, George Lee, Hanghang Tong, et al. Rubricem: Meta-rl with rubric-guided policy decomposition beyond verifiable rewards. arXiv preprint arXiv:2605.10899, 2026

Pith/arXiv arXiv 2026
[17]

Reward hacking in rubric-based reinforcement learning.arXiv preprint arXiv:2605.12474, 2026

Anas Mahmoud, MohammadHossein Rezaei, Zihao Wang, Anisha Gunjal, Bing Liu, and Yunzhong He. Reward hacking in rubric-based reinforcement learning.arXiv preprint arXiv:2605.12474, 2026

Pith/arXiv arXiv 2026
[18]

Stepsearch: Igniting llms search ability via step-wise proximal policy optimization.arXiv preprint arXiv:2505.15107, 2025

Ziliang Wang, Xuhui Zheng, Kang An, Cijun Ouyang, Jialu Cai, Yuhang Wang, and Yichao Wu. Stepsearch: Igniting llms search ability via step-wise proximal policy optimization.arXiv preprint arXiv:2505.15107, 2025

arXiv 2025
[19]

Reinforcing multi-turn reasoning in llm agents via turn- level reward design.arXiv preprint arXiv:2505.11821, 2025

Quan Wei, Siliang Zeng, Chenliang Li, William Brown, Oana Frunza, Wei Deng, Anderson Schneider, Yuriy Nevmyvaka, Yang Katie Zhao, Alfredo Garcia, et al. Reinforcing multi-turn reasoning in llm agents via turn- level reward design.arXiv preprint arXiv:2505.11821, 2025. 11

arXiv 2025
[20]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In The Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum? id=3zKtaqxLhW

2024
[21]

Revisiting on-policy distillation: Empirical failure modes and simple fixes, 2026

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on-policy distillation: Empirical failure modes and simple fixes, 2026. URLhttps://arxiv.org/abs/2603.25562

Pith/arXiv arXiv 2026
[22]

Reinforcement learning via self-distillation, 2026

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Reinforcement learning via self-distillation, 2026. URLhttps://arxiv.org/abs/2601.20802

Pith/arXiv arXiv 2026
[23]

On-policy context distillation for language models

Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models. arXiv preprint arXiv:2602.12275, 2026

Pith/arXiv arXiv 2026
[24]

Self-distillation enables continual learning

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning. In ICLR 2026 Workshop on Lifelong Agents: Learning, Aligning, Evolving, 2026. URLhttps://openreview. net/forum?id=HlWA3V6iKF

2026
[25]

Privileged information distillation for language models

Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models. InThe 1st Workshop on Scaling Post-training for LLMs, 2026. URLhttps://openreview.net/forum?id=FbJu6NEBQR

2026
[26]

On-policy self-distillation for reasoning compression.arXiv preprint arXiv:2603.05433, 2026

Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. On-policy self-distillation for reasoning compression.arXiv preprint arXiv:2603.05433, 2026

Pith/arXiv arXiv 2026
[27]

Mirothinker-1.7 & h1: Towards heavy-duty research agents via verification.arXiv preprint arXiv:2603.15726, 2026

MiroMind Team, S Bai, L Bing, L Lei, R Li, X Li, X Lin, E Min, L Su, B Wang, et al. Mirothinker-1.7 & h1: Towards heavy-duty research agents via verification.arXiv preprint arXiv:2603.15726, 2026

arXiv 2026
[28]

Mirothinker: Pushing the performance boundaries of open-source research agents via model, context, and interactive scaling.arXiv preprint arXiv:2511.11793, 2025

MiroMind Team, Song Bai, Lidong Bing, Carson Chen, Guanzheng Chen, Yuntao Chen, Zhe Chen, Ziyi Chen, Jifeng Dai, Xuan Dong, et al. Mirothinker: Pushing the performance boundaries of open-source research agents via model, context, and interactive scaling.arXiv preprint arXiv:2511.11793, 2025

Pith/arXiv arXiv 2025
[29]

Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

Pith/arXiv arXiv 2025
[30]

Deep- researcher: Scaling deep research via reinforcement learning in real-world environments

Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deep- researcher: Scaling deep research via reinforcement learning in real-world environments. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 414–431, 2025

2025
[31]

Stabilizing moe reinforcement learning by aligning training and inference routers.arXiv e-prints, pages arXiv–2510, 2025

Wenhan Ma, Hailin Zhang, Liang Zhao, Yifan Song, Yudong Wang, Zhifang Sui, and Fuli Luo. Stabilizing moe reinforcement learning by aligning training and inference routers.arXiv e-prints, pages arXiv–2510, 2025

2025
[32]

Openseeker: Democratizing frontier search agents by fully open-sourcing training data.arXiv preprint arXiv:2603.15594, 2026

Yuwen Du, Rui Ye, Shuo Tang, Xinyu Zhu, Yijun Lu, Yuzhu Cai, and Siheng Chen. Openseeker: Democratizing frontier search agents by fully open-sourcing training data.arXiv preprint arXiv:2603.15594, 2026

arXiv 2026
[33]

Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

Pith/arXiv arXiv 2025
[34]

Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese

Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, et al. Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese. arXiv e-prints, pages arXiv–2504, 2025

2025
[35]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe TwelfthInternational Conference on Learning Representations, 2023

2023
[36]

xbench: Tracking agents productivity scaling with profession-aligned real- world evaluations.arXiv preprint arXiv:2506.13651, 2025

Kaiyuan Chen, Yixin Ren, Yang Liu, Xiaobo Hu, Haotong Tian, Tianbao Xie, Fangfu Liu, Haoye Zhang, Hongzhang Liu, Yuan Gong, et al. xbench: Tracking agents productivity scaling with profession-aligned real- world evaluations.arXiv preprint arXiv:2506.13651, 2025

arXiv 2025
[37]

gpt-oss-120b & gpt-oss-20b model card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925, 2025. 12

Pith/arXiv arXiv 2025
[38]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[39]

Llamafactory: Unified efficient fine- tuning of 100+ language models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. Llamafactory: Unified efficient fine- tuning of 100+ language models. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 3: system demonstrations), pages 400–410, 2024

2024
[40]

Megatron-lm: Training multi-billion parameter language models using model parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

Pith/arXiv arXiv 1909
[41]

Sglang: Efficient execution of structured language model programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs. Advancesin neural information processing systems, 37:62557–62583, 2024. 13

2024

[1] [1]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025

[2] [3]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300

Pith/arXiv arXiv 2024

[3] [4]

Tongyi deepresearch technical report

Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, et al. Tongyi deepresearch technical report. arXiv preprint arXiv:2510.24701, 2025

Pith/arXiv arXiv 2025

[4] [5]

Dr-venus: Towards frontier edge-scale deep research agents with only 10k open data

Venus Team, Sunhao Dai, Yong Deng, Jinzhen Lin, Yusheng Song, Guoqing Wang, Xiaofeng Wu, Yuqi Zhou, Shuo Yang, Zhenzhe Ying, et al. Dr-venus: Towards frontier edge-scale deep research agents with only 10k open data. arXiv preprint arXiv:2604.19859, 2026

Pith/arXiv arXiv 2026

[5] [6]

Opensearch-vl: An open recipe for frontier multimodal search agents

Shuang Chen, Kaituo Feng, Hangting Chen, Wenxuan Huang, Dasen Dai, Quanxin Shou, Yunlong Lin, Xiangyu Yue, Shenghua Gao, and Tianyu Pang. Opensearch-vl: An open recipe for frontier multimodal search agents. arXiv preprint arXiv:2605.05185, 2026

Pith/arXiv arXiv 2026

[6] [7]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

Pith/arXiv arXiv 2026

[7] [8]

Mind deepresearch technical report.arXiv preprint arXiv:2604.14518, 2026

MindDR Team and Li Auto Inc. Mind deepresearch technical report.arXiv preprint arXiv:2604.14518, 2026

Pith/arXiv arXiv 2026

[8] [9]

Tree search for llm agent reinforcement learning.arXiv preprint arXiv:2509.21240, 2025

Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, and Liaoni Wu. Tree search for llm agent reinforcement learning.arXiv preprint arXiv:2509.21240, 2025

arXiv 2025

[9] [10]

Treerpo: Tree relative policy optimization

Zhicheng Yang, Zhijiang Guo, Yinya Huang, Xiaodan Liang, Yiwei Wang, and Jing Tang. Treerpo: Tree relative policy optimization. arXiv preprint arXiv:2506.05183, 2025

arXiv 2025

[10] [11]

On-policy distillation.Thinking Machines Lab: Con- nectionism, 2025

Kevin Lu and Thinking Machines Lab. On-policy distillation. Thinking Machines Lab: Connectionism, 2025. doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation

work page doi:10.64434/tml.20251026 2025

[11] [12]

Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

Pith/arXiv arXiv 2026

[12] [13]

Self-distilled reasoner: On-policy self-distillation for large language models, 2026

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models, 2026. URLhttps://arxiv.org/abs/2601.18734

Pith/arXiv arXiv 2026

[13] [14]

Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026

Pith/arXiv arXiv 2026

[14] [15]

Criticsearch: Fine-grained credit assignment for search agents via a retrospective critic.arXiv preprint arXiv:2511.12159, 2025

Yaocheng Zhang, Haohuan Huang, Zijun Song, Yuanheng Zhu, Qichao Zhang, Zijie Zhao, and Dongbin Zhao. Criticsearch: Fine-grained credit assignment for search agents via a retrospective critic.arXiv preprint arXiv:2511.12159, 2025

arXiv 2025

[15] [16]

Rubricem: Meta-rl with rubric-guided policy decomposition beyond verifiable rewards

Gaotang Li, Bhavana Dalvi Mishra, Zifeng Wang, Jun Yan, Yanfei Chen, Chun-Liang Li, Long T Le, Rujun Han, George Lee, Hanghang Tong, et al. Rubricem: Meta-rl with rubric-guided policy decomposition beyond verifiable rewards. arXiv preprint arXiv:2605.10899, 2026

Pith/arXiv arXiv 2026

[16] [17]

Reward hacking in rubric-based reinforcement learning.arXiv preprint arXiv:2605.12474, 2026

Anas Mahmoud, MohammadHossein Rezaei, Zihao Wang, Anisha Gunjal, Bing Liu, and Yunzhong He. Reward hacking in rubric-based reinforcement learning.arXiv preprint arXiv:2605.12474, 2026

Pith/arXiv arXiv 2026

[17] [18]

Stepsearch: Igniting llms search ability via step-wise proximal policy optimization.arXiv preprint arXiv:2505.15107, 2025

Ziliang Wang, Xuhui Zheng, Kang An, Cijun Ouyang, Jialu Cai, Yuhang Wang, and Yichao Wu. Stepsearch: Igniting llms search ability via step-wise proximal policy optimization.arXiv preprint arXiv:2505.15107, 2025

arXiv 2025

[18] [19]

Reinforcing multi-turn reasoning in llm agents via turn- level reward design.arXiv preprint arXiv:2505.11821, 2025

Quan Wei, Siliang Zeng, Chenliang Li, William Brown, Oana Frunza, Wei Deng, Anderson Schneider, Yuriy Nevmyvaka, Yang Katie Zhao, Alfredo Garcia, et al. Reinforcing multi-turn reasoning in llm agents via turn- level reward design.arXiv preprint arXiv:2505.11821, 2025. 11

arXiv 2025

[19] [20]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In The Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum? id=3zKtaqxLhW

2024

[20] [21]

Revisiting on-policy distillation: Empirical failure modes and simple fixes, 2026

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on-policy distillation: Empirical failure modes and simple fixes, 2026. URLhttps://arxiv.org/abs/2603.25562

Pith/arXiv arXiv 2026

[21] [22]

Reinforcement learning via self-distillation, 2026

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Reinforcement learning via self-distillation, 2026. URLhttps://arxiv.org/abs/2601.20802

Pith/arXiv arXiv 2026

[22] [23]

On-policy context distillation for language models

Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models. arXiv preprint arXiv:2602.12275, 2026

Pith/arXiv arXiv 2026

[23] [24]

Self-distillation enables continual learning

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning. In ICLR 2026 Workshop on Lifelong Agents: Learning, Aligning, Evolving, 2026. URLhttps://openreview. net/forum?id=HlWA3V6iKF

2026

[24] [25]

Privileged information distillation for language models

Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models. InThe 1st Workshop on Scaling Post-training for LLMs, 2026. URLhttps://openreview.net/forum?id=FbJu6NEBQR

2026

[25] [26]

On-policy self-distillation for reasoning compression.arXiv preprint arXiv:2603.05433, 2026

Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. On-policy self-distillation for reasoning compression.arXiv preprint arXiv:2603.05433, 2026

Pith/arXiv arXiv 2026

[26] [27]

Mirothinker-1.7 & h1: Towards heavy-duty research agents via verification.arXiv preprint arXiv:2603.15726, 2026

MiroMind Team, S Bai, L Bing, L Lei, R Li, X Li, X Lin, E Min, L Su, B Wang, et al. Mirothinker-1.7 & h1: Towards heavy-duty research agents via verification.arXiv preprint arXiv:2603.15726, 2026

arXiv 2026

[27] [28]

Mirothinker: Pushing the performance boundaries of open-source research agents via model, context, and interactive scaling.arXiv preprint arXiv:2511.11793, 2025

MiroMind Team, Song Bai, Lidong Bing, Carson Chen, Guanzheng Chen, Yuntao Chen, Zhe Chen, Ziyi Chen, Jifeng Dai, Xuan Dong, et al. Mirothinker: Pushing the performance boundaries of open-source research agents via model, context, and interactive scaling.arXiv preprint arXiv:2511.11793, 2025

Pith/arXiv arXiv 2025

[28] [29]

Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

Pith/arXiv arXiv 2025

[29] [30]

Deep- researcher: Scaling deep research via reinforcement learning in real-world environments

Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deep- researcher: Scaling deep research via reinforcement learning in real-world environments. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 414–431, 2025

2025

[30] [31]

Stabilizing moe reinforcement learning by aligning training and inference routers.arXiv e-prints, pages arXiv–2510, 2025

Wenhan Ma, Hailin Zhang, Liang Zhao, Yifan Song, Yudong Wang, Zhifang Sui, and Fuli Luo. Stabilizing moe reinforcement learning by aligning training and inference routers.arXiv e-prints, pages arXiv–2510, 2025

2025

[31] [32]

Openseeker: Democratizing frontier search agents by fully open-sourcing training data.arXiv preprint arXiv:2603.15594, 2026

Yuwen Du, Rui Ye, Shuo Tang, Xinyu Zhu, Yijun Lu, Yuzhu Cai, and Siheng Chen. Openseeker: Democratizing frontier search agents by fully open-sourcing training data.arXiv preprint arXiv:2603.15594, 2026

arXiv 2026

[32] [33]

Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

Pith/arXiv arXiv 2025

[33] [34]

Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese

Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, et al. Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese. arXiv e-prints, pages arXiv–2504, 2025

2025

[34] [35]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe TwelfthInternational Conference on Learning Representations, 2023

2023

[35] [36]

xbench: Tracking agents productivity scaling with profession-aligned real- world evaluations.arXiv preprint arXiv:2506.13651, 2025

Kaiyuan Chen, Yixin Ren, Yang Liu, Xiaobo Hu, Haotong Tian, Tianbao Xie, Fangfu Liu, Haoye Zhang, Hongzhang Liu, Yuan Gong, et al. xbench: Tracking agents productivity scaling with profession-aligned real- world evaluations.arXiv preprint arXiv:2506.13651, 2025

arXiv 2025

[36] [37]

gpt-oss-120b & gpt-oss-20b model card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925, 2025. 12

Pith/arXiv arXiv 2025

[37] [38]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[38] [39]

Llamafactory: Unified efficient fine- tuning of 100+ language models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. Llamafactory: Unified efficient fine- tuning of 100+ language models. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 3: system demonstrations), pages 400–410, 2024

2024

[39] [40]

Megatron-lm: Training multi-billion parameter language models using model parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

Pith/arXiv arXiv 1909

[40] [41]

Sglang: Efficient execution of structured language model programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs. Advancesin neural information processing systems, 37:62557–62583, 2024. 13

2024