pith. sign in

arxiv: 2606.09348 · v1 · pith:QM5VVB33new · submitted 2026-06-08 · 💻 cs.LG · cs.CL

PBSD: Privileged Bayesian Self-Distillation for Long-Horizon Credit Assignment

Pith reviewed 2026-06-27 17:20 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords credit assignmentreinforcement learningself-distillationlong-horizon taskspolicy optimizationBayesian methodsagentic tasksprivileged information
0
0 comments X

The pith

PBSD turns sparse outcome rewards into Bayes-calibrated turn-level credit signals via privileged self-distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes PBSD to solve the credit assignment problem in long-horizon reinforcement learning tasks with only final outcome rewards. It employs Bayes' rule to re-express the quality of a trajectory as a likelihood ratio between a student policy and an answer-conditioned teacher model. This ratio is then decomposed autoregressively to assign credit to each turn. A reader would care because this provides a principled way to guide policy updates in complex agent behaviors without requiring dense rewards.

Core claim

PBSD measures trajectory quality through the posterior-to-prior probability ratio of the verified answer and applies Bayes' rule to convert this hard-to-estimate answer-side ratio into a tractable likelihood ratio between a standard student model and a privileged answer-conditioned teacher model. Autoregressive decomposition of this Bayesian evidence score yields turn-level signals that identify whether each intermediate turn supports or undermines the verified outcome.

What carries the argument

The likelihood ratio between the student model and the privileged answer-conditioned teacher model, which serves as the Bayesian evidence score for reweighting trajectories.

If this is right

  • Provides turn-level credit signals compatible with standard policy optimization.
  • Enhances performance across in-domain and out-of-domain settings.
  • Enables effective transfer from short-context training to long-context inference.
  • Identifies supporting and undermining actions in successful and failed trajectories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could apply to other sparse-reward settings in RL where privileged information is accessible during training.
  • Bayesian decomposition might help in other multi-step decision processes beyond agents.
  • Similar ratios could extend to sequence generation tasks without explicit agents.

Load-bearing premise

The privileged answer-conditioned teacher model yields a tractable and unbiased likelihood ratio that produces valid turn-level credit signals without systematic biases from the conditioning or model mismatch.

What would settle it

Running experiments where PBSD signals are compared to baseline outcome supervision in long-horizon tasks and finding no consistent performance gains or even losses would falsify the effectiveness of the credit assignment.

read the original abstract

Long-horizon agentic tasks pose a fundamental credit assignment challenge for outcome-base reinforcement learning: trajectory-level rewards verify final correctness but provide limited guidance on which intermediate reasoning steps or tool interactions contribute to the outcome. The difficulty is especially pronounced in multi-turn search agents, where successful trajectories may contain misleading actions and failed trajectories may contain valuable evidence-gathering steps. We propose PBSD (Privileged Bayesian Self-Distillation), a Bayes-calibrated self-distillation method for fine-grained credit assignment under sparse final rewards. PBSD measures trajectory quality through the posterior-to-prior probability ratio of the verified answer and applies Bayes' rule to convert this hard-to-estimate answer-side ratio into a tractable likelihood ratio between a standard student model and a privileged answer-conditioned teacher model. Autoregressive decomposition of this Bayesian evidence score yields turn-level signals that identify whether each intermediate turn supports or undermines the verified outcome. Consequently, PBSD provides a principled and elegant reweighting scheme that transforms sparse outcome supervision into Bayes-calibrated turn-level credit signals, while remaining fully compatible with standard policy optimization. Experiments demonstrate that PBSD consistently enhances performance across both in-domain and out-of-domain settings, and effectively transfers knowledge from short-context training to long-context inference, suggesting that its fine-grained credit assignment mechanism facilitates more effective policy learning and yields improved generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes PBSD, a method for long-horizon credit assignment in outcome-based RL. It converts the posterior-to-prior ratio over a verified answer into a tractable likelihood ratio between a standard student policy and a privileged answer-conditioned teacher via Bayes' rule, then autoregressively decomposes the ratio into per-turn credit signals. These signals are used to reweight trajectories for standard policy optimization. Experiments report consistent gains in in-domain and out-of-domain settings plus improved short-to-long context transfer.

Significance. If the central derivation is free of systematic bias from the privileged conditioning, PBSD would supply a parameter-free, Bayes-calibrated reweighting scheme that turns sparse final rewards into fine-grained turn-level credits. This could meaningfully advance credit assignment for multi-turn agents. The reported experimental improvements and transfer results would then constitute useful evidence of practical utility.

major comments (2)
  1. [Abstract / §3] Abstract and §3 (method description): the central claim that the autoregressive decomposition of the likelihood ratio P(trajectory | answer)/P(trajectory) yields unbiased turn-level credits rests on an unverified assumption that the privileged teacher approximation introduces no systematic hindsight bias. The provided text supplies no equations, no correction term, and no analysis of the mismatch between the answer-conditioned teacher and the student at generation time.
  2. The stress-test concern lands: early-turn probabilities under the privileged teacher can reflect consistency with the known answer rather than causal support for the outcome. Without an explicit bias analysis or empirical control (e.g., comparison against an oracle teacher or ablation removing answer conditioning), the 'Bayes-calibrated' property cannot be confirmed and the credit signals may be circular by construction.
minor comments (1)
  1. [Experiments] The abstract states that PBSD 'remains fully compatible with standard policy optimization' but does not specify which RL algorithm or loss is used in the experiments; this should be stated explicitly in the experimental section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater rigor around the privileged-teacher approximation in PBSD. We address each major comment below and commit to strengthening the theoretical and empirical treatment of potential bias in the revision.

read point-by-point responses
  1. Referee: [Abstract / §3] Abstract and §3 (method description): the central claim that the autoregressive decomposition of the likelihood ratio P(trajectory | answer)/P(trajectory) yields unbiased turn-level credits rests on an unverified assumption that the privileged teacher approximation introduces no systematic hindsight bias. The provided text supplies no equations, no correction term, and no analysis of the mismatch between the answer-conditioned teacher and the student at generation time.

    Authors: We agree that §3 presents the Bayes-rule conversion and the subsequent autoregressive decomposition via the chain rule but does not supply an explicit bias analysis or correction term for the privileged conditioning. The derivation itself is exact under the modeling assumption that the teacher approximates the answer-conditioned distribution; any mismatch between teacher and student at generation time is therefore an approximation error whose effect on credit calibration is not quantified in the current text. We will add (i) the full set of equations showing the exact Bayes conversion and the per-turn decomposition, (ii) a dedicated paragraph discussing the hindsight-bias concern and the conditions under which the approximation remains calibrated, and (iii) a new ablation that trains a non-privileged teacher (answer conditioning removed) to measure the resulting change in credit-signal quality. revision: yes

  2. Referee: [—] The stress-test concern lands: early-turn probabilities under the privileged teacher can reflect consistency with the known answer rather than causal support for the outcome. Without an explicit bias analysis or empirical control (e.g., comparison against an oracle teacher or ablation removing answer conditioning), the 'Bayes-calibrated' property cannot be confirmed and the credit signals may be circular by construction.

    Authors: The concern is valid: because the teacher is conditioned on the verified answer, its early-turn probabilities necessarily incorporate future information, which is precisely what allows the ratio to serve as a credit signal. This is by design in the Bayesian framing, yet it does open the possibility of non-causal consistency effects. The manuscript currently offers no oracle-teacher comparison or ablation that isolates the contribution of answer conditioning. We will therefore add the suggested ablation (privileged teacher vs. answer-removed teacher) on the same trajectories and report both the resulting credit-signal statistics and downstream policy performance. If the ablation shows substantial degradation when conditioning is removed, we will qualify the 'Bayes-calibrated' claim accordingly in the revised text. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies exact Bayes identity to define credits

full rationale

The paper's central derivation applies Bayes' rule to equate the posterior-to-prior answer ratio with a likelihood ratio between student and privileged teacher models, then decomposes the latter autoregressively into turn-level terms. This is an exact mathematical identity (P(A|T)/P(A) = P(T|A)/P(T)), not a fitted parameter renamed as prediction, not a self-citation load-bearing claim, and not a self-definitional loop. The privileged teacher is an explicit modeling choice in the method rather than an unverified assumption smuggled in; the resulting reweighting scheme supplies independent content for credit assignment even if downstream bias from conditioning remains a separate correctness question. No equations reduce the output to the input by construction, and no self-citations are invoked for uniqueness or ansatz. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; all such elements are unknown from available text.

pith-pipeline@v0.9.1-grok · 5787 in / 1119 out tokens · 21275 ms · 2026-06-27T17:20:28.405585+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 1 canonical work pages

  1. [1]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  2. [3]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300

  3. [4]

    Tongyi deepresearch technical report

    Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, et al. Tongyi deepresearch technical report. arXiv preprint arXiv:2510.24701, 2025

  4. [5]

    Dr-venus: Towards frontier edge-scale deep research agents with only 10k open data

    Venus Team, Sunhao Dai, Yong Deng, Jinzhen Lin, Yusheng Song, Guoqing Wang, Xiaofeng Wu, Yuqi Zhou, Shuo Yang, Zhenzhe Ying, et al. Dr-venus: Towards frontier edge-scale deep research agents with only 10k open data. arXiv preprint arXiv:2604.19859, 2026

  5. [6]

    Opensearch-vl: An open recipe for frontier multimodal search agents

    Shuang Chen, Kaituo Feng, Hangting Chen, Wenxuan Huang, Dasen Dai, Quanxin Shou, Yunlong Lin, Xiangyu Yue, Shenghua Gao, and Tianyu Pang. Opensearch-vl: An open recipe for frontier multimodal search agents. arXiv preprint arXiv:2605.05185, 2026

  6. [7]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

  7. [8]

    Mind deepresearch technical report.arXiv preprint arXiv:2604.14518, 2026

    MindDR Team and Li Auto Inc. Mind deepresearch technical report.arXiv preprint arXiv:2604.14518, 2026

  8. [9]

    Tree search for llm agent reinforcement learning.arXiv preprint arXiv:2509.21240, 2025

    Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, and Liaoni Wu. Tree search for llm agent reinforcement learning.arXiv preprint arXiv:2509.21240, 2025

  9. [10]

    Treerpo: Tree relative policy optimization

    Zhicheng Yang, Zhijiang Guo, Yinya Huang, Xiaodan Liang, Yiwei Wang, and Jing Tang. Treerpo: Tree relative policy optimization. arXiv preprint arXiv:2506.05183, 2025

  10. [11]

    On-policy distillation.Thinking Machines Lab: Con- nectionism, 2025

    Kevin Lu and Thinking Machines Lab. On-policy distillation. Thinking Machines Lab: Connectionism, 2025. doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation

  11. [12]

    Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

    Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

  12. [13]

    Self-distilled reasoner: On-policy self-distillation for large language models, 2026

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models, 2026. URLhttps://arxiv.org/abs/2601.18734

  13. [14]

    Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026

    Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026

  14. [15]

    Criticsearch: Fine-grained credit assignment for search agents via a retrospective critic.arXiv preprint arXiv:2511.12159, 2025

    Yaocheng Zhang, Haohuan Huang, Zijun Song, Yuanheng Zhu, Qichao Zhang, Zijie Zhao, and Dongbin Zhao. Criticsearch: Fine-grained credit assignment for search agents via a retrospective critic.arXiv preprint arXiv:2511.12159, 2025

  15. [16]

    Rubricem: Meta-rl with rubric-guided policy decomposition beyond verifiable rewards

    Gaotang Li, Bhavana Dalvi Mishra, Zifeng Wang, Jun Yan, Yanfei Chen, Chun-Liang Li, Long T Le, Rujun Han, George Lee, Hanghang Tong, et al. Rubricem: Meta-rl with rubric-guided policy decomposition beyond verifiable rewards. arXiv preprint arXiv:2605.10899, 2026

  16. [17]

    Reward hacking in rubric-based reinforcement learning.arXiv preprint arXiv:2605.12474, 2026

    Anas Mahmoud, MohammadHossein Rezaei, Zihao Wang, Anisha Gunjal, Bing Liu, and Yunzhong He. Reward hacking in rubric-based reinforcement learning.arXiv preprint arXiv:2605.12474, 2026

  17. [18]

    Stepsearch: Igniting llms search ability via step-wise proximal policy optimization.arXiv preprint arXiv:2505.15107, 2025

    Ziliang Wang, Xuhui Zheng, Kang An, Cijun Ouyang, Jialu Cai, Yuhang Wang, and Yichao Wu. Stepsearch: Igniting llms search ability via step-wise proximal policy optimization.arXiv preprint arXiv:2505.15107, 2025

  18. [19]

    Reinforcing multi-turn reasoning in llm agents via turn- level reward design.arXiv preprint arXiv:2505.11821, 2025

    Quan Wei, Siliang Zeng, Chenliang Li, William Brown, Oana Frunza, Wei Deng, Anderson Schneider, Yuriy Nevmyvaka, Yang Katie Zhao, Alfredo Garcia, et al. Reinforcing multi-turn reasoning in llm agents via turn- level reward design.arXiv preprint arXiv:2505.11821, 2025. 11

  19. [20]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In The Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum? id=3zKtaqxLhW

  20. [21]

    Revisiting on-policy distillation: Empirical failure modes and simple fixes, 2026

    Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on-policy distillation: Empirical failure modes and simple fixes, 2026. URLhttps://arxiv.org/abs/2603.25562

  21. [22]

    Reinforcement learning via self-distillation, 2026

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Reinforcement learning via self-distillation, 2026. URLhttps://arxiv.org/abs/2601.20802

  22. [23]

    On-policy context distillation for language models

    Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models. arXiv preprint arXiv:2602.12275, 2026

  23. [24]

    Self-distillation enables continual learning

    Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning. In ICLR 2026 Workshop on Lifelong Agents: Learning, Aligning, Evolving, 2026. URLhttps://openreview. net/forum?id=HlWA3V6iKF

  24. [25]

    Privileged information distillation for language models

    Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models. InThe 1st Workshop on Scaling Post-training for LLMs, 2026. URLhttps://openreview.net/forum?id=FbJu6NEBQR

  25. [26]

    On-policy self-distillation for reasoning compression.arXiv preprint arXiv:2603.05433, 2026

    Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. On-policy self-distillation for reasoning compression.arXiv preprint arXiv:2603.05433, 2026

  26. [27]

    Mirothinker-1.7 & h1: Towards heavy-duty research agents via verification.arXiv preprint arXiv:2603.15726, 2026

    MiroMind Team, S Bai, L Bing, L Lei, R Li, X Li, X Lin, E Min, L Su, B Wang, et al. Mirothinker-1.7 & h1: Towards heavy-duty research agents via verification.arXiv preprint arXiv:2603.15726, 2026

  27. [28]

    Mirothinker: Pushing the performance boundaries of open-source research agents via model, context, and interactive scaling.arXiv preprint arXiv:2511.11793, 2025

    MiroMind Team, Song Bai, Lidong Bing, Carson Chen, Guanzheng Chen, Yuntao Chen, Zhe Chen, Ziyi Chen, Jifeng Dai, Xuan Dong, et al. Mirothinker: Pushing the performance boundaries of open-source research agents via model, context, and interactive scaling.arXiv preprint arXiv:2511.11793, 2025

  28. [29]

    Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

  29. [30]

    Deep- researcher: Scaling deep research via reinforcement learning in real-world environments

    Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deep- researcher: Scaling deep research via reinforcement learning in real-world environments. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 414–431, 2025

  30. [31]

    Stabilizing moe reinforcement learning by aligning training and inference routers.arXiv e-prints, pages arXiv–2510, 2025

    Wenhan Ma, Hailin Zhang, Liang Zhao, Yifan Song, Yudong Wang, Zhifang Sui, and Fuli Luo. Stabilizing moe reinforcement learning by aligning training and inference routers.arXiv e-prints, pages arXiv–2510, 2025

  31. [32]

    Openseeker: Democratizing frontier search agents by fully open-sourcing training data.arXiv preprint arXiv:2603.15594, 2026

    Yuwen Du, Rui Ye, Shuo Tang, Xinyu Zhu, Yijun Lu, Yuzhu Cai, and Siheng Chen. Openseeker: Democratizing frontier search agents by fully open-sourcing training data.arXiv preprint arXiv:2603.15594, 2026

  32. [33]

    Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

  33. [34]

    Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese

    Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, et al. Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese. arXiv e-prints, pages arXiv–2504, 2025

  34. [35]

    Gaia: a benchmark for general ai assistants

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe TwelfthInternational Conference on Learning Representations, 2023

  35. [36]

    xbench: Tracking agents productivity scaling with profession-aligned real- world evaluations.arXiv preprint arXiv:2506.13651, 2025

    Kaiyuan Chen, Yixin Ren, Yang Liu, Xiaobo Hu, Haotong Tian, Tianbao Xie, Fangfu Liu, Haoye Zhang, Hongzhang Liu, Yuan Gong, et al. xbench: Tracking agents productivity scaling with profession-aligned real- world evaluations.arXiv preprint arXiv:2506.13651, 2025

  36. [37]

    gpt-oss-120b & gpt-oss-20b model card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925, 2025. 12

  37. [38]

    Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  38. [39]

    Llamafactory: Unified efficient fine- tuning of 100+ language models

    Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. Llamafactory: Unified efficient fine- tuning of 100+ language models. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 3: system demonstrations), pages 400–410, 2024

  39. [40]

    Megatron-lm: Training multi-billion parameter language models using model parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

  40. [41]

    Sglang: Efficient execution of structured language model programs

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs. Advancesin neural information processing systems, 37:62557–62583, 2024. 13