arxiv: 2605.13779 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.AI· cs.DC

Recognition: 2 theorem links

· Lean Theorem

MinT: Managed Infrastructure for Training and Serving Millions of LLMs

Mind Lab: Song Cao , Vic Cao , Andrew Chen , Kaijie Chen , Cleon Cheng , Steven Chiang , Kaixuan Fan , Hera Feng

show 53 more authors

Huan Feng Arthur Fu Jun Gao Hongquan Gu Aaron Guan Nolan Ho Mutian Hong Hailee Hou Peixuan Hua Charles Huang Miles Jiang Nora Jiang Yuyi Jiang Qiuyu Jin Fancy Kong Andrew Lei Kyrie Lei Alexy Li Lucian Li Ray Li Theo Li Zhihui Li Jiayi Lin Kairus Liu Kieran Liu Logan Liu Xiang Liu Irvine Lu Maeve Luo Runze Lv Pony Ma Verity Niu Anson Qiu Vincent Wang Rio Yang Maxwell Yao Carrie Ye Regis Ye Wenlin Ye Josh Ying Danney Zeng Yuhan Zhan Anya Zhang Di Zhang Ruijia Zhang Sueky Zhang Ya Zhang Wei Zhao Ada Zhou Changhai Zhou Yuhua Zhou Xinyue Zhu Murphy Zhuang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DC

keywords LoRALLM infrastructuremodel servingadapter managementdistributed trainingpolicy catalogsMoE modelsbase model sharing

0 comments

The pith

MinT manages million-scale LoRA policy catalogs by training and serving only small adapter revisions over shared 1T-class base models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

MinT is a managed infrastructure system for Low-Rank Adaptation post-training and online serving that targets many trained policies produced over a small number of expensive base-model deployments. Instead of materializing each policy as a merged full checkpoint, MinT keeps the base model resident and moves only the exported LoRA adapter revisions through their full lifecycle of rollout, update, export, evaluation, serving, and rollback. The system hides distributed training, serving, scheduling, and data movement behind a service interface while scaling along three axes to support frontier models beyond 1T parameters and catalogs of up to a million addressable policies.

Core claim

MinT scales LoRA RL to frontier-scale dense and MoE architectures including MLA and DSA attention paths with training and serving validated beyond 1T total parameters, reduces adapter-only handoff steps by 18.3x on a 4B dense model and 2.85x on a 30B MoE, shortens wall time by 1.77x and 1.45x with concurrent multi-policy GRPO without raising peak memory, and supports 10^6-scale addressable catalogs with thousand-adapter active waves while improving live engine loading by 8.5-8.7x through packed MoE LoRA tensors.

What carries the argument

The LoRA adapter revision lifecycle manager that keeps base models resident in shared deployments and moves only small exported adapters (under 1% of base size in rank-1 settings) while separating durable policy addressability from CPU/GPU working sets.

If this is right

Adapter-only handoff reduces measured step time and memory footprint while enabling concurrent multi-policy training without peak memory increases.
Million-scale policy addressability becomes practical by treating cold loading as scheduled service work separate from active GPU sets.
Shared base-model deployments can host many selected adapter revisions for training and serving without materializing full checkpoints for each policy.
Packed MoE LoRA tensors improve live engine loading speeds by nearly an order of magnitude at cluster scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This design could let organizations maintain and switch among thousands of specialized policies on the same hardware without proportional storage growth.
Treating adapter movement as scheduled service work opens the possibility of dynamic policy waves where active sets change frequently based on demand.
The separation of addressable catalogs from working sets may simplify rollback and evaluation pipelines for large numbers of model variants.

Load-bearing premise

Distributed training, serving, scheduling, and data movement can be hidden behind a service interface without unacceptable latency or resource contention at 1T-parameter scales and million-scale policy catalogs.

What would settle it

A cluster deployment at 1T parameters showing that serving or training 100,000 policies simultaneously produces sustained latency above 100 ms or causes GPU memory contention that reduces throughput below single-policy baselines.

read the original abstract

We present MindLab Toolkit (MinT), a managed infrastructure system for Low-Rank Adaptation (LoRA) post-training and online serving. MinT targets a setting where many trained policies are produced over a small number of expensive base-model deployments. Instead of materializing each policy as a merged full checkpoint, MinT keeps the base model resident and moves exported LoRA adapter revisions through rollout, update, export, evaluation, serving, and rollback, hiding distributed training, serving, scheduling, and data movement behind a service interface. MinT scales this path along three axes. Scale Up extends LoRA RL to frontier-scale dense and MoE architectures, including MLA and DSA attention paths, with training and serving validated beyond 1T total parameters. Scale Down moves only the exported LoRA adapter, which can be under 1% of base-model size in rank-1 settings; adapter-only handoff reduces the measured step by 18.3x on a 4B dense model and 2.85x on a 30B MoE, while concurrent multi-policy GRPO shortens wall time by 1.77x and 1.45x without raising peak memory. Scale Out separates durable policy addressability from CPU/GPU working sets: a tensor-parallel deployment supports 10^6-scale addressable catalogs (measured single-engine sweeps through 100K) and thousand-adapter active waves at cluster scale, with cold loading treated as scheduled service work and packed MoE LoRA tensors improving live engine loading by 8.5-8.7x. MinT thus manages million-scale LoRA policy catalogs while training and serving selected adapter revisions over shared 1T-class base models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MinT delivers useful measurements on adapter management for LoRA at scale but extrapolates its biggest claims from smaller experiments.

read the letter

The main thing to know about this paper is that MinT is a system for managing large numbers of LoRA policies on shared base models by keeping the base resident and only moving the small adapters through training, serving, and updates. It reports concrete improvements from this approach on the models they tested. What the work does well is lay out an architecture that separates durable policy catalogs from the active GPU working sets. They measure an 18.3 times reduction in steps for adapter handoff on a 4B dense model, 2.85 times on a 30B MoE, and loading speedups of 8.5 to 8.7 times with packed tensors. Concurrent training of multiple policies shortens wall time without increasing peak memory, and they demonstrate single-engine handling of 100K adapters. Extending LoRA to larger architectures with specific attention paths is also a plus, with validation beyond 1T total parameters mentioned. The soft spots are around the largest scales. The million-scale catalog and 1T base claims build on the 100K sweeps and smaller model runs. Direct evidence for behavior at full cluster scale with thousands of active adapters on 1T models is not shown, so issues like network contention or memory bandwidth at that point remain open. The paper treats cold loading as scheduled service work, but without measurements at the headline sizes, that assumption is not fully stress-tested. This paper is for people who build or operate large-scale LLM serving and training infrastructure, especially those dealing with many customized models from a few bases. A reader interested in practical systems for multi-policy RL or adapter management would find the metrics and design choices relevant. I would send it to peer review. The reported numbers are specific enough to merit checking, and the overall problem of scaling LoRA catalogs is worth discussion even if the biggest claims need more data in revision.

Referee Report

3 major / 1 minor

Summary. The paper presents MinT, a managed infrastructure for LoRA post-training and online serving that keeps a small number of expensive base models resident while moving only exported adapter revisions through the lifecycle of rollout, update, evaluation, serving, and rollback. It claims to scale along three axes: Scale Up to frontier dense/MoE models beyond 1T total parameters (including MLA/DSA attention), Scale Down via adapter-only handoff yielding 18.3x step reduction on 4B dense and 2.85x on 30B MoE plus 1.77x/1.45x wall-time gains from concurrent multi-policy GRPO, and Scale Out to 10^6-scale addressable catalogs (measured to 100K) with 8.5-8.7x packed-MoE loading improvements while treating cold loading as scheduled service work.

Significance. If the scaling and overhead claims hold, MinT would materially reduce the cost of maintaining and serving large catalogs of fine-tuned policies without materializing full checkpoints, enabling practical multi-policy serving over shared 1T-class bases. The concrete mid-scale speedups and the separation of durable addressability from working sets are practical contributions that could influence production LLM platforms.

major comments (3)

[Abstract] Abstract: the statement that training and serving are 'validated beyond 1T total parameters' is not supported by any reported experiment; all quantitative results use 4B dense and 30B MoE models. This directly undercuts the central Scale Up claim.
[Abstract] Abstract: the 10^6-scale catalog claim rests on 'measured single-engine sweeps through 100K' with no data shown for thousand-adapter active waves at cluster scale on 1T-class bases. The Scale Out argument therefore lacks evidence at the operating point asserted in the abstract.
[Abstract] Abstract: the assertion that 'cold loading can be treated as scheduled service work' without measurable impact on latency or contention is presented without measurements at 1T parameter scale or with thousands of concurrent adapters; this assumption is load-bearing for the service-interface hiding claim.

minor comments (1)

[Abstract] The abstract mixes measured results with extrapolated claims; a short table separating the two would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We agree that several claims require qualification to align precisely with the reported experiments, and we will revise the manuscript to address each point.

read point-by-point responses

Referee: [Abstract] Abstract: the statement that training and serving are 'validated beyond 1T total parameters' is not supported by any reported experiment; all quantitative results use 4B dense and 30B MoE models. This directly undercuts the central Scale Up claim.

Authors: We agree that the quantitative results are reported only for the 4B dense and 30B MoE models. The abstract phrasing 'validated beyond 1T total parameters' was intended to convey that the LoRA pipelines and attention extensions (MLA/DSA) have been implemented to support architectures at that scale, but no end-to-end performance measurements at >1T are presented. We will revise the abstract to state that the system architecture supports models beyond 1T parameters while the reported scaling results use the 4B and 30B models. revision: yes
Referee: [Abstract] Abstract: the 10^6-scale catalog claim rests on 'measured single-engine sweeps through 100K' with no data shown for thousand-adapter active waves at cluster scale on 1T-class bases. The Scale Out argument therefore lacks evidence at the operating point asserted in the abstract.

Authors: The 10^6-scale addressable catalog is an architectural target enabled by separating durable policy identifiers from GPU working sets. Empirical data are limited to single-engine sweeps through 100K adapters; we have not reported cluster-scale runs with thousands of active adapters on 1T-class bases. We will revise the abstract to specify the measurement basis (single-engine sweeps to 100K) and describe the million-scale figure as the supported capacity rather than a directly measured operating point. revision: yes
Referee: [Abstract] Abstract: the assertion that 'cold loading can be treated as scheduled service work' without measurable impact on latency or contention is presented without measurements at 1T parameter scale or with thousands of concurrent adapters; this assumption is load-bearing for the service-interface hiding claim.

Authors: We acknowledge that no measurements of cold-loading latency or contention at 1T scale with thousands of concurrent adapters are provided. The claim rests on the design principle that only small adapters are moved and that cold loads can be scheduled as background service work. We will revise the abstract to present this as a design assumption supported by the adapter-only handoff and smaller-scale observations, rather than asserting zero measurable impact at the full claimed scale. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive system implementation with empirical measurements only

full rationale

The paper presents MinT as a managed infrastructure for LoRA training and serving, describing architectural decisions for hiding distributed operations behind a service interface and reporting concrete measurements (e.g., 18.3x step reduction on 4B models, 8.5-8.7x loading improvements). No equations, first-principles derivations, fitted parameters, or predictions appear in the provided text. Claims rest on direct benchmarks and scaling descriptions rather than any self-referential reduction or self-citation chain that would force results by construction. This is a standard non-circular system paper whose central claims are externally falsifiable via the reported experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on standard assumptions in ML infrastructure about the efficiency of LoRA and the feasibility of large-scale distributed management.

axioms (2)

domain assumption LoRA adapters remain effective for post-training at large scales
Relied upon for the scale up claims.
domain assumption Distributed systems can schedule adapter movements without significant overhead
Central to hiding complexity behind service interface.

invented entities (1)

MinT managed infrastructure no independent evidence
purpose: To handle training and serving of many LoRA policies
The core contribution is this new system.

pith-pipeline@v0.9.0 · 5826 in / 1346 out tokens · 69247 ms · 2026-05-14T19:21:06.133261+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MinT keeps the base model resident and moves exported LoRA adapter revisions through rollout, update, export, evaluation, serving, and rollback
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Scale Out separates durable policy addressability from CPU/GPU working sets: a tensor-parallel deployment supports 10^6-scale addressable catalogs

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 29 canonical work pages · 11 internal anchors

[1]

Anthropic

Accessed 2026-05. Anthropic. Measuring AI agent autonomy in practice. Anthropic research,

2026
[2]

AsyncFlow Authors

Accessed 2026-05. AsyncFlow Authors. AsyncFlow: An asynchronous streaming RL framework for efficient LLM post-training.arXiv preprint arXiv:2507.01663,

work page arXiv 2026
[3]

20 StevenChiang, YiwenLu, QihanLiu, AndrewChen, PonyMa, andMindLab

arXiv:2310.18547. 20 StevenChiang, YiwenLu, QihanLiu, AndrewChen, PonyMa, andMindLab. RouterreplayR3: Whyitfailedandhow we fixed it. Mind Lab: A Lab for Experiential Intelligence, 2026a. https://macaron.im/mindlab/research/router- replay-r3-why-it-failed-and-how-we-fixed-it. Steven Chiang, Yiwen Lu, Qihan Liu, Nolan Ho, Andrew Chen, Pony Ma, and Mind Lab....

work page arXiv 2026
[4]

QLoRA: Efficient Finetuning of Quantized LLMs

arXiv:2305.14314. Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Songyang Zhang, Kai Chen, Zongwen Shen, and Jidong Ge. LawBench: Benchmarking legal knowledge of large language models.Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP),

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Wei Fu et al

arXiv:2309.16289. Wei Fu et al. AReaL: A large-scale asynchronous reinforcement learning system for language reasoning.arXiv preprint arXiv:2505.24298,

work page arXiv
[6]

Compress then serve: Serving thousands of LoRA adapters with little overhead

Rickard Brüel Gabrielsson, Jiacheng Zhu, Onkar Bhardwaj, Leshem Choshen, Kristjan Greenewald, Mikhail Yurochkin, and Justin Solomon. Compress then serve: Serving thousands of LoRA adapters with little overhead. arXiv preprint arXiv:2407.00066,

work page arXiv
[7]

GLM-5: from Vibe Coding to Agentic Engineering

GLM-5-Team. GLM-5: From vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

FinEval: A chinese financial domain knowledge evaluation benchmark for large language models.arXiv preprint arXiv:2308.09975,

Xin Guo, Haotian Xia, Zhaowei Liu, Hanyang Cao, Zhi Yang, Zhiqiang Liu, Sizhe Wang, Jinyi Niu, Chuqi Wang, Yanhui Wang, Xiaolong Liang, Xiaoming Huang, Bing Zhu, Zhongyu Wei, Yun Chen, Weining Shen, and Liwen Zhang. FinEval: A chinese financial domain knowledge evaluation benchmark for large language models.arXiv preprint arXiv:2308.09975,

work page arXiv
[9]

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Jian Hu et al. OpenRLHF: An easy-to-use, scalable and high-performance RLHF framework.arXiv preprint arXiv:2405.11143,

work page internal anchor Pith review arXiv
[10]

Serving heterogeneous LoRA adapters in distributed LLM inference systems.arXiv preprint arXiv:2511.22880,

Shashwat Jaiswal, Shrikara Arun, Anjaly Parayil, Ankur Mallick, Spyros Mastorakis, Alind Khare, Chloi Alverti, Renee St Amant, Chetan Bansal, Victor Rühle, and Josep Torrellas. Serving heterogeneous LoRA adapters in distributed LLM inference systems.arXiv preprint arXiv:2511.22880,

work page arXiv
[11]

Kimi K2.5: Visual Agentic Intelligence

Kimi Team. Kimi K2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

arXiv preprint arXiv:2510.18855 , year=

Ling Team, Anqi Shen, Baihui Li, Bin Hu, Bin Jing, Cai Chen, Chao Huang, Chao Zhang, Chaokun Yang, Cheng Lin, et al. Every step evolves: Scaling reinforcement learning for trillion-scale thinking model.arXiv preprint arXiv:2510.18855,

work page arXiv
[13]

arXiv preprint arXiv:2510.18855 , year=

doi: 10.48550/arXiv.2510.18855. URLhttps://arxiv.org/abs/2510.18855. Introduces IcePop token-level discrepancy masking and clipping. Dennis Liu, Zijie Yan, Xin Yao, Tong Liu, Vijay Korthikanti, Evan Wu, Shiqing Fan, Gao Deng, Hongxiao Bai, Jianbin Chang, Ashwath Aithal, Michael Andersch, Mohammad Shoeybi, Jiajie Yao, Chandler Zhou, David Wu, Xipeng Li, an...

work page doi:10.48550/arxiv.2510.18855
[14]

Xiao-Yang Liu, Guoxuan Wang, Hongyang Yang, and Daochen Zha

https://macaron.im/mindlab/research/building-trillion-parameter-reasoning-rl-with-10-gpus. Xiao-Yang Liu, Guoxuan Wang, Hongyang Yang, and Daochen Zha. FinGPT: Democratizing internet-scale data for financial large language models.arXiv preprint arXiv:2307.10485,

work page arXiv
[15]

Stabilizing MoE reinforcement learning by aligning training and inference routers.arXiv preprint arXiv:2510.11370,

21 Wenhan Ma, Hailin Zhang, Liang Zhao, Yifan Song, Yudong Wang, Zhifang Sui, and Fuli Luo. Stabilizing MoE reinforcement learning by aligning training and inference routers.arXiv preprint arXiv:2510.11370,

work page arXiv
[16]

2024 american invitational mathematics examination.https://maa.org/ maa-invitational-competitions/,

Mathematical Association of America. 2024 american invitational mathematics examination.https://maa.org/ maa-invitational-competitions/,

2024
[17]

Kimi K2: Open Agentic Intelligence

Accessed 2026-05. Moonshot AI. Kimi K2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

doi: 10.48550/arXiv.2507. 20534. URLhttps://arxiv.org/abs/2507.20534. Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. Ray: A distributed framework for emerging AI applications. InUSENIX Symposium on Operating Systems Design and Impleme...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507
[19]

Efficient large-scale language model training on GPU clusters using Megatron-LM.arXiv preprint arXiv:2104.04473,

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on GPU clusters using Megatron-LM.arXiv preprint arXiv:2104.04473,

work page arXiv
[20]

Accessed 2026-04. OpenAI. Introducing GPT-5.5. OpenAI blog,

2026
[21]

OpenTinker Authors

Accessed 2026-05. OpenTinker Authors. OpenTinker: A reinforcement learning as a service framework for agentic workflows.arXiv preprint arXiv:2601.07376,

work page arXiv 2026
[22]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Qwen3 Technical Report

doi: 10.48550/arXiv.2505.09388. URL https://arxiv.org/abs/2505.09388. Qwen Team. Qwen3.5. Qwen blog,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388
[24]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

Accessed 2026-05. Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory optimizations toward training trillion parameter models.arXiv preprint arXiv:1910.02054,

work page arXiv 2026
[25]

Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale

Relax Authors. Relax: A service-oriented asynchronous reinforcement learning engine for post-training.arXiv preprint arXiv:2604.11554,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Gerald Shen et al

doi: 10.64434/tml.20250929.https://thinkingmachines.ai/blog/lora/. Gerald Shen et al. NeMo-Aligner: Scalable toolkit for efficient model alignment.arXiv preprint arXiv:2405.01481,

work page doi:10.64434/tml.20250929.https://thinkingmachines.ai/blog/lora/
[27]

HybridFlow: A Flexible and Efficient RLHF Framework , url=

Guangming Sheng, Yuxuan Tong, Borui Wan, Wang Zhang, Chaobo Jia, Xibin Wu, Yuqi Wu, Xiang Li, Chi Zhang, Yanghua Peng, Haibin Lin, Xin Liu, and Chuan Wu. Laminar: A scalable asynchronous RL post-training framework. arXiv preprint arXiv:2510.12633, 2025a. Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, a...

work page doi:10.1145/3689031.3696075
[28]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053,

work page internal anchor Pith review Pith/arXiv arXiv 1909
[29]

AnnouncingTinker.https://thinkingmachines.ai/blog/announcing-tinker/, 2025a

22 ThinkingMachinesLab. AnnouncingTinker.https://thinkingmachines.ai/blog/announcing-tinker/, 2025a. Accessed 2026-04. Thinking Machines Lab. Tinker Cookbook. GitHub repository,https://github.com/thinking-machines-lab/ tinker-cookbook, 2025b. Post-training with Tinker. Accessed 2026-04. Bingyang Wu, Ruidong Zhu, Zili Zhang, Peng Sun, Xuanzhe Liu, and Xin ...

2026
[30]

Jet-RL: Enabling on-policy FP8 reinforcement learning with unified training and rollout precision flow.arXiv preprint arXiv:2601.14243,

Haocheng Xi, Charlie Ruan, Peiyuan Liao, Yujun Lin, Han Cai, Yilong Zhao, Shuo Yang, Kurt Keutzer, Song Han, and Ligeng Zhu. Jet-RL: Enabling on-policy FP8 reinforcement learning with unified training and rollout precision flow.arXiv preprint arXiv:2601.14243,

work page arXiv
[31]

On the rollout-training mismatch in modern RL systems.NeurIPS 2025 Workshop on Efficient Reasoning,

Feng Yao, Liyuan Liu, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. On the rollout-training mismatch in modern RL systems.NeurIPS 2025 Workshop on Efficient Reasoning,

2025
[32]

Zhengmao Ye, Dengchun Li, Zetao Hu, Tingfeng Lan, Jian Sha, Sicong Zhang, Lei Duan, Jie Zuo, Hui Lu, Yuanchun Zhou, and Mingjie Tang

Accessed 2026-05. Zhengmao Ye, Dengchun Li, Zetao Hu, Tingfeng Lan, Jian Sha, Sicong Zhang, Lei Duan, Jie Zuo, Hui Lu, Yuanchun Zhou, and Mingjie Tang. mLoRA: Fine-tuning LoRA adapters via highly-efficient pipeline parallelism in multiple GPUs.arXiv preprint arXiv:2312.02515,

work page arXiv 2026
[33]

Improving the serving performance of multi- LoRA large language models via efficient LoRA and KV cache management

Hang Zhang, Jiuchen Shi, Yixiao Wang, Quan Chen, Yizhou Shan, and Minyi Guo. Improving the serving per- formance of multi-LoRA large language models via efficient LoRA and KV cache management.arXiv preprint arXiv:2505.03756,

work page arXiv
[34]

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

arXiv:2303.10512. Yusen Zhong et al. StreamRL: Scalable, heterogeneous, and elastic RL for LLMs with disaggregated stream generation. arXiv preprint,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

The unique- adapter rows remove this locality and measure how many distinct adapters can become cached near one engine before the run stops being a clean warm-path claim

The repeated-hotset rows model adapter locality after routing has found a useful engine placement. The unique- adapter rows remove this locality and measure how many distinct adapters can become cached near one engine before the run stops being a clean warm-path claim. These measurements define the CPU-side tier between the durable adapter catalog and the...

2048