MinT: Managed Infrastructure for Training and Serving Millions of LLMs

Aaron Guan; Ada Zhou; Alexy Li; Andrew Chen; Andrew Lei; Anson Qiu; Anya Zhang; Arthur Fu; Carrie Ye; Changhai Zhou

arxiv: 2605.13779 · v2 · pith:X2JYNWFRnew · submitted 2026-05-13 · 💻 cs.LG · cs.AI· cs.DC

MinT: Managed Infrastructure for Training and Serving Millions of LLMs

Mind Lab: Song Cao , Vic Cao , Andrew Chen , Kaijie Chen , Cleon Cheng , Steven Chiang , Kaixuan Fan , Hera Feng

show 53 more authors

Huan Feng Arthur Fu Jun Gao Hongquan Gu Aaron Guan Nolan Ho Mutian Hong Hailee Hou Peixuan Hua Charles Huang Miles Jiang Nora Jiang Yuyi Jiang Qiuyu Jin Fancy Kong Andrew Lei Kyrie Lei Alexy Li Lucian Li Ray Li Theo Li Zhihui Li Jiayi Lin Kairus Liu Kieran Liu Logan Liu Xiang Liu Irvine Lu Maeve Luo Runze Lv Pony Ma Verity Niu Anson Qiu Vincent Wang Rio Yang Maxwell Yao Carrie Ye Regis Ye Wenlin Ye Josh Ying Danney Zeng Yuhan Zhan Anya Zhang Di Zhang Ruijia Zhang Sueky Zhang Ya Zhang Wei Zhao Ada Zhou Changhai Zhou Yuhua Zhou Xinyue Zhu Murphy Zhuang

This is my paper

Pith reviewed 2026-06-30 21:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DC

keywords LoRAadapter servingLLM infrastructurepolicy catalogdistributed trainingMoE modelsmodel management

0 comments

The pith

MinT keeps trillion-parameter base models fixed while moving only small LoRA adapters to train and serve million-scale policy catalogs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

MinT is a managed system for LoRA post-training and serving that targets many policies produced over few expensive base-model deployments. Instead of creating full merged checkpoints for each policy, the system keeps the base model resident on hardware and moves only the exported LoRA adapter revisions through the full lifecycle of rollout, update, evaluation, serving, and rollback. This design hides the details of distributed training, scheduling, and data movement behind a service interface. The approach scales in three directions: extending LoRA RL to models beyond 1T parameters, reducing data movement by shipping adapters that are often under 1 percent of base size, and separating durable policy addressability from active GPU working sets to support 10^6-scale catalogs. A sympathetic reader would care because the method directly attacks the storage, movement, and memory costs that otherwise limit how many specialized policies can run concurrently over shared frontier hardware.

Core claim

MinT keeps the base model resident and moves exported LoRA adapter revisions through rollout, update, export, evaluation, serving, and rollback pipelines. It scales along three axes: Scale Up extends LoRA RL to frontier-scale dense and MoE architectures with training and serving validated beyond 1T total parameters; Scale Down moves only the exported adapter, which reduces measured step time by 18.3x on a 4B dense model and 2.85x on a 30B MoE while concurrent multi-policy GRPO shortens wall time by 1.77x and 1.45x without raising peak memory; Scale Out separates durable policy addressability from CPU/GPU working sets to support 10^6-scale addressable catalogs, with packed MoE LoRA tensors im

What carries the argument

LoRA adapter-only handoff over a fixed shared base model, which carries the argument by moving only the small exported adapter revisions through all pipelines instead of materializing full model checkpoints.

If this is right

Supports training and serving of dense and MoE models beyond 1T total parameters using MLA and DSA attention.
Adapter-only handoff reduces step time by 18.3x on 4B dense models and 2.85x on 30B MoE models.
Concurrent multi-policy GRPO shortens wall time by up to 1.77x without increasing peak memory.
Enables 10^6-scale addressable policy catalogs with thousand-adapter active waves at cluster scale.
Packed MoE LoRA tensors improve live engine loading by 8.5-8.7x.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The design implies that future serving platforms could treat policy catalogs as first-class durable objects separate from any single model deployment.
It opens the possibility of treating adapter movement as scheduled background work rather than on-demand GPU transfers.
Similar separation of base and delta could be tested with other parameter-efficient methods beyond LoRA to see if the same scaling holds.
At cluster scale the approach may change how organizations allocate hardware between a few large base deployments and many lightweight policy instances.

Load-bearing premise

LoRA adapters remain effective and stable when moved independently through rollout, update, export, evaluation, serving, and rollback pipelines without requiring full model materialization or incurring unacceptable accuracy or latency penalties.

What would settle it

An end-to-end run at 1T scale in which adapter-only movement through the full pipeline produces either accuracy loss or latency increase that exceeds the gains reported for the 4B and 30B cases.

read the original abstract

We present MindLab Toolkit (MinT), a managed infrastructure system for Low-Rank Adaptation (LoRA) post-training and online serving. MinT targets a setting where many trained policies are produced over a small number of expensive base-model deployments. Instead of materializing each policy as a merged full checkpoint, MinT keeps the base model resident and moves exported LoRA adapter revisions through rollout, update, export, evaluation, serving, and rollback, hiding distributed training, serving, scheduling, and data movement behind a service interface. MinT scales this path along three axes. Scale Up extends LoRA RL to frontier-scale dense and MoE architectures, including MLA and DSA attention paths, with training and serving validated beyond 1T total parameters. Scale Down moves only the exported LoRA adapter, which can be under 1% of base-model size in rank-1 settings; adapter-only handoff reduces the measured step by 18.3x on a 4B dense model and 2.85x on a 30B MoE, while concurrent multi-policy GRPO shortens wall time by 1.77x and 1.45x without raising peak memory. Scale Out separates durable policy addressability from CPU/GPU working sets: a tensor-parallel deployment supports 10^6-scale addressable catalogs (measured single-engine sweeps through 100K) and thousand-adapter active waves at cluster scale, with cold loading treated as scheduled service work and packed MoE LoRA tensors improving live engine loading by 8.5-8.7x. MinT thus manages million-scale LoRA policy catalogs while training and serving selected adapter revisions over shared 1T-class base models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MinT is a system paper on managing million-scale LoRA catalogs over shared bases with concrete speedups reported, but the 1T-scale adapter stability claims rest on unshown measurements.

read the letter

MinT describes infrastructure that keeps a few large base models resident and moves only exported LoRA adapters through training, rollout, evaluation, serving, and rollback. The goal is to support many customized policies without materializing full checkpoints each time.

The paper does a solid job laying out three scaling directions with specific numbers attached. Scale-up extends LoRA RL to 1T-class dense and MoE models including MLA and DSA paths. Scale-down uses adapter-only handoff, which the abstract says cuts step time by 18.3x on a 4B model and 2.85x on a 30B MoE while also showing memory and wall-time gains from concurrent multi-policy training. Scale-out separates durable catalogs from active working sets, claiming 10^6 addressable policies (measured to 100K on single-engine sweeps) and 8.5-8.7x faster loading via packed MoE tensors. These are practical engineering details that teams running many fine-tunes would recognize.

The soft spot is the missing evidence on whether adapters stay effective after independent movement at the claimed frontier scale. All quantitative results come from 4B and 30B models; there are no perplexity, accuracy, or stability numbers for adapters after the full pipeline at 1T parameters. The 10^6 catalog figure is extrapolated from smaller single-engine tests. Without those checks, the headline claim that this works for million-scale personalized policies on 1T bases is harder to assess.

This is for practitioners who build or operate multi-policy LLM serving systems. Readers working on similar infrastructure could extract useful ideas on scheduling and data movement. It deserves peer review because the problem is current and the architecture is described clearly enough to evaluate, even if the current numbers leave the largest-scale claims open.

Referee Report

3 major / 2 minor

Summary. The paper presents MinT, a managed infrastructure system for LoRA-based post-training and serving of many policies over a small number of shared base models (up to 1T parameters). It claims that keeping bases resident and moving only exported adapters through rollout/update/evaluation/serving pipelines enables 18.3x step-time reduction (4B model), 2.85x on 30B MoE, concurrent GRPO speedups, 8.5-8.7x packed loading, and support for 10^6-scale addressable catalogs (measured to 100K) while validating training/serving beyond 1T total parameters.

Significance. If the engineering measurements and adapter-movement assumptions hold at frontier scale, MinT would materially reduce the cost of maintaining large policy catalogs for RL post-training and multi-tenant serving; the concrete speedups on 4B/30B models and the explicit separation of durable addressability from working sets are useful engineering contributions even if the 1T claims require further substantiation.

major comments (3)

[Abstract / Scale Up] Abstract and Scale Up section: the claim that training and serving are 'validated beyond 1T total parameters' is load-bearing for the million-scale catalog assertion, yet the manuscript supplies no perplexity, task accuracy, or stability metrics for adapters after export/rollout/loading at 1T scale; all quantitative results are confined to 4B and 30B models.
[Scale Down] Scale Down section: the 18.3x step-time reduction and 2.85x MoE improvement are reported only for 4B/30B models; the central claim that adapter-only handoff remains effective without unacceptable accuracy or latency penalties at 1T-class bases therefore rests on an untested extrapolation.
[Scale Out] Scale Out section: the 10^6-scale addressable catalog is supported by single-engine sweeps only through 100K; the manuscript does not provide cluster-scale measurements or error bars that would justify the extrapolation to million-scale catalogs under concurrent training and serving load.

minor comments (2)

[Abstract] The abstract and introduction would benefit from an explicit statement of the evaluation methodology (datasets, baselines, number of runs) used for the reported speedups.
[Scale Out] Notation for 'packed MoE LoRA tensors' and 'cold loading as scheduled service work' is introduced without a preceding definition or diagram reference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on the scalability claims. We address each major point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract / Scale Up] Abstract and Scale Up section: the claim that training and serving are 'validated beyond 1T total parameters' is load-bearing for the million-scale catalog assertion, yet the manuscript supplies no perplexity, task accuracy, or stability metrics for adapters after export/rollout/loading at 1T scale; all quantitative results are confined to 4B and 30B models.

Authors: We acknowledge that the manuscript reports no task-specific metrics (perplexity, accuracy, or stability) for adapters at 1T scale. The phrase 'validated beyond 1T total parameters' refers to successful system-level operation of the infrastructure with base models whose aggregate parameter count exceeds 1T, but we agree this wording is imprecise and that per-adapter metrics remain limited to the 4B/30B regime. We will revise the abstract and Scale Up section to distinguish system capacity from per-adapter empirical results and to note the absence of frontier-scale task metrics as a limitation. revision: yes
Referee: [Scale Down] Scale Down section: the 18.3x step-time reduction and 2.85x MoE improvement are reported only for 4B/30B models; the central claim that adapter-only handoff remains effective without unacceptable accuracy or latency penalties at 1T-class bases therefore rests on an untested extrapolation.

Authors: The reported speedups and memory measurements are confined to 4B and 30B models. While the adapter-only handoff design is size-agnostic, we accept that direct evidence of accuracy or latency behavior at 1T-class bases is absent. We will add a short discussion in the Scale Down section explaining why the relative gains are expected to generalize and will explicitly flag the lack of 1T measurements as an open question for future work. revision: yes
Referee: [Scale Out] Scale Out section: the 10^6-scale addressable catalog is supported by single-engine sweeps only through 100K; the manuscript does not provide cluster-scale measurements or error bars that would justify the extrapolation to million-scale catalogs under concurrent training and serving load.

Authors: The 100K figure comes from single-engine addressability sweeps; the 10^6 target is a design property of the durable catalog layer. We agree that cluster-scale concurrent measurements and error bars are not supplied. We will revise the Scale Out section to report any available error bars, clarify the single-engine scope of the sweeps, and state that full concurrent cluster validation at million-scale remains future work. revision: partial

Circularity Check

0 steps flagged

No circularity: system description with empirical measurements only

full rationale

The paper describes an infrastructure system (MinT) for LoRA adapter management, scaling, and serving over shared base models. All claims rest on reported engineering measurements (e.g., 18.3x step-time reduction on 4B model, 10^6 catalog support via sweeps to 100K) rather than any mathematical derivations, equations, fitted parameters presented as predictions, or uniqueness theorems. No self-citation chains, ansatzes, or renamings of known results appear as load-bearing steps. The paper is self-contained against its own benchmarks; absence of 1T-scale accuracy numbers is a correctness/measurement-gap issue, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an engineering system paper with no mathematical derivations, fitted parameters, or postulated entities; the central claims rest on the feasibility of the described adapter movement pipeline and the validity of the reported performance measurements.

pith-pipeline@v0.9.1-grok · 6057 in / 1210 out tokens · 26060 ms · 2026-06-30T21:44:47.344143+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters
cs.LG 2026-06 unverdicted novelty 5.0

PEFT adapters are positioned as persistent personal state on foundation models, organized via Scale Up, Scale Down, and Scale Out axes, with MinT as an infrastructure example for managing them.

Reference graph

Works this paper leans on

36 extracted references · 29 canonical work pages · cited by 1 Pith paper · 12 internal anchors

[1]

Anthropic

Accessed 2026-05. Anthropic. Measuring AI agent autonomy in practice. Anthropic research,

2026
[2]

Asyncflow: An asynchronous streaming rl framework for efficient llm post-training

Accessed 2026-05. 22 AsyncFlow Authors. AsyncFlow: An asynchronous streaming RL framework for efficient LLM post-training.arXiv preprint arXiv:2507.01663,

work page arXiv 2026
[3]

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav

arXiv:2310.18547. StevenChiang, YiwenLu, QihanLiu, AndrewChen, PonyMa, andMindLab. RouterreplayR3: Whyitfailedandhow we fixed it. Mind Lab: A Lab for Experiential Intelligence, 2026a. https://macaron.im/mindlab/research/router- replay-r3-why-it-failed-and-how-we-fixed-it. Steven Chiang, Yiwen Lu, Qihan Liu, Nolan Ho, Andrew Chen, Pony Ma, and Mind Lab. Su...

work page arXiv 2026
[4]

QLoRA: Efficient Finetuning of Quantized LLMs

arXiv:2305.14314. Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Songyang Zhang, Kai Chen, Zongwen Shen, and Jidong Ge. LawBench: Benchmarking legal knowledge of large language models.Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP),

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Wei Fu et al

arXiv:2309.16289. Wei Fu et al. AReaL: A large-scale asynchronous reinforcement learning system for language reasoning.arXiv preprint arXiv:2505.24298,

work page arXiv
[6]

Compress then serve: Serving thousands of lora adapters with little overhead

Rickard Brüel Gabrielsson, Jiacheng Zhu, Onkar Bhardwaj, Leshem Choshen, Kristjan Greenewald, Mikhail Yurochkin, and Justin Solomon. Compress then serve: Serving thousands of LoRA adapters with little overhead. arXiv preprint arXiv:2407.00066,

work page arXiv
[7]

GLM-5: from Vibe Coding to Agentic Engineering

GLM-5-Team. GLM-5: From vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

FinEval: A chinese financial domain knowledge evaluation benchmark for large language models.arXiv preprint arXiv:2308.09975,

Xin Guo, Haotian Xia, Zhaowei Liu, Hanyang Cao, Zhi Yang, Zhiqiang Liu, Sizhe Wang, Jinyi Niu, Chuqi Wang, Yanhui Wang, Xiaolong Liang, Xiaoming Huang, Bing Zhu, Zhongyu Wei, Yun Chen, Weining Shen, and Liwen Zhang. FinEval: A chinese financial domain knowledge evaluation benchmark for large language models.arXiv preprint arXiv:2308.09975,

work page arXiv
[9]

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Jian Hu et al. OpenRLHF: An easy-to-use, scalable and high-performance RLHF framework.arXiv preprint arXiv:2405.11143,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Serving heterogeneous LoRA adapters in distributed LLM inference systems.arXiv preprint arXiv:2511.22880,

Shashwat Jaiswal, Shrikara Arun, Anjaly Parayil, Ankur Mallick, Spyros Mastorakis, Alind Khare, Chloi Alverti, Renee St Amant, Chetan Bansal, Victor Rühle, and Josep Torrellas. Serving heterogeneous LoRA adapters in distributed LLM inference systems.arXiv preprint arXiv:2511.22880,

work page arXiv
[11]

Kimi K2.5: Visual Agentic Intelligence

Kimi Team. Kimi K2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

arXiv preprint arXiv:2510.18855 , year=

Ling Team, Anqi Shen, Baihui Li, Bin Hu, Bin Jing, Cai Chen, Chao Huang, Chao Zhang, Chaokun Yang, Cheng Lin, et al. Every step evolves: Scaling reinforcement learning for trillion-scale thinking model.arXiv preprint arXiv:2510.18855,

work page arXiv
[13]

arXiv preprint arXiv:2510.18855 , year=

doi: 10.48550/arXiv.2510.18855. URLhttps://arxiv.org/abs/2510.18855. Introduces IcePop token-level discrepancy masking and clipping. Dennis Liu, Zijie Yan, Xin Yao, Tong Liu, Vijay Korthikanti, Evan Wu, Shiqing Fan, Gao Deng, Hongxiao Bai, Jianbin Chang, Ashwath Aithal, Michael Andersch, Mohammad Shoeybi, Jiajie Yao, Chandler Zhou, David Wu, Xipeng Li, an...

work page doi:10.48550/arxiv.2510.18855
[14]

arXiv preprint arXiv:2307.10485 , year=

https://macaron.im/mindlab/research/building-trillion-parameter-reasoning-rl-with-10-gpus. Xiao-Yang Liu, Guoxuan Wang, Hongyang Yang, and Daochen Zha. FinGPT: Democratizing internet-scale data for financial large language models.arXiv preprint arXiv:2307.10485,

work page arXiv
[15]

arXiv preprint arXiv:2510.11370 , year=

Wenhan Ma, Hailin Zhang, Liang Zhao, Yifan Song, Yudong Wang, Zhifang Sui, and Fuli Luo. Stabilizing MoE reinforcement learning by aligning training and inference routers.arXiv preprint arXiv:2510.11370,

work page arXiv
[16]

2024 american invitational mathematics examination.https://maa.org/ maa-invitational-competitions/,

Mathematical Association of America. 2024 american invitational mathematics examination.https://maa.org/ maa-invitational-competitions/,

2024
[17]

Mind Lab

American Mathematics Competitions. Mind Lab. Macaron-A2UI: A model for generative UI in personal agent.https://macaron.im/mindlab/research/ macaron-a2ui-generative-ui-personal-agent, April 2026a. Accessed 2026-05-24. Mind Lab. MinT Cookbook. GitHub repository: MindLab-Research/mint-cookbook, 2026b. Recipe registry and maintained evaluation artifacts. Mini...

2026
[18]

Kimi K2: Open Agentic Intelligence

Accessed 2026-05. Moonshot AI. Kimi K2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

doi: 10.48550/arXiv.2507. 20534. URLhttps://arxiv.org/abs/2507.20534. Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. Ray: A distributed framework for emerging AI applications. InUSENIX Symposium on Operating Systems Design and Impleme...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507
[20]

Efficient large-scale language model training on GPU clusters using Megatron-LM

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on GPU clusters using Megatron-LM.arXiv preprint arXiv:2104.04473,

work page arXiv
[21]

Accessed 2026-04. OpenAI. Introducing GPT-5.5. OpenAI blog,

2026
[22]

OpenTinker: Separating concerns in agentic reinforcement learning

Accessed 2026-05. OpenTinker Authors. OpenTinker: A reinforcement learning as a service framework for agentic workflows.arXiv preprint arXiv:2601.07376,

work page arXiv 2026
[23]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Qwen3 Technical Report

doi: 10.48550/arXiv.2505.09388. URL https://arxiv.org/abs/2505.09388. Qwen Team. Qwen3.5. Qwen blog,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388
[25]

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Accessed 2026-05. Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory optimizations toward training trillion parameter models.arXiv preprint arXiv:1910.02054,

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale

Relax Authors. Relax: A service-oriented asynchronous reinforcement learning engine for post-training.arXiv preprint arXiv:2604.11554,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Gerald Shen et al

doi: 10.64434/tml.20250929.https://thinkingmachines.ai/blog/lora/. Gerald Shen et al. NeMo-Aligner: Scalable toolkit for efficient model alignment.arXiv preprint arXiv:2405.01481,

work page doi:10.64434/tml.20250929.https://thinkingmachines.ai/blog/lora/
[28]

Laminar: A scalable asynchronous RL post-training framework

Guangming Sheng, Yuxuan Tong, Borui Wan, Wang Zhang, Chaobo Jia, Xibin Wu, Yuqi Wu, Xiang Li, Chi Zhang, Yanghua Peng, Haibin Lin, Xin Liu, and Chuan Wu. Laminar: A scalable asynchronous RL post-training framework. arXiv preprint arXiv:2510.12633, 2025a. Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, a...

work page doi:10.1145/3689031.3696075
[29]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053,

work page internal anchor Pith review Pith/arXiv arXiv 1909
[30]

AnnouncingTinker.https://thinkingmachines.ai/blog/announcing-tinker/, 2025a

ThinkingMachinesLab. AnnouncingTinker.https://thinkingmachines.ai/blog/announcing-tinker/, 2025a. Accessed 2026-04. Thinking Machines Lab. Tinker Cookbook. GitHub repository,https://github.com/thinking-machines-lab/ tinker-cookbook, 2025b. Post-training with Tinker. Accessed 2026-04. Bingyang Wu, Ruidong Zhu, Zili Zhang, Peng Sun, Xuanzhe Liu, and Xin Jin...

2026
[31]

Jet-RL: Enabling on-policy FP8 reinforcement learning with unified training and rollout precision flow.arXiv preprint arXiv:2601.14243,

Haocheng Xi, Charlie Ruan, Peiyuan Liao, Yujun Lin, Han Cai, Yilong Zhao, Shuo Yang, Kurt Keutzer, Song Han, and Ligeng Zhu. Jet-RL: Enabling on-policy FP8 reinforcement learning with unified training and rollout precision flow.arXiv preprint arXiv:2601.14243,

work page arXiv
[32]

On the rollout-training mismatch in modern RL systems.NeurIPS 2025 Workshop on Efficient Reasoning,

Feng Yao, Liyuan Liu, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. On the rollout-training mismatch in modern RL systems.NeurIPS 2025 Workshop on Efficient Reasoning,

2025
[33]

Zhengmao Ye, Dengchun Li, Zetao Hu, Tingfeng Lan, Jian Sha, Sicong Zhang, Lei Duan, Jie Zuo, Hui Lu, Yuanchun Zhou, and Mingjie Tang

Accessed 2026-05. Zhengmao Ye, Dengchun Li, Zetao Hu, Tingfeng Lan, Jian Sha, Sicong Zhang, Lei Duan, Jie Zuo, Hui Lu, Yuanchun Zhou, and Mingjie Tang. mLoRA: Fine-tuning LoRA adapters via highly-efficient pipeline parallelism in multiple GPUs.arXiv preprint arXiv:2312.02515,

work page arXiv 2026
[34]

Improving the serving per- formance of multi-LoRA large language models via efficient LoRA and KV cache management.arXiv preprint arXiv:2505.03756,

Hang Zhang, Jiuchen Shi, Yixiao Wang, Quan Chen, Yizhou Shan, and Minyi Guo. Improving the serving per- formance of multi-LoRA large language models via efficient LoRA and KV cache management.arXiv preprint arXiv:2505.03756,

work page arXiv
[35]

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

arXiv:2303.10512. Yusen Zhong et al. StreamRL: Scalable, heterogeneous, and elastic RL for LLMs with disaggregated stream generation. arXiv preprint,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

The unique- adapter rows remove this locality and measure how many distinct adapters can become cached near one engine before the run stops being a clean warm-path claim

The repeated-hotset rows model adapter locality after routing has found a useful engine placement. The unique- adapter rows remove this locality and measure how many distinct adapters can become cached near one engine before the run stops being a clean warm-path claim. These measurements define the CPU-side tier between the durable adapter catalog and the...

2048

[1] [1]

Anthropic

Accessed 2026-05. Anthropic. Measuring AI agent autonomy in practice. Anthropic research,

2026

[2] [2]

Asyncflow: An asynchronous streaming rl framework for efficient llm post-training

Accessed 2026-05. 22 AsyncFlow Authors. AsyncFlow: An asynchronous streaming RL framework for efficient LLM post-training.arXiv preprint arXiv:2507.01663,

work page arXiv 2026

[3] [3]

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav

arXiv:2310.18547. StevenChiang, YiwenLu, QihanLiu, AndrewChen, PonyMa, andMindLab. RouterreplayR3: Whyitfailedandhow we fixed it. Mind Lab: A Lab for Experiential Intelligence, 2026a. https://macaron.im/mindlab/research/router- replay-r3-why-it-failed-and-how-we-fixed-it. Steven Chiang, Yiwen Lu, Qihan Liu, Nolan Ho, Andrew Chen, Pony Ma, and Mind Lab. Su...

work page arXiv 2026

[4] [4]

QLoRA: Efficient Finetuning of Quantized LLMs

arXiv:2305.14314. Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Songyang Zhang, Kai Chen, Zongwen Shen, and Jidong Ge. LawBench: Benchmarking legal knowledge of large language models.Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP),

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Wei Fu et al

arXiv:2309.16289. Wei Fu et al. AReaL: A large-scale asynchronous reinforcement learning system for language reasoning.arXiv preprint arXiv:2505.24298,

work page arXiv

[6] [6]

Compress then serve: Serving thousands of lora adapters with little overhead

Rickard Brüel Gabrielsson, Jiacheng Zhu, Onkar Bhardwaj, Leshem Choshen, Kristjan Greenewald, Mikhail Yurochkin, and Justin Solomon. Compress then serve: Serving thousands of LoRA adapters with little overhead. arXiv preprint arXiv:2407.00066,

work page arXiv

[7] [7]

GLM-5: from Vibe Coding to Agentic Engineering

GLM-5-Team. GLM-5: From vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

FinEval: A chinese financial domain knowledge evaluation benchmark for large language models.arXiv preprint arXiv:2308.09975,

Xin Guo, Haotian Xia, Zhaowei Liu, Hanyang Cao, Zhi Yang, Zhiqiang Liu, Sizhe Wang, Jinyi Niu, Chuqi Wang, Yanhui Wang, Xiaolong Liang, Xiaoming Huang, Bing Zhu, Zhongyu Wei, Yun Chen, Weining Shen, and Liwen Zhang. FinEval: A chinese financial domain knowledge evaluation benchmark for large language models.arXiv preprint arXiv:2308.09975,

work page arXiv

[9] [9]

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Jian Hu et al. OpenRLHF: An easy-to-use, scalable and high-performance RLHF framework.arXiv preprint arXiv:2405.11143,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Serving heterogeneous LoRA adapters in distributed LLM inference systems.arXiv preprint arXiv:2511.22880,

Shashwat Jaiswal, Shrikara Arun, Anjaly Parayil, Ankur Mallick, Spyros Mastorakis, Alind Khare, Chloi Alverti, Renee St Amant, Chetan Bansal, Victor Rühle, and Josep Torrellas. Serving heterogeneous LoRA adapters in distributed LLM inference systems.arXiv preprint arXiv:2511.22880,

work page arXiv

[11] [11]

Kimi K2.5: Visual Agentic Intelligence

Kimi Team. Kimi K2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

arXiv preprint arXiv:2510.18855 , year=

Ling Team, Anqi Shen, Baihui Li, Bin Hu, Bin Jing, Cai Chen, Chao Huang, Chao Zhang, Chaokun Yang, Cheng Lin, et al. Every step evolves: Scaling reinforcement learning for trillion-scale thinking model.arXiv preprint arXiv:2510.18855,

work page arXiv

[13] [13]

arXiv preprint arXiv:2510.18855 , year=

doi: 10.48550/arXiv.2510.18855. URLhttps://arxiv.org/abs/2510.18855. Introduces IcePop token-level discrepancy masking and clipping. Dennis Liu, Zijie Yan, Xin Yao, Tong Liu, Vijay Korthikanti, Evan Wu, Shiqing Fan, Gao Deng, Hongxiao Bai, Jianbin Chang, Ashwath Aithal, Michael Andersch, Mohammad Shoeybi, Jiajie Yao, Chandler Zhou, David Wu, Xipeng Li, an...

work page doi:10.48550/arxiv.2510.18855

[14] [14]

arXiv preprint arXiv:2307.10485 , year=

https://macaron.im/mindlab/research/building-trillion-parameter-reasoning-rl-with-10-gpus. Xiao-Yang Liu, Guoxuan Wang, Hongyang Yang, and Daochen Zha. FinGPT: Democratizing internet-scale data for financial large language models.arXiv preprint arXiv:2307.10485,

work page arXiv

[15] [15]

arXiv preprint arXiv:2510.11370 , year=

Wenhan Ma, Hailin Zhang, Liang Zhao, Yifan Song, Yudong Wang, Zhifang Sui, and Fuli Luo. Stabilizing MoE reinforcement learning by aligning training and inference routers.arXiv preprint arXiv:2510.11370,

work page arXiv

[16] [16]

2024 american invitational mathematics examination.https://maa.org/ maa-invitational-competitions/,

Mathematical Association of America. 2024 american invitational mathematics examination.https://maa.org/ maa-invitational-competitions/,

2024

[17] [17]

Mind Lab

American Mathematics Competitions. Mind Lab. Macaron-A2UI: A model for generative UI in personal agent.https://macaron.im/mindlab/research/ macaron-a2ui-generative-ui-personal-agent, April 2026a. Accessed 2026-05-24. Mind Lab. MinT Cookbook. GitHub repository: MindLab-Research/mint-cookbook, 2026b. Recipe registry and maintained evaluation artifacts. Mini...

2026

[18] [18]

Kimi K2: Open Agentic Intelligence

Accessed 2026-05. Moonshot AI. Kimi K2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

doi: 10.48550/arXiv.2507. 20534. URLhttps://arxiv.org/abs/2507.20534. Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. Ray: A distributed framework for emerging AI applications. InUSENIX Symposium on Operating Systems Design and Impleme...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507

[20] [20]

Efficient large-scale language model training on GPU clusters using Megatron-LM

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on GPU clusters using Megatron-LM.arXiv preprint arXiv:2104.04473,

work page arXiv

[21] [21]

Accessed 2026-04. OpenAI. Introducing GPT-5.5. OpenAI blog,

2026

[22] [22]

OpenTinker: Separating concerns in agentic reinforcement learning

Accessed 2026-05. OpenTinker Authors. OpenTinker: A reinforcement learning as a service framework for agentic workflows.arXiv preprint arXiv:2601.07376,

work page arXiv 2026

[23] [23]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Qwen3 Technical Report

doi: 10.48550/arXiv.2505.09388. URL https://arxiv.org/abs/2505.09388. Qwen Team. Qwen3.5. Qwen blog,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388

[25] [25]

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Accessed 2026-05. Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory optimizations toward training trillion parameter models.arXiv preprint arXiv:1910.02054,

work page internal anchor Pith review Pith/arXiv arXiv 2026

[26] [26]

Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale

Relax Authors. Relax: A service-oriented asynchronous reinforcement learning engine for post-training.arXiv preprint arXiv:2604.11554,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Gerald Shen et al

doi: 10.64434/tml.20250929.https://thinkingmachines.ai/blog/lora/. Gerald Shen et al. NeMo-Aligner: Scalable toolkit for efficient model alignment.arXiv preprint arXiv:2405.01481,

work page doi:10.64434/tml.20250929.https://thinkingmachines.ai/blog/lora/

[28] [28]

Laminar: A scalable asynchronous RL post-training framework

Guangming Sheng, Yuxuan Tong, Borui Wan, Wang Zhang, Chaobo Jia, Xibin Wu, Yuqi Wu, Xiang Li, Chi Zhang, Yanghua Peng, Haibin Lin, Xin Liu, and Chuan Wu. Laminar: A scalable asynchronous RL post-training framework. arXiv preprint arXiv:2510.12633, 2025a. Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, a...

work page doi:10.1145/3689031.3696075

[29] [29]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053,

work page internal anchor Pith review Pith/arXiv arXiv 1909

[30] [30]

AnnouncingTinker.https://thinkingmachines.ai/blog/announcing-tinker/, 2025a

ThinkingMachinesLab. AnnouncingTinker.https://thinkingmachines.ai/blog/announcing-tinker/, 2025a. Accessed 2026-04. Thinking Machines Lab. Tinker Cookbook. GitHub repository,https://github.com/thinking-machines-lab/ tinker-cookbook, 2025b. Post-training with Tinker. Accessed 2026-04. Bingyang Wu, Ruidong Zhu, Zili Zhang, Peng Sun, Xuanzhe Liu, and Xin Jin...

2026

[31] [31]

Jet-RL: Enabling on-policy FP8 reinforcement learning with unified training and rollout precision flow.arXiv preprint arXiv:2601.14243,

Haocheng Xi, Charlie Ruan, Peiyuan Liao, Yujun Lin, Han Cai, Yilong Zhao, Shuo Yang, Kurt Keutzer, Song Han, and Ligeng Zhu. Jet-RL: Enabling on-policy FP8 reinforcement learning with unified training and rollout precision flow.arXiv preprint arXiv:2601.14243,

work page arXiv

[32] [32]

On the rollout-training mismatch in modern RL systems.NeurIPS 2025 Workshop on Efficient Reasoning,

Feng Yao, Liyuan Liu, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. On the rollout-training mismatch in modern RL systems.NeurIPS 2025 Workshop on Efficient Reasoning,

2025

[33] [33]

Zhengmao Ye, Dengchun Li, Zetao Hu, Tingfeng Lan, Jian Sha, Sicong Zhang, Lei Duan, Jie Zuo, Hui Lu, Yuanchun Zhou, and Mingjie Tang

Accessed 2026-05. Zhengmao Ye, Dengchun Li, Zetao Hu, Tingfeng Lan, Jian Sha, Sicong Zhang, Lei Duan, Jie Zuo, Hui Lu, Yuanchun Zhou, and Mingjie Tang. mLoRA: Fine-tuning LoRA adapters via highly-efficient pipeline parallelism in multiple GPUs.arXiv preprint arXiv:2312.02515,

work page arXiv 2026

[34] [34]

Improving the serving per- formance of multi-LoRA large language models via efficient LoRA and KV cache management.arXiv preprint arXiv:2505.03756,

Hang Zhang, Jiuchen Shi, Yixiao Wang, Quan Chen, Yizhou Shan, and Minyi Guo. Improving the serving per- formance of multi-LoRA large language models via efficient LoRA and KV cache management.arXiv preprint arXiv:2505.03756,

work page arXiv

[35] [35]

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

arXiv:2303.10512. Yusen Zhong et al. StreamRL: Scalable, heterogeneous, and elastic RL for LLMs with disaggregated stream generation. arXiv preprint,

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

The unique- adapter rows remove this locality and measure how many distinct adapters can become cached near one engine before the run stops being a clean warm-path claim

The repeated-hotset rows model adapter locality after routing has found a useful engine placement. The unique- adapter rows remove this locality and measure how many distinct adapters can become cached near one engine before the run stops being a clean warm-path claim. These measurements define the CPU-side tier between the durable adapter catalog and the...

2048