pith. machine review for the scientific record. sign in

arxiv: 2605.14217 · v1 · submitted 2026-05-14 · 💻 cs.LG · cs.AI· cs.CL· cs.SY· eess.SY

Recognition: no theorem link

PreFT: Prefill-only finetuning for efficient inference

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.SYeess.SY
keywords prefill-only finetuningPEFTLoRALLM servingmulti-adapter inferencethroughput optimizationpersonalized modelsReFT
0
0 comments X

The pith

Applying adapters only during prefill and discarding them afterward raises serving throughput nearly twofold while keeping performance near standard PEFT levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models are now personalized at scale through parameter-efficient finetuning, yet serving many user-specific adapters slows generation because decode steps suffer more overhead than prefill. This paper proposes restricting the adapter to the prefill phase only, then removing it for autoregressive decoding. The change produces 1.9 times higher throughput when handling 512 adapters on Llama 3.1 70B. On supervised finetuning tasks the evaluation loss rises but recovers when adapter rank is increased with almost no throughput penalty. On reinforcement learning tasks the prefill-only versions reach near parity with full adapters, making multi-user personalization more practical.

Core claim

Prefill-only finetuning applies the adapter exclusively to prefill tokens and discards it for decode, delivering substantially higher multi-adapter serving throughput than conventional PEFT while preserving task performance that can be restored on SFT by raising rank and that already approaches parity on RL.

What carries the argument

The prefill-only adapter that limits low-rank or representation updates to initial context tokens before removal during generation.

If this is right

  • Serving 512 adapters on Llama 3.1 70B reaches 1.9 times the throughput of traditional PEFT.
  • Raising adapter rank on SFT tasks offsets higher evaluation loss with negligible throughput reduction.
  • PreFT reaches near parity with full PEFT on reinforcement learning tasks across model scales.
  • Open-source vLLM kernels for prefill-only LoRA and ReFT make the method immediately usable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prefill-only restriction could be applied to other PEFT families to test broader applicability.
  • Large deployments might cut per-user memory by avoiding adapter storage during generation.
  • Adapters could be redesigned specifically for prefill efficiency rather than full-sequence use.
  • Combining PreFT with quantization or speculative decoding may compound the throughput gains.

Load-bearing premise

Discarding the adapter after prefill leaves the quality of later generated tokens largely intact on downstream tasks, and any shortfall can be offset by higher rank without throughput cost.

What would settle it

An experiment showing that even high-rank PreFT versions fall short of standard PEFT accuracy on a standard RL benchmark while the reported throughput advantage holds.

Figures

Figures reproduced from arXiv: 2605.14217 by Andrew Lanpouthakoun, Aryaman Arora, Ben Keigwin, Christopher Potts, Dan Jurafsky, Dhruv Pai, Zhengxuan Wu.

Figure 1
Figure 1. Figure 1: PreFT adapters approach parity on a variety of tasks with much better throughput. Inference throughput vs. accuracy for PEFTs (LoRA) and PreFTs (LoRAP , DiReFTP ) for a variety of tasks. Tasks besides GSM8K are on Llama 3.1 8B Base. To roughly match parameter count in these plots, LoRA/LoRAP is always rank-1 and DiReFTP is rank-16 (Tülu-3, LongBench-Write) or rank-8 otherwise. On Tülu-3 (SFT) and for RL ta… view at source ↗
Figure 2
Figure 2. Figure 2: PreFTs maintain throughput more effectively than traditional PEFTs as number of adapters increase. Inference throughput (tokens/s) on the Punica microbenchmark when comparing rank-1 LoRA (prefill-only vs. all positions) and rank-8 DiReFT (prefill-only vs. all positions) with varying numbers of adapters. 4 Efficient multi-PreFT inference We now implement and benchmark multi-PreFT serving with LoRA and ReFT … view at source ↗
Figure 3
Figure 3. Figure 3: PreFTs underperform all-position PEFTs at matched parameter counts, but scale predictably with rank. Eval loss (best-LR) for LoRAP and DiReFTP compared to their all position counterparts, on Tülu-3 (Llama-3.2-1B and Llama-3.1-8B) and OpenThoughts Llama-3.2-1B, as a function of trainable parameter count. 3, 606 tok/s for Llama-3.1-8B). We find that DiReFTP maintains near-baseline throughput even as the numb… view at source ↗
Figure 4
Figure 4. Figure 4: LoRAP matches LoRA on length-following without sacrificing quality; DiReFTP does not. LongWriter on Llama-3.1-8B-Instruct (rank 16), four required output-length brackets. The prefill and all-position LoRA variants are indistinguishable on both Sl (left) and Sq (middle) at every bracket. DiReFTP instead overshoots target length, writing runaway generations that saturate the 32k-token decoding cap on most ≥2… view at source ↗
Figure 5
Figure 5. Figure 5: SFT of Llama 3.2 1B Instruct and Llama 3.1 8B Instruct on Tülu-3, comparing old and new [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Architecture of the PreFT fork. The left column is the one-time load path from a trained ReftModel into the engine; the right column is the per-step training–inference sync path used during on-policy RL. Solid arrows carry adapter configuration or weights; the dashed arrow marks per￾forward state (the position mask) written by the model runner outside the compiled region. Adapter parameters live at stable … view at source ↗
Figure 7
Figure 7. Figure 7: LoRAP on MLP-only still outperforms DiReFTP and DiReFTA at high ranks Long￾Writer on Llama-3.1-8B-Instruct (rank 16), four required output-length brackets. H Long output attenuation experiments In the LongBench-Write experiments, we find that DiReFTP runs fail while LoRAP runs do not. We conduct two follow-up experiments to investigate why. Is the failure mediated by attention? Our initial hypothesis was t… view at source ↗
read the original abstract

Large language models can now be personalised efficiently at scale using parameter efficient finetuning methods (PEFTs), but serving user-specific PEFTs harms throughput, even with specialised kernels and memory management techniques. This is because, theoretically and empirically, a mismatch exists between prefill (processing a large number of tokens at once) and decode (generating a single token autoregressively): the latter has far lower throughput when serving multiple adapters. Rather than optimising performance relative to parameter count, for efficient multi-adapter serving, we instead ought to optimise performance relative to serving throughput. We therefore propose PreFT (Prefill-only Finetuning), wherein we only apply the adapter to prefill tokens and discard it afterwards. PreFT significantly increases throughput with minimal effect on performance. We develop and release an efficient implementation of two prefill-only PEFTs, LoRA and ReFT, on the vLLM inference engine. We first show that serving multi-user PreFTs is more efficient than traditional PEFTs ($1.9\times$ the throughput when serving $512$ adapters on Llama 3.1 70B). Then, we compare the performance of prefill-only vs. all-token adapters on a variety of supervised finetuning and reinforcement learning tasks with LMs at varying scales. On SFT, we observe that the evaluation loss of PreFTs is higher than PEFTs, but can be compensated by increasing rank with nearly no reduction in throughput. On RL, we consistently find that PreFTs approach parity with standard PEFTs. Together, this work validates prefill-only adaptation of LLMs as a more favourable accuracy-throughput tradeoff than existing PEFTs for personalised serving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces PreFT (Prefill-only Finetuning), a technique that applies parameter-efficient adapters (LoRA and ReFT) exclusively to the prefill phase of LLM inference and discards them for autoregressive decoding. This design targets improved multi-adapter serving throughput by avoiding adapter overhead during token generation. On Llama 3.1 70B, the authors report 1.9× higher throughput when serving 512 adapters versus standard PEFT baselines, with an efficient vLLM implementation released. Task evaluations show that SFT evaluation loss increases can be offset by increasing adapter rank with negligible throughput cost, while RL tasks exhibit near-parity with full-token PEFTs across model scales.

Significance. If the empirical results hold, PreFT provides a practical accuracy-throughput tradeoff for personalized LLM serving at scale, shifting optimization focus from parameter count to serving efficiency. The concrete throughput gains on 70B-scale models, combined with the open implementation and consistent RL parity, represent a useful engineering contribution for inference systems handling many concurrent adapters.

minor comments (2)
  1. [Implementation] The exact mechanism for baking adapter effects into the KV cache during prefill (and its interaction with vLLM's memory management) would benefit from an expanded description or pseudocode in the implementation section to aid reproducibility.
  2. [Experiments] Figure or table presenting the rank-scaling throughput curves for SFT compensation should include error bars or multiple runs to strengthen the claim of 'nearly no reduction in throughput'.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and recommendation to accept. We appreciate the recognition of PreFT's practical contributions to multi-adapter serving efficiency and the open-source implementation.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper defines PreFT explicitly as applying adapters only during prefill then discarding them, then reports direct empirical measurements of throughput (e.g., 1.9× on 512 adapters) and task performance (SFT loss offset by rank, RL near-parity) against standard PEFT baselines. No step reduces by construction to its own inputs, no fitted parameter is relabeled as a prediction, and no load-bearing claim rests on self-citation chains or imported uniqueness theorems. All central results are externally falsifiable via the described experiments on Llama models and vLLM implementation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the standard assumption that LoRA and ReFT adapters can be applied selectively to prefix tokens without breaking the autoregressive generation process; no new free parameters or invented entities are introduced beyond the usual rank hyperparameter.

axioms (1)
  • domain assumption Adapters trained on full sequences remain effective when applied only to the prefill portion of the same sequences.
    Invoked when claiming that prefill-only training approximates full-token adaptation.

pith-pipeline@v0.9.0 · 5650 in / 1172 out tokens · 34852 ms · 2026-05-15T02:06:32.961245+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 9 internal anchors

  1. [1]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openrev...

  2. [2]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. Program synthesis with large language models. CoRR, abs/2108.07732, 2021. URL https://arxiv.org/abs/2108.07732

  3. [3]

    How to Scale Your Model

    Jacob Austin, Sholto Douglas, Roy Frostig, Anselm Levskaya, Charlie Chen, Sharad Vikram, Federico Lebron, Peter Choy, Vinay Ramasesh, Albert Webson, and Reiner Pope. How to Scale Your Model. Google DeepMind, 2025. URL https://jax-ml.github.io/scaling-book/

  4. [4]

    Longwriter: Unleashing 10,000+ word generation from long context llms, 2024

    Yushi Bai, Jiajie Zhang, Xin Lv, Linzhi Zheng, Siqi Zhu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longwriter: Unleashing 10,000+ word generation from long context llms, 2024. URL https://arxiv.org/abs/2408.07055

  5. [5]

    LoRA-XS : Low-rank adaptation with extremely small number of parameters

    Klaudia Balazy, Mohammadreza Banaei, Karl Aberer, and Jacek Tabor. LoRA-XS : Low-rank adaptation with extremely small number of parameters. In In \^ e s Lynce, Nello Murano, Mauro Vallati, Serena Villata, Federico Chesani, Michela Milano, Andrea Omicini, and Mehdi Dastani, editors, ECAI 2025 - 28th European Conference on Artificial Intelligence, 25-30 Oct...

  6. [6]

    Punica: Multi-tenant lora serving

    Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krishnamurthy. Punica: Multi-tenant lora serving. In Phillip B. Gibbons, Gennady Pekhimenko, and Christopher De Sa, editors, Proceedings of the Seventh Annual Conference on Machine Learning and Systems, MLSys 2024, Santa Clara, CA, USA, May 13-16, 2024. mlsys.org, 2024. URL https://proce...

  7. [7]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond \' e de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bava...

  8. [8]

    Learning rate scaling across LoRA ranks and transfer to full finetuning

    Nan Chen, Soledad Villar, and Soufiane Hayou. Learning rate scaling across LoRA ranks and transfer to full finetuning. arXiv:2602.06204, 2026. URL https://arxiv.org/abs/2602.06204

  9. [9]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168

  10. [10]

    Some methods for strengthening the common ^2 tests

    William G Cochran. Some methods for strengthening the common ^2 tests. Biometrics, 10 0 (4): 0 417--451, 1954. URL https://www.jstor.org/stable/3001616

  11. [11]

    Split personality training: Revealing latent knowledge through alternate personalities

    Florian Dietz, William Wale, Oscar Gilg, Robert McCarthy, Felix Michalak, Gustavo Ewbank Rodrigues Danon, Miguelito de Guzman, and Dietrich Klakow. Split personality training: Revealing latent knowledge through alternate personalities. arXiv:2602.05532, 2026. URL https://arxiv.org/abs/2602.05532

  12. [12]

    Cartridges: Lightweight and general- purpose long context representations via self-study

    Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, Neel Guha, Dylan Zinsley, Emily Liu, Will Tennien, Atri Rudra, James Zou, Azalia Mirhoseini, and Christopher Re. Cartridges: Lightweight and general-purpose long context representations via self-study. arXiv:2506.06266, 2025. URL https://arxiv.org/abs/2506.06266

  13. [13]

    Greenewald, Mikhail Yurochkin, and Justin Solomon

    Rickard Br \" u el Gabrielsson, Jiacheng Zhu, Onkar Bhardwaj, Leshem Choshen, Kristjan H. Greenewald, Mikhail Yurochkin, and Justin Solomon. Compress then serve: Serving thousands of LoRA adapters with little overhead. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste - Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, ...

  14. [14]

    Activated LoRA : Fine-tuned LLM s for intrinsics

    Kristjan Greenewald, Luis Lastras, Thomas Parnell, Vraj Shah, Lucian Popa, Giulio Zizzo, Chulaka Gunasekara, Ambrish Rawat, and David Cox. Activated LoRA : Fine-tuned LLM s for intrinsics. arXiv:2504.12397, 2025. URL https://arxiv.org/abs/2504.12397

  15. [15]

    OpenThoughts: Data Recipes for Reasoning Models

    Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, ...

  16. [16]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net, 2021 a . URL https://openreview.net/forum?id=d7KBjmI3GmQ

  17. [17]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. CoRR, abs/2103.03874, 2021 b . URL https://arxiv.org/abs/2103.03874

  18. [18]

    Parameter-efficient transfer learning for NLP

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP . In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach,...

  19. [19]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen - Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen - Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9

  20. [20]

    A rank stabilization scaling factor for fine-tuning with lora,

    Damjan Kalajdzievski. A rank stabilization scaling factor for fine-tuning with LoRA . arXiv:2312.03732, 2023. URL https://arxiv.org/abs/2312.03732

  21. [21]

    Dawid Jan Kopiczko, Tijmen Blankevoort, and Yuki M. Asano. VeRA : Vector-based random matrix adaptation. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=NjNfLdxr3A

  22. [22]

    InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23)

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with P aged A ttention. In Jason Flinn, Margo I. Seltzer, Peter Druschel, Antoine Kaufmann, and Jonathan Mace, editors, Proceedings of the 29th Symposium on Operating Syste...

  23. [23]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirzi...

  24. [24]

    Prefix-Tuning: Optimizing Continuous Prompts for Generation , url =

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 45...

  25. [25]

    DoRA : Weight-decomposed low-rank adaptation

    Shih - Yang Liu, Chien - Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu - Chiang Frank Wang, Kwang - Ting Cheng, and Min - Hung Chen. DoRA : Weight-decomposed low-rank adaptation. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Forty-first International Conference on Machi...

  26. [26]

    On-policy distillation

    Kevin Lu and Thinking Machines Lab . On-policy distillation. Thinking Machines Lab: Connectionism, 2025. URL https://thinkingmachines.ai/blog/on-policy-distillation

  27. [27]

    Statistical aspects of the analysis of data from retrospective studies of disease

    Nathan Mantel and William Haenszel. Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22 0 (4): 0 719--748, 1959. URL https://academic.oup.com/jnci/article-abstract/22/4/719/900746

  28. [28]

    Note on the sampling error of the difference between correlated proportions or percentages

    Quinn McNemar. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12 0 (2): 0 153--157, 1947

  29. [29]

    Morris, Niloofar Mireshghallah, Mark Ibrahim, and Saeed Mahloujifar

    John X. Morris, Niloofar Mireshghallah, Mark Ibrahim, and Saeed Mahloujifar. Learning to reason in 13 parameters. arXiv:2602.04118, 2026. URL https://arxiv.org/abs/2602.04118

  30. [30]

    Predictive-LoRA : A proactive and fragmentation-aware serverless inference system for LLMs

    Yinan Ni, Xiao Yang, Yuqi Tang, Zhimin Qiu, Chen Wang, and Tingzhou Yuan. Predictive-LoRA : A proactive and fragmentation-aware serverless inference system for LLMs . In Proceedings of the 2025 6th International Conference on Computer Science and Management Technology, ICCSMT '25, page 1267–1273, New York, NY, USA, 2026. Association for Computing Machiner...

  31. [31]

    MAD-X : A n A dapter- B ased F ramework for M ulti- T ask C ross- L ingual T ransfer

    Jonas Pfeiffer, Ivan Vuli \'c , Iryna Gurevych, and Sebastian Ruder. MAD-X : A n A dapter- B ased F ramework for M ulti- T ask C ross- L ingual T ransfer. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7654--7673, Online, November 2020. A...

  32. [32]

    Fast PEFT serving at scale, 2025

    Nihal Potdar, Megha Agarwal, Hanlin Tang, Asfandyar Qureshi, Qi Zheng, Daya Khudia, Tianrun Li, Nikunj Gupta, and James Thomas. Fast PEFT serving at scale, 2025. URL https://www.databricks.com/blog/fast-peft-serving-scale

  33. [33]

    Mooncake: A KVCache-centric disaggregated architecture for LLM serving

    Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: A KVCache -centric disaggregated architecture for LLM serving. arXiv:2407.00079, 2025. URL https://arxiv.org/abs/2407.00079

  34. [34]

    Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter

    Ruoyu Qin, Weiran He, Yaoyu Wang, Zheming Li, Xinran Xu, Yongwei Wu, Weimin Zheng, and Mingxing Zhang. Prefill-as-a-service: KVCache of next-generation models could go cross-datacenter. arXiv:2604.15039, 2026. URL https://arxiv.org/abs/2604.15039

  35. [35]

    LoRA without regret

    John Schulman and Thinking Machines Lab . LoRA without regret. Thinking Machines Lab: Connectionism, 2025. doi:10.64434/tml.20250929. https://thinkingmachines.ai/blog/lora/

  36. [36]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath : Pushing the limits of mathematical reasoning in open language models. arXiv:2402.03300, 2024. URL https://arxiv.org/abs/2402.03300

  37. [37]

    S-lora: Serving thousands of concurrent lora adapters.arXiv preprint arXiv:2311.03285, 2023

    Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, and Ion Stoica. S-LoRA : Serving thousands of concurrent LoRA adapters. arXiv:2311.03285, 2024. URL https://arxiv.org/abs/2311.03285

  38. [38]

    Command-V : Pasting LLM behaviors via activation profiles

    Barry Wang, Avi Schwarzschild, Alexander Robey, Ali Payani, Charles Fleming, Mingjie Sun, and Daphne Ippolito. Command-V : Pasting LLM behaviors via activation profiles. arXiv:2506.19140, 2025. URL https://arxiv.org/abs/2506.19140

  39. [39]

    dLoRA : Dynamically orchestrating requests and adapters for LoRA LLM serving

    Bingyang Wu, Ruidong Zhu, Zili Zhang, Peng Sun, Xuanzhe Liu, and Xin Jin. dLoRA : Dynamically orchestrating requests and adapters for LoRA LLM serving. In Ada Gavrilovska and Douglas B. Terry, editors, 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024 , pages 911--927. USENIX Associatio...

  40. [40]

    Manning, and Christopher Potts

    Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. ReFT : Representation finetuning for language models. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, Advances in Neural Information Processing Systems 38: Annual...

  41. [41]

    Towards context-robust LLM s: A gated representation fine-tuning approach

    Shenglai Zeng, Pengfei He, Kai Guo, Tianqi Zheng, Hanqing Lu, Yue Xing, and Hui Liu. Towards context-robust LLM s: A gated representation fine-tuning approach. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pape...

  42. [42]

    Improving the serving performance of multi- LoRA large language models via efficient LoRA and KV cache management

    Hang Zhang, Jiuchen Shi, Yixiao Wang, Quan Chen, Yizhou Shan, and Minyi Guo. Improving the serving performance of multi- LoRA large language models via efficient LoRA and KV cache management. arXiv:2505.03756, 2025. URL https://arxiv.org/abs/2505.03756

  43. [43]

    Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving. In Ada Gavrilovska and Douglas B. Terry, editors, 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-...

  44. [44]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. arXiv:2311.07911, 2023. URL https://arxiv.org/abs/2311.07911

  45. [45]

    Cannikin: No lagger of SLO in concurrent multiple lora LLM serving

    Ruidong Zhu, Ziyue Jiang, Zhi Zhang, Xin Liu, Xuanzhe Liu, and Xin Jin. Cannikin: No lagger of SLO in concurrent multiple lora LLM serving. IEEE Trans. Parallel Distributed Syst. , 36 0 (9): 0 1972--1984, 2025. doi:10.1109/TPDS.2025.3590014. URL https://doi.org/10.1109/TPDS.2025.3590014

  46. [46]

    LoRAFusion : Efficient LoRA fine-tuning for LLMs

    Zhanda Zhu, Qidong Su, Yaoyao Ding, Kevin Song, Shang Wang, and Gennady Pekhimenko. LoRAFusion : Efficient LoRA fine-tuning for LLMs . In Antonio Barbalace, Luo Mai, Roxana Geambasu, and Peter R. Pietzuch, editors, Proceedings of the 21st European Conference on Computer Systems, EuroSys 2026, McEwan Hall/The University of Edinburgh, Edinburgh, Scotland, U...