arxiv: 2605.14217 · v1 · submitted 2026-05-14 · 💻 cs.LG · cs.AI· cs.CL· cs.SY· eess.SY

Recognition: no theorem link

PreFT: Prefill-only finetuning for efficient inference

Andrew Lanpouthakoun , Aryaman Arora , Zhengxuan Wu , Dhruv Pai , Ben Keigwin , Dan Jurafsky , Christopher Potts

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.SYeess.SY

keywords prefill-only finetuningPEFTLoRALLM servingmulti-adapter inferencethroughput optimizationpersonalized modelsReFT

0 comments

The pith

Applying adapters only during prefill and discarding them afterward raises serving throughput nearly twofold while keeping performance near standard PEFT levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models are now personalized at scale through parameter-efficient finetuning, yet serving many user-specific adapters slows generation because decode steps suffer more overhead than prefill. This paper proposes restricting the adapter to the prefill phase only, then removing it for autoregressive decoding. The change produces 1.9 times higher throughput when handling 512 adapters on Llama 3.1 70B. On supervised finetuning tasks the evaluation loss rises but recovers when adapter rank is increased with almost no throughput penalty. On reinforcement learning tasks the prefill-only versions reach near parity with full adapters, making multi-user personalization more practical.

Core claim

Prefill-only finetuning applies the adapter exclusively to prefill tokens and discards it for decode, delivering substantially higher multi-adapter serving throughput than conventional PEFT while preserving task performance that can be restored on SFT by raising rank and that already approaches parity on RL.

What carries the argument

The prefill-only adapter that limits low-rank or representation updates to initial context tokens before removal during generation.

If this is right

Serving 512 adapters on Llama 3.1 70B reaches 1.9 times the throughput of traditional PEFT.
Raising adapter rank on SFT tasks offsets higher evaluation loss with negligible throughput reduction.
PreFT reaches near parity with full PEFT on reinforcement learning tasks across model scales.
Open-source vLLM kernels for prefill-only LoRA and ReFT make the method immediately usable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prefill-only restriction could be applied to other PEFT families to test broader applicability.
Large deployments might cut per-user memory by avoiding adapter storage during generation.
Adapters could be redesigned specifically for prefill efficiency rather than full-sequence use.
Combining PreFT with quantization or speculative decoding may compound the throughput gains.

Load-bearing premise

Discarding the adapter after prefill leaves the quality of later generated tokens largely intact on downstream tasks, and any shortfall can be offset by higher rank without throughput cost.

What would settle it

An experiment showing that even high-rank PreFT versions fall short of standard PEFT accuracy on a standard RL benchmark while the reported throughput advantage holds.

Figures

Figures reproduced from arXiv: 2605.14217 by Andrew Lanpouthakoun, Aryaman Arora, Ben Keigwin, Christopher Potts, Dan Jurafsky, Dhruv Pai, Zhengxuan Wu.

**Figure 1.** Figure 1: PreFT adapters approach parity on a variety of tasks with much better throughput. Inference throughput vs. accuracy for PEFTs (LoRA) and PreFTs (LoRAP , DiReFTP ) for a variety of tasks. Tasks besides GSM8K are on Llama 3.1 8B Base. To roughly match parameter count in these plots, LoRA/LoRAP is always rank-1 and DiReFTP is rank-16 (Tülu-3, LongBench-Write) or rank-8 otherwise. On Tülu-3 (SFT) and for RL ta… view at source ↗

**Figure 2.** Figure 2: PreFTs maintain throughput more effectively than traditional PEFTs as number of adapters increase. Inference throughput (tokens/s) on the Punica microbenchmark when comparing rank-1 LoRA (prefill-only vs. all positions) and rank-8 DiReFT (prefill-only vs. all positions) with varying numbers of adapters. 4 Efficient multi-PreFT inference We now implement and benchmark multi-PreFT serving with LoRA and ReFT … view at source ↗

**Figure 3.** Figure 3: PreFTs underperform all-position PEFTs at matched parameter counts, but scale predictably with rank. Eval loss (best-LR) for LoRAP and DiReFTP compared to their all position counterparts, on Tülu-3 (Llama-3.2-1B and Llama-3.1-8B) and OpenThoughts Llama-3.2-1B, as a function of trainable parameter count. 3, 606 tok/s for Llama-3.1-8B). We find that DiReFTP maintains near-baseline throughput even as the numb… view at source ↗

**Figure 4.** Figure 4: LoRAP matches LoRA on length-following without sacrificing quality; DiReFTP does not. LongWriter on Llama-3.1-8B-Instruct (rank 16), four required output-length brackets. The prefill and all-position LoRA variants are indistinguishable on both Sl (left) and Sq (middle) at every bracket. DiReFTP instead overshoots target length, writing runaway generations that saturate the 32k-token decoding cap on most ≥2… view at source ↗

**Figure 5.** Figure 5: SFT of Llama 3.2 1B Instruct and Llama 3.1 8B Instruct on Tülu-3, comparing old and new [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: Architecture of the PreFT fork. The left column is the one-time load path from a trained ReftModel into the engine; the right column is the per-step training–inference sync path used during on-policy RL. Solid arrows carry adapter configuration or weights; the dashed arrow marks perforward state (the position mask) written by the model runner outside the compiled region. Adapter parameters live at stable … view at source ↗

**Figure 7.** Figure 7: LoRAP on MLP-only still outperforms DiReFTP and DiReFTA at high ranks LongWriter on Llama-3.1-8B-Instruct (rank 16), four required output-length brackets. H Long output attenuation experiments In the LongBench-Write experiments, we find that DiReFTP runs fail while LoRAP runs do not. We conduct two follow-up experiments to investigate why. Is the failure mediated by attention? Our initial hypothesis was t… view at source ↗

read the original abstract

Large language models can now be personalised efficiently at scale using parameter efficient finetuning methods (PEFTs), but serving user-specific PEFTs harms throughput, even with specialised kernels and memory management techniques. This is because, theoretically and empirically, a mismatch exists between prefill (processing a large number of tokens at once) and decode (generating a single token autoregressively): the latter has far lower throughput when serving multiple adapters. Rather than optimising performance relative to parameter count, for efficient multi-adapter serving, we instead ought to optimise performance relative to serving throughput. We therefore propose PreFT (Prefill-only Finetuning), wherein we only apply the adapter to prefill tokens and discard it afterwards. PreFT significantly increases throughput with minimal effect on performance. We develop and release an efficient implementation of two prefill-only PEFTs, LoRA and ReFT, on the vLLM inference engine. We first show that serving multi-user PreFTs is more efficient than traditional PEFTs ($1.9\times$ the throughput when serving $512$ adapters on Llama 3.1 70B). Then, we compare the performance of prefill-only vs. all-token adapters on a variety of supervised finetuning and reinforcement learning tasks with LMs at varying scales. On SFT, we observe that the evaluation loss of PreFTs is higher than PEFTs, but can be compensated by increasing rank with nearly no reduction in throughput. On RL, we consistently find that PreFTs approach parity with standard PEFTs. Together, this work validates prefill-only adaptation of LLMs as a more favourable accuracy-throughput tradeoff than existing PEFTs for personalised serving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PreFT delivers a measurable throughput gain for multi-adapter serving by restricting adapters to prefill, with performance close enough on RL and tunable on SFT.

read the letter

The central result is that applying adapters only during prefill and then discarding them for decode gives about 1.9x higher throughput when serving 512 adapters on Llama 3.1 70B, while keeping task performance competitive. They back this with direct measurements in vLLM for both LoRA and ReFT, plus evaluations on SFT and RL tasks across model scales. On RL the gap to full PEFT is small, and on SFT any loss increase can be offset by raising rank with almost no throughput cost. The implementation release makes the numbers easy to check. The soft spot is that the approach still rests on the prefill phase capturing enough of the adapter signal for good autoregressive output. Their experiments show this holds for the tasks and lengths they tried, but longer contexts or more specialized adaptations could expose bigger drops. The paper does not test those edge cases, so the claim is solid within the reported regime but not proven universal. This is useful for anyone running production serving of user-specific adapters. The empirical comparisons are concrete and the engineering insight is straightforward, so it deserves a full referee rather than a quick reject.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces PreFT (Prefill-only Finetuning), a technique that applies parameter-efficient adapters (LoRA and ReFT) exclusively to the prefill phase of LLM inference and discards them for autoregressive decoding. This design targets improved multi-adapter serving throughput by avoiding adapter overhead during token generation. On Llama 3.1 70B, the authors report 1.9× higher throughput when serving 512 adapters versus standard PEFT baselines, with an efficient vLLM implementation released. Task evaluations show that SFT evaluation loss increases can be offset by increasing adapter rank with negligible throughput cost, while RL tasks exhibit near-parity with full-token PEFTs across model scales.

Significance. If the empirical results hold, PreFT provides a practical accuracy-throughput tradeoff for personalized LLM serving at scale, shifting optimization focus from parameter count to serving efficiency. The concrete throughput gains on 70B-scale models, combined with the open implementation and consistent RL parity, represent a useful engineering contribution for inference systems handling many concurrent adapters.

minor comments (2)

[Implementation] The exact mechanism for baking adapter effects into the KV cache during prefill (and its interaction with vLLM's memory management) would benefit from an expanded description or pseudocode in the implementation section to aid reproducibility.
[Experiments] Figure or table presenting the rank-scaling throughput curves for SFT compensation should include error bars or multiple runs to strengthen the claim of 'nearly no reduction in throughput'.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and recommendation to accept. We appreciate the recognition of PreFT's practical contributions to multi-adapter serving efficiency and the open-source implementation.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper defines PreFT explicitly as applying adapters only during prefill then discarding them, then reports direct empirical measurements of throughput (e.g., 1.9× on 512 adapters) and task performance (SFT loss offset by rank, RL near-parity) against standard PEFT baselines. No step reduces by construction to its own inputs, no fitted parameter is relabeled as a prediction, and no load-bearing claim rests on self-citation chains or imported uniqueness theorems. All central results are externally falsifiable via the described experiments on Llama models and vLLM implementation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the standard assumption that LoRA and ReFT adapters can be applied selectively to prefix tokens without breaking the autoregressive generation process; no new free parameters or invented entities are introduced beyond the usual rank hyperparameter.

axioms (1)

domain assumption Adapters trained on full sequences remain effective when applied only to the prefill portion of the same sequences.
Invoked when claiming that prefill-only training approximates full-token adaptation.

pith-pipeline@v0.9.0 · 5650 in / 1172 out tokens · 34852 ms · 2026-05-15T02:06:32.961245+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 9 internal anchors

[1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openrev...

work page 2024
[2]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. Program synthesis with large language models. CoRR, abs/2108.07732, 2021. URL https://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

How to Scale Your Model

Jacob Austin, Sholto Douglas, Roy Frostig, Anselm Levskaya, Charlie Chen, Sharad Vikram, Federico Lebron, Peter Choy, Vinay Ramasesh, Albert Webson, and Reiner Pope. How to Scale Your Model. Google DeepMind, 2025. URL https://jax-ml.github.io/scaling-book/

work page 2025
[4]

Longwriter: Unleashing 10,000+ word generation from long context llms, 2024

Yushi Bai, Jiajie Zhang, Xin Lv, Linzhi Zheng, Siqi Zhu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longwriter: Unleashing 10,000+ word generation from long context llms, 2024. URL https://arxiv.org/abs/2408.07055

work page arXiv 2024
[5]

LoRA-XS : Low-rank adaptation with extremely small number of parameters

Klaudia Balazy, Mohammadreza Banaei, Karl Aberer, and Jacek Tabor. LoRA-XS : Low-rank adaptation with extremely small number of parameters. In In \^ e s Lynce, Nello Murano, Mauro Vallati, Serena Villata, Federico Chesani, Michela Milano, Andrea Omicini, and Mehdi Dastani, editors, ECAI 2025 - 28th European Conference on Artificial Intelligence, 25-30 Oct...

work page doi:10.3233/faia251185 2025
[6]

Punica: Multi-tenant lora serving

Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krishnamurthy. Punica: Multi-tenant lora serving. In Phillip B. Gibbons, Gennady Pekhimenko, and Christopher De Sa, editors, Proceedings of the Seventh Annual Conference on Machine Learning and Systems, MLSys 2024, Santa Clara, CA, USA, May 13-16, 2024. mlsys.org, 2024. URL https://proce...

work page 2024
[7]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond \' e de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bava...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Learning rate scaling across LoRA ranks and transfer to full finetuning

Nan Chen, Soledad Villar, and Soufiane Hayou. Learning rate scaling across LoRA ranks and transfer to full finetuning. arXiv:2602.06204, 2026. URL https://arxiv.org/abs/2602.06204

work page arXiv 2026
[9]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Some methods for strengthening the common ^2 tests

William G Cochran. Some methods for strengthening the common ^2 tests. Biometrics, 10 0 (4): 0 417--451, 1954. URL https://www.jstor.org/stable/3001616

work page arXiv 1954
[11]

Split personality training: Revealing latent knowledge through alternate personalities

Florian Dietz, William Wale, Oscar Gilg, Robert McCarthy, Felix Michalak, Gustavo Ewbank Rodrigues Danon, Miguelito de Guzman, and Dietrich Klakow. Split personality training: Revealing latent knowledge through alternate personalities. arXiv:2602.05532, 2026. URL https://arxiv.org/abs/2602.05532

work page arXiv 2026
[12]

Cartridges: Lightweight and general- purpose long context representations via self-study

Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, Neel Guha, Dylan Zinsley, Emily Liu, Will Tennien, Atri Rudra, James Zou, Azalia Mirhoseini, and Christopher Re. Cartridges: Lightweight and general-purpose long context representations via self-study. arXiv:2506.06266, 2025. URL https://arxiv.org/abs/2506.06266

work page arXiv 2025
[13]

Greenewald, Mikhail Yurochkin, and Justin Solomon

Rickard Br \" u el Gabrielsson, Jiacheng Zhu, Onkar Bhardwaj, Leshem Choshen, Kristjan H. Greenewald, Mikhail Yurochkin, and Justin Solomon. Compress then serve: Serving thousands of LoRA adapters with little overhead. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste - Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, ...

work page 2025
[14]

Activated LoRA : Fine-tuned LLM s for intrinsics

Kristjan Greenewald, Luis Lastras, Thomas Parnell, Vraj Shah, Lucian Popa, Giulio Zizzo, Chulaka Gunasekara, Ambrish Rawat, and David Cox. Activated LoRA : Fine-tuned LLM s for intrinsics. arXiv:2504.12397, 2025. URL https://arxiv.org/abs/2504.12397

work page arXiv 2025
[15]

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net, 2021 a . URL https://openreview.net/forum?id=d7KBjmI3GmQ

work page 2021
[17]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. CoRR, abs/2103.03874, 2021 b . URL https://arxiv.org/abs/2103.03874

work page internal anchor Pith review Pith/arXiv arXiv 2021
[18]

Parameter-efficient transfer learning for NLP

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP . In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach,...

work page 2019
[19]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen - Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen - Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9

work page 2022
[20]

A rank stabilization scaling factor for fine-tuning with lora,

Damjan Kalajdzievski. A rank stabilization scaling factor for fine-tuning with LoRA . arXiv:2312.03732, 2023. URL https://arxiv.org/abs/2312.03732

work page arXiv 2023
[21]

Dawid Jan Kopiczko, Tijmen Blankevoort, and Yuki M. Asano. VeRA : Vector-based random matrix adaptation. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=NjNfLdxr3A

work page 2024
[22]

InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23)

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with P aged A ttention. In Jason Flinn, Margo I. Seltzer, Peter Druschel, Antoine Kaufmann, and Jonathan Mace, editors, Proceedings of the 29th Symposium on Operating Syste...

work page doi:10.1145/3600006.3613165 2023
[23]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirzi...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Prefix-Tuning: Optimizing Continuous Prompts for Generation , url =

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 45...

work page doi:10.18653/v1/2021.acl-long.353 2021
[25]

DoRA : Weight-decomposed low-rank adaptation

Shih - Yang Liu, Chien - Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu - Chiang Frank Wang, Kwang - Ting Cheng, and Min - Hung Chen. DoRA : Weight-decomposed low-rank adaptation. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Forty-first International Conference on Machi...

work page 2024
[26]

On-policy distillation

Kevin Lu and Thinking Machines Lab . On-policy distillation. Thinking Machines Lab: Connectionism, 2025. URL https://thinkingmachines.ai/blog/on-policy-distillation

work page 2025
[27]

Statistical aspects of the analysis of data from retrospective studies of disease

Nathan Mantel and William Haenszel. Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22 0 (4): 0 719--748, 1959. URL https://academic.oup.com/jnci/article-abstract/22/4/719/900746

work page 1959
[28]

Note on the sampling error of the difference between correlated proportions or percentages

Quinn McNemar. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12 0 (2): 0 153--157, 1947

work page 1947
[29]

Morris, Niloofar Mireshghallah, Mark Ibrahim, and Saeed Mahloujifar

John X. Morris, Niloofar Mireshghallah, Mark Ibrahim, and Saeed Mahloujifar. Learning to reason in 13 parameters. arXiv:2602.04118, 2026. URL https://arxiv.org/abs/2602.04118

work page arXiv 2026
[30]

Predictive-LoRA : A proactive and fragmentation-aware serverless inference system for LLMs

Yinan Ni, Xiao Yang, Yuqi Tang, Zhimin Qiu, Chen Wang, and Tingzhou Yuan. Predictive-LoRA : A proactive and fragmentation-aware serverless inference system for LLMs . In Proceedings of the 2025 6th International Conference on Computer Science and Management Technology, ICCSMT '25, page 1267–1273, New York, NY, USA, 2026. Association for Computing Machiner...

work page doi:10.1145/3795154.3795359 2025
[31]

MAD-X : A n A dapter- B ased F ramework for M ulti- T ask C ross- L ingual T ransfer

Jonas Pfeiffer, Ivan Vuli \'c , Iryna Gurevych, and Sebastian Ruder. MAD-X : A n A dapter- B ased F ramework for M ulti- T ask C ross- L ingual T ransfer. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7654--7673, Online, November 2020. A...

work page doi:10.18653/v1/2020.emnlp-main.617 2020
[32]

Fast PEFT serving at scale, 2025

Nihal Potdar, Megha Agarwal, Hanlin Tang, Asfandyar Qureshi, Qi Zheng, Daya Khudia, Tianrun Li, Nikunj Gupta, and James Thomas. Fast PEFT serving at scale, 2025. URL https://www.databricks.com/blog/fast-peft-serving-scale

work page 2025
[33]

Mooncake: A KVCache-centric disaggregated architecture for LLM serving

Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: A KVCache -centric disaggregated architecture for LLM serving. arXiv:2407.00079, 2025. URL https://arxiv.org/abs/2407.00079

work page arXiv 2025
[34]

Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter

Ruoyu Qin, Weiran He, Yaoyu Wang, Zheming Li, Xinran Xu, Yongwei Wu, Weimin Zheng, and Mingxing Zhang. Prefill-as-a-service: KVCache of next-generation models could go cross-datacenter. arXiv:2604.15039, 2026. URL https://arxiv.org/abs/2604.15039

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

LoRA without regret

John Schulman and Thinking Machines Lab . LoRA without regret. Thinking Machines Lab: Connectionism, 2025. doi:10.64434/tml.20250929. https://thinkingmachines.ai/blog/lora/

work page doi:10.64434/tml.20250929 2025
[36]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath : Pushing the limits of mathematical reasoning in open language models. arXiv:2402.03300, 2024. URL https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

S-lora: Serving thousands of concurrent lora adapters.arXiv preprint arXiv:2311.03285, 2023

Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, and Ion Stoica. S-LoRA : Serving thousands of concurrent LoRA adapters. arXiv:2311.03285, 2024. URL https://arxiv.org/abs/2311.03285

work page arXiv 2024
[38]

Command-V : Pasting LLM behaviors via activation profiles

Barry Wang, Avi Schwarzschild, Alexander Robey, Ali Payani, Charles Fleming, Mingjie Sun, and Daphne Ippolito. Command-V : Pasting LLM behaviors via activation profiles. arXiv:2506.19140, 2025. URL https://arxiv.org/abs/2506.19140

work page arXiv 2025
[39]

dLoRA : Dynamically orchestrating requests and adapters for LoRA LLM serving

Bingyang Wu, Ruidong Zhu, Zili Zhang, Peng Sun, Xuanzhe Liu, and Xin Jin. dLoRA : Dynamically orchestrating requests and adapters for LoRA LLM serving. In Ada Gavrilovska and Douglas B. Terry, editors, 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024 , pages 911--927. USENIX Associatio...

work page 2024
[40]

Manning, and Christopher Potts

Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. ReFT : Representation finetuning for language models. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, Advances in Neural Information Processing Systems 38: Annual...

work page 2024
[41]

Towards context-robust LLM s: A gated representation fine-tuning approach

Shenglai Zeng, Pengfei He, Kai Guo, Tianqi Zheng, Hanqing Lu, Yue Xing, and Hui Liu. Towards context-robust LLM s: A gated representation fine-tuning approach. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pape...

work page doi:10.18653/v1/2025.acl-long.506 2025
[42]

Improving the serving performance of multi- LoRA large language models via efficient LoRA and KV cache management

Hang Zhang, Jiuchen Shi, Yixiao Wang, Quan Chen, Yizhou Shan, and Minyi Guo. Improving the serving performance of multi- LoRA large language models via efficient LoRA and KV cache management. arXiv:2505.03756, 2025. URL https://arxiv.org/abs/2505.03756

work page arXiv 2025
[43]

Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving. In Ada Gavrilovska and Douglas B. Terry, editors, 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-...

work page 2024
[44]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. arXiv:2311.07911, 2023. URL https://arxiv.org/abs/2311.07911

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Cannikin: No lagger of SLO in concurrent multiple lora LLM serving

Ruidong Zhu, Ziyue Jiang, Zhi Zhang, Xin Liu, Xuanzhe Liu, and Xin Jin. Cannikin: No lagger of SLO in concurrent multiple lora LLM serving. IEEE Trans. Parallel Distributed Syst. , 36 0 (9): 0 1972--1984, 2025. doi:10.1109/TPDS.2025.3590014. URL https://doi.org/10.1109/TPDS.2025.3590014

work page doi:10.1109/tpds.2025.3590014 1972
[46]

LoRAFusion : Efficient LoRA fine-tuning for LLMs

Zhanda Zhu, Qidong Su, Yaoyao Ding, Kevin Song, Shang Wang, and Gennady Pekhimenko. LoRAFusion : Efficient LoRA fine-tuning for LLMs . In Antonio Barbalace, Luo Mai, Roxana Geambasu, and Peter R. Pietzuch, editors, Proceedings of the 21st European Conference on Computer Systems, EuroSys 2026, McEwan Hall/The University of Edinburgh, Edinburgh, Scotland, U...

work page doi:10.1145/3767295.3769331 2026