Recognition: no theorem link
PreFT: Prefill-only finetuning for efficient inference
Pith reviewed 2026-05-15 02:06 UTC · model grok-4.3
The pith
Applying adapters only during prefill and discarding them afterward raises serving throughput nearly twofold while keeping performance near standard PEFT levels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Prefill-only finetuning applies the adapter exclusively to prefill tokens and discards it for decode, delivering substantially higher multi-adapter serving throughput than conventional PEFT while preserving task performance that can be restored on SFT by raising rank and that already approaches parity on RL.
What carries the argument
The prefill-only adapter that limits low-rank or representation updates to initial context tokens before removal during generation.
If this is right
- Serving 512 adapters on Llama 3.1 70B reaches 1.9 times the throughput of traditional PEFT.
- Raising adapter rank on SFT tasks offsets higher evaluation loss with negligible throughput reduction.
- PreFT reaches near parity with full PEFT on reinforcement learning tasks across model scales.
- Open-source vLLM kernels for prefill-only LoRA and ReFT make the method immediately usable.
Where Pith is reading between the lines
- The same prefill-only restriction could be applied to other PEFT families to test broader applicability.
- Large deployments might cut per-user memory by avoiding adapter storage during generation.
- Adapters could be redesigned specifically for prefill efficiency rather than full-sequence use.
- Combining PreFT with quantization or speculative decoding may compound the throughput gains.
Load-bearing premise
Discarding the adapter after prefill leaves the quality of later generated tokens largely intact on downstream tasks, and any shortfall can be offset by higher rank without throughput cost.
What would settle it
An experiment showing that even high-rank PreFT versions fall short of standard PEFT accuracy on a standard RL benchmark while the reported throughput advantage holds.
Figures
read the original abstract
Large language models can now be personalised efficiently at scale using parameter efficient finetuning methods (PEFTs), but serving user-specific PEFTs harms throughput, even with specialised kernels and memory management techniques. This is because, theoretically and empirically, a mismatch exists between prefill (processing a large number of tokens at once) and decode (generating a single token autoregressively): the latter has far lower throughput when serving multiple adapters. Rather than optimising performance relative to parameter count, for efficient multi-adapter serving, we instead ought to optimise performance relative to serving throughput. We therefore propose PreFT (Prefill-only Finetuning), wherein we only apply the adapter to prefill tokens and discard it afterwards. PreFT significantly increases throughput with minimal effect on performance. We develop and release an efficient implementation of two prefill-only PEFTs, LoRA and ReFT, on the vLLM inference engine. We first show that serving multi-user PreFTs is more efficient than traditional PEFTs ($1.9\times$ the throughput when serving $512$ adapters on Llama 3.1 70B). Then, we compare the performance of prefill-only vs. all-token adapters on a variety of supervised finetuning and reinforcement learning tasks with LMs at varying scales. On SFT, we observe that the evaluation loss of PreFTs is higher than PEFTs, but can be compensated by increasing rank with nearly no reduction in throughput. On RL, we consistently find that PreFTs approach parity with standard PEFTs. Together, this work validates prefill-only adaptation of LLMs as a more favourable accuracy-throughput tradeoff than existing PEFTs for personalised serving.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces PreFT (Prefill-only Finetuning), a technique that applies parameter-efficient adapters (LoRA and ReFT) exclusively to the prefill phase of LLM inference and discards them for autoregressive decoding. This design targets improved multi-adapter serving throughput by avoiding adapter overhead during token generation. On Llama 3.1 70B, the authors report 1.9× higher throughput when serving 512 adapters versus standard PEFT baselines, with an efficient vLLM implementation released. Task evaluations show that SFT evaluation loss increases can be offset by increasing adapter rank with negligible throughput cost, while RL tasks exhibit near-parity with full-token PEFTs across model scales.
Significance. If the empirical results hold, PreFT provides a practical accuracy-throughput tradeoff for personalized LLM serving at scale, shifting optimization focus from parameter count to serving efficiency. The concrete throughput gains on 70B-scale models, combined with the open implementation and consistent RL parity, represent a useful engineering contribution for inference systems handling many concurrent adapters.
minor comments (2)
- [Implementation] The exact mechanism for baking adapter effects into the KV cache during prefill (and its interaction with vLLM's memory management) would benefit from an expanded description or pseudocode in the implementation section to aid reproducibility.
- [Experiments] Figure or table presenting the rank-scaling throughput curves for SFT compensation should include error bars or multiple runs to strengthen the claim of 'nearly no reduction in throughput'.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the manuscript and recommendation to accept. We appreciate the recognition of PreFT's practical contributions to multi-adapter serving efficiency and the open-source implementation.
Circularity Check
No significant circularity
full rationale
The paper defines PreFT explicitly as applying adapters only during prefill then discarding them, then reports direct empirical measurements of throughput (e.g., 1.9× on 512 adapters) and task performance (SFT loss offset by rank, RL near-parity) against standard PEFT baselines. No step reduces by construction to its own inputs, no fitted parameter is relabeled as a prediction, and no load-bearing claim rests on self-citation chains or imported uniqueness theorems. All central results are externally falsifiable via the described experiments on Llama models and vLLM implementation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Adapters trained on full sequences remain effective when applied only to the prefill portion of the same sequences.
Reference graph
Works this paper leans on
-
[1]
On-policy distillation of language models: Learning from self-generated mistakes
Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openrev...
work page 2024
-
[2]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. Program synthesis with large language models. CoRR, abs/2108.07732, 2021. URL https://arxiv.org/abs/2108.07732
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Jacob Austin, Sholto Douglas, Roy Frostig, Anselm Levskaya, Charlie Chen, Sharad Vikram, Federico Lebron, Peter Choy, Vinay Ramasesh, Albert Webson, and Reiner Pope. How to Scale Your Model. Google DeepMind, 2025. URL https://jax-ml.github.io/scaling-book/
work page 2025
-
[4]
Longwriter: Unleashing 10,000+ word generation from long context llms, 2024
Yushi Bai, Jiajie Zhang, Xin Lv, Linzhi Zheng, Siqi Zhu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longwriter: Unleashing 10,000+ word generation from long context llms, 2024. URL https://arxiv.org/abs/2408.07055
-
[5]
LoRA-XS : Low-rank adaptation with extremely small number of parameters
Klaudia Balazy, Mohammadreza Banaei, Karl Aberer, and Jacek Tabor. LoRA-XS : Low-rank adaptation with extremely small number of parameters. In In \^ e s Lynce, Nello Murano, Mauro Vallati, Serena Villata, Federico Chesani, Michela Milano, Andrea Omicini, and Mehdi Dastani, editors, ECAI 2025 - 28th European Conference on Artificial Intelligence, 25-30 Oct...
-
[6]
Punica: Multi-tenant lora serving
Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krishnamurthy. Punica: Multi-tenant lora serving. In Phillip B. Gibbons, Gennady Pekhimenko, and Christopher De Sa, editors, Proceedings of the Seventh Annual Conference on Machine Learning and Systems, MLSys 2024, Santa Clara, CA, USA, May 13-16, 2024. mlsys.org, 2024. URL https://proce...
work page 2024
-
[7]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond \' e de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bava...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
Learning rate scaling across LoRA ranks and transfer to full finetuning
Nan Chen, Soledad Villar, and Soufiane Hayou. Learning rate scaling across LoRA ranks and transfer to full finetuning. arXiv:2602.06204, 2026. URL https://arxiv.org/abs/2602.06204
-
[9]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[10]
Some methods for strengthening the common ^2 tests
William G Cochran. Some methods for strengthening the common ^2 tests. Biometrics, 10 0 (4): 0 417--451, 1954. URL https://www.jstor.org/stable/3001616
-
[11]
Split personality training: Revealing latent knowledge through alternate personalities
Florian Dietz, William Wale, Oscar Gilg, Robert McCarthy, Felix Michalak, Gustavo Ewbank Rodrigues Danon, Miguelito de Guzman, and Dietrich Klakow. Split personality training: Revealing latent knowledge through alternate personalities. arXiv:2602.05532, 2026. URL https://arxiv.org/abs/2602.05532
-
[12]
Cartridges: Lightweight and general- purpose long context representations via self-study
Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, Neel Guha, Dylan Zinsley, Emily Liu, Will Tennien, Atri Rudra, James Zou, Azalia Mirhoseini, and Christopher Re. Cartridges: Lightweight and general-purpose long context representations via self-study. arXiv:2506.06266, 2025. URL https://arxiv.org/abs/2506.06266
-
[13]
Greenewald, Mikhail Yurochkin, and Justin Solomon
Rickard Br \" u el Gabrielsson, Jiacheng Zhu, Onkar Bhardwaj, Leshem Choshen, Kristjan H. Greenewald, Mikhail Yurochkin, and Justin Solomon. Compress then serve: Serving thousands of LoRA adapters with little overhead. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste - Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, ...
work page 2025
-
[14]
Activated LoRA : Fine-tuned LLM s for intrinsics
Kristjan Greenewald, Luis Lastras, Thomas Parnell, Vraj Shah, Lucian Popa, Giulio Zizzo, Chulaka Gunasekara, Ambrish Rawat, and David Cox. Activated LoRA : Fine-tuned LLM s for intrinsics. arXiv:2504.12397, 2025. URL https://arxiv.org/abs/2504.12397
-
[15]
OpenThoughts: Data Recipes for Reasoning Models
Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net, 2021 a . URL https://openreview.net/forum?id=d7KBjmI3GmQ
work page 2021
-
[17]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. CoRR, abs/2103.03874, 2021 b . URL https://arxiv.org/abs/2103.03874
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[18]
Parameter-efficient transfer learning for NLP
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP . In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach,...
work page 2019
-
[19]
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen - Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9
work page 2022
-
[20]
A rank stabilization scaling factor for fine-tuning with lora,
Damjan Kalajdzievski. A rank stabilization scaling factor for fine-tuning with LoRA . arXiv:2312.03732, 2023. URL https://arxiv.org/abs/2312.03732
-
[21]
Dawid Jan Kopiczko, Tijmen Blankevoort, and Yuki M. Asano. VeRA : Vector-based random matrix adaptation. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=NjNfLdxr3A
work page 2024
-
[22]
InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23)
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with P aged A ttention. In Jason Flinn, Margo I. Seltzer, Peter Druschel, Antoine Kaufmann, and Jonathan Mace, editors, Proceedings of the 29th Symposium on Operating Syste...
-
[23]
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirzi...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Prefix-Tuning: Optimizing Continuous Prompts for Generation , url =
Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 45...
-
[25]
DoRA : Weight-decomposed low-rank adaptation
Shih - Yang Liu, Chien - Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu - Chiang Frank Wang, Kwang - Ting Cheng, and Min - Hung Chen. DoRA : Weight-decomposed low-rank adaptation. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Forty-first International Conference on Machi...
work page 2024
-
[26]
Kevin Lu and Thinking Machines Lab . On-policy distillation. Thinking Machines Lab: Connectionism, 2025. URL https://thinkingmachines.ai/blog/on-policy-distillation
work page 2025
-
[27]
Statistical aspects of the analysis of data from retrospective studies of disease
Nathan Mantel and William Haenszel. Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22 0 (4): 0 719--748, 1959. URL https://academic.oup.com/jnci/article-abstract/22/4/719/900746
work page 1959
-
[28]
Note on the sampling error of the difference between correlated proportions or percentages
Quinn McNemar. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12 0 (2): 0 153--157, 1947
work page 1947
-
[29]
Morris, Niloofar Mireshghallah, Mark Ibrahim, and Saeed Mahloujifar
John X. Morris, Niloofar Mireshghallah, Mark Ibrahim, and Saeed Mahloujifar. Learning to reason in 13 parameters. arXiv:2602.04118, 2026. URL https://arxiv.org/abs/2602.04118
-
[30]
Predictive-LoRA : A proactive and fragmentation-aware serverless inference system for LLMs
Yinan Ni, Xiao Yang, Yuqi Tang, Zhimin Qiu, Chen Wang, and Tingzhou Yuan. Predictive-LoRA : A proactive and fragmentation-aware serverless inference system for LLMs . In Proceedings of the 2025 6th International Conference on Computer Science and Management Technology, ICCSMT '25, page 1267–1273, New York, NY, USA, 2026. Association for Computing Machiner...
-
[31]
MAD-X : A n A dapter- B ased F ramework for M ulti- T ask C ross- L ingual T ransfer
Jonas Pfeiffer, Ivan Vuli \'c , Iryna Gurevych, and Sebastian Ruder. MAD-X : A n A dapter- B ased F ramework for M ulti- T ask C ross- L ingual T ransfer. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7654--7673, Online, November 2020. A...
-
[32]
Fast PEFT serving at scale, 2025
Nihal Potdar, Megha Agarwal, Hanlin Tang, Asfandyar Qureshi, Qi Zheng, Daya Khudia, Tianrun Li, Nikunj Gupta, and James Thomas. Fast PEFT serving at scale, 2025. URL https://www.databricks.com/blog/fast-peft-serving-scale
work page 2025
-
[33]
Mooncake: A KVCache-centric disaggregated architecture for LLM serving
Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: A KVCache -centric disaggregated architecture for LLM serving. arXiv:2407.00079, 2025. URL https://arxiv.org/abs/2407.00079
-
[34]
Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter
Ruoyu Qin, Weiran He, Yaoyu Wang, Zheming Li, Xinran Xu, Yongwei Wu, Weimin Zheng, and Mingxing Zhang. Prefill-as-a-service: KVCache of next-generation models could go cross-datacenter. arXiv:2604.15039, 2026. URL https://arxiv.org/abs/2604.15039
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[35]
John Schulman and Thinking Machines Lab . LoRA without regret. Thinking Machines Lab: Connectionism, 2025. doi:10.64434/tml.20250929. https://thinkingmachines.ai/blog/lora/
-
[36]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath : Pushing the limits of mathematical reasoning in open language models. arXiv:2402.03300, 2024. URL https://arxiv.org/abs/2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
S-lora: Serving thousands of concurrent lora adapters.arXiv preprint arXiv:2311.03285, 2023
Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, and Ion Stoica. S-LoRA : Serving thousands of concurrent LoRA adapters. arXiv:2311.03285, 2024. URL https://arxiv.org/abs/2311.03285
-
[38]
Command-V : Pasting LLM behaviors via activation profiles
Barry Wang, Avi Schwarzschild, Alexander Robey, Ali Payani, Charles Fleming, Mingjie Sun, and Daphne Ippolito. Command-V : Pasting LLM behaviors via activation profiles. arXiv:2506.19140, 2025. URL https://arxiv.org/abs/2506.19140
-
[39]
dLoRA : Dynamically orchestrating requests and adapters for LoRA LLM serving
Bingyang Wu, Ruidong Zhu, Zili Zhang, Peng Sun, Xuanzhe Liu, and Xin Jin. dLoRA : Dynamically orchestrating requests and adapters for LoRA LLM serving. In Ada Gavrilovska and Douglas B. Terry, editors, 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024 , pages 911--927. USENIX Associatio...
work page 2024
-
[40]
Manning, and Christopher Potts
Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. ReFT : Representation finetuning for language models. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, Advances in Neural Information Processing Systems 38: Annual...
work page 2024
-
[41]
Towards context-robust LLM s: A gated representation fine-tuning approach
Shenglai Zeng, Pengfei He, Kai Guo, Tianqi Zheng, Hanqing Lu, Yue Xing, and Hui Liu. Towards context-robust LLM s: A gated representation fine-tuning approach. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pape...
-
[42]
Hang Zhang, Jiuchen Shi, Yixiao Wang, Quan Chen, Yizhou Shan, and Minyi Guo. Improving the serving performance of multi- LoRA large language models via efficient LoRA and KV cache management. arXiv:2505.03756, 2025. URL https://arxiv.org/abs/2505.03756
-
[43]
Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving. In Ada Gavrilovska and Douglas B. Terry, editors, 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-...
work page 2024
-
[44]
Instruction-Following Evaluation for Large Language Models
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. arXiv:2311.07911, 2023. URL https://arxiv.org/abs/2311.07911
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Cannikin: No lagger of SLO in concurrent multiple lora LLM serving
Ruidong Zhu, Ziyue Jiang, Zhi Zhang, Xin Liu, Xuanzhe Liu, and Xin Jin. Cannikin: No lagger of SLO in concurrent multiple lora LLM serving. IEEE Trans. Parallel Distributed Syst. , 36 0 (9): 0 1972--1984, 2025. doi:10.1109/TPDS.2025.3590014. URL https://doi.org/10.1109/TPDS.2025.3590014
-
[46]
LoRAFusion : Efficient LoRA fine-tuning for LLMs
Zhanda Zhu, Qidong Su, Yaoyao Ding, Kevin Song, Shang Wang, and Gennady Pekhimenko. LoRAFusion : Efficient LoRA fine-tuning for LLMs . In Antonio Barbalace, Luo Mai, Roxana Geambasu, and Peter R. Pietzuch, editors, Proceedings of the 21st European Conference on Computer Systems, EuroSys 2026, McEwan Hall/The University of Edinburgh, Edinburgh, Scotland, U...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.