{"total":13,"items":[{"citing_arxiv_id":"2605.14217","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PreFT: Prefill-only finetuning for efficient inference","primary_cat":"cs.LG","submitted_at":"2026-05-14T00:19:41+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Prefill-only adaptation of LLMs yields 1.9x higher throughput for 512 adapters on Llama 3.1 70B with near-parity performance on RL tasks and recoverable loss on SFT.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07063","ref_index":191,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training","primary_cat":"cs.LG","submitted_at":"2026-05-08T00:16:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Dr. Post-Training reframes general data as a data-induced regularizer for LLM post-training updates, yielding a family of methods that outperform data-selection baselines on SFT, RLHF, and RLVR tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21571","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies","primary_cat":"cs.AI","submitted_at":"2026-04-23T11:51:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A separable expert architecture uses base models, LoRA adapters, and deletable per-user proxies to enable privacy-preserving personalization and deterministic unlearning in LLMs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[16] and QLoRA [17] enable efficient adapter training, LoraHub [18] and task arithmetic [19, 20] demonstrate multi-adapter composition, and S-LoRA [21] enables serving thousands of concurrent adapters from a single base model while Punica [22] provides efficient multi- tenant batching via segmented gather-matrix-vector kernels. Activation steering methods, including Con- trastive Activation Addition [23] and Inference-Time Intervention [24], show that behavioral modification without weight changes can be both effective and rel- atively lightweight. LLM personalization approaches, including LaMP [1], Personalized Soups [2], P-RLHF [3], and VPL [4], capture user preferences through var- ious mechanisms. However, none of these approaches architecturally separates user state from shared weights,"},{"citing_arxiv_id":"2604.16583","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"POLAR: Online Learning for LoRA Adapter Caching and Routing in Edge LLM Serving","primary_cat":"cs.LG","submitted_at":"2026-04-17T14:34:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"POLAR formulates joint LoRA adapter caching and routing as a two-timescale contextual bandit, achieving sublinear regret bounds and outperforming non-adaptive baselines in experiments with real adapters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07173","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models","primary_cat":"cs.DC","submitted_at":"2026-04-08T15:01:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"InfiniLoRA decouples LoRA execution from base-model inference and reports 3.05x higher request throughput plus 54% more adapters meeting strict latency SLOs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"org/abs/2508.17624 [33] Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Xi Victoria Lin, Noah A. Smith, Luke Zettlemoyer, Wen tau Yih, and Mike Lewis. 2024. In-Context Pretraining: Language Modeling Beyond Document Boundaries. InThe Twelfth International Confer- ence on Learning Representations.https://openreview.net/forum?id= LXVswInHOo [34] Xiao Shi, Jiangsu Du, Zhiguang Chen, and Yutong Lu. 2025. AuLoRA: Fine-Grained Loading and Computation Orchestration for Efficient LoRA LLM Serving. In2025 IEEE 43rd International Conference on Computer Design (ICCD). 277-284. doi:10.1109/ICCD65941.2025.00046 [35] Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505."},{"citing_arxiv_id":"2604.06370","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ForkKV: Scaling Multi-LoRA Agent Serving via Copy-on-Write Disaggregated KV Cache","primary_cat":"cs.DC","submitted_at":"2026-04-07T18:52:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ForkKV uses copy-on-write disaggregated KV cache with DualRadixTree and ResidualAttention kernels to deliver up to 3x throughput over prior multi-LoRA serving systems with negligible quality loss.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"computational demands of modern agentic workflows (§2.3). 2.1 LLM Serving LLMs [ 2, 13, 14, 55, 64] predominantly adopt the Transformer architecture to generate text auto-regressively. During genera- tion, tokens interact with historical context via attention mech- anism [3, 51, 56], where sequential order is typically captured by applying Rotary Position Embedding (RoPE) [53] to the Query (𝑄) and Key (𝐾) representations. To avoid redundantly recomputing𝐾 and 𝑉 tensors for historical tokens at every step, inference engines employ aKV cache. This optimization naturally divides the serving process into two phases: a compute-heavyprefill phasethat processes the prompt to popu- late the initial KV cache, and a memory-bounddecode phasethat"},{"citing_arxiv_id":"2604.05426","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads","primary_cat":"cs.LG","submitted_at":"2026-04-07T04:40:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ALTO accelerates LoRA tuning up to 13.8x by monitoring loss trajectories for early stopping, using fused grouped GEMM with rank-local adapter parallelism, and combining intra- and inter-task scheduling for heterogeneous workloads without quality loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.22911","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CeRA: Overcoming the Linear Ceiling of Low-Rank Adaptation via Capacity Expansion","primary_cat":"cs.LG","submitted_at":"2026-02-26T11:55:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CeRA overcomes LoRA's linear ceiling by injecting non-linear SiLU gating and dropout, outperforming high-rank LoRA on complex math reasoning with 1/8 the parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.11938","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters","primary_cat":"cs.DC","submitted_at":"2025-10-13T21:01:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FlexPipe introduces runtime pipeline refactoring for LLMs to achieve higher resource efficiency and lower latency in serverless GPU clusters with fragmentation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.15919","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HFX: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling","primary_cat":"cs.DC","submitted_at":"2025-08-21T18:40:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HFX jointly designs scheduling and scaling for multi-SLO LLM serving, achieving up to 4.44x higher SLO attainment, 65.82% lower latency, and 49.81% lower cost than prior systems on multi-task workloads.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.00029","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LoRA-Mixer: Coordinate Modular LoRA Experts Through Serial Attention Routing","primary_cat":"cs.LG","submitted_at":"2025-06-17T14:58:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LoRA-Mixer routes modular LoRA experts into attention projection matrices with an adaptive Routing Specialization Loss to improve multi-task performance while using fewer trainable parameters than prior LoRA-MoE methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2403.14608","ref_index":140,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey","primary_cat":"cs.LG","submitted_at":"2024-03-21T17:55:50+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"which is very unfriendly to single-user task scheduling or multi-user workload balance. The challenging part of serving the auto-regressive paradigm is that all previous sequences have to be cached and saved for the next proceeding iteration; the cached activation generated from the previous sequences is stored as the Key-Value Cache (KV-cache). To effectively manage these challenges, S-LoRA [140] employs a Unified Paging mechanism within a unified memory pool that dynam- ically allocates and manages memory in a paged fashion. This sophisticated approach minimizes memory fragmentation and enhances the efficiency of KV-cache storage by allowing for flexible and efficient memory access patterns. These pages are managed such that the KV-cache associated with each adapter"},{"citing_arxiv_id":"2403.03507","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection","primary_cat":"cs.LG","submitted_at":"2024-03-06T07:29:57+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GaLore performs full-parameter LLM training with up to 65.5% less optimizer memory by projecting gradients onto a low-rank subspace at each step, matching full-rank performance on LLaMA pre-training and RoBERTa fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}