hub

S-lora: Serving thousands of concurrent lora adapters

S-LoRA: Serving Thousands of Concurrent LoRA Adapters , author= · 2023 · arXiv 2311.03285

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

read on arXiv browse 16 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

POLAR: Online Learning for LoRA Adapter Caching and Routing in Edge LLM Serving

cs.LG · 2026-04-17 · unverdicted · novelty 7.0

POLAR formulates joint LoRA adapter caching and routing as a two-timescale contextual bandit, achieving sublinear regret bounds and outperforming non-adaptive baselines in experiments with real adapters.

InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models

cs.DC · 2026-04-08 · unverdicted · novelty 7.0

InfiniLoRA decouples LoRA execution from base-model inference and reports 3.05x higher request throughput plus 54% more adapters meeting strict latency SLOs.

ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads

cs.LG · 2026-04-07 · unverdicted · novelty 7.0

ALTO accelerates LoRA tuning up to 13.8x by monitoring loss trajectories for early stopping, using fused grouped GEMM with rank-local adapter parallelism, and combining intra- and inter-task scheduling for heterogeneous workloads without quality loss.

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

cs.LG · 2024-03-06 · conditional · novelty 7.0

GaLore performs full-parameter LLM training with up to 65.5% less optimizer memory by projecting gradients onto a low-rank subspace at each step, matching full-rank performance on LLaMA pre-training and RoBERTa fine-tuning.

SPEAR: A System for Post-Quantization Error-Adaptive Recovery Enabling Efficient Low-Bit LLM Serving

cs.AR · 2026-06-04 · unverdicted · novelty 6.0

SPEAR places input-dependent error compensators at CKA-selected layers and fuses them into low-bit GEMMs to recover 56-75% of the W4-to-FP16 perplexity gap with <1% memory overhead and near-baseline latency.

PreFT: Prefill-only finetuning for efficient inference

cs.LG · 2026-05-14 · accept · novelty 6.0

Prefill-only adaptation of LLMs yields 1.9x higher throughput for 512 adapters on Llama 3.1 70B with near-parity performance on RL tasks and recoverable loss on SFT.

Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Dr. Post-Training reframes general data as a data-induced regularizer for LLM post-training updates, yielding a family of methods that outperform data-selection baselines on SFT, RLHF, and RLVR tasks.

Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies

cs.AI · 2026-04-23 · unverdicted · novelty 6.0

A separable expert architecture uses base models, LoRA adapters, and deletable per-user proxies to enable privacy-preserving personalization and deterministic unlearning in LLMs.

ForkKV: Scaling Multi-LoRA Agent Serving via Copy-on-Write Disaggregated KV Cache

cs.DC · 2026-04-07 · unverdicted · novelty 6.0

ForkKV uses copy-on-write disaggregated KV cache with DualRadixTree and ResidualAttention kernels to deliver up to 3x throughput over prior multi-LoRA serving systems with negligible quality loss.

LoRA-Mixer: Coordinate Modular LoRA Experts Through Serial Attention Routing

cs.LG · 2025-06-17 · unverdicted · novelty 6.0

LoRA-Mixer routes modular LoRA experts into attention projection matrices with an adaptive Routing Specialization Loss to improve multi-task performance while using fewer trainable parameters than prior LoRA-MoE methods.

On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters

cs.LG · 2026-06-01 · unverdicted · novelty 5.0

PEFT adapters are positioned as persistent personal state on foundation models, organized via Scale Up, Scale Down, and Scale Out axes, with MinT as an infrastructure example for managing them.

FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters

cs.DC · 2025-10-13 · unverdicted · novelty 5.0

FlexPipe introduces runtime pipeline refactoring for LLMs to achieve higher resource efficiency and lower latency in serverless GPU clusters with fragmentation.

HFX: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling

cs.DC · 2025-08-21 · unverdicted · novelty 5.0

HFX jointly designs scheduling and scaling for multi-SLO LLM serving, achieving up to 4.44x higher SLO attainment, 65.82% lower latency, and 49.81% lower cost than prior systems on multi-task workloads.

Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

cs.LG · 2024-03-21 · accept · novelty 4.0

A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.

Large Language Models Should Learn Personalized Rather Than Aggregated Human Preferences

cs.LG · 2026-05-30 · unverdicted · novelty 3.0

Position paper advocating personalized preference learning in LLMs over aggregated approaches, grounded in social choice theory and demographic variation.

CeRA: Breaking the Linear Ceiling of Low-Rank Adaptation with Non-linearity Retained at Inference

cs.LG · 2026-02-26

citing papers explorer

Showing 1 of 1 citing paper after filters.

PreFT: Prefill-only finetuning for efficient inference cs.LG · 2026-05-14 · accept · none · ref 37
Prefill-only adaptation of LLMs yields 1.9x higher throughput for 512 adapters on Llama 3.1 70B with near-parity performance on RL tasks and recoverable loss on SFT.

S-lora: Serving thousands of concurrent lora adapters

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer