KVCodec uses GPU-native video codecs and pipelined fetching to compress and transmit KV caches, delivering up to 3.51x faster TTFT than prior methods while preserving accuracy.
hub
Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache
15 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
NanoCP introduces request-level dynamic context parallelism to decouple MoE communication from KV cache placement in hybrid data-expert parallel serving, reporting up to 3.27x higher request rates and 2.12x lower P99 latency under TPOT SLOs.
MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.
SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
AMMA is a memory-centric multi-chiplet architecture using HBM-PNM cubes, custom logic dies, hybrid parallelism, and reordered collectives that delivers 15.5X lower attention latency and 6.9X lower energy than NVIDIA H100 for 1M context serving.
WarmServe reduces tail TTFT by up to 50.8× versus autoscaling and supports 2.5× higher throughput than GPU-sharing by using one-for-many prewarming, model placement, KV cache reservation, and efficient tensor switching.
Amoeba adaptively adjusts tensor parallelism at runtime for LLM inference services to handle mixed short and long context requests, delivering 1.75x-6.57x throughput gains over prior solutions in real-world trace evaluations.
Lizard linearizes Transformer LLMs via subquadratic attention and adaptive learnable modules, recovering near-original performance while outperforming prior linearization methods on MMLU and associative recall.
MegaScale-Data is a distributed data loading system that disaggregates preprocessing and applies auto-partitioning to deliver 4.5x higher end-to-end training throughput and 13.5x lower CPU memory usage for multisource large foundation models.
KernelFlume presents a disaggregated decode architecture that separates core attention from projection/FFN paths to enable elastic scaling of attention nodes, reporting up to 61% lower cost per million tokens versus full-instance scaling on H100 hardware for Llama-3.1-8B under dynamic long-context w
A unified KV cache system with architecture-specific sizing, six-tier memory from GPU to filesystems, and Bayesian prediction delivers 7.4x higher batch sizes, 70-84% hit rates, and projected 1.7-2.9x throughput gains.
ODMA raises KV-cache utilization by up to 19.25% and throughput by 23-27% on Cambricon MLU accelerators by dynamically adjusting prediction buckets and using a safety pool for LLM serving.
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
The paper surveys and taxonomizes inference optimization methods for large vision-language models across four categories while noting limitations and open problems.
citing papers explorer
-
Efficient Remote KV Cache Reuse with GPU-native Video Codec
KVCodec uses GPU-native video codecs and pipelined fetching to compress and transmit KV caches, delivering up to 3.51x faster TTFT than prior methods while preserving accuracy.