hub

Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache

Bin Lin, Tao Peng, Chen Zhang, Minmin Sun, Lanbo Li, Hanyu Zhao, Wencong Xiao, Qi Xu, Xiafei Qiu, Shen Li, Zhigang Ji, Yong Li, Wei Lin · 2024 · arXiv 2401.02669

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

read on arXiv browse 15 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 method 1

citation-polarity summary

background 3 use method 1

representative citing papers

Efficient Remote KV Cache Reuse with GPU-native Video Codec

cs.DC · 2026-02-10 · conditional · novelty 7.0

KVCodec uses GPU-native video codecs and pipelined fetching to compress and transmit KV caches, delivering up to 3.51x faster TTFT than prior methods while preserving accuracy.

From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems

cs.MA · 2025-06-05 · accept · novelty 7.0

A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.

NanoCP: Request-Level Dynamic Context Parallelism for Data-Expert Parallel Decoding

cs.DC · 2026-05-20 · unverdicted · novelty 6.0

NanoCP introduces request-level dynamic context parallelism to decouple MoE communication from KV cache placement in hybrid data-expert parallel serving, reporting up to 3.27x higher request rates and 2.12x lower P99 latency under TPOT SLOs.

MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems

cs.AR · 2026-05-07 · unverdicted · novelty 6.0

MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.

Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving

cs.LG · 2026-04-29 · unverdicted · novelty 6.0

SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.

AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving

cs.AR · 2026-04-28 · unverdicted · novelty 6.0

AMMA is a memory-centric multi-chiplet architecture using HBM-PNM cubes, custom logic dies, hybrid parallelism, and reordered collectives that delivers 15.5X lower attention latency and 6.9X lower energy than NVIDIA H100 for 1M context serving.

WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving

cs.DC · 2025-12-10 · unverdicted · novelty 6.0

WarmServe reduces tail TTFT by up to 50.8× versus autoscaling and supports 2.5× higher throughput than GPU-sharing by using one-for-many prewarming, model placement, KV cache reservation, and efficient tensor switching.

Amoeba: Runtime Tensor Parallel Transformation for LLM Inference Services

cs.DC · 2025-09-24 · unverdicted · novelty 6.0

Amoeba adaptively adjusts tensor parallelism at runtime for LLM inference services to handle mixed short and long context requests, delivering 1.75x-6.57x throughput gains over prior solutions in real-world trace evaluations.

Lizard: An Efficient Linearization Framework for Large Language Models

cs.CL · 2025-07-11 · unverdicted · novelty 6.0

Lizard linearizes Transformer LLMs via subquadratic attention and adaptive learnable modules, recovering near-original performance while outperforming prior linearization methods on MMLU and associative recall.

MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training

cs.DC · 2025-04-14 · unverdicted · novelty 6.0

MegaScale-Data is a distributed data loading system that disaggregates preprocessing and applies auto-partitioning to deliver 4.5x higher end-to-end training throughput and 13.5x lower CPU memory usage for multisource large foundation models.

KernelFlume: Elastic Core-Attention Scaling for Agentic Long-Context Decoding

cs.DC · 2026-06-28 · unverdicted · novelty 5.0

KernelFlume presents a disaggregated decode architecture that separates core attention from projection/FFN paths to enable elastic scaling of attention nodes, reporting up to 61% lower cost per million tokens versus full-instance scaling on H100 hardware for Llama-3.1-8B under dynamic long-context w

Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference

cs.AR · 2026-04-19 · unverdicted · novelty 5.0

A unified KV cache system with architecture-specific sizing, six-tier memory from GPU to filesystems, and Bayesian prediction delivers 7.4x higher batch sizes, 70-84% hit rates, and projected 1.7-2.9x throughput gains.

ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators

cs.AR · 2025-12-10 · unverdicted · novelty 5.0

ODMA raises KV-cache utilization by up to 19.25% and throughput by 23-27% on Cambricon MLU accelerators by dynamically adjusting prediction buckets and using a safety pool for LLM serving.

A Survey on Efficient Inference for Large Language Models

cs.CL · 2024-04-22 · accept · novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

Towards Efficient Large Vision-Language Models: A Comprehensive Survey on Inference Strategies

cs.LG · 2026-03-30 · unverdicted · novelty 2.0

The paper surveys and taxonomizes inference optimization methods for large vision-language models across four categories while noting limitations and open problems.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Efficient Remote KV Cache Reuse with GPU-native Video Codec cs.DC · 2026-02-10 · conditional · none · ref 43
KVCodec uses GPU-native video codecs and pipelined fetching to compress and transmit KV caches, delivering up to 3.51x faster TTFT than prior methods while preserving accuracy.

Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer