Instinfer: In-storage attention offloading for cost-effective long-context llm inference

Xiurui Pan, Endian Li, Qiao Li, Shengwen Liang, Yizhou Shan, Ke Zhou, Yingwei Luo, Xiaolin Wang, Jie Zhang · 2024 · arXiv 2409.04992

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

representative citing papers

Efficient Remote KV Cache Reuse with GPU-native Video Codec

cs.DC · 2026-02-10 · conditional · novelty 7.0

KVCodec uses GPU-native video codecs and pipelined fetching to compress and transmit KV caches, delivering up to 3.51x faster TTFT than prior methods while preserving accuracy.

DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference

cs.DC · 2026-04-29 · unverdicted · novelty 6.0

DUAL-BLADE uses a dual-path KV-cache framework with NVMe-direct access to reduce prefill and decode latency by up to 33% and 42% while improving SSD utilization 2.2x under tight memory budgets.

MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference

cs.LG · 2026-04-22 · unverdicted · novelty 6.0

MCAP uses load-time Monte Carlo profiling to estimate layer importance, enabling dynamic quantization (W4A8 vs W4A16) and memory tiering (GPU/RAM/SSD) that delivers 1.5-1.8x higher decode throughput than llama-cpp Q4_0 on NVIDIA T4 while fitting models into previously infeasible memory budgets.

SinkRouter: Sink-Aware Routing for Efficient Long-Context Decoding in Large Language and Multimodal Models

cs.LG · 2026-04-18 · unverdicted · novelty 6.0

SinkRouter identifies attention sinks as training-derived fixed points and routes around them to skip redundant KV-cache loads, delivering up to 2.03x decoding speedup on long-context benchmarks.

citing papers explorer

Showing 4 of 4 citing papers.

Efficient Remote KV Cache Reuse with GPU-native Video Codec cs.DC · 2026-02-10 · conditional · none · ref 51
KVCodec uses GPU-native video codecs and pipelined fetching to compress and transmit KV caches, delivering up to 3.51x faster TTFT than prior methods while preserving accuracy.
DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference cs.DC · 2026-04-29 · unverdicted · none · ref 43
DUAL-BLADE uses a dual-path KV-cache framework with NVMe-direct access to reduce prefill and decode latency by up to 33% and 42% while improving SSD utilization 2.2x under tight memory budgets.
MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference cs.LG · 2026-04-22 · unverdicted · none · ref 11
MCAP uses load-time Monte Carlo profiling to estimate layer importance, enabling dynamic quantization (W4A8 vs W4A16) and memory tiering (GPU/RAM/SSD) that delivers 1.5-1.8x higher decode throughput than llama-cpp Q4_0 on NVIDIA T4 while fitting models into previously infeasible memory budgets.
SinkRouter: Sink-Aware Routing for Efficient Long-Context Decoding in Large Language and Multimodal Models cs.LG · 2026-04-18 · unverdicted · none · ref 22
SinkRouter identifies attention sinks as training-derived fixed points and routes around them to skip redundant KV-cache loads, delivering up to 2.03x decoding speedup on long-context benchmarks.

Instinfer: In-storage attention offloading for cost-effective long-context llm inference

fields

years

verdicts

representative citing papers

citing papers explorer