KV offloading degrades accuracy on context-intensive tasks due to low-rank key projections and unreliable landmarks; a simpler alternative improves results across models and benchmarks.
Kvsharer: Efficient inference via layer-wise dissimilar kv cache sharing
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4roles
background 1polarities
background 1representative citing papers
GHOST is a geometry-hierarchical token eviction framework that halves the KV cache size in monocular video 3D reconstruction while maintaining quality and achieving 1.75x faster inference.
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
Omni-Flow introduces a three-layer abstraction (Control Flow, Data Flow, Compute Flow) for unified orchestration and KV cache sharing in multimodal inference pipelines.
citing papers explorer
-
KV Cache Offloading for Context-Intensive Tasks
KV offloading degrades accuracy on context-intensive tasks due to low-rank key projections and unreliable landmarks; a simpler alternative improves results across models and benchmarks.
-
GHOST: Geometry-Hierarchical Online Streaming Token Eviction for Efficient 3D Reconstruction
GHOST is a geometry-hierarchical token eviction framework that halves the KV cache size in monocular video 3D reconstruction while maintaining quality and achieving 1.75x faster inference.
-
Search Your Block Floating Point Scales!
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
-
Omni-Flow: A Unified Workflow Orchestration and Distributed KV Cache Sharing Framework for Multimodal Inference
Omni-Flow introduces a three-layer abstraction (Control Flow, Data Flow, Compute Flow) for unified orchestration and KV cache sharing in multimodal inference pipelines.