MosaicKV achieves up to 16x attention speedup, 4.8x lower decode latency, 7.3x higher throughput, and 3x memory reduction with 1.76% accuracy loss via dynamic two-D KV cache compression and management on H800 GPUs.
Canonical reference
InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles (SOSP ’25)
Canonical reference. 80% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
years
2026 7roles
background 5representative citing papers
Feather uses reinforcement learning and a Chunked Hash Tree to balance batch size against prefix homogeneity in LLM inference, delivering 2-10x higher throughput than existing schedulers.
Four attribution methods applied to over one million Polygon blocks show that most atomic arbitrage MEV opportunities trace to single source transactions from a small set of protocols.
Proxics introduces lightweight virtual processors and low-latency communication channels as portable OS abstractions for programming near-data processing accelerators, demonstrated on real hardware for memory-intensive workloads.
Strait cuts high-priority deadline violations in ML inference serving by 1-11 percentage points through contention modeling and priority scheduling under high GPU load.
BloomBee is a distributed LLM inference system that achieves up to 1.76x higher throughput and 43.2% lower latency than prior decentralized systems by optimizing communication across multiple dimensions in low-bandwidth internet settings.
Libra optimizes GPU allocation across rollout and training in agentic RL via an elastic hybrid pool and C-MLFQ scheduler based on tool-return causal signals, claiming up to 3.0x throughput and 2.5x faster reward convergence on 48 A800 GPUs.
citing papers explorer
-
Proxics: an efficient programming model for far memory accelerators
Proxics introduces lightweight virtual processors and low-latency communication channels as portable OS abstractions for programming near-data processing accelerators, demonstrated on real hardware for memory-intensive workloads.