MosaicKV achieves up to 16x attention speedup, 4.8x lower decode latency, 7.3x higher throughput, and 3x memory reduction with 1.76% accuracy loss via dynamic two-D KV cache compression and management on H800 GPUs.
Canonical reference
InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles (SOSP ’25)
Canonical reference. 80% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
years
2026 7roles
background 5representative citing papers
Feather uses reinforcement learning and a Chunked Hash Tree to balance batch size against prefix homogeneity in LLM inference, delivering 2-10x higher throughput than existing schedulers.
Four attribution methods applied to over one million Polygon blocks show that most atomic arbitrage MEV opportunities trace to single source transactions from a small set of protocols.
Proxics introduces lightweight virtual processors and low-latency communication channels as portable OS abstractions for programming near-data processing accelerators, demonstrated on real hardware for memory-intensive workloads.
Strait cuts high-priority deadline violations in ML inference serving by 1-11 percentage points through contention modeling and priority scheduling under high GPU load.
BloomBee is a distributed LLM inference system that achieves up to 1.76x higher throughput and 43.2% lower latency than prior decentralized systems by optimizing communication across multiple dimensions in low-bandwidth internet settings.
Libra optimizes GPU allocation across rollout and training in agentic RL via an elastic hybrid pool and C-MLFQ scheduler based on tool-return causal signals, claiming up to 3.0x throughput and 2.5x faster reward convergence on 48 A800 GPUs.
citing papers explorer
-
The Origins of MEV: Systematic Attribution of Arbitrage Opportunity Creation at Scale
Four attribution methods applied to over one million Polygon blocks show that most atomic arbitrage MEV opportunities trace to single source transactions from a small set of protocols.
-
Distributed Generative Inference of LLM at Internet Scales with Multi-Dimensional Communication Optimization
BloomBee is a distributed LLM inference system that achieves up to 1.76x higher throughput and 43.2% lower latency than prior decentralized systems by optimizing communication across multiple dimensions in low-bandwidth internet settings.