Canonical reference

InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles (SOSP ’25)

Yanqi Zhang, Yuwei Hu, Runyuan Zhao, John C · 2025 · arXiv 1569.37648

Canonical reference. 80% of citing Pith papers cite this work as background.

7 Pith papers citing it

Background 80% of classified citations

read on arXiv browse 7 citing papers

citation-role summary

background 5

citation-polarity summary

background 4 extend 1

representative citing papers

MosaicKV: Serving Long-Context LLM with Dynamic Two-D KV Cache Compression

cs.LG · 2026-07-01 · unverdicted · novelty 6.0

MosaicKV achieves up to 16x attention speedup, 4.8x lower decode latency, 7.3x higher throughput, and 3x memory reduction with 1.76% accuracy loss via dynamic two-D KV cache compression and management on H800 GPUs.

Requests of a Feather Must Flock Together: Batch Size vs. Prefix Homogeneity in LLM Inference

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Feather uses reinforcement learning and a Chunked Hash Tree to balance batch size against prefix homogeneity in LLM inference, delivering 2-10x higher throughput than existing schedulers.

The Origins of MEV: Systematic Attribution of Arbitrage Opportunity Creation at Scale

cs.DC · 2026-04-30 · unverdicted · novelty 6.0

Four attribution methods applied to over one million Polygon blocks show that most atomic arbitrage MEV opportunities trace to single source transactions from a small set of protocols.

Proxics: an efficient programming model for far memory accelerators

cs.OS · 2026-04-20 · conditional · novelty 6.0

Proxics introduces lightweight virtual processors and low-latency communication channels as portable OS abstractions for programming near-data processing accelerators, demonstrated on real hardware for memory-intensive workloads.

Strait: Perceiving Priority and Interference in ML Inference Serving

cs.LG · 2026-04-30 · unverdicted · novelty 5.0

Strait cuts high-priority deadline violations in ML inference serving by 1-11 percentage points through contention modeling and priority scheduling under high GPU load.

Distributed Generative Inference of LLM at Internet Scales with Multi-Dimensional Communication Optimization

cs.DC · 2026-04-22 · unverdicted · novelty 5.0

BloomBee is a distributed LLM inference system that achieves up to 1.76x higher throughput and 43.2% lower latency than prior decentralized systems by optimizing communication across multiple dimensions in low-bandwidth internet settings.

Libra: Efficient Resource Management for Agentic RL Post-Training

cs.LG · 2026-06-02 · unverdicted · novelty 4.0

Libra optimizes GPU allocation across rollout and training in agentic RL via an elastic hybrid pool and C-MLFQ scheduler based on tool-return causal signals, claiming up to 3.0x throughput and 2.5x faster reward convergence on 48 A800 GPUs.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Proxics: an efficient programming model for far memory accelerators cs.OS · 2026-04-20 · conditional · none · ref 17
Proxics introduces lightweight virtual processors and low-latency communication channels as portable OS abstractions for programming near-data processing accelerators, demonstrated on real hardware for memory-intensive workloads.

InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles (SOSP ’25)

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer