Block-GTQ performs RoPE-aware greedy bit allocation on KV caches using per-block energy scores, cutting logit MAE 32-80% versus uniform TQ-MSE and lifting long-context task scores substantially at 2-3 bits per dimension.
hub
Mellette, Alex Forencich, Rukshani Athapathu, Alex C
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 14representative citing papers
Leyline adds a policy-directed KV cache edit primitive with closed-form RoPE correction for agentic inference, reporting +11.2 pp cache-hit lift and +14.3 pp solve-rate gain.
SAGA introduces workflow-atomic scheduling for compound AI agents, achieving 1.64x lower task completion time and 1.22x better memory utilization than vLLM on a 64-GPU cluster at the cost of 30% lower peak throughput.
RNG deploys the first production flat datacenter network using quasi-random graphs, a new distributed routing protocol, and a passive optical cabling shuffle device, achieving fat-tree performance at substantially lower cost.
PrefixWall mitigates APC side channels in multi-tenant LLM systems via selective prefix isolation, delivering up to 70% higher cache reuse and 30% lower latency than full-isolation baselines.
netstacklat is a new low-overhead monitoring tool that records host network stack latency from early kernel processing to application delivery and was tested on 144 HTTP workload variants plus a Cloudflare deployment.
A greedy max-weight decomposition strategy for MoE all-to-all communication on photonic fabrics improves overlap efficiency and reduces compute overheads compared to BvN by bounding the number of matchings.
ForkKV uses copy-on-write disaggregated KV cache with DualRadixTree and ResidualAttention kernels to deliver up to 3x throughput over prior multi-LoRA serving systems with negligible quality loss.
FlexiCache reduces GPU memory for long-context LLM requests by up to 70% and boosts throughput 1.38-1.55x and latency 1.6-2.1x by exploiting per-head differences in temporal stability of critical tokens.
CCoT generates variable-length continuous contemplation tokens that compress explicit reasoning chains, enabling additional dense reasoning and accuracy gains in off-the-shelf language models while allowing adaptive control of token count.
MarginGate triggers verification only on low-margin decode steps to achieve 100% deterministic batch inference at 15-50% of the cost of always-on verification across tested models and datasets.
KV-RM regularizes KV-cache movement via block paging and coalesced transfers to improve throughput, tail latency, and memory efficiency in static-graph LLM serving without changing the decoder interface.
CacheWeaver is a lightweight scheduling layer that orders evidence to exploit prefix caching, reducing median TTFT by 20-33% across vLLM setups while preserving answer quality.
The paper surveys energy efficiency strategies for Agentic AI inference by proposing a new accounting framework and taxonomy that spans model simplification, computation control, input optimization, and cross-layer co-design with wireless networks.
citing papers explorer
-
RoPE-Aware Bit Allocation for KV-Cache Quantization
Block-GTQ performs RoPE-aware greedy bit allocation on KV caches using per-block energy scores, cutting logit MAE 32-80% versus uniform TQ-MSE and lifting long-context task scores substantially at 2-3 bits per dimension.
-
Leyline: KV Cache Directives for Agentic Inference
Leyline adds a policy-directed KV cache edit primitive with closed-form RoPE correction for agentic inference, reporting +11.2 pp cache-hit lift and +14.3 pp solve-rate gain.
-
SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters
SAGA introduces workflow-atomic scheduling for compound AI agents, achieving 1.64x lower task completion time and 1.22x better memory utilization than vLLM on a 64-GPU cluster at the cost of 30% lower peak throughput.
-
RNG: Flat Datacenter Networks at Scale
RNG deploys the first production flat datacenter network using quasi-random graphs, a new distributed routing protocol, and a passive optical cabling shuffle device, achieving fat-tree performance at substantially lower cost.
-
PrefixWall: Mitigating Prefix Caching Side Channels in Shared LLM Systems
PrefixWall mitigates APC side channels in multi-tenant LLM systems via selective prefix isolation, delivering up to 70% higher cache reuse and 30% lower latency than full-isolation baselines.
-
Waiting at the front door: Continuous monitoring of latency in the host network stack
netstacklat is a new low-overhead monitoring tool that records host network stack latency from early kernel processing to application delivery and was tested on 144 HTTP workload variants plus a Cloudflare deployment.
-
Birkhoff Decompositions and Photonic Interconnects Wait! Don't Forget the Compute!
A greedy max-weight decomposition strategy for MoE all-to-all communication on photonic fabrics improves overlap efficiency and reduces compute overheads compared to BvN by bounding the number of matchings.
-
ForkKV: Scaling Multi-LoRA Agent Serving via Copy-on-Write Disaggregated KV Cache
ForkKV uses copy-on-write disaggregated KV cache with DualRadixTree and ResidualAttention kernels to deliver up to 3x throughput over prior multi-LoRA serving systems with negligible quality loss.
-
MarginGate: Sparse Margin-Triggered Verification for Batch-Invariant LLM Inference
MarginGate triggers verification only on low-margin decode steps to achieve 100% deterministic batch inference at 15-50% of the cost of always-on verification across tested models and datasets.
-
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving
KV-RM regularizes KV-cache movement via block paging and coalesced transfers to improve throughput, tail latency, and memory efficiency in static-graph LLM serving without changing the decoder interface.
-
CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference
CacheWeaver is a lightweight scheduling layer that orders evidence to exploit prefix caching, reducing median TTFT by 20-33% across vLLM setups while preserving answer quality.
-
Networking-Aware Energy Efficiency in Agentic AI Inference: A Survey
The paper surveys energy efficiency strategies for Agentic AI inference by proposing a new accounting framework and taxonomy that spans model simplification, computation control, input optimization, and cross-layer co-design with wireless networks.