hub

Mellette, Alex Forencich, Rukshani Athapathu, Alex C

William M · 2024 · arXiv 1890.367227

14 Pith papers cite this work. Polarity classification is still indexing.

14 Pith papers citing it

read on arXiv browse 14 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 extension 1

citation-polarity summary

background 3 extend 1

representative citing papers

RoPE-Aware Bit Allocation for KV-Cache Quantization

cs.LG · 2026-06-23 · unverdicted · novelty 7.0

Block-GTQ performs RoPE-aware greedy bit allocation on KV caches using per-block energy scores, cutting logit MAE 32-80% versus uniform TQ-MSE and lifting long-context task scores substantially at 2-3 bits per dimension.

Leyline: KV Cache Directives for Agentic Inference

cs.DC · 2026-05-31 · unverdicted · novelty 7.0

Leyline adds a policy-directed KV cache edit primitive with closed-form RoPE correction for agentic inference, reporting +11.2 pp cache-hit lift and +14.3 pp solve-rate gain.

SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters

cs.DC · 2026-05-01 · unverdicted · novelty 7.0 · 2 refs

SAGA introduces workflow-atomic scheduling for compound AI agents, achieving 1.64x lower task completion time and 1.22x better memory utilization than vLLM on a 64-GPU cluster at the cost of 30% lower peak throughput.

RNG: Flat Datacenter Networks at Scale

cs.NI · 2026-04-16 · unverdicted · novelty 7.0 · 3 refs

RNG deploys the first production flat datacenter network using quasi-random graphs, a new distributed routing protocol, and a passive optical cabling shuffle device, achieving fat-tree performance at substantially lower cost.

PrefixWall: Mitigating Prefix Caching Side Channels in Shared LLM Systems

cs.CR · 2026-03-11 · unverdicted · novelty 7.0

PrefixWall mitigates APC side channels in multi-tenant LLM systems via selective prefix isolation, delivering up to 70% higher cache reuse and 30% lower latency than full-isolation baselines.

Waiting at the front door: Continuous monitoring of latency in the host network stack

cs.NI · 2026-06-01 · unverdicted · novelty 6.0

netstacklat is a new low-overhead monitoring tool that records host network stack latency from early kernel processing to application delivery and was tested on 144 HTTP workload variants plus a Cloudflare deployment.

Birkhoff Decompositions and Photonic Interconnects Wait! Don't Forget the Compute!

cs.NI · 2026-05-26 · unverdicted · novelty 6.0

A greedy max-weight decomposition strategy for MoE all-to-all communication on photonic fabrics improves overlap efficiency and reduces compute overheads compared to BvN by bounding the number of matchings.

ForkKV: Scaling Multi-LoRA Agent Serving via Copy-on-Write Disaggregated KV Cache

cs.DC · 2026-04-07 · unverdicted · novelty 6.0

ForkKV uses copy-on-write disaggregated KV cache with DualRadixTree and ResidualAttention kernels to deliver up to 3x throughput over prior multi-LoRA serving systems with negligible quality loss.

FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management

cs.LG · 2025-11-02 · unverdicted · novelty 6.0

FlexiCache reduces GPU memory for long-context LLM requests by up to 70% and boosts throughput 1.38-1.55x and latency 1.6-2.1x by exploiting per-head differences in temporal stability of critical tokens.

Compressed Chain of Thought: Efficient Reasoning Through Dense Representations

cs.CL · 2024-12-17 · unverdicted · novelty 6.0

CCoT generates variable-length continuous contemplation tokens that compress explicit reasoning chains, enabling additional dense reasoning and accuracy gains in off-the-shelf language models while allowing adaptive control of token count.

MarginGate: Sparse Margin-Triggered Verification for Batch-Invariant LLM Inference

cs.LG · 2026-05-28 · unverdicted · novelty 5.0

MarginGate triggers verification only on low-margin decode steps to achieve 100% deterministic batch inference at 15-50% of the cost of always-on verification across tested models and datasets.

KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving

cs.AR · 2026-05-10 · unverdicted · novelty 5.0 · 2 refs

KV-RM regularizes KV-cache movement via block paging and coalesced transfers to improve throughput, tail latency, and memory efficiency in static-graph LLM serving without changing the decoder interface.

CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference

cs.CL · 2026-06-18 · unverdicted · novelty 4.0

CacheWeaver is a lightweight scheduling layer that orders evidence to exploit prefix caching, reducing median TTFT by 20-33% across vLLM setups while preserving answer quality.

Networking-Aware Energy Efficiency in Agentic AI Inference: A Survey

eess.SY · 2026-04-09 · unverdicted · novelty 4.0

The paper surveys energy efficiency strategies for Agentic AI inference by proposing a new accounting framework and taxonomy that spans model simplification, computation control, input optimization, and cross-layer co-design with wireless networks.

citing papers explorer

Showing 12 of 12 citing papers after filters.

RoPE-Aware Bit Allocation for KV-Cache Quantization cs.LG · 2026-06-23 · unverdicted · none · ref 22
Block-GTQ performs RoPE-aware greedy bit allocation on KV caches using per-block energy scores, cutting logit MAE 32-80% versus uniform TQ-MSE and lifting long-context task scores substantially at 2-3 bits per dimension.
Leyline: KV Cache Directives for Agentic Inference cs.DC · 2026-05-31 · unverdicted · none · ref 35
Leyline adds a policy-directed KV cache edit primitive with closed-form RoPE correction for agentic inference, reporting +11.2 pp cache-hit lift and +14.3 pp solve-rate gain.
SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters cs.DC · 2026-05-01 · unverdicted · none · ref 41 · 2 links
SAGA introduces workflow-atomic scheduling for compound AI agents, achieving 1.64x lower task completion time and 1.22x better memory utilization than vLLM on a 64-GPU cluster at the cost of 30% lower peak throughput.
RNG: Flat Datacenter Networks at Scale cs.NI · 2026-04-16 · unverdicted · none · ref 29 · 3 links
RNG deploys the first production flat datacenter network using quasi-random graphs, a new distributed routing protocol, and a passive optical cabling shuffle device, achieving fat-tree performance at substantially lower cost.
PrefixWall: Mitigating Prefix Caching Side Channels in Shared LLM Systems cs.CR · 2026-03-11 · unverdicted · none · ref 44
PrefixWall mitigates APC side channels in multi-tenant LLM systems via selective prefix isolation, delivering up to 70% higher cache reuse and 30% lower latency than full-isolation baselines.
Waiting at the front door: Continuous monitoring of latency in the host network stack cs.NI · 2026-06-01 · unverdicted · none · ref 45
netstacklat is a new low-overhead monitoring tool that records host network stack latency from early kernel processing to application delivery and was tested on 144 HTTP workload variants plus a Cloudflare deployment.
Birkhoff Decompositions and Photonic Interconnects Wait! Don't Forget the Compute! cs.NI · 2026-05-26 · unverdicted · none · ref 22
A greedy max-weight decomposition strategy for MoE all-to-all communication on photonic fabrics improves overlap efficiency and reduces compute overheads compared to BvN by bounding the number of matchings.
ForkKV: Scaling Multi-LoRA Agent Serving via Copy-on-Write Disaggregated KV Cache cs.DC · 2026-04-07 · unverdicted · none · ref 37
ForkKV uses copy-on-write disaggregated KV cache with DualRadixTree and ResidualAttention kernels to deliver up to 3x throughput over prior multi-LoRA serving systems with negligible quality loss.
MarginGate: Sparse Margin-Triggered Verification for Batch-Invariant LLM Inference cs.LG · 2026-05-28 · unverdicted · none · ref 17
MarginGate triggers verification only on low-margin decode steps to achieve 100% deterministic batch inference at 15-50% of the cost of always-on verification across tested models and datasets.
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving cs.AR · 2026-05-10 · unverdicted · none · ref 27 · 2 links
KV-RM regularizes KV-cache movement via block paging and coalesced transfers to improve throughput, tail latency, and memory efficiency in static-graph LLM serving without changing the decoder interface.
CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference cs.CL · 2026-06-18 · unverdicted · none · ref 27
CacheWeaver is a lightweight scheduling layer that orders evidence to exploit prefix caching, reducing median TTFT by 20-33% across vLLM setups while preserving answer quality.
Networking-Aware Energy Efficiency in Agentic AI Inference: A Survey eess.SY · 2026-04-09 · unverdicted · none · ref 75
The paper surveys energy efficiency strategies for Agentic AI inference by proposing a new accounting framework and taxonomy that spans model simplification, computation control, input optimization, and cross-layer co-design with wireless networks.

Mellette, Alex Forencich, Rukshani Athapathu, Alex C

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer