SAGA reduces AI agent task completion time by 1.64x on 64-GPU clusters by scheduling at the full workflow level with execution graphs, affinity batching, and completion-time fairness.
Mellette, Alex Forencich, Rukshani Athapathu, Alex C
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 8representative citing papers
RNG deploys the first production flat datacenter network using quasi-random graphs, a new distributed routing protocol, and a passive optical cabling shuffle device, achieving fat-tree performance at substantially lower cost.
PrefixWall mitigates APC side channels in multi-tenant LLM systems via selective prefix isolation, delivering up to 70% higher cache reuse and 30% lower latency than full-isolation baselines.
KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.
ForkKV uses copy-on-write disaggregated KV cache with DualRadixTree and ResidualAttention kernels to deliver up to 3x throughput over prior multi-LoRA serving systems with negligible quality loss.
FlexiCache reduces GPU memory for long-context LLM requests by up to 70% and boosts throughput 1.38-1.55x and latency 1.6-2.1x by exploiting per-head differences in temporal stability of critical tokens.
CCoT generates variable-length continuous contemplation tokens that compress explicit reasoning chains, enabling additional dense reasoning and accuracy gains in off-the-shelf language models while allowing adaptive control of token count.
The paper surveys energy efficiency strategies for Agentic AI inference by proposing a new accounting framework and taxonomy that spans model simplification, computation control, input optimization, and cross-layer co-design with wireless networks.
citing papers explorer
-
SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters
SAGA reduces AI agent task completion time by 1.64x on 64-GPU clusters by scheduling at the full workflow level with execution graphs, affinity batching, and completion-time fairness.
-
RNG: Flat Datacenter Networks at Scale
RNG deploys the first production flat datacenter network using quasi-random graphs, a new distributed routing protocol, and a passive optical cabling shuffle device, achieving fat-tree performance at substantially lower cost.
-
PrefixWall: Mitigating Prefix Caching Side Channels in Shared LLM Systems
PrefixWall mitigates APC side channels in multi-tenant LLM systems via selective prefix isolation, delivering up to 70% higher cache reuse and 30% lower latency than full-isolation baselines.
-
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving
KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.
-
ForkKV: Scaling Multi-LoRA Agent Serving via Copy-on-Write Disaggregated KV Cache
ForkKV uses copy-on-write disaggregated KV cache with DualRadixTree and ResidualAttention kernels to deliver up to 3x throughput over prior multi-LoRA serving systems with negligible quality loss.
-
FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management
FlexiCache reduces GPU memory for long-context LLM requests by up to 70% and boosts throughput 1.38-1.55x and latency 1.6-2.1x by exploiting per-head differences in temporal stability of critical tokens.
-
Compressed Chain of Thought: Efficient Reasoning Through Dense Representations
CCoT generates variable-length continuous contemplation tokens that compress explicit reasoning chains, enabling additional dense reasoning and accuracy gains in off-the-shelf language models while allowing adaptive control of token count.
-
Networking-Aware Energy Efficiency in Agentic AI Inference: A Survey
The paper surveys energy efficiency strategies for Agentic AI inference by proposing a new accounting framework and taxonomy that spans model simplification, computation control, input optimization, and cross-layer co-design with wireless networks.