GreenCache dynamically manages LLM KV cache resources to reduce carbon emissions by 15.1% on average (up to 25.3%) while meeting latency constraints for over 90% of requests on real traces.
Beehive: Sub-second elasticity for web services with semi-faas execution
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
PALS adds dynamic GPU power capping to LLM serving frameworks like vLLM, jointly tuning it with batch size via offline models and feedback control to improve energy efficiency up to 26.3% and cut QoS violations 4-7x on dense and MoE models.
Flare proposes routing microservice spike load selectively to serverless while keeping steady load on VMs, with claimed minimal integration changes.
Introduces the Feasible Sovereign Operating Region (FSOR) as a construct for workloads sustainable under physical and regulatory limits, along with a joint compute-network optimization framework that enforces sustainability as hard constraints.
citing papers explorer
-
Cache Your Prompt When It's Green: Carbon-Aware Caching for Large Language Model Serving
GreenCache dynamically manages LLM KV cache resources to reduce carbon emissions by 15.1% on average (up to 25.3%) while meeting latency constraints for over 90% of requests on real traces.
-
PALS: Power-Aware LLM Serving for Mixture-of-Experts Models
PALS adds dynamic GPU power capping to LLM serving frameworks like vLLM, jointly tuning it with batch size via offline models and feedback control to improve energy efficiency up to 26.3% and cut QoS violations 4-7x on dense and MoE models.
-
Flare: Leveraging Serverless Elasticity to Absorb Microservice Load Spikes
Flare proposes routing microservice spike load selectively to serverless while keeping steady load on VMs, with claimed minimal integration changes.
-
Sustainability-Constrained Workload Orchestration for Sovereign AI Infrastructure: A Joint Compute-Network Optimization Framework
Introduces the Feasible Sovereign Operating Region (FSOR) as a construct for workloads sustainable under physical and regulatory limits, along with a joint compute-network optimization framework that enforces sustainability as hard constraints.