GreenCache dynamically manages LLM KV cache resources to reduce carbon emissions by 15.1% on average (up to 25.3%) while meeting latency constraints for over 90% of requests on real traces.
Beehive: Sub-second elasticity for web services with semi-faas execution
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Spandana decouples SLO enforcement from cost optimization via per-VM request steering between VMs and FaaS, reporting 76-86% utilization and 5-44% cost reduction versus baselines.
PALS adds dynamic GPU power capping to LLM serving frameworks like vLLM, jointly tuning it with batch size via offline models and feedback control to improve energy efficiency up to 26.3% and cut QoS violations 4-7x on dense and MoE models.
Flare proposes routing microservice spike load selectively to serverless while keeping steady load on VMs, with claimed minimal integration changes.
Introduces the Feasible Sovereign Operating Region (FSOR) as a construct for workloads sustainable under physical and regulatory limits, along with a joint compute-network optimization framework that enforces sustainability as hard constraints.
citing papers explorer
-
Cache Your Prompt When It's Green: Carbon-Aware Caching for Large Language Model Serving
GreenCache dynamically manages LLM KV cache resources to reduce carbon emissions by 15.1% on average (up to 25.3%) while meeting latency constraints for over 90% of requests on real traces.