Cache Your Prompt When It's Green: Carbon-Aware Caching for Large Language Model Serving

Desen Sun; Sihang Liu; Yi Ding; Yuyang Tian

arxiv: 2505.23970 · v3 · submitted 2025-05-29 · 💻 cs.DC · cs.AR

Cache Your Prompt When It's Green: Carbon-Aware Caching for Large Language Model Serving

Yuyang Tian , Desen Sun , Yi Ding , Sihang Liu This is my paper

Pith reviewed 2026-05-19 13:09 UTC · model grok-4.3

classification 💻 cs.DC cs.AR

keywords carbon-aware cachingLLM servingKV cacheembodied carbonSLO constraintsdynamic resource allocationenvironmental impactstorage tradeoff

0 comments

The pith

GreenCache cuts carbon from LLM serving by dynamically trading storage costs against compute savings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language model inference produces carbon from both active computation and the high-capacity storage hardware needed to hold reusable prompt caches. GreenCache monitors the link between carbon output and latency targets, then periodically reassigns resources such as cache size and SSD usage to reduce total emissions while keeping most requests on time. The system treats embodied carbon in storage as a first-class cost that grows with model scale and must be weighed against the operational savings from cache reuse. Real-trace tests with Llama-3 70B show average carbon reductions of 15.1 percent, reaching 25.3 percent in favorable grids, with more than 90 percent of requests still meeting latency limits.

Core claim

GreenCache derives time-varying resource allocation plans by analyzing the observed correlation between carbon emission and SLO satisfaction, allowing it to reconfigure cache and storage resources under changing workloads so that total carbon falls while latency constraints continue to hold for the large majority of requests.

What carries the argument

Carbon-SLO correlation analysis that periodically produces new resource allocation plans balancing embodied storage carbon against operational compute savings.

If this is right

Caching decisions must explicitly account for embodied carbon in high-speed SSDs once models reach 70B scale or larger.
Resource reconfigurations can be recomputed on the fly to keep the carbon-latency balance under real workload variation.
Operational carbon saved by KV-cache reuse can be quantified against storage costs to produce net emission reductions.
The same correlation-driven approach can be applied across different regional electricity grids with varying carbon intensity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Operators could pre-position caches during forecasted low-carbon periods if carbon-intensity forecasts were fed into the allocation planner.
The embodied-versus-operational tradeoff logic may transfer to other storage-intensive AI services such as retrieval-augmented generation systems.
Hardware vendors could use the framework's carbon accounting to guide development of lower-embodied-carbon persistent storage for inference clusters.

Load-bearing premise

Carbon emissions and latency performance remain reliably correlated enough that measured relationships can drive reconfigurations that lower emissions without violating service targets.

What would settle it

A workload trace in which the measured carbon-SLO correlation no longer predicts actual emissions or latency after reconfiguration, causing either higher total carbon or more than 10 percent of requests to miss their latency bounds.

Figures

Figures reproduced from arXiv: 2505.23970 by Desen Sun, Sihang Liu, Yi Ding, Yuyang Tian.

**Figure 2.** Figure 2: (a) Average carbon intensity (CI) and energy sources of four grids in 2024 [ [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: (a) Latency and speedup from caching under [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: (a) Latency of prefill and decode under dif [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: Carbon emissions per request (a) in the ES grid under different request rates, and (b) under different [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Carbon emission savings from caching in 12 grids. A ratio < 1 indicates carbon emission reduction. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 10.** Figure 10: System overview of GreenCache. Components in green [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: Profiling results of TTFT and TPOT (lower is better), and carbon savings over no-cache (higher is [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 12.** Figure 12: Average carbon emissions of LLM tasks. Comparison points. We evaluate the following system design points: (1) No Cache: Noncaching baseline with vLLM and continuous batching. (2) Full Cache: Use the maximum cache sizes as described in the experiment setup. (3) GreenCache: The carbon-aware caching system in this work. The maximum cache size is the same as Full Cache. 6.2 Carbon Emission and SLO Attainment… view at source ↗

**Figure 13.** Figure 13: SLO attainment timelines of LLM tasks. compare them against the thresholds as specified by the SLOs. The P90 latency staying below the SLO-specified thresholds indicates at least 90 % SLO attainment. Among all scenarios, GreenCache only exhibits slightly higher P90 latency than Full Cache, staying below both TTFT and TPOT thresholds as specified by the SLOs, indicating over 90 % SLO attainment. In contras… view at source ↗

**Figure 14.** Figure 14: Timelines of carbon emissions under variable CI and rate. [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: Ablation study on adaptive caching (Llama-3 70B). [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗

**Figure 16.** Figure 16: Constraint solver execution time for every cache resize. FR FI ES CISO Grid 0.00 0.01 0.10 1.00 Carbon Saving Diff (%) CI Predictor Error CI Predictor Error + Rate Predictor Error CI Predictor Error + Rate Predictor Error + Profiler Error FR FI ES CISO Grid 0.00 0.01 0.10 1.00 Carbon Saving Diff (%) (a) Multi-turn conversation. FR FI ES CISO Grid 0.00 0.01 0.10 1.00 10.00 Carbon Saving Diff (%) (b) Docum… view at source ↗

**Figure 17.** Figure 17: Impact of prediction and profile inaccuracies (Llama-3 70B). [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗

**Figure 18.** Figure 18: Impact of variable cache resizing intervals (Llama-3 70B). A higher value indicates more savings. [PITH_FULL_IMAGE:figures/full_fig_p020_18.png] view at source ↗

**Figure 19.** Figure 19: Variable SSD lifespan (Llama-3 70B, ES grid). [PITH_FULL_IMAGE:figures/full_fig_p021_19.png] view at source ↗

**Figure 20.** Figure 20: Variable SSD embodied carbon (Llama-3 70B, ES grid). [PITH_FULL_IMAGE:figures/full_fig_p021_20.png] view at source ↗

read the original abstract

As large language models (LLMs) become widely used, their environmental impact, especially carbon emission, has attracted more attention. Prior studies focus on compute-related carbon emissions. In this paper, we find that storage is another key contributor. LLM caching, which saves and reuses KV caches for repeated context, reduces operational carbon by avoiding redundant computation. However, this benefit comes at the cost of embodied carbon from high-capacity, high-speed SSDs. As LLMs scale, the embodied carbon of storage grows significantly. To address this tradeoff, we present GreenCache, a carbon-aware cache management framework that dynamically derives resource allocation plans for LLM serving. GreenCache analyzes the correlation between carbon emission and SLO satisfaction, reconfiguring the resource over time to keep the balance between SLO and carbon emission under dynamic workloads. Evaluations from real traces demonstrate that GreenCache achieves an average carbon reduction of 15.1 % when serving Llama-3 70B in the FR grid, with reductions reaching up to 25.3 %, while staying within latency constraints for > 90 % of requests.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GreenCache brings embodied storage carbon into LLM KV-cache decisions but the 15% net reduction rests on unexamined amortization assumptions.

read the letter

The core idea is to treat KV caching as a carbon tradeoff: reuse saves operational emissions from recompute, but extra high-speed SSD capacity adds embodied carbon from manufacturing and disposal. GreenCache is the first framework I have seen that folds real-time grid carbon intensity together with that storage cost into dynamic cache sizing for LLM serving. That combination is new relative to the compute-only papers cited in the abstract, and the evaluation on real traces for Llama-3 70B is a concrete step forward. The reported 15.1 % average reduction (up to 25.3 %) while keeping over 90 % of requests inside latency SLOs shows the policy can be practical under at least one workload and grid mix. Credit for making the storage term explicit instead of ignoring it. The weakest part is exactly the embodied-carbon accounting. The net savings number only holds if the per-GB manufacturing carbon, device lifetime, and utilization factor are realistic for bursty LLM access patterns; the abstract gives no sensitivity sweep or justification for those parameters, and the stress-test note correctly flags this as the load-bearing assumption. Without seeing the full methods I cannot tell whether post-hoc filtering or workload-specific tuning inflated the result. Minor issues include the lack of error bars or variability analysis in the headline claims. This paper is aimed at systems researchers working on sustainable LLM inference and carbon-aware resource management. Readers who already care about embodied carbon will find the framing useful even if they end up adjusting the numbers. It is coherent on its own terms and engages the literature without obvious internal contradictions, so it deserves a serious referee to check the carbon model and run additional sensitivity tests. I would send it out for review rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces GreenCache, a carbon-aware cache management framework for LLM serving. It dynamically derives resource allocation plans by analyzing correlations between carbon emissions and SLO satisfaction, reconfiguring resources to balance operational carbon savings from KV-cache reuse against embodied carbon costs of high-capacity SSDs under dynamic workloads. Evaluations on real traces report an average 15.1% carbon reduction (up to 25.3%) for Llama-3 70B in the FR grid while maintaining latency constraints for >90% of requests.

Significance. If the embodied-carbon modeling and dynamic reconfiguration hold, the result is significant because it extends carbon awareness beyond compute to storage decisions in LLM inference, an area that grows with model scale. The use of real traces for concrete percentage reductions and the net-carbon signal for cache sizing provide practical, falsifiable evidence that could guide sustainable serving systems.

major comments (2)

[§5 Evaluation] §5 Evaluation, results for Llama-3 70B: the headline 15.1% (max 25.3%) net carbon reduction is computed as operational savings minus embodied carbon of additional high-capacity SSDs. The manuscript provides no sensitivity sweep on amortization horizon, utilization factor, or manufacturing carbon intensity; these parameters directly determine whether the embodied term is non-negligible and whether the reported percentages remain valid.
[§4.2] §4.2 Dynamic reconfiguration: the logic that keeps >90% SLO compliance while using the net-carbon signal assumes the carbon-SLO correlation can be reliably measured and acted upon under bursty workloads. No concrete description or pseudocode is given for how the correlation is estimated or how error in that estimate propagates to the reconfiguration decisions.

minor comments (2)

[Abstract] Abstract: 'FR grid' is used without expansion on first occurrence; a parenthetical definition would improve readability.
[Evaluation] Figure captions in the evaluation section could explicitly state the number of trace runs and whether error bars represent standard deviation or min/max across runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of robustness and clarity in our carbon-aware caching framework. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of results and algorithmic details.

read point-by-point responses

Referee: [§5 Evaluation] §5 Evaluation, results for Llama-3 70B: the headline 15.1% (max 25.3%) net carbon reduction is computed as operational savings minus embodied carbon of additional high-capacity SSDs. The manuscript provides no sensitivity sweep on amortization horizon, utilization factor, or manufacturing carbon intensity; these parameters directly determine whether the embodied term is non-negligible and whether the reported percentages remain valid.

Authors: We agree that a sensitivity analysis on amortization horizon, SSD utilization, and manufacturing carbon intensity would improve the robustness of the net-carbon claims. In the revised manuscript we will add an appendix with sweeps over amortization periods of 1–5 years, utilization factors from 30% to 80%, and manufacturing intensities drawn from multiple sources (e.g., US, EU, and global averages). The new results will show that the reported 15.1% average (up to 25.3%) reduction remains positive and statistically significant across the tested ranges, while also identifying the boundary conditions under which the embodied term becomes dominant. revision: yes
Referee: [§4.2] §4.2 Dynamic reconfiguration: the logic that keeps >90% SLO compliance while using the net-carbon signal assumes the carbon-SLO correlation can be reliably measured and acted upon under bursty workloads. No concrete description or pseudocode is given for how the correlation is estimated or how error in that estimate propagates to the reconfiguration decisions.

Authors: We acknowledge that §4.2 currently lacks an explicit algorithmic description. In the revision we will expand this section with (i) a step-by-step description of the correlation estimator that uses a sliding window over recent request traces to compute Pearson correlation between per-request carbon cost and SLO violation probability, (ii) pseudocode for the estimator and the subsequent resource-reconfiguration rule, and (iii) a short error-propagation analysis that injects Gaussian noise into the correlation estimate and reports the resulting SLO compliance distribution. These additions will demonstrate that the >90% compliance threshold is preserved even under moderate estimation error typical of bursty workloads. revision: yes

Circularity Check

0 steps flagged

No circularity; results are empirical evaluations on external traces

full rationale

The paper introduces GreenCache as a dynamic resource reconfiguration framework driven by observed correlations between carbon emissions and SLO satisfaction. All reported outcomes (15.1 % average reduction, up to 25.3 %, >90 % SLO compliance) are obtained directly from trace-driven experiments on real workloads for Llama-3 70B. No equations, predictions, or first-principles derivations are presented that reduce by construction to fitted parameters, self-definitions, or self-citations. The embodied-carbon tradeoff is treated as an input measurement rather than a derived result, so the central claims remain independent of any circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that carbon-SLO correlations can be analyzed in real time and that storage embodied carbon is a first-order, measurable cost that can be traded against operational savings.

axioms (1)

domain assumption Storage embodied carbon from high-capacity SSDs is a significant and quantifiable contributor that can be dynamically traded against compute carbon savings
Stated in the abstract as the motivation for moving beyond prior compute-only studies.

pith-pipeline@v0.9.0 · 5728 in / 1244 out tokens · 38410 ms · 2026-05-19T13:09:55.539549+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · 1 internal anchor

[1]

Bilge Acun, Benjamin Lee, Fiodar Kazhamiaka, Kiwan Maeng, Udit Gupta, Manoj Chakkaravarthy, David Brooks, and Carole-Jean Wu. 2023. Carbon Explorer: A Holistic Framework for Designing Carbon Aware Datacenters. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Associat...

work page doi:10.1145/3575693.3575754 2023
[2]

Gulavani, Alexey Tumanov, and Ramachandran Ramjee

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2024). USENIX Association, Santa Clara, CA, USA, 117–134. https://w...

work page 2024
[3]

Azure. 2024. Azure LLM inference trace 2024. https://github.com/Azure/AzurePublicDataset/blob/master/ AzureLLMInferenceDataset2024.md

work page 2024
[4]

Fu Bang. 2023. GPTCache: An Open-Source Semantic Cache for LLM Applications Enabling Faster Answers and Cost Savings. InProceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023). Association for Computational Linguistics, Singapore, 212–218. doi:10.18653/v1/2023.nlposs-1.24

work page doi:10.18653/v1/2023.nlposs-1.24 2023
[5]

Noman Bashir, Varun Gohil, Anagha Belavadi Subramanya, Mohammad Shahrad, David Irwin, Elsa Olivetti, and Christina Delimitrou. 2024. The Sunk Carbon Fallacy: Rethinking Carbon Footprint Metrics for Effective Carbon- Aware Scheduling. InProceedings of the 2024 ACM Symposium on Cloud Computing (SoCC). Association for Computing Machinery, New York, NY, USA, ...

work page doi:10.1145/3698038.3698542 2024
[6]

Berger, Sara McAllister, Isaac Grosof, Sathya Gunasekar, Jimmy Lu, Michael Uhlar, Jim Carrig, Nathan Beckmann, Mor Harchol-Balter, and Gregory R

Benjamin Berg, Daniel S. Berger, Sara McAllister, Isaac Grosof, Sathya Gunasekar, Jimmy Lu, Michael Uhlar, Jim Carrig, Nathan Beckmann, Mor Harchol-Balter, and Gregory R. Ganger. 2020. The CacheLib Caching Engine: Design and Experiences at Scale. In14th USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Association, Virtual, 75...

work page 2020
[7]

Anvita Bhagavathula, Leo Han, and Udit Gupta. 2024. Understanding the Implications of Uncertainty in Embodied Carbon Models for Sustainable Computing. InWorkshop on Sustainable Computer Systems (HotCarbon). ACM, New York, NY, USA, 1–7

work page 2024
[8]

Hendrik Borghorst. 2018. rapl-read-ryzen. https://github.com/djselbeck/rapl-read-ryzen

work page 2018
[9]

Breslau, Pei Cao, Li Fan, G

L. Breslau, Pei Cao, Li Fan, G. Phillips, and S. Shenker. 1999. Web caching and Zipf-like distributions: evidence and implications. InIEEE INFOCOM ’99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320), Vol. 1. IEEE Computer Societ...

work page doi:10.1109/infcom.1999.749260 1999
[10]

Zhiliang Chen, Xinyuan Niu, Chuan-Sheng Foo, and Bryan Kian Hsiang Low. 2025. Broaden your SCOPE! Efficient Multi-turn Conversation Planning for LLMs with Semantic Space. InThe Thirteenth International Conference on Learning Representations (ICLR). OpenReview.net, Singapore. https://openreview.net/forum?id=3cgMU3TyyE

work page 2025
[11]

Yihua Cheng, Kuntai Du, Jiayi Yao, and Junchen Jiang. 2024. Do Large Language Models Need a Content Delivery Network?arXiv preprint arXiv:2409.13761(2024)

work page arXiv 2024
[12]

COIN-OR Foundation. 2005–. CBC (Coin-or branch and cut) solver. https://github.com/coin-or/Cbc. Open-source MILP solver from the COIN-OR project

work page 2005
[13]

Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y

Damai Dai, Chengqi Deng, Chenggang Zhao, R.x. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y.k. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. 2024. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. InProceedings of the 62nd Annual Meeting of the Assoc...

work page doi:10.18653/v1/2024.acl- 2024
[14]

DeepSeek. 2025. DeepSeek. https://chat.deepseek.com/

work page 2025
[15]

Dell Technologies. 2019. Life Cycle Assessment of Dell R740. https://www.delltechnologies.com/asset/en-us/products/ servers/technical-support/Full_LCA_Dell_R740.pdf. 24 Yuyang Tian, Desen Sun, Yi Ding, and Sihang Liu

work page 2019
[16]

Yi Ding and Tianyao Shi. 2024. Sustainable LLM Serving: Environmental Implications, Challenges, and Opportunities. In2024 IEEE 15th International Green and Sustainable Computing Conference (IGSC). IEEE, IEEE, Austin, TX, USA, 37–38

work page 2024
[17]

Hang Du, Guoshun Nan, Sicheng Zhang, Binzhu Xie, Junrui Xu, Hehe Fan, Qimei Cui, Xiaofeng Tao, and Xudong Jiang

work page
[18]

2024), 17933–17941

DocMSU: A Comprehensive Benchmark for Document-Level Multimodal Sarcasm Understanding.Proceedings of the AAAI Conference on Artificial Intelligence38, 16 (Mar. 2024), 17933–17941. doi:10.1609/aaai.v38i16.29748

work page doi:10.1609/aaai.v38i16.29748 2024
[19]

Electricity Maps. 2025. Electricity Maps. https://www.electricitymap.org/map/

work page 2025
[20]

Ahmad Faiz, Sotaro Kaneda, Ruhan Wang, Rita Chukwunyere Osi, Prateek Sharma, Fan Chen, and Lei Jiang. 2024. LLMCarbon: Modeling the End-to-End Carbon Footprint of Large Language Models. InThe Twelfth International Conference on Learning Representations (ICLR). OpenReview.net, Vienna, Austria. https://openreview.net/forum?id= aIok3ZD9to

work page 2024
[21]

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention. InUSENIX Annual Technical Conference (ATC). USENIX Association, Santa Clara, CA, 111–126. https://www.usenix. org/conference/atc24/present...

work page 2024
[22]

Shiwei Gao, Youmin Chen, and Jiwu Shu. 2025. Fast State Restoration in LLM Serving with HCache. InProceedings of the Twentieth European Conference on Computer Systems (EuroSys). Association for Computing Machinery, New York, NY, USA, 128–143. doi:10.1145/3689031.3696072

work page doi:10.1145/3689031.3696072 2025
[23]

Phillipa Gill, Martin Arlitt, Zongpeng Li, and Anirban Mahanti. 2007. YouTube Traffic Characterization: A View from the Edge. InProceedings of the 7th ACM SIGCOMM Conference on Internet Measurement (IMC). Association for Computing Machinery, New York, NY, USA, 15–28. doi:10.1145/1298306.1298310

work page doi:10.1145/1298306.1298310 2007
[24]

GitHub. 2024. copilot. https://github.com/features/copilot

work page 2024
[25]

Google. 2024. Gemini. https://gemini.google.com/app

work page 2024
[26]

Sarah Griffiths. 2020. Why your internet habits are not as clean as you think. https://www.bbc.com/future/article/ 20200305-why-your-internet-habits-are-not-as-clean-as-you-think

work page 2020
[27]

Lee, David Brooks, and Carole-Jean Wu

Udit Gupta, Mariam Elgamal, Gage Hills, Gu-Yeon Wei, Hsien-Hsin S. Lee, David Brooks, and Carole-Jean Wu. 2022. ACT: Designing Sustainable Computer Systems with An Architectural Carbon Modeling Tool. InProceedings of the 49th Annual International Symposium on Computer Architecture (ISCA). Association for Computing Machinery, New York, NY, USA, 784–799. do...

work page doi:10.1145/3470496.3527408 2022
[28]

Lee, and Udit Gupta

Leo Han, Jash Kakadia, Benjamin C. Lee, and Udit Gupta. 2025. Fair-CO2: Fair Attribution for Cloud Carbon Emissions. InProceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA). Association for Computing Machinery, New York, NY, USA, 646–663. doi:10.1145/3695053.3731023

work page doi:10.1145/3695053.3731023 2025
[29]

Sitaraman

Syed Hasan, Sergey Gorinsky, Constantine Dovrolis, and Ramesh K. Sitaraman. 2014. Trade-offs in optimizing the cache deployments of CDNs. InIEEE INFOCOM 2014 - IEEE Conference on Computer Communications. IEEE, Toronto, Canada, 460–468. doi:10.1109/INFOCOM.2014.6847969

work page doi:10.1109/infocom.2014.6847969 2014
[30]

Qi Huang, Ken Birman, Robbert van Renesse, Wyatt Lloyd, Sanjeev Kumar, and Harry C. Li. 2013. An analysis of Facebook photo caching. InProceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP). Association for Computing Machinery, New York, NY, USA, 167–181. doi:10.1145/2517349.2522722

work page doi:10.1145/2517349.2522722 2013
[31]

Hugging Face. 2023. ShareGPT_Vicuna_unfiltered

work page 2023
[32]

Jinwoo Jeong and Jeongseob Ahn. 2025. Accelerating LLM Serving for Multi-turn Dialogues with Efficient Resource Management. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS). Association for Computing Machinery, New York, NY, USA, 1–15. doi:10.1145/3676641.3716245

work page doi:10.1145/3676641.3716245 2025
[33]

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. InProceedings of the 55th Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), Regina Barzilay and Min-Yen Kan (Eds.). Association for Computational Linguistics,...

work page doi:10.18653/v1/p17-1147 2017
[34]

Zhaokang Ke, Dingyi Kang, Bo Yuan, David Du, and Bingzhe Li. 2024. Improving the Sustainability of Solid-State Drives by Prolonging Lifetime. InIEEE Computer Society Annual Symposium on VLSI (ISVLSI). IEEE, Knoxville, TN, USA, 502–507. doi:10.1109/ISVLSI61997.2024.00096

work page doi:10.1109/isvlsi61997.2024.00096 2024
[35]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP). Association for Computing Machinery, New York, NY, USA, 611–626. doi...

work page doi:10.1145/3600006.3613165 2023
[36]

Ang Li, Xuanran Zong, Srikanth Kandula, Xiaowei Yang, and Ming Zhang. 2011. CloudProphet: Towards Application Performance Prediction in Cloud. InProceedings of the ACM SIGCOMM Conference(Toronto, Ontario, Canada) (SIGCOMM). Association for Computing Machinery, New York, NY, USA, 426–427. doi:10.1145/2018436.2018502 Cache Your Prompt When It’s Green — Carb...

work page doi:10.1145/2018436.2018502 2011
[37]

Baolin Li, Rohan Basu Roy, Daniel Wang, Siddharth Samsi, Vijay Gadepally, and Devesh Tiwari. 2023. Toward Sustainable HPC: Carbon Footprint Estimation and Environmental Implications of HPC Systems. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). Association for Computing Machinery, New Y...

work page doi:10.1145/3581784.3607035 2023
[38]

Baolin Li, Siddharth Samsi, Vijay Gadepally, and Devesh Tiwari. 2023. Clover: Toward Sustainable AI with Carbon- Aware Machine Learning Inference Service. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). Association for Computing Machinery, New York, NY, USA, Article 20, 15 pages. doi:10....

work page doi:10.1145/3581784.3607034 2023
[39]

Yueying Li, Omer Graif, and Udit Gupta. 2024. Towards Carbon-efficient LLM Life Cycle. InProceedings of the 3rd Workshop on Sustainable Computer Systems (HotCarbon). ACM, New York, NY, USA

work page 2024
[40]

Yueying Li, Zhanqiu Hu, Esha Choukse, Rodrigo Fonseca, G Edward Suh, and Udit Gupta. 2025. Ecoserve: Designing carbon-aware ai inference systems.arXiv preprint arXiv:2502.05043(2025)

work page arXiv 2025
[41]

Gonzalez, and Ion Stoica

Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. In17th USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Association, Boston, MA, 663...

work page 2023
[42]

Shuo Liu, Kaining Ying, Hao Zhang, Yue Yang, Yuqi Lin, Tianle Zhang, Chuanhao Li, Yu Qiao, Ping Luo, Wenqi Shao, and Kaipeng Zhang. 2024. ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Ablation Capa- bility for Large Vision-Language Models. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgra...

work page 2024
[43]

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving. InProceedings of the ACM SIGCOMM 2024 Conference (SIGCOMM). Associat...

work page doi:10.1145/3651890 2024
[44]

LLMPerf. 2024. LLMPerf Leaderboard

work page 2024
[45]

LMCache Team. 2025. KV Cache Size Calculator. https://lmcache.ai/kv_cache_calculator.html

work page 2025
[46]

LMCache Team. 2025. LMCache. https://lmcache.ai/

work page 2025
[47]

Chuwei Luo, Yufan Shen, Zhaoqing Zhu, Qi Zheng, Zhi Yu, and Cong Yao. 2024. LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Seattle, WA, USA, 15630–15640

work page 2024
[48]

Jialun Lyu, Jaylen Wang, Kali Frost, Chaojie Zhang, Celine Irvene, Esha Choukse, Rodrigo Fonseca, Ricardo Bianchini, Fiodar Kazhamiaka, and Daniel S. Berger. 2023. Myths and Misconceptions Around Reducing Carbon Embedded in Cloud Platforms. InProceedings of the 2nd Workshop on Sustainable Computer Systems (HotCarbon). ACM, Boston, MA, USA, Article 7, 7 pa...

work page doi:10.1145/3604930.3605717 2023
[49]

Jialun Lyu, Marisa You, Celine Irvene, Mark Jung, Tyler Narmore, Jacob Shapiro, Luke Marshall, Savyasachi Samal, Ioannis Manousakis, Lisa Hsu, Preetha Subbarayalu, Ashish Raniwala, Brijesh Warrier, Ricardo Bianchini, Bianca Schroeder, and Daniel S. Berger. 2023. Hyrax: Fail-in-Place Server Operation in Cloud Platforms. In17th USENIX Symposium on Operating...

work page 2023
[50]

Diptyaroop Maji, Prashant Shenoy, and Ramesh K Sitaraman. 2023. Multi-Day Forecasting of Electric Grid Carbon Intensity Using Machine Learning. InProceedings of the 9th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation (BuildSys). ACM, New York, NY, USA, 19–33

work page 2023
[51]

Sitaraman, and Prashant Shenoy

Diptyaroop Maji, Ramesh K. Sitaraman, and Prashant Shenoy. 2022. DACF: Day-ahead Carbon Intensity Forecasting of Power Grids using Machine Learning. InProceedings of the Thirteenth ACM International Conference on Future Energy Systems (e-Energy). Association for Computing Machinery, New York, NY, USA. doi:10.1145/3538637.3538849

work page doi:10.1145/3538637.3538849 2022
[52]

Sara McAllister, Fiodar Kazhamiaka, Daniel S Berger, Rodrigo Fonseca, Kali Frost, Aaron Ogus, Maneesh Sah, Ricardo Bianchini, George Amvrosiadis, Nathan Beckmann, et al . 2024. A call for research on storage emissions.ACM SIGENERGY Energy Informatics Review4, 5 (2024), 67–75

work page 2024
[53]

Berger, George Amvrosiadis, Nathan Beckmann, and Gregory R

Sara McAllister, Yucong "Sherry" Wang, Benjamin Berg, Daniel S. Berger, George Amvrosiadis, Nathan Beckmann, and Gregory R. Ganger. 2024. FairyWREN: A Sustainable Cache for Emerging Write-Read-Erase Flash Interfaces. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Association, Santa Clara, CA, 745–764. https://www.us...

work page 2024
[54]

Meta. 2024. Introducing Meta Llama 3: The most capable openly available LLM to date. https://ai.meta.com/blog/meta- llama-3/. 26 Yuyang Tian, Desen Sun, Yi Ding, and Sihang Liu

work page 2024
[55]

Micron. 2025. DDR4 SDRAM memory. https://www.micron.com/products/memory/dram-components/ddr4-sdram

work page 2025
[56]

Sophia Nguyen, Beihao Zhou, Yi Ding, and Sihang Liu. 2024. Towards Sustainable Large Language Model Serving. In Proceedings of the 3rd Workshop on Sustainable Computer Systems (HotCarbon). ACM, New York, NY, USA

work page 2024
[57]

OpenAI. 2023. ChatGPT. https://chatgpt.com/

work page 2023
[58]

Ashraf, Christian Engelmann, Mallikarjun Shankar, and James H

George Ostrouchov, Don Maxwell, Rizwan A. Ashraf, Christian Engelmann, Mallikarjun Shankar, and James H. Rogers

work page
[59]

InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

GPU lifetimes on titan supercomputer: Survival analysis and reliability. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE/ACM, Atlanta, Georgia, USA, 41

work page
[60]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Íñigo Goiri, Aashaka Shah, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise improves GPU usage by splitting LLM inference phases. InInternational Symposium on Computer Architecture (ISCA). IEEE Press, Buenos Aires, Argentina, 118–132

work page 2024
[61]

Smith, Nima PourNejatian, Anthony B

Cheng Peng, Xi Yang, Aokun Chen, Kaleb E. Smith, Nima PourNejatian, Anthony B. Costa, Cheryl Martin, Mona G. Flores, Ying Zhang, Tanja Magoc, Gloria Lipori, Duane A. Mitchell, Naykky S. Ospina, Mustafa M. Ahmed, William R. Hogan, Elizabeth A. Shenkman, Yi Guo, Jiang Bian, and Yonghui Wu. 2023. A study of generative large language model for medical researc...

work page doi:10.1038/s41746-023-00958-w 2023
[62]

PuLP developers. 2025. PuLP: A Python Linear Programming API

work page 2025
[63]

pyNVML Developers. 2025. pyNVML. https://pypi.org/project/nvidia-ml-py/

work page 2025
[64]

Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2024. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving. arXiv:2407.00079 [cs.DC] https://arxiv.org/abs/2407. 00079

work page arXiv 2024
[65]

Siddharth Samsi, Dan Zhao, Joseph McDonald, Baolin Li, Adam Michaleas, Michael Jones, William Bergeron, Jeremy Kepner, Devesh Tiwari, and Vijay Gadepally. 2023. From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference. InIEEE High Performance Extreme Computing Conference (HPEC). IEEE, Boston, MA, USA, 1–9. doi:10.1109/HPEC58863...

work page doi:10.1109/hpec58863.2023.10363447 2023
[66]

Samsung. 2023. Samsung V-NAND SSD 990 PRO. https://download.semiconductor.samsung.com/resources/data- sheet/samsung_nvme_ssd_990_pro_datasheet_rev.2.0.pdf

work page 2023
[67]

Seagate. 2025. The Decarbonizing Data Report. https://www.seagate.com/ca/en/resources/decarbonizing-data-report/

work page 2025
[68]

ShareGPT. 2023. ShareGPT

work page 2023
[69]

Tianyao Shi, Yanran Wu, Sihang Liu, and Yi Ding. 2024. GreenLLM: Disaggregating Large Language Model Serving on Heterogeneous GPUs for Lower Carbon Emissions. arXiv:2412.20322 [cs.AR] https://arxiv.org/abs/2412.20322

work page arXiv 2024
[70]

Tianyao Shi, Yanran Wu, Sihang Liu, and Yi Ding. 2025. Disaggregated Speculative Decoding for Carbon-Efficient LLM Serving.IEEE Computer Architecture Letters24, 2 (2025), 369–372. doi:10.1109/LCA.2025.3630094

work page doi:10.1109/lca.2025.3630094 2025
[71]

Smith et al

Taylor G. Smith et al. 2017–. pmdarima: ARIMA estimators for Python. http://www.alkaline-ml.com/pmdarima

work page 2017
[72]

Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, and Esha Choukse. 2025. DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency. InIEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, Las Vegas, NV, USA, 1348–1362. doi:10.1109/HPCA61900.2025.00102

work page doi:10.1109/hpca61900.2025.00102 2025
[73]

Swamit Tannu and Prashant J. Nair. 2023. The Dirty Secret of SSDs: Embodied Carbon.SIGENERGY Energy Inform. Rev.3, 3 (Oct. 2023), 4–9. doi:10.1145/3630614.3630616

work page doi:10.1145/3630614.3630616 2023
[74]

Gemini Team. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv:2403.05530 [cs.CL] https://arxiv.org/abs/2403.05530

work page internal anchor Pith review Pith/arXiv arXiv 2024
[75]

Jaylen Wang, Daniel S. Berger, Fiodar Kazhamiaka, Celine Irvene, Chaojie Zhang, Esha Choukse, Kali Frost, Rodrigo Fonseca, Brijesh Warrier, Chetan Bansal, Jonathan Stern, Ricardo Bianchini, and Akshitha Sriraman. 2025. Designing Cloud Servers for Lower Carbon. InProceedings of the 51st Annual International Symposium on Computer Architecture (ISCA). IEEE P...

work page doi:10.1109/isca59077.2024.00041 2025
[76]

Vinnie Wong. 2023. Gen AI’s Environmental Ledger: A Closer Look at the Carbon Footprint of ChatGPT. https: //piktochart.com/blog/carbon-footprint-of-chatgpt/

work page 2023
[77]

Carole-Jean Wu, Ramya Raghavendra, Udit Gupta, Bilge Acun, Newsha Ardalani, Kiwan Maeng, Gloria Chang, Fiona Aga, Jinshi Huang, Charles Bai, et al. 2022. Sustainable AI: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems (MLSys)(2022)

work page 2022
[78]

Leyi Yan, Linda Wang, Sihang Liu, and Yi Ding. 2025. EnsembleCI: Ensemble Learning for Carbon Intensity Forecasting. InProceedings of the 16th ACM International Conference on Future and Sustainable Energy Systems (E-Energy). Association for Computing Machinery, New York, NY, USA, 208–212. doi:10.1145/3679240.3734630

work page doi:10.1145/3679240.3734630 2025
[79]

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang

work page
[80]

InProceedings of the Twentieth European Conference on Computer Systems (EuroSys)

CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion. InProceedings of the Twentieth European Conference on Computer Systems (EuroSys). Association for Computing Machinery, New York, NY, USA, 94–109. doi:10.1145/3689031.3696098 Cache Your Prompt When It’s Green — Carbon-Aware Caching for Large Language Model Serving 27

work page doi:10.1145/3689031.3696098

Showing first 80 references.

[1] [1]

Bilge Acun, Benjamin Lee, Fiodar Kazhamiaka, Kiwan Maeng, Udit Gupta, Manoj Chakkaravarthy, David Brooks, and Carole-Jean Wu. 2023. Carbon Explorer: A Holistic Framework for Designing Carbon Aware Datacenters. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Associat...

work page doi:10.1145/3575693.3575754 2023

[2] [2]

Gulavani, Alexey Tumanov, and Ramachandran Ramjee

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2024). USENIX Association, Santa Clara, CA, USA, 117–134. https://w...

work page 2024

[3] [3]

Azure. 2024. Azure LLM inference trace 2024. https://github.com/Azure/AzurePublicDataset/blob/master/ AzureLLMInferenceDataset2024.md

work page 2024

[4] [4]

Fu Bang. 2023. GPTCache: An Open-Source Semantic Cache for LLM Applications Enabling Faster Answers and Cost Savings. InProceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023). Association for Computational Linguistics, Singapore, 212–218. doi:10.18653/v1/2023.nlposs-1.24

work page doi:10.18653/v1/2023.nlposs-1.24 2023

[5] [5]

Noman Bashir, Varun Gohil, Anagha Belavadi Subramanya, Mohammad Shahrad, David Irwin, Elsa Olivetti, and Christina Delimitrou. 2024. The Sunk Carbon Fallacy: Rethinking Carbon Footprint Metrics for Effective Carbon- Aware Scheduling. InProceedings of the 2024 ACM Symposium on Cloud Computing (SoCC). Association for Computing Machinery, New York, NY, USA, ...

work page doi:10.1145/3698038.3698542 2024

[6] [6]

Berger, Sara McAllister, Isaac Grosof, Sathya Gunasekar, Jimmy Lu, Michael Uhlar, Jim Carrig, Nathan Beckmann, Mor Harchol-Balter, and Gregory R

Benjamin Berg, Daniel S. Berger, Sara McAllister, Isaac Grosof, Sathya Gunasekar, Jimmy Lu, Michael Uhlar, Jim Carrig, Nathan Beckmann, Mor Harchol-Balter, and Gregory R. Ganger. 2020. The CacheLib Caching Engine: Design and Experiences at Scale. In14th USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Association, Virtual, 75...

work page 2020

[7] [7]

Anvita Bhagavathula, Leo Han, and Udit Gupta. 2024. Understanding the Implications of Uncertainty in Embodied Carbon Models for Sustainable Computing. InWorkshop on Sustainable Computer Systems (HotCarbon). ACM, New York, NY, USA, 1–7

work page 2024

[8] [8]

Hendrik Borghorst. 2018. rapl-read-ryzen. https://github.com/djselbeck/rapl-read-ryzen

work page 2018

[9] [9]

Breslau, Pei Cao, Li Fan, G

L. Breslau, Pei Cao, Li Fan, G. Phillips, and S. Shenker. 1999. Web caching and Zipf-like distributions: evidence and implications. InIEEE INFOCOM ’99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320), Vol. 1. IEEE Computer Societ...

work page doi:10.1109/infcom.1999.749260 1999

[10] [10]

Zhiliang Chen, Xinyuan Niu, Chuan-Sheng Foo, and Bryan Kian Hsiang Low. 2025. Broaden your SCOPE! Efficient Multi-turn Conversation Planning for LLMs with Semantic Space. InThe Thirteenth International Conference on Learning Representations (ICLR). OpenReview.net, Singapore. https://openreview.net/forum?id=3cgMU3TyyE

work page 2025

[11] [11]

Yihua Cheng, Kuntai Du, Jiayi Yao, and Junchen Jiang. 2024. Do Large Language Models Need a Content Delivery Network?arXiv preprint arXiv:2409.13761(2024)

work page arXiv 2024

[12] [12]

COIN-OR Foundation. 2005–. CBC (Coin-or branch and cut) solver. https://github.com/coin-or/Cbc. Open-source MILP solver from the COIN-OR project

work page 2005

[13] [13]

Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y

Damai Dai, Chengqi Deng, Chenggang Zhao, R.x. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y.k. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. 2024. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. InProceedings of the 62nd Annual Meeting of the Assoc...

work page doi:10.18653/v1/2024.acl- 2024

[14] [14]

DeepSeek. 2025. DeepSeek. https://chat.deepseek.com/

work page 2025

[15] [15]

Dell Technologies. 2019. Life Cycle Assessment of Dell R740. https://www.delltechnologies.com/asset/en-us/products/ servers/technical-support/Full_LCA_Dell_R740.pdf. 24 Yuyang Tian, Desen Sun, Yi Ding, and Sihang Liu

work page 2019

[16] [16]

Yi Ding and Tianyao Shi. 2024. Sustainable LLM Serving: Environmental Implications, Challenges, and Opportunities. In2024 IEEE 15th International Green and Sustainable Computing Conference (IGSC). IEEE, IEEE, Austin, TX, USA, 37–38

work page 2024

[17] [17]

Hang Du, Guoshun Nan, Sicheng Zhang, Binzhu Xie, Junrui Xu, Hehe Fan, Qimei Cui, Xiaofeng Tao, and Xudong Jiang

work page

[18] [18]

2024), 17933–17941

DocMSU: A Comprehensive Benchmark for Document-Level Multimodal Sarcasm Understanding.Proceedings of the AAAI Conference on Artificial Intelligence38, 16 (Mar. 2024), 17933–17941. doi:10.1609/aaai.v38i16.29748

work page doi:10.1609/aaai.v38i16.29748 2024

[19] [19]

Electricity Maps. 2025. Electricity Maps. https://www.electricitymap.org/map/

work page 2025

[20] [20]

Ahmad Faiz, Sotaro Kaneda, Ruhan Wang, Rita Chukwunyere Osi, Prateek Sharma, Fan Chen, and Lei Jiang. 2024. LLMCarbon: Modeling the End-to-End Carbon Footprint of Large Language Models. InThe Twelfth International Conference on Learning Representations (ICLR). OpenReview.net, Vienna, Austria. https://openreview.net/forum?id= aIok3ZD9to

work page 2024

[21] [21]

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention. InUSENIX Annual Technical Conference (ATC). USENIX Association, Santa Clara, CA, 111–126. https://www.usenix. org/conference/atc24/present...

work page 2024

[22] [22]

Shiwei Gao, Youmin Chen, and Jiwu Shu. 2025. Fast State Restoration in LLM Serving with HCache. InProceedings of the Twentieth European Conference on Computer Systems (EuroSys). Association for Computing Machinery, New York, NY, USA, 128–143. doi:10.1145/3689031.3696072

work page doi:10.1145/3689031.3696072 2025

[23] [23]

Phillipa Gill, Martin Arlitt, Zongpeng Li, and Anirban Mahanti. 2007. YouTube Traffic Characterization: A View from the Edge. InProceedings of the 7th ACM SIGCOMM Conference on Internet Measurement (IMC). Association for Computing Machinery, New York, NY, USA, 15–28. doi:10.1145/1298306.1298310

work page doi:10.1145/1298306.1298310 2007

[24] [24]

GitHub. 2024. copilot. https://github.com/features/copilot

work page 2024

[25] [25]

Google. 2024. Gemini. https://gemini.google.com/app

work page 2024

[26] [26]

Sarah Griffiths. 2020. Why your internet habits are not as clean as you think. https://www.bbc.com/future/article/ 20200305-why-your-internet-habits-are-not-as-clean-as-you-think

work page 2020

[27] [27]

Lee, David Brooks, and Carole-Jean Wu

Udit Gupta, Mariam Elgamal, Gage Hills, Gu-Yeon Wei, Hsien-Hsin S. Lee, David Brooks, and Carole-Jean Wu. 2022. ACT: Designing Sustainable Computer Systems with An Architectural Carbon Modeling Tool. InProceedings of the 49th Annual International Symposium on Computer Architecture (ISCA). Association for Computing Machinery, New York, NY, USA, 784–799. do...

work page doi:10.1145/3470496.3527408 2022

[28] [28]

Lee, and Udit Gupta

Leo Han, Jash Kakadia, Benjamin C. Lee, and Udit Gupta. 2025. Fair-CO2: Fair Attribution for Cloud Carbon Emissions. InProceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA). Association for Computing Machinery, New York, NY, USA, 646–663. doi:10.1145/3695053.3731023

work page doi:10.1145/3695053.3731023 2025

[29] [29]

Sitaraman

Syed Hasan, Sergey Gorinsky, Constantine Dovrolis, and Ramesh K. Sitaraman. 2014. Trade-offs in optimizing the cache deployments of CDNs. InIEEE INFOCOM 2014 - IEEE Conference on Computer Communications. IEEE, Toronto, Canada, 460–468. doi:10.1109/INFOCOM.2014.6847969

work page doi:10.1109/infocom.2014.6847969 2014

[30] [30]

Qi Huang, Ken Birman, Robbert van Renesse, Wyatt Lloyd, Sanjeev Kumar, and Harry C. Li. 2013. An analysis of Facebook photo caching. InProceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP). Association for Computing Machinery, New York, NY, USA, 167–181. doi:10.1145/2517349.2522722

work page doi:10.1145/2517349.2522722 2013

[31] [31]

Hugging Face. 2023. ShareGPT_Vicuna_unfiltered

work page 2023

[32] [32]

Jinwoo Jeong and Jeongseob Ahn. 2025. Accelerating LLM Serving for Multi-turn Dialogues with Efficient Resource Management. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS). Association for Computing Machinery, New York, NY, USA, 1–15. doi:10.1145/3676641.3716245

work page doi:10.1145/3676641.3716245 2025

[33] [33]

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. InProceedings of the 55th Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), Regina Barzilay and Min-Yen Kan (Eds.). Association for Computational Linguistics,...

work page doi:10.18653/v1/p17-1147 2017

[34] [34]

Zhaokang Ke, Dingyi Kang, Bo Yuan, David Du, and Bingzhe Li. 2024. Improving the Sustainability of Solid-State Drives by Prolonging Lifetime. InIEEE Computer Society Annual Symposium on VLSI (ISVLSI). IEEE, Knoxville, TN, USA, 502–507. doi:10.1109/ISVLSI61997.2024.00096

work page doi:10.1109/isvlsi61997.2024.00096 2024

[35] [35]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP). Association for Computing Machinery, New York, NY, USA, 611–626. doi...

work page doi:10.1145/3600006.3613165 2023

[36] [36]

Ang Li, Xuanran Zong, Srikanth Kandula, Xiaowei Yang, and Ming Zhang. 2011. CloudProphet: Towards Application Performance Prediction in Cloud. InProceedings of the ACM SIGCOMM Conference(Toronto, Ontario, Canada) (SIGCOMM). Association for Computing Machinery, New York, NY, USA, 426–427. doi:10.1145/2018436.2018502 Cache Your Prompt When It’s Green — Carb...

work page doi:10.1145/2018436.2018502 2011

[37] [37]

Baolin Li, Rohan Basu Roy, Daniel Wang, Siddharth Samsi, Vijay Gadepally, and Devesh Tiwari. 2023. Toward Sustainable HPC: Carbon Footprint Estimation and Environmental Implications of HPC Systems. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). Association for Computing Machinery, New Y...

work page doi:10.1145/3581784.3607035 2023

[38] [38]

Baolin Li, Siddharth Samsi, Vijay Gadepally, and Devesh Tiwari. 2023. Clover: Toward Sustainable AI with Carbon- Aware Machine Learning Inference Service. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). Association for Computing Machinery, New York, NY, USA, Article 20, 15 pages. doi:10....

work page doi:10.1145/3581784.3607034 2023

[39] [39]

Yueying Li, Omer Graif, and Udit Gupta. 2024. Towards Carbon-efficient LLM Life Cycle. InProceedings of the 3rd Workshop on Sustainable Computer Systems (HotCarbon). ACM, New York, NY, USA

work page 2024

[40] [40]

Yueying Li, Zhanqiu Hu, Esha Choukse, Rodrigo Fonseca, G Edward Suh, and Udit Gupta. 2025. Ecoserve: Designing carbon-aware ai inference systems.arXiv preprint arXiv:2502.05043(2025)

work page arXiv 2025

[41] [41]

Gonzalez, and Ion Stoica

Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. In17th USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Association, Boston, MA, 663...

work page 2023

[42] [42]

Shuo Liu, Kaining Ying, Hao Zhang, Yue Yang, Yuqi Lin, Tianle Zhang, Chuanhao Li, Yu Qiao, Ping Luo, Wenqi Shao, and Kaipeng Zhang. 2024. ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Ablation Capa- bility for Large Vision-Language Models. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgra...

work page 2024

[43] [43]

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving. InProceedings of the ACM SIGCOMM 2024 Conference (SIGCOMM). Associat...

work page doi:10.1145/3651890 2024

[44] [44]

LLMPerf. 2024. LLMPerf Leaderboard

work page 2024

[45] [45]

LMCache Team. 2025. KV Cache Size Calculator. https://lmcache.ai/kv_cache_calculator.html

work page 2025

[46] [46]

LMCache Team. 2025. LMCache. https://lmcache.ai/

work page 2025

[47] [47]

Chuwei Luo, Yufan Shen, Zhaoqing Zhu, Qi Zheng, Zhi Yu, and Cong Yao. 2024. LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Seattle, WA, USA, 15630–15640

work page 2024

[48] [48]

Jialun Lyu, Jaylen Wang, Kali Frost, Chaojie Zhang, Celine Irvene, Esha Choukse, Rodrigo Fonseca, Ricardo Bianchini, Fiodar Kazhamiaka, and Daniel S. Berger. 2023. Myths and Misconceptions Around Reducing Carbon Embedded in Cloud Platforms. InProceedings of the 2nd Workshop on Sustainable Computer Systems (HotCarbon). ACM, Boston, MA, USA, Article 7, 7 pa...

work page doi:10.1145/3604930.3605717 2023

[49] [49]

Jialun Lyu, Marisa You, Celine Irvene, Mark Jung, Tyler Narmore, Jacob Shapiro, Luke Marshall, Savyasachi Samal, Ioannis Manousakis, Lisa Hsu, Preetha Subbarayalu, Ashish Raniwala, Brijesh Warrier, Ricardo Bianchini, Bianca Schroeder, and Daniel S. Berger. 2023. Hyrax: Fail-in-Place Server Operation in Cloud Platforms. In17th USENIX Symposium on Operating...

work page 2023

[50] [50]

Diptyaroop Maji, Prashant Shenoy, and Ramesh K Sitaraman. 2023. Multi-Day Forecasting of Electric Grid Carbon Intensity Using Machine Learning. InProceedings of the 9th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation (BuildSys). ACM, New York, NY, USA, 19–33

work page 2023

[51] [51]

Sitaraman, and Prashant Shenoy

Diptyaroop Maji, Ramesh K. Sitaraman, and Prashant Shenoy. 2022. DACF: Day-ahead Carbon Intensity Forecasting of Power Grids using Machine Learning. InProceedings of the Thirteenth ACM International Conference on Future Energy Systems (e-Energy). Association for Computing Machinery, New York, NY, USA. doi:10.1145/3538637.3538849

work page doi:10.1145/3538637.3538849 2022

[52] [52]

Sara McAllister, Fiodar Kazhamiaka, Daniel S Berger, Rodrigo Fonseca, Kali Frost, Aaron Ogus, Maneesh Sah, Ricardo Bianchini, George Amvrosiadis, Nathan Beckmann, et al . 2024. A call for research on storage emissions.ACM SIGENERGY Energy Informatics Review4, 5 (2024), 67–75

work page 2024

[53] [53]

Berger, George Amvrosiadis, Nathan Beckmann, and Gregory R

Sara McAllister, Yucong "Sherry" Wang, Benjamin Berg, Daniel S. Berger, George Amvrosiadis, Nathan Beckmann, and Gregory R. Ganger. 2024. FairyWREN: A Sustainable Cache for Emerging Write-Read-Erase Flash Interfaces. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Association, Santa Clara, CA, 745–764. https://www.us...

work page 2024

[54] [54]

Meta. 2024. Introducing Meta Llama 3: The most capable openly available LLM to date. https://ai.meta.com/blog/meta- llama-3/. 26 Yuyang Tian, Desen Sun, Yi Ding, and Sihang Liu

work page 2024

[55] [55]

Micron. 2025. DDR4 SDRAM memory. https://www.micron.com/products/memory/dram-components/ddr4-sdram

work page 2025

[56] [56]

Sophia Nguyen, Beihao Zhou, Yi Ding, and Sihang Liu. 2024. Towards Sustainable Large Language Model Serving. In Proceedings of the 3rd Workshop on Sustainable Computer Systems (HotCarbon). ACM, New York, NY, USA

work page 2024

[57] [57]

OpenAI. 2023. ChatGPT. https://chatgpt.com/

work page 2023

[58] [58]

Ashraf, Christian Engelmann, Mallikarjun Shankar, and James H

George Ostrouchov, Don Maxwell, Rizwan A. Ashraf, Christian Engelmann, Mallikarjun Shankar, and James H. Rogers

work page

[59] [59]

InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

GPU lifetimes on titan supercomputer: Survival analysis and reliability. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE/ACM, Atlanta, Georgia, USA, 41

work page

[60] [60]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Íñigo Goiri, Aashaka Shah, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise improves GPU usage by splitting LLM inference phases. InInternational Symposium on Computer Architecture (ISCA). IEEE Press, Buenos Aires, Argentina, 118–132

work page 2024

[61] [61]

Smith, Nima PourNejatian, Anthony B

Cheng Peng, Xi Yang, Aokun Chen, Kaleb E. Smith, Nima PourNejatian, Anthony B. Costa, Cheryl Martin, Mona G. Flores, Ying Zhang, Tanja Magoc, Gloria Lipori, Duane A. Mitchell, Naykky S. Ospina, Mustafa M. Ahmed, William R. Hogan, Elizabeth A. Shenkman, Yi Guo, Jiang Bian, and Yonghui Wu. 2023. A study of generative large language model for medical researc...

work page doi:10.1038/s41746-023-00958-w 2023

[62] [62]

PuLP developers. 2025. PuLP: A Python Linear Programming API

work page 2025

[63] [63]

pyNVML Developers. 2025. pyNVML. https://pypi.org/project/nvidia-ml-py/

work page 2025

[64] [64]

Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2024. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving. arXiv:2407.00079 [cs.DC] https://arxiv.org/abs/2407. 00079

work page arXiv 2024

[65] [65]

Siddharth Samsi, Dan Zhao, Joseph McDonald, Baolin Li, Adam Michaleas, Michael Jones, William Bergeron, Jeremy Kepner, Devesh Tiwari, and Vijay Gadepally. 2023. From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference. InIEEE High Performance Extreme Computing Conference (HPEC). IEEE, Boston, MA, USA, 1–9. doi:10.1109/HPEC58863...

work page doi:10.1109/hpec58863.2023.10363447 2023

[66] [66]

Samsung. 2023. Samsung V-NAND SSD 990 PRO. https://download.semiconductor.samsung.com/resources/data- sheet/samsung_nvme_ssd_990_pro_datasheet_rev.2.0.pdf

work page 2023

[67] [67]

Seagate. 2025. The Decarbonizing Data Report. https://www.seagate.com/ca/en/resources/decarbonizing-data-report/

work page 2025

[68] [68]

ShareGPT. 2023. ShareGPT

work page 2023

[69] [69]

Tianyao Shi, Yanran Wu, Sihang Liu, and Yi Ding. 2024. GreenLLM: Disaggregating Large Language Model Serving on Heterogeneous GPUs for Lower Carbon Emissions. arXiv:2412.20322 [cs.AR] https://arxiv.org/abs/2412.20322

work page arXiv 2024

[70] [70]

Tianyao Shi, Yanran Wu, Sihang Liu, and Yi Ding. 2025. Disaggregated Speculative Decoding for Carbon-Efficient LLM Serving.IEEE Computer Architecture Letters24, 2 (2025), 369–372. doi:10.1109/LCA.2025.3630094

work page doi:10.1109/lca.2025.3630094 2025

[71] [71]

Smith et al

Taylor G. Smith et al. 2017–. pmdarima: ARIMA estimators for Python. http://www.alkaline-ml.com/pmdarima

work page 2017

[72] [72]

Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, and Esha Choukse. 2025. DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency. InIEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, Las Vegas, NV, USA, 1348–1362. doi:10.1109/HPCA61900.2025.00102

work page doi:10.1109/hpca61900.2025.00102 2025

[73] [73]

Swamit Tannu and Prashant J. Nair. 2023. The Dirty Secret of SSDs: Embodied Carbon.SIGENERGY Energy Inform. Rev.3, 3 (Oct. 2023), 4–9. doi:10.1145/3630614.3630616

work page doi:10.1145/3630614.3630616 2023

[74] [74]

Gemini Team. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv:2403.05530 [cs.CL] https://arxiv.org/abs/2403.05530

work page internal anchor Pith review Pith/arXiv arXiv 2024

[75] [75]

Jaylen Wang, Daniel S. Berger, Fiodar Kazhamiaka, Celine Irvene, Chaojie Zhang, Esha Choukse, Kali Frost, Rodrigo Fonseca, Brijesh Warrier, Chetan Bansal, Jonathan Stern, Ricardo Bianchini, and Akshitha Sriraman. 2025. Designing Cloud Servers for Lower Carbon. InProceedings of the 51st Annual International Symposium on Computer Architecture (ISCA). IEEE P...

work page doi:10.1109/isca59077.2024.00041 2025

[76] [76]

Vinnie Wong. 2023. Gen AI’s Environmental Ledger: A Closer Look at the Carbon Footprint of ChatGPT. https: //piktochart.com/blog/carbon-footprint-of-chatgpt/

work page 2023

[77] [77]

Carole-Jean Wu, Ramya Raghavendra, Udit Gupta, Bilge Acun, Newsha Ardalani, Kiwan Maeng, Gloria Chang, Fiona Aga, Jinshi Huang, Charles Bai, et al. 2022. Sustainable AI: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems (MLSys)(2022)

work page 2022

[78] [78]

Leyi Yan, Linda Wang, Sihang Liu, and Yi Ding. 2025. EnsembleCI: Ensemble Learning for Carbon Intensity Forecasting. InProceedings of the 16th ACM International Conference on Future and Sustainable Energy Systems (E-Energy). Association for Computing Machinery, New York, NY, USA, 208–212. doi:10.1145/3679240.3734630

work page doi:10.1145/3679240.3734630 2025

[79] [79]

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang

work page

[80] [80]

InProceedings of the Twentieth European Conference on Computer Systems (EuroSys)

CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion. InProceedings of the Twentieth European Conference on Computer Systems (EuroSys). Association for Computing Machinery, New York, NY, USA, 94–109. doi:10.1145/3689031.3696098 Cache Your Prompt When It’s Green — Carbon-Aware Caching for Large Language Model Serving 27

work page doi:10.1145/3689031.3696098