Recognition: no theorem link
GAR: Carbon-Aware Routing for LLM Inference via Constrained Optimization
Pith reviewed 2026-05-13 01:36 UTC · model grok-4.3
The pith
GAR routes each LLM request to minimize carbon emissions while enforcing accuracy floors and p95 latency bounds across heterogeneous model pools.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GAR is a constrained multi-objective optimization framework that minimizes per-request CO2 emissions subject to explicit accuracy floors and p95-latency service-level objectives. It employs adaptive constraint optimization through per-dataset floor tuning together with lightweight estimators for correctness, tail latency, and carbon emissions, enabling real-time routing decisions without additional inference passes. GAR-PD provides a practical online primal-dual routing algorithm for rolling carbon budgets, while heuristic variants maintain high feasibility coverage with limited accuracy degradation. Experiments on standard NLP benchmarks demonstrate substantial carbon reductions while the p
What carries the argument
The constrained multi-objective optimization that places carbon emissions as the objective and accuracy plus p95 latency as enforceable constraints, solved with per-dataset adaptive tuning and an online primal-dual algorithm.
If this is right
- Requests can be steered to smaller or more efficient models when grid carbon intensity rises, provided the accuracy floor for that task remains satisfied.
- Per-dataset tuning lets the same framework adapt accuracy requirements to different benchmarks without manual retuning of every constraint.
- Rolling carbon budgets become enforceable over time windows rather than single requests.
- Heuristic approximations offer practical fallbacks that still respect the latency bound when exact optimization is computationally expensive.
Where Pith is reading between the lines
- The same constraint-based framing could be applied to other scarce resources such as memory bandwidth or specialized accelerator time.
- If the estimators generalize, the router could incorporate real-time regional carbon intensity data to prefer models located in cleaner grids.
- The approach points toward treating sustainability metrics as first-class constraints in any multi-model serving system rather than post-hoc filters.
Load-bearing premise
Lightweight estimators for accuracy, tail latency, and emissions can be trained to stay accurate enough for real-time decisions without extra model runs, and per-dataset floor tuning keeps the constraints feasible across different model sizes.
What would settle it
Run the router live on a new dataset with measured carbon intensity traces and check whether measured accuracy drops below the tuned floor or p95 latency exceeds the target on a statistically significant fraction of requests.
Figures
read the original abstract
The growing deployment of large language models (LLMs) makes per-request routing essential for balancing response quality and computational cost across heterogeneous model pools. Current routing methods rarely consider sustainable energy use and CO2 emissions as optimization objectives, despite grid carbon intensity varying by time and region, and models differing significantly in energy consumption. To address this gap, we introduce Green-Aware Routing (GAR), a constrained multi-objective optimization framework that minimizes per-request CO2 emissions subject to explicit accuracy floors and p95-latency service-level objectives (SLOs). GAR employs adaptive constraint optimization through per-dataset floor tuning and incorporates lightweight estimators for correctness, tail latency, and carbon emissions, enabling real-time routing decisions without additional inference passes. We present GAR-PD, a practical online primal-dual routing algorithm for rolling carbon budgets, alongside heuristic variants that achieve high feasibility coverage while limiting accuracy degradation. Comprehensive experiments across standard NLP benchmarks with heterogeneous LLM pools (7B-70B) demonstrate that GAR achieves substantial carbon reductions while maintaining competitive accuracy and p95 latency guarantees, providing a practical, theoretically grounded approach to sustainable LLM inference.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Green-Aware Routing (GAR), a constrained multi-objective optimization framework that minimizes per-request CO2 emissions for LLM inference subject to accuracy floors and p95-latency SLOs. It employs lightweight estimators for correctness, tail latency, and carbon emissions to enable real-time routing without additional inference passes. The authors present GAR-PD, an online primal-dual algorithm for rolling carbon budgets, and heuristic variants. Experiments on standard NLP benchmarks with heterogeneous 7B-70B model pools are reported to achieve substantial carbon reductions while maintaining competitive accuracy and p95 latency guarantees.
Significance. If the experimental results and estimator accuracies hold, this work would be significant for sustainable AI and LLM serving. It fills a gap by incorporating variable grid carbon intensity and model energy differences into routing via constrained optimization, offering a practical method beyond cost/latency-focused approaches. The primal-dual formulation and per-dataset tuning provide theoretical grounding for online decisions under carbon budgets.
major comments (2)
- [Abstract] Abstract: The abstract asserts experimental success on NLP benchmarks with 'substantial carbon reductions' and 'competitive accuracy' but supplies no quantitative results, error bars, baseline comparisons, or details on estimator training and constraint satisfaction rates, leaving the central claim unsupported by visible evidence.
- [Framework description] Framework description: The constrained optimization claims depend on lightweight estimators for p95 latency and carbon emissions achieving low prediction error to avoid SLO violations. No validation is provided on estimator fidelity across 7B-70B model sizes, prompt characteristics, or hardware variations, nor on how per-dataset accuracy floor tuning maps to online feasibility without hidden gaps.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight opportunities to strengthen the abstract and provide more explicit validation details. We address each major comment below and outline the corresponding revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract asserts experimental success on NLP benchmarks with 'substantial carbon reductions' and 'competitive accuracy' but supplies no quantitative results, error bars, baseline comparisons, or details on estimator training and constraint satisfaction rates, leaving the central claim unsupported by visible evidence.
Authors: We agree that the abstract would be improved by including concrete quantitative results to support the claims. In the revised version, we will incorporate specific metrics drawn from the experimental results, such as average per-request CO2 reductions (with ranges across benchmarks), accuracy maintenance relative to baselines, p95 latency SLO satisfaction rates, estimator prediction errors, and constraint feasibility percentages. This will make the central contributions more evident while maintaining the abstract's length and focus. revision: yes
-
Referee: [Framework description] Framework description: The constrained optimization claims depend on lightweight estimators for p95 latency and carbon emissions achieving low prediction error to avoid SLO violations. No validation is provided on estimator fidelity across 7B-70B model sizes, prompt characteristics, or hardware variations, nor on how per-dataset accuracy floor tuning maps to online feasibility without hidden gaps.
Authors: The manuscript reports estimator performance and experimental outcomes in the evaluation section, including prediction errors for latency and carbon models. To address the request for more explicit validation, we will add a dedicated paragraph in the framework section summarizing estimator fidelity results broken down by model size (7B-70B), prompt characteristics, and constraint satisfaction rates under per-dataset accuracy floor tuning. This will clarify the mapping to online feasibility. We note that hardware variation analysis is limited to the tested setups and can be expanded with a limitations statement if needed. revision: partial
Circularity Check
No significant circularity in GAR's optimization framework or claims
full rationale
The paper defines GAR as a constrained multi-objective optimization that minimizes CO2 subject to accuracy floors and p95 SLOs, using standard primal-dual methods and lightweight estimators trained separately for correctness, latency, and emissions. No quoted equations or steps show a prediction reducing to its own fitted inputs by construction, no self-citation load-bearing the central result, and no ansatz or uniqueness imported from prior author work. Experiments on NLP benchmarks with 7B-70B pools provide external validation of reductions while meeting constraints, keeping the derivation self-contained against the described inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Lightweight estimators can predict per-model accuracy, p95 latency, and carbon emissions with sufficient fidelity for real-time constrained optimization
Reference graph
Works this paper leans on
-
[1]
Cargo: A framework for confidence-aware routing of llm queries.arXiv preprint arXiv:2509.14899,
Ahmed Barrak, Ahmed Abdelsalam, Karan Jain, et al. Cargo: A framework for confidence-aware routing of llm queries.arXiv preprint arXiv:2509.14899,
-
[2]
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance.arXiv preprint arXiv:2305.05176,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Daniel Khashabi, Oyvind Sud, Ashish Sabharwal, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Mea- suring the carbon intensity of AI in cloud instances
Jesse Dodge, Taylor Prewitt, Remi Tachet des Combes, Erika Odmark, Roy Schwartz, Emma Strubell, Alexandra Sasha Luccioni, Noah A Smith, Nicole DeCario, and Will Buchanan. Mea- suring the carbon intensity of AI in cloud instances. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 1877–1894,
work page 2022
-
[6]
arXiv preprint arXiv:2410.03834 , year=
Tao Feng, Yanzhen Shen, and Jiaxuan You. Graphrouter: A graph-based router for llm selections. arXiv preprint arXiv:2410.03834,
-
[7]
Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. Routerbench: A benchmark for multi-llm routing system.arXiv preprint arXiv:2403.12031,
-
[8]
Baolin Li, Siddharth Samsi, Vijay Gadepally, and Devesh Tiwari. Clover: Toward sustainable ai with carbon-aware machine learning inference service.arXiv preprint arXiv:2304.09781,
-
[9]
Sprout: Green generative ai with carbon-efficient llm inference
Baolin Li, Yankai Jiang, Vijay Gadepally, and Devesh Tiwari. Sprout: Green generative ai with carbon-efficient llm inference. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 22074–22086,
work page 2024
-
[10]
Ecoserve: Designing carbon-aware ai inference systems.arXiv preprint arXiv:2502.05043,
Yueying Li, Zhanqiu Hu, Esha Choukse, Rodrigo Fonseca, G Edward Suh, and Udit Gupta. Ecoserve: Designing carbon-aware ai inference systems.arXiv preprint arXiv:2502.05043,
-
[11]
RouteLLM: Learning to Route LLMs with Preference Data
Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data.arXiv preprint arXiv:2406.18665,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Squad: 100,000+ questions for machine comprehension of text
10 Preprint Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392,
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.