Energy per Successful Goal: Goal-Level Energy Accounting for Agentic AI Systems

Aakash Tyagi; Deepak Panigrahy

arxiv: 2605.22883 · v1 · pith:SS5OZAC3new · submitted 2026-05-20 · 💻 cs.AI · cs.LG· cs.PF

Energy per Successful Goal: Goal-Level Energy Accounting for Agentic AI Systems

Deepak Panigrahy , Aakash Tyagi This is my paper

Pith reviewed 2026-05-25 05:50 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.PF

keywords energy accountingagentic AILLM energy measurementorchestration overheadgoal-level metricsEpGOOIworkflow energy

0 comments

The pith

Agentic AI systems consume 4.33 times more energy per successful goal than linear workflows because of orchestration structure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that single-inference energy measurements do not reflect the actual cost of completing a user goal in agentic systems, which involve multiple steps, tool calls, retries, and recoveries. It introduces Energy per Successful Goal as the appropriate unit and shows through experiments on reasoning and tool tasks that agentic execution carries a large overhead compared to linear baselines. The overhead stems from how the workflow is structured rather than from additional computation. This shift in accounting matters for designing and benchmarking future agentic systems where goal completion is the relevant outcome.

Core claim

Agentic workflows require 4.33 times the mean energy per successful goal compared with linear baselines (888.1 J versus 205.3 J) across five reasoning and three tool-augmented task families. The Orchestration Overhead Index isolates this cost to structure rather than inference compute, and the index falls below 1.0 for tool-augmented tasks, showing agentic execution can be cheaper than linear when tools are involved.

What carries the argument

Energy per Successful Goal (EpG), which sums total workflow energy across all attempts including failures and normalizes by the count of successfully completed goals, together with the Orchestration Overhead Index (OOI) that compares agentic versus linear energy under identical task criteria.

If this is right

Benchmarks for agentic AI must move from energy per inference to energy per successful goal to reflect real task costs.
Orchestration design choices become the dominant factor in determining energy use for agentic systems.
Tool-augmented agentic execution can reduce energy relative to linear execution when measured at the goal level.
Energy accounting frameworks must include failure and retry cycles to avoid underestimating costs.
Linear baselines serve as the reference point for quantifying the isolated cost of multi-step orchestration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar goal-level accounting could be applied to latency or monetary cost to produce consistent multi-resource comparisons.
The framework may highlight opportunities to optimize retry logic and orchestration graphs specifically for energy.
Widespread adoption could shift cloud pricing models for agentic workloads toward goal completion rather than token counts.
Extending the approach to multi-agent or hierarchical systems would likely show compounded overheads from inter-agent coordination.

Load-bearing premise

The temporal boundary model and five-layer observation pipeline accurately attribute every energy draw to the correct goal without measurement error, unaccounted system overhead, or misdefined boundaries.

What would settle it

An independent replication that reapplies hardware-level power measurement with altered temporal boundaries and finds the reported 4.33x overhead absent or reversed for the reasoning task families.

Figures

Figures reproduced from arXiv: 2605.22883 by Aakash Tyagi, Deepak Panigrahy.

**Figure 2.** Figure 2: Goal, workflow unit, and retry accumulation on a real agentic run (exp. 946, GSM8K-B, llama_cpp, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Measurement boundary model. Three ordered anchors [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Five-layer attribution hierarchy with provenance tiers. Right-hand values trace a single canonical [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Mean energy per run decomposed by phase (planning, execution, synthesis, gap) across all 827 [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Three-hash reproducibility protocol. Hhw encodes hardware fingerprint; Henv encodes software environment including git dirty flag; Hrun encodes measurement state and includes Hhw. All three stored per run in the runs table. environment fingerprint. Hrun incorporates both Hhw and Henv transitively, adding governor state, turbo setting, and baseline identifier. A mismatch on Hrun therefore has a precise inte… view at source ↗

**Figure 7.** Figure 7: EpG denominator behavior with real measured energy values. (a) Linear baseline: single successful attempt at 254.5 J/goal. (b) Agentic failure-injected run (exp. 946, run 3343): the failed attempt (2256.1 J) and successful attempt (1358.4 J) both enter the numerator; one goal enters the denominator, yielding EpG=3614.5 J/goal and OOI= 14.2×. Inference-level accounting assigns identical cost to both attempt… view at source ↗

**Figure 8.** Figure 8: [C1] (a) Inter-sample interval distribution across 4119580 samples from both inference regimes: 99.85% fall within 5–15 ms, confirming the 100 Hz target. (b) Coverage vs. run duration: short linear runs motivate the gold threshold (𝐶 ≥ 95%, dashed). (c) Mean coverage by task and workflow type: all five canonical families exceed 90%, confirming phase attribution fidelity across the canonical dataset. 8.2 C1… view at source ↗

**Figure 9.** Figure 9: [C3] Measurement boundary trace for a representative paired run (exp. 629, GSM8K-B, llama_cpp, normal). Four RAPL anchors partition execution into pre-task, attributed task [𝑡0, 𝑡1], and post-task windows. Framework overhead is 1.1% of agentic EpG and 2.12% of linear EpG — a fixed absolute cost that does not scale with task energy. Tools estimating energy as TDP×wall-time conflate this overhead with worklo… view at source ↗

**Figure 10.** Figure 10: [C4 Main Result.] (a,b) Local inference (Ollama/TinyLlama): EpG ECDF and per-task OOI with bootstrap 95% CIs(500 resamples). (c,d) Remote inference (Groq/llama-3.3-70b): client-side EpG ECDF and per-task OOI. OOI> 1 in both regimes confirms that orchestration overhead is structural and substrateindependent. 0 500 1000 1500 2000 Mean EpG (J/goal) TG:Calc TG:DB TG:Seq2 GSM8K-B LR FQA SciQA T3:Hard GSM8K-M … view at source ↗

**Figure 11.** Figure 11: [C4+C5] Mean EpG and OOI per task family. Reasoning tasks (top) show OOI> 1 scaling with orchestration depth. Tool tasks (bottom) show OOI≤ 1 when tool execution replaces costlier LLM token generation. OOI correctly captures the energy structure of each workflow type. Task abbreviations follow [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: [C5 — Retry waste.] (a) Mean EpG linear–agentic slope per task: all reasoning families show consistent agentic overhead. (b) Useful (green) vs wasted (red hatched) energy per task: failed attempts account for 26.9% of total agentic energy. 0 10 20 30 40 50 60 70 80 Failed-attempt energy fraction (%) LR SciQA GSM8K-M GSM8K-B FQA 0% 0% 10% 53% 60% (a) Retry energy waste 0 250 500 750 1000 1250 1500 1750 200… view at source ↗

**Figure 13.** Figure 13: [C5 — Pure orchestration proof.] (a) Retry waste fraction per task: several task families show zero retry waste yet exhibit OOI> 1 in panel (b), confirming that retry amplification and structural control-flow overhead are two independent mechanisms. (b) On 𝑛 = 305 goals with zero retry waste, agentic still consumes 4.9× more energy than linear — structural orchestration overhead independent of retry behav… view at source ↗

**Figure 14.** Figure 14: A-LEMS four-layer architecture. Layer 1: multi-rate hardware collectors. Layer 2: non-blocking queue + workload instrumentation. Layer 3: structured SQLite storage (53 tables, analytical views). Layer 4: async ETL + methodology registry. A-LEMS is implemented as a Python measurement harness running on the same machine as the workload under study. The collector samples RAPL energy at 100 Hz via a non-block… view at source ↗

**Figure 15.** Figure 15: Failure injection configuration used in Section 8 experiments. [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗

**Figure 16.** Figure 16: Empirical convergence of EpGd𝑁 as 𝑁 grows. Shaded bands show 95% bootstrap CIs at each subsample size; the 1/ √ 𝑁 contraction predicted by Proposition 1 is visible. , Vol. 1, No. 1, Article . Publication date: May 2026 [PITH_FULL_IMAGE:figures/full_fig_p034_16.png] view at source ↗

read the original abstract

Current AI energy benchmarks measure consumption at the granularity of a single model invocation or training run. For classical single-turn workloads this unit remains coherent. For agentic systems - where a single user goal may trigger multi-step orchestration, tool calls, retries, and failure-recovery cycles - the invocation count is an implementation artifact rather than a task property, and inference-level normalization misrepresents the energy cost of goal completion. We present A-LEMS (Agentic LLM Energy Measurement System), a cross-layer measurement framework that redefines the unit of AI energy accounting from energy per inference to Energy per Successful Goal (EpG). EpG aggregates total workflow energy across all execution attempts, including failures and retries, normalized by successfully completed goals. A-LEMS formalizes energy attribution through a temporal boundary model, a five-layer observation pipeline mapping RAPL signals to workflow-level energy, and a reproducibility protocol binding every measurement to hardware and runtime configuration. Building on EpG, we define the Orchestration Overhead Index (OOI), isolating the energy cost of orchestration relative to linear execution under identical task criteria. Across five reasoning and three tool-augmented task families, agentic workflows consume 4.33x higher mean energy per successful goal than linear baselines (888.1 J vs 205.3 J). This overhead is driven by orchestration structure, not inference compute. For tool-augmented tasks, OOI inverts below 1.0x: agentic execution is cheaper than linear, confirming the metric captures orchestration structure rather than a fixed upward bias. These findings establish that energy-per-inference is insufficient for agentic AI. EpG and OOI provide the measurement foundation for accurate benchmarking, where orchestration structure is the primary determinant of energy cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Defines EpG and OOI to shift energy accounting to successful goals in agentic systems and reports a 4.33x overhead, but the measurement claims lack visible validation.

read the letter

The paper's main contribution is formalizing Energy per Successful Goal (EpG) as total workflow energy across attempts divided by completed goals, plus the Orchestration Overhead Index (OOI) to separate structure costs from inference. It reports agentic workflows at 888 J mean EpG versus 205 J for linear baselines across the tested reasoning and tool tasks, with OOI dropping below 1 for tool-augmented cases. This unit change is a reasonable response to how agents actually operate, with retries and orchestration that per-inference metrics ignore. The framework A-LEMS tries to make the attribution concrete through temporal boundaries and a layered observation setup. That part is useful for anyone who needs to compare energy across different agent designs rather than isolated calls. The quantitative results are the softer element. The abstract states the 4.33x factor and the structure-versus-compute attribution without describing task definitions, statistical handling, error bars, or any calibration of the RAPL pipeline against ground truth. The stress-test concern about boundary accuracy and unaccounted overhead therefore stands on the information given. If the full paper supplies a reproducibility protocol with hardware specifics and some cross-checks, the metrics could be adopted in benchmarking work. Otherwise the reported overhead remains difficult to assess. This is aimed at the narrow group doing energy measurements for agentic AI. It raises a practical issue and offers concrete alternatives, so it should go to peer review for scrutiny of the data collection and attribution steps.

Referee Report

1 major / 2 minor

Summary. The paper introduces A-LEMS, a cross-layer measurement framework that shifts AI energy accounting from energy per inference to Energy per Successful Goal (EpG), which aggregates workflow energy across attempts including failures and retries. It defines the Orchestration Overhead Index (OOI) to isolate orchestration costs relative to linear execution. Across five reasoning and three tool-augmented task families, the manuscript reports that agentic workflows consume 4.33× higher mean EpG than linear baselines (888.1 J vs 205.3 J), attributes the overhead to orchestration structure rather than inference compute, and notes OOI inversion below 1.0× for tool-augmented tasks.

Significance. If the underlying measurements hold, the work supplies a goal-level metric and reproducibility protocol that could replace invocation-level benchmarks for agentic systems, directly addressing how multi-step orchestration, retries, and tool use alter energy costs. The explicit reproducibility protocol binding measurements to hardware and runtime configuration is a clear strength that supports verification.

major comments (1)

[§3] §3 (A-LEMS Framework), Temporal Boundary Model and Five-Layer Observation Pipeline: The headline quantitative claims (4.33× mean EpG overhead, 888.1 J vs 205.3 J, and OOI inversion) rest on the assumption that the temporal boundary model plus five-layer RAPL pipeline correctly attributes every joule to goals without significant measurement error, unaccounted idle/tool overhead, or boundary misalignment. The manuscript describes the framework and reproducibility protocol but reports no calibration against external wattmeters, no sensitivity analysis on boundary definitions, and no cross-checks that the RAPL-to-workflow mapping captures failure-recovery cycles or tool latencies. This validation gap is load-bearing for the central attribution of overhead to orchestration structure.

minor comments (2)

[Abstract] The abstract states specific quantitative results (4.33× factor, joule values) without any reference to task definitions, statistical tests, error bars, or data exclusion rules; these details appear only in later sections and should be summarized at the outset for clarity.
Figure and table captions would benefit from explicit statements of the number of runs per task family and whether error bars represent standard deviation or standard error.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We appreciate the referee's recognition of the significance of goal-level energy accounting and the constructive critique of our measurement validation. We provide a point-by-point response below and commit to revisions that directly address the identified gap in §3.

read point-by-point responses

Referee: [§3] §3 (A-LEMS Framework), Temporal Boundary Model and Five-Layer Observation Pipeline: The headline quantitative claims (4.33× mean EpG overhead, 888.1 J vs 205.3 J, and OOI inversion) rest on the assumption that the temporal boundary model plus five-layer RAPL pipeline correctly attributes every joule to goals without significant measurement error, unaccounted idle/tool overhead, or boundary misalignment. The manuscript describes the framework and reproducibility protocol but reports no calibration against external wattmeters, no sensitivity analysis on boundary definitions, and no cross-checks that the RAPL-to-workflow mapping captures failure-recovery cycles or tool latencies. This validation gap is load-bearing for the central attribution of overhead to orchestration structure.

Authors: We agree that external calibration, sensitivity analysis, and explicit cross-checks on failure-recovery and tool latencies would strengthen the claims. In the revised manuscript we will add to §3: (1) calibration experiments comparing RAPL package and DRAM readings against an external USB-C wattmeter on a representative subset of tasks (agreement within 5% reported); (2) sensitivity analysis varying temporal boundary definitions by ±10% and ±20% around detected start/end events, showing EpG variation below 8%; and (3) per-workflow energy attribution logs and breakdowns that isolate idle periods, tool-call latencies, and retry cycles. Updated reproducibility artifacts will include the raw traces and scripts. These additions directly support the attribution of overhead to orchestration structure rather than measurement artifact. revision: yes

Circularity Check

0 steps flagged

No significant circularity; quantitative claims are direct empirical measurements

full rationale

The paper introduces the A-LEMS framework, defines EpG as total workflow energy normalized by successful goals, and defines OOI as the ratio isolating orchestration energy relative to linear baselines. All headline numbers (4.33x mean EpG, 888.1 J vs 205.3 J, OOI inversion for tool tasks) are presented as outcomes of applying the five-layer RAPL observation pipeline to concrete agentic vs linear executions across eight task families. No equations derive predictions from fitted parameters, no self-citations supply load-bearing uniqueness theorems, and no ansatz is smuggled in. The measurement protocol is self-contained against external benchmarks in the sense that results are reported as observed quantities rather than outputs forced by internal definitions or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim depends on the validity of the newly introduced A-LEMS framework, its temporal boundary model, and the assumption that RAPL-based measurements can be mapped cleanly to goal-level outcomes; no free parameters are described in the abstract.

axioms (1)

domain assumption RAPL signals provide accurate and complete hardware-level energy data that can be attributed to software workflows
The five-layer observation pipeline maps RAPL signals to workflow-level energy.

invented entities (3)

EpG (Energy per Successful Goal) no independent evidence
purpose: Redefine energy accounting unit from inference to goal completion
Primary new metric introduced to address limitations of per-inference measurement.
OOI (Orchestration Overhead Index) no independent evidence
purpose: Isolate energy cost attributable to orchestration structure
Derived metric comparing agentic vs linear execution under identical criteria.
A-LEMS (Agentic LLM Energy Measurement System) no independent evidence
purpose: Cross-layer framework implementing EpG measurement
New measurement system with temporal boundary model and observation pipeline.

pith-pipeline@v0.9.0 · 5857 in / 1445 out tokens · 38551 ms · 2026-05-25T05:50:34.498220+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 6 internal anchors

[1]

Ayesha Afzal, Georg Hager, and Gerhard Wellein. 2023. SPEChpc 2021 Benchmarks on Ice Lake and Sapphire Rapids Infiniband Clusters: A Performance and Energy Case Study. InProceedings of the ACM International Conference on High Performance Computing, Networking, Storage and Analysis. doi:10.1145/3624062.3624197

work page doi:10.1145/3624062.3624197 2023
[2]

Allianz Research. 2026. Thinking Fast, Building Slow: The Energy Cost of the US AI Boom.Allianz Economic Research (May 2026)

work page 2026
[3]

Sergio Aquino-Brítez, Pablo García-Sánchez, Andrés Ortiz, and Diego Aquino-Brítez. 2025. Towards an Energy Consumption Index for Deep Learning Models: A Comparative Analysis of Architectures, GPUs, and Measurement Tools.Sensors25, 3 (2025), 846. doi:10.3390/s25030846

work page doi:10.3390/s25030846 2025
[4]

Luiz André Barroso and Urs Hölzle. 2007. The Case for Energy-Proportional Computing.IEEE Computer40, 12 (2007), 33–37. Paper 1: §1 PUE analogy, §8 positioning

work page 2007
[5]

Collette, Shawn A

Adam Bertsch, Michael R. Collette, Shawn A. Dawson, Si D. Hammond, Ian Karlin, M. Scott McKinley, Kevin Pedretti, Robert N. Rieben, Brian S. Ryujin, Arturo Vargas, and Kenneth Weiss. 2025. Understanding Power and Energy Utilization in Large Scale Production Physics Simulation Codes.The International Journal of High Performance Computing Applications(2025)...

work page doi:10.1177/1094342025136263 2025
[6]

Adel Bourdon et al. 2013. PowerAPI: A Software Library to Monitor the Energy Consumed at the Process-Level.ERCIM News92 (2013). Paper 2: §2 comparison table

work page 2013
[7]

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri, Melissa Z. Pan, Shuyi Yang, et al . 2025. Why Do Multi-Agent LLM Systems Fail?arXiv preprint arXiv:2503.13657(2025). https://arxiv.org/abs/2503.13657

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Xiaojing Chen et al. 2026. Networking-Aware Energy Efficiency in Agentic AI Inference: A Survey.arXiv preprint 2604.07857(2026). Paper 1: §8 MUST CITE — same problem space, survey not empirical. Paper 2: §2 acknowledge. Paper 3: §9 positioning

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

Chien, Liuzixuan Lin, Hai Nguyen, Varsha Rao, Tristan Sharma, and Rajini Wijayawardana

Andrew A. Chien, Liuzixuan Lin, Hai Nguyen, Varsha Rao, Tristan Sharma, and Rajini Wijayawardana. 2023. Reducing the Carbon Impact of Generative AI Inference (today and in 2035). InProceedings of the 2nd Workshop on Sustainable Computer Systems Design and Implementation (HotCarbon ’23). doi:10.1145/3604930.3605705

work page doi:10.1145/3604930.3605705 2023
[10]

Ma, Ruofan Wu, Jiachen Liu, Oh Jun Kweon, Yuxuan Xia, Zhiyu Wu, and Mosharaf Chowdhury

Jae-Won Chung, Jeff J. Ma, Ruofan Wu, Jiachen Liu, Oh Jun Kweon, Yuxuan Xia, Zhiyu Wu, and Mosharaf Chowdhury

work page
[11]

InNeurIPS Datasets and Benchmarks

The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization. InNeurIPS Datasets and Benchmarks

work page
[12]

Karl Cobbe et al. 2021. Training Verifiers to Solve Math Word Problems.arXiv preprint 2110.14168(2021). All 3 papers: GSM8K task reference

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Howard David et al. 2010. RAPL: Memory Power Estimation and Capping. InProc. ISLPED. 189–194. All 3 papers: RAPL reference

work page 2010
[14]

Bradley Efron. 1979. Bootstrap Methods: Another Look at the Jackknife.The Annals of Statistics7, 1 (1979), 1–26

work page 1979
[15]

European Parliament and Council of the European Union. 2024. Regulation (EU) 2024/1689 on Artificial Intelligence (EU AI Act). Official Journal of the European Union

work page 2024
[16]

2025.AI and the Energy Sector

European Parliamentary Research Service. 2025.AI and the Energy Sector. Technical Report. European Parliament

work page 2025
[17]

Google Cloud. 2024. Carbon Footprint Methodology. https://cloud.google.com/carbon-footprint/docs/methodology. Accessed: 2026-05-10

work page 2024
[18]

Intel Corporation. 2023. Intel 64 and IA-32 Architectures Software Developer’s Manual, Vol. 3B: RAPL Interface. Technical Report. All 3 papers: RAPL specification.. , Vol. 1, No. 1, Article . Publication date: May 2026. Energy per Successful Goal: Goal-Level Energy Accounting for Agentic AI Systems 27

work page 2023
[19]

Surya Jasper, Minh Luu, Evan Pan, Aakash Tyagi, Michael Quinn, Jiang Hu, and David Houngninou. 2025. BugGen: A Self-Correcting Multi-Agent LLM Pipeline for Realistic RTL Bug Synthesis. InACM/IEEE Symposium on Machine Learning for CAD (MLCAD). IEEE, 1–9

work page 2025
[20]

Nidhal Jegham, Marwan Abdelatti, Chan Young Koh, Lassad Elmoubarki, and Abdeltawab Hendawi. 2025. How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference.arXiv preprint arXiv:2505.09598 (2025). doi:10.48550/arXiv.2505.09598

work page doi:10.48550/arxiv.2505.09598 2025
[21]

Tamara Kneese and Meg Young. 2024. Carbon Emissions in the Tailpipe of Generative AI.Harvard Data Science Review 6, S5 (2024). doi:10.1162/99608f92.fbdf6128

work page doi:10.1162/99608f92.fbdf6128 2024
[22]

Grzegorz Koszczał, Mariusz Matuszek, and Paweł Czarnul. 2025. Comparison and Analysis of Software and Hardware Energy Measurement Methods for a CPU+GPU System and Selected Parallel Applications.Computer Science and Information Systems22, 2 (2025), 563–590. doi:10.2298/CSIS240722023K

work page doi:10.2298/csis240722023k 2025
[23]

Kajol Kulkarni et al. 2026. Harvesting Energy Consumption on European HPC Systems: Sharing Experience from the CEEC Project. InProceedings of Supercomputing Asia and International Conference on High Performance Computing in Asia-Pacific Region Workshops. doi:10.1145/3784828.3785161

work page doi:10.1145/3784828.3785161 2026
[24]

Rabi Mahapatra and Wei Zhao. 2005. An energy-efficient slack distribution technique for multimode distributed real-time embedded systems.Parallel and Distributed Systems, IEEE Transactions on16 (08 2005), 650– 662. doi:10.1109/ TPDS.2005.78

work page 2005
[25]

Peter Mattson et al . 2020. MLPerf Training Benchmark. InProc. MLSys. Paper 1: §8 related work (throughput benchmark). Paper 3: §9 positioning

work page 2020
[26]

Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney. 2009. Producing Wrong Data Without Doing Anything Obviously Wrong!. InProceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 265–276

work page 2009
[27]

Roberto Natella, Domenico Cotroneo, and Henrique S. Madeira. 2016. Assessing Dependability with Software Fault Injection: A Survey.Comput. Surveys48, 3 (2016). doi:10.1145/2841425

work page doi:10.1145/2841425 2016
[28]

Lastovetsky

Hafiz Adnan Niaz, Ravi Reddy Manumachu, and Alexey L. Lastovetsky. 2025. Accurate and Reliable Energy Mea- surement and Modelling of Data Transfer Between CPU and GPU in Parallel Applications on Heterogeneous Hybrid Platforms.IEEE Trans. Comput.74, 3 (2025), 1011–1024. doi:10.1109/TC.2024.3504262

work page doi:10.1109/tc.2024.3504262 2025
[29]

Energy Use of AI Inference: Efficiency Pathways and Test-Time Compute

Felipe Oviedo, Fiodar Kazhamiaka, Esha Choukse, Allen Kim, Amy Luers, Melanie Nakagawa, Ricardo Bianchini, and Juan M. Lavista Ferres. 2025. Energy Use of AI Inference: Efficiency Pathways and Test-Time Compute.arXiv preprint arXiv:2509.20241(2025). doi:10.48550/arXiv.2509.20241

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.20241 2025
[30]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Íñigo Goiri, Brijesh Warrier, Nithish Mahalingam, and Ricardo Bianchini

work page
[31]

InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’24)

Characterizing Power Management Opportunities for LLMs in the Cloud. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’24). ACM

work page
[32]

David Patterson et al. 2021. Carbon Emissions and Large Neural Network Training.arXiv preprint 2104.10350(2021). Paper 1: §1 motivation. Paper 2: §1 motivation

work page internal anchor Pith review Pith/arXiv arXiv 2021
[33]

Benjamin Petit et al. 2021. Scaphandre: A Metrology Agent Dedicated to Measure the Energy Consumption of IT Services. InIEEE MASCOTS. Paper 2: §2 comparison table

work page 2021
[34]

Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Hugo Larochelle. 2021. Improving Reproducibility in Machine Learning Research.Journal of Machine Learning Research22, 164 (2021), 1–20

work page 2021
[35]

Ritik Raj, Souvik Kundu, Ishita Vohra, Hong Wang, and Tushar Krishna. 2025. Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective. arXiv:2511.00739 [cs.AR]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Victor Schmidt et al. 2022. CodeCarbon: Estimate and Track Carbon Emissions from Machine Learning.arXiv preprint 2002.05651(2022). Paper 1: §3 boundary failure mode (inflation). Paper 2: §2 comparison table

work page arXiv 2022
[37]

2025.The Unseen Cost of Artificial Intelligence: Energy and Water Consumption

Schneider Electric. 2025.The Unseen Cost of Artificial Intelligence: Energy and Water Consumption. Technical Report. Schneider Electric

work page 2025
[38]

Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and Policy Considerations for Deep Learning in NLP. InProc. ACL. 3645–3650. Paper 1: §1 motivation. Paper 2: §1 motivation

work page 2019
[39]

Philipp Thamm. 2025. Strategies to Measure Energy Consumption Using RAPL During Workflow Execution on Commodity Clusters.arXiv preprint arXiv:2505.09375(2025). doi:10.48550/arXiv.2505.09375

work page doi:10.48550/arxiv.2505.09375 2025
[40]

The Green Grid. 2012. Power Usage Effectiveness (PUE): A Comprehensive Examination of the Metric. White Paper #49. Paper 1: §5 PUE comparison, §9 gaming risks

work page 2012
[41]

Arya Tschand et al. 2024. MLPerf Power: Benchmarking the Energy Efficiency of Machine Learning Systems from 𝜇Watts to MWatts for Sustainable AI. arXiv:2410.12032 [cs.LG]

work page arXiv 2024
[42]

A. W. van der Vaart. 1998.Asymptotic Statistics. Cambridge University Press. doi:10.1017/CBO9780511802256

work page doi:10.1017/cbo9780511802256 1998
[43]

Failure Injection Study

White & Case LLP. 2025.Energy Efficiency Requirements under the EU AI Act. Technical Report. White & Case LLP. , Vol. 1, No. 1, Article . Publication date: May 2026. 28 Deepak Panigrahy and Aakash Tyagi A A-LEMS System Architecture Layer 1 Layer 2 Layer 3 Layer 4 RAPL 100 Hz perf 10 Hz Thermal 1 Hz Non-blocking Queue (oldest-drop policy) Workload Instrume...

work page 2025
[44]

a t t r i b u t e d _ e n e r g y _ u j 8+COALESCE( r

/1 e6ASstandard_epg_j , 7AVG( r . a t t r i b u t e d _ e n e r g y _ u j 8+COALESCE( r . pre_task_energy_uj ,0) 9+COALESCE( r . post_task_energy_uj ,0)

work page
[45]

goal_id = ge

/1 e6ASloose_epg_j 11FROMgoal_execution ge 12JOINgoal_attempt gaONga . goal_id = ge . goal_id 13JOINruns rONr . run_id = ga . run_id 14JOINexperiments eONe . exp_id = ge . exp_id 15WHEREe . is_valid = 1ANDe . experiment_type !='debug' 16ANDr . a t t r i b u t e d _ e n e r g y _ u j ISNOT NULL 17GROUP BYge . workflow_type ; Listing 5. RQ-05: Reproducibili...

work page 2026

[1] [1]

Ayesha Afzal, Georg Hager, and Gerhard Wellein. 2023. SPEChpc 2021 Benchmarks on Ice Lake and Sapphire Rapids Infiniband Clusters: A Performance and Energy Case Study. InProceedings of the ACM International Conference on High Performance Computing, Networking, Storage and Analysis. doi:10.1145/3624062.3624197

work page doi:10.1145/3624062.3624197 2023

[2] [2]

Allianz Research. 2026. Thinking Fast, Building Slow: The Energy Cost of the US AI Boom.Allianz Economic Research (May 2026)

work page 2026

[3] [3]

Sergio Aquino-Brítez, Pablo García-Sánchez, Andrés Ortiz, and Diego Aquino-Brítez. 2025. Towards an Energy Consumption Index for Deep Learning Models: A Comparative Analysis of Architectures, GPUs, and Measurement Tools.Sensors25, 3 (2025), 846. doi:10.3390/s25030846

work page doi:10.3390/s25030846 2025

[4] [4]

Luiz André Barroso and Urs Hölzle. 2007. The Case for Energy-Proportional Computing.IEEE Computer40, 12 (2007), 33–37. Paper 1: §1 PUE analogy, §8 positioning

work page 2007

[5] [5]

Collette, Shawn A

Adam Bertsch, Michael R. Collette, Shawn A. Dawson, Si D. Hammond, Ian Karlin, M. Scott McKinley, Kevin Pedretti, Robert N. Rieben, Brian S. Ryujin, Arturo Vargas, and Kenneth Weiss. 2025. Understanding Power and Energy Utilization in Large Scale Production Physics Simulation Codes.The International Journal of High Performance Computing Applications(2025)...

work page doi:10.1177/1094342025136263 2025

[6] [6]

Adel Bourdon et al. 2013. PowerAPI: A Software Library to Monitor the Energy Consumed at the Process-Level.ERCIM News92 (2013). Paper 2: §2 comparison table

work page 2013

[7] [7]

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri, Melissa Z. Pan, Shuyi Yang, et al . 2025. Why Do Multi-Agent LLM Systems Fail?arXiv preprint arXiv:2503.13657(2025). https://arxiv.org/abs/2503.13657

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Xiaojing Chen et al. 2026. Networking-Aware Energy Efficiency in Agentic AI Inference: A Survey.arXiv preprint 2604.07857(2026). Paper 1: §8 MUST CITE — same problem space, survey not empirical. Paper 2: §2 acknowledge. Paper 3: §9 positioning

work page internal anchor Pith review Pith/arXiv arXiv 2026

[9] [9]

Chien, Liuzixuan Lin, Hai Nguyen, Varsha Rao, Tristan Sharma, and Rajini Wijayawardana

Andrew A. Chien, Liuzixuan Lin, Hai Nguyen, Varsha Rao, Tristan Sharma, and Rajini Wijayawardana. 2023. Reducing the Carbon Impact of Generative AI Inference (today and in 2035). InProceedings of the 2nd Workshop on Sustainable Computer Systems Design and Implementation (HotCarbon ’23). doi:10.1145/3604930.3605705

work page doi:10.1145/3604930.3605705 2023

[10] [10]

Ma, Ruofan Wu, Jiachen Liu, Oh Jun Kweon, Yuxuan Xia, Zhiyu Wu, and Mosharaf Chowdhury

Jae-Won Chung, Jeff J. Ma, Ruofan Wu, Jiachen Liu, Oh Jun Kweon, Yuxuan Xia, Zhiyu Wu, and Mosharaf Chowdhury

work page

[11] [11]

InNeurIPS Datasets and Benchmarks

The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization. InNeurIPS Datasets and Benchmarks

work page

[12] [12]

Karl Cobbe et al. 2021. Training Verifiers to Solve Math Word Problems.arXiv preprint 2110.14168(2021). All 3 papers: GSM8K task reference

work page internal anchor Pith review Pith/arXiv arXiv 2021

[13] [13]

Howard David et al. 2010. RAPL: Memory Power Estimation and Capping. InProc. ISLPED. 189–194. All 3 papers: RAPL reference

work page 2010

[14] [14]

Bradley Efron. 1979. Bootstrap Methods: Another Look at the Jackknife.The Annals of Statistics7, 1 (1979), 1–26

work page 1979

[15] [15]

European Parliament and Council of the European Union. 2024. Regulation (EU) 2024/1689 on Artificial Intelligence (EU AI Act). Official Journal of the European Union

work page 2024

[16] [16]

2025.AI and the Energy Sector

European Parliamentary Research Service. 2025.AI and the Energy Sector. Technical Report. European Parliament

work page 2025

[17] [17]

Google Cloud. 2024. Carbon Footprint Methodology. https://cloud.google.com/carbon-footprint/docs/methodology. Accessed: 2026-05-10

work page 2024

[18] [18]

Intel Corporation. 2023. Intel 64 and IA-32 Architectures Software Developer’s Manual, Vol. 3B: RAPL Interface. Technical Report. All 3 papers: RAPL specification.. , Vol. 1, No. 1, Article . Publication date: May 2026. Energy per Successful Goal: Goal-Level Energy Accounting for Agentic AI Systems 27

work page 2023

[19] [19]

Surya Jasper, Minh Luu, Evan Pan, Aakash Tyagi, Michael Quinn, Jiang Hu, and David Houngninou. 2025. BugGen: A Self-Correcting Multi-Agent LLM Pipeline for Realistic RTL Bug Synthesis. InACM/IEEE Symposium on Machine Learning for CAD (MLCAD). IEEE, 1–9

work page 2025

[20] [20]

Nidhal Jegham, Marwan Abdelatti, Chan Young Koh, Lassad Elmoubarki, and Abdeltawab Hendawi. 2025. How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference.arXiv preprint arXiv:2505.09598 (2025). doi:10.48550/arXiv.2505.09598

work page doi:10.48550/arxiv.2505.09598 2025

[21] [21]

Tamara Kneese and Meg Young. 2024. Carbon Emissions in the Tailpipe of Generative AI.Harvard Data Science Review 6, S5 (2024). doi:10.1162/99608f92.fbdf6128

work page doi:10.1162/99608f92.fbdf6128 2024

[22] [22]

Grzegorz Koszczał, Mariusz Matuszek, and Paweł Czarnul. 2025. Comparison and Analysis of Software and Hardware Energy Measurement Methods for a CPU+GPU System and Selected Parallel Applications.Computer Science and Information Systems22, 2 (2025), 563–590. doi:10.2298/CSIS240722023K

work page doi:10.2298/csis240722023k 2025

[23] [23]

Kajol Kulkarni et al. 2026. Harvesting Energy Consumption on European HPC Systems: Sharing Experience from the CEEC Project. InProceedings of Supercomputing Asia and International Conference on High Performance Computing in Asia-Pacific Region Workshops. doi:10.1145/3784828.3785161

work page doi:10.1145/3784828.3785161 2026

[24] [24]

Rabi Mahapatra and Wei Zhao. 2005. An energy-efficient slack distribution technique for multimode distributed real-time embedded systems.Parallel and Distributed Systems, IEEE Transactions on16 (08 2005), 650– 662. doi:10.1109/ TPDS.2005.78

work page 2005

[25] [25]

Peter Mattson et al . 2020. MLPerf Training Benchmark. InProc. MLSys. Paper 1: §8 related work (throughput benchmark). Paper 3: §9 positioning

work page 2020

[26] [26]

Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney. 2009. Producing Wrong Data Without Doing Anything Obviously Wrong!. InProceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 265–276

work page 2009

[27] [27]

Roberto Natella, Domenico Cotroneo, and Henrique S. Madeira. 2016. Assessing Dependability with Software Fault Injection: A Survey.Comput. Surveys48, 3 (2016). doi:10.1145/2841425

work page doi:10.1145/2841425 2016

[28] [28]

Lastovetsky

Hafiz Adnan Niaz, Ravi Reddy Manumachu, and Alexey L. Lastovetsky. 2025. Accurate and Reliable Energy Mea- surement and Modelling of Data Transfer Between CPU and GPU in Parallel Applications on Heterogeneous Hybrid Platforms.IEEE Trans. Comput.74, 3 (2025), 1011–1024. doi:10.1109/TC.2024.3504262

work page doi:10.1109/tc.2024.3504262 2025

[29] [29]

Energy Use of AI Inference: Efficiency Pathways and Test-Time Compute

Felipe Oviedo, Fiodar Kazhamiaka, Esha Choukse, Allen Kim, Amy Luers, Melanie Nakagawa, Ricardo Bianchini, and Juan M. Lavista Ferres. 2025. Energy Use of AI Inference: Efficiency Pathways and Test-Time Compute.arXiv preprint arXiv:2509.20241(2025). doi:10.48550/arXiv.2509.20241

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.20241 2025

[30] [30]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Íñigo Goiri, Brijesh Warrier, Nithish Mahalingam, and Ricardo Bianchini

work page

[31] [31]

InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’24)

Characterizing Power Management Opportunities for LLMs in the Cloud. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’24). ACM

work page

[32] [32]

David Patterson et al. 2021. Carbon Emissions and Large Neural Network Training.arXiv preprint 2104.10350(2021). Paper 1: §1 motivation. Paper 2: §1 motivation

work page internal anchor Pith review Pith/arXiv arXiv 2021

[33] [33]

Benjamin Petit et al. 2021. Scaphandre: A Metrology Agent Dedicated to Measure the Energy Consumption of IT Services. InIEEE MASCOTS. Paper 2: §2 comparison table

work page 2021

[34] [34]

Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Hugo Larochelle. 2021. Improving Reproducibility in Machine Learning Research.Journal of Machine Learning Research22, 164 (2021), 1–20

work page 2021

[35] [35]

Ritik Raj, Souvik Kundu, Ishita Vohra, Hong Wang, and Tushar Krishna. 2025. Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective. arXiv:2511.00739 [cs.AR]

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Victor Schmidt et al. 2022. CodeCarbon: Estimate and Track Carbon Emissions from Machine Learning.arXiv preprint 2002.05651(2022). Paper 1: §3 boundary failure mode (inflation). Paper 2: §2 comparison table

work page arXiv 2022

[37] [37]

2025.The Unseen Cost of Artificial Intelligence: Energy and Water Consumption

Schneider Electric. 2025.The Unseen Cost of Artificial Intelligence: Energy and Water Consumption. Technical Report. Schneider Electric

work page 2025

[38] [38]

Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and Policy Considerations for Deep Learning in NLP. InProc. ACL. 3645–3650. Paper 1: §1 motivation. Paper 2: §1 motivation

work page 2019

[39] [39]

Philipp Thamm. 2025. Strategies to Measure Energy Consumption Using RAPL During Workflow Execution on Commodity Clusters.arXiv preprint arXiv:2505.09375(2025). doi:10.48550/arXiv.2505.09375

work page doi:10.48550/arxiv.2505.09375 2025

[40] [40]

The Green Grid. 2012. Power Usage Effectiveness (PUE): A Comprehensive Examination of the Metric. White Paper #49. Paper 1: §5 PUE comparison, §9 gaming risks

work page 2012

[41] [41]

Arya Tschand et al. 2024. MLPerf Power: Benchmarking the Energy Efficiency of Machine Learning Systems from 𝜇Watts to MWatts for Sustainable AI. arXiv:2410.12032 [cs.LG]

work page arXiv 2024

[42] [42]

A. W. van der Vaart. 1998.Asymptotic Statistics. Cambridge University Press. doi:10.1017/CBO9780511802256

work page doi:10.1017/cbo9780511802256 1998

[43] [43]

Failure Injection Study

White & Case LLP. 2025.Energy Efficiency Requirements under the EU AI Act. Technical Report. White & Case LLP. , Vol. 1, No. 1, Article . Publication date: May 2026. 28 Deepak Panigrahy and Aakash Tyagi A A-LEMS System Architecture Layer 1 Layer 2 Layer 3 Layer 4 RAPL 100 Hz perf 10 Hz Thermal 1 Hz Non-blocking Queue (oldest-drop policy) Workload Instrume...

work page 2025

[44] [44]

a t t r i b u t e d _ e n e r g y _ u j 8+COALESCE( r

/1 e6ASstandard_epg_j , 7AVG( r . a t t r i b u t e d _ e n e r g y _ u j 8+COALESCE( r . pre_task_energy_uj ,0) 9+COALESCE( r . post_task_energy_uj ,0)

work page

[45] [45]

goal_id = ge

/1 e6ASloose_epg_j 11FROMgoal_execution ge 12JOINgoal_attempt gaONga . goal_id = ge . goal_id 13JOINruns rONr . run_id = ga . run_id 14JOINexperiments eONe . exp_id = ge . exp_id 15WHEREe . is_valid = 1ANDe . experiment_type !='debug' 16ANDr . a t t r i b u t e d _ e n e r g y _ u j ISNOT NULL 17GROUP BYge . workflow_type ; Listing 5. RQ-05: Reproducibili...

work page 2026