EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data

Eliseo Curcio

arxiv: 2605.15230 · v1 · pith:CYBVPCPJnew · submitted 2026-05-13 · 💰 econ.EM

EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data

Eliseo Curcio This is my paper

Pith reviewed 2026-05-19 17:38 UTC · model grok-4.3

classification 💰 econ.EM

keywords LLM agentsenergy marketsdatacenter sitingbenchmarkelectricity pricescarbon intensityagent evaluationlive data

0 comments

The pith

EnergyAgentBench tests LLM agents on live electricity market data to select optimal sites for AI datacenters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates EnergyAgentBench to measure how LLM agents perform when they must reason step by step over real-time electricity prices, carbon intensity, and technology cost trends. This evaluation matters because placing large AI facilities involves repeated decisions across changing market and grid conditions that static tests cannot capture. The benchmark organizes 70 tasks into five families that each demand between three and 48 calls to live data sources. Model runs reveal clear differences in capability, with some smaller models matching or exceeding larger ones on specific long-horizon problems.

Core claim

EnergyAgentBench supplies the first set of agent tasks grounded directly in live electricity market endpoints, EIA reports, and NREL data, with ground truth produced by XGBoost cost-surface models and the NREL Annual Technology Baseline 2024, and evaluation across nine models shows that performance varies sharply by task family while overall scores remain high for several frontier systems.

What carries the argument

EnergyAgentBench, the benchmark of 70 task variants across datacenter siting, long-horizon portfolio planning, lifetime LCOE ranking, 30-year optimization, and causal grid diagnosis that each require sequential tool calls to live endpoints.

If this is right

Different model families exhibit distinct strengths, with some leading on short-term cost-carbon trade-offs while others excel on extended planning horizons.
Causal diagnosis tasks produce the widest performance gaps among models and therefore serve as the strongest filter for agent capability.
The most frequent errors involve failure to integrate missing values from technology trajectory data, followed by premature commitment on causal problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same live-data agent evaluation approach could be applied to other infrastructure domains such as water allocation or transportation network design.
Emphasis on handling incomplete trajectory data during training might raise scores on the most discriminating task families.
Organizations choosing models for energy-related automation may achieve better results by matching model size to the dominant failure modes rather than defaulting to the largest available system.

Load-bearing premise

Ground truth labels produced by the trained XGBoost cost-surface models and the NREL Annual Technology Baseline accurately represent real-world cost-carbon trade-offs and causal relationships in the electricity grid.

What would settle it

If independent expert review or actual historical placement outcomes in the tested regions show that the benchmark ground truth consistently selects inferior sites on a substantial fraction of the 70 tasks, the reported model rankings would lose their meaning.

Figures

Figures reproduced from arXiv: 2605.15230 by Eliseo Curcio.

**Figure 2.** Figure 2: Score heatmap by model and task family. Cell color = mean composite score on diverging [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Base versus adversarial performance by model. Blue bars = mean composite on non [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Failure mode distribution by model. Each bar = one model; segments = failure category counts [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

read the original abstract

Selecting the right electricity market region for a hyperscale AI datacenter requires reasoning across live electricity prices, grid carbon intensity, technology cost trajectories, and causal grid dynamics -- a multi-step, multi-source analytical task that static knowledge benchmarks cannot evaluate. We introduce EnergyAgentBench, the first agentic benchmark grounded in live electricity market data for this problem class. The benchmark comprises 70 task variants across five families: datacenter siting under cost-carbon trade-offs (F1), long-horizon portfolio siting (F1-LH), lifetime LCOE ranking over multi-decade cost trajectories (F2), 30-year portfolio optimization (F2-LH), and causal grid diagnosis (F3). Tasks require 3 to 48 sequential tool calls against live endpoints from the QuarluxAI infrastructure platform, the U.S. Energy Information Administration (EIA), and the National Renewable Energy Laboratory (NREL) with ground truth derived from trained XGBoost cost-surface models (R^2 0.967--0.995) and the NREL Annual Technology Baseline 2024. We evaluate nine models across Anthropic, OpenAI, and HuggingFace over 1,414 runs at three random seeds. Claude Sonnet 4.6 achieves the highest overall score (0.900) at one-quarter the cost of Claude Opus 4.7 (0.889). Claude Haiku 4.5 leads on long-horizon procedural siting (0.986), outperforming all frontier models including those costing 16x more per run. F3 Causal is the most discriminating family, with a 30.7-point spread between Sonnet (0.793) and Llama 3.3 70B (0.486), versus a 6.6-point spread on F1 Siting. A failure taxonomy of 135 coded failures identifies null-value integration in NREL ATB trajectories as the dominant failure mode (70%), followed by premature commitment on causal tasks (20%) and adversarial injection blindness (6%). Benchmark code, run trajectories, and the failure taxonomy dataset are publicly released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EnergyAgentBench adds a live-data agent benchmark for energy tasks with public runs and a failure taxonomy, but the causal diagnosis ground truth relies on predictive XGBoost surfaces rather than explicit causal structures.

read the letter

The paper's core contribution is EnergyAgentBench: 70 task variants across cost-carbon siting, long-horizon planning, LCOE ranking, portfolio optimization, and causal grid diagnosis, all pulling from live QuarluxAI, EIA, and NREL endpoints. It evaluates nine models over 1,414 runs, releases code and trajectories, and codes 135 failures with null-value handling as the top issue. That scale and openness is the useful part; it gives a concrete way to compare agent performance on multi-step tool use against real infrastructure data rather than static questions. Claude Sonnet 4.6 tops the overall score while Haiku 4.5 does well on long-horizon procedural tasks at lower cost, and the spread on F3 tasks is larger than on simpler siting ones. The evaluation setup looks reproducible on the numbers given. The main soft spot is the F3 causal diagnosis family. Ground truth there comes from trained XGBoost cost-surface models plus NREL baselines, which deliver high predictive R-squared but are not built from causal discovery, structural equations, or do-calculus. If the labels for diagnosis tasks are outputs of the same correlational surfaces used elsewhere, then agent scores on F3 may reflect pattern matching to the model rather than reasoning about how constraints or outages actually propagate. The abstract does not describe any separate causal labeling step, so this assumption needs checking in the full text. This work is aimed at people building or testing LLM agents for applied infrastructure problems. Readers who want benchmarks with live endpoints, failure breakdowns, and public artifacts will find it worth reading. It is coherent enough and brings enough new artifacts that it deserves a serious referee rather than a desk reject, mainly to press on the causal ground-truth construction and task representativeness.

Referee Report

2 major / 2 minor

Summary. The paper introduces EnergyAgentBench, the first agentic benchmark for LLM agents on live electricity market data for hyperscale datacenter siting and related tasks. It defines 70 task variants across five families (F1 datacenter siting under cost-carbon trade-offs, F1-LH long-horizon portfolio siting, F2 lifetime LCOE ranking, F2-LH 30-year portfolio optimization, and F3 causal grid diagnosis). Tasks require 3–48 sequential tool calls against live endpoints from QuarluxAI, EIA, and NREL, with ground truth from XGBoost cost-surface models (R² 0.967–0.995) plus NREL ATB 2024 baselines. Nine models are evaluated over 1,414 runs at three seeds; Claude Sonnet 4.6 scores highest overall (0.900) at one-quarter the cost of Claude Opus 4.7 (0.889), while Claude Haiku 4.5 leads on long-horizon tasks. F3 shows the largest performance spread (30.7 points). A failure taxonomy from 135 coded failures identifies null-value integration as the dominant mode (70%). Code, trajectories, and taxonomy are publicly released.

Significance. If the ground truth labels validly capture real-world dynamics, the benchmark offers a useful contribution to evaluating multi-step agentic reasoning on live energy data, with clear strengths in scale (1,414 runs), reproducibility via public code release, and the detailed failure taxonomy. The finding that F3 is far more discriminating than F1 could help prioritize agent improvements for causal-style tasks in energy economics. The public artifacts directly support follow-on work.

major comments (2)

[Abstract / F3 family description] Abstract / F3 family description: The claim that F3 evaluates causal grid diagnosis rests on ground truth derived from trained XGBoost cost-surface models (R² 0.967–0.995) and NREL ATB 2024. These are supervised predictive models; high predictive accuracy does not establish that they encode directed causal dependencies (e.g., how a transmission constraint or generator outage propagates). No mention of causal discovery, do-calculus, or structural equations appears in the ground-truth construction. If F3 labels are outputs of the same correlational surfaces used for F1/F2, then F3 performance measures pattern-matching to the model rather than causal reasoning over live data. This directly affects the central claim that F3 is the most discriminating family.
[Task construction paragraph] Task construction paragraph: The 70 task variants are stated to be grounded in live endpoints, yet the manuscript provides no explicit validation that the XGBoost surfaces and NREL baselines reproduce observed causal grid relationships (e.g., via intervention tests or comparison to known outage events). Without such checks, the benchmark's suitability for diagnosing agent causal reasoning remains open.

minor comments (2)

[Results section] Results section: The 30.7-point spread on F3 versus 6.6-point spread on F1 is highlighted, but a single consolidated table reporting all nine models across all five families would improve readability and allow direct comparison.
[Failure taxonomy] Failure taxonomy: The 70%/20%/6% breakdown of failure modes is useful, but the taxonomy would benefit from one or two concrete trajectory examples for the dominant null-value integration failure to illustrate the precise agent error.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the causal framing of the F3 family. We agree that the ground-truth construction relies on supervised predictive models rather than explicit causal structures, and we have revised the manuscript to clarify this distinction, update task descriptions, and add a limitations discussion. These changes preserve the empirical results while avoiding overstatement. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract / F3 family description] The claim that F3 evaluates causal grid diagnosis rests on ground truth derived from trained XGBoost cost-surface models (R² 0.967–0.995) and NREL ATB 2024. These are supervised predictive models; high predictive accuracy does not establish that they encode directed causal dependencies (e.g., how a transmission constraint or generator outage propagates). No mention of causal discovery, do-calculus, or structural equations appears in the ground-truth construction. If F3 labels are outputs of the same correlational surfaces used for F1/F2, then F3 performance measures pattern-matching to the model rather than causal reasoning over live data. This directly affects the central claim that F3 is the most discriminating family.

Authors: We agree that the XGBoost surfaces are supervised predictive models trained on historical data and that we do not employ causal discovery, do-calculus, or structural causal models in their construction. The F3 tasks require agents to integrate live multi-source data to produce outputs that align with these high-accuracy empirical surfaces; the larger performance spread on F3 appears to reflect the demands of sequential live-data reasoning rather than pure correlational pattern matching. To address the concern, we have revised the abstract, F3 description, and related sections to refer to 'empirical grid diagnosis' tasks grounded in predictive cost surfaces, while retaining the reported performance differences as an empirical observation. revision: yes
Referee: [Task construction paragraph] The 70 task variants are stated to be grounded in live endpoints, yet the manuscript provides no explicit validation that the XGBoost surfaces and NREL baselines reproduce observed causal grid relationships (e.g., via intervention tests or comparison to known outage events). Without such checks, the benchmark's suitability for diagnosing agent causal reasoning remains open.

Authors: The current manuscript does not include intervention tests or direct comparisons against specific historical outage events. The XGBoost models were validated on held-out real-world data with high R², and NREL ATB 2024 serves as an established industry reference. We have added text in the task construction section and a new limitations subsection that explicitly notes the correlational basis of the ground truth and clarifies that the benchmark evaluates agent performance against these empirical models rather than testing strict causal inference. This revision addresses the referee's point without requiring new primary data collection. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark constructed from external live data and pre-trained models

full rationale

The paper presents EnergyAgentBench as an evaluation framework for LLM agents using live endpoints from QuarluxAI, EIA, and NREL, with ground truth labels generated from separately trained XGBoost models (R^2 0.967-0.995) and NREL ATB 2024 baselines. No equations, derivations, or self-citations are shown that reduce any result to its own inputs by construction. The central claims concern benchmark design and model performance rankings, which rest on external data sources rather than fitted parameters renamed as predictions or load-bearing self-references. This is a standard benchmark introduction paper whose evaluation chain is independent of the target results.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The benchmark rests on live external data feeds and pre-trained models for ground truth rather than new derivations; no new physical entities or ad-hoc constants are introduced beyond standard data access assumptions.

free parameters (1)

XGBoost cost-surface models
Ground truth for tasks derived from trained models with reported R^2 0.967-0.995; these are fitted to data.

axioms (2)

domain assumption Live data from QuarluxAI infrastructure platform, EIA, and NREL endpoints is accurate, up-to-date, and accessible via tool calls
All tasks require 3 to 48 sequential tool calls against these endpoints.
domain assumption NREL Annual Technology Baseline 2024 provides valid multi-decade cost trajectories
Used for lifetime LCOE ranking and 30-year portfolio tasks.

pith-pipeline@v0.9.0 · 5921 in / 1478 out tokens · 51893 ms · 2026-05-19T17:38:23.255429+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce EnergyAgentBench, the first agentic benchmark grounded in live electricity market data... 70 task variants across five families: datacenter siting... lifetime LCOE ranking... causal grid diagnosis
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Ground truth derived from trained XGBoost cost-surface models (R^2 0.967--0.995) and the NREL Annual Technology Baseline 2024

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 13 internal anchors

[1]

Energy and AI

International Energy Agency. Energy and AI. IEA, Paris, 2025. Available: https://www.iea.org/reports/energy-and-ai

work page 2025
[2]

Energy and AI: Executive Summary

International Energy Agency. Energy and AI: Executive Summary. IEA, Paris, 2025. Available: https://www.iea.org/reports/energy-and-ai/executive-summary

work page 2025
[3]

Data Centre Electricity Use Surged in 2025

International Energy Agency. Data Centre Electricity Use Surged in 2025. IEA News, April 2026. Available: https://www.iea.org/news/data-centre-electricity-use-surged-in-2025

work page 2025
[4]

How Much Electricity Does a Data Center Use? Complete 2025 Analysis

IAEI Magazine. How Much Electricity Does a Data Center Use? Complete 2025 Analysis. IAEI Magazine, 2025. Available: https://iaeimagazine.org/electrical -fundamentals/how-much- electricity-does-a-data-center-use-complete-2025-analysis

work page 2025
[5]

Curcio, Curcio, Eliseo, Risk-Aware AI-Driven Design Optimization of Grid-Connected Hydrogen Systems Under Stochastic Operating Conditions (March 23, 2026)

E. Curcio, Curcio, Eliseo, Risk-Aware AI-Driven Design Optimization of Grid-Connected Hydrogen Systems Under Stochastic Operating Conditions (March 23, 2026). Available at SSRN: https://ssrn.com/abstract=6560319

work page 2026
[6]

Curcio, Benchmarking Reasoning Reliability in Artificial Intelligence Models for Energy-System Analysis (October 10, 2025)

E. Curcio, Benchmarking Reasoning Reliability in Artificial Intelligence Models for Energy-System Analysis (October 10, 2025). Available at SSRN: https://ssrn.com/abstract=5608973 or http://dx.doi.org/10.2139/ssrn.5608973

work page doi:10.2139/ssrn.5608973 2025
[7]

GAIA: a benchmark for General AI Assistants

G. Mialon, C. Fourrier, C. Swift, et al., "GAIA: A Benchmark for General AI Assistants," arXiv preprint arXiv:2311.12983, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

AgentBench: Evaluating LLMs as Agents

X. Liu, H. Yu, H. Zhang, et al., "AgentBench: Evaluating LLMs as Agents," arXiv preprint arXiv:2308.03688, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Beyond Binary Correctness: Scaling Evaluation of Long -Horizon Agents on Subjective Enterprise Tasks,

W. Pipatsakulroj et al., "Beyond Binary Correctness: Scaling Evaluation of Long -Horizon Agents on Subjective Enterprise Tasks," arXiv preprint arXiv:2603.22744, 2026

work page arXiv 2026
[10]

2024 Annual Technology Baseline (ATB) Cost and Performance Data for Electricity Generation Technologies,

B. Mirletz, L. Vimmerstedt, G. Avery, et al., "2024 Annual Technology Baseline (ATB) Cost and Performance Data for Electricity Generation Technologies," National Renewable Energy Laboratory, 2024. doi: 10.25984/2377191

work page doi:10.25984/2377191 2024
[11]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, et al., "Measuring Massive Multitask Language Understanding," arXiv preprint arXiv:2009.03300, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2009
[12]

Holistic Evaluation of Language Models

P. Liang, R. Bommasani, T. Lee, et al., "Holistic Evaluation of Language Models," arXiv preprint arXiv:2211.09110, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

A. Srivastava et al., "Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models," arXiv preprint arXiv:2206.04615, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

SciEval: A Multi -Level Large Language Model Evaluation Benchmark for Scientific Research,

L. Sun et al., "SciEval: A Multi -Level Large Language Model Evaluation Benchmark for Scientific Research," in Proc. AAAI Conf. Artificial Intelligence, 2024

work page 2024
[15]

ClimateNLP: Analyzing Current Discourse of Climate Change using Natural Language Processing,

N. Webersinke, M. Kraus, J. Bingler, and M. Leippold, "ClimateNLP: Analyzing Current Discourse of Climate Change using Natural Language Processing," arXiv preprint arXiv:2209.11333, 2022

work page arXiv 2022
[16]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

C. Jimenez, J. Yang, A. Wettig, et al., "SWE -bench: Can Language Models Resolve Real -World GitHub Issues?" arXiv preprint arXiv:2310.06770, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

WebArena: A Realistic Web Environment for Building Autonomous Agents

S. Zhou, F. Xu, H. Zhu, et al., "WebArena: A Realistic Web Environment for Building Autonomous Agents," arXiv preprint arXiv:2307.13854, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M -Token Real -World Contexts. arXiv preprint arXiv:2601.11044, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

arXiv preprint arXiv:2508.09124, 2025

OdysseyBench: Evaluating LLM Agents on Long -Horizon Complex Office Application Workflows. arXiv preprint arXiv:2508.09124, 2025

work page arXiv 2025
[20]

arXiv preprint arXiv:2604.01212, 2026

YC -Bench: Benchmarking AI Agents for Long -Term Planning and Consistent Execution. arXiv preprint arXiv:2604.01212, 2026

work page arXiv 2026
[21]

Ama-bench: Evaluating long-horizon memory for agentic llms,

AMA -Bench: Evaluating Long -Horizon Memory for Agentic Applications. arXiv preprint arXiv:2602.22769, 2026

work page arXiv 2026
[22]

Benchmarking Pre -Trained Time Series Models for Electricity Price Forecasting,

T. Hornek, A. Sartipi, I. Tchappi, and G. Fridgen, "Benchmarking Pre -Trained Time Series Models for Electricity Price Forecasting," arXiv preprint arXiv:2506.08113, 2025

work page arXiv 2025
[23]

An Optimized Machine Learning Approach for Electricity Price Prediction in Cloud Data Centers,

S. C. Gupta, "An Optimized Machine Learning Approach for Electricity Price Prediction in Cloud Data Centers," International Journal of Research and Analytical Studies (IJRASET), 2025. doi: 10.22214/ijraset.2025.74382

work page doi:10.22214/ijraset.2025.74382 2025
[24]

Evaluation of Electrical Load Demand Forecasting Using Various Machine Learning Algorithms,

"Evaluation of Electrical Load Demand Forecasting Using Various Machine Learning Algorithms," Frontiers in Energy Research, vol. 12, 2024. doi: 10.3389/fenrg.2024.1408119

work page doi:10.3389/fenrg.2024.1408119 2024
[25]

Statistical Comparisons of Classifiers over Multiple Data Sets,

J. Demsar, "Statistical Comparisons of Classifiers over Multiple Data Sets," Journal of Machine Learning Research, vol. 7, pp. 1 –30, 2006. Available: https://www.jmlr.org/papers/volume7/demsar06a/demsar06a.pdf

work page 2006
[26]

Chen and C

T. Chen and C. Guestrin, "XGBoost: A Scalable Tree Boosting System," in Proc. 22nd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD), pp. 785 –794, 2016. doi: 10.1145/2939672.2939785

work page doi:10.1145/2939672.2939785 2016
[27]

EIA Open Data API v2,

U.S. Energy Information Administration, "EIA Open Data API v2," U.S. Department of Energy, 2022. Available: https://www.eia.gov/opendata/

work page 2022
[28]

The Proof and Measurement of Association between Two Things,

C. Spearman, "The Proof and Measurement of Association between Two Things," American Journal of Psychology, vol. 15, pp. 72–101, 1904

work page 1904
[29]

Red Teaming Language Models with Language Models

E. Perez, S. Huang, F. Song, et al., "Red Teaming Language Models with Language Models," arXiv preprint arXiv:2202.03286, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

The Claude 4 Model Family,

Anthropic, "The Claude 4 Model Family," Anthropic Technical Documentation, 2025. Available: https://www.anthropic.com/claude

work page 2025
[31]

GPT-5 Technical Report,

OpenAI, "GPT-5 Technical Report," OpenAI, 2025. Available: https://openai.com

work page 2025
[32]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, et al. (Meta AI), "The Llama 3 Herd of Models," arXiv preprint arXiv:2407.21783, 2024. doi: 10.48550/arXiv.2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
[33]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

T. Wolf, L. Debut, V. Sanh, et al., "Transformers: State -of-the-Art Natural Language Processing," in Proc. 2020 Conf. Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations, pp. 38–45, 2020. arXiv:1910.03771

work page internal anchor Pith review Pith/arXiv arXiv 2020
[34]

Qwen2.5 Technical Report

Qwen Team (Alibaba), "Qwen2.5 Technical Report," arXiv preprint arXiv:2412.15115, 2024. doi: 10.48550/arXiv.2412.15115

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2024
[35]

DeepSeek-V3 Technical Report

DeepSeek AI, "DeepSeek -V3 Technical Report," arXiv preprint arXiv:2412.19437, 2024. doi: 10.48550/arXiv.2412.19437

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2024
[36]

Efron and R

B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap. New York: Chapman & Hall/CRC,

work page
[37]

doi: 10.1201/9780429246593

work page doi:10.1201/9780429246593

[1] [1]

Energy and AI

International Energy Agency. Energy and AI. IEA, Paris, 2025. Available: https://www.iea.org/reports/energy-and-ai

work page 2025

[2] [2]

Energy and AI: Executive Summary

International Energy Agency. Energy and AI: Executive Summary. IEA, Paris, 2025. Available: https://www.iea.org/reports/energy-and-ai/executive-summary

work page 2025

[3] [3]

Data Centre Electricity Use Surged in 2025

International Energy Agency. Data Centre Electricity Use Surged in 2025. IEA News, April 2026. Available: https://www.iea.org/news/data-centre-electricity-use-surged-in-2025

work page 2025

[4] [4]

How Much Electricity Does a Data Center Use? Complete 2025 Analysis

IAEI Magazine. How Much Electricity Does a Data Center Use? Complete 2025 Analysis. IAEI Magazine, 2025. Available: https://iaeimagazine.org/electrical -fundamentals/how-much- electricity-does-a-data-center-use-complete-2025-analysis

work page 2025

[5] [5]

Curcio, Curcio, Eliseo, Risk-Aware AI-Driven Design Optimization of Grid-Connected Hydrogen Systems Under Stochastic Operating Conditions (March 23, 2026)

E. Curcio, Curcio, Eliseo, Risk-Aware AI-Driven Design Optimization of Grid-Connected Hydrogen Systems Under Stochastic Operating Conditions (March 23, 2026). Available at SSRN: https://ssrn.com/abstract=6560319

work page 2026

[6] [6]

Curcio, Benchmarking Reasoning Reliability in Artificial Intelligence Models for Energy-System Analysis (October 10, 2025)

E. Curcio, Benchmarking Reasoning Reliability in Artificial Intelligence Models for Energy-System Analysis (October 10, 2025). Available at SSRN: https://ssrn.com/abstract=5608973 or http://dx.doi.org/10.2139/ssrn.5608973

work page doi:10.2139/ssrn.5608973 2025

[7] [7]

GAIA: a benchmark for General AI Assistants

G. Mialon, C. Fourrier, C. Swift, et al., "GAIA: A Benchmark for General AI Assistants," arXiv preprint arXiv:2311.12983, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

AgentBench: Evaluating LLMs as Agents

X. Liu, H. Yu, H. Zhang, et al., "AgentBench: Evaluating LLMs as Agents," arXiv preprint arXiv:2308.03688, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Beyond Binary Correctness: Scaling Evaluation of Long -Horizon Agents on Subjective Enterprise Tasks,

W. Pipatsakulroj et al., "Beyond Binary Correctness: Scaling Evaluation of Long -Horizon Agents on Subjective Enterprise Tasks," arXiv preprint arXiv:2603.22744, 2026

work page arXiv 2026

[10] [10]

2024 Annual Technology Baseline (ATB) Cost and Performance Data for Electricity Generation Technologies,

B. Mirletz, L. Vimmerstedt, G. Avery, et al., "2024 Annual Technology Baseline (ATB) Cost and Performance Data for Electricity Generation Technologies," National Renewable Energy Laboratory, 2024. doi: 10.25984/2377191

work page doi:10.25984/2377191 2024

[11] [11]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, et al., "Measuring Massive Multitask Language Understanding," arXiv preprint arXiv:2009.03300, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2009

[12] [12]

Holistic Evaluation of Language Models

P. Liang, R. Bommasani, T. Lee, et al., "Holistic Evaluation of Language Models," arXiv preprint arXiv:2211.09110, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

A. Srivastava et al., "Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models," arXiv preprint arXiv:2206.04615, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[14] [14]

SciEval: A Multi -Level Large Language Model Evaluation Benchmark for Scientific Research,

L. Sun et al., "SciEval: A Multi -Level Large Language Model Evaluation Benchmark for Scientific Research," in Proc. AAAI Conf. Artificial Intelligence, 2024

work page 2024

[15] [15]

ClimateNLP: Analyzing Current Discourse of Climate Change using Natural Language Processing,

N. Webersinke, M. Kraus, J. Bingler, and M. Leippold, "ClimateNLP: Analyzing Current Discourse of Climate Change using Natural Language Processing," arXiv preprint arXiv:2209.11333, 2022

work page arXiv 2022

[16] [16]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

C. Jimenez, J. Yang, A. Wettig, et al., "SWE -bench: Can Language Models Resolve Real -World GitHub Issues?" arXiv preprint arXiv:2310.06770, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

WebArena: A Realistic Web Environment for Building Autonomous Agents

S. Zhou, F. Xu, H. Zhu, et al., "WebArena: A Realistic Web Environment for Building Autonomous Agents," arXiv preprint arXiv:2307.13854, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M -Token Real -World Contexts. arXiv preprint arXiv:2601.11044, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

arXiv preprint arXiv:2508.09124, 2025

OdysseyBench: Evaluating LLM Agents on Long -Horizon Complex Office Application Workflows. arXiv preprint arXiv:2508.09124, 2025

work page arXiv 2025

[20] [20]

arXiv preprint arXiv:2604.01212, 2026

YC -Bench: Benchmarking AI Agents for Long -Term Planning and Consistent Execution. arXiv preprint arXiv:2604.01212, 2026

work page arXiv 2026

[21] [21]

Ama-bench: Evaluating long-horizon memory for agentic llms,

AMA -Bench: Evaluating Long -Horizon Memory for Agentic Applications. arXiv preprint arXiv:2602.22769, 2026

work page arXiv 2026

[22] [22]

Benchmarking Pre -Trained Time Series Models for Electricity Price Forecasting,

T. Hornek, A. Sartipi, I. Tchappi, and G. Fridgen, "Benchmarking Pre -Trained Time Series Models for Electricity Price Forecasting," arXiv preprint arXiv:2506.08113, 2025

work page arXiv 2025

[23] [23]

An Optimized Machine Learning Approach for Electricity Price Prediction in Cloud Data Centers,

S. C. Gupta, "An Optimized Machine Learning Approach for Electricity Price Prediction in Cloud Data Centers," International Journal of Research and Analytical Studies (IJRASET), 2025. doi: 10.22214/ijraset.2025.74382

work page doi:10.22214/ijraset.2025.74382 2025

[24] [24]

Evaluation of Electrical Load Demand Forecasting Using Various Machine Learning Algorithms,

"Evaluation of Electrical Load Demand Forecasting Using Various Machine Learning Algorithms," Frontiers in Energy Research, vol. 12, 2024. doi: 10.3389/fenrg.2024.1408119

work page doi:10.3389/fenrg.2024.1408119 2024

[25] [25]

Statistical Comparisons of Classifiers over Multiple Data Sets,

J. Demsar, "Statistical Comparisons of Classifiers over Multiple Data Sets," Journal of Machine Learning Research, vol. 7, pp. 1 –30, 2006. Available: https://www.jmlr.org/papers/volume7/demsar06a/demsar06a.pdf

work page 2006

[26] [26]

Chen and C

T. Chen and C. Guestrin, "XGBoost: A Scalable Tree Boosting System," in Proc. 22nd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD), pp. 785 –794, 2016. doi: 10.1145/2939672.2939785

work page doi:10.1145/2939672.2939785 2016

[27] [27]

EIA Open Data API v2,

U.S. Energy Information Administration, "EIA Open Data API v2," U.S. Department of Energy, 2022. Available: https://www.eia.gov/opendata/

work page 2022

[28] [28]

The Proof and Measurement of Association between Two Things,

C. Spearman, "The Proof and Measurement of Association between Two Things," American Journal of Psychology, vol. 15, pp. 72–101, 1904

work page 1904

[29] [29]

Red Teaming Language Models with Language Models

E. Perez, S. Huang, F. Song, et al., "Red Teaming Language Models with Language Models," arXiv preprint arXiv:2202.03286, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[30] [30]

The Claude 4 Model Family,

Anthropic, "The Claude 4 Model Family," Anthropic Technical Documentation, 2025. Available: https://www.anthropic.com/claude

work page 2025

[31] [31]

GPT-5 Technical Report,

OpenAI, "GPT-5 Technical Report," OpenAI, 2025. Available: https://openai.com

work page 2025

[32] [32]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, et al. (Meta AI), "The Llama 3 Herd of Models," arXiv preprint arXiv:2407.21783, 2024. doi: 10.48550/arXiv.2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024

[33] [33]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

T. Wolf, L. Debut, V. Sanh, et al., "Transformers: State -of-the-Art Natural Language Processing," in Proc. 2020 Conf. Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations, pp. 38–45, 2020. arXiv:1910.03771

work page internal anchor Pith review Pith/arXiv arXiv 2020

[34] [34]

Qwen2.5 Technical Report

Qwen Team (Alibaba), "Qwen2.5 Technical Report," arXiv preprint arXiv:2412.15115, 2024. doi: 10.48550/arXiv.2412.15115

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2024

[35] [35]

DeepSeek-V3 Technical Report

DeepSeek AI, "DeepSeek -V3 Technical Report," arXiv preprint arXiv:2412.19437, 2024. doi: 10.48550/arXiv.2412.19437

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2024

[36] [36]

Efron and R

B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap. New York: Chapman & Hall/CRC,

work page

[37] [37]

doi: 10.1201/9780429246593

work page doi:10.1201/9780429246593