Sustainability via LLM Right-sizing

Finn Klessascheck; Jan Mendling; Jennifer Haase; Sebastian Pokutta

arxiv: 2504.13217 · v3 · pith:EKKFTHQ7new · submitted 2025-04-17 · 💻 cs.CL · cs.AI

Sustainability via LLM Right-sizing

Jennifer Haase , Finn Klessascheck , Jan Mendling , Sebastian Pokutta This is my paper

Pith reviewed 2026-05-22 19:19 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM evaluationmodel right-sizingsustainabilitytask performancecost efficiencyopen-weight modelsdual evaluation

0 comments

The pith

Smaller LLMs like Gemma-3 and Phi-4 match larger models on most workplace tasks while using far less energy and cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests eleven large language models on ten common occupational tasks such as summarizing documents, creating schedules, and writing emails and proposals. It uses an automated dual-LLM evaluator to score outputs on quality, factual accuracy, and ethical responsibility. Results show GPT-4o leads in performance but at much higher resource use, while smaller models often produce reliable enough results for practical purposes. Task type matters, with conceptual work proving harder than simple aggregation or transformation steps. The authors conclude that organizations should assess models for sufficiency in their specific context rather than defaulting to maximum capability.

Core claim

Evaluating eleven proprietary and open-weight LLMs across ten everyday work tasks with a dual-LLM evaluator reveals that compact models such as Gemma-3 and Phi-4 deliver strong and consistent results on most tasks, supporting their use where cost, local deployment, or data control are priorities, while larger models like GPT-4o offer superior but more expensive performance; cluster analysis further groups models into premium all-rounders, competent generalists, and limited but safe performers, with task category strongly affecting outcomes.

What carries the argument

Dual-LLM-based evaluation framework that automates task execution and applies standardized scoring across ten criteria for output quality, factual accuracy, and ethical responsibility.

If this is right

Organizations can reduce energy consumption and operating costs by selecting smaller models for routine tasks without major quality loss.
Local deployment of compact models improves data sovereignty and privacy for sensitive workflows.
Task category should guide model choice, since conceptual tasks expose weaknesses that aggregation tasks do not.
Evaluation shifts from pure performance maximization to context-specific sufficiency checks that reflect real organizational constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be applied to domain-specific tasks such as legal drafting or medical summarization to test whether the same size-performance trade-offs hold.
Hybrid systems that route easy tasks to small models and hard tasks to large ones become more attractive once sufficiency thresholds are quantified.
Wider use of this evaluation style would create demand for even more efficient small models tailored to common workplace patterns.

Load-bearing premise

The automated dual-LLM evaluator gives unbiased and accurate scores for quality, accuracy, and ethics that match what human judges would conclude.

What would settle it

A direct human rating study on the same ten tasks and ten criteria that produces substantially different performance rankings or sufficiency thresholds for the smaller models compared with the automated results.

Figures

Figures reproduced from arXiv: 2504.13217 by Finn Klessascheck, Jan Mendling, Jennifer Haase, Sebastian Pokutta.

**Figure 2.** Figure 2: LLM Performance vs. Sum of Input + Output Cost (by Marker Shape) [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

read the original abstract

Large language models (LLMs) have become increasingly embedded in organizational workflows. This has raised concerns over their energy consumption, financial costs, and data sovereignty. While performance benchmarks often celebrate cutting-edge models, real-world deployment decisions require a broader perspective: when is a smaller, locally deployable model "good enough"? This study offers an empirical answer by evaluating eleven proprietary and open-weight LLMs across ten everyday occupational tasks, including summarizing texts, generating schedules, and drafting emails and proposals. Using a dual-LLM-based evaluation framework, we automated task execution and standardized evaluation across ten criteria related to output quality, factual accuracy, and ethical responsibility. Results show that GPT-4o delivers consistently superior performance but at a significantly higher cost and environmental footprint. Notably, smaller models like Gemma-3 and Phi-4 achieved strong and reliable results on most tasks, suggesting their viability in contexts requiring cost-efficiency, local deployment, or privacy. A cluster analysis revealed three model groups -- premium all-rounders, competent generalists, and limited but safe performers -- highlighting trade-offs between quality, control, and sustainability. Significantly, task type influenced model effectiveness: conceptual tasks challenged most models, while aggregation and transformation tasks yielded better performances. We argue for a shift from performance-maximizing benchmarks to task- and context-aware sufficiency assessments that better reflect organizational priorities. Our approach contributes a scalable method to evaluate AI models through a sustainability lens and offers actionable guidance for responsible LLM deployment in practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Smaller models look viable for routine office tasks in this comparison, but the dual-LLM scoring method has no reported validation against humans or other checks.

read the letter

Hi, the main point is that this paper finds smaller models like Gemma-3 and Phi-4 producing usable output on most of the ten occupational tasks while using less energy and keeping data local, in contrast to GPT-4o which scores higher but costs more. They group the eleven models into three clusters and note that task type matters, with conceptual work proving harder than aggregation or transformation steps. That framing around sufficiency instead of peak performance is the practical angle they push for organizations. What they do well is lay out a direct side-by-side on real-sounding tasks such as summarizing, scheduling, and drafting, then tie the results to sustainability and deployment choices. The cluster view makes the trade-offs easy to see without drowning in single-model tables. The soft spot is the evaluation. All quality, accuracy, and ethics scores come from their dual-LLM judge, yet the abstract gives no human correlation, inter-rater numbers, or bias checks. LLM judges can favor familiar styles and miss quiet errors, so the claim that the smaller models are strong and reliable sits on an untested proxy. Prompt details and any statistical tests are also missing, which leaves the patterns harder to verify. This is aimed at applied teams picking models under cost or privacy constraints rather than benchmark researchers. A reader who needs concrete task-level comparisons for internal use would find it worth a look. The work shows straightforward empirical thinking and engages the right questions, so it deserves referee time even if the methods need tightening. Recommendation: send it for review and ask for evaluator calibration and more method transparency.

Referee Report

3 major / 2 minor

Summary. The paper empirically evaluates eleven proprietary and open-weight LLMs across ten occupational tasks (e.g., text summarization, schedule generation, email drafting) using a dual-LLM evaluation framework to score outputs on ten criteria covering quality, factual accuracy, and ethical responsibility. It reports that GPT-4o achieves the highest performance but with greater cost and environmental impact, while smaller models such as Gemma-3 and Phi-4 deliver strong and reliable results on most tasks. A cluster analysis identifies three model groups (premium all-rounders, competent generalists, limited but safe performers), task type modulates effectiveness, and the work advocates shifting from performance-maximizing benchmarks to task- and context-aware sufficiency assessments for sustainable deployment.

Significance. If the evaluation framework holds, the study supplies actionable, organization-relevant evidence on right-sizing LLMs to balance performance against energy, financial, and sovereignty costs. The broad model coverage and task set, together with the proposed scalable dual-LLM method, could help move the field from leaderboard-style maximization toward practical sufficiency criteria. The purely empirical design avoids circularity and supplies falsifiable, reproducible comparisons across models and tasks.

major comments (3)

[Abstract / Evaluation Framework] The headline finding that Gemma-3 and Phi-4 produce 'strong and reliable results on most tasks' rests entirely on the dual-LLM judge; the manuscript provides no human correlation, inter-rater reliability statistics, ablation against alternative judges, or calibration for known LLM-judge biases (stylistic favoritism, under-detection of subtle factual errors). This is load-bearing for all performance, cluster, and sufficiency claims.
[Methods / Evaluation Framework] No details are reported on prompt engineering for either the task-execution or the evaluation LLMs, statistical significance tests across models/tasks, or handling of evaluator bias. These omissions prevent verification of the central empirical patterns cited in the abstract.
[Results / Cluster Analysis] The cluster analysis that yields the three model groups and the claim that 'task type influenced model effectiveness' would require explicit description of the clustering algorithm, feature set, and robustness checks; without them the trade-off conclusions remain under-supported.

minor comments (2)

[Abstract] The abstract states GPT-4o has a 'significantly higher cost and environmental footprint' but supplies no quantitative deltas; adding concrete metrics (e.g., tokens, energy estimates) would make the sustainability comparison more precise.
[Throughout] Notation for the ten evaluation criteria and the three cluster labels should be defined once and used consistently to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to improve the manuscript.

read point-by-point responses

Referee: [Abstract / Evaluation Framework] The headline finding that Gemma-3 and Phi-4 produce 'strong and reliable results on most tasks' rests entirely on the dual-LLM judge; the manuscript provides no human correlation, inter-rater reliability statistics, ablation against alternative judges, or calibration for known LLM-judge biases (stylistic favoritism, under-detection of subtle factual errors). This is load-bearing for all performance, cluster, and sufficiency claims.

Authors: We agree that the absence of human validation for the dual-LLM judge is a significant limitation that affects the strength of the performance and sufficiency claims. The manuscript describes the dual-LLM framework but does not report human correlation, reliability statistics, or bias calibration. In the revised version we will add a dedicated subsection that includes correlation results between the LLM judge and human raters on a held-out sample of outputs, inter-rater agreement metrics, and an explicit discussion of known LLM-judge biases with any mitigation steps taken. revision: yes
Referee: [Methods / Evaluation Framework] No details are reported on prompt engineering for either the task-execution or the evaluation LLMs, statistical significance tests across models/tasks, or handling of evaluator bias. These omissions prevent verification of the central empirical patterns cited in the abstract.

Authors: We acknowledge that the current Methods section is insufficiently detailed on these points. Although the prompts are referenced in the appendix, we will expand the main text to include the complete prompt templates for both task execution and evaluation, describe the statistical procedures used (including the specific tests, multiple-comparison corrections, and effect-size reporting), and explain the steps taken to reduce evaluator bias such as fixed prompt wording and consistent judge model selection. revision: yes
Referee: [Results / Cluster Analysis] The cluster analysis that yields the three model groups and the claim that 'task type influenced model effectiveness' would require explicit description of the clustering algorithm, feature set, and robustness checks; without them the trade-off conclusions remain under-supported.

Authors: We agree that the cluster analysis requires more explicit documentation to support the reported groupings and task-type observations. The manuscript currently states the existence of three clusters and notes task-type effects but omits algorithmic details. In the revision we will specify the clustering algorithm and number of clusters chosen, list the exact feature set (normalized scores across the ten criteria), and add robustness checks including silhouette analysis and results from alternative clustering approaches. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with direct results from model outputs and automated scoring

full rationale

The paper performs an empirical evaluation of eleven LLMs on ten occupational tasks using a dual-LLM evaluation framework to score outputs on quality, accuracy, and ethical criteria. Results, including cluster analysis and sufficiency conclusions for models like Gemma-3 and Phi-4, derive directly from the observed task executions and automated scores rather than from any fitted parameters, self-referential predictions, mathematical derivations, or self-citation chains that reduce to inputs by construction. No equations, ansatzes, uniqueness theorems, or renamings of known results appear in the derivation chain. The study is self-contained against external benchmarks via direct measurement, qualifying for the default non-circularity outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that automated LLM-as-judge evaluation serves as a valid proxy for human assessment of quality and ethics; no free parameters or invented entities are introduced.

axioms (1)

domain assumption LLM-generated outputs can be reliably and objectively scored for quality, factual accuracy, and ethical responsibility by another LLM using fixed criteria.
Invoked via the dual-LLM evaluation framework described in the abstract as the basis for all performance comparisons.

pith-pipeline@v0.9.0 · 5792 in / 1262 out tokens · 71328 ms · 2026-05-22T19:19:20.752571+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Using a dual-LLM-based evaluation framework, we automated task execution and standardized evaluation across ten criteria... cluster analysis revealed three model groups
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Results show that GPT-4o delivers consistently superior performance but at a significantly higher cost... smaller models like Gemma-3 and Phi-4 achieved strong and reliable results

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 4 internal anchors

[1]

Situational Awareness. Tech. rep. situational-awareness.ai, 2024, p

work page 2024
[2]

Why the Carbon Footprint of Generative Large Language Models Alone Will Not Help Us Assess Their Sustainability,

“Why the Carbon Footprint of Generative Large Language Models Alone Will Not Help Us Assess Their Sustainability,”Nature Machine Intelligence(7:2) 2025, pp. 164–165. Both, C., Hoover, B., Strobelt, H., Krotov, D., Weidele, D. K. I., Martino, M., and Dehmamy, N

work page 2025
[3]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference,

“Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference,” in: Forty-First International Conference on Machine Learning,2024. Chkirbene, Z., Hamila, R., Gouissem, A., and Devrim, U

work page 2024
[4]

Large Language Models (LLM) in Industry: A Survey of Applications, Challenges, and Trends,

“Large Language Models (LLM) in Industry: A Survey of Applications, Challenges, and Trends,” in:2024 IEEE 21st International Conference on Smart Communities: Improving Quality of Life Using AI, Robotics and IoT (HONET),2024, pp. 229–

work page 2024
[5]

TheAIGambit:LeveragingArtificialIntelligence to Combat Climate Change—Opportunities, Challenges, and Recommendations,

Cowls,J.,Tsamados,A.,Taddeo,M.,andFloridi,L. 2023.“TheAIGambit:LeveragingArtificialIntelligence to Combat Climate Change—Opportunities, Challenges, and Recommendations,”AI & SOCIETY(38:1) 2023, pp. 283–307. del Valle, J. I. and Lara, F

work page 2023
[6]

AI-powered Recommender Systems and the Preservation of Personal Autonomy,

“AI-powered Recommender Systems and the Preservation of Personal Autonomy,”AI & SOCIETY(39:5) 2024, pp. 2479–2491. Ferreira, F., Bailey, K. G., and Ferraro, V

work page 2024
[7]

Good-Enough Representations in Language Comprehen- sion,

“Good-Enough Representations in Language Comprehen- sion,”Current Directions in Psychological Science(11:1) 2002, pp. 11–15. Fioravante, R

work page 2002
[8]

Beyond the Business Case for Responsible Artificial Intelligence: Strategic CSR in Light of Digital Washing and the Moral Human Argument,

“Beyond the Business Case for Responsible Artificial Intelligence: Strategic CSR in Light of Digital Washing and the Moral Human Argument,”Sustainability (16:3) 2024, p

work page 2024
[9]

Does Using Multiple Computer Monitors for Office Tasks Affect User Experience?: A Systematic Review,

“Does Using Multiple Computer Monitors for Office Tasks Affect User Experience?: A Systematic Review,”Human Factors: The Journal of the Human Factors and Ergonomics Society(63:3) 2021, pp. 433–449. Grover, R., Vats, A., Moorman, N., Agrawal, A., and Gombolay, M

work page 2021
[10]

Haase, J., Kremser, W., Leopold, H., Mendling, J., Onnasch, L., and Plattfaut, R

arXiv: 2502.14632[cs]. Haase, J., Kremser, W., Leopold, H., Mendling, J., Onnasch, L., and Plattfaut, R

work page arXiv
[11]

Hoffmann, J. et al. 2022.Training Compute-Optimal Large Language Models

work page 2022
[12]

Training Compute-Optimal Large Language Models

arXiv: 2203.15556[cs]. Hogan, M

work page internal anchor Pith review Pith/arXiv arXiv
[13]

When Service Quality Is Enhanced by Human–Artificial Intelligence Interaction: An Examination of Anthropomorphism, Responsiveness from the Perspectives of Employees and Customers,

“When Service Quality Is Enhanced by Human–Artificial Intelligence Interaction: An Examination of Anthropomorphism, Responsiveness from the Perspectives of Employees and Customers,”International Journal of Human–Computer Interaction(40:22) 2024, pp. 7546–7561. Ji, J., Liu, M., Dai, J., Pan, X., Zhang, C., Bian, C., Chen, B., Sun, R., Wang, Y., and Yang, Y

work page 2024
[14]

BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset,

“BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset,”Advances in Neural Information Processing Systems(36) 2023, pp. 24678–24704. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. 2020.Scaling Laws for Neural Language Models

work page 2023
[15]

Scaling Laws for Neural Language Models

arXiv: 2001.08361[cs]. Kirchner-Krath, J., Morschheuser, B., Sicevic, N., Xi, N., Von Korflesch, H. F., and Hamari, J

work page internal anchor Pith review Pith/arXiv arXiv 2001
[16]

Chal- lenges in the Adoption of Sustainability Information Systems: A Study on Green IS in Organizations,

“Chal- lenges in the Adoption of Sustainability Information Systems: A Study on Green IS in Organizations,” International Journal of Information Management(77) 2024, p. 102754. Klessascheck, F., Weber, I., and Pufahl, L

work page 2024
[17]

A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research,

“A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research,”Journal of Chiropractic Medicine(15:2) 2016, pp. 155–163. Kotlarsky, J., Oshri, I., and Sekulic, N

work page 2016
[18]

Digital Sustainability in Information Systems Research: Conceptual Foundations and Future Directions,

“Digital Sustainability in Information Systems Research: Conceptual Foundations and Future Directions,”Journal of the Association for Information Systems (24:4), pp. 936–952. Kurtić, E., Marques, A., Kurtz, M., and Alistarh, D. 2024.We Ran over Half a Million Evaluations on Quantized LLMs—Here’s What We Found

work page 2024
[19]

and Manuali, L

Lazar, S. and Manuali, L. 2024.Can LLMs Advance Democratic Values?2024. arXiv: 2410.08418[cs]. 15 Sustainability via LLM Right-sizing Leon, M

work page arXiv 2024
[20]

The Escalating AI’s Energy Demands and the Imperative Need for Sustainable Solutions,

“The Escalating AI’s Energy Demands and the Imperative Need for Sustainable Solutions,” WSEAS TRANSACTIONS ON SYSTEMS(23) 2024, pp. 444–457. Li, B., Jiang, Y., Gadepally, V., and Tiwari, D. 2024a. “SPROUT: Green Generative AI with Carbon- Efficient LLM Inference,”arXiv preprint arXiv:2403.12900(). Li, P., Yang, J., Islam, M. A., and Ren, S

work page arXiv 2024
[21]

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

arXiv: 2406.11939 [cs]. Long, L., Wang, R., Xiao, R., Zhao, J., Ding, X., Chen, G., and Wang, H. 2024.On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Luccioni, A

arXiv: 2406.15126[cs]. Luccioni, A. S., Viguier, S., and Ligozat, A.-L

work page arXiv
[23]

Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model,

“Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model,”arXiv preprint arXiv:2211.02001(). Nayak, S. P., Pasumarthi, S., Rajagopal, B., and Verma, A. K

work page arXiv
[24]

GDPR Compliant ChatGPT Play- ground,

“GDPR Compliant ChatGPT Play- ground,” in:2024 International Conference on Emerging Technologies in Computer Science for Interdis- ciplinary Applications (ICETCS),2024, pp. 1–6. Petrov, I., Dekoninck, J., Baltadzhiev, L., Drencheva, M., Minchev, K., Balunović, M., Jovanović, N., and Vechev,M

work page 2024
[25]

Purvis, B., Mao, Y., and Robinson, D

Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad.2025.arXiv:2503.21934 [cs]. Purvis, B., Mao, Y., and Robinson, D

work page arXiv 2025
[26]

Combining Human and Artificial Intelligence: Hybrid Problem-Solving in Organizations,

“Combining Human and Artificial Intelligence: Hybrid Problem-Solving in Organizations,”Academy of Management Review() 2024, amr.2021.0421. Samsi, S., Yuen, S., Sundar, M. V., Bates, N., Morrow, J., Elliot, J., et al

work page arXiv 2024
[27]

From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference,

“From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference,”arXiv preprint arXiv:2310.03003 (). Sarker, S., Susarla, A., Gopal, R., and Thatcher, J. B

work page arXiv
[28]

A Survey of Sustainability in Large Language Models: Applications, Economics, and Challenges,

“A Survey of Sustainability in Large Language Models: Applications, Economics, and Challenges,” in:2025 IEEE 15th Annual Computing and Communication Workshop and Conference (CCWC),2025, pp. 00008–00014. Team, G. et al. 2025.Gemma 3 Technical Report

work page 2025
[29]

Gemma 3 Technical Report

arXiv: 2503.19786[cs]. Tripathi, A. and Kumar, V

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Enterprise Green IT Strategy,

“Enterprise Green IT Strategy,” in:Harnessing Green It, S. Murugesan and G. R. Gangadharan (eds.). 1st ed. Wiley, 2012, pp. 149–165. Vaccaro, M., Almaatouq, A., and Malone, T

work page 2012
[31]

When Combinations of Humans and AI Are Useful: A Systematic Review and Meta-Analysis,

“When Combinations of Humans and AI Are Useful: A Systematic Review and Meta-Analysis,”Nature Human Behaviour(8:12) 2024, pp. 2293–2303. Veit, D. J. and Thatcher, J. B

work page 2024
[32]

Digitalization as a Problem or Solution? Charting the Path for Research on Sustainable Information Systems,

“Digitalization as a Problem or Solution? Charting the Path for Research on Sustainable Information Systems,”Journal of Business Economics(93:6-7) 2023, pp. 1231–

work page 2023
[33]

A systematic review of Green AI,

“A systematic review of Green AI,”WIREs Data Mining and Knowledge Discovery(13:4), e1507. Virk, Y., Devanbu, P., and Ahmed, T. 2024.Enhancing Trust in LLM-Generated Code Summaries with Calibrated Confidence Scores

work page 2024
[34]

Watson, Boudreau, and Chen

arXiv: 2404.19318[cs]. Watson, Boudreau, and Chen

work page arXiv
[35]

2024.FhGenie: A Custom, Confidentiality-Preserving Chat AI for Corporate and Scientific Use

16 Sustainability via LLM Right-sizing Weber, I., Linka, H., Mertens, D., Muryshkin, T., Opgenoorth, H., and Langer, S. 2024.FhGenie: A Custom, Confidentiality-Preserving Chat AI for Corporate and Scientific Use. White, C. et al. 2024.LiveBench: A Challenging, Contamination-Free LLM Benchmark

work page 2024
[36]

Sustainable AI: Environmental Implications, Challenges and Opportunities,

“Sustainable AI: Environmental Implications, Challenges and Opportunities,”Pro- ceedings of Machine Learning and Systems(4) 2022, pp. 795–813. Yao, Z., Wu, X., Li, C., Youn, S., and He, Y

work page 2022
[37]

Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation,

“Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation,”Proceedings of the AAAI Conference on Artificial Intelligence (38:17) 2024, pp. 19377–19385. You, J

work page 2024

[1] [1]

Situational Awareness. Tech. rep. situational-awareness.ai, 2024, p

work page 2024

[2] [2]

Why the Carbon Footprint of Generative Large Language Models Alone Will Not Help Us Assess Their Sustainability,

“Why the Carbon Footprint of Generative Large Language Models Alone Will Not Help Us Assess Their Sustainability,”Nature Machine Intelligence(7:2) 2025, pp. 164–165. Both, C., Hoover, B., Strobelt, H., Krotov, D., Weidele, D. K. I., Martino, M., and Dehmamy, N

work page 2025

[3] [3]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference,

“Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference,” in: Forty-First International Conference on Machine Learning,2024. Chkirbene, Z., Hamila, R., Gouissem, A., and Devrim, U

work page 2024

[4] [4]

Large Language Models (LLM) in Industry: A Survey of Applications, Challenges, and Trends,

“Large Language Models (LLM) in Industry: A Survey of Applications, Challenges, and Trends,” in:2024 IEEE 21st International Conference on Smart Communities: Improving Quality of Life Using AI, Robotics and IoT (HONET),2024, pp. 229–

work page 2024

[5] [5]

TheAIGambit:LeveragingArtificialIntelligence to Combat Climate Change—Opportunities, Challenges, and Recommendations,

Cowls,J.,Tsamados,A.,Taddeo,M.,andFloridi,L. 2023.“TheAIGambit:LeveragingArtificialIntelligence to Combat Climate Change—Opportunities, Challenges, and Recommendations,”AI & SOCIETY(38:1) 2023, pp. 283–307. del Valle, J. I. and Lara, F

work page 2023

[6] [6]

AI-powered Recommender Systems and the Preservation of Personal Autonomy,

“AI-powered Recommender Systems and the Preservation of Personal Autonomy,”AI & SOCIETY(39:5) 2024, pp. 2479–2491. Ferreira, F., Bailey, K. G., and Ferraro, V

work page 2024

[7] [7]

Good-Enough Representations in Language Comprehen- sion,

“Good-Enough Representations in Language Comprehen- sion,”Current Directions in Psychological Science(11:1) 2002, pp. 11–15. Fioravante, R

work page 2002

[8] [8]

Beyond the Business Case for Responsible Artificial Intelligence: Strategic CSR in Light of Digital Washing and the Moral Human Argument,

“Beyond the Business Case for Responsible Artificial Intelligence: Strategic CSR in Light of Digital Washing and the Moral Human Argument,”Sustainability (16:3) 2024, p

work page 2024

[9] [9]

Does Using Multiple Computer Monitors for Office Tasks Affect User Experience?: A Systematic Review,

“Does Using Multiple Computer Monitors for Office Tasks Affect User Experience?: A Systematic Review,”Human Factors: The Journal of the Human Factors and Ergonomics Society(63:3) 2021, pp. 433–449. Grover, R., Vats, A., Moorman, N., Agrawal, A., and Gombolay, M

work page 2021

[10] [10]

Haase, J., Kremser, W., Leopold, H., Mendling, J., Onnasch, L., and Plattfaut, R

arXiv: 2502.14632[cs]. Haase, J., Kremser, W., Leopold, H., Mendling, J., Onnasch, L., and Plattfaut, R

work page arXiv

[11] [11]

Hoffmann, J. et al. 2022.Training Compute-Optimal Large Language Models

work page 2022

[12] [12]

Training Compute-Optimal Large Language Models

arXiv: 2203.15556[cs]. Hogan, M

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

When Service Quality Is Enhanced by Human–Artificial Intelligence Interaction: An Examination of Anthropomorphism, Responsiveness from the Perspectives of Employees and Customers,

“When Service Quality Is Enhanced by Human–Artificial Intelligence Interaction: An Examination of Anthropomorphism, Responsiveness from the Perspectives of Employees and Customers,”International Journal of Human–Computer Interaction(40:22) 2024, pp. 7546–7561. Ji, J., Liu, M., Dai, J., Pan, X., Zhang, C., Bian, C., Chen, B., Sun, R., Wang, Y., and Yang, Y

work page 2024

[14] [14]

BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset,

“BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset,”Advances in Neural Information Processing Systems(36) 2023, pp. 24678–24704. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. 2020.Scaling Laws for Neural Language Models

work page 2023

[15] [15]

Scaling Laws for Neural Language Models

arXiv: 2001.08361[cs]. Kirchner-Krath, J., Morschheuser, B., Sicevic, N., Xi, N., Von Korflesch, H. F., and Hamari, J

work page internal anchor Pith review Pith/arXiv arXiv 2001

[16] [16]

Chal- lenges in the Adoption of Sustainability Information Systems: A Study on Green IS in Organizations,

“Chal- lenges in the Adoption of Sustainability Information Systems: A Study on Green IS in Organizations,” International Journal of Information Management(77) 2024, p. 102754. Klessascheck, F., Weber, I., and Pufahl, L

work page 2024

[17] [17]

A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research,

“A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research,”Journal of Chiropractic Medicine(15:2) 2016, pp. 155–163. Kotlarsky, J., Oshri, I., and Sekulic, N

work page 2016

[18] [18]

Digital Sustainability in Information Systems Research: Conceptual Foundations and Future Directions,

“Digital Sustainability in Information Systems Research: Conceptual Foundations and Future Directions,”Journal of the Association for Information Systems (24:4), pp. 936–952. Kurtić, E., Marques, A., Kurtz, M., and Alistarh, D. 2024.We Ran over Half a Million Evaluations on Quantized LLMs—Here’s What We Found

work page 2024

[19] [19]

and Manuali, L

Lazar, S. and Manuali, L. 2024.Can LLMs Advance Democratic Values?2024. arXiv: 2410.08418[cs]. 15 Sustainability via LLM Right-sizing Leon, M

work page arXiv 2024

[20] [20]

The Escalating AI’s Energy Demands and the Imperative Need for Sustainable Solutions,

“The Escalating AI’s Energy Demands and the Imperative Need for Sustainable Solutions,” WSEAS TRANSACTIONS ON SYSTEMS(23) 2024, pp. 444–457. Li, B., Jiang, Y., Gadepally, V., and Tiwari, D. 2024a. “SPROUT: Green Generative AI with Carbon- Efficient LLM Inference,”arXiv preprint arXiv:2403.12900(). Li, P., Yang, J., Islam, M. A., and Ren, S

work page arXiv 2024

[21] [21]

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

arXiv: 2406.11939 [cs]. Long, L., Wang, R., Xiao, R., Zhao, J., Ding, X., Chen, G., and Wang, H. 2024.On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

Luccioni, A

arXiv: 2406.15126[cs]. Luccioni, A. S., Viguier, S., and Ligozat, A.-L

work page arXiv

[23] [23]

Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model,

“Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model,”arXiv preprint arXiv:2211.02001(). Nayak, S. P., Pasumarthi, S., Rajagopal, B., and Verma, A. K

work page arXiv

[24] [24]

GDPR Compliant ChatGPT Play- ground,

“GDPR Compliant ChatGPT Play- ground,” in:2024 International Conference on Emerging Technologies in Computer Science for Interdis- ciplinary Applications (ICETCS),2024, pp. 1–6. Petrov, I., Dekoninck, J., Baltadzhiev, L., Drencheva, M., Minchev, K., Balunović, M., Jovanović, N., and Vechev,M

work page 2024

[25] [25]

Purvis, B., Mao, Y., and Robinson, D

Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad.2025.arXiv:2503.21934 [cs]. Purvis, B., Mao, Y., and Robinson, D

work page arXiv 2025

[26] [26]

Combining Human and Artificial Intelligence: Hybrid Problem-Solving in Organizations,

“Combining Human and Artificial Intelligence: Hybrid Problem-Solving in Organizations,”Academy of Management Review() 2024, amr.2021.0421. Samsi, S., Yuen, S., Sundar, M. V., Bates, N., Morrow, J., Elliot, J., et al

work page arXiv 2024

[27] [27]

From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference,

“From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference,”arXiv preprint arXiv:2310.03003 (). Sarker, S., Susarla, A., Gopal, R., and Thatcher, J. B

work page arXiv

[28] [28]

A Survey of Sustainability in Large Language Models: Applications, Economics, and Challenges,

“A Survey of Sustainability in Large Language Models: Applications, Economics, and Challenges,” in:2025 IEEE 15th Annual Computing and Communication Workshop and Conference (CCWC),2025, pp. 00008–00014. Team, G. et al. 2025.Gemma 3 Technical Report

work page 2025

[29] [29]

Gemma 3 Technical Report

arXiv: 2503.19786[cs]. Tripathi, A. and Kumar, V

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

Enterprise Green IT Strategy,

“Enterprise Green IT Strategy,” in:Harnessing Green It, S. Murugesan and G. R. Gangadharan (eds.). 1st ed. Wiley, 2012, pp. 149–165. Vaccaro, M., Almaatouq, A., and Malone, T

work page 2012

[31] [31]

When Combinations of Humans and AI Are Useful: A Systematic Review and Meta-Analysis,

“When Combinations of Humans and AI Are Useful: A Systematic Review and Meta-Analysis,”Nature Human Behaviour(8:12) 2024, pp. 2293–2303. Veit, D. J. and Thatcher, J. B

work page 2024

[32] [32]

Digitalization as a Problem or Solution? Charting the Path for Research on Sustainable Information Systems,

“Digitalization as a Problem or Solution? Charting the Path for Research on Sustainable Information Systems,”Journal of Business Economics(93:6-7) 2023, pp. 1231–

work page 2023

[33] [33]

A systematic review of Green AI,

“A systematic review of Green AI,”WIREs Data Mining and Knowledge Discovery(13:4), e1507. Virk, Y., Devanbu, P., and Ahmed, T. 2024.Enhancing Trust in LLM-Generated Code Summaries with Calibrated Confidence Scores

work page 2024

[34] [34]

Watson, Boudreau, and Chen

arXiv: 2404.19318[cs]. Watson, Boudreau, and Chen

work page arXiv

[35] [35]

2024.FhGenie: A Custom, Confidentiality-Preserving Chat AI for Corporate and Scientific Use

16 Sustainability via LLM Right-sizing Weber, I., Linka, H., Mertens, D., Muryshkin, T., Opgenoorth, H., and Langer, S. 2024.FhGenie: A Custom, Confidentiality-Preserving Chat AI for Corporate and Scientific Use. White, C. et al. 2024.LiveBench: A Challenging, Contamination-Free LLM Benchmark

work page 2024

[36] [36]

Sustainable AI: Environmental Implications, Challenges and Opportunities,

“Sustainable AI: Environmental Implications, Challenges and Opportunities,”Pro- ceedings of Machine Learning and Systems(4) 2022, pp. 795–813. Yao, Z., Wu, X., Li, C., Youn, S., and He, Y

work page 2022

[37] [37]

Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation,

“Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation,”Proceedings of the AAAI Conference on Artificial Intelligence (38:17) 2024, pp. 19377–19385. You, J

work page 2024