pith. sign in

arxiv: 2504.13217 · v3 · pith:EKKFTHQ7new · submitted 2025-04-17 · 💻 cs.CL · cs.AI

Sustainability via LLM Right-sizing

Pith reviewed 2026-05-22 19:19 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM evaluationmodel right-sizingsustainabilitytask performancecost efficiencyopen-weight modelsdual evaluation
0
0 comments X

The pith

Smaller LLMs like Gemma-3 and Phi-4 match larger models on most workplace tasks while using far less energy and cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests eleven large language models on ten common occupational tasks such as summarizing documents, creating schedules, and writing emails and proposals. It uses an automated dual-LLM evaluator to score outputs on quality, factual accuracy, and ethical responsibility. Results show GPT-4o leads in performance but at much higher resource use, while smaller models often produce reliable enough results for practical purposes. Task type matters, with conceptual work proving harder than simple aggregation or transformation steps. The authors conclude that organizations should assess models for sufficiency in their specific context rather than defaulting to maximum capability.

Core claim

Evaluating eleven proprietary and open-weight LLMs across ten everyday work tasks with a dual-LLM evaluator reveals that compact models such as Gemma-3 and Phi-4 deliver strong and consistent results on most tasks, supporting their use where cost, local deployment, or data control are priorities, while larger models like GPT-4o offer superior but more expensive performance; cluster analysis further groups models into premium all-rounders, competent generalists, and limited but safe performers, with task category strongly affecting outcomes.

What carries the argument

Dual-LLM-based evaluation framework that automates task execution and applies standardized scoring across ten criteria for output quality, factual accuracy, and ethical responsibility.

If this is right

  • Organizations can reduce energy consumption and operating costs by selecting smaller models for routine tasks without major quality loss.
  • Local deployment of compact models improves data sovereignty and privacy for sensitive workflows.
  • Task category should guide model choice, since conceptual tasks expose weaknesses that aggregation tasks do not.
  • Evaluation shifts from pure performance maximization to context-specific sufficiency checks that reflect real organizational constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be applied to domain-specific tasks such as legal drafting or medical summarization to test whether the same size-performance trade-offs hold.
  • Hybrid systems that route easy tasks to small models and hard tasks to large ones become more attractive once sufficiency thresholds are quantified.
  • Wider use of this evaluation style would create demand for even more efficient small models tailored to common workplace patterns.

Load-bearing premise

The automated dual-LLM evaluator gives unbiased and accurate scores for quality, accuracy, and ethics that match what human judges would conclude.

What would settle it

A direct human rating study on the same ten tasks and ten criteria that produces substantially different performance rankings or sufficiency thresholds for the smaller models compared with the automated results.

Figures

Figures reproduced from arXiv: 2504.13217 by Finn Klessascheck, Jan Mendling, Jennifer Haase, Sebastian Pokutta.

Figure 1
Figure 1. Figure 1: Overview of the applied Testing and Evaluation Framework [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: LLM Performance vs. Sum of Input + Output Cost (by Marker Shape) [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
read the original abstract

Large language models (LLMs) have become increasingly embedded in organizational workflows. This has raised concerns over their energy consumption, financial costs, and data sovereignty. While performance benchmarks often celebrate cutting-edge models, real-world deployment decisions require a broader perspective: when is a smaller, locally deployable model "good enough"? This study offers an empirical answer by evaluating eleven proprietary and open-weight LLMs across ten everyday occupational tasks, including summarizing texts, generating schedules, and drafting emails and proposals. Using a dual-LLM-based evaluation framework, we automated task execution and standardized evaluation across ten criteria related to output quality, factual accuracy, and ethical responsibility. Results show that GPT-4o delivers consistently superior performance but at a significantly higher cost and environmental footprint. Notably, smaller models like Gemma-3 and Phi-4 achieved strong and reliable results on most tasks, suggesting their viability in contexts requiring cost-efficiency, local deployment, or privacy. A cluster analysis revealed three model groups -- premium all-rounders, competent generalists, and limited but safe performers -- highlighting trade-offs between quality, control, and sustainability. Significantly, task type influenced model effectiveness: conceptual tasks challenged most models, while aggregation and transformation tasks yielded better performances. We argue for a shift from performance-maximizing benchmarks to task- and context-aware sufficiency assessments that better reflect organizational priorities. Our approach contributes a scalable method to evaluate AI models through a sustainability lens and offers actionable guidance for responsible LLM deployment in practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper empirically evaluates eleven proprietary and open-weight LLMs across ten occupational tasks (e.g., text summarization, schedule generation, email drafting) using a dual-LLM evaluation framework to score outputs on ten criteria covering quality, factual accuracy, and ethical responsibility. It reports that GPT-4o achieves the highest performance but with greater cost and environmental impact, while smaller models such as Gemma-3 and Phi-4 deliver strong and reliable results on most tasks. A cluster analysis identifies three model groups (premium all-rounders, competent generalists, limited but safe performers), task type modulates effectiveness, and the work advocates shifting from performance-maximizing benchmarks to task- and context-aware sufficiency assessments for sustainable deployment.

Significance. If the evaluation framework holds, the study supplies actionable, organization-relevant evidence on right-sizing LLMs to balance performance against energy, financial, and sovereignty costs. The broad model coverage and task set, together with the proposed scalable dual-LLM method, could help move the field from leaderboard-style maximization toward practical sufficiency criteria. The purely empirical design avoids circularity and supplies falsifiable, reproducible comparisons across models and tasks.

major comments (3)
  1. [Abstract / Evaluation Framework] The headline finding that Gemma-3 and Phi-4 produce 'strong and reliable results on most tasks' rests entirely on the dual-LLM judge; the manuscript provides no human correlation, inter-rater reliability statistics, ablation against alternative judges, or calibration for known LLM-judge biases (stylistic favoritism, under-detection of subtle factual errors). This is load-bearing for all performance, cluster, and sufficiency claims.
  2. [Methods / Evaluation Framework] No details are reported on prompt engineering for either the task-execution or the evaluation LLMs, statistical significance tests across models/tasks, or handling of evaluator bias. These omissions prevent verification of the central empirical patterns cited in the abstract.
  3. [Results / Cluster Analysis] The cluster analysis that yields the three model groups and the claim that 'task type influenced model effectiveness' would require explicit description of the clustering algorithm, feature set, and robustness checks; without them the trade-off conclusions remain under-supported.
minor comments (2)
  1. [Abstract] The abstract states GPT-4o has a 'significantly higher cost and environmental footprint' but supplies no quantitative deltas; adding concrete metrics (e.g., tokens, energy estimates) would make the sustainability comparison more precise.
  2. [Throughout] Notation for the ten evaluation criteria and the three cluster labels should be defined once and used consistently to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to improve the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Evaluation Framework] The headline finding that Gemma-3 and Phi-4 produce 'strong and reliable results on most tasks' rests entirely on the dual-LLM judge; the manuscript provides no human correlation, inter-rater reliability statistics, ablation against alternative judges, or calibration for known LLM-judge biases (stylistic favoritism, under-detection of subtle factual errors). This is load-bearing for all performance, cluster, and sufficiency claims.

    Authors: We agree that the absence of human validation for the dual-LLM judge is a significant limitation that affects the strength of the performance and sufficiency claims. The manuscript describes the dual-LLM framework but does not report human correlation, reliability statistics, or bias calibration. In the revised version we will add a dedicated subsection that includes correlation results between the LLM judge and human raters on a held-out sample of outputs, inter-rater agreement metrics, and an explicit discussion of known LLM-judge biases with any mitigation steps taken. revision: yes

  2. Referee: [Methods / Evaluation Framework] No details are reported on prompt engineering for either the task-execution or the evaluation LLMs, statistical significance tests across models/tasks, or handling of evaluator bias. These omissions prevent verification of the central empirical patterns cited in the abstract.

    Authors: We acknowledge that the current Methods section is insufficiently detailed on these points. Although the prompts are referenced in the appendix, we will expand the main text to include the complete prompt templates for both task execution and evaluation, describe the statistical procedures used (including the specific tests, multiple-comparison corrections, and effect-size reporting), and explain the steps taken to reduce evaluator bias such as fixed prompt wording and consistent judge model selection. revision: yes

  3. Referee: [Results / Cluster Analysis] The cluster analysis that yields the three model groups and the claim that 'task type influenced model effectiveness' would require explicit description of the clustering algorithm, feature set, and robustness checks; without them the trade-off conclusions remain under-supported.

    Authors: We agree that the cluster analysis requires more explicit documentation to support the reported groupings and task-type observations. The manuscript currently states the existence of three clusters and notes task-type effects but omits algorithmic details. In the revision we will specify the clustering algorithm and number of clusters chosen, list the exact feature set (normalized scores across the ten criteria), and add robustness checks including silhouette analysis and results from alternative clustering approaches. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with direct results from model outputs and automated scoring

full rationale

The paper performs an empirical evaluation of eleven LLMs on ten occupational tasks using a dual-LLM evaluation framework to score outputs on quality, accuracy, and ethical criteria. Results, including cluster analysis and sufficiency conclusions for models like Gemma-3 and Phi-4, derive directly from the observed task executions and automated scores rather than from any fitted parameters, self-referential predictions, mathematical derivations, or self-citation chains that reduce to inputs by construction. No equations, ansatzes, uniqueness theorems, or renamings of known results appear in the derivation chain. The study is self-contained against external benchmarks via direct measurement, qualifying for the default non-circularity outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that automated LLM-as-judge evaluation serves as a valid proxy for human assessment of quality and ethics; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption LLM-generated outputs can be reliably and objectively scored for quality, factual accuracy, and ethical responsibility by another LLM using fixed criteria.
    Invoked via the dual-LLM evaluation framework described in the abstract as the basis for all performance comparisons.

pith-pipeline@v0.9.0 · 5792 in / 1262 out tokens · 71328 ms · 2026-05-22T19:19:20.752571+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 4 internal anchors

  1. [1]

    Situational Awareness. Tech. rep. situational-awareness.ai, 2024, p

  2. [2]

    Why the Carbon Footprint of Generative Large Language Models Alone Will Not Help Us Assess Their Sustainability,

    “Why the Carbon Footprint of Generative Large Language Models Alone Will Not Help Us Assess Their Sustainability,”Nature Machine Intelligence(7:2) 2025, pp. 164–165. Both, C., Hoover, B., Strobelt, H., Krotov, D., Weidele, D. K. I., Martino, M., and Dehmamy, N

  3. [3]

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference,

    “Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference,” in: Forty-First International Conference on Machine Learning,2024. Chkirbene, Z., Hamila, R., Gouissem, A., and Devrim, U

  4. [4]

    Large Language Models (LLM) in Industry: A Survey of Applications, Challenges, and Trends,

    “Large Language Models (LLM) in Industry: A Survey of Applications, Challenges, and Trends,” in:2024 IEEE 21st International Conference on Smart Communities: Improving Quality of Life Using AI, Robotics and IoT (HONET),2024, pp. 229–

  5. [5]

    TheAIGambit:LeveragingArtificialIntelligence to Combat Climate Change—Opportunities, Challenges, and Recommendations,

    Cowls,J.,Tsamados,A.,Taddeo,M.,andFloridi,L. 2023.“TheAIGambit:LeveragingArtificialIntelligence to Combat Climate Change—Opportunities, Challenges, and Recommendations,”AI & SOCIETY(38:1) 2023, pp. 283–307. del Valle, J. I. and Lara, F

  6. [6]

    AI-powered Recommender Systems and the Preservation of Personal Autonomy,

    “AI-powered Recommender Systems and the Preservation of Personal Autonomy,”AI & SOCIETY(39:5) 2024, pp. 2479–2491. Ferreira, F., Bailey, K. G., and Ferraro, V

  7. [7]

    Good-Enough Representations in Language Comprehen- sion,

    “Good-Enough Representations in Language Comprehen- sion,”Current Directions in Psychological Science(11:1) 2002, pp. 11–15. Fioravante, R

  8. [8]

    Beyond the Business Case for Responsible Artificial Intelligence: Strategic CSR in Light of Digital Washing and the Moral Human Argument,

    “Beyond the Business Case for Responsible Artificial Intelligence: Strategic CSR in Light of Digital Washing and the Moral Human Argument,”Sustainability (16:3) 2024, p

  9. [9]

    Does Using Multiple Computer Monitors for Office Tasks Affect User Experience?: A Systematic Review,

    “Does Using Multiple Computer Monitors for Office Tasks Affect User Experience?: A Systematic Review,”Human Factors: The Journal of the Human Factors and Ergonomics Society(63:3) 2021, pp. 433–449. Grover, R., Vats, A., Moorman, N., Agrawal, A., and Gombolay, M

  10. [10]

    Haase, J., Kremser, W., Leopold, H., Mendling, J., Onnasch, L., and Plattfaut, R

    arXiv: 2502.14632[cs]. Haase, J., Kremser, W., Leopold, H., Mendling, J., Onnasch, L., and Plattfaut, R

  11. [11]

    Hoffmann, J. et al. 2022.Training Compute-Optimal Large Language Models

  12. [12]

    Training Compute-Optimal Large Language Models

    arXiv: 2203.15556[cs]. Hogan, M

  13. [13]

    When Service Quality Is Enhanced by Human–Artificial Intelligence Interaction: An Examination of Anthropomorphism, Responsiveness from the Perspectives of Employees and Customers,

    “When Service Quality Is Enhanced by Human–Artificial Intelligence Interaction: An Examination of Anthropomorphism, Responsiveness from the Perspectives of Employees and Customers,”International Journal of Human–Computer Interaction(40:22) 2024, pp. 7546–7561. Ji, J., Liu, M., Dai, J., Pan, X., Zhang, C., Bian, C., Chen, B., Sun, R., Wang, Y., and Yang, Y

  14. [14]

    BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset,

    “BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset,”Advances in Neural Information Processing Systems(36) 2023, pp. 24678–24704. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. 2020.Scaling Laws for Neural Language Models

  15. [15]

    Scaling Laws for Neural Language Models

    arXiv: 2001.08361[cs]. Kirchner-Krath, J., Morschheuser, B., Sicevic, N., Xi, N., Von Korflesch, H. F., and Hamari, J

  16. [16]

    Chal- lenges in the Adoption of Sustainability Information Systems: A Study on Green IS in Organizations,

    “Chal- lenges in the Adoption of Sustainability Information Systems: A Study on Green IS in Organizations,” International Journal of Information Management(77) 2024, p. 102754. Klessascheck, F., Weber, I., and Pufahl, L

  17. [17]

    A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research,

    “A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research,”Journal of Chiropractic Medicine(15:2) 2016, pp. 155–163. Kotlarsky, J., Oshri, I., and Sekulic, N

  18. [18]

    Digital Sustainability in Information Systems Research: Conceptual Foundations and Future Directions,

    “Digital Sustainability in Information Systems Research: Conceptual Foundations and Future Directions,”Journal of the Association for Information Systems (24:4), pp. 936–952. Kurtić, E., Marques, A., Kurtz, M., and Alistarh, D. 2024.We Ran over Half a Million Evaluations on Quantized LLMs—Here’s What We Found

  19. [19]

    and Manuali, L

    Lazar, S. and Manuali, L. 2024.Can LLMs Advance Democratic Values?2024. arXiv: 2410.08418[cs]. 15 Sustainability via LLM Right-sizing Leon, M

  20. [20]

    The Escalating AI’s Energy Demands and the Imperative Need for Sustainable Solutions,

    “The Escalating AI’s Energy Demands and the Imperative Need for Sustainable Solutions,” WSEAS TRANSACTIONS ON SYSTEMS(23) 2024, pp. 444–457. Li, B., Jiang, Y., Gadepally, V., and Tiwari, D. 2024a. “SPROUT: Green Generative AI with Carbon- Efficient LLM Inference,”arXiv preprint arXiv:2403.12900(). Li, P., Yang, J., Islam, M. A., and Ren, S

  21. [21]

    From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

    arXiv: 2406.11939 [cs]. Long, L., Wang, R., Xiao, R., Zhao, J., Ding, X., Chen, G., and Wang, H. 2024.On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey

  22. [22]

    Luccioni, A

    arXiv: 2406.15126[cs]. Luccioni, A. S., Viguier, S., and Ligozat, A.-L

  23. [23]

    Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model,

    “Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model,”arXiv preprint arXiv:2211.02001(). Nayak, S. P., Pasumarthi, S., Rajagopal, B., and Verma, A. K

  24. [24]

    GDPR Compliant ChatGPT Play- ground,

    “GDPR Compliant ChatGPT Play- ground,” in:2024 International Conference on Emerging Technologies in Computer Science for Interdis- ciplinary Applications (ICETCS),2024, pp. 1–6. Petrov, I., Dekoninck, J., Baltadzhiev, L., Drencheva, M., Minchev, K., Balunović, M., Jovanović, N., and Vechev,M

  25. [25]

    Purvis, B., Mao, Y., and Robinson, D

    Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad.2025.arXiv:2503.21934 [cs]. Purvis, B., Mao, Y., and Robinson, D

  26. [26]

    Combining Human and Artificial Intelligence: Hybrid Problem-Solving in Organizations,

    “Combining Human and Artificial Intelligence: Hybrid Problem-Solving in Organizations,”Academy of Management Review() 2024, amr.2021.0421. Samsi, S., Yuen, S., Sundar, M. V., Bates, N., Morrow, J., Elliot, J., et al

  27. [27]

    From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference,

    “From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference,”arXiv preprint arXiv:2310.03003 (). Sarker, S., Susarla, A., Gopal, R., and Thatcher, J. B

  28. [28]

    A Survey of Sustainability in Large Language Models: Applications, Economics, and Challenges,

    “A Survey of Sustainability in Large Language Models: Applications, Economics, and Challenges,” in:2025 IEEE 15th Annual Computing and Communication Workshop and Conference (CCWC),2025, pp. 00008–00014. Team, G. et al. 2025.Gemma 3 Technical Report

  29. [29]

    Gemma 3 Technical Report

    arXiv: 2503.19786[cs]. Tripathi, A. and Kumar, V

  30. [30]

    Enterprise Green IT Strategy,

    “Enterprise Green IT Strategy,” in:Harnessing Green It, S. Murugesan and G. R. Gangadharan (eds.). 1st ed. Wiley, 2012, pp. 149–165. Vaccaro, M., Almaatouq, A., and Malone, T

  31. [31]

    When Combinations of Humans and AI Are Useful: A Systematic Review and Meta-Analysis,

    “When Combinations of Humans and AI Are Useful: A Systematic Review and Meta-Analysis,”Nature Human Behaviour(8:12) 2024, pp. 2293–2303. Veit, D. J. and Thatcher, J. B

  32. [32]

    Digitalization as a Problem or Solution? Charting the Path for Research on Sustainable Information Systems,

    “Digitalization as a Problem or Solution? Charting the Path for Research on Sustainable Information Systems,”Journal of Business Economics(93:6-7) 2023, pp. 1231–

  33. [33]

    A systematic review of Green AI,

    “A systematic review of Green AI,”WIREs Data Mining and Knowledge Discovery(13:4), e1507. Virk, Y., Devanbu, P., and Ahmed, T. 2024.Enhancing Trust in LLM-Generated Code Summaries with Calibrated Confidence Scores

  34. [34]

    Watson, Boudreau, and Chen

    arXiv: 2404.19318[cs]. Watson, Boudreau, and Chen

  35. [35]

    2024.FhGenie: A Custom, Confidentiality-Preserving Chat AI for Corporate and Scientific Use

    16 Sustainability via LLM Right-sizing Weber, I., Linka, H., Mertens, D., Muryshkin, T., Opgenoorth, H., and Langer, S. 2024.FhGenie: A Custom, Confidentiality-Preserving Chat AI for Corporate and Scientific Use. White, C. et al. 2024.LiveBench: A Challenging, Contamination-Free LLM Benchmark

  36. [36]

    Sustainable AI: Environmental Implications, Challenges and Opportunities,

    “Sustainable AI: Environmental Implications, Challenges and Opportunities,”Pro- ceedings of Machine Learning and Systems(4) 2022, pp. 795–813. Yao, Z., Wu, X., Li, C., Youn, S., and He, Y

  37. [37]

    Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation,

    “Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation,”Proceedings of the AAAI Conference on Artificial Intelligence (38:17) 2024, pp. 19377–19385. You, J