Sustainability via LLM Right-sizing
Pith reviewed 2026-05-22 19:19 UTC · model grok-4.3
The pith
Smaller LLMs like Gemma-3 and Phi-4 match larger models on most workplace tasks while using far less energy and cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Evaluating eleven proprietary and open-weight LLMs across ten everyday work tasks with a dual-LLM evaluator reveals that compact models such as Gemma-3 and Phi-4 deliver strong and consistent results on most tasks, supporting their use where cost, local deployment, or data control are priorities, while larger models like GPT-4o offer superior but more expensive performance; cluster analysis further groups models into premium all-rounders, competent generalists, and limited but safe performers, with task category strongly affecting outcomes.
What carries the argument
Dual-LLM-based evaluation framework that automates task execution and applies standardized scoring across ten criteria for output quality, factual accuracy, and ethical responsibility.
If this is right
- Organizations can reduce energy consumption and operating costs by selecting smaller models for routine tasks without major quality loss.
- Local deployment of compact models improves data sovereignty and privacy for sensitive workflows.
- Task category should guide model choice, since conceptual tasks expose weaknesses that aggregation tasks do not.
- Evaluation shifts from pure performance maximization to context-specific sufficiency checks that reflect real organizational constraints.
Where Pith is reading between the lines
- The method could be applied to domain-specific tasks such as legal drafting or medical summarization to test whether the same size-performance trade-offs hold.
- Hybrid systems that route easy tasks to small models and hard tasks to large ones become more attractive once sufficiency thresholds are quantified.
- Wider use of this evaluation style would create demand for even more efficient small models tailored to common workplace patterns.
Load-bearing premise
The automated dual-LLM evaluator gives unbiased and accurate scores for quality, accuracy, and ethics that match what human judges would conclude.
What would settle it
A direct human rating study on the same ten tasks and ten criteria that produces substantially different performance rankings or sufficiency thresholds for the smaller models compared with the automated results.
Figures
read the original abstract
Large language models (LLMs) have become increasingly embedded in organizational workflows. This has raised concerns over their energy consumption, financial costs, and data sovereignty. While performance benchmarks often celebrate cutting-edge models, real-world deployment decisions require a broader perspective: when is a smaller, locally deployable model "good enough"? This study offers an empirical answer by evaluating eleven proprietary and open-weight LLMs across ten everyday occupational tasks, including summarizing texts, generating schedules, and drafting emails and proposals. Using a dual-LLM-based evaluation framework, we automated task execution and standardized evaluation across ten criteria related to output quality, factual accuracy, and ethical responsibility. Results show that GPT-4o delivers consistently superior performance but at a significantly higher cost and environmental footprint. Notably, smaller models like Gemma-3 and Phi-4 achieved strong and reliable results on most tasks, suggesting their viability in contexts requiring cost-efficiency, local deployment, or privacy. A cluster analysis revealed three model groups -- premium all-rounders, competent generalists, and limited but safe performers -- highlighting trade-offs between quality, control, and sustainability. Significantly, task type influenced model effectiveness: conceptual tasks challenged most models, while aggregation and transformation tasks yielded better performances. We argue for a shift from performance-maximizing benchmarks to task- and context-aware sufficiency assessments that better reflect organizational priorities. Our approach contributes a scalable method to evaluate AI models through a sustainability lens and offers actionable guidance for responsible LLM deployment in practice.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper empirically evaluates eleven proprietary and open-weight LLMs across ten occupational tasks (e.g., text summarization, schedule generation, email drafting) using a dual-LLM evaluation framework to score outputs on ten criteria covering quality, factual accuracy, and ethical responsibility. It reports that GPT-4o achieves the highest performance but with greater cost and environmental impact, while smaller models such as Gemma-3 and Phi-4 deliver strong and reliable results on most tasks. A cluster analysis identifies three model groups (premium all-rounders, competent generalists, limited but safe performers), task type modulates effectiveness, and the work advocates shifting from performance-maximizing benchmarks to task- and context-aware sufficiency assessments for sustainable deployment.
Significance. If the evaluation framework holds, the study supplies actionable, organization-relevant evidence on right-sizing LLMs to balance performance against energy, financial, and sovereignty costs. The broad model coverage and task set, together with the proposed scalable dual-LLM method, could help move the field from leaderboard-style maximization toward practical sufficiency criteria. The purely empirical design avoids circularity and supplies falsifiable, reproducible comparisons across models and tasks.
major comments (3)
- [Abstract / Evaluation Framework] The headline finding that Gemma-3 and Phi-4 produce 'strong and reliable results on most tasks' rests entirely on the dual-LLM judge; the manuscript provides no human correlation, inter-rater reliability statistics, ablation against alternative judges, or calibration for known LLM-judge biases (stylistic favoritism, under-detection of subtle factual errors). This is load-bearing for all performance, cluster, and sufficiency claims.
- [Methods / Evaluation Framework] No details are reported on prompt engineering for either the task-execution or the evaluation LLMs, statistical significance tests across models/tasks, or handling of evaluator bias. These omissions prevent verification of the central empirical patterns cited in the abstract.
- [Results / Cluster Analysis] The cluster analysis that yields the three model groups and the claim that 'task type influenced model effectiveness' would require explicit description of the clustering algorithm, feature set, and robustness checks; without them the trade-off conclusions remain under-supported.
minor comments (2)
- [Abstract] The abstract states GPT-4o has a 'significantly higher cost and environmental footprint' but supplies no quantitative deltas; adding concrete metrics (e.g., tokens, energy estimates) would make the sustainability comparison more precise.
- [Throughout] Notation for the ten evaluation criteria and the three cluster labels should be defined once and used consistently to improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to improve the manuscript.
read point-by-point responses
-
Referee: [Abstract / Evaluation Framework] The headline finding that Gemma-3 and Phi-4 produce 'strong and reliable results on most tasks' rests entirely on the dual-LLM judge; the manuscript provides no human correlation, inter-rater reliability statistics, ablation against alternative judges, or calibration for known LLM-judge biases (stylistic favoritism, under-detection of subtle factual errors). This is load-bearing for all performance, cluster, and sufficiency claims.
Authors: We agree that the absence of human validation for the dual-LLM judge is a significant limitation that affects the strength of the performance and sufficiency claims. The manuscript describes the dual-LLM framework but does not report human correlation, reliability statistics, or bias calibration. In the revised version we will add a dedicated subsection that includes correlation results between the LLM judge and human raters on a held-out sample of outputs, inter-rater agreement metrics, and an explicit discussion of known LLM-judge biases with any mitigation steps taken. revision: yes
-
Referee: [Methods / Evaluation Framework] No details are reported on prompt engineering for either the task-execution or the evaluation LLMs, statistical significance tests across models/tasks, or handling of evaluator bias. These omissions prevent verification of the central empirical patterns cited in the abstract.
Authors: We acknowledge that the current Methods section is insufficiently detailed on these points. Although the prompts are referenced in the appendix, we will expand the main text to include the complete prompt templates for both task execution and evaluation, describe the statistical procedures used (including the specific tests, multiple-comparison corrections, and effect-size reporting), and explain the steps taken to reduce evaluator bias such as fixed prompt wording and consistent judge model selection. revision: yes
-
Referee: [Results / Cluster Analysis] The cluster analysis that yields the three model groups and the claim that 'task type influenced model effectiveness' would require explicit description of the clustering algorithm, feature set, and robustness checks; without them the trade-off conclusions remain under-supported.
Authors: We agree that the cluster analysis requires more explicit documentation to support the reported groupings and task-type observations. The manuscript currently states the existence of three clusters and notes task-type effects but omits algorithmic details. In the revision we will specify the clustering algorithm and number of clusters chosen, list the exact feature set (normalized scores across the ten criteria), and add robustness checks including silhouette analysis and results from alternative clustering approaches. revision: yes
Circularity Check
No circularity: purely empirical comparison with direct results from model outputs and automated scoring
full rationale
The paper performs an empirical evaluation of eleven LLMs on ten occupational tasks using a dual-LLM evaluation framework to score outputs on quality, accuracy, and ethical criteria. Results, including cluster analysis and sufficiency conclusions for models like Gemma-3 and Phi-4, derive directly from the observed task executions and automated scores rather than from any fitted parameters, self-referential predictions, mathematical derivations, or self-citation chains that reduce to inputs by construction. No equations, ansatzes, uniqueness theorems, or renamings of known results appear in the derivation chain. The study is self-contained against external benchmarks via direct measurement, qualifying for the default non-circularity outcome.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM-generated outputs can be reliably and objectively scored for quality, factual accuracy, and ethical responsibility by another LLM using fixed criteria.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Using a dual-LLM-based evaluation framework, we automated task execution and standardized evaluation across ten criteria... cluster analysis revealed three model groups
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Results show that GPT-4o delivers consistently superior performance but at a significantly higher cost... smaller models like Gemma-3 and Phi-4 achieved strong and reliable results
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Situational Awareness. Tech. rep. situational-awareness.ai, 2024, p
work page 2024
-
[2]
“Why the Carbon Footprint of Generative Large Language Models Alone Will Not Help Us Assess Their Sustainability,”Nature Machine Intelligence(7:2) 2025, pp. 164–165. Both, C., Hoover, B., Strobelt, H., Krotov, D., Weidele, D. K. I., Martino, M., and Dehmamy, N
work page 2025
-
[3]
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference,
“Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference,” in: Forty-First International Conference on Machine Learning,2024. Chkirbene, Z., Hamila, R., Gouissem, A., and Devrim, U
work page 2024
-
[4]
Large Language Models (LLM) in Industry: A Survey of Applications, Challenges, and Trends,
“Large Language Models (LLM) in Industry: A Survey of Applications, Challenges, and Trends,” in:2024 IEEE 21st International Conference on Smart Communities: Improving Quality of Life Using AI, Robotics and IoT (HONET),2024, pp. 229–
work page 2024
-
[5]
Cowls,J.,Tsamados,A.,Taddeo,M.,andFloridi,L. 2023.“TheAIGambit:LeveragingArtificialIntelligence to Combat Climate Change—Opportunities, Challenges, and Recommendations,”AI & SOCIETY(38:1) 2023, pp. 283–307. del Valle, J. I. and Lara, F
work page 2023
-
[6]
AI-powered Recommender Systems and the Preservation of Personal Autonomy,
“AI-powered Recommender Systems and the Preservation of Personal Autonomy,”AI & SOCIETY(39:5) 2024, pp. 2479–2491. Ferreira, F., Bailey, K. G., and Ferraro, V
work page 2024
-
[7]
Good-Enough Representations in Language Comprehen- sion,
“Good-Enough Representations in Language Comprehen- sion,”Current Directions in Psychological Science(11:1) 2002, pp. 11–15. Fioravante, R
work page 2002
-
[8]
“Beyond the Business Case for Responsible Artificial Intelligence: Strategic CSR in Light of Digital Washing and the Moral Human Argument,”Sustainability (16:3) 2024, p
work page 2024
-
[9]
Does Using Multiple Computer Monitors for Office Tasks Affect User Experience?: A Systematic Review,
“Does Using Multiple Computer Monitors for Office Tasks Affect User Experience?: A Systematic Review,”Human Factors: The Journal of the Human Factors and Ergonomics Society(63:3) 2021, pp. 433–449. Grover, R., Vats, A., Moorman, N., Agrawal, A., and Gombolay, M
work page 2021
-
[10]
Haase, J., Kremser, W., Leopold, H., Mendling, J., Onnasch, L., and Plattfaut, R
arXiv: 2502.14632[cs]. Haase, J., Kremser, W., Leopold, H., Mendling, J., Onnasch, L., and Plattfaut, R
-
[11]
Hoffmann, J. et al. 2022.Training Compute-Optimal Large Language Models
work page 2022
-
[12]
Training Compute-Optimal Large Language Models
arXiv: 2203.15556[cs]. Hogan, M
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
“When Service Quality Is Enhanced by Human–Artificial Intelligence Interaction: An Examination of Anthropomorphism, Responsiveness from the Perspectives of Employees and Customers,”International Journal of Human–Computer Interaction(40:22) 2024, pp. 7546–7561. Ji, J., Liu, M., Dai, J., Pan, X., Zhang, C., Bian, C., Chen, B., Sun, R., Wang, Y., and Yang, Y
work page 2024
-
[14]
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset,
“BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset,”Advances in Neural Information Processing Systems(36) 2023, pp. 24678–24704. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. 2020.Scaling Laws for Neural Language Models
work page 2023
-
[15]
Scaling Laws for Neural Language Models
arXiv: 2001.08361[cs]. Kirchner-Krath, J., Morschheuser, B., Sicevic, N., Xi, N., Von Korflesch, H. F., and Hamari, J
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[16]
“Chal- lenges in the Adoption of Sustainability Information Systems: A Study on Green IS in Organizations,” International Journal of Information Management(77) 2024, p. 102754. Klessascheck, F., Weber, I., and Pufahl, L
work page 2024
-
[17]
A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research,
“A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research,”Journal of Chiropractic Medicine(15:2) 2016, pp. 155–163. Kotlarsky, J., Oshri, I., and Sekulic, N
work page 2016
-
[18]
“Digital Sustainability in Information Systems Research: Conceptual Foundations and Future Directions,”Journal of the Association for Information Systems (24:4), pp. 936–952. Kurtić, E., Marques, A., Kurtz, M., and Alistarh, D. 2024.We Ran over Half a Million Evaluations on Quantized LLMs—Here’s What We Found
work page 2024
-
[19]
Lazar, S. and Manuali, L. 2024.Can LLMs Advance Democratic Values?2024. arXiv: 2410.08418[cs]. 15 Sustainability via LLM Right-sizing Leon, M
-
[20]
The Escalating AI’s Energy Demands and the Imperative Need for Sustainable Solutions,
“The Escalating AI’s Energy Demands and the Imperative Need for Sustainable Solutions,” WSEAS TRANSACTIONS ON SYSTEMS(23) 2024, pp. 444–457. Li, B., Jiang, Y., Gadepally, V., and Tiwari, D. 2024a. “SPROUT: Green Generative AI with Carbon- Efficient LLM Inference,”arXiv preprint arXiv:2403.12900(). Li, P., Yang, J., Islam, M. A., and Ren, S
-
[21]
From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline
arXiv: 2406.11939 [cs]. Long, L., Wang, R., Xiao, R., Zhao, J., Ding, X., Chen, G., and Wang, H. 2024.On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [22]
-
[23]
Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model,
“Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model,”arXiv preprint arXiv:2211.02001(). Nayak, S. P., Pasumarthi, S., Rajagopal, B., and Verma, A. K
-
[24]
GDPR Compliant ChatGPT Play- ground,
“GDPR Compliant ChatGPT Play- ground,” in:2024 International Conference on Emerging Technologies in Computer Science for Interdis- ciplinary Applications (ICETCS),2024, pp. 1–6. Petrov, I., Dekoninck, J., Baltadzhiev, L., Drencheva, M., Minchev, K., Balunović, M., Jovanović, N., and Vechev,M
work page 2024
-
[25]
Purvis, B., Mao, Y., and Robinson, D
Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad.2025.arXiv:2503.21934 [cs]. Purvis, B., Mao, Y., and Robinson, D
-
[26]
Combining Human and Artificial Intelligence: Hybrid Problem-Solving in Organizations,
“Combining Human and Artificial Intelligence: Hybrid Problem-Solving in Organizations,”Academy of Management Review() 2024, amr.2021.0421. Samsi, S., Yuen, S., Sundar, M. V., Bates, N., Morrow, J., Elliot, J., et al
-
[27]
From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference,
“From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference,”arXiv preprint arXiv:2310.03003 (). Sarker, S., Susarla, A., Gopal, R., and Thatcher, J. B
-
[28]
A Survey of Sustainability in Large Language Models: Applications, Economics, and Challenges,
“A Survey of Sustainability in Large Language Models: Applications, Economics, and Challenges,” in:2025 IEEE 15th Annual Computing and Communication Workshop and Conference (CCWC),2025, pp. 00008–00014. Team, G. et al. 2025.Gemma 3 Technical Report
work page 2025
-
[29]
arXiv: 2503.19786[cs]. Tripathi, A. and Kumar, V
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
“Enterprise Green IT Strategy,” in:Harnessing Green It, S. Murugesan and G. R. Gangadharan (eds.). 1st ed. Wiley, 2012, pp. 149–165. Vaccaro, M., Almaatouq, A., and Malone, T
work page 2012
-
[31]
When Combinations of Humans and AI Are Useful: A Systematic Review and Meta-Analysis,
“When Combinations of Humans and AI Are Useful: A Systematic Review and Meta-Analysis,”Nature Human Behaviour(8:12) 2024, pp. 2293–2303. Veit, D. J. and Thatcher, J. B
work page 2024
-
[32]
“Digitalization as a Problem or Solution? Charting the Path for Research on Sustainable Information Systems,”Journal of Business Economics(93:6-7) 2023, pp. 1231–
work page 2023
-
[33]
A systematic review of Green AI,
“A systematic review of Green AI,”WIREs Data Mining and Knowledge Discovery(13:4), e1507. Virk, Y., Devanbu, P., and Ahmed, T. 2024.Enhancing Trust in LLM-Generated Code Summaries with Calibrated Confidence Scores
work page 2024
- [34]
-
[35]
2024.FhGenie: A Custom, Confidentiality-Preserving Chat AI for Corporate and Scientific Use
16 Sustainability via LLM Right-sizing Weber, I., Linka, H., Mertens, D., Muryshkin, T., Opgenoorth, H., and Langer, S. 2024.FhGenie: A Custom, Confidentiality-Preserving Chat AI for Corporate and Scientific Use. White, C. et al. 2024.LiveBench: A Challenging, Contamination-Free LLM Benchmark
work page 2024
-
[36]
Sustainable AI: Environmental Implications, Challenges and Opportunities,
“Sustainable AI: Environmental Implications, Challenges and Opportunities,”Pro- ceedings of Machine Learning and Systems(4) 2022, pp. 795–813. Yao, Z., Wu, X., Li, C., Youn, S., and He, Y
work page 2022
-
[37]
Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation,
“Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation,”Proceedings of the AAAI Conference on Artificial Intelligence (38:17) 2024, pp. 19377–19385. You, J
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.