Recognition: unknown
The Hidden Cost of Thinking: Energy Use and Environmental Impact of LMs Beyond Pretraining
Pith reviewed 2026-05-09 17:53 UTC · model grok-4.3
The pith
Full language model development pipelines use 82.2% of compute on experimentation and failed runs, with reasoning models requiring 17 times more post-training energy than instruction-tuned versions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For the Olmo 3 models, development costs including experimentation, failed runs, and ablations account for 82.2% of total compute, a roughly 65% increase over the approximately 50% reported for pretraining-focused pipelines. Reasoning models are 17 times more expensive to post-train than instruction-tuned counterparts, driven by reinforcement learning rollout generation. The entire process consumed approximately 12.3 GWh of datacenter energy, emitted 4,251 tCO2eq, and used 15,887 kL of water, with water consumption tied entirely to power generation infrastructure rather than direct data center cooling.
What carries the argument
The stage-by-stage environmental accounting of the complete language model development pipeline, including pretraining, supervised fine-tuning, preference optimization, reinforcement learning, and all associated experimentation and failed runs.
If this is right
- Reasoning models incur substantially higher post-training energy costs due to reinforcement learning rollouts.
- Development overheads significantly increase the overall environmental footprint beyond pretraining alone.
- Environmental reporting standards must incorporate full pipeline costs to accurately reflect impacts.
- Unreported costs will likely grow rapidly as post-training pipelines become more complex.
- Efforts to reduce AI's environmental impact should address the entire development process rather than isolated stages.
Where Pith is reading between the lines
- Developers may need to adopt more efficient experimentation strategies to lower the share of failed runs.
- Disclosure of full development impacts could raise public awareness of AI's true environmental footprint.
- Optimization efforts might target reinforcement learning stages and trial-and-error workflows for greater efficiency gains.
Load-bearing premise
The energy, carbon, and water figures for each pipeline stage including reinforcement learning rollouts and failed experiments are complete and accurately measured without significant undercounting or reliance on unverified conversion factors.
What would settle it
An independent audit that directly measures the total datacenter energy and carbon emissions for a comparable full language model development pipeline and finds totals or proportions substantially different from 12.3 GWh or 82.2 percent overhead.
Figures
read the original abstract
Modern language model development extends far beyond pretraining, yet environmental reporting remains narrowly focused on the cost of training a single final model. In this work, we provide the first detailed breakdown of the environmental impact of a full model development pipeline, from pretraining through supervised fine-tuning, preference optimization, and reinforcement learning, for Olmo 3, a family of 7 billion and 32 billion parameter models in both instruction-following and reasoning variants. We find that reasoning models are 17x more expensive to post-train than their instruction-tuned counterparts in terms of datacenter energy, driven by reinforcement learning rollout generation. Development costs (including experimentation, failed runs, and ablations) account for 82.2% of total compute, a roughly 65% increase over the ~50% reported for pretraining-focused pipelines in prior work. In total, we estimate our model development process consumed ~12.3 GWh of datacenter energy, emitted 4,251 tCO2eq, and consumed 15,887 kL of water, with water consumption driven entirely by power generation infrastructure rather than data center cooling. These costs, which are almost entirely unreported by model developers, are growing rapidly as post-training pipelines become more complex, and must be accounted for in environmental reporting standards and by the research community working to reduce AI's environmental impact.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper provides the first detailed environmental impact analysis of a complete LM development pipeline for the Olmo 3 family (7B and 32B parameter models in instruction-tuned and reasoning variants), covering pretraining through supervised fine-tuning, preference optimization, and reinforcement learning. It reports that development costs (experimentation, failed runs, ablations) comprise 82.2% of total compute (65% above prior pretraining-focused estimates), that reasoning models require 17x more post-training datacenter energy than instruction-tuned counterparts due to RL rollout generation, and that the full process consumed approximately 12.3 GWh of energy, emitted 4,251 tCO2eq, and used 15,887 kL of water (driven by power generation rather than cooling).
Significance. If the underlying measurements hold, the work is significant for demonstrating that post-training and hidden development overheads dominate the environmental footprint of modern LM pipelines far beyond pretraining, providing concrete numbers for a real model family that prior studies lacked. It strengthens the case for expanded reporting standards and could guide efficiency research, with credit due for the comprehensive stage-by-stage breakdown and inclusion of water metrics alongside energy and emissions.
major comments (3)
- [Abstract] Abstract and results sections: The headline figures (82.2% development share, 17x reasoning multiplier, 12.3 GWh / 4,251 tCO2eq / 15,887 kL totals) are presented without error bars, sensitivity ranges, or explicit data sources for energy-to-carbon and energy-to-water conversion factors, undermining verification of the central quantitative claims.
- [Methods] Methods or experimental setup: The accounting for development costs (including all failed runs and ablations) is described at a high level but lacks specifics on cluster logging completeness, post-hoc estimation procedures for unlogged experiments, or how RL rollout energy was isolated and scaled, which directly supports the 82.2% and 17x claims.
- [Results] Results on post-training comparison: The 17x energy multiplier for reasoning vs. instruction-tuned models is attributed to RL rollouts, yet no details are given on rollout counts, sampling volumes, or per-rollout measurement methodology, making it impossible to assess whether the factor is robust or generalizable beyond this pipeline.
minor comments (2)
- [Introduction] The reference to '~50% reported for pretraining-focused pipelines in prior work' would benefit from explicit citations to those studies for direct comparison.
- [Results] Figure or table presenting the stage-by-stage breakdown should include units and conversion assumptions explicitly in the caption for clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive review and for recognizing the significance of providing a full-pipeline environmental analysis. The comments correctly identify areas where greater methodological transparency is needed. We address each point below and will revise the manuscript to incorporate the requested details.
read point-by-point responses
-
Referee: [Abstract] Abstract and results sections: The headline figures (82.2% development share, 17x reasoning multiplier, 12.3 GWh / 4,251 tCO2eq / 15,887 kL totals) are presented without error bars, sensitivity ranges, or explicit data sources for energy-to-carbon and energy-to-water conversion factors, undermining verification of the central quantitative claims.
Authors: We agree that the absence of uncertainty quantification and explicit conversion-factor sources limits verifiability. In the revised manuscript we will (i) cite the precise regional grid-intensity and water-use factors used (with references to the underlying EPA and utility data), (ii) add a sensitivity table showing how totals vary under ±20 % changes in carbon intensity and water-use rates, and (iii) report approximate uncertainty ranges derived from hardware-utilization variance and logging gaps. These additions will appear in both the Results section and a new subsection of the Methods; the abstract will be updated to reference the sensitivity analysis. revision: yes
-
Referee: [Methods] Methods or experimental setup: The accounting for development costs (including all failed runs and ablations) is described at a high level but lacks specifics on cluster logging completeness, post-hoc estimation procedures for unlogged experiments, or how RL rollout energy was isolated and scaled, which directly supports the 82.2% and 17x claims.
Authors: We acknowledge that the current description is insufficient for independent assessment. The revision will expand the Methods section with: (a) the exact coverage of the cluster logging system (percentage of jobs captured by the power-monitoring daemon), (b) the post-hoc estimation protocol for unlogged runs (linear regression on GPU-hours and model size calibrated against logged counterparts), and (c) the precise procedure used to isolate RL rollout energy (per-token power draw measured on the inference cluster, multiplied by total tokens generated across all rollouts). These details will directly underpin the 82.2 % and 17× figures. revision: yes
-
Referee: [Results] Results on post-training comparison: The 17x energy multiplier for reasoning vs. instruction-tuned models is attributed to RL rollouts, yet no details are given on rollout counts, sampling volumes, or per-rollout measurement methodology, making it impossible to assess whether the factor is robust or generalizable beyond this pipeline.
Authors: We agree that the 17× claim requires supporting quantitative detail. The revised Results section will report: the total number of RL rollouts performed for each model variant, the average number of sampled responses per prompt, and the per-rollout energy measurement method (direct power-meter readings on the inference nodes, averaged over representative batches). We will also add a short discussion of how these volumes compare with other published RL post-training pipelines, while noting that the exact multiplier is pipeline-specific. revision: yes
Circularity Check
No circularity: claims rest on direct empirical measurements of pipeline energy use
full rationale
The paper reports observed totals and breakdowns (12.3 GWh, 82.2% development share, 17x post-training gap) obtained by instrumenting their own Olmo 3 training runs, including failed experiments, ablations, SFT, preference optimization, and RL rollouts. These quantities are presented as measured outputs rather than derived via equations, fitted parameters, or self-referential definitions that would make the reported figures equivalent to the inputs by construction. External comparisons (e.g., to the ~50% figure in prior pretraining-focused work) cite independent literature and do not rely on load-bearing self-citations or uniqueness theorems from the authors' prior papers. The derivation chain is therefore a straightforward accounting exercise with no reduction of predictions to fitted inputs or ansatzes smuggled through citations.
Axiom & Free-Parameter Ledger
free parameters (2)
- energy per RL rollout
- development overhead fraction
axioms (1)
- domain assumption Datacenter energy measurements and carbon/water conversion factors from power infrastructure are accurate and complete.
Reference graph
Works this paper leans on
-
[1]
Smith, Nicole DeCario, and Will Buchanan
URL https: //arxiv.org/abs/2206.05229. Epoch AI. Gpus power usage in ai data centers. https://epoch.ai/data-insights/ gpus-power-usage-in-ai-data-centers,
-
[2]
URLhttps://arxiv.org/abs/2407.21783. Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jaco...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
H., Ivison, H., Magnusson, I., Wang, Y., et al
URL https://arxiv.org/abs/2402.00838. Pengfei Li, Jianyi Yang, Mohammad A. Islam, and Shaolei Ren. Making ai less ”thirsty”: Uncovering and addressing the secret water footprint of ai models,
-
[4]
Alexandra Sasha Luccioni, Sylvain Viguier, and Anne-Laure Ligozat
URL https: //arxiv.org/abs/2304.03271. Alexandra Sasha Luccioni, Sylvain Viguier, and Anne-Laure Ligozat. Estimating the carbon footprint of bloom, a 176b parameter language model,
-
[5]
arXiv preprint arXiv:2211.02001 , year=
URL https://arxiv.org/ abs/2211.02001. Sasha Luccioni, Yacine Jernite, and Emma Strubell. Power hungry processing: Watts driving the cost of ai deployment? InThe 2024 ACM Conference on Fairness Accountability and Transparency, FAccT ’24, pp. 85–99. ACM, June
-
[6]
Power Hungry Processing: Watts Driving the Cost of AI Deployment? , url=
doi: 10.1145/3630106.3658542. URL http://dx.doi.org/10.1145/3630106.3658542. Jacob Morrison, Clara Na, Jared Fernandez, Tim Dettmers, Emma Strubell, and Jesse Dodge. Holistically evaluating the environmental impact of creating language models,
-
[7]
URL https://arxiv.org/abs/2503.05804. 11 Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, ...
-
[8]
URLhttps://arxiv.org/abs/2512.13961. Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, Michal Guerquin, David Hein...
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
URLhttps://arxiv.org/abs/2501.00656. Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model,
work page internal anchor Pith review arXiv
-
[10]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
URLhttps://arxiv.org/abs/2305.18290. Paul Reig, Tianyi Luo, Eric Christensen, and Julie Sinistore. Guidance for calculat- ing water use embedded in purchased electricity. https://www.wri.org/research/ guidance-calculating-water-use-embedded-purchased-electricity,
work page internal anchor Pith review arXiv
- [11]
-
[12]
URLhttps://arxiv.org/abs/1906.02243. Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi`ere, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, L´eonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Am ´elie H ´e...
work page Pith review arXiv 1906
-
[13]
URL https://arxiv.org/abs/2403.08295. 12 Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023a. URLhttps://arxiv.org/...
work page internal anchor Pith review arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.