pith. sign in

arxiv: 2602.00056 · v3 · submitted 2026-01-20 · 💻 cs.CY · cs.AI

How Hyper-Datafication Impacts the Sustainability Costs in Frontier AI

Pith reviewed 2026-05-16 13:17 UTC · model grok-4.3

classification 💻 cs.CY cs.AI
keywords hyper-dataficationAI sustainabilitydata laborGlobal Southenvironmental costsfrontier AIdataset analysisrepresentational harms
0
0 comments X

The pith

Hyper-datafication in frontier AI redistributes environmental burdens, labor risks, and representational harms toward the Global South and precarious workers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that frontier AI has moved from using available data to actively generating new data tailored for model training, a process called hyper-datafication. Analysis of roughly 550,000 datasets reveals rapid growth in storage needs and associated energy use, while interviews highlight exploitative labor conditions. This shift does not just consume more resources overall but concentrates the downsides on specific groups and regions. Readers should care because these dynamics affect the equity and long-term viability of AI technologies that influence daily life worldwide. The authors offer practical recommendations to reduce these overlooked costs.

Core claim

The transition to hyper-datafication in AI does not just scale up resource use but systematically shifts environmental burdens, labour risks, and representational harms to the Global South, precarious data workers, and under-represented cultures, as evidenced by dataset analyses and qualitative data from Kenya. The authors propose Data PROOFS recommendations spanning provenance, resource awareness, ownership, openness, frugality, and standards to mitigate these costs.

What carries the argument

Hyper-datafication, the active creation of data for building AI models instead of relying on existing data, which carries the redistribution of costs.

If this is right

  • Increased dataset growth drives higher storage energy consumption and carbon emissions.
  • Data labor exposes workers in the Global South to graphic content and precarious employment.
  • Under-represented languages and cultures face continued representational harms in AI outputs.
  • Disparities in data infrastructure amplify environmental impacts in certain regions.
  • Following the Data PROOFS framework could reduce these redistributed burdens.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If accurate, AI governance should mandate transparent data sourcing to prevent burden shifting across borders.
  • This pattern may apply to other emerging technologies reliant on massive data collection.
  • A testable extension would involve mapping data worker conditions across multiple countries.
  • Connections to digital colonialism suggest broader geopolitical implications for data control.

Load-bearing premise

The sample of Hugging Face Hub datasets and Kenyan data worker responses sufficiently captures global data practices and impacts for frontier AI.

What would settle it

A global study showing that data-related environmental burdens, labor risks, and harms are not disproportionately shifted to the Global South would falsify the central redistribution claim.

Figures

Figures reproduced from arXiv: 2602.00056 by Erik B. Dam, Janin Koch, Mophat Okinyi, Raghavendra Selvan, Sebastian Mair, Sophia N. Wilson.

Figure 1
Figure 1. Figure 1: Growth of datasets and data volume over time and download concentration on the Hugging Face Hub. Left: Monthly counts [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Left: Estimated provider-side storage energy (GWh). Right: Estimated user-side storage energy (TWh), assuming that 10 [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Left: Historical (2022–2024) and projected (2024–2034) electricity use for all data centres worldwide under two scenarios: a base [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Left: Distribution of respondents across salary bands by weekly working hours. Centre: Distribution of respondents across [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Gender-disaggregated distributions of weekly working hours, monthly salary, experience, data work types, and exposure to [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Representation and demand for the ten largest language groups on the Hugging Face Hub. Left: A depiction of each group’s [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Left: Historical (2015-2024) and projected (2024-2030) global annual investment in data centres in the base case reflecting [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of dataset modalities and task categories on the Hugging Face Hub. Left: The fifteen most common dataset [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of dataset sizes and downloads on the Hugging Face Hub by modality and task. Left: Violin plots of dataset sizes [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Mobile and fixed broadband traffic for the Asia-Pacific region, America, Europe, the Arab States, and the Commonwealth of [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
read the original abstract

Large-scale data has fuelled the success of frontier artificial intelligence (AI) models over the past decade. This expansion has relied on sustained efforts by large technology corporations to aggregate and curate internet-scale datasets. In this work, we examine the environmental, social, and economic costs of large-scale data in AI through a sustainability lens. We argue that the field is shifting from building models from data to actively creating data for building models. We characterise this transition as hyper-datafication, which marks a critical juncture for the future of frontier AI and its societal impacts. To quantify and contextualise data-related costs, we analyse approximately 550,000 datasets from the Hugging Face Hub, focusing on dataset growth, storage-related energy consumption and carbon footprint, and societal representation using language data. We complement this analysis with qualitative responses from data workers in Kenya to examine the labour involved, including direct employment by big tech corporations and exposure to graphic content. We further draw on external data sources to substantiate our findings by illustrating the global disparity in data centre infrastructure. Our analyses reveal that hyper-datafication does not merely increase resource consumption but systematically redistributes environmental burdens, labour risks, and representational harms toward the Global South, precarious data workers, and under-represented cultures. Thus, we propose Data PROOFS recommendations spanning provenance, resource awareness, ownership, openness, frugality, and standards to mitigate these costs. Our work aims to make visible the often-overlooked costs of data that underpin frontier AI and to stimulate broader debate within the research community and beyond.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript claims that frontier AI is transitioning from using existing data to actively creating data for models, a shift termed 'hyper-datafication' that increases sustainability costs and systematically redistributes environmental burdens, labour risks, and representational harms toward the Global South, precarious data workers, and under-represented cultures. This is supported by analysis of ~550,000 Hugging Face Hub datasets on growth, storage energy consumption, carbon footprint, and language representation; qualitative responses from data workers in Kenya on labour conditions including exposure to graphic content; and external data on global data centre infrastructure disparities. The paper concludes by proposing Data PROOFS recommendations (provenance, resource awareness, ownership, openness, frugality, standards) to mitigate these costs.

Significance. If the redistribution claim is substantiated through added comparative baselines and provenance tracing, the work would contribute to AI sustainability literature by highlighting data-related burdens and labour issues beyond model training energy costs. The mixed-methods design combining large-scale dataset metrics with worker interviews is a positive feature that broadens the scope, though the current evidence base limits the strength of the global claims.

major comments (3)
  1. [Abstract] Abstract: The central claim that hyper-datafication 'systematically redistributes' environmental burdens, labour risks, and representational harms toward the Global South is presented as a direct finding from the analyses, yet the HF Hub sample (curated and English-dominant) and Kenya-only interviews provide no region-stratified impact accounting or explicit provenance tracing linking dataset characteristics to localized geographic origins or energy metrics.
  2. [Dataset analysis section] Dataset analysis section: No details are given on methods for the ~550k HF Hub analysis, including data exclusion criteria, error handling for storage energy and carbon calculations, or how language representation data quantitatively maps to 'under-represented cultures' or Global South burdens, leaving the quantitative support for redistribution interpretive rather than demonstrated.
  3. [Qualitative section] Qualitative section: The Kenya data worker responses supply valuable local detail on employment and content exposure but include no comparative baseline from other regions or quantitative linkage to HF Hub dataset metrics, which undermines the systematic global redistribution conclusion.
minor comments (1)
  1. [Recommendations] The Data PROOFS recommendations are introduced in the abstract and conclusion but lack expanded definitions or concrete implementation examples that would strengthen their utility.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important areas for clarification and strengthening. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that hyper-datafication 'systematically redistributes' environmental burdens, labour risks, and representational harms toward the Global South is presented as a direct finding from the analyses, yet the HF Hub sample (curated and English-dominant) and Kenya-only interviews provide no region-stratified impact accounting or explicit provenance tracing linking dataset characteristics to localized geographic origins or energy metrics.

    Authors: We agree that the abstract presents the redistribution claim too directly. The HF Hub analysis reveals growth and representation patterns that disproportionately involve English-dominant and curated datasets, while the Kenya interviews illustrate labor conditions typical of Global South data work. However, we lack explicit region-stratified accounting or full provenance tracing. We will revise the abstract to state that the analyses indicate patterns consistent with a redistribution of burdens, supported by the combined quantitative and qualitative evidence, and will add explicit discussion of these limitations in the text. revision: yes

  2. Referee: [Dataset analysis section] Dataset analysis section: No details are given on methods for the ~550k HF Hub analysis, including data exclusion criteria, error handling for storage energy and carbon calculations, or how language representation data quantitatively maps to 'under-represented cultures' or Global South burdens, leaving the quantitative support for redistribution interpretive rather than demonstrated.

    Authors: We acknowledge the omission of methodological details. In the revised manuscript, we will add a dedicated methods subsection describing: data collection via the Hugging Face Hub API, exclusion criteria (e.g., datasets with missing metadata or non-public status), the formulas and assumptions used for storage energy and carbon calculations with error handling and sensitivity analysis, and the quantitative mapping of language codes to cultural representation using external demographic sources. These additions will render the quantitative support more explicit. revision: yes

  3. Referee: [Qualitative section] Qualitative section: The Kenya data worker responses supply valuable local detail on employment and content exposure but include no comparative baseline from other regions or quantitative linkage to HF Hub dataset metrics, which undermines the systematic global redistribution conclusion.

    Authors: The Kenya responses are presented as a case study of data labor conditions in a key Global South location. We accept that no comparative baselines or direct quantitative linkages are provided. We will revise the section to frame the findings as illustrative of documented trends in data work, add references to studies from other regions (e.g., India and the Philippines), and include a new limitations paragraph clarifying that the global redistribution claim is inferred from the combined evidence rather than directly measured. This will prevent overstatement while retaining the contribution of the qualitative data. revision: partial

standing simulated objections not resolved
  • Full region-stratified impact accounting and explicit provenance tracing across all ~550,000 HF Hub datasets, as this would require proprietary data access and new empirical collection beyond the current study's scope.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper performs an empirical analysis of ~550k external Hugging Face Hub datasets for growth, energy, and representation metrics, supplemented by Kenya interviews and external data-centre statistics. No internal equations, fitted parameters, or self-citations are present that reduce the redistribution conclusion to the inputs by construction. The central claim is an interpretive synthesis of independent data sources rather than a self-referential derivation. This matches the default expectation for non-circular empirical work; the provided skeptic concerns address evidence sufficiency, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim depends on the representativeness of the Hugging Face sample for frontier AI data and the generalizability of Kenyan worker experiences, plus interpretive framing of costs as systematic redistribution without explicit global benchmarks.

axioms (2)
  • domain assumption Hugging Face Hub datasets are representative of data used in frontier AI development
    The quantitative analysis is performed exclusively on this platform without stated justification for its coverage of all relevant data sources.
  • domain assumption Responses from data workers in Kenya capture key labour conditions and risks in AI data work globally
    Qualitative findings are drawn from this specific group to support broader claims about labour risks and exposure to graphic content.
invented entities (1)
  • hyper-datafication no independent evidence
    purpose: To name and frame the transition from using existing data to actively creating data for AI models
    New conceptual label introduced to organize the described shift and its consequences.

pith-pipeline@v0.9.0 · 5597 in / 1561 out tokens · 52118 ms · 2026-05-16T13:17:14.677358+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · 4 internal anchors

  1. [1]

    [n. d.]. DataCenterMap. https://www.datacentermap.com/datacenters/. Accessed: 2025-24-11

  2. [2]

    Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. InProceedings of the 2016 ACM SIGSAC conference on computer and communications security. 308–318

  3. [3]

    Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. 2024. Phi-4 technical report.arXiv preprint arXiv:2412.08905(2024)

  4. [4]

    Nur Ahmed and Muntasir Wahed. 2020. The De-democratization of AI: Deep learning and the compute divide in artificial intelligence research. arXiv preprint arXiv:2010.15581(2020)

  5. [5]

    AI@Meta. 2024. Llama 3 Model Card. https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md

  6. [6]

    Wolff Anthony, Benjamin Kanding, and Raghavendra Selvan

    Lasse F. Wolff Anthony, Benjamin Kanding, and Raghavendra Selvan. 2020. Carbontracker: Tracking and Predicting the Carbon Footprint of Training Deep Learning Models. InICML Workshop on Challenges in Deploying and Monitoring Machine Learning Systems

  7. [7]

    Pedram Bakhtiarifard, Christian Igel, and Raghavendra Selvan. 2024. EC-NAS: Energy Consumption Aware Tabular Benchmarks for Neural Architecture Search. InICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5660–5664. doi:10.1109/icassp48485.2024.10448303

  8. [8]

    Pedram Bakhtiarifard, Pınar Tözün, Christian Igel, and Raghavendra Selvan. 2025. Climate And Resource Awareness is Imperative to Achieving Sustainable AI (and Preventing a Global AI Arms Race).arXiv preprint arXiv:2502.20016(2025)

  9. [9]

    2023.Fairness and machine learning: Limitations and opportunities

    Solon Barocas, Moritz Hardt, and Arvind Narayanan. 2023.Fairness and machine learning: Limitations and opportunities. MIT press

  10. [10]

    Martin Brandt, Compton J Tucker, Ankit Kariryaa, Kjeld Rasmussen, Christin Abel, Jennifer Small, Jerome Chave, Laura Vang Rasmussen, Pierre Hiernaux, Abdoul Aziz Diouf, et al. 2020. An unexpectedly large count of trees in the West African Sahara and Sahel.Nature587, 7832 (2020), 78–82

  11. [11]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901

  12. [12]

    Thomas Bruckner, Igor Alexeyevich Bashmakov, Yacob Mulugetta, Helen Chum, Angel De la Vega Navarro, James Edmonds, Andre Faaij, Bundit Fungtammasan, Amit Garg, Edgar Hertwich, et al. 2014. Energy systems.Climate Change 2014: Mitigation of Climate Change. Contribution of Working Group III to the Fifth Assessment Report of the Intergovernmental Panel on Cli...

  13. [13]

    Miranda Bryant. 2025. Denmark to tackle deepfakes by giving people copyright to their own features. https://www.theguardian.com/technology/ 2025/jun/27/deepfakes-denmark-copyright-law-artificial-intelligence. The Guardian Article

  14. [14]

    2024.Feeding the machine: The hidden human labor powering AI

    Callum Cant, James Muldoon, and Mark Graham. 2024.Feeding the machine: The hidden human labor powering AI. Bloomsbury Publishing USA

  15. [15]

    Joel Castaño, Silverio Martínez-Fernández, Xavier Franch, and Justus Bogner. 2024. Analyzing the evolution and maintenance of ml models on hugging face. InProceedings of the 21st International Conference on Mining Software Repositories. 607–618

  16. [16]

    CEIC Data. 2024. Kenya Monthly Earnings. https://www.ceicdata.com/en/indicator/kenya/monthly-earnings Accessed: 2025-12-08

  17. [17]

    Srravya Chandhiramowuli, Alex S Taylor, Sara Heitlinger, and Ding Wang. 2024. Making data work count.Proceedings of the ACM on Human- Computer Interaction8, CSCW1 (2024), 1–26

  18. [18]

    Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. 2019. Certified adversarial robustness via randomized smoothing. Ininternational conference on machine learning. PMLR, 1310–1320

  19. [19]

    Data Center Watch. 2025. Data Center Watch Report. https://static1.squarespace.com/static/67819031da098341c45ac84a/t/ 6849bcfe640a951f79e00715/1749662975141/Data+Center+Watch+Report+.pdf Accessed: 2025-11-30

  20. [20]

    Francisco J Doblas-Reyes, Jenni Kontkanen, Irina Sandu, Mario Acosta, Mohammed Hussam Al Turjmam, Ivan Alsina-Ferrer, Miguel Andrés- Martínez, Leo Arriola, Marvin Axness, Marc Batlle Martín, et al. 2025. The Destination Earth digital twin for climate change adaptation.EGUsphere 2025 (2025), 1–41

  21. [21]

    Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. InTheory of cryptography conference. Springer, 265–284

  22. [22]

    Environmental and Energy Study Institute. 2024. Data Centers and Water Consumption. https://www.eesi.org/articles/view/data-centers-and- water-consumption Accessed: 2025-11-13

  23. [23]

    Frontier Data Centers

    Epoch AI. 2025. “Frontier Data Centers”. https://epoch.ai/data/data-centers Accessed: 2026-01-10

  24. [24]

    European Commission. 2023. European Virtual Human Twins (VHT) Initiative. https://digital-strategy.ec.europa.eu/en/policies/virtual-human- twins. EU Project

  25. [25]

    Sophia Falk, David Ekchajzer, Thibault Pirson, Etienne Lees-Perasso, Augustin Wattiez, Lisa Biber-Freudenberger, Sasha Luccioni, and Aimee van Wynsberghe. 2025. More than Carbon: Cradle-to-Grave environmental impacts of GenAI training on the Nvidia A100 GPU.arXiv preprint arXiv:2509.00093(2025)

  26. [26]

    Federation of Kenya Employers. [n. d.].Youth Employment. https://www.fke-kenya.org/policy-issues/youth-employment

  27. [27]

    Charlotte Freitag, Mike Berners-Lee, Kelly Widdicks, Bran Knowles, Gordon S Blair, and Adrian Friday. 2021. The real climate and transformative impact of ICT: A critique of estimates, trends, and regulations.Patterns(2021)

  28. [28]

    Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. 2021. Datasheets for datasets.Commun. ACM64, 12 (2021), 86–92. 16 Wilson et al

  29. [29]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

  30. [30]

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. 2022. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18995–19012

  31. [31]

    Bargagli-Stoffi

    Gianluca Guidi, Francesca Dominici, Jonathan Gilmour, Kevin Butler, Eric Bell, Scott Delaney, and Falco J. Bargagli-Stoffi. 2024. Environmental Burden of United States Data Centers in the Artificial Intelligence Era. arXiv:2411.09786 [cs.CY] https://arxiv.org/abs/2411.09786

  32. [32]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al . 2025. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning.Nature645, 8081 (2025), 633–638

  33. [33]

    Peter Henderson, Jieru Hu, Joshua Romoff, Emma Brunskill, Dan Jurafsky, and Joelle Pineau. 2020. Towards the systematic reporting of the energy and carbon footprints of machine learning.Journal of Machine Learning Research(2020)

  34. [34]

    2016.Social machines: the coming collision of artificial intelligence, social networking, and humanity

    James Hendler and Alice M Mulvehill. 2016.Social machines: the coming collision of artificial intelligence, social networking, and humanity. Apress

  35. [35]

    Highland Economics Council. 2025. Measuring the Environmental Cost of Artificial Intelligence and Their Data Centers. https://www.hecweb.org/ 2025/06/28/measuring-the-environmental-cost-of-artificial-intelligence-and-their-data-centers/ Accessed: 2025-11-13

  36. [36]

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems. 30016–30030

  37. [37]

    Lars Hornuf and Daniel Vrankar. 2022. Hourly wages in crowdworking: A meta-analysis.Business & Information Systems Engineering64, 5 (2022), 553–573

  38. [38]

    International Energy Agency. 2024. Energy Demand from AI. https://www.iea.org/reports/energy-and-ai/energy-demand-from-ai. Accessed: 2025-11-28

  39. [39]

    International Energy Agency. 2024. World Energy Outlook 2024. https://iea.blob.core.windows.net/assets/140a0470-5b90-4922-a0e9-838b3ac6918c/ WorldEnergyOutlook2024.pdf Accessed: 2025-11-30

  40. [40]

    International Energy Agency. 2025. Electricity 2025. https://www.iea.org/reports/electricity-2025. Accessed: 2025-12-11

  41. [41]

    International Energy Agency. 2025. Energy and AI. https://www.iea.org/reports/energy-and-ai Licence: CC BY 4.0, Accessed: 2025-11-13

  42. [42]

    International Energy Agency. 2025. Global Energy Review. https://www.iea.org/reports/global-energy-review-2025. Accessed: 2025-06-10

  43. [43]

    International Telecommunication Union. 2024. Measuring Digital Development: Facts and Figures 2024. https://digitallibrary.un.org/record/4074377

  44. [44]

    Wang, Feiyang Kang, and Dawn Song

    Ruoxi Jia, Luis Oala, Wenjie Xiong, Suqin Ge, Jiachen T. Wang, Feiyang Kang, and Dawn Song. 2025. A Sustainable AI Economy Needs Data Deals That Work for Generators. InThe Thirty-Ninth Annual Conference on Neural Information Processing Systems Position Paper Track. https://openreview.net/forum?id=mdKzkjY1dM

  45. [45]

    John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. 2021. Highly accurate protein structure prediction with AlphaFold.nature596, 7873 (2021), 583–589

  46. [46]

    Lynn H Kaack, Priya L Donti, Emma Strubell, George Kamiya, Felix Creutzig, and David Rolnick. 2022. Aligning artificial intelligence with climate change mitigation.Nature Climate Change12, 6 (2022), 518–527

  47. [47]

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models.arXiv preprint arXiv:2001.08361(2020)

  48. [48]

    Angelos Katharopoulos and François Fleuret. 2018. Not all samples are created equal: Deep learning with importance sampling. InInternational conference on machine learning. PMLR, 2525–2534

  49. [49]

    Simon Kemp. 2025. Digital 2025 Facebook Q1 Report. https://datareportal.com/essential-facebook-stats. Accessed on 2026-01-13

  50. [50]

    Krishnateja Killamsetty, Sivasubramanian Durga, Ganesh Ramakrishnan, Abir De, and Rishabh Iyer. 2021. GRAD-MATCH: Gradient matching based data subset selection for efficient deep model training. InInternational Conference on Machine Learning. PMLR, 5464–5474

  51. [51]

    Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, et al. 2023. Learning skillful medium-range global weather forecasting.Science382, 6677 (2023), 1416–1421

  52. [52]

    Andreas D Lauritzen, Alejandro Rodríguez-Ruiz, My Catarina von Euler-Chelpin, Elsebeth Lynge, Ilse Vejborg, Mads Nielsen, Nico Karssemeijer, and Martin Lillholm. 2022. An artificial intelligence–based mammography screening protocol for breast cancer: outcome and radiologist workload. Radiology304, 1 (2022), 41–49

  53. [53]

    Clément Le Ludec, Maxime Cornet, and Antonio A Casilli. 2023. The problem with annotation. Human labour and outsourcing between France and Madagascar.Big Data & Society10, 2 (2023), 20539517231188723

  54. [54]

    Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning.nature521, 7553 (2015), 436–444

  55. [55]

    Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Yitzhak Gadre, Hritik Bansal, Etash Guha, Sedrick Scott Keh, Kushal Arora, et al. 2024. Datacomp-lm: In search of the next generation of training sets for language models.Advances in Neural Information Processing Systems 37 (2024), 14200–14282

  56. [56]

    Pengfei Li, Jianyi Yang, Mohammad A Islam, and Shaolei Ren. 2025. Making AI less’ thirsty’: Uncovering and addressing the secret water footprint of AI models.Commun. ACM68, 7 (2025), 54–61

  57. [57]

    Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, et al. 2024. A large-scale audit of dataset licensing and attribution in AI.Nature Machine Intelligence6, 8 (2024), 975–987. How Hyper-Datafication Impacts the Sustainability Costs in Frontier AI 17

  58. [58]

    Alexandra Sasha Luccioni, Yacine Jernite, and Emma Strubell. 2024. Power hungry processing: Watts driving the cost of AI deployment?. In Proceedings of the 2024 ACM conference on fairness, accountability, and transparency. 85–99

  59. [59]

    Alexandra Sasha Luccioni, Sylvain Viguier, and Anne-Laure Ligozat. 2023. Estimating the carbon footprint of bloom, a 176b parameter language model.Journal of machine learning research24, 253 (2023), 1–15

  60. [60]

    Jens Malmodin, Dag Lundén, Åsa Moberg, Greger Andersson, and Mikael Nilsson. 2014. Life cycle assessment of ICT: carbon footprint and operational electricity use from the operator, national, and subscriber perspective in Sweden.Journal of Industrial Ecology18, 6 (2014), 829–845

  61. [61]

    M Mardani, N Brenowitz, Y Cohen, J Pathak, CY Chen, CC Liu, A Vahdat, K Kashinath, J Kautz, and M Pritchard. 2023. Residual Diffusion Modeling for Km-scale Atmospheric Downscaling. arXiv 2023.arXiv preprint arXiv:2309.15214(2023)

  62. [62]

    Nestor Maslej, Loredana Fattorini, Raymond Perrault, Vanessa Parli, Anka Reuel, Erik Brynjolfsson, John Etchemendy, Katrina Ligett, Terah Lyons, James Manyika, Juan Carlos Niebles, Yoav Shoham, Russell Wald, and Jack Clark. 2025. The 2025 AI Index Report. Online Resource. https://hai.stanford.edu/ai-index/2025-ai-index-report

  63. [63]

    Ulises A Mejias and Nick Couldry. 2024. Data grab: The new colonialism of big tech and how to fight back. InData Grab. University of Chicago Press

  64. [64]

    Gabriel Mersy and Sanjay Krishnan. 2024. Toward a life cycle assessment for the carbon footprint of data.ACM SIGENERGY Energy Informatics Review4, 5 (2024), 25–33

  65. [65]

    Shakir Mohamed, Marie-Therese Png, and William Isaac. 2020. Decolonial AI: Decolonial theory as sociotechnical foresight in artificial intelligence. Philosophy & Technology33, 4 (2020), 659–684

  66. [66]

    2023.The eye of the master: A social history of artificial intelligence

    Matteo Pasquinelli. 2023.The eye of the master: A social history of artificial intelligence. Verso Books

  67. [67]

    Adam Satariano Paul Mozur and Emiliano Rodríguez Mega. 2025. From Mexico to Ireland, Fury Mounts Over a Global A.I. Frenzy. https: //www.nytimes.com/2025/10/20/technology/ai-data-center-backlash-mexico-ireland.html. New York Times Article

  68. [68]

    Ben Purvis, Yong Mao, and Darren Robinson. 2019. Three pillars of sustainability: in search of conceptual origins.Sustainability science14, 3 (2019), 681–695

  69. [69]

    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving Language Understanding by Generative Pre-Training. Online Technical Report

  70. [70]

    Hannah Ritchie, Pablo Rosado, and Max Roser. 2023. Data Page: Carbon intensity of electricity generation. https://archive.ourworldindata.org/ 20251014-145858/grapher/carbon-intensity-electricity.html. Data adapted from Ember and the Energy Institute. Part of Ritchie, Rosado and Roser (2023)Energy. Archived on 14 October 2025

  71. [71]

    Hannah Ritchie, Pablo Rosado, and Max Roser. 2023. Data Page: CO 2 emissions per capita. https://archive.ourworldindata.org/20251113- 170236/grapher/co-emissions-per-capita.html. Part of the publication:CO 2 and Greenhouse Gas Emissions. Data adapted from the Global Carbon Project and other sources. Online resource archived on 13 November 2025

  72. [72]

    2024.Chapter 3 Peter’s Problem

    Andrea Rosales and Sara Suárez-Gonzalo. 2024.Chapter 3 Peter’s Problem. An Analysis of the Imaginaries about Automated Futures Portrayed in QualityLand. De Gruyter, Berlin, Boston, 37–54. doi:doi:10.1515/9783110792256-003

  73. [73]

    Jürgen Rudolph. 2025. The hidden labour in AI: Big Tech’s dirty secret and the need for critical AI literacy in higher education.Handbook of AI and higher education. Edward Elgar(2025)

  74. [74]

    Germans Savcisens, Tina Eliassi-Rad, Lars Kai Hansen, Laust Hvas Mortensen, Lau Lilleholt, Anna Rogers, Ingo Zettler, and Sune Lehmann. 2024. Using sequences of life-events to predict human lives.Nature Computational Science4, 1 (2024), 43–56

  75. [75]

    Jürgen Schmidhuber. 2015. Deep learning in neural networks: An overview.Neural networks61 (2015), 85–117

  76. [76]

    Heike Schweitzer, Jacques Crémer, and Yves-Alexandre de Montjoye. 2019. Competition Policy for the Digital Era. https://www.rewi.hu- berlin.de/de/lf/oe/rdt/pub/working-paper-no-6

  77. [77]

    2025.Sustainable AI

    Raghavendra Selvan. 2025.Sustainable AI. O’Reilly

  78. [78]

    Jaime Sevilla, Lennart Heim, Anson Ho, Tamay Besiroglu, Marius Hobbhahn, and Pablo Villalobos. 2022. Compute trends across three eras of machine learning. In2022 international joint conference on neural networks (IJCNN). IEEE, 1–8

  79. [79]

    Wilington Shitawa. 2024. Click Captives: The Unseen Struggle of Data Workers. Data Workers‘ Inquiry. https://data-workers.org/wilington/

  80. [80]

    David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. 2017. Mastering the game of go without human knowledge.nature550, 7676 (2017), 354–359

Showing first 80 references.