pith. sign in

arxiv: 2602.00056 · v4 · pith:HJFXGWJTnew · submitted 2026-01-20 · 💻 cs.CY · cs.AI

How Hyper-Datafication Impacts the Sustainability Costs in Frontier AI

Pith reviewed 2026-05-16 13:17 UTC · model grok-4.3

classification 💻 cs.CY cs.AI
keywords hyper-dataficationAI sustainabilitydata laborGlobal Southenvironmental costsfrontier AIdataset analysisrepresentational harms
0
0 comments X

The pith

Hyper-datafication in frontier AI redistributes environmental burdens, labor risks, and representational harms toward the Global South and precarious workers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that frontier AI has moved from using available data to actively generating new data tailored for model training, a process called hyper-datafication. Analysis of roughly 550,000 datasets reveals rapid growth in storage needs and associated energy use, while interviews highlight exploitative labor conditions. This shift does not just consume more resources overall but concentrates the downsides on specific groups and regions. Readers should care because these dynamics affect the equity and long-term viability of AI technologies that influence daily life worldwide. The authors offer practical recommendations to reduce these overlooked costs.

Core claim

The transition to hyper-datafication in AI does not just scale up resource use but systematically shifts environmental burdens, labour risks, and representational harms to the Global South, precarious data workers, and under-represented cultures, as evidenced by dataset analyses and qualitative data from Kenya. The authors propose Data PROOFS recommendations spanning provenance, resource awareness, ownership, openness, frugality, and standards to mitigate these costs.

What carries the argument

Hyper-datafication, the active creation of data for building AI models instead of relying on existing data, which carries the redistribution of costs.

If this is right

  • Increased dataset growth drives higher storage energy consumption and carbon emissions.
  • Data labor exposes workers in the Global South to graphic content and precarious employment.
  • Under-represented languages and cultures face continued representational harms in AI outputs.
  • Disparities in data infrastructure amplify environmental impacts in certain regions.
  • Following the Data PROOFS framework could reduce these redistributed burdens.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If accurate, AI governance should mandate transparent data sourcing to prevent burden shifting across borders.
  • This pattern may apply to other emerging technologies reliant on massive data collection.
  • A testable extension would involve mapping data worker conditions across multiple countries.
  • Connections to digital colonialism suggest broader geopolitical implications for data control.

Load-bearing premise

The sample of Hugging Face Hub datasets and Kenyan data worker responses sufficiently captures global data practices and impacts for frontier AI.

What would settle it

A global study showing that data-related environmental burdens, labor risks, and harms are not disproportionately shifted to the Global South would falsify the central redistribution claim.

Figures

Figures reproduced from arXiv: 2602.00056 by Erik B. Dam, Janin Koch, Mophat Okinyi, Raghavendra Selvan, Sebastian Mair, Sophia N. Wilson.

Figure 1
Figure 1. Figure 1: Growth of datasets and data volume over time and download concentration on the Hugging Face Hub. Left: Monthly counts [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Left: Estimated provider-side storage energy (GWh). Right: Estimated user-side storage energy (TWh), assuming that 10 [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Left: Historical (2022–2024) and projected (2024–2034) electricity use for all data centres worldwide under two scenarios: a base [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Left: Distribution of respondents across salary bands by weekly working hours. Centre: Distribution of respondents across [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Gender-disaggregated distributions of weekly working hours, monthly salary, experience, data work types, and exposure to [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Representation and demand for the ten largest language groups on the Hugging Face Hub. Left: A depiction of each group’s [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Left: Historical (2015-2024) and projected (2024-2030) global annual investment in data centres in the base case reflecting [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of dataset modalities and task categories on the Hugging Face Hub. Left: The fifteen most common dataset [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of dataset sizes and downloads on the Hugging Face Hub by modality and task. Left: Violin plots of dataset sizes [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Mobile and fixed broadband traffic for the Asia-Pacific region, America, Europe, the Arab States, and the Commonwealth of [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
read the original abstract

Large-scale data has fuelled the success of frontier artificial intelligence (AI) models over the past decade. This expansion has relied on sustained efforts by large technology corporations to aggregate and curate internet-scale datasets. In this work, we examine the environmental, social, and economic costs of large-scale data in AI through a sustainability lens. We argue that the field is shifting from building models from data to actively creating data for building models. We characterise this transition as hyper-datafication, which marks a critical juncture for the future of frontier AI and its societal impacts. To quantify and contextualise data-related costs, we analyse approximately 550,000 datasets from the Hugging Face Hub, focusing on dataset growth, storage-related energy consumption and carbon footprint, and societal representation using language data. We complement this analysis with qualitative responses from data workers in Kenya to examine the labour involved, including direct employment by big tech corporations and exposure to graphic content. We further draw on external data sources to substantiate our findings by illustrating the global disparity in data centre infrastructure. Our analyses reveal that hyper-datafication drives substantial and growing environmental costs while systematically redistributing labour risks and representational harms toward the Global South. Thus, we propose Data PROOFS recommendations spanning provenance, resource awareness, ownership, openness, frugality, and standards to mitigate these costs. Our work aims to make visible the often-overlooked costs of data that underpin frontier AI and to stimulate broader debate within the research community and beyond.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript claims that frontier AI is transitioning from using existing data to actively creating data for models, a shift termed 'hyper-datafication' that increases sustainability costs and systematically redistributes environmental burdens, labour risks, and representational harms toward the Global South, precarious data workers, and under-represented cultures. This is supported by analysis of ~550,000 Hugging Face Hub datasets on growth, storage energy consumption, carbon footprint, and language representation; qualitative responses from data workers in Kenya on labour conditions including exposure to graphic content; and external data on global data centre infrastructure disparities. The paper concludes by proposing Data PROOFS recommendations (provenance, resource awareness, ownership, openness, frugality, standards) to mitigate these costs.

Significance. If the redistribution claim is substantiated through added comparative baselines and provenance tracing, the work would contribute to AI sustainability literature by highlighting data-related burdens and labour issues beyond model training energy costs. The mixed-methods design combining large-scale dataset metrics with worker interviews is a positive feature that broadens the scope, though the current evidence base limits the strength of the global claims.

major comments (3)
  1. [Abstract] Abstract: The central claim that hyper-datafication 'systematically redistributes' environmental burdens, labour risks, and representational harms toward the Global South is presented as a direct finding from the analyses, yet the HF Hub sample (curated and English-dominant) and Kenya-only interviews provide no region-stratified impact accounting or explicit provenance tracing linking dataset characteristics to localized geographic origins or energy metrics.
  2. [Dataset analysis section] Dataset analysis section: No details are given on methods for the ~550k HF Hub analysis, including data exclusion criteria, error handling for storage energy and carbon calculations, or how language representation data quantitatively maps to 'under-represented cultures' or Global South burdens, leaving the quantitative support for redistribution interpretive rather than demonstrated.
  3. [Qualitative section] Qualitative section: The Kenya data worker responses supply valuable local detail on employment and content exposure but include no comparative baseline from other regions or quantitative linkage to HF Hub dataset metrics, which undermines the systematic global redistribution conclusion.
minor comments (1)
  1. [Recommendations] The Data PROOFS recommendations are introduced in the abstract and conclusion but lack expanded definitions or concrete implementation examples that would strengthen their utility.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important areas for clarification and strengthening. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that hyper-datafication 'systematically redistributes' environmental burdens, labour risks, and representational harms toward the Global South is presented as a direct finding from the analyses, yet the HF Hub sample (curated and English-dominant) and Kenya-only interviews provide no region-stratified impact accounting or explicit provenance tracing linking dataset characteristics to localized geographic origins or energy metrics.

    Authors: We agree that the abstract presents the redistribution claim too directly. The HF Hub analysis reveals growth and representation patterns that disproportionately involve English-dominant and curated datasets, while the Kenya interviews illustrate labor conditions typical of Global South data work. However, we lack explicit region-stratified accounting or full provenance tracing. We will revise the abstract to state that the analyses indicate patterns consistent with a redistribution of burdens, supported by the combined quantitative and qualitative evidence, and will add explicit discussion of these limitations in the text. revision: yes

  2. Referee: [Dataset analysis section] Dataset analysis section: No details are given on methods for the ~550k HF Hub analysis, including data exclusion criteria, error handling for storage energy and carbon calculations, or how language representation data quantitatively maps to 'under-represented cultures' or Global South burdens, leaving the quantitative support for redistribution interpretive rather than demonstrated.

    Authors: We acknowledge the omission of methodological details. In the revised manuscript, we will add a dedicated methods subsection describing: data collection via the Hugging Face Hub API, exclusion criteria (e.g., datasets with missing metadata or non-public status), the formulas and assumptions used for storage energy and carbon calculations with error handling and sensitivity analysis, and the quantitative mapping of language codes to cultural representation using external demographic sources. These additions will render the quantitative support more explicit. revision: yes

  3. Referee: [Qualitative section] Qualitative section: The Kenya data worker responses supply valuable local detail on employment and content exposure but include no comparative baseline from other regions or quantitative linkage to HF Hub dataset metrics, which undermines the systematic global redistribution conclusion.

    Authors: The Kenya responses are presented as a case study of data labor conditions in a key Global South location. We accept that no comparative baselines or direct quantitative linkages are provided. We will revise the section to frame the findings as illustrative of documented trends in data work, add references to studies from other regions (e.g., India and the Philippines), and include a new limitations paragraph clarifying that the global redistribution claim is inferred from the combined evidence rather than directly measured. This will prevent overstatement while retaining the contribution of the qualitative data. revision: partial

standing simulated objections not resolved
  • Full region-stratified impact accounting and explicit provenance tracing across all ~550,000 HF Hub datasets, as this would require proprietary data access and new empirical collection beyond the current study's scope.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper performs an empirical analysis of ~550k external Hugging Face Hub datasets for growth, energy, and representation metrics, supplemented by Kenya interviews and external data-centre statistics. No internal equations, fitted parameters, or self-citations are present that reduce the redistribution conclusion to the inputs by construction. The central claim is an interpretive synthesis of independent data sources rather than a self-referential derivation. This matches the default expectation for non-circular empirical work; the provided skeptic concerns address evidence sufficiency, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim depends on the representativeness of the Hugging Face sample for frontier AI data and the generalizability of Kenyan worker experiences, plus interpretive framing of costs as systematic redistribution without explicit global benchmarks.

axioms (2)
  • domain assumption Hugging Face Hub datasets are representative of data used in frontier AI development
    The quantitative analysis is performed exclusively on this platform without stated justification for its coverage of all relevant data sources.
  • domain assumption Responses from data workers in Kenya capture key labour conditions and risks in AI data work globally
    Qualitative findings are drawn from this specific group to support broader claims about labour risks and exposure to graphic content.
invented entities (1)
  • hyper-datafication no independent evidence
    purpose: To name and frame the transition from using existing data to actively creating data for AI models
    New conceptual label introduced to organize the described shift and its consequences.

pith-pipeline@v0.9.0 · 5597 in / 1561 out tokens · 52118 ms · 2026-05-16T13:17:14.677358+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.