The Costs of Pretending That There Are Data-Generating Probability Distributions in the Social World

Benedikt H\"oltgen; Robert C. Williamson

arxiv: 2407.17395 · v5 · submitted 2024-07-24 · 💻 cs.LG

The Costs of Pretending That There Are Data-Generating Probability Distributions in the Social World

Benedikt H\"oltgen , Robert C. Williamson This is my paper

Pith reviewed 2026-05-23 22:32 UTC · model grok-4.3

classification 💻 cs.LG

keywords machine learningprobability distributionssocial datadata-generating processeslearning theorypopulationsfairness

0 comments

The pith

True data-generating probability distributions do not exist for social data, and assuming they do harms machine learning practice.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Machine learning research routinely presumes that observed data points are sampled from some underlying probability distribution that also governs future data. The paper contends that no such true distributions exist when the subject matter is social, and that continued use of this language is actively harmful. It proposes that one can instead refer directly to the relevant population of interest. This substitution preserves the technical content of classical learning theory while making the actual modeling choices more visible. A reader would care because many claims about generalization, prediction, and algorithmic fairness in social domains rest on the disputed language.

Core claim

Machine learning routinely assumes data are sampled from a true data-generating probability distribution that also governs future observations. The paper maintains that no such distributions exist in social domains and that the assumption is harmful. Frameworks that refer directly to relevant populations instead of distributions are available and leave classical learning theory essentially intact. The distribution language can obscure modeling choices and pursued goals. The paper therefore recommends avoiding the assumption of data-generating probability distributions when working with social data.

What carries the argument

The assumption of a data-generating probability distribution, whose replacement by direct reference to relevant populations preserves existing learning theory while clarifying modeling decisions.

If this is right

Machine learning models for social phenomena can be stated and justified without invoking nonexistent distributions.
Fairness and equity arguments become more transparent once modeling choices are no longer hidden behind distribution talk.
Generalization claims must be anchored to concrete populations rather than hypothetical sampling mechanisms.
Existing technical results in learning theory remain applicable under the population-based framing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The argument invites empirical comparisons of models trained under each framing on the same social datasets to check for practical differences.
It suggests that dataset construction and scope definition deserve more explicit attention than is typical when distributions are assumed to exist.
The view may prompt re-examination of how fairness metrics are defined when the underlying population is treated as finite and observable rather than drawn from an abstract distribution.

Load-bearing premise

That alternative frameworks focusing directly on relevant populations rather than abstract distributions are available and leave classical learning theory almost unchanged.

What would settle it

A worked social-data example in which switching from distribution language to explicit population language forces material changes to the statements or guarantees of classical learning theory.

read the original abstract

Machine Learning research, including work promoting fair or equitable algorithms, often relies on the concept of a data-generating probability distribution. The standard presumption is that since data points are 'sampled from' such a distribution, one can learn from observed data about this distribution and, thus, predict future data points which are also drawn from it. We argue, however, that such true probability distributions do not exist and that the rhetoric around them is harmful in social settings. We show that alternative frameworks focusing directly on relevant populations rather than abstract distributions are available and leave classical learning theory almost unchanged. Furthermore, we argue that the assumption of true probabilities or data-generating distributions can be misleading and obscure both the choices made and the goals pursued in machine learning practice. Based on these considerations, we suggest avoiding the assumption of data-generating probability distributions in the social world.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper usefully flags how distribution talk can obscure choices in social ML but asserts without showing that population-based alternatives preserve classical learning theory.

read the letter

The main takeaway is that assuming a true data-generating distribution in social applications is not neutral language; it can hide decisions about which groups count and what the goals are. The authors make this case in the context of fairness work and suggest shifting focus to concrete populations instead. That part of the argument is straightforward and worth saying plainly. It draws on the fact that social data often involves finite, specific groups rather than repeatable draws from some stable measure. The paper does a reasonable job connecting this to how ML rhetoric can mislead about generalization and responsibility. What is less developed is the claim that population-centered frameworks are already available and leave classical learning theory almost unchanged. Results like PAC bounds or uniform convergence are defined in terms of expectations and high-probability statements over a measure. The paper does not sketch how even one such result would be restated as a direct counting argument over populations while keeping non-vacuous guarantees. Without that, the preservation claim stays at the level of assertion. This is a position paper aimed at readers who work on ML for social domains or who write about its conceptual foundations. It is coherent internally and engages the literature on its own terms, so it deserves peer review to test whether the learning-theory part can be made more concrete or scoped more narrowly.

Referee Report

1 major / 0 minor

Summary. The manuscript claims that true data-generating probability distributions do not exist in the social world, that rhetoric around them is harmful, and that population-focused alternatives are available which leave classical learning theory almost unchanged; it concludes by recommending that ML practice avoid assuming such distributions.

Significance. If the argument holds, the paper would prompt re-examination of foundational probabilistic assumptions in ML applications to social domains such as fairness research. The work is conceptual rather than empirical or formal and offers no machine-checked proofs or reproducible code.

major comments (1)

[Abstract] Abstract: the assertion that population-focused alternatives 'leave classical learning theory almost unchanged' is load-bearing for the positive proposal, yet the manuscript provides no concrete reformulation of any standard result (e.g., PAC learnability, VC-dimension bounds, or uniform convergence) as a purely population-based counting argument that yields comparable non-vacuous guarantees.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful comments on the manuscript. The major comment raises an important point about the need for concrete examples to support our claim regarding classical learning theory. We respond to this below.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that population-focused alternatives 'leave classical learning theory almost unchanged' is load-bearing for the positive proposal, yet the manuscript provides no concrete reformulation of any standard result (e.g., PAC learnability, VC-dimension bounds, or uniform convergence) as a purely population-based counting argument that yields comparable non-vacuous guarantees.

Authors: The referee correctly notes that the manuscript does not provide explicit reformulations of results such as PAC learnability or uniform convergence bounds in purely population-based terms. Our intention was to argue at a conceptual level that the mathematical structure of these results does not depend on the existence of a data-generating distribution and can be reinterpreted in terms of finite populations. However, we agree that this would be strengthened by a concrete illustration. In the revised manuscript, we will include a brief example demonstrating how a basic uniform convergence argument can be expressed using proportions in a finite population, yielding similar guarantees. This revision will be made to address the concern directly. revision: yes

Circularity Check

0 steps flagged

No circularity: conceptual critique without self-referential derivations

full rationale

The paper advances a philosophical argument against assuming data-generating distributions in social ML settings and claims that population-focused alternatives preserve classical learning theory. No equations, fitted parameters, or derivation chains appear in the provided text. The central claim that alternatives 'leave classical learning theory almost unchanged' is presented as an assertion rather than derived from any self-citation chain or input-by-construction reduction. Self-citations, if present, are not load-bearing for the core thesis. This is a standard non-finding for a non-mathematical critique paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.0 · 5674 in / 948 out tokens · 13049 ms · 2026-05-23T22:32:46.866795+00:00 · methodology

The Costs of Pretending That There Are Data-Generating Probability Distributions in the Social World

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)