The Costs of Pretending That There Are Data-Generating Probability Distributions in the Social World
Pith reviewed 2026-05-23 22:32 UTC · model grok-4.3
The pith
True data-generating probability distributions do not exist for social data, and assuming they do harms machine learning practice.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Machine learning routinely assumes data are sampled from a true data-generating probability distribution that also governs future observations. The paper maintains that no such distributions exist in social domains and that the assumption is harmful. Frameworks that refer directly to relevant populations instead of distributions are available and leave classical learning theory essentially intact. The distribution language can obscure modeling choices and pursued goals. The paper therefore recommends avoiding the assumption of data-generating probability distributions when working with social data.
What carries the argument
The assumption of a data-generating probability distribution, whose replacement by direct reference to relevant populations preserves existing learning theory while clarifying modeling decisions.
If this is right
- Machine learning models for social phenomena can be stated and justified without invoking nonexistent distributions.
- Fairness and equity arguments become more transparent once modeling choices are no longer hidden behind distribution talk.
- Generalization claims must be anchored to concrete populations rather than hypothetical sampling mechanisms.
- Existing technical results in learning theory remain applicable under the population-based framing.
Where Pith is reading between the lines
- The argument invites empirical comparisons of models trained under each framing on the same social datasets to check for practical differences.
- It suggests that dataset construction and scope definition deserve more explicit attention than is typical when distributions are assumed to exist.
- The view may prompt re-examination of how fairness metrics are defined when the underlying population is treated as finite and observable rather than drawn from an abstract distribution.
Load-bearing premise
That alternative frameworks focusing directly on relevant populations rather than abstract distributions are available and leave classical learning theory almost unchanged.
What would settle it
A worked social-data example in which switching from distribution language to explicit population language forces material changes to the statements or guarantees of classical learning theory.
read the original abstract
Machine Learning research, including work promoting fair or equitable algorithms, often relies on the concept of a data-generating probability distribution. The standard presumption is that since data points are 'sampled from' such a distribution, one can learn from observed data about this distribution and, thus, predict future data points which are also drawn from it. We argue, however, that such true probability distributions do not exist and that the rhetoric around them is harmful in social settings. We show that alternative frameworks focusing directly on relevant populations rather than abstract distributions are available and leave classical learning theory almost unchanged. Furthermore, we argue that the assumption of true probabilities or data-generating distributions can be misleading and obscure both the choices made and the goals pursued in machine learning practice. Based on these considerations, we suggest avoiding the assumption of data-generating probability distributions in the social world.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that true data-generating probability distributions do not exist in the social world, that rhetoric around them is harmful, and that population-focused alternatives are available which leave classical learning theory almost unchanged; it concludes by recommending that ML practice avoid assuming such distributions.
Significance. If the argument holds, the paper would prompt re-examination of foundational probabilistic assumptions in ML applications to social domains such as fairness research. The work is conceptual rather than empirical or formal and offers no machine-checked proofs or reproducible code.
major comments (1)
- [Abstract] Abstract: the assertion that population-focused alternatives 'leave classical learning theory almost unchanged' is load-bearing for the positive proposal, yet the manuscript provides no concrete reformulation of any standard result (e.g., PAC learnability, VC-dimension bounds, or uniform convergence) as a purely population-based counting argument that yields comparable non-vacuous guarantees.
Simulated Author's Rebuttal
We thank the referee for their thoughtful comments on the manuscript. The major comment raises an important point about the need for concrete examples to support our claim regarding classical learning theory. We respond to this below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion that population-focused alternatives 'leave classical learning theory almost unchanged' is load-bearing for the positive proposal, yet the manuscript provides no concrete reformulation of any standard result (e.g., PAC learnability, VC-dimension bounds, or uniform convergence) as a purely population-based counting argument that yields comparable non-vacuous guarantees.
Authors: The referee correctly notes that the manuscript does not provide explicit reformulations of results such as PAC learnability or uniform convergence bounds in purely population-based terms. Our intention was to argue at a conceptual level that the mathematical structure of these results does not depend on the existence of a data-generating distribution and can be reinterpreted in terms of finite populations. However, we agree that this would be strengthened by a concrete illustration. In the revised manuscript, we will include a brief example demonstrating how a basic uniform convergence argument can be expressed using proportions in a finite population, yielding similar guarantees. This revision will be made to address the concern directly. revision: yes
Circularity Check
No circularity: conceptual critique without self-referential derivations
full rationale
The paper advances a philosophical argument against assuming data-generating distributions in social ML settings and claims that population-focused alternatives preserve classical learning theory. No equations, fitted parameters, or derivation chains appear in the provided text. The central claim that alternatives 'leave classical learning theory almost unchanged' is presented as an assertion rather than derived from any self-citation chain or input-by-construction reduction. Self-citations, if present, are not load-bearing for the core thesis. This is a standard non-finding for a non-mathematical critique paper.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.