The Pragmatic Frames of Spurious Correlations in Machine Learning: Interpreting How and Why They Matter
Pith reviewed 2026-05-23 17:11 UTC · model grok-4.3
The pith
Researchers judge spurious correlations in machine learning by their practical effects rather than statistical definitions alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Rather than relying solely on formal definitions, researchers assess spuriousness through pragmatic frames: judgments based on what a correlation does in practice—how it affects model behavior, supports or impedes task performance, or aligns with broader normative goals. Drawing on a broad survey of ML literature, four such frames are identified: relevance (models should use correlations relevant to the task), generalizability (models should use correlations that generalize to unseen data), human-likeness (models should use correlations that a human would use), and harmfulness (models should use correlations that are not socially or ethically harmful). These representations reveal that a key
What carries the argument
Pragmatic frames: judgments of a correlation's desirability based on its practical effects in model behavior, task performance, and alignment with normative goals.
If this is right
- Correlation desirability is treated as a context-dependent judgment rather than a fixed statistical property.
- Assessments of spuriousness incorporate ethical and normative considerations alongside performance metrics.
- Different research contexts may prioritize different frames when evaluating the same correlation.
- Operationalizing spuriousness requires explicit negotiation among technical, epistemic, and ethical factors.
Where Pith is reading between the lines
- Subfields such as computer vision and natural language processing may invoke the frames at different rates or with different weightings.
- Simultaneous optimization across multiple frames could create trade-offs that current training procedures do not explicitly address.
- The frames could be used as an explicit design checklist when auditing deployed models for unintended correlations.
Load-bearing premise
A broad survey of ML literature is sufficient to identify and comprehensively categorize the dominant frames used to judge spurious correlations.
What would settle it
A review of ML papers that identifies a consistent additional frame or a substantially different set of judgment criteria not covered by relevance, generalizability, human-likeness, or harmfulness.
read the original abstract
Learning correlations from data forms the foundation of today's machine learning (ML) and artificial intelligence research. While contemporary methods enable the automatic discovery of complex patterns, they are prone to failure when unintended correlations are captured. This vulnerability has spurred a growing interest in interrogating spuriousness, which is often seen as a threat to model performance, fairness, and robustness. In this article, we trace departures from the conventional statistical definition of spuriousness-which denotes a non-causal relationship arising from coincidence or confounding-to examine how its meaning is negotiated in ML research. Rather than relying solely on formal definitions, researchers assess spuriousness through what we call pragmatic frames: Judgments based on what a correlation does in practice-how it affects model behavior, supports or impedes task performance, or aligns with broader normative goals. Drawing on a broad survey of ML literature, we identify four such frames: Relevance (Models should use correlations that are relevant to the task), generalizability (Models should use correlations that generalize to unseen data), human-likeness (Models should use correlations that a human would use to perform the same task), and harmfulness (Models should use correlations that are not socially or ethically harmful). These representations reveal that correlation desirability is not a fixed statistical property but a situated judgment informed by technical, epistemic, and ethical considerations. By examining how a foundational ML conundrum is problematized in research literature, we contribute to broader conversations on the contingent practices through which technical concepts like spuriousness are defined and operationalized.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that ML researchers assess spurious correlations not via the conventional statistical definition (non-causal relationships from coincidence or confounding) but through four pragmatic frames—relevance (task-relevant correlations), generalizability (correlations that hold on unseen data), human-likeness (correlations humans would use), and harmfulness (non-harmful correlations)—identified via a broad survey of ML literature. These frames show that judgments of correlation desirability are situated, incorporating technical, epistemic, and ethical considerations rather than fixed statistical properties.
Significance. If supported, the work offers a useful taxonomy for how the ML community negotiates a core concept, contributing to discussions on the contingent operationalization of technical terms. It explicitly credits the survey for mapping observed practices to frames and could help researchers reflect on implicit assumptions in robustness, fairness, and generalization work.
major comments (1)
- [Survey description (abstract and §2)] Survey description (abstract and §2): No details are provided on search strategy, inclusion criteria, number of papers examined, coding procedure, or validation against counterexamples. This directly undermines the central claim that the four frames 'comprehensively capture the situated judgments in the field,' as the taxonomy cannot be evaluated for completeness or selection bias without this information.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. The primary concern is the absence of methodological details on the literature survey. We address this point directly below and commit to revisions that enhance transparency while preserving the paper's conceptual focus.
read point-by-point responses
-
Referee: Survey description (abstract and §2): No details are provided on search strategy, inclusion criteria, number of papers examined, coding procedure, or validation against counterexamples. This directly undermines the central claim that the four frames 'comprehensively capture the situated judgments in the field,' as the taxonomy cannot be evaluated for completeness or selection bias without this information.
Authors: We agree that the current manuscript lacks sufficient detail on the survey process, which limits readers' ability to assess how the four frames were derived and whether the taxonomy is comprehensive. The abstract and §2 describe the frames as emerging from a broad survey of ML literature but provide no information on search strategy, inclusion criteria, scale, coding, or counterexample consideration. In the revised version we will expand §2 with a methods subsection specifying: the databases and keywords used (e.g., arXiv, conference proceedings, terms such as 'spurious correlation' and 'shortcut learning'); inclusion criteria focused on papers that explicitly discuss or operationalize spuriousness in ML; the approximate number of papers reviewed; the thematic coding procedure that led to the four frames; and any steps taken to test the frames against alternative interpretations. These additions will make the situated, non-exhaustive character of the analysis explicit and allow evaluation of selection bias, thereby strengthening rather than altering the central claim. revision: yes
Circularity Check
No circularity; taxonomy derived from external literature without self-referential reduction
full rationale
The paper derives its four pragmatic frames (relevance, generalizability, human-likeness, harmfulness) via a broad survey of ML literature, as stated in the abstract: 'Drawing on a broad survey of ML literature, we identify four such frames...' This process relies on external sources rather than any equations, fitted parameters, self-citations, or definitional loops that would reduce the output to the inputs by construction. None of the six enumerated circularity patterns apply, as there are no mathematical derivations, uniqueness theorems, or ansatzes smuggled via self-citation. The analysis is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Spuriousness judgments in ML are negotiated through pragmatic, context-dependent frames rather than fixed statistical definitions alone.
Reference graph
Works this paper leans on
-
[1]
Arjovsky M, Bottou L, Gulrajani I and Lopez-Paz D (2019) Invariant risk minimization. arXiv preprint arXiv:1907.02893. Arpit D, Jastrzębski S, Ballas N, Krueger D, Bengio E, Kanwal MS, Maharaj T, Fischer A, Courville A, Bengio Y and Lacoste-Julien S (2017) A closer look at memorization in deep networks. In Proceedings of the 34th International Conference ...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[2]
Ghosal SS, Ming Y and Li Y (2022) Are vision transformers robust to spurious correlations? arXiv preprint arXiv:2203.09125. Goldenfein, J (2019) The Profiling Potential of Computer Vision and the Challenge of Computational Empiricism. Proceedings of the Conference on Fairness, Accountability, and Transparency , pp. 110–119. FAT ‘19’. Association for Compu...
-
[3]
Oxford, UK: Oxford University Press
21 Kampourakis K and McCain K (2019) Uncertainty: How it makes science advance . Oxford, UK: Oxford University Press. Knorr KD (1979) Tinkering toward success: Prelude to a theory of scientific practice. Theory and Society 8(3): 347–376. Kirichenko P, Izmailov P, and Wilson AG (2023) Last layer re-training is sufficient for robustness to spurious correlat...
work page 2019
-
[4]
Big Data & Society 1(1): 2053951714528481
Kitchin R (2014) Big Data, new epistemologies and paradigm shifts. Big Data & Society 1(1): 2053951714528481. Latour B (1987) Science in action: How to follow scientists and engineers through society . Cambridge, MA: Harvard University Press. Latour B and Woolgar S (1979) Laboratory life: The construction of scientific facts . Princeton, NJ: Princeton Uni...
-
[5]
American Sociological Review , 00031224241271100
Navon D (2024) Reiterated fact-making: Explaining transformation and continuity in scientific facts. American Sociological Review , 00031224241271100. Peters J, Janzing D and Schölkopf B (2017) Elements of Causal Inference: Foundations and Learning Algorithms . Adaptive computation and machine learning. Cambridge, MA: The MIT Press. Ribeiro MT, Singh S an...
work page 2024
-
[6]
Sagawa S, Koh PW, Hashimoto TB and Liang P (2019) Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. arXiv preprint arXiv:1911.08731. Scimeca L, Oh SJ, Chun S, Poli M and Yun S (2022) Which shortcut cues will DNNs choose? A study from the parameter-space perspective. In Internationa...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[7]
On Causal and Anticausal Learning
23 Schölkopf B, Janzing D, Peters J, Sgouritsa E, Zhang K and Mooij J (2012) On causal and anticausal learning. arXiv preprint arXiv:1206.6471. Schölkopf B, Locatello F, Bauer S, Ke NR, Kalchbrenner N, Goyal A and Bengio Y (2021) Toward causal representation learning. Proceedings of the IEEE , 109(5), 612–634. Shah H, Tamuly K, Raghunathan A, Jain P and N...
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[8]
arXiv preprint arXiv:2308.11043
Sreekumar G and Boddeti VN (2023) Spurious correlations and where to find them. arXiv preprint arXiv:2308.11043. Star SL and Griesemer JR (1989) Institutional ecology, ‘translations’ and boundary objects: Amateurs and professionals in Berkeley’s Museum of Vertebrate Zoology, 1907–39. Social Studies of Science 19(3): 387–420. Stock P, and Cisse M (2018) Co...
-
[9]
Will You Find These Shortcuts?
Wang S, Cooper N and Eby M (2024) From human-centered to social-centered artificial intelligence: Assessing ChatGPT’s impact through disruptive events. Big Data & Society 11(4): 20539517241290220. Wang T, Zhao J, Yatskar M, Chang KW and Ordonez V (2019b) Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations....
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.