The Pragmatic Frames of Spurious Correlations in Machine Learning: Interpreting How and Why They Matter

Samuel J. Bell; Skyler Wang

arxiv: 2411.04696 · v5 · submitted 2024-11-07 · 💻 cs.LG · cs.AI

The Pragmatic Frames of Spurious Correlations in Machine Learning: Interpreting How and Why They Matter

Samuel J. Bell , Skyler Wang This is my paper

Pith reviewed 2026-05-23 17:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords spurious correlationsmachine learningpragmatic framescorrelation desirabilitymodel robustnessAI ethicsgeneralizabilityfairness

0 comments

The pith

Researchers judge spurious correlations in machine learning by their practical effects rather than statistical definitions alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Machine learning models depend on correlations discovered in data, yet many of these turn out to be unintended and problematic. The paper shows that researchers move beyond formal statistical notions of non-causality when deciding what counts as spurious. Instead, they apply situated judgments that weigh how a correlation influences model behavior, task success, and wider goals. A survey of the literature surfaces four recurring frames that shape these judgments. This matters because it demonstrates that decisions about which correlations to trust are shaped by technical, epistemic, and ethical considerations rather than fixed rules.

Core claim

Rather than relying solely on formal definitions, researchers assess spuriousness through pragmatic frames: judgments based on what a correlation does in practice—how it affects model behavior, supports or impedes task performance, or aligns with broader normative goals. Drawing on a broad survey of ML literature, four such frames are identified: relevance (models should use correlations relevant to the task), generalizability (models should use correlations that generalize to unseen data), human-likeness (models should use correlations that a human would use), and harmfulness (models should use correlations that are not socially or ethically harmful). These representations reveal that a key

What carries the argument

Pragmatic frames: judgments of a correlation's desirability based on its practical effects in model behavior, task performance, and alignment with normative goals.

If this is right

Correlation desirability is treated as a context-dependent judgment rather than a fixed statistical property.
Assessments of spuriousness incorporate ethical and normative considerations alongside performance metrics.
Different research contexts may prioritize different frames when evaluating the same correlation.
Operationalizing spuriousness requires explicit negotiation among technical, epistemic, and ethical factors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Subfields such as computer vision and natural language processing may invoke the frames at different rates or with different weightings.
Simultaneous optimization across multiple frames could create trade-offs that current training procedures do not explicitly address.
The frames could be used as an explicit design checklist when auditing deployed models for unintended correlations.

Load-bearing premise

A broad survey of ML literature is sufficient to identify and comprehensively categorize the dominant frames used to judge spurious correlations.

What would settle it

A review of ML papers that identifies a consistent additional frame or a substantially different set of judgment criteria not covered by relevance, generalizability, human-likeness, or harmfulness.

read the original abstract

Learning correlations from data forms the foundation of today's machine learning (ML) and artificial intelligence research. While contemporary methods enable the automatic discovery of complex patterns, they are prone to failure when unintended correlations are captured. This vulnerability has spurred a growing interest in interrogating spuriousness, which is often seen as a threat to model performance, fairness, and robustness. In this article, we trace departures from the conventional statistical definition of spuriousness-which denotes a non-causal relationship arising from coincidence or confounding-to examine how its meaning is negotiated in ML research. Rather than relying solely on formal definitions, researchers assess spuriousness through what we call pragmatic frames: Judgments based on what a correlation does in practice-how it affects model behavior, supports or impedes task performance, or aligns with broader normative goals. Drawing on a broad survey of ML literature, we identify four such frames: Relevance (Models should use correlations that are relevant to the task), generalizability (Models should use correlations that generalize to unseen data), human-likeness (Models should use correlations that a human would use to perform the same task), and harmfulness (Models should use correlations that are not socially or ethically harmful). These representations reveal that correlation desirability is not a fixed statistical property but a situated judgment informed by technical, epistemic, and ethical considerations. By examining how a foundational ML conundrum is problematized in research literature, we contribute to broader conversations on the contingent practices through which technical concepts like spuriousness are defined and operationalized.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper names four pragmatic frames for judging spurious correlations but the literature survey has no visible method, so the taxonomy's coverage is hard to assess.

read the letter

The main point is that this paper claims ML researchers decide whether a correlation counts as spurious by using four practical frames—relevance to the task, generalizability to new data, human-likeness, and harmfulness—rather than sticking to the statistical definition of non-causality. That reframing is the actual new piece: it turns a familiar complaint into a short list of situated judgments that researchers already apply in robustness and fairness work. The abstract shows they drew this from existing papers, and the organization into named frames gives a compact way to talk about why some correlations get kept and others dropped. That part is useful for anyone who has to explain or defend a modeling choice in those areas. The weakness is exactly what the stress-test note flags. No search strategy, inclusion rules, paper count, or coding steps are described, so there is no way to check whether these four frames are the main ones or just the ones the authors noticed. Without that, the claim that they comprehensively capture how the field negotiates spuriousness stays interpretive rather than documented. The paper stays conceptual; it does not test the frames against new data or run any experiments. Readers who already work on ML robustness or fairness will find the vocabulary handy for discussion. It is worth sending to referees because the core observation is clear and the gap in method is fixable with a methods section, but it is not yet strong enough to stand on its own as a finished contribution.

Referee Report

1 major / 0 minor

Summary. The paper claims that ML researchers assess spurious correlations not via the conventional statistical definition (non-causal relationships from coincidence or confounding) but through four pragmatic frames—relevance (task-relevant correlations), generalizability (correlations that hold on unseen data), human-likeness (correlations humans would use), and harmfulness (non-harmful correlations)—identified via a broad survey of ML literature. These frames show that judgments of correlation desirability are situated, incorporating technical, epistemic, and ethical considerations rather than fixed statistical properties.

Significance. If supported, the work offers a useful taxonomy for how the ML community negotiates a core concept, contributing to discussions on the contingent operationalization of technical terms. It explicitly credits the survey for mapping observed practices to frames and could help researchers reflect on implicit assumptions in robustness, fairness, and generalization work.

major comments (1)

[Survey description (abstract and §2)] Survey description (abstract and §2): No details are provided on search strategy, inclusion criteria, number of papers examined, coding procedure, or validation against counterexamples. This directly undermines the central claim that the four frames 'comprehensively capture the situated judgments in the field,' as the taxonomy cannot be evaluated for completeness or selection bias without this information.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. The primary concern is the absence of methodological details on the literature survey. We address this point directly below and commit to revisions that enhance transparency while preserving the paper's conceptual focus.

read point-by-point responses

Referee: Survey description (abstract and §2): No details are provided on search strategy, inclusion criteria, number of papers examined, coding procedure, or validation against counterexamples. This directly undermines the central claim that the four frames 'comprehensively capture the situated judgments in the field,' as the taxonomy cannot be evaluated for completeness or selection bias without this information.

Authors: We agree that the current manuscript lacks sufficient detail on the survey process, which limits readers' ability to assess how the four frames were derived and whether the taxonomy is comprehensive. The abstract and §2 describe the frames as emerging from a broad survey of ML literature but provide no information on search strategy, inclusion criteria, scale, coding, or counterexample consideration. In the revised version we will expand §2 with a methods subsection specifying: the databases and keywords used (e.g., arXiv, conference proceedings, terms such as 'spurious correlation' and 'shortcut learning'); inclusion criteria focused on papers that explicitly discuss or operationalize spuriousness in ML; the approximate number of papers reviewed; the thematic coding procedure that led to the four frames; and any steps taken to test the frames against alternative interpretations. These additions will make the situated, non-exhaustive character of the analysis explicit and allow evaluation of selection bias, thereby strengthening rather than altering the central claim. revision: yes

Circularity Check

0 steps flagged

No circularity; taxonomy derived from external literature without self-referential reduction

full rationale

The paper derives its four pragmatic frames (relevance, generalizability, human-likeness, harmfulness) via a broad survey of ML literature, as stated in the abstract: 'Drawing on a broad survey of ML literature, we identify four such frames...' This process relies on external sources rather than any equations, fitted parameters, self-citations, or definitional loops that would reduce the output to the inputs by construction. None of the six enumerated circularity patterns apply, as there are no mathematical derivations, uniqueness theorems, or ansatzes smuggled via self-citation. The analysis is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper's contribution rests on the domain assumption that pragmatic frames, rather than formal statistical criteria, are the operative mechanism by which ML researchers decide correlation desirability; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Spuriousness judgments in ML are negotiated through pragmatic, context-dependent frames rather than fixed statistical definitions alone.
This premise is invoked in the abstract to motivate the survey and the identification of the four frames.

pith-pipeline@v0.9.0 · 5803 in / 1308 out tokens · 59137 ms · 2026-05-23T17:11:27.824338+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages · 3 internal anchors

[1]

Invariant Risk Minimization

Arjovsky M, Bottou L, Gulrajani I and Lopez-Paz D (2019) Invariant risk minimization. arXiv preprint arXiv:1907.02893. Arpit D, Jastrzębski S, Ballas N, Krueger D, Bengio E, Kanwal MS, Maharaj T, Fischer A, Courville A, Bengio Y and Lacoste-Julien S (2017) A closer look at memorization in deep networks. In Proceedings of the 34th International Conference ...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[2]

Goldenfein, J (2019) The Profiling Potential of Computer Vision and the Challenge of Computational Empiricism

Ghosal SS, Ming Y and Li Y (2022) Are vision transformers robust to spurious correlations? arXiv preprint arXiv:2203.09125. Goldenfein, J (2019) The Profiling Potential of Computer Vision and the Challenge of Computational Empiricism. Proceedings of the Conference on Fairness, Accountability, and Transparency , pp. 110–119. FAT ‘19’. Association for Compu...

work page arXiv 2022
[3]

Oxford, UK: Oxford University Press

21 Kampourakis K and McCain K (2019) Uncertainty: How it makes science advance . Oxford, UK: Oxford University Press. Knorr KD (1979) Tinkering toward success: Prelude to a theory of scientific practice. Theory and Society 8(3): 347–376. Kirichenko P, Izmailov P, and Wilson AG (2023) Last layer re-training is sufficient for robustness to spurious correlat...

work page 2019
[4]

Big Data & Society 1(1): 2053951714528481

Kitchin R (2014) Big Data, new epistemologies and paradigm shifts. Big Data & Society 1(1): 2053951714528481. Latour B (1987) Science in action: How to follow scientists and engineers through society . Cambridge, MA: Harvard University Press. Latour B and Woolgar S (1979) Laboratory life: The construction of scientific facts . Princeton, NJ: Princeton Uni...

work page arXiv 2014
[5]

American Sociological Review , 00031224241271100

Navon D (2024) Reiterated fact-making: Explaining transformation and continuity in scientific facts. American Sociological Review , 00031224241271100. Peters J, Janzing D and Schölkopf B (2017) Elements of Causal Inference: Foundations and Learning Algorithms . Adaptive computation and machine learning. Cambridge, MA: The MIT Press. Ribeiro MT, Singh S an...

work page 2024
[6]

Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

Sagawa S, Koh PW, Hashimoto TB and Liang P (2019) Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. arXiv preprint arXiv:1911.08731. Scimeca L, Oh SJ, Chun S, Poli M and Yun S (2022) Which shortcut cues will DNNs choose? A study from the parameter-space perspective. In Internationa...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[7]

On Causal and Anticausal Learning

23 Schölkopf B, Janzing D, Peters J, Sgouritsa E, Zhang K and Mooij J (2012) On causal and anticausal learning. arXiv preprint arXiv:1206.6471. Schölkopf B, Locatello F, Bauer S, Ke NR, Kalchbrenner N, Goyal A and Bengio Y (2021) Toward causal representation learning. Proceedings of the IEEE , 109(5), 612–634. Shah H, Tamuly K, Raghunathan A, Jain P and N...

work page internal anchor Pith review Pith/arXiv arXiv 2012
[8]

arXiv preprint arXiv:2308.11043

Sreekumar G and Boddeti VN (2023) Spurious correlations and where to find them. arXiv preprint arXiv:2308.11043. Star SL and Griesemer JR (1989) Institutional ecology, ‘translations’ and boundary objects: Amateurs and professionals in Berkeley’s Museum of Vertebrate Zoology, 1907–39. Social Studies of Science 19(3): 387–420. Stock P, and Cisse M (2018) Co...

work page arXiv 2023
[9]

Will You Find These Shortcuts?

Wang S, Cooper N and Eby M (2024) From human-centered to social-centered artificial intelligence: Assessing ChatGPT’s impact through disruptive events. Big Data & Society 11(4): 20539517241290220. Wang T, Zhao J, Yatskar M, Chang KW and Ordonez V (2019b) Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations....

work page arXiv 2024

[1] [1]

Invariant Risk Minimization

Arjovsky M, Bottou L, Gulrajani I and Lopez-Paz D (2019) Invariant risk minimization. arXiv preprint arXiv:1907.02893. Arpit D, Jastrzębski S, Ballas N, Krueger D, Bengio E, Kanwal MS, Maharaj T, Fischer A, Courville A, Bengio Y and Lacoste-Julien S (2017) A closer look at memorization in deep networks. In Proceedings of the 34th International Conference ...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[2] [2]

Goldenfein, J (2019) The Profiling Potential of Computer Vision and the Challenge of Computational Empiricism

Ghosal SS, Ming Y and Li Y (2022) Are vision transformers robust to spurious correlations? arXiv preprint arXiv:2203.09125. Goldenfein, J (2019) The Profiling Potential of Computer Vision and the Challenge of Computational Empiricism. Proceedings of the Conference on Fairness, Accountability, and Transparency , pp. 110–119. FAT ‘19’. Association for Compu...

work page arXiv 2022

[3] [3]

Oxford, UK: Oxford University Press

21 Kampourakis K and McCain K (2019) Uncertainty: How it makes science advance . Oxford, UK: Oxford University Press. Knorr KD (1979) Tinkering toward success: Prelude to a theory of scientific practice. Theory and Society 8(3): 347–376. Kirichenko P, Izmailov P, and Wilson AG (2023) Last layer re-training is sufficient for robustness to spurious correlat...

work page 2019

[4] [4]

Big Data & Society 1(1): 2053951714528481

Kitchin R (2014) Big Data, new epistemologies and paradigm shifts. Big Data & Society 1(1): 2053951714528481. Latour B (1987) Science in action: How to follow scientists and engineers through society . Cambridge, MA: Harvard University Press. Latour B and Woolgar S (1979) Laboratory life: The construction of scientific facts . Princeton, NJ: Princeton Uni...

work page arXiv 2014

[5] [5]

American Sociological Review , 00031224241271100

Navon D (2024) Reiterated fact-making: Explaining transformation and continuity in scientific facts. American Sociological Review , 00031224241271100. Peters J, Janzing D and Schölkopf B (2017) Elements of Causal Inference: Foundations and Learning Algorithms . Adaptive computation and machine learning. Cambridge, MA: The MIT Press. Ribeiro MT, Singh S an...

work page 2024

[6] [6]

Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

Sagawa S, Koh PW, Hashimoto TB and Liang P (2019) Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. arXiv preprint arXiv:1911.08731. Scimeca L, Oh SJ, Chun S, Poli M and Yun S (2022) Which shortcut cues will DNNs choose? A study from the parameter-space perspective. In Internationa...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[7] [7]

On Causal and Anticausal Learning

23 Schölkopf B, Janzing D, Peters J, Sgouritsa E, Zhang K and Mooij J (2012) On causal and anticausal learning. arXiv preprint arXiv:1206.6471. Schölkopf B, Locatello F, Bauer S, Ke NR, Kalchbrenner N, Goyal A and Bengio Y (2021) Toward causal representation learning. Proceedings of the IEEE , 109(5), 612–634. Shah H, Tamuly K, Raghunathan A, Jain P and N...

work page internal anchor Pith review Pith/arXiv arXiv 2012

[8] [8]

arXiv preprint arXiv:2308.11043

Sreekumar G and Boddeti VN (2023) Spurious correlations and where to find them. arXiv preprint arXiv:2308.11043. Star SL and Griesemer JR (1989) Institutional ecology, ‘translations’ and boundary objects: Amateurs and professionals in Berkeley’s Museum of Vertebrate Zoology, 1907–39. Social Studies of Science 19(3): 387–420. Stock P, and Cisse M (2018) Co...

work page arXiv 2023

[9] [9]

Will You Find These Shortcuts?

Wang S, Cooper N and Eby M (2024) From human-centered to social-centered artificial intelligence: Assessing ChatGPT’s impact through disruptive events. Big Data & Society 11(4): 20539517241290220. Wang T, Zhao J, Yatskar M, Chang KW and Ordonez V (2019b) Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations....

work page arXiv 2024