arxiv: 2511.21931 · v2 · submitted 2025-11-26 · 💻 cs.LG · cs.AI

Does the Model Say What the Data Says? A Simple Heuristic for Model Data Alignment

Henry Salgado , Meagan R. Kendall , Martine Ceberio This is my paper

Pith reviewed 2026-05-17 04:01 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords model-data alignmentfeature importancepotential outcomesbinary classificationmodel explanationsinterpretabilitydata structure

0 comments

The pith

A data-derived feature ranking from outcome separation provides a baseline to check if model explanations match the data's structure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out a straightforward way to test whether a machine learning model has learned from the real patterns present in its training data. It creates a ranking of features by estimating how strongly each one separates the two groups in a binary outcome using the potential outcomes framework. These rankings come straight from the data and serve as a reference point. Practitioners then compare them to the feature importance scores that come from the model's explanation tools. Agreement between the two indicates the model is aligned with the data; disagreement flags possible reliance on irrelevant or spurious signals.

Core claim

For binary classification, each feature's effect on separating the two outcome groups can be quantified directly from the data via the potential outcomes framework. The resulting ranking acts as a data-native baseline. Comparing it to rankings from standard model explanation methods yields an interpretable, model-agnostic test of whether the model reflects the data's underlying structure.

What carries the argument

Data-derived feature ranking obtained by quantifying each feature's separation strength between outcome groups via the potential outcomes framework.

If this is right

Practitioners obtain a concrete, side-by-side comparison that reveals when a model explanation rests on features the data itself does not strongly support.
The test applies to any model because the data baseline does not depend on the model's internal structure.
The procedure is computationally light and can be run as a quick sanity check before deploying a model.
Mismatches can guide targeted data review or feature selection to improve alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation idea might be adapted to regression settings by replacing binary group separation with a measure of outcome variance explained.
Systematic misalignment across many models on the same dataset could serve as a signal of hidden data quality problems.
Pairing the ranking with causal discovery algorithms could strengthen the claim that the baseline truly reflects causal structure rather than correlation.
In regulated domains the method could supply a documented, auditable step that links model behavior back to observable data properties.

Load-bearing premise

That quantifying each feature's separation of outcome groups via the potential outcomes framework produces a valid and sufficient representation of the data's underlying structure against which model explanations should be compared.

What would settle it

Build a controlled binary dataset in which the true separating power of each feature is known in advance, apply the method, and check whether its data ranking recovers the known order while model explanations deviate from it.

Figures

Figures reproduced from arXiv: 2511.21931 by Henry Salgado, Martine Ceberio, Meagan R. Kendall.

**Figure 2.** Figure 2: Rank comparison scatter plots for the Diabetes dataset. Both comparisons [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

In this work, we propose a simple and computationally efficient framework for evaluating whether machine learning models align with the structure of the data they learn from; that is, whether the model says what the data says. Unlike existing interpretability methods that focus exclusively on explaining model behavior, our approach establishes a baseline derived directly from the data itself. Drawing inspiration from Rubin's Potential Outcomes Framework, we quantify how strongly each feature separates the two outcome groups in a binary classification task, moving beyond traditional descriptive statistics to estimate each feature's effect on the outcome. By comparing these data-derived feature rankings with model-based explanations, we provide practitioners with an interpretable and model-agnostic method for assessing model-data alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a heuristic that builds a data baseline from potential outcomes feature rankings and compares it to model explanations for binary classification alignment, but the abstract alone gives no methods, equations, or results to evaluate it.

read the letter

The main point is a proposed check for whether model explanations match the data's own signals in binary classification. The authors suggest using Rubin's potential outcomes framework to estimate how strongly each feature separates the outcome groups, then ranking those effects as an independent baseline and lining it up against whatever attributions the model produces. This keeps the baseline outside the model itself, which is a reasonable way to avoid circularity that shows up in some other interpretability work. If the comparison turns out to be stable and informative, it could give practitioners a lightweight extra diagnostic without needing to retrain or add heavy computation. That framing is clear and targets a practical need in applied settings where people want some reassurance that the model is not inventing importance the data does not support. The idea of moving past simple correlations or descriptive stats to effect estimates is a small but concrete step beyond what many current alignment checks do. On the other hand, the abstract supplies none of the details required to judge whether this actually works. There are no equations for the effect estimation, no discussion of how they handle confounding or selection bias in the potential outcomes step, and no experiments showing the method flags real misalignments or improves on simpler baselines. Without those pieces it is impossible to tell if the heuristic is robust or if the comparison is even fair across different model types. The central assumption—that the potential outcomes ranking gives a sufficient and valid picture of what the data says—remains untested here. This would mainly interest people already working on model interpretability and trust for deployed binary classifiers, such as in healthcare or risk modeling. A reader looking for new practical heuristics could find the direction useful once the full paper appears with implementation and validation. I would send a completed version to peer review because the core proposal is not obviously broken and it addresses a real gap, even though it will need substantial added evidence and testing to become convincing.

Referee Report

2 major / 1 minor

Summary. The paper proposes a simple, model-agnostic heuristic for assessing whether machine learning models align with the underlying structure of their training data in binary classification tasks. It derives a data-only baseline by applying Rubin's Potential Outcomes Framework to quantify each feature's effect on separating the two outcome groups, produces feature rankings from this baseline, and compares them to model-based explanations.

Significance. If the framework can be implemented with valid causal estimators, proper bias correction, and empirical validation against existing alignment checks, it could supply practitioners with an interpretable, data-derived reference point that is independent of any particular model. The abstract, however, contains no equations, algorithms, experiments, or handling of estimation biases, so the practical significance cannot yet be determined.

major comments (2)

[Abstract] Abstract: the central claim that the Potential Outcomes Framework produces a valid baseline for model-data alignment rests on estimating each feature's effect on the outcome, yet no estimator, identification assumptions (e.g., ignorability, positivity), or bias-correction procedure is stated; without these the comparison to model explanations cannot be evaluated for correctness.
[Abstract] Abstract: the method is asserted to be 'computationally efficient' and to move 'beyond traditional descriptive statistics,' but no algorithm, complexity statement, or explicit contrast with simpler statistics (e.g., mutual information or standardized mean differences) is supplied, leaving the claimed advantages unsubstantiated.

minor comments (1)

[Abstract] Abstract: the phrase 'model-based explanations' is used without indicating which post-hoc methods (SHAP, LIME, etc.) or intrinsic feature importances are intended, which affects how the alignment metric would be defined.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We agree that the abstract could benefit from additional details to substantiate the claims. We address the two major comments point by point and indicate where revisions to the manuscript will be made.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the Potential Outcomes Framework produces a valid baseline for model-data alignment rests on estimating each feature's effect on the outcome, yet no estimator, identification assumptions (e.g., ignorability, positivity), or bias-correction procedure is stated; without these the comparison to model explanations cannot be evaluated for correctness.

Authors: The referee is right that these details are absent from the abstract. Abstracts are constrained in length and typically omit such specifics. The body of the paper describes the estimator derived from the Potential Outcomes Framework for quantifying feature effects on the binary outcome, along with the relevant identification assumptions and any bias considerations. We will revise the abstract to include a short description of the estimator and assumptions to make the central claim more evaluable. revision: partial
Referee: [Abstract] Abstract: the method is asserted to be 'computationally efficient' and to move 'beyond traditional descriptive statistics,' but no algorithm, complexity statement, or explicit contrast with simpler statistics (e.g., mutual information or standardized mean differences) is supplied, leaving the claimed advantages unsubstantiated.

Authors: We acknowledge that the abstract does not provide an explicit algorithm or direct comparisons. The proposed method involves a direct computation of feature-wise effects, which is efficient and scales linearly with the number of features. It goes beyond descriptive statistics by providing effect estimates rather than simple associations. We will update the abstract to mention the computational efficiency and contrast it briefly with traditional statistics like mean differences. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The abstract describes a data-derived baseline for feature rankings obtained via the Potential Outcomes Framework, constructed independently of any model, followed by a comparison to model explanations. No equations, fitted parameters, self-citations, or derivation steps are provided that would reduce the claimed result to its own inputs by construction. The approach is explicitly positioned as model-agnostic with the baseline drawn directly from the data, rendering the central claim self-contained against external benchmarks without load-bearing circular elements.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only view limits visibility into details; the approach rests on the applicability of Rubin's Potential Outcomes Framework to feature effect estimation in classification, treated here as a domain assumption rather than derived within the paper.

axioms (1)

domain assumption Rubin's Potential Outcomes Framework can be directly applied to quantify each feature's effect on binary outcomes by estimating separation strength between groups
The abstract states the method draws inspiration from this framework to move beyond descriptive statistics to effect estimation.

pith-pipeline@v0.9.0 · 5390 in / 1382 out tokens · 49695 ms · 2026-05-17T04:01:55.849884+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We define … Δj = μ1j − μ0j / sp,j . The absolute value |Δj| represents the standardized effect size of feature j. We rank features by descending |Δj|.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 2 internal anchors

[1]

Edward A. Lee. Deep Neural Networks, Explanations, and Rationality. In Bern- hard Steffen, editor,Bridging the Gap Between AI and Reality, pages 11–21, Cham,

work page
[2]

Springer Nature Switzerland

work page
[3]

Interpretable Machine Learning – A Brief History, State-of-the-Art and Challenges

Christoph Molnar, Giuseppe Casalicchio, and Bernd Bischl. Interpretable Machine Learning – A Brief History, State-of-the-Art and Challenges. volume 1323, pages 417–431. 2020. arXiv:2010.09337 [stat]

work page arXiv 2020
[4]

Zech, Marcus A

John R. Zech, Marcus A. Badgeley, Manway Liu, Anthony B. Costa, Joseph J. Titano, and Eric Karl Oermann. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLoS medicine, 15(11):e1002683, November 2018

work page 2018
[5]

Johannes Rueckel, Christian Huemmer, Andreas Fieselmann, Florin-Cristian Gh- esu, Awais Mansoor, Balthasar Schachtner, Philipp Wesp, Lena Trappmann, Basel Munawwar, Jens Ricke, Michael Ingrisch, and Bastian O. Sabel. Pneumothorax detection in chest radiographs: optimizing artificial intelligence system for accuracy and confounding bias reduction using in-...

work page 2021
[6]

National Geographic Books, October 2020

Brian Christian.The Alignment Problem: Machine Learning and Human Values. National Geographic Books, October 2020. Google-Books-ID: KGCNEAAAQBAJ

work page 2020
[7]

Reasoning Models Don't Always Say What They Think

Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, and Ethan Perez. Rea- soning Models Don’t Always Say What They Think, May 2025. arXiv:2505.05410 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

"Why Should I Trust You?": Explaining the Predictions of Any Classifier

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "Why Should I Trust You?": Explaining the Predictions of Any Classifier, August 2016. arXiv:1602.04938 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2016
[9]

Scholbeck, Giuseppe Casalicchio, Moritz Grosse-Wentrup, and Bernd Bischl

Christoph Molnar, Gunnar König, Julia Herbinger, Timo Freiesleben, Susanne Dandl, Christian A. Scholbeck, Giuseppe Casalicchio, Moritz Grosse-Wentrup, and Bernd Bischl. General Pitfalls of Model-Agnostic Interpretation Methods for Ma- chine Learning Models. In Andreas Holzinger, Randy Goebel, Ruth Fong, Taesup Moon, Klaus-Robert Müller, and Wojciech Samek...

work page 2020
[10]

Causal Effects of Linguistic Properties, June 2021

Reid Pryzant, Dallas Card, Dan Jurafsky, Victor Veitch, and Dhanya Sridhar. Causal Effects of Linguistic Properties, June 2021. arXiv:2010.12919 [cs]

work page arXiv 2021
[11]

Determining the Relevance of Features for Deep Neural Networks

Christian Reimers, Jakob Runge, and Joachim Denzler. Determining the Relevance of Features for Deep Neural Networks. In Andrea Vedaldi, Horst Bischof, Thomas 10 H. Salgado et al. Brox, and Jan-Michael Frahm, editors,Computer Vision – ECCV 2020, volume 12371, pages 330–346. Springer International Publishing, Cham, 2020. Series Title: Lecture Notes in Compu...

work page 2020
[12]

Causal Parrots: Large Language Models May Talk Causality But Are Not Causal, August 2023

Matej Zečević, Moritz Willig, Devendra Singh Dhami, and Kristian Kersting. Causal Parrots: Large Language Models May Talk Causality But Are Not Causal, August 2023. arXiv:2308.13067 [cs]

work page arXiv 2023
[13]

The Effects of Data Quality on Machine Learning Performance on Tabular Data.Information Systems, 132:102549, July 2025

SedirMohammed,LukasBudach,MoritzFeuerpfeil,NinaIhde,AndreaNathansen, Nele Noack, Hendrik Patzlaff, Felix Naumann, and Hazar Harmouch. The Effects of Data Quality on Machine Learning Performance on Tabular Data.Information Systems, 132:102549, July 2025. arXiv:2207.14529 [cs]

work page arXiv 2025
[14]

Rubin.Multiple Imputation for Nonresponse in Surveys

Donald B. Rubin.Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, June 2004. Google-Books-ID: bQBtw6rx_mUC

work page 2004
[15]

Shadow Finch Media LLC, 2013

Larry Hatcher.Advanced Statistics in Research: Reading, Understanding, and Writing Up Data Analysis Results. Shadow Finch Media LLC, 2013. Google- Books-ID: Uo2TlgEACAAJ

work page 2013
[16]

Donald B. Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of Educational Psychology, 66(5):688–701, 1974. Place: US Publisher: American Psychological Association

work page 1974