"Faithful to What?" On the Limits of Fidelity-Based Explanations

Jackson Eshbaugh

arxiv: 2506.12176 · v5 · submitted 2025-06-13 · 💻 cs.LG · stat.ML

"Faithful to What?" On the Limits of Fidelity-Based Explanations

Jackson Eshbaugh This is my paper

Pith reviewed 2026-05-19 09:03 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords explainable AIsurrogate modelsfidelitylinearity scoreneural networksregressionpredictive performance

0 comments

The pith

Fidelity to a neural network's predictions does not ensure surrogates recover the data patterns that give the network its edge over simpler models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how surrogate models are typically judged in explainable AI by how closely they match a neural network's outputs. It points out that this fidelity measure only checks copying of the model's decisions rather than alignment with the actual signals in the data that drive better predictions. The authors introduce the linearity score as a separate diagnostic to quantify how much of the network's input-output behavior can be captured linearly. Tests on synthetic and real regression datasets show that surrogates with high fidelity can still miss the performance advantages the network holds over linear baselines and may even fall short of those baselines. This highlights that accounting for a model's outputs is distinct from accounting for the task structure in the data.

Core claim

Across synthetic and real-world regression datasets, surrogates can achieve high fidelity to a neural network while failing to recover the predictive gains that distinguish the network from simpler models. In several cases, high-fidelity surrogates underperform even linear baselines trained directly on the data. These results demonstrate that explaining a model's behavior is not equivalent to explaining the task-relevant structure of the data, highlighting a limitation of fidelity-based explanations when used to reason about predictive performance.

What carries the argument

The linearity score λ(f), an R² measure of how well a linear surrogate fits the regression network's input-output mapping, used as a diagnostic distinct from fidelity.

If this is right

Surrogates may match network outputs closely without capturing the network's advantages over linear models.
High-fidelity surrogates can sometimes perform worse than linear baselines trained on the same data.
Fidelity-based evaluation alone is insufficient for assessing whether an explanation accounts for predictive performance gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The distinction between model alignment and data alignment suggests that other explanation techniques relying on output matching may need similar diagnostics.
Testing the linearity score on classification tasks or deeper architectures could show whether the gap between fidelity and data recovery appears more broadly.

Load-bearing premise

The premise that alignment to the data-generating signal rather than to the learned model is the appropriate target for explanations of predictive performance.

What would settle it

A finding that high-fidelity surrogates consistently recover the performance gains of neural networks over linear models across the synthetic and real-world datasets would contradict the reported observations.

Figures

Figures reproduced from arXiv: 2506.12176 by Jackson Eshbaugh.

**Figure 1.** Figure 1: Synthetic dataset y = x · sin(x) + ϵ with Gaussian noise ϵ ∼ N (0, 0.2 2 ). The dataset demonstrates a smooth nonlinear relationship. 4.1 Synthetic The first experiment uses a synthetic regression task defined by y = x · sin(x) + ϵ, where ϵ ∼ N (0, 0.2 2 ) and x ∈ [−4, 4] This function exhibits a continuous, nonlinear structure, providing a testbed for analyzing the expressiveness of linear approximations… view at source ↗

**Figure 2.** Figure 2: Predictions from the baseline linear model, neural network, and linear surrogate on the [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Predictions from the baseline linear model, neural network, and linear surrogate on the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Predictions from the baseline linear model, neural network, and linear surrogate on the [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Predictions on the California Housing dataset. The linear surrogate tracks the neural [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

In explainable AI, surrogate models are commonly evaluated by their fidelity to a neural network's predictions. Fidelity, however, measures alignment to a learned model rather than alignment to the data-generating signal underlying the task. This work introduces the linearity score $\lambda(f)$, a diagnostic that quantifies the extent to which a regression network's input--output behavior is linearly decodable. $\lambda(f)$ is defined as an $R^2$ measure of surrogate fit to the network. Across synthetic and real-world regression datasets, we find that surrogates can achieve high fidelity to a neural network while failing to recover the predictive gains that distinguish the network from simpler models. In several cases, high-fidelity surrogates underperform even linear baselines trained directly on the data. These results demonstrate that explaining a model's behavior is not equivalent to explaining the task-relevant structure of the data, highlighting a limitation of fidelity-based explanations when used to reason about predictive performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper usefully flags that high-fidelity surrogates need not recover a network's edge over linear models, but the reported reversals look constrained by basic error decomposition.

read the letter

The main point worth knowing is that fidelity to a neural net's outputs does not automatically mean the surrogate has recovered the structure that gives the net its predictive advantage over simpler models. The authors define a linearity score λ(f) as the R² of a linear surrogate fit to the network predictions and report that, across synthetic and real regression sets, some high-fidelity surrogates still fall below linear baselines trained directly on the labels even when the network itself beats those baselines. This is a concrete way to separate alignment with the learned model from alignment with the underlying signal. The work does a clean job of stating the distinction and running the same diagnostic on both controlled synthetic data and real datasets. The linearity score itself is a simple, reproducible addition that others could adopt without much overhead. The experiments are at least described at the level of the abstract, which is better than many XAI papers that stay purely conceptual. The soft spot is the central empirical claim. When fidelity is high, the surrogate's mean squared error on the true targets is bounded by the network's error plus the fidelity residual plus a covariance term. Large reversals relative to a linear baseline are therefore possible only if the network-linear gap is itself modest or if the fidelity numbers are not uniformly high across the test distribution. The abstract gives no numerical fidelity values, no effect sizes, and no error bars, so it is difficult to judge whether the observed underperformance is large enough to matter or whether it survives tighter controls on how the surrogates were fit. Without those details the result reads more like a cautionary observation than a strong demonstration of a fundamental limit. This paper is for people who build or evaluate post-hoc explanations for regression models and who already worry about metric choices in XAI. A reader who wants a quick diagnostic for whether an explanation is mostly copying the model or actually tracking task structure will get something usable from it. It deserves a serious referee because the question is legitimate and the proposed score is lightweight, even though the current write-up would benefit from expanded results sections and explicit checks against the decomposition bound. I would send it out for review after asking for the missing numbers and sensitivity checks.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces the linearity score λ(f), defined as an R² measure of how well a surrogate fits a regression neural network f, to argue that fidelity to the network's predictions does not imply recovery of the network's predictive gains over simpler models. Across synthetic and real-world regression datasets, the authors report that high-fidelity surrogates can underperform even linear baselines trained directly on the data, while the NN itself outperforms those baselines. This is presented as evidence that fidelity-based explanations are limited when reasoning about task performance, because they align to the learned model rather than the data-generating signal.

Significance. If the empirical findings hold after addressing the mathematical bounding issue, the work would usefully highlight a distinction between model fidelity and task-relevant structure in XAI for regression. The linearity score provides a concrete diagnostic that could be adopted more broadly. The paper earns credit for direct empirical comparisons on both synthetic and real datasets and for separating fidelity from downstream task performance, though the evidential strength depends on controls and gap sizes not fully detailed in the provided sections.

major comments (1)

[Empirical results and definition of fidelity] The central empirical claim—that high-fidelity surrogates can underperform linear baselines while the NN outperforms them—requires clarification in light of the error decomposition. Expanding MSE(ŝ, y) = MSE(f, y) + MSE(ŝ, f) + 2 Cov(f-y, ŝ-f) shows that small MSE(ŝ, f) (high fidelity) bounds the performance gap between ŝ and f; adverse covariance can widen it only modestly. The reported reversals relative to a linear baseline therefore depend on either narrow NN-linear gaps or fidelity that is high only in a weak/in-sample sense. Please cite the specific fidelity R² values, test MSE/R² gaps, and covariance terms from the synthetic and real-world experiments (e.g., in the results tables or § on empirical evaluation) to demonstrate that the findings are not precluded by this identity.

minor comments (2)

[Introduction] The abstract and introduction would benefit from an explicit statement of how λ(f) differs numerically from standard fidelity metrics used for the surrogates.
[Experimental setup] Missing details on surrogate implementations (exact architectures, training procedures, hyperparameter choices) and whether error bars or multiple runs are reported would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback, particularly the observation regarding the error decomposition and its implications for interpreting the empirical reversals. We agree that explicitly reporting the quantitative details will strengthen the clarity of the central claim and address potential concerns about whether the findings are consistent with the identity. We will revise the manuscript to incorporate these elements in the empirical evaluation section.

read point-by-point responses

Referee: [Empirical results and definition of fidelity] The central empirical claim—that high-fidelity surrogates can underperform linear baselines while the NN outperforms them—requires clarification in light of the error decomposition. Expanding MSE(ŝ, y) = MSE(f, y) + MSE(ŝ, f) + 2 Cov(f-y, ŝ-f) shows that small MSE(ŝ, f) (high fidelity) bounds the performance gap between ŝ and f; adverse covariance can widen it only modestly. The reported reversals relative to a linear baseline therefore depend on either narrow NN-linear gaps or fidelity that is high only in a weak/in-sample sense. Please cite the specific fidelity R² values, test MSE/R² gaps, and covariance terms from the synthetic and real-world experiments (e.g., in the results tables or § on empirical evaluation) to demonstrate that the findings are not precluded by this identity.

Authors: We appreciate this rigorous point and the explicit decomposition, which correctly shows that MSE(ŝ, y) − MSE(f, y) = MSE(ŝ, f) + 2 Cov(f − y, ŝ − f). Our claim, however, centers on the comparison between the surrogate ŝ and a linear model trained directly on the target y, rather than on the absolute gap between ŝ and f. When the NN’s advantage over the linear baseline stems from non-linear structure that a surrogate approximates only in an average (high-R²) sense without recovering the same task-relevant interactions, it is possible for ŝ to fall below the linear baseline even while remaining close to f. We will revise the empirical evaluation section (and associated tables) to report: (i) the fidelity R² values λ(f) achieved by the surrogates on both synthetic and real-world datasets, (ii) the test MSE and R² gaps between the NN, the linear baseline, and each surrogate, and (iii) the covariance terms from the decomposition computed on the test sets. These additions will allow readers to verify that the observed reversals are compatible with the identity while still illustrating the distinction between fidelity to the learned model and recovery of predictive structure in the data. If any of these quantities are not already tabulated, they will be added in the revision. revision: yes

Circularity Check

0 steps flagged

Empirical comparisons of surrogates vs. networks and baselines show no circular reduction

full rationale

The paper's central claims rest on direct empirical measurements: surrogates are trained to match neural network outputs (via the defined λ(f) as R²), then evaluated for task performance (R² or MSE against ground-truth y) on held-out data, with comparisons to linear models trained directly on the same data. No load-bearing step reduces a 'prediction' or result to a quantity defined by the paper's own fitted parameters or self-citations. The linearity score is introduced as an explicit diagnostic rather than smuggled in, and the reported findings (high-fidelity surrogates sometimes underperforming linear baselines) are presented as observations from synthetic and real-world datasets rather than derived by construction from the definition of fidelity. This is self-contained against external benchmarks and receives a low circularity score consistent with normal empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests primarily on the distinction between model alignment and data-generating signal alignment, plus empirical observations; the linearity score is introduced as a new diagnostic without additional free parameters or invented physical entities.

axioms (1)

domain assumption Fidelity measures alignment to a learned model rather than alignment to the data-generating signal underlying the task.
This distinction is presented in the abstract as the core motivation for the work and the linearity score.

invented entities (1)

linearity score λ(f) no independent evidence
purpose: Quantifies the extent to which a regression network's input-output behavior is linearly decodable as an R² measure of surrogate fit to the network.
Newly defined in this paper as a diagnostic tool to reveal limitations of fidelity.

pith-pipeline@v0.9.0 · 5687 in / 1433 out tokens · 62877 ms · 2026-05-19T09:03:32.307757+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

[1]

Understanding intermediate layers using linear classifier probes, November 2018

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes, November 2018

work page 2018
[2]

Medical cost personal dataset, 2017

Miri Choi. Medical cost personal dataset, 2017

work page 2017
[3]

Craven and Jude W

Mark W. Craven and Jude W. Shavlik. Extracting tree-structured representations of trained networks. In Proceedings of the 9th International Conference on Neural Information Process- ing Systems, NIPS’95, pages 24–30, Denver, Colorado and Cambridge, MA, USA, 1995. MIT Press

work page 1995
[4]

Costa-juss `a

Javier Ferrando, Gabriele Sarti, Arianna Bisazza, and Marta R. Costa-juss `a. A Primer on the Inner Workings of Transformer-based Language Models, October 2024

work page 2024
[5]

Complexity of Linear Regions in Deep Networks, June 2019

Boris Hanin and David Rolnick. Complexity of Linear Regions in Deep Networks, June 2019

work page 2019
[6]

John Hewitt and Christopher D. Manning. A Structural Probe for Finding Syntax in Word Representations. In Proceedings of the 2019 Conference of the North , pages 4129–4138, Min- neapolis, Minnesota, 2019. Association for Computational Linguistics

work page 2019
[7]

Kelley Pace and Ronald Barry

R. Kelley Pace and Ronald Barry. Sparse spatial autoregressions. Statistics & Probability Letters, 33(3):291–297, May 1997

work page 1997
[8]

Lundberg and Su-In Lee

Scott M. Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems , NIPS’17, pages 4768–4777, Long Beach, California, USA and Red Hook, NY , USA, 2017. Curran Associates Inc

work page 2017
[9]

Progress measures for grokking via mechanistic interpretability, October 2023

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability, October 2023

work page 2023
[10]

Abolafia, Jeffrey Pennington, and Jascha Sohl- Dickstein

Roman Novak, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl- Dickstein. Sensitivity and Generalization in Neural Networks: An Empirical Study, June 2018

work page 2018
[11]

Distill , year =

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom In: An Introduction to Circuits. Distill, 5(3):10.23915/distill.00024.001, March 2020

work page doi:10.23915/distill.00024.001 2020
[12]

SVCCA: Singu- lar Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability, November 2017

Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. SVCCA: Singu- lar Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability, November 2017

work page 2017
[13]

”Why Should I Trust You?”: Ex- plaining the Predictions of Any Classifier

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. ”Why Should I Trust You?”: Ex- plaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD In- ternational Conference on Knowledge Discovery and Data Mining , pages 1135–1144, San Francisco California USA, August 2016. ACM. 9

work page 2016
[14]

Understanding Learning Dynamics Of Language Models with

Naomi Saphra and Adam Lopez. Understanding Learning Dynamics Of Language Models with. In Proceedings of the 2019 Conference of the North , pages 3257–3267, Minneapolis, Minnesota, 2019. Association for Computational Linguistics

work page 2019
[15]

Sato and H

M. Sato and H. Tsukimoto. Rule extraction from neural networks via decision tree induc- tion. In IJCNN’01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222), volume 3, pages 1870–1875, Washington, DC, USA, 2001. IEEE

work page 2001
[16]

I.-C. Yeh. Modeling of strength of high-performance concrete using artificial neural networks. Cement and Concrete Research, 28(12):1797–1808, December 1998. 10

work page 1998

[1] [1]

Understanding intermediate layers using linear classifier probes, November 2018

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes, November 2018

work page 2018

[2] [2]

Medical cost personal dataset, 2017

Miri Choi. Medical cost personal dataset, 2017

work page 2017

[3] [3]

Craven and Jude W

Mark W. Craven and Jude W. Shavlik. Extracting tree-structured representations of trained networks. In Proceedings of the 9th International Conference on Neural Information Process- ing Systems, NIPS’95, pages 24–30, Denver, Colorado and Cambridge, MA, USA, 1995. MIT Press

work page 1995

[4] [4]

Costa-juss `a

Javier Ferrando, Gabriele Sarti, Arianna Bisazza, and Marta R. Costa-juss `a. A Primer on the Inner Workings of Transformer-based Language Models, October 2024

work page 2024

[5] [5]

Complexity of Linear Regions in Deep Networks, June 2019

Boris Hanin and David Rolnick. Complexity of Linear Regions in Deep Networks, June 2019

work page 2019

[6] [6]

John Hewitt and Christopher D. Manning. A Structural Probe for Finding Syntax in Word Representations. In Proceedings of the 2019 Conference of the North , pages 4129–4138, Min- neapolis, Minnesota, 2019. Association for Computational Linguistics

work page 2019

[7] [7]

Kelley Pace and Ronald Barry

R. Kelley Pace and Ronald Barry. Sparse spatial autoregressions. Statistics & Probability Letters, 33(3):291–297, May 1997

work page 1997

[8] [8]

Lundberg and Su-In Lee

Scott M. Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems , NIPS’17, pages 4768–4777, Long Beach, California, USA and Red Hook, NY , USA, 2017. Curran Associates Inc

work page 2017

[9] [9]

Progress measures for grokking via mechanistic interpretability, October 2023

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability, October 2023

work page 2023

[10] [10]

Abolafia, Jeffrey Pennington, and Jascha Sohl- Dickstein

Roman Novak, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl- Dickstein. Sensitivity and Generalization in Neural Networks: An Empirical Study, June 2018

work page 2018

[11] [11]

Distill , year =

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom In: An Introduction to Circuits. Distill, 5(3):10.23915/distill.00024.001, March 2020

work page doi:10.23915/distill.00024.001 2020

[12] [12]

SVCCA: Singu- lar Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability, November 2017

Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. SVCCA: Singu- lar Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability, November 2017

work page 2017

[13] [13]

”Why Should I Trust You?”: Ex- plaining the Predictions of Any Classifier

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. ”Why Should I Trust You?”: Ex- plaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD In- ternational Conference on Knowledge Discovery and Data Mining , pages 1135–1144, San Francisco California USA, August 2016. ACM. 9

work page 2016

[14] [14]

Understanding Learning Dynamics Of Language Models with

Naomi Saphra and Adam Lopez. Understanding Learning Dynamics Of Language Models with. In Proceedings of the 2019 Conference of the North , pages 3257–3267, Minneapolis, Minnesota, 2019. Association for Computational Linguistics

work page 2019

[15] [15]

Sato and H

M. Sato and H. Tsukimoto. Rule extraction from neural networks via decision tree induc- tion. In IJCNN’01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222), volume 3, pages 1870–1875, Washington, DC, USA, 2001. IEEE

work page 2001

[16] [16]

I.-C. Yeh. Modeling of strength of high-performance concrete using artificial neural networks. Cement and Concrete Research, 28(12):1797–1808, December 1998. 10

work page 1998