The Neglected Baseline in Model Interpretation

Xiaohui Fan; Yongjin Cui

arxiv: 2605.22417 · v3 · pith:CTEJSPWUnew · submitted 2026-05-21 · 💻 cs.CV · cs.SE

The Neglected Baseline in Model Interpretation

Yongjin Cui , Xiaohui Fan This is my paper

Pith reviewed 2026-05-22 07:19 UTC · model grok-4.3

classification 💻 cs.CV cs.SE

keywords model interpretationintegrated gradientsbaselineattribution errorlayer-wise featuresneural network explainabilitycomputer vision

0 comments

The pith

Revising Integrated Gradients with an explicit baseline yields more accurate attributions from any network layer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing interpretation methods for neural networks routinely omit or mishandle a baseline reference when tracing a prediction back to input features, which distorts the resulting explanations. The paper reformulates the interpretation task around this missing reference point and shows how gradient-based techniques, Integrated Gradients, and Taylor expansions are all incomplete without it. By supplying a clear baseline, the authors revise Integrated Gradients so that the summed attributions match the model's output more closely. The same revision works for feature maps taken from any layer, because each layer simply reflects a different stage of processing. Quality is judged directly by how small the gap is between the attributed values and the actual target output rather than by indirect tests.

Core claim

Model interpretation requires an explicit baseline as the reference from which feature contributions are measured; without it, methods such as standard Integrated Gradients, LayerCAM, and ODAM produce attributions whose sum deviates from the model's output. Revising Integrated Gradients to use a reasonable baseline removes this deviation, supports attribution at every layer, and makes the differences across layers interpretable as successive stages of feature extraction.

What carries the argument

The revised Integrated Gradients path that integrates gradients from an explicit baseline value rather than an implicit or zero reference, allowing the attributions to be computed for features extracted at any chosen layer.

If this is right

Attributions taken from early layers emphasize low-level patterns while later layers emphasize higher-level concepts, and both are valid once a baseline is fixed.
Any gradient-based method can be completed by inserting the same explicit baseline, unifying them under one evaluation criterion.
Indirect checks such as marginal-effect removal or perfect-model assumptions become unnecessary once attribution error is measured directly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Choosing the baseline according to domain knowledge (for example, a neutral image in vision tasks) could further reduce error without changing the algorithm.
The layer-wise view suggests that ensemble attributions across several layers might capture complementary information about a single decision.
The same baseline correction could be tested on transformer attention maps to check whether the same precision gain appears outside convolutional networks.

Load-bearing premise

That the size of the difference between the attribution map and the model's actual output is the most reliable measure of interpretation quality.

What would settle it

On a held-out image, compute the attribution error of the revised method versus the original Integrated Gradients; if the revised error is not consistently smaller while still highlighting the features that drive the prediction, the improvement does not hold.

Figures

Figures reproduced from arXiv: 2605.22417 by Xiaohui Fan, Yongjin Cui.

**Figure 2.** Figure 2: GAE coordinate interpretation. reasoning behind this is that when considering the feature region as the center, the expansion or contraction of the feature region has opposite effects on x1, y1 and x2, y2. However, this line of thinking is erroneous. When viewed through the lens of baseline analysis methods, what we are actually interpreting is the difference between the current output and the baseline out… view at source ↗

**Figure 3.** Figure 3: Interpretation of category outputs (logits and probabilities) of the DETR_demo model by [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Interpretation of category outputs (logits and probabilities) of the DETR model by ODAM [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Interpretation of category outputs (logits and probabilities) of the DETR model. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Interpretation of category outputs (logits and probabilities) in the DETR model by ODAM [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: VGG category logits interpretation. demonstrations are shown in [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

read the original abstract

We observe that existing model interpretation methods generally ignore the baseline, and such neglect often results in imprecise or even incorrect interpretation. In this paper, we reformulate the task of model interpretation and the interpretation principles for model interpretation results to demonstrate the importance of the baseline. We further unify gradient-based methods, Integrated Gradients (IG) methods, and Taylor expansion, clarifying the connections among them and explicitly identifying the baseline for each method. On this basis, we analyze the flaws and errors in related model interpretation methods (IG, LayerCAM, ODAM, Difference Map). We advocate evaluating the quality of model interpretation results precisely through the attribution error between the attribution result and the attribution target, rather than adopting flawed evaluation methods, such as those based on marginal-effect or the assumption of perfect model performance. We revise IG and develope a model interpretation method with a clear and reasonable baseline, achieving better results. Our method supports model interpretation based on features from any layer. Interpretation based on features from different layers are all reasonable, and the differences among these results reflect varying degrees of feature extraction at different feature extraction stages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper usefully unifies some attribution methods around explicit baselines but its core evaluation metric risks circularity and the abstract shows no derivations or numbers to back the claimed improvements.

read the letter

The main thing to know is that this paper argues existing interpretation methods like IG, LayerCAM, and ODAM neglect baselines and therefore produce imprecise results, then offers a revised IG with a clear baseline that works for features from any layer. They unify gradient methods, IG, and Taylor expansions by naming the baseline in each, critique the flaws in prior work, and push for judging quality via attribution error to a target rather than marginal effects or perfect-model assumptions. The revised method is said to deliver better results and treat different layers as all reasonable but reflecting different extraction stages. That unification step is a modest but real clarification that could help readers see connections across techniques they might otherwise treat separately. The stance against weak evaluation shortcuts is also fair and directly stated. The soft spots are more noticeable. The abstract contains no derivations, examples, or quantitative results, so it is impossible to check whether the identified flaws are load-bearing or whether the new baseline actually reduces error in a meaningful way. The stress-test concern about circularity lands: if the attribution target is defined using the model's own logit differences or activations, then showing lower error for the revised method could be partly tautological rather than an external check. The claim that all layer interpretations are equally reasonable also needs more than assertion to hold. This work is for researchers already focused on feature attribution and explainable AI in computer vision who are looking for ways to tighten baseline choices. A reader deep in that literature might extract some useful framing, but the paper does not look like a field-changer. It shows clear engagement with the cited methods and tries to address a practical gap, so it deserves a serious referee to examine the full math, experiments, and whether the metric avoids the circularity issue.

Referee Report

3 major / 2 minor

Summary. The paper claims that existing model interpretation methods (IG, LayerCAM, ODAM, Difference Map) neglect the baseline, leading to imprecise results. It reformulates the interpretation task and principles to demonstrate the baseline's importance, unifies gradient-based, IG, and Taylor methods while explicitly identifying baselines for each, analyzes flaws in prior approaches, and advocates evaluating quality via attribution error to an attribution target rather than marginal-effect or perfect-model assumptions. The authors revise IG with a clear baseline, report superior results, and extend the approach to support feature-based interpretation from any layer, with layer differences reflecting varying feature extraction stages.

Significance. If the revised IG method delivers non-circular improvements under the attribution error metric and the unification is rigorous, the work could strengthen principled feature attribution in computer vision by addressing a commonly overlooked aspect of baseline selection. The layer-wise flexibility is a constructive extension. The paper receives credit for attempting to unify methods and for rejecting flawed evaluation assumptions, but significance remains provisional without independent validation of the metric or quantitative results.

major comments (3)

[Evaluation of interpretation quality] Evaluation via attribution error (section on evaluation metrics and the revised method): The attribution target is not shown to be derived independently of the model's output assumptions that the paper critiques elsewhere; if the target is defined via logit differences or feature activations (as in standard IG setups), lower error for the revised method risks being tautological rather than externally validated. A concrete derivation or external benchmark for the target is required to support the central superiority claim.
[Analysis of flaws and errors in related model interpretation methods] Analysis of flaws in IG, LayerCAM, ODAM (section analyzing flaws and errors): The identified errors due to baseline neglect are described qualitatively but lack specific quantitative examples, error calculations, or comparisons to the proposed baseline choice, making it difficult to assess whether the flaws are load-bearing or merely presentational.
[Revised IG and layer-wise interpretation] Revised IG formulation (section on the proposed method): The manuscript states a 'clear and reasonable baseline' but does not provide the explicit equation or parameter choice relative to the standard zero baseline in IG (Eq. for integrated gradients), preventing verification that the revision avoids the circularity or imprecision issues raised for prior methods.

minor comments (2)

The abstract contains a typo ('develope' instead of 'develop').
Notation for the attribution target and error could be formalized with an equation to improve clarity when comparing across layers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report, which highlights several areas where the presentation and supporting evidence can be strengthened. We address each major comment below and will incorporate revisions to clarify derivations, add quantitative support, and provide explicit formulations as needed.

read point-by-point responses

Referee: [Evaluation of interpretation quality] Evaluation via attribution error (section on evaluation metrics and the revised method): The attribution target is not shown to be derived independently of the model's output assumptions that the paper critiques elsewhere; if the target is defined via logit differences or feature activations (as in standard IG setups), lower error for the revised method risks being tautological rather than externally validated. A concrete derivation or external benchmark for the target is required to support the central superiority claim.

Authors: The attribution target is computed directly as the difference in model output (logit or feature activation) between the original input and the chosen baseline input via forward passes alone, without invoking any attribution method. This total effect serves as the quantity to be explained, while attribution methods distribute it; the evaluation therefore measures how completely and accurately each method recovers the target. To address the concern, the revised manuscript will include an explicit step-by-step derivation of the target from the forward pass and will report results on a synthetic dataset with known ground-truth attributions, providing an external check independent of the methods under test. revision: yes
Referee: [Analysis of flaws and errors in related model interpretation methods] Analysis of flaws in IG, LayerCAM, ODAM (section analyzing flaws and errors): The identified errors due to baseline neglect are described qualitatively but lack specific quantitative examples, error calculations, or comparisons to the proposed baseline choice, making it difficult to assess whether the flaws are load-bearing or merely presentational.

Authors: We agree that quantitative illustrations would make the impact of baseline neglect more concrete. The current manuscript identifies the conceptual errors arising from implicit or zero baselines, but the revised version will add explicit numerical examples: we will compute the attribution error (under the proposed metric) for IG, LayerCAM, and ODAM on representative inputs and compare these values to the error obtained with the revised baseline, thereby quantifying the practical consequences of the identified flaws. revision: yes
Referee: [Revised IG and layer-wise interpretation] Revised IG formulation (section on the proposed method): The manuscript states a 'clear and reasonable baseline' but does not provide the explicit equation or parameter choice relative to the standard zero baseline in IG (Eq. for integrated gradients), preventing verification that the revision avoids the circularity or imprecision issues raised for prior methods.

Authors: We apologize for the omission of the explicit formulation. The revised IG replaces the conventional zero baseline with a baseline defined as the element-wise average of a small set of reference images (or a fixed neutral value chosen per layer to represent feature absence). In the revised manuscript we will insert the precise equation for this baseline, show its substitution into the integrated-gradients path integral, and contrast it directly with the standard zero-baseline formulation to demonstrate how the choice eliminates the imprecision previously identified. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper observes baseline neglect in prior methods, reformulates the interpretation task and principles to emphasize baseline importance, unifies gradient/IG/Taylor approaches while identifying baselines for each, critiques flaws in IG/LayerCAM/ODAM/Difference Map, advocates attribution error to an attribution target as the precise evaluation (rejecting marginal-effect and perfect-model assumptions), and presents a revised IG with explicit baseline that supports any-layer features and yields better results under the advocated metric. No equations, self-citations, or definitions are quoted that reduce a central claim, prediction, or result to a fit or self-definition by construction. The attribution target is referenced as an independent benchmark without evidence here that it is defined tautologically from the revised method's own assumptions. The derivation chain remains self-contained with external grounding in prior methods' documented issues.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view prevents identification of specific free parameters or axioms; the approach rests on standard domain assumptions in gradient-based attribution that baselines exist and can be chosen reasonably.

pith-pipeline@v0.9.0 · 5714 in / 971 out tokens · 63599 ms · 2026-05-22T07:19:44.084298+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We reformulate the task of model interpretation ... explicitly identifying the baseline for each method ... evaluate ... through the attribution error between the attribution result and the attribution target
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We unify gradient-based methods, Integrated Gradients (IG) methods, and Taylor expansion ... IGapprox_i(x) = (x_i - x'_i) * sum ...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.