Manifold-Aligned Guided Integrated Gradients for Reliable Feature Attribution
Pith reviewed 2026-05-20 23:47 UTC · model grok-4.3
The pith
Manifold-Aligned Guided Integrated Gradients improves attribution reliability by keeping integration paths close to the data manifold using a pre-trained VAE.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MA-GIG constructs attribution paths by sampling points in the latent space of a pre-trained VAE, decoding them to input space, and applying guided updates that keep features proximal to the original input; this alignment with the learned generative manifold reduces exposure to implausible regions and produces more faithful explanations than standard IG or Guided IG.
What carries the argument
Manifold-Aligned Guided Integrated Gradients (MA-GIG), which builds the integration path in VAE latent space and decodes intermediates to enforce proximity to the data manifold.
If this is right
- MA-GIG yields higher-fidelity explanations than prior path-based methods across multiple datasets and classifiers.
- The method reduces off-manifold noise by restricting gradient aggregation to decoded points near the input.
- Explanations become more reliable for diagnosing model behavior because paths avoid regions with meaningless gradients.
- The approach extends the axiomatic benefits of Integrated Gradients while mitigating a known practical failure mode.
Where Pith is reading between the lines
- The same latent-space alignment idea could be tested with other generative models such as diffusion models or GANs to see if the benefit is specific to VAEs.
- High-stakes applications like medical imaging or autonomous driving might see more stable debugging if this manifold constraint is adopted.
- If VAE training data differs substantially from the classifier training data, the benefit could shrink, suggesting a need for joint training or domain-matched VAEs.
Load-bearing premise
Decoding intermediate latent states from the pre-trained VAE produces inputs that lie sufficiently close to the true data manifold without adding significant artifacts or gradient biases.
What would settle it
A direct comparison showing that MA-GIG attributions have equal or lower fidelity than Guided IG when evaluated on a dataset where the VAE reconstruction error is large would falsify the claim that manifold alignment reliably improves explanations.
Figures
read the original abstract
Feature attribution is central to diagnosing and trusting deep neural networks, and Integrated Gradients (IG) is widely used due to its axiomatic properties. However, IG can yield unreliable explanations when the integration path between a baseline and the input passes through regions with noisy gradients. While Guided Integrated Gradients reduces this sensitivity by adaptively updating low-gradient-magnitude features, input-space guidance still produces intermediate inputs that deviate from the data manifold. To address this limitation, we propose \emph{Manifold-Aligned Guided Integrated Gradients} (MA-GIG), which constructs attribution paths in the latent space of a pre-trained variational autoencoder. By decoding intermediate latent states, MA-GIG biases the path toward the learned generative manifold and reduces exposure to implausible input-space regions. Through qualitative and quantitative evaluations, we demonstrate that MA-GIG produces faithful explanations by aggregating gradients on path features proximal to the input. Consequently, our method reduces off-manifold noise and outperforms prior path-based attribution methods across multiple datasets and classifiers. Our code is available at https://github.com/leekwoon/ma-gig/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Manifold-Aligned Guided Integrated Gradients (MA-GIG), which performs the integration path of Integrated Gradients inside the latent space of a pre-trained VAE rather than in input space. Intermediate latent points are decoded to input space to produce a path that is biased toward the learned generative manifold, with the goal of reducing exposure to off-manifold regions that produce noisy gradients. The authors claim that this yields more faithful attributions than Guided IG and other path-based baselines, supported by qualitative visualizations and quantitative comparisons across multiple datasets and classifiers.
Significance. If the central mechanism holds, the work offers a practical way to improve the reliability of axiomatic attribution methods by leveraging existing generative models to constrain paths to the data manifold. The code release supports reproducibility. However, the significance is limited by the method's dependence on VAE reconstruction quality, which is not isolated in the experiments and may not generalize to domains where VAEs exhibit posterior collapse or high reconstruction error.
major comments (3)
- [§3.2] §3.2 (Method): The claim that decoded latent states remain 'proximal to the input' and thereby reduce off-manifold noise rests on the untested assumption that the pre-trained VAE produces faithful reconstructions. No reconstruction error, FID scores, or manifold-distance metrics are reported to quantify how close the decoded path points actually lie to the true data manifold.
- [§4.3] §4.3 (Quantitative Evaluation): The reported outperformance over Guided IG is not accompanied by an ablation that disables the VAE decoding step while retaining the guidance rule. Without this isolation, it is impossible to determine whether gains derive from manifold alignment or from other implementation details of the path construction.
- [Table 2] Table 2 (or equivalent results table): Performance differences are presented without error bars, standard deviations across runs, or statistical significance tests, making it difficult to assess whether the claimed superiority is robust or could be explained by variance in the experimental setup.
minor comments (2)
- [Abstract] The abstract states that MA-GIG 'aggregates gradients on path features proximal to the input,' but the precise aggregation rule (e.g., weighting or selection of points) is not formalized with an equation in §3.
- [§3.1] Notation for the latent-space path (e.g., z(t) vs. x(t)) is introduced without an explicit comparison table to the standard IG formulation, which would aid readability.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below and outline the revisions we will make to improve the clarity and rigor of our empirical evaluations.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Method): The claim that decoded latent states remain 'proximal to the input' and thereby reduce off-manifold noise rests on the untested assumption that the pre-trained VAE produces faithful reconstructions. No reconstruction error, FID scores, or manifold-distance metrics are reported to quantify how close the decoded path points actually lie to the true data manifold.
Authors: We agree that providing quantitative evidence of VAE reconstruction quality is important to support our claims. In the revised manuscript, we will report reconstruction errors (MSE) on held-out test data for each dataset and include FID scores where applicable to quantify the closeness of decoded points to the data manifold. This will help validate that the integration paths remain proximal to the input and reduce off-manifold exposure. revision: yes
-
Referee: [§4.3] §4.3 (Quantitative Evaluation): The reported outperformance over Guided IG is not accompanied by an ablation that disables the VAE decoding step while retaining the guidance rule. Without this isolation, it is impossible to determine whether gains derive from manifold alignment or from other implementation details of the path construction.
Authors: We acknowledge the value of such an ablation for isolating the contribution of manifold alignment. However, the guidance mechanism in MA-GIG is inherently tied to the latent space representation and the decoding step to compute gradients in input space. Disabling decoding while keeping the guidance rule would require a fundamentally different implementation that no longer aligns with the proposed method. We will add a detailed discussion in §4.3 explaining this design choice and include an alternative ablation, such as varying the VAE's reconstruction fidelity or comparing against a non-generative baseline with similar path guidance, to better isolate the effect. revision: partial
-
Referee: [Table 2] Table 2 (or equivalent results table): Performance differences are presented without error bars, standard deviations across runs, or statistical significance tests, making it difficult to assess whether the claimed superiority is robust or could be explained by variance in the experimental setup.
Authors: We appreciate this suggestion for improving the statistical robustness of our results. In the revised version, we will rerun the experiments with multiple random seeds (at least 5), report mean performance with standard deviations, add error bars to relevant tables and figures, and include statistical significance tests (e.g., Wilcoxon signed-rank tests) with p-values to confirm the observed improvements are significant. revision: yes
Circularity Check
No circularity: MA-GIG derivation is self-contained via external VAE and empirical validation
full rationale
The paper proposes Manifold-Aligned Guided Integrated Gradients by constructing attribution paths in the latent space of a pre-trained variational autoencoder and decoding intermediate states to bias toward the generative manifold. No derivation step reduces a claimed prediction or result to its own inputs by construction, nor does any load-bearing premise rely on self-citation chains, uniqueness theorems imported from the authors' prior work, or ansatzes smuggled via citation. The central claims rest on qualitative/quantitative evaluations across datasets and classifiers rather than definitional equivalence or fitted-parameter renaming. This is the normal case of an independent methodological contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The pre-trained variational autoencoder accurately captures the data manifold.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
constructs attribution paths in the latent space of a pre-trained variational autoencoder... By decoding intermediate latent states, MA-GIG biases the path toward the learned generative manifold
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the decoder is a smooth immersion... Im(JD(z)) = T_{D(z)} M
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Spectral Integrated Gradients for Coarse-to-Fine Feature Attribution
Spectral Integrated Gradients constructs SVD-based integration paths that activate singular components from largest to smallest, producing cleaner attribution maps and better quantitative scores than standard Integrat...
Reference graph
Works this paper leans on
-
[1]
Fefferman, C., Mitter, S., and Narayanan, H
doi: 10.1090/S0002-9947-1959-0110078-1. Fefferman, C., Mitter, S., and Narayanan, H. Testing the manifold hypothesis.Journal of the American Mathemat- ical Society, 29(4):983–1049,
-
[2]
Jeon, G., Jeong, H., and Choi, J. Beyond single path in- tegrated gradients for reliable input attribution via ran- domized path sampling. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2052– 2061,
work page 2052
-
[3]
Landing with the score: Riemannian optimization through denoising, 2025
Kharitenko, A., Shen, Z., de Santi, R., He, N., and Doerfler, F. Landing with the score: Riemannian optimization through denoising.arXiv preprint arXiv:2509.23357,
-
[4]
Kingma, D. P. and Welling, M. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Lee, K. and Choi, J. Local manifold approximation and projection for manifold-aware diffusion planning.arXiv preprint arXiv:2506.00867, 2025a. Lee, K. and Choi, J. State-covering trajectory stitching for diffusion planners. InAdvances in Neural Information Processing Systems (NeurIPS), 2025b. Lee, K., Lee, K., Lee, H., and Shin, J. A simple unified framew...
-
[6]
Lee, K., Kim, S., and Choi, J. Adaptive and explainable deployment of navigation skills via hierarchical deep re- inforcement learning.arXiv preprint arXiv:2305.19746, 2023a. Lee, K., Kim, S., and Choi, J. Refining diffusion planner for reliable behavior synthesis by automatic detection of infeasible plans. InAdvances in Neural Information Processing Syst...
-
[7]
Razzhigaev, A., Shakhmatov, A., Maltseva, A., Arkhip- kin, V ., Pavlov, I., Ryabov, I., Kuts, A., Panchenko, A., Kuznetsov, A., and Dimitrov, D. Kandinsky: an improved text-to-image synthesis with image prior and latent diffu- sion.arXiv preprint arXiv:2310.03502,
-
[8]
Not Just a Black Box: Learning Important Features Through Propagating Activation Differences
Shrikumar, A., Greenside, P., Shcherbina, A., and Kundaje, A. Not just a black box: Learning important features through propagating activation differences.arXiv preprint arXiv:1605.01713,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps
Simonyan, K., Vedaldi, A., and Zisserman, A. Deep in- side convolutional networks: Visualising image clas- sification models and saliency maps.arXiv preprint arXiv:1312.6034,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Zhang, B., Zheng, W., Zhou, J., and Lu, J
URL https://proceedings.mlr.press/ v235/zaher24a.html. Zhang, B., Zheng, W., Zhou, J., and Lu, J. Path choice mat- ters for clear attribution in path methods.arXiv preprint arXiv:2401.10442,
-
[11]
ISSN 1939-3539. doi: 10.1109/TPAMI.2024. 3388092. 12 Manifold-Aligned Guided Integrated Gradients for Reliable Feature Attribution Appendix Table of Contents Appendix A.Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Appen...
-
[12]
17 Manifold-Aligned Guided Integrated Gradients for Reliable Feature Attribution Aggregated Score.The final scalar metric is obtained by integrating the DiffID curve over the perturbation range: DiffID= Z 1 0 ψ(x, δ)dδ.(31) A higher DiffID score indicates a more faithful attribution map, distinguishing salient features from irrelevant ones more effectivel...
work page 2012
-
[13]
based on unCLIP architecture. Unlike the SD family, it utilizes mCLIP for embeddings and employs a MoVQGAN decoder for latent reconstruction. It is trained on large-scale datasets including LAION HighRes and fine-tuned on high-quality internal datasets. We use the kandinsky-community/kandinsky-2-1checkpoint. F.3. Baselines Here we describe the specific im...
-
[14]
0.1996 0.2861 0.0866 AGI (Pan et al.,
work page 1996
-
[15]
Method MPRT (↑) G×I (Shrikumar et al., 2016)−0.005 IG (Sundararajan et al.,
0.2532 0.32120.0680 MA-GIG 0.2670 0.3429 0.0760 Table 7.Efficient MPRT sanity check on Ima- geNet with ResNet18. Method MPRT (↑) G×I (Shrikumar et al., 2016)−0.005 IG (Sundararajan et al.,
-
[16]
0.073 IG2 (Zhuo & Ge, 2024)−0.063 AGI (Pan et al., 2021)−0.066 EIG (Jha et al.,
work page 2024
-
[17]
0.013 MIG (Zaher et al., 2024)−0.003 GIG (Kapishnikov et al.,
work page 2024
-
[18]
0.308 MA-GIG 0.625 Large-scale ImageNet evaluation.To verify that the ImageNet results are not an artifact of the 500-image subset, we additionally evaluate 5,000 ImageNet validation images using ResNet18. MA-GIG preserves the ranking observed in the main evaluation and achieves the best DiffID and Insertion AUC. Model-randomization sanity check.We evalua...
work page 2016
-
[19]
0.2384 0.4378 0.1994 0.4060 0.5174 0.1114 0.2255 0.3940 0.1685 IG (Sundararajan et al.,
work page 1994
-
[20]
0.3634 0.5093 0.1459 0.5556 0.58890.03330.3586 0.4880 0.1294 MA-GIG (SD1) 0.4255 0.5550 0.1294 0.5745 0.6547 0.0802 0.4033 0.5147 0.1114 MA-GIG (SD1, w/ Slerp) 0.4387 0.5652 0.1264 0.5985 0.6751 0.0766 0.3937 0.5042 0.1105 MA-GIG (SD2) 0.4495 0.5697 0.1201 0.5901 0.6601 0.0700 0.4075 0.5213 0.1138 MA-GIG (SD2, w/ Slerp) 0.4474 0.5739 0.12640.6171 0.69790....
-
[21]
0.1222 0.2338 0.1116 0.2784 0.3576 0.0791 0.2000 0.2813 0.0813 IG (Sundararajan et al.,
work page 2000
-
[22]
0.0193 0.1713 0.1520 0.2224 0.3304 0.1080 0.0816 0.2053 0.1238 AGI (Pan et al.,
work page 2053
-
[23]
0.0136 0.1569 0.1433 0.1402 0.2682 0.1280 0.0787 0.1947 0.1160 EIG (Jha et al.,
work page 1947
-
[24]
0.1891 0.27200.08290.2542 0.3424 0.0882 0.2551 0.3282 0.0731 MA-GIG (SD1) 0.2249 0.3240 0.0991 0.3367 0.4109 0.0742 0.3067 0.3800 0.0733 MA-GIG (SD1, w/ Slerp) 0.2009 0.3064 0.1056 0.3333 0.4131 0.0798 0.2884 0.3638 0.0753 MA-GIG (SD2) 0.2324 0.3253 0.0929 0.3429 0.4191 0.0762 0.3073 0.3756 0.0682 MA-GIG (SD2, w/ Slerp) 0.1933 0.2998 0.1064 0.3404 0.4196 ...
-
[25]
0.1358 0.2507 0.1149 0.2469 0.3222 0.0753 0.2020 0.3138 0.1118 IG2 (Zhuo & Ge,
work page 2020
-
[26]
0.1447 0.2507 0.1060 0.2242 0.3040 0.0798 0.2067 0.2993 0.0927 EIG (Jha et al.,
work page 2067
-
[27]
0.1327 0.2467 0.1140 0.2436 0.3222 0.0787 0.1942 0.3013 0.1071 MIG (Zaher et al.,
work page 1942
-
[28]
0.2536 0.3304 0.0769 0.2842 0.33670.05240.2771 0.3667 0.0896 MA-GIG (SD1) 0.2467 0.3333 0.0867 0.3247 0.3869 0.0622 0.3020 0.3871 0.0851 MA-GIG (SD1, w/ Slerp) 0.2542 0.3382 0.0840 0.3329 0.3933 0.0604 0.3118 0.3893 0.0776 MA-GIG (SD2) 0.2598 0.3409 0.0811 0.3284 0.3900 0.0616 0.3082 0.3838 0.0756 MA-GIG (SD2, w/ Slerp) 0.26380.34780.08400.3429 0.39820.05...
-
[29]
These visualizations cover various classifiers on the ImageNet2012, Oxford-IIIT Pet, and Oxford 102 Flower datasets. The visualized intermediate steps (second and third rows in each panel) provide empirical evidence for the reliability of our framework. First, regarding gradient behavior, pixel-space guidance methods like GIG often suffer from manifold de...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.