IG-Lens: Exact Additive Probability Attribution Across Transformer Layers via Telescoping Integrated Gradients
Pith reviewed 2026-07-01 06:16 UTC · model grok-4.3
The pith
IG-Lens attributes the exact change in target token probability to each transformer layer by telescoping integrated gradients along a hidden-state path.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Crediting each segment of a single integrated-gradients path to the layer it terminates at produces a layer-wise map whose values sum exactly to the change in target probability, achieved by placing the softmax inside the integration path and replacing raw gradients with observed probability deltas at each step.
What carries the argument
Telescoping integrated gradients: the integral of the probability gradient along a straight hidden-state path is partitioned at layer boundaries so that each partition equals the difference in the probability function evaluated at the segment endpoints.
If this is right
- The layer attributions remain additive in probability space at any chosen number of integration steps.
- A single forward pass through the model with batched baselines computes the complete token-by-layer map without any backward call.
- The method suppresses integration steps that produce gradient signal without an actual change in the target probability.
- Completeness holds to floating-point accuracy rather than depending on Riemann-sum approximation quality.
Where Pith is reading between the lines
- The same telescoping construction could be applied to other nonlinear readouts, such as cross-entropy loss or next-token distributions over multiple tokens.
- If the path-based attributions align with causal interventions, they could serve as a diagnostic for identifying the depth at which particular factual associations are assembled.
- The exactness property removes one source of variance in attribution studies, allowing cleaner comparisons of how different model scales or training regimes distribute probability mass across depth.
Load-bearing premise
That a single straight-line path through hidden states from a chosen baseline to the final layer produces attributions that reflect each layer's causal contribution to the probability change.
What would settle it
An input where the summed layer attributions differ from the total probability change by more than machine precision, or where layer attributions contradict the probability shift observed after ablating the corresponding layer.
read the original abstract
We ask a simple question about decoder-only transformers: between which two layers is the probability of a predicted token actually produced? Existing layer-wise readout tools answer only approximately. The logit lens and its trained variant report a per-layer level of probability but give no additive decomposition; their estimates are biased and non-monotone across depth. Direct Logit Attribution and related residual-stream methods are additive, but only in logit space, the softmax nonlinearity breaks additivity in probability space, precisely the quantity one usually cares about. Layer Conductance integrates gradients per layer, but attributes each to its own baseline and so does not sum to the total change in prediction. We introduce IG-Lens, a telescoping application of Integrated Gradients along a single path through the hidden states from a baseline to the final layer. Crediting each segment to the layer it terminates at yields a layer-wise attribution whose sum is exactly the change in target probability, with the softmax inside the integration path rather than linearized away. Our default estimator credits each integration step its observed change in target probability (a prediction-aware reweighting in the spirit of IDGI) rather than its raw gradient. Because the readout is a one-dimensional probability, this collapses each segment to a telescoping sum of endpoint values, so completeness holds exactly (to floating point) at any step count, removing Riemann discretization error while suppressing steps that show gradient sensitivity without a change in output. We give the telescoping identity and its proof, verify completeness to floating point, and describe a single-pass batched implementation computing the full token-by-layer map without any backward call. Code: https://github.com/anhnda/IGLens.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces IG-Lens, a telescoping variant of integrated gradients applied along a single straight-line path in the hidden-state space of decoder-only transformers. By crediting each segment of the path to the layer at which it terminates and using observed endpoint probability deltas (rather than raw gradients), the method produces layer-wise attributions whose sum equals exactly the change in target-token probability, with the softmax kept inside the path; completeness holds to floating point at any step count because the telescoping identity collapses the integral to endpoint differences.
Significance. If the straight-line path is accepted as a meaningful interpolation, IG-Lens supplies the first additive decomposition of probability change across layers that is exact (no discretization error, no linearization of softmax) and complete by construction. The single-pass batched implementation without any backward pass is a practical advantage over standard IG, and the explicit telescoping identity plus floating-point verification constitute a clear technical contribution over prior readout tools.
major comments (1)
- [Abstract] Abstract (paragraph describing the path): the central interpretive claim—that crediting segments of the straight-line hidden-state path to terminating layers answers 'between which two layers the probability is actually produced'—rests on the assumption that this particular continuous path isolates each layer's causal contribution. The telescoping completeness identity holds for any continuous path connecting the same endpoints, so different paths would produce different layer-wise breakdowns while still summing exactly to the total probability change; the manuscript does not provide justification, ablation, or empirical test showing why the straight line is the appropriate path for causal attribution.
minor comments (1)
- The abstract states that the telescoping identity and its proof are given and that completeness is verified to floating point; the corresponding section should include the explicit statement of the identity (with equation numbers) so readers can check the derivation without reconstructing it from the surrounding text.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The single major comment identifies a substantive point regarding path selection that we address directly below. We will revise the manuscript to incorporate additional discussion on this issue.
read point-by-point responses
-
Referee: [Abstract] Abstract (paragraph describing the path): the central interpretive claim—that crediting segments of the straight-line hidden-state path to terminating layers answers 'between which two layers the probability is actually produced'—rests on the assumption that this particular continuous path isolates each layer's causal contribution. The telescoping completeness identity holds for any continuous path connecting the same endpoints, so different paths would produce different layer-wise breakdowns while still summing exactly to the total probability change; the manuscript does not provide justification, ablation, or empirical test showing why the straight line is the appropriate path for causal attribution.
Authors: We agree that the telescoping completeness property is path-independent and that the specific layer-wise breakdown depends on the chosen path. The manuscript adopts the straight-line path in hidden-state space because it is the canonical interpolation used in the original Integrated Gradients formulation, corresponding to a uniform progression between the baseline and observed hidden states as layers are applied sequentially. This choice aligns with the residual-stream structure of decoder-only transformers, where each layer produces an incremental update to the hidden state. We do not claim that the straight-line path is the unique or universally optimal choice for isolating causal contributions; rather, it yields an exact additive decomposition along a standard, well-defined path while keeping the softmax inside the integration. The manuscript would benefit from explicit elaboration on this rationale. In revision we will expand the relevant section (and the abstract if space permits) to clarify the motivation for the straight-line path, reference its use in prior IG literature, and note that alternative paths remain an avenue for future investigation. No ablation across paths is currently present, and we will not add one in this revision as it would require substantial new experiments beyond the scope of addressing the current comment. revision: yes
Circularity Check
No circularity: derivation uses standard IG completeness plus explicit telescoping identity on chosen path
full rationale
The paper defines IG-Lens as the application of Integrated Gradients along one straight-line path in hidden-state space, then credits each segment's probability delta to the terminating layer. The exact additive sum follows directly from the telescoping property of any sequence of endpoint differences (sum of deltas equals total change), which is a general mathematical fact independent of the model or the specific path. No parameters are fitted to data and then relabeled as predictions. No self-citations are invoked to justify uniqueness or to smuggle an ansatz. The straight-line path choice is stated as an explicit modeling decision rather than derived from prior results by the same authors. The method is therefore self-contained against external benchmarks (standard IG axiom plus elementary summation identity) and receives a non-finding.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Integrated gradients completeness holds when the path is taken through the sequence of hidden states.
- standard math The telescoping sum of endpoint probability values equals the integrated contribution for each segment.
Reference graph
Works this paper leans on
-
[1]
Eliciting Latent Predictions from Transformers with the Tuned Lens
Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens.arXiv preprint arXiv:2303.08112, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Analyzing transformers in embedding space, 2023
Guy Dar, Mor Geva, Ankit Gupta, and Jonathan Berant. Analyzing transformers in embedding space, 2023
2023
-
[3]
How important is a neuron? In International Conference on Learning Representations (ICLR), 2019
Kedar Dhamdhere, Mukund Sundararajan, and Qiqi Yan. How important is a neuron? In International Conference on Learning Representations (ICLR), 2019
2019
-
[4]
A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, et al. A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html
2021
-
[5]
Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space
Mor Geva, Avi Caciularu, Kevin Ro Wang, and Yoav Goldberg. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. InEmpirical Methods in Natural Language Processing (EMNLP), 2022
2022
-
[6]
Transformer feed-forward layers are key-value memories
Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InEmpirical Methods in Natural Language Processing (EMNLP), 2021
2021
-
[7]
Captum: A unified and generic model interpretability library for PyTorch, 2020
Narine Kokhlikyan, Vivek Miglani, Miguel Martin, Edward Wang, Bilal Alsallakh, Jonathan Reynolds, Alexander Melnikov, Natalia Kliushkina, Carlos Araya, Siqi Yan, and Orion Reblitz- Richardson. Captum: A unified and generic model interpretability library for PyTorch, 2020
2020
-
[8]
Locating and editing factual associations in GPT
Kevin Meng, David Bau, Andrew Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems (NeurIPS), 2022
2022
-
[9]
TransformerLens.https://github.com/ TransformerLensOrg/TransformerLens, 2022
Neel Nanda and Joseph Bloom. TransformerLens.https://github.com/ TransformerLensOrg/TransformerLens, 2022
2022
-
[10]
Interpreting GPT: The logit lens
nostalgebraist. Interpreting GPT: The logit lens. LessWrong, 2020.https://www.lesswrong. com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens
2020
-
[11]
Axiomatic attribution for deep networks
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. InInternational Conference on Machine Learning (ICML), 2017. 8
2017
-
[12]
Interpretability in the wild: A circuit for indirect object identification in GPT-2 small
Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: A circuit for indirect object identification in GPT-2 small. In International Conference on Learning Representations (ICLR), 2023
2023
-
[13]
IDGI: A framework to eliminate explanation noise from integrated gradients
Ruo Yang, Binxia Wang, and Mustafa Bilgic. IDGI: A framework to eliminate explanation noise from integrated gradients. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23725–23734, 2023. 9
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.