IG-Lens: Exact Additive Probability Attribution Across Transformer Layers via Telescoping Integrated Gradients

Duc Anh Nguyen

arxiv: 2606.29693 · v2 · pith:C5LYRGAJnew · submitted 2026-06-29 · 💻 cs.LG

IG-Lens: Exact Additive Probability Attribution Across Transformer Layers via Telescoping Integrated Gradients

Duc Anh Nguyen This is my paper

Pith reviewed 2026-07-01 06:16 UTC · model grok-4.3

classification 💻 cs.LG

keywords integrated gradientslayer-wise attributiontransformer interpretabilityprobability decompositiondecoder-only modelstelescoping sumlogit lens

0 comments

The pith

IG-Lens attributes the exact change in target token probability to each transformer layer by telescoping integrated gradients along a hidden-state path.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines where the probability of a predicted token arises between layers in decoder-only transformers. Prior methods either produce non-additive or biased probability estimates or achieve additivity only in logit space, where the final softmax breaks the decomposition. IG-Lens integrates the gradient of the target probability along one straight path through the hidden states from a baseline to the final layer, then credits each segment to the layer at which it ends. Because the readout remains one-dimensional probability, each segment collapses to a telescoping difference of endpoint values, so the layer attributions sum exactly to the observed probability change with the softmax kept inside the path.

Core claim

Crediting each segment of a single integrated-gradients path to the layer it terminates at produces a layer-wise map whose values sum exactly to the change in target probability, achieved by placing the softmax inside the integration path and replacing raw gradients with observed probability deltas at each step.

What carries the argument

Telescoping integrated gradients: the integral of the probability gradient along a straight hidden-state path is partitioned at layer boundaries so that each partition equals the difference in the probability function evaluated at the segment endpoints.

If this is right

The layer attributions remain additive in probability space at any chosen number of integration steps.
A single forward pass through the model with batched baselines computes the complete token-by-layer map without any backward call.
The method suppresses integration steps that produce gradient signal without an actual change in the target probability.
Completeness holds to floating-point accuracy rather than depending on Riemann-sum approximation quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same telescoping construction could be applied to other nonlinear readouts, such as cross-entropy loss or next-token distributions over multiple tokens.
If the path-based attributions align with causal interventions, they could serve as a diagnostic for identifying the depth at which particular factual associations are assembled.
The exactness property removes one source of variance in attribution studies, allowing cleaner comparisons of how different model scales or training regimes distribute probability mass across depth.

Load-bearing premise

That a single straight-line path through hidden states from a chosen baseline to the final layer produces attributions that reflect each layer's causal contribution to the probability change.

What would settle it

An input where the summed layer attributions differ from the total probability change by more than machine precision, or where layer attributions contradict the probability shift observed after ablating the corresponding layer.

read the original abstract

We ask a simple question about decoder-only transformers: between which two layers is the probability of a predicted token actually produced? Existing layer-wise readout tools answer only approximately. The logit lens and its trained variant report a per-layer level of probability but give no additive decomposition; their estimates are biased and non-monotone across depth. Direct Logit Attribution and related residual-stream methods are additive, but only in logit space, the softmax nonlinearity breaks additivity in probability space, precisely the quantity one usually cares about. Layer Conductance integrates gradients per layer, but attributes each to its own baseline and so does not sum to the total change in prediction. We introduce IG-Lens, a telescoping application of Integrated Gradients along a single path through the hidden states from a baseline to the final layer. Crediting each segment to the layer it terminates at yields a layer-wise attribution whose sum is exactly the change in target probability, with the softmax inside the integration path rather than linearized away. Our default estimator credits each integration step its observed change in target probability (a prediction-aware reweighting in the spirit of IDGI) rather than its raw gradient. Because the readout is a one-dimensional probability, this collapses each segment to a telescoping sum of endpoint values, so completeness holds exactly (to floating point) at any step count, removing Riemann discretization error while suppressing steps that show gradient sensitivity without a change in output. We give the telescoping identity and its proof, verify completeness to floating point, and describe a single-pass batched implementation computing the full token-by-layer map without any backward call. Code: https://github.com/anhnda/IGLens.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IG-Lens delivers exact additive probability attributions across layers via telescoping, but the straight-line hidden-state path makes the layer credits path-dependent rather than causal.

read the letter

The main point is that this paper supplies a telescoping integrated-gradients construction that turns layer-wise attributions into an exact sum of observed probability changes, with no discretization error left once you use endpoint deltas.

It does two things cleanly. First, the math collapses the integral so completeness holds to floating point at any step count. Second, the default estimator reweights by the actual probability shift at each step instead of the raw gradient, which suppresses noise from gradient sensitivity without output movement. The single-pass batched implementation without any backward pass is also practical.

The soft spot is the path. The method integrates along a straight line in hidden-state space from a chosen baseline to the final layer. That line is not the sequence of activations the model actually produces when it runs the layers in order. Any other continuous path between the same endpoints would give a different layer-by-layer breakdown while still summing exactly to the total probability change. The paper therefore gives a well-defined decomposition, but the claim that it tells you “between which two layers the probability is produced” rests on an untested assumption that the straight line is the right one.

The abstract states they prove the identity and verify completeness numerically, which is the right evidence to supply. No fitted parameters or circular predictions appear.

This is for interpretability researchers who need additive probability-space readouts on decoder-only transformers. A reader who already works with logit lens or layer conductance will see the difference immediately and can judge whether the path choice is acceptable for their use case.

The work is coherent on its own terms and the implementation detail is useful, so it deserves a serious referee.

Referee Report

1 major / 1 minor

Summary. The paper introduces IG-Lens, a telescoping variant of integrated gradients applied along a single straight-line path in the hidden-state space of decoder-only transformers. By crediting each segment of the path to the layer at which it terminates and using observed endpoint probability deltas (rather than raw gradients), the method produces layer-wise attributions whose sum equals exactly the change in target-token probability, with the softmax kept inside the path; completeness holds to floating point at any step count because the telescoping identity collapses the integral to endpoint differences.

Significance. If the straight-line path is accepted as a meaningful interpolation, IG-Lens supplies the first additive decomposition of probability change across layers that is exact (no discretization error, no linearization of softmax) and complete by construction. The single-pass batched implementation without any backward pass is a practical advantage over standard IG, and the explicit telescoping identity plus floating-point verification constitute a clear technical contribution over prior readout tools.

major comments (1)

[Abstract] Abstract (paragraph describing the path): the central interpretive claim—that crediting segments of the straight-line hidden-state path to terminating layers answers 'between which two layers the probability is actually produced'—rests on the assumption that this particular continuous path isolates each layer's causal contribution. The telescoping completeness identity holds for any continuous path connecting the same endpoints, so different paths would produce different layer-wise breakdowns while still summing exactly to the total probability change; the manuscript does not provide justification, ablation, or empirical test showing why the straight line is the appropriate path for causal attribution.

minor comments (1)

The abstract states that the telescoping identity and its proof are given and that completeness is verified to floating point; the corresponding section should include the explicit statement of the identity (with equation numbers) so readers can check the derivation without reconstructing it from the surrounding text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The single major comment identifies a substantive point regarding path selection that we address directly below. We will revise the manuscript to incorporate additional discussion on this issue.

read point-by-point responses

Referee: [Abstract] Abstract (paragraph describing the path): the central interpretive claim—that crediting segments of the straight-line hidden-state path to terminating layers answers 'between which two layers the probability is actually produced'—rests on the assumption that this particular continuous path isolates each layer's causal contribution. The telescoping completeness identity holds for any continuous path connecting the same endpoints, so different paths would produce different layer-wise breakdowns while still summing exactly to the total probability change; the manuscript does not provide justification, ablation, or empirical test showing why the straight line is the appropriate path for causal attribution.

Authors: We agree that the telescoping completeness property is path-independent and that the specific layer-wise breakdown depends on the chosen path. The manuscript adopts the straight-line path in hidden-state space because it is the canonical interpolation used in the original Integrated Gradients formulation, corresponding to a uniform progression between the baseline and observed hidden states as layers are applied sequentially. This choice aligns with the residual-stream structure of decoder-only transformers, where each layer produces an incremental update to the hidden state. We do not claim that the straight-line path is the unique or universally optimal choice for isolating causal contributions; rather, it yields an exact additive decomposition along a standard, well-defined path while keeping the softmax inside the integration. The manuscript would benefit from explicit elaboration on this rationale. In revision we will expand the relevant section (and the abstract if space permits) to clarify the motivation for the straight-line path, reference its use in prior IG literature, and note that alternative paths remain an avenue for future investigation. No ablation across paths is currently present, and we will not add one in this revision as it would require substantial new experiments beyond the scope of addressing the current comment. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation uses standard IG completeness plus explicit telescoping identity on chosen path

full rationale

The paper defines IG-Lens as the application of Integrated Gradients along one straight-line path in hidden-state space, then credits each segment's probability delta to the terminating layer. The exact additive sum follows directly from the telescoping property of any sequence of endpoint differences (sum of deltas equals total change), which is a general mathematical fact independent of the model or the specific path. No parameters are fitted to data and then relabeled as predictions. No self-citations are invoked to justify uniqueness or to smuggle an ansatz. The straight-line path choice is stated as an explicit modeling decision rather than derived from prior results by the same authors. The method is therefore self-contained against external benchmarks (standard IG axiom plus elementary summation identity) and receives a non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claim depends on the validity of applying integrated gradients along the residual-stream path and on the telescoping identity holding for the probability function; no free parameters, new entities, or ad-hoc axioms beyond standard calculus are introduced.

axioms (2)

domain assumption Integrated gradients completeness holds when the path is taken through the sequence of hidden states.
The method applies the standard IG path integral to the residual stream.
standard math The telescoping sum of endpoint probability values equals the integrated contribution for each segment.
Derived from the fundamental theorem of calculus applied to the one-dimensional probability readout.

pith-pipeline@v0.9.1-grok · 5830 in / 1379 out tokens · 31467 ms · 2026-07-01T06:16:21.457928+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens.arXiv preprint arXiv:2303.08112, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Analyzing transformers in embedding space, 2023

Guy Dar, Mor Geva, Ankit Gupta, and Jonathan Berant. Analyzing transformers in embedding space, 2023

2023
[3]

How important is a neuron? In International Conference on Learning Representations (ICLR), 2019

Kedar Dhamdhere, Mukund Sundararajan, and Qiqi Yan. How important is a neuron? In International Conference on Learning Representations (ICLR), 2019

2019
[4]

A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, et al. A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html

2021
[5]

Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space

Mor Geva, Avi Caciularu, Kevin Ro Wang, and Yoav Goldberg. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. InEmpirical Methods in Natural Language Processing (EMNLP), 2022

2022
[6]

Transformer feed-forward layers are key-value memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InEmpirical Methods in Natural Language Processing (EMNLP), 2021

2021
[7]

Captum: A unified and generic model interpretability library for PyTorch, 2020

Narine Kokhlikyan, Vivek Miglani, Miguel Martin, Edward Wang, Bilal Alsallakh, Jonathan Reynolds, Alexander Melnikov, Natalia Kliushkina, Carlos Araya, Siqi Yan, and Orion Reblitz- Richardson. Captum: A unified and generic model interpretability library for PyTorch, 2020

2020
[8]

Locating and editing factual associations in GPT

Kevin Meng, David Bau, Andrew Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

2022
[9]

TransformerLens.https://github.com/ TransformerLensOrg/TransformerLens, 2022

Neel Nanda and Joseph Bloom. TransformerLens.https://github.com/ TransformerLensOrg/TransformerLens, 2022

2022
[10]

Interpreting GPT: The logit lens

nostalgebraist. Interpreting GPT: The logit lens. LessWrong, 2020.https://www.lesswrong. com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens

2020
[11]

Axiomatic attribution for deep networks

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. InInternational Conference on Machine Learning (ICML), 2017. 8

2017
[12]

Interpretability in the wild: A circuit for indirect object identification in GPT-2 small

Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: A circuit for indirect object identification in GPT-2 small. In International Conference on Learning Representations (ICLR), 2023

2023
[13]

IDGI: A framework to eliminate explanation noise from integrated gradients

Ruo Yang, Binxia Wang, and Mustafa Bilgic. IDGI: A framework to eliminate explanation noise from integrated gradients. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23725–23734, 2023. 9

2023

[1] [1]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens.arXiv preprint arXiv:2303.08112, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Analyzing transformers in embedding space, 2023

Guy Dar, Mor Geva, Ankit Gupta, and Jonathan Berant. Analyzing transformers in embedding space, 2023

2023

[3] [3]

How important is a neuron? In International Conference on Learning Representations (ICLR), 2019

Kedar Dhamdhere, Mukund Sundararajan, and Qiqi Yan. How important is a neuron? In International Conference on Learning Representations (ICLR), 2019

2019

[4] [4]

A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, et al. A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html

2021

[5] [5]

Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space

Mor Geva, Avi Caciularu, Kevin Ro Wang, and Yoav Goldberg. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. InEmpirical Methods in Natural Language Processing (EMNLP), 2022

2022

[6] [6]

Transformer feed-forward layers are key-value memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InEmpirical Methods in Natural Language Processing (EMNLP), 2021

2021

[7] [7]

Captum: A unified and generic model interpretability library for PyTorch, 2020

Narine Kokhlikyan, Vivek Miglani, Miguel Martin, Edward Wang, Bilal Alsallakh, Jonathan Reynolds, Alexander Melnikov, Natalia Kliushkina, Carlos Araya, Siqi Yan, and Orion Reblitz- Richardson. Captum: A unified and generic model interpretability library for PyTorch, 2020

2020

[8] [8]

Locating and editing factual associations in GPT

Kevin Meng, David Bau, Andrew Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

2022

[9] [9]

TransformerLens.https://github.com/ TransformerLensOrg/TransformerLens, 2022

Neel Nanda and Joseph Bloom. TransformerLens.https://github.com/ TransformerLensOrg/TransformerLens, 2022

2022

[10] [10]

Interpreting GPT: The logit lens

nostalgebraist. Interpreting GPT: The logit lens. LessWrong, 2020.https://www.lesswrong. com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens

2020

[11] [11]

Axiomatic attribution for deep networks

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. InInternational Conference on Machine Learning (ICML), 2017. 8

2017

[12] [12]

Interpretability in the wild: A circuit for indirect object identification in GPT-2 small

Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: A circuit for indirect object identification in GPT-2 small. In International Conference on Learning Representations (ICLR), 2023

2023

[13] [13]

IDGI: A framework to eliminate explanation noise from integrated gradients

Ruo Yang, Binxia Wang, and Mustafa Bilgic. IDGI: A framework to eliminate explanation noise from integrated gradients. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23725–23734, 2023. 9

2023