pith. sign in

arxiv: 2606.29139 · v1 · pith:2QH6X3CLnew · submitted 2026-06-28 · 💻 cs.LG

How Token Influence Decays with Distance: A Green-Function View of Trained Language Models

Pith reviewed 2026-06-30 08:17 UTC · model grok-4.3

classification 💻 cs.LG
keywords language modelstransformersjacobianpower-law decaygreen's functiontoken influenceautoregressivesensitivity analysis
0
0 comments X

The pith

Trained language models exhibit power-law decay of token influence with distance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors measure how perturbations to input token embeddings affect next-token predictions in autoregressive Transformers by computing Jacobians via autograd. They find that the median sensitivity decays as a power law with exponent p approximately 0.8 rather than exponentially, over distances in coherent text from Gutenberg and WikiText. This profile is a learned feature, appearing in trained Pythia and Qwen models but not in random ones, and it remains even after token shuffling disrupts syntax.

Core claim

The median Jacobian sensitivity profile in trained models is described by the diagonal-normalized form \overline G(r) ≈ γ + β(r+1)^{-p} with p ≈ 0.7--0.9, and this slowly decaying long-range sensitivity is a learned property of trained autoregressive Transformer operators.

What carries the argument

The empirical distance-resolved median Jacobian sensitivity profile, treated as a Green's function for the model's forward operator.

If this is right

  • Coarse-level or hierarchical mechanisms can exploit the long-tailed sensitivity for global interactions.
  • The power-law behavior persists under token shuffling, indicating it is not dependent on intact syntax or high prediction quality.
  • Randomly initialized models lack the power-law profile, confirming it emerges from training.
  • Multilevel preconditioning ideas from differential equations may apply to language model inference or training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Transformers may handle long contexts effectively because distant tokens retain measurable influence through this slow decay.
  • Similar measurements could be applied to other sequence models to test if power-law decay is architecture-specific.
  • Models could be modified to enforce different decay profiles to study effects on performance.

Load-bearing premise

The autograd-computed Jacobian entries faithfully capture the 'influence' of one token on the prediction of another.

What would settle it

Finding that trained models display exponential decay in Jacobian sensitivity or that random models exhibit the same power-law profile would contradict the claim.

Figures

Figures reproduced from arXiv: 2606.29139 by Matthias Br\"andel, Oliver Rheinbach, Stephan K\"ohler.

Figure 1
Figure 1. Figure 1: Median distance-resolved Jacobian sensitivity profile for Qwen2.5-0.5B using the true next-token [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Individual diagonal-normalized sensitivity profiles for Qwen2.5-0.5B using the true next-token logit. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Median distance-resolved Jacobian sensitivity profile for Pythia-410M using the true next-token [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Median distance-resolved sensitivity profile for Qwen2.5-0.5B using the model-predicted next-token [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
read the original abstract

We study how the next-token prediction of an autoregressive Transformer language model changes under small perturbations of earlier input token embeddings. Motivated by operator learning and iterative solvers for differential equations, we investigate how the influence of one token on another decays with distance in a trained model. In multilevel methods for differential equations, such as domain decomposition, multigrid, and multilevel preconditioning, one often exploits a separation between strong local interactions and weaker but essential global interactions. The latter correspond to the long tail of the Green's function and are typically handled by a coarse-level operator. Inspired by this perspective, we compute an empirical, distance-resolved gradient profile of token dependencies using autograd. Experiments on trained Pythia models and Qwen2.5-0.5B show that, over the measured distance range, the median Jacobian sensitivity is much better described by a power-law-type decay than by an exponential alternative: the diagonal-normalized profile is well described by $$\overline G(r) \approx \gamma+\beta(r+1)^{-p}$$ with exponents $p \approx 0.7$--$0.9$ (typically $0.8$--$0.9$). This behavior appears on coherent text from Gutenberg and WikiText-103. Token-shuffling experiments show that the power-law profile persists even when syntax and prediction quality collapse, whereas randomly initialized models do not exhibit it. The slowly decaying long-range sensitivity thus appears to be a learned property of trained autoregressive Transformer operators. These findings suggest that hierarchical or coarse-level mechanisms in language models may be able to exploit the long-tailed sensitivity profiles.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that the distance-resolved influence of input tokens on next-token prediction in trained autoregressive Transformers, measured via median autograd Jacobians of the output logit w.r.t. earlier embeddings (diagonal-normalized), follows a power-law decay ¯G(r) ≈ γ + β(r+1)^(-p) with p ≈ 0.7-0.9 rather than exponential, on coherent text from Gutenberg and WikiText-103; this long-range tail is presented as a learned property because it is absent in random-initialized models yet persists under token shuffling, suggesting implications for hierarchical/coarse-level mechanisms in LMs.

Significance. If the Jacobian-to-Green's-function identification holds and the power-law characterization is quantitatively supported, the result would supply an empirical signature of long-range token sensitivity in trained models that could inform architecture design (e.g., multilevel preconditioners) and explain why Transformers capture distant dependencies; the multi-model, multi-dataset empirical design with random-init and shuffling controls is a strength that makes the observation reproducible even if the interpretation requires further grounding.

major comments (3)
  1. [Abstract] Abstract, the displayed equation for ¯G(r): the central claim equates the median autograd Jacobian profile with the Green's function of the forward autoregressive operator, yet the Jacobian supplies only the first-order linear sensitivity of a single scalar output at a fixed input point; the manuscript provides no explicit validation that this linear response reproduces the change in next-token prediction under finite token-level perturbations, which is required to interpret the observed decay (and its contrast to exponential) as the claimed learned long-range property.
  2. [Abstract] Abstract: the assertion that the power-law form is 'much better described' than exponential lacks any quantitative support (R^{2} values, residual statistics, cross-validation error, or confidence intervals on p); the three-parameter fit (γ, β, p) is performed post-hoc on median profiles without reported details on optimization, weighting, or distance range, making the superiority claim impossible to assess from the supplied information.
  3. [Abstract] Abstract (token-shuffling and random-init controls): while the persistence under shuffling and absence in random models are presented as evidence that the power-law tail is learned, the manuscript does not address whether the Jacobian remains a faithful influence measure once syntax and prediction quality have collapsed, nor does it report whether the fitted exponents themselves change significantly under these controls.
minor comments (1)
  1. [Abstract] The abstract refers to 'Pythia models and Qwen2.5-0.5B' and 'coherent text from Gutenberg and WikiText-103' without specifying model sizes, layer counts, or exact corpus subsets used for the median profiles.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We agree that the abstract would benefit from quantitative fit statistics, clearer language on the linear nature of the Jacobian, and additional reporting on the control experiments. We will revise accordingly while preserving the core empirical contribution. Point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract] Abstract, the displayed equation for ¯G(r): the central claim equates the median autograd Jacobian profile with the Green's function of the forward autoregressive operator, yet the Jacobian supplies only the first-order linear sensitivity of a single scalar output at a fixed input point; the manuscript provides no explicit validation that this linear response reproduces the change in next-token prediction under finite token-level perturbations, which is required to interpret the observed decay (and its contrast to exponential) as the claimed learned long-range property.

    Authors: We acknowledge the distinction: the reported quantity is the first-order Jacobian sensitivity, used as an empirical proxy for influence decay in the linear-response regime (consistent with the operator-learning motivation in the introduction). The manuscript does not claim exact reproduction of finite perturbations. To strengthen the interpretation, the revision will add a targeted validation experiment applying small finite perturbations to selected token embeddings and comparing the resulting logit changes against the Jacobian prediction over the same distance range. We will also revise the abstract wording to emphasize 'Jacobian sensitivity profile' rather than direct identification with the Green's function. revision: yes

  2. Referee: [Abstract] Abstract: the assertion that the power-law form is 'much better described' than exponential lacks any quantitative support (R^{2} values, residual statistics, cross-validation error, or confidence intervals on p); the three-parameter fit (γ, β, p) is performed post-hoc on median profiles without reported details on optimization, weighting, or distance range, making the superiority claim impossible to assess from the supplied information.

    Authors: We agree that quantitative support is needed. The revision will report R², residual sum of squares, and AIC values for both the power-law and exponential models across all reported settings. Fitting details (nonlinear least-squares via curve_fit on median profiles for r = 1 to 100, uniform weighting) and bootstrap-derived 95% confidence intervals on p will be added to the methods section and referenced from the abstract. revision: yes

  3. Referee: [Abstract] Abstract (token-shuffling and random-init controls): while the persistence under shuffling and absence in random models are presented as evidence that the power-law tail is learned, the manuscript does not address whether the Jacobian remains a faithful influence measure once syntax and prediction quality have collapsed, nor does it report whether the fitted exponents themselves change significantly under these controls.

    Authors: The shuffling result shows the power-law profile survives the removal of coherent syntax, indicating it is not an artifact of grammatical structure alone. We will add the fitted exponents (with confidence intervals) for the shuffled condition to the revision and note that the autograd Jacobian continues to compute local output sensitivity regardless of downstream prediction quality. The random-init contrast already demonstrates the tail is absent without training. A short discussion of the measure's continued validity under these controls will be included. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical Jacobian profiles with post-hoc fit

full rationale

The paper computes autograd Jacobians on trained Pythia and Qwen models, aggregates median diagonal-normalized profiles ar G(r) over token distance on Gutenberg and WikiText-103 data, and fits the power-law form ar G(r) ≈ γ + β(r+1)^{-p} (p ≈ 0.7-0.9) after the fact. Token-shuffle and random-init controls supply independent evidence that the long-range tail is learned. No derivation chain reduces any reported exponent, profile shape, or 'learned property' conclusion to a quantity defined by the fit itself; the functional form is descriptive only. No self-citations are load-bearing, no ansatz is smuggled, and no uniqueness theorem is invoked. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

3 free parameters · 3 axioms · 0 invented entities

The central claim rests on the empirical computation of Jacobians in trained autoregressive models and the post-hoc fitting of a three-parameter functional form to the resulting median profiles. No new physical entities are introduced; the power-law parameters are fitted quantities.

free parameters (3)
  • p = 0.7-0.9
    Exponent in the power-law fit \overline G(r) ≈ γ+β(r+1)^{-p} to the median Jacobian profile; reported range 0.7-0.9.
  • β
    Scaling coefficient in the same power-law fit.
  • γ
    Additive offset in the same power-law fit.
axioms (3)
  • standard math The Jacobian of next-token logits with respect to earlier token embeddings exists and can be obtained via automatic differentiation.
    Invoked to obtain the distance-resolved sensitivity profile.
  • domain assumption The Pythia and Qwen2.5-0.5B checkpoints are trained autoregressive Transformers whose forward pass implements next-token prediction.
    Required for the autograd experiments to be meaningful.
  • domain assumption The median over position pairs yields a representative distance profile.
    Used to summarize the Jacobian data across examples.

pith-pipeline@v0.9.1-grok · 5830 in / 2040 out tokens · 104640 ms · 2026-06-30T08:17:37.859134+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 19 canonical work pages · 10 internal anchors

  1. [1]

    On the Schwarz alternating method

    Pierre-Louis Lions. On the Schwarz alternating method. I. In Roland Glowinski, Gene H. Golub, Gérard A. Meurant, andJacquesPériaux, editors,First International Symposium on Domain Decomposition Methods for Partial Differential Equations, Philadelphia, PA, 1988. SIAM

  2. [2]

    Springer, Berlin, Heidelberg, 2005

    Andrea Toselli and Olof Widlund.Domain Decomposition Methods – Algorithms and Theory, volume 34 ofSpringer Series in Computational Mathematics. Springer, Berlin, Heidelberg, 2005

  3. [3]

    Smith, Petter E

    Barry F. Smith, Petter E. Bjørstad, and William D. Gropp.Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial Differential Equations. Cambridge University Press, Cambridge, 1996

  4. [4]

    Oosterlee, and Anton Schüller.Multigrid

    Ulrich Trottenberg, Cornelis W. Oosterlee, and Anton Schüller.Multigrid. Academic Press, London, San Diego, 2001

  5. [5]

    Machine learning and domain decomposition methods - a survey.Computational Science and Engineering, 1(2):2, 2024

    Axel Klawonn, Martin Lanser, and Janine Weber. Machine learning and domain decomposition methods - a survey.Computational Science and Engineering, 1(2):2, 2024. doi: 10.1007/s44207-024-00003-y. URL https://doi.org/10.1007/s44207-024-00003-y

  6. [6]

    Hierarchical Attention via Domain Decomposition

    Stephan Köhler and Oliver Rheinbach. Hierarchical attention via domain decomposition, 2026. URL https://arxiv.org/abs/2606.18525. 12 How Token Influence Decays with Distance: A Green-Function View of Trained Language ModelsA Preprint

  7. [7]

    Sarthak Jain and Byron C. Wallace. Attention is not explanation. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3543–3556. Association for Computational Linguistics, 2019. doi: 10.18653/v1/ N19-1357. URLhttps://aclanthology.org/N19-1357/

  8. [8]

    Attention is not not explanation

    Sarah Wiegreffe and Yuval Pinter. Attention is not not explanation. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 11–20. Association for Computational Linguistics, 2019. doi: 10.18653/v1/D19-1002. URLhttps://aclanthology.org/D19-1002/

  9. [9]

    Quantifying attention flow in transformers

    Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4190–4197. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.385. URLhttps://aclanthology. org/2020.acl-main.385/

  10. [10]

    Evans.Partial Differential Equations, volume 19 ofGraduate Studies in Mathematics

    Lawrence C. Evans.Partial Differential Equations, volume 19 ofGraduate Studies in Mathematics. American Mathematical Society, Providence, RI, 2 edition, 2010. ISBN 978-0-8218-4974-3

  11. [11]

    Pytorch: An imperative style, high-performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-perfo...

  12. [12]

    A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield- Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. ...

  13. [13]

    In-context Learning and Induction Heads

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Gan- guli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish,...

  14. [14]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020. URLhttps://arxiv.org/abs/2001.08361

  15. [15]

    Jasmijn Bastings and Katja Filippova. The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? In Afra Alishahi, Yonatan Belinkov, Grzegorz Chrupała, Dieuwke Hupkes, Yuval Pinter, and Hassan Sajjad, editors,Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP,...

  16. [16]

    doi: 10.18653/v1/2020.blackboxnlp-1.14

    Association for Computational Linguistics. doi: 10.18653/v1/2020.blackboxnlp-1.14. URLhttps: //aclanthology.org/2020.blackboxnlp-1.14/

  17. [17]

    Axiomatic Attribution for Deep Networks

    Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning, 2017. URLhttps://arxiv. org/abs/1703.01365

  18. [18]

    Jacobian Scopes: token-level causal attributions in LLMs

    Toni J.B. Liu, Baran Zadeoğlu, Nicolas Boullé, Raphaël Sarfati, Gurbir Arora, and Christopher J. Earls. Jacobian scopes: token-level causal attributions in LLMs.arXiv preprint arXiv:2601.16407, 2026. URL https://arxiv.org/abs/2601.16407

  19. [19]

    A Structural Theory of Position Bias in Transformers

    Hanna Herasimchyk, Robin Labryga, Tomislav Prusina, and Sören Laue. A structural theory of position bias in transformers.arXiv preprint arXiv:2602.16837, 2026. URL https://arxiv.org/abs/2602. 16837

  20. [20]

    Transactions of the Association for Computational Linguistics7, 452–466 (2019) https://doi.org/10.1162/tacl a 00276

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 2023. doi: 10.48550/arXiv.2307.03172. URLhttps://arxiv.org/abs/2307. 03172

  21. [21]

    Simeng Sun, Kalpesh Krishna, Andrew Mattarella-Micke, and Mohit Iyyer. Do long-range language models actually use long-range context? InProceedings of the 2021 Conference on Empirical Methods 13 How Token Influence Decays with Distance: A Green-Function View of Trained Language ModelsA Preprint in Natural Language Processing, pages 807–822. Association fo...

  22. [22]

    Efficient streaming language models with attention sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=NG7sS51zVF

  23. [23]

    On the emergence of position bias in transformers, 2025

    Xinyi Wu, Yifei Wang, Stefanie Jegelka, and Ali Jadbabaie. On the emergence of position bias in transformers, 2025. URLhttps://arxiv.org/abs/2502.01951

  24. [24]

    Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L. Turner, Brian Chen, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Temple- ton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Ri...

  25. [25]

    Information flow routes: Automatically interpreting language models at scale.arXiv preprint arXiv:2403.00824, 2024

    Javier Ferrando and Elena Voita. Information flow routes: Automatically interpreting language models at scale.arXiv preprint arXiv:2403.00824, 2024. doi: 10.48550/arXiv.2403.00824. URLhttps://arxiv. org/abs/2403.00824v2

  26. [26]

    Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023. URLhttps://arxiv.org/abs/2304.01373

  27. [27]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T...

  28. [28]

    Pointer sentinel mixture models,

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models,

  29. [29]

    URLhttps://arxiv.org/abs/1609.07843

  30. [30]

    SAIA: a seamless Slurm-native solution for HPC-based services.The Journal of Supercomputing, 82(7):403, May 2026

    Ali Doosthosseini, Jonathan Decker, Hendrik Nolte, and Julian Kunkel. SAIA: a seamless Slurm-native solution for HPC-based services.The Journal of Supercomputing, 82(7):403, May 2026. ISSN 1573-0484. doi: 10.1007/s11227-026-08508-3. URLhttps://doi.org/10.1007/s11227-026-08508-3. 14