pith. sign in

arxiv: 2605.21726 · v1 · pith:D5TXA47Enew · submitted 2026-05-20 · 💻 cs.CL · cs.AI

Probabilistic Attribution For Large Language Models

Pith reviewed 2026-05-22 08:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords large language modelsprobabilistic attributiontoken attributionBayes rulestochastic processesmodel interpretabilityconditional probabilitiesentropy analysis
0
0 comments X

The pith

Large language models' next-token probabilities can be inverted with Bayes' rule to create a structure-independent attribution score for each prompt token's influence on the response.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a way to attribute parts of an LLM's generated response back to specific tokens in the input prompt using only the probabilities the model outputs. By treating the generation as a stochastic process and applying Bayes' rule to reverse the conditional probabilities, the method produces a score that shows how much the response probability would change if a particular prompt token were removed. The approach works for any model because it does not rely on the model's internal computations or architecture. It also calculates the entropy of the token distributions at each step to show where the model is most uncertain. Together these tools make it easier to understand and debug how LLMs arrive at their outputs across different models and prompts.

Core claim

We situate LLMs within the mathematical theory of stochastic processes. Using Bayes rule to invert the next-token log-probabilities, we capture the model's internal representation of the distribution over token sequences in a way that is independent of the model's computational structure. This yields the conditional probability of the response given the prompt and of the response given the prompt with a token marginalized away. Our attribution score is the log of the ratio of these probabilities.

What carries the argument

The probabilistic token attribution score obtained as the log ratio of the conditional response probability given the full prompt to the conditional probability with one prompt token marginalized out using Bayes inversion of next-token probabilities.

If this is right

  • The attribution measure applies to any LLM regardless of its size or architecture.
  • High attribution tokens indicate prompt elements that strongly shape the generated response.
  • Entropy calculations identify positions where the model has high uncertainty in its token choices.
  • Model comparisons reveal variations in how different LLMs respond to token marginalization.
  • Users can focus attention on uncertain or high-sensitivity parts of the prompt for better control over generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This attribution could be extended to measure influence over multiple tokens at once for more complex explanations.
  • Combining it with other interpretability techniques might help trace back factual errors to specific prompt parts.
  • Observing how attribution scores evolve during training could indicate when a model has learned stable representations.
  • The stochastic process view might apply to other autoregressive models in different domains like music or code generation.

Load-bearing premise

Applying Bayes' rule to invert the next-token log-probabilities produces a faithful representation of the distribution over full token sequences that allows valid marginalization over individual prompt tokens.

What would settle it

Generate responses with and without a high-attribution prompt token removed and check if the observed change in response probability matches the predicted log-ratio from the attribution score; significant deviation would indicate the inversion does not faithfully represent the sequence distribution.

Figures

Figures reproduced from arXiv: 2605.21726 by Bethany Lusch, Carlo Graziani, Michael E. Papka, Shilpika Shilpika, Venkatram Vishwanath.

Figure 1
Figure 1. Figure 1: Attribution Scores for prompts given the response for the 5 LLMs with greedy [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Contextual Entropy versus Attribution Scores for seven prompts and 8 LLMs [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Contextual Entropy versus Attribution Scores for six prompts and 8 LLMs and a [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: KL divergences defined in Equation (15) for greedy and top-p sampling. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Token probabilities used in the computation of [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Figure (a) shows the entropy of the frequency counts of exact matches in the [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: XAI Evaluation Metrics for 6 models and 7 prompts for greedy sampling. Model [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: XAI Evaluation Metrics for 6 models and 7 prompts for top-p sampling. Model [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Compute time for AS for greedy and nucleus (top-p) decoding across multiple LLMs. Top row shows the total GPU hours and bottom row shows GPU seconds for each token replacement from Equation 6. The red text on each stacked bar is the total tokens in the model vocabulary. The prompt token length is the length of the LLM tokenized prompt. autotuned kernel selection, yielding small floating-point differences b… view at source ↗
Figure 10
Figure 10. Figure 10: Attribution Scores for prompts given the response for the 8 LLMs and 4 prompts [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Attribution Scores for prompts given the response for the 8 LLMs and 3 prompts [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Attribution Scores for prompts given the response for the 8 LLMs and 4 prompts [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Attribution Scores for prompts given the response for the 8 LLMs and 3 prompts [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗
read the original abstract

The generative nature of Large Language Models (LLMs) is reflected in the conditional probabilities they compute to sample each response token given the previous tokens. These probabilities encode the distributional structure that the model learns in training and exploits in inference. In this work, we use these probabilities to situate LLMs within the mathematical theory of stochastic processes. We use this framework to design a model-agnostic probabilistic token attribution measure, using Bayes rule to invert the next-token log-probabilities so as to capture the models internal representation of the distribution over token sequences. The representation is independent of the models computational structure. This representation yields the conditional probability of the response given the prompt, and of the response given the prompt with a token marginalized away. Our attribution score is the log of the ratio of these probabilities. We further compute the entropies of a single prompts token distributions, conditioned on the remaining context. The interplay between entropy and attribution score sheds light on LLM behavior. We evaluate 8 models across 7 prompts and investigate anomalies, token sensitivity, response stability, model stability, and training convergence, thereby improving interpretability and guiding users to focus on uncertain or unstable parts of the generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper situates LLMs within stochastic processes and proposes a model-agnostic probabilistic token attribution measure. It applies Bayes' rule to invert next-token log-probabilities into a representation of the distribution over token sequences that is claimed to be independent of computational structure. This yields P(response | prompt) and P(response | prompt with token marginalized), with the attribution score defined as the log ratio of these quantities. The work also examines conditional entropies of prompt token distributions and reports evaluations across 8 models and 7 prompts on anomalies, token sensitivity, response stability, model stability, and training convergence.

Significance. If the central derivation holds, the approach supplies a parameter-free attribution method grounded directly in the model's output distribution rather than gradients or internal states. This could improve interpretability by highlighting influential or uncertain prompt tokens and offers a stochastic-process framing that may generalize beyond current attribution techniques. The multi-model empirical component provides concrete data on stability and convergence that could guide practical use.

major comments (1)
  1. [Derivation of attribution score] The section deriving the attribution measure (following the stochastic-process framing): the claim that Bayes inversion of the product of next-token conditionals produces a structure-independent joint over full sequences from which exact marginalization over a prompt token is possible is load-bearing. Inverting P(x_t | x_<t) to obtain P(response | prompt with token i marginalized) requires either an explicit prior p(sequence) or the marginal p(prompt); neither is specified, and summation over the vocabulary at the marginalized position while preserving downstream dependencies cannot be performed from the chain rule alone without additional modeling choices. This risks contradicting the stated independence from computational structure and absence of extra assumptions.
minor comments (2)
  1. [Abstract and Evaluation] The abstract and evaluation description mention 8 models and 7 prompts but do not list them explicitly; a table or appendix listing the exact models, prompts, and any preprocessing would improve reproducibility.
  2. [Notation] Notation for the inverted representation versus the original next-token probabilities should be introduced with a clear symbol (e.g., Q or P*) and used consistently when defining the log-ratio score.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and insightful comments on the derivation of our attribution measure. We address the major comment below, providing clarification on the stochastic process framing and marginalization procedure. We have revised the manuscript to expand the relevant section with explicit steps and an algorithmic outline.

read point-by-point responses
  1. Referee: The section deriving the attribution measure (following the stochastic-process framing): the claim that Bayes inversion of the product of next-token conditionals produces a structure-independent joint over full sequences from which exact marginalization over a prompt token is possible is load-bearing. Inverting P(x_t | x_<t) to obtain P(response | prompt with token i marginalized) requires either an explicit prior p(sequence) or the marginal p(prompt); neither is specified, and summation over the vocabulary at the marginalized position while preserving downstream dependencies cannot be performed from the chain rule alone without additional modeling choices. This risks contradicting the stated independence from computational structure and absence of extra assumptions.

    Authors: The joint distribution over any token sequence is defined exactly by the chain rule applied to the model's next-token conditionals: P(x_1, ..., x_n) = ∏_t P(x_t | x_<t). This product is the model's representation of the distribution over sequences and depends solely on the output probabilities it returns; it is therefore independent of internal architecture details such as attention weights or layer computations. Bayes inversion is used to re-express the relevant conditional probabilities in terms of these output distributions, allowing us to obtain both P(response | prompt) and the marginalized version without reference to model internals. For marginalization over prompt token i, the model's own conditional at position i supplies the weights P(v | context before i) for each vocabulary item v. The marginalized conditional is then the expectation E_v [P(response | prompt with position i set to v)], where each term P(response | prompt with i = v) is again computed from the model's next-token probabilities on the modified prompt. No external prior is introduced; the weighting distribution is the model's learned conditional. We acknowledge that exact computation requires a sum over the vocabulary and have added a precise algorithmic description, including the number of forward passes required, to the revised derivation section. This procedure remains model-agnostic in the sense that it uses only black-box next-token queries. revision: partial

Circularity Check

0 steps flagged

No circularity; attribution derived directly from model's next-token probabilities

full rationale

The paper defines its probabilistic token attribution by applying Bayes' rule directly to the LLM's existing next-token log-probabilities to obtain conditional probabilities over responses. This uses the model's output distribution as input without introducing fitted parameters, self-referential definitions, or load-bearing self-citations that reduce the result to its own assumptions. The derivation remains self-contained because the joint distribution over sequences is the product of the observed conditionals, and the attribution ratio follows immediately from that product without additional modeling choices that loop back to the target quantity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on treating LLM token generation as a stochastic process and on the validity of Bayes inversion to recover marginal conditionals; no free parameters or new entities are introduced in the abstract.

axioms (1)
  • domain assumption LLM token generation can be modeled as a stochastic process whose conditional probabilities encode the learned distribution over sequences.
    Invoked when the abstract states that the generative nature is reflected in conditional probabilities and situates LLMs within stochastic process theory.

pith-pipeline@v0.9.0 · 5746 in / 1377 out tokens · 48852 ms · 2026-05-22T08:52:00.826887+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 4 internal anchors

  1. [1]

    The Curious Case of Neural Text Degeneration

    Ari Holtzman and Jan Buys and Maxwell Forbes and Yejin Choi , title =. CoRR , volume =. 2019 , url =. 1904.09751 , timestamp =

  2. [2]

    Jonathon Phillips and Carina Hahn and Peter Fontana and Amy Yates and Kristen K

    P. Jonathon Phillips and Carina Hahn and Peter Fontana and Amy Yates and Kristen K. Greene and David Broniatowski and Mark A. Przybocki , title =. 2021 , month =. doi:https://doi.org/10.6028/NIST.IR.8312 , language =

  3. [3]

    Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead

    Rudin, Cynthia. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell

  4. [4]

    Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsi- ble ai,

    Alejandro. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI , journal =. 2020 , issn =. doi:https://doi.org/10.1016/j.inffus.2019.12.012 , url =

  5. [5]

    2020 , eprint=

    The Curious Case of Neural Text Degeneration , author=. 2020 , eprint=

  6. [6]

    C. K. Chow and C. N. Liu , title =. IEEE Transactions on Information Theory , year =

  7. [7]

    1988 , address =

    Judea Pearl , title =. 1988 , address =

  8. [8]

    1995 , publisher=

    Probability and Measure , author=. 1995 , publisher=

  9. [9]

    Elements of Information Theory , doi =

  10. [10]

    2019 , eprint=

    GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , author=. 2019 , eprint=

  11. [11]

    Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

    Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav , title...

  12. [12]

    The LAMBADA dataset: Word prediction requiring a broad discourse context

    Paperno, Denis and Kruszewski, Germ \'a n and Lazaridou, Angeliki and Pham, Ngoc Quan and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fern \'a ndez, Raquel. The LAMBADA dataset: Word prediction requiring a broad discourse context. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (...

  13. [13]

    A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

    Williams, Adina and Nangia, Nikita and Bowman, Samuel R. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10.18653/v1/N18-1101

  14. [14]

    , title =

    Wang, Alex and Pruksachatkun, Yada and Nangia, Nikita and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R. , title =. Proceedings of the 33rd International Conference on Neural Information Processing Systems , articleno =. 2019 , publisher =

  15. [15]

    Perturbing Inputs for Fragile Interpretations in Deep Natural Language Processing

    Sinha, Sanchit and Chen, Hanjie and Sekhon, Arshdeep and Ji, Yangfeng and Qi, Yanjun. Perturbing Inputs for Fragile Interpretations in Deep Natural Language Processing. Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. 2021. doi:10.18653/v1/2021.blackboxnlp-1.33

  16. [16]

    doi:10.5281/zenodo.5297715 , url =

    Black, Sid and Leo, Gao and Wang, Phil and Leahy, Connor and Biderman, Stella , title =. doi:10.5281/zenodo.5297715 , url =

  17. [17]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao and Stella Biderman and Sid Black and Laurence Golding and Travis Hoppe and Charles Foster and Jason Phang and Horace He and Anish Thite and Noa Nabeshima and Shawn Presser and Connor Leahy , title =. CoRR , volume =. 2021 , url =. 2101.00027 , timestamp =

  18. [18]

    OpenAI blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

  19. [19]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Rowan Zellers and Ari Holtzman and Yonatan Bisk and Ali Farhadi and Yejin Choi , title =. CoRR , volume =. 2019 , url =. 1905.07830 , timestamp =

  20. [20]

    Inseq: An Interpretability Toolkit for Sequence Generation Models

    Sarti, Gabriele and Feldhus, Nils and Sickert, Ludwig and van der Wal, Oskar and Nissim, Malvina and Bisazza, Arianna. Inseq: An Interpretability Toolkit for Sequence Generation Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). 2023. doi:10.18653/v1/2023.acl-demo.40

  21. [21]

    and Ravikumar, Pradeep , title =

    Yeh, Chih-Kuan and Hsieh, Cheng-Yu and Suggala, Arun Sai and Inouye, David I. and Ravikumar, Pradeep , title =. Proceedings of the 33rd International Conference on Neural Information Processing Systems , articleno =. 2019 , publisher =

  22. [22]

    and Ruotsalo, Tuukka and Maal e, Lars and Maistro, Maria

    Edin, Joakim and Motzfeldt, Andreas Geert and Christensen, Casper L. and Ruotsalo, Tuukka and Maal e, Lars and Maistro, Maria. Normalized AOPC : Fixing Misleading Faithfulness Metrics for Feature Attributions Explainability. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/...

  23. [23]

    Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =

    Marco Tulio Ribeiro and Sameer Singh and Carlos Guestrin , title =. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =. 2016 , doi =

  24. [24]

    Lundberg and Su-In Lee , title =

    Scott M. Lundberg and Su-In Lee , title =. Advances in Neural Information Processing Systems , volume =

  25. [25]

    Proceedings of the 34th International Conference on Machine Learning , series =

    Mukund Sundararajan and Ankur Taly and Qiqi Yan , title =. Proceedings of the 34th International Conference on Machine Learning , series =

  26. [26]

    2020 , eprint =

    Narine Kokhlikyan and Vivek Miglani and Miguel Martin and Edward Wang and Bilal Alsallakh and Jonathan Reynolds and Alexander Melnikov and Natalia Kliushkina and Carlos Araya and Siqi Yan and Orion Reblitz-Richardson , title =. 2020 , eprint =

  27. [27]

    Wallace , title =

    Jay DeYoung and Sarthak Jain and Nazneen Fatema Rajani and Eric Lehman and Caiming Xiong and Richard Socher and Byron C. Wallace , title =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages =

  28. [28]

    Wallace , title =

    Sarthak Jain and Byron C. Wallace , title =. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages =. 2019 , doi =

  29. [29]

    Sarah Wiegreffe and Yuval Pinter , title =. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages =. 2019 , doi =

  30. [30]

    A Primer in BERT ology: What We Know About How BERT Works

    Rogers, Anna and Kovaleva, Olga and Rumshisky, Anna. A Primer in BERT ology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics. 2020. doi:10.1162/tacl_a_00349

  31. [31]

    2021 , howpublished =

    Nelson Elhage and Neel Nanda and Catherine Olsson and Tom Henighan and Nicholas Joseph and Ben Mann and Amanda Askell and Yuntao Bai and Anna Chen and Tom Conerly and Nova DasSarma and Dawn Drain and Deep Ganguli and Zac Hatfield-Dodds and Danny Hernandez and Andy Jones and Jackson Kernion and Liane Lovitt and Kamal Ndousse and Dario Amodei and Tom Brown ...

  32. [32]

    ArXiv , year=

    From Understanding to Utilization: A Survey on Explainability for Large Language Models , author=. ArXiv , year=

  33. [33]

    Proceedings of the 34th International Conference on Machine Learning - Volume 70 , pages =

    Shrikumar, Avanti and Greenside, Peyton and Kundaje, Anshul , title =. Proceedings of the 34th International Conference on Machine Learning - Volume 70 , pages =. 2017 , publisher =

  34. [34]

    Why Should I Trust You?

    Ribeiro, Marco Tulio and Singh, Sameer and Guestrin, Carlos , title =. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =. 2016 , isbn =. doi:10.1145/2939672.2939778 , abstract =

  35. [35]

    and Fergus, Rob

    Zeiler, Matthew D. and Fergus, Rob. Visualizing and Understanding Convolutional Networks. Computer Vision -- ECCV 2014. 2014

  36. [36]

    2024 , eprint=

    ReAGent: A Model-agnostic Feature Attribution Method for Generative Language Models , author=. 2024 , eprint=