Probabilistic Attribution For Large Language Models
Pith reviewed 2026-05-22 08:52 UTC · model grok-4.3
The pith
Large language models' next-token probabilities can be inverted with Bayes' rule to create a structure-independent attribution score for each prompt token's influence on the response.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We situate LLMs within the mathematical theory of stochastic processes. Using Bayes rule to invert the next-token log-probabilities, we capture the model's internal representation of the distribution over token sequences in a way that is independent of the model's computational structure. This yields the conditional probability of the response given the prompt and of the response given the prompt with a token marginalized away. Our attribution score is the log of the ratio of these probabilities.
What carries the argument
The probabilistic token attribution score obtained as the log ratio of the conditional response probability given the full prompt to the conditional probability with one prompt token marginalized out using Bayes inversion of next-token probabilities.
If this is right
- The attribution measure applies to any LLM regardless of its size or architecture.
- High attribution tokens indicate prompt elements that strongly shape the generated response.
- Entropy calculations identify positions where the model has high uncertainty in its token choices.
- Model comparisons reveal variations in how different LLMs respond to token marginalization.
- Users can focus attention on uncertain or high-sensitivity parts of the prompt for better control over generation.
Where Pith is reading between the lines
- This attribution could be extended to measure influence over multiple tokens at once for more complex explanations.
- Combining it with other interpretability techniques might help trace back factual errors to specific prompt parts.
- Observing how attribution scores evolve during training could indicate when a model has learned stable representations.
- The stochastic process view might apply to other autoregressive models in different domains like music or code generation.
Load-bearing premise
Applying Bayes' rule to invert the next-token log-probabilities produces a faithful representation of the distribution over full token sequences that allows valid marginalization over individual prompt tokens.
What would settle it
Generate responses with and without a high-attribution prompt token removed and check if the observed change in response probability matches the predicted log-ratio from the attribution score; significant deviation would indicate the inversion does not faithfully represent the sequence distribution.
Figures
read the original abstract
The generative nature of Large Language Models (LLMs) is reflected in the conditional probabilities they compute to sample each response token given the previous tokens. These probabilities encode the distributional structure that the model learns in training and exploits in inference. In this work, we use these probabilities to situate LLMs within the mathematical theory of stochastic processes. We use this framework to design a model-agnostic probabilistic token attribution measure, using Bayes rule to invert the next-token log-probabilities so as to capture the models internal representation of the distribution over token sequences. The representation is independent of the models computational structure. This representation yields the conditional probability of the response given the prompt, and of the response given the prompt with a token marginalized away. Our attribution score is the log of the ratio of these probabilities. We further compute the entropies of a single prompts token distributions, conditioned on the remaining context. The interplay between entropy and attribution score sheds light on LLM behavior. We evaluate 8 models across 7 prompts and investigate anomalies, token sensitivity, response stability, model stability, and training convergence, thereby improving interpretability and guiding users to focus on uncertain or unstable parts of the generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper situates LLMs within stochastic processes and proposes a model-agnostic probabilistic token attribution measure. It applies Bayes' rule to invert next-token log-probabilities into a representation of the distribution over token sequences that is claimed to be independent of computational structure. This yields P(response | prompt) and P(response | prompt with token marginalized), with the attribution score defined as the log ratio of these quantities. The work also examines conditional entropies of prompt token distributions and reports evaluations across 8 models and 7 prompts on anomalies, token sensitivity, response stability, model stability, and training convergence.
Significance. If the central derivation holds, the approach supplies a parameter-free attribution method grounded directly in the model's output distribution rather than gradients or internal states. This could improve interpretability by highlighting influential or uncertain prompt tokens and offers a stochastic-process framing that may generalize beyond current attribution techniques. The multi-model empirical component provides concrete data on stability and convergence that could guide practical use.
major comments (1)
- [Derivation of attribution score] The section deriving the attribution measure (following the stochastic-process framing): the claim that Bayes inversion of the product of next-token conditionals produces a structure-independent joint over full sequences from which exact marginalization over a prompt token is possible is load-bearing. Inverting P(x_t | x_<t) to obtain P(response | prompt with token i marginalized) requires either an explicit prior p(sequence) or the marginal p(prompt); neither is specified, and summation over the vocabulary at the marginalized position while preserving downstream dependencies cannot be performed from the chain rule alone without additional modeling choices. This risks contradicting the stated independence from computational structure and absence of extra assumptions.
minor comments (2)
- [Abstract and Evaluation] The abstract and evaluation description mention 8 models and 7 prompts but do not list them explicitly; a table or appendix listing the exact models, prompts, and any preprocessing would improve reproducibility.
- [Notation] Notation for the inverted representation versus the original next-token probabilities should be introduced with a clear symbol (e.g., Q or P*) and used consistently when defining the log-ratio score.
Simulated Author's Rebuttal
We thank the referee for their careful reading and insightful comments on the derivation of our attribution measure. We address the major comment below, providing clarification on the stochastic process framing and marginalization procedure. We have revised the manuscript to expand the relevant section with explicit steps and an algorithmic outline.
read point-by-point responses
-
Referee: The section deriving the attribution measure (following the stochastic-process framing): the claim that Bayes inversion of the product of next-token conditionals produces a structure-independent joint over full sequences from which exact marginalization over a prompt token is possible is load-bearing. Inverting P(x_t | x_<t) to obtain P(response | prompt with token i marginalized) requires either an explicit prior p(sequence) or the marginal p(prompt); neither is specified, and summation over the vocabulary at the marginalized position while preserving downstream dependencies cannot be performed from the chain rule alone without additional modeling choices. This risks contradicting the stated independence from computational structure and absence of extra assumptions.
Authors: The joint distribution over any token sequence is defined exactly by the chain rule applied to the model's next-token conditionals: P(x_1, ..., x_n) = ∏_t P(x_t | x_<t). This product is the model's representation of the distribution over sequences and depends solely on the output probabilities it returns; it is therefore independent of internal architecture details such as attention weights or layer computations. Bayes inversion is used to re-express the relevant conditional probabilities in terms of these output distributions, allowing us to obtain both P(response | prompt) and the marginalized version without reference to model internals. For marginalization over prompt token i, the model's own conditional at position i supplies the weights P(v | context before i) for each vocabulary item v. The marginalized conditional is then the expectation E_v [P(response | prompt with position i set to v)], where each term P(response | prompt with i = v) is again computed from the model's next-token probabilities on the modified prompt. No external prior is introduced; the weighting distribution is the model's learned conditional. We acknowledge that exact computation requires a sum over the vocabulary and have added a precise algorithmic description, including the number of forward passes required, to the revised derivation section. This procedure remains model-agnostic in the sense that it uses only black-box next-token queries. revision: partial
Circularity Check
No circularity; attribution derived directly from model's next-token probabilities
full rationale
The paper defines its probabilistic token attribution by applying Bayes' rule directly to the LLM's existing next-token log-probabilities to obtain conditional probabilities over responses. This uses the model's output distribution as input without introducing fitted parameters, self-referential definitions, or load-bearing self-citations that reduce the result to its own assumptions. The derivation remains self-contained because the joint distribution over sequences is the product of the observed conditionals, and the attribution ratio follows immediately from that product without additional modeling choices that loop back to the target quantity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM token generation can be modeled as a stochastic process whose conditional probabilities encode the learned distribution over sequences.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
using Bayes rule to invert the next-token log-probabilities so as to capture the model’s internal representation of the distribution over token sequences... attribution score is the log of the ratio of these probabilities
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
situate LLMs within the mathematical theory of stochastic processes... chain rule of probability
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
The Curious Case of Neural Text Degeneration
Ari Holtzman and Jan Buys and Maxwell Forbes and Yejin Choi , title =. CoRR , volume =. 2019 , url =. 1904.09751 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[2]
Jonathon Phillips and Carina Hahn and Peter Fontana and Amy Yates and Kristen K
P. Jonathon Phillips and Carina Hahn and Peter Fontana and Amy Yates and Kristen K. Greene and David Broniatowski and Mark A. Przybocki , title =. 2021 , month =. doi:https://doi.org/10.6028/NIST.IR.8312 , language =
-
[3]
Rudin, Cynthia. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell
-
[4]
Alejandro. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI , journal =. 2020 , issn =. doi:https://doi.org/10.1016/j.inffus.2019.12.012 , url =
-
[5]
The Curious Case of Neural Text Degeneration , author=. 2020 , eprint=
work page 2020
-
[6]
C. K. Chow and C. N. Liu , title =. IEEE Transactions on Information Theory , year =
- [7]
- [8]
-
[9]
Elements of Information Theory , doi =
-
[10]
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , author=. 2019 , eprint=
work page 2019
-
[11]
Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov
Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav , title...
-
[12]
The LAMBADA dataset: Word prediction requiring a broad discourse context
Paperno, Denis and Kruszewski, Germ \'a n and Lazaridou, Angeliki and Pham, Ngoc Quan and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fern \'a ndez, Raquel. The LAMBADA dataset: Word prediction requiring a broad discourse context. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (...
-
[13]
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
Williams, Adina and Nangia, Nikita and Bowman, Samuel R. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10.18653/v1/N18-1101
work page internal anchor Pith review doi:10.18653/v1/n18-1101 2018
-
[14]
Wang, Alex and Pruksachatkun, Yada and Nangia, Nikita and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R. , title =. Proceedings of the 33rd International Conference on Neural Information Processing Systems , articleno =. 2019 , publisher =
work page 2019
-
[15]
Perturbing Inputs for Fragile Interpretations in Deep Natural Language Processing
Sinha, Sanchit and Chen, Hanjie and Sekhon, Arshdeep and Ji, Yangfeng and Qi, Yanjun. Perturbing Inputs for Fragile Interpretations in Deep Natural Language Processing. Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. 2021. doi:10.18653/v1/2021.blackboxnlp-1.33
-
[16]
doi:10.5281/zenodo.5297715 , url =
Black, Sid and Leo, Gao and Wang, Phil and Leahy, Connor and Biderman, Stella , title =. doi:10.5281/zenodo.5297715 , url =
-
[17]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao and Stella Biderman and Sid Black and Laurence Golding and Travis Hoppe and Charles Foster and Jason Phang and Horace He and Anish Thite and Noa Nabeshima and Shawn Presser and Connor Leahy , title =. CoRR , volume =. 2021 , url =. 2101.00027 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[18]
Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
-
[19]
HellaSwag: Can a Machine Really Finish Your Sentence?
Rowan Zellers and Ari Holtzman and Yonatan Bisk and Ali Farhadi and Yejin Choi , title =. CoRR , volume =. 2019 , url =. 1905.07830 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[20]
Inseq: An Interpretability Toolkit for Sequence Generation Models
Sarti, Gabriele and Feldhus, Nils and Sickert, Ludwig and van der Wal, Oskar and Nissim, Malvina and Bisazza, Arianna. Inseq: An Interpretability Toolkit for Sequence Generation Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). 2023. doi:10.18653/v1/2023.acl-demo.40
-
[21]
and Ravikumar, Pradeep , title =
Yeh, Chih-Kuan and Hsieh, Cheng-Yu and Suggala, Arun Sai and Inouye, David I. and Ravikumar, Pradeep , title =. Proceedings of the 33rd International Conference on Neural Information Processing Systems , articleno =. 2019 , publisher =
work page 2019
-
[22]
and Ruotsalo, Tuukka and Maal e, Lars and Maistro, Maria
Edin, Joakim and Motzfeldt, Andreas Geert and Christensen, Casper L. and Ruotsalo, Tuukka and Maal e, Lars and Maistro, Maria. Normalized AOPC : Fixing Misleading Faithfulness Metrics for Feature Attributions Explainability. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/...
-
[23]
Marco Tulio Ribeiro and Sameer Singh and Carlos Guestrin , title =. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =. 2016 , doi =
work page 2016
-
[24]
Lundberg and Su-In Lee , title =
Scott M. Lundberg and Su-In Lee , title =. Advances in Neural Information Processing Systems , volume =
-
[25]
Proceedings of the 34th International Conference on Machine Learning , series =
Mukund Sundararajan and Ankur Taly and Qiqi Yan , title =. Proceedings of the 34th International Conference on Machine Learning , series =
-
[26]
Narine Kokhlikyan and Vivek Miglani and Miguel Martin and Edward Wang and Bilal Alsallakh and Jonathan Reynolds and Alexander Melnikov and Natalia Kliushkina and Carlos Araya and Siqi Yan and Orion Reblitz-Richardson , title =. 2020 , eprint =
work page 2020
-
[27]
Jay DeYoung and Sarthak Jain and Nazneen Fatema Rajani and Eric Lehman and Caiming Xiong and Richard Socher and Byron C. Wallace , title =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages =
-
[28]
Sarthak Jain and Byron C. Wallace , title =. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages =. 2019 , doi =
work page 2019
-
[29]
Sarah Wiegreffe and Yuval Pinter , title =. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages =. 2019 , doi =
work page 2019
-
[30]
A Primer in BERT ology: What We Know About How BERT Works
Rogers, Anna and Kovaleva, Olga and Rumshisky, Anna. A Primer in BERT ology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics. 2020. doi:10.1162/tacl_a_00349
-
[31]
Nelson Elhage and Neel Nanda and Catherine Olsson and Tom Henighan and Nicholas Joseph and Ben Mann and Amanda Askell and Yuntao Bai and Anna Chen and Tom Conerly and Nova DasSarma and Dawn Drain and Deep Ganguli and Zac Hatfield-Dodds and Danny Hernandez and Andy Jones and Jackson Kernion and Liane Lovitt and Kamal Ndousse and Dario Amodei and Tom Brown ...
work page 2021
-
[32]
From Understanding to Utilization: A Survey on Explainability for Large Language Models , author=. ArXiv , year=
-
[33]
Proceedings of the 34th International Conference on Machine Learning - Volume 70 , pages =
Shrikumar, Avanti and Greenside, Peyton and Kundaje, Anshul , title =. Proceedings of the 34th International Conference on Machine Learning - Volume 70 , pages =. 2017 , publisher =
work page 2017
-
[34]
Ribeiro, Marco Tulio and Singh, Sameer and Guestrin, Carlos , title =. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =. 2016 , isbn =. doi:10.1145/2939672.2939778 , abstract =
-
[35]
Zeiler, Matthew D. and Fergus, Rob. Visualizing and Understanding Convolutional Networks. Computer Vision -- ECCV 2014. 2014
work page 2014
-
[36]
ReAGent: A Model-agnostic Feature Attribution Method for Generative Language Models , author=. 2024 , eprint=
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.