Probabilistic Attribution For Large Language Models

Bethany Lusch; Carlo Graziani; Michael E. Papka; Shilpika Shilpika; Venkatram Vishwanath

arxiv: 2605.21726 · v1 · pith:D5TXA47Enew · submitted 2026-05-20 · 💻 cs.CL · cs.AI

Probabilistic Attribution For Large Language Models

Shilpika Shilpika , Carlo Graziani , Bethany Lusch , Venkatram Vishwanath , Michael E. Papka This is my paper

Pith reviewed 2026-05-22 08:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords large language modelsprobabilistic attributiontoken attributionBayes rulestochastic processesmodel interpretabilityconditional probabilitiesentropy analysis

0 comments

The pith

Large language models' next-token probabilities can be inverted with Bayes' rule to create a structure-independent attribution score for each prompt token's influence on the response.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a way to attribute parts of an LLM's generated response back to specific tokens in the input prompt using only the probabilities the model outputs. By treating the generation as a stochastic process and applying Bayes' rule to reverse the conditional probabilities, the method produces a score that shows how much the response probability would change if a particular prompt token were removed. The approach works for any model because it does not rely on the model's internal computations or architecture. It also calculates the entropy of the token distributions at each step to show where the model is most uncertain. Together these tools make it easier to understand and debug how LLMs arrive at their outputs across different models and prompts.

Core claim

We situate LLMs within the mathematical theory of stochastic processes. Using Bayes rule to invert the next-token log-probabilities, we capture the model's internal representation of the distribution over token sequences in a way that is independent of the model's computational structure. This yields the conditional probability of the response given the prompt and of the response given the prompt with a token marginalized away. Our attribution score is the log of the ratio of these probabilities.

What carries the argument

The probabilistic token attribution score obtained as the log ratio of the conditional response probability given the full prompt to the conditional probability with one prompt token marginalized out using Bayes inversion of next-token probabilities.

If this is right

The attribution measure applies to any LLM regardless of its size or architecture.
High attribution tokens indicate prompt elements that strongly shape the generated response.
Entropy calculations identify positions where the model has high uncertainty in its token choices.
Model comparisons reveal variations in how different LLMs respond to token marginalization.
Users can focus attention on uncertain or high-sensitivity parts of the prompt for better control over generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This attribution could be extended to measure influence over multiple tokens at once for more complex explanations.
Combining it with other interpretability techniques might help trace back factual errors to specific prompt parts.
Observing how attribution scores evolve during training could indicate when a model has learned stable representations.
The stochastic process view might apply to other autoregressive models in different domains like music or code generation.

Load-bearing premise

Applying Bayes' rule to invert the next-token log-probabilities produces a faithful representation of the distribution over full token sequences that allows valid marginalization over individual prompt tokens.

What would settle it

Generate responses with and without a high-attribution prompt token removed and check if the observed change in response probability matches the predicted log-ratio from the attribution score; significant deviation would indicate the inversion does not faithfully represent the sequence distribution.

Figures

Figures reproduced from arXiv: 2605.21726 by Bethany Lusch, Carlo Graziani, Michael E. Papka, Shilpika Shilpika, Venkatram Vishwanath.

**Figure 2.** Figure 2: Contextual Entropy versus Attribution Scores for seven prompts and 8 LLMs [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Contextual Entropy versus Attribution Scores for six prompts and 8 LLMs and a [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: KL divergences defined in Equation (15) for greedy and top-p sampling. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Token probabilities used in the computation of [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Figure (a) shows the entropy of the frequency counts of exact matches in the [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: XAI Evaluation Metrics for 6 models and 7 prompts for greedy sampling. Model [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: XAI Evaluation Metrics for 6 models and 7 prompts for top-p sampling. Model [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Compute time for AS for greedy and nucleus (top-p) decoding across multiple LLMs. Top row shows the total GPU hours and bottom row shows GPU seconds for each token replacement from Equation 6. The red text on each stacked bar is the total tokens in the model vocabulary. The prompt token length is the length of the LLM tokenized prompt. autotuned kernel selection, yielding small floating-point differences b… view at source ↗

**Figure 10.** Figure 10: Attribution Scores for prompts given the response for the 8 LLMs and 4 prompts [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Attribution Scores for prompts given the response for the 8 LLMs and 3 prompts [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

**Figure 12.** Figure 12: Attribution Scores for prompts given the response for the 8 LLMs and 4 prompts [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗

**Figure 13.** Figure 13: Attribution Scores for prompts given the response for the 8 LLMs and 3 prompts [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗

read the original abstract

The generative nature of Large Language Models (LLMs) is reflected in the conditional probabilities they compute to sample each response token given the previous tokens. These probabilities encode the distributional structure that the model learns in training and exploits in inference. In this work, we use these probabilities to situate LLMs within the mathematical theory of stochastic processes. We use this framework to design a model-agnostic probabilistic token attribution measure, using Bayes rule to invert the next-token log-probabilities so as to capture the models internal representation of the distribution over token sequences. The representation is independent of the models computational structure. This representation yields the conditional probability of the response given the prompt, and of the response given the prompt with a token marginalized away. Our attribution score is the log of the ratio of these probabilities. We further compute the entropies of a single prompts token distributions, conditioned on the remaining context. The interplay between entropy and attribution score sheds light on LLM behavior. We evaluate 8 models across 7 prompts and investigate anomalies, token sensitivity, response stability, model stability, and training convergence, thereby improving interpretability and guiding users to focus on uncertain or unstable parts of the generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper tries to turn next-token probabilities into prompt-token attributions via Bayes inversion, but the marginalization step looks underspecified.

read the letter

Colleague, the core idea is using Bayes rule on an LLM's next-token probabilities to estimate how much each prompt token affects the probability of the full response. They frame generation as a stochastic process, invert the conditionals to represent the distribution over sequences, then define the attribution score as the log ratio of P(response | prompt) to P(response | prompt with one token marginalized out). They also track entropy of the token distributions under the remaining context and run checks on anomalies and stability. That combination is the main new piece relative to gradient or attention methods, and it has the advantage of using only the model's existing output probabilities with no extra fitted parameters or architecture dependence. The evaluation on eight models and seven prompts gives at least a basic check on whether the scores behave sensibly across different sizes and training stages. Credit for keeping the method grounded in the probabilities the model already computes rather than post-hoc probes. The soft spot is exactly where the stress-test note flags it. Marginalizing a single token requires summing over all possible replacements at that position while preserving the downstream conditionals, but the abstract gives no explicit prior or independence assumption to make that sum well-defined from the product of next-token distributions alone. If the paper supplies a concrete procedure for this step, the claim of structure independence holds; if not, the inversion may implicitly add modeling choices that undermine the central selling point. The prompt set is also small, so the stability and anomaly observations are more suggestive than conclusive. This is for interpretability researchers who want architecture-agnostic tools and are willing to work through the probabilistic details. A reader focused on practical debugging or safety checks could extract usable ideas even if the math needs tightening. It should go to peer review so referees can verify the exact marginalization computation and see whether the reported experiments actually support the stability claims.

Referee Report

1 major / 2 minor

Summary. The paper situates LLMs within stochastic processes and proposes a model-agnostic probabilistic token attribution measure. It applies Bayes' rule to invert next-token log-probabilities into a representation of the distribution over token sequences that is claimed to be independent of computational structure. This yields P(response | prompt) and P(response | prompt with token marginalized), with the attribution score defined as the log ratio of these quantities. The work also examines conditional entropies of prompt token distributions and reports evaluations across 8 models and 7 prompts on anomalies, token sensitivity, response stability, model stability, and training convergence.

Significance. If the central derivation holds, the approach supplies a parameter-free attribution method grounded directly in the model's output distribution rather than gradients or internal states. This could improve interpretability by highlighting influential or uncertain prompt tokens and offers a stochastic-process framing that may generalize beyond current attribution techniques. The multi-model empirical component provides concrete data on stability and convergence that could guide practical use.

major comments (1)

[Derivation of attribution score] The section deriving the attribution measure (following the stochastic-process framing): the claim that Bayes inversion of the product of next-token conditionals produces a structure-independent joint over full sequences from which exact marginalization over a prompt token is possible is load-bearing. Inverting P(x_t | x_<t) to obtain P(response | prompt with token i marginalized) requires either an explicit prior p(sequence) or the marginal p(prompt); neither is specified, and summation over the vocabulary at the marginalized position while preserving downstream dependencies cannot be performed from the chain rule alone without additional modeling choices. This risks contradicting the stated independence from computational structure and absence of extra assumptions.

minor comments (2)

[Abstract and Evaluation] The abstract and evaluation description mention 8 models and 7 prompts but do not list them explicitly; a table or appendix listing the exact models, prompts, and any preprocessing would improve reproducibility.
[Notation] Notation for the inverted representation versus the original next-token probabilities should be introduced with a clear symbol (e.g., Q or P*) and used consistently when defining the log-ratio score.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and insightful comments on the derivation of our attribution measure. We address the major comment below, providing clarification on the stochastic process framing and marginalization procedure. We have revised the manuscript to expand the relevant section with explicit steps and an algorithmic outline.

read point-by-point responses

Referee: The section deriving the attribution measure (following the stochastic-process framing): the claim that Bayes inversion of the product of next-token conditionals produces a structure-independent joint over full sequences from which exact marginalization over a prompt token is possible is load-bearing. Inverting P(x_t | x_<t) to obtain P(response | prompt with token i marginalized) requires either an explicit prior p(sequence) or the marginal p(prompt); neither is specified, and summation over the vocabulary at the marginalized position while preserving downstream dependencies cannot be performed from the chain rule alone without additional modeling choices. This risks contradicting the stated independence from computational structure and absence of extra assumptions.

Authors: The joint distribution over any token sequence is defined exactly by the chain rule applied to the model's next-token conditionals: P(x_1, ..., x_n) = ∏_t P(x_t | x_<t). This product is the model's representation of the distribution over sequences and depends solely on the output probabilities it returns; it is therefore independent of internal architecture details such as attention weights or layer computations. Bayes inversion is used to re-express the relevant conditional probabilities in terms of these output distributions, allowing us to obtain both P(response | prompt) and the marginalized version without reference to model internals. For marginalization over prompt token i, the model's own conditional at position i supplies the weights P(v | context before i) for each vocabulary item v. The marginalized conditional is then the expectation E_v [P(response | prompt with position i set to v)], where each term P(response | prompt with i = v) is again computed from the model's next-token probabilities on the modified prompt. No external prior is introduced; the weighting distribution is the model's learned conditional. We acknowledge that exact computation requires a sum over the vocabulary and have added a precise algorithmic description, including the number of forward passes required, to the revised derivation section. This procedure remains model-agnostic in the sense that it uses only black-box next-token queries. revision: partial

Circularity Check

0 steps flagged

No circularity; attribution derived directly from model's next-token probabilities

full rationale

The paper defines its probabilistic token attribution by applying Bayes' rule directly to the LLM's existing next-token log-probabilities to obtain conditional probabilities over responses. This uses the model's output distribution as input without introducing fitted parameters, self-referential definitions, or load-bearing self-citations that reduce the result to its own assumptions. The derivation remains self-contained because the joint distribution over sequences is the product of the observed conditionals, and the attribution ratio follows immediately from that product without additional modeling choices that loop back to the target quantity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on treating LLM token generation as a stochastic process and on the validity of Bayes inversion to recover marginal conditionals; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption LLM token generation can be modeled as a stochastic process whose conditional probabilities encode the learned distribution over sequences.
Invoked when the abstract states that the generative nature is reflected in conditional probabilities and situates LLMs within stochastic process theory.

pith-pipeline@v0.9.0 · 5746 in / 1377 out tokens · 48852 ms · 2026-05-22T08:52:00.826887+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

using Bayes rule to invert the next-token log-probabilities so as to capture the model’s internal representation of the distribution over token sequences... attribution score is the log of the ratio of these probabilities
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

situate LLMs within the mathematical theory of stochastic processes... chain rule of probability

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 4 internal anchors

[1]

The Curious Case of Neural Text Degeneration

Ari Holtzman and Jan Buys and Maxwell Forbes and Yejin Choi , title =. CoRR , volume =. 2019 , url =. 1904.09751 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2019
[2]

Jonathon Phillips and Carina Hahn and Peter Fontana and Amy Yates and Kristen K

P. Jonathon Phillips and Carina Hahn and Peter Fontana and Amy Yates and Kristen K. Greene and David Broniatowski and Mark A. Przybocki , title =. 2021 , month =. doi:https://doi.org/10.6028/NIST.IR.8312 , language =

work page doi:10.6028/nist.ir.8312 2021
[3]

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead

Rudin, Cynthia. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell

work page
[4]

Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsi- ble ai,

Alejandro. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI , journal =. 2020 , issn =. doi:https://doi.org/10.1016/j.inffus.2019.12.012 , url =

work page doi:10.1016/j.inffus.2019.12.012 2020
[5]

2020 , eprint=

The Curious Case of Neural Text Degeneration , author=. 2020 , eprint=

work page 2020
[6]

C. K. Chow and C. N. Liu , title =. IEEE Transactions on Information Theory , year =

work page
[7]

1988 , address =

Judea Pearl , title =. 1988 , address =

work page 1988
[8]

1995 , publisher=

Probability and Measure , author=. 1995 , publisher=

work page 1995
[9]

Elements of Information Theory , doi =

work page
[10]

2019 , eprint=

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , author=. 2019 , eprint=

work page 2019
[11]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav , title...

work page doi:10.1162/tacl_a_00276 2019
[12]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Paperno, Denis and Kruszewski, Germ \'a n and Lazaridou, Angeliki and Pham, Ngoc Quan and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fern \'a ndez, Raquel. The LAMBADA dataset: Word prediction requiring a broad discourse context. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (...

work page doi:10.18653/v1/p16-1144 2016
[13]

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

Williams, Adina and Nangia, Nikita and Bowman, Samuel R. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10.18653/v1/N18-1101

work page internal anchor Pith review doi:10.18653/v1/n18-1101 2018
[14]

, title =

Wang, Alex and Pruksachatkun, Yada and Nangia, Nikita and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R. , title =. Proceedings of the 33rd International Conference on Neural Information Processing Systems , articleno =. 2019 , publisher =

work page 2019
[15]

Perturbing Inputs for Fragile Interpretations in Deep Natural Language Processing

Sinha, Sanchit and Chen, Hanjie and Sekhon, Arshdeep and Ji, Yangfeng and Qi, Yanjun. Perturbing Inputs for Fragile Interpretations in Deep Natural Language Processing. Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. 2021. doi:10.18653/v1/2021.blackboxnlp-1.33

work page doi:10.18653/v1/2021.blackboxnlp-1.33 2021
[16]

doi:10.5281/zenodo.5297715 , url =

Black, Sid and Leo, Gao and Wang, Phil and Leahy, Connor and Biderman, Stella , title =. doi:10.5281/zenodo.5297715 , url =

work page doi:10.5281/zenodo.5297715
[17]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao and Stella Biderman and Sid Black and Laurence Golding and Travis Hoppe and Charles Foster and Jason Phang and Horace He and Anish Thite and Noa Nabeshima and Shawn Presser and Connor Leahy , title =. CoRR , volume =. 2021 , url =. 2101.00027 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2021
[18]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

work page
[19]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers and Ari Holtzman and Yonatan Bisk and Ali Farhadi and Yejin Choi , title =. CoRR , volume =. 2019 , url =. 1905.07830 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2019
[20]

Inseq: An Interpretability Toolkit for Sequence Generation Models

Sarti, Gabriele and Feldhus, Nils and Sickert, Ludwig and van der Wal, Oskar and Nissim, Malvina and Bisazza, Arianna. Inseq: An Interpretability Toolkit for Sequence Generation Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). 2023. doi:10.18653/v1/2023.acl-demo.40

work page doi:10.18653/v1/2023.acl-demo.40 2023
[21]

and Ravikumar, Pradeep , title =

Yeh, Chih-Kuan and Hsieh, Cheng-Yu and Suggala, Arun Sai and Inouye, David I. and Ravikumar, Pradeep , title =. Proceedings of the 33rd International Conference on Neural Information Processing Systems , articleno =. 2019 , publisher =

work page 2019
[22]

and Ruotsalo, Tuukka and Maal e, Lars and Maistro, Maria

Edin, Joakim and Motzfeldt, Andreas Geert and Christensen, Casper L. and Ruotsalo, Tuukka and Maal e, Lars and Maistro, Maria. Normalized AOPC : Fixing Misleading Faithfulness Metrics for Feature Attributions Explainability. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/...

work page doi:10.18653/v1/2025.acl-long.86 2025
[23]

Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =

Marco Tulio Ribeiro and Sameer Singh and Carlos Guestrin , title =. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =. 2016 , doi =

work page 2016
[24]

Lundberg and Su-In Lee , title =

Scott M. Lundberg and Su-In Lee , title =. Advances in Neural Information Processing Systems , volume =

work page
[25]

Proceedings of the 34th International Conference on Machine Learning , series =

Mukund Sundararajan and Ankur Taly and Qiqi Yan , title =. Proceedings of the 34th International Conference on Machine Learning , series =

work page
[26]

2020 , eprint =

Narine Kokhlikyan and Vivek Miglani and Miguel Martin and Edward Wang and Bilal Alsallakh and Jonathan Reynolds and Alexander Melnikov and Natalia Kliushkina and Carlos Araya and Siqi Yan and Orion Reblitz-Richardson , title =. 2020 , eprint =

work page 2020
[27]

Wallace , title =

Jay DeYoung and Sarthak Jain and Nazneen Fatema Rajani and Eric Lehman and Caiming Xiong and Richard Socher and Byron C. Wallace , title =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages =

work page
[28]

Wallace , title =

Sarthak Jain and Byron C. Wallace , title =. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages =. 2019 , doi =

work page 2019
[29]

Sarah Wiegreffe and Yuval Pinter , title =. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages =. 2019 , doi =

work page 2019
[30]

A Primer in BERT ology: What We Know About How BERT Works

Rogers, Anna and Kovaleva, Olga and Rumshisky, Anna. A Primer in BERT ology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics. 2020. doi:10.1162/tacl_a_00349

work page doi:10.1162/tacl_a_00349 2020
[31]

2021 , howpublished =

Nelson Elhage and Neel Nanda and Catherine Olsson and Tom Henighan and Nicholas Joseph and Ben Mann and Amanda Askell and Yuntao Bai and Anna Chen and Tom Conerly and Nova DasSarma and Dawn Drain and Deep Ganguli and Zac Hatfield-Dodds and Danny Hernandez and Andy Jones and Jackson Kernion and Liane Lovitt and Kamal Ndousse and Dario Amodei and Tom Brown ...

work page 2021
[32]

ArXiv , year=

From Understanding to Utilization: A Survey on Explainability for Large Language Models , author=. ArXiv , year=

work page
[33]

Proceedings of the 34th International Conference on Machine Learning - Volume 70 , pages =

Shrikumar, Avanti and Greenside, Peyton and Kundaje, Anshul , title =. Proceedings of the 34th International Conference on Machine Learning - Volume 70 , pages =. 2017 , publisher =

work page 2017
[34]

Why Should I Trust You?

Ribeiro, Marco Tulio and Singh, Sameer and Guestrin, Carlos , title =. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =. 2016 , isbn =. doi:10.1145/2939672.2939778 , abstract =

work page doi:10.1145/2939672.2939778 2016
[35]

and Fergus, Rob

Zeiler, Matthew D. and Fergus, Rob. Visualizing and Understanding Convolutional Networks. Computer Vision -- ECCV 2014. 2014

work page 2014
[36]

2024 , eprint=

ReAGent: A Model-agnostic Feature Attribution Method for Generative Language Models , author=. 2024 , eprint=

work page 2024

[1] [1]

The Curious Case of Neural Text Degeneration

Ari Holtzman and Jan Buys and Maxwell Forbes and Yejin Choi , title =. CoRR , volume =. 2019 , url =. 1904.09751 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2019

[2] [2]

Jonathon Phillips and Carina Hahn and Peter Fontana and Amy Yates and Kristen K

P. Jonathon Phillips and Carina Hahn and Peter Fontana and Amy Yates and Kristen K. Greene and David Broniatowski and Mark A. Przybocki , title =. 2021 , month =. doi:https://doi.org/10.6028/NIST.IR.8312 , language =

work page doi:10.6028/nist.ir.8312 2021

[3] [3]

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead

Rudin, Cynthia. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell

work page

[4] [4]

Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsi- ble ai,

Alejandro. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI , journal =. 2020 , issn =. doi:https://doi.org/10.1016/j.inffus.2019.12.012 , url =

work page doi:10.1016/j.inffus.2019.12.012 2020

[5] [5]

2020 , eprint=

The Curious Case of Neural Text Degeneration , author=. 2020 , eprint=

work page 2020

[6] [6]

C. K. Chow and C. N. Liu , title =. IEEE Transactions on Information Theory , year =

work page

[7] [7]

1988 , address =

Judea Pearl , title =. 1988 , address =

work page 1988

[8] [8]

1995 , publisher=

Probability and Measure , author=. 1995 , publisher=

work page 1995

[9] [9]

Elements of Information Theory , doi =

work page

[10] [10]

2019 , eprint=

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , author=. 2019 , eprint=

work page 2019

[11] [11]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav , title...

work page doi:10.1162/tacl_a_00276 2019

[12] [12]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Paperno, Denis and Kruszewski, Germ \'a n and Lazaridou, Angeliki and Pham, Ngoc Quan and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fern \'a ndez, Raquel. The LAMBADA dataset: Word prediction requiring a broad discourse context. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (...

work page doi:10.18653/v1/p16-1144 2016

[13] [13]

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

Williams, Adina and Nangia, Nikita and Bowman, Samuel R. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10.18653/v1/N18-1101

work page internal anchor Pith review doi:10.18653/v1/n18-1101 2018

[14] [14]

, title =

Wang, Alex and Pruksachatkun, Yada and Nangia, Nikita and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R. , title =. Proceedings of the 33rd International Conference on Neural Information Processing Systems , articleno =. 2019 , publisher =

work page 2019

[15] [15]

Perturbing Inputs for Fragile Interpretations in Deep Natural Language Processing

Sinha, Sanchit and Chen, Hanjie and Sekhon, Arshdeep and Ji, Yangfeng and Qi, Yanjun. Perturbing Inputs for Fragile Interpretations in Deep Natural Language Processing. Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. 2021. doi:10.18653/v1/2021.blackboxnlp-1.33

work page doi:10.18653/v1/2021.blackboxnlp-1.33 2021

[16] [16]

doi:10.5281/zenodo.5297715 , url =

Black, Sid and Leo, Gao and Wang, Phil and Leahy, Connor and Biderman, Stella , title =. doi:10.5281/zenodo.5297715 , url =

work page doi:10.5281/zenodo.5297715

[17] [17]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao and Stella Biderman and Sid Black and Laurence Golding and Travis Hoppe and Charles Foster and Jason Phang and Horace He and Anish Thite and Noa Nabeshima and Shawn Presser and Connor Leahy , title =. CoRR , volume =. 2021 , url =. 2101.00027 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2021

[18] [18]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

work page

[19] [19]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers and Ari Holtzman and Yonatan Bisk and Ali Farhadi and Yejin Choi , title =. CoRR , volume =. 2019 , url =. 1905.07830 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2019

[20] [20]

Inseq: An Interpretability Toolkit for Sequence Generation Models

Sarti, Gabriele and Feldhus, Nils and Sickert, Ludwig and van der Wal, Oskar and Nissim, Malvina and Bisazza, Arianna. Inseq: An Interpretability Toolkit for Sequence Generation Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). 2023. doi:10.18653/v1/2023.acl-demo.40

work page doi:10.18653/v1/2023.acl-demo.40 2023

[21] [21]

and Ravikumar, Pradeep , title =

Yeh, Chih-Kuan and Hsieh, Cheng-Yu and Suggala, Arun Sai and Inouye, David I. and Ravikumar, Pradeep , title =. Proceedings of the 33rd International Conference on Neural Information Processing Systems , articleno =. 2019 , publisher =

work page 2019

[22] [22]

and Ruotsalo, Tuukka and Maal e, Lars and Maistro, Maria

Edin, Joakim and Motzfeldt, Andreas Geert and Christensen, Casper L. and Ruotsalo, Tuukka and Maal e, Lars and Maistro, Maria. Normalized AOPC : Fixing Misleading Faithfulness Metrics for Feature Attributions Explainability. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/...

work page doi:10.18653/v1/2025.acl-long.86 2025

[23] [23]

Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =

Marco Tulio Ribeiro and Sameer Singh and Carlos Guestrin , title =. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =. 2016 , doi =

work page 2016

[24] [24]

Lundberg and Su-In Lee , title =

Scott M. Lundberg and Su-In Lee , title =. Advances in Neural Information Processing Systems , volume =

work page

[25] [25]

Proceedings of the 34th International Conference on Machine Learning , series =

Mukund Sundararajan and Ankur Taly and Qiqi Yan , title =. Proceedings of the 34th International Conference on Machine Learning , series =

work page

[26] [26]

2020 , eprint =

Narine Kokhlikyan and Vivek Miglani and Miguel Martin and Edward Wang and Bilal Alsallakh and Jonathan Reynolds and Alexander Melnikov and Natalia Kliushkina and Carlos Araya and Siqi Yan and Orion Reblitz-Richardson , title =. 2020 , eprint =

work page 2020

[27] [27]

Wallace , title =

Jay DeYoung and Sarthak Jain and Nazneen Fatema Rajani and Eric Lehman and Caiming Xiong and Richard Socher and Byron C. Wallace , title =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages =

work page

[28] [28]

Wallace , title =

Sarthak Jain and Byron C. Wallace , title =. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages =. 2019 , doi =

work page 2019

[29] [29]

Sarah Wiegreffe and Yuval Pinter , title =. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages =. 2019 , doi =

work page 2019

[30] [30]

A Primer in BERT ology: What We Know About How BERT Works

Rogers, Anna and Kovaleva, Olga and Rumshisky, Anna. A Primer in BERT ology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics. 2020. doi:10.1162/tacl_a_00349

work page doi:10.1162/tacl_a_00349 2020

[31] [31]

2021 , howpublished =

Nelson Elhage and Neel Nanda and Catherine Olsson and Tom Henighan and Nicholas Joseph and Ben Mann and Amanda Askell and Yuntao Bai and Anna Chen and Tom Conerly and Nova DasSarma and Dawn Drain and Deep Ganguli and Zac Hatfield-Dodds and Danny Hernandez and Andy Jones and Jackson Kernion and Liane Lovitt and Kamal Ndousse and Dario Amodei and Tom Brown ...

work page 2021

[32] [32]

ArXiv , year=

From Understanding to Utilization: A Survey on Explainability for Large Language Models , author=. ArXiv , year=

work page

[33] [33]

Proceedings of the 34th International Conference on Machine Learning - Volume 70 , pages =

Shrikumar, Avanti and Greenside, Peyton and Kundaje, Anshul , title =. Proceedings of the 34th International Conference on Machine Learning - Volume 70 , pages =. 2017 , publisher =

work page 2017

[34] [34]

Why Should I Trust You?

Ribeiro, Marco Tulio and Singh, Sameer and Guestrin, Carlos , title =. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =. 2016 , isbn =. doi:10.1145/2939672.2939778 , abstract =

work page doi:10.1145/2939672.2939778 2016

[35] [35]

and Fergus, Rob

Zeiler, Matthew D. and Fergus, Rob. Visualizing and Understanding Convolutional Networks. Computer Vision -- ECCV 2014. 2014

work page 2014

[36] [36]

2024 , eprint=

ReAGent: A Model-agnostic Feature Attribution Method for Generative Language Models , author=. 2024 , eprint=

work page 2024