pith. sign in

arxiv: 2503.16771 · v3 · submitted 2025-03-21 · 💻 cs.SE · cs.LG

Enabling Global, Human-Centered Explanations for LLMs:From Tokens to Interpretable Code and Test Generation

Pith reviewed 2026-05-22 23:38 UTC · model grok-4.3

classification 💻 cs.SE cs.LG
keywords code rationalesglobal interpretabilitytoken-level explanationsLM4CodeShannon entropysyntactic cueshuman alignmentcode generation
0
0 comments X

The pith

Aggregating token-level rationales into code categories enables global analysis that exposes LLM preference for syntactic cues and misalignment with humans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces code rationales as a way to lift token-level explanations up to high-level programming categories. Aggregating thousands of these mapped explanations produces statistical signals about how models reason across entire code snippets. This process cuts explanation uncertainty by more than half and shows one model consistently relies on surface features such as indentation rather than deeper logic. A study with human developers finds the model's reasoning patterns diverge from human ones in measurable ways. These patterns remain invisible to standard accuracy scores, so the work argues that global, code-based views are needed to build trust in code-generating models.

Core claim

CodeQ maps token-level rationales onto programming categories, and the resulting aggregates distill a clearer signal that reveals consistent model behaviors, including a preference for shallow syntactic cues over semantic logic, while also showing statistically significant misalignment with human developer reasoning.

What carries the argument

code rationales (CodeQ), the mapping from token-level rationales to high-level programming categories that supports aggregation and statistical analysis

If this is right

  • Statistical patterns become visible that traditional token-level or accuracy metrics miss.
  • Explanation uncertainty measured by Shannon entropy drops by more than 50 percent after aggregation.
  • Models exhibit a measurable preference for indentation and other surface syntax over deeper semantic features.
  • User studies can quantify misalignment between model reasoning and human developer reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams could run CodeQ-style aggregation on their own model outputs to detect hidden biases before deployment.
  • The same category-mapping step might apply to non-code generation tasks where token explanations are noisy.
  • Training objectives could be adjusted to penalize over-reliance on the syntactic cues the aggregates flag.
  • Global views of this kind might become part of standard model cards for code models.

Load-bearing premise

The mapping of token rationales to programming categories faithfully reflects the model's actual reasoning without adding aggregation artifacts.

What would settle it

Apply the same aggregation procedure to a different code model and measure whether the entropy reduction disappears or the syntactic bias reverses.

Figures

Figures reproduced from arXiv: 2503.16771 by Alejandro Velasco, Daniel Rodriguez-Cardenas, David N. Palacio, Denys Poshyvanyk, Dipin Khati, Michele Tufano.

Figure 1
Figure 1. Figure 1: Conceptual Dependency Map reasoning, indicating directions for future work; (4) We published an online appendix [26] that contains documented notebooks for researchers, experimental data, source code, models, and the statis￾tical analysis of the results of the user study. 2 Why code-based global explanations? Although researchers acknowledge the need for interpretability in LM4Code, existing techniques pro… view at source ↗
Figure 3
Figure 3. Figure 3: CodeQ Interpretability Framework 4 Research Questions We conducted an exploratory analysis and a user study to explore the following RQs: RQ1 [Applicability]: How applicable is CodeQ to interpret code generation? We define applicability as the ability to use CodeQ in crafting understandable explanations. This RQ focuses on exploring the experience of using CodeQ to explain the behavior of LM4Code in code g… view at source ↗
Figure 2
Figure 2. Figure 2: eWASH Focal Context Window for Test Generation Prompt Generated Code Testbeds 1 Focal Method Encoder- Decoder Compatible Model Code Concepts Categories eWASH AST-Tree Decoder Pre-conditions 2 Code completion Test generation Local: Global: Dependency Maps Heatmaps Frequencies Code Code Explanation 3 4 Rationalization Mapping Reduction Interpretability Tensor C1 C2 C3 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Interpretability Concepts (C) for Java and Python (1) signature + truncated body (𝑇 𝐵1), (2) docstring + signature + body (𝑇 𝐵2), (3) docstring + signature (𝑇 𝐵3), and (4) docstring only (𝑇 𝐵4). Each testbed contains 100 unique prompts. To support robust statistical analysis, we sampled 30 rationale sets for each prompt, obtaining a total of 12K model-generated sequences for code generation. For test case … view at source ↗
Figure 5
Figure 5. Figure 5: CodeQ Interpretability Tensors Component2: Mapping. To group token rationales into inter￾pretable structures, we map each token in the rationale matrix 𝜙 to a concept 𝑐 ∈ C (𝐿1). This results in a matrix over the concept space: 𝜙C = 𝜙[ | C |× | C | ] = map(𝜙[𝑇 ×𝑇 ] , C) (2) Here, 𝜙C ∈ R | C |× | C | is the concept-level rationale matrix, where rows represent source concepts and columns represent target con… view at source ↗
Figure 6
Figure 6. Figure 6: Category and Context window level rationales [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

As Large Language Models for Code (LM4Code) become integral to software engineering, establishing trust in their output becomes critical. However, standard accuracy metrics obscure the underlying reasoning of generative models, offering little insight into how decisions are made. Although post-hoc interpretability methods attempt to fill this gap, they often restrict explanations to local, token-level insights, which fail to provide a developer-understandable global analysis. Our work highlights the urgent need for \textbf{global, code-based} explanations that reveal how models reason across code. To support this vision, we introduce \textit{code rationales} (CodeQ), a framework that enables global interpretability by mapping token-level rationales to high-level programming categories. Aggregating thousands of these token-level explanations allows us to perform statistical analyses that expose systemic reasoning behaviors. We validate this aggregation by showing it distills a clear signal from noisy token data, reducing explanation uncertainty (Shannon entropy) by over 50%. Additionally, we find that a code generation model (\textit{codeparrot-small}) consistently favors shallow syntactic cues (e.g., \textbf{indentation}) over deeper semantic logic. Furthermore, in a user study with 37 participants, we find its reasoning is significantly misaligned with that of human developers. These findings, hidden from traditional metrics, demonstrate the importance of global interpretability techniques to foster trust in LM4Code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces the CodeQ framework, which maps token-level rationales from code generation models to high-level programming categories. Aggregating thousands of these mappings enables statistical analyses that purportedly expose systemic reasoning behaviors; the authors validate the approach by reporting a >50% reduction in Shannon entropy and apply it to show that codeparrot-small favors syntactic cues (e.g., indentation) over semantic logic, with further evidence from a 37-participant user study indicating misalignment with human developers.

Significance. If the entropy reduction and category mappings can be shown to reflect genuine model reasoning rather than methodological artifacts, the work would offer a practical route to global, developer-interpretable explanations for LM4Code that standard accuracy metrics cannot provide. The user-study component supplies direct evidence of misalignment, which is a useful empirical contribution. No machine-checked proofs, reproducible artifacts, or parameter-free derivations are described.

major comments (1)
  1. [Abstract (and results section describing entropy calculation)] Abstract and (presumably) the results/validation section: the central claim that aggregation 'distills a clear signal from noisy token data' and reduces explanation uncertainty by over 50% is load-bearing for validating CodeQ, yet no control is reported (e.g., random or frequency-preserving permuted mappings from tokens to the same category set). Any collapse from a large token vocabulary to a small number of categories necessarily lowers entropy by construction; without the control, it is impossible to separate binning artifact from genuine distillation of coherent reasoning.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting a methodological gap in our validation of the entropy reduction. We agree that demonstrating the reduction exceeds what would be expected from binning alone is essential to support the claim of distilling coherent reasoning signals. We will add the requested control experiment in the revision.

read point-by-point responses
  1. Referee: [Abstract (and results section describing entropy calculation)] Abstract and (presumably) the results/validation section: the central claim that aggregation 'distills a clear signal from noisy token data' and reduces explanation uncertainty by over 50% is load-bearing for validating CodeQ, yet no control is reported (e.g., random or frequency-preserving permuted mappings from tokens to the same category set). Any collapse from a large token vocabulary to a small number of categories necessarily lowers entropy by construction; without the control, it is impossible to separate binning artifact from genuine distillation of coherent reasoning.

    Authors: We acknowledge the validity of this concern. The reported >50% entropy reduction is intended to show that CodeQ mappings capture non-random structure, but without a baseline using random or permuted token-to-category assignments (while preserving the category set and ideally frequency distribution), the reduction could partly result from vocabulary compression. In the revised manuscript we will add a control: (1) generate random mappings from the original token vocabulary to the same category set, (2) compute the resulting aggregate distributions and Shannon entropy, and (3) compare the observed reduction against this null distribution across multiple random seeds. We will report the statistical significance of the difference and update the abstract and results section to reflect the control. This directly addresses the load-bearing claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical validation stands independently

full rationale

The paper presents an empirical framework for mapping token rationales to programming categories, followed by aggregation, entropy measurement, and a 37-participant user study. No equations, fitted parameters, or first-principles derivations appear in the text. The entropy reduction is reported as an observed outcome of the aggregation step rather than a constructed equivalence or prediction that reduces to the input by definition. The work relies on external benchmarks (user study) and does not invoke self-citations as load-bearing premises. This matches the default case of a self-contained empirical study with no reduction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review is abstract-only; the central claim rests on the assumption that token-to-category mapping is faithful and that aggregation yields a true signal rather than noise.

axioms (1)
  • domain assumption Token-level rationales from post-hoc methods can be reliably grouped into high-level programming categories without distorting model behavior
    This mapping is the core operation of CodeQ and is invoked to justify the statistical analyses.
invented entities (1)
  • CodeQ framework no independent evidence
    purpose: To convert local token rationales into global code-category explanations
    Newly introduced construct whose validity is asserted by the entropy reduction and user study.

pith-pipeline@v0.9.0 · 5811 in / 1271 out tokens · 50375 ms · 2026-05-22T23:38:41.941256+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 15 internal anchors

  1. [1]

    2024.The Claude 3 Model Family: Opus, Sonnet, Haiku

    Anthropic. 2024.The Claude 3 Model Family: Opus, Sonnet, Haiku. Technical Report. Anthropic. https://www.anthropic.com/news/claude-3-family

  2. [2]

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, et al . 2022. Constitutional AI: Harmlessness from AI Feedback. arXiv:cs.CL/2212.08073 https://arxiv.org/abs/2212.08073

  3. [3]

    Agathe Balayn, Mireia Yurrita, Fanny Rancourt, Fabio Casati, and Ujwal Gadiraju

  4. [4]

    arXiv:cs.HC/2405.16310 https://arxiv.org/abs/2405.16310

    An Empirical Exploration of Trust Dynamics in LLM Supply Chains. arXiv:cs.HC/2405.16310 https://arxiv.org/abs/2405.16310

  5. [5]

    Sebastian Baltes and Paul Ralph. 2021. Sampling in Software Engineering Re- search: A Critical Review and Guidelines. arXiv:2002.07764

  6. [6]

    Jasmijn Bastings and Katja Filippova. 2020. The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? (2020), 149–155. https://doi.org/10.18653/v1/2020.blackboxnlp-1.14 arXiv: 2010.05607

  7. [7]

    2009.Natural Language Processing with Python

    Steven Bird, Ewan Klein, and Edward Loper. 2009.Natural Language Processing with Python. O’Reilly Media

  8. [8]

    Language Models are Few-Shot Learners

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, et al. 2020. Language Models are Few-Shot Learners. arXiv:cs.CL/2005.14165 https://arxiv.org/abs/2005.14165

  9. [9]

    Gino Brunner, Yang Liu, Damián Pascual, Oliver Richter, Massimiliano Ciaramita, et al. 2020. On Identifiability in Transformers. arXiv:cs.CL/1908.04211

  10. [10]

    Max Brunsfeld, Andrew Hlynskyi, Patrick Thomson, Josh Vera, Phil Turnbull, et al. 2023. tree-sitter/tree-sitter: v0.20.8. https://doi.org/10.5281/zenodo.7798573

  11. [11]

    Ullman, Fernando Martinez-Plumed, et al

    Ryan Burnell, Wout Schellaert, John Burden, Tomer D. Ullman, Fernando Martinez-Plumed, et al . 2023. Rethink reporting of evaluation results in AI. 380, 6641 (2023), 136–138. https://doi.org/10.1126/science.adf6369

  12. [12]

    2006.Constructing Grounded Theory: A Practical Guide through Qualitative Analysis

    Kathy Charmaz. 2006.Constructing Grounded Theory: A Practical Guide through Qualitative Analysis. SAGE Publications Inc

  13. [13]

    Learning to Explain: An Information-Theoretic Perspective on Model Interpretation

    Jianbo Chen, Le Song, Martin J. Wainwright, and Michael I. Jordan. 2018. Learn- ing to Explain: An Information-Theoretic Perspective on Model Interpretation. arXiv:cs.LG/1802.07814 https://arxiv.org/abs/1802.07814

  14. [14]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, et al. 2021. Evaluating Large Language Models Trained on Code. (2021). arXiv:2107.03374 http://arxiv.org/abs/2107.03374

  15. [15]

    Zixi Chen, Varshini Subhash, Marton Havasi, Weiwei Pan, and Finale Doshi-Velez

  16. [16]

    http://arxiv.org/abs/2211.05667 arXiv:2211.05667 [cs]

    What Makes a Good Explanation?: A Harmonized View of Properties of Explanations. http://arxiv.org/abs/2211.05667 arXiv:2211.05667 [cs]

  17. [17]

    Jürgen Cito, Isil Dillig, Vijayaraghavan Murali, and Satish Chandra. 2022. Coun- terfactual explanations for models of code. InProceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice. ACM, Pitts- burgh Pennsylvania, 125–134. https://doi.org/10.1145/3510457.3513081

  18. [18]

    Clement, Shuai Lu, Xiaoyu Liu, Michele Tufano, Dawn Drain, et al

    Colin B. Clement, Shuai Lu, Xiaoyu Liu, Michele Tufano, Dawn Drain, et al. [n. d.]. Long-Range Modeling of Source Code Files with eWASH: Extended Window Access by Syntax Hierarchy. arXiv:2109.08780 [cs] http://arxiv.org/abs/2109. 08780

  19. [19]

    Sudipta Dey and Tathagata Roy Chowdhury. 2024. A Comparative Survey of SHAP and LIME: Explaining Machine Learning Models for Transparent AI.In- ternational Journal of Innovative Research in Education11 (11 2024), 827–835

  20. [20]

    Finale Doshi-Velez and Been Kim. 2017. Towards A Rigorous Science of In- terpretable Machine Learning.arXiv preprint arXiv:1702.08608(2017). https: //arxiv.org/abs/1702.08608

  21. [21]

    Fabian Fagerholm, Michael Felderer, Davide Fucci, Michael Unterkalmsteiner, Bogdan Marculescu, et al. 2022. Cognition in Software Engineering: A Taxonomy and Survey of a Half-Century of Research. arXiv:cs.SE/2201.05551 https://arxiv. org/abs/2201.05551

  22. [22]

    Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan. 2023. Automated Repair of Programs from Large Language Models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 1469–1481. https://doi.org/10.1109/ICSE48619.2023.00128

  23. [23]

    2019.Towards automatic concept-based explanations

    Amirata Ghorbani, James Wexler, James Zou, and Been Kim. 2019.Towards automatic concept-based explanations. Curran Associates Inc., Red Hook, NY, USA

  24. [24]

    Hugging Face. 2022. CodeParrot. https://huggingface.co/codeparrot. Accessed: 2024-07-23

  25. [25]

    Alon Jacovi and Yoav Goldberg. 2020. Towards Faithfully Interpretable NLP Sys- tems: How should we define and evaluate faithfulness? arXiv:cs.CL/2004.03685

  26. [26]

    Sarthak Jain and Byron C. Wallace. 2019. Attention is not Explanation. https: //doi.org/10.48550/arXiv.1902.10186 arXiv:1902.10186 [cs]

  27. [27]

    Palacio, Yixuan Zhang, and Denys Poshyvanyk

    Dipin Khati, Yijin Liu, David N. Palacio, Yixuan Zhang, and Denys Poshyvanyk

  28. [28]

    arXiv:cs.SE/2503.13793 https://arxiv.org/abs/2503.13793

    Mapping the Trust Terrain: LLMs in Software Engineering – Insights and Perspectives. arXiv:cs.SE/2503.13793 https://arxiv.org/abs/2503.13793

  29. [29]

    SEMERU Lab. 2024. Code Rationale. https://github.com/WM-SEMERU/code- rationales/. Accessed: 2025-05-19

  30. [30]

    Isaac Lage, Emily Chen, Jeffrey He, Menaka Narayanan, Been Kim, et al. 2019. Human Evaluation of Models Built for Interpretability.Proceedings of the AAAI Conference on Human Computation and Crowdsourcing7 (Oct. 2019), 59–67. https: //doi.org/10.1609/hcomp.v7i1.5280

  31. [31]

    Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, et al. 2023. StarCoder: may the source be with you! arXiv:cs.CL/2305.06161 https://arxiv.org/abs/2305.06161

  32. [32]

    Zheng Li, Fuxiang Sun, Haifeng Wang, Yifan Ding, Yong Liu, et al. 2021. CLACER: A Deep Learning-based Compilation Error Classification Method for Novice Stu- dents’ Programs. In2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC). 74–83. https://doi.org/10.1109/COMPSAC51774.2021. 00022

  33. [33]

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, et al

  34. [34]

    Holistic Evaluation of Language Models

    Holistic Evaluation of Language Models. arXiv:cs.CL/2211.09110 https: //arxiv.org/abs/2211.09110

  35. [35]

    Pantelis Linardatos, Vasilis Papastefanopoulos, and Sotiris Kotsiantis. 2021. Ex- plainable AI: A Review of Machine Learning Interpretability Methods.Entropy 23, 1 (2021). https://doi.org/10.3390/e23010018

  36. [36]

    Zachary C. Lipton. 2017. The Mythos of Model Interpretability. arXiv:1606.03490

  37. [37]

    Zhijie Liu, Yutian Tang, Xiapu Luo, Yuming Zhou, and Liang Feng Zhang. 2023. No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT. arXiv:cs.SE/2308.04838

  38. [38]

    Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. InProceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS). Curran Associates Inc., 4765–4774

  39. [39]

    Marina Meilă. 2007. Comparing clusterings—an information based distance. Journal of Multivariate Analysis98, 5 (2007), 873–895. https://doi.org/10.1016/j. jmva.2006.11.013

  40. [40]

    Ahmad Haji Mohammadkhani, Chakkrit Tantithamthavorn, and Hadi Hemmati

  41. [41]

    InProceedings of the 23rd IEEE International Working Con- ference on Source Code Analysis and Manipulation (SCAM)

    Explainable AI for Pre-Trained Code Models: What Do They Learn? When They Do Not Work?. InProceedings of the 23rd IEEE International Working Con- ference on Source Code Analysis and Manipulation (SCAM). IEEE, 1–11

  42. [42]

    Khapra, Balaji Vasan Srinivasan, et al

    Akash Kumar Mohankumar, Preksha Nema, Sharan Narasimhan, Mitesh M. Khapra, Balaji Vasan Srinivasan, et al. 2020. Towards Transparent and Explainable Attention Models. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 4206–4216. https://doi.org/10.18653/v1/2020.acl-main.387

  43. [43]

    Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G

    John X. Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G. Ed- ward Suh, et al . 2025. How much do language models memorize? arXiv:cs.CL/2505.24832 https://arxiv.org/abs/2505.24832

  44. [44]

    Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, et al . 2019. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. arXiv:cs.CL/1904.01038 https://arxiv.org/abs/1904.01038

  45. [45]

    Palacio, Daniel Rodriguez-Cardenas, Alejandro Velasco, Dipin Khati, Kevin Moran, et al

    David N. Palacio, Daniel Rodriguez-Cardenas, Alejandro Velasco, Dipin Khati, Kevin Moran, et al. 2024. Towards More Trustworthy and Interpretable LLMs for Code through Syntax-Grounded Explanations. arXiv:cs.SE/2407.08983 https: //arxiv.org/abs/2407.08983

  46. [46]

    Palacio, Alejandro Velasco, Nathan Cooper, Alvaro Rodriguez, Kevin Moran, et al

    David N. Palacio, Alejandro Velasco, Nathan Cooper, Alvaro Rodriguez, Kevin Moran, et al. 2024. Toward a Theory of Causation for Interpreting Neural Code Models.IEEE Transactions on Software Engineering50, 5 (May 2024), 1215–1243. https://doi.org/10.1109/TSE.2024.3379943 arXiv:2302.03788 [cs, stat]

  47. [47]

    Tobias Peters and Roel Visser. 2023. The Importance of Distrust in AI. https: //doi.org/10.48550/arXiv.2307.13601

  48. [48]

    Goldstein, Jake M

    Forough Poursabzi-Sangdeh, Daniel G. Goldstein, Jake M. Hofman, Jennifer Wort- man Vaughan, and Hanna Wallach. 2021. Manipulating and Measuring Model Interpretability. http://arxiv.org/abs/1802.07810 arXiv:1802.07810 [cs]

  49. [49]

    Qualtrics. 2024. Qualtrics XM. https://www.qualtrics.com/. Accessed: 2024-02-13

  50. [50]

    Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, et al. 2020. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. arXiv:cs.SE/2009.10297 https://arxiv.org/abs/2009.10297

  51. [51]

    "Why Should I Trust You?": Explaining the Predictions of Any Classifier

    Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. https://doi.org/10. 48550/arXiv.1602.04938 arXiv:1602.04938 [cs, stat]

  52. [52]

    Marco Tulio Ribeiro, Tongshuang Sherry Wu, Carlos Guestrin, and Sameer Singh

  53. [53]

    In Annual Meeting of the Association for Computational Linguistics

    Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Annual Meeting of the Association for Computational Linguistics. https://api. semanticscholar.org/CorpusID:218551201

  54. [54]

    Palacio, Dipin Khati, Henry Burke, and Denys Poshyvanyk

    Daniel Rodriguez-Cardenas, David N. Palacio, Dipin Khati, Henry Burke, and Denys Poshyvanyk. 2023. Benchmarking Causal Study to Interpret Large Lan- guage Models for Source Code. In2023 IEEE International Conference on Soft- ware Maintenance and Evolution (ICSME). 329–334. https://doi.org/10.1109/ ICSME58846.2023.00040

  55. [55]

    Sofia Serrano and Noah A. Smith. 2019. Is Attention Interpretable? https: //doi.org/10.48550/arXiv.1906.03731 arXiv:1906.03731 [cs]

  56. [56]

    C. E. Shannon. 1948. A Mathematical Theory of Communication.Bell Sys- tem Technical Journal27, 3 (1948), 379–423. https://doi.org/10.1002/j.1538- 7305.1948.tb01338.x arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/j.1538- 7305.1948.tb01338.x

  57. [57]

    Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2019. Learning Important Features Through Propagating Activation Differences. https: //doi.org/10.48550/arXiv.1704.02685 arXiv:1704.02685 [cs]. Enabling Global, Human-Centered Explanations for LLMs: From Tokens to Interpretable Code and Test Generation ICSE ’26, April 12–18, 2026, Rio de Janeiro, Brazil

  58. [58]

    Kacper Sokol and Peter Flach. 2019. Desiderata for Interpretability: Explaining Decision Tree Predictions with Counterfactuals.Proceedings of the AAAI Confer- ence on Artificial Intelligence33 (07 2019), 10035–10036. https://doi.org/10.1609/ aaai.v33i01.330110035

  59. [59]

    Management Solutions. 2022. Explainable Artificial Intelligence (XAI). Chal- lenges of model interpretability. https://www.managementsolutions.com/sites/ default/files/minisite/static/22959b0f-b3da-47c8-9d5c-80ec3216552b/iax/pdf/ explainable-artificial-intelligence-en-04.pdf. Accessed: 18 June 2024

  60. [60]

    Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, et al . 2024. TrustLLM: Trustworthiness in Large Language Models. arXiv:cs.CL/2401.05561 https://arxiv.org/abs/2401.05561

  61. [61]

    Simon Thorne. 2024. Understanding the Interplay Between Trust, Reliability, and Human Factors in the Age of Generative AI.International Journal of Simulation: Systems, Science & technology(5 May 2024). https://doi.org/10.5013/ijssst.a.25.01. 10

  62. [62]

    Michele Tufano, Shao Kun Deng, Neel Sundaresan, and Alexey Svyatkovskiy. 2022. Methods2Test: a dataset of focal methods mapped to test cases. InProceedings of the 19th International Conference on Mining Software Repositories (MSR ’22). ACM, 299–303. https://doi.org/10.1145/3524842.3528009

  63. [63]

    Blei, and Alexander M

    Keyon Vafa, Yuntian Deng, David M. Blei, and Alexander M. Rush. 2017. Ratio- nales for Sequential Predictions. arXiv:2109.06387 [cs] http://arxiv.org/abs/2109. 06387

  64. [64]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, et al

  65. [65]

    Attention Is All You Need

    Attention Is All You Need. arXiv:1706.03762 [cs] http://arxiv.org/abs/1706. 03762

  66. [66]

    Herbsleb, Alexandra Holloway, and Scott Davidoff

    David Gray Widder, Laura Dabbish, James D. Herbsleb, Alexandra Holloway, and Scott Davidoff. 2021. Trust in Collaborative Automation in High Stakes Software Engineering Work: A Case Study at NASA. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 184, 1...

  67. [67]

    Ralph A Wiggins. 1978. Minimum entropy deconvolution.Geoexploration16, 1 (1978), 21–35. https://doi.org/10.1016/0016-7142(78)90005-4

  68. [68]

    Ohlsson, Björn Regnell, et al

    Claes Wohlin, Per Runeson, Martin Höst, Magnus C. Ohlsson, Björn Regnell, et al. 2012.Experimentation in Software Engineering. Springer Science & Business Media