Enabling Global, Human-Centered Explanations for LLMs:From Tokens to Interpretable Code and Test Generation

Alejandro Velasco; Daniel Rodriguez-Cardenas; David N. Palacio; Denys Poshyvanyk; Dipin Khati; Michele Tufano

arxiv: 2503.16771 · v3 · submitted 2025-03-21 · 💻 cs.SE · cs.LG

Enabling Global, Human-Centered Explanations for LLMs:From Tokens to Interpretable Code and Test Generation

Dipin Khati , Daniel Rodriguez-Cardenas , David N. Palacio , Alejandro Velasco , Michele Tufano , Denys Poshyvanyk This is my paper

Pith reviewed 2026-05-22 23:38 UTC · model grok-4.3

classification 💻 cs.SE cs.LG

keywords code rationalesglobal interpretabilitytoken-level explanationsLM4CodeShannon entropysyntactic cueshuman alignmentcode generation

0 comments

The pith

Aggregating token-level rationales into code categories enables global analysis that exposes LLM preference for syntactic cues and misalignment with humans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces code rationales as a way to lift token-level explanations up to high-level programming categories. Aggregating thousands of these mapped explanations produces statistical signals about how models reason across entire code snippets. This process cuts explanation uncertainty by more than half and shows one model consistently relies on surface features such as indentation rather than deeper logic. A study with human developers finds the model's reasoning patterns diverge from human ones in measurable ways. These patterns remain invisible to standard accuracy scores, so the work argues that global, code-based views are needed to build trust in code-generating models.

Core claim

CodeQ maps token-level rationales onto programming categories, and the resulting aggregates distill a clearer signal that reveals consistent model behaviors, including a preference for shallow syntactic cues over semantic logic, while also showing statistically significant misalignment with human developer reasoning.

What carries the argument

code rationales (CodeQ), the mapping from token-level rationales to high-level programming categories that supports aggregation and statistical analysis

If this is right

Statistical patterns become visible that traditional token-level or accuracy metrics miss.
Explanation uncertainty measured by Shannon entropy drops by more than 50 percent after aggregation.
Models exhibit a measurable preference for indentation and other surface syntax over deeper semantic features.
User studies can quantify misalignment between model reasoning and human developer reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams could run CodeQ-style aggregation on their own model outputs to detect hidden biases before deployment.
The same category-mapping step might apply to non-code generation tasks where token explanations are noisy.
Training objectives could be adjusted to penalize over-reliance on the syntactic cues the aggregates flag.
Global views of this kind might become part of standard model cards for code models.

Load-bearing premise

The mapping of token rationales to programming categories faithfully reflects the model's actual reasoning without adding aggregation artifacts.

What would settle it

Apply the same aggregation procedure to a different code model and measure whether the entropy reduction disappears or the syntactic bias reverses.

Figures

Figures reproduced from arXiv: 2503.16771 by Alejandro Velasco, Daniel Rodriguez-Cardenas, David N. Palacio, Denys Poshyvanyk, Dipin Khati, Michele Tufano.

**Figure 1.** Figure 1: Conceptual Dependency Map reasoning, indicating directions for future work; (4) We published an online appendix [26] that contains documented notebooks for researchers, experimental data, source code, models, and the statistical analysis of the results of the user study. 2 Why code-based global explanations? Although researchers acknowledge the need for interpretability in LM4Code, existing techniques pro… view at source ↗

**Figure 3.** Figure 3: CodeQ Interpretability Framework 4 Research Questions We conducted an exploratory analysis and a user study to explore the following RQs: RQ1 [Applicability]: How applicable is CodeQ to interpret code generation? We define applicability as the ability to use CodeQ in crafting understandable explanations. This RQ focuses on exploring the experience of using CodeQ to explain the behavior of LM4Code in code g… view at source ↗

**Figure 2.** Figure 2: eWASH Focal Context Window for Test Generation Prompt Generated Code Testbeds 1 Focal Method Encoder- Decoder Compatible Model Code Concepts Categories eWASH AST-Tree Decoder Pre-conditions 2 Code completion Test generation Local: Global: Dependency Maps Heatmaps Frequencies Code Code Explanation 3 4 Rationalization Mapping Reduction Interpretability Tensor C1 C2 C3 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Interpretability Concepts (C) for Java and Python (1) signature + truncated body (𝑇 𝐵1), (2) docstring + signature + body (𝑇 𝐵2), (3) docstring + signature (𝑇 𝐵3), and (4) docstring only (𝑇 𝐵4). Each testbed contains 100 unique prompts. To support robust statistical analysis, we sampled 30 rationale sets for each prompt, obtaining a total of 12K model-generated sequences for code generation. For test case … view at source ↗

**Figure 5.** Figure 5: CodeQ Interpretability Tensors Component2: Mapping. To group token rationales into interpretable structures, we map each token in the rationale matrix 𝜙 to a concept 𝑐 ∈ C (𝐿1). This results in a matrix over the concept space: 𝜙C = 𝜙[ | C |× | C | ] = map(𝜙[𝑇 ×𝑇 ] , C) (2) Here, 𝜙C ∈ R | C |× | C | is the concept-level rationale matrix, where rows represent source concepts and columns represent target con… view at source ↗

**Figure 6.** Figure 6: Category and Context window level rationales [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

As Large Language Models for Code (LM4Code) become integral to software engineering, establishing trust in their output becomes critical. However, standard accuracy metrics obscure the underlying reasoning of generative models, offering little insight into how decisions are made. Although post-hoc interpretability methods attempt to fill this gap, they often restrict explanations to local, token-level insights, which fail to provide a developer-understandable global analysis. Our work highlights the urgent need for \textbf{global, code-based} explanations that reveal how models reason across code. To support this vision, we introduce \textit{code rationales} (CodeQ), a framework that enables global interpretability by mapping token-level rationales to high-level programming categories. Aggregating thousands of these token-level explanations allows us to perform statistical analyses that expose systemic reasoning behaviors. We validate this aggregation by showing it distills a clear signal from noisy token data, reducing explanation uncertainty (Shannon entropy) by over 50%. Additionally, we find that a code generation model (\textit{codeparrot-small}) consistently favors shallow syntactic cues (e.g., \textbf{indentation}) over deeper semantic logic. Furthermore, in a user study with 37 participants, we find its reasoning is significantly misaligned with that of human developers. These findings, hidden from traditional metrics, demonstrate the importance of global interpretability techniques to foster trust in LM4Code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CodeQ aggregates token explanations into code categories for global stats on LM4Code behavior, but the >50% entropy drop lacks a control to separate binning effects from actual signal.

read the letter

The paper introduces CodeQ as a way to lift token-level rationales up to high-level programming categories, then aggregate thousands of them for statistical views of model reasoning across many examples. This produces claims about systemic biases like favoring indentation over semantics, plus a reported misalignment with human developers in a 37-person study. The entropy reduction is positioned as evidence that aggregation extracts coherent patterns from noisy token data. That direction addresses a genuine gap: local explanations rarely scale to the kind of global understanding developers need for trust in code models. The user-study component also brings in a human-centered angle that most token-only work skips. The framework itself looks like a fresh attempt relative to the token-focused baselines cited. The main soft spot is the entropy claim. Collapsing a large token space into a small number of categories will shrink Shannon entropy by construction, so the drop could be mostly methodological compression rather than proof of distilled reasoning. A control that applies the same mapping to randomized or permuted token data would separate those cases, but nothing like that appears in the abstract. The category definitions and mapping validation also stay thin, which makes it hard to judge whether the aggregation preserves fidelity or introduces its own artifacts. The syntactic-cue and misalignment findings are worth checking but rest on the same untested mapping step. This is for people already working on post-hoc interpretability for code LLMs who want ideas for moving past per-token views. A reader could pull the aggregation concept and try it, but would need the full methods to replicate or extend. It deserves peer review because the problem is real and the empirical angle is concrete enough for referees to test the controls and details.

Referee Report

1 major / 0 minor

Summary. The paper introduces the CodeQ framework, which maps token-level rationales from code generation models to high-level programming categories. Aggregating thousands of these mappings enables statistical analyses that purportedly expose systemic reasoning behaviors; the authors validate the approach by reporting a >50% reduction in Shannon entropy and apply it to show that codeparrot-small favors syntactic cues (e.g., indentation) over semantic logic, with further evidence from a 37-participant user study indicating misalignment with human developers.

Significance. If the entropy reduction and category mappings can be shown to reflect genuine model reasoning rather than methodological artifacts, the work would offer a practical route to global, developer-interpretable explanations for LM4Code that standard accuracy metrics cannot provide. The user-study component supplies direct evidence of misalignment, which is a useful empirical contribution. No machine-checked proofs, reproducible artifacts, or parameter-free derivations are described.

major comments (1)

[Abstract (and results section describing entropy calculation)] Abstract and (presumably) the results/validation section: the central claim that aggregation 'distills a clear signal from noisy token data' and reduces explanation uncertainty by over 50% is load-bearing for validating CodeQ, yet no control is reported (e.g., random or frequency-preserving permuted mappings from tokens to the same category set). Any collapse from a large token vocabulary to a small number of categories necessarily lowers entropy by construction; without the control, it is impossible to separate binning artifact from genuine distillation of coherent reasoning.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting a methodological gap in our validation of the entropy reduction. We agree that demonstrating the reduction exceeds what would be expected from binning alone is essential to support the claim of distilling coherent reasoning signals. We will add the requested control experiment in the revision.

read point-by-point responses

Referee: [Abstract (and results section describing entropy calculation)] Abstract and (presumably) the results/validation section: the central claim that aggregation 'distills a clear signal from noisy token data' and reduces explanation uncertainty by over 50% is load-bearing for validating CodeQ, yet no control is reported (e.g., random or frequency-preserving permuted mappings from tokens to the same category set). Any collapse from a large token vocabulary to a small number of categories necessarily lowers entropy by construction; without the control, it is impossible to separate binning artifact from genuine distillation of coherent reasoning.

Authors: We acknowledge the validity of this concern. The reported >50% entropy reduction is intended to show that CodeQ mappings capture non-random structure, but without a baseline using random or permuted token-to-category assignments (while preserving the category set and ideally frequency distribution), the reduction could partly result from vocabulary compression. In the revised manuscript we will add a control: (1) generate random mappings from the original token vocabulary to the same category set, (2) compute the resulting aggregate distributions and Shannon entropy, and (3) compare the observed reduction against this null distribution across multiple random seeds. We will report the statistical significance of the difference and update the abstract and results section to reflect the control. This directly addresses the load-bearing claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical validation stands independently

full rationale

The paper presents an empirical framework for mapping token rationales to programming categories, followed by aggregation, entropy measurement, and a 37-participant user study. No equations, fitted parameters, or first-principles derivations appear in the text. The entropy reduction is reported as an observed outcome of the aggregation step rather than a constructed equivalence or prediction that reduces to the input by definition. The work relies on external benchmarks (user study) and does not invoke self-citations as load-bearing premises. This matches the default case of a self-contained empirical study with no reduction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review is abstract-only; the central claim rests on the assumption that token-to-category mapping is faithful and that aggregation yields a true signal rather than noise.

axioms (1)

domain assumption Token-level rationales from post-hoc methods can be reliably grouped into high-level programming categories without distorting model behavior
This mapping is the core operation of CodeQ and is invoked to justify the statistical analyses.

invented entities (1)

CodeQ framework no independent evidence
purpose: To convert local token rationales into global code-category explanations
Newly introduced construct whose validity is asserted by the entropy reduction and user study.

pith-pipeline@v0.9.0 · 5811 in / 1271 out tokens · 50375 ms · 2026-05-22T23:38:41.941256+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

mapping token-level rationales to high-level programming categories... reducing explanation uncertainty (Shannon entropy) by over 50%
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

global interpretability tensor Φ... concept-level rationale matrix

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 15 internal anchors

[1]

2024.The Claude 3 Model Family: Opus, Sonnet, Haiku

Anthropic. 2024.The Claude 3 Model Family: Opus, Sonnet, Haiku. Technical Report. Anthropic. https://www.anthropic.com/news/claude-3-family

work page 2024
[2]

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, et al . 2022. Constitutional AI: Harmlessness from AI Feedback. arXiv:cs.CL/2212.08073 https://arxiv.org/abs/2212.08073

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Agathe Balayn, Mireia Yurrita, Fanny Rancourt, Fabio Casati, and Ujwal Gadiraju

work page
[4]

arXiv:cs.HC/2405.16310 https://arxiv.org/abs/2405.16310

An Empirical Exploration of Trust Dynamics in LLM Supply Chains. arXiv:cs.HC/2405.16310 https://arxiv.org/abs/2405.16310

work page arXiv
[5]

Sebastian Baltes and Paul Ralph. 2021. Sampling in Software Engineering Re- search: A Critical Review and Guidelines. arXiv:2002.07764

work page arXiv 2021
[6]

Jasmijn Bastings and Katja Filippova. 2020. The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? (2020), 149–155. https://doi.org/10.18653/v1/2020.blackboxnlp-1.14 arXiv: 2010.05607

work page doi:10.18653/v1/2020.blackboxnlp-1.14 2020
[7]

2009.Natural Language Processing with Python

Steven Bird, Ewan Klein, and Edward Loper. 2009.Natural Language Processing with Python. O’Reilly Media

work page 2009
[8]

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, et al. 2020. Language Models are Few-Shot Learners. arXiv:cs.CL/2005.14165 https://arxiv.org/abs/2005.14165

work page internal anchor Pith review Pith/arXiv arXiv 2020
[9]

Gino Brunner, Yang Liu, Damián Pascual, Oliver Richter, Massimiliano Ciaramita, et al. 2020. On Identifiability in Transformers. arXiv:cs.CL/1908.04211

work page arXiv 2020
[10]

Max Brunsfeld, Andrew Hlynskyi, Patrick Thomson, Josh Vera, Phil Turnbull, et al. 2023. tree-sitter/tree-sitter: v0.20.8. https://doi.org/10.5281/zenodo.7798573

work page doi:10.5281/zenodo.7798573 2023
[11]

Ullman, Fernando Martinez-Plumed, et al

Ryan Burnell, Wout Schellaert, John Burden, Tomer D. Ullman, Fernando Martinez-Plumed, et al . 2023. Rethink reporting of evaluation results in AI. 380, 6641 (2023), 136–138. https://doi.org/10.1126/science.adf6369

work page doi:10.1126/science.adf6369 2023
[12]

2006.Constructing Grounded Theory: A Practical Guide through Qualitative Analysis

Kathy Charmaz. 2006.Constructing Grounded Theory: A Practical Guide through Qualitative Analysis. SAGE Publications Inc

work page 2006
[13]

Learning to Explain: An Information-Theoretic Perspective on Model Interpretation

Jianbo Chen, Le Song, Martin J. Wainwright, and Michael I. Jordan. 2018. Learn- ing to Explain: An Information-Theoretic Perspective on Model Interpretation. arXiv:cs.LG/1802.07814 https://arxiv.org/abs/1802.07814

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, et al. 2021. Evaluating Large Language Models Trained on Code. (2021). arXiv:2107.03374 http://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[15]

Zixi Chen, Varshini Subhash, Marton Havasi, Weiwei Pan, and Finale Doshi-Velez

work page
[16]

http://arxiv.org/abs/2211.05667 arXiv:2211.05667 [cs]

What Makes a Good Explanation?: A Harmonized View of Properties of Explanations. http://arxiv.org/abs/2211.05667 arXiv:2211.05667 [cs]

work page arXiv
[17]

Jürgen Cito, Isil Dillig, Vijayaraghavan Murali, and Satish Chandra. 2022. Coun- terfactual explanations for models of code. InProceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice. ACM, Pitts- burgh Pennsylvania, 125–134. https://doi.org/10.1145/3510457.3513081

work page doi:10.1145/3510457.3513081 2022
[18]

Clement, Shuai Lu, Xiaoyu Liu, Michele Tufano, Dawn Drain, et al

Colin B. Clement, Shuai Lu, Xiaoyu Liu, Michele Tufano, Dawn Drain, et al. [n. d.]. Long-Range Modeling of Source Code Files with eWASH: Extended Window Access by Syntax Hierarchy. arXiv:2109.08780 [cs] http://arxiv.org/abs/2109. 08780

work page arXiv
[19]

Sudipta Dey and Tathagata Roy Chowdhury. 2024. A Comparative Survey of SHAP and LIME: Explaining Machine Learning Models for Transparent AI.In- ternational Journal of Innovative Research in Education11 (11 2024), 827–835

work page 2024
[20]

Finale Doshi-Velez and Been Kim. 2017. Towards A Rigorous Science of In- terpretable Machine Learning.arXiv preprint arXiv:1702.08608(2017). https: //arxiv.org/abs/1702.08608

work page internal anchor Pith review Pith/arXiv arXiv 2017
[21]

Fabian Fagerholm, Michael Felderer, Davide Fucci, Michael Unterkalmsteiner, Bogdan Marculescu, et al. 2022. Cognition in Software Engineering: A Taxonomy and Survey of a Half-Century of Research. arXiv:cs.SE/2201.05551 https://arxiv. org/abs/2201.05551

work page arXiv 2022
[22]

Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan. 2023. Automated Repair of Programs from Large Language Models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 1469–1481. https://doi.org/10.1109/ICSE48619.2023.00128

work page doi:10.1109/icse48619.2023.00128 2023
[23]

2019.Towards automatic concept-based explanations

Amirata Ghorbani, James Wexler, James Zou, and Been Kim. 2019.Towards automatic concept-based explanations. Curran Associates Inc., Red Hook, NY, USA

work page 2019
[24]

Hugging Face. 2022. CodeParrot. https://huggingface.co/codeparrot. Accessed: 2024-07-23

work page 2022
[25]

Alon Jacovi and Yoav Goldberg. 2020. Towards Faithfully Interpretable NLP Sys- tems: How should we define and evaluate faithfulness? arXiv:cs.CL/2004.03685

work page arXiv 2020
[26]

Sarthak Jain and Byron C. Wallace. 2019. Attention is not Explanation. https: //doi.org/10.48550/arXiv.1902.10186 arXiv:1902.10186 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1902.10186 2019
[27]

Palacio, Yixuan Zhang, and Denys Poshyvanyk

Dipin Khati, Yijin Liu, David N. Palacio, Yixuan Zhang, and Denys Poshyvanyk

work page
[28]

arXiv:cs.SE/2503.13793 https://arxiv.org/abs/2503.13793

Mapping the Trust Terrain: LLMs in Software Engineering – Insights and Perspectives. arXiv:cs.SE/2503.13793 https://arxiv.org/abs/2503.13793

work page arXiv
[29]

SEMERU Lab. 2024. Code Rationale. https://github.com/WM-SEMERU/code- rationales/. Accessed: 2025-05-19

work page 2024
[30]

Isaac Lage, Emily Chen, Jeffrey He, Menaka Narayanan, Been Kim, et al. 2019. Human Evaluation of Models Built for Interpretability.Proceedings of the AAAI Conference on Human Computation and Crowdsourcing7 (Oct. 2019), 59–67. https: //doi.org/10.1609/hcomp.v7i1.5280

work page doi:10.1609/hcomp.v7i1.5280 2019
[31]

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, et al. 2023. StarCoder: may the source be with you! arXiv:cs.CL/2305.06161 https://arxiv.org/abs/2305.06161

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Zheng Li, Fuxiang Sun, Haifeng Wang, Yifan Ding, Yong Liu, et al. 2021. CLACER: A Deep Learning-based Compilation Error Classification Method for Novice Stu- dents’ Programs. In2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC). 74–83. https://doi.org/10.1109/COMPSAC51774.2021. 00022

work page doi:10.1109/compsac51774.2021 2021
[33]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, et al

work page
[34]

Holistic Evaluation of Language Models

Holistic Evaluation of Language Models. arXiv:cs.CL/2211.09110 https: //arxiv.org/abs/2211.09110

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Pantelis Linardatos, Vasilis Papastefanopoulos, and Sotiris Kotsiantis. 2021. Ex- plainable AI: A Review of Machine Learning Interpretability Methods.Entropy 23, 1 (2021). https://doi.org/10.3390/e23010018

work page doi:10.3390/e23010018 2021
[36]

Zachary C. Lipton. 2017. The Mythos of Model Interpretability. arXiv:1606.03490

work page internal anchor Pith review Pith/arXiv arXiv 2017
[37]

Zhijie Liu, Yutian Tang, Xiapu Luo, Yuming Zhou, and Liang Feng Zhang. 2023. No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT. arXiv:cs.SE/2308.04838

work page arXiv 2023
[38]

Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. InProceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS). Curran Associates Inc., 4765–4774

work page 2017
[39]

Marina Meilă. 2007. Comparing clusterings—an information based distance. Journal of Multivariate Analysis98, 5 (2007), 873–895. https://doi.org/10.1016/j. jmva.2006.11.013

work page doi:10.1016/j 2007
[40]

Ahmad Haji Mohammadkhani, Chakkrit Tantithamthavorn, and Hadi Hemmati

work page
[41]

InProceedings of the 23rd IEEE International Working Con- ference on Source Code Analysis and Manipulation (SCAM)

Explainable AI for Pre-Trained Code Models: What Do They Learn? When They Do Not Work?. InProceedings of the 23rd IEEE International Working Con- ference on Source Code Analysis and Manipulation (SCAM). IEEE, 1–11

work page
[42]

Khapra, Balaji Vasan Srinivasan, et al

Akash Kumar Mohankumar, Preksha Nema, Sharan Narasimhan, Mitesh M. Khapra, Balaji Vasan Srinivasan, et al. 2020. Towards Transparent and Explainable Attention Models. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 4206–4216. https://doi.org/10.18653/v1/2020.acl-main.387

work page doi:10.18653/v1/2020.acl-main.387 2020
[43]

Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G

John X. Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G. Ed- ward Suh, et al . 2025. How much do language models memorize? arXiv:cs.CL/2505.24832 https://arxiv.org/abs/2505.24832

work page arXiv 2025
[44]

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, et al . 2019. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. arXiv:cs.CL/1904.01038 https://arxiv.org/abs/1904.01038

work page internal anchor Pith review Pith/arXiv arXiv 2019
[45]

Palacio, Daniel Rodriguez-Cardenas, Alejandro Velasco, Dipin Khati, Kevin Moran, et al

David N. Palacio, Daniel Rodriguez-Cardenas, Alejandro Velasco, Dipin Khati, Kevin Moran, et al. 2024. Towards More Trustworthy and Interpretable LLMs for Code through Syntax-Grounded Explanations. arXiv:cs.SE/2407.08983 https: //arxiv.org/abs/2407.08983

work page arXiv 2024
[46]

Palacio, Alejandro Velasco, Nathan Cooper, Alvaro Rodriguez, Kevin Moran, et al

David N. Palacio, Alejandro Velasco, Nathan Cooper, Alvaro Rodriguez, Kevin Moran, et al. 2024. Toward a Theory of Causation for Interpreting Neural Code Models.IEEE Transactions on Software Engineering50, 5 (May 2024), 1215–1243. https://doi.org/10.1109/TSE.2024.3379943 arXiv:2302.03788 [cs, stat]

work page doi:10.1109/tse.2024.3379943 2024
[47]

Tobias Peters and Roel Visser. 2023. The Importance of Distrust in AI. https: //doi.org/10.48550/arXiv.2307.13601

work page doi:10.48550/arxiv.2307.13601 2023
[48]

Goldstein, Jake M

Forough Poursabzi-Sangdeh, Daniel G. Goldstein, Jake M. Hofman, Jennifer Wort- man Vaughan, and Hanna Wallach. 2021. Manipulating and Measuring Model Interpretability. http://arxiv.org/abs/1802.07810 arXiv:1802.07810 [cs]

work page arXiv 2021
[49]

Qualtrics. 2024. Qualtrics XM. https://www.qualtrics.com/. Accessed: 2024-02-13

work page 2024
[50]

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, et al. 2020. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. arXiv:cs.SE/2009.10297 https://arxiv.org/abs/2009.10297

work page internal anchor Pith review Pith/arXiv arXiv 2020
[51]

"Why Should I Trust You?": Explaining the Predictions of Any Classifier

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. https://doi.org/10. 48550/arXiv.1602.04938 arXiv:1602.04938 [cs, stat]

work page internal anchor Pith review Pith/arXiv arXiv 2016
[52]

Marco Tulio Ribeiro, Tongshuang Sherry Wu, Carlos Guestrin, and Sameer Singh

work page
[53]

In Annual Meeting of the Association for Computational Linguistics

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Annual Meeting of the Association for Computational Linguistics. https://api. semanticscholar.org/CorpusID:218551201

work page
[54]

Palacio, Dipin Khati, Henry Burke, and Denys Poshyvanyk

Daniel Rodriguez-Cardenas, David N. Palacio, Dipin Khati, Henry Burke, and Denys Poshyvanyk. 2023. Benchmarking Causal Study to Interpret Large Lan- guage Models for Source Code. In2023 IEEE International Conference on Soft- ware Maintenance and Evolution (ICSME). 329–334. https://doi.org/10.1109/ ICSME58846.2023.00040

work page arXiv 2023
[55]

Sofia Serrano and Noah A. Smith. 2019. Is Attention Interpretable? https: //doi.org/10.48550/arXiv.1906.03731 arXiv:1906.03731 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1906.03731 2019
[56]

C. E. Shannon. 1948. A Mathematical Theory of Communication.Bell Sys- tem Technical Journal27, 3 (1948), 379–423. https://doi.org/10.1002/j.1538- 7305.1948.tb01338.x arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/j.1538- 7305.1948.tb01338.x

work page doi:10.1002/j.1538- 1948
[57]

Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2019. Learning Important Features Through Propagating Activation Differences. https: //doi.org/10.48550/arXiv.1704.02685 arXiv:1704.02685 [cs]. Enabling Global, Human-Centered Explanations for LLMs: From Tokens to Interpretable Code and Test Generation ICSE ’26, April 12–18, 2026, Rio de Janeiro, Brazil

work page doi:10.48550/arxiv.1704.02685 2019
[58]

Kacper Sokol and Peter Flach. 2019. Desiderata for Interpretability: Explaining Decision Tree Predictions with Counterfactuals.Proceedings of the AAAI Confer- ence on Artificial Intelligence33 (07 2019), 10035–10036. https://doi.org/10.1609/ aaai.v33i01.330110035

work page 2019
[59]

Management Solutions. 2022. Explainable Artificial Intelligence (XAI). Chal- lenges of model interpretability. https://www.managementsolutions.com/sites/ default/files/minisite/static/22959b0f-b3da-47c8-9d5c-80ec3216552b/iax/pdf/ explainable-artificial-intelligence-en-04.pdf. Accessed: 18 June 2024

work page 2022
[60]

Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, et al . 2024. TrustLLM: Trustworthiness in Large Language Models. arXiv:cs.CL/2401.05561 https://arxiv.org/abs/2401.05561

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

Simon Thorne. 2024. Understanding the Interplay Between Trust, Reliability, and Human Factors in the Age of Generative AI.International Journal of Simulation: Systems, Science & technology(5 May 2024). https://doi.org/10.5013/ijssst.a.25.01. 10

work page doi:10.5013/ijssst.a.25.01 2024
[62]

Michele Tufano, Shao Kun Deng, Neel Sundaresan, and Alexey Svyatkovskiy. 2022. Methods2Test: a dataset of focal methods mapped to test cases. InProceedings of the 19th International Conference on Mining Software Repositories (MSR ’22). ACM, 299–303. https://doi.org/10.1145/3524842.3528009

work page doi:10.1145/3524842.3528009 2022
[63]

Blei, and Alexander M

Keyon Vafa, Yuntian Deng, David M. Blei, and Alexander M. Rush. 2017. Ratio- nales for Sequential Predictions. arXiv:2109.06387 [cs] http://arxiv.org/abs/2109. 06387

work page arXiv 2017
[64]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, et al

work page
[65]

Attention Is All You Need

Attention Is All You Need. arXiv:1706.03762 [cs] http://arxiv.org/abs/1706. 03762

work page internal anchor Pith review Pith/arXiv arXiv
[66]

Herbsleb, Alexandra Holloway, and Scott Davidoff

David Gray Widder, Laura Dabbish, James D. Herbsleb, Alexandra Holloway, and Scott Davidoff. 2021. Trust in Collaborative Automation in High Stakes Software Engineering Work: A Case Study at NASA. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 184, 1...

work page doi:10.1145/3411764.3445650 2021
[67]

Ralph A Wiggins. 1978. Minimum entropy deconvolution.Geoexploration16, 1 (1978), 21–35. https://doi.org/10.1016/0016-7142(78)90005-4

work page doi:10.1016/0016-7142(78)90005-4 1978
[68]

Ohlsson, Björn Regnell, et al

Claes Wohlin, Per Runeson, Martin Höst, Magnus C. Ohlsson, Björn Regnell, et al. 2012.Experimentation in Software Engineering. Springer Science & Business Media

work page 2012

[1] [1]

2024.The Claude 3 Model Family: Opus, Sonnet, Haiku

Anthropic. 2024.The Claude 3 Model Family: Opus, Sonnet, Haiku. Technical Report. Anthropic. https://www.anthropic.com/news/claude-3-family

work page 2024

[2] [2]

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, et al . 2022. Constitutional AI: Harmlessness from AI Feedback. arXiv:cs.CL/2212.08073 https://arxiv.org/abs/2212.08073

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

Agathe Balayn, Mireia Yurrita, Fanny Rancourt, Fabio Casati, and Ujwal Gadiraju

work page

[4] [4]

arXiv:cs.HC/2405.16310 https://arxiv.org/abs/2405.16310

An Empirical Exploration of Trust Dynamics in LLM Supply Chains. arXiv:cs.HC/2405.16310 https://arxiv.org/abs/2405.16310

work page arXiv

[5] [5]

Sebastian Baltes and Paul Ralph. 2021. Sampling in Software Engineering Re- search: A Critical Review and Guidelines. arXiv:2002.07764

work page arXiv 2021

[6] [6]

Jasmijn Bastings and Katja Filippova. 2020. The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? (2020), 149–155. https://doi.org/10.18653/v1/2020.blackboxnlp-1.14 arXiv: 2010.05607

work page doi:10.18653/v1/2020.blackboxnlp-1.14 2020

[7] [7]

2009.Natural Language Processing with Python

Steven Bird, Ewan Klein, and Edward Loper. 2009.Natural Language Processing with Python. O’Reilly Media

work page 2009

[8] [8]

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, et al. 2020. Language Models are Few-Shot Learners. arXiv:cs.CL/2005.14165 https://arxiv.org/abs/2005.14165

work page internal anchor Pith review Pith/arXiv arXiv 2020

[9] [9]

Gino Brunner, Yang Liu, Damián Pascual, Oliver Richter, Massimiliano Ciaramita, et al. 2020. On Identifiability in Transformers. arXiv:cs.CL/1908.04211

work page arXiv 2020

[10] [10]

Max Brunsfeld, Andrew Hlynskyi, Patrick Thomson, Josh Vera, Phil Turnbull, et al. 2023. tree-sitter/tree-sitter: v0.20.8. https://doi.org/10.5281/zenodo.7798573

work page doi:10.5281/zenodo.7798573 2023

[11] [11]

Ullman, Fernando Martinez-Plumed, et al

Ryan Burnell, Wout Schellaert, John Burden, Tomer D. Ullman, Fernando Martinez-Plumed, et al . 2023. Rethink reporting of evaluation results in AI. 380, 6641 (2023), 136–138. https://doi.org/10.1126/science.adf6369

work page doi:10.1126/science.adf6369 2023

[12] [12]

2006.Constructing Grounded Theory: A Practical Guide through Qualitative Analysis

Kathy Charmaz. 2006.Constructing Grounded Theory: A Practical Guide through Qualitative Analysis. SAGE Publications Inc

work page 2006

[13] [13]

Learning to Explain: An Information-Theoretic Perspective on Model Interpretation

Jianbo Chen, Le Song, Martin J. Wainwright, and Michael I. Jordan. 2018. Learn- ing to Explain: An Information-Theoretic Perspective on Model Interpretation. arXiv:cs.LG/1802.07814 https://arxiv.org/abs/1802.07814

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [14]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, et al. 2021. Evaluating Large Language Models Trained on Code. (2021). arXiv:2107.03374 http://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021

[15] [15]

Zixi Chen, Varshini Subhash, Marton Havasi, Weiwei Pan, and Finale Doshi-Velez

work page

[16] [16]

http://arxiv.org/abs/2211.05667 arXiv:2211.05667 [cs]

What Makes a Good Explanation?: A Harmonized View of Properties of Explanations. http://arxiv.org/abs/2211.05667 arXiv:2211.05667 [cs]

work page arXiv

[17] [17]

Jürgen Cito, Isil Dillig, Vijayaraghavan Murali, and Satish Chandra. 2022. Coun- terfactual explanations for models of code. InProceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice. ACM, Pitts- burgh Pennsylvania, 125–134. https://doi.org/10.1145/3510457.3513081

work page doi:10.1145/3510457.3513081 2022

[18] [18]

Clement, Shuai Lu, Xiaoyu Liu, Michele Tufano, Dawn Drain, et al

Colin B. Clement, Shuai Lu, Xiaoyu Liu, Michele Tufano, Dawn Drain, et al. [n. d.]. Long-Range Modeling of Source Code Files with eWASH: Extended Window Access by Syntax Hierarchy. arXiv:2109.08780 [cs] http://arxiv.org/abs/2109. 08780

work page arXiv

[19] [19]

Sudipta Dey and Tathagata Roy Chowdhury. 2024. A Comparative Survey of SHAP and LIME: Explaining Machine Learning Models for Transparent AI.In- ternational Journal of Innovative Research in Education11 (11 2024), 827–835

work page 2024

[20] [20]

Finale Doshi-Velez and Been Kim. 2017. Towards A Rigorous Science of In- terpretable Machine Learning.arXiv preprint arXiv:1702.08608(2017). https: //arxiv.org/abs/1702.08608

work page internal anchor Pith review Pith/arXiv arXiv 2017

[21] [21]

Fabian Fagerholm, Michael Felderer, Davide Fucci, Michael Unterkalmsteiner, Bogdan Marculescu, et al. 2022. Cognition in Software Engineering: A Taxonomy and Survey of a Half-Century of Research. arXiv:cs.SE/2201.05551 https://arxiv. org/abs/2201.05551

work page arXiv 2022

[22] [22]

Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan. 2023. Automated Repair of Programs from Large Language Models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 1469–1481. https://doi.org/10.1109/ICSE48619.2023.00128

work page doi:10.1109/icse48619.2023.00128 2023

[23] [23]

2019.Towards automatic concept-based explanations

Amirata Ghorbani, James Wexler, James Zou, and Been Kim. 2019.Towards automatic concept-based explanations. Curran Associates Inc., Red Hook, NY, USA

work page 2019

[24] [24]

Hugging Face. 2022. CodeParrot. https://huggingface.co/codeparrot. Accessed: 2024-07-23

work page 2022

[25] [25]

Alon Jacovi and Yoav Goldberg. 2020. Towards Faithfully Interpretable NLP Sys- tems: How should we define and evaluate faithfulness? arXiv:cs.CL/2004.03685

work page arXiv 2020

[26] [26]

Sarthak Jain and Byron C. Wallace. 2019. Attention is not Explanation. https: //doi.org/10.48550/arXiv.1902.10186 arXiv:1902.10186 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1902.10186 2019

[27] [27]

Palacio, Yixuan Zhang, and Denys Poshyvanyk

Dipin Khati, Yijin Liu, David N. Palacio, Yixuan Zhang, and Denys Poshyvanyk

work page

[28] [28]

arXiv:cs.SE/2503.13793 https://arxiv.org/abs/2503.13793

Mapping the Trust Terrain: LLMs in Software Engineering – Insights and Perspectives. arXiv:cs.SE/2503.13793 https://arxiv.org/abs/2503.13793

work page arXiv

[29] [29]

SEMERU Lab. 2024. Code Rationale. https://github.com/WM-SEMERU/code- rationales/. Accessed: 2025-05-19

work page 2024

[30] [30]

Isaac Lage, Emily Chen, Jeffrey He, Menaka Narayanan, Been Kim, et al. 2019. Human Evaluation of Models Built for Interpretability.Proceedings of the AAAI Conference on Human Computation and Crowdsourcing7 (Oct. 2019), 59–67. https: //doi.org/10.1609/hcomp.v7i1.5280

work page doi:10.1609/hcomp.v7i1.5280 2019

[31] [31]

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, et al. 2023. StarCoder: may the source be with you! arXiv:cs.CL/2305.06161 https://arxiv.org/abs/2305.06161

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Zheng Li, Fuxiang Sun, Haifeng Wang, Yifan Ding, Yong Liu, et al. 2021. CLACER: A Deep Learning-based Compilation Error Classification Method for Novice Stu- dents’ Programs. In2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC). 74–83. https://doi.org/10.1109/COMPSAC51774.2021. 00022

work page doi:10.1109/compsac51774.2021 2021

[33] [33]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, et al

work page

[34] [34]

Holistic Evaluation of Language Models

Holistic Evaluation of Language Models. arXiv:cs.CL/2211.09110 https: //arxiv.org/abs/2211.09110

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

Pantelis Linardatos, Vasilis Papastefanopoulos, and Sotiris Kotsiantis. 2021. Ex- plainable AI: A Review of Machine Learning Interpretability Methods.Entropy 23, 1 (2021). https://doi.org/10.3390/e23010018

work page doi:10.3390/e23010018 2021

[36] [36]

Zachary C. Lipton. 2017. The Mythos of Model Interpretability. arXiv:1606.03490

work page internal anchor Pith review Pith/arXiv arXiv 2017

[37] [37]

Zhijie Liu, Yutian Tang, Xiapu Luo, Yuming Zhou, and Liang Feng Zhang. 2023. No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT. arXiv:cs.SE/2308.04838

work page arXiv 2023

[38] [38]

Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. InProceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS). Curran Associates Inc., 4765–4774

work page 2017

[39] [39]

Marina Meilă. 2007. Comparing clusterings—an information based distance. Journal of Multivariate Analysis98, 5 (2007), 873–895. https://doi.org/10.1016/j. jmva.2006.11.013

work page doi:10.1016/j 2007

[40] [40]

Ahmad Haji Mohammadkhani, Chakkrit Tantithamthavorn, and Hadi Hemmati

work page

[41] [41]

InProceedings of the 23rd IEEE International Working Con- ference on Source Code Analysis and Manipulation (SCAM)

Explainable AI for Pre-Trained Code Models: What Do They Learn? When They Do Not Work?. InProceedings of the 23rd IEEE International Working Con- ference on Source Code Analysis and Manipulation (SCAM). IEEE, 1–11

work page

[42] [42]

Khapra, Balaji Vasan Srinivasan, et al

Akash Kumar Mohankumar, Preksha Nema, Sharan Narasimhan, Mitesh M. Khapra, Balaji Vasan Srinivasan, et al. 2020. Towards Transparent and Explainable Attention Models. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 4206–4216. https://doi.org/10.18653/v1/2020.acl-main.387

work page doi:10.18653/v1/2020.acl-main.387 2020

[43] [43]

Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G

John X. Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G. Ed- ward Suh, et al . 2025. How much do language models memorize? arXiv:cs.CL/2505.24832 https://arxiv.org/abs/2505.24832

work page arXiv 2025

[44] [44]

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, et al . 2019. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. arXiv:cs.CL/1904.01038 https://arxiv.org/abs/1904.01038

work page internal anchor Pith review Pith/arXiv arXiv 2019

[45] [45]

Palacio, Daniel Rodriguez-Cardenas, Alejandro Velasco, Dipin Khati, Kevin Moran, et al

David N. Palacio, Daniel Rodriguez-Cardenas, Alejandro Velasco, Dipin Khati, Kevin Moran, et al. 2024. Towards More Trustworthy and Interpretable LLMs for Code through Syntax-Grounded Explanations. arXiv:cs.SE/2407.08983 https: //arxiv.org/abs/2407.08983

work page arXiv 2024

[46] [46]

Palacio, Alejandro Velasco, Nathan Cooper, Alvaro Rodriguez, Kevin Moran, et al

David N. Palacio, Alejandro Velasco, Nathan Cooper, Alvaro Rodriguez, Kevin Moran, et al. 2024. Toward a Theory of Causation for Interpreting Neural Code Models.IEEE Transactions on Software Engineering50, 5 (May 2024), 1215–1243. https://doi.org/10.1109/TSE.2024.3379943 arXiv:2302.03788 [cs, stat]

work page doi:10.1109/tse.2024.3379943 2024

[47] [47]

Tobias Peters and Roel Visser. 2023. The Importance of Distrust in AI. https: //doi.org/10.48550/arXiv.2307.13601

work page doi:10.48550/arxiv.2307.13601 2023

[48] [48]

Goldstein, Jake M

Forough Poursabzi-Sangdeh, Daniel G. Goldstein, Jake M. Hofman, Jennifer Wort- man Vaughan, and Hanna Wallach. 2021. Manipulating and Measuring Model Interpretability. http://arxiv.org/abs/1802.07810 arXiv:1802.07810 [cs]

work page arXiv 2021

[49] [49]

Qualtrics. 2024. Qualtrics XM. https://www.qualtrics.com/. Accessed: 2024-02-13

work page 2024

[50] [50]

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, et al. 2020. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. arXiv:cs.SE/2009.10297 https://arxiv.org/abs/2009.10297

work page internal anchor Pith review Pith/arXiv arXiv 2020

[51] [51]

"Why Should I Trust You?": Explaining the Predictions of Any Classifier

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. https://doi.org/10. 48550/arXiv.1602.04938 arXiv:1602.04938 [cs, stat]

work page internal anchor Pith review Pith/arXiv arXiv 2016

[52] [52]

Marco Tulio Ribeiro, Tongshuang Sherry Wu, Carlos Guestrin, and Sameer Singh

work page

[53] [53]

In Annual Meeting of the Association for Computational Linguistics

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Annual Meeting of the Association for Computational Linguistics. https://api. semanticscholar.org/CorpusID:218551201

work page

[54] [54]

Palacio, Dipin Khati, Henry Burke, and Denys Poshyvanyk

Daniel Rodriguez-Cardenas, David N. Palacio, Dipin Khati, Henry Burke, and Denys Poshyvanyk. 2023. Benchmarking Causal Study to Interpret Large Lan- guage Models for Source Code. In2023 IEEE International Conference on Soft- ware Maintenance and Evolution (ICSME). 329–334. https://doi.org/10.1109/ ICSME58846.2023.00040

work page arXiv 2023

[55] [55]

Sofia Serrano and Noah A. Smith. 2019. Is Attention Interpretable? https: //doi.org/10.48550/arXiv.1906.03731 arXiv:1906.03731 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1906.03731 2019

[56] [56]

C. E. Shannon. 1948. A Mathematical Theory of Communication.Bell Sys- tem Technical Journal27, 3 (1948), 379–423. https://doi.org/10.1002/j.1538- 7305.1948.tb01338.x arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/j.1538- 7305.1948.tb01338.x

work page doi:10.1002/j.1538- 1948

[57] [57]

Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2019. Learning Important Features Through Propagating Activation Differences. https: //doi.org/10.48550/arXiv.1704.02685 arXiv:1704.02685 [cs]. Enabling Global, Human-Centered Explanations for LLMs: From Tokens to Interpretable Code and Test Generation ICSE ’26, April 12–18, 2026, Rio de Janeiro, Brazil

work page doi:10.48550/arxiv.1704.02685 2019

[58] [58]

Kacper Sokol and Peter Flach. 2019. Desiderata for Interpretability: Explaining Decision Tree Predictions with Counterfactuals.Proceedings of the AAAI Confer- ence on Artificial Intelligence33 (07 2019), 10035–10036. https://doi.org/10.1609/ aaai.v33i01.330110035

work page 2019

[59] [59]

Management Solutions. 2022. Explainable Artificial Intelligence (XAI). Chal- lenges of model interpretability. https://www.managementsolutions.com/sites/ default/files/minisite/static/22959b0f-b3da-47c8-9d5c-80ec3216552b/iax/pdf/ explainable-artificial-intelligence-en-04.pdf. Accessed: 18 June 2024

work page 2022

[60] [60]

Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, et al . 2024. TrustLLM: Trustworthiness in Large Language Models. arXiv:cs.CL/2401.05561 https://arxiv.org/abs/2401.05561

work page internal anchor Pith review Pith/arXiv arXiv 2024

[61] [61]

Simon Thorne. 2024. Understanding the Interplay Between Trust, Reliability, and Human Factors in the Age of Generative AI.International Journal of Simulation: Systems, Science & technology(5 May 2024). https://doi.org/10.5013/ijssst.a.25.01. 10

work page doi:10.5013/ijssst.a.25.01 2024

[62] [62]

Michele Tufano, Shao Kun Deng, Neel Sundaresan, and Alexey Svyatkovskiy. 2022. Methods2Test: a dataset of focal methods mapped to test cases. InProceedings of the 19th International Conference on Mining Software Repositories (MSR ’22). ACM, 299–303. https://doi.org/10.1145/3524842.3528009

work page doi:10.1145/3524842.3528009 2022

[63] [63]

Blei, and Alexander M

Keyon Vafa, Yuntian Deng, David M. Blei, and Alexander M. Rush. 2017. Ratio- nales for Sequential Predictions. arXiv:2109.06387 [cs] http://arxiv.org/abs/2109. 06387

work page arXiv 2017

[64] [64]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, et al

work page

[65] [65]

Attention Is All You Need

Attention Is All You Need. arXiv:1706.03762 [cs] http://arxiv.org/abs/1706. 03762

work page internal anchor Pith review Pith/arXiv arXiv

[66] [66]

Herbsleb, Alexandra Holloway, and Scott Davidoff

David Gray Widder, Laura Dabbish, James D. Herbsleb, Alexandra Holloway, and Scott Davidoff. 2021. Trust in Collaborative Automation in High Stakes Software Engineering Work: A Case Study at NASA. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 184, 1...

work page doi:10.1145/3411764.3445650 2021

[67] [67]

Ralph A Wiggins. 1978. Minimum entropy deconvolution.Geoexploration16, 1 (1978), 21–35. https://doi.org/10.1016/0016-7142(78)90005-4

work page doi:10.1016/0016-7142(78)90005-4 1978

[68] [68]

Ohlsson, Björn Regnell, et al

Claes Wohlin, Per Runeson, Martin Höst, Magnus C. Ohlsson, Björn Regnell, et al. 2012.Experimentation in Software Engineering. Springer Science & Business Media

work page 2012