From Program Slices to Causal Clarity: Evaluating Faithful, Actionable LLM-Generated Failure Explanations via Context Partitioning and LLM-as-a-Judge
Pith reviewed 2026-05-21 08:43 UTC · model grok-4.3
The pith
The quality of LLM-generated failure explanations depends causally on the composition of the debugging context provided.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By partitioning the available debugging information into distinct context compositions and scoring the resulting LLM outputs with an LLM-as-a-judge on six criteria for faithfulness and actionability, the study shows that explanation quality is causally affected by context composition: evidence-rich, failure-specific artifacts improve causal and action-oriented quality, whereas overly large contexts tend to yield vague explanations, with higher explanation-score quartiles associated with higher downstream repair pass rates.
What carries the argument
Context partitioning, the systematic construction of 93 distinct debugging contexts from program slices and other artifacts, combined with LLM-as-a-judge scoring against human-validated criteria.
If this is right
- Higher explanation scores correlate with higher success rates on subsequent bug repair tasks.
- Overly large contexts produce vague explanations that lack causal specificity.
- Evidence-rich failure-specific artifacts improve the action-oriented usefulness of the explanations.
- Low-scoring explanations can reduce repair performance below the level achieved with no explanation at all.
Where Pith is reading between the lines
- Debugging assistants could benefit from automated selection of relevant slices rather than feeding all available artifacts to the model.
- Treating explanation quality as an explicit optimization target may improve reliability in other LLM-assisted software engineering workflows.
- The partitioning technique offers a way to isolate which artifacts drive diagnostic performance when applying LLMs to fault localization or root-cause analysis.
Load-bearing premise
The 93 context configurations and the selected real bugs are representative enough for the observed quality differences to generalize beyond the tested dataset and models.
What would settle it
A replication on a fresh collection of bugs or with additional LLMs that fails to reproduce the reported correlation between context type and both explanation scores and repair pass rates.
Figures
read the original abstract
Large language model (LLM)-based debugging systems can generate failure explanations, but these explanations may be incomplete or incorrect. Misleading explanations are harmful for downstream tasks (e.g., bug triage, bug fixing). We investigate how explanation quality is affected by various LLM context configurations. Existing work predominantly treats LLM-generated failure explanations as an ad hoc by-product of debugging or repair workflows, using generic prompting over undifferentiated artifacts such as code, tests, and error messages rather than targeting explanations as a first-class output with dedicated quality assessment. Consequently, existing approaches provide limited support for assessing whether these explanations capture the underlying fault-error-failure mechanism and for actionable next steps, and most techniques instead prioritize task success (e.g., patch correctness or review quality) over the explicit causal explanation quality. We systematically vary the debugging information to study how distinct context compositions affect the quality of LLM-generated failure explanations. Across 93 context configurations on real bugs and three economically viable models (gpt-5-mini, DeepSeek-V3.2, and Grok-4.1-fast), we evaluate explanations with six criteria and validate the LLM-as-a-judge scores against human ratings in a user study. Our results indicate that explanation quality is causally affected by context composition. Evidence-rich, failure-specific artifacts improve causal and action-oriented quality, whereas overly large contexts tend to yield vague explanations. Higher explanation-score quartiles are associated with higher downstream repair pass rates and, for some models, with fixes that are closer to the reference minimal fixes. In contrast, low-score quartiles can even underperform the no-explanation baseline. Reproduction package is publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLM-generated failure explanations for debugging are causally affected by context composition. Using 93 systematically varied context configurations on real bugs and three models (gpt-5-mini, DeepSeek-V3.2, Grok-4.1-fast), it evaluates explanations on six quality criteria, validates LLM-as-a-judge scores via human ratings, and links higher explanation scores to improved downstream repair pass rates (and, for some models, closer-to-minimal fixes). Evidence-rich, failure-specific artifacts outperform overly large contexts, which produce vaguer explanations; low-score explanations can underperform a no-explanation baseline.
Significance. If the results hold, the work supplies concrete, actionable guidance on context design for LLM debugging systems, moving beyond ad-hoc prompting. Credit is due for the systematic variation across 93 configurations, evaluation on three economically viable models, human validation of the LLM judge, explicit downstream repair-rate measurements, and the public reproduction package.
major comments (1)
- [Abstract and experimental results sections] The central causal claim—that context composition affects explanation quality and repair outcomes across economically viable models—rests on the assumption that the 93 configurations and chosen real bugs adequately sample failure mechanisms and artifact types. The abstract (and presumably the experimental sections) does not detail exact bug-selection criteria or statistical controls for representativeness; without this, the observed quality differences and quartile-repair correlations risk being dataset-specific rather than general.
minor comments (1)
- [Abstract and §4] Clarify the exact model identifiers (e.g., whether 'gpt-5-mini' is a typo or specific variant) and ensure all six evaluation criteria are defined with explicit rubrics or examples in the main text.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for acknowledging the systematic variation across 93 configurations, the use of multiple models, human validation of the LLM judge, and the downstream repair measurements. We address the major comment below.
read point-by-point responses
-
Referee: [Abstract and experimental results sections] The central causal claim—that context composition affects explanation quality and repair outcomes across economically viable models—rests on the assumption that the 93 configurations and chosen real bugs adequately sample failure mechanisms and artifact types. The abstract (and presumably the experimental sections) does not detail exact bug-selection criteria or statistical controls for representativeness; without this, the observed quality differences and quartile-repair correlations risk being dataset-specific rather than general.
Authors: We agree that the abstract provides limited detail on bug selection and that the experimental sections would benefit from greater transparency on this point. The manuscript describes the bugs as real bugs drawn from open-source projects with failing tests and ground-truth fixes, and the 93 configurations systematically vary context elements such as code slices, test cases, error messages, and stack traces. To strengthen the presentation, we will revise the experimental results section to include a dedicated paragraph on dataset construction: bugs were selected from established benchmarks to include a range of failure mechanisms (e.g., null dereferences, incorrect conditionals, resource management errors) and artifact types, with explicit criteria for inclusion (reproducible failures, availability of minimal patches). We will also add a limitations paragraph noting that, while the design supports causal claims about context composition within the sampled space and across three models, full statistical representativeness of all possible software failures would require a substantially larger corpus. These changes will clarify the scope of the claims without altering the core results. revision: yes
Circularity Check
No significant circularity: empirical claims grounded in external human ratings and repair outcomes
full rationale
The paper performs an empirical study that varies 93 context configurations across real bugs and three models, scores explanations on six criteria, validates LLM-as-a-judge outputs via a separate human user study, and correlates scores with measured downstream repair pass rates. All load-bearing claims (context composition causally affects quality; higher-score quartiles link to better repair) are supported by these observed data and external benchmarks rather than any derivation, fitted parameter, or self-referential definition. No equations, self-citations, or ansatzes are invoked to force the results; the reproduction package further enables independent verification. This matches the default expectation of a non-circular empirical paper.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The chosen real bugs and 93 context configurations are representative of typical debugging scenarios
- domain assumption LLM-as-a-judge scores on the six criteria align sufficiently with human judgment for the quality assessment
Reference graph
Works this paper leans on
-
[1]
Fannar Steinn Aðalsteinsson, Björn Borgar Magnússon, Mislav Milicevic, Adam Nirving Davidsson, and Chih-Hong Cheng. 2025. Rethinking code re- view workflows with llm assistance: An empirical study. In2025 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). IEEE, 488–497
work page 2025
- [2]
-
[3]
Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, and Lingpeng Kong. 2025. Why Does the Effective Context Length of LLMs Fall Short?. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=eoln5WgrPx
work page 2025
-
[4]
Algirdas Avizienis, J-C Laprie, Brian Randell, and Carl Landwehr. 2004. Basic concepts and taxonomy of dependable and secure computing.IEEE transactions on dependable and secure computing1, 1 (2004), 11–33
work page 2004
-
[5]
DeepSeek-AI. 2026.DeepSeek-V3.2. https://huggingface.co/deepseek-ai/ DeepSeek-V3.2 Hugging Face model card
work page 2026
-
[6]
Bryan Guan, Mehdi Rezagholizadeh, Tanya G. Roosta, and Peyman Passban
-
[7]
InFirst International KDD Workshop on Prompt Optimization, 2025
The Order Effect: Investigating Prompt Sensitivity to Input Order in LLMs. InFirst International KDD Workshop on Prompt Optimization, 2025. https: //openreview.net/forum?id=QcYyYvrPNU
work page 2025
- [8]
-
[9]
Junda He, Jieke Shi, Terry Yue Zhuo, Christoph Treude, Jiamou Sun, Zhenchang Xing, Xiaoning Du, and David Lo. 2026. LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead.ACM Transactions on Software Engineering and Methodology(2026)
work page 2026
-
[10]
Tyler Holloway and Ethan R. Elenberg. 2024. On the Role of Context Granular- ity in LLM-Driven Program Repair. InMachine Learning for Systems Workshop (NeurIPS ’24 Workshop). NeurIPS Foundation, Vancouver, BC, Canada, 8 pages. https://neurips.cc/virtual/2024/103609 Workshop paper
work page 2024
-
[11]
Ernst, Reid Holmes, and Gordon Fraser
René Just, Darioush Jalali, Laura Inozemtseva, Michael D. Ernst, Reid Holmes, and Gordon Fraser. 2014. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. InProceedings of the 23rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2014). ACM, 312–315. doi:10.1145/2610384.2628055
-
[12]
J. Peter Kincaid, Robert P. Fishburne, Richard L. Rogers, and Brad S. Chissom. 1975.Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel. Technical Report Research Branch Report 8-75. Naval Technical Training Command. https: //stars.library.ucf.edu/istlibrary/56/
work page 1975
-
[13]
Lucas Layman, Madeline Diep, Meiyappan Nagappan, Janice Singer, Robert Deline, and Gina Venolia. 2013. Debugging revisited: Toward understanding the debugging needs of contemporary software developers. In2013 ACM/IEEE international symposium on empirical software engineering and measurement. IEEE, 383–392
work page 2013
-
[14]
Vladimir I. Levenshtein. 1966. Binary Codes Capable of Correcting Deletions, Insertions and Reversals.Soviet Physics Doklady10, 8 (1966), 707–710
work page 1966
- [15]
-
[16]
Lost in the middle: How language models use long contexts
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023. Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: NLG evaluation using gpt-4 with better human alignment. InProceedings of the 2023 conference on empirical methods in natural language processing. 2511–2522
work page 2023
-
[18]
Junyi Lu, Xiaojia Li, Zihan Hua, Lei Yu, Shiqi Cheng, Li Yang, Fengjun Zhang, and Chun Zuo. 2025. Deepcrceval: Revisiting the evaluation of code review comment generation. InInternational Conference on Fundamental Approaches to Software Engineering. Springer, 43–64
work page 2025
-
[19]
Luca Mariotto, Christian Medeiros Adriano, Daniel Burgstahler, René Eichhorn, and Holger Giese. 2025. From Assessment to Enhancement of Pull Requests at Scale: Aligning Code Reviews with Developer Competencies Using Large Language Models. To appear
work page 2025
-
[20]
Christian Medeiros Adriano. 2022. Microtasking software failure resolution: early results.ACM SIGSOFT Software Engineering Notes44, 1 (2022), 36–39
work page 2022
-
[21]
automatic patch generation learned from human-written patches
Martin Monperrus. 2014. A critical review of “automatic patch generation learned from human-written patches”: essay on the problem statement and the evaluation of automatic software repair. InProceedings of the 36th International Conference on Software Engineering (ICSE ’14). ACM, 234–242. doi:10.1145/2568225.2568324
-
[22]
OpenRouter. 2026.openai/gpt-5-mini. https://openrouter.ai/openai/gpt-5-mini Model route page
work page 2026
-
[23]
OpenRouter. 2026.x-ai/grok-4.1-fast. https://openrouter.ai/x-ai/grok-4.1-fast Model route page
work page 2026
-
[24]
Julius Porbeck. 2026.From Program Slices to Causal Clarity: Evaluating Faithful, Actionable LLM-Generated Failure Explanations via Context Partitioning and LLM- as-a-Judge. Master’s thesis. Hasso Plattner Institute
work page 2026
-
[25]
Zichao Qi, Fan Long, Sara Achour, and Martin Rinard. 2015. An Analysis of Patch Plausibility and Correctness for Generate-And-Validate Patch Generation Systems. (02 2015). doi:10.1145/2771783.2771791
-
[26]
Stefano Rando, Luca Romani, Alessio Sampieri, Luca Franco, John Yang, Yuta Kyuragi, Fabio Galasso, and Tatsunori Hashimoto. 2025. LongCodeBench: Evalu- ating Coding LLMs at 1M Context Windows. InSecond Conference on Language Modeling. https://openreview.net/forum?id=GFPoM8Ylp8
work page 2025
- [27]
-
[28]
Ezekiel Soremekun, Lukas Kirschner, Marcel Böhme, and Andreas Zeller. 2021. Locating faults with program slicing: an empirical analysis.Empirical Software Engineering26, 3 (2021), 51
work page 2021
-
[29]
J.M. Voas and K.W. Miller. 1995. Software testability: the new verification.IEEE Software12, 3 (1995), 17–28. doi:10.1109/52.382180
-
[30]
Xindi Wang, Mahsa Salmani, Parsa Omidi, Xiangyu Ren, Mehdi Rezagholizadeh, and Armaghan Eshaghi. 2024. Beyond the limits: a survey of techniques to extend the context length in large language models. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence. 8299–8307
work page 2024
-
[31]
Ratnadira Widyasari, Jia Wei Ang, Truong Giang Nguyen, Neil Sharma, and David Lo. 2024. Demystifying faulty code: Step-by-step reasoning for explainable fault localization. In2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 568–579
work page 2024
-
[32]
James Woodward. 1989. The Causal Mechanical Model of Explanation.Philoso- phy of Science56, 2 (1989), 345–363. https://conservancy.umn.edu/bitstreams/ f470d3c2-57fc-4764-8c0a-88c8747acc36/download Review of Wesley C. Salmon, Scientific Explanation and the Causal Structure of the World
work page 1989
- [33]
-
[34]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.