pith. sign in

arxiv: 2604.18309 · v2 · pith:QEGTRZSKnew · submitted 2026-04-20 · 💻 cs.SE

From Program Slices to Causal Clarity: Evaluating Faithful, Actionable LLM-Generated Failure Explanations via Context Partitioning and LLM-as-a-Judge

Pith reviewed 2026-05-21 08:43 UTC · model grok-4.3

classification 💻 cs.SE
keywords LLM-generated explanationsfailure explanationscontext partitioningprogram slicesLLM-as-a-judgesoftware debuggingbug repairexplanation quality
0
0 comments X

The pith

The quality of LLM-generated failure explanations depends causally on the composition of the debugging context provided.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how different assemblies of debugging information affect the causal accuracy and usefulness of explanations that large language models produce for software failures. It runs experiments across 93 context configurations built from real bugs, varying which artifacts such as program slices, tests, and error messages are included. Focused, failure-specific evidence improves the explanations while very large undifferentiated contexts tend to produce vague ones. These quality differences also track with success rates on later repair tasks, and the automated scores receive validation from human raters. The work therefore treats explanation quality itself as a measurable first-class output rather than an incidental byproduct of debugging workflows.

Core claim

By partitioning the available debugging information into distinct context compositions and scoring the resulting LLM outputs with an LLM-as-a-judge on six criteria for faithfulness and actionability, the study shows that explanation quality is causally affected by context composition: evidence-rich, failure-specific artifacts improve causal and action-oriented quality, whereas overly large contexts tend to yield vague explanations, with higher explanation-score quartiles associated with higher downstream repair pass rates.

What carries the argument

Context partitioning, the systematic construction of 93 distinct debugging contexts from program slices and other artifacts, combined with LLM-as-a-judge scoring against human-validated criteria.

If this is right

  • Higher explanation scores correlate with higher success rates on subsequent bug repair tasks.
  • Overly large contexts produce vague explanations that lack causal specificity.
  • Evidence-rich failure-specific artifacts improve the action-oriented usefulness of the explanations.
  • Low-scoring explanations can reduce repair performance below the level achieved with no explanation at all.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Debugging assistants could benefit from automated selection of relevant slices rather than feeding all available artifacts to the model.
  • Treating explanation quality as an explicit optimization target may improve reliability in other LLM-assisted software engineering workflows.
  • The partitioning technique offers a way to isolate which artifacts drive diagnostic performance when applying LLMs to fault localization or root-cause analysis.

Load-bearing premise

The 93 context configurations and the selected real bugs are representative enough for the observed quality differences to generalize beyond the tested dataset and models.

What would settle it

A replication on a fresh collection of bugs or with additional LLMs that fails to reproduce the reported correlation between context type and both explanation scores and repair pass rates.

Figures

Figures reproduced from arXiv: 2604.18309 by Christian Medeiros Adriano, Germany), Holger Giese (Hasso Plattner Institute, Julius Porbeck, University of Potsdam.

Figure 1
Figure 1. Figure 1: Study overview: from 12 Defects4J defects translated from Java to Python (contamination control: Java [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distributions of expected total explanation scores [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distributions of configuration-level expected total explanation scores [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Minimality effect sizes (Δ = Q4–Q1) for passing fixes in composed batches. Blue filled circles denote two￾way batches; orange open circles denote three-way batches. Whiskers show Bonferroni-adjusted defect-bootstrap CIs (𝑚 = 2 per model). Positive values indicate that higher-quality explanations coincide with a higher minimal-fix rate. that explanations can improve fix accuracy [19], but add that qual￾ity … view at source ↗
Figure 5
Figure 5. Figure 5: Effect sizes comparing high- vs. low-quality explanations on passing fixes in composed batches ( [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Large language model (LLM)-based debugging systems can generate failure explanations, but these explanations may be incomplete or incorrect. Misleading explanations are harmful for downstream tasks (e.g., bug triage, bug fixing). We investigate how explanation quality is affected by various LLM context configurations. Existing work predominantly treats LLM-generated failure explanations as an ad hoc by-product of debugging or repair workflows, using generic prompting over undifferentiated artifacts such as code, tests, and error messages rather than targeting explanations as a first-class output with dedicated quality assessment. Consequently, existing approaches provide limited support for assessing whether these explanations capture the underlying fault-error-failure mechanism and for actionable next steps, and most techniques instead prioritize task success (e.g., patch correctness or review quality) over the explicit causal explanation quality. We systematically vary the debugging information to study how distinct context compositions affect the quality of LLM-generated failure explanations. Across 93 context configurations on real bugs and three economically viable models (gpt-5-mini, DeepSeek-V3.2, and Grok-4.1-fast), we evaluate explanations with six criteria and validate the LLM-as-a-judge scores against human ratings in a user study. Our results indicate that explanation quality is causally affected by context composition. Evidence-rich, failure-specific artifacts improve causal and action-oriented quality, whereas overly large contexts tend to yield vague explanations. Higher explanation-score quartiles are associated with higher downstream repair pass rates and, for some models, with fixes that are closer to the reference minimal fixes. In contrast, low-score quartiles can even underperform the no-explanation baseline. Reproduction package is publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that LLM-generated failure explanations for debugging are causally affected by context composition. Using 93 systematically varied context configurations on real bugs and three models (gpt-5-mini, DeepSeek-V3.2, Grok-4.1-fast), it evaluates explanations on six quality criteria, validates LLM-as-a-judge scores via human ratings, and links higher explanation scores to improved downstream repair pass rates (and, for some models, closer-to-minimal fixes). Evidence-rich, failure-specific artifacts outperform overly large contexts, which produce vaguer explanations; low-score explanations can underperform a no-explanation baseline.

Significance. If the results hold, the work supplies concrete, actionable guidance on context design for LLM debugging systems, moving beyond ad-hoc prompting. Credit is due for the systematic variation across 93 configurations, evaluation on three economically viable models, human validation of the LLM judge, explicit downstream repair-rate measurements, and the public reproduction package.

major comments (1)
  1. [Abstract and experimental results sections] The central causal claim—that context composition affects explanation quality and repair outcomes across economically viable models—rests on the assumption that the 93 configurations and chosen real bugs adequately sample failure mechanisms and artifact types. The abstract (and presumably the experimental sections) does not detail exact bug-selection criteria or statistical controls for representativeness; without this, the observed quality differences and quartile-repair correlations risk being dataset-specific rather than general.
minor comments (1)
  1. [Abstract and §4] Clarify the exact model identifiers (e.g., whether 'gpt-5-mini' is a typo or specific variant) and ensure all six evaluation criteria are defined with explicit rubrics or examples in the main text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for acknowledging the systematic variation across 93 configurations, the use of multiple models, human validation of the LLM judge, and the downstream repair measurements. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract and experimental results sections] The central causal claim—that context composition affects explanation quality and repair outcomes across economically viable models—rests on the assumption that the 93 configurations and chosen real bugs adequately sample failure mechanisms and artifact types. The abstract (and presumably the experimental sections) does not detail exact bug-selection criteria or statistical controls for representativeness; without this, the observed quality differences and quartile-repair correlations risk being dataset-specific rather than general.

    Authors: We agree that the abstract provides limited detail on bug selection and that the experimental sections would benefit from greater transparency on this point. The manuscript describes the bugs as real bugs drawn from open-source projects with failing tests and ground-truth fixes, and the 93 configurations systematically vary context elements such as code slices, test cases, error messages, and stack traces. To strengthen the presentation, we will revise the experimental results section to include a dedicated paragraph on dataset construction: bugs were selected from established benchmarks to include a range of failure mechanisms (e.g., null dereferences, incorrect conditionals, resource management errors) and artifact types, with explicit criteria for inclusion (reproducible failures, availability of minimal patches). We will also add a limitations paragraph noting that, while the design supports causal claims about context composition within the sampled space and across three models, full statistical representativeness of all possible software failures would require a substantially larger corpus. These changes will clarify the scope of the claims without altering the core results. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical claims grounded in external human ratings and repair outcomes

full rationale

The paper performs an empirical study that varies 93 context configurations across real bugs and three models, scores explanations on six criteria, validates LLM-as-a-judge outputs via a separate human user study, and correlates scores with measured downstream repair pass rates. All load-bearing claims (context composition causally affects quality; higher-score quartiles link to better repair) are supported by these observed data and external benchmarks rather than any derivation, fitted parameter, or self-referential definition. No equations, self-citations, or ansatzes are invoked to force the results; the reproduction package further enables independent verification. This matches the default expectation of a non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the representativeness of the selected bugs and context partitions plus the validity of the six evaluation criteria; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption The chosen real bugs and 93 context configurations are representative of typical debugging scenarios
    Invoked when generalizing the causal effect of context composition beyond the studied cases.
  • domain assumption LLM-as-a-judge scores on the six criteria align sufficiently with human judgment for the quality assessment
    Supported by the user study but required for treating the automated scores as reliable.

pith-pipeline@v0.9.0 · 5858 in / 1395 out tokens · 54406 ms · 2026-05-21T08:43:49.173538+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 1 internal anchor

  1. [1]

    Fannar Steinn Aðalsteinsson, Björn Borgar Magnússon, Mislav Milicevic, Adam Nirving Davidsson, and Chih-Hong Cheng. 2025. Rethinking code re- view workflows with llm assistance: An empirical study. In2025 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). IEEE, 488–497

  2. [2]

    Elijah Kayode Adejumo and Brittany Johnson. 2025. Explaining Code Risk in OSS: Towards LLM-Generated Fault Prediction Interpretations. arXiv:2510.06104

  3. [3]

    Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, and Lingpeng Kong. 2025. Why Does the Effective Context Length of LLMs Fall Short?. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=eoln5WgrPx

  4. [4]

    Algirdas Avizienis, J-C Laprie, Brian Randell, and Carl Landwehr. 2004. Basic concepts and taxonomy of dependable and secure computing.IEEE transactions on dependable and secure computing1, 1 (2004), 11–33

  5. [5]

    2026.DeepSeek-V3.2

    DeepSeek-AI. 2026.DeepSeek-V3.2. https://huggingface.co/deepseek-ai/ DeepSeek-V3.2 Hugging Face model card

  6. [6]

    Roosta, and Peyman Passban

    Bryan Guan, Mehdi Rezagholizadeh, Tanya G. Roosta, and Peyman Passban

  7. [7]

    InFirst International KDD Workshop on Prompt Optimization, 2025

    The Order Effect: Investigating Prompt Sensitivity to Input Order in LLMs. InFirst International KDD Workshop on Prompt Optimization, 2025. https: //openreview.net/forum?id=QcYyYvrPNU

  8. [8]

    Halstead

    Maurice H. Halstead. 1977.Elements of Software Science. Elsevier North-Holland, New York, NY

  9. [9]

    Junda He, Jieke Shi, Terry Yue Zhuo, Christoph Treude, Jiamou Sun, Zhenchang Xing, Xiaoning Du, and David Lo. 2026. LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead.ACM Transactions on Software Engineering and Methodology(2026)

  10. [10]

    Elenberg

    Tyler Holloway and Ethan R. Elenberg. 2024. On the Role of Context Granular- ity in LLM-Driven Program Repair. InMachine Learning for Systems Workshop (NeurIPS ’24 Workshop). NeurIPS Foundation, Vancouver, BC, Canada, 8 pages. https://neurips.cc/virtual/2024/103609 Workshop paper

  11. [11]

    Ernst, Reid Holmes, and Gordon Fraser

    René Just, Darioush Jalali, Laura Inozemtseva, Michael D. Ernst, Reid Holmes, and Gordon Fraser. 2014. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. InProceedings of the 23rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2014). ACM, 312–315. doi:10.1145/2610384.2628055

  12. [12]

    Peter Kincaid, Robert P

    J. Peter Kincaid, Robert P. Fishburne, Richard L. Rogers, and Brad S. Chissom. 1975.Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel. Technical Report Research Branch Report 8-75. Naval Technical Training Command. https: //stars.library.ucf.edu/istlibrary/56/

  13. [13]

    Lucas Layman, Madeline Diep, Meiyappan Nagappan, Janice Singer, Robert Deline, and Gina Venolia. 2013. Debugging revisited: Toward understanding the debugging needs of contemporary software developers. In2013 ACM/IEEE international symposium on empirical software engineering and measurement. IEEE, 383–392

  14. [14]

    Levenshtein

    Vladimir I. Levenshtein. 1966. Binary Codes Capable of Correcting Deletions, Insertions and Reversals.Soviet Physics Doklady10, 8 (1966), 707–710

  15. [15]

    Haolin Li and Michael Coblenz. 2026. A Grounded Theory of Debugging in Professional Software Engineering Practice. arXiv:2602.11435

  16. [16]

    Lost in the middle: How language models use long contexts

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023. Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172

  17. [17]

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: NLG evaluation using gpt-4 with better human alignment. InProceedings of the 2023 conference on empirical methods in natural language processing. 2511–2522

  18. [18]

    Junyi Lu, Xiaojia Li, Zihan Hua, Lei Yu, Shiqi Cheng, Li Yang, Fengjun Zhang, and Chun Zuo. 2025. Deepcrceval: Revisiting the evaluation of code review comment generation. InInternational Conference on Fundamental Approaches to Software Engineering. Springer, 43–64

  19. [19]

    Luca Mariotto, Christian Medeiros Adriano, Daniel Burgstahler, René Eichhorn, and Holger Giese. 2025. From Assessment to Enhancement of Pull Requests at Scale: Aligning Code Reviews with Developer Competencies Using Large Language Models. To appear

  20. [20]

    Christian Medeiros Adriano. 2022. Microtasking software failure resolution: early results.ACM SIGSOFT Software Engineering Notes44, 1 (2022), 36–39

  21. [21]

    automatic patch generation learned from human-written patches

    Martin Monperrus. 2014. A critical review of “automatic patch generation learned from human-written patches”: essay on the problem statement and the evaluation of automatic software repair. InProceedings of the 36th International Conference on Software Engineering (ICSE ’14). ACM, 234–242. doi:10.1145/2568225.2568324

  22. [22]

    2026.openai/gpt-5-mini

    OpenRouter. 2026.openai/gpt-5-mini. https://openrouter.ai/openai/gpt-5-mini Model route page

  23. [23]

    2026.x-ai/grok-4.1-fast

    OpenRouter. 2026.x-ai/grok-4.1-fast. https://openrouter.ai/x-ai/grok-4.1-fast Model route page

  24. [24]

    2026.From Program Slices to Causal Clarity: Evaluating Faithful, Actionable LLM-Generated Failure Explanations via Context Partitioning and LLM- as-a-Judge

    Julius Porbeck. 2026.From Program Slices to Causal Clarity: Evaluating Faithful, Actionable LLM-Generated Failure Explanations via Context Partitioning and LLM- as-a-Judge. Master’s thesis. Hasso Plattner Institute

  25. [25]

    Zichao Qi, Fan Long, Sara Achour, and Martin Rinard. 2015. An Analysis of Patch Plausibility and Correctness for Generate-And-Validate Patch Generation Systems. (02 2015). doi:10.1145/2771783.2771791

  26. [26]

    Stefano Rando, Luca Romani, Alessio Sampieri, Luca Franco, John Yang, Yuta Kyuragi, Fabio Galasso, and Tatsunori Hashimoto. 2025. LongCodeBench: Evalu- ating Coding LLMs at 1M Context Windows. InSecond Conference on Language Modeling. https://openreview.net/forum?id=GFPoM8Ylp8

  27. [27]

    Audrey Salmon, Katie Hammer, Eddie Antonio Santos, and Brett A Becker. 2025. Debugging Without Error Messages: How LLM Prompting Strategy Affects Programming Error Explanation Effectiveness. (2025). arXiv:2501.05706

  28. [28]

    Ezekiel Soremekun, Lukas Kirschner, Marcel Böhme, and Andreas Zeller. 2021. Locating faults with program slicing: an empirical analysis.Empirical Software Engineering26, 3 (2021), 51

  29. [29]

    Voas and K.W

    J.M. Voas and K.W. Miller. 1995. Software testability: the new verification.IEEE Software12, 3 (1995), 17–28. doi:10.1109/52.382180

  30. [30]

    Xindi Wang, Mahsa Salmani, Parsa Omidi, Xiangyu Ren, Mehdi Rezagholizadeh, and Armaghan Eshaghi. 2024. Beyond the limits: a survey of techniques to extend the context length in large language models. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence. 8299–8307

  31. [31]

    Ratnadira Widyasari, Jia Wei Ang, Truong Giang Nguyen, Neil Sharma, and David Lo. 2024. Demystifying faulty code: Step-by-step reasoning for explainable fault localization. In2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 568–579

  32. [32]

    James Woodward. 1989. The Causal Mechanical Model of Explanation.Philoso- phy of Science56, 2 (1989), 345–363. https://conservancy.umn.edu/bitstreams/ f470d3c2-57fc-4764-8c0a-88c8747acc36/download Review of Wesley C. Salmon, Scientific Explanation and the Causal Structure of the World

  33. [33]

    Jiwei Yan, Jinhao Huang, Chunrong Fang, Jun Yan, and Jian Zhang. 2024. Better debugging: Combining static analysis and llms for explainable crashing fault localization. arXiv:2408.12070

  34. [34]

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623