From Program Slices to Causal Clarity: Evaluating Faithful, Actionable LLM-Generated Failure Explanations via Context Partitioning and LLM-as-a-Judge

Christian Medeiros Adriano; Germany); Holger Giese (Hasso Plattner Institute; Julius Porbeck; University of Potsdam

REVIEW 1 major objections 1 minor 34 references

The quality of LLM-generated failure explanations depends causally on the composition of the debugging context provided.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-21 08:43 UTC pith:QEGTRZSK

load-bearing objection Context composition affects LLM failure explanation quality with measurable repair links, but the 93 configs and bug set leave generalizability open. the 1 major comments →

arxiv 2604.18309 v2 pith:QEGTRZSK submitted 2026-04-20 cs.SE

From Program Slices to Causal Clarity: Evaluating Faithful, Actionable LLM-Generated Failure Explanations via Context Partitioning and LLM-as-a-Judge

Julius Porbeck , Christian Medeiros Adriano , Holger Giese (Hasso Plattner Institute , University of Potsdam , Germany) This is my paper

classification cs.SE

keywords LLM-generated explanationsfailure explanationscontext partitioningprogram slicesLLM-as-a-judgesoftware debuggingbug repairexplanation quality

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how different assemblies of debugging information affect the causal accuracy and usefulness of explanations that large language models produce for software failures. It runs experiments across 93 context configurations built from real bugs, varying which artifacts such as program slices, tests, and error messages are included. Focused, failure-specific evidence improves the explanations while very large undifferentiated contexts tend to produce vague ones. These quality differences also track with success rates on later repair tasks, and the automated scores receive validation from human raters. The work therefore treats explanation quality itself as a measurable first-class output rather than an incidental byproduct of debugging workflows.

Core claim

By partitioning the available debugging information into distinct context compositions and scoring the resulting LLM outputs with an LLM-as-a-judge on six criteria for faithfulness and actionability, the study shows that explanation quality is causally affected by context composition: evidence-rich, failure-specific artifacts improve causal and action-oriented quality, whereas overly large contexts tend to yield vague explanations, with higher explanation-score quartiles associated with higher downstream repair pass rates.

What carries the argument

Context partitioning, the systematic construction of 93 distinct debugging contexts from program slices and other artifacts, combined with LLM-as-a-judge scoring against human-validated criteria.

Load-bearing premise

The 93 context configurations and the selected real bugs are representative enough for the observed quality differences to generalize beyond the tested dataset and models.

What would settle it

A replication on a fresh collection of bugs or with additional LLMs that fails to reproduce the reported correlation between context type and both explanation scores and repair pass rates.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Higher explanation scores correlate with higher success rates on subsequent bug repair tasks.
Overly large contexts produce vague explanations that lack causal specificity.
Evidence-rich failure-specific artifacts improve the action-oriented usefulness of the explanations.
Low-scoring explanations can reduce repair performance below the level achieved with no explanation at all.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Debugging assistants could benefit from automated selection of relevant slices rather than feeding all available artifacts to the model.
Treating explanation quality as an explicit optimization target may improve reliability in other LLM-assisted software engineering workflows.
The partitioning technique offers a way to isolate which artifacts drive diagnostic performance when applying LLMs to fault localization or root-cause analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

Context composition affects LLM failure explanation quality with measurable repair links, but the 93 configs and bug set leave generalizability open.

read the letter

The main thing to know is that this paper shows context composition has a real effect on how well LLMs explain failures, backed by 93 varied setups, human-checked scores, and downstream repair measurements on actual bugs. Evidence-rich partitions beat large undifferentiated ones for causal and actionable quality, and higher scores tie to better patch rates while low ones can fall below a no-explanation baseline. That mapping is more direct than most prior work that treats explanations as a side effect of repair prompts. The public reproduction package and the three-model comparison add practical value. The design is systematic enough to support the claim that certain artifacts improve explanation quality over generic prompting. The soft spot is representativeness. The causal generalization rests on whether the chosen real bugs and the 93 partitions sample failure mechanisms and artifact types broadly enough. If the bugs cluster in particular domains or miss key signals, the quality differences and quartile correlations could be narrower than presented. The abstract is light on exact bug-selection criteria and full statistical controls, which makes it harder to judge how far the patterns extend to other models or codebases. This is for researchers building or evaluating LLM tools for software maintenance and debugging. A reader who needs concrete data on what context elements help or hurt explanations will get usable findings here. It has enough empirical grounding and external validation to deserve a serious referee rather than a desk reject, though the generalizability section will likely need tightening.

Referee Report

1 major / 1 minor

Summary. The paper claims that LLM-generated failure explanations for debugging are causally affected by context composition. Using 93 systematically varied context configurations on real bugs and three models (gpt-5-mini, DeepSeek-V3.2, Grok-4.1-fast), it evaluates explanations on six quality criteria, validates LLM-as-a-judge scores via human ratings, and links higher explanation scores to improved downstream repair pass rates (and, for some models, closer-to-minimal fixes). Evidence-rich, failure-specific artifacts outperform overly large contexts, which produce vaguer explanations; low-score explanations can underperform a no-explanation baseline.

Significance. If the results hold, the work supplies concrete, actionable guidance on context design for LLM debugging systems, moving beyond ad-hoc prompting. Credit is due for the systematic variation across 93 configurations, evaluation on three economically viable models, human validation of the LLM judge, explicit downstream repair-rate measurements, and the public reproduction package.

major comments (1)

[Abstract and experimental results sections] The central causal claim—that context composition affects explanation quality and repair outcomes across economically viable models—rests on the assumption that the 93 configurations and chosen real bugs adequately sample failure mechanisms and artifact types. The abstract (and presumably the experimental sections) does not detail exact bug-selection criteria or statistical controls for representativeness; without this, the observed quality differences and quartile-repair correlations risk being dataset-specific rather than general.

minor comments (1)

[Abstract and §4] Clarify the exact model identifiers (e.g., whether 'gpt-5-mini' is a typo or specific variant) and ensure all six evaluation criteria are defined with explicit rubrics or examples in the main text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for acknowledging the systematic variation across 93 configurations, the use of multiple models, human validation of the LLM judge, and the downstream repair measurements. We address the major comment below.

read point-by-point responses

Referee: [Abstract and experimental results sections] The central causal claim—that context composition affects explanation quality and repair outcomes across economically viable models—rests on the assumption that the 93 configurations and chosen real bugs adequately sample failure mechanisms and artifact types. The abstract (and presumably the experimental sections) does not detail exact bug-selection criteria or statistical controls for representativeness; without this, the observed quality differences and quartile-repair correlations risk being dataset-specific rather than general.

Authors: We agree that the abstract provides limited detail on bug selection and that the experimental sections would benefit from greater transparency on this point. The manuscript describes the bugs as real bugs drawn from open-source projects with failing tests and ground-truth fixes, and the 93 configurations systematically vary context elements such as code slices, test cases, error messages, and stack traces. To strengthen the presentation, we will revise the experimental results section to include a dedicated paragraph on dataset construction: bugs were selected from established benchmarks to include a range of failure mechanisms (e.g., null dereferences, incorrect conditionals, resource management errors) and artifact types, with explicit criteria for inclusion (reproducible failures, availability of minimal patches). We will also add a limitations paragraph noting that, while the design supports causal claims about context composition within the sampled space and across three models, full statistical representativeness of all possible software failures would require a substantially larger corpus. These changes will clarify the scope of the claims without altering the core results. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical claims grounded in external human ratings and repair outcomes

full rationale

The paper performs an empirical study that varies 93 context configurations across real bugs and three models, scores explanations on six criteria, validates LLM-as-a-judge outputs via a separate human user study, and correlates scores with measured downstream repair pass rates. All load-bearing claims (context composition causally affects quality; higher-score quartiles link to better repair) are supported by these observed data and external benchmarks rather than any derivation, fitted parameter, or self-referential definition. No equations, self-citations, or ansatzes are invoked to force the results; the reproduction package further enables independent verification. This matches the default expectation of a non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the representativeness of the selected bugs and context partitions plus the validity of the six evaluation criteria; no free parameters or invented entities are introduced.

axioms (2)

domain assumption The chosen real bugs and 93 context configurations are representative of typical debugging scenarios
Invoked when generalizing the causal effect of context composition beyond the studied cases.
domain assumption LLM-as-a-judge scores on the six criteria align sufficiently with human judgment for the quality assessment
Supported by the user study but required for treating the automated scores as reliable.

pith-pipeline@v0.9.0 · 5858 in / 1395 out tokens · 54406 ms · 2026-05-21T08:43:49.173538+00:00 · methodology

0 comments

read the original abstract

Large language model (LLM)-based debugging systems can generate failure explanations, but these explanations may be incomplete or incorrect. Misleading explanations are harmful for downstream tasks (e.g., bug triage, bug fixing). We investigate how explanation quality is affected by various LLM context configurations. Existing work predominantly treats LLM-generated failure explanations as an ad hoc by-product of debugging or repair workflows, using generic prompting over undifferentiated artifacts such as code, tests, and error messages rather than targeting explanations as a first-class output with dedicated quality assessment. Consequently, existing approaches provide limited support for assessing whether these explanations capture the underlying fault-error-failure mechanism and for actionable next steps, and most techniques instead prioritize task success (e.g., patch correctness or review quality) over the explicit causal explanation quality. We systematically vary the debugging information to study how distinct context compositions affect the quality of LLM-generated failure explanations. Across 93 context configurations on real bugs and three economically viable models (gpt-5-mini, DeepSeek-V3.2, and Grok-4.1-fast), we evaluate explanations with six criteria and validate the LLM-as-a-judge scores against human ratings in a user study. Our results indicate that explanation quality is causally affected by context composition. Evidence-rich, failure-specific artifacts improve causal and action-oriented quality, whereas overly large contexts tend to yield vague explanations. Higher explanation-score quartiles are associated with higher downstream repair pass rates and, for some models, with fixes that are closer to the reference minimal fixes. In contrast, low-score quartiles can even underperform the no-explanation baseline. Reproduction package is publicly available.

Figures

Figures reproduced from arXiv: 2604.18309 by Christian Medeiros Adriano, Germany), Holger Giese (Hasso Plattner Institute, Julius Porbeck, University of Potsdam.

**Figure 2.** Figure 2: Distributions of expected total explanation scores [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Distributions of configuration-level expected total explanation scores [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Minimality effect sizes (Δ = Q4–Q1) for passing fixes in composed batches. Blue filled circles denote twoway batches; orange open circles denote three-way batches. Whiskers show Bonferroni-adjusted defect-bootstrap CIs (𝑚 = 2 per model). Positive values indicate that higher-quality explanations coincide with a higher minimal-fix rate. that explanations can improve fix accuracy [19], but add that quality … view at source ↗

**Figure 5.** Figure 5: Effect sizes comparing high- vs. low-quality explanations on passing fixes in composed batches ( [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 2 internal anchors

[1]

Fannar Steinn Aðalsteinsson, Björn Borgar Magnússon, Mislav Milicevic, Adam Nirving Davidsson, and Chih-Hong Cheng. 2025. Rethinking code re- view workflows with llm assistance: An empirical study. In2025 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). IEEE, 488–497

work page 2025
[2]

Elijah Kayode Adejumo and Brittany Johnson. 2025. Explaining Code Risk in OSS: Towards LLM-Generated Fault Prediction Interpretations. arXiv:2510.06104

work page arXiv 2025
[3]

Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, and Lingpeng Kong. 2025. Why Does the Effective Context Length of LLMs Fall Short?. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=eoln5WgrPx

work page 2025
[4]

Algirdas Avizienis, J-C Laprie, Brian Randell, and Carl Landwehr. 2004. Basic concepts and taxonomy of dependable and secure computing.IEEE transactions on dependable and secure computing1, 1 (2004), 11–33

work page 2004
[5]

2026.DeepSeek-V3.2

DeepSeek-AI. 2026.DeepSeek-V3.2. https://huggingface.co/deepseek-ai/ DeepSeek-V3.2 Hugging Face model card

work page 2026
[6]

Roosta, and Peyman Passban

Bryan Guan, Mehdi Rezagholizadeh, Tanya G. Roosta, and Peyman Passban

work page
[7]

InFirst International KDD Workshop on Prompt Optimization, 2025

The Order Effect: Investigating Prompt Sensitivity to Input Order in LLMs. InFirst International KDD Workshop on Prompt Optimization, 2025. https: //openreview.net/forum?id=QcYyYvrPNU

work page 2025
[8]

Halstead

Maurice H. Halstead. 1977.Elements of Software Science. Elsevier North-Holland, New York, NY

work page 1977
[9]

Junda He, Jieke Shi, Terry Yue Zhuo, Christoph Treude, Jiamou Sun, Zhenchang Xing, Xiaoning Du, and David Lo. 2026. LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead.ACM Transactions on Software Engineering and Methodology(2026)

work page 2026
[10]

Elenberg

Tyler Holloway and Ethan R. Elenberg. 2024. On the Role of Context Granular- ity in LLM-Driven Program Repair. InMachine Learning for Systems Workshop (NeurIPS ’24 Workshop). NeurIPS Foundation, Vancouver, BC, Canada, 8 pages. https://neurips.cc/virtual/2024/103609 Workshop paper

work page 2024
[11]

Defects4j: a database of existing faults to enable controlled testing studies for java programs,

René Just, Darioush Jalali, Laura Inozemtseva, Michael D. Ernst, Reid Holmes, and Gordon Fraser. 2014. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. InProceedings of the 23rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2014). ACM, 312–315. doi:10.1145/2610384.2628055

work page doi:10.1145/2610384.2628055 2014
[12]

Peter Kincaid, Robert P

J. Peter Kincaid, Robert P. Fishburne, Richard L. Rogers, and Brad S. Chissom. 1975.Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel. Technical Report Research Branch Report 8-75. Naval Technical Training Command. https: //stars.library.ucf.edu/istlibrary/56/

work page 1975
[13]

Lucas Layman, Madeline Diep, Meiyappan Nagappan, Janice Singer, Robert Deline, and Gina Venolia. 2013. Debugging revisited: Toward understanding the debugging needs of contemporary software developers. In2013 ACM/IEEE international symposium on empirical software engineering and measurement. IEEE, 383–392

work page 2013
[14]

Levenshtein

Vladimir I. Levenshtein. 1966. Binary Codes Capable of Correcting Deletions, Insertions and Reversals.Soviet Physics Doklady10, 8 (1966), 707–710

work page 1966
[15]

Haolin Li and Michael Coblenz. 2026. A Grounded Theory of Debugging in Professional Software Engineering Practice. arXiv:2602.11435

work page internal anchor Pith review arXiv 2026
[16]

Lost in the Middle: How Language Models Use Long Contexts

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023. Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: NLG evaluation using gpt-4 with better human alignment. InProceedings of the 2023 conference on empirical methods in natural language processing. 2511–2522

work page 2023
[18]

Junyi Lu, Xiaojia Li, Zihan Hua, Lei Yu, Shiqi Cheng, Li Yang, Fengjun Zhang, and Chun Zuo. 2025. Deepcrceval: Revisiting the evaluation of code review comment generation. InInternational Conference on Fundamental Approaches to Software Engineering. Springer, 43–64

work page 2025
[19]

Luca Mariotto, Christian Medeiros Adriano, Daniel Burgstahler, René Eichhorn, and Holger Giese. 2025. From Assessment to Enhancement of Pull Requests at Scale: Aligning Code Reviews with Developer Competencies Using Large Language Models. To appear

work page 2025
[20]

Christian Medeiros Adriano. 2022. Microtasking software failure resolution: early results.ACM SIGSOFT Software Engineering Notes44, 1 (2022), 36–39

work page 2022
[21]

automatic patch generation learned from human-written patches

Martin Monperrus. 2014. A critical review of “automatic patch generation learned from human-written patches”: essay on the problem statement and the evaluation of automatic software repair. InProceedings of the 36th International Conference on Software Engineering (ICSE ’14). ACM, 234–242. doi:10.1145/2568225.2568324

work page doi:10.1145/2568225.2568324 2014
[22]

2026.openai/gpt-5-mini

OpenRouter. 2026.openai/gpt-5-mini. https://openrouter.ai/openai/gpt-5-mini Model route page

work page 2026
[23]

2026.x-ai/grok-4.1-fast

OpenRouter. 2026.x-ai/grok-4.1-fast. https://openrouter.ai/x-ai/grok-4.1-fast Model route page

work page 2026
[24]

2026.From Program Slices to Causal Clarity: Evaluating Faithful, Actionable LLM-Generated Failure Explanations via Context Partitioning and LLM- as-a-Judge

Julius Porbeck. 2026.From Program Slices to Causal Clarity: Evaluating Faithful, Actionable LLM-Generated Failure Explanations via Context Partitioning and LLM- as-a-Judge. Master’s thesis. Hasso Plattner Institute

work page 2026
[25]

Zichao Qi, Fan Long, Sara Achour, and Martin Rinard. 2015. An Analysis of Patch Plausibility and Correctness for Generate-And-Validate Patch Generation Systems. (02 2015). doi:10.1145/2771783.2771791

work page doi:10.1145/2771783.2771791 2015
[26]

Stefano Rando, Luca Romani, Alessio Sampieri, Luca Franco, John Yang, Yuta Kyuragi, Fabio Galasso, and Tatsunori Hashimoto. 2025. LongCodeBench: Evalu- ating Coding LLMs at 1M Context Windows. InSecond Conference on Language Modeling. https://openreview.net/forum?id=GFPoM8Ylp8

work page 2025
[27]

Audrey Salmon, Katie Hammer, Eddie Antonio Santos, and Brett A Becker. 2025. Debugging Without Error Messages: How LLM Prompting Strategy Affects Programming Error Explanation Effectiveness. (2025). arXiv:2501.05706

work page arXiv 2025
[28]

Ezekiel Soremekun, Lukas Kirschner, Marcel Böhme, and Andreas Zeller. 2021. Locating faults with program slicing: an empirical analysis.Empirical Software Engineering26, 3 (2021), 51

work page 2021
[29]

Voas and K.W

J.M. Voas and K.W. Miller. 1995. Software testability: the new verification.IEEE Software12, 3 (1995), 17–28. doi:10.1109/52.382180

work page doi:10.1109/52.382180 1995
[30]

Xindi Wang, Mahsa Salmani, Parsa Omidi, Xiangyu Ren, Mehdi Rezagholizadeh, and Armaghan Eshaghi. 2024. Beyond the limits: a survey of techniques to extend the context length in large language models. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence. 8299–8307

work page 2024
[31]

Ratnadira Widyasari, Jia Wei Ang, Truong Giang Nguyen, Neil Sharma, and David Lo. 2024. Demystifying faulty code: Step-by-step reasoning for explainable fault localization. In2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 568–579

work page 2024
[32]

James Woodward. 1989. The Causal Mechanical Model of Explanation.Philoso- phy of Science56, 2 (1989), 345–363. https://conservancy.umn.edu/bitstreams/ f470d3c2-57fc-4764-8c0a-88c8747acc36/download Review of Wesley C. Salmon, Scientific Explanation and the Causal Structure of the World

work page 1989
[33]

Jiwei Yan, Jinhao Huang, Chunrong Fang, Jun Yan, and Jian Zhang. 2024. Better debugging: Combining static analysis and llms for explainable crashing fault localization. arXiv:2408.12070

work page arXiv 2024
[34]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623

work page 2023

[1] [1]

Fannar Steinn Aðalsteinsson, Björn Borgar Magnússon, Mislav Milicevic, Adam Nirving Davidsson, and Chih-Hong Cheng. 2025. Rethinking code re- view workflows with llm assistance: An empirical study. In2025 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). IEEE, 488–497

work page 2025

[2] [2]

Elijah Kayode Adejumo and Brittany Johnson. 2025. Explaining Code Risk in OSS: Towards LLM-Generated Fault Prediction Interpretations. arXiv:2510.06104

work page arXiv 2025

[3] [3]

Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, and Lingpeng Kong. 2025. Why Does the Effective Context Length of LLMs Fall Short?. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=eoln5WgrPx

work page 2025

[4] [4]

Algirdas Avizienis, J-C Laprie, Brian Randell, and Carl Landwehr. 2004. Basic concepts and taxonomy of dependable and secure computing.IEEE transactions on dependable and secure computing1, 1 (2004), 11–33

work page 2004

[5] [5]

2026.DeepSeek-V3.2

DeepSeek-AI. 2026.DeepSeek-V3.2. https://huggingface.co/deepseek-ai/ DeepSeek-V3.2 Hugging Face model card

work page 2026

[6] [6]

Roosta, and Peyman Passban

Bryan Guan, Mehdi Rezagholizadeh, Tanya G. Roosta, and Peyman Passban

work page

[7] [7]

InFirst International KDD Workshop on Prompt Optimization, 2025

The Order Effect: Investigating Prompt Sensitivity to Input Order in LLMs. InFirst International KDD Workshop on Prompt Optimization, 2025. https: //openreview.net/forum?id=QcYyYvrPNU

work page 2025

[8] [8]

Halstead

Maurice H. Halstead. 1977.Elements of Software Science. Elsevier North-Holland, New York, NY

work page 1977

[9] [9]

Junda He, Jieke Shi, Terry Yue Zhuo, Christoph Treude, Jiamou Sun, Zhenchang Xing, Xiaoning Du, and David Lo. 2026. LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead.ACM Transactions on Software Engineering and Methodology(2026)

work page 2026

[10] [10]

Elenberg

Tyler Holloway and Ethan R. Elenberg. 2024. On the Role of Context Granular- ity in LLM-Driven Program Repair. InMachine Learning for Systems Workshop (NeurIPS ’24 Workshop). NeurIPS Foundation, Vancouver, BC, Canada, 8 pages. https://neurips.cc/virtual/2024/103609 Workshop paper

work page 2024

[11] [11]

Defects4j: a database of existing faults to enable controlled testing studies for java programs,

René Just, Darioush Jalali, Laura Inozemtseva, Michael D. Ernst, Reid Holmes, and Gordon Fraser. 2014. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. InProceedings of the 23rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2014). ACM, 312–315. doi:10.1145/2610384.2628055

work page doi:10.1145/2610384.2628055 2014

[12] [12]

Peter Kincaid, Robert P

J. Peter Kincaid, Robert P. Fishburne, Richard L. Rogers, and Brad S. Chissom. 1975.Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel. Technical Report Research Branch Report 8-75. Naval Technical Training Command. https: //stars.library.ucf.edu/istlibrary/56/

work page 1975

[13] [13]

Lucas Layman, Madeline Diep, Meiyappan Nagappan, Janice Singer, Robert Deline, and Gina Venolia. 2013. Debugging revisited: Toward understanding the debugging needs of contemporary software developers. In2013 ACM/IEEE international symposium on empirical software engineering and measurement. IEEE, 383–392

work page 2013

[14] [14]

Levenshtein

Vladimir I. Levenshtein. 1966. Binary Codes Capable of Correcting Deletions, Insertions and Reversals.Soviet Physics Doklady10, 8 (1966), 707–710

work page 1966

[15] [15]

Haolin Li and Michael Coblenz. 2026. A Grounded Theory of Debugging in Professional Software Engineering Practice. arXiv:2602.11435

work page internal anchor Pith review arXiv 2026

[16] [16]

Lost in the Middle: How Language Models Use Long Contexts

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023. Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: NLG evaluation using gpt-4 with better human alignment. InProceedings of the 2023 conference on empirical methods in natural language processing. 2511–2522

work page 2023

[18] [18]

Junyi Lu, Xiaojia Li, Zihan Hua, Lei Yu, Shiqi Cheng, Li Yang, Fengjun Zhang, and Chun Zuo. 2025. Deepcrceval: Revisiting the evaluation of code review comment generation. InInternational Conference on Fundamental Approaches to Software Engineering. Springer, 43–64

work page 2025

[19] [19]

Luca Mariotto, Christian Medeiros Adriano, Daniel Burgstahler, René Eichhorn, and Holger Giese. 2025. From Assessment to Enhancement of Pull Requests at Scale: Aligning Code Reviews with Developer Competencies Using Large Language Models. To appear

work page 2025

[20] [20]

Christian Medeiros Adriano. 2022. Microtasking software failure resolution: early results.ACM SIGSOFT Software Engineering Notes44, 1 (2022), 36–39

work page 2022

[21] [21]

automatic patch generation learned from human-written patches

Martin Monperrus. 2014. A critical review of “automatic patch generation learned from human-written patches”: essay on the problem statement and the evaluation of automatic software repair. InProceedings of the 36th International Conference on Software Engineering (ICSE ’14). ACM, 234–242. doi:10.1145/2568225.2568324

work page doi:10.1145/2568225.2568324 2014

[22] [22]

2026.openai/gpt-5-mini

OpenRouter. 2026.openai/gpt-5-mini. https://openrouter.ai/openai/gpt-5-mini Model route page

work page 2026

[23] [23]

2026.x-ai/grok-4.1-fast

OpenRouter. 2026.x-ai/grok-4.1-fast. https://openrouter.ai/x-ai/grok-4.1-fast Model route page

work page 2026

[24] [24]

2026.From Program Slices to Causal Clarity: Evaluating Faithful, Actionable LLM-Generated Failure Explanations via Context Partitioning and LLM- as-a-Judge

Julius Porbeck. 2026.From Program Slices to Causal Clarity: Evaluating Faithful, Actionable LLM-Generated Failure Explanations via Context Partitioning and LLM- as-a-Judge. Master’s thesis. Hasso Plattner Institute

work page 2026

[25] [25]

Zichao Qi, Fan Long, Sara Achour, and Martin Rinard. 2015. An Analysis of Patch Plausibility and Correctness for Generate-And-Validate Patch Generation Systems. (02 2015). doi:10.1145/2771783.2771791

work page doi:10.1145/2771783.2771791 2015

[26] [26]

Stefano Rando, Luca Romani, Alessio Sampieri, Luca Franco, John Yang, Yuta Kyuragi, Fabio Galasso, and Tatsunori Hashimoto. 2025. LongCodeBench: Evalu- ating Coding LLMs at 1M Context Windows. InSecond Conference on Language Modeling. https://openreview.net/forum?id=GFPoM8Ylp8

work page 2025

[27] [27]

Audrey Salmon, Katie Hammer, Eddie Antonio Santos, and Brett A Becker. 2025. Debugging Without Error Messages: How LLM Prompting Strategy Affects Programming Error Explanation Effectiveness. (2025). arXiv:2501.05706

work page arXiv 2025

[28] [28]

Ezekiel Soremekun, Lukas Kirschner, Marcel Böhme, and Andreas Zeller. 2021. Locating faults with program slicing: an empirical analysis.Empirical Software Engineering26, 3 (2021), 51

work page 2021

[29] [29]

Voas and K.W

J.M. Voas and K.W. Miller. 1995. Software testability: the new verification.IEEE Software12, 3 (1995), 17–28. doi:10.1109/52.382180

work page doi:10.1109/52.382180 1995

[30] [30]

Xindi Wang, Mahsa Salmani, Parsa Omidi, Xiangyu Ren, Mehdi Rezagholizadeh, and Armaghan Eshaghi. 2024. Beyond the limits: a survey of techniques to extend the context length in large language models. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence. 8299–8307

work page 2024

[31] [31]

Ratnadira Widyasari, Jia Wei Ang, Truong Giang Nguyen, Neil Sharma, and David Lo. 2024. Demystifying faulty code: Step-by-step reasoning for explainable fault localization. In2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 568–579

work page 2024

[32] [32]

James Woodward. 1989. The Causal Mechanical Model of Explanation.Philoso- phy of Science56, 2 (1989), 345–363. https://conservancy.umn.edu/bitstreams/ f470d3c2-57fc-4764-8c0a-88c8747acc36/download Review of Wesley C. Salmon, Scientific Explanation and the Causal Structure of the World

work page 1989

[33] [33]

Jiwei Yan, Jinhao Huang, Chunrong Fang, Jun Yan, and Jian Zhang. 2024. Better debugging: Combining static analysis and llms for explainable crashing fault localization. arXiv:2408.12070

work page arXiv 2024

[34] [34]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623

work page 2023