pith. machine review for the scientific record. sign in

arxiv: 2604.05955 · v1 · submitted 2026-04-07 · 💻 cs.SE · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:42 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords LLM agentsissue resolutiondesign constraintssoftware benchmarkspatch qualitytest pass ratesSWE-bench
0
0 comments X

The pith

Test pass rates substantially overestimate LLM patch quality because fewer than half of resolved issues meet design constraints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that benchmarks for LLM-based issue resolution rely too heavily on test pass rates, which fail to capture whether patches follow project-specific design rules such as architectural conventions and error-handling policies. These rules are typically implicit in code reviews rather than encoded in tests. The authors create a benchmark by extracting and validating design constraints from real pull requests, then use an LLM verifier to check compliance on resolved issues. Results show widespread violations and almost no statistical connection between passing tests and satisfying design rules. This matters for anyone relying on LLM agents for real software maintenance, where maintainability and consistency matter as much as functionality.

Core claim

The paper claims that functional correctness measured by test pass rates substantially overestimates patch quality in LLM-based issue resolution. Fewer than half of resolved issues are fully design-satisfying, design violations are widespread, and functional correctness exhibits negligible statistical association with design satisfaction. Providing issue-specific design guidance reduces violations but leaves substantial non-compliance.

What carries the argument

The design-aware issue resolution benchmark that mines validated design constraints from pull requests and applies an LLM-based verifier to measure patch compliance.

If this is right

  • Test-based evaluations give an inflated view of current agent performance on real-world patches.
  • Design violations occur frequently even among patches that pass all tests.
  • Functional correctness and design satisfaction show negligible statistical association.
  • Explicit design guidance to agents reduces but does not eliminate non-compliance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agents will need new mechanisms to extract and apply implicit project conventions from code history and reviews.
  • Evaluation suites should adopt multi-dimensional metrics that separately score functionality and design adherence.
  • In practice, developers may still need manual review for design fit even when LLM patches pass tests.

Load-bearing premise

An LLM-based verifier can reliably judge whether patches comply with design constraints extracted from pull requests without systematic human validation of the judgments.

What would settle it

A human expert review of the LLM verifier's compliance decisions on a representative sample of patches, where high rates of disagreement would show that the reported gap between test pass rates and design satisfaction is unreliable.

Figures

Figures reproduced from arXiv: 2604.05955 by Chong Wang, Junhao Zeng, Junwei Liu, Kai Yu, Xin Peng, Xueying Du, Ying Wang, Yujia Wang, Zhenhao Zhou, Zhiqiang Yuan, Ziyu Zhou.

Figure 1
Figure 1. Figure 1: A motivating example illustrating the resolution process of a realistic issue, along with relevant code review threads [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the construction pipeline of [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Venn Diagram for Violated Design Constraints of [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An example of design-constraint violation in [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of DVR Before and After Constraint [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: An example of design-constraint violation in [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
read the original abstract

Repository-level issue resolution benchmarks have become a standard testbed for evaluating LLM-based agents, yet success is still predominantly measured by test pass rates. In practice, however, acceptable patches must also comply with project-specific design constraints, such as architectural conventions, error-handling policies, and maintainability requirements, which are rarely encoded in tests and are often documented only implicitly in code review discussions. This paper introduces \textit{design-aware issue resolution} and presents \bench{}, a benchmark that makes such implicit design constraints explicit and measurable. \bench{} is constructed by mining and validating design constraints from real-world pull requests, linking them to issue instances, and automatically checking patch compliance using an LLM-based verifier, yielding 495 issues and 1,787 validated constraints across six repositories, aligned with SWE-bench-Verified and SWE-bench-Pro. Experiments with state-of-the-art agents show that test-based correctness substantially overestimates patch quality: fewer than half of resolved issues are fully design-satisfying, design violations are widespread, and functional correctness exhibits negligible statistical association with design satisfaction. While providing issue-specific design guidance reduces violations, substantial non-compliance remains, highlighting a fundamental gap in current agent capabilities and motivating design-aware evaluation beyond functional correctness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper argues that test-pass-rate metrics for LLM-based issue resolution agents substantially overestimate patch quality by ignoring compliance with implicit design constraints (e.g., architectural conventions and maintainability rules) that are documented only in PR discussions. It introduces design-aware issue resolution and the benchmark, constructed by mining 1,787 validated constraints from real-world PRs across six repositories and linking them to 495 issues aligned with SWE-bench variants. An LLM-based verifier automatically extracts constraints and checks patch compliance. Experiments with state-of-the-art agents show fewer than half of resolved issues are fully design-satisfying, design violations are widespread, functional correctness has negligible statistical association with design satisfaction, and providing design guidance reduces but does not eliminate violations.

Significance. If the LLM verifier's judgments prove reliable, the work would be significant for exposing a fundamental limitation in current agent evaluation: functional correctness alone is insufficient and orthogonal to design compliance. The use of real PR data for constraint mining is a clear strength, grounding the benchmark in practice rather than synthetic rules. This motivates design-aware benchmarks and agents, with potential impact on how progress in repository-level code generation is measured.

major comments (1)
  1. The central quantitative claims (fewer than half of resolved issues fully design-satisfying; negligible correlation with test pass rate; widespread violations) are computed entirely from the outputs of the LLM-based verifier that extracts constraints from PR discussions and labels patch compliance. The abstract and method description indicate this process is fully automatic, with no reported human adjudication, inter-rater reliability metrics, calibration details, or error analysis on the 1,787 constraints. Because the constraints are implicitly documented, any systematic LLM bias would propagate directly into all headline statistics and the association test. A human validation study on a representative sample is required to make the empirical findings defensible.
minor comments (2)
  1. The benchmark name is introduced with LaTeX formatting but would benefit from an explicit definition and expansion on first use in the abstract and introduction.
  2. Clarify the exact overlap and differences with SWE-bench-Verified and SWE-bench-Pro (e.g., via a table or paragraph in the benchmark construction section) to help readers assess generalizability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The concern regarding the reliability of the LLM-based verifier is well-taken, and we address it directly below while committing to strengthen the manuscript accordingly.

read point-by-point responses
  1. Referee: The central quantitative claims (fewer than half of resolved issues fully design-satisfying; negligible correlation with test pass rate; widespread violations) are computed entirely from the outputs of the LLM-based verifier that extracts constraints from PR discussions and labels patch compliance. The abstract and method description indicate this process is fully automatic, with no reported human adjudication, inter-rater reliability metrics, calibration details, or error analysis on the 1,787 constraints. Because the constraints are implicitly documented, any systematic LLM bias would propagate directly into all headline statistics and the association test. A human validation study on a representative sample is required to make the empirical findings defensible.

    Authors: We agree that the absence of reported human validation for the LLM verifier's outputs on constraint extraction and compliance labeling represents a limitation that affects the defensibility of the headline statistics. Although the benchmark construction mines constraints from real PR discussions (providing grounding in practice) and the verifier is applied at scale, we did not include human adjudication, inter-rater metrics, or error analysis in the submitted version. To address this, we will add a human validation study: we will randomly sample 150 constraints (approximately 8.5% of the total) and 150 patch-compliance judgments across the six repositories. Two independent human annotators (with software engineering expertise) will label each item for extraction accuracy and compliance correctness. We will report agreement via Cohen's kappa, per-category error rates, and any systematic biases observed. These results, along with calibration details for the verifier prompts, will be incorporated into the Methods, Results, and a new Limitations subsection. We believe this addition will directly mitigate concerns about bias propagation while preserving the automatic nature of the benchmark for reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results from external PR-derived benchmark

full rationale

The paper constructs its benchmark by mining design constraints from real-world pull requests (external data), links them to SWE-bench issues, and applies an LLM verifier to label patch compliance. All headline statistics (percentages of design-satisfying issues, violation rates, correlation with pass rates) are direct empirical counts and association tests on these labels. No equations, derivations, or first-principles claims exist that reduce to fitted parameters or self-referential definitions. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The chain is data collection → automated labeling → reporting, fully independent of the reported numbers and self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper. No mathematical derivations, free parameters, or invented entities are described in the abstract. The work rests on the domain assumption that design constraints can be reliably extracted from PRs and checked by LLMs.

pith-pipeline@v0.9.0 · 5548 in / 1071 out tokens · 37294 ms · 2026-05-10T18:42:01.904973+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 5 internal anchors

  1. [1]

    ISO/IEC/IEEE International Standard - Systems and software engineering – Life cycle processes –Requirements engineering.ISO/IEC/IEEE 29148:2011(E) (2011), 1–94

    2011. ISO/IEC/IEEE International Standard - Systems and software engineering – Life cycle processes –Requirements engineering.ISO/IEC/IEEE 29148:2011(E) (2011), 1–94. doi:10.1109/IEEESTD.2011.6146379

  2. [2]

    Halim Ahmad and Hasnita Halim. 2017. Determining sample size for research activities: the case of organizational research.Selangor Business Review(2017), 20–34

  3. [3]

    Toufique Ahmed and Premkumar Devanbu. 2022. Few-shot training llms for project-specific code-summarization. InProceedings of the 37th IEEE/ACM inter- national conference on automated software engineering. 1–5

  4. [4]

    Dongping Chen, Dongping Chen, Ruoxi Chen, Shilin Zhang, Yinuo Liu, Yaochen Wang, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. 2024. MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark.arXiv (Cornell University)(2024). doi:10.48550/arxiv.2402.04788

  5. [5]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  6. [6]

    Vincent Cohen-Addad, Varun Kanade, Frederik Mallmann-Trenn, and Claire Mathieu. 2019. Hierarchical Clustering: Objective Functions and Algorithms.J. ACM66, 4 (2019), 26:1–26:42. doi:10.1145/3321386

  7. [7]

    1999.Mathematical methods of statistics

    Harald Cramér. 1999.Mathematical methods of statistics. Vol. 9. Princeton university press

  8. [8]

    Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. 2025. SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?C...

  9. [9]

    Mouna Dhaouadi, Bentley Oakes, and Michalis Famelis. 2022. End-to-End Ratio- nale Reconstruction. In37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022, Rochester, MI, USA, October 10-14, 2022. ACM, 176:1–176:5. doi:10.1145/3551349.3559547

  10. [10]

    Mouna Dhaouadi, Bentley Oakes, and Michalis Famelis. 2025. Automated Extrac- tion and Analysis of Developer’s Rationale in Open Source Software.Proc. ACM Softw. Eng.2, FSE (2025), 2548–2570. doi:10.1145/3729383

  11. [11]

    Deming Ding, Shichun Liu, Enhui Yang, Jiahang Lin, Ziying Chen, Shihan Dou, Honglin Guo, Weiyu Cheng, Pengyu Zhao, Chengjun Xiao, et al. 2026. OctoBench: Benchmarking Scaffold-Aware Instruction Following in Repository-Grounded Agentic Coding.arXiv preprint arXiv:2601.10343(2026)

  12. [12]

    Django Software Foundation. 2025. Django. https://github.com/django/django

  13. [13]

    Yihong Dong, Xue Jiang, Jiaru Qian, Tian Wang, Kechi Zhang, Zhi Jin, and Ge Li. 2025. A Survey on Code Generation with LLM-based Agents.CoRR abs/2508.00083 (2025). arXiv:2508.00083 doi:10.48550/ARXIV.2508.00083

  14. [14]

    Ronald Aylmer Fisher. 1970. Statistical methods for research workers. InBreak- throughs in statistics: Methodology and distribution. Springer, 66–70

  15. [15]

    Gruber, Catherine Baudin, John H

    Thomas R. Gruber, Catherine Baudin, John H. Boose, and Jay Webber. 1991. Design Rationale Capture as Knowledge Acquisition. InProceedings of the Eighth Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Yu et al. International Workshop (ML91), Northwestern University, Evanston, Illinois, USA, Lawrence Birnbaum and Gregg Collins (Eds.). Morgan Kaufman...

  16. [16]

    Xinyi He, Qian Liu, Mingzhe Du, Lin Yan, Zhijie Fan, Yiming Huang, Zejian Yuan, and Zejun Ma. 2025. SWE-Perf: Can Language Models Optimize Code Perfor- mance on Real-World Repositories?CoRRabs/2507.12415 (2025). arXiv:2507.12415 doi:10.48550/ARXIV.2507.12415

  17. [17]

    Anton Jansen, Jan Bosch, and Paris Avgeriou. 2008. Documenting after the fact: Recovering architectural design decisions.J. Syst. Softw.81, 4 (2008), 536–557. doi:10.1016/J.JSS.2007.08.025

  18. [18]

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for Code Generation.CoRRabs/2406.00515 (2024). arXiv:2406.00515 doi:10.48550/ARXIV.2406.00515

  19. [19]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. https://openreview.net/forum?id=VTF8yNQM66

  20. [20]

    Jimenez, Ofir Press, and John Yang

    Kabir Khandpur, Kilian Lieret, Carlos E. Jimenez, Ofir Press, and John Yang. 2025. SWE-bench Multilingual. https://kabirk.com/multilingual

  21. [21]

    Abhi Kottamasu, Akul Datta, Aakash Barthwal, Chirag Mahapatra, Ajay Arun, Adarsh Hiremath, Brendan Foody, and Bertie Vidgen. 2026. APEX-SWE.arXiv preprint arXiv:2601.08806(2026)

  22. [22]

    Caihua Li, Lianghong Guo, Yanlin Wang, Daya Guo, Wei Tao, Zhenyu Shan, Mingwei Liu, Jiachi Chen, Haoyu Song, Duyu Tang, et al. 2026. Advances and Frontiers of LLM-based Issue Resolution in Software Engineering: A Compre- hensive Survey.arXiv preprint arXiv:2601.11655(2026)

  23. [23]

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics12 (2024), 157–173

  24. [24]

    Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambro- sio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al . 2021. Codexglue: A machine learning benchmark dataset for code understanding and generation.arXiv preprint arXiv:2102.04664(2021)

  25. [25]

    Jeffrey Jian Ma, Milad Hashemi, Amir Yazdanbakhsh, Kevin Swersky, Ofir Press, Enhui Li, Vijay Janapa Reddi, and Parthasarathy Ranganathan. 2025. SWE- fficiency: Can Language Models Optimize Real-World Repositories on Real Work- loads?CoRRabs/2511.06090 (2025). arXiv:2511.06090 doi:10.48550/ARXIV.2511. 06090

  26. [26]

    Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke

  27. [27]

    InForty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025

    SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Free- lance Software Engineering?. InForty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025. OpenReview.net. https://openreview.net/forum?id=xZXhFg43EI

  28. [28]

    Karl Pearson. 1900. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling.The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science50, 302 (1900), 157–175

  29. [29]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Confer- ence on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, Kentaro Inui, Jing Jiang, Vince...

  30. [30]

    Pankajeshwara Nand Sharma, Bastin Tony Roy Savarimuthu, and Nigel Stanger

  31. [31]

    Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks , booktitle =

    Extracting Rationale for Open Source Software Development Decisions - A Study of Python Email Archives. In43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021. IEEE, 1008–1019. doi:10.1109/ICSE43902.2021.00095

  32. [32]

    Lin Shi, Ziyou Jiang, Ye Yang, Xiao Chen, Yumin Zhang, Fangwen Mu, Hanzhi Jiang, and Qing Wang. 2021. ISPY: Automatic Issue-Solution Pair Extraction from Community Live Chats. In36th IEEE/ACM International Conference on Automated Software Engineering, ASE 2021, Melbourne, Australia, November 15-19, 2021. IEEE, 142–154. doi:10.1109/ASE51524.2021.9678894

  33. [33]

    2025.Introducing Sonar Foundation Agent

    SonarSource. 2025.Introducing Sonar Foundation Agent. https://www. sonarsource.com/blog/introducing-sonar-foundation-agent/

  34. [34]

    Weisong Sun, Yun Miao, Yuekang Li, Hongyu Zhang, Chunrong Fang, Yi Liu, Gelei Deng, Yang Liu, and Zhenyu Chen. 2024. Source code summarization in the era of large language models.arXiv preprint arXiv:2407.07959(2024)

  35. [35]

    Annalisa Szymanski, Annalisa Szymanski, Noah Ziems, Heather Eicher-Miller, Toby Li, Meng Jiang, and Ronald Metoyer. 2024. Limitations of the LLM-as-a- Judge Approach for Evaluating LLM Outputs in Expert Knowledge Tasks.arXiv (Cornell University)(2024). doi:10.48550/arxiv.2410.20266

  36. [36]

    Chunqiu Steven Xia, Zhe Wang, Yan Yang, Yuxiang Wei, and Lingming Zhang

  37. [37]

    CoRR , volume =

    Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly? CoRRabs/2511.13646 (2025). arXiv:2511.13646 doi:10.48550/ARXIV.2511.13646

  38. [38]

    Jingxuan Xu, Ken Deng, Weihao Li, Songwei Yu, Huaixi Tang, Haoyang Huang, Zhiyi Lai, Zizheng Zhan, Yanan Wu, Chenchen Zhang, Kepeng Lei, Yifan Yao, Xinping Lei, Wenqiang Zhu, Zong-Xian Feng, Han Li, Junqi Xiong, Dailin Li, Zuchen Gao, Kun Wu, Wen Xiang, Ziqi Zhan, Yuanxing Zhang, Wuxuan Gong, Ziyuan Gao, Guanxiang Wang, Yirong Xue, Mengtong Li, Mengfei Xi...

  39. [39]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Inter- faces Enable Automated Software Engineering. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, Dece...

  40. [40]

    Jimenez, Alex L

    John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida Wang, and Ofir Press. 2025. SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, S...

  41. [41]

    Xu Yang, Jiayuan Zhou, Michael Pacheco, Wenhan Zhu, Pengfei He, Shaowei Wang, Kui Liu, and Ruiqi Pan. 2025. Lingxi: Repository-Level Issue Resolu- tion Framework Enhanced by Procedural Knowledge Guided Scaling.CoRR abs/2510.11838 (2025). arXiv:2510.11838 doi:10.48550/ARXIV.2510.11838

  42. [42]

    Jiayi Ye, Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh Chawla, and Xiangliang Zhang. 2024. Justice or Prejudice? Quantifying Biases in LLM-as-a- Judge.arXiv (Cornell University)(2024). doi:10.48550/arxiv.2410.02736

  43. [43]

    Jiuang Zhao, Donghao Yang, Li Zhang, Xiaoli Lian, Zitian Yang, and Fang Liu

  44. [45]

    Jiuang Zhao, Zitian Yang, Li Zhang, Xiaoli Lian, Donghao Yang, and Xin Tan. 2024. DRMiner: Extracting Latent Design Rationale from Jira Issue Logs. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, ASE 2024, Sacramento, CA, USA, October 27 - November 1, 2024, Vladimir Filkov, Baishakhi Ray, and Minghui Zhou (Ed...

  45. [46]

    Xiyu Zhou, Ruiyin Li, Peng Liang, Beiqi Zhang, Mojtaba Shahin, Zengyang Li, and Chen Yang. 2025. Using LLMs in Generating Design Rationale for Software Architecture Decisions.CoRRabs/2504.20781 (2025). arXiv:2504.20781 doi:10. 48550/ARXIV.2504.20781

  46. [47]

    Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks.Advances in neural information processing systems32 (2019)