arxiv: 2604.05955 · v1 · submitted 2026-04-07 · 💻 cs.SE · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution

Kai Yu , Zhenhao Zhou , Junhao Zeng , Ying Wang , Xueying Du , Zhiqiang Yuan , Junwei Liu , Ziyu Zhou

show 3 more authors

Yujia Wang Chong Wang Xin Peng

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:42 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords LLM agentsissue resolutiondesign constraintssoftware benchmarkspatch qualitytest pass ratesSWE-bench

0 comments

The pith

Test pass rates substantially overestimate LLM patch quality because fewer than half of resolved issues meet design constraints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that benchmarks for LLM-based issue resolution rely too heavily on test pass rates, which fail to capture whether patches follow project-specific design rules such as architectural conventions and error-handling policies. These rules are typically implicit in code reviews rather than encoded in tests. The authors create a benchmark by extracting and validating design constraints from real pull requests, then use an LLM verifier to check compliance on resolved issues. Results show widespread violations and almost no statistical connection between passing tests and satisfying design rules. This matters for anyone relying on LLM agents for real software maintenance, where maintainability and consistency matter as much as functionality.

Core claim

The paper claims that functional correctness measured by test pass rates substantially overestimates patch quality in LLM-based issue resolution. Fewer than half of resolved issues are fully design-satisfying, design violations are widespread, and functional correctness exhibits negligible statistical association with design satisfaction. Providing issue-specific design guidance reduces violations but leaves substantial non-compliance.

What carries the argument

The design-aware issue resolution benchmark that mines validated design constraints from pull requests and applies an LLM-based verifier to measure patch compliance.

If this is right

Test-based evaluations give an inflated view of current agent performance on real-world patches.
Design violations occur frequently even among patches that pass all tests.
Functional correctness and design satisfaction show negligible statistical association.
Explicit design guidance to agents reduces but does not eliminate non-compliance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agents will need new mechanisms to extract and apply implicit project conventions from code history and reviews.
Evaluation suites should adopt multi-dimensional metrics that separately score functionality and design adherence.
In practice, developers may still need manual review for design fit even when LLM patches pass tests.

Load-bearing premise

An LLM-based verifier can reliably judge whether patches comply with design constraints extracted from pull requests without systematic human validation of the judgments.

What would settle it

A human expert review of the LLM verifier's compliance decisions on a representative sample of patches, where high rates of disagreement would show that the reported gap between test pass rates and design satisfaction is unreliable.

Figures

Figures reproduced from arXiv: 2604.05955 by Chong Wang, Junhao Zeng, Junwei Liu, Kai Yu, Xin Peng, Xueying Du, Ying Wang, Yujia Wang, Zhenhao Zhou, Zhiqiang Yuan, Ziyu Zhou.

**Figure 1.** Figure 1: A motivating example illustrating the resolution process of a realistic issue, along with relevant code review threads [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the construction pipeline of [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Venn Diagram for Violated Design Constraints of [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: An example of design-constraint violation in [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of DVR Before and After Constraint [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: An example of design-constraint violation in [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

read the original abstract

Repository-level issue resolution benchmarks have become a standard testbed for evaluating LLM-based agents, yet success is still predominantly measured by test pass rates. In practice, however, acceptable patches must also comply with project-specific design constraints, such as architectural conventions, error-handling policies, and maintainability requirements, which are rarely encoded in tests and are often documented only implicitly in code review discussions. This paper introduces \textit{design-aware issue resolution} and presents \bench{}, a benchmark that makes such implicit design constraints explicit and measurable. \bench{} is constructed by mining and validating design constraints from real-world pull requests, linking them to issue instances, and automatically checking patch compliance using an LLM-based verifier, yielding 495 issues and 1,787 validated constraints across six repositories, aligned with SWE-bench-Verified and SWE-bench-Pro. Experiments with state-of-the-art agents show that test-based correctness substantially overestimates patch quality: fewer than half of resolved issues are fully design-satisfying, design violations are widespread, and functional correctness exhibits negligible statistical association with design satisfaction. While providing issue-specific design guidance reduces violations, substantial non-compliance remains, highlighting a fundamental gap in current agent capabilities and motivating design-aware evaluation beyond functional correctness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical new way to mine design constraints from real PRs and shows test-pass rates miss a lot of violations, but every main number rests on an unvalidated LLM judge.

read the letter

The useful part is the benchmark pipeline. They pull design constraints out of actual pull request discussions across six repositories, validate them, link them to 495 issues, and end up with 1,787 constraints aligned to SWE-bench sets. That construction step is concrete and grounded in real data rather than synthetic rules. The experiments then report that fewer than half the resolved issues fully satisfy design, violations are common, and test passing has almost no statistical tie to design compliance. Giving agents extra design hints reduces violations but leaves a big gap. Those observations line up with what many people suspect about current agents, and the data collection makes the point more specific than before. The soft spot is the LLM verifier that does the extraction, validation, and compliance labeling. The abstract and stress-test note give no human agreement numbers, no error analysis on the 1,787 constraints, and no calibration details. Because the constraints are implicit in discussions, any systematic tilt in how the LLM reads intent will move the reported percentages and the correlation result. That single piece carries the headline claims. This belongs in a reading group for people working on LLM coding agents and benchmarks. The construction is worth looking at even if the numbers need more checking. I would not cite the specific findings yet, but the overall direction is worth tracking once the judge is validated. It should go to peer review. The repo-level data work is solid enough that referees can focus on tightening the evaluation side without starting from scratch.

Referee Report

1 major / 2 minor

Summary. The paper argues that test-pass-rate metrics for LLM-based issue resolution agents substantially overestimate patch quality by ignoring compliance with implicit design constraints (e.g., architectural conventions and maintainability rules) that are documented only in PR discussions. It introduces design-aware issue resolution and the benchmark, constructed by mining 1,787 validated constraints from real-world PRs across six repositories and linking them to 495 issues aligned with SWE-bench variants. An LLM-based verifier automatically extracts constraints and checks patch compliance. Experiments with state-of-the-art agents show fewer than half of resolved issues are fully design-satisfying, design violations are widespread, functional correctness has negligible statistical association with design satisfaction, and providing design guidance reduces but does not eliminate violations.

Significance. If the LLM verifier's judgments prove reliable, the work would be significant for exposing a fundamental limitation in current agent evaluation: functional correctness alone is insufficient and orthogonal to design compliance. The use of real PR data for constraint mining is a clear strength, grounding the benchmark in practice rather than synthetic rules. This motivates design-aware benchmarks and agents, with potential impact on how progress in repository-level code generation is measured.

major comments (1)

The central quantitative claims (fewer than half of resolved issues fully design-satisfying; negligible correlation with test pass rate; widespread violations) are computed entirely from the outputs of the LLM-based verifier that extracts constraints from PR discussions and labels patch compliance. The abstract and method description indicate this process is fully automatic, with no reported human adjudication, inter-rater reliability metrics, calibration details, or error analysis on the 1,787 constraints. Because the constraints are implicitly documented, any systematic LLM bias would propagate directly into all headline statistics and the association test. A human validation study on a representative sample is required to make the empirical findings defensible.

minor comments (2)

The benchmark name is introduced with LaTeX formatting but would benefit from an explicit definition and expansion on first use in the abstract and introduction.
Clarify the exact overlap and differences with SWE-bench-Verified and SWE-bench-Pro (e.g., via a table or paragraph in the benchmark construction section) to help readers assess generalizability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The concern regarding the reliability of the LLM-based verifier is well-taken, and we address it directly below while committing to strengthen the manuscript accordingly.

read point-by-point responses

Referee: The central quantitative claims (fewer than half of resolved issues fully design-satisfying; negligible correlation with test pass rate; widespread violations) are computed entirely from the outputs of the LLM-based verifier that extracts constraints from PR discussions and labels patch compliance. The abstract and method description indicate this process is fully automatic, with no reported human adjudication, inter-rater reliability metrics, calibration details, or error analysis on the 1,787 constraints. Because the constraints are implicitly documented, any systematic LLM bias would propagate directly into all headline statistics and the association test. A human validation study on a representative sample is required to make the empirical findings defensible.

Authors: We agree that the absence of reported human validation for the LLM verifier's outputs on constraint extraction and compliance labeling represents a limitation that affects the defensibility of the headline statistics. Although the benchmark construction mines constraints from real PR discussions (providing grounding in practice) and the verifier is applied at scale, we did not include human adjudication, inter-rater metrics, or error analysis in the submitted version. To address this, we will add a human validation study: we will randomly sample 150 constraints (approximately 8.5% of the total) and 150 patch-compliance judgments across the six repositories. Two independent human annotators (with software engineering expertise) will label each item for extraction accuracy and compliance correctness. We will report agreement via Cohen's kappa, per-category error rates, and any systematic biases observed. These results, along with calibration details for the verifier prompts, will be incorporated into the Methods, Results, and a new Limitations subsection. We believe this addition will directly mitigate concerns about bias propagation while preserving the automatic nature of the benchmark for reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results from external PR-derived benchmark

full rationale

The paper constructs its benchmark by mining design constraints from real-world pull requests (external data), links them to SWE-bench issues, and applies an LLM verifier to label patch compliance. All headline statistics (percentages of design-satisfying issues, violation rates, correlation with pass rates) are direct empirical counts and association tests on these labels. No equations, derivations, or first-principles claims exist that reduce to fitted parameters or self-referential definitions. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The chain is data collection → automated labeling → reporting, fully independent of the reported numbers and self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper. No mathematical derivations, free parameters, or invented entities are described in the abstract. The work rests on the domain assumption that design constraints can be reliably extracted from PRs and checked by LLMs.

pith-pipeline@v0.9.0 · 5548 in / 1071 out tokens · 37294 ms · 2026-05-10T18:42:01.904973+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SWE-Shield is constructed by mining and validating design constraints from real-world pull requests, linking them to issue instances, and automatically checking patch compliance using an LLM-based verifier
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a χ² test shows no significant association ... Cramér’s V ≤ 0.11

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 5 internal anchors

[1]

ISO/IEC/IEEE International Standard - Systems and software engineering – Life cycle processes –Requirements engineering.ISO/IEC/IEEE 29148:2011(E) (2011), 1–94

2011. ISO/IEC/IEEE International Standard - Systems and software engineering – Life cycle processes –Requirements engineering.ISO/IEC/IEEE 29148:2011(E) (2011), 1–94. doi:10.1109/IEEESTD.2011.6146379

work page doi:10.1109/ieeestd.2011.6146379 2011
[2]

Halim Ahmad and Hasnita Halim. 2017. Determining sample size for research activities: the case of organizational research.Selangor Business Review(2017), 20–34

work page 2017
[3]

Toufique Ahmed and Premkumar Devanbu. 2022. Few-shot training llms for project-specific code-summarization. InProceedings of the 37th IEEE/ACM inter- national conference on automated software engineering. 1–5

work page 2022
[4]

Dongping Chen, Dongping Chen, Ruoxi Chen, Shilin Zhang, Yinuo Liu, Yaochen Wang, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. 2024. MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark.arXiv (Cornell University)(2024). doi:10.48550/arxiv.2402.04788

work page doi:10.48550/arxiv.2402.04788 2024
[5]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Vincent Cohen-Addad, Varun Kanade, Frederik Mallmann-Trenn, and Claire Mathieu. 2019. Hierarchical Clustering: Objective Functions and Algorithms.J. ACM66, 4 (2019), 26:1–26:42. doi:10.1145/3321386

work page doi:10.1145/3321386 2019
[7]

1999.Mathematical methods of statistics

Harald Cramér. 1999.Mathematical methods of statistics. Vol. 9. Princeton university press

work page 1999
[8]

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. 2025. SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?C...

work page internal anchor Pith review doi:10.48550/arxiv.2509.16941 2025
[9]

Mouna Dhaouadi, Bentley Oakes, and Michalis Famelis. 2022. End-to-End Ratio- nale Reconstruction. In37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022, Rochester, MI, USA, October 10-14, 2022. ACM, 176:1–176:5. doi:10.1145/3551349.3559547

work page doi:10.1145/3551349.3559547 2022
[10]

Mouna Dhaouadi, Bentley Oakes, and Michalis Famelis. 2025. Automated Extrac- tion and Analysis of Developer’s Rationale in Open Source Software.Proc. ACM Softw. Eng.2, FSE (2025), 2548–2570. doi:10.1145/3729383

work page doi:10.1145/3729383 2025
[11]

Deming Ding, Shichun Liu, Enhui Yang, Jiahang Lin, Ziying Chen, Shihan Dou, Honglin Guo, Weiyu Cheng, Pengyu Zhao, Chengjun Xiao, et al. 2026. OctoBench: Benchmarking Scaffold-Aware Instruction Following in Repository-Grounded Agentic Coding.arXiv preprint arXiv:2601.10343(2026)

work page arXiv 2026
[12]

Django Software Foundation. 2025. Django. https://github.com/django/django

work page 2025
[13]

Yihong Dong, Xue Jiang, Jiaru Qian, Tian Wang, Kechi Zhang, Zhi Jin, and Ge Li. 2025. A Survey on Code Generation with LLM-based Agents.CoRR abs/2508.00083 (2025). arXiv:2508.00083 doi:10.48550/ARXIV.2508.00083

work page doi:10.48550/arxiv.2508.00083 2025
[14]

Ronald Aylmer Fisher. 1970. Statistical methods for research workers. InBreak- throughs in statistics: Methodology and distribution. Springer, 66–70

work page 1970
[15]

Gruber, Catherine Baudin, John H

Thomas R. Gruber, Catherine Baudin, John H. Boose, and Jay Webber. 1991. Design Rationale Capture as Knowledge Acquisition. InProceedings of the Eighth Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Yu et al. International Workshop (ML91), Northwestern University, Evanston, Illinois, USA, Lawrence Birnbaum and Gregg Collins (Eds.). Morgan Kaufman...

work page 1991
[16]

Xinyi He, Qian Liu, Mingzhe Du, Lin Yan, Zhijie Fan, Yiming Huang, Zejian Yuan, and Zejun Ma. 2025. SWE-Perf: Can Language Models Optimize Code Perfor- mance on Real-World Repositories?CoRRabs/2507.12415 (2025). arXiv:2507.12415 doi:10.48550/ARXIV.2507.12415

work page doi:10.48550/arxiv.2507.12415 2025
[17]

Anton Jansen, Jan Bosch, and Paris Avgeriou. 2008. Documenting after the fact: Recovering architectural design decisions.J. Syst. Softw.81, 4 (2008), 536–557. doi:10.1016/J.JSS.2007.08.025

work page doi:10.1016/j.jss.2007.08.025 2008
[18]

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for Code Generation.CoRRabs/2406.00515 (2024). arXiv:2406.00515 doi:10.48550/ARXIV.2406.00515

work page internal anchor Pith review doi:10.48550/arxiv.2406.00515 2024
[19]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. https://openreview.net/forum?id=VTF8yNQM66

work page 2024
[20]

Jimenez, Ofir Press, and John Yang

Kabir Khandpur, Kilian Lieret, Carlos E. Jimenez, Ofir Press, and John Yang. 2025. SWE-bench Multilingual. https://kabirk.com/multilingual

work page 2025
[21]

Abhi Kottamasu, Akul Datta, Aakash Barthwal, Chirag Mahapatra, Ajay Arun, Adarsh Hiremath, Brendan Foody, and Bertie Vidgen. 2026. APEX-SWE.arXiv preprint arXiv:2601.08806(2026)

work page arXiv 2026
[22]

Caihua Li, Lianghong Guo, Yanlin Wang, Daya Guo, Wei Tao, Zhenyu Shan, Mingwei Liu, Jiachi Chen, Haoyu Song, Duyu Tang, et al. 2026. Advances and Frontiers of LLM-based Issue Resolution in Software Engineering: A Compre- hensive Survey.arXiv preprint arXiv:2601.11655(2026)

work page arXiv 2026
[23]

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics12 (2024), 157–173

work page 2024
[24]

Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambro- sio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al . 2021. Codexglue: A machine learning benchmark dataset for code understanding and generation.arXiv preprint arXiv:2102.04664(2021)

work page internal anchor Pith review arXiv 2021
[25]

Jeffrey Jian Ma, Milad Hashemi, Amir Yazdanbakhsh, Kevin Swersky, Ofir Press, Enhui Li, Vijay Janapa Reddi, and Parthasarathy Ranganathan. 2025. SWE- fficiency: Can Language Models Optimize Real-World Repositories on Real Work- loads?CoRRabs/2511.06090 (2025). arXiv:2511.06090 doi:10.48550/ARXIV.2511. 06090

work page doi:10.48550/arxiv.2511 2025
[26]

Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke

work page
[27]

InForty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025

SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Free- lance Software Engineering?. InForty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025. OpenReview.net. https://openreview.net/forum?id=xZXhFg43EI

work page 2025
[28]

Karl Pearson. 1900. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling.The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science50, 302 (1900), 157–175

work page 1900
[29]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Confer- ence on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, Kentaro Inui, Jing Jiang, Vince...

work page doi:10.18653/v1/d19-1410 2019
[30]

Pankajeshwara Nand Sharma, Bastin Tony Roy Savarimuthu, and Nigel Stanger

work page
[31]

Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks , booktitle =

Extracting Rationale for Open Source Software Development Decisions - A Study of Python Email Archives. In43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021. IEEE, 1008–1019. doi:10.1109/ICSE43902.2021.00095

work page doi:10.1109/icse43902.2021.00095 2021
[32]

Lin Shi, Ziyou Jiang, Ye Yang, Xiao Chen, Yumin Zhang, Fangwen Mu, Hanzhi Jiang, and Qing Wang. 2021. ISPY: Automatic Issue-Solution Pair Extraction from Community Live Chats. In36th IEEE/ACM International Conference on Automated Software Engineering, ASE 2021, Melbourne, Australia, November 15-19, 2021. IEEE, 142–154. doi:10.1109/ASE51524.2021.9678894

work page doi:10.1109/ase51524.2021.9678894 2021
[33]

2025.Introducing Sonar Foundation Agent

SonarSource. 2025.Introducing Sonar Foundation Agent. https://www. sonarsource.com/blog/introducing-sonar-foundation-agent/

work page 2025
[34]

Weisong Sun, Yun Miao, Yuekang Li, Hongyu Zhang, Chunrong Fang, Yi Liu, Gelei Deng, Yang Liu, and Zhenyu Chen. 2024. Source code summarization in the era of large language models.arXiv preprint arXiv:2407.07959(2024)

work page arXiv 2024
[35]

Annalisa Szymanski, Annalisa Szymanski, Noah Ziems, Heather Eicher-Miller, Toby Li, Meng Jiang, and Ronald Metoyer. 2024. Limitations of the LLM-as-a- Judge Approach for Evaluating LLM Outputs in Expert Knowledge Tasks.arXiv (Cornell University)(2024). doi:10.48550/arxiv.2410.20266

work page doi:10.48550/arxiv.2410.20266 2024
[36]

Chunqiu Steven Xia, Zhe Wang, Yan Yang, Yuxiang Wei, and Lingming Zhang

work page
[37]

CoRR , volume =

Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly? CoRRabs/2511.13646 (2025). arXiv:2511.13646 doi:10.48550/ARXIV.2511.13646

work page doi:10.48550/arxiv.2511.13646 2025
[38]

Jingxuan Xu, Ken Deng, Weihao Li, Songwei Yu, Huaixi Tang, Haoyang Huang, Zhiyi Lai, Zizheng Zhan, Yanan Wu, Chenchen Zhang, Kepeng Lei, Yifan Yao, Xinping Lei, Wenqiang Zhu, Zong-Xian Feng, Han Li, Junqi Xiong, Dailin Li, Zuchen Gao, Kun Wu, Wen Xiang, Ziqi Zhan, Yuanxing Zhang, Wuxuan Gong, Ziyuan Gao, Guanxiang Wang, Yirong Xue, Mengtong Li, Mengfei Xi...

work page arXiv 2025
[39]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Inter- faces Enable Automated Software Engineering. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, Dece...

work page 2024
[40]

Jimenez, Alex L

John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida Wang, and Ofir Press. 2025. SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, S...

work page 2025
[41]

Xu Yang, Jiayuan Zhou, Michael Pacheco, Wenhan Zhu, Pengfei He, Shaowei Wang, Kui Liu, and Ruiqi Pan. 2025. Lingxi: Repository-Level Issue Resolu- tion Framework Enhanced by Procedural Knowledge Guided Scaling.CoRR abs/2510.11838 (2025). arXiv:2510.11838 doi:10.48550/ARXIV.2510.11838

work page doi:10.48550/arxiv.2510.11838 2025
[42]

Jiayi Ye, Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh Chawla, and Xiangliang Zhang. 2024. Justice or Prejudice? Quantifying Biases in LLM-as-a- Judge.arXiv (Cornell University)(2024). doi:10.48550/arxiv.2410.02736

work page internal anchor Pith review doi:10.48550/arxiv.2410.02736 2024
[43]

Jiuang Zhao, Donghao Yang, Li Zhang, Xiaoli Lian, Zitian Yang, and Fang Liu

work page
[45]

Jiuang Zhao, Zitian Yang, Li Zhang, Xiaoli Lian, Donghao Yang, and Xin Tan. 2024. DRMiner: Extracting Latent Design Rationale from Jira Issue Logs. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, ASE 2024, Sacramento, CA, USA, October 27 - November 1, 2024, Vladimir Filkov, Baishakhi Ray, and Minghui Zhou (Ed...

work page doi:10.1145/3691620 2024
[46]

Xiyu Zhou, Ruiyin Li, Peng Liang, Beiqi Zhang, Mojtaba Shahin, Zengyang Li, and Chen Yang. 2025. Using LLMs in Generating Design Rationale for Software Architecture Decisions.CoRRabs/2504.20781 (2025). arXiv:2504.20781 doi:10. 48550/ARXIV.2504.20781

work page arXiv 2025
[47]

Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks.Advances in neural information processing systems32 (2019)

work page 2019