SWE-QA: Can Language Models Answer Repository-level Code Questions?

Beijun Shen; Weihan Peng; Xiaodong Gu; Xinyun Zhang; Yuhang Wang; Yuling Shi

arxiv: 2509.14635 · v2 · submitted 2025-09-18 · 💻 cs.CL · cs.PL· cs.SE

SWE-QA: Can Language Models Answer Repository-level Code Questions?

Weihan Peng , Yuling Shi , Yuhang Wang , Xinyun Zhang , Beijun Shen , Xiaodong Gu This is my paper

Pith reviewed 2026-05-18 16:38 UTC · model grok-4.3

classification 💻 cs.CL cs.PLcs.SE

keywords repository-level code QAlanguage model evaluationsoftware repository reasoningGitHub issuesagentic frameworkscross-file reasoningbenchmark construction

0 comments

The pith

SWE-QA supplies 576 repository-level code questions drawn from real GitHub issues to test language models on cross-file and multi-hop reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing benchmarks limit evaluation to isolated code snippets that do not reflect how developers actually work across files and dependencies. The authors analyze 77,100 issues from eleven repositories to build a two-level taxonomy and then manually curate 576 validated question-answer pairs covering intention understanding, cross-file reasoning, and multi-hop dependency analysis. They introduce SWE-QA-Agent as an agentic framework that lets LLMs reason step by step and act to retrieve answers from the full repository. Experiments with six advanced models show measurable gains when the agent framework augments context, yet performance gaps remain on harder multi-hop cases.

Core claim

SWE-QA is a repository-level code QA benchmark containing 576 high-quality pairs that better capture real-world developer questions than prior snippet-based datasets. The pairs are derived from naturally occurring issues and organized by a two-level taxonomy that emphasizes navigation of architecture and long-range dependencies. The accompanying SWE-QA-Agent framework demonstrates that LLM agents can improve answer accuracy by reasoning and acting across repository files.

What carries the argument

SWE-QA benchmark and its two-level taxonomy of repository-level questions, which organizes evaluation around intention understanding, cross-file reasoning, and multi-hop dependency analysis extracted from GitHub issues.

If this is right

Agent-based augmentation can measurably raise accuracy on repository-scale questions that require file navigation.
Benchmark results expose persistent weaknesses in current models for multi-hop dependency tracing and architectural understanding.
SWE-QA supplies a concrete testbed for developing automated tools that answer developer questions inside full codebases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Tools built on this style of benchmark could reduce time developers spend searching across files for answers to their own questions.
The taxonomy may transfer to other large code artifacts such as documentation sets or test suites if the same construction process is repeated.
Extending the benchmark to additional repositories would test whether the observed performance patterns hold outside the original eleven projects.

Load-bearing premise

The manually chosen 576 questions accurately mirror the distribution and difficulty of questions that developers actually ask about repositories without selection bias.

What would settle it

A direct comparison showing that models ranked highly on SWE-QA still fail to resolve comparable questions when inserted into live, unmodified repositories or that the taxonomy categories miss major types of questions observed in additional issue corpora.

Figures

Figures reproduced from arXiv: 2509.14635 by Beijun Shen, Weihan Peng, Xiaodong Gu, Xinyun Zhang, Yuhang Wang, Yuling Shi.

**Figure 1.** Figure 1: Workflow of Benchmark Construction. • We construct SWE-QA, a repository-level code QA benchmark comprising 576 high-quality question-answer pairs from 12 diverse open-source Python repositories to evaluate comprehensive repository understanding. • We propose SWE-QA-Agent, an intelligent ReAct-style agent designed to answer repositorylevel questions. The agent leverages a variety of tools to assist in rea… view at source ↗

**Figure 2.** Figure 2: Distribution of Collected Issues. Our data collection process began by crawling 77,100 GitHub issues from the 11 popular repositories used in SWE-bench [12] (excluding Django, which has issues disabled). To focus on substantive discussions, we filtered for issues with a body length of at least 1,000 characters, resulting in a dataset of 41,955 issues. Given that issues often contain extensive descript… view at source ↗

**Figure 3.** Figure 3: Distribution of Question Types. We first manually analyzed a random sample of 1,000 questions through an iterative open coding process to identify recurring patterns and developer intentions. This yielded a structured two-level taxonomy for repository-level QA, as summarized in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Core Elements and their Relations Extracted [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: QA Generation and Reference Answer Example. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Overview of the SWE-QA-Agent. While CoReQA [3] operates at the repository level by collecting data from GitHub issues and comments, it emphasizes issue resolution rather than code module comprehension. Critically, none of these existing benchmarks require models to understand code modules as interconnected architectural components within a repository. SWE-QA addresses this fundamental gap by specifically … view at source ↗

**Figure 7.** Figure 7: Case Study on sqlfluff. in [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

read the original abstract

Understanding and reasoning about entire software repositories is an essential capability for intelligent software engineering tools. While existing benchmarks such as CoSQA and CodeQA have advanced the field, they predominantly focus on small, self-contained code snippets. These setups fail to capture the complexity of real-world repositories, where effective understanding and reasoning often require navigating multiple files, understanding software architecture, and grounding answers in long-range code dependencies. In this paper, we present SWE-QA, a repository-level code question answering (QA) benchmark designed to facilitate research on automated QA systems in realistic code environments. SWE-QA involves 576 high-quality question-answer pairs spanning diverse categories, including intention understanding, cross-file reasoning, and multi-hop dependency analysis. To construct SWE-QA, we first crawled 77,100 GitHub issues from 11 popular repositories. Based on an analysis of naturally occurring developer questions extracted from these issues, we developed a two-level taxonomy of repository-level questions and constructed a set of seed questions for each category. For each category, we manually curated and validated questions and collected their corresponding answers. As a prototype application, we further develop SWE-QA-Agent, an agentic framework in which LLM agents reason and act to find answers automatically. We evaluate six advanced LLMs on SWE-QA under various context augmentation strategies. Experimental results highlight the promise of LLMs, particularly our SWE-QA-Agent framework, in addressing repository-level QA, while also revealing open challenges and pointing to future research directions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SWE-QA builds a useful new benchmark from real GitHub issues but the manual step from 77k issues to 576 pairs needs clearer checks on representativeness.

read the letter

SWE-QA gives us a new benchmark of 576 repository-level QA pairs pulled from real GitHub issues, along with a taxonomy and an agent setup for tackling them. The construction starts from 77,100 issues across 11 repos and derives categories that cover intention understanding, cross-file reasoning, and multi-hop dependencies. This moves past the small-snippet focus of CoSQA and CodeQA, and grounding the questions in actual developer issues is a practical step forward. The SWE-QA-Agent framework is a straightforward prototype that shows one way LLMs can be prompted to navigate repos for answers. The evaluations on six models under different context strategies add some early data on where current systems stand. The paper does a clean job of naming the gap and shipping a concrete resource to fill it. The main soft spot is the curation process itself. Going from a large issue set to a small validated set by hand is reasonable, but without reported inter-annotator agreement or a direct comparison of category frequencies between the full 77k issues and the final 576, it is hard to judge how well the benchmark reflects typical questions rather than the ones curators found interesting. If harder multi-hop cases were over-selected, the difficulty numbers and the agent results would look stronger than they are in the wild. The abstract flags open challenges, which is honest, but the full paper should include the sampling protocol and any validation stats to make the claims stick. This work is for researchers building or testing LLM agents on full-repository software engineering tasks. Anyone who needs a measurable target beyond single-file code QA would get immediate use from the dataset and taxonomy. I would send it for peer review. The gap is real, the resource is new, and referees can push the methodology details without rejecting the core idea.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SWE-QA, a repository-level code QA benchmark consisting of 576 high-quality question-answer pairs derived from analysis of 77,100 GitHub issues across 11 repositories. It develops a two-level taxonomy of repository-level questions (covering intention understanding, cross-file reasoning, and multi-hop dependency analysis), manually curates and validates the pairs from seed questions, and presents SWE-QA-Agent, an agentic LLM framework evaluated on six advanced models under different context augmentation strategies.

Significance. If the manually curated benchmark proves representative of real developer questions, it would meaningfully extend beyond snippet-based resources like CoSQA and CodeQA by emphasizing repository-scale navigation and long-range dependencies, with the agent framework offering a practical prototype for automated QA. The scale of issue analysis and manual validation effort provides a solid foundation for future work on repository-level reasoning.

major comments (2)

[Benchmark Construction] Benchmark Construction section: The process of selecting and curating the final 576 pairs from the initial 77,100 issues provides no inter-annotator agreement statistics, no explicit sampling protocol, and no quantitative comparison (e.g., chi-squared or frequency tables) of category distributions between the full issue corpus and the curated set. This directly affects the central claim that SWE-QA better captures real-world complexity, as unquantified selection bias could over- or under-represent multi-hop or cross-file questions relative to naturally occurring developer queries.
[Evaluation] Evaluation section: While the abstract and prototype description reference LLM evaluations and highlight 'promise' for the SWE-QA-Agent, the manuscript supplies no concrete performance metrics, baseline comparisons, error breakdowns, or ablation results on the 576 pairs, leaving the empirical support for the framework's effectiveness difficult to assess.

minor comments (2)

[Abstract] Abstract: The claim of 'high-quality' pairs is stated without reference to the specific validation criteria or agreement thresholds used during manual curation.
[Taxonomy Development] The two-level taxonomy is introduced but its derivation from the issue analysis is described at a high level without example questions or decision rules for category assignment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, indicating planned revisions to improve clarity and rigor.

read point-by-point responses

Referee: [Benchmark Construction] Benchmark Construction section: The process of selecting and curating the final 576 pairs from the initial 77,100 issues provides no inter-annotator agreement statistics, no explicit sampling protocol, and no quantitative comparison (e.g., chi-squared or frequency tables) of category distributions between the full issue corpus and the curated set. This directly affects the central claim that SWE-QA better captures real-world complexity, as unquantified selection bias could over- or under-represent multi-hop or cross-file questions relative to naturally occurring developer queries.

Authors: We agree that greater transparency in the curation process is needed to support claims of representativeness. In the revised manuscript, we will expand the Benchmark Construction section with: an explicit description of the sampling protocol from the 77,100 issues; inter-annotator agreement statistics (e.g., Cohen's kappa) from the manual validation; and quantitative comparisons including frequency tables and statistical tests (such as chi-squared) of category distributions between the full corpus and the final 576 pairs. These additions will directly address potential selection bias concerns. revision: yes
Referee: [Evaluation] Evaluation section: While the abstract and prototype description reference LLM evaluations and highlight 'promise' for the SWE-QA-Agent, the manuscript supplies no concrete performance metrics, baseline comparisons, error breakdowns, or ablation results on the 576 pairs, leaving the empirical support for the framework's effectiveness difficult to assess.

Authors: We acknowledge that the current Evaluation section would benefit from more explicit and detailed reporting to facilitate assessment. We will revise this section to include concrete performance metrics for the six models across context strategies, direct baseline comparisons, categorized error breakdowns on the 576 pairs, and ablation results for the SWE-QA-Agent components. This will strengthen the empirical presentation while building on the existing experimental findings. revision: yes

Circularity Check

0 steps flagged

No circularity: new benchmark constructed from external issues and evaluated on independent LLMs

full rationale

The paper constructs SWE-QA by crawling 77,100 GitHub issues from 11 repositories, deriving a two-level taxonomy from them, manually curating 576 QA pairs, and then evaluating six external LLMs plus the SWE-QA-Agent framework on the resulting benchmark. No mathematical derivations, fitted parameters renamed as predictions, self-citations bearing central claims, or uniqueness theorems imported from prior author work appear in the provided text. The contribution is an empirical dataset and prototype system whose validity rests on the curation process and external model evaluations rather than any internal reduction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on domain assumptions about data representativeness rather than new mathematical axioms or invented entities; no free parameters are introduced.

axioms (1)

domain assumption The 77,100 crawled GitHub issues from 11 repositories yield a representative sample of real developer questions suitable for taxonomy and question construction.
Stated in the construction process described in the abstract.

pith-pipeline@v0.9.0 · 5815 in / 1235 out tokens · 41388 ms · 2026-05-18T16:38:05.182216+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction and embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We first manually analyzed a random sample of 1,000 questions through an iterative open coding process to identify recurring patterns and developer intentions. This yielded a structured two-level taxonomy for repository-level QA... The first level categorizes questions based on their primary interrogative: What... Why... Where... and How...
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SWE-QA-Agent operates through an iterative ReAct-based workflow... actions: ReadFile, GetRepoStructure, SearchContent, Finish
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We evaluate six advanced LLMs... under various context augmentation strategies... rubric-guided evaluation system... correctness, completeness, relevance, clarity, and reasoning

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Context Gathering Decision Process: A POMDP Framework for Agentic Search
cs.AI 2026-05 accept novelty 7.0

Framing LLM agent loops as a Context Gathering Decision Process POMDP yields a predicate-based belief state that boosts multi-hop reasoning up to 11.4% and an exhaustion gate that cuts token use up to 39% with no perf...
ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction
cs.CV 2026-04 unverdicted novelty 7.0

ShredBench shows state-of-the-art MLLMs perform well on intact documents but suffer sharp drops in restoration accuracy as fragmentation increases to 8-16 pieces, indicating insufficient cross-modal semantic reasoning...
Neurosymbolic Repo-level Code Localization
cs.SE 2026-04 unverdicted novelty 7.0

LogicLoc combines LLMs with Datalog to achieve accurate repo-level code localization without relying on keyword shortcuts in benchmarks.
CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding
cs.CL 2026-02 unverdicted novelty 7.0

Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.
AOCI: Symbolic-Semantic Indexing for Practical Repository-Scale Code Understanding with LLMs
cs.SE 2026-05 unverdicted novelty 6.0

AOCI creates an incremental symbolic-semantic index per code unit that gives LLMs a complete, consistent repository view, outperforming baselines with zero defects on 19 industrial tasks while using far fewer tokens.
Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry
cs.CL 2026-04 unverdicted novelty 6.0

The ChangAn benchmark demonstrates that existing AI detectors perform poorly at distinguishing LLM-generated classical Chinese poetry from human-written examples.
Context Pruning for Coding Agents via Multi-Rubric Latent Reasoning
cs.AI 2026-05 unverdicted novelty 5.0

LaMR decomposes code context pruning into two rubrics using dedicated CRFs, a mixture-of-experts gate, and AST-derived labels to filter noise and often match or beat full-context baselines on coding benchmarks.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 7 Pith papers · 2 internal anchors

[1]

Anonymous Authors. 2025. Replication Package of SWE-QA Benchmark and SWE-QA-Agent. https://anonymous. 4open.science/r/SWE-QA-D3EC

work page 2025
[2]

Anthropic. 2025. Claude 3.7 Sonnet Model. https://www.anthropic.com/news/claude-3-7-sonnet/

work page 2025
[3]

Jialiang Chen, Kaifa Zhao, Jie Liu, Chao Peng, Jierui Liu, Hang Zhu, Pengfei Gao, Ping Yang, and Shuiguang Deng

work page
[4]

CoreQA: uncovering potentials of language models in code repository question answering.arXiv preprint arXiv:2501.03447(2025)

work page arXiv 2025
[5]

Silin Chen, Shaoxin Lin, Xiaodong Gu, Yuling Shi, Heng Lian, Longfei Yun, Dong Chen, Weiguo Sun, Lin Cao, and Qianxiang Wang. 2025. SWE-Exp: Experience-Driven Software Issue Resolution.arXiv preprint arXiv:2507.23361 (2025)

work page arXiv 2025
[6]

DeepSeek-AI. 2025. DeepSeek-V3 Model. https://api.deepseek.com

work page 2025
[7]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. InFindings of the Association for Computational Linguistics: EMNLP 2020. 1536–1547

work page 2020
[8]

Mingyang Geng, Shangwen Wang, Dezun Dong, Haotian Wang, Ge Li, Zhi Jin, Xiaoguang Mao, and Xiangke Liao

work page
[9]

In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering

Large language models are few-shot summarizers: Multi-intent comment generation via in-context learning. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–13

work page
[10]

Junda He, Jieke Shi, Terry Yue Zhuo, Christoph Treude, Jiamou Sun, Zhenchang Xing, Xiaoning Du, and David Lo

work page
[11]

From code to courtroom: Llms as the new software judges.arXiv preprint arXiv:2503.02246(2025)

work page arXiv 2025
[12]

Ruida Hu, Chao Peng, Jingyi Ren, Bo Jiang, Xiangxin Meng, Qinyun Wu, Pengfei Gao, Xinchen Wang, and Cuiyun Gao. 2024. CodeRepoQA: A Large-scale Benchmark for Software Engineering Question Answering.arXiv preprint arXiv:2412.14764(2024). , Vol. 1, No. 1, Article . Publication date: September 2025. 20 W. Peng, Y. Shi, Y. Wang, X. Zhang, X. Gu, and B. Shen

work page arXiv 2024
[13]

Junjie Huang, Duyu Tang, Linjun Shou, Ming Gong, Ke Xu, Daxin Jiang, Ming Zhou, and Nan Duan. 2021. CoSQA: 20, 000+ Web Queries for Code Search and Question Answering. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volum...

work page 2021
[14]

Siming Huang, Tianhao Cheng, Jason Klein Liu, Jiaran Hao, Liuyihan Song, Yang Xu, J Yang, Jiaheng Liu, Chenchen Zhang, Linzheng Chai, et al. 2024. Opencoder: The open cookbook for top-tier code large language models.arXiv preprint arXiv:2411.04905(2024)

work page arXiv 2024
[15]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InICLR

work page 2024
[16]

Changyoon Lee, Yeon Seonwoo, and Alice Oh. 2022. CS1QA: A Dataset for Assisting Code-based Question Answering in an Introductory Programming Course.CoRRabs/2210.14494 (2022)

work page arXiv 2022
[17]

Han Li, Yuling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Yantao Jia, Tao Huang, and Qianxiang Wang

work page
[18]

SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution.arXiv preprint arXiv:2507.23348 (2025)

work page arXiv 2025
[19]

Linyi Li, Shijie Geng, Zhenwen Li, Yibo He, Hao Yu, Ziyue Hua, Guanghan Ning, Siwei Wang, Tao Xie, and Hongxia Yang. 2024. Infibench: Evaluating the question-answering capabilities of code large language models.Advances in Neural Information Processing Systems37 (2024), 128668–128698

work page 2024
[20]

Zehan Li, Jianfei Zhang, Chuantao Yin, Yuanxin Ouyang, and Wenge Rong. 2024. ProCQA: A Large-scale Community- based Programming Question Answering Dataset for Code Search. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 13057–13067

work page 2024
[21]

Chenxiao Liu and Xiaojun Wan. 2021. CodeQA: A Question Answering Dataset for Source Code Comprehension. In Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021. Association for Computational Linguistics, 2618–2632

work page 2021
[22]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2511–2522

work page 2023
[23]

Yingwei Ma, Qingping Yang, Rongyu Cao, Binhua Li, Fei Huang, and Yongbin Li. 2024. How to understand whole software repository.arXiv preprint arXiv:2406.01422(2024)

work page arXiv 2024
[24]

Mistral AI. 2025. Devstral Model. https://mistral.ai/news/devstral

work page 2025
[25]

OpenAI. 2024. GPT-4o Model. https://openai.com/index/hello-gpt-4o/

work page 2024
[26]

OpenAI. 2025. GPT-5 Model. https://openai.com/index/introducing-gpt-5/

work page 2025
[27]

Siru Ouyang, Wenhao Yu, Kaixin Ma, Zilin Xiao, Zhihan Zhang, Mengzhao Jia, Jiawei Han, Hongming Zhang, and Dong Yu. 2024. RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph. InThe Thirteenth International Conference on Learning Representations

work page 2024
[28]

Qwen Team. 2024. Qwen2.5-Coder Model. https://qwenlm.github.io/blog/qwen2.5-coder/

work page 2024
[29]

Surya Prakash Sahu, Madhurima Mandal, Shikhar Bharadwaj, Aditya Kanade, Petros Maniatis, and Shirish K. Shevade

work page
[30]

InProceedings of the 17th Innovations in Software Engineering Conference, ISEC 2024, Bangalore, India, February 22-24, 2024

CodeQueries: A Dataset of Semantic Queries over Code. InProceedings of the 17th Innovations in Software Engineering Conference, ISEC 2024, Bangalore, India, February 22-24, 2024. ACM, 7:1–7:11

work page 2024
[31]

Yuling Shi, Songsong Wang, Chengcheng Wan, and Xiaodong Gu. 2024. From code to correctness: Closing the last mile of code generation with hierarchical debugging.arXiv preprint arXiv:2410.01215(2024)

work page arXiv 2024
[32]

Disha Shrivastava, Denis Kocetkov, Harm De Vries, Dzmitry Bahdanau, and Torsten Scholak. 2023. Repofusion: Training code models to understand your repository.arXiv preprint arXiv:2306.10998(2023)

work page arXiv 2023
[33]

Hwanjun Song, Hang Su, Igor Shalyminov, Jason Cai, and Saab Mansour. 2024. FineSurE: Fine-grained Summarization Evaluation using LLMs. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 906–922

work page 2024
[34]

Jan Strich, Florian Schneider, Irina Nikishina, and Chris Biemann. 2024. On Improving Repository-Level Code QA for Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop). 209–244

work page 2024
[35]

Qwen Team. 2024. Qwen2 technical report.arXiv preprint arXiv:2407.10671(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Tree-sitter

Tree-sitter 2025. Tree-sitter. https://tree-sitter.github.io/tree-sitter/

work page 2025
[37]

Chaofan Wang, Tingrui Yu, Jie Wang, Dong Chen, Wenrui Zhang, Yuling Shi, Xiaodong Gu, and Beijun Shen. 2025. EVOC2RUST: A Skeleton-guided Framework for Project-Level C-to-Rust Translation.arXiv preprint arXiv:2508.04295 (2025)

work page arXiv 2025
[38]

Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, and Xin Xia. 2025. Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 1955–1977. , Vol. 1, No. 1, Article . Publication date: September 2025. SWE-QA: Can Language Models Answer Reposito...

work page 2025
[39]

Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder- Decoder Models for Code Understanding and Generation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8696–8708

work page 2021
[40]

Yanlin Wang, Yanli Wang, Daya Guo, Jiachi Chen, Ruikai Zhang, Yuchi Ma, and Zibin Zheng. 2024. RLCoder: Reinforcement Learning for Repository-Level Code Completion. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 165–177

work page 2024
[41]

Yanli Wang, Yanlin Wang, Suiquan Wang, Daya Guo, Jiachi Chen, John Grundy, Xilin Liu, Yuchi Ma, Mingzhi Mao, Hongyu Zhang, et al. 2024. Repotransbench: A real-world benchmark for repository-level code translation.arXiv preprint arXiv:2412.17744(2024)

work page arXiv 2024
[42]

Xiaorui Xue, Jiansong Zhang, and Yunfeng Chen. 2024. Question-answering framework for building codes using fine-tuned and distilled pre-trained transformer models.Automation in Construction168 (2024), 105730

work page 2024
[43]

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652

work page 2024
[44]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. ReAct: Syner- gizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net

work page 2023
[45]

Tingrui Yu, Xiaodong Gu, and Beijun Shen. 2022. Code question answering via task-adaptive sequence-to-sequence pre-training. In2022 29th Asia-Pacific Software Engineering Conference (APSEC). IEEE, 229–238

work page 2022
[46]

Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen

work page
[47]

RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, 2471–2484

work page 2023
[48]

Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. 2024. CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 13643–13658

work page 2024
[49]

Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo Gao, Shirong Ma, et al. 2024. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence.arXiv preprint arXiv:2406.11931(2024). , Vol. 1, No. 1, Article . Publication date: September 2025

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Anonymous Authors. 2025. Replication Package of SWE-QA Benchmark and SWE-QA-Agent. https://anonymous. 4open.science/r/SWE-QA-D3EC

work page 2025

[2] [2]

Anthropic. 2025. Claude 3.7 Sonnet Model. https://www.anthropic.com/news/claude-3-7-sonnet/

work page 2025

[3] [3]

Jialiang Chen, Kaifa Zhao, Jie Liu, Chao Peng, Jierui Liu, Hang Zhu, Pengfei Gao, Ping Yang, and Shuiguang Deng

work page

[4] [4]

CoreQA: uncovering potentials of language models in code repository question answering.arXiv preprint arXiv:2501.03447(2025)

work page arXiv 2025

[5] [5]

Silin Chen, Shaoxin Lin, Xiaodong Gu, Yuling Shi, Heng Lian, Longfei Yun, Dong Chen, Weiguo Sun, Lin Cao, and Qianxiang Wang. 2025. SWE-Exp: Experience-Driven Software Issue Resolution.arXiv preprint arXiv:2507.23361 (2025)

work page arXiv 2025

[6] [6]

DeepSeek-AI. 2025. DeepSeek-V3 Model. https://api.deepseek.com

work page 2025

[7] [7]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. InFindings of the Association for Computational Linguistics: EMNLP 2020. 1536–1547

work page 2020

[8] [8]

Mingyang Geng, Shangwen Wang, Dezun Dong, Haotian Wang, Ge Li, Zhi Jin, Xiaoguang Mao, and Xiangke Liao

work page

[9] [9]

In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering

Large language models are few-shot summarizers: Multi-intent comment generation via in-context learning. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–13

work page

[10] [10]

Junda He, Jieke Shi, Terry Yue Zhuo, Christoph Treude, Jiamou Sun, Zhenchang Xing, Xiaoning Du, and David Lo

work page

[11] [11]

From code to courtroom: Llms as the new software judges.arXiv preprint arXiv:2503.02246(2025)

work page arXiv 2025

[12] [12]

Ruida Hu, Chao Peng, Jingyi Ren, Bo Jiang, Xiangxin Meng, Qinyun Wu, Pengfei Gao, Xinchen Wang, and Cuiyun Gao. 2024. CodeRepoQA: A Large-scale Benchmark for Software Engineering Question Answering.arXiv preprint arXiv:2412.14764(2024). , Vol. 1, No. 1, Article . Publication date: September 2025. 20 W. Peng, Y. Shi, Y. Wang, X. Zhang, X. Gu, and B. Shen

work page arXiv 2024

[13] [13]

Junjie Huang, Duyu Tang, Linjun Shou, Ming Gong, Ke Xu, Daxin Jiang, Ming Zhou, and Nan Duan. 2021. CoSQA: 20, 000+ Web Queries for Code Search and Question Answering. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volum...

work page 2021

[14] [14]

Siming Huang, Tianhao Cheng, Jason Klein Liu, Jiaran Hao, Liuyihan Song, Yang Xu, J Yang, Jiaheng Liu, Chenchen Zhang, Linzheng Chai, et al. 2024. Opencoder: The open cookbook for top-tier code large language models.arXiv preprint arXiv:2411.04905(2024)

work page arXiv 2024

[15] [15]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InICLR

work page 2024

[16] [16]

Changyoon Lee, Yeon Seonwoo, and Alice Oh. 2022. CS1QA: A Dataset for Assisting Code-based Question Answering in an Introductory Programming Course.CoRRabs/2210.14494 (2022)

work page arXiv 2022

[17] [17]

Han Li, Yuling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Yantao Jia, Tao Huang, and Qianxiang Wang

work page

[18] [18]

SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution.arXiv preprint arXiv:2507.23348 (2025)

work page arXiv 2025

[19] [19]

Linyi Li, Shijie Geng, Zhenwen Li, Yibo He, Hao Yu, Ziyue Hua, Guanghan Ning, Siwei Wang, Tao Xie, and Hongxia Yang. 2024. Infibench: Evaluating the question-answering capabilities of code large language models.Advances in Neural Information Processing Systems37 (2024), 128668–128698

work page 2024

[20] [20]

Zehan Li, Jianfei Zhang, Chuantao Yin, Yuanxin Ouyang, and Wenge Rong. 2024. ProCQA: A Large-scale Community- based Programming Question Answering Dataset for Code Search. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 13057–13067

work page 2024

[21] [21]

Chenxiao Liu and Xiaojun Wan. 2021. CodeQA: A Question Answering Dataset for Source Code Comprehension. In Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021. Association for Computational Linguistics, 2618–2632

work page 2021

[22] [22]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2511–2522

work page 2023

[23] [23]

Yingwei Ma, Qingping Yang, Rongyu Cao, Binhua Li, Fei Huang, and Yongbin Li. 2024. How to understand whole software repository.arXiv preprint arXiv:2406.01422(2024)

work page arXiv 2024

[24] [24]

Mistral AI. 2025. Devstral Model. https://mistral.ai/news/devstral

work page 2025

[25] [25]

OpenAI. 2024. GPT-4o Model. https://openai.com/index/hello-gpt-4o/

work page 2024

[26] [26]

OpenAI. 2025. GPT-5 Model. https://openai.com/index/introducing-gpt-5/

work page 2025

[27] [27]

Siru Ouyang, Wenhao Yu, Kaixin Ma, Zilin Xiao, Zhihan Zhang, Mengzhao Jia, Jiawei Han, Hongming Zhang, and Dong Yu. 2024. RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph. InThe Thirteenth International Conference on Learning Representations

work page 2024

[28] [28]

Qwen Team. 2024. Qwen2.5-Coder Model. https://qwenlm.github.io/blog/qwen2.5-coder/

work page 2024

[29] [29]

Surya Prakash Sahu, Madhurima Mandal, Shikhar Bharadwaj, Aditya Kanade, Petros Maniatis, and Shirish K. Shevade

work page

[30] [30]

InProceedings of the 17th Innovations in Software Engineering Conference, ISEC 2024, Bangalore, India, February 22-24, 2024

CodeQueries: A Dataset of Semantic Queries over Code. InProceedings of the 17th Innovations in Software Engineering Conference, ISEC 2024, Bangalore, India, February 22-24, 2024. ACM, 7:1–7:11

work page 2024

[31] [31]

Yuling Shi, Songsong Wang, Chengcheng Wan, and Xiaodong Gu. 2024. From code to correctness: Closing the last mile of code generation with hierarchical debugging.arXiv preprint arXiv:2410.01215(2024)

work page arXiv 2024

[32] [32]

Disha Shrivastava, Denis Kocetkov, Harm De Vries, Dzmitry Bahdanau, and Torsten Scholak. 2023. Repofusion: Training code models to understand your repository.arXiv preprint arXiv:2306.10998(2023)

work page arXiv 2023

[33] [33]

Hwanjun Song, Hang Su, Igor Shalyminov, Jason Cai, and Saab Mansour. 2024. FineSurE: Fine-grained Summarization Evaluation using LLMs. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 906–922

work page 2024

[34] [34]

Jan Strich, Florian Schneider, Irina Nikishina, and Chris Biemann. 2024. On Improving Repository-Level Code QA for Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop). 209–244

work page 2024

[35] [35]

Qwen Team. 2024. Qwen2 technical report.arXiv preprint arXiv:2407.10671(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Tree-sitter

Tree-sitter 2025. Tree-sitter. https://tree-sitter.github.io/tree-sitter/

work page 2025

[37] [37]

Chaofan Wang, Tingrui Yu, Jie Wang, Dong Chen, Wenrui Zhang, Yuling Shi, Xiaodong Gu, and Beijun Shen. 2025. EVOC2RUST: A Skeleton-guided Framework for Project-Level C-to-Rust Translation.arXiv preprint arXiv:2508.04295 (2025)

work page arXiv 2025

[38] [38]

Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, and Xin Xia. 2025. Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 1955–1977. , Vol. 1, No. 1, Article . Publication date: September 2025. SWE-QA: Can Language Models Answer Reposito...

work page 2025

[39] [39]

Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder- Decoder Models for Code Understanding and Generation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8696–8708

work page 2021

[40] [40]

Yanlin Wang, Yanli Wang, Daya Guo, Jiachi Chen, Ruikai Zhang, Yuchi Ma, and Zibin Zheng. 2024. RLCoder: Reinforcement Learning for Repository-Level Code Completion. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 165–177

work page 2024

[41] [41]

Yanli Wang, Yanlin Wang, Suiquan Wang, Daya Guo, Jiachi Chen, John Grundy, Xilin Liu, Yuchi Ma, Mingzhi Mao, Hongyu Zhang, et al. 2024. Repotransbench: A real-world benchmark for repository-level code translation.arXiv preprint arXiv:2412.17744(2024)

work page arXiv 2024

[42] [42]

Xiaorui Xue, Jiansong Zhang, and Yunfeng Chen. 2024. Question-answering framework for building codes using fine-tuned and distilled pre-trained transformer models.Automation in Construction168 (2024), 105730

work page 2024

[43] [43]

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652

work page 2024

[44] [44]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. ReAct: Syner- gizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net

work page 2023

[45] [45]

Tingrui Yu, Xiaodong Gu, and Beijun Shen. 2022. Code question answering via task-adaptive sequence-to-sequence pre-training. In2022 29th Asia-Pacific Software Engineering Conference (APSEC). IEEE, 229–238

work page 2022

[46] [46]

Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen

work page

[47] [47]

RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, 2471–2484

work page 2023

[48] [48]

Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. 2024. CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 13643–13658

work page 2024

[49] [49]

Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo Gao, Shirong Ma, et al. 2024. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence.arXiv preprint arXiv:2406.11931(2024). , Vol. 1, No. 1, Article . Publication date: September 2025

work page internal anchor Pith review Pith/arXiv arXiv 2024