SWE-QA: Can Language Models Answer Repository-level Code Questions?
Pith reviewed 2026-05-18 16:38 UTC · model grok-4.3
The pith
SWE-QA supplies 576 repository-level code questions drawn from real GitHub issues to test language models on cross-file and multi-hop reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SWE-QA is a repository-level code QA benchmark containing 576 high-quality pairs that better capture real-world developer questions than prior snippet-based datasets. The pairs are derived from naturally occurring issues and organized by a two-level taxonomy that emphasizes navigation of architecture and long-range dependencies. The accompanying SWE-QA-Agent framework demonstrates that LLM agents can improve answer accuracy by reasoning and acting across repository files.
What carries the argument
SWE-QA benchmark and its two-level taxonomy of repository-level questions, which organizes evaluation around intention understanding, cross-file reasoning, and multi-hop dependency analysis extracted from GitHub issues.
If this is right
- Agent-based augmentation can measurably raise accuracy on repository-scale questions that require file navigation.
- Benchmark results expose persistent weaknesses in current models for multi-hop dependency tracing and architectural understanding.
- SWE-QA supplies a concrete testbed for developing automated tools that answer developer questions inside full codebases.
Where Pith is reading between the lines
- Tools built on this style of benchmark could reduce time developers spend searching across files for answers to their own questions.
- The taxonomy may transfer to other large code artifacts such as documentation sets or test suites if the same construction process is repeated.
- Extending the benchmark to additional repositories would test whether the observed performance patterns hold outside the original eleven projects.
Load-bearing premise
The manually chosen 576 questions accurately mirror the distribution and difficulty of questions that developers actually ask about repositories without selection bias.
What would settle it
A direct comparison showing that models ranked highly on SWE-QA still fail to resolve comparable questions when inserted into live, unmodified repositories or that the taxonomy categories miss major types of questions observed in additional issue corpora.
Figures
read the original abstract
Understanding and reasoning about entire software repositories is an essential capability for intelligent software engineering tools. While existing benchmarks such as CoSQA and CodeQA have advanced the field, they predominantly focus on small, self-contained code snippets. These setups fail to capture the complexity of real-world repositories, where effective understanding and reasoning often require navigating multiple files, understanding software architecture, and grounding answers in long-range code dependencies. In this paper, we present SWE-QA, a repository-level code question answering (QA) benchmark designed to facilitate research on automated QA systems in realistic code environments. SWE-QA involves 576 high-quality question-answer pairs spanning diverse categories, including intention understanding, cross-file reasoning, and multi-hop dependency analysis. To construct SWE-QA, we first crawled 77,100 GitHub issues from 11 popular repositories. Based on an analysis of naturally occurring developer questions extracted from these issues, we developed a two-level taxonomy of repository-level questions and constructed a set of seed questions for each category. For each category, we manually curated and validated questions and collected their corresponding answers. As a prototype application, we further develop SWE-QA-Agent, an agentic framework in which LLM agents reason and act to find answers automatically. We evaluate six advanced LLMs on SWE-QA under various context augmentation strategies. Experimental results highlight the promise of LLMs, particularly our SWE-QA-Agent framework, in addressing repository-level QA, while also revealing open challenges and pointing to future research directions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SWE-QA, a repository-level code QA benchmark consisting of 576 high-quality question-answer pairs derived from analysis of 77,100 GitHub issues across 11 repositories. It develops a two-level taxonomy of repository-level questions (covering intention understanding, cross-file reasoning, and multi-hop dependency analysis), manually curates and validates the pairs from seed questions, and presents SWE-QA-Agent, an agentic LLM framework evaluated on six advanced models under different context augmentation strategies.
Significance. If the manually curated benchmark proves representative of real developer questions, it would meaningfully extend beyond snippet-based resources like CoSQA and CodeQA by emphasizing repository-scale navigation and long-range dependencies, with the agent framework offering a practical prototype for automated QA. The scale of issue analysis and manual validation effort provides a solid foundation for future work on repository-level reasoning.
major comments (2)
- [Benchmark Construction] Benchmark Construction section: The process of selecting and curating the final 576 pairs from the initial 77,100 issues provides no inter-annotator agreement statistics, no explicit sampling protocol, and no quantitative comparison (e.g., chi-squared or frequency tables) of category distributions between the full issue corpus and the curated set. This directly affects the central claim that SWE-QA better captures real-world complexity, as unquantified selection bias could over- or under-represent multi-hop or cross-file questions relative to naturally occurring developer queries.
- [Evaluation] Evaluation section: While the abstract and prototype description reference LLM evaluations and highlight 'promise' for the SWE-QA-Agent, the manuscript supplies no concrete performance metrics, baseline comparisons, error breakdowns, or ablation results on the 576 pairs, leaving the empirical support for the framework's effectiveness difficult to assess.
minor comments (2)
- [Abstract] Abstract: The claim of 'high-quality' pairs is stated without reference to the specific validation criteria or agreement thresholds used during manual curation.
- [Taxonomy Development] The two-level taxonomy is introduced but its derivation from the issue analysis is described at a high level without example questions or decision rules for category assignment.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below, indicating planned revisions to improve clarity and rigor.
read point-by-point responses
-
Referee: [Benchmark Construction] Benchmark Construction section: The process of selecting and curating the final 576 pairs from the initial 77,100 issues provides no inter-annotator agreement statistics, no explicit sampling protocol, and no quantitative comparison (e.g., chi-squared or frequency tables) of category distributions between the full issue corpus and the curated set. This directly affects the central claim that SWE-QA better captures real-world complexity, as unquantified selection bias could over- or under-represent multi-hop or cross-file questions relative to naturally occurring developer queries.
Authors: We agree that greater transparency in the curation process is needed to support claims of representativeness. In the revised manuscript, we will expand the Benchmark Construction section with: an explicit description of the sampling protocol from the 77,100 issues; inter-annotator agreement statistics (e.g., Cohen's kappa) from the manual validation; and quantitative comparisons including frequency tables and statistical tests (such as chi-squared) of category distributions between the full corpus and the final 576 pairs. These additions will directly address potential selection bias concerns. revision: yes
-
Referee: [Evaluation] Evaluation section: While the abstract and prototype description reference LLM evaluations and highlight 'promise' for the SWE-QA-Agent, the manuscript supplies no concrete performance metrics, baseline comparisons, error breakdowns, or ablation results on the 576 pairs, leaving the empirical support for the framework's effectiveness difficult to assess.
Authors: We acknowledge that the current Evaluation section would benefit from more explicit and detailed reporting to facilitate assessment. We will revise this section to include concrete performance metrics for the six models across context strategies, direct baseline comparisons, categorized error breakdowns on the 576 pairs, and ablation results for the SWE-QA-Agent components. This will strengthen the empirical presentation while building on the existing experimental findings. revision: yes
Circularity Check
No circularity: new benchmark constructed from external issues and evaluated on independent LLMs
full rationale
The paper constructs SWE-QA by crawling 77,100 GitHub issues from 11 repositories, deriving a two-level taxonomy from them, manually curating 576 QA pairs, and then evaluating six external LLMs plus the SWE-QA-Agent framework on the resulting benchmark. No mathematical derivations, fitted parameters renamed as predictions, self-citations bearing central claims, or uniqueness theorems imported from prior author work appear in the provided text. The contribution is an empirical dataset and prototype system whose validity rests on the curation process and external model evaluations rather than any internal reduction to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 77,100 crawled GitHub issues from 11 repositories yield a representative sample of real developer questions suitable for taxonomy and question construction.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat induction and embed_strictMono_of_one_lt unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We first manually analyzed a random sample of 1,000 questions through an iterative open coding process to identify recurring patterns and developer intentions. This yielded a structured two-level taxonomy for repository-level QA... The first level categorizes questions based on their primary interrogative: What... Why... Where... and How...
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SWE-QA-Agent operates through an iterative ReAct-based workflow... actions: ReadFile, GetRepoStructure, SearchContent, Finish
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We evaluate six advanced LLMs... under various context augmentation strategies... rubric-guided evaluation system... correctness, completeness, relevance, clarity, and reasoning
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 7 Pith papers
-
The Context Gathering Decision Process: A POMDP Framework for Agentic Search
Framing LLM agent loops as a Context Gathering Decision Process POMDP yields a predicate-based belief state that boosts multi-hop reasoning up to 11.4% and an exhaustion gate that cuts token use up to 39% with no perf...
-
ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction
ShredBench shows state-of-the-art MLLMs perform well on intact documents but suffer sharp drops in restoration accuracy as fragmentation increases to 8-16 pieces, indicating insufficient cross-modal semantic reasoning...
-
Neurosymbolic Repo-level Code Localization
LogicLoc combines LLMs with Datalog to achieve accurate repo-level code localization without relying on keyword shortcuts in benchmarks.
-
CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding
Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.
-
AOCI: Symbolic-Semantic Indexing for Practical Repository-Scale Code Understanding with LLMs
AOCI creates an incremental symbolic-semantic index per code unit that gives LLMs a complete, consistent repository view, outperforming baselines with zero defects on 19 industrial tasks while using far fewer tokens.
-
Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry
The ChangAn benchmark demonstrates that existing AI detectors perform poorly at distinguishing LLM-generated classical Chinese poetry from human-written examples.
-
Context Pruning for Coding Agents via Multi-Rubric Latent Reasoning
LaMR decomposes code context pruning into two rubrics using dedicated CRFs, a mixture-of-experts gate, and AST-derived labels to filter noise and often match or beat full-context baselines on coding benchmarks.
Reference graph
Works this paper leans on
-
[1]
Anonymous Authors. 2025. Replication Package of SWE-QA Benchmark and SWE-QA-Agent. https://anonymous. 4open.science/r/SWE-QA-D3EC
work page 2025
-
[2]
Anthropic. 2025. Claude 3.7 Sonnet Model. https://www.anthropic.com/news/claude-3-7-sonnet/
work page 2025
-
[3]
Jialiang Chen, Kaifa Zhao, Jie Liu, Chao Peng, Jierui Liu, Hang Zhu, Pengfei Gao, Ping Yang, and Shuiguang Deng
- [4]
- [5]
-
[6]
DeepSeek-AI. 2025. DeepSeek-V3 Model. https://api.deepseek.com
work page 2025
-
[7]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. InFindings of the Association for Computational Linguistics: EMNLP 2020. 1536–1547
work page 2020
-
[8]
Mingyang Geng, Shangwen Wang, Dezun Dong, Haotian Wang, Ge Li, Zhi Jin, Xiaoguang Mao, and Xiangke Liao
-
[9]
In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering
Large language models are few-shot summarizers: Multi-intent comment generation via in-context learning. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–13
-
[10]
Junda He, Jieke Shi, Terry Yue Zhuo, Christoph Treude, Jiamou Sun, Zhenchang Xing, Xiaoning Du, and David Lo
- [11]
-
[12]
Ruida Hu, Chao Peng, Jingyi Ren, Bo Jiang, Xiangxin Meng, Qinyun Wu, Pengfei Gao, Xinchen Wang, and Cuiyun Gao. 2024. CodeRepoQA: A Large-scale Benchmark for Software Engineering Question Answering.arXiv preprint arXiv:2412.14764(2024). , Vol. 1, No. 1, Article . Publication date: September 2025. 20 W. Peng, Y. Shi, Y. Wang, X. Zhang, X. Gu, and B. Shen
-
[13]
Junjie Huang, Duyu Tang, Linjun Shou, Ming Gong, Ke Xu, Daxin Jiang, Ming Zhou, and Nan Duan. 2021. CoSQA: 20, 000+ Web Queries for Code Search and Question Answering. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volum...
work page 2021
- [14]
-
[15]
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InICLR
work page 2024
- [16]
-
[17]
Han Li, Yuling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Yantao Jia, Tao Huang, and Qianxiang Wang
- [18]
-
[19]
Linyi Li, Shijie Geng, Zhenwen Li, Yibo He, Hao Yu, Ziyue Hua, Guanghan Ning, Siwei Wang, Tao Xie, and Hongxia Yang. 2024. Infibench: Evaluating the question-answering capabilities of code large language models.Advances in Neural Information Processing Systems37 (2024), 128668–128698
work page 2024
-
[20]
Zehan Li, Jianfei Zhang, Chuantao Yin, Yuanxin Ouyang, and Wenge Rong. 2024. ProCQA: A Large-scale Community- based Programming Question Answering Dataset for Code Search. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 13057–13067
work page 2024
-
[21]
Chenxiao Liu and Xiaojun Wan. 2021. CodeQA: A Question Answering Dataset for Source Code Comprehension. In Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021. Association for Computational Linguistics, 2618–2632
work page 2021
-
[22]
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2511–2522
work page 2023
- [23]
-
[24]
Mistral AI. 2025. Devstral Model. https://mistral.ai/news/devstral
work page 2025
-
[25]
OpenAI. 2024. GPT-4o Model. https://openai.com/index/hello-gpt-4o/
work page 2024
-
[26]
OpenAI. 2025. GPT-5 Model. https://openai.com/index/introducing-gpt-5/
work page 2025
-
[27]
Siru Ouyang, Wenhao Yu, Kaixin Ma, Zilin Xiao, Zhihan Zhang, Mengzhao Jia, Jiawei Han, Hongming Zhang, and Dong Yu. 2024. RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph. InThe Thirteenth International Conference on Learning Representations
work page 2024
-
[28]
Qwen Team. 2024. Qwen2.5-Coder Model. https://qwenlm.github.io/blog/qwen2.5-coder/
work page 2024
-
[29]
Surya Prakash Sahu, Madhurima Mandal, Shikhar Bharadwaj, Aditya Kanade, Petros Maniatis, and Shirish K. Shevade
-
[30]
CodeQueries: A Dataset of Semantic Queries over Code. InProceedings of the 17th Innovations in Software Engineering Conference, ISEC 2024, Bangalore, India, February 22-24, 2024. ACM, 7:1–7:11
work page 2024
- [31]
- [32]
-
[33]
Hwanjun Song, Hang Su, Igor Shalyminov, Jason Cai, and Saab Mansour. 2024. FineSurE: Fine-grained Summarization Evaluation using LLMs. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 906–922
work page 2024
-
[34]
Jan Strich, Florian Schneider, Irina Nikishina, and Chris Biemann. 2024. On Improving Repository-Level Code QA for Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop). 209–244
work page 2024
-
[35]
Qwen Team. 2024. Qwen2 technical report.arXiv preprint arXiv:2407.10671(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Tree-sitter 2025. Tree-sitter. https://tree-sitter.github.io/tree-sitter/
work page 2025
- [37]
-
[38]
Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, and Xin Xia. 2025. Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 1955–1977. , Vol. 1, No. 1, Article . Publication date: September 2025. SWE-QA: Can Language Models Answer Reposito...
work page 2025
-
[39]
Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder- Decoder Models for Code Understanding and Generation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8696–8708
work page 2021
-
[40]
Yanlin Wang, Yanli Wang, Daya Guo, Jiachi Chen, Ruikai Zhang, Yuchi Ma, and Zibin Zheng. 2024. RLCoder: Reinforcement Learning for Repository-Level Code Completion. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 165–177
work page 2024
- [41]
-
[42]
Xiaorui Xue, Jiansong Zhang, and Yunfeng Chen. 2024. Question-answering framework for building codes using fine-tuned and distilled pre-trained transformer models.Automation in Construction168 (2024), 105730
work page 2024
-
[43]
John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652
work page 2024
-
[44]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. ReAct: Syner- gizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net
work page 2023
-
[45]
Tingrui Yu, Xiaodong Gu, and Beijun Shen. 2022. Code question answering via task-adaptive sequence-to-sequence pre-training. In2022 29th Asia-Pacific Software Engineering Conference (APSEC). IEEE, 229–238
work page 2022
-
[46]
Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen
-
[47]
RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, 2471–2484
work page 2023
-
[48]
Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. 2024. CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 13643–13658
work page 2024
-
[49]
Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo Gao, Shirong Ma, et al. 2024. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence.arXiv preprint arXiv:2406.11931(2024). , Vol. 1, No. 1, Article . Publication date: September 2025
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.