pith. sign in

arxiv: 2509.14635 · v2 · submitted 2025-09-18 · 💻 cs.CL · cs.PL· cs.SE

SWE-QA: Can Language Models Answer Repository-level Code Questions?

Pith reviewed 2026-05-18 16:38 UTC · model grok-4.3

classification 💻 cs.CL cs.PLcs.SE
keywords repository-level code QAlanguage model evaluationsoftware repository reasoningGitHub issuesagentic frameworkscross-file reasoningbenchmark construction
0
0 comments X

The pith

SWE-QA supplies 576 repository-level code questions drawn from real GitHub issues to test language models on cross-file and multi-hop reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing benchmarks limit evaluation to isolated code snippets that do not reflect how developers actually work across files and dependencies. The authors analyze 77,100 issues from eleven repositories to build a two-level taxonomy and then manually curate 576 validated question-answer pairs covering intention understanding, cross-file reasoning, and multi-hop dependency analysis. They introduce SWE-QA-Agent as an agentic framework that lets LLMs reason step by step and act to retrieve answers from the full repository. Experiments with six advanced models show measurable gains when the agent framework augments context, yet performance gaps remain on harder multi-hop cases.

Core claim

SWE-QA is a repository-level code QA benchmark containing 576 high-quality pairs that better capture real-world developer questions than prior snippet-based datasets. The pairs are derived from naturally occurring issues and organized by a two-level taxonomy that emphasizes navigation of architecture and long-range dependencies. The accompanying SWE-QA-Agent framework demonstrates that LLM agents can improve answer accuracy by reasoning and acting across repository files.

What carries the argument

SWE-QA benchmark and its two-level taxonomy of repository-level questions, which organizes evaluation around intention understanding, cross-file reasoning, and multi-hop dependency analysis extracted from GitHub issues.

If this is right

  • Agent-based augmentation can measurably raise accuracy on repository-scale questions that require file navigation.
  • Benchmark results expose persistent weaknesses in current models for multi-hop dependency tracing and architectural understanding.
  • SWE-QA supplies a concrete testbed for developing automated tools that answer developer questions inside full codebases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Tools built on this style of benchmark could reduce time developers spend searching across files for answers to their own questions.
  • The taxonomy may transfer to other large code artifacts such as documentation sets or test suites if the same construction process is repeated.
  • Extending the benchmark to additional repositories would test whether the observed performance patterns hold outside the original eleven projects.

Load-bearing premise

The manually chosen 576 questions accurately mirror the distribution and difficulty of questions that developers actually ask about repositories without selection bias.

What would settle it

A direct comparison showing that models ranked highly on SWE-QA still fail to resolve comparable questions when inserted into live, unmodified repositories or that the taxonomy categories miss major types of questions observed in additional issue corpora.

Figures

Figures reproduced from arXiv: 2509.14635 by Beijun Shen, Weihan Peng, Xiaodong Gu, Xinyun Zhang, Yuhang Wang, Yuling Shi.

Figure 1
Figure 1. Figure 1: Workflow of Benchmark Construction. • We construct SWE-QA, a repository-level code QA benchmark comprising 576 high-quality question-answer pairs from 12 diverse open-source Python repositories to evaluate compre￾hensive repository understanding. • We propose SWE-QA-Agent, an intelligent ReAct-style agent designed to answer repository￾level questions. The agent leverages a variety of tools to assist in rea… view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of Collected Issues. Our data collection process began by crawl￾ing 77,100 GitHub issues from the 11 popular repositories used in SWE-bench [12] (exclud￾ing Django, which has issues disabled). To fo￾cus on substantive discussions, we filtered for issues with a body length of at least 1,000 char￾acters, resulting in a dataset of 41,955 issues. Given that issues often contain extensive de￾script… view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of Question Types. We first manually analyzed a random sam￾ple of 1,000 questions through an iterative open coding process to identify recurring patterns and developer intentions. This yielded a struc￾tured two-level taxonomy for repository-level QA, as summarized in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Core Elements and their Relations Extracted [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: QA Generation and Reference Answer Example. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overview of the SWE-QA-Agent. While CoReQA [3] operates at the repository level by collecting data from GitHub issues and comments, it emphasizes issue resolution rather than code module comprehension. Critically, none of these existing benchmarks require models to understand code modules as interconnected archi￾tectural components within a repository. SWE-QA addresses this fundamental gap by specifically … view at source ↗
Figure 7
Figure 7. Figure 7: Case Study on sqlfluff. in [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
read the original abstract

Understanding and reasoning about entire software repositories is an essential capability for intelligent software engineering tools. While existing benchmarks such as CoSQA and CodeQA have advanced the field, they predominantly focus on small, self-contained code snippets. These setups fail to capture the complexity of real-world repositories, where effective understanding and reasoning often require navigating multiple files, understanding software architecture, and grounding answers in long-range code dependencies. In this paper, we present SWE-QA, a repository-level code question answering (QA) benchmark designed to facilitate research on automated QA systems in realistic code environments. SWE-QA involves 576 high-quality question-answer pairs spanning diverse categories, including intention understanding, cross-file reasoning, and multi-hop dependency analysis. To construct SWE-QA, we first crawled 77,100 GitHub issues from 11 popular repositories. Based on an analysis of naturally occurring developer questions extracted from these issues, we developed a two-level taxonomy of repository-level questions and constructed a set of seed questions for each category. For each category, we manually curated and validated questions and collected their corresponding answers. As a prototype application, we further develop SWE-QA-Agent, an agentic framework in which LLM agents reason and act to find answers automatically. We evaluate six advanced LLMs on SWE-QA under various context augmentation strategies. Experimental results highlight the promise of LLMs, particularly our SWE-QA-Agent framework, in addressing repository-level QA, while also revealing open challenges and pointing to future research directions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SWE-QA, a repository-level code QA benchmark consisting of 576 high-quality question-answer pairs derived from analysis of 77,100 GitHub issues across 11 repositories. It develops a two-level taxonomy of repository-level questions (covering intention understanding, cross-file reasoning, and multi-hop dependency analysis), manually curates and validates the pairs from seed questions, and presents SWE-QA-Agent, an agentic LLM framework evaluated on six advanced models under different context augmentation strategies.

Significance. If the manually curated benchmark proves representative of real developer questions, it would meaningfully extend beyond snippet-based resources like CoSQA and CodeQA by emphasizing repository-scale navigation and long-range dependencies, with the agent framework offering a practical prototype for automated QA. The scale of issue analysis and manual validation effort provides a solid foundation for future work on repository-level reasoning.

major comments (2)
  1. [Benchmark Construction] Benchmark Construction section: The process of selecting and curating the final 576 pairs from the initial 77,100 issues provides no inter-annotator agreement statistics, no explicit sampling protocol, and no quantitative comparison (e.g., chi-squared or frequency tables) of category distributions between the full issue corpus and the curated set. This directly affects the central claim that SWE-QA better captures real-world complexity, as unquantified selection bias could over- or under-represent multi-hop or cross-file questions relative to naturally occurring developer queries.
  2. [Evaluation] Evaluation section: While the abstract and prototype description reference LLM evaluations and highlight 'promise' for the SWE-QA-Agent, the manuscript supplies no concrete performance metrics, baseline comparisons, error breakdowns, or ablation results on the 576 pairs, leaving the empirical support for the framework's effectiveness difficult to assess.
minor comments (2)
  1. [Abstract] Abstract: The claim of 'high-quality' pairs is stated without reference to the specific validation criteria or agreement thresholds used during manual curation.
  2. [Taxonomy Development] The two-level taxonomy is introduced but its derivation from the issue analysis is described at a high level without example questions or decision rules for category assignment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, indicating planned revisions to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Benchmark Construction] Benchmark Construction section: The process of selecting and curating the final 576 pairs from the initial 77,100 issues provides no inter-annotator agreement statistics, no explicit sampling protocol, and no quantitative comparison (e.g., chi-squared or frequency tables) of category distributions between the full issue corpus and the curated set. This directly affects the central claim that SWE-QA better captures real-world complexity, as unquantified selection bias could over- or under-represent multi-hop or cross-file questions relative to naturally occurring developer queries.

    Authors: We agree that greater transparency in the curation process is needed to support claims of representativeness. In the revised manuscript, we will expand the Benchmark Construction section with: an explicit description of the sampling protocol from the 77,100 issues; inter-annotator agreement statistics (e.g., Cohen's kappa) from the manual validation; and quantitative comparisons including frequency tables and statistical tests (such as chi-squared) of category distributions between the full corpus and the final 576 pairs. These additions will directly address potential selection bias concerns. revision: yes

  2. Referee: [Evaluation] Evaluation section: While the abstract and prototype description reference LLM evaluations and highlight 'promise' for the SWE-QA-Agent, the manuscript supplies no concrete performance metrics, baseline comparisons, error breakdowns, or ablation results on the 576 pairs, leaving the empirical support for the framework's effectiveness difficult to assess.

    Authors: We acknowledge that the current Evaluation section would benefit from more explicit and detailed reporting to facilitate assessment. We will revise this section to include concrete performance metrics for the six models across context strategies, direct baseline comparisons, categorized error breakdowns on the 576 pairs, and ablation results for the SWE-QA-Agent components. This will strengthen the empirical presentation while building on the existing experimental findings. revision: yes

Circularity Check

0 steps flagged

No circularity: new benchmark constructed from external issues and evaluated on independent LLMs

full rationale

The paper constructs SWE-QA by crawling 77,100 GitHub issues from 11 repositories, deriving a two-level taxonomy from them, manually curating 576 QA pairs, and then evaluating six external LLMs plus the SWE-QA-Agent framework on the resulting benchmark. No mathematical derivations, fitted parameters renamed as predictions, self-citations bearing central claims, or uniqueness theorems imported from prior author work appear in the provided text. The contribution is an empirical dataset and prototype system whose validity rests on the curation process and external model evaluations rather than any internal reduction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on domain assumptions about data representativeness rather than new mathematical axioms or invented entities; no free parameters are introduced.

axioms (1)
  • domain assumption The 77,100 crawled GitHub issues from 11 repositories yield a representative sample of real developer questions suitable for taxonomy and question construction.
    Stated in the construction process described in the abstract.

pith-pipeline@v0.9.0 · 5815 in / 1235 out tokens · 41388 ms · 2026-05-18T16:38:05.182216+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Context Gathering Decision Process: A POMDP Framework for Agentic Search

    cs.AI 2026-05 accept novelty 7.0

    Framing LLM agent loops as a Context Gathering Decision Process POMDP yields a predicate-based belief state that boosts multi-hop reasoning up to 11.4% and an exhaustion gate that cuts token use up to 39% with no perf...

  2. ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction

    cs.CV 2026-04 unverdicted novelty 7.0

    ShredBench shows state-of-the-art MLLMs perform well on intact documents but suffer sharp drops in restoration accuracy as fragmentation increases to 8-16 pieces, indicating insufficient cross-modal semantic reasoning...

  3. Neurosymbolic Repo-level Code Localization

    cs.SE 2026-04 unverdicted novelty 7.0

    LogicLoc combines LLMs with Datalog to achieve accurate repo-level code localization without relying on keyword shortcuts in benchmarks.

  4. CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

    cs.CL 2026-02 unverdicted novelty 7.0

    Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.

  5. AOCI: Symbolic-Semantic Indexing for Practical Repository-Scale Code Understanding with LLMs

    cs.SE 2026-05 unverdicted novelty 6.0

    AOCI creates an incremental symbolic-semantic index per code unit that gives LLMs a complete, consistent repository view, outperforming baselines with zero defects on 19 industrial tasks while using far fewer tokens.

  6. Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry

    cs.CL 2026-04 unverdicted novelty 6.0

    The ChangAn benchmark demonstrates that existing AI detectors perform poorly at distinguishing LLM-generated classical Chinese poetry from human-written examples.

  7. Context Pruning for Coding Agents via Multi-Rubric Latent Reasoning

    cs.AI 2026-05 unverdicted novelty 5.0

    LaMR decomposes code context pruning into two rubrics using dedicated CRFs, a mixture-of-experts gate, and AST-derived labels to filter noise and often match or beat full-context baselines on coding benchmarks.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 7 Pith papers · 2 internal anchors

  1. [1]

    Anonymous Authors. 2025. Replication Package of SWE-QA Benchmark and SWE-QA-Agent. https://anonymous. 4open.science/r/SWE-QA-D3EC

  2. [2]

    Anthropic. 2025. Claude 3.7 Sonnet Model. https://www.anthropic.com/news/claude-3-7-sonnet/

  3. [3]

    Jialiang Chen, Kaifa Zhao, Jie Liu, Chao Peng, Jierui Liu, Hang Zhu, Pengfei Gao, Ping Yang, and Shuiguang Deng

  4. [4]

    CoreQA: uncovering potentials of language models in code repository question answering.arXiv preprint arXiv:2501.03447(2025)

  5. [5]

    Silin Chen, Shaoxin Lin, Xiaodong Gu, Yuling Shi, Heng Lian, Longfei Yun, Dong Chen, Weiguo Sun, Lin Cao, and Qianxiang Wang. 2025. SWE-Exp: Experience-Driven Software Issue Resolution.arXiv preprint arXiv:2507.23361 (2025)

  6. [6]

    DeepSeek-AI. 2025. DeepSeek-V3 Model. https://api.deepseek.com

  7. [7]

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. InFindings of the Association for Computational Linguistics: EMNLP 2020. 1536–1547

  8. [8]

    Mingyang Geng, Shangwen Wang, Dezun Dong, Haotian Wang, Ge Li, Zhi Jin, Xiaoguang Mao, and Xiangke Liao

  9. [9]

    In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering

    Large language models are few-shot summarizers: Multi-intent comment generation via in-context learning. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–13

  10. [10]

    Junda He, Jieke Shi, Terry Yue Zhuo, Christoph Treude, Jiamou Sun, Zhenchang Xing, Xiaoning Du, and David Lo

  11. [11]

    From code to courtroom: Llms as the new software judges.arXiv preprint arXiv:2503.02246(2025)

  12. [12]

    Ruida Hu, Chao Peng, Jingyi Ren, Bo Jiang, Xiangxin Meng, Qinyun Wu, Pengfei Gao, Xinchen Wang, and Cuiyun Gao. 2024. CodeRepoQA: A Large-scale Benchmark for Software Engineering Question Answering.arXiv preprint arXiv:2412.14764(2024). , Vol. 1, No. 1, Article . Publication date: September 2025. 20 W. Peng, Y. Shi, Y. Wang, X. Zhang, X. Gu, and B. Shen

  13. [13]

    Junjie Huang, Duyu Tang, Linjun Shou, Ming Gong, Ke Xu, Daxin Jiang, Ming Zhou, and Nan Duan. 2021. CoSQA: 20, 000+ Web Queries for Code Search and Question Answering. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volum...

  14. [14]

    Siming Huang, Tianhao Cheng, Jason Klein Liu, Jiaran Hao, Liuyihan Song, Yang Xu, J Yang, Jiaheng Liu, Chenchen Zhang, Linzheng Chai, et al. 2024. Opencoder: The open cookbook for top-tier code large language models.arXiv preprint arXiv:2411.04905(2024)

  15. [15]

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InICLR

  16. [16]

    Changyoon Lee, Yeon Seonwoo, and Alice Oh. 2022. CS1QA: A Dataset for Assisting Code-based Question Answering in an Introductory Programming Course.CoRRabs/2210.14494 (2022)

  17. [17]

    Han Li, Yuling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Yantao Jia, Tao Huang, and Qianxiang Wang

  18. [18]

    SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution.arXiv preprint arXiv:2507.23348 (2025)

  19. [19]

    Linyi Li, Shijie Geng, Zhenwen Li, Yibo He, Hao Yu, Ziyue Hua, Guanghan Ning, Siwei Wang, Tao Xie, and Hongxia Yang. 2024. Infibench: Evaluating the question-answering capabilities of code large language models.Advances in Neural Information Processing Systems37 (2024), 128668–128698

  20. [20]

    Zehan Li, Jianfei Zhang, Chuantao Yin, Yuanxin Ouyang, and Wenge Rong. 2024. ProCQA: A Large-scale Community- based Programming Question Answering Dataset for Code Search. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 13057–13067

  21. [21]

    Chenxiao Liu and Xiaojun Wan. 2021. CodeQA: A Question Answering Dataset for Source Code Comprehension. In Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021. Association for Computational Linguistics, 2618–2632

  22. [22]

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2511–2522

  23. [23]

    Yingwei Ma, Qingping Yang, Rongyu Cao, Binhua Li, Fei Huang, and Yongbin Li. 2024. How to understand whole software repository.arXiv preprint arXiv:2406.01422(2024)

  24. [24]

    Mistral AI. 2025. Devstral Model. https://mistral.ai/news/devstral

  25. [25]

    OpenAI. 2024. GPT-4o Model. https://openai.com/index/hello-gpt-4o/

  26. [26]

    OpenAI. 2025. GPT-5 Model. https://openai.com/index/introducing-gpt-5/

  27. [27]

    Siru Ouyang, Wenhao Yu, Kaixin Ma, Zilin Xiao, Zhihan Zhang, Mengzhao Jia, Jiawei Han, Hongming Zhang, and Dong Yu. 2024. RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph. InThe Thirteenth International Conference on Learning Representations

  28. [28]

    Qwen Team. 2024. Qwen2.5-Coder Model. https://qwenlm.github.io/blog/qwen2.5-coder/

  29. [29]

    Surya Prakash Sahu, Madhurima Mandal, Shikhar Bharadwaj, Aditya Kanade, Petros Maniatis, and Shirish K. Shevade

  30. [30]

    InProceedings of the 17th Innovations in Software Engineering Conference, ISEC 2024, Bangalore, India, February 22-24, 2024

    CodeQueries: A Dataset of Semantic Queries over Code. InProceedings of the 17th Innovations in Software Engineering Conference, ISEC 2024, Bangalore, India, February 22-24, 2024. ACM, 7:1–7:11

  31. [31]

    Yuling Shi, Songsong Wang, Chengcheng Wan, and Xiaodong Gu. 2024. From code to correctness: Closing the last mile of code generation with hierarchical debugging.arXiv preprint arXiv:2410.01215(2024)

  32. [32]

    Disha Shrivastava, Denis Kocetkov, Harm De Vries, Dzmitry Bahdanau, and Torsten Scholak. 2023. Repofusion: Training code models to understand your repository.arXiv preprint arXiv:2306.10998(2023)

  33. [33]

    Hwanjun Song, Hang Su, Igor Shalyminov, Jason Cai, and Saab Mansour. 2024. FineSurE: Fine-grained Summarization Evaluation using LLMs. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 906–922

  34. [34]

    Jan Strich, Florian Schneider, Irina Nikishina, and Chris Biemann. 2024. On Improving Repository-Level Code QA for Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop). 209–244

  35. [35]

    Qwen Team. 2024. Qwen2 technical report.arXiv preprint arXiv:2407.10671(2024)

  36. [36]

    Tree-sitter

    Tree-sitter 2025. Tree-sitter. https://tree-sitter.github.io/tree-sitter/

  37. [37]

    Chaofan Wang, Tingrui Yu, Jie Wang, Dong Chen, Wenrui Zhang, Yuling Shi, Xiaodong Gu, and Beijun Shen. 2025. EVOC2RUST: A Skeleton-guided Framework for Project-Level C-to-Rust Translation.arXiv preprint arXiv:2508.04295 (2025)

  38. [38]

    Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, and Xin Xia. 2025. Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 1955–1977. , Vol. 1, No. 1, Article . Publication date: September 2025. SWE-QA: Can Language Models Answer Reposito...

  39. [39]

    Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder- Decoder Models for Code Understanding and Generation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8696–8708

  40. [40]

    Yanlin Wang, Yanli Wang, Daya Guo, Jiachi Chen, Ruikai Zhang, Yuchi Ma, and Zibin Zheng. 2024. RLCoder: Reinforcement Learning for Repository-Level Code Completion. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 165–177

  41. [41]

    Yanli Wang, Yanlin Wang, Suiquan Wang, Daya Guo, Jiachi Chen, John Grundy, Xilin Liu, Yuchi Ma, Mingzhi Mao, Hongyu Zhang, et al. 2024. Repotransbench: A real-world benchmark for repository-level code translation.arXiv preprint arXiv:2412.17744(2024)

  42. [42]

    Xiaorui Xue, Jiansong Zhang, and Yunfeng Chen. 2024. Question-answering framework for building codes using fine-tuned and distilled pre-trained transformer models.Automation in Construction168 (2024), 105730

  43. [43]

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652

  44. [44]

    Narasimhan, and Yuan Cao

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. ReAct: Syner- gizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net

  45. [45]

    Tingrui Yu, Xiaodong Gu, and Beijun Shen. 2022. Code question answering via task-adaptive sequence-to-sequence pre-training. In2022 29th Asia-Pacific Software Engineering Conference (APSEC). IEEE, 229–238

  46. [46]

    Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen

  47. [47]

    RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, 2471–2484

  48. [48]

    Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. 2024. CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 13643–13658

  49. [49]

    Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo Gao, Shirong Ma, et al. 2024. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence.arXiv preprint arXiv:2406.11931(2024). , Vol. 1, No. 1, Article . Publication date: September 2025