pith. machine review for the scientific record. sign in

arxiv: 2509.14635 · v2 · submitted 2025-09-18 · 💻 cs.CL · cs.PL· cs.SE

Recognition: unknown

SWE-QA: Can Language Models Answer Repository-level Code Questions?

Authors on Pith no claims yet
classification 💻 cs.CL cs.PLcs.SE
keywords codequestionsswe-qarepository-levelunderstandinganswersreasoningrepositories
0
0 comments X
read the original abstract

Understanding and reasoning about entire software repositories is an essential capability for intelligent software engineering tools. While existing benchmarks such as CoSQA and CodeQA have advanced the field, they predominantly focus on small, self-contained code snippets. These setups fail to capture the complexity of real-world repositories, where effective understanding and reasoning often require navigating multiple files, understanding software architecture, and grounding answers in long-range code dependencies. In this paper, we present SWE-QA, a repository-level code question answering (QA) benchmark designed to facilitate research on automated QA systems in realistic code environments. SWE-QA involves 576 high-quality question-answer pairs spanning diverse categories, including intention understanding, cross-file reasoning, and multi-hop dependency analysis. To construct SWE-QA, we first crawled 77,100 GitHub issues from 11 popular repositories. Based on an analysis of naturally occurring developer questions extracted from these issues, we developed a two-level taxonomy of repository-level questions and constructed a set of seed questions for each category. For each category, we manually curated and validated questions and collected their corresponding answers. As a prototype application, we further develop SWE-QA-Agent, an agentic framework in which LLM agents reason and act to find answers automatically. We evaluate six advanced LLMs on SWE-QA under various context augmentation strategies. Experimental results highlight the promise of LLMs, particularly our SWE-QA-Agent framework, in addressing repository-level QA, while also revealing open challenges and pointing to future research directions.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Context Gathering Decision Process: A POMDP Framework for Agentic Search

    cs.AI 2026-05 accept novelty 7.0

    Framing LLM agent loops as a Context Gathering Decision Process POMDP yields a predicate-based belief state that boosts multi-hop reasoning up to 11.4% and an exhaustion gate that cuts token use up to 39% with no perf...

  2. ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction

    cs.CV 2026-04 unverdicted novelty 7.0

    ShredBench shows state-of-the-art MLLMs perform well on intact documents but suffer sharp drops in restoration accuracy as fragmentation increases to 8-16 pieces, indicating insufficient cross-modal semantic reasoning...

  3. Neurosymbolic Repo-level Code Localization

    cs.SE 2026-04 unverdicted novelty 7.0

    LogicLoc combines LLMs with Datalog to achieve accurate repo-level code localization without relying on keyword shortcuts in benchmarks.

  4. AOCI: Symbolic-Semantic Indexing for Practical Repository-Scale Code Understanding with LLMs

    cs.SE 2026-05 unverdicted novelty 6.0

    AOCI creates an incremental symbolic-semantic index per code unit that gives LLMs a complete, consistent repository view, outperforming baselines with zero defects on 19 industrial tasks while using far fewer tokens.

  5. Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry

    cs.CL 2026-04 unverdicted novelty 6.0

    The ChangAn benchmark demonstrates that existing AI detectors perform poorly at distinguishing LLM-generated classical Chinese poetry from human-written examples.