arxiv: 2604.24814 · v1 · submitted 2026-04-27 · 💻 cs.SE · cs.AI

Recognition: unknown

SWE-QA: A Dataset and Benchmark for Complex Code Understanding

La\"ila Elkoussy (LRE , EPITA) , Julien Perez (EPITA , LRE)

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:20 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords code comprehensionmulti-hop reasoningbenchmark datasetlanguage model evaluationPython repositoriessoftware engineeringSWE-benchquestion generation

0 comments

The pith

SWE-QA benchmark shows language models reach at most 74.41 percent accuracy on multi-hop code questions drawn from real repositories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SWE-QA, a collection of 9,072 multiple-choice questions built from 12 Python codebases in the SWE-bench collection. Questions target recurring reasoning patterns such as linking a declaration to its later calls or tracing interactions among several collaborating components spread across files. The construction process uses static parsing to identify entities and an LLM to draft questions while human-validated distractors reduce the chance of surface-level shortcuts. When 15 models ranging from 360 million to 671 billion parameters are tested, the strongest result is 74.41 percent, dense models outperform mixture-of-experts models by 10 to 14 points, and reasoning variants give inconsistent gains.

Core claim

The central claim is that existing code benchmarks are too simple because they examine isolated snippets, whereas real development requires repeatedly connecting facts across dispersed segments of a codebase; SWE-QA supplies a controlled set of questions that force exactly this multi-hop integration and thereby exposes a clear performance ceiling for current language models.

What carries the argument

The SWE-QA dataset, built by parsing-based entity extraction from SWE-bench repositories followed by LLM-assisted question generation and distractor validation, which produces two main question families: Declaration-and-Call items and Interacting-Entity items.

If this is right

Dense model architectures appear better suited than mixture-of-experts designs for tasks that require tracking long-range code dependencies.
Standard chain-of-thought or reasoning enhancements do not reliably improve performance on this style of code question.
Current evaluation suites underestimate the difficulty developers face when information is distributed across files.
Training regimes that explicitly reward cross-file entity tracking could close the observed gap.
The 74.41 percent ceiling indicates that production code agents still need additional mechanisms to maintain consistent understanding of large repositories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A similar construction pipeline could be applied to other languages or to non-code domains where multi-hop factual linking is required.
If the distractors are sufficiently hard, SWE-QA could serve as a diagnostic tool for identifying which specific reasoning failures occur in large models.
The consistent dense-model advantage suggests that parameter sharing across all tokens may matter more than sparse routing when the task involves precise entity resolution.
Future work might measure whether fine-tuning on SWE-QA transfers to downstream software-engineering tasks such as bug localization or refactoring.

Load-bearing premise

The questions generated by parsing and LLM assistance actually require connecting information across multiple code locations rather than allowing answers from local patterns or distractor cues alone.

What would settle it

If the same models achieve comparable accuracy when the questions are answered with the original code context replaced by unrelated but syntactically similar code, the benchmark would fail to isolate multi-hop comprehension.

Figures

Figures reproduced from arXiv: 2604.24814 by EPITA), Julien Perez (EPITA, La\"ila Elkoussy (LRE, LRE).

**Figure 1.** Figure 1: Multi-hop question sampled from SWE-QA requiring cross-file reasoning: the view at source ↗

**Figure 2.** Figure 2: Distribution of questions of SWE-QA across public repositories. The repositories are 12 open source GitHub repositories that each contains the source code for a popular, widely downloaded PyPI package. ing, scientific computing, utilities, and varying code complexity. Preliminary analysis of entity density, file structure, and cross-file dependencies informed chunking parameters. Text Segmentation. We used… view at source ↗

**Figure 3.** Figure 3: Ratios of code chunks showing the presence of specific programming constructs (loops, conditions, functions, classes, imports, async operations, and exceptions). 3.2. Question Categories and Generation Multi-Hop Taxonomy. Inspired by the HotpotQA dataset, we have defined two multi-hop question types. Declaration-and-Call (DC) questions connect an entity’s definition in one chunk with its usage in anoth… view at source ↗

**Figure 5.** Figure 5: Histogram showing question difficulty, measured by the number of models that answered each question correctly. Fewer correct responses indicate higher difficulty. 5.1. Oracle Question Answering Overall Performance. Across 15 models, accuracy spans 74.41%–19.94%, highlighting the difficulty of multi-hop code reasoning. Llama-3.3- 70B-Instruct leads with 74.41%, 2.4 points above gemma-3-4b-it, demonstratin… view at source ↗

**Figure 6.** Figure 6: Model accuracy in the Oracle Question Answering setting as a function of parameter count. Circle view at source ↗

read the original abstract

In this paper, we introduce SWE-QA, a text and code corpus aimed at benchmarking multi-hop code comprehension, addressing the gap between simplified evaluation tasks and the complex reasoning required in real-world software development. While existing code understanding benchmarks focus on isolated snippets, developers must routinely connect information across multiple dispersed code segments. The dataset comprises 9,072 multiple-choice questions systematically generated from 12 Python repositories of SWE-bench, evaluating several recurrent reasoning patterns like Declaration-and-Call questions that link entity definitions to their usage, and Interacting-Entity questions that examine the dynamic relationships among multiple collaborating components. Generated through parsing-based entity extraction and Large Language Model assisted question construction with carefully validated distractors, the benchmark distinguishes genuine comprehension from superficial pattern matching. Evaluation of 15 language models (360M to 671B parameters) reveals significant challenges in multi-hop reasoning, with best performance reaching 74.41% accuracy. Dense architectures consistently outperform mixture-of-experts models by 10-14 percentage points, while reasoning-enhanced variants show inconsistent benefits.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SWE-QA creates a new multi-hop code QA dataset from real SWE-bench repos but the validation that questions force genuine cross-segment reasoning rather than cues is thin.

read the letter

The main takeaway is that this paper ships a fresh dataset of 9072 multiple-choice questions built from 12 Python repositories in SWE-bench, targeting Declaration-and-Call and Interacting-Entity patterns that require linking definitions to uses or tracing component interactions across files. That focus on dispersed real code is the clearest step forward from most existing code benchmarks that stay inside single snippets or functions. The model sweep across 15 systems from 360M to 671B parameters is also useful, showing a 74% ceiling, a consistent 10-14 point edge for dense models over MoE ones, and mixed results when reasoning enhancements are added. Those numbers give a practical signal about where current systems still struggle with realistic code navigation. The generation approach that combines parsing for entity extraction with LLM-assisted question writing is a workable way to produce scale, and the intent to use validated distractors to block superficial matching is stated clearly. The soft spot is that the abstract and available details do not include concrete numbers on human validation volume, inter-annotator agreement, or checks such as whether models can answer correctly from isolated snippets alone. Without those, the claim that the questions reliably test multi-hop comprehension rests on an assumption that may not fully hold if generation artifacts or statistical patterns remain exploitable. The performance gaps are reported but lack error breakdowns or significance tests, which makes it harder to judge how robust the differences are. This work is aimed at the code intelligence and AI-for-SE community, particularly groups that want benchmarks closer to actual development workflows than synthetic tasks. Readers who build or test models for repository-level reasoning will find the patterns and results worth examining even if they want tighter controls on question quality. It has enough original material and empirical content to merit peer review, though the authors should expect requests for more validation data and analysis. I would send it out for refereeing with that expectation.

Referee Report

2 major / 2 minor

Summary. The paper introduces SWE-QA, a dataset of 9,072 multiple-choice questions generated from 12 Python repositories in SWE-bench to benchmark multi-hop code comprehension. Questions target recurrent patterns such as Declaration-and-Call (linking definitions to usage) and Interacting-Entity (dynamic relationships among components), produced via parsing-based entity extraction and LLM-assisted construction with validated distractors. Evaluation of 15 language models (360M to 671B parameters) shows peak accuracy of 74.41%, with dense models outperforming mixture-of-experts architectures by 10-14 points and reasoning-enhanced variants showing inconsistent gains.

Significance. If the distractors and generation pipeline successfully enforce multi-hop reasoning over dispersed code segments rather than superficial cues, SWE-QA would address a clear gap in existing code benchmarks that rely on isolated snippets. The empirical results on architectural differences (dense vs. MoE) and model scale would offer actionable insights for software engineering applications requiring complex code understanding.

major comments (2)

[Dataset construction and question generation (inferred from abstract and § on methodology)] The central claim that the questions test genuine multi-hop comprehension (rather than pattern matching or generation artifacts) rests on the distractor validation step, yet the manuscript provides no details on human validation volume, inter-annotator agreement, or adversarial checks such as model performance on isolated snippets versus full context. This directly undermines the assertion in the abstract that the benchmark 'distinguishes genuine comprehension from superficial pattern matching.'
[Model evaluation and results] The reported 10-14 percentage point advantage of dense over MoE models is presented as a key finding, but the evaluation section lacks statistical significance tests, confidence intervals, or error analysis to establish that the gaps are robust rather than attributable to other variables such as training data or prompt sensitivity.

minor comments (2)

[Abstract] The abstract refers to SWE-QA as a 'text and code corpus' while the content is a set of multiple-choice questions; clarify the exact composition and whether raw code repositories are also released.
[Introduction and dataset description] The description of 'carefully validated distractors' is repeated without concrete metrics or examples; adding a small table of sample questions with distractors and validation notes would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript introducing SWE-QA. We address each major comment point by point below and will revise the paper to incorporate additional details and analyses where this strengthens the work without misrepresenting our existing methodology or results.

read point-by-point responses

Referee: [Dataset construction and question generation (inferred from abstract and § on methodology)] The central claim that the questions test genuine multi-hop comprehension (rather than pattern matching or generation artifacts) rests on the distractor validation step, yet the manuscript provides no details on human validation volume, inter-annotator agreement, or adversarial checks such as model performance on isolated snippets versus full context. This directly undermines the assertion in the abstract that the benchmark 'distinguishes genuine comprehension from superficial pattern matching.'

Authors: We agree that expanding the description of distractor validation would better support the central claim. The manuscript outlines parsing-based entity extraction combined with LLM-assisted construction and states that distractors were carefully validated to avoid superficial cues. In revision, we will add specifics on the human validation process, including the number of questions reviewed, inter-annotator agreement metrics, and a new ablation experiment comparing model accuracy on full multi-hop contexts versus isolated code snippets. This directly addresses the concern about distinguishing genuine comprehension. revision: yes
Referee: [Model evaluation and results] The reported 10-14 percentage point advantage of dense over MoE models is presented as a key finding, but the evaluation section lacks statistical significance tests, confidence intervals, or error analysis to establish that the gaps are robust rather than attributable to other variables such as training data or prompt sensitivity.

Authors: We concur that statistical rigor and error analysis would make the architectural comparison more robust. The current results report raw accuracies across 15 models, highlighting the consistent dense-model advantage. In the revised version, we will include bootstrap-derived 95% confidence intervals for the accuracy differences, appropriate significance tests (such as McNemar's test for paired model comparisons), and a breakdown of error types by question pattern and model architecture to rule out confounds like prompt sensitivity. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical dataset and benchmark

full rationale

The paper introduces SWE-QA through parsing-based entity extraction and LLM-assisted question generation followed by direct model evaluation on accuracy metrics. No equations, derivations, fitted parameters, or self-referential predictions appear in the described pipeline or results. All claims reduce to measured performance on the constructed questions rather than any input being redefined as output by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that automated parsing and LLM question generation with validated distractors produce valid tests of multi-hop comprehension; no free parameters or new entities are introduced.

axioms (1)

domain assumption Parsing-based entity extraction accurately identifies declarations, calls, and interacting components in Python code from the selected repositories.
This underpins the creation of Declaration-and-Call and Interacting-Entity question types.

pith-pipeline@v0.9.0 · 5488 in / 1254 out tokens · 115135 ms · 2026-05-08T03:20:23.708783+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 2 canonical work pages · 2 internal anchors

[1]

Theevaluatedmodelsspanfromsmalltolarge scales and include twoSmolLM2 variants, Llama- 3.3-70B-Instruct, andDeepSeek-R1

Experimental Setup We evaluate a collection of language models on SWE-QA to assess their multi-hop code compre- hension capabilities and the pertinence of our cor- pus. Theevaluatedmodelsspanfromsmalltolarge scales and include twoSmolLM2 variants, Llama- 3.3-70B-Instruct, andDeepSeek-R1. To iso- late the effect of reasoning on multi-hop question answering...

2025
[2]

sweet spot,

Experiments All results reported below reflect corrected ground truth labels after applying our validation procedure described in Section 4. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of Models That Answered Correctly 0 200 400 600 800 1000 1200Number of Questions Figure 5: Histogram showing question difficulty, measured by the number of models that ans...
[3]

Architectural Insights MoE vs

Discussion 6.1. Architectural Insights MoE vs. Dense Models.In our experiments, the two MoE model families tested underper- form their dense counterparts, potentially point- ing to structural limitations in multi-hop reasoning, though caution is warranted given the small sam- ple. DeepSeek-R1 achieves only 60.98%, similar to gpt-oss-20b and below dense co...
[4]

SWE-QA captures core chal- lenges of large-scale code understanding and mir- rors the cognitive demands faced by developers navigating extensive codebases

Conclusion We introduced SWE-QA, a benchmark for multi- hop code comprehension that evaluates language models on complex reasoning tasks across real software repositories. SWE-QA captures core chal- lenges of large-scale code understanding and mir- rors the cognitive demands faced by developers navigating extensive codebases. Theevaluationoffifteenmodelss...
[5]

Limitations Dataset Construction.SWE-QA is restricted to Python repositories from SWE-bench and to 2–3 hop reasoning chains, limiting generalization to other programming languages and to deeper multi- hop scenarios. Complex code patterns such as asynchronous control flow, metaprogramming, and cross-module dynamic dispatch are largely absent, as they fall ...
[6]

All code processing and analysis re- spect the original licenses and usage terms of the source repositories

Ethical Considerations This work uses publicly available code repositories and focuses on advancing code comprehension capabilities. All code processing and analysis re- spect the original licenses and usage terms of the source repositories
[7]

Bibliographical References Jacob Austin et al. 2021. Program synthesis with large language models.arXiv preprint arXiv:2108.07732. Mark Chen et al. 2021. Evaluating large lan- guage models trained on code.arXiv preprint arXiv:2107.03374. Zhaoling Chen, Xiangru Tang, Gangda Deng, Fang Wu, Jialong Wu, Zhiwei Jiang, Viktor Prasanna, Arman Cohan, and Xingyao ...

work page internal anchor Pith review arXiv 2021
[8]

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

Codexembed: A generalist embedding model family for multiligual and multi-task code retrieval. Shuai Lu et al. 2021. Codexglue: A machine learn- ing benchmark dataset for code understanding and generation.arXiv preprint arXiv:2102.04664. WeiMa,ShangqingLiu,ZhihaoLin,WenhanWang, Qiang Hu, Ye Liu, Cen Zhang, Liming Nie, Li Li, and Yang Liu. 2023. Lms: Under...

work page internal anchor Pith review arXiv 2021
[9]

A Python code snippet
[10]

A question about that code
[11]

A detailed answer Your task is to sanitize the answer: - Remove fluff and redundancy - Keep only what directly answers the question - Make it short, clear, and direct - Do not repeat the question - Do not rephrase the code Input Code: {code} Question: {question} Original Answer: {answer} Sanitized Answer: B.6. Distractor Generation for MCQs Type:Single us...
[12]

Be contextually relevant to the code and question
[13]

Represent a different level of Bloom’s Taxonomy (e.g., Understanding, Applying, Analyzing)
[14]

Be plausible -- choices a well-meaning but mistaken student might select
[15]

Be similar in structure or terminology to the correct answer
[16]

option":

Avoid being trivially or obviously incorrect Return ONLY the distractors as a valid Python list of dictionaries: [ { "option": "Distractor text here", "bloom_level": "Bloom’s taxonomy level (e.g., Understanding, Applying, Analyzing)" }, ... ] B.7. Correct Answer Adaptation to Match Distractor Style Type:Single user prompt Purpose:Rephrase the correct answ...
[17]

The answer to the question is A

User: "The answer to the question is A." Assistant: "A"
[18]

B." Assistant:

User: "B." Assistant: "B"
[19]

C" Assistant:

User: "C" Assistant: "C" Followed by the actual response to extract: {conclusion} Note:The conclusion is extracted from the benchmark model output, with optional removal of <think>...</think>tags if reasoning extraction is enabled. B.10. Placeholders Explained {entity_name}The name of the specific entity being analyzed {entity_A},{entity_B}Names of two in...