Code-QA-Bench: Separating Code Reasoning from Documentation Memorization in Repository-Level QA

Hanwen Du; JianYing Qu; Jun Zhang; Qiao Zhao; Yehua Yang; Zhongkai Sun

arxiv: 2605.29277 · v1 · pith:KEU3R6NWnew · submitted 2026-05-28 · 💻 cs.SE · cs.AI

Code-QA-Bench: Separating Code Reasoning from Documentation Memorization in Repository-Level QA

Jun Zhang , JianYing Qu , Hanwen Du , Zhongkai Sun , Yehua Yang , Qiao Zhao This is my paper

Pith reviewed 2026-06-29 06:57 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords repository-level QAcode reasoningdocumentation recallbenchmark synthesisLLM evaluationmemorization effectscode comprehension

0 comments

The pith

Code access improves repository QA performance much more than documentation does.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Code-QA-Bench to create repository-level questions that separate genuine code reasoning from documentation recall or pretraining memorization. It builds tasks through an answer-first process in which a tool-using agent first explores source code to produce verified answers, then derives questions from those answers. Models are then evaluated in three settings: closed-book with no repository, code-only with documentation stripped out, and full documented access. Across 628 tasks drawn from ten Python repositories, code access produces a large performance increase while documentation adds only a small extra benefit on tasks that require it. This design directly measures how much models rely on code structure versus text recall.

Core claim

The framework produces 528 code-derivable tasks and 100 doc-dependent tasks. Frontier models tested under closed-book, code-only, and documented conditions show that code access accounts for a mean gain of 0.23 over closed-book, documentation supplies an additional 0.071 gain only on doc-dependent tasks, and scores on code-derivable tasks are nearly identical between code-only and documented conditions.

What carries the argument

The three-condition experimental design (closed-book, code-only, documented) that measures separate effects of code access and documentation by direct performance deltas.

If this is right

Code access is the dominant factor behind improved answers on repository tasks.
Documentation supplies only modest extra value beyond code on tasks that depend on it.
Performance with code alone nearly equals performance with full documentation when questions can be answered from code structure.
The synthesis method applies to any well-documented Python repository.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future benchmarks could adopt code-only conditions as the default to focus evaluation on reasoning rather than recall.
The modest documentation benefit suggests that repository QA systems might prioritize code indexing over full text retrieval.
Similar three-condition tests on other languages or task sources could check whether the pattern holds beyond the current Python repositories.

Load-bearing premise

The three conditions isolate documentation utility and memorization without interference from the agent used to generate answers or from the way tasks were selected from the repositories.

What would settle it

Re-running the evaluation and finding substantially higher scores in the documented condition than in the code-only condition on tasks labeled code-derivable.

Figures

Figures reproduced from arXiv: 2605.29277 by Hanwen Du, JianYing Qu, Jun Zhang, Qiao Zhao, Yehua Yang, Zhongkai Sun.

read the original abstract

We present Code-QA-Bench, a fully automated framework for synthesizing repository-level code understanding benchmarks that separates genuine code comprehension from documentation recall and pretraining memorization. The framework makes two methodological contributions: (1) an answer-first generation pipeline where a tool-equipped agent explores source code to produce verified gold answers before deriving questions, ensuring every task is grounded in real code structure; and (2) a three-condition experimental design evaluating agents under closed-book (no repository), code-only (documentation removed), and documented (full repository) conditions, with deltas directly quantifying documentation utility and memorization. We generate 528 code-derivable and 100 doc-dependent tasks across 10 Python repositories from SWE-Bench, scored by an LLM judge on accuracy, completeness, and specificity. Experiments on four frontier models reveal that code access is the dominant factor (+0.23 mean gain over closed-book), documentation provides modest additional benefit (+0.071 on doc-dependent tasks), and code-only $\approx$ documented on code-derivable tasks, validating the design. The framework is open-source and applicable to any well-documented Python repository.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The answer-first pipeline plus three-condition deltas are the actual novelty, but the separation claim rests on unvalidated LLM judging and task classification.

read the letter

The paper introduces an answer-first pipeline that has a tool-using agent produce verified answers from source code before questions are written, then runs models under closed-book, code-only, and full-repo conditions. That setup, plus the open-source release, is the concrete addition. They apply it to 10 SWE-Bench Python repos and report 628 tasks split into code-derivable and doc-dependent, with the expected pattern that code access gives the biggest lift (+0.23) while documentation adds only a small increment on the doc-dependent subset.

The design is straightforward and the numbers line up with intuition. Releasing the framework so others can run it on new repos is useful for anyone who needs diagnostic benchmarks rather than another generic QA set.

The soft spot is that the abstract gives no numbers on judge agreement, calibration against humans, or the exact protocol used to label tasks as code-derivable versus doc-dependent. If the same class of agent that generates the gold answers also influences the later evaluations, or if repository selection tilts the split, the reported deltas could partly reflect those choices instead of pure documentation utility. The paper would be stronger with an ablation or inter-rater check on the judge and a clear description of the classification step.

This is aimed at people who build or evaluate repository-level code agents and want a reusable way to measure what documentation actually contributes. The method is simple enough that a serious referee could check the missing validation details in one round. I would send it to review rather than desk-reject.

Referee Report

3 major / 0 minor

Summary. The manuscript presents Code-QA-Bench, a fully automated framework for synthesizing repository-level code understanding benchmarks. It introduces an answer-first generation pipeline in which a tool-equipped agent explores source code to produce verified gold answers before questions are derived, and a three-condition experimental design (closed-book, code-only with documentation removed, and documented) that is intended to quantify documentation utility and memorization effects. The work generates 528 code-derivable and 100 doc-dependent tasks from 10 Python repositories in SWE-Bench, evaluates four frontier models using an LLM judge on accuracy/completeness/specificity, and reports that code access yields a +0.23 mean gain over closed-book while documentation adds only +0.071 on doc-dependent tasks, with code-only performance approximately matching documented performance on code-derivable tasks.

Significance. If the separation claim holds after addressing validation gaps, the benchmark supplies a reproducible, open-source method for isolating code reasoning from documentation recall that could improve evaluation practices for repository-level QA in software engineering. The quantitative deltas and the applicability to any well-documented Python repository constitute concrete, falsifiable contributions that other researchers could directly extend or refute.

major comments (3)

[Abstract] Abstract: the reported deltas (+0.23 code gain, +0.071 documentation benefit) and the claim that the three-condition design 'directly quantif[ies] documentation utility and memorization' rest on LLM-judge scores, yet the manuscript supplies no information on judge calibration, inter-rater agreement, or human validation of the judge; this absence is load-bearing for interpreting the numerical results as evidence of separation.
[Abstract] Abstract (answer-first pipeline description): the gold-answer generation step uses a tool-equipped agent whose capabilities relative to the four evaluated frontier models are not characterized; if the generation agent shares relevant strengths with the evaluated models, the resulting task distribution could artifactually inflate the observed code-only versus closed-book gap and undermine the central separation claim.
[Abstract] Abstract (task split): the classification of tasks into 528 code-derivable and 100 doc-dependent subsets is presented without a detailed selection protocol, inter-annotator agreement statistics, or ablation on classification criteria; without these, it is impossible to rule out that the reported code-only ≈ documented equivalence on code-derivable tasks arises from curation artifacts rather than the intended isolation of documentation utility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify key methodological aspects of Code-QA-Bench. The comments correctly identify areas where additional validation and documentation will strengthen the separation claims. We address each point below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Abstract] Abstract: the reported deltas (+0.23 code gain, +0.071 documentation benefit) and the claim that the three-condition design 'directly quantif[ies] documentation utility and memorization' rest on LLM-judge scores, yet the manuscript supplies no information on judge calibration, inter-rater agreement, or human validation of the judge; this absence is load-bearing for interpreting the numerical results as evidence of separation.

Authors: We agree that the absence of judge validation details limits interpretability of the deltas. In the revised manuscript we will add a new subsection (Section 4.3) that specifies the LLM judge prompt, reports calibration against human annotations on a 50-task random sample (including accuracy, completeness, and specificity scores with Cohen's kappa), and provides inter-rater agreement statistics between the judge and two human annotators. These additions will allow readers to assess the reliability of the reported +0.23 and +0.071 gains. revision: yes
Referee: [Abstract] Abstract (answer-first pipeline description): the gold-answer generation step uses a tool-equipped agent whose capabilities relative to the four evaluated frontier models are not characterized; if the generation agent shares relevant strengths with the evaluated models, the resulting task distribution could artifactually inflate the observed code-only versus closed-book gap and undermine the central separation claim.

Authors: The generation agent uses the same base model as one of the evaluated models but is equipped with repository-specific tools (file search, AST parsing, execution) unavailable to the evaluated models during testing. While this design difference reduces direct comparability concerns, we acknowledge the need for explicit characterization. We will add an appendix comparing the agent's closed-book success rate on the generated tasks against the four evaluated models' closed-book performance to quantify any capability overlap. revision: partial
Referee: [Abstract] Abstract (task split): the classification of tasks into 528 code-derivable and 100 doc-dependent subsets is presented without a detailed selection protocol, inter-annotator agreement statistics, or ablation on classification criteria; without these, it is impossible to rule out that the reported code-only ≈ documented equivalence on code-derivable tasks arises from curation artifacts rather than the intended isolation of documentation utility.

Authors: Task classification was performed by two authors who independently labeled whether each gold answer could be derived from code alone. We will expand the methods section with the full annotation guidelines, report inter-annotator agreement (Cohen's kappa on a 20% stratified sample), and include a sensitivity analysis that varies the classification criteria to test robustness of the code-only ≈ documented equivalence on the code-derivable subset. revision: yes

Circularity Check

0 steps flagged

No significant circularity; experimental outcomes are independent of inputs

full rationale

The paper describes an empirical benchmark construction pipeline and reports direct experimental deltas from evaluating four frontier models under three conditions. No equations, fitted parameters, or predictions defined in terms of the inputs appear. No self-citations are invoked to justify uniqueness or load-bearing premises. Task classification and scoring are presented as procedural outcomes rather than quantities that reduce to the generation agent or SWE-Bench selection by construction. The derivation chain consists of observable measurements and is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract mentions no free parameters, no additional axioms beyond standard use of LLMs and agents, and no new invented entities; the approach relies on existing SWE-Bench repositories and LLM judges.

pith-pipeline@v0.9.1-grok · 5741 in / 1181 out tokens · 28679 ms · 2026-06-29T06:57:49.021780+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 18 canonical work pages · 12 internal anchors

[1]

SWE-Bench+: Enhanced Coding Benchmark for LLMs.arXiv preprint arXiv:2410.06992, 2024

Aleithan, R., Kang, M.J., and Kamalloo, E. SWE-Bench+: Enhanced Coding Benchmark for LLMs.arXiv preprint arXiv:2410.06992, 2024

work page arXiv 2024
[2]

Program Synthesis with Large Language Models

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., and Sutton, C. Program Synthesis with Large Language Models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Un- derstanding.arXiv preprint arXiv:2603.16124, 2025

Cai, S., Lyu, Z., Ni, Y., Chen, X., Zhou, B., Zhu, S., Lu, Y., Wang, H., Ruan, C., Schnei- der, B., Zhang, W., Li, X., Zheng, A., Zhang, Y., Nie, P., and Chen, W. SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Un- derstanding.arXiv preprint arXiv:2603.16124, 2025

work page arXiv 2025
[4]

Evaluating Large Language Models Trained on Code

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators.arXiv preprint arXiv:2508.09101, 2025

Chou, J., Liu, A., Deng, Y., Zeng, Z., Zhang, T., Zhu, H., Cai, J., Mao, Y., Zhang, C., Tan, L., Xu, Z., Zhai, B., Liu, H., Zhu, S., Zhou, W., and Lian, F. AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators.arXiv preprint arXiv:2508.09101, 2025

work page arXiv 2025
[6]

CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion

Ding, Y., Wang, Z., Ahmad, W.U., Ramanathan, M.K., Nallapati, R., Bhatia, P., Roth, D., and Xiang, B. CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion. InNeurIPS, 2024

2024
[7]

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

Gu, A., Rozière, B., Leather, H., Solar-Lezama, A., Synnaeve, G., and Wang, S.I. CRUX- Eval: A Benchmark for Code Reasoning, Understanding and Execution.arXiv preprint arXiv:2401.03065, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Husain, H., Wu, H.-H., Gazit, T., Allamanis, M., and Brockschmidt, M. CodeSearchNet Challenge: EvaluatingtheStateofSemanticCodeSearch.arXiv preprint arXiv:1909.09436, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[9]

Measuring Coding Challenge Competence with APPS

Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., and Steinhardt, J. Measuring Coding Challenge Competence with APPS. InNeurIPS, 2021

2021
[10]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Wang, S.I., Solar-Lezama, A., Sen, K., and Stoica, I. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code.arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

R2E: Turning any Github Repository into a Programming Agent Environment

Jain, N., Shetty, M., Zhang, T., Han, K., Sen, K., and Stoica, I. R2E: Turning any Github Repository into a Programming Agent Environment. InICML, 2024

2024
[12]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues? InICLR, 2024

Jimenez, C.E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? InICLR, 2024

2024
[13]

CS1QA: A Dataset for Assisting Code-based Question AnsweringinanIntroductoryProgrammingCourse.arXiv preprint arXiv:2210.14921, 2022

Lee, J., Seo, J., Ahn, J., and Seo, M. CS1QA: A Dataset for Assisting Code-based Question AnsweringinanIntroductoryProgrammingCourse.arXiv preprint arXiv:2210.14921, 2022. 14

work page arXiv 2022
[14]

DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories

Li, J., Qi, G., Li, Y., Dong, Y., and Guo, D. DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories. InACL, 2024

2024
[15]

DevBench: A Comprehensive Benchmark for Software Development.arXiv preprint arXiv:2403.08604, 2024

Li, B., Fang, T., Cui, Y., Jiang, Y., Wu, J., Gong, Y., Ding, L., Sun, J., and Tao, D. DevBench: A Comprehensive Benchmark for Software Development.arXiv preprint arXiv:2403.08604, 2024

work page arXiv 2024
[16]

Competition-Level Code Generation with AlphaCode

Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al. Competition-Level Code Generation with AlphaCode. Science, 378(6624):1092–1097, 2022

2022
[17]

From Laboratory to Real-World Applications: Benchmarking Agentic Code Reasoning at the Repository Level

Li, J., Su, Y., and Lyu, M.R. From Laboratory to Real-World Applications: Benchmarking Agentic Code Reasoning at the Repository Level.arXiv preprint arXiv:2601.03731, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

Li, T., Chiang, W.-L., Frick, E., Dunlap, L., Zhu, B., Gonzalez, J.E., and Stoica, I. From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline. arXiv preprint arXiv:2406.11939, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Liu, J., Xia, C.S., Wang, Y., and Zhang, L. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. InNeurIPS, 2024

2024
[20]

CodeMind: Evaluating Large Language Models for Code Reasoning

Liu, T., Fang, J., Wen, Y., and Xie, T. CodeMind: A Framework to Challenge Large Language Models for Code Reasoning.arXiv preprint arXiv:2402.09664, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

CodeQA: A Question Answering Dataset for Source Code Comprehension

Liu, J., Wan, C., Tao, C., Zhao, K., and Sun, C. CodeQA: A Question Answering Dataset for Source Code Comprehension. InFindings of EMNLP, 2021

2021
[22]

RepoBench: Benchmarking Repository-Level Code Auto- Completion Systems

Liu, T., Xu, C., and McAuley, J. RepoBench: Benchmarking Repository-Level Code Auto- Completion Systems. InICLR, 2024

2024
[23]

Stimulus Structures and Mental Representations in Expert Comprehension of Computer Programs.Cognitive Psychology, 19(3):295–341, 1987

Pennington, N. Stimulus Structures and Mental Representations in Expert Comprehension of Computer Programs.Cognitive Psychology, 19(3):295–341, 1987

1987
[24]

SWE-QA: Can Language Models Answer Repository-level Code Questions?

Peng, W., Shi, Y., Wang, Y., Zhang, X., Shen, B., and Gu, X. SWE-QA: Can Language Models Answer Repository-level Code Questions?arXiv preprint arXiv:2509.14635, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Descriptive Compound Identifier Names Improve Source Code Comprehension

Schankin, A., Berger, A., Holt, D.V., Hofmeister, J.C., Riedel, T., and Beigl, M. Descriptive Compound Identifier Names Improve Source Code Comprehension. InICPC, 2018

2018
[26]

ProgramComprehensionDuringSoftwareMaintenance and Evolution.IEEE Computer, 28(8):44–55, 1995

vonMayrhauser, A.andVans, A.M. ProgramComprehensionDuringSoftwareMaintenance and Evolution.IEEE Computer, 28(8):44–55, 1995

1995
[27]

A Survey on Large Language Models for Code Generation

Xia, C.S., Deng, Y., and Zhang, L. A Survey on Large Language Models for Code Genera- tion.arXiv preprint arXiv:2406.00515, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Agentless: Demystifying LLM-based Software Engineering Agents

Xia, C.S., Deng, Y., Dunn, S., and Zhang, L. Agentless: Demystifying LLM-based Software Engineering Agents.arXiv preprint arXiv:2407.01489, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

FeatureBench: Benchmarking Agentic Coding for Complex Feature Development.arXiv preprint, 2025

Xie, Y., Liu, Z., Chen, Y., Li, J., et al. FeatureBench: Benchmarking Agentic Coding for Complex Feature Development.arXiv preprint, 2025

2025
[30]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

Yang, J., Jimenez, C.E., Wettig, A., Liber, K., Yao, S., Narasimhan, K., and Press, O. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering.arXiv preprint arXiv:2405.15793, 2024. 15

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

Li, J., Zhang, G., Li, Y., Dong, Y., Luo, L., Zhu, M., Guo, Y., and He, Q. EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories. arXiv preprint arXiv:2404.00599, 2024

work page arXiv 2024
[32]

CodeScope: An Execution-Based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation

Yan, W., Liu, H., Wang, Y., Li, Z., Zhao, Q., Wei, F., Liu, T., and Sui, Z. CodeScope: An Execution-Based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation. InACL, 2024

2024
[33]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E.P., et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In NeurIPS, 2023. A Example Tasks We present four example tasks, one per category, drawn from different repositories. Where / data_control_flow (sympy) Question:Where doesipartfracget call...

2023
[34]

Internally it usesfunctools.reduceto compute total product of denominators, builds complement values, and passes them tomigcdexwhich callsigcdexpair- wise

Itbuildscomplementvalues(denom // xforeachx)andpassesthemtomigcdex Gold answer (abridged):Theipartfracfunction flows data through the follow- ing call chain: it is called incos._eval_rewrite_as_sqrt()intrigonometric.py. Internally it usesfunctools.reduceto compute total product of denominators, builds complement values, and passes them tomigcdexwhich call...
[35]

Finalizers are popped in LIFO order (using.pop())
[36]

Finalizers are popped in LIFO order

Multiple exceptions are wrapped in aBaseExceptionGroupwith exceptions reversed 16 Gold answer (abridged):FixtureDef.finish()runs all finalizers even if some fail. Finalizers are popped in LIFO order. If multiple exceptions occur, they are wrapped in aBaseExceptionGroupwithexceptions[::-1]. After finalization, cached_resultis set toNoneand_finalizers.clear...
[37]

Defined incommon.pywith three@overloadsignatures and a single implemen- tation acceptingos.PathLike | str | T
[38]

Performstwotransformations:os.fspath()forPathLike, andos.path.abspath(os.path.expanduser(path)) for local strings
[39]

It has three@overloadsignatures:PathLike→str, str→str, and genericT→T

Remote URIs detected viais_remote_uri()(regex-based inutils.py) are left unmodified Gold answer (abridged):_normalize_pathnormalizes file paths throughout xar- ray’s backend I/O system. It has three@overloadsignatures:PathLike→str, str→str, and genericT→T. It converts PathLike objects viaos.fspath(), expands local strings, and passes remote URIs through u...
[40]

The management command calls_ogrinspectdirectly to collect lines into a list and append the mapping dictionary before joining
[41]

""Calculate Euclidean distance between two points. Args: point_a: A tuple (x, y). point_b: A tuple (x, y). Returns: The Euclidean distance

The command usesget_func_args(_ogrinspect)to dynamically filter CLI options to accepted parameters Gold answer (abridged):The separation serves two purposes: (1) streaming vs. string output —_ogrinspectyields lines one at a time, allowing the management command to append additional output before joining; (2) dynamic argument filtering — the command usesge...
[42]

Accuracy (0-5): Are the factual claims correct?
[43]

Completeness (0-5): Does the answer cover all rubric points?
[44]

accuracy

Specificity (0-5): Does the answer reference specific files/functions? Return JSON: {"accuracy": N, "completeness": N, "specificity": N, "explanation": "..."} D Comparison with SWE-QA and SWE-QA-Pro Table 9: Feature comparison of repository-level code QA benchmarks. F eature SWE-QA SWE-QA-Pro Code-QA-Bench Repositories 12 (popular) 26 (long-tail) 10 (popu...
[45]

Start by locating the code referenced in the documentation chunk
[46]

Read the primary file(s) and identify the key functions/classes
[47]

Go ONE LEVEL DEEPER: trace at least one callee, one parent class, or one related module to understand how the code connects to the broader system
[48]

{repo_name}

Your answer must include facts you discovered from code that are NOT stated in the documentation chunk -- this is what makes the benchmark challenging Theuser promptprovides the documentation chunk as a topic guide and specifies the target category: The following documentation chunk from "{repo_name}" identifies the TOPIC for your answer. Use it to know W...
[49]

Use tools to find and read the source code
[50]

Go deeper: trace at least one callee, parent class, or import
[51]

answer":

Write an answer that includes specific code details NOT found in the documentation chunk above When done exploring, respond with JSON (no tool calls): { "answer": "...", "key_files": ["relative/path.py", ...], "code_evidence": [ "specific fact verified by reading code (file + function)", ... ] } Your answer MUST include at least 3 items in code_evidence. ...
[52]

23 I Per-Repository Breakdown Table 13 shows code-only scores by repository for all four models

mean that model-to-model differences in∆ doc are not individually significant, but the consistent directionality across all four models provides strong evidence for the aggregate effect. 23 I Per-Repository Breakdown Table 13 shows code-only scores by repository for all four models. Scores are consistent across repositories: within each model, the range s...

[1] [1]

SWE-Bench+: Enhanced Coding Benchmark for LLMs.arXiv preprint arXiv:2410.06992, 2024

Aleithan, R., Kang, M.J., and Kamalloo, E. SWE-Bench+: Enhanced Coding Benchmark for LLMs.arXiv preprint arXiv:2410.06992, 2024

work page arXiv 2024

[2] [2]

Program Synthesis with Large Language Models

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., and Sutton, C. Program Synthesis with Large Language Models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Un- derstanding.arXiv preprint arXiv:2603.16124, 2025

Cai, S., Lyu, Z., Ni, Y., Chen, X., Zhou, B., Zhu, S., Lu, Y., Wang, H., Ruan, C., Schnei- der, B., Zhang, W., Li, X., Zheng, A., Zhang, Y., Nie, P., and Chen, W. SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Un- derstanding.arXiv preprint arXiv:2603.16124, 2025

work page arXiv 2025

[4] [4]

Evaluating Large Language Models Trained on Code

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators.arXiv preprint arXiv:2508.09101, 2025

Chou, J., Liu, A., Deng, Y., Zeng, Z., Zhang, T., Zhu, H., Cai, J., Mao, Y., Zhang, C., Tan, L., Xu, Z., Zhai, B., Liu, H., Zhu, S., Zhou, W., and Lian, F. AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators.arXiv preprint arXiv:2508.09101, 2025

work page arXiv 2025

[6] [6]

CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion

Ding, Y., Wang, Z., Ahmad, W.U., Ramanathan, M.K., Nallapati, R., Bhatia, P., Roth, D., and Xiang, B. CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion. InNeurIPS, 2024

2024

[7] [7]

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

Gu, A., Rozière, B., Leather, H., Solar-Lezama, A., Synnaeve, G., and Wang, S.I. CRUX- Eval: A Benchmark for Code Reasoning, Understanding and Execution.arXiv preprint arXiv:2401.03065, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Husain, H., Wu, H.-H., Gazit, T., Allamanis, M., and Brockschmidt, M. CodeSearchNet Challenge: EvaluatingtheStateofSemanticCodeSearch.arXiv preprint arXiv:1909.09436, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[9] [9]

Measuring Coding Challenge Competence with APPS

Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., and Steinhardt, J. Measuring Coding Challenge Competence with APPS. InNeurIPS, 2021

2021

[10] [10]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Wang, S.I., Solar-Lezama, A., Sen, K., and Stoica, I. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code.arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

R2E: Turning any Github Repository into a Programming Agent Environment

Jain, N., Shetty, M., Zhang, T., Han, K., Sen, K., and Stoica, I. R2E: Turning any Github Repository into a Programming Agent Environment. InICML, 2024

2024

[12] [12]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues? InICLR, 2024

Jimenez, C.E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? InICLR, 2024

2024

[13] [13]

CS1QA: A Dataset for Assisting Code-based Question AnsweringinanIntroductoryProgrammingCourse.arXiv preprint arXiv:2210.14921, 2022

Lee, J., Seo, J., Ahn, J., and Seo, M. CS1QA: A Dataset for Assisting Code-based Question AnsweringinanIntroductoryProgrammingCourse.arXiv preprint arXiv:2210.14921, 2022. 14

work page arXiv 2022

[14] [14]

DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories

Li, J., Qi, G., Li, Y., Dong, Y., and Guo, D. DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories. InACL, 2024

2024

[15] [15]

DevBench: A Comprehensive Benchmark for Software Development.arXiv preprint arXiv:2403.08604, 2024

Li, B., Fang, T., Cui, Y., Jiang, Y., Wu, J., Gong, Y., Ding, L., Sun, J., and Tao, D. DevBench: A Comprehensive Benchmark for Software Development.arXiv preprint arXiv:2403.08604, 2024

work page arXiv 2024

[16] [16]

Competition-Level Code Generation with AlphaCode

Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al. Competition-Level Code Generation with AlphaCode. Science, 378(6624):1092–1097, 2022

2022

[17] [17]

From Laboratory to Real-World Applications: Benchmarking Agentic Code Reasoning at the Repository Level

Li, J., Su, Y., and Lyu, M.R. From Laboratory to Real-World Applications: Benchmarking Agentic Code Reasoning at the Repository Level.arXiv preprint arXiv:2601.03731, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

Li, T., Chiang, W.-L., Frick, E., Dunlap, L., Zhu, B., Gonzalez, J.E., and Stoica, I. From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline. arXiv preprint arXiv:2406.11939, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Liu, J., Xia, C.S., Wang, Y., and Zhang, L. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. InNeurIPS, 2024

2024

[20] [20]

CodeMind: Evaluating Large Language Models for Code Reasoning

Liu, T., Fang, J., Wen, Y., and Xie, T. CodeMind: A Framework to Challenge Large Language Models for Code Reasoning.arXiv preprint arXiv:2402.09664, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

CodeQA: A Question Answering Dataset for Source Code Comprehension

Liu, J., Wan, C., Tao, C., Zhao, K., and Sun, C. CodeQA: A Question Answering Dataset for Source Code Comprehension. InFindings of EMNLP, 2021

2021

[22] [22]

RepoBench: Benchmarking Repository-Level Code Auto- Completion Systems

Liu, T., Xu, C., and McAuley, J. RepoBench: Benchmarking Repository-Level Code Auto- Completion Systems. InICLR, 2024

2024

[23] [23]

Stimulus Structures and Mental Representations in Expert Comprehension of Computer Programs.Cognitive Psychology, 19(3):295–341, 1987

Pennington, N. Stimulus Structures and Mental Representations in Expert Comprehension of Computer Programs.Cognitive Psychology, 19(3):295–341, 1987

1987

[24] [24]

SWE-QA: Can Language Models Answer Repository-level Code Questions?

Peng, W., Shi, Y., Wang, Y., Zhang, X., Shen, B., and Gu, X. SWE-QA: Can Language Models Answer Repository-level Code Questions?arXiv preprint arXiv:2509.14635, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Descriptive Compound Identifier Names Improve Source Code Comprehension

Schankin, A., Berger, A., Holt, D.V., Hofmeister, J.C., Riedel, T., and Beigl, M. Descriptive Compound Identifier Names Improve Source Code Comprehension. InICPC, 2018

2018

[26] [26]

ProgramComprehensionDuringSoftwareMaintenance and Evolution.IEEE Computer, 28(8):44–55, 1995

vonMayrhauser, A.andVans, A.M. ProgramComprehensionDuringSoftwareMaintenance and Evolution.IEEE Computer, 28(8):44–55, 1995

1995

[27] [27]

A Survey on Large Language Models for Code Generation

Xia, C.S., Deng, Y., and Zhang, L. A Survey on Large Language Models for Code Genera- tion.arXiv preprint arXiv:2406.00515, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Agentless: Demystifying LLM-based Software Engineering Agents

Xia, C.S., Deng, Y., Dunn, S., and Zhang, L. Agentless: Demystifying LLM-based Software Engineering Agents.arXiv preprint arXiv:2407.01489, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

FeatureBench: Benchmarking Agentic Coding for Complex Feature Development.arXiv preprint, 2025

Xie, Y., Liu, Z., Chen, Y., Li, J., et al. FeatureBench: Benchmarking Agentic Coding for Complex Feature Development.arXiv preprint, 2025

2025

[30] [30]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

Yang, J., Jimenez, C.E., Wettig, A., Liber, K., Yao, S., Narasimhan, K., and Press, O. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering.arXiv preprint arXiv:2405.15793, 2024. 15

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

Li, J., Zhang, G., Li, Y., Dong, Y., Luo, L., Zhu, M., Guo, Y., and He, Q. EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories. arXiv preprint arXiv:2404.00599, 2024

work page arXiv 2024

[32] [32]

CodeScope: An Execution-Based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation

Yan, W., Liu, H., Wang, Y., Li, Z., Zhao, Q., Wei, F., Liu, T., and Sui, Z. CodeScope: An Execution-Based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation. InACL, 2024

2024

[33] [33]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E.P., et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In NeurIPS, 2023. A Example Tasks We present four example tasks, one per category, drawn from different repositories. Where / data_control_flow (sympy) Question:Where doesipartfracget call...

2023

[34] [34]

Internally it usesfunctools.reduceto compute total product of denominators, builds complement values, and passes them tomigcdexwhich callsigcdexpair- wise

Itbuildscomplementvalues(denom // xforeachx)andpassesthemtomigcdex Gold answer (abridged):Theipartfracfunction flows data through the follow- ing call chain: it is called incos._eval_rewrite_as_sqrt()intrigonometric.py. Internally it usesfunctools.reduceto compute total product of denominators, builds complement values, and passes them tomigcdexwhich call...

[35] [35]

Finalizers are popped in LIFO order (using.pop())

[36] [36]

Finalizers are popped in LIFO order

Multiple exceptions are wrapped in aBaseExceptionGroupwith exceptions reversed 16 Gold answer (abridged):FixtureDef.finish()runs all finalizers even if some fail. Finalizers are popped in LIFO order. If multiple exceptions occur, they are wrapped in aBaseExceptionGroupwithexceptions[::-1]. After finalization, cached_resultis set toNoneand_finalizers.clear...

[37] [37]

Defined incommon.pywith three@overloadsignatures and a single implemen- tation acceptingos.PathLike | str | T

[38] [38]

Performstwotransformations:os.fspath()forPathLike, andos.path.abspath(os.path.expanduser(path)) for local strings

[39] [39]

It has three@overloadsignatures:PathLike→str, str→str, and genericT→T

Remote URIs detected viais_remote_uri()(regex-based inutils.py) are left unmodified Gold answer (abridged):_normalize_pathnormalizes file paths throughout xar- ray’s backend I/O system. It has three@overloadsignatures:PathLike→str, str→str, and genericT→T. It converts PathLike objects viaos.fspath(), expands local strings, and passes remote URIs through u...

[40] [40]

The management command calls_ogrinspectdirectly to collect lines into a list and append the mapping dictionary before joining

[41] [41]

""Calculate Euclidean distance between two points. Args: point_a: A tuple (x, y). point_b: A tuple (x, y). Returns: The Euclidean distance

The command usesget_func_args(_ogrinspect)to dynamically filter CLI options to accepted parameters Gold answer (abridged):The separation serves two purposes: (1) streaming vs. string output —_ogrinspectyields lines one at a time, allowing the management command to append additional output before joining; (2) dynamic argument filtering — the command usesge...

[42] [42]

Accuracy (0-5): Are the factual claims correct?

[43] [43]

Completeness (0-5): Does the answer cover all rubric points?

[44] [44]

accuracy

Specificity (0-5): Does the answer reference specific files/functions? Return JSON: {"accuracy": N, "completeness": N, "specificity": N, "explanation": "..."} D Comparison with SWE-QA and SWE-QA-Pro Table 9: Feature comparison of repository-level code QA benchmarks. F eature SWE-QA SWE-QA-Pro Code-QA-Bench Repositories 12 (popular) 26 (long-tail) 10 (popu...

[45] [45]

Start by locating the code referenced in the documentation chunk

[46] [46]

Read the primary file(s) and identify the key functions/classes

[47] [47]

Go ONE LEVEL DEEPER: trace at least one callee, one parent class, or one related module to understand how the code connects to the broader system

[48] [48]

{repo_name}

Your answer must include facts you discovered from code that are NOT stated in the documentation chunk -- this is what makes the benchmark challenging Theuser promptprovides the documentation chunk as a topic guide and specifies the target category: The following documentation chunk from "{repo_name}" identifies the TOPIC for your answer. Use it to know W...

[49] [49]

Use tools to find and read the source code

[50] [50]

Go deeper: trace at least one callee, parent class, or import

[51] [51]

answer":

Write an answer that includes specific code details NOT found in the documentation chunk above When done exploring, respond with JSON (no tool calls): { "answer": "...", "key_files": ["relative/path.py", ...], "code_evidence": [ "specific fact verified by reading code (file + function)", ... ] } Your answer MUST include at least 3 items in code_evidence. ...

[52] [52]

23 I Per-Repository Breakdown Table 13 shows code-only scores by repository for all four models

mean that model-to-model differences in∆ doc are not individually significant, but the consistent directionality across all four models provides strong evidence for the aggregate effect. 23 I Per-Repository Breakdown Table 13 shows code-only scores by repository for all four models. Scores are consistent across repositories: within each model, the range s...