pith. sign in

arxiv: 2605.29277 · v1 · pith:KEU3R6NWnew · submitted 2026-05-28 · 💻 cs.SE · cs.AI

Code-QA-Bench: Separating Code Reasoning from Documentation Memorization in Repository-Level QA

Pith reviewed 2026-06-29 06:57 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords repository-level QAcode reasoningdocumentation recallbenchmark synthesisLLM evaluationmemorization effectscode comprehension
0
0 comments X

The pith

Code access improves repository QA performance much more than documentation does.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Code-QA-Bench to create repository-level questions that separate genuine code reasoning from documentation recall or pretraining memorization. It builds tasks through an answer-first process in which a tool-using agent first explores source code to produce verified answers, then derives questions from those answers. Models are then evaluated in three settings: closed-book with no repository, code-only with documentation stripped out, and full documented access. Across 628 tasks drawn from ten Python repositories, code access produces a large performance increase while documentation adds only a small extra benefit on tasks that require it. This design directly measures how much models rely on code structure versus text recall.

Core claim

The framework produces 528 code-derivable tasks and 100 doc-dependent tasks. Frontier models tested under closed-book, code-only, and documented conditions show that code access accounts for a mean gain of 0.23 over closed-book, documentation supplies an additional 0.071 gain only on doc-dependent tasks, and scores on code-derivable tasks are nearly identical between code-only and documented conditions.

What carries the argument

The three-condition experimental design (closed-book, code-only, documented) that measures separate effects of code access and documentation by direct performance deltas.

If this is right

  • Code access is the dominant factor behind improved answers on repository tasks.
  • Documentation supplies only modest extra value beyond code on tasks that depend on it.
  • Performance with code alone nearly equals performance with full documentation when questions can be answered from code structure.
  • The synthesis method applies to any well-documented Python repository.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future benchmarks could adopt code-only conditions as the default to focus evaluation on reasoning rather than recall.
  • The modest documentation benefit suggests that repository QA systems might prioritize code indexing over full text retrieval.
  • Similar three-condition tests on other languages or task sources could check whether the pattern holds beyond the current Python repositories.

Load-bearing premise

The three conditions isolate documentation utility and memorization without interference from the agent used to generate answers or from the way tasks were selected from the repositories.

What would settle it

Re-running the evaluation and finding substantially higher scores in the documented condition than in the code-only condition on tasks labeled code-derivable.

Figures

Figures reproduced from arXiv: 2605.29277 by Hanwen Du, JianYing Qu, Jun Zhang, Qiao Zhao, Yehua Yang, Zhongkai Sun.

Figure 1
Figure 1. Figure 1: The answer-first task generation pipeline. Documentation chunks are extracted and [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

We present Code-QA-Bench, a fully automated framework for synthesizing repository-level code understanding benchmarks that separates genuine code comprehension from documentation recall and pretraining memorization. The framework makes two methodological contributions: (1) an answer-first generation pipeline where a tool-equipped agent explores source code to produce verified gold answers before deriving questions, ensuring every task is grounded in real code structure; and (2) a three-condition experimental design evaluating agents under closed-book (no repository), code-only (documentation removed), and documented (full repository) conditions, with deltas directly quantifying documentation utility and memorization. We generate 528 code-derivable and 100 doc-dependent tasks across 10 Python repositories from SWE-Bench, scored by an LLM judge on accuracy, completeness, and specificity. Experiments on four frontier models reveal that code access is the dominant factor (+0.23 mean gain over closed-book), documentation provides modest additional benefit (+0.071 on doc-dependent tasks), and code-only $\approx$ documented on code-derivable tasks, validating the design. The framework is open-source and applicable to any well-documented Python repository.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The manuscript presents Code-QA-Bench, a fully automated framework for synthesizing repository-level code understanding benchmarks. It introduces an answer-first generation pipeline in which a tool-equipped agent explores source code to produce verified gold answers before questions are derived, and a three-condition experimental design (closed-book, code-only with documentation removed, and documented) that is intended to quantify documentation utility and memorization effects. The work generates 528 code-derivable and 100 doc-dependent tasks from 10 Python repositories in SWE-Bench, evaluates four frontier models using an LLM judge on accuracy/completeness/specificity, and reports that code access yields a +0.23 mean gain over closed-book while documentation adds only +0.071 on doc-dependent tasks, with code-only performance approximately matching documented performance on code-derivable tasks.

Significance. If the separation claim holds after addressing validation gaps, the benchmark supplies a reproducible, open-source method for isolating code reasoning from documentation recall that could improve evaluation practices for repository-level QA in software engineering. The quantitative deltas and the applicability to any well-documented Python repository constitute concrete, falsifiable contributions that other researchers could directly extend or refute.

major comments (3)
  1. [Abstract] Abstract: the reported deltas (+0.23 code gain, +0.071 documentation benefit) and the claim that the three-condition design 'directly quantif[ies] documentation utility and memorization' rest on LLM-judge scores, yet the manuscript supplies no information on judge calibration, inter-rater agreement, or human validation of the judge; this absence is load-bearing for interpreting the numerical results as evidence of separation.
  2. [Abstract] Abstract (answer-first pipeline description): the gold-answer generation step uses a tool-equipped agent whose capabilities relative to the four evaluated frontier models are not characterized; if the generation agent shares relevant strengths with the evaluated models, the resulting task distribution could artifactually inflate the observed code-only versus closed-book gap and undermine the central separation claim.
  3. [Abstract] Abstract (task split): the classification of tasks into 528 code-derivable and 100 doc-dependent subsets is presented without a detailed selection protocol, inter-annotator agreement statistics, or ablation on classification criteria; without these, it is impossible to rule out that the reported code-only ≈ documented equivalence on code-derivable tasks arises from curation artifacts rather than the intended isolation of documentation utility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify key methodological aspects of Code-QA-Bench. The comments correctly identify areas where additional validation and documentation will strengthen the separation claims. We address each point below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported deltas (+0.23 code gain, +0.071 documentation benefit) and the claim that the three-condition design 'directly quantif[ies] documentation utility and memorization' rest on LLM-judge scores, yet the manuscript supplies no information on judge calibration, inter-rater agreement, or human validation of the judge; this absence is load-bearing for interpreting the numerical results as evidence of separation.

    Authors: We agree that the absence of judge validation details limits interpretability of the deltas. In the revised manuscript we will add a new subsection (Section 4.3) that specifies the LLM judge prompt, reports calibration against human annotations on a 50-task random sample (including accuracy, completeness, and specificity scores with Cohen's kappa), and provides inter-rater agreement statistics between the judge and two human annotators. These additions will allow readers to assess the reliability of the reported +0.23 and +0.071 gains. revision: yes

  2. Referee: [Abstract] Abstract (answer-first pipeline description): the gold-answer generation step uses a tool-equipped agent whose capabilities relative to the four evaluated frontier models are not characterized; if the generation agent shares relevant strengths with the evaluated models, the resulting task distribution could artifactually inflate the observed code-only versus closed-book gap and undermine the central separation claim.

    Authors: The generation agent uses the same base model as one of the evaluated models but is equipped with repository-specific tools (file search, AST parsing, execution) unavailable to the evaluated models during testing. While this design difference reduces direct comparability concerns, we acknowledge the need for explicit characterization. We will add an appendix comparing the agent's closed-book success rate on the generated tasks against the four evaluated models' closed-book performance to quantify any capability overlap. revision: partial

  3. Referee: [Abstract] Abstract (task split): the classification of tasks into 528 code-derivable and 100 doc-dependent subsets is presented without a detailed selection protocol, inter-annotator agreement statistics, or ablation on classification criteria; without these, it is impossible to rule out that the reported code-only ≈ documented equivalence on code-derivable tasks arises from curation artifacts rather than the intended isolation of documentation utility.

    Authors: Task classification was performed by two authors who independently labeled whether each gold answer could be derived from code alone. We will expand the methods section with the full annotation guidelines, report inter-annotator agreement (Cohen's kappa on a 20% stratified sample), and include a sensitivity analysis that varies the classification criteria to test robustness of the code-only ≈ documented equivalence on the code-derivable subset. revision: yes

Circularity Check

0 steps flagged

No significant circularity; experimental outcomes are independent of inputs

full rationale

The paper describes an empirical benchmark construction pipeline and reports direct experimental deltas from evaluating four frontier models under three conditions. No equations, fitted parameters, or predictions defined in terms of the inputs appear. No self-citations are invoked to justify uniqueness or load-bearing premises. Task classification and scoring are presented as procedural outcomes rather than quantities that reduce to the generation agent or SWE-Bench selection by construction. The derivation chain consists of observable measurements and is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract mentions no free parameters, no additional axioms beyond standard use of LLMs and agents, and no new invented entities; the approach relies on existing SWE-Bench repositories and LLM judges.

pith-pipeline@v0.9.1-grok · 5741 in / 1181 out tokens · 28679 ms · 2026-06-29T06:57:49.021780+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 18 canonical work pages · 12 internal anchors

  1. [1]

    SWE-Bench+: Enhanced Coding Benchmark for LLMs.arXiv preprint arXiv:2410.06992, 2024

    Aleithan, R., Kang, M.J., and Kamalloo, E. SWE-Bench+: Enhanced Coding Benchmark for LLMs.arXiv preprint arXiv:2410.06992, 2024

  2. [2]

    Program Synthesis with Large Language Models

    Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., and Sutton, C. Program Synthesis with Large Language Models.arXiv preprint arXiv:2108.07732, 2021

  3. [3]

    SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Un- derstanding.arXiv preprint arXiv:2603.16124, 2025

    Cai, S., Lyu, Z., Ni, Y., Chen, X., Zhou, B., Zhu, S., Lu, Y., Wang, H., Ruan, C., Schnei- der, B., Zhang, W., Li, X., Zheng, A., Zhang, Y., Nie, P., and Chen, W. SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Un- derstanding.arXiv preprint arXiv:2603.16124, 2025

  4. [4]

    Evaluating Large Language Models Trained on Code

    Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374, 2021

  5. [5]

    AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators.arXiv preprint arXiv:2508.09101, 2025

    Chou, J., Liu, A., Deng, Y., Zeng, Z., Zhang, T., Zhu, H., Cai, J., Mao, Y., Zhang, C., Tan, L., Xu, Z., Zhai, B., Liu, H., Zhu, S., Zhou, W., and Lian, F. AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators.arXiv preprint arXiv:2508.09101, 2025

  6. [6]

    CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion

    Ding, Y., Wang, Z., Ahmad, W.U., Ramanathan, M.K., Nallapati, R., Bhatia, P., Roth, D., and Xiang, B. CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion. InNeurIPS, 2024

  7. [7]

    CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

    Gu, A., Rozière, B., Leather, H., Solar-Lezama, A., Synnaeve, G., and Wang, S.I. CRUX- Eval: A Benchmark for Code Reasoning, Understanding and Execution.arXiv preprint arXiv:2401.03065, 2024

  8. [8]

    CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

    Husain, H., Wu, H.-H., Gazit, T., Allamanis, M., and Brockschmidt, M. CodeSearchNet Challenge: EvaluatingtheStateofSemanticCodeSearch.arXiv preprint arXiv:1909.09436, 2019

  9. [9]

    Measuring Coding Challenge Competence with APPS

    Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., and Steinhardt, J. Measuring Coding Challenge Competence with APPS. InNeurIPS, 2021

  10. [10]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Wang, S.I., Solar-Lezama, A., Sen, K., and Stoica, I. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code.arXiv preprint arXiv:2403.07974, 2024

  11. [11]

    R2E: Turning any Github Repository into a Programming Agent Environment

    Jain, N., Shetty, M., Zhang, T., Han, K., Sen, K., and Stoica, I. R2E: Turning any Github Repository into a Programming Agent Environment. InICML, 2024

  12. [12]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues? InICLR, 2024

    Jimenez, C.E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? InICLR, 2024

  13. [13]

    CS1QA: A Dataset for Assisting Code-based Question AnsweringinanIntroductoryProgrammingCourse.arXiv preprint arXiv:2210.14921, 2022

    Lee, J., Seo, J., Ahn, J., and Seo, M. CS1QA: A Dataset for Assisting Code-based Question AnsweringinanIntroductoryProgrammingCourse.arXiv preprint arXiv:2210.14921, 2022. 14

  14. [14]

    DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories

    Li, J., Qi, G., Li, Y., Dong, Y., and Guo, D. DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories. InACL, 2024

  15. [15]

    DevBench: A Comprehensive Benchmark for Software Development.arXiv preprint arXiv:2403.08604, 2024

    Li, B., Fang, T., Cui, Y., Jiang, Y., Wu, J., Gong, Y., Ding, L., Sun, J., and Tao, D. DevBench: A Comprehensive Benchmark for Software Development.arXiv preprint arXiv:2403.08604, 2024

  16. [16]

    Competition-Level Code Generation with AlphaCode

    Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al. Competition-Level Code Generation with AlphaCode. Science, 378(6624):1092–1097, 2022

  17. [17]

    From Laboratory to Real-World Applications: Benchmarking Agentic Code Reasoning at the Repository Level

    Li, J., Su, Y., and Lyu, M.R. From Laboratory to Real-World Applications: Benchmarking Agentic Code Reasoning at the Repository Level.arXiv preprint arXiv:2601.03731, 2025

  18. [18]

    From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

    Li, T., Chiang, W.-L., Frick, E., Dunlap, L., Zhu, B., Gonzalez, J.E., and Stoica, I. From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline. arXiv preprint arXiv:2406.11939, 2024

  19. [19]

    Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

    Liu, J., Xia, C.S., Wang, Y., and Zhang, L. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. InNeurIPS, 2024

  20. [20]

    CodeMind: Evaluating Large Language Models for Code Reasoning

    Liu, T., Fang, J., Wen, Y., and Xie, T. CodeMind: A Framework to Challenge Large Language Models for Code Reasoning.arXiv preprint arXiv:2402.09664, 2024

  21. [21]

    CodeQA: A Question Answering Dataset for Source Code Comprehension

    Liu, J., Wan, C., Tao, C., Zhao, K., and Sun, C. CodeQA: A Question Answering Dataset for Source Code Comprehension. InFindings of EMNLP, 2021

  22. [22]

    RepoBench: Benchmarking Repository-Level Code Auto- Completion Systems

    Liu, T., Xu, C., and McAuley, J. RepoBench: Benchmarking Repository-Level Code Auto- Completion Systems. InICLR, 2024

  23. [23]

    Stimulus Structures and Mental Representations in Expert Comprehension of Computer Programs.Cognitive Psychology, 19(3):295–341, 1987

    Pennington, N. Stimulus Structures and Mental Representations in Expert Comprehension of Computer Programs.Cognitive Psychology, 19(3):295–341, 1987

  24. [24]

    SWE-QA: Can Language Models Answer Repository-level Code Questions?

    Peng, W., Shi, Y., Wang, Y., Zhang, X., Shen, B., and Gu, X. SWE-QA: Can Language Models Answer Repository-level Code Questions?arXiv preprint arXiv:2509.14635, 2025

  25. [25]

    Descriptive Compound Identifier Names Improve Source Code Comprehension

    Schankin, A., Berger, A., Holt, D.V., Hofmeister, J.C., Riedel, T., and Beigl, M. Descriptive Compound Identifier Names Improve Source Code Comprehension. InICPC, 2018

  26. [26]

    ProgramComprehensionDuringSoftwareMaintenance and Evolution.IEEE Computer, 28(8):44–55, 1995

    vonMayrhauser, A.andVans, A.M. ProgramComprehensionDuringSoftwareMaintenance and Evolution.IEEE Computer, 28(8):44–55, 1995

  27. [27]

    A Survey on Large Language Models for Code Generation

    Xia, C.S., Deng, Y., and Zhang, L. A Survey on Large Language Models for Code Genera- tion.arXiv preprint arXiv:2406.00515, 2024

  28. [28]

    Agentless: Demystifying LLM-based Software Engineering Agents

    Xia, C.S., Deng, Y., Dunn, S., and Zhang, L. Agentless: Demystifying LLM-based Software Engineering Agents.arXiv preprint arXiv:2407.01489, 2024

  29. [29]

    FeatureBench: Benchmarking Agentic Coding for Complex Feature Development.arXiv preprint, 2025

    Xie, Y., Liu, Z., Chen, Y., Li, J., et al. FeatureBench: Benchmarking Agentic Coding for Complex Feature Development.arXiv preprint, 2025

  30. [30]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    Yang, J., Jimenez, C.E., Wettig, A., Liber, K., Yao, S., Narasimhan, K., and Press, O. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering.arXiv preprint arXiv:2405.15793, 2024. 15

  31. [31]

    EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

    Li, J., Zhang, G., Li, Y., Dong, Y., Luo, L., Zhu, M., Guo, Y., and He, Q. EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories. arXiv preprint arXiv:2404.00599, 2024

  32. [32]

    CodeScope: An Execution-Based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation

    Yan, W., Liu, H., Wang, Y., Li, Z., Zhao, Q., Wei, F., Liu, T., and Sui, Z. CodeScope: An Execution-Based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation. InACL, 2024

  33. [33]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E.P., et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In NeurIPS, 2023. A Example Tasks We present four example tasks, one per category, drawn from different repositories. Where / data_control_flow (sympy) Question:Where doesipartfracget call...

  34. [34]

    Internally it usesfunctools.reduceto compute total product of denominators, builds complement values, and passes them tomigcdexwhich callsigcdexpair- wise

    Itbuildscomplementvalues(denom // xforeachx)andpassesthemtomigcdex Gold answer (abridged):Theipartfracfunction flows data through the follow- ing call chain: it is called incos._eval_rewrite_as_sqrt()intrigonometric.py. Internally it usesfunctools.reduceto compute total product of denominators, builds complement values, and passes them tomigcdexwhich call...

  35. [35]

    Finalizers are popped in LIFO order (using.pop())

  36. [36]

    Finalizers are popped in LIFO order

    Multiple exceptions are wrapped in aBaseExceptionGroupwith exceptions reversed 16 Gold answer (abridged):FixtureDef.finish()runs all finalizers even if some fail. Finalizers are popped in LIFO order. If multiple exceptions occur, they are wrapped in aBaseExceptionGroupwithexceptions[::-1]. After finalization, cached_resultis set toNoneand_finalizers.clear...

  37. [37]

    Defined incommon.pywith three@overloadsignatures and a single implemen- tation acceptingos.PathLike | str | T

  38. [38]

    Performstwotransformations:os.fspath()forPathLike, andos.path.abspath(os.path.expanduser(path)) for local strings

  39. [39]

    It has three@overloadsignatures:PathLike→str, str→str, and genericT→T

    Remote URIs detected viais_remote_uri()(regex-based inutils.py) are left unmodified Gold answer (abridged):_normalize_pathnormalizes file paths throughout xar- ray’s backend I/O system. It has three@overloadsignatures:PathLike→str, str→str, and genericT→T. It converts PathLike objects viaos.fspath(), expands local strings, and passes remote URIs through u...

  40. [40]

    The management command calls_ogrinspectdirectly to collect lines into a list and append the mapping dictionary before joining

  41. [41]

    ""Calculate Euclidean distance between two points. Args: point_a: A tuple (x, y). point_b: A tuple (x, y). Returns: The Euclidean distance

    The command usesget_func_args(_ogrinspect)to dynamically filter CLI options to accepted parameters Gold answer (abridged):The separation serves two purposes: (1) streaming vs. string output —_ogrinspectyields lines one at a time, allowing the management command to append additional output before joining; (2) dynamic argument filtering — the command usesge...

  42. [42]

    Accuracy (0-5): Are the factual claims correct?

  43. [43]

    Completeness (0-5): Does the answer cover all rubric points?

  44. [44]

    accuracy

    Specificity (0-5): Does the answer reference specific files/functions? Return JSON: {"accuracy": N, "completeness": N, "specificity": N, "explanation": "..."} D Comparison with SWE-QA and SWE-QA-Pro Table 9: Feature comparison of repository-level code QA benchmarks. F eature SWE-QA SWE-QA-Pro Code-QA-Bench Repositories 12 (popular) 26 (long-tail) 10 (popu...

  45. [45]

    Start by locating the code referenced in the documentation chunk

  46. [46]

    Read the primary file(s) and identify the key functions/classes

  47. [47]

    Go ONE LEVEL DEEPER: trace at least one callee, one parent class, or one related module to understand how the code connects to the broader system

  48. [48]

    {repo_name}

    Your answer must include facts you discovered from code that are NOT stated in the documentation chunk -- this is what makes the benchmark challenging Theuser promptprovides the documentation chunk as a topic guide and specifies the target category: The following documentation chunk from "{repo_name}" identifies the TOPIC for your answer. Use it to know W...

  49. [49]

    Use tools to find and read the source code

  50. [50]

    Go deeper: trace at least one callee, parent class, or import

  51. [51]

    answer":

    Write an answer that includes specific code details NOT found in the documentation chunk above When done exploring, respond with JSON (no tool calls): { "answer": "...", "key_files": ["relative/path.py", ...], "code_evidence": [ "specific fact verified by reading code (file + function)", ... ] } Your answer MUST include at least 3 items in code_evidence. ...

  52. [52]

    23 I Per-Repository Breakdown Table 13 shows code-only scores by repository for all four models

    mean that model-to-model differences in∆ doc are not individually significant, but the consistent directionality across all four models provides strong evidence for the aggregate effect. 23 I Per-Repository Breakdown Table 13 shows code-only scores by repository for all four models. Scores are consistent across repositories: within each model, the range s...