StackRepoQA shows LLMs reach only moderate accuracy on multi-file Java QA tasks, with gains from graph-based retrieval but frequent reliance on verbatim answer reproduction.
Y., Liu, C., Gao, H., Thongtanunam, P., and Treude, C.CodeReviewQA: The code review comprehension assessment for large language models
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.SE 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering
StackRepoQA shows LLMs reach only moderate accuracy on multi-file Java QA tasks, with gains from graph-based retrieval but frequent reliance on verbatim answer reproduction.