arxiv: 2605.04845 · v1 · submitted 2026-05-06 · 💻 cs.SE

Recognition: unknown

Agentic Repository Mining: A Multi-Task Evaluation

Johannes H\"artel

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:01 UTC · model grok-4.3

classification 💻 cs.SE

keywords LLM agentsrepository miningsoftware classificationcontext robustnessmulti-task evaluationbash command explorationartifact labeling

0 comments

The pith

LLM agents that explore repositories via bash commands match the accuracy of context-provided LLMs across four classification tasks while avoiding context limits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLM agents that issue their own bash commands to navigate software repositories can classify artifacts such as commits, reviews, code lines, and whole repos at the same quality as standard LLMs that receive pre-selected context. In experiments covering four tasks, eight configurations, and 4943 individual classifications, the agents reach competitive accuracy levels even though they must gather context themselves. The main reported benefit is robustness: agents never overflow fixed context windows and their performance does not degrade with larger artifacts. A follow-up manual review of 100 disagreements shows that some ground-truth labels were created under limited context or contain specification ambiguities, implying that measured accuracy may understate the agents' actual capability when given broader access.

Core claim

Across four tasks, eight approach configurations, and 4943 classifications, agents achieve competitive accuracy despite retrieving their own context. The primary advantage is robustness: agents avoid context-window overflows and scale independently of artifact size. A manual diagnosis of 100 cases where approaches disagree with the ground truth reveals specification ambiguities and labels produced under limited context, suggesting that accuracy against such ground truth may underestimate approaches with broader context access.

What carries the argument

LLM agents that dynamically explore repositories through standard bash commands to retrieve context on demand for classification decisions.

If this is right

Agents can be applied to repositories whose size would exceed fixed context windows of conventional LLM prompts.
Manual context engineering for each new artifact becomes unnecessary because the agent selects what to read.
The same agent loop can be reused across different repository-mining tasks without task-specific prompt redesign.
Classification quality may improve further once ground-truth labels are updated to reflect full-context human judgments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be extended to iterative tasks such as bug localization or patch generation where the agent must decide what to inspect next.
In practice, combining agent exploration with occasional human oversight on ambiguous cases might raise overall label quality while still reducing total human effort.
The robustness property suggests the method would remain stable when applied to very large monorepos or long commit histories that currently break context-based baselines.

Load-bearing premise

The provided ground-truth labels are reliable and complete enough to serve as a fair measure of classification quality even when those labels were created with limited context or ambiguous specifications.

What would settle it

Re-label a random sample of at least 200 disagreed cases by multiple human experts who receive full repository access and then compare the new consensus labels against both the agent and baseline outputs to determine which side matches the expert labels more often.

Figures

Figures reproduced from arXiv: 2605.04845 by Johannes H\"artel.

**Figure 1.** Figure 1: Agent trajectory: Green is LLM reasoning, blue is view at source ↗

**Figure 2.** Figure 2: Mean token usage by approach over all experiments. view at source ↗

**Figure 3.** Figure 3: Cost per experiment vs. engineered context size view at source ↗

**Figure 5.** Figure 5: Tool usage by agent approach (top 10 commands). view at source ↗

**Figure 7.** Figure 7: Posterior of the accuracy difference as violin with view at source ↗

read the original abstract

Mining software repositories often requires classifying artifacts like commits, reviews, code lines, or entire repositories into categories. Human labeling is expensive and error-prone; limited context frequently leads to misclassifications or uncertainty in labels. We investigate whether LLM agents that dynamically explore repositories through standard bash commands can match the classification quality of simple LLMs that receive pre-engineered context. Across four tasks, eight approach configurations, and 4943 classifications, agents achieve competitive accuracy despite retrieving their own context. The primary advantage is robustness: agents avoid context-window overflows and scale independently of artifact size. A manual diagnosis of 100 cases where approaches disagree with the ground truth reveals specification ambiguities and labels produced under limited context, suggesting that accuracy against such ground truth may underestimate approaches with broader context access.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates whether LLM agents that dynamically explore software repositories via bash commands can achieve classification accuracy comparable to standard LLMs supplied with pre-engineered context. It reports results across four tasks (commit, review, code-line, and repository classification), eight approach configurations, and 4943 total classifications, claiming competitive accuracy for agents together with a robustness advantage (no context-window overflows, independent scaling with artifact size). A manual diagnosis of 100 disagreements with ground truth is used to argue that specification ambiguities and limited-context labels may cause accuracy metrics to underestimate broader-context methods.

Significance. If the empirical results hold after addressing label-quality concerns, the work would demonstrate a practical advantage for agentic exploration in repository mining, reducing the need for manual context engineering and enabling analysis of arbitrarily large artifacts. The scale of the evaluation (nearly 5000 instances) and the inclusion of disagreement diagnosis add transparency that is valuable for software-engineering tool papers.

major comments (2)

[Abstract / Evaluation] Abstract and Evaluation section: the claim of 'competitive accuracy' is measured against ground-truth labels that the manuscript itself flags as containing 'specification ambiguities and labels produced under limited context.' No inter-annotator agreement statistics, re-labeling protocol, or noise estimate is provided, so it is impossible to determine whether the reported parity is genuine or an artifact of label bias against context-rich approaches.
[Results] Results section: while 4943 classifications are cited, the manuscript does not report per-task label provenance, the exact number of human annotators, or any sensitivity analysis showing how accuracy changes under alternative labelings. This information is load-bearing for interpreting the robustness advantage.

minor comments (2)

[Introduction] The four tasks are named only in the abstract; a brief enumerated list with one-sentence definitions in the introduction would improve readability.
[Tables / Figures] Figure captions and table headers should explicitly state the evaluation metric (accuracy) and the exact number of instances per cell to avoid ambiguity when comparing configurations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing label quality and transparency in our evaluation. We address each major comment below and will incorporate clarifications into the revised manuscript to strengthen the presentation of our results.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and Evaluation section: the claim of 'competitive accuracy' is measured against ground-truth labels that the manuscript itself flags as containing 'specification ambiguities and labels produced under limited context.' No inter-annotator agreement statistics, re-labeling protocol, or noise estimate is provided, so it is impossible to determine whether the reported parity is genuine or an artifact of label bias against context-rich approaches.

Authors: We agree that the ground-truth labels contain ambiguities and that this must be carefully considered when interpreting accuracy. The manuscript already flags these issues in the abstract and devotes a dedicated section to a manual diagnosis of 100 disagreements, which explicitly identifies specification ambiguities and labels produced under limited context as sources of error. This diagnosis was performed to provide qualitative evidence that some reported errors may reflect label limitations rather than deficiencies in the agentic or context-provided approaches. We did not collect inter-annotator agreement statistics or conduct a formal re-labeling protocol across the full set of 4943 instances. In the revision we will expand the Evaluation section with an explicit discussion of potential label bias, its implications for the competitive-accuracy claim, and how the robustness advantage (context-window independence and scaling with artifact size) remains observable even under the current labeling regime. revision: partial
Referee: [Results] Results section: while 4943 classifications are cited, the manuscript does not report per-task label provenance, the exact number of human annotators, or any sensitivity analysis showing how accuracy changes under alternative labelings. This information is load-bearing for interpreting the robustness advantage.

Authors: We will add a per-task breakdown of label provenance and the number of human annotators in the revised Results section, using the records from our data-collection process. The 4943 total aggregates four tasks; we will report the exact counts per task and note the primary sources (existing datasets supplemented by targeted human annotation). A full sensitivity analysis under alternative labelings was not performed, owing to the scale of the study. In the revision we will include a limitations paragraph discussing how label noise could affect accuracy figures and why the reported robustness advantage—agents avoiding context-window overflows and scaling independently of artifact size—does not depend on perfect labels. These additions will make the evaluation more transparent without altering the core empirical claims. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation without derivations or self-referential reductions

full rationale

The paper conducts a direct empirical comparison of agentic vs. non-agentic LLM approaches on four classification tasks, reporting accuracy over 4943 instances against provided ground-truth labels. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the abstract or described structure. Claims rest on measured performance differences and a manual diagnosis of disagreements, with no reduction of results to prior definitions or ansatzes by construction. The analysis is self-contained against external benchmarks (the ground-truth labels), satisfying the default expectation of no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only review limits visibility into exact setup; no free parameters or invented entities are evident. Relies on standard assumptions about LLM tool use and label quality.

axioms (2)

domain assumption LLM agents can reliably execute and interpret bash commands to gather repository context without introducing new errors beyond those of direct prompting.
Central to the agent configuration and robustness claim.
domain assumption Ground-truth labels provide a stable benchmark for accuracy despite acknowledged ambiguities from limited context.
Used to compute competitive accuracy; paper itself flags this as potentially underestimating broader-context methods.

pith-pipeline@v0.9.0 · 5415 in / 1307 out tokens · 54214 ms · 2026-05-08T16:01:25.242425+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 6 canonical work pages · 1 internal anchor

[1]

Samuel Abedu, Ahmad Abdellatif, and Emad Shihab. 2024. LLM-Based Chatbots for Mining Software Repositories: Challenges and Opportunities. InEASE. ACM, 201–210

2024
[2]

Samuel Abedu, Laurine Menneron, SayedHassan Khatoonabadi, and Emad Shi- hab. 2025. RepoChat: An LLM-Powered Chatbot for GitHub Repository Question- Answering. InMSR. IEEE, 255–259

2025
[3]

Mahmoud Alfadel, Nicholas Alexandre Nagy, Diego Elias Costa, Rabe Abdalka- reem, and Emad Shihab. 2023. Empirical analysis of security-related code reviews in npm packages.J. Syst. Softw.203 (2023), 111752

2023
[4]

Gábor Antal, Bence Bogenfürst, Rudolf Ferenc, and Péter Hegedüs. 2025. Identi- fying Helpful Context for LLM-based Vulnerability Repair: A Preliminary Study. InEASE. ACM, 696–700

2025
[5]

Anthropic. 2026. An update on recent Claude Code quality reports. https: //www.anthropic.com/engineering/april-23-postmortem Accessed: 2026-04-24

2026
[6]

Boqi Chen, Kua Chen, José Antonio Hernández López, Gunter Mussbacher, Dániel Varró, and Amir Feizpour. 2025. SHERPA: A Model-Driven Framework for Large Language Model Execution.CoRRabs/2509.00272 (2025)

work page arXiv 2025
[7]

Promptware engineering: Software engineering for prompt-enabled systems,

Zhenpeng Chen, Chong Wang, Weisong Sun, Guang Yang, Xuanzhe Liu, Jie M. Zhang, and Yang Liu. 2025. Promptware Engineering: Software Engineering for LLM Prompt Development.CoRRabs/2503.02400 (2025)

work page arXiv 2025
[8]

Umut Cihan, Arda Içöz, Vahid Haratian, and Eray Tüzün. 2025. Evaluating Large Language Models for Code Review.CoRRabs/2505.20206 (2025)

work page arXiv 2025
[9]

Shamse Tasnim Cynthia, Joy Krishan Das, and Banani Roy. 2026. Are We All Using Agents the Same Way? An Empirical Study of Core and Peripheral Devel- opers Use of Coding Agents. InMSR. IEEE

2026
[10]

Aleksandra Eliseeva, Yaroslav Sokolov, Egor Bogomolov, Yaroslav Golubev, Danny Dig, and Timofey Bryksin. 2023. From Commit Message Generation to History-Aware Commit Message Completion. InASE. IEEE, 723–735

2023
[11]

Xiuting Ge, Chunrong Fang, Meiyuan Qian, Yu Ge, and Mingshuang Qing. 2022. Locality-based security bug report identification via active learning.Inf. Softw. Technol.147 (2022), 106899

2022
[12]

Jinyao Guo, Chengpeng Wang, Xiangzhe Xu, Zian Su, and Xiangyu Zhang. 2025. RepoAudit: An Autonomous LLM-Agent for Repository-Level Code Auditing. CoRRabs/2501.18160 (2025)

work page arXiv 2025
[13]

Chawla, Olaf Wiest, and Xiangliang Zhang

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V. Chawla, Olaf Wiest, and Xiangliang Zhang. 2024. Large Language Model Based Multi-agents: A Survey of Progress and Challenges. InIJCAI. ijcai.org, 8048– 8057

2024
[14]

Johannes Härtel. 2025. Improved Labeling of Security Defects in Code Review by Active Learning with LLMs. InEASE. ACM, 1014–1023

2025
[15]

Johannes Härtel, Hakan Aksu, and Ralf Lämmel. 2018. Classification of APIs by hierarchical clustering. InICPC. ACM, 233–243

2018
[16]

Johannes Härtel, Marcel Heinz, and Ralf Lämmel. 2018. EMF Patterns of Usage on GitHub. InECMFA (Lecture Notes in Computer Science). Springer, 216–234

2018
[17]

Johannes Härtel and Ralf Lämmel. 2023. Operationalizing validity of empirical software engineering studies.Empir. Softw. Eng.28, 6 (2023), 153

2023
[18]

Fard, Austin Z

Steffen Herbold, Alexander Trautsch, Benjamin Ledel, Alireza Aghamohammadi, Taher Ahmed Ghaleb, Kuljit Kaur Chahal, Tim Bossenmaier, Bhaveet Nagaria, Philip Makedonski, Matin Nili Ahmadabadi, Kristóf Szabados, Helge Spieker, Matej Madeja, Nathaniel Hoy, Valentina Lenarduzzi, Shangwen Wang, Gema Rodríguez-Pérez, Ricardo Colomo-Palacios, Roberto Verdecchia,...

2022
[19]

Hora and Romain Robbes

André C. Hora and Romain Robbes. 2026. Are Coding Agents Generating Over- Mocked Tests? An Empirical Study. InMSR. IEEE

2026
[20]

Why Language Models Hallucinate

Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang. 2025. Why Language Models Hallucinate.CoRRabs/2509.04664 (2025)

work page internal anchor Pith review arXiv 2025
[21]

John Kruschke. 2014. Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan. (2014)

2014
[22]

John K Kruschke. 2018. Rejecting or accepting parameter values in Bayesian estimation.Advances in methods and practices in psychological science1, 2 (2018), 270–280

2018
[23]

Stanislav Levin and Amiram Yehudai. 2017. Boosting Automatic Commit Clas- sification Into Maintenance Activities By Utilizing Source Code Changes. In PROMISE. ACM, 97–106

2017
[24]

Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Teaching Models to Express Their Uncertainty in Words.Trans. Mach. Learn. Res.2022 (2022)

2022
[25]

Tim Menzies, Jeremy Greenwald, and Art Frank. 2007. Data Mining Static Code Attributes to Learn Defect Predictors.IEEE Trans. Software Eng.33, 1 (2007), 2–13

2007
[26]

Audris Mockus and Lawrence G. Votta. 2000. Identifying Reasons for Software Changes using Historic Databases. InICSM. IEEE Computer Society, 120–130

2000
[27]

Seyedmoein Mohsenimofidi, Matthias Galster, Christoph Treude, and Sebastian Baltes. 2026. Context Engineering for AI Agents in Open-Source Software. In MSR. IEEE

2026
[28]

Tim P Morris, Ian R White, and Michael J Crowther. 2019. Using simulation studies to evaluate statistical methods.Statistics in medicine38, 11 (2019), 2074– 2102

2019
[29]

Nuthan Munaiah, Steven Kroh, Craig Cabrey, and Meiyappan Nagappan. 2017. Curating GitHub for engineered software projects.Empir. Softw. Eng.22, 6 (2017), 3219–3253

2017
[30]

Nahidul Islam Opu, Shaowei Wang, and Shaiful Chowdhury

Md. Nahidul Islam Opu, Shaowei Wang, and Shaiful Chowdhury. 2025. LLM- Based Detection of Tangled Code Changes for Higher-Quality Method-Level Bug Datasets.CoRRabs/2505.08263 (2025)

work page arXiv 2025
[31]

Henrik Plate, Serena Elisa Ponta, and Antonino Sabetta. 2015. Impact assessment for vulnerabilities in open-source software libraries. InICSME. IEEE Computer Society, 411–420

2015
[32]

Williams

Akond Rahman, Chris Parnin, and Laurie A. Williams. 2019. The seven sins: security smells in infrastructure as code scripts. InICSE. IEEE / ACM, 164–175

2019
[33]

Akond Rahman, Shazibul Islam Shamim, Dibyendu Brinto Bose, and Rahul Pandita. 2023. Security Misconfigurations in Open Source Kubernetes Manifests: An Empirical Study.ACM Trans. Softw. Eng. Methodol.32, 4 (2023), 99:1–99:36

2023
[34]

Foyzur Rahman and Premkumar T. Devanbu. 2011. Ownership, experience and defects: a fine-grained study of authorship. InICSE. ACM, 491–500

2011
[35]

Hora, and Stefano Zacchiroli

Romain Robbes, Théo Matricon, Thomas Degueule, André C. Hora, and Stefano Zacchiroli. 2026. Promises, Perils, and (Timely) Heuristics for Mining Coding Agent Activity. InMSR. IEEE

2026
[36]

Diomidis Spinellis and Georgios Gousios. 2018. How to analyze git repositories with command line tools: we’re not in kansas anymore. InICSE (Companion Volume). ACM, 540–541

2018
[37]

2025.Stan Reference Manual, 2.35

Stan Development Team. 2025.Stan Reference Manual, 2.35. https://mc-stan.org

2025
[38]

Mahan Tafreshipour, Aaron Imani, Eric Huang, Eduardo Santana de Almeida, Thomas Zimmermann, and Iftekhar Ahmed. 2025. Prompting in the Wild: An Empirical Study of Prompt Evolution in Software Repositories. InMSR. IEEE, 686–698

2025
[39]

Zefan Wang, Zichuan Liu, Yingying Zhang, Aoxiao Zhong, Jihong Wang, Fengbin Yin, Lunting Fan, Lingfei Wu, and Qingsong Wen. 2024. RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models. InCIKM. ACM, 4966–4974

2024
[40]

Zezhou Yang, Sirong Chen, Cuiyun Gao, Zhenhao Li, Xing Hu, Kui Liu, and Xin Xia. 2025. An Empirical Study of Retrieval-Augmented Code Generation: Challenges and Opportunities.ACM Trans. Softw. Eng. Methodol.34, 7 (2025), 188:1–188:28

2025
[41]

Jiaxin Yu, Liming Fu, Peng Liang, Amjed Tahir, and Mojtaba Shahin. 2023. Se- curity Defect Detection via Code Review: A Study of the OpenStack and Qt Communities. InESEM. IEEE, 1–12

2023
[42]

Zhaoyang Yu, Minghua Ma, Xiaoyu Feng, Ruomeng Ding, Chaoyun Zhang, Ze Li, Murali Chintalapati, Xuchao Zhang, Rujia Wang, Chetan Bansal, Saravan Rajmohan, Qingwei Lin, Shenglin Zhang, Changhua Pei, and Dan Pei. 2025. Triangle: Empowering Incident Triage with Multi-LLM-Agents.FSE(2025)

2025
[43]

Williams, and Tim Menzies

Zhe Yu, Christopher Theisen, Laurie A. Williams, and Tim Menzies. 2021. Im- proving Vulnerability Inspection Efficiency Using Active Learning.IEEE Trans. Software Eng.47, 11 (2021), 2401–2420

2021
[44]

Hassan, Shane McIntosh, and Ying Zou

Feng Zhang, Ahmed E. Hassan, Shane McIntosh, and Ying Zou. 2017. The Use of Summation to Aggregate Software Metrics Hinders the Performance of Defect Prediction Models.IEEE Trans. Software Eng.43, 5 (2017), 476–491

2017
[45]

Jiale Zhang, Liangqiong Tu, Jie Cai, Xiaobing Sun, Bin Li, Weitong Chen, and Yu Wang. 2022. Vulnerability Detection for Smart Contract via Backward Bayesian Active Learning. InACNS Workshops (Lecture Notes in Computer Science, Vol. 13285). Springer, 66–83

2022
[46]

Thomas Zimmermann and Nachiappan Nagappan. 2008. Predicting defects using network analysis on dependency graphs. InICSE. ACM, 531–540

2008
[47]

Thomas Zimmermann, Rahul Premraj, and Andreas Zeller. 2007. Predicting Defects for Eclipse. InPROMISE@ICSE. IEEE Computer Society, 9

2007