Applying an Agentic Coding Tool for Improving Published Algorithm Implementations

Worasait Suwannik

arxiv: 2604.13109 · v1 · submitted 2026-04-11 · 💻 cs.SE · cs.AI

Applying an Agentic Coding Tool for Improving Published Algorithm Implementations

Worasait Suwannik This is my paper

Pith reviewed 2026-05-10 16:28 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords agentic coding toolsalgorithm implementation improvementAI-assisted code enhancementpublished research algorithmstwo-stage improvement pipelinehuman-AI collaborationpeer review implicationsacademic publishing practices

0 comments

The pith

An agentic coding tool produced improvements in all eleven published algorithm implementations tested, each within a single working day.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper outlines a two-stage pipeline that uses a large language model to select recently published algorithms meeting clear experimental criteria and then applies an agentic coding tool to reproduce the reported baseline before iterating on improvements. When run on eleven experiments drawn from varied research domains, the tool reported gains in every instance, with each cycle completed inside one working day. The authors detail the human tasks that stay essential, such as choosing targets, confirming experimental validity, judging novelty and impact, supplying compute, and disclosing AI use in writing. They also examine how routine application of this approach could shift expectations in peer review and academic publishing.

Core claim

The paper establishes that a two-stage pipeline, in which a research-capable language model first identifies published algorithms with suitable experimental setups and an agentic coding tool then reproduces the baseline and iterates improvements, yields reported enhancements across all eleven experiments attempted. Each improvement process finished within a single working day. The work further identifies the human contributions that remain indispensable, including target selection, verification of results, assessment of novelty, provision of resources, and appropriate disclosure of AI assistance.

What carries the argument

The two-stage pipeline that pairs LLM-driven selection of published algorithms meeting explicit experimental criteria with subsequent agentic coding tool prompts to reproduce baselines and iterate enhancements.

If this is right

Researchers could routinely apply similar tools to upgrade their own published algorithm code with limited additional effort.
Verification of results, assessment of novelty, and resource allocation would continue to require human judgment.
Peer review standards might incorporate checks for AI-assisted code enhancements and their documentation.
Academic publishing practices could evolve toward higher baseline expectations for implementation quality and reproducibility.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread use might generate improved code versions that later papers adopt as new reference implementations.
The approach could be tested on older publications or additional domains to measure how far the reported gains generalize.
Studies comparing tool outputs against human-only improvement efforts would clarify the specific contribution of the agentic step.

Load-bearing premise

The improvements identified and implemented by the agentic coding tool are genuine, reproducible by others, and meaningfully better than the original published versions rather than artifacts of prompt design or unverified tool self-reports.

What would settle it

An independent replication that applies the identical two-stage pipeline and prompts to the same eleven published algorithms and obtains no improvements on any of them.

read the original abstract

We present a two-stage pipeline for AI-assisted improvement of published algorithm implementations. In the first stage, a large language model with research capabilities identifies recently published algorithms satisfying explicit experimental criteria. In the second stage, Claude Code is given a prompt to reproduce the reported baseline and then iterate an improvement process. We apply this pipeline to published algorithm implementations spanning multiple research domains. Claude Code reported that all eleven experiments yielded improvements. Each improvement could be achieved within a single working day. We analyse the human contributions that remain indispensable, including selecting the target, verifying experimental validity, assessing novelty and impact, providing computational resources, and writing with appropriate AI-use disclosure. Finally, we discuss implications for peer review and academic publishing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies Claude Code via a two-stage pipeline to tweak 11 published algorithms and claims uniform success in a day each, but supplies no metrics, diffs, or independent checks to back it up.

read the letter

The main thing to know is that this is a report on using an LLM first to select recent papers with experimental results and then Claude Code to reproduce and iteratively improve the code. The authors say every one of the eleven cases got better, all within a single working day, and they close with a list of tasks that still require humans. That pipeline and the human-role breakdown are the concrete pieces that were not already in the literature they cite. The discussion of what remains indispensable—selecting targets, verifying validity, judging novelty, providing resources, and disclosing AI use—reads as straightforward and avoids overpromising full replacement of researchers. The implications section for peer review and publishing is also direct about the current limits. The soft spot is the evidence. The central claim rests on the tool reporting its own improvements with no quantitative before-and-after numbers, no released code or diffs, no re-execution by the authors or others, and no comparison to what a human might achieve in the same time. Because the process is iterative LLM editing, the absence of artifacts means an independent reader cannot confirm the gains are real, reproducible, or meaningful rather than prompt artifacts. That leaves the headline result dependent on trust in the agent's self-assessment. This is for readers already working with agentic coding tools who want a practical example of applying them to published work and a reminder of the human oversight still needed. It could fit a reading group on AI in research workflows. I would not cite the results as they stand because they lack the backing to build on, but the paper deserves peer review so the authors can add verification steps and data; the topic is timely enough that referees could usefully push it toward a more solid empirical report.

Referee Report

3 major / 2 minor

Summary. The paper presents a two-stage pipeline for AI-assisted improvement of published algorithm implementations. Stage one uses a large language model to identify recently published algorithms meeting explicit experimental criteria; stage two prompts Claude Code to reproduce the reported baseline and iteratively improve the implementation. The authors apply the pipeline to eleven cases across multiple domains and report that the tool achieved improvements in every case, each within a single working day. They further analyze the human contributions that remain essential (target selection, validity verification, novelty assessment, resource provision, and disclosure) and discuss implications for peer review and academic publishing.

Significance. If the claimed improvements can be independently verified with quantitative metrics and released artifacts, the work would illustrate a practical workflow for post-publication code refinement using current agentic tools. It could inform discussions on integrating AI assistance into research software practices while underscoring the continued necessity of human oversight for correctness, novelty, and disclosure. The explicit enumeration of indispensable human roles provides a balanced perspective that may help calibrate expectations about AI capabilities in empirical computer science.

major comments (3)

[Results (and abstract)] The central result—that Claude Code produced improvements in all eleven experiments—is stated without any quantitative before/after metrics, performance numbers, correctness checks, or error analysis. No tables, figures, or appendices supply baseline values, improved values, or statistical comparisons, leaving the headline claim unsupported by visible evidence.
[Methods and Results] No code diffs, original or improved implementations, prompt logs, or reproduction packages are released. Because the process is iterative LLM-driven editing, the absence of artifacts prevents independent parties from reproducing, falsifying, or even inspecting the claimed gains.
[Methods] The improvement process is described at a high level only; the manuscript supplies neither the exact prompts, the stopping criteria used by the agent, nor an operational definition of what constitutes a valid 'improvement.' This makes it impossible to evaluate whether the reported successes are robust or sensitive to prompt engineering.

minor comments (2)

[Abstract] The abstract would be strengthened by naming the research domains or algorithm types represented in the eleven cases, giving readers immediate context for the scope of the evaluation.
[Discussion] The discussion of human contributions is useful but would benefit from concrete examples (e.g., specific verification steps the authors performed) rather than remaining at the level of general categories.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and describe the revisions we will make to strengthen the manuscript's evidence and reproducibility.

read point-by-point responses

Referee: [Results (and abstract)] The central result—that Claude Code produced improvements in all eleven experiments—is stated without any quantitative before/after metrics, performance numbers, correctness checks, or error analysis. No tables, figures, or appendices supply baseline values, improved values, or statistical comparisons, leaving the headline claim unsupported by visible evidence.

Authors: We agree that the current manuscript presents the outcome primarily through the agent's reported success without accompanying quantitative metrics. In the revised version we will add a results table that reports, for each of the eleven cases, the original published performance, the reproduced baseline achieved by the agent, the final improved performance, and any author-performed correctness or validity checks. This will directly support the headline claim with visible numerical evidence. revision: yes
Referee: [Methods and Results] No code diffs, original or improved implementations, prompt logs, or reproduction packages are released. Because the process is iterative LLM-driven editing, the absence of artifacts prevents independent parties from reproducing, falsifying, or even inspecting the claimed gains.

Authors: We acknowledge that the lack of released artifacts limits independent verification. We will add a dedicated reproducibility section and a public repository link containing the code diffs, representative prompt logs, and reproduction instructions for all eleven experiments. Where original source code is restricted by licensing or availability, we will clearly note the limitation and release the modified versions and logs that we control. revision: yes
Referee: [Methods] The improvement process is described at a high level only; the manuscript supplies neither the exact prompts, the stopping criteria used by the agent, nor an operational definition of what constitutes a valid 'improvement.' This makes it impossible to evaluate whether the reported successes are robust or sensitive to prompt engineering.

Authors: The high-level description was chosen to keep the focus on the overall pipeline and the indispensable human roles. We will expand the Methods section to include representative full prompts, the concrete stopping criteria employed (e.g., no further performance gain after a fixed number of iterations or a one-day time budget), and an explicit operational definition of improvement as a measurable gain on the primary metric reported in the original paper while preserving correctness on the authors' validation tests. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical report with no derivations or self-referential predictions

full rationale

The paper describes a two-stage pipeline for applying an agentic coding tool (Claude Code) to published algorithm implementations and reports that the tool identified improvements in all eleven cases. No equations, fitted parameters, predictions, or first-principles derivations appear in the abstract or described content. The central claims rest on the tool's reported outcomes and human analysis of required contributions, with no load-bearing steps that reduce by construction to the inputs via self-definition, self-citation chains, or renaming. The work is self-contained as a descriptive empirical study; any concerns about reproducibility or verification of the reported gains fall under validity rather than circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the unproven assumption that current large language models and agentic coding tools possess sufficient capability to identify suitable targets and produce valid code improvements; no free parameters, invented entities, or additional axioms are introduced.

axioms (1)

domain assumption Large language models with research capabilities can reliably identify published algorithms meeting explicit experimental criteria, and agentic coding tools can reproduce baselines and generate genuine improvements.
The entire pipeline depends on these capabilities without independent verification or proof supplied in the abstract.

pith-pipeline@v0.9.0 · 5406 in / 1247 out tokens · 45161 ms · 2026-05-10T16:28:37.064340+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages

[1]

Researchagent: Iterative research idea gen- eration over scientific literature with large language models,

J. Baek, S. K. Jauhar, S. Cucerzan, and S. J. Hwang, “Researchagent: Iterative research idea gen- eration over scientific literature with large language models,” arXiv, April 2024. [Online]. Available: https://www.microsoft.com/en-us/research/publication/ researchagent-iterative-research-idea-generation-over-scientific-literature-with-large-language-models/

work page 2024
[2]

Claudini: Autoresearch discovers state-of-the-art adversarial attack algorithms for llms,

A. Panfilov, P. Romov, I. Shilov, Y.-A. Montjoye, J. Geip- ing, and M. Andriushchenko, “Claudini: Autoresearch discovers state-of-the-art adversarial attack algorithms for llms,” 03 2026

work page 2026
[3]

autoresearch: AI agents running research on single-GPU nanochat training automatically,

A. Karpathy, “autoresearch: AI agents running research on single-GPU nanochat training automatically,” https://github. com/karpathy/autoresearch, 2026, accessed: 2026-04-06

work page 2026
[4]

Evaluating claude code’s coding and test automation for gpu acceleration ofa legacy fortran application: A geofem case study,

T. Hoshino, S.-i. Hayashi, D. Mukunoki, T. Katagiri, and T. Hanawa, “Evaluating claude code’s coding and test automation for gpu acceleration ofa legacy fortran application: A geofem case study,” inProceedings of the Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region Workshops, ser. SCA/HPCAsiaWS ’26. New...

work page doi:10.1145/3784828.3785335 2026
[5]

PostTrainBench: Can LLM agents automate LLM post-training?arXiv preprint arXiv:2603.08640, 2026

B. Rank, H. Bhatnagar, A. Prabhu, S. Eisenberg, K. Nguyen, M. Bethge, and M. Andriushchenko, “Posttrainbench: Can llm agents automate llm post-training?” 2026. [Online]. Available: https://arxiv.org/abs/2603.08640

work page arXiv 2026
[6]

Claude’s cycles,

D. E. Knuth, “Claude’s cycles,” Stanford University, Com- puter Science Department, Tech. Rep., February 2026, revised 2 19 March 2026. Available at https://www-cs-faculty.stanford.edu/ ∼knuth/papers/claude-cycles.pdf

work page 2026
[7]

Introducing deep research,

OpenAI, “Introducing deep research,” https://openai.com/ index/introducing-deep-research/, Feb. 2025, accessed: 2026- 04-08

work page 2025
[8]

Ai guidelines for researchers,

Wiley, “Ai guidelines for researchers,” 2025, accessed: 2026-04-

work page 2025
[9]

Available: https://www.wiley.com/en-us/publish/ article/ai-guidelines/

[Online]. Available: https://www.wiley.com/en-us/publish/ article/ai-guidelines/

work page
[10]

APA journals pol- icy on generative AI: Additional guidance,

American Psychological Association, “APA journals pol- icy on generative AI: Additional guidance,” Aug. 2025, last updated August 2025. Accessed: 2026-04-

work page 2025
[11]

Available: https://www.apa.org/pubs/journals/ resources/publishing-tips/policy-generative-ai 20

[Online]. Available: https://www.apa.org/pubs/journals/ resources/publishing-tips/policy-generative-ai 20

work page

[1] [1]

Researchagent: Iterative research idea gen- eration over scientific literature with large language models,

J. Baek, S. K. Jauhar, S. Cucerzan, and S. J. Hwang, “Researchagent: Iterative research idea gen- eration over scientific literature with large language models,” arXiv, April 2024. [Online]. Available: https://www.microsoft.com/en-us/research/publication/ researchagent-iterative-research-idea-generation-over-scientific-literature-with-large-language-models/

work page 2024

[2] [2]

Claudini: Autoresearch discovers state-of-the-art adversarial attack algorithms for llms,

A. Panfilov, P. Romov, I. Shilov, Y.-A. Montjoye, J. Geip- ing, and M. Andriushchenko, “Claudini: Autoresearch discovers state-of-the-art adversarial attack algorithms for llms,” 03 2026

work page 2026

[3] [3]

autoresearch: AI agents running research on single-GPU nanochat training automatically,

A. Karpathy, “autoresearch: AI agents running research on single-GPU nanochat training automatically,” https://github. com/karpathy/autoresearch, 2026, accessed: 2026-04-06

work page 2026

[4] [4]

Evaluating claude code’s coding and test automation for gpu acceleration ofa legacy fortran application: A geofem case study,

T. Hoshino, S.-i. Hayashi, D. Mukunoki, T. Katagiri, and T. Hanawa, “Evaluating claude code’s coding and test automation for gpu acceleration ofa legacy fortran application: A geofem case study,” inProceedings of the Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region Workshops, ser. SCA/HPCAsiaWS ’26. New...

work page doi:10.1145/3784828.3785335 2026

[5] [5]

PostTrainBench: Can LLM agents automate LLM post-training?arXiv preprint arXiv:2603.08640, 2026

B. Rank, H. Bhatnagar, A. Prabhu, S. Eisenberg, K. Nguyen, M. Bethge, and M. Andriushchenko, “Posttrainbench: Can llm agents automate llm post-training?” 2026. [Online]. Available: https://arxiv.org/abs/2603.08640

work page arXiv 2026

[6] [6]

Claude’s cycles,

D. E. Knuth, “Claude’s cycles,” Stanford University, Com- puter Science Department, Tech. Rep., February 2026, revised 2 19 March 2026. Available at https://www-cs-faculty.stanford.edu/ ∼knuth/papers/claude-cycles.pdf

work page 2026

[7] [7]

Introducing deep research,

OpenAI, “Introducing deep research,” https://openai.com/ index/introducing-deep-research/, Feb. 2025, accessed: 2026- 04-08

work page 2025

[8] [8]

Ai guidelines for researchers,

Wiley, “Ai guidelines for researchers,” 2025, accessed: 2026-04-

work page 2025

[9] [9]

Available: https://www.wiley.com/en-us/publish/ article/ai-guidelines/

[Online]. Available: https://www.wiley.com/en-us/publish/ article/ai-guidelines/

work page

[10] [10]

APA journals pol- icy on generative AI: Additional guidance,

American Psychological Association, “APA journals pol- icy on generative AI: Additional guidance,” Aug. 2025, last updated August 2025. Accessed: 2026-04-

work page 2025

[11] [11]

Available: https://www.apa.org/pubs/journals/ resources/publishing-tips/policy-generative-ai 20

[Online]. Available: https://www.apa.org/pubs/journals/ resources/publishing-tips/policy-generative-ai 20

work page