arxiv: 2604.17745 · v2 · submitted 2026-04-20 · 💻 cs.CL

Recognition: unknown

HiRAS: A Hierarchical Multi-Agent Framework for Paper-to-Code Generation and Execution

Hanhua Hong , Yizhi Li , Jiaoyan Chen , Sophia Ananiadou , Xiaoli Li , Jung-jae Kim , Chenghua Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:41 UTC · model grok-4.3

classification 💻 cs.CL

keywords multi-agent systemspaper-to-code generationhierarchical agentsexperiment reproductionlarge language modelscode generationhallucination reductionbenchmark evaluation

0 comments

The pith

Hierarchical manager agents coordinate specialized agents to improve paper-to-code reproduction and reduce hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that a hierarchical multi-agent framework overcomes weak global coordination in fixed sequential agent pipelines for turning research papers into executable code. Supervisory manager agents oversee specialized agents at each fine-grained stage, providing stronger planning and error handling across the full reproduction process. This matters because reliable automation of computational experiments could speed up research validation and reduce manual effort. The authors also refine the evaluation benchmark with repository-level details to better measure real performance. Tests indicate gains in accuracy and lower hallucination rates when using open-source models.

Core claim

The authors propose the Hierarchical Research Agent System (HiRAS), a hierarchical multi-agent framework that employs supervisory manager agents to coordinate specialised agents across fine-grained stages for end-to-end experiment reproduction from papers. This addresses limitations in existing sequential approaches by improving global coordination, which leads to better robustness. They introduce Paper2Code-Extra (P2C-Ex), a refined evaluation protocol that incorporates repository-level information and aligns better with reference-based metrics. Extensive evaluation shows over 10% relative performance gain beyond prior state-of-the-art using open-source backbone models along with reduced幻觉.

What carries the argument

Supervisory manager agents that coordinate specialized agents across fine-grained stages in the Hierarchical Research Agent System (HiRAS)

If this is right

End-to-end paper-to-code tasks become more robust through explicit coordination at each stage.
Hallucination rates drop in both code generation and evaluation outputs.
Open-source models reach performance levels closer to closed-source systems on reproduction benchmarks.
Evaluation of such systems becomes more reliable when using repository-aware protocols like Paper2Code-Extra.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same supervisory-layer idea could apply to other multi-step agent workflows such as data analysis pipelines.
Explicit hierarchy may help agent teams maintain consistency over long documents or multi-file projects.
Dynamic adjustment of manager scope based on paper complexity could be a natural next step to test.

Load-bearing premise

That fixed sequential agent pipelines suffer from weak global coordination and that inserting supervisory manager agents will reliably improve performance and reduce errors without creating new hierarchy-level failures.

What would settle it

A side-by-side test on identical papers and open-source models where the hierarchical HiRAS version shows no more than 5% gain or higher hallucination rates than a flat sequential baseline.

Figures

Figures reproduced from arXiv: 2604.17745 by Chenghua Lin, Hanhua Hong, Jiaoyan Chen, Jung-jae Kim, Sophia Ananiadou, Xiaoli Li, Yizhi Li.

**Figure 1.** Figure 1: Overview of the HIRAS framework. The reproduction workflow is decomposed into fine-grained phases, each handled by a specialised agent (blue) equipped with appropriate tools to operate within a shared workspace throughout the process. To enhance coordination, HIRAS introduces hierarchical manager agents (orange) that inspect the workspace to supervise progress and dynamically invoke subordinate agents to p… view at source ↗

**Figure 2.** Figure 2: An example of the repository structure code development, execution, and result matching, using percentage scores to indicate comprehensiveness of the reproduction. Following previous work (Seo et al., 2025; Zhao et al., 2025), we first evaluate our framework on PaperBenchCodeDev subset with o3-mini-high (Zhang et al., 2025) as the evaluator to showcase the overall performance. We further use ChatGPT-4… view at source ↗

**Figure 3.** Figure 3: Repository-level linear regression plots. Points represent the scoring of repositories, with the x-axis [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Structure illustrations of code repositories generated by PaperCoder and [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: The prompt from the planning manager agent [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: The implementation roadmap section in the plans generated by PaperCoder and [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Structure illustration of the code repository [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: The original prompt for reference-based evaluation [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: The original prompt for reference-free evaluation [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: The prompt for Paper2Code-Extra evaluation. The differences are marked in red. The [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: The initial context for the global manager agent on PaperBench [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: The initial context for the planning manager agent on PaperBench [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: The initial context for the overall planning agent on PaperBench [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: The initial context for the architecture design agent on PaperBench [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: The initial context for the dependency modelling agent on PaperBench [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

**Figure 16.** Figure 16: The initial context for the configuration generation agent on PaperBench [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗

**Figure 17.** Figure 17: The initial context for the analysis agent on PaperBench [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗

**Figure 18.** Figure 18: The initial context for the coding agent on PaperBench [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗

**Figure 19.** Figure 19: The initial context for the execution agent on PaperBench [PITH_FULL_IMAGE:figures/full_fig_p029_19.png] view at source ↗

read the original abstract

Recent advances in large language models have highlighted their potential to automate computational research, particularly reproducing experimental results. However, existing approaches still use fixed sequential agent pipelines with weak global coordination, which limits their robustness and overall performance. In this work, we propose Hierarchical Research Agent System (HiRAS), a hierarchical multi-agent framework for end-to-end experiment reproduction that employs supervisory manager agents to coordinate specialised agents across fine-grained stages. We also identify limitations in the reference-free evaluation of the Paper2Code benchmark and introduce Paper2Code-Extra (P2C-Ex), a refined protocol that incorporates repository-level information and better aligns with the original reference-based metric. We conduct extensive evaluation, validating the effectiveness and robustness of our proposed methods, and observing improvements, including >10\% relative performance gain beyond the previous state-of-the-art using open-source backbone models and significantly reduced hallucination in evaluation. Our work is available on GitHub: https://github.com/KOU-199024/HiRAS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes HiRAS, a hierarchical multi-agent framework for paper-to-code generation and experiment reproduction. Supervisory manager agents coordinate specialized agents across fine-grained stages to address weak global coordination in fixed sequential pipelines. The work also introduces P2C-Ex, a refined evaluation protocol incorporating repository-level information, and reports >10% relative performance gains over prior SOTA using open-source backbones plus significantly reduced hallucination.

Significance. If the performance gains and hallucination reductions hold under controlled conditions, the hierarchical coordination approach could meaningfully advance automated research reproduction tasks by improving robustness in multi-agent code generation systems.

major comments (3)

[Evaluation / Experiments section] The central claim of >10% relative performance gain and reduced hallucination (abstract) attributes the improvement to supervisory manager agents providing global coordination. However, no ablations are described that remove the managers while preserving specialized agents, total agent count, and inference budget; this leaves open whether hierarchy (vs. specialization or protocol change) is load-bearing.
[Evaluation / Experiments section] P2C-Ex is presented as addressing limitations in reference-free evaluation of the Paper2Code benchmark (abstract). Prior SOTA methods must be re-evaluated under the identical P2C-Ex protocol and open-source backbones to support the cross-method comparison; the manuscript does not confirm this re-evaluation was performed.
[Introduction and Method] The weakest assumption—that fixed sequential pipelines inherently suffer from weak coordination and that adding managers will reliably improve performance without introducing new coordination failures—requires explicit testing via failure-mode analysis or additional robustness metrics beyond aggregate scores.

minor comments (2)

[Benchmark section] Clarify the exact definition and implementation details of the P2C-Ex protocol, including how repository-level information is incorporated and how it aligns with the original reference-based metric.
[Evaluation] Provide quantitative details on hallucination measurement (e.g., specific metrics, detection method) and statistical significance of the reported gains.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which highlight important aspects of our evaluation and assumptions. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Evaluation / Experiments section] The central claim of >10% relative performance gain and reduced hallucination (abstract) attributes the improvement to supervisory manager agents providing global coordination. However, no ablations are described that remove the managers while preserving specialized agents, total agent count, and inference budget; this leaves open whether hierarchy (vs. specialization or protocol change) is load-bearing.

Authors: We agree that isolating the contribution of the hierarchical managers is necessary to substantiate the central claim. The current manuscript does not include ablations that remove the managers while holding specialized agents, total agent count, and inference budget fixed. We will add these experiments in the revised version, comparing the full HiRAS system against a non-hierarchical multi-agent baseline with equivalent resources. Results will be reported in the Experiments section to clarify whether the hierarchy itself drives the gains beyond specialization or protocol changes. revision: yes
Referee: [Evaluation / Experiments section] P2C-Ex is presented as addressing limitations in reference-free evaluation of the Paper2Code benchmark (abstract). Prior SOTA methods must be re-evaluated under the identical P2C-Ex protocol and open-source backbones to support the cross-method comparison; the manuscript does not confirm this re-evaluation was performed.

Authors: We confirm that all prior SOTA methods were re-implemented and evaluated under the exact P2C-Ex protocol using the same open-source backbones, as this was required to report the >10% relative gains. However, the manuscript does not explicitly state this re-evaluation in the Evaluation section. We will add a clear statement and, if space permits, a table footnote confirming that baselines were run under identical conditions. This will make the comparison fully transparent. revision: yes
Referee: [Introduction and Method] The weakest assumption—that fixed sequential pipelines inherently suffer from weak coordination and that adding managers will reliably improve performance without introducing new coordination failures—requires explicit testing via failure-mode analysis or additional robustness metrics beyond aggregate scores.

Authors: We acknowledge that aggregate scores alone do not fully test the assumption about coordination failures in sequential pipelines. While the paper reports reduced hallucination and overall robustness as supporting evidence, it does not include dedicated failure-mode analysis. We will add a new subsection in the Experiments section providing qualitative and quantitative analysis of coordination failure cases (e.g., error propagation in sequential vs. hierarchical setups) and additional robustness metrics such as consistency across runs. This will directly address the concern. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework proposal with independent benchmark refinement

full rationale

The paper proposes HiRAS as a hierarchical multi-agent system and separately introduces P2C-Ex to address limitations in an existing benchmark's reference-free evaluation. Performance claims (>10% gain, reduced hallucination) are presented as results of empirical evaluation on open-source models, not as outputs derived by construction from fitted parameters, self-referential definitions, or load-bearing self-citations. No equations, ansatzes, or uniqueness theorems appear in the provided text. The central claim rests on experimental comparison rather than reducing to the method's own inputs by definition. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Assessment performed on abstract only; no free parameters, axioms, or invented entities can be extracted from the provided text.

pith-pipeline@v0.9.0 · 5490 in / 1081 out tokens · 31717 ms · 2026-05-10T04:41:33.838455+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 7 canonical work pages · 2 internal anchors

[1]

Alireza Ghafarollahi and Markus J Buehler

Reviewer2: Optimizing review generation through prompt generation.CoRR, abs/2402.10886. Aniketh Garikaparthi, Manasi Patwardhan, Lovekesh Vig, and Arman Cohan. 2025. IRIS: Interactive re- search ideation system for accelerating scientific dis- covery. InProceedings of the 63rd Annual Meet- ing of the Association for Computational Linguistics (Volume 3: Sy...

work page arXiv 2025
[2]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

A survey on llm-based multi-agent sys- tems: workflow, infrastructure, and challenges.Vici- nagearth, 1(1):9. Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. 2024. Encouraging divergent thinking in large language models through multi-agent debate. InProceedings of the 2024 Conference on Empiri...

work page internal anchor Pith review arXiv 2024
[3]

Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibil- ity program).J. Mach. Learn. Res., 22(1). David B Resnik and Adil E Shamoo. 2017. Repro- ducibility and research integrity.Accountability in research, 24(2):116–123. Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki

2019
[4]

Paper2code: Automating code generation from scientific papers in machine learning.arXiv preprint arXiv:2504.17192, 2025

‘smolagents‘: a smol library to build great agentic systems. https://github.com/ huggingface/smolagents. Guillaume Sanchez, Alexander Spangher, Honglu Fan, Elad Levi, Pawan Sasanka Ammanamanchi, and Stella Biderman. 2024. Stay on topic with classifier- free guidance. Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, M...

work page arXiv 2024
[5]

arXiv preprint arXiv:2003.08039 , year=

SciMON: Scientific inspiration machines op- timized for novelty. InProceedings of the 62nd An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 279–299, Bangkok, Thailand. Association for Computational Linguistics. 11 Tonghan Wang, Heng Dong, Victor Lesser, and Chongjie Zhang. 2020. Roma: Multi-agent rein- force...

work page arXiv 2020
[6]

Yanzheng Xiang, Hanqi Yan, Shuyin Ouyang, Lin Gui, and Yulan He

Large-scale terminal agentic trajectory gen- eration from dockerized environments.Preprint, arXiv:2602.01244. Yanzheng Xiang, Hanqi Yan, Shuyin Ouyang, Lin Gui, and Yulan He. 2025. Scireplicate-bench: Benchmark- ing LLMs in agent-driven algorithmic reproduction from research papers. InSecond Conference on Lan- guage Modeling. Qiujie Xie, Yixuan Weng, Minj...

work page arXiv 2025
[7]

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shen- gran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha

How far are ai scientists from changing the world?Preprint, arXiv:2507.23276. Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shen- gran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. 2025. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search. Preprint, arXiv:2504.08066. An Yang, Anfeng Li, Baosong Yang, Beichen...

work page arXiv 2025
[8]

AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage

Openai o3-mini system card. Xuanle Zhao, Zilin Sang, Yuxuan Li, Qi Shi, Weilun Zhao, Shuo Wang, Duzhen Zhang, Xu Han, Zhiyuan Liu, and Maosong Sun. 2025. Autoreproduce: Auto- matic ai experiment reproduction with paper lineage. Preprint, arXiv:2505.20662. Minjun Zhu, Yixuan Weng, Linyi Yang, and Yue Zhang. 2025. Deepreview: Improving llm-based paper revie...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Detailed step-by-step implementation plan for each experiment section: - Gaussian targets (Section 5.1) - Non-Gaussian targets (Section 5.1) - PosteriorDB Bayesian models (Section 5.2) - Deep generative models (V AE application, Section 5.3)
[10]

Specific milestones with dates and deliverables: - Step-by-step breakdown of implementation tasks - Clear deliverables for each milestone - Integration and testing phases
[11]

Resource allocation (computational, time, personnel): - Computational resources required (GPU, CPU, memory) - Time estimates for each phase - Personnel roles and responsibilities
[12]

The current content is insufficient

Risk assessment and mitigation strategies: - Technical risks (algorithm convergence, implementation complexity) - Resource risks (computational requirements, time constraints) - Mitigation strategies for each identified risk The plan should be comprehensive and ready for implementation. The current content is insufficient. Figure 5: The prompt from the pl...
[13]

**BaM algorithm class**
[14]

**Score function interface** for target distributions
[15]

**Gaussian variational family** with mean/covariance parameters
[16]

**Matrix operations** (quadratic equation solver, low- rank optimizations) ### Experiment Modules
[17]

**Synthetic target generators** (Gaussian, sinh-arcsinh)
[18]

**Benchmark models** (ADVI, GSM implementations)
[19]

**Evaluation metric calculators**
[20]

**Visualization tools** for results ### Testing Strategy
[21]

**Unit tests** for core algorithm components
[22]

**Gradient checking** for score computations
[23]

**Convergence verification** on simple Gaussian tar- gets
[24]

**Reproduction of paper figures** with statistical sig- nificance (a) PaperCoder ## Specific Milestones and Deliverables ### Phase 1: Core Implementation Algorithm Core - [ ] Implement BaM algorithm with full covariance - [ ] Implement low-rank solver for B < D - [ ] Create Gaussian target test suite Baseline Implementations - [ ] Implement ADVI, GSM, Sco...

2024
[25]

Pay close attention to the key aspects that are crucial for implementing the paper’s results (e.g., specific algorithms, data preprocessing steps, evaluation protocols)

Identify Key Aspects of the Paper: Carefully read the research paper to understand its core concepts, methodology, and algorithms. Pay close attention to the key aspects that are crucial for implementing the paper’s results (e.g., specific algorithms, data preprocessing steps, evaluation protocols)
[26]

Use the gold repository as a reference for how the paper’s methodology should be translated into code

Analyze the Gold Repository: Examine the gold repository to understand how these key aspects have been implemented. Use the gold repository as a reference for how the paper’s methodology should be translated into code. Note the completeness, accuracy, and design choices in the gold repository that faithfully represent the paper’s concepts
[27]

Reference the gold repository as a guide for understanding these key aspects in the target repository

Examine the Target Repository: Analyze the target repository to assess how well it implements the key aspects of the paper. Reference the gold repository as a guide for understanding these key aspects in the target repository. Focus on whether the target repository’s core logic, algorithms, and structure align with the methodology and experiments describe...
[28]

Note any incorrect representations, inconsistencies, or incomplete implementations that could affect the correctness of the target repository

Identify Logical Errors and Deviations: Check for logical errors, missing steps, or deviations from the paper’s methodology. Note any incorrect representations, inconsistencies, or incomplete implementations that could affect the correctness of the target repository
[29]

You do not need to analyze minor details like logging functions, script organization, or documentation quality

Provide a Critique: Consider both the completeness and accuracy of the implementation relative to the paper’s goals and the gold repository’s standard. You do not need to analyze minor details like logging functions, script organization, or documentation quality. Instead, concentrate on the correctness of the logic and implementation that ensures the core...
[30]

Identify missing components, significant deviations, or incorrect implementations that could affect the correctness of the target repository

Assess the Correctness: Determine whether the target repository includes all the critical elements described in the paper and implemented in the gold repository. Identify missing components, significant deviations, or incorrect implementations that could affect the correctness of the target repository
[31]

‘json {

Assign a Score: Based on your evaluation, provide a critique and assign a correctness score from 1 to 5 for the target repository, reflecting how well it implements the key aspects of the paper refer to the gold repository. Include a detailed critique in the specified JSON format. — Severity Level: Each identified critique will be assigned a severity leve...
[35]

You do not need to analyze minor details like logging functions, script organization, or documentation quality

Provide a Critique: Consider the completeness and accuracy of the implementation relative to the paper’s goals. You do not need to analyze minor details like logging functions, script organization, or documentation quality. Instead, concentrate on the correctness of the logic and implementation to ensure the core concepts from the paper are fully reflecte...
[37]

‘json {

Assign a Score: Based on your evaluation, provide a critique and assign a correctness score from 1 to 5 for the repository, reflecting how well it implements the key aspects of the paper. Include a detailed critique in the specified JSON format. — Severity Level: Each identified critique will be assigned a severity level based on its impact on the correct...
[38]

Pay close attention to key aspects crucial for implementing the paper’s results (e.g., specific algorithms, data preprocessing steps, evaluation protocols)

Identify Key Aspects of the Paper: Carefully read the paper to understand its core concepts, methodology, and algorithms. Pay close attention to key aspects crucial for implementing the paper’s results (e.g., specific algorithms, data preprocessing steps, evaluation protocols)
[39]

Focus on whether the repository’s core logic, algorithms, and structure align with the methodology and experiments described in the paper

Examine the Code Repository: Analyze the repository to determine how well it implements the key aspects of the paper. Focus on whether the repository’s core logic, algorithms, and structure align with the methodology and experiments described in the paper
[40]

Note any incorrect representations, inconsistencies, or incomplete implementations that could affect the correctness of the repository

Identify Logical Errors and Deviations: Check for logical errors, missing steps, or deviations from the paper’s methodology. Note any incorrect representations, inconsistencies, or incomplete implementations that could affect the correctness of the repository
[41]

You do not need to analyze minor details like logging functions, script organization, or documentation quality

Provide a Critique: Consider the completeness and accuracy of the implementation relative to the paper’s goals. You do not need to analyze minor details like logging functions, script organization, or documentation quality. Instead, concentrate on the correctness of the logic and implementation to ensure the core concepts from the paper are fully reflecte...
[42]

Ensure that all critical components—such as data preprocessing, core algorithms, and evaluation steps—are implemented and consistent with the paper’s descriptions

Assess Completeness and Accuracy: Evaluate the repository for its completeness and accuracy relative to the paper’s methodology. Ensure that all critical components—such as data preprocessing, core algorithms, and evaluation steps—are implemented and consistent with the paper’s descriptions
[43]

Evaluate completeness holistically rather than limiting your review to existing files

Code verification: Verify that all key components expected from the paper are fully implemented with codes in the repository. Evaluate completeness holistically rather than limiting your review to existing files. In your critique, explicitly identify any missing components, absent implementations, or deviations from expected behavior, including cases wher...
[44]

‘json {

Assign a Score: Based on your evaluation, provide a critique and assign a correctness score from 1 to 5 for the repository, reflecting how well it implements the key aspects of the paper. Include a detailed critique in the specified JSON format. — Severity Level: Each identified critique will be assigned a severity level based on its impact on the correct...
[45]

- paper.md: the paper to reproduce

The given paper & addendum are in the current directory, you should explicitly instruct all the agents to read them COMPLETELY . - paper.md: the paper to reproduce. - addendum.md: the addendum of the paper
[46]

- architecture.md: the architecture design

Explicitly instruct the planning agent to call helping agents to generate the following files: - plan.md: the general experiment plan. - architecture.md: the architecture design. - dependency.md: the dependency analysis for components. - config.yaml: the parameters for each component
[47]

- analysis/components.txt: the list of components to be analysed

Call the analysing agent to analyse the implementation of each component and save the result in the analysis/ directory: - analysis/: the directory that contains the analysing report for each component. - analysis/components.txt: the list of components to be analysed
[48]

Use the coding agent to write the code for each component mentioned in the above files and save its code in the code/ directory. In addtion to paper, addendum, plan, architecture, dependency and config files, the coding agent should also be instructed to read the analysis for each component before implementing it. - code/: the directory that contains the ...
[49]

- results/*.log: the log of the execution

Use the executing agent to execute the code for each component mentioned in the above files and save its execution result in the results/ directory. - results/*.log: the log of the execution. You should call the agents one at a time to generate the fulfil the experiment reproduction. Do not call all the agents in one single step. You should check whether ...
[50]

plan.md: general_planning_agent
[51]

architecture.md: architecture_planning_agent
[52]

dependency.md: dependency_planning_agent
[53]

- paper.md: the target paper

config.yaml: config_planning_agent The target paper and addendum are in the current directory. - paper.md: the target paper. - addendum.md: addendum to the paper. You should first use the list_directory tool to check these files. When calling the agents, you should explicitly instruct them to read the files COMPLETELY . You should check which files are al...
[54]

Align with the Paper: Your plan must strictly follow the methods, datasets, model configurations, hyperparameters, and experimental setups described in the paper
[55]

Be Clear and Structured: Present the plan in a well-organized and easy-to-follow format, breaking it down into actionable steps
[56]

Prioritize Efficiency: Optimize the plan for clarity and practical implementation while ensuring fidelity to the original experiments. ## Task
[57]

We want to reproduce the method described in the attached paper
[58]

The authors did not release any official code, so we have to plan our own implementation
[59]

- Important aspects of **Experiments**, including dataset requirements, experimental settings, hyperparameters, or evaluation metrics

Before writing any Python code, please outline a comprehensive plan that covers: - Key details from the paper’s **Methodology**. - Important aspects of **Experiments**, including dataset requirements, experimental settings, hyperparameters, or evaluation metrics
[60]

Implementation approach

The plan should be as **detailed and informative** as possible to help us write the final code later. ## Requirements - You don’t need to provide the actual code yet; focus on a **thorough, clear strategy**. - If something is unclear from the paper, mention it explicitly. ## Instruction The response should give us a strong roadmap, making it easier to wri...
[61]

Align with the Paper: Your analysis must strictly follow the methods, datasets, model configurations, hyperparameters, and experimental setups described in the paper
[62]

Be Clear and Structured: Present your analysis in a logical, well-organized, and actionable format that is easy to follow and implement
[63]

Prioritize Efficiency: Optimize the analysis for clarity and practical implementation while ensuring fidelity to the original experiments
[65]

Do not invent or assume any values—only use configurations explicitly provided

REFER TO CONFIGURATION: Always reference settings from the config.yaml file. Do not invent or assume any values—only use configurations explicitly provided
[66]

Format example

Correctly Save: You MUST save the analysis files to analysis/<component_name>.py_analysis.md ## Instruction Conduct a Logic Analysis to assist in writing the code, based on the paper, the plan, the design, the task and the previously specified configuration file (config.yaml). You DON’T need to provide the actual code yet; focus on a thorough, clear analy...
[67]

Only One file: do your best to implement ONLY ONE FILE AT A TIME
[68]

COMPLETE CODE: Your code will be part of the entire project, so please implement complete, reliable, reusable code snippets
[69]

A VOID circular import

Set default value: If there is any setting, ALWAYS SET A DEFAULT V ALUE, ALWAYS USE STRONG TYPE AND EXPLICIT V ARIABLE. A VOID circular import
[70]

Data structures and interfaces

Follow design: YOU MUST FOLLOW "Data structures and interfaces". DONT CHANGE ANY DESIGN. Do not use public member functions that do not exist in your design
[71]

CAREFULLY CHECK THAT YOU DONT MISS ANY NECESSARY CLASS/FUNCTION IN THIS FILE
[72]

Before using a external variable/module, make sure you import it first
[73]

Write out EVERY CODE DETAIL, DON’T LEA VE TODO
[74]

config.yaml

REFER TO CONFIGURATION: you must use configuration from "config.yaml". DO NOT FABRICATE any configuration values. You MUST write the code in the directory code/ DO NOT write the code in other directories. You don’t need to create the code directory, the write_file tool will automatically create it when saving the file. DO NOT directly output the code!!! Y...