pith. machine review for the scientific record. sign in

arxiv: 2604.17745 · v2 · submitted 2026-04-20 · 💻 cs.CL

Recognition: unknown

HiRAS: A Hierarchical Multi-Agent Framework for Paper-to-Code Generation and Execution

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:41 UTC · model grok-4.3

classification 💻 cs.CL
keywords multi-agent systemspaper-to-code generationhierarchical agentsexperiment reproductionlarge language modelscode generationhallucination reductionbenchmark evaluation
0
0 comments X

The pith

Hierarchical manager agents coordinate specialized agents to improve paper-to-code reproduction and reduce hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that a hierarchical multi-agent framework overcomes weak global coordination in fixed sequential agent pipelines for turning research papers into executable code. Supervisory manager agents oversee specialized agents at each fine-grained stage, providing stronger planning and error handling across the full reproduction process. This matters because reliable automation of computational experiments could speed up research validation and reduce manual effort. The authors also refine the evaluation benchmark with repository-level details to better measure real performance. Tests indicate gains in accuracy and lower hallucination rates when using open-source models.

Core claim

The authors propose the Hierarchical Research Agent System (HiRAS), a hierarchical multi-agent framework that employs supervisory manager agents to coordinate specialised agents across fine-grained stages for end-to-end experiment reproduction from papers. This addresses limitations in existing sequential approaches by improving global coordination, which leads to better robustness. They introduce Paper2Code-Extra (P2C-Ex), a refined evaluation protocol that incorporates repository-level information and aligns better with reference-based metrics. Extensive evaluation shows over 10% relative performance gain beyond prior state-of-the-art using open-source backbone models along with reduced幻觉.

What carries the argument

Supervisory manager agents that coordinate specialized agents across fine-grained stages in the Hierarchical Research Agent System (HiRAS)

If this is right

  • End-to-end paper-to-code tasks become more robust through explicit coordination at each stage.
  • Hallucination rates drop in both code generation and evaluation outputs.
  • Open-source models reach performance levels closer to closed-source systems on reproduction benchmarks.
  • Evaluation of such systems becomes more reliable when using repository-aware protocols like Paper2Code-Extra.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same supervisory-layer idea could apply to other multi-step agent workflows such as data analysis pipelines.
  • Explicit hierarchy may help agent teams maintain consistency over long documents or multi-file projects.
  • Dynamic adjustment of manager scope based on paper complexity could be a natural next step to test.

Load-bearing premise

That fixed sequential agent pipelines suffer from weak global coordination and that inserting supervisory manager agents will reliably improve performance and reduce errors without creating new hierarchy-level failures.

What would settle it

A side-by-side test on identical papers and open-source models where the hierarchical HiRAS version shows no more than 5% gain or higher hallucination rates than a flat sequential baseline.

Figures

Figures reproduced from arXiv: 2604.17745 by Chenghua Lin, Hanhua Hong, Jiaoyan Chen, Jung-jae Kim, Sophia Ananiadou, Xiaoli Li, Yizhi Li.

Figure 1
Figure 1. Figure 1: Overview of the HIRAS framework. The reproduction workflow is decomposed into fine-grained phases, each handled by a specialised agent (blue) equipped with appropriate tools to operate within a shared workspace throughout the process. To enhance coordination, HIRAS introduces hierarchical manager agents (orange) that inspect the workspace to supervise progress and dynamically invoke subordinate agents to p… view at source ↗
Figure 2
Figure 2. Figure 2: An example of the repository structure code development, execution, and result match￾ing, using percentage scores to indicate compre￾hensiveness of the reproduction. Following pre￾vious work (Seo et al., 2025; Zhao et al., 2025), we first evaluate our framework on PaperBench￾CodeDev subset with o3-mini-high (Zhang et al., 2025) as the evaluator to showcase the overall per￾formance. We further use ChatGPT-4… view at source ↗
Figure 3
Figure 3. Figure 3: Repository-level linear regression plots. Points represent the scoring of repositories, with the x-axis [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Structure illustrations of code repositories generated by PaperCoder and [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The prompt from the planning manager agent [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The implementation roadmap section in the plans generated by PaperCoder and [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Structure illustration of the code repository [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The original prompt for reference-based evaluation [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The original prompt for reference-free evaluation [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The prompt for Paper2Code-Extra evaluation. The differences are marked in red. The [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The initial context for the global manager agent on PaperBench [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: The initial context for the planning manager agent on PaperBench [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The initial context for the overall planning agent on PaperBench [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: The initial context for the architecture design agent on PaperBench [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: The initial context for the dependency modelling agent on PaperBench [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: The initial context for the configuration generation agent on PaperBench [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: The initial context for the analysis agent on PaperBench [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: The initial context for the coding agent on PaperBench [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: The initial context for the execution agent on PaperBench [PITH_FULL_IMAGE:figures/full_fig_p029_19.png] view at source ↗
read the original abstract

Recent advances in large language models have highlighted their potential to automate computational research, particularly reproducing experimental results. However, existing approaches still use fixed sequential agent pipelines with weak global coordination, which limits their robustness and overall performance. In this work, we propose Hierarchical Research Agent System (HiRAS), a hierarchical multi-agent framework for end-to-end experiment reproduction that employs supervisory manager agents to coordinate specialised agents across fine-grained stages. We also identify limitations in the reference-free evaluation of the Paper2Code benchmark and introduce Paper2Code-Extra (P2C-Ex), a refined protocol that incorporates repository-level information and better aligns with the original reference-based metric. We conduct extensive evaluation, validating the effectiveness and robustness of our proposed methods, and observing improvements, including >10\% relative performance gain beyond the previous state-of-the-art using open-source backbone models and significantly reduced hallucination in evaluation. Our work is available on GitHub: https://github.com/KOU-199024/HiRAS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes HiRAS, a hierarchical multi-agent framework for paper-to-code generation and experiment reproduction. Supervisory manager agents coordinate specialized agents across fine-grained stages to address weak global coordination in fixed sequential pipelines. The work also introduces P2C-Ex, a refined evaluation protocol incorporating repository-level information, and reports >10% relative performance gains over prior SOTA using open-source backbones plus significantly reduced hallucination.

Significance. If the performance gains and hallucination reductions hold under controlled conditions, the hierarchical coordination approach could meaningfully advance automated research reproduction tasks by improving robustness in multi-agent code generation systems.

major comments (3)
  1. [Evaluation / Experiments section] The central claim of >10% relative performance gain and reduced hallucination (abstract) attributes the improvement to supervisory manager agents providing global coordination. However, no ablations are described that remove the managers while preserving specialized agents, total agent count, and inference budget; this leaves open whether hierarchy (vs. specialization or protocol change) is load-bearing.
  2. [Evaluation / Experiments section] P2C-Ex is presented as addressing limitations in reference-free evaluation of the Paper2Code benchmark (abstract). Prior SOTA methods must be re-evaluated under the identical P2C-Ex protocol and open-source backbones to support the cross-method comparison; the manuscript does not confirm this re-evaluation was performed.
  3. [Introduction and Method] The weakest assumption—that fixed sequential pipelines inherently suffer from weak coordination and that adding managers will reliably improve performance without introducing new coordination failures—requires explicit testing via failure-mode analysis or additional robustness metrics beyond aggregate scores.
minor comments (2)
  1. [Benchmark section] Clarify the exact definition and implementation details of the P2C-Ex protocol, including how repository-level information is incorporated and how it aligns with the original reference-based metric.
  2. [Evaluation] Provide quantitative details on hallucination measurement (e.g., specific metrics, detection method) and statistical significance of the reported gains.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which highlight important aspects of our evaluation and assumptions. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Evaluation / Experiments section] The central claim of >10% relative performance gain and reduced hallucination (abstract) attributes the improvement to supervisory manager agents providing global coordination. However, no ablations are described that remove the managers while preserving specialized agents, total agent count, and inference budget; this leaves open whether hierarchy (vs. specialization or protocol change) is load-bearing.

    Authors: We agree that isolating the contribution of the hierarchical managers is necessary to substantiate the central claim. The current manuscript does not include ablations that remove the managers while holding specialized agents, total agent count, and inference budget fixed. We will add these experiments in the revised version, comparing the full HiRAS system against a non-hierarchical multi-agent baseline with equivalent resources. Results will be reported in the Experiments section to clarify whether the hierarchy itself drives the gains beyond specialization or protocol changes. revision: yes

  2. Referee: [Evaluation / Experiments section] P2C-Ex is presented as addressing limitations in reference-free evaluation of the Paper2Code benchmark (abstract). Prior SOTA methods must be re-evaluated under the identical P2C-Ex protocol and open-source backbones to support the cross-method comparison; the manuscript does not confirm this re-evaluation was performed.

    Authors: We confirm that all prior SOTA methods were re-implemented and evaluated under the exact P2C-Ex protocol using the same open-source backbones, as this was required to report the >10% relative gains. However, the manuscript does not explicitly state this re-evaluation in the Evaluation section. We will add a clear statement and, if space permits, a table footnote confirming that baselines were run under identical conditions. This will make the comparison fully transparent. revision: yes

  3. Referee: [Introduction and Method] The weakest assumption—that fixed sequential pipelines inherently suffer from weak coordination and that adding managers will reliably improve performance without introducing new coordination failures—requires explicit testing via failure-mode analysis or additional robustness metrics beyond aggregate scores.

    Authors: We acknowledge that aggregate scores alone do not fully test the assumption about coordination failures in sequential pipelines. While the paper reports reduced hallucination and overall robustness as supporting evidence, it does not include dedicated failure-mode analysis. We will add a new subsection in the Experiments section providing qualitative and quantitative analysis of coordination failure cases (e.g., error propagation in sequential vs. hierarchical setups) and additional robustness metrics such as consistency across runs. This will directly address the concern. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework proposal with independent benchmark refinement

full rationale

The paper proposes HiRAS as a hierarchical multi-agent system and separately introduces P2C-Ex to address limitations in an existing benchmark's reference-free evaluation. Performance claims (>10% gain, reduced hallucination) are presented as results of empirical evaluation on open-source models, not as outputs derived by construction from fitted parameters, self-referential definitions, or load-bearing self-citations. No equations, ansatzes, or uniqueness theorems appear in the provided text. The central claim rests on experimental comparison rather than reducing to the method's own inputs by definition. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Assessment performed on abstract only; no free parameters, axioms, or invented entities can be extracted from the provided text.

pith-pipeline@v0.9.0 · 5490 in / 1081 out tokens · 31717 ms · 2026-05-10T04:41:33.838455+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 7 canonical work pages · 2 internal anchors

  1. [1]

    Alireza Ghafarollahi and Markus J Buehler

    Reviewer2: Optimizing review generation through prompt generation.CoRR, abs/2402.10886. Aniketh Garikaparthi, Manasi Patwardhan, Lovekesh Vig, and Arman Cohan. 2025. IRIS: Interactive re- search ideation system for accelerating scientific dis- covery. InProceedings of the 63rd Annual Meet- ing of the Association for Computational Linguistics (Volume 3: Sy...

  2. [2]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    A survey on llm-based multi-agent sys- tems: workflow, infrastructure, and challenges.Vici- nagearth, 1(1):9. Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. 2024. Encouraging divergent thinking in large language models through multi-agent debate. InProceedings of the 2024 Conference on Empiri...

  3. [3]

    Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibil- ity program).J. Mach. Learn. Res., 22(1). David B Resnik and Adil E Shamoo. 2017. Repro- ducibility and research integrity.Accountability in research, 24(2):116–123. Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki

  4. [4]

    Paper2code: Automating code generation from scientific papers in machine learning.arXiv preprint arXiv:2504.17192, 2025

    ‘smolagents‘: a smol library to build great agentic systems. https://github.com/ huggingface/smolagents. Guillaume Sanchez, Alexander Spangher, Honglu Fan, Elad Levi, Pawan Sasanka Ammanamanchi, and Stella Biderman. 2024. Stay on topic with classifier- free guidance. Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, M...

  5. [5]

    arXiv preprint arXiv:2003.08039 , year=

    SciMON: Scientific inspiration machines op- timized for novelty. InProceedings of the 62nd An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 279–299, Bangkok, Thailand. Association for Computational Linguistics. 11 Tonghan Wang, Heng Dong, Victor Lesser, and Chongjie Zhang. 2020. Roma: Multi-agent rein- force...

  6. [6]

    Yanzheng Xiang, Hanqi Yan, Shuyin Ouyang, Lin Gui, and Yulan He

    Large-scale terminal agentic trajectory gen- eration from dockerized environments.Preprint, arXiv:2602.01244. Yanzheng Xiang, Hanqi Yan, Shuyin Ouyang, Lin Gui, and Yulan He. 2025. Scireplicate-bench: Benchmark- ing LLMs in agent-driven algorithmic reproduction from research papers. InSecond Conference on Lan- guage Modeling. Qiujie Xie, Yixuan Weng, Minj...

  7. [7]

    Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shen- gran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha

    How far are ai scientists from changing the world?Preprint, arXiv:2507.23276. Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shen- gran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. 2025. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search. Preprint, arXiv:2504.08066. An Yang, Anfeng Li, Baosong Yang, Beichen...

  8. [8]

    AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage

    Openai o3-mini system card. Xuanle Zhao, Zilin Sang, Yuxuan Li, Qi Shi, Weilun Zhao, Shuo Wang, Duzhen Zhang, Xu Han, Zhiyuan Liu, and Maosong Sun. 2025. Autoreproduce: Auto- matic ai experiment reproduction with paper lineage. Preprint, arXiv:2505.20662. Minjun Zhu, Yixuan Weng, Linyi Yang, and Yue Zhang. 2025. Deepreview: Improving llm-based paper revie...

  9. [9]

    Detailed step-by-step implementation plan for each experiment section: - Gaussian targets (Section 5.1) - Non-Gaussian targets (Section 5.1) - PosteriorDB Bayesian models (Section 5.2) - Deep generative models (V AE application, Section 5.3)

  10. [10]

    Specific milestones with dates and deliverables: - Step-by-step breakdown of implementation tasks - Clear deliverables for each milestone - Integration and testing phases

  11. [11]

    Resource allocation (computational, time, personnel): - Computational resources required (GPU, CPU, memory) - Time estimates for each phase - Personnel roles and responsibilities

  12. [12]

    The current content is insufficient

    Risk assessment and mitigation strategies: - Technical risks (algorithm convergence, implementation complexity) - Resource risks (computational requirements, time constraints) - Mitigation strategies for each identified risk The plan should be comprehensive and ready for implementation. The current content is insufficient. Figure 5: The prompt from the pl...

  13. [13]

    **BaM algorithm class**

  14. [14]

    **Score function interface** for target distributions

  15. [15]

    **Gaussian variational family** with mean/covariance parameters

  16. [16]

    **Matrix operations** (quadratic equation solver, low- rank optimizations) ### Experiment Modules

  17. [17]

    **Synthetic target generators** (Gaussian, sinh-arcsinh)

  18. [18]

    **Benchmark models** (ADVI, GSM implementations)

  19. [19]

    **Evaluation metric calculators**

  20. [20]

    **Visualization tools** for results ### Testing Strategy

  21. [21]

    **Unit tests** for core algorithm components

  22. [22]

    **Gradient checking** for score computations

  23. [23]

    **Convergence verification** on simple Gaussian tar- gets

  24. [24]

    **Reproduction of paper figures** with statistical sig- nificance (a) PaperCoder ## Specific Milestones and Deliverables ### Phase 1: Core Implementation Algorithm Core - [ ] Implement BaM algorithm with full covariance - [ ] Implement low-rank solver for B < D - [ ] Create Gaussian target test suite Baseline Implementations - [ ] Implement ADVI, GSM, Sco...

  25. [25]

    Pay close attention to the key aspects that are crucial for implementing the paper’s results (e.g., specific algorithms, data preprocessing steps, evaluation protocols)

    Identify Key Aspects of the Paper: Carefully read the research paper to understand its core concepts, methodology, and algorithms. Pay close attention to the key aspects that are crucial for implementing the paper’s results (e.g., specific algorithms, data preprocessing steps, evaluation protocols)

  26. [26]

    Use the gold repository as a reference for how the paper’s methodology should be translated into code

    Analyze the Gold Repository: Examine the gold repository to understand how these key aspects have been implemented. Use the gold repository as a reference for how the paper’s methodology should be translated into code. Note the completeness, accuracy, and design choices in the gold repository that faithfully represent the paper’s concepts

  27. [27]

    Reference the gold repository as a guide for understanding these key aspects in the target repository

    Examine the Target Repository: Analyze the target repository to assess how well it implements the key aspects of the paper. Reference the gold repository as a guide for understanding these key aspects in the target repository. Focus on whether the target repository’s core logic, algorithms, and structure align with the methodology and experiments describe...

  28. [28]

    Note any incorrect representations, inconsistencies, or incomplete implementations that could affect the correctness of the target repository

    Identify Logical Errors and Deviations: Check for logical errors, missing steps, or deviations from the paper’s methodology. Note any incorrect representations, inconsistencies, or incomplete implementations that could affect the correctness of the target repository

  29. [29]

    You do not need to analyze minor details like logging functions, script organization, or documentation quality

    Provide a Critique: Consider both the completeness and accuracy of the implementation relative to the paper’s goals and the gold repository’s standard. You do not need to analyze minor details like logging functions, script organization, or documentation quality. Instead, concentrate on the correctness of the logic and implementation that ensures the core...

  30. [30]

    Identify missing components, significant deviations, or incorrect implementations that could affect the correctness of the target repository

    Assess the Correctness: Determine whether the target repository includes all the critical elements described in the paper and implemented in the gold repository. Identify missing components, significant deviations, or incorrect implementations that could affect the correctness of the target repository

  31. [31]

    ‘json {

    Assign a Score: Based on your evaluation, provide a critique and assign a correctness score from 1 to 5 for the target repository, reflecting how well it implements the key aspects of the paper refer to the gold repository. Include a detailed critique in the specified JSON format. — Severity Level: Each identified critique will be assigned a severity leve...

  32. [35]

    You do not need to analyze minor details like logging functions, script organization, or documentation quality

    Provide a Critique: Consider the completeness and accuracy of the implementation relative to the paper’s goals. You do not need to analyze minor details like logging functions, script organization, or documentation quality. Instead, concentrate on the correctness of the logic and implementation to ensure the core concepts from the paper are fully reflecte...

  33. [37]

    ‘json {

    Assign a Score: Based on your evaluation, provide a critique and assign a correctness score from 1 to 5 for the repository, reflecting how well it implements the key aspects of the paper. Include a detailed critique in the specified JSON format. — Severity Level: Each identified critique will be assigned a severity level based on its impact on the correct...

  34. [38]

    Pay close attention to key aspects crucial for implementing the paper’s results (e.g., specific algorithms, data preprocessing steps, evaluation protocols)

    Identify Key Aspects of the Paper: Carefully read the paper to understand its core concepts, methodology, and algorithms. Pay close attention to key aspects crucial for implementing the paper’s results (e.g., specific algorithms, data preprocessing steps, evaluation protocols)

  35. [39]

    Focus on whether the repository’s core logic, algorithms, and structure align with the methodology and experiments described in the paper

    Examine the Code Repository: Analyze the repository to determine how well it implements the key aspects of the paper. Focus on whether the repository’s core logic, algorithms, and structure align with the methodology and experiments described in the paper

  36. [40]

    Note any incorrect representations, inconsistencies, or incomplete implementations that could affect the correctness of the repository

    Identify Logical Errors and Deviations: Check for logical errors, missing steps, or deviations from the paper’s methodology. Note any incorrect representations, inconsistencies, or incomplete implementations that could affect the correctness of the repository

  37. [41]

    You do not need to analyze minor details like logging functions, script organization, or documentation quality

    Provide a Critique: Consider the completeness and accuracy of the implementation relative to the paper’s goals. You do not need to analyze minor details like logging functions, script organization, or documentation quality. Instead, concentrate on the correctness of the logic and implementation to ensure the core concepts from the paper are fully reflecte...

  38. [42]

    Ensure that all critical components—such as data preprocessing, core algorithms, and evaluation steps—are implemented and consistent with the paper’s descriptions

    Assess Completeness and Accuracy: Evaluate the repository for its completeness and accuracy relative to the paper’s methodology. Ensure that all critical components—such as data preprocessing, core algorithms, and evaluation steps—are implemented and consistent with the paper’s descriptions

  39. [43]

    Evaluate completeness holistically rather than limiting your review to existing files

    Code verification: Verify that all key components expected from the paper are fully implemented with codes in the repository. Evaluate completeness holistically rather than limiting your review to existing files. In your critique, explicitly identify any missing components, absent implementations, or deviations from expected behavior, including cases wher...

  40. [44]

    ‘json {

    Assign a Score: Based on your evaluation, provide a critique and assign a correctness score from 1 to 5 for the repository, reflecting how well it implements the key aspects of the paper. Include a detailed critique in the specified JSON format. — Severity Level: Each identified critique will be assigned a severity level based on its impact on the correct...

  41. [45]

    - paper.md: the paper to reproduce

    The given paper & addendum are in the current directory, you should explicitly instruct all the agents to read them COMPLETELY . - paper.md: the paper to reproduce. - addendum.md: the addendum of the paper

  42. [46]

    - architecture.md: the architecture design

    Explicitly instruct the planning agent to call helping agents to generate the following files: - plan.md: the general experiment plan. - architecture.md: the architecture design. - dependency.md: the dependency analysis for components. - config.yaml: the parameters for each component

  43. [47]

    - analysis/components.txt: the list of components to be analysed

    Call the analysing agent to analyse the implementation of each component and save the result in the analysis/ directory: - analysis/: the directory that contains the analysing report for each component. - analysis/components.txt: the list of components to be analysed

  44. [48]

    Use the coding agent to write the code for each component mentioned in the above files and save its code in the code/ directory. In addtion to paper, addendum, plan, architecture, dependency and config files, the coding agent should also be instructed to read the analysis for each component before implementing it. - code/: the directory that contains the ...

  45. [49]

    - results/*.log: the log of the execution

    Use the executing agent to execute the code for each component mentioned in the above files and save its execution result in the results/ directory. - results/*.log: the log of the execution. You should call the agents one at a time to generate the fulfil the experiment reproduction. Do not call all the agents in one single step. You should check whether ...

  46. [50]

    plan.md: general_planning_agent

  47. [51]

    architecture.md: architecture_planning_agent

  48. [52]

    dependency.md: dependency_planning_agent

  49. [53]

    - paper.md: the target paper

    config.yaml: config_planning_agent The target paper and addendum are in the current directory. - paper.md: the target paper. - addendum.md: addendum to the paper. You should first use the list_directory tool to check these files. When calling the agents, you should explicitly instruct them to read the files COMPLETELY . You should check which files are al...

  50. [54]

    Align with the Paper: Your plan must strictly follow the methods, datasets, model configurations, hyperparameters, and experimental setups described in the paper

  51. [55]

    Be Clear and Structured: Present the plan in a well-organized and easy-to-follow format, breaking it down into actionable steps

  52. [56]

    Prioritize Efficiency: Optimize the plan for clarity and practical implementation while ensuring fidelity to the original experiments. ## Task

  53. [57]

    We want to reproduce the method described in the attached paper

  54. [58]

    The authors did not release any official code, so we have to plan our own implementation

  55. [59]

    - Important aspects of **Experiments**, including dataset requirements, experimental settings, hyperparameters, or evaluation metrics

    Before writing any Python code, please outline a comprehensive plan that covers: - Key details from the paper’s **Methodology**. - Important aspects of **Experiments**, including dataset requirements, experimental settings, hyperparameters, or evaluation metrics

  56. [60]

    Implementation approach

    The plan should be as **detailed and informative** as possible to help us write the final code later. ## Requirements - You don’t need to provide the actual code yet; focus on a **thorough, clear strategy**. - If something is unclear from the paper, mention it explicitly. ## Instruction The response should give us a strong roadmap, making it easier to wri...

  57. [61]

    Align with the Paper: Your analysis must strictly follow the methods, datasets, model configurations, hyperparameters, and experimental setups described in the paper

  58. [62]

    Be Clear and Structured: Present your analysis in a logical, well-organized, and actionable format that is easy to follow and implement

  59. [63]

    Prioritize Efficiency: Optimize the analysis for clarity and practical implementation while ensuring fidelity to the original experiments

  60. [65]

    Do not invent or assume any values—only use configurations explicitly provided

    REFER TO CONFIGURATION: Always reference settings from the config.yaml file. Do not invent or assume any values—only use configurations explicitly provided

  61. [66]

    Format example

    Correctly Save: You MUST save the analysis files to analysis/<component_name>.py_analysis.md ## Instruction Conduct a Logic Analysis to assist in writing the code, based on the paper, the plan, the design, the task and the previously specified configuration file (config.yaml). You DON’T need to provide the actual code yet; focus on a thorough, clear analy...

  62. [67]

    Only One file: do your best to implement ONLY ONE FILE AT A TIME

  63. [68]

    COMPLETE CODE: Your code will be part of the entire project, so please implement complete, reliable, reusable code snippets

  64. [69]

    A VOID circular import

    Set default value: If there is any setting, ALWAYS SET A DEFAULT V ALUE, ALWAYS USE STRONG TYPE AND EXPLICIT V ARIABLE. A VOID circular import

  65. [70]

    Data structures and interfaces

    Follow design: YOU MUST FOLLOW "Data structures and interfaces". DONT CHANGE ANY DESIGN. Do not use public member functions that do not exist in your design

  66. [71]

    CAREFULLY CHECK THAT YOU DONT MISS ANY NECESSARY CLASS/FUNCTION IN THIS FILE

  67. [72]

    Before using a external variable/module, make sure you import it first

  68. [73]

    Write out EVERY CODE DETAIL, DON’T LEA VE TODO

  69. [74]

    config.yaml

    REFER TO CONFIGURATION: you must use configuration from "config.yaml". DO NOT FABRICATE any configuration values. You MUST write the code in the directory code/ DO NOT write the code in other directories. You don’t need to create the code directory, the write_file tool will automatically create it when saving the file. DO NOT directly output the code!!! Y...