AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage

Duzhen Zhang; Maosong Sun; Qi Shi; Shuo Wang; Weilun Zhao; Xuanle Zhao; Xu Han; Yuxuan Li; Zhiyuan Liu; Zilin Sang

arxiv: 2505.20662 · v4 · submitted 2025-05-27 · 💻 cs.AI

AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage

Xuanle Zhao , Zilin Sang , Yuxuan Li , Qi Shi , Weilun Zhao , Shuo Wang , Duzhen Zhang , Xu Han

show 2 more authors

Zhiyuan Liu Maosong Sun

This is my paper

Pith reviewed 2026-05-19 14:10 UTC · model grok-4.3

classification 💻 cs.AI

keywords paper lineageautomatic reproductionmulti-agent frameworkAI experimentscode generationreproduction fidelitybenchmarks

0 comments

The pith

AutoReproduce autonomously reproduces AI paper experiments by extracting implicit knowledge from citations with a multi-agent framework.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents the paper lineage algorithm to systematically extract implicit knowledge from the literature cited in a research paper. This extraction underpins AutoReproduce, a multi-agent system that generates complete experimental code in an end-to-end autonomous process. The system employs a sampling-based unit testing approach to quickly validate that the code runs and produces expected outputs. A new benchmark called ourbench with verified implementations and specific metrics is created to measure how well reproduction succeeds and how the code performs when executed. Evaluations indicate that this method exceeds current baselines in both reproduction accuracy and execution results.

Core claim

The paper lineage algorithm mines implicit knowledge from cited papers to serve as the backbone for AutoReproduce, a multi-agent framework that autonomously reproduces experimental code end-to-end, incorporating sampling-based unit testing to ensure executability and achieving superior reproduction fidelity and execution performance.

What carries the argument

The paper lineage algorithm that systematically mines implicit knowledge from the cited literature to enable autonomous code reproduction.

If this is right

Substantial improvements occur in reproduction fidelity compared to existing baselines.
Final execution performance of the reproduced code is enhanced.
The approach applies to both PaperBench and the introduced ourbench.
Sampling-based unit testing allows for rapid validation of code executability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adoption of this framework could accelerate scientific progress by lowering the effort needed to verify new AI methods.
The paper lineage idea might extend to reproducing experiments in other research areas.
Further integration with advanced AI agents could address cases where cited literature lacks sufficient details.

Load-bearing premise

The paper lineage algorithm can extract enough implicit knowledge from cited literature to support full autonomous reproduction without additional domain expertise.

What would settle it

Demonstrating a paper where the system produces code that fails to match the original results or cannot execute despite access to all citations would challenge the central claim.

Figures

Figures reproduced from arXiv: 2505.20662 by Duzhen Zhang, Maosong Sun, Qi Shi, Shuo Wang, Weilun Zhao, Xuanle Zhao, Xu Han, Yuxuan Li, Zhiyuan Liu, Zilin Sang.

**Figure 2.** Figure 2: Comparison between the vanilla Transformer (top) and the proposed iTransformer (bottom). Transformer embeds the temporal token, which contains the multivariate representation of each time step. iTransformer embeds each series independently to the variate token, such that the attention module depicts the multivariate correlations and the feed-forward network encodes series representations. information is e… view at source ↗

**Figure 2.** Figure 2: The correlation analysis of papers selected. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: The input quries of AUTOREPRODUCE. It contains the ARXIV ID to download the paper. TASK, MODEL and METRIC in the paper that need to be reproduced. Iterative dialogue template of LLM agents ~~~~~~~~~ History: {history string} ~~~~~~~~~ Current Step: {step}, Phase: {phase} Task instructions: {current phase prompts} [Overall Objective] Your overall goal is to follow the instructions to replicate the method pr… view at source ↗

**Figure 4.** Figure 4: Prompt for the ADD command of code agent. Abbreviation prompts for Paper Lineage stage of the research agent Your task is to read the paper and identify the 3 most relevant papers from its references that help in understanding the paper’s contributions, including the proposed model architecture, experimental settings, and other details. These papers need to be in the same research field as the ones that ne… view at source ↗

**Figure 5.** Figure 5: Abbreviation prompts for Paper Lineage stage of the research agent [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Abbreviation prompts for Paper Lineage stage of the code agent [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt for the EDIT command of code agent [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt for human evaluation instructions. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt for human evaluation instructions. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt for key points summarization. Prompt for paper-level evaluation Now, I’m presenting you with a generated code. You need to check whether the details of the code correspond to the key points. The experiment instructions for the generated code are {INSTRUCTION} You just need to consider the model, task, and dataset used in the instructions. There are a total of 5 comparison points, and each point is … view at source ↗

**Figure 11.** Figure 11: Prompt for paper-level score. The Points are the 5 key points generated by the LLM judge, and the Generated Code is the code generated by LLM Agents [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt for code-level score. The Reference Code and Generated Code are our curated official implementations and Agents’ generated code, respectively [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt for code-level score. The Reference Code and Generated Code are our curated official implementations and Agents’ generated code, respectively [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt for mixed-level score. The Reference Code and Generated Code are our curated official implementations and Agents’ generated code, respectively [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: Prompt for mixed-level score. The Reference Code and Generated Code are our curated official implementations and Agents’ generated code, respectively [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

read the original abstract

Efficient reproduction of research papers is pivotal to accelerating scientific progress. However, the increasing complexity of proposed methods often renders reproduction a labor-intensive endeavor, necessitating profound domain expertise. To address this, we introduce the paper lineage, which systematically mines implicit knowledge from the cited literature. This algorithm serves as the backbone of our proposed \ours, a multi-agent framework designed to autonomously reproduce experimental code in a complete, end-to-end manner. To ensure code executability, \ours incorporates a sampling-based unit testing strategy for rapid validation. To assess reproduction capabilities, we introduce \ourbench, a benchmark featuring verified implementations, alongside comprehensive metrics for evaluating both reproduction and execution fidelity. Extensive evaluations on PaperBench and \ourbench demonstrate that \ours consistently surpasses existing baselines across all metrics. Notably, it yields substantial improvements in reproduction fidelity and final execution performance. The code is available at https://github.com/AI9Stars/AutoReproduce.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AutoReproduce mines citation lineage to feed a multi-agent reproduction pipeline and ships a new benchmark, but the abstract gives no isolated test of whether the lineage actually supplies the claimed implicit details.

read the letter

The main thing to know is that this work builds a multi-agent system around a paper lineage algorithm that pulls implicit details from citations, then uses sampling-based unit testing to produce executable code from AI papers. They also release ourbench with verified implementations and report gains over baselines on reproduction fidelity and final execution metrics. The code is public at the GitHub link they give. That combination of lineage mining plus agents plus a dedicated benchmark is the concrete addition here. The practical focus on quick validation through unit testing and the decision to include verified reference implementations in the benchmark are useful engineering choices that make the evaluation setup more credible than many agent papers. Open-sourcing helps others check or extend the pipeline. The soft spot is the missing evidence that the lineage step is doing real work. The abstract presents lineage as the backbone that recovers unstated hyperparameters, data assumptions, and library versions, yet there is no ablation or metric showing what fraction of those details actually comes from the lineage versus what the agents obtain through prompting or external search. If the lineage extraction turns out to be mostly surface-level citation following, the reported improvements could be driven by the multi-agent scaffolding instead. That matches the stress-test concern and stands as a fair question on the current description. This is aimed at researchers building agents for scientific tasks or working on reproducibility infrastructure. Someone looking for a benchmark or framework ideas in automated experiment reproduction could get value from the artifacts even if the central mechanism needs tighter validation. It deserves peer review so the full methods, any lineage-specific controls, and the evaluation protocols can be examined directly.

Referee Report

2 major / 2 minor

Summary. The paper introduces AutoReproduce, a multi-agent framework for end-to-end autonomous reproduction of experimental code from AI research papers. Its core component is the paper lineage algorithm, which extracts implicit knowledge from cited literature. The system adds sampling-based unit testing for executability validation. A new benchmark (ourbench) with verified implementations and metrics for reproduction/execution fidelity is proposed. Evaluations on PaperBench and ourbench report consistent outperformance over baselines in reproduction fidelity and final execution performance, with public code release.

Significance. If the paper lineage mechanism can be shown to systematically recover reproduction-critical details (unstated hyperparameters, data-processing assumptions, library versions) beyond what multi-agent prompting and search already provide, the work could meaningfully lower barriers to reproducing complex AI experiments. The introduction of ourbench and public code release are clear strengths that support reproducibility of the claimed results. However, the significance is currently limited by the absence of evidence isolating the lineage contribution.

major comments (2)

[Abstract / §3] Abstract and §3 (paper lineage description): the central claim states that the paper lineage 'systematically mines implicit knowledge from the cited literature' to enable fully autonomous reproduction. No ablation, metric, or quantitative breakdown is provided showing what fraction of reproduction-critical details (e.g., unstated hyperparameters or library versions) are recovered by lineage extraction versus supplied by the multi-agent loop or external search. Without this, the reported fidelity gains cannot be attributed to the novel lineage component rather than the scaffolding.
[§4] §4 (evaluation): the claim of 'substantial improvements in reproduction fidelity and final execution performance' on PaperBench and ourbench is presented without error bars, statistical significance tests, or details on how many runs were averaged. This makes it difficult to assess whether the superiority over baselines is robust or sensitive to post-hoc choices.

minor comments (2)

[Throughout] The notation for 'ourbench' and 'PaperBench' should be standardized (e.g., consistent capitalization and italicization) throughout the manuscript and figures.
[§4 / Figures] Figure captions and the benchmark description should explicitly list the exact metrics used for 'reproduction fidelity' and 'execution performance' so readers can interpret the tables without ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which has helped us improve the clarity and rigor of our work. We address each major comment below and have revised the manuscript accordingly to strengthen the evidence for our claims.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (paper lineage description): the central claim states that the paper lineage 'systematically mines implicit knowledge from the cited literature' to enable fully autonomous reproduction. No ablation, metric, or quantitative breakdown is provided showing what fraction of reproduction-critical details (e.g., unstated hyperparameters or library versions) are recovered by lineage extraction versus supplied by the multi-agent loop or external search. Without this, the reported fidelity gains cannot be attributed to the novel lineage component rather than the scaffolding.

Authors: We agree that an explicit quantitative isolation of the paper lineage's contribution would better attribute the observed gains. In the revised manuscript we have added a dedicated ablation subsection in §4. This compares the full AutoReproduce system against an ablated variant that disables lineage extraction while retaining the multi-agent loop and external search. We manually annotated a representative sample of reproduction-critical details (hyperparameters, data-processing steps, library versions) across the evaluated papers and report the fraction recovered exclusively by lineage versus the other components. The ablation shows that lineage extraction accounts for a substantial share of these details and directly improves fidelity metrics. We have also updated the abstract and §3 to reference these new results. revision: yes
Referee: [§4] §4 (evaluation): the claim of 'substantial improvements in reproduction fidelity and final execution performance' on PaperBench and ourbench is presented without error bars, statistical significance tests, or details on how many runs were averaged. This makes it difficult to assess whether the superiority over baselines is robust or sensitive to post-hoc choices.

Authors: We accept that the original presentation lacked sufficient statistical detail. The revised §4 now includes error bars (standard deviation) on all reported metrics, results of paired t-tests with p-values comparing AutoReproduce to each baseline, and an explicit statement that every metric is averaged over five independent runs using different random seeds for agent sampling and execution. These additions demonstrate that the reported improvements are robust and not sensitive to single-run variability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation rests on independent benchmark and code release

full rationale

The paper introduces paper lineage as a novel algorithm that mines implicit knowledge from cited literature and positions it as the backbone of the multi-agent AutoReproduce framework. It further introduces the new benchmark ourbench containing verified implementations and reports performance gains on both PaperBench and ourbench against baselines. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described structure. The central claims are supported by external evaluations and public code rather than reducing to self-definition or prior self-citations by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central contribution rests on the novel paper lineage concept and assumptions about LLM capabilities for knowledge extraction and code generation; no explicit fitted parameters are mentioned in the abstract.

axioms (1)

domain assumption Large language models can reliably mine and apply implicit knowledge from paper citations to generate executable experimental code.
This underpins the paper lineage backbone but is not justified or detailed in the abstract.

invented entities (1)

paper lineage no independent evidence
purpose: Systematically mines implicit knowledge from the cited literature to support code reproduction.
New algorithmic construct introduced as the core of the framework.

pith-pipeline@v0.9.0 · 5716 in / 1218 out tokens · 56065 ms · 2026-05-19T14:10:24.286011+00:00 · methodology

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?
cs.AI 2026-04 unverdicted novelty 7.0

LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.
FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification
cs.AI 2026-04 conditional novelty 7.0

FactReview extracts claims from ML papers, positions them via literature retrieval, and verifies them through code execution, labeling each as Supported, Partially supported, or In conflict, as shown in a CompGCN case study.
ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents
cs.CR 2026-05 conditional novelty 6.0

Clarification-seeking in LLM agents amplifies prompt injection attack success from ~2% to over 30% across ten frontier models in a new 728-scenario benchmark.
HiRAS: A Hierarchical Multi-Agent Framework for Paper-to-Code Generation and Execution
cs.CL 2026-04 unverdicted novelty 6.0

HiRAS introduces hierarchical multi-agent coordination for paper-to-code generation and experiment reproduction, claiming over 10% relative gains over prior state-of-the-art on a refined benchmark with reduced hallucination.
AI for Auto-Research: Roadmap & User Guide
cs.AI 2026-05 unverdicted novelty 4.0

The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 5 Pith papers · 1 internal anchor

[1]

Timevae: A variational auto-encoder for multivariate time series generation,

Selective frequency network for image restora- tion. InThe eleventh international conference on learning representations. Abhyuday Desai, Cynthia Freeman, Zuhui Wang, and Ian Beaver. 2021. Timevae: A variational auto- encoder for multivariate time series generation.arXiv preprint arXiv:2111.08095. Ege Erdil and Tamay Besiroglu. 2023. Explosive growth from...

work page arXiv 2021
[2]

From System 1 to System 2: A Survey of Reasoning Large Language Models

From system 1 to system 2: A survey of reasoning large language models.arXiv preprint arXiv:2502.17419. Zijie Lin, Yiqing Shen, Qilin Cai, He Sun, Jinrui Zhou, and Mingjun Xiao. 2025. Autop2c: An llm-based agent framework for code repository generation from multimodal content in academic papers.Preprint, arXiv:2504.20115. Shang-Ching Liu, ShengKun Wang, W...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, and Jianfeng Gao

Cycleresearcher: Improving automated research via automated review.arXiv preprint arXiv:2411.00816. Haixu Wu, Tengge Hu, Huakun Luo, Jianmin Wang, and Mingsheng Long. 2023. Solving high-dimensional pdes with latent spectral models.arXiv preprint arXiv:2301.12664. Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang,...

work page arXiv 2023
[4]

Your task is to compare Generated Code against the official Reference Code (Ground Truth) for a specific research paper

Overview & Objective You are acting as an expert evaluator to assess the quality and fidelity of LLM-generated code. Your task is to compare Generated Code against the official Reference Code (Ground Truth) for a specific research paper. Goal: Determine how accurately the generated code reproduces the specific methods, parameters, and experimental pipelin...

work page
[5]

boilerplate

Scoring Criteria (Total: 20 Points) Please evaluate the code across three specific dimensions. Use the Reference Code as the absolute standard for correctness. A. Completeness of Method (Max 10 Points) Focus: Does the code implement the core modeling innovation, specific algorithms, network architecture, and loss functions? 0 - 1 Points (Total Difference)...

work page
[6]

Evaluation Guidelines

work page
[7]

Judge based on the executable logic/code statements

Logic over Comments: Ignore comments in the code. Judge based on the executable logic/code statements

work page
[8]

Functional Equivalence: If the generated code achieves the same mathematical result as the reference but uses a slightly different coding style (e.g., 2 lines vs 1 line), consider it correct

work page
[9]

Looking similar

Strictness: Do not give full marks unless the implementation is rigorous. "Looking similar" is not enough for a max score; it must be "functionally equivalent

work page
[10]

Output Format Please provide your evaluation in the following format: Paper Title: Title Dimension: [Score](Justification (Briefly explain matches/discrepancies) Method: [**/10](**) Parameters: [**/5](**) Pipeline: [**/5](**) Total Score: **/20 Figure 9: Prompt for human evaluation instructions. Prompt for summarizing 5 key points proposed in the paper TA...

work page
[11]

Points: A list of key concepts, mechanisms, algorithms, or architectural features from the research paper that the generated code is supposed to implement

work page
[12]

This code serves as the benchmark for understanding the precise, intended implementation details of each key point

Reference code: The official source code accompanying the research paper. This code serves as the benchmark for understanding the precise, intended implementation details of each key point

work page
[13]

Your Evaluation Process:

Generated code: The generated code that needs to be evaluated for its accuracy in reproducing the key points as they are implemented in the reference code. Your Evaluation Process:

work page
[14]

Identify and describe the specific segments of the reference code (e.g., functions, classes, logic blocks) that implement this key point

Understand Key Point via Reference Code: For each key point, first, thoroughly examine the reference code. Identify and describe the specific segments of the reference code (e.g., functions, classes, logic blocks) that implement this key point. Summarize how the reference code realizes this key point. This understanding will be your basis for comparison

work page
[15]

Compare this implementation directly against your understanding of how it was done in the reference code

Analyse Generated Code against Reference Implementation:Now, review the generated code (generated code) to find its implementation of the same key point. Compare this implementation directly against your understanding of how it was done in the reference code. Focus on whether the logic, structure, and functional outcome are equivalent

work page
[16]

Score the Replication: Based on your comparative analysis, assign a score from 0 to 20 to the generated code for its replication of this specific key point, using the scoring rubric below

work page
[17]

Specifically highlight matches and discrepancies between the generated code’s implementation and the reference code’s implementation of the key point

Provide Detailed Justification: Clearly articulate the reasons for your score. Specifically highlight matches and discrepancies between the generated code’s implementation and the reference code’s implementation of the key point. Explain why it matches or why it deviates. Scoring Rubric: 0-2 points (Total difference): The core innovation point (as demonst...

work page
[18]

Reference Code Implementation Summary:*[Your summary of how this key point is implemented in the reference code]

work page
[19]

Generated Code Analysis & Comparison:*[Your detailed analysis of the generated code’s attempt to implement this point, comparing it directly to the reference code’s approach]

work page
[20]

Score:*[x/20 points]

work page
[21]

Overall Score:*[x/100 points] Figure 15: Prompt for mixed-level score

Reasoning for Score:*[Detailed justification based on the comparison] Sum the overall scores for each key point to provide a final score out of 100 points, and include a summary of the overall evaluation. Overall Score:*[x/100 points] Figure 15: Prompt for mixed-level score. The Reference Code and Generated Code are our curated official implementations an...

work page

[1] [1]

Timevae: A variational auto-encoder for multivariate time series generation,

Selective frequency network for image restora- tion. InThe eleventh international conference on learning representations. Abhyuday Desai, Cynthia Freeman, Zuhui Wang, and Ian Beaver. 2021. Timevae: A variational auto- encoder for multivariate time series generation.arXiv preprint arXiv:2111.08095. Ege Erdil and Tamay Besiroglu. 2023. Explosive growth from...

work page arXiv 2021

[2] [2]

From System 1 to System 2: A Survey of Reasoning Large Language Models

From system 1 to system 2: A survey of reasoning large language models.arXiv preprint arXiv:2502.17419. Zijie Lin, Yiqing Shen, Qilin Cai, He Sun, Jinrui Zhou, and Mingjun Xiao. 2025. Autop2c: An llm-based agent framework for code repository generation from multimodal content in academic papers.Preprint, arXiv:2504.20115. Shang-Ching Liu, ShengKun Wang, W...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, and Jianfeng Gao

Cycleresearcher: Improving automated research via automated review.arXiv preprint arXiv:2411.00816. Haixu Wu, Tengge Hu, Huakun Luo, Jianmin Wang, and Mingsheng Long. 2023. Solving high-dimensional pdes with latent spectral models.arXiv preprint arXiv:2301.12664. Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang,...

work page arXiv 2023

[4] [4]

Your task is to compare Generated Code against the official Reference Code (Ground Truth) for a specific research paper

Overview & Objective You are acting as an expert evaluator to assess the quality and fidelity of LLM-generated code. Your task is to compare Generated Code against the official Reference Code (Ground Truth) for a specific research paper. Goal: Determine how accurately the generated code reproduces the specific methods, parameters, and experimental pipelin...

work page

[5] [5]

boilerplate

Scoring Criteria (Total: 20 Points) Please evaluate the code across three specific dimensions. Use the Reference Code as the absolute standard for correctness. A. Completeness of Method (Max 10 Points) Focus: Does the code implement the core modeling innovation, specific algorithms, network architecture, and loss functions? 0 - 1 Points (Total Difference)...

work page

[6] [6]

Evaluation Guidelines

work page

[7] [7]

Judge based on the executable logic/code statements

Logic over Comments: Ignore comments in the code. Judge based on the executable logic/code statements

work page

[8] [8]

Functional Equivalence: If the generated code achieves the same mathematical result as the reference but uses a slightly different coding style (e.g., 2 lines vs 1 line), consider it correct

work page

[9] [9]

Looking similar

Strictness: Do not give full marks unless the implementation is rigorous. "Looking similar" is not enough for a max score; it must be "functionally equivalent

work page

[10] [10]

Output Format Please provide your evaluation in the following format: Paper Title: Title Dimension: [Score](Justification (Briefly explain matches/discrepancies) Method: [**/10](**) Parameters: [**/5](**) Pipeline: [**/5](**) Total Score: **/20 Figure 9: Prompt for human evaluation instructions. Prompt for summarizing 5 key points proposed in the paper TA...

work page

[11] [11]

Points: A list of key concepts, mechanisms, algorithms, or architectural features from the research paper that the generated code is supposed to implement

work page

[12] [12]

This code serves as the benchmark for understanding the precise, intended implementation details of each key point

Reference code: The official source code accompanying the research paper. This code serves as the benchmark for understanding the precise, intended implementation details of each key point

work page

[13] [13]

Your Evaluation Process:

Generated code: The generated code that needs to be evaluated for its accuracy in reproducing the key points as they are implemented in the reference code. Your Evaluation Process:

work page

[14] [14]

Identify and describe the specific segments of the reference code (e.g., functions, classes, logic blocks) that implement this key point

Understand Key Point via Reference Code: For each key point, first, thoroughly examine the reference code. Identify and describe the specific segments of the reference code (e.g., functions, classes, logic blocks) that implement this key point. Summarize how the reference code realizes this key point. This understanding will be your basis for comparison

work page

[15] [15]

Compare this implementation directly against your understanding of how it was done in the reference code

Analyse Generated Code against Reference Implementation:Now, review the generated code (generated code) to find its implementation of the same key point. Compare this implementation directly against your understanding of how it was done in the reference code. Focus on whether the logic, structure, and functional outcome are equivalent

work page

[16] [16]

Score the Replication: Based on your comparative analysis, assign a score from 0 to 20 to the generated code for its replication of this specific key point, using the scoring rubric below

work page

[17] [17]

Specifically highlight matches and discrepancies between the generated code’s implementation and the reference code’s implementation of the key point

Provide Detailed Justification: Clearly articulate the reasons for your score. Specifically highlight matches and discrepancies between the generated code’s implementation and the reference code’s implementation of the key point. Explain why it matches or why it deviates. Scoring Rubric: 0-2 points (Total difference): The core innovation point (as demonst...

work page

[18] [18]

Reference Code Implementation Summary:*[Your summary of how this key point is implemented in the reference code]

work page

[19] [19]

Generated Code Analysis & Comparison:*[Your detailed analysis of the generated code’s attempt to implement this point, comparing it directly to the reference code’s approach]

work page

[20] [20]

Score:*[x/20 points]

work page

[21] [21]

Overall Score:*[x/100 points] Figure 15: Prompt for mixed-level score

Reasoning for Score:*[Detailed justification based on the comparison] Sum the overall scores for each key point to provide a final score out of 100 points, and include a summary of the overall evaluation. Overall Score:*[x/100 points] Figure 15: Prompt for mixed-level score. The Reference Code and Generated Code are our curated official implementations an...

work page