pith. machine review for the scientific record. sign in

arxiv: 2602.11354 · v2 · submitted 2026-02-11 · 💻 cs.AI · cs.CL

Recognition: no theorem link

ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences

Authors on Pith no claims yet

Pith reviewed 2026-05-16 01:52 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords LLM agentsreplicabilitybenchmarksocial sciencesdata retrievalcomputational experimentsReplicatorBenchreplication process
0
0 comments X

The pith

LLM agents design and run replication experiments effectively but fail to retrieve necessary new data for social and behavioral science claims.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ReplicatorBench to evaluate AI agents on end-to-end replication of research claims, using both replicable and non-replicable examples from social and behavioral sciences. It divides the process into retrieving data, designing and running experiments, and interpreting outcomes, then tests a baseline agent equipped with web search and code execution tools. The central finding is that agents handle the computational design and execution stages well across different models and settings, yet consistently struggle with locating and accessing external resources like fresh datasets. A sympathetic reader would care because this gap shows why current agents cannot yet serve as reliable, scalable replicators in fields where data availability is inconsistent.

Core claim

ReplicatorBench supplies human-verified replicable and non-replicable claims and measures agent performance across three stages: extraction and retrieval of replication data, design and execution of computational experiments, and interpretation of results. When tested with ReplicatorAgent using tools for web search and sandboxed code execution, agents succeed at designing and running experiments but fail at retrieving new data required to replicate claims.

What carries the argument

ReplicatorBench benchmark with human-verified replicable and non-replicable claims, structured around three replication stages, paired with ReplicatorAgent framework that uses web search and iterative sandboxed code execution.

If this is right

  • Agents achieve higher success when given partial code access than when starting from no code.
  • Performance varies across underlying LLMs, with some models handling experiment execution more reliably.
  • The benchmark distinguishes retrieval failures from execution failures, allowing targeted diagnosis of agent weaknesses.
  • Struggles with new data retrieval limit agents to reproduction tasks rather than true replication.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Better integration with external databases or search APIs could close the main performance gap identified.
  • The same three-stage structure could be adapted to test replication in other disciplines with sparse data.
  • Agents that improve on retrieval might enable automated screening of large literatures for questionable findings.

Load-bearing premise

The human-verified labels correctly identify which claims are replicable in practice and the chosen examples represent the range of social and behavioral science research.

What would settle it

An experiment in which the tested agents successfully retrieve new external datasets for non-replicable claims and correctly conclude that replication fails, using only the benchmark's provided tools.

Figures

Figures reproduced from arXiv: 2602.11354 by Adam Gill, Anna Szabelska, Bang Nguyen, Dominik So\'os, Jian Wu, Meng Jiang, Qian Ma, Rochana R. Obadage, Sai Koneru, Sarah Rajtmajer, Shakhlo Nematova, Timothy M. Errington, Zack Ranjan.

Figure 1
Figure 1. Figure 1: Given a research paper and a focal claim, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The Generation-Execution stage has two main [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance of GPT-4o ReplicatorAgent in a Python-only setting compared to Native setting. LLMEval scores are reported for design, execution, and interpretation stage with error bars calculated as a 95% confidence interval of the mean. Macro F1 Scores are reported for the final repli￾cation outcome (criteria met or unmet). to help generate the replication code, in turn providing a more detailed data plan f… view at source ↗
Figure 4
Figure 4. Figure 4: Performance of GPT-5 ReplicatorAgent with and without access to human-written replication code. LLMEval scores are reported for design, execution, and interpretation stage with error bars calculated as a 95% confidence interval of the mean. Macro F1 Scores are reported for final replication outcome (criteria met or unmet). • Implementation errors. The run does not execute the intended procedure, e.g., omit… view at source ↗
Figure 5
Figure 5. Figure 5: Execution trace for resolving a Docker build failure caused by an incompatible SciPy version. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Design and execution trace of an incorrect translation attempt from R to Python. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Description of 10 computational execution checkpoints. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: LLMEval prompt template for the Extraction, Generation-Design, and Interpretation stage. The reference document is [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: LLMEval prompt for the Generation-Execution stage. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Schema for Extraction Stage - Subtask 1: Gathering key information about the focal claim for replicaiton. [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Schema for Extraction Stage - Subtask 2: Web search for replication data. [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Schema for Generation Stage - Design Step: Preregister the replication research cplan. [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Schema for Generation Stage - Execution Step: Conduct the replication study. [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Schema for Interpretation Stage: Make conclusions regarding the claim’s replicability. [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: System Prompt for a ReAct-agent [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Agent Prompt for Extraction Stage - Post-registration sub-task. [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Agent Prompt for Extraction Stage - Web-search sub-task. [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Agent Prompt for Generation Stage - Design Step. [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Agent Prompt for Generation Stage - Execution Step. [PITH_FULL_IMAGE:figures/full_fig_p023_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Agent Prompt for Interpretation Stage [PITH_FULL_IMAGE:figures/full_fig_p024_20.png] view at source ↗
read the original abstract

The literature has witnessed an emerging interest in AI agents for automated assessment of scientific papers. Existing benchmarks focus primarily on the computational aspect of this task, testing agents' ability to reproduce or replicate research outcomes when having access to the code and data. This setting, while foundational, (1) fails to capture the inconsistent availability of new data for replication as opposed to reproduction, and (2) lacks ground-truth diversity by focusing only on reproducible papers, thereby failing to evaluate an agent's ability to identify non-replicable research. Furthermore, most benchmarks only evaluate outcomes rather than the replication process. In response, we introduce ReplicatorBench, an end-to-end benchmark, including human-verified replicable and non-replicable research claims in social and behavioral sciences for evaluating AI agents in research replication across three stages: (1) extraction and retrieval of replication data; (2) design and execution of computational experiments; and (3) interpretation of results, allowing a test of AI agents' capability to mimic the activities of human replicators in real world. To set a baseline of AI agents' capability, we develop ReplicatorAgent, an agentic framework equipped with necessary tools, like web search and iterative interaction with sandboxed environments, to accomplish tasks in ReplicatorBench. We evaluate ReplicatorAgent across four underlying large language models (LLMs), as well as different design choices of programming language and levels of code access. Our findings reveal that while current LLM agents are capable of effectively designing and executing computational experiments, they struggle with retrieving resources, such as new data, necessary to replicate a claim. All code and data are publicly available at https://github.com/CenterForOpenScience/llm-benchmarking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces ReplicatorBench, an end-to-end benchmark for LLM agents performing research replication in social and behavioral sciences. It supplies human-verified replicable and non-replicable claims and structures evaluation across three stages—resource extraction and retrieval, computational experiment design and execution, and result interpretation—while releasing ReplicatorAgent as a baseline agent equipped with web-search and sandbox tools. Experiments across four LLMs, programming languages, and code-access levels show agents succeed at design/execution but fail at retrieving new data needed for replication.

Significance. If the ground-truth labels are robust, the work supplies a more realistic testbed than prior reproduction-only benchmarks by including non-replicable claims and retrieval demands that mirror real-world replication. Public release of code, data, and the agent framework enables direct follow-on research and provides concrete comparative data on current LLM limitations.

major comments (1)
  1. [Human-verification section (methods)] Human-verification section (methods): the central performance differential—strong design/execution, weak retrieval—rests on the assumption that human labels correctly isolate retrieval difficulty as the decisive barrier. The manuscript states that claims were human-verified but supplies no protocol, inter-rater reliability statistics, or explicit criteria for attributing non-replicability to missing resources versus statistical, design, or ethical factors. Without this documentation the reported gap cannot be confidently attributed to agent capability rather than benchmark construction.
minor comments (2)
  1. [Abstract] Abstract: the claim that the benchmark captures 'inconsistent availability of new data' would be strengthened by a single sentence summarizing the verification protocol.
  2. [Evaluation tables] Evaluation tables: report the exact number of replicable versus non-replicable claims and their distribution across fields to allow readers to assess whether the retrieval-failure result generalizes or is driven by a narrow subset of tasks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the domain assumption that human experts can reliably classify claims as replicable or non-replicable and that the chosen social science papers form a representative sample; no free parameters are fitted to produce the main findings and no new physical or theoretical entities are introduced.

axioms (1)
  • domain assumption Human experts can accurately and consistently label research claims as replicable or non-replicable based on available materials.
    The benchmark construction depends on these human-verified labels to provide ground truth for agent performance.

pith-pipeline@v0.9.0 · 5663 in / 1241 out tokens · 104800 ms · 2026-05-16T01:52:40.636861+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ARA: Agentic Reproducibility Assessment For Scalable Support Of Scientific Peer-Review

    cs.DL 2026-05 unverdicted novelty 6.0

    ARA extracts workflow graphs from papers and scores reproducibility, reaching 61% accuracy on 213 ReScience C articles and outperforming priors on ReproBench and GoldStandardDB.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Nazanin Alipourfard, Beatrix Arendt, Daniel M Benjamin, Noam Benkler, Michael Bishop, Mark Burstein, Martin Bush, James Caverlee, Yiling Chen, Chae Clark, et al. 2021. Systematizing confidence in open research and evidence (SCORE). SocArXiv(2021)

  2. [2]

    2024.Mass Reproducibility and Replicability: A New Hope

    Abel Brodeur, Derek Mikola, Nikolai Cook, Thomas Brailey, Ryan Briggs, Alexan- dra de Gendre, Yannick Dupraz, Lenka Fiala, Jacopo Gabani, Romain Gauriot, et al. 2024.Mass Reproducibility and Replicability: A New Hope. Technical Report. The Institute for Replication (I4R)

  3. [3]

    Colin F Camerer, Anna Dreber, Eskil Forsell, Teck-Hua Ho, Jürgen Huber, Magnus Johannesson, Michael Kirchler, Johan Almenberg, Adam Altmejd, Taizan Chan, et al. 2016. Evaluating replicability of laboratory experiments in economics. Science351, 6280 (2016), 1433–1436

  4. [4]

    Open Science Collaboration. 2015. Estimating the reproducibil- ity of psychological science.Science349, 6251 (2015), aac4716. arXiv:https://www.science.org/doi/pdf/10.1126/science.aac4716 doi:10.1126/ science.aac4716

  5. [5]

    Engineering, Medicine and National Academies of Sciences, Engineering, and Medicine and others. 2019. Reproducibility and replicability in science. (2019)

  6. [6]

    Chuxuan Hu, Liyun Zhang, Yeji Lim, Aum Wadhwani, Austin Peters, and Daniel Kang. 2025. REPRO-BENCH: Can Agentic AI Systems Assess the Reproducibility of Social Science Research?. InFindings of the Association for Computational Linguistics: ACL 2025. 23616–23626

  7. [7]

    Tianyu Hua, Harper Hua, Violet Xiang, Benjamin Klieger, Sang T Truong, Weixin Liang, Fan-Yun Sun, and Nick Haber. 2025. ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code.arXiv preprint arXiv:2506.02314(2025)

  8. [8]

    Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013/

  9. [9]

    Ziming Luo, Zonglin Yang, Zexin Xu, Wei Yang, and Xinya Du. 2025. Llm4sr: A survey on large language models for scientific research.arXiv preprint arXiv:2501.04306(2025)

  10. [10]

    Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth Vora, Tushar Khot, Ashish Sabharwal, and Peter Clark. 2024. Discoverybench: Towards data-driven discov- ery with large language models.arXiv preprint arXiv:2407.01725(2024)

  11. [11]

    Brian A Nosek and Timothy M Errington. 2020. What is replication?PLoS biology 18, 3 (2020), e3000691

  12. [12]

    Shuo Ren, Pu Jian, Zhenjiang Ren, Chunlin Leng, Can Xie, and Jiajun Zhang

  13. [13]

    arXiv preprint arXiv:2503.24047(2025)

    Towards scientific intelligence: A survey of llm-based scientific agents. arXiv preprint arXiv:2503.24047(2025)

  14. [14]

    Zachary S Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, and Arvind Narayanan. 2024. Core-bench: Fostering the credibility of published research through a computational reproducibility agent benchmark.arXiv preprint arXiv:2409.11363(2024)

  15. [15]

    Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, et al. 2025. PaperBench: Evaluating AI’s Ability to Replicate AI Research.arXiv preprint arXiv:2504.01848(2025)

  16. [16]

    Minyang Tian, Luyu Gao, Shizhuo Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, et al . 2024. Scicode: A research coding benchmark curated by scientists.Advances in Neural Information Processing Systems37 (2024), 30624–30650

  17. [17]

    Lukas Twist, Jie M Zhang, Mark Harman, Don Syme, Joost Noppen, and Detlef Nauck. 2025. LLMs Love Python: A Study of LLMs’ Bias for Programming Languages and Libraries.arXiv preprint arXiv:2503.17181(2025)

  18. [18]

    Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. 2024. Executable code actions elicit better llm agents. InForty-first International Conference on Machine Learning

  19. [19]

    Emily C Willroth and Olivia E Atherton. 2024. Best laid plans: A guide to reporting preregistration deviations.Advances in Methods and Practices in Psychological Science7, 1 (2024), 25152459231213802

  20. [20]

    Yanzheng Xiang, Hanqi Yan, Shuyin Ouyang, Lin Gui, and Yulan He. 2025. Scireplicate-bench: Benchmarking llms in agent-driven algorithmic reproduction from research papers.arXiv preprint arXiv:2504.00255(2025)

  21. [21]

    Shuo Yan, Ruochen Li, Ziming Luo, Zimu Wang, Daoyang Li, Liqiang Jing, Kaiyu He, Peilin Wu, George Michalopoulos, Yue Zhang, et al . 2025. LMR-BENCH: Evaluating LLM Agent’s Ability on Reproducing Language Modeling Research. arXiv preprint arXiv:2506.17335(2025)

  22. [22]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. [n. d.]. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations

  23. [23]

    Christine Ye, Sihan Yuan, Suchetha Cooray, Steven Dillmann, Ian LV Roque, Dalya Baron, Philipp Frank, Sergio Martin-Alvarez, Nolan Koblischke, Frank J Qu, et al. 2025. ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?arXiv preprint arXiv:2510.24591(2025)

  24. [24]

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675(2019)

  25. [25]

    /data/original/4/python

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623. A Comparisons of automated replication/reproduction benchmarks Table 4 comparesReplica...

  26. [26]

    file_path

    and BERTScore [23]. According to table. 6, LLMEval has the highest alignment with human judgement of semantic matching be- tween the agent-produced post-registration and expert-annotated post-registration. Thus, we adopted LLMEval with GPT-4o as the main automated evaluation mechanism. C.2 Evaluation prompts and templates For evaluation of the extraction,...

  27. [29]

    There may also be useful code to help you with the replication

    [...]/replication_data: The folder containing the data that can potentially be used for the replication. There may also be useful code to help you with the replication. But if not, you have to generate the replication code yourself in Python. You are given the following information about the replication attempt. 1.[...]/post_registration.json: A structure...

  28. [30]

    Focus on the logs of the design and the execute stage for this evaluation

    [...]/_log/: Folder contain the logs of the agent replication attempt. Focus on the logs of the design and the execute stage for this evaluation

  29. [31]

    [...]/replication_info.json: Final structured report of the design stage by the agent

  30. [32]

    original_study

    [...]/execution_results.json: Final strcuterd report of the execution stage by the agent. Your task is to score the replication attempt by filling out this structured rubric === START OF EVALUATION RUBRIC TO BE FILLED OUT === {schema for execution checkpoints} === END OF EVALUAIION RUBRIC TO BE FILLED OUT === For each leaf node in the rubric assign a scor...

  31. [33]

    A JSON template where each key contains a description of what is expected

  32. [34]

    The original paper manuscript (original_paper.pdf)

  33. [35]

    urls": [ {

    Initial details file (initial_details_easy.txt) containing: - Claim statement (use this directly, do not extract from paper) - Hypotheses (use these directly, do not extract from paper) Your goal is to: - For'claim.statement'field: Use the exact statement from initial_details_easy.txt - For'hypotheses'field: Use the exact list from initial_details_easy.tx...

  34. [37]

    [...]/initial_details.txt: Details about the claim from the original paper to be replicated

  35. [39]

    /app/data

    [...]/replication_data: The folder containing the data that can potentially be used for the replication. There may also be useful code to help you with the replication. But if not, you have to generate the replication code yourself in Python. Based on the provided documents, your goal is to plan for the replication study and fill out this JSON template: {...

  36. [40]

    Analyze the error message in the Observation

  37. [41]

    app/data

    Use`write_file`to FIX the issue (e.g., rewrite`replication_info.json`to add packages, or rewrite the code files). Remember that write_file will overwrite any existing content in the provided file_path if existing. When you use the tool, the provided path file_path to the tool MUST be the study path given to you. But to access other files within the file_c...

  38. [42]

    **Phases of Execution:** PHASE 1: BUILD ENVIRONMENT 1.`orchestrator_generate_dockerfile`: Creates _runtime/Dockerfile from replication_info.json

    RETRY the failed step. **Phases of Execution:** PHASE 1: BUILD ENVIRONMENT 1.`orchestrator_generate_dockerfile`: Creates _runtime/Dockerfile from replication_info.json. 2.`orchestrator_build_image`: Builds the image. * IF BUILD FAILS: Read the error log. It usually means a missing system package or R/Python library. Edit` replication_info.json`to add the ...

  39. [43]

    Ready to execute command: <COMMAND>. Approve? (yes/no)

    Before running the actual analysis code, you MUST Ask the human: Action: ask_human_input: "Ready to execute command: <COMMAND>. Approve? (yes/no)" * If they say "no", stop the container and fill the output JSON with status "cancelled". * If they say "yes", proceed to Phase 4. PHASE 4: EXECUTE & DEBUG 6.`orchestrator_execute_entry`: Runs the code. * IF EXE...

  40. [44]

    {study_path}

    Parse`execution_result.json`and output the Answer in the following required JSON schema. {Task schema for Generation-Execution Stage} Current Study Path: "{study_path}" Start by generating the Dockerfile. Remember, every response needs to have one of the following two formats: ----- FORMAT 1 (For when you need to call actions to help accomplish the given ...

  41. [45]

    [...]/original_paper.pdf: The pdf file containing the full text of the original paper

  42. [46]

    [...]/initial_details.txt: Details about the claim from the original paper to be replicated You are given the following information about the replication attempt

  43. [47]

    [...]/post_registration.json: A structured document with key extracted information about the original paper and the claim to be replicated

  44. [48]

    [...]/replication_info.json: Structured report of the agent at the PLANNING stage for the replication of the given claim

  45. [49]

    You MUST examine any additional execution result files not reported in execution_results.json before making your interpretataions

    [...]/replication_data: The folder containing the data and code that were used for the replication, along with any output files generated after running the code. You MUST examine any additional execution result files not reported in execution_results.json before making your interpretataions

  46. [50]

    If the report doesn' t have any results, look for output files generated by the code to find the execution results before making conclusions

    [...]/execution_results.json: Final structured report of the execution stage by the agent. If the report doesn' t have any results, look for output files generated by the code to find the execution results before making conclusions. Your task is to interpret the the replication results and fill out the following structured JSON report. === START OF INTERP...