Read the Paper, Write the Code: Agentic Reproduction of Social-Science Results

Alexander Hoyle; Benjamin Kohler; David Zollikofer; Elliott Ash; Johanna Einsiedler

arxiv: 2604.21965 · v1 · submitted 2026-04-23 · 💻 cs.AI

Read the Paper, Write the Code: Agentic Reproduction of Social-Science Results

Benjamin Kohler , David Zollikofer , Johanna Einsiedler , Alexander Hoyle , Elliott Ash This is my paper

Pith reviewed 2026-05-09 21:17 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentsreproducibilitysocial scienceagentic systemsempirical reproductionmethods extractionerror attribution

0 comments

The pith

LLM agents can largely reproduce social science results from a paper's methods description and original data alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a system in which LLM agents extract structured methods from a paper, generate and execute reimplementations while seeing neither the original code nor results, and compare outputs at the cell level. It tests this setup across four agent scaffolds and four models on 48 papers that humans had already confirmed are reproducible from text and data. Agents recover most published findings, yet success rates differ sharply by model, scaffold, and paper. Discrepancies trace to two sources: mistakes made by the agents themselves and gaps in how completely the papers describe their procedures. The work shows that agents can serve as an automated check on empirical claims even when full replication code is unavailable.

Core claim

An agentic reproduction pipeline that parses methods text into executable steps, runs reimplementations under strict information isolation, performs deterministic cell-by-cell comparison against original outputs, and attributes errors back through the pipeline enables agents to recover the majority of published social-science results on a set of 48 human-verified papers.

What carries the argument

The agentic reproduction system that extracts structured methods descriptions, generates isolated reimplementations, executes deterministic comparisons, and performs root-cause error attribution.

Load-bearing premise

The 48 papers selected allow exact reproduction of results from the methods description and data alone, with no hidden information required and with the isolation protocol preserved throughout.

What would settle it

A larger collection of papers that humans have independently verified as fully reproducible from methods text and data alone, on which the agents nevertheless fail to match the published cell-level results under the same isolation rules.

Figures

Figures reproduced from arXiv: 2604.21965 by Alexander Hoyle, Benjamin Kohler, David Zollikofer, Elliott Ash, Johanna Einsiedler.

**Figure 2.** Figure 2: Performance metrics of agent-reproduced coefficients [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Mean percentage differences between original and reproduced, by statistic type [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Grade comparison by aggregation level and agent [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Token usage, run durations, and run costs, by agent [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Effort allocation across tasks. The figures show the number of tool-call actions taken by the agents (left) and the relative volume of text (number of characters) emitted (right), colored by function category. The best-performing agent, OpenCode GPT-5.4, consistently operates at a higher level of effort. It consumes many more tokens and takes much more time per run than competing approaches, and it is also… view at source ↗

**Figure 7.** Figure 7: Error analysis pipeline: Number of divergences by error source. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Stability of results between multiple reproduction runs. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Pre-training leakage evaluation. Average table grades for sample of papers published before and after the model knowledge cutoff. No statistical difference suggests that performance of models in main analysis is not driven by pre-training leakage. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

read the original abstract

Recent work has used LLM agents to reproduce empirical social science results with access to both the data and code. We broaden this scope by asking: Can they reproduce results given only a paper's methods description and original data? We develop an agentic reproduction system that extracts structured methods descriptions from papers, runs reimplementations under strict information isolation -- agents never see the original code, results, or paper -- and enables deterministic, cell-level comparison of reproduced outputs to the original results. An error attribution step traces discrepancies through the system chain to identify root causes. Evaluating four agent scaffolds and four LLMs on 48 papers with human-verified reproducibility, we find that agents can largely recover published results, but performance varies substantially between models, scaffolds, and papers. Root cause analysis reveals that failures stem both from agent errors and from underspecification in the papers themselves.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Agents recover many results from text and data alone under isolation, but the 48-paper test set may not be as clean as claimed.

read the letter

The main takeaway is that LLM agents can reproduce a decent share of social-science results when limited to methods text plus the original data files, with no access to the published code or tables. Performance still swings a lot by model, scaffold, and paper, and the error tracing step shows failures come from both the agents and gaps in the papers themselves. That combination of strict isolation, cell-level matching, and root-cause logging is the clearest addition over earlier reproduction work that handed agents the code up front. The scaffold design and deterministic comparison look like practical engineering that others could build on. The evaluation covers four scaffolds and four models on 48 human-verified papers, which gives a broader picture than single-model demos. The root-cause breakdown is also useful for seeing where the bottlenecks actually sit. The soft spot is the paper selection. The human verification that picked the 48 cases is described at a high level, and it is not obvious that those verifiers worked under the same information limits the agents faced. If they had access to more of the original paper or results during verification, the reported success rates could be higher than what the isolation protocol truly allows, which would also blur the split between agent errors and underspecification. The abstract itself gives no numbers on overall recovery rates or error distributions, so the strength of the central claim is still hard to judge without the tables. This is aimed at groups building agent systems for empirical verification or at social scientists who want better tools to check published claims. A reader who cares about reproducibility infrastructure will find the isolation protocol and comparison method worth looking at. The work is coherent enough on its own terms to go to a serious referee, even if the selection details and quantitative breakdowns need tightening in revision.

Referee Report

2 major / 2 minor

Summary. The paper introduces an agentic reproduction system that extracts structured methods descriptions from social science papers, reimplements the analyses using only the original data files under strict information isolation (agents never see original code, results, or full paper text), and performs deterministic cell-level comparison of outputs to published results. An error attribution step traces discrepancies to root causes. On a set of 48 papers pre-selected via human verification of reproducibility, the evaluation across four agent scaffolds and four LLMs finds that agents can largely recover published results, though performance varies substantially by model, scaffold, and paper; failures are attributed to both agent errors and underspecification in the papers themselves.

Significance. If the central claims hold after addressing the verification protocol, this work would be a meaningful contribution to LLM agent applications in scientific reproducibility. It moves beyond prior code-assisted reproduction to a stricter text-plus-data setting and introduces a traceable error attribution mechanism that could help diagnose both AI limitations and reporting gaps in empirical papers. The use of human-verified baselines and deterministic output matching are explicit strengths that support falsifiable evaluation.

major comments (2)

[§4 (Evaluation Setup)] §4 (Evaluation Setup / Paper Selection): The human verification process used to select the 48 papers is described only at a high level. The manuscript does not report whether verifiers reproduced results using exactly the same inputs and strict isolation constraints given to the agents (structured methods description plus data files only, with no access to the original paper text, code, or published tables). Without an independent check that the same inputs suffice for human reproduction under isolation, the selected set may include papers that are not reproducible under the claimed protocol, which would inflate reported agent success rates and confound the root-cause attribution between agent errors and paper underspecification.
[§5 (Results)] §5 (Results): The central claim that agents 'largely recover' published results is not supported by sufficient quantitative detail. The manuscript does not report overall success rates (e.g., fraction of papers with exact cell-level match), per-paper accuracy distributions, breakdowns by error type or scaffold, or explicit controls for data leakage. This absence makes it impossible to assess the strength of the 'largely recover' statement or the reported variations across the four models and four scaffolds.

minor comments (2)

[Abstract] Abstract: The summary would be strengthened by including at least one key quantitative result (e.g., overall reproduction rate or range across models) to ground the 'largely recover' claim.
[Figure 1] Figure 1 (system diagram): The boundaries of the strict information-isolation protocol could be labeled more explicitly to clarify what information is withheld from agents at each step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to provide greater precision on the evaluation protocol and to include additional quantitative results.

read point-by-point responses

Referee: [§4 (Evaluation Setup)] §4 (Evaluation Setup / Paper Selection): The human verification process used to select the 48 papers is described only at a high level. The manuscript does not report whether verifiers reproduced results using exactly the same inputs and strict isolation constraints given to the agents (structured methods description plus data files only, with no access to the original paper text, code, or published tables). Without an independent check that the same inputs suffice for human reproduction under isolation, the selected set may include papers that are not reproducible under the claimed protocol, which would inflate reported agent success rates and confound the root-cause attribution between agent errors and paper underspecification.

Authors: We agree that the human verification protocol merits more explicit description. The verifiers were given the full original papers, code, and data to establish that the published results were reproducible in principle; this is distinct from the agents' strict isolation to only the extracted structured methods description and data files. As a result, the 48 papers are pre-screened for general reproducibility rather than guaranteed reproducibility under the exact agent constraints. In the revised manuscript we have expanded the description in §4 to detail the information provided to human verifiers, explicitly note the difference in access, and add a limitations paragraph clarifying that reported agent success rates are conditional on this pre-selection. This revision helps readers distinguish agent limitations from paper underspecification without overstating the findings. revision: yes
Referee: [§5 (Results)] §5 (Results): The central claim that agents 'largely recover' published results is not supported by sufficient quantitative detail. The manuscript does not report overall success rates (e.g., fraction of papers with exact cell-level match), per-paper accuracy distributions, breakdowns by error type or scaffold, or explicit controls for data leakage. This absence makes it impossible to assess the strength of the 'largely recover' statement or the reported variations across the four models and four scaffolds.

Authors: We accept that aggregate quantitative summaries are needed to substantiate the central claim. Although the manuscript already presents per-scaffold and per-model comparisons, it lacks overall success rates, accuracy distributions, and explicit error-type breakdowns. In the revised version we have added these elements to §5: an overall exact-match success rate across papers, histograms of per-paper cell-level accuracy, breakdowns by attributed error source (agent versus paper underspecification) and by scaffold/model, plus a statement confirming the isolation protocol precludes data leakage. These additions, supported by new tables, allow a clearer evaluation of the 'largely recover' statement and the observed variations. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation against external baselines

full rationale

This is an empirical systems paper that measures agent performance on 48 human-verified papers using deterministic output comparison under information isolation. No derivations, equations, fitted parameters renamed as predictions, or self-referential claims exist. Results are benchmarked against independently human-verified reproducibility, with root-cause analysis tracing failures to agent behavior or paper underspecification. The selection protocol and isolation claims are experimental design choices, not a derivation chain that reduces to its own inputs by construction. Any potential weakness in verifying the isolation of the human baseline is a validity concern, not circularity per the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The evaluation rests on the assumption that human verification correctly identified reproducible papers and that the agent system maintained true isolation from original code and results.

axioms (1)

domain assumption Human verification of reproducibility for the 48 papers is accurate and complete.
Used to construct the test set against which agent performance is measured.

invented entities (1)

Agentic reproduction system with error attribution step no independent evidence
purpose: To extract methods, generate code under isolation, compare outputs, and trace root causes of discrepancies.
Core novel component introduced by the paper; no independent evidence provided beyond the evaluation itself.

pith-pipeline@v0.9.0 · 5454 in / 1315 out tokens · 36691 ms · 2026-05-09T21:17:47.881339+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 2 canonical work pages

[1]

Accessed: 2026-03-31. John P . A. Ioannidis. Why most published research findings are false.PLoS Medicine, 2(8): e124, 2005. doi: 10.1371/journal.pmed.0020124. Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. Llms-as-judges: a comprehensive survey on llm-based evaluation methods.arXiv preprint arXiv:2412.05579,...

work page doi:10.1371/journal.pmed.0020124 2026
[2]

arXiv preprint arXiv:2602.16733 , year=

URLhttps://openreview.net/forum?id=cy8mq7QYae. Yiqing Xu and Leo Yang Yang. Scaling reproducibility: An AI-assisted workflow for large- scale replication and reanalysis, 2026. URLhttps://arxiv.org/abs/2602.16733. John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. SWE-agent: Agent-computer interf...

work page arXiv 2026
[3]

Do not access files outside of it or navigate to parent directories

**Workspace only.** Only read and write files inside this workspace directory. Do not access files outside of it or navigate to parent directories. **Exception:** the`data/`directory may be a symlink pointing outside the workspace -- this is intentional and you should use it freely as your dataset
[4]

Do not search for prior replication attempts

**No searching for the paper.** Do not search the internet for this paper, its authors, its published results, or any replication code or packages. Do not search for prior replication attempts
[5]

Your replication must be derived entirely from the methodology summary and the data provided

**No searching for results.** Do not look up expected coefficients, effect sizes, tables, or figures from this paper. Your replication must be derived entirely from the methodology summary and the data provided
[6]

statsmodels, pandas, matplotlib) and general statistical methods

**Allowed web use.** You may search for Python library documentation (e.g. statsmodels, pandas, matplotlib) and general statistical methods
[7]

## Instructions

**Work independently.** Base your replication only on the methodology description in this file and the dataset. ## Instructions
[8]

The methodology summary above describes the variables in detail, so a brief check should suffice

**Quick data check**: Inspect the data files to confirm column names and basic structure. The methodology summary above describes the variables in detail, so a brief check should suffice
[9]

All scripts can import from this module

**Write`prepare_data.py`**: Load and clean the data following the processing steps described above. All scripts can import from this module
[10]

Write the script (see output filename specified for each item above) b

**Write and execute one script at a time**: For each item: a. Write the script (see output filename specified for each item above) b. Execute it c. Fix any errors immediately d. Move on to the next item once the output file is verified {table_instructions}{figure_instructions}
[11]

for specialized estimators), you can call R from Python using`rpy2`

**R packages**: If the paper's methodology requires R-specific packages (e.g. for specialized estimators), you can call R from Python using`rpy2`
[12]

Execute every script and verify the output file exists
[13]

Python library docu- mentation

Save all outputs in the current working directory. **Reasonable assumptions.** Where the methodology description is incomplete or ambiguous, you are free to make reasonable assumptions based on common practice in the field. Document your assumptions briefly in comments. Focus on substance and accuracy. Match the described methodology as closely as possibl...
[14]

missing" and describe the missing file(s). Otherwise set data_available =

DATA AVAILABILITY --- Check whether the data files required to produce {item_id} exist in the data directory inside original_code/. If any required input file is absent from the replication package, set data_available = "missing" and describe the missing file(s). Otherwise set data_available = "available"
[15]

<description of what is missing>

AGENT CODE --- In agent_code/*.py, find the code that computes values for the REMAINING cells listed above (ignore already-attributed sections). If no such code exists, set agent_behavior = "<description of what is missing>"
[16]

ORIGINAL CODE --- In original_code/{original_file_glob}, find the equivalent {original_language} code
[17]

Most outputs have one, but multi-panel tables with different specifications may have several (e.g

DISCREPANCIES --- For the REMAINING cells only, identify all distinct root causes. Most outputs have one, but multi-panel tables with different specifications may have several (e.g. columns 1--3 OLS wrong clustering; columns 4--6 IV wrong instrument). If all remaining cells are already explained, return an empty divergences array. List one entry per disti...
[18]

All columns

SECTION MAPPING --- For each discrepancy, list which rows/columns/panels it explains (e.g. "All columns", "Columns 1--3 (OLS)", "Panel B"). Maps many cells to one cause. Also list each specific affected cell from the cell table above as {{"item_id": "{item_id}", "row_label": "...", "column_label": "..."}} using the exact labels shown. Include all cells yo...
[19]

<item_id>

ALSO EXPLAINS --- For each discrepancy, list which other failures it also explains. Each entry is either: - A plain string "<item_id>" if the discrepancy explains the entire other item, OR - An object {{"item_id": "<item_id>", "sections": "<which part>"}} if only partial. Be conservative: only include items you are confident share the exact same root caus...

[1] [1]

Accessed: 2026-03-31. John P . A. Ioannidis. Why most published research findings are false.PLoS Medicine, 2(8): e124, 2005. doi: 10.1371/journal.pmed.0020124. Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. Llms-as-judges: a comprehensive survey on llm-based evaluation methods.arXiv preprint arXiv:2412.05579,...

work page doi:10.1371/journal.pmed.0020124 2026

[2] [2]

arXiv preprint arXiv:2602.16733 , year=

URLhttps://openreview.net/forum?id=cy8mq7QYae. Yiqing Xu and Leo Yang Yang. Scaling reproducibility: An AI-assisted workflow for large- scale replication and reanalysis, 2026. URLhttps://arxiv.org/abs/2602.16733. John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. SWE-agent: Agent-computer interf...

work page arXiv 2026

[3] [3]

Do not access files outside of it or navigate to parent directories

**Workspace only.** Only read and write files inside this workspace directory. Do not access files outside of it or navigate to parent directories. **Exception:** the`data/`directory may be a symlink pointing outside the workspace -- this is intentional and you should use it freely as your dataset

[4] [4]

Do not search for prior replication attempts

**No searching for the paper.** Do not search the internet for this paper, its authors, its published results, or any replication code or packages. Do not search for prior replication attempts

[5] [5]

Your replication must be derived entirely from the methodology summary and the data provided

**No searching for results.** Do not look up expected coefficients, effect sizes, tables, or figures from this paper. Your replication must be derived entirely from the methodology summary and the data provided

[6] [6]

statsmodels, pandas, matplotlib) and general statistical methods

**Allowed web use.** You may search for Python library documentation (e.g. statsmodels, pandas, matplotlib) and general statistical methods

[7] [7]

## Instructions

**Work independently.** Base your replication only on the methodology description in this file and the dataset. ## Instructions

[8] [8]

The methodology summary above describes the variables in detail, so a brief check should suffice

**Quick data check**: Inspect the data files to confirm column names and basic structure. The methodology summary above describes the variables in detail, so a brief check should suffice

[9] [9]

All scripts can import from this module

**Write`prepare_data.py`**: Load and clean the data following the processing steps described above. All scripts can import from this module

[10] [10]

Write the script (see output filename specified for each item above) b

**Write and execute one script at a time**: For each item: a. Write the script (see output filename specified for each item above) b. Execute it c. Fix any errors immediately d. Move on to the next item once the output file is verified {table_instructions}{figure_instructions}

[11] [11]

for specialized estimators), you can call R from Python using`rpy2`

**R packages**: If the paper's methodology requires R-specific packages (e.g. for specialized estimators), you can call R from Python using`rpy2`

[12] [12]

Execute every script and verify the output file exists

[13] [13]

Python library docu- mentation

Save all outputs in the current working directory. **Reasonable assumptions.** Where the methodology description is incomplete or ambiguous, you are free to make reasonable assumptions based on common practice in the field. Document your assumptions briefly in comments. Focus on substance and accuracy. Match the described methodology as closely as possibl...

[14] [14]

missing" and describe the missing file(s). Otherwise set data_available =

DATA AVAILABILITY --- Check whether the data files required to produce {item_id} exist in the data directory inside original_code/. If any required input file is absent from the replication package, set data_available = "missing" and describe the missing file(s). Otherwise set data_available = "available"

[15] [15]

<description of what is missing>

AGENT CODE --- In agent_code/*.py, find the code that computes values for the REMAINING cells listed above (ignore already-attributed sections). If no such code exists, set agent_behavior = "<description of what is missing>"

[16] [16]

ORIGINAL CODE --- In original_code/{original_file_glob}, find the equivalent {original_language} code

[17] [17]

Most outputs have one, but multi-panel tables with different specifications may have several (e.g

DISCREPANCIES --- For the REMAINING cells only, identify all distinct root causes. Most outputs have one, but multi-panel tables with different specifications may have several (e.g. columns 1--3 OLS wrong clustering; columns 4--6 IV wrong instrument). If all remaining cells are already explained, return an empty divergences array. List one entry per disti...

[18] [18]

All columns

SECTION MAPPING --- For each discrepancy, list which rows/columns/panels it explains (e.g. "All columns", "Columns 1--3 (OLS)", "Panel B"). Maps many cells to one cause. Also list each specific affected cell from the cell table above as {{"item_id": "{item_id}", "row_label": "...", "column_label": "..."}} using the exact labels shown. Include all cells yo...

[19] [19]

<item_id>

ALSO EXPLAINS --- For each discrepancy, list which other failures it also explains. Each entry is either: - A plain string "<item_id>" if the discrepancy explains the entire other item, OR - An object {{"item_id": "<item_id>", "sections": "<which part>"}} if only partial. Be conservative: only include items you are confident share the exact same root caus...