Read the Paper, Write the Code: Agentic Reproduction of Social-Science Results
Pith reviewed 2026-05-09 21:17 UTC · model grok-4.3
The pith
LLM agents can largely reproduce social science results from a paper's methods description and original data alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
An agentic reproduction pipeline that parses methods text into executable steps, runs reimplementations under strict information isolation, performs deterministic cell-by-cell comparison against original outputs, and attributes errors back through the pipeline enables agents to recover the majority of published social-science results on a set of 48 human-verified papers.
What carries the argument
The agentic reproduction system that extracts structured methods descriptions, generates isolated reimplementations, executes deterministic comparisons, and performs root-cause error attribution.
Load-bearing premise
The 48 papers selected allow exact reproduction of results from the methods description and data alone, with no hidden information required and with the isolation protocol preserved throughout.
What would settle it
A larger collection of papers that humans have independently verified as fully reproducible from methods text and data alone, on which the agents nevertheless fail to match the published cell-level results under the same isolation rules.
Figures
read the original abstract
Recent work has used LLM agents to reproduce empirical social science results with access to both the data and code. We broaden this scope by asking: Can they reproduce results given only a paper's methods description and original data? We develop an agentic reproduction system that extracts structured methods descriptions from papers, runs reimplementations under strict information isolation -- agents never see the original code, results, or paper -- and enables deterministic, cell-level comparison of reproduced outputs to the original results. An error attribution step traces discrepancies through the system chain to identify root causes. Evaluating four agent scaffolds and four LLMs on 48 papers with human-verified reproducibility, we find that agents can largely recover published results, but performance varies substantially between models, scaffolds, and papers. Root cause analysis reveals that failures stem both from agent errors and from underspecification in the papers themselves.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces an agentic reproduction system that extracts structured methods descriptions from social science papers, reimplements the analyses using only the original data files under strict information isolation (agents never see original code, results, or full paper text), and performs deterministic cell-level comparison of outputs to published results. An error attribution step traces discrepancies to root causes. On a set of 48 papers pre-selected via human verification of reproducibility, the evaluation across four agent scaffolds and four LLMs finds that agents can largely recover published results, though performance varies substantially by model, scaffold, and paper; failures are attributed to both agent errors and underspecification in the papers themselves.
Significance. If the central claims hold after addressing the verification protocol, this work would be a meaningful contribution to LLM agent applications in scientific reproducibility. It moves beyond prior code-assisted reproduction to a stricter text-plus-data setting and introduces a traceable error attribution mechanism that could help diagnose both AI limitations and reporting gaps in empirical papers. The use of human-verified baselines and deterministic output matching are explicit strengths that support falsifiable evaluation.
major comments (2)
- [§4 (Evaluation Setup)] §4 (Evaluation Setup / Paper Selection): The human verification process used to select the 48 papers is described only at a high level. The manuscript does not report whether verifiers reproduced results using exactly the same inputs and strict isolation constraints given to the agents (structured methods description plus data files only, with no access to the original paper text, code, or published tables). Without an independent check that the same inputs suffice for human reproduction under isolation, the selected set may include papers that are not reproducible under the claimed protocol, which would inflate reported agent success rates and confound the root-cause attribution between agent errors and paper underspecification.
- [§5 (Results)] §5 (Results): The central claim that agents 'largely recover' published results is not supported by sufficient quantitative detail. The manuscript does not report overall success rates (e.g., fraction of papers with exact cell-level match), per-paper accuracy distributions, breakdowns by error type or scaffold, or explicit controls for data leakage. This absence makes it impossible to assess the strength of the 'largely recover' statement or the reported variations across the four models and four scaffolds.
minor comments (2)
- [Abstract] Abstract: The summary would be strengthened by including at least one key quantitative result (e.g., overall reproduction rate or range across models) to ground the 'largely recover' claim.
- [Figure 1] Figure 1 (system diagram): The boundaries of the strict information-isolation protocol could be labeled more explicitly to clarify what information is withheld from agents at each step.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to provide greater precision on the evaluation protocol and to include additional quantitative results.
read point-by-point responses
-
Referee: [§4 (Evaluation Setup)] §4 (Evaluation Setup / Paper Selection): The human verification process used to select the 48 papers is described only at a high level. The manuscript does not report whether verifiers reproduced results using exactly the same inputs and strict isolation constraints given to the agents (structured methods description plus data files only, with no access to the original paper text, code, or published tables). Without an independent check that the same inputs suffice for human reproduction under isolation, the selected set may include papers that are not reproducible under the claimed protocol, which would inflate reported agent success rates and confound the root-cause attribution between agent errors and paper underspecification.
Authors: We agree that the human verification protocol merits more explicit description. The verifiers were given the full original papers, code, and data to establish that the published results were reproducible in principle; this is distinct from the agents' strict isolation to only the extracted structured methods description and data files. As a result, the 48 papers are pre-screened for general reproducibility rather than guaranteed reproducibility under the exact agent constraints. In the revised manuscript we have expanded the description in §4 to detail the information provided to human verifiers, explicitly note the difference in access, and add a limitations paragraph clarifying that reported agent success rates are conditional on this pre-selection. This revision helps readers distinguish agent limitations from paper underspecification without overstating the findings. revision: yes
-
Referee: [§5 (Results)] §5 (Results): The central claim that agents 'largely recover' published results is not supported by sufficient quantitative detail. The manuscript does not report overall success rates (e.g., fraction of papers with exact cell-level match), per-paper accuracy distributions, breakdowns by error type or scaffold, or explicit controls for data leakage. This absence makes it impossible to assess the strength of the 'largely recover' statement or the reported variations across the four models and four scaffolds.
Authors: We accept that aggregate quantitative summaries are needed to substantiate the central claim. Although the manuscript already presents per-scaffold and per-model comparisons, it lacks overall success rates, accuracy distributions, and explicit error-type breakdowns. In the revised version we have added these elements to §5: an overall exact-match success rate across papers, histograms of per-paper cell-level accuracy, breakdowns by attributed error source (agent versus paper underspecification) and by scaffold/model, plus a statement confirming the isolation protocol precludes data leakage. These additions, supported by new tables, allow a clearer evaluation of the 'largely recover' statement and the observed variations. revision: yes
Circularity Check
No circularity: purely empirical evaluation against external baselines
full rationale
This is an empirical systems paper that measures agent performance on 48 human-verified papers using deterministic output comparison under information isolation. No derivations, equations, fitted parameters renamed as predictions, or self-referential claims exist. Results are benchmarked against independently human-verified reproducibility, with root-cause analysis tracing failures to agent behavior or paper underspecification. The selection protocol and isolation claims are experimental design choices, not a derivation chain that reduces to its own inputs by construction. Any potential weakness in verifying the isolation of the human baseline is a validity concern, not circularity per the enumerated patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human verification of reproducibility for the 48 papers is accurate and complete.
invented entities (1)
-
Agentic reproduction system with error attribution step
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Accessed: 2026-03-31. John P . A. Ioannidis. Why most published research findings are false.PLoS Medicine, 2(8): e124, 2005. doi: 10.1371/journal.pmed.0020124. Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. Llms-as-judges: a comprehensive survey on llm-based evaluation methods.arXiv preprint arXiv:2412.05579,...
-
[2]
arXiv preprint arXiv:2602.16733 , year=
URLhttps://openreview.net/forum?id=cy8mq7QYae. Yiqing Xu and Leo Yang Yang. Scaling reproducibility: An AI-assisted workflow for large- scale replication and reanalysis, 2026. URLhttps://arxiv.org/abs/2602.16733. John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. SWE-agent: Agent-computer interf...
-
[3]
Do not access files outside of it or navigate to parent directories
**Workspace only.** Only read and write files inside this workspace directory. Do not access files outside of it or navigate to parent directories. **Exception:** the`data/`directory may be a symlink pointing outside the workspace -- this is intentional and you should use it freely as your dataset
-
[4]
Do not search for prior replication attempts
**No searching for the paper.** Do not search the internet for this paper, its authors, its published results, or any replication code or packages. Do not search for prior replication attempts
-
[5]
Your replication must be derived entirely from the methodology summary and the data provided
**No searching for results.** Do not look up expected coefficients, effect sizes, tables, or figures from this paper. Your replication must be derived entirely from the methodology summary and the data provided
-
[6]
statsmodels, pandas, matplotlib) and general statistical methods
**Allowed web use.** You may search for Python library documentation (e.g. statsmodels, pandas, matplotlib) and general statistical methods
-
[7]
## Instructions
**Work independently.** Base your replication only on the methodology description in this file and the dataset. ## Instructions
-
[8]
The methodology summary above describes the variables in detail, so a brief check should suffice
**Quick data check**: Inspect the data files to confirm column names and basic structure. The methodology summary above describes the variables in detail, so a brief check should suffice
-
[9]
All scripts can import from this module
**Write`prepare_data.py`**: Load and clean the data following the processing steps described above. All scripts can import from this module
-
[10]
Write the script (see output filename specified for each item above) b
**Write and execute one script at a time**: For each item: a. Write the script (see output filename specified for each item above) b. Execute it c. Fix any errors immediately d. Move on to the next item once the output file is verified {table_instructions}{figure_instructions}
-
[11]
for specialized estimators), you can call R from Python using`rpy2`
**R packages**: If the paper's methodology requires R-specific packages (e.g. for specialized estimators), you can call R from Python using`rpy2`
-
[12]
Execute every script and verify the output file exists
-
[13]
Python library docu- mentation
Save all outputs in the current working directory. **Reasonable assumptions.** Where the methodology description is incomplete or ambiguous, you are free to make reasonable assumptions based on common practice in the field. Document your assumptions briefly in comments. Focus on substance and accuracy. Match the described methodology as closely as possibl...
-
[14]
missing" and describe the missing file(s). Otherwise set data_available =
DATA AVAILABILITY --- Check whether the data files required to produce {item_id} exist in the data directory inside original_code/. If any required input file is absent from the replication package, set data_available = "missing" and describe the missing file(s). Otherwise set data_available = "available"
-
[15]
<description of what is missing>
AGENT CODE --- In agent_code/*.py, find the code that computes values for the REMAINING cells listed above (ignore already-attributed sections). If no such code exists, set agent_behavior = "<description of what is missing>"
-
[16]
ORIGINAL CODE --- In original_code/{original_file_glob}, find the equivalent {original_language} code
-
[17]
Most outputs have one, but multi-panel tables with different specifications may have several (e.g
DISCREPANCIES --- For the REMAINING cells only, identify all distinct root causes. Most outputs have one, but multi-panel tables with different specifications may have several (e.g. columns 1--3 OLS wrong clustering; columns 4--6 IV wrong instrument). If all remaining cells are already explained, return an empty divergences array. List one entry per disti...
-
[18]
All columns
SECTION MAPPING --- For each discrepancy, list which rows/columns/panels it explains (e.g. "All columns", "Columns 1--3 (OLS)", "Panel B"). Maps many cells to one cause. Also list each specific affected cell from the cell table above as {{"item_id": "{item_id}", "row_label": "...", "column_label": "..."}} using the exact labels shown. Include all cells yo...
-
[19]
<item_id>
ALSO EXPLAINS --- For each discrepancy, list which other failures it also explains. Each entry is either: - A plain string "<item_id>" if the discrepancy explains the entire other item, OR - An object {{"item_id": "<item_id>", "sections": "<which part>"}} if only partial. Be conservative: only include items you are confident share the exact same root caus...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.