arxiv: 2604.04324 · v1 · submitted 2026-04-06 · 💻 cs.AI · cs.SE

Recognition: no theorem link

RESCORE: LLM-Driven Simulation Recovery in Control Systems Research Papers

Vineet Bhat , Shiqing Wei , Ali Umut Kaypak , Prashanth Krishnamurthy , Ramesh Karri , Farshad Khorrami

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:22 UTC · model grok-4.3

classification 💻 cs.AI cs.SE

keywords simulation recoveryLLM agentscontrol systemsreproducibilitypaper to codeagentic frameworkiterative verificationbenchmark

0 comments

The pith

An LLM agentic framework recovers executable simulations from 40.7% of control systems papers by iteratively analyzing, coding, and verifying outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines the task of recovering runnable simulations from research papers that often leave parameters underspecified. It presents RESCORE, a three-part LLM system that breaks down the paper, writes code, and checks results through repeated execution and visual matching. The method succeeds on 40.7 percent of a 500-paper benchmark from the IEEE Conference on Decision and Control, beating direct generation approaches. Success here would let researchers verify published claims much faster than manual recreation. The authors also release the benchmark to support further work on automated replication.

Core claim

RESCORE is a three-component LLM agentic framework consisting of an Analyzer that parses paper content, a Coder that produces simulation code, and a Verifier that runs the code and compares outputs visually to the paper's results. Iterative loops of execution feedback allow the system to refine the code until it produces task-coherent simulations. On a curated benchmark of 500 papers from the IEEE Conference on Decision and Control, this process recovers matching simulations for 40.7 percent of instances while outperforming single-pass generation and delivering an estimated 10X speedup relative to manual human replication.

What carries the argument

The RESCORE pipeline that combines LLM-driven paper analysis, code generation, and iterative verification through execution feedback plus visual comparison to resolve ambiguities.

If this is right

Published control methodologies become easier to verify and build upon without full manual reimplementation.
The 10X estimated speedup over human replication lowers the barrier to checking claims in the literature.
The released benchmark of 500 CDC papers provides a standard test set for measuring progress on automated replication.
Agentic loops that incorporate runtime checks outperform one-shot code generation for recovering complex simulations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar iterative LLM pipelines could be tested on papers from neighboring fields that rely on numerical simulations, such as robotics or signal processing.
The benchmark enables direct comparison of future recovery systems against the 40.7 percent baseline.
Success on this task suggests LLMs can infer missing details when given concrete execution and visual signals rather than relying on text alone.
Widespread adoption would shift research practice toward routine automated checks of published results.

Load-bearing premise

Iterative LLM analysis, code generation, and verification via execution feedback and visual comparison can reliably overcome underspecified parameters and ambiguous implementation details in control systems papers.

What would settle it

Applying RESCORE to the 500-paper benchmark and obtaining a success rate below 25 percent for simulations that match both numerical outputs and published figures would falsify the reported recovery performance.

Figures

Figures reproduced from arXiv: 2604.04324 by Ali Umut Kaypak, Farshad Khorrami, Prashanth Krishnamurthy, Ramesh Karri, Shiqing Wei, Vineet Bhat.

**Figure 1.** Figure 1: RESCORE framework automates code recovery from control system papers. After filtering, expert screening, and annotation, Analyzer, Coder, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 3.** Figure 3: Prompt for Coder Agent. The agent performs both initial code [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 2.** Figure 2: Prompt for Analyzer Agent. The agent performs two tasks: [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Prompt for Verifier Agent: The agent receives the generated and [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: Confusion matrices: Human raters vs. LLM grader on RESCORE [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: End-to-end attrition of the 500 candidate papers. Domain expert [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of papers for which the RESCORE loop converged [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 8.** Figure 8: Mean FRS by category (RESCORE, n = 194). Categories assigned by keyword match of paper titles. Human FRS avg. of both raters. minutes. When successful (FRS ≥ 3), it yields ∼10× speedup over manual replication. Even for partial reconstructions, the codebase is a structurally sound foundation and can reduce the effort to complete the simulation. G. Case Study As a case study, consider the paper [33] from CD… view at source ↗

**Figure 9.** Figure 9: Case study for Connected Cruise Control system from [33]. [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

read the original abstract

Reconstructing numerical simulations from control systems research papers is often hindered by underspecified parameters and ambiguous implementation details. We define the task of Paper to Simulation Recoverability, the ability of an automated system to generate executable code that faithfully reproduces a paper's results. We curate a benchmark of 500 papers from the IEEE Conference on Decision and Control (CDC) and propose RESCORE, a three component LLM agentic framework, Analyzer, Coder, and Verifier. RESCORE uses iterative execution feedback and visual comparison to improve reconstruction fidelity. Our method successfully recovers task coherent simulations for 40.7% of benchmark instances, outperforming single pass generation. Notably, the RESCORE automated pipeline achieves an estimated 10X speedup over manual human replication, drastically cutting the time and effort required to verify published control methodologies. We will release our benchmark and agents to foster community progress in automated research replication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RESCORE adds a 500-paper CDC benchmark and a three-stage LLM loop for recovering simulations, but the 40.7% success rate rests on an underspecified notion of 'task coherent' that may not require numerical fidelity to the original results.

read the letter

The paper's core contribution is a curated benchmark of 500 CDC papers plus the RESCORE agentic setup that cycles an Analyzer, Coder, and Verifier using execution feedback and visual plot comparison. It reports recovering task-coherent simulations in 40.7% of cases and estimates a 10X reduction in human effort compared with manual replication. That benchmark and the iterative visual-verification step are the concrete new pieces; the underlying LLM code-generation tactic builds on prior agent work but is applied here to a specific reproducibility bottleneck in control systems.

Referee Report

3 major / 2 minor

Summary. The manuscript defines the task of Paper to Simulation Recoverability and introduces RESCORE, a three-component LLM agentic framework consisting of Analyzer, Coder, and Verifier modules. The system employs iterative execution feedback and visual comparison to reconstruct executable code from control systems papers. It evaluates the approach on a new benchmark of 500 papers from the IEEE Conference on Decision and Control, claiming a 40.7% success rate in recovering task-coherent simulations (outperforming single-pass generation) and an estimated 10X speedup relative to manual human replication. The benchmark and agents are to be released.

Significance. If the performance claims hold under precise definitions of success and rigorous baselines, the work could meaningfully advance automated replication and verification in control theory and related engineering fields by lowering the barrier to reproducing published simulations. The curation of a 500-paper benchmark and the agentic iterative design represent constructive steps toward scalable research assistance tools.

major comments (3)

[Abstract and §4] Abstract and §4 (Method): The central performance claim of 40.7% task-coherent recoveries is load-bearing for the outperformance and speedup assertions, yet the abstract and method description provide no explicit definition or operationalization of 'task coherent,' no quantitative single-pass baseline numbers, no error bars, and no scoring protocol for the visual comparison step. Without these, it is impossible to determine whether the metric requires numerical agreement on key control metrics (e.g., settling time, gain margins) or merely qualitative plot similarity.
[§5] §5 (Experiments): The reported 40.7% success rate and 10X human speedup lack accompanying details on how benchmark instances were labeled successful, how the single-pass comparator was implemented, or any statistical analysis; this directly affects the validity of the cross-method comparison and the speedup estimate.
[§4.3] Verifier component (§4.3): Execution feedback confirms runtime and basic behavior, while visual comparison (presumably LLM-mediated) can accept plausible but non-matching simulations generated with inferred or altered parameters. If the benchmark does not enforce exact reproduction of reported numerical results, the recoverability claim overstates fidelity to the original published methodology.

minor comments (2)

[Abstract] The abstract would benefit from a one-sentence parenthetical gloss on 'task coherent' to improve immediate readability.
[§5] Ensure all figures in the results section include clear captions specifying the exact success criterion used for each bar or table entry.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive report. We address each major comment below, indicating planned revisions where appropriate to strengthen the clarity and rigor of the manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Method): The central performance claim of 40.7% task-coherent recoveries is load-bearing for the outperformance and speedup assertions, yet the abstract and method description provide no explicit definition or operationalization of 'task coherent,' no quantitative single-pass baseline numbers, no error bars, and no scoring protocol for the visual comparison step. Without these, it is impossible to determine whether the metric requires numerical agreement on key control metrics (e.g., settling time, gain margins) or merely qualitative plot similarity.

Authors: We agree that explicit definitions and metrics are necessary for interpreting the central claims. We will revise the abstract and Section 4 to include a precise operational definition of 'task-coherent,' specifying that it requires the generated simulation to reproduce the primary dynamical behaviors, stability characteristics, and key performance indicators described in the paper (with tolerance thresholds for quantitative metrics where reported). We will also add the single-pass baseline results with error bars from repeated evaluations and detail the LLM-mediated visual comparison scoring protocol, including the prompt criteria used to assess plot similarity and behavioral fidelity. revision: yes
Referee: [§5] §5 (Experiments): The reported 40.7% success rate and 10X human speedup lack accompanying details on how benchmark instances were labeled successful, how the single-pass comparator was implemented, or any statistical analysis; this directly affects the validity of the cross-method comparison and the speedup estimate.

Authors: We acknowledge the need for greater transparency in the experimental reporting. In the revised manuscript, we will expand Section 5 to describe the success labeling process (a combination of automated execution checks and human review against paper-reported outcomes), the exact implementation of the single-pass baseline (direct LLM code generation without iterative feedback), and statistical analysis including confidence intervals and significance testing for the performance differences. We will also provide additional details on the basis for the 10X speedup estimate, including the data collection method for human replication times. revision: yes
Referee: [§4.3] Verifier component (§4.3): Execution feedback confirms runtime and basic behavior, while visual comparison (presumably LLM-mediated) can accept plausible but non-matching simulations generated with inferred or altered parameters. If the benchmark does not enforce exact reproduction of reported numerical results, the recoverability claim overstates fidelity to the original published methodology.

Authors: The referee accurately notes that our verifier prioritizes executable and behaviorally coherent simulations over exact numerical matches, which is a deliberate design choice given that many control papers omit full parameter sets or implementation specifics. This aligns with the defined recoverability task but can indeed lead to acceptance of inferred parameters. We will revise Section 4.3 to explicitly state this limitation, clarify that the claims refer to task-level coherence rather than bit-for-bit or numerical identity, and include illustrative examples of accepted simulations that involve parameter inference. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance on external benchmark

full rationale

The paper defines Paper to Simulation Recoverability as a task, curates an external benchmark of 500 IEEE CDC papers, and reports an empirical success rate of 40.7% for its RESCORE agentic pipeline (Analyzer-Coder-Verifier with execution/visual feedback) versus single-pass baselines, plus a 10X speedup estimate. These are measured outcomes on held-out papers rather than quantities derived from equations, fitted parameters, or self-referential definitions. No load-bearing self-citations, uniqueness theorems, or ansatzes appear in the provided text; the central claims remain falsifiable experimental results independent of the method's internal construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the work is an empirical agentic pipeline whose performance claims rest on the benchmark and the definition of successful recovery.

pith-pipeline@v0.9.0 · 5466 in / 1257 out tokens · 54619 ms · 2026-05-10T20:22:53.937353+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 6 canonical work pages · 4 internal anchors

[1]

Vila: Improving structured content extraction from scientific pdfs using visual layout groups,

Z. Shen, K. Lo, L. L. Wang, B. Kuehl, D. S. Weld, and D. Downey, “Vila: Improving structured content extraction from scientific pdfs using visual layout groups,”Transactions of the Association for Computational Linguistics, vol. 10, pp. 376–392, 2022

2022
[2]

Pdffigures 2.0: Mining figures from research papers,

C. Clark and S. Divvala, “Pdffigures 2.0: Mining figures from research papers,” inProc. ACM/IEEE-CS Joint Conf. on Digital Libraries, (Newark, NJ), July 2016

2016
[3]

Humb: Automatic key term extraction from scientific articles in grobid,

P. Lopez and L. Romary, “Humb: Automatic key term extraction from scientific articles in grobid,” inProc. International Workshop on Semantic Evaluation, (Uppsala, Sweden), July 2010

2010
[4]

Image-to-markup generation with coarse-to-fine attention,

Y . Deng, A. Kanervisto, J. Ling, and A. M. Rush, “Image-to-markup generation with coarse-to-fine attention,” inProc. International Conf. on Machine Learning, (Sydney, Australia), August 2017

2017
[5]

Blecher, G

L. Blecher, G. Cucurull, T. Scialom, and R. Stojnic, “Nougat: Neu- ral optical understanding for academic documents,”arXiv preprint arXiv:2308.13418, 2023

work page arXiv 2023
[6]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Leet al., “Program synthesis with large language models,”arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Scicode: A research coding benchmark curated by scientists,

M. Tian, L. Gao, S. D. Zhanget al., “Scicode: A research coding benchmark curated by scientists,” inProc. Conf. on Neural Inf. Processing Systems, (Vancouver, Canada), Dec 2024

2024
[8]

PaperBench: Evaluating AI's Ability to Replicate AI Research

G. Starace, O. Jaffe, D. Sherburn, J. Aung, J. S. Chan, L. Maksin, R. Dias, E. Mays, B. Kinsella, W. Thompsonet al., “Paperbench: Evaluating ai’s ability to replicate ai research,”arXiv preprint arXiv:2504.01848, 2025

work page internal anchor Pith review arXiv 2025
[9]

Paper2code: Automating code generation from scientific papers in machine learning,

M. Seo, J. Baek, S. Lee, and S. J. Hwang, “Paper2code: Automating code generation from scientific papers in machine learning,” inProc. International Conf. on Learning Representations, (Rio de Janeiro, Brazil), April 2026

2026
[10]

Control systems reproducibility challenge [from the editor],

J. P. How, “Control systems reproducibility challenge [from the editor],”IEEE Control Systems Magazine, vol. 38, no. 4, pp. 3–4, 2018

2018
[11]

N. A. of Sciences, Medicine, Policy, G. Affairs, B. on Research Data, Information, D. on Engineering, P. Sciences, C. on Applied, T. Statis- ticset al.,Reproducibility and replicability in science. National Academies Press, 2019

2019
[12]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Ka- plan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Measuring coding challenge competence with apps,

D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Songet al., “Measuring coding challenge competence with apps,” inProc. Conf. on Neural Inf. Processing Systems, (Virtual), Dec. 2021

2021
[14]

Competition- level code generation with alphacode,

Y . Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lagoet al., “Competition- level code generation with alphacode,”Science, vol. 378, no. 6624, pp. 1092–1097, 2022

2022
[15]

Swe-bench: Can language models resolve real- world github issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan, “Swe-bench: Can language models resolve real- world github issues?” inProc. International Conf. on Learning Rep- resentations, (Vienna, Austria), May 2024

2024
[16]

Swe-agent: Agent-computer interfaces enable auto- mated software engineering,

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “Swe-agent: Agent-computer interfaces enable auto- mated software engineering,” inProc. Conf. on Neural Information Processing Systems, (Vancouver, Canada), Dec 2024

2024
[17]

Repobench: Benchmarking repository- level code auto-completion systems,

T. Liu, C. Xu, and J. McAuley, “Repobench: Benchmarking repository- level code auto-completion systems,” inProc. International Conf. on Learning Representations, (Vienna, Austria), May 2024

2024
[18]

Evaluating large language models in class-level code generation,

X. Du, M. Liu, K. Wang, H. Wang, J. Liu, Y . Chen, J. Feng, C. Sha, X. Peng, and Y . Lou, “Evaluating large language models in class-level code generation,” inProc. IEEE/ACM International Conf. on Software Engineering, (Lisbon, Portugal), April 2024

2024
[19]

Swe-rebench: An automated pipeline for task collection and decon- taminated evaluation of software engineering agents,

I. Badertdinov, A. Golubev, M. Nekrashevich, A. Shevtsov, S. Karasik, A. Andriushchenko, M. Trofimova, D. Litvintseva, and B. Yangel, “Swe-rebench: An automated pipeline for task collection and decon- taminated evaluation of software engineering agents,” inProc. Conf. on Neural Info. Processing Systems, (San Diego, CA), Dec 2025

2025
[20]

Teaching large language models to self-debug,

X. Chen, M. Lin, N. Sch ¨arli, and D. Zhou, “Teaching large language models to self-debug,” inProc. International Conf. on Learning Representations, (Vienna, Austria), May 2024

2024
[21]

Rlef: Grounding code llms in execution feedback with reinforcement learning,

J. Gehring, K. Zheng, J. Copet, V . Mella, T. Cohen, and G. Synnaeve, “Rlef: Grounding code llms in execution feedback with reinforcement learning,” inProc. International Conf. on Machine Learning, (Van- couver, Canada), July 2025

2025
[22]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” inProc. International Conf. on Learning Representations, (Kigali, Rwanda), May 2023

2023
[23]

Critic: Large language models can self-correct with tool-interactive critiquing,

Z. Gou, Z. Shao, Y . Gong, Y . Yang, N. Duan, W. Chenet al., “Critic: Large language models can self-correct with tool-interactive critiquing,” inProc. International Conf. on Learning Representations, (Vienna, Austria), May 2024

2024
[24]

Grounding large language models for robot task planning using closed-loop state feedback,

V . Bhat, A. U. Kaypak, P. Krishnamurthy, R. Karri, and F. Khorrami, “Grounding large language models for robot task planning using closed-loop state feedback,”Advanced Robotics Research, 2025

2025
[25]

Utboost: Rigorous evaluation of coding agents on swe-bench,

B. Yu, Y . Zhu, P. He, and D. Kang, “Utboost: Rigorous evaluation of coding agents on swe-bench,” inProc. Annual Meeting of the Assoc. for Computational Linguistics, (Vienna, Austria), July 2025

2025
[26]

Introducing GPT-5.2: Announcement and system card,

OpenAI, “Introducing GPT-5.2: Announcement and system card,” 2025. [Online]. Available: https://openai.com/index/ introducing-gpt-5-2/

2025
[27]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

URLhttp://www.jstor.org/stable/3001968

F. Wilcoxon, “Individual comparisons by ranking methods,” Biometrics Bulletin, vol. 1, no. 6, pp. 80–83, 1945. [Online]. Available: http://www.jstor.org/stable/3001968

work page arXiv 1945
[29]

Gemini 3.1 flash-lite - model card,

Google DeepMind, “Gemini 3.1 flash-lite - model card,” March 2026. [Online]. Available: https://deepmind.google/models/ model-cards/gemini-3-1-flash-lite/

2026
[30]

Data-driven reachability analysis with christoffel functions,

A. Devonport, F. Yang, L. El Ghaoui, and M. Arcak, “Data-driven reachability analysis with christoffel functions,” inProc. IEEE Conf. on Decision and Control, (Austin, TX), Dec 2021

2021
[31]

Identification of piecewise affine systems with online deterministic annealing,

C. N. Mavridis and J. S. Baras, “Identification of piecewise affine systems with online deterministic annealing,” inProc. IEEE Conf. on Decision and Control, (Marina Bay Sands, Singapore), Dec 2023

2023
[32]

In-Class Data Analysis Replications: Teaching Students While Testing Science,

K. Gligori ´c, T. Piccardi, J. M. Hofman, and R. West, “In-Class Data Analysis Replications: Teaching Students While Testing Science,” Harvard Data Science Review, vol. 6, no. 3, 2024

2024
[33]

On the safety of connected cruise control: analysis and synthesis with control barrier functions,

T. G. Molnar, G. Orosz, and A. D. Ames, “On the safety of connected cruise control: analysis and synthesis with control barrier functions,” inProc. IEEE Conf. on Decision and Control, (Marina Bay Sands, Singapore), Dec 2023

2023