pith. sign in

arxiv: 2506.22598 · v3 · submitted 2025-06-27 · 💻 cs.CL

RExBench: Can coding agents autonomously implement AI research extensions?

Pith reviewed 2026-05-19 07:31 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM agentsresearch extensionscoding agentsbenchmarkautonomous implementationAI researchsoftware engineering
0
0 comments X

The pith

LLM coding agents autonomously implement only about a third of realistic AI research extensions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RExBench as a benchmark to test whether LLM-based agents can take existing research papers and codebases and implement novel extensions that test new hypotheses. Twelve such extensions were prepared with expert-written instructions and automatic success criteria that run the modified code to check outcomes. Twelve agents built with the aider and OpenHands frameworks were evaluated on these tasks. The best agent succeeded on roughly one third of the extensions without extra help. Adding human-written hints raised performance but kept the best result below 44 percent, showing that current agents still need substantial human guidance for realistic research work.

Core claim

RExBench consists of realistic extensions of 12 research papers that investigate novel hypotheses, each accompanied by domain-expert instructions and an automatic evaluation infrastructure that executes agent outputs to verify success criteria. All evaluated agents fail to autonomously implement the majority of these extensions, with the strongest agent reaching around a 33 percent success rate. Success improves when human-written hints are supplied, yet the best performance in that setting remains below 44 percent. The results indicate that current agents remain short of handling realistic research-extension tasks without substantial human guidance.

What carries the argument

RExBench benchmark of 12 paper extensions with expert instructions and automatic execution-based success evaluation.

Load-bearing premise

The 12 selected extensions and their success criteria represent genuine research-extension difficulty without systematic bias.

What would settle it

A new agent that autonomously completes more than half of the twelve RExBench extensions without hints would challenge the reported performance gap.

Figures

Figures reproduced from arXiv: 2506.22598 by Najoung Kim, Nicholas Edwards, Sebastian Schuster, Yujun Audrey Mao, Yukyung Lee, Yulu Qin.

Figure 1
Figure 1. Figure 1: End-to-end workflow of REXBENCH: (1) An LLM agent receives inputs consisting of the research paper(s), the original codebase, and an extension instruction; (2) the system implements the extension and a patch file is obtained; (3) the patch is applied to the original code and executed via our evaluation infrastructure; and (4) the results are evaluated using specified metrics. for a sample task instruction)… view at source ↗
Figure 2
Figure 2. Figure 2: Agent performance on REXBENCH. The color coding indicates the agent framework and the y axis indicates the the backbone LLM. Results include three runs per task to account for agent random variation. Error bars show standard error of the mean of all runs per model computed using the closed form formula (2σ, no normality assumption). 4.1.1 Baseline agent design We used three different agent frameworks: two … view at source ↗
Figure 3
Figure 3. Figure 3: Final success rates for each agent-LLM combination and hint level. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cost effectiveness and time efficiency of coding agents on [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Regression coefficients with 95% confidence intervals for predictors of final [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Tool usage distribution across OpenHands agent implementations. Percentages indicate the [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of execution errors across Python, Bash, and timeout categories. Errors with [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
read the original abstract

Agents based on Large Language Models (LLMs) have shown promise for performing sophisticated software engineering tasks autonomously. In addition, there has been progress towards developing agents that can perform parts of the research pipeline in machine learning and the natural sciences. We argue that research extension and its implementation is a critical capability for such systems, and introduce RExBench to support the evaluation of this capability. RExBench is a benchmark consisting of realistic extensions of 12 research papers that aim to investigate novel research hypotheses. Each task is set up as an extension to an existing research paper and codebase, accompanied by domain expert-written instructions. RExBench is robust to data contamination and supports an automatic evaluation infrastructure that executes agent outputs to determine whether the success criteria are met. We use this benchmark to evaluate 12 LLM agents implemented using two different frameworks, aider and OpenHands. We find that all agents fail to autonomously implement the majority of the extensions, with the best agent achieving around a 33% success rate. Although the success rate improves with additional human-written hints, the best performance under this setting remains below 44%. This indicates that current agents are still short of being able to handle realistic research extension tasks without substantial human guidance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces RExBench, a benchmark consisting of 12 realistic extensions to existing AI research papers and their codebases, each accompanied by domain-expert instructions and automatic success criteria. It evaluates 12 LLM-based coding agents implemented in the aider and OpenHands frameworks, reporting that the best agent achieves approximately 33% success rate autonomously and below 44% even with additional human-written hints, concluding that current agents cannot handle realistic research-extension tasks without substantial human guidance.

Significance. If the 12 tasks prove representative, the benchmark fills a gap between standard coding benchmarks and full research automation by providing contamination-robust, automatically executable tasks. The automatic evaluation infrastructure and direct execution-based metrics are positive features that could enable reproducible comparisons of future agents on research-like implementation work.

major comments (3)
  1. [§3] §3 (Task Construction): The manuscript does not describe systematic selection criteria, diversity metrics, or controls for the 12 source papers and extensions. Because the central claim (best agent ~33% success) depends on these tasks fairly sampling realistic research-extension difficulty, the absence of such justification leaves open the possibility that the low rates reflect selection bias rather than a general limitation of current agents.
  2. [§4.2] §4.2 (Evaluation Protocol): Exact success criteria and their automatic implementations are described only at a high level. Without per-task details on what constitutes a passing execution (e.g., quantitative thresholds, test cases, or handling of partial implementations), it is difficult to assess whether the criteria are free of systematic bias or overly lenient/strict for certain failure modes.
  3. [Results] Table 1 / Results section: The reported success rates lack error bars, multiple-run statistics, or significance tests. Given that agent performance can vary with temperature and prompt stochasticity, single-run percentages make it hard to determine whether the gap between 33% and 44% (with hints) is reliable.
minor comments (2)
  1. [Abstract] Abstract: The number of agents (12) and the two frameworks (aider, OpenHands) should be stated explicitly for immediate clarity.
  2. [§5] §5 (Discussion): Adding even one or two concrete failure-case examples (e.g., a specific extension where agents consistently produce incorrect API calls) would help readers understand the nature of the remaining difficulties.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Task Construction): The manuscript does not describe systematic selection criteria, diversity metrics, or controls for the 12 source papers and extensions. Because the central claim (best agent ~33% success) depends on these tasks fairly sampling realistic research-extension difficulty, the absence of such justification leaves open the possibility that the low rates reflect selection bias rather than a general limitation of current agents.

    Authors: We agree that explicit documentation of selection criteria would better support the claim that the 12 tasks are representative. In the revised manuscript we will add a dedicated subsection to §3 that describes the selection process: papers were required to have publicly available codebases, to be published within the last three years, and to span distinct AI sub-areas (NLP, vision, RL, etc.). We will also report simple diversity statistics (e.g., distribution across sub-fields and average code-base size) and acknowledge that the set remains a curated starting point rather than a statistically sampled population. revision: yes

  2. Referee: [§4.2] §4.2 (Evaluation Protocol): Exact success criteria and their automatic implementations are described only at a high level. Without per-task details on what constitutes a passing execution (e.g., quantitative thresholds, test cases, or handling of partial implementations), it is difficult to assess whether the criteria are free of systematic bias or overly lenient/strict for certain failure modes.

    Authors: We accept that the current high-level description limits reproducibility and scrutiny. We will expand §4.2 with a table or appendix that lists, for each task, the precise success criterion (including any quantitative thresholds or test scripts), how partial implementations are scored, and the exact command used by the automatic evaluator. This will allow readers to judge potential leniency or bias on a per-task basis. revision: yes

  3. Referee: [Results] Table 1 / Results section: The reported success rates lack error bars, multiple-run statistics, or significance tests. Given that agent performance can vary with temperature and prompt stochasticity, single-run percentages make it hard to determine whether the gap between 33% and 44% (with hints) is reliable.

    Authors: We acknowledge that single-run reporting leaves the results vulnerable to stochastic effects. Because of the substantial compute required to run each agent on the full suite of tasks, we initially reported single executions. In the revision we will re-run the two strongest agents (with and without hints) across three random seeds, report mean success rates together with standard deviations, and add a short discussion of observed variability. We will also note that the 33 %–44 % gap should be interpreted with this variability in mind. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with direct execution evaluation

full rationale

The paper introduces RExBench as an empirical benchmark consisting of 12 research extensions, expert-written instructions, and automatic success criteria evaluated via agent execution. No derivations, equations, fitted parameters, or predictions appear in the provided abstract or description. Central claims rest on observed success rates (33% best agent, <44% with hints) obtained through direct testing rather than any self-referential construction or self-citation chain. The evaluation infrastructure is falsifiable externally via reproduction on the released tasks. Selection of the 12 papers and criterion validity are potential external-validity concerns but do not constitute circularity under the defined patterns, as no step reduces by construction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmark paper containing no mathematical derivations, fitted constants, or postulated entities.

pith-pipeline@v0.9.0 · 5757 in / 943 out tokens · 29077 ms · 2026-05-19T07:31:01.313856+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. What Makes AI Research Replicable? Executable Knowledge Graphs as Scientific Knowledge Representations

    cs.CL 2025-10 unverdicted novelty 6.0

    xKG is a paper-centric knowledge base that extracts code and insights to improve LLM agent performance on AI research replication by 10.9% on PaperBench.

  2. AblateCell: A Reproduce-then-Ablate Agent for Virtual Cell Repositories

    cs.AI 2026-04 unverdicted novelty 5.0

    AblateCell reproduces baselines in three single-cell perturbation repositories with 88.9% success and recovers ground-truth critical components with 93.3% accuracy via closed-loop ablation.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · cited by 2 Pith papers · 8 internal anchors

  1. [1]

    Aider AI. 2023. Aider: AI pair programming in your terminal. https://github.com/ Aider-AI/aider. Accessed: 2025-05-12

  2. [2]

    Anthropic. 2024. Claude 3.7 Sonnet System Card. https://assets.anthropic.com/ m/785e231869ea8b3b/original/claude-3-7-sonnet-system-card.pdf . Accessed: 2025-05-14

  3. [3]

    Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, and Tushar Khot. 2024. SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 12622–12645, Miami, Florida, USA. Association for...

  4. [4]

    Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. 2023. Autonomous chemical research with large language models. Nature, 624(7992):570–578

  5. [5]

    Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, et al. 2024. ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery, October 2024. arXiv:2410.05080

  6. [6]

    Jonathan H. Choi. 2024. How to use large language models for empirical legal research.Journal of Institutional and Theoretical Economics (JITE) , 180(2):214–233

  7. [7]

    Róbert Csordás, Kazuki Irie, and Juergen Schmidhuber. 2021. The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages 619–634, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. 10

  8. [8]

    Singularity Developers. 2021. Singularity

  9. [9]

    Yu, and Wenpeng Yin

    Jiangshu Du, Yibo Wang, Wenting Zhao, Zhongfen Deng, Shuaiqi Liu, Renze Lou, Henry Peng Zou, Pranav Narayanan Venkit, Nan Zhang, Mukund Srinath, Haoran Ranran Zhang, Vipul Gupta, Yinghui Li, Tao Li, Fei Wang, Qin Liu, Tianlin Liu, Pengzhi Gao, Congying Xia, Chen Xing, Cheng Jiayang, Zhaowei Wang, Ying Su, Raj Sanjay Shah, Ruohao Guo, Jing Gu, Haoran Li, K...

  10. [10]

    Cole, Fangyu Liu, and William W

    Julian Martin Eisenschlos, Jeremy R. Cole, Fangyu Liu, and William W. Cohen. 2023. WinoDict: Probing language models for in-context word acquisition. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages 94–102, Dubrovnik, Croatia. Association for Computational Linguistics

  11. [11]

    Kanishk Gandhi, Michael Y Li, Lyle Goodyear, Louise Li, Aditi Bhaskar, Mohammed Zaman, and Noah D Goodman. 2025. BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery. arXiv:2501.01540

  12. [12]

    Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. 2025. Towards an AI co-scientist. arXiv:2502.18864

  13. [13]

    Zhang, Lanyi Zhu, Mike A Merrill, Jeffrey Heer, and Tim Althoff

    Ken Gu, Ruoxi Shang, Ruien Jiang, Keying Kuang, Richard-John Lin, Donghe Lyu, Yue Mao, Youran Pan, Teng Wu, Jiaqian Yu, Yikun Zhang, Tianmai M. Zhang, Lanyi Zhu, Mike A Merrill, Jeffrey Heer, and Tim Althoff. 2024. BLADE: Benchmarking Language Model Agents for Data-Driven Science. In Findings of the Association for Computational Linguistics: EMNLP 2024, p...

  14. [14]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv:2501.12948

  15. [15]

    Liu, Percy Liang, and Christopher D

    John Hewitt, Nelson F. Liu, Percy Liang, and Christopher D. Manning. 2024. Instruction following without instruction tuning. arXiv:2409.14254

  16. [16]

    Qian Huang, Jian V ora, Percy Liang, and Jure Leskovec. 2024. MLAgentBench: evaluating language agents on machine learning experimentation. In Proceedings of the 41st International Conference on Machine Learning, pages 20271–20309

  17. [17]

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. 2024. OpenAI o1 system card. arXiv:2412.16720

  18. [18]

    Peter Alexander Jansen, Oyvind Tafjord, Marissa Radensky, Pao Siangliulue, Tom Hope, Bhavana Dalvi, Bodhisattwa Prasad Majumder, D

    Peter Jansen, Oyvind Tafjord, Marissa Radensky, Pao Siangliulue, Tom Hope, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Daniel S. Weld, and Peter Clark. 2025. CodeSci- entist: End-to-End Semi-Automated Scientific Discovery with Code-based Experimentation. arXiv:2503.22708

  19. [19]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues? In The Twelfth International Conference on Learning Representations

  20. [20]

    Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, and Dong Yu. 2024. DSBench: How Far Are Data Science Agents to Becoming Data Science Experts? arXiv:2409.07703

  21. [21]

    Julie Kallini, Isabel Papadimitriou, Richard Futrell, Kyle Mahowald, and Christopher Potts

  22. [22]

    In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 14691–14714, Bangkok, Thailand

    Mission: Impossible Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 14691–14714, Bangkok, Thailand. Association for Computational Linguistics. 11

  23. [23]

    Najoung Kim and Tal Linzen. 2020. COGS: A Compositional Generalization Challenge Based on Semantic Interpretation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 9087–9105, Online. Association for Computational Linguistics

  24. [24]

    Najoung Kim, Sebastian Schuster, and Shubham Toshniwal. 2024. Code pretraining improves entity tracking abilities of language models. arXiv:2405.21068

  25. [25]

    Hiroaki Kitano. 2021. Nobel Turing Challenge: creating the engine for scientific discovery. NPJ systems biology and applications , 7(1):29

  26. [26]

    Patrick Tser Jern Kon, Jiachen Liu, Qiuyi Ding, Yiming Qiu, Zhenning Yang, Yibo Huang, Jayanth Srinivasa, Myungjin Lee, Mosharaf Chowdhury, and Ang Chen. 2025. Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents. arXiv:2502.16069

  27. [27]

    LAB-Bench: Measuring Capabilities of Language Models for Biology Research

    Jon M. Laurent, Joseph D. Janizek, Michael Ruzo, Michaela M. Hinks, Michael J. Hammerling, Siddharth Narayanan, Manvitha Ponnapati, Andrew D. White, and Samuel G. Rodriques. 2024. Lab-bench: Measuring capabilities of language models for biology research. arXiv:2407.10362

  28. [28]

    Yukyung Lee, Joonghoon Kim, Jaehee Kim, Hyowon Cho, Pilsung Kang, and Najoung Kim

  29. [29]

    arXiv:2403.18771

    Checkeval: A reliable LLM-as-a-judge framework for evaluating text generation using checklists. arXiv:2403.18771

  30. [30]

    Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg

    Kenneth Li, Aspen K. Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. 2023. Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task. In The Eleventh International Conference on Learning Representations

  31. [31]

    Sihang Li, Jin Huang, Jiaxi Zhuang, Yaorui Shi, Xiaochen Cai, Mingjun Xu, Xiang Wang, Linfeng Zhang, Guolin Ke, and Hengxing Cai. 2025. SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding. In The Thirteenth International Conference on Learning Representations

  32. [32]

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. 2024. The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery. arXiv:2408.06292

  33. [33]

    Ziming Luo, Zonglin Yang, Zexin Xu, Wei Yang, and Xinya Du. 2025. LLM4SR: A Survey on Large Language Models for Scientific Research. arXiv:2501.04306

  34. [34]

    Neel Nanda, Andrew Lee, and Martin Wattenberg. 2023. Emergent Linear Representations in World Models of Self-Supervised Sequence Models. In Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP , pages 16–30, Singapore. Association for Computational Linguistics

  35. [35]

    Harshith Padigela, Chintan Shah, and Dinkar Juyal. 2025. ML-Dev-Bench: Comparative Analysis of AI Agents on ML development workflows. arXiv:2502.00964

  36. [36]

    Minju Seo, Jinheon Baek, Seongyun Lee, and Sung Ju Hwang. 2025. Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning. arXiv:2504.17192

  37. [37]

    Chan Jun Shern, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. 2024. MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering. arXiv:2410.07095

  38. [38]

    Chenglei Si, Diyi Yang, and Tatsunori Hashimoto. 2024. Can LLMs generate novel research ideas? a large-scale human study with 100+ NLP researchers. arXiv:2409.04109

  39. [39]

    Zachary S Siegel, Sayash Kapoor, Nitya Nadgir, Benedikt Stroebl, and Arvind Narayanan

  40. [40]

    Transactions on Machine Learning Research

    CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark. Transactions on Machine Learning Research

  41. [41]

    Language agents achieve superhuman synthesis of scientific knowledge.arXiv preprint arXiv:2409.13740, 2024

    Michael D. Skarlinski, Sam Cox, Jon M Laurent, James D. Braza, Michaela Hinks, Michael J. Hammerling, Manvitha Ponnapati, Samuel G. Rodriques, and Andrew D. White. 2024. Lan- guage agents achieve superhuman synthesis of scientific knowledge. arXiv:2409.13740. 12

  42. [42]

    Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, et al. 2025. PaperBench: Evaluating AI’s Ability to Replicate AI Research. arXiv:2504.01848

  43. [43]

    Zilu Tang, Mayank Agarwal, Alexander Shypula, Bailin Wang, Derry Wijaya, Jie Chen, and Yoon Kim. 2023. Explain-then-translate: an analysis on improving program translation with self-generated explanations. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1741–1788, Singapore. Association for Computational Linguistics

  44. [44]

    Minyang Tian, Luyu Gao, Dylan Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, Shengyan Liu, Di Luo, Yutao Ma, HAO TONG, Kha Trinh, Chenyu Tian, Zihan Wang, Bohao Wu, Shengzhu Yin, Minhui Zhu, Kilian Lieret, Yanxin Lu, Genglin Liu, Yufeng Du, Tianhua Tao, Ofir Press, Jamie Callan, Eliu A Huerta, and Hao Peng

  45. [45]

    In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track

    SciCode: A Research Coding Benchmark Curated by Scientists. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track

  46. [46]

    Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Yinxu Pan, Yesai Wu, Hui Haotian, Liu Weichuan, Zhiyuan Liu, and Maosong Sun. 2024. DebugBench: Evaluating Debugging Capability of Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2024 , pages 4173–4198, Bangkok, Thailand. Association for Computational Linguistics

  47. [47]

    Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2025. OpenHands: An Open Platform for A...

  48. [48]

    Leon Weber-Genzel, Siyao Peng, Marie-Catherine De Marneffe, and Barbara Plank. 2024. VariErr NLI: Separating Annotation Error from Human Label Variation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 2256–2269, Bangkok, Thailand. Association for Computational Linguistics

  49. [49]

    Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Akyürek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, and Yoon Kim. 2024. Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: ...

  50. [50]

    Yanzheng Xiang, Hanqi Yan, Shuyin Ouyang, Lin Gui, and Yulan He. 2025. SciReplicate- Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers. arXiv:2504.00255

  51. [51]

    TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

    Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, and Graham Neubig. 2024. TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks....

  52. [52]

    Xiaohan Xu, Chongyang Tao, Tao Shen, Can Xu, Hongbo Xu, Guodong Long, Jian-Guang Lou, and Shuai Ma. 2024. Re-Reading Improves Reasoning in Large Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 15549–15575, Miami, Florida, USA. Association for Computational Linguistics

  53. [53]

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. In Advances in Neural Information Processing Systems, volume 36, pages 11809–11822. Curran Associates, Inc. 13

  54. [54]

    Dan Zhang, Sining Zhoubian, Min Cai, Fengzu Li, Lekang Yang, Wei Wang, Tianjiao Dong, Ziniu Hu, Jie Tang, and Yisong Yue. 2025. Datascibench: An LLM agent benchmark for data science. arXiv:2502.13897

  55. [55]

    Caleb Ziems, William Held, Omar Shaikh, Jiaao Chen, Zhehao Zhang, and Diyi Yang. 2024. Can Large Language Models Transform Computational Social Science? Computational Linguistics, 50(1):237–291. 14 A An Example Task Instruction (Extension of WinoDict) WinoDict Task Instruction Problem Description Background The paperWinoDict: Probing language models for i...

  56. [56]

    Top Group: • Verbs, Nouns, Adverbs: Select the top 20% most frequent words • Adjectives: Select the top 35% most frequent adjectives (to match the sample set size)

  57. [57]

    Bottom Group: • Verbs, Nouns, Adverbs: Select the bottom 20% least frequent words • Adjectives: Select the bottom 35% least frequent adjectives

  58. [58]

    Read the instructions in instructions.md and carry out the specified task

    All Group: • Verbs, Nouns, Adjectives, Adverbs: Include all words, no frequency-based filtering Assume that the frequency information will be provided in a form of four files corre- sponding to each POS, named 1_all_rank_noun.txt, 2_all_rank_verb.txt, 3_all_rank_adjective.txt, 4_all_rank_adverb.txt, under the directory./words/. Each file lists words in de...