pith. sign in

arxiv: 2605.17324 · v1 · pith:DLHZDBPJnew · submitted 2026-05-17 · 💻 cs.CR · cs.AI

ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents

Pith reviewed 2026-05-19 23:49 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords prompt injectionLLM agentsclarification-seekingvulnerabilitysecurity benchmarkambiguity resolutionASPIfrontier models
0
0 comments X p. Extension
pith:DLHZDBPJ Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{DLHZDBPJ}

Prints a linked pith:DLHZDBPJ badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Seeking clarification on ambiguous tasks makes LLM agents far more vulnerable to prompt injection attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether LLM agents become more susceptible to prompt injection when they enter a clarification-seeking state to resolve task ambiguity. The authors create a benchmark called ASPI with 728 scenarios that compare agent behavior under fully specified instructions versus when they must ask for and use additional user input. Across ten frontier models, they find that attack success rates increase dramatically in the clarification setting, for example rising from 1.8 percent to 34 percent for one model. This matters because clarification is promoted as a safe way to handle underspecified tasks, yet it appears to create an exploitable channel that standard security tests on clear tasks do not capture.

Core claim

The paper establishes that the transition to a clarification-seeking state in LLM agents substantially increases susceptibility to prompt injection attacks. In the ASPI benchmark, agents in the clarification setting must request and incorporate user input before acting, while in the execution setting they receive fully specified instructions and encounter adversarial content only via tool returns. Evaluations show consistent amplification of attack success, such as from 1.8% to 34.0% for o3 and 2.2% to 35.7% for Gemini-3-Flash. The increase stems from both a shift in how models process content in this state and effects from the clarification interface itself.

What carries the argument

The ASPI benchmark, which isolates the clarification-seeking state by using matched pairs of execution and clarification conditions for each of the 728 task-attack scenarios.

If this is right

  • Standard security evaluations on fully specified tasks will underestimate the attack surface of interactive agents.
  • Robustness under clear instructions does not translate to robustness when agents must request and use clarifying input.
  • The vulnerability gap arises from both state-dependent changes in how models process incoming content and channel-specific effects from the clarification interface.
  • Real-world agent deployments that rely on clarification may expose users to higher prompt injection risks than current testing suggests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers might add input validation specific to the clarification channel to offset the increased exposure.
  • Similar amplification could occur in other interactive patterns such as confirming tool outputs or handling follow-up questions.
  • Security benchmarks for agents should include ambiguity-resolution states by default rather than testing only direct execution.
  • Fine-tuning on clarification exchanges could be tested as a way to reduce the observed state-dependent vulnerability.

Load-bearing premise

The benchmark successfully isolates the clarification-seeking state transition as the sole variable without differences in prompt formatting, tool-return handling, or user-input channel independently affecting attack success.

What would settle it

Re-running the benchmark on the same models but finding no significant difference in attack success rates between the execution and clarification settings would indicate that the state transition does not amplify vulnerability.

Figures

Figures reproduced from arXiv: 2605.17324 by Dileepa Lakshan, Heming Liu, Joseph Brandifino, Max Fenkell, Udari Madhushani Sehwag, Zhengyang Shan.

Figure 1
Figure 1. Figure 1: Overview of ASPI, illustrated with a benchmark example. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Attack success rates (ASR) for execution-time tool-channel attacks ( [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Paired ΔASR by channel. Left: tool (clarif_tool vs. exec_tool). Right: user (clarif_user vs. exec_user). Points show mean differences with 95% CIs; dashed line = zero. Statistical significance is assessed using exact McNemar tests ( ∗𝑝 < 0.05, ∗∗𝑝 < 0.01, ∗∗∗𝑝 < 0.001). exec_user, suggesting that the clarification interface exposes a new high-impact attack surface. Overall, vulnerability de￾pends jointly o… view at source ↗
Figure 4
Figure 4. Figure 4: Attack success rates (ASR, %) across conditions for each model. Each panel shows execution conditions ( [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Task utility (%) across all conditions for each model, including benign baselines and attacked settings. Each panel [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Trade-offs between robustness and capability. (a) ASR versus clean-task utility shows that models with high [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Attack success rates (ASR) for execution-time tool-channel attacks [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Paired clarification–execution differences in attack success rate under defenses. The left panel compares [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Clarification-question quality by model. Each bar shows the distribution of clarification outcomes under ambiguity [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Post-clarification behavior under attacked clarification responses. Each bar shows the share of continuations [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Judge-classified attack compliance by condition. Each bar shows the distribution of attack behavior across execution [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Relationship between clarification-question quality and post-clarification behavior. Rows indicate whether the [PITH_FULL_IMAGE:figures/full_fig_p032_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Judge-attributed compliance reason by condition. For trajectories with attack-following behavior, the judge assigns [PITH_FULL_IMAGE:figures/full_fig_p033_13.png] view at source ↗
read the original abstract

Clarification-seeking behavior is widely regarded as a desirable property of LLM agents, enabling them to resolve ambiguity before acting on underspecified tasks. However, the security implications of this interaction pattern remain unexplored. We investigate whether the transition from standard execution to a clarification-seeking state increases an agent's susceptibility to prompt injection attacks. We introduce ASPI (Ambiguous-State Prompt Injection), a benchmark of 728 task-attack scenarios that isolates clarification as a distinct agent state and measures how this state transition affects vulnerability under controlled conditions. Each benchmark instance is evaluated under matched execution and clarification settings: in the execution setting, the agent acts on a fully specified instruction and encounters adversarial content only through tool-returned data; in the clarification setting, the agent must first request and incorporate additional user input before acting. We evaluate ten frontier LLMs and find that clarification-seeking consistently and substantially amplifies vulnerability. For instance, attack success rises from 1.8% to 34.0% for o3 and from 2.2% to 35.7% for Gemini-3-Flash. A decomposition analysis reveals that this gap reflects both a state-dependent shift in how models process incoming content and a channel-specific effect arising from the agent-solicited clarification interface. These findings demonstrate that standard execution-time security evaluation systematically underestimates the attack surface of interactive agents, and that robustness under fully specified tasks does not translate to robustness under ambiguity. For reproducibility, our data and source code are available at https://github.com/scaleapi/aspi.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the ASPI benchmark comprising 728 task-attack scenarios to test whether clarification-seeking behavior in LLM agents increases susceptibility to prompt injection. Each scenario is evaluated in matched execution (fully specified instruction, adversarial content via tool returns) and clarification (agent requests and incorporates user input) settings across ten frontier models. Results show large increases in attack success under clarification (e.g., 1.8% to 34.0% for o3; 2.2% to 35.7% for Gemini-3-Flash). A decomposition analysis attributes the gap to both a state-dependent processing shift and a channel-specific effect from the agent-solicited clarification interface. The authors conclude that execution-time security evaluations underestimate the attack surface of interactive agents and release data and code for reproducibility.

Significance. If the central empirical finding is robust, the work identifies a practically relevant security risk tied to a desirable agent capability (clarification-seeking). The explicit release of the benchmark, data, and source code at the provided GitHub repository is a clear strength that supports verification and extension. The results challenge the transferability of robustness from fully specified tasks to ambiguous, interactive settings and therefore bear on the design of secure LLM agent systems.

major comments (2)
  1. [§3] §3 (Benchmark Construction): The matched execution and clarification conditions are presented as isolating the clarification state, yet the manuscript does not detail how prompt formatting, tokenization, or presentation order of user-solicited input is aligned with tool-returned content. Because the decomposition analysis already acknowledges a channel-specific effect, the absence of explicit controls or ablations that hold formatting and channel constant while varying only the internal state leaves the attribution of the observed jumps (e.g., ~2% to ~35%) insecure.
  2. [Decomposition Analysis] Decomposition Analysis (results section): The paper reports that the gap reflects both state-dependent and channel-specific contributions but provides no quantitative breakdown or controlled ablation that measures the marginal effect of each factor separately. Without such measurements, it remains unclear whether the state transition itself is the dominant driver or whether the channel change accounts for most of the amplification.
minor comments (2)
  1. [Abstract] Abstract: The sentence describing the decomposition analysis could be tightened to state explicitly that both effects are present rather than implying the gap is fully explained by the state transition.
  2. [Results] Table or figure captions (results): Ensure that all reported attack-success percentages are accompanied by the exact number of trials or scenarios per cell so readers can assess statistical reliability of the reported deltas.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on the ASPI benchmark and decomposition analysis. The feedback highlights opportunities to strengthen the description of our controls and to make the quantitative attribution more explicit. We address each major comment below and will incorporate clarifications and additional details in the revised manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Construction): The matched execution and clarification conditions are presented as isolating the clarification state, yet the manuscript does not detail how prompt formatting, tokenization, or presentation order of user-solicited input is aligned with tool-returned content. Because the decomposition analysis already acknowledges a channel-specific effect, the absence of explicit controls or ablations that hold formatting and channel constant while varying only the internal state leaves the attribution of the observed jumps (e.g., ~2% to ~35%) insecure.

    Authors: We agree that greater transparency on formatting and presentation details would improve the manuscript. The benchmark was constructed so that the adversarial payload is identical in text and length and always appears as the final message before the agent's turn; the only intentional difference is the message role label (user-solicited clarification versus tool return). In the revised version we will add the exact prompt templates, token-count statistics, and ordering description to §3. The decomposition analysis already includes auxiliary runs that hold the channel fixed while varying state (and vice versa); we will surface these comparisons more clearly rather than treating them as supplementary. revision: yes

  2. Referee: [Decomposition Analysis] Decomposition Analysis (results section): The paper reports that the gap reflects both state-dependent and channel-specific contributions but provides no quantitative breakdown or controlled ablation that measures the marginal effect of each factor separately. Without such measurements, it remains unclear whether the state transition itself is the dominant driver or whether the channel change accounts for most of the amplification.

    Authors: We accept that the current presentation of the decomposition is insufficiently quantitative. In the revised manuscript we will add a table and accompanying text that reports attack success rates for the four crossed conditions (execution/tool, clarification/user, execution/user-simulated, clarification/tool-simulated). This will permit direct calculation of the marginal state-dependent and channel-specific contributions. We will also state the relative sizes of each component based on those measurements. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with measured attack rates

full rationale

The paper presents an empirical benchmark study (ASPI) that directly measures attack success rates across matched execution and clarification settings on 728 scenarios for ten LLMs. The central findings (e.g., attack success rising from 1.8% to 34.0% for o3) are reported as observed outcomes under controlled conditions rather than derived from equations, fitted parameters, or first-principles predictions. No self-definitional steps, uniqueness theorems, or ansatzes appear; the decomposition analysis is likewise an empirical breakdown of measured gaps. The work is self-contained against external benchmarks and does not reduce its claims to prior self-citations or inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that the clarification interface and state can be cleanly isolated from other prompt and tool differences; no free parameters or new entities are introduced.

axioms (1)
  • domain assumption The clarification-seeking behavior can be isolated as a distinct agent state without altering other aspects of the prompt or tool interactions.
    This premise enables the matched comparison between execution and clarification settings described in the abstract.

pith-pipeline@v0.9.0 · 5828 in / 1318 out tokens · 57127 ms · 2026-05-19T23:49:21.895873+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

116 extracted references · 116 canonical work pages · 10 internal anchors

  1. [1]

    2026 , eprint=

    ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection , author=. 2026 , eprint=

  2. [2]

    2026 , eprint=

    AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification , author=. 2026 , eprint=

  3. [3]

    2026 , eprint=

    AttriGuard: Defeating Indirect Prompt Injection in LLM Agents via Causal Attribution of Tool Invocations , author=. 2026 , eprint=

  4. [4]

    2026 , eprint=

    How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition , author=. 2026 , eprint=

  5. [5]

    2026 , url=

    Philippe Laban and Hiroaki Hayashi and Yingbo Zhou and Jennifer Neville , booktitle=. 2026 , url=

  6. [6]

    Sanidhya Vijayvargiya and Xuhui Zhou and Akhila Yerukola and Maarten Sap and Graham Neubig , booktitle=. Ambig-. 2026 , url=

  7. [7]

    2026 , eprint=

    Value of Information: A Framework for Human-Agent Communication , author=. 2026 , eprint=

  8. [8]

    2026 , eprint=

    Ask or Assume? Uncertainty-Aware Clarification-Seeking in Coding Agents , author=. 2026 , eprint=

  9. [9]

    2024 , publisher =

    ProtectAI.com , title =. 2024 , publisher =

  10. [10]

    2023 , howpublished =

    The Dual LLM Pattern for Building AI Assistants that Can Resist Prompt Injection , author =. 2023 , howpublished =

  11. [11]

    The Twelfth International Conference on Learning Representations , year=

    WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. The Twelfth International Conference on Learning Representations , year=

  12. [12]

    Clarify When Necessary: Resolving Ambiguity Through Interaction with LM s

    Zhang, Michael JQ and Choi, Eunsol. Clarify When Necessary: Resolving Ambiguity Through Interaction with LM s. Findings of the Association for Computational Linguistics: NAACL 2025. 2025. doi:10.18653/v1/2025.findings-naacl.306

  13. [13]

    I njec A gent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

    Zhan, Qiusi and Liang, Zhixiang and Ying, Zifan and Kang, Daniel. I njec A gent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.624

  14. [14]

    Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1 , pages =

    Yi, Jingwei and Xie, Yueqi and Zhu, Bin and Kiciman, Emre and Sun, Guangzhong and Xie, Xing and Wu, Fangzhao , title =. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1 , pages =. 2025 , isbn =. doi:10.1145/3690624.3709179 , abstract =

  15. [15]

    Shunyu Yao and Noah Shinn and Pedram Razavi and Karthik R Narasimhan , booktitle=. \ \. 2025 , url=

  16. [16]

    Frank F. Xu and Yufan Song and Boxuan Li and Yuxuan Tang and Kritanjali Jain and Mengxue Bao and Zora Zhiruo Wang and Xuhui Zhou and Zhitong Guo and Murong Cao and Mingyang Yang and Hao Yang Lu and Amaad Martin and Zhe Su and Leander Melroy Maben and Raj Mehta and Wayne Chi and Lawrence Keunho Jang and Yiqing Xie and Shuyan Zhou and Graham Neubig , bookti...

  17. [17]

    2026 , eprint=

    ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context , author=. 2026 , eprint=

  18. [18]

    2025 , eprint=

    IsolateGPT: An Execution Isolation Architecture for LLM-Based Agentic Systems , author=. 2025 , eprint=

  19. [19]

    2024 , eprint=

    OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation , author=. 2024 , eprint=

  20. [20]

    2025 , eprint=

    OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows , author=. 2025 , eprint=

  21. [21]

    2024 , eprint=

    The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions , author=. 2024 , eprint=

  22. [22]

    2026 , eprint=

    Structured Uncertainty guided Clarification for LLM Agents , author=. 2026 , eprint=

  23. [23]

    Maddison and Tatsunori Hashimoto , booktitle=

    Yangjun Ruan and Honghua Dong and Andrew Wang and Silviu Pitis and Yongchao Zhou and Jimmy Ba and Yann Dubois and Chris J. Maddison and Tatsunori Hashimoto , booktitle=. Identifying the Risks of. 2024 , url=

  24. [24]

    2025 , eprint=

    UserBench: An Interactive Gym Environment for User-Centric Agents , author=. 2025 , eprint=

  25. [25]

    2026 , eprint=

    LHAW: Controllable Underspecification for Long-Horizon Tasks , author=. 2026 , eprint=

  26. [26]

    2022 , eprint=

    Ignore Previous Prompt: Attack Techniques For Language Models , author=. 2022 , eprint=

  27. [27]

    A mbig QA : Answering Ambiguous Open-domain Questions

    Min, Sewon and Michael, Julian and Hajishirzi, Hannaneh and Zettlemoyer, Luke. A mbig QA : Answering Ambiguous Open-domain Questions. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.466

  28. [28]

    33rd USENIX Security Symposium (USENIX Security 24) , year =

    Yupei Liu and Yuqi Jia and Runpeng Geng and Jinyuan Jia and Neil Zhenqiang Gong , title =. 33rd USENIX Security Symposium (USENIX Security 24) , year =

  29. [29]

    AgentBench: Evaluating

    Xiao Liu and Hao Yu and Hanchen Zhang and Yifan Xu and Xuanyu Lei and Hanyu Lai and Yu Gu and Hangliang Ding and Kaiwen Men and Kejuan Yang and Shudan Zhang and Xiang Deng and Aohan Zeng and Zhengxiao Du and Chenhui Zhang and Sheng Shen and Tianjun Zhang and Yu Su and Huan Sun and Minlie Huang and Yuxiao Dong and Jie Tang , booktitle=. AgentBench: Evaluat...

  30. [30]

    Li and Been Kim and Zi Wang , booktitle=

    Belinda Z. Li and Been Kim and Zi Wang , booktitle=. QuestBench: Can. 2025 , url=

  31. [31]

    2024 , eprint=

    Defending Against Indirect Prompt Injection Attacks With Spotlighting , author=. 2024 , eprint=

  32. [32]

    Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , pages =

    Greshake, Kai and Abdelnabi, Sahar and Mishra, Shailesh and Endres, Christoph and Holz, Thorsten and Fritz, Mario , title =. Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , pages =. 2023 , isbn =. doi:10.1145/3605764.3623985 , abstract =

  33. [33]

    Proceedings of the 18th ACM Workshop on Artificial Intelligence and Security , pages =

    Chen, Sizhe and Wang, Yizhu and Carlini, Nicholas and Sitawarin, Chawin and Wagner, David , title =. Proceedings of the 18th ACM Workshop on Artificial Intelligence and Security , pages =. 2026 , isbn =. doi:10.1145/3733799.3762982 , abstract =

  34. [34]

    2026 , eprint=

    MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers , author=. 2026 , eprint=

  35. [35]

    2020 , eprint=

    ConvAI3: Generating Clarifying Questions for Open-Domain Dialogue Systems (ClariQ) , author=. 2020 , eprint=

  36. [36]

    AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents , url =

    Debenedetti, Edoardo and Zhang, Jie and Balunovic, Mislav and Beurer-Kellner, Luca and Fischer, Marc and Tram\`. AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents , url =. Advances in Neural Information Processing Systems , doi =

  37. [37]

    VitaBench: Benchmarking

    Wei He and Yueqing Sun and Hongyan Hao and Xueyuan Hao and Zhikang Xia and Qi GU and Hui Su and Xunliang Cai , booktitle=. VitaBench: Benchmarking. 2026 , url=

  38. [38]

    2019 , eprint=

    GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , author=. 2019 , eprint=

  39. [39]

    2020 , eprint=

    SuperGlue: Learning Feature Matching with Graph Neural Networks , author=. 2020 , eprint=

  40. [40]

    2021 , eprint=

    Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

  41. [41]

    2025 , eprint=

    BIGbench: A Unified Benchmark for Evaluating Multi-dimensional Social Biases in Text-to-Image Models , author=. 2025 , eprint=

  42. [42]

    2022 , eprint=

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , author=. 2022 , eprint=

  43. [43]

    2024 , eprint=

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues? , author=. 2024 , eprint=

  44. [44]

    2021 , eprint=

    Measuring Mathematical Problem Solving With the MATH Dataset , author=. 2021 , eprint=

  45. [45]

    2021 , eprint=

    Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

  46. [46]

    2024 , eprint=

    OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems , author=. 2024 , eprint=

  47. [47]

    2018 , eprint=

    Winograd Schema - Knowledge Extraction Using Narrative Chains , author=. 2018 , eprint=

  48. [48]

    2019 , eprint=

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. 2019 , eprint=

  49. [49]

    2025 , eprint=

    ARC Prize 2024: Technical Report , author=. 2025 , eprint=

  50. [50]

    2021 , eprint=

    Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies , author=. 2021 , eprint=

  51. [51]

    2021 , eprint=

    Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

  52. [52]

    2021 , eprint=

    Program Synthesis with Large Language Models , author=. 2021 , eprint=

  53. [53]

    2023 , eprint=

    Minigrid & Miniworld: Modular & Customizable Reinforcement Learning Environments for Goal-Oriented Tasks , author=. 2023 , eprint=

  54. [54]

    2021 , eprint=

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning , author=. 2021 , eprint=

  55. [55]

    CoRR , volume =

    Marc-Alexandre C\^ot\'e and \'Akos K\'ad\'ar and Xingdi Yuan and Ben Kybartas and Tavian Barnes and Emery Fine and James Moore and Ruo Yu Tao and Matthew Hausknecht and Layla El Asri and Mahmoud Adada and Wendy Tay and Adam Trischler , title =. CoRR , volume =

  56. [56]

    2022 , eprint=

    ScienceWorld: Is your Agent Smarter than a 5th Grader? , author=. 2022 , eprint=

  57. [57]

    2018 , eprint=

    Medical Exam Question Answering with Large-scale Reading Comprehension , author=. 2018 , eprint=

  58. [58]

    2022 , eprint=

    MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering , author=. 2022 , eprint=

  59. [59]

    2019 , eprint=

    PubMedQA: A Dataset for Biomedical Research Question Answering , author=. 2019 , eprint=

  60. [60]

    Proceedings of AAAI Information Retrieval and Knowledge Discovery in Biomedical Text , interhash =

    Tsatsaronis, George and Schroeder, Michael and Paliouras, Georgios and Almirantis, Yannis and Androutsopoulos, Ion and Gaussier, Eric and Gallinari, Patrick and Artieres, Thierry and Alvers,. Proceedings of AAAI Information Retrieval and Knowledge Discovery in Biomedical Text , interhash =

  61. [61]

    2022 , eprint=

    FinQA: A Dataset of Numerical Reasoning over Financial Data , author=. 2022 , eprint=

  62. [62]

    2023 , eprint=

    LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models , author=. 2023 , eprint=

  63. [63]

    2025 , eprint=

    MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research , author=. 2025 , eprint=

  64. [64]

    2025 , eprint=

    MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges? , author=. 2025 , eprint=

  65. [65]

    2024 , eprint=

    MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation , author=. 2024 , eprint=

  66. [66]

    2025 , eprint=

    AIDE: AI-Driven Exploration in the Space of Code , author=. 2025 , eprint=

  67. [67]

    2025 , eprint=

    MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering , author=. 2025 , eprint=

  68. [68]

    CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics

    CFD-LLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics , author=. arXiv preprint arXiv:2509.20374 , year=

  69. [69]

    HealthBench: Evaluating Large Language Models Towards Improved Human Health

    Healthbench: Evaluating large language models towards improved human health , author=. arXiv preprint arXiv:2505.08775 , year=

  70. [70]

    arXiv preprint arXiv:2505.06108 , year=

    LLMs Outperform Experts on Challenging Biology Benchmarks , author=. arXiv preprint arXiv:2505.06108 , year=

  71. [71]

    PaperBench: Evaluating AI’s Ability to Replicate AI Research , author=

  72. [72]

    AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage

    Autoreproduce: Automatic ai experiment reproduction with paper lineage , author=. arXiv preprint arXiv:2505.20662 , year=

  73. [73]

    arXiv preprint arXiv:2506.02314 , year=

    ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code , author=. arXiv preprint arXiv:2506.02314 , year=

  74. [74]

    arXiv preprint arXiv:2505.24785 , year=

    EXP-Bench: Can AI Conduct AI Research Experiments? , author=. arXiv preprint arXiv:2505.24785 , year=

  75. [75]

    arXiv preprint arXiv:2506.17335 , year=

    LMR-BENCH: Evaluating LLM Agent's Ability on Reproducing Language Modeling Research , author=. arXiv preprint arXiv:2506.17335 , year=

  76. [76]

    arXiv preprint arXiv:2504.20115 , year=

    AutoP2C: An LLM-Based Agent Framework for Code Repository Generation from Multimodal Content in Academic Papers , author=. arXiv preprint arXiv:2504.20115 , year=

  77. [77]

    Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025) , pages=

    FrontierScience Bench: Evaluating AI Research Capabilities in LLMs , author=. Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025) , pages=

  78. [78]

    bioRxiv , pages=

    Replicating a high-impact scientific publication using systems of large language models , author=. bioRxiv , pages=. 2024 , publisher=

  79. [79]

    2024 , eprint=

    LAB-Bench: Measuring Capabilities of Language Models for Biology Research , author=. 2024 , eprint=

  80. [80]

    2nd AI4Research Workshop: Towards a Knowledge-grounded Scientific Research Lifecycle , year=

    Large language models for rediscovering unseen chemistry scientific hypotheses , author=. 2nd AI4Research Workshop: Towards a Knowledge-grounded Scientific Research Lifecycle , year=

Showing first 80 references.