ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents

arxiv: 2605.17324 · v1 · pith:DLHZDBPJnew · submitted 2026-05-17 · 💻 cs.CR · cs.AI

ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents

Udari Madhushani Sehwag , Zhengyang Shan , Heming Liu , Dileepa Lakshan , Joseph Brandifino , Max Fenkell This is my paper

Pith reviewed 2026-05-19 23:49 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords prompt injectionLLM agentsclarification-seekingvulnerabilitysecurity benchmarkambiguity resolutionASPIfrontier models

0 comments p. Extension

pith:DLHZDBPJ Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{DLHZDBPJ}

Prints a linked pith:DLHZDBPJ badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Seeking clarification on ambiguous tasks makes LLM agents far more vulnerable to prompt injection attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether LLM agents become more susceptible to prompt injection when they enter a clarification-seeking state to resolve task ambiguity. The authors create a benchmark called ASPI with 728 scenarios that compare agent behavior under fully specified instructions versus when they must ask for and use additional user input. Across ten frontier models, they find that attack success rates increase dramatically in the clarification setting, for example rising from 1.8 percent to 34 percent for one model. This matters because clarification is promoted as a safe way to handle underspecified tasks, yet it appears to create an exploitable channel that standard security tests on clear tasks do not capture.

Core claim

The paper establishes that the transition to a clarification-seeking state in LLM agents substantially increases susceptibility to prompt injection attacks. In the ASPI benchmark, agents in the clarification setting must request and incorporate user input before acting, while in the execution setting they receive fully specified instructions and encounter adversarial content only via tool returns. Evaluations show consistent amplification of attack success, such as from 1.8% to 34.0% for o3 and 2.2% to 35.7% for Gemini-3-Flash. The increase stems from both a shift in how models process content in this state and effects from the clarification interface itself.

What carries the argument

The ASPI benchmark, which isolates the clarification-seeking state by using matched pairs of execution and clarification conditions for each of the 728 task-attack scenarios.

If this is right

Standard security evaluations on fully specified tasks will underestimate the attack surface of interactive agents.
Robustness under clear instructions does not translate to robustness when agents must request and use clarifying input.
The vulnerability gap arises from both state-dependent changes in how models process incoming content and channel-specific effects from the clarification interface.
Real-world agent deployments that rely on clarification may expose users to higher prompt injection risks than current testing suggests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers might add input validation specific to the clarification channel to offset the increased exposure.
Similar amplification could occur in other interactive patterns such as confirming tool outputs or handling follow-up questions.
Security benchmarks for agents should include ambiguity-resolution states by default rather than testing only direct execution.
Fine-tuning on clarification exchanges could be tested as a way to reduce the observed state-dependent vulnerability.

Load-bearing premise

The benchmark successfully isolates the clarification-seeking state transition as the sole variable without differences in prompt formatting, tool-return handling, or user-input channel independently affecting attack success.

What would settle it

Re-running the benchmark on the same models but finding no significant difference in attack success rates between the execution and clarification settings would indicate that the state transition does not amplify vulnerability.

Figures

Figures reproduced from arXiv: 2605.17324 by Dileepa Lakshan, Heming Liu, Joseph Brandifino, Max Fenkell, Udari Madhushani Sehwag, Zhengyang Shan.

**Figure 2.** Figure 2: Attack success rates (ASR) for execution-time tool-channel attacks ( [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Paired ΔASR by channel. Left: tool (clarif_tool vs. exec_tool). Right: user (clarif_user vs. exec_user). Points show mean differences with 95% CIs; dashed line = zero. Statistical significance is assessed using exact McNemar tests ( ∗𝑝 < 0.05, ∗∗𝑝 < 0.01, ∗∗∗𝑝 < 0.001). exec_user, suggesting that the clarification interface exposes a new high-impact attack surface. Overall, vulnerability depends jointly o… view at source ↗

**Figure 4.** Figure 4: Attack success rates (ASR, %) across conditions for each model. Each panel shows execution conditions ( [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Task utility (%) across all conditions for each model, including benign baselines and attacked settings. Each panel [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Trade-offs between robustness and capability. (a) ASR versus clean-task utility shows that models with high [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Attack success rates (ASR) for execution-time tool-channel attacks [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Paired clarification–execution differences in attack success rate under defenses. The left panel compares [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Clarification-question quality by model. Each bar shows the distribution of clarification outcomes under ambiguity [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗

**Figure 10.** Figure 10: Post-clarification behavior under attacked clarification responses. Each bar shows the share of continuations [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗

**Figure 11.** Figure 11: Judge-classified attack compliance by condition. Each bar shows the distribution of attack behavior across execution [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗

**Figure 12.** Figure 12: Relationship between clarification-question quality and post-clarification behavior. Rows indicate whether the [PITH_FULL_IMAGE:figures/full_fig_p032_12.png] view at source ↗

**Figure 13.** Figure 13: Judge-attributed compliance reason by condition. For trajectories with attack-following behavior, the judge assigns [PITH_FULL_IMAGE:figures/full_fig_p033_13.png] view at source ↗

read the original abstract

Clarification-seeking behavior is widely regarded as a desirable property of LLM agents, enabling them to resolve ambiguity before acting on underspecified tasks. However, the security implications of this interaction pattern remain unexplored. We investigate whether the transition from standard execution to a clarification-seeking state increases an agent's susceptibility to prompt injection attacks. We introduce ASPI (Ambiguous-State Prompt Injection), a benchmark of 728 task-attack scenarios that isolates clarification as a distinct agent state and measures how this state transition affects vulnerability under controlled conditions. Each benchmark instance is evaluated under matched execution and clarification settings: in the execution setting, the agent acts on a fully specified instruction and encounters adversarial content only through tool-returned data; in the clarification setting, the agent must first request and incorporate additional user input before acting. We evaluate ten frontier LLMs and find that clarification-seeking consistently and substantially amplifies vulnerability. For instance, attack success rises from 1.8% to 34.0% for o3 and from 2.2% to 35.7% for Gemini-3-Flash. A decomposition analysis reveals that this gap reflects both a state-dependent shift in how models process incoming content and a channel-specific effect arising from the agent-solicited clarification interface. These findings demonstrate that standard execution-time security evaluation systematically underestimates the attack surface of interactive agents, and that robustness under fully specified tasks does not translate to robustness under ambiguity. For reproducibility, our data and source code are available at https://github.com/scaleapi/aspi.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Clarification-seeking boosts prompt injection rates in agents from low single digits to mid-30s percent, but the benchmark needs tighter checks on whether channel and formatting differences drive part of the gap.

read the letter

The main takeaway is that clarification-seeking in LLM agents appears to amplify prompt injection vulnerability substantially. Attack success jumps from 1.8% to 34% for o3 and similar gaps show up for Gemini-3-Flash, based on their 728 scenarios across ten models. This extends prior prompt injection work into interactive agent states rather than just static prompts. The paper introduces the ASPI benchmark to compare a fully specified execution setting against one where the agent must request and use additional user input. They also run a decomposition that attributes the gap to both a shift in how models process content and a channel effect from the clarification interface. Releasing code and data on GitHub is useful for anyone who wants to inspect or rerun the experiments. The empirical scale and the focus on real deployment patterns are the stronger parts here. The softer spot is the isolation of the clarification state itself. The abstract acknowledges both state-dependent and channel-specific contributions, which aligns with the stress-test concern that differences in input handling or prompt formatting between tool returns and user clarification could affect results on their own. Without seeing the exact prompt constructions and any quantitative breakdown of how much each factor contributes, it is hard to pin the full effect on the internal state transition. This paper is for security researchers and developers working on LLM agents that operate in ambiguous settings. Readers who care about prompt injection defenses or agent evaluation practices would get concrete numbers to think about. It has enough new empirical content and reproducibility effort to deserve a serious referee, mainly to verify the controls and the decomposition details. I would send it to peer review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the ASPI benchmark comprising 728 task-attack scenarios to test whether clarification-seeking behavior in LLM agents increases susceptibility to prompt injection. Each scenario is evaluated in matched execution (fully specified instruction, adversarial content via tool returns) and clarification (agent requests and incorporates user input) settings across ten frontier models. Results show large increases in attack success under clarification (e.g., 1.8% to 34.0% for o3; 2.2% to 35.7% for Gemini-3-Flash). A decomposition analysis attributes the gap to both a state-dependent processing shift and a channel-specific effect from the agent-solicited clarification interface. The authors conclude that execution-time security evaluations underestimate the attack surface of interactive agents and release data and code for reproducibility.

Significance. If the central empirical finding is robust, the work identifies a practically relevant security risk tied to a desirable agent capability (clarification-seeking). The explicit release of the benchmark, data, and source code at the provided GitHub repository is a clear strength that supports verification and extension. The results challenge the transferability of robustness from fully specified tasks to ambiguous, interactive settings and therefore bear on the design of secure LLM agent systems.

major comments (2)

[§3] §3 (Benchmark Construction): The matched execution and clarification conditions are presented as isolating the clarification state, yet the manuscript does not detail how prompt formatting, tokenization, or presentation order of user-solicited input is aligned with tool-returned content. Because the decomposition analysis already acknowledges a channel-specific effect, the absence of explicit controls or ablations that hold formatting and channel constant while varying only the internal state leaves the attribution of the observed jumps (e.g., ~2% to ~35%) insecure.
[Decomposition Analysis] Decomposition Analysis (results section): The paper reports that the gap reflects both state-dependent and channel-specific contributions but provides no quantitative breakdown or controlled ablation that measures the marginal effect of each factor separately. Without such measurements, it remains unclear whether the state transition itself is the dominant driver or whether the channel change accounts for most of the amplification.

minor comments (2)

[Abstract] Abstract: The sentence describing the decomposition analysis could be tightened to state explicitly that both effects are present rather than implying the gap is fully explained by the state transition.
[Results] Table or figure captions (results): Ensure that all reported attack-success percentages are accompanied by the exact number of trials or scenarios per cell so readers can assess statistical reliability of the reported deltas.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on the ASPI benchmark and decomposition analysis. The feedback highlights opportunities to strengthen the description of our controls and to make the quantitative attribution more explicit. We address each major comment below and will incorporate clarifications and additional details in the revised manuscript.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction): The matched execution and clarification conditions are presented as isolating the clarification state, yet the manuscript does not detail how prompt formatting, tokenization, or presentation order of user-solicited input is aligned with tool-returned content. Because the decomposition analysis already acknowledges a channel-specific effect, the absence of explicit controls or ablations that hold formatting and channel constant while varying only the internal state leaves the attribution of the observed jumps (e.g., ~2% to ~35%) insecure.

Authors: We agree that greater transparency on formatting and presentation details would improve the manuscript. The benchmark was constructed so that the adversarial payload is identical in text and length and always appears as the final message before the agent's turn; the only intentional difference is the message role label (user-solicited clarification versus tool return). In the revised version we will add the exact prompt templates, token-count statistics, and ordering description to §3. The decomposition analysis already includes auxiliary runs that hold the channel fixed while varying state (and vice versa); we will surface these comparisons more clearly rather than treating them as supplementary. revision: yes
Referee: [Decomposition Analysis] Decomposition Analysis (results section): The paper reports that the gap reflects both state-dependent and channel-specific contributions but provides no quantitative breakdown or controlled ablation that measures the marginal effect of each factor separately. Without such measurements, it remains unclear whether the state transition itself is the dominant driver or whether the channel change accounts for most of the amplification.

Authors: We accept that the current presentation of the decomposition is insufficiently quantitative. In the revised manuscript we will add a table and accompanying text that reports attack success rates for the four crossed conditions (execution/tool, clarification/user, execution/user-simulated, clarification/tool-simulated). This will permit direct calculation of the marginal state-dependent and channel-specific contributions. We will also state the relative sizes of each component based on those measurements. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with measured attack rates

full rationale

The paper presents an empirical benchmark study (ASPI) that directly measures attack success rates across matched execution and clarification settings on 728 scenarios for ten LLMs. The central findings (e.g., attack success rising from 1.8% to 34.0% for o3) are reported as observed outcomes under controlled conditions rather than derived from equations, fitted parameters, or first-principles predictions. No self-definitional steps, uniqueness theorems, or ansatzes appear; the decomposition analysis is likewise an empirical breakdown of measured gaps. The work is self-contained against external benchmarks and does not reduce its claims to prior self-citations or inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that the clarification interface and state can be cleanly isolated from other prompt and tool differences; no free parameters or new entities are introduced.

axioms (1)

domain assumption The clarification-seeking behavior can be isolated as a distinct agent state without altering other aspects of the prompt or tool interactions.
This premise enables the matched comparison between execution and clarification settings described in the abstract.

pith-pipeline@v0.9.0 · 5828 in / 1318 out tokens · 57127 ms · 2026-05-19T23:49:21.895873+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

116 extracted references · 116 canonical work pages · 10 internal anchors

[1]

2026 , eprint=

ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection , author=. 2026 , eprint=

work page 2026
[2]

2026 , eprint=

AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification , author=. 2026 , eprint=

work page 2026
[3]

2026 , eprint=

AttriGuard: Defeating Indirect Prompt Injection in LLM Agents via Causal Attribution of Tool Invocations , author=. 2026 , eprint=

work page 2026
[4]

2026 , eprint=

How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition , author=. 2026 , eprint=

work page 2026
[5]

2026 , url=

Philippe Laban and Hiroaki Hayashi and Yingbo Zhou and Jennifer Neville , booktitle=. 2026 , url=

work page 2026
[6]

Sanidhya Vijayvargiya and Xuhui Zhou and Akhila Yerukola and Maarten Sap and Graham Neubig , booktitle=. Ambig-. 2026 , url=

work page 2026
[7]

2026 , eprint=

Value of Information: A Framework for Human-Agent Communication , author=. 2026 , eprint=

work page 2026
[8]

2026 , eprint=

Ask or Assume? Uncertainty-Aware Clarification-Seeking in Coding Agents , author=. 2026 , eprint=

work page 2026
[9]

2024 , publisher =

ProtectAI.com , title =. 2024 , publisher =

work page 2024
[10]

2023 , howpublished =

The Dual LLM Pattern for Building AI Assistants that Can Resist Prompt Injection , author =. 2023 , howpublished =

work page 2023
[11]

The Twelfth International Conference on Learning Representations , year=

WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. The Twelfth International Conference on Learning Representations , year=

work page
[12]

Clarify When Necessary: Resolving Ambiguity Through Interaction with LM s

Zhang, Michael JQ and Choi, Eunsol. Clarify When Necessary: Resolving Ambiguity Through Interaction with LM s. Findings of the Association for Computational Linguistics: NAACL 2025. 2025. doi:10.18653/v1/2025.findings-naacl.306

work page doi:10.18653/v1/2025.findings-naacl.306 2025
[13]

I njec A gent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

Zhan, Qiusi and Liang, Zhixiang and Ying, Zifan and Kang, Daniel. I njec A gent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.624

work page doi:10.18653/v1/2024.findings-acl.624 2024
[14]

Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1 , pages =

Yi, Jingwei and Xie, Yueqi and Zhu, Bin and Kiciman, Emre and Sun, Guangzhong and Xie, Xing and Wu, Fangzhao , title =. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1 , pages =. 2025 , isbn =. doi:10.1145/3690624.3709179 , abstract =

work page doi:10.1145/3690624.3709179 2025
[15]

Shunyu Yao and Noah Shinn and Pedram Razavi and Karthik R Narasimhan , booktitle=. \ \. 2025 , url=

work page 2025
[16]

Frank F. Xu and Yufan Song and Boxuan Li and Yuxuan Tang and Kritanjali Jain and Mengxue Bao and Zora Zhiruo Wang and Xuhui Zhou and Zhitong Guo and Murong Cao and Mingyang Yang and Hao Yang Lu and Amaad Martin and Zhe Su and Leander Melroy Maben and Raj Mehta and Wayne Chi and Lawrence Keunho Jang and Yiqing Xie and Shuyan Zhou and Graham Neubig , bookti...

work page 2025
[17]

2026 , eprint=

ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context , author=. 2026 , eprint=

work page 2026
[18]

2025 , eprint=

IsolateGPT: An Execution Isolation Architecture for LLM-Based Agentic Systems , author=. 2025 , eprint=

work page 2025
[19]

2024 , eprint=

OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation , author=. 2024 , eprint=

work page 2024
[20]

2025 , eprint=

OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows , author=. 2025 , eprint=

work page 2025
[21]

2024 , eprint=

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions , author=. 2024 , eprint=

work page 2024
[22]

2026 , eprint=

Structured Uncertainty guided Clarification for LLM Agents , author=. 2026 , eprint=

work page 2026
[23]

Maddison and Tatsunori Hashimoto , booktitle=

Yangjun Ruan and Honghua Dong and Andrew Wang and Silviu Pitis and Yongchao Zhou and Jimmy Ba and Yann Dubois and Chris J. Maddison and Tatsunori Hashimoto , booktitle=. Identifying the Risks of. 2024 , url=

work page 2024
[24]

2025 , eprint=

UserBench: An Interactive Gym Environment for User-Centric Agents , author=. 2025 , eprint=

work page 2025
[25]

2026 , eprint=

LHAW: Controllable Underspecification for Long-Horizon Tasks , author=. 2026 , eprint=

work page 2026
[26]

2022 , eprint=

Ignore Previous Prompt: Attack Techniques For Language Models , author=. 2022 , eprint=

work page 2022
[27]

A mbig QA : Answering Ambiguous Open-domain Questions

Min, Sewon and Michael, Julian and Hajishirzi, Hannaneh and Zettlemoyer, Luke. A mbig QA : Answering Ambiguous Open-domain Questions. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.466

work page doi:10.18653/v1/2020.emnlp-main.466 2020
[28]

33rd USENIX Security Symposium (USENIX Security 24) , year =

Yupei Liu and Yuqi Jia and Runpeng Geng and Jinyuan Jia and Neil Zhenqiang Gong , title =. 33rd USENIX Security Symposium (USENIX Security 24) , year =

work page
[29]

AgentBench: Evaluating

Xiao Liu and Hao Yu and Hanchen Zhang and Yifan Xu and Xuanyu Lei and Hanyu Lai and Yu Gu and Hangliang Ding and Kaiwen Men and Kejuan Yang and Shudan Zhang and Xiang Deng and Aohan Zeng and Zhengxiao Du and Chenhui Zhang and Sheng Shen and Tianjun Zhang and Yu Su and Huan Sun and Minlie Huang and Yuxiao Dong and Jie Tang , booktitle=. AgentBench: Evaluat...

work page 2024
[30]

Li and Been Kim and Zi Wang , booktitle=

Belinda Z. Li and Been Kim and Zi Wang , booktitle=. QuestBench: Can. 2025 , url=

work page 2025
[31]

2024 , eprint=

Defending Against Indirect Prompt Injection Attacks With Spotlighting , author=. 2024 , eprint=

work page 2024
[32]

Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , pages =

Greshake, Kai and Abdelnabi, Sahar and Mishra, Shailesh and Endres, Christoph and Holz, Thorsten and Fritz, Mario , title =. Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , pages =. 2023 , isbn =. doi:10.1145/3605764.3623985 , abstract =

work page doi:10.1145/3605764.3623985 2023
[33]

Proceedings of the 18th ACM Workshop on Artificial Intelligence and Security , pages =

Chen, Sizhe and Wang, Yizhu and Carlini, Nicholas and Sitawarin, Chawin and Wagner, David , title =. Proceedings of the 18th ACM Workshop on Artificial Intelligence and Security , pages =. 2026 , isbn =. doi:10.1145/3733799.3762982 , abstract =

work page doi:10.1145/3733799.3762982 2026
[34]

2026 , eprint=

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers , author=. 2026 , eprint=

work page 2026
[35]

2020 , eprint=

ConvAI3: Generating Clarifying Questions for Open-Domain Dialogue Systems (ClariQ) , author=. 2020 , eprint=

work page 2020
[36]

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents , url =

Debenedetti, Edoardo and Zhang, Jie and Balunovic, Mislav and Beurer-Kellner, Luca and Fischer, Marc and Tram\`. AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents , url =. Advances in Neural Information Processing Systems , doi =

work page
[37]

VitaBench: Benchmarking

Wei He and Yueqing Sun and Hongyan Hao and Xueyuan Hao and Zhikang Xia and Qi GU and Hui Su and Xunliang Cai , booktitle=. VitaBench: Benchmarking. 2026 , url=

work page 2026
[38]

2019 , eprint=

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , author=. 2019 , eprint=

work page 2019
[39]

2020 , eprint=

SuperGlue: Learning Feature Matching with Graph Neural Networks , author=. 2020 , eprint=

work page 2020
[40]

2021 , eprint=

Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

work page 2021
[41]

2025 , eprint=

BIGbench: A Unified Benchmark for Evaluating Multi-dimensional Social Biases in Text-to-Image Models , author=. 2025 , eprint=

work page 2025
[42]

2022 , eprint=

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , author=. 2022 , eprint=

work page 2022
[43]

2024 , eprint=

SWE-bench: Can Language Models Resolve Real-World GitHub Issues? , author=. 2024 , eprint=

work page 2024
[44]

2021 , eprint=

Measuring Mathematical Problem Solving With the MATH Dataset , author=. 2021 , eprint=

work page 2021
[45]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

work page 2021
[46]

2024 , eprint=

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems , author=. 2024 , eprint=

work page 2024
[47]

2018 , eprint=

Winograd Schema - Knowledge Extraction Using Narrative Chains , author=. 2018 , eprint=

work page 2018
[48]

2019 , eprint=

WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. 2019 , eprint=

work page 2019
[49]

2025 , eprint=

ARC Prize 2024: Technical Report , author=. 2025 , eprint=

work page 2024
[50]

2021 , eprint=

Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies , author=. 2021 , eprint=

work page 2021
[51]

2021 , eprint=

Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

work page 2021
[52]

2021 , eprint=

Program Synthesis with Large Language Models , author=. 2021 , eprint=

work page 2021
[53]

2023 , eprint=

Minigrid & Miniworld: Modular & Customizable Reinforcement Learning Environments for Goal-Oriented Tasks , author=. 2023 , eprint=

work page 2023
[54]

2021 , eprint=

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning , author=. 2021 , eprint=

work page 2021
[55]

CoRR , volume =

Marc-Alexandre C\^ot\'e and \'Akos K\'ad\'ar and Xingdi Yuan and Ben Kybartas and Tavian Barnes and Emery Fine and James Moore and Ruo Yu Tao and Matthew Hausknecht and Layla El Asri and Mahmoud Adada and Wendy Tay and Adam Trischler , title =. CoRR , volume =

work page
[56]

2022 , eprint=

ScienceWorld: Is your Agent Smarter than a 5th Grader? , author=. 2022 , eprint=

work page 2022
[57]

2018 , eprint=

Medical Exam Question Answering with Large-scale Reading Comprehension , author=. 2018 , eprint=

work page 2018
[58]

2022 , eprint=

MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering , author=. 2022 , eprint=

work page 2022
[59]

2019 , eprint=

PubMedQA: A Dataset for Biomedical Research Question Answering , author=. 2019 , eprint=

work page 2019
[60]

Proceedings of AAAI Information Retrieval and Knowledge Discovery in Biomedical Text , interhash =

Tsatsaronis, George and Schroeder, Michael and Paliouras, Georgios and Almirantis, Yannis and Androutsopoulos, Ion and Gaussier, Eric and Gallinari, Patrick and Artieres, Thierry and Alvers,. Proceedings of AAAI Information Retrieval and Knowledge Discovery in Biomedical Text , interhash =

work page
[61]

2022 , eprint=

FinQA: A Dataset of Numerical Reasoning over Financial Data , author=. 2022 , eprint=

work page 2022
[62]

2023 , eprint=

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models , author=. 2023 , eprint=

work page 2023
[63]

2025 , eprint=

MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research , author=. 2025 , eprint=

work page 2025
[64]

2025 , eprint=

MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges? , author=. 2025 , eprint=

work page 2025
[65]

2024 , eprint=

MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation , author=. 2024 , eprint=

work page 2024
[66]

2025 , eprint=

AIDE: AI-Driven Exploration in the Space of Code , author=. 2025 , eprint=

work page 2025
[67]

2025 , eprint=

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering , author=. 2025 , eprint=

work page 2025
[68]

CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics

CFD-LLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics , author=. arXiv preprint arXiv:2509.20374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[69]

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Healthbench: Evaluating large language models towards improved human health , author=. arXiv preprint arXiv:2505.08775 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[70]

arXiv preprint arXiv:2505.06108 , year=

LLMs Outperform Experts on Challenging Biology Benchmarks , author=. arXiv preprint arXiv:2505.06108 , year=

work page arXiv
[71]

PaperBench: Evaluating AI’s Ability to Replicate AI Research , author=

work page
[72]

AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage

Autoreproduce: Automatic ai experiment reproduction with paper lineage , author=. arXiv preprint arXiv:2505.20662 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[73]

arXiv preprint arXiv:2506.02314 , year=

ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code , author=. arXiv preprint arXiv:2506.02314 , year=

work page arXiv
[74]

arXiv preprint arXiv:2505.24785 , year=

EXP-Bench: Can AI Conduct AI Research Experiments? , author=. arXiv preprint arXiv:2505.24785 , year=

work page arXiv
[75]

arXiv preprint arXiv:2506.17335 , year=

LMR-BENCH: Evaluating LLM Agent's Ability on Reproducing Language Modeling Research , author=. arXiv preprint arXiv:2506.17335 , year=

work page arXiv
[76]

arXiv preprint arXiv:2504.20115 , year=

AutoP2C: An LLM-Based Agent Framework for Code Repository Generation from Multimodal Content in Academic Papers , author=. arXiv preprint arXiv:2504.20115 , year=

work page arXiv
[77]

Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025) , pages=

FrontierScience Bench: Evaluating AI Research Capabilities in LLMs , author=. Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025) , pages=

work page 2025
[78]

bioRxiv , pages=

Replicating a high-impact scientific publication using systems of large language models , author=. bioRxiv , pages=. 2024 , publisher=

work page 2024
[79]

2024 , eprint=

LAB-Bench: Measuring Capabilities of Language Models for Biology Research , author=. 2024 , eprint=

work page 2024
[80]

2nd AI4Research Workshop: Towards a Knowledge-grounded Scientific Research Lifecycle , year=

Large language models for rediscovering unseen chemistry scientific hypotheses , author=. 2nd AI4Research Workshop: Towards a Knowledge-grounded Scientific Research Lifecycle , year=

work page

Showing first 80 references.

[1] [1]

2026 , eprint=

ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection , author=. 2026 , eprint=

work page 2026

[2] [2]

2026 , eprint=

AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification , author=. 2026 , eprint=

work page 2026

[3] [3]

2026 , eprint=

AttriGuard: Defeating Indirect Prompt Injection in LLM Agents via Causal Attribution of Tool Invocations , author=. 2026 , eprint=

work page 2026

[4] [4]

2026 , eprint=

How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition , author=. 2026 , eprint=

work page 2026

[5] [5]

2026 , url=

Philippe Laban and Hiroaki Hayashi and Yingbo Zhou and Jennifer Neville , booktitle=. 2026 , url=

work page 2026

[6] [6]

Sanidhya Vijayvargiya and Xuhui Zhou and Akhila Yerukola and Maarten Sap and Graham Neubig , booktitle=. Ambig-. 2026 , url=

work page 2026

[7] [7]

2026 , eprint=

Value of Information: A Framework for Human-Agent Communication , author=. 2026 , eprint=

work page 2026

[8] [8]

2026 , eprint=

Ask or Assume? Uncertainty-Aware Clarification-Seeking in Coding Agents , author=. 2026 , eprint=

work page 2026

[9] [9]

2024 , publisher =

ProtectAI.com , title =. 2024 , publisher =

work page 2024

[10] [10]

2023 , howpublished =

The Dual LLM Pattern for Building AI Assistants that Can Resist Prompt Injection , author =. 2023 , howpublished =

work page 2023

[11] [11]

The Twelfth International Conference on Learning Representations , year=

WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. The Twelfth International Conference on Learning Representations , year=

work page

[12] [12]

Clarify When Necessary: Resolving Ambiguity Through Interaction with LM s

Zhang, Michael JQ and Choi, Eunsol. Clarify When Necessary: Resolving Ambiguity Through Interaction with LM s. Findings of the Association for Computational Linguistics: NAACL 2025. 2025. doi:10.18653/v1/2025.findings-naacl.306

work page doi:10.18653/v1/2025.findings-naacl.306 2025

[13] [13]

I njec A gent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

Zhan, Qiusi and Liang, Zhixiang and Ying, Zifan and Kang, Daniel. I njec A gent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.624

work page doi:10.18653/v1/2024.findings-acl.624 2024

[14] [14]

Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1 , pages =

Yi, Jingwei and Xie, Yueqi and Zhu, Bin and Kiciman, Emre and Sun, Guangzhong and Xie, Xing and Wu, Fangzhao , title =. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1 , pages =. 2025 , isbn =. doi:10.1145/3690624.3709179 , abstract =

work page doi:10.1145/3690624.3709179 2025

[15] [15]

Shunyu Yao and Noah Shinn and Pedram Razavi and Karthik R Narasimhan , booktitle=. \ \. 2025 , url=

work page 2025

[16] [16]

Frank F. Xu and Yufan Song and Boxuan Li and Yuxuan Tang and Kritanjali Jain and Mengxue Bao and Zora Zhiruo Wang and Xuhui Zhou and Zhitong Guo and Murong Cao and Mingyang Yang and Hao Yang Lu and Amaad Martin and Zhe Su and Leander Melroy Maben and Raj Mehta and Wayne Chi and Lawrence Keunho Jang and Yiqing Xie and Shuyan Zhou and Graham Neubig , bookti...

work page 2025

[17] [17]

2026 , eprint=

ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context , author=. 2026 , eprint=

work page 2026

[18] [18]

2025 , eprint=

IsolateGPT: An Execution Isolation Architecture for LLM-Based Agentic Systems , author=. 2025 , eprint=

work page 2025

[19] [19]

2024 , eprint=

OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation , author=. 2024 , eprint=

work page 2024

[20] [20]

2025 , eprint=

OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows , author=. 2025 , eprint=

work page 2025

[21] [21]

2024 , eprint=

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions , author=. 2024 , eprint=

work page 2024

[22] [22]

2026 , eprint=

Structured Uncertainty guided Clarification for LLM Agents , author=. 2026 , eprint=

work page 2026

[23] [23]

Maddison and Tatsunori Hashimoto , booktitle=

Yangjun Ruan and Honghua Dong and Andrew Wang and Silviu Pitis and Yongchao Zhou and Jimmy Ba and Yann Dubois and Chris J. Maddison and Tatsunori Hashimoto , booktitle=. Identifying the Risks of. 2024 , url=

work page 2024

[24] [24]

2025 , eprint=

UserBench: An Interactive Gym Environment for User-Centric Agents , author=. 2025 , eprint=

work page 2025

[25] [25]

2026 , eprint=

LHAW: Controllable Underspecification for Long-Horizon Tasks , author=. 2026 , eprint=

work page 2026

[26] [26]

2022 , eprint=

Ignore Previous Prompt: Attack Techniques For Language Models , author=. 2022 , eprint=

work page 2022

[27] [27]

A mbig QA : Answering Ambiguous Open-domain Questions

Min, Sewon and Michael, Julian and Hajishirzi, Hannaneh and Zettlemoyer, Luke. A mbig QA : Answering Ambiguous Open-domain Questions. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.466

work page doi:10.18653/v1/2020.emnlp-main.466 2020

[28] [28]

33rd USENIX Security Symposium (USENIX Security 24) , year =

Yupei Liu and Yuqi Jia and Runpeng Geng and Jinyuan Jia and Neil Zhenqiang Gong , title =. 33rd USENIX Security Symposium (USENIX Security 24) , year =

work page

[29] [29]

AgentBench: Evaluating

Xiao Liu and Hao Yu and Hanchen Zhang and Yifan Xu and Xuanyu Lei and Hanyu Lai and Yu Gu and Hangliang Ding and Kaiwen Men and Kejuan Yang and Shudan Zhang and Xiang Deng and Aohan Zeng and Zhengxiao Du and Chenhui Zhang and Sheng Shen and Tianjun Zhang and Yu Su and Huan Sun and Minlie Huang and Yuxiao Dong and Jie Tang , booktitle=. AgentBench: Evaluat...

work page 2024

[30] [30]

Li and Been Kim and Zi Wang , booktitle=

Belinda Z. Li and Been Kim and Zi Wang , booktitle=. QuestBench: Can. 2025 , url=

work page 2025

[31] [31]

2024 , eprint=

Defending Against Indirect Prompt Injection Attacks With Spotlighting , author=. 2024 , eprint=

work page 2024

[32] [32]

Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , pages =

Greshake, Kai and Abdelnabi, Sahar and Mishra, Shailesh and Endres, Christoph and Holz, Thorsten and Fritz, Mario , title =. Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , pages =. 2023 , isbn =. doi:10.1145/3605764.3623985 , abstract =

work page doi:10.1145/3605764.3623985 2023

[33] [33]

Proceedings of the 18th ACM Workshop on Artificial Intelligence and Security , pages =

Chen, Sizhe and Wang, Yizhu and Carlini, Nicholas and Sitawarin, Chawin and Wagner, David , title =. Proceedings of the 18th ACM Workshop on Artificial Intelligence and Security , pages =. 2026 , isbn =. doi:10.1145/3733799.3762982 , abstract =

work page doi:10.1145/3733799.3762982 2026

[34] [34]

2026 , eprint=

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers , author=. 2026 , eprint=

work page 2026

[35] [35]

2020 , eprint=

ConvAI3: Generating Clarifying Questions for Open-Domain Dialogue Systems (ClariQ) , author=. 2020 , eprint=

work page 2020

[36] [36]

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents , url =

Debenedetti, Edoardo and Zhang, Jie and Balunovic, Mislav and Beurer-Kellner, Luca and Fischer, Marc and Tram\`. AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents , url =. Advances in Neural Information Processing Systems , doi =

work page

[37] [37]

VitaBench: Benchmarking

Wei He and Yueqing Sun and Hongyan Hao and Xueyuan Hao and Zhikang Xia and Qi GU and Hui Su and Xunliang Cai , booktitle=. VitaBench: Benchmarking. 2026 , url=

work page 2026

[38] [38]

2019 , eprint=

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , author=. 2019 , eprint=

work page 2019

[39] [39]

2020 , eprint=

SuperGlue: Learning Feature Matching with Graph Neural Networks , author=. 2020 , eprint=

work page 2020

[40] [40]

2021 , eprint=

Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

work page 2021

[41] [41]

2025 , eprint=

BIGbench: A Unified Benchmark for Evaluating Multi-dimensional Social Biases in Text-to-Image Models , author=. 2025 , eprint=

work page 2025

[42] [42]

2022 , eprint=

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , author=. 2022 , eprint=

work page 2022

[43] [43]

2024 , eprint=

SWE-bench: Can Language Models Resolve Real-World GitHub Issues? , author=. 2024 , eprint=

work page 2024

[44] [44]

2021 , eprint=

Measuring Mathematical Problem Solving With the MATH Dataset , author=. 2021 , eprint=

work page 2021

[45] [45]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

work page 2021

[46] [46]

2024 , eprint=

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems , author=. 2024 , eprint=

work page 2024

[47] [47]

2018 , eprint=

Winograd Schema - Knowledge Extraction Using Narrative Chains , author=. 2018 , eprint=

work page 2018

[48] [48]

2019 , eprint=

WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. 2019 , eprint=

work page 2019

[49] [49]

2025 , eprint=

ARC Prize 2024: Technical Report , author=. 2025 , eprint=

work page 2024

[50] [50]

2021 , eprint=

Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies , author=. 2021 , eprint=

work page 2021

[51] [51]

2021 , eprint=

Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

work page 2021

[52] [52]

2021 , eprint=

Program Synthesis with Large Language Models , author=. 2021 , eprint=

work page 2021

[53] [53]

2023 , eprint=

Minigrid & Miniworld: Modular & Customizable Reinforcement Learning Environments for Goal-Oriented Tasks , author=. 2023 , eprint=

work page 2023

[54] [54]

2021 , eprint=

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning , author=. 2021 , eprint=

work page 2021

[55] [55]

CoRR , volume =

Marc-Alexandre C\^ot\'e and \'Akos K\'ad\'ar and Xingdi Yuan and Ben Kybartas and Tavian Barnes and Emery Fine and James Moore and Ruo Yu Tao and Matthew Hausknecht and Layla El Asri and Mahmoud Adada and Wendy Tay and Adam Trischler , title =. CoRR , volume =

work page

[56] [56]

2022 , eprint=

ScienceWorld: Is your Agent Smarter than a 5th Grader? , author=. 2022 , eprint=

work page 2022

[57] [57]

2018 , eprint=

Medical Exam Question Answering with Large-scale Reading Comprehension , author=. 2018 , eprint=

work page 2018

[58] [58]

2022 , eprint=

MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering , author=. 2022 , eprint=

work page 2022

[59] [59]

2019 , eprint=

PubMedQA: A Dataset for Biomedical Research Question Answering , author=. 2019 , eprint=

work page 2019

[60] [60]

Proceedings of AAAI Information Retrieval and Knowledge Discovery in Biomedical Text , interhash =

Tsatsaronis, George and Schroeder, Michael and Paliouras, Georgios and Almirantis, Yannis and Androutsopoulos, Ion and Gaussier, Eric and Gallinari, Patrick and Artieres, Thierry and Alvers,. Proceedings of AAAI Information Retrieval and Knowledge Discovery in Biomedical Text , interhash =

work page

[61] [61]

2022 , eprint=

FinQA: A Dataset of Numerical Reasoning over Financial Data , author=. 2022 , eprint=

work page 2022

[62] [62]

2023 , eprint=

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models , author=. 2023 , eprint=

work page 2023

[63] [63]

2025 , eprint=

MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research , author=. 2025 , eprint=

work page 2025

[64] [64]

2025 , eprint=

MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges? , author=. 2025 , eprint=

work page 2025

[65] [65]

2024 , eprint=

MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation , author=. 2024 , eprint=

work page 2024

[66] [66]

2025 , eprint=

AIDE: AI-Driven Exploration in the Space of Code , author=. 2025 , eprint=

work page 2025

[67] [67]

2025 , eprint=

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering , author=. 2025 , eprint=

work page 2025

[68] [68]

CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics

CFD-LLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics , author=. arXiv preprint arXiv:2509.20374 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[69] [69]

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Healthbench: Evaluating large language models towards improved human health , author=. arXiv preprint arXiv:2505.08775 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[70] [70]

arXiv preprint arXiv:2505.06108 , year=

LLMs Outperform Experts on Challenging Biology Benchmarks , author=. arXiv preprint arXiv:2505.06108 , year=

work page arXiv

[71] [71]

PaperBench: Evaluating AI’s Ability to Replicate AI Research , author=

work page

[72] [72]

AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage

Autoreproduce: Automatic ai experiment reproduction with paper lineage , author=. arXiv preprint arXiv:2505.20662 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[73] [73]

arXiv preprint arXiv:2506.02314 , year=

ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code , author=. arXiv preprint arXiv:2506.02314 , year=

work page arXiv

[74] [74]

arXiv preprint arXiv:2505.24785 , year=

EXP-Bench: Can AI Conduct AI Research Experiments? , author=. arXiv preprint arXiv:2505.24785 , year=

work page arXiv

[75] [75]

arXiv preprint arXiv:2506.17335 , year=

LMR-BENCH: Evaluating LLM Agent's Ability on Reproducing Language Modeling Research , author=. arXiv preprint arXiv:2506.17335 , year=

work page arXiv

[76] [76]

arXiv preprint arXiv:2504.20115 , year=

AutoP2C: An LLM-Based Agent Framework for Code Repository Generation from Multimodal Content in Academic Papers , author=. arXiv preprint arXiv:2504.20115 , year=

work page arXiv

[77] [77]

Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025) , pages=

FrontierScience Bench: Evaluating AI Research Capabilities in LLMs , author=. Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025) , pages=

work page 2025

[78] [78]

bioRxiv , pages=

Replicating a high-impact scientific publication using systems of large language models , author=. bioRxiv , pages=. 2024 , publisher=

work page 2024

[79] [79]

2024 , eprint=

LAB-Bench: Measuring Capabilities of Language Models for Biology Research , author=. 2024 , eprint=

work page 2024

[80] [80]

2nd AI4Research Workshop: Towards a Knowledge-grounded Scientific Research Lifecycle , year=

Large language models for rediscovering unseen chemistry scientific hypotheses , author=. 2nd AI4Research Workshop: Towards a Knowledge-grounded Scientific Research Lifecycle , year=

work page