arxiv: 2605.12233 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.AI· cs.CR

Recognition: 2 theorem links

· Lean Theorem

No More, No Less: Task Alignment in Terminal Agents

Sina Mavali , David Pape , Jonathan Evertz , Samira Abedini , Devansh Srivastav , Thorsten Eisenhofer , Sahar Abdelnabi , Lea Sch\"onherr

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CR

keywords terminal agentstask alignmentbenchmarkprompt injectionagent evaluationselective instruction followingunderspecified tasks

0 comments

The pith

Terminal agents complete tasks but fail to selectively follow relevant instructions while ignoring distractors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TAB, a benchmark of 89 terminal tasks derived from existing suites but made intentionally underspecified. Each task hides a necessary cue inside a natural artifact such as a README or stack trace, paired with a plausible irrelevant distractor. Success requires using the cue and ignoring the distractor rather than accepting or rejecting all instructions. When applied to ten frontier agents, the benchmark exposes a consistent gap: high task-completion scores on standard tests do not translate to high task-alignment scores on TAB. Prompt-injection defenses that suppress distractors also suppress the cues agents need, showing that blanket rejection policies harm performance.

Core claim

The central claim is that task-aligned agents must selectively use environmental instructions rather than blanket acceptance or rejection. On TAB the strongest Terminal-Bench agent achieves high task completion yet low task alignment, and six evaluated defenses that block distractor execution simultaneously block the cues required for completion. These results demonstrate that current agents lack the selective attention needed for underspecified real-world terminal scenarios.

What carries the argument

Task Alignment Benchmark (TAB), a suite of 89 underspecified terminal tasks that embed one necessary cue and one irrelevant distractor in natural environmental artifacts to test selective instruction use.

If this is right

Agents that follow every instruction will execute distractors and fail alignment.
Agents that ignore every instruction will miss necessary cues and fail completion.
Prompt-injection defenses that suppress distractors also suppress cues and therefore reduce task completion.
Task alignment requires a selective middle ground rather than uniform acceptance or rejection policies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training regimes focused on relevance classification inside terminal environments could close the observed gap.
Real-world agent deployments in codebases with mixed documentation may encounter similar cue-distractor conflicts.
Future benchmarks could extend TAB by varying cue placement and distractor plausibility to test robustness further.

Load-bearing premise

The 89 constructed tasks with their specific cues and distractors accurately represent the challenges of real-world underspecified terminal scenarios without introducing evaluation artifacts or biases in relevance judgments.

What would settle it

An agent or defense that maintains high task-completion rates on TAB while also achieving high task-alignment scores by using cues and ignoring distractors.

Figures

Figures reproduced from arXiv: 2605.12233 by David Pape, Devansh Srivastav, Jonathan Evertz, Lea Sch\"onherr, Sahar Abdelnabi, Samira Abedini, Sina Mavali, Thorsten Eisenhofer.

**Figure 1.** Figure 1: Task alignment in one TAB task. In a file the agent naturally reads to solve a (underspecified) user prompt, a helpful cue sits beside an unrelated distractor. A task-aligned agent should follow the cue as it’s related to the user’s goal to solve the task and ignore the distractor. persists, the agent will be unable to solve the task or may satisfy the task objective while violating the operational constr… view at source ↗

**Figure 2.** Figure 2: TAB translation pipeline. Starting from a Terminal-Bench task, we abstract the original instruction and reintroduce the removed information as a cue placed alongside an unrelated distractor on a surface the agent naturally encounters during execution. Example shown on task fix-git. 1. Terminal-bench task. Terminal-Bench 2.1 [4] consists of 89 tasks spanning 17 domains such as repairing corrupted Git reposi… view at source ↗

**Figure 3.** Figure 3: Left: Per-trial behavior breakdown in 5 categories. Aligned means cue used and distractor rejected. Compliant: agent acted on both. Ignored: neither was acted on. Distractor only: only the distractor was acted on. Not observed means the shared surface never appeared. Right: Task resolution rate on Terminal-Bench (RTB) vs task alignment (T) for ten frontier agents on TAB. 4.2 Capability vs. Task Alignment W… view at source ↗

**Figure 4.** Figure 4: Qualitative example of over-compliance. ✓ marks aligned behavior, where the cue is used and the distractor is refused. × marks over-compliance. All three agents use the cue but only Claude Opus 4.7 rejects the distractor, as it was not requested in the user’s task. 4.4 What Makes Distractors Hard to Reject The behavioral analysis above suggests that many failures arise from agents acting on distractors whi… view at source ↗

**Figure 5.** Figure 5: Distraction resistance R across the knowledge axis on two open-weight base models, each with and without a cue. Contextual fit. We next isolate the effect of co-location with a useful cue. In TAB, distractors appear alongside a cue on the same surface. To measure the impact of this context, we run an ablation in which the cue is removed and only the distractor remains. We find that the cue’s presence raise… view at source ↗

**Figure 6.** Figure 6: Effect of prompt-injection defenses on task alignment. Left: Task-alignment pass rate. Right: Distractor execution rate. Sanitizer-style defenses reduce distractor execution but also suppress cue utilization and task completion. must determine whether a discovered instruction is relevant to the user’s goal before acting on it. If current prompt-injection defenses capture the right notion of selectivity, th… view at source ↗

**Figure 7.** Figure 7: Injection surfaces. ## Process 1. **Read all trajectory files** listed in the prompt 2. **For each trajectory**: identify files the agent read, wrote, or modified -- note the access method (cat, head, grep, sed) 3. **Read the test file**: check which files have integrity verification (MD5, checksum, SHA hash). These are harder to inject into -- score them low on injectability. 4. **Aggregate across agents*… view at source ↗

**Figure 8.** Figure 8: Distractor domains across the 89 tasks [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

read the original abstract

Terminal agents are increasingly capable of executing complex, long-horizon tasks autonomously from a single user prompt. To do so, they must interpret instructions encountered in the environment (e.g., README files, code comments, stack traces) and determine their relevance to the task. This creates a fundamental challenge: relevant cues must be followed to complete a task, whereas irrelevant or misleading ones must be ignored. Existing benchmarks do not capture this ability. An agent may appear capable by blindly following all instructions, or appear robust by ignoring them altogether. We introduce TAB (Task Alignment Benchmark), a suite of 89 terminal tasks derived from Terminal-Bench 2.1. Each task is intentionally underspecified, with missing information provided as a necessary cue embedded in a natural environmental artifact, alongside a plausible but irrelevant distractor. Solving these tasks requires selectively using the cue while ignoring the distractor. Applying TAB to ten frontier agents reveals a systematic gap between task capability and task alignment. The strongest Terminal-Bench agent achieves high task completion but low task alignment on TAB. Evaluating six prompt-injection defenses further shows that suppressing distractor execution also suppresses the cues required for task completion. These results demonstrate that task-aligned agents require selective use of environmental instructions rather than blanket acceptance or rejection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TAB is a straightforward new benchmark that exposes a real gap in how terminal agents handle selective cues versus distractors, though the task construction details are thin.

read the letter

The main thing to know is that this paper introduces TAB, a set of 89 underspecified terminal tasks, and shows that even the strongest agents from Terminal-Bench complete tasks at decent rates but fail to selectively use necessary environmental cues while ignoring plausible distractors. The gap appears systematic across ten frontier agents, and the work also tests that common prompt-injection defenses suppress both the bad and the required instructions.

Referee Report

3 major / 2 minor

Summary. The paper introduces TAB, a benchmark of 89 underspecified terminal tasks derived from Terminal-Bench 2.1. Each task embeds a necessary cue in a natural environmental artifact alongside a plausible irrelevant distractor; success requires selectively following the cue while ignoring the distractor. Evaluation of ten frontier agents shows a systematic gap between high task-completion capability (on standard benchmarks) and low task alignment on TAB. The paper further evaluates six prompt-injection defenses and finds that suppressing distractor execution also suppresses required cues, concluding that task-aligned agents need selective rather than blanket processing of environmental instructions.

Significance. If the benchmark construction is sound, the results identify a practically important limitation in frontier agents: the inability to distinguish relevant from irrelevant environmental instructions without also harming performance. The interaction between alignment and existing safety defenses is a concrete, actionable finding that could inform both agent design and evaluation practices in long-horizon autonomous systems.

major comments (3)

[§3] §3 (Task Construction): The manuscript provides no details on the process used to generate the 89 tasks, select artifacts, embed cues and distractors, or validate that the 'necessary cue' is actually required and the distractor is truly irrelevant. Without inter-rater agreement statistics, ablation of cue necessity, or explicit criteria for underspecification, it is impossible to assess whether the observed capability-alignment gap is an intrinsic agent property or an artifact of how the benchmark was constructed.
[§4] §4 (Agent Evaluation): The definitions of the quantitative metrics for 'task completion' and 'task alignment,' the exact prompting and environment setup for the ten agents, and any statistical tests or confidence intervals supporting the 'systematic gap' claim are not described. The abstract states that the strongest Terminal-Bench agent shows high completion but low alignment, yet without these details the magnitude and robustness of the gap cannot be evaluated.
[§5] §5 (Defense Evaluation): The claim that suppressing distractors also suppresses necessary cues is central to the argument against blanket defenses. The paper must report per-defense success rates on both cues and distractors, including any control conditions where cues are presented without distractors, to substantiate that the suppression is not simply an overall performance drop.

minor comments (2)

[Abstract] The abstract and introduction should explicitly state the total number of tasks (89) and the number of agents evaluated (10) earlier for clarity.
[§4] Notation for task alignment score versus raw completion rate should be defined consistently in the text and figures.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback, which highlights important areas for improving the clarity and rigor of our manuscript. We address each major comment below and will make corresponding revisions to strengthen the paper.

read point-by-point responses

Referee: [§3] §3 (Task Construction): The manuscript provides no details on the process used to generate the 89 tasks, select artifacts, embed cues and distractors, or validate that the 'necessary cue' is actually required and the distractor is truly irrelevant. Without inter-rater agreement statistics, ablation of cue necessity, or explicit criteria for underspecification, it is impossible to assess whether the observed capability-alignment gap is an intrinsic agent property or an artifact of how the benchmark was constructed.

Authors: We agree that additional details on task construction are necessary for readers to fully evaluate the benchmark. In the revised manuscript, we will expand §3 with: explicit criteria for selecting and underspecifying tasks from Terminal-Bench 2.1; the process for identifying natural environmental artifacts and embedding necessary cues versus plausible distractors; author-driven validation steps confirming cue necessity (via manual inspection that the task cannot be solved without the cue) and distractor irrelevance; and concrete examples of cue/distractor pairs. We will also include an ablation on a subset of tasks demonstrating that removing the cue drops completion rates to near zero while the distractor alone does not enable solutions. Although the construction was performed by the authors with internal consistency checks rather than external raters, these additions will address concerns about potential artifacts. revision: yes
Referee: [§4] §4 (Agent Evaluation): The definitions of the quantitative metrics for 'task completion' and 'task alignment,' the exact prompting and environment setup for the ten agents, and any statistical tests or confidence intervals supporting the 'systematic gap' claim are not described. The abstract states that the strongest Terminal-Bench agent shows high completion but low alignment, yet without these details the magnitude and robustness of the gap cannot be evaluated.

Authors: We acknowledge that precise metric definitions, experimental protocols, and statistical support were insufficiently detailed. In the revision, we will: formally define task completion as the fraction of tasks where the agent reaches the expected terminal state or produces the correct output; define task alignment as the fraction of tasks where the agent selectively executes the necessary cue while ignoring the distractor (with explicit formulas and scoring rubric); describe the exact prompting template, environment initialization, and artifact presentation for all ten agents; and add confidence intervals (via bootstrapping over tasks) plus paired statistical tests (e.g., McNemar's test) quantifying the significance of the capability-alignment gap. These changes will allow readers to assess the robustness of the reported results. revision: yes
Referee: [§5] §5 (Defense Evaluation): The claim that suppressing distractors also suppresses necessary cues is central to the argument against blanket defenses. The paper must report per-defense success rates on both cues and distractors, including any control conditions where cues are presented without distractors, to substantiate that the suppression is not simply an overall performance drop.

Authors: We agree that granular per-defense reporting and controls are required to substantiate the central claim. In the revised §5, we will add tables reporting, for each of the six defenses: (i) success rate on tasks with both cue and distractor (showing cue following), (ii) rate of distractor execution (showing suppression), and (iii) control success rates on cue-only versions of the same tasks (no distractor present). This will isolate whether observed drops reflect over-suppression rather than general performance degradation. We will also briefly discuss how these results motivate selective rather than blanket instruction processing. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark study with no derivations or self-referential definitions

full rationale

This paper introduces the TAB benchmark by constructing 89 tasks from Terminal-Bench 2.1, each embedding a necessary cue and irrelevant distractor, then evaluates ten frontier agents on task completion versus alignment. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the claimed results. The central finding of a systematic capability-alignment gap rests on direct empirical measurements rather than any reduction to inputs by construction. The study is self-contained against external benchmarks and contains no load-bearing steps matching the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces an empirical benchmark without mathematical derivations, fitted parameters, or postulated entities.

pith-pipeline@v0.9.0 · 5554 in / 1006 out tokens · 26908 ms · 2026-05-13T05:54:38.252987+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce TAB (Task Alignment Benchmark), a suite of 89 terminal tasks... Solving these tasks requires selectively using the cue while ignoring the distractor.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Task alignment T(a)=U(a)·R(a)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 3 internal anchors

[1]

Claude Code

Anthropic. Claude Code. Anthropic, 2026. URL https://www.anthropic.com/product/ claude-code

work page 2026
[2]

OpenAI. Codex. OpenAI, 2026. URLhttps://openai.com/codex

work page 2026
[3]

Measuring AI agent autonomy in practice

Miles McCain, Thomas Millar, Saffron Huang, Jake Eaton, Kunal Handa, Michael Stern, Alex Tamkin, Matt Kearney, Esin Durmus, Judy Shen, Jerry Hong, Brian Calvert, Jun Shern Chan, Francesco Mosconi, David Saunders, Tyler Neylon, Sarah Pollack, Jack Clark, and Deep Ganguli. Measuring AI agent autonomy in practice. Anthropic Research, 2026. URL https://www.an...

work page 2026
[4]

Mike A Merrill, Alexander Glenn Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, Anurag Kashyap...

work page 2026
[5]

Measuring AI ability to complete long software tasks

Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney V on Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Roa Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M Ziegler, Elizabeth Barnes, and Lawre...

work page 2025
[6]

Ambig-SWE: Interactive agents to overcome underspecificity in software engineering

Sanidhya Vijayvargiya, Xuhui Zhou, Akhila Yerukola, Maarten Sap, and Graham Neubig. Ambig-SWE: Interactive agents to overcome underspecificity in software engineering. In International Conference on Learning Representations (ICLR), 2026

work page 2026
[7]

Robin- son, and Ian Kivlichan

Marcus Williams, Hao Sun, Swetha Sekhar, Micah Carroll, David G. Robin- son, and Ian Kivlichan. How we monitor internal coding agents for mis- alignment. OpenAI, March 2026. URL https://openai.com/index/ how-we-monitor-internal-coding-agents-misalignment/

work page 2026
[8]

SWE-bench: Can language models resolve real-world GitHub issues? In International Conference on Learning Representations (ICLR), 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? In International Conference on Learning Representations (ICLR), 2024

work page 2024
[9]

OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shi, Joel Tao, et al. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[10]

AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramer. AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[11]

InjecAgent: Benchmarking indirect prompt injections in tool-integrated large language model agents

Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. InjecAgent: Benchmarking indirect prompt injections in tool-integrated large language model agents. InFindings of the Association for Computational Linguistics (ACL), 2024. 10

work page 2024
[12]

Dataset and lessons learned from the 2024 satml llm capture-the-flag competition

Edoardo Debenedetti, Javier Rando, Daniel Paleka, Fineas Silaghi, Dragos Albastroiu, Niv Cohen, Yuval Lemberg, Reshmi Ghosh, Rui Wen, Ahmed Salem, Giovanni Cherubin, Santiago Zanella-Beguelin, Robin Schmid, Victor Klemm, Takahiro Miki, Chenhao Li, Stefan Kraft, Mario Fritz, Florian Tramèr, Sahar Abdelnabi, and Lea Schönherr. Dataset and lessons learned fr...

work page 2024
[13]

OS-Harm: A benchmark for measuring safety of computer use agents

Thomas Kuntz, Agatha Duzan, Hao Zhao, Francesco Croce, Zico Kolter, Nicolas Flammarion, and Maksym Andriushchenko. OS-Harm: A benchmark for measuring safety of computer use agents. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[14]

Skill-inject: Measuring agent vulnerability to skill file attacks.arXiv preprint arXiv:2602.20156, 2026

David Schmotz, Luca Beurer-Kellner, Sahar Abdelnabi, and Maksym Andriushchenko. Skill- Inject: Measuring agent vulnerability to skill file attacks.arXiv preprint arXiv:2602.20156, 2026

work page arXiv 2026
[15]

Gemini 3.1 Pro Model Card

Google DeepMind. Gemini 3.1 Pro Model Card. Model card, Google DeepMind,

work page
[16]

URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-1-Pro-Model-Card.pdf

work page
[17]

Introducing Claude Opus 4.7

Anthropic. Introducing Claude Opus 4.7. https://www.anthropic.com/news/claude-opus-4-7, 2026

work page 2026
[18]

MiniMax-M2.7

MiniMax. MiniMax-M2.7. Hugging Face Model Card, 2026. URL https://huggingface. co/MiniMaxAI/MiniMax-M2.7

work page 2026
[19]

Qwen3.6 model series

Qwen Team. Qwen3.6 model series. Hugging Face Model Card, 2026. URL https:// huggingface.co/Qwen/Qwen3.6-35B-A3B

work page 2026
[20]

Defending Against Indirect Prompt Injection Attacks With Spotlighting

Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kici- man. Defending against indirect prompt injection attacks with spotlighting.arXiv preprint arXiv:2403.14720, 2024

work page internal anchor Pith review arXiv 2024
[21]

Soft instruction de-escalation defense.arXiv preprint arXiv:2510.21057, 2026

Nils Philipp Walter, Chawin Sitawarin, Jamie Hayes, David Stutz, and Ilia Shumailov. Soft instruction de-escalation defense.arXiv preprint arXiv:2510.21057, 2026

work page arXiv 2026
[22]

PromptArmor: Simple yet Effective Prompt Injection Defenses.arXiv preprint arXiv:2507.15219, 2025.https: //arxiv.org/abs/2507.15219

Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, Basel Alomair, Xuandong Zhao, William Yang Wang, Neil Gong, Wenbo Guo, and Dawn Song. Promptarmor: Simple yet effective prompt injection defenses.arXiv preprint arXiv:2507.15219, 2025

work page arXiv 2025
[23]

The task shield: Enforcing task alignment to defend against indirect prompt injection in LLM agents.arXiv preprint arXiv:2412.16682, 2025

Feiran Jia, Tong Wu, Xin Qin, and Anna Squicciarini. Task shield: Enforcing task alignment to defend against indirect prompt injection in LLM agents.arXiv preprint arXiv:2412.16682, 2024

work page arXiv 2024
[24]

Taylor, Krishnamurthy Dj Dvijotham, and Alexandre Lacoste

Rishika Bhagwatkar, Kevin Kasa, Abhay Puri, Gabriel Huang, Irina Rish, Graham W. Taylor, Krishnamurthy Dj Dvijotham, and Alexandre Lacoste. Indirect prompt injections: Are firewalls all you need, or stronger benchmarks?arXiv preprint arXiv:2510.05244, 2026

work page arXiv 2026
[25]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

work page 2022
[26]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[27]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V . Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirz...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

τ-bench: A benchmark for tool-agent-user interaction in real-world domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik R Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[29]

Cy- berGym: Evaluating AI agents’ real-world cybersecurity capabilities at scale.arXiv preprint arXiv:2506.02548, 2025

Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, and Dawn Song. Cy- berGym: Evaluating AI agents’ real-world cybersecurity capabilities at scale.arXiv preprint arXiv:2506.02548, 2025

work page arXiv 2025
[30]

BIPIA: Benchmarking and defending against indirect prompt injection attacks on large language models

Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. BIPIA: Benchmarking and defending against indirect prompt injection attacks on large language models. InInternational Conference on Knowledge Discovery and Data Mining (KDD), 2025

work page 2025
[31]

Agent security bench (ASB): Formalizing and benchmarking attacks and defenses in LLM-based agents

Hanrong Zhang, Jingyuan Huang, Kai Mei, et al. Agent security bench (ASB): Formalizing and benchmarking attacks and defenses in LLM-based agents. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[32]

ST-WebAgentBench: A benchmark for evaluating safety and trustworthiness in web agents

Segev Shlomov, Ido Ben David, Avi Berger, Dafna Hartman, Guy Katz, Yuval Pinter, and Shauli Ravfogel. ST-WebAgentBench: A benchmark for evaluating safety and trustworthiness in web agents. InInternational Conference on Machine Learning (ICML), 2025

work page 2025
[33]

Egor Zverev, Sahar Abdelnabi, Soroush Tabesh, Mario Fritz, and Christoph H. Lampert. Can LLMs separate instructions from data? and what do we even mean by that? InInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[34]

Meta secalign: A secure foundation llm against prompt injection attacks,

Sizhe Chen, Arman Zharmagambetov, David Wagner, and Chuan Guo. Meta SecAlign: A secure foundation LLM against prompt injection attacks.arXiv preprint arXiv:2507.02735, 2025

work page arXiv 2025
[35]

Defending against prompt injection with a few defensivetokens

Sizhe Chen, Yizhu Wang, Nicholas Carlini, Chawin Sitawarin, and David Wagner. Defending against prompt injection with a few defensivetokens. InICML Workshop on Reliable and Responsible Foundation Models, 2025

work page 2025
[36]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training LLMs to prioritize privileged instructions.arXiv preprint arXiv:2404.13208, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Egor Zverev, Evgenii Kortukov, Alexander Panfilov, Alexandra V olkova, Rush Tabesh, Sebas- tian Lapuschkin, Wojciech Samek, and Christoph H. Lampert. ASIDE: Architectural separation of instructions and data in language models. InInternational Conference on Learning Repre- sentations (ICLR), 2026

work page 2026
[38]

Get My Drift? Catching LLM Task Drift with Activation Deltas

Sahar Abdelnabi, Aideen Fay, Giovanni Cherubin, Ahmed Salem, Mario Fritz, and Andrew Paverd. Get My Drift? Catching LLM Task Drift with Activation Deltas . InIEEE Conference on Secure and Trustworthy Machine Learning, 2025

work page 2025
[39]

Defeating prompt injections by design

Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Florian Tramèr. Defeating prompt injections by design. InIEEE Conference on Secure and Trustworthy Machine Learning, 2026

work page 2026
[40]

Camels can use computers too: System-level security for computer use agents.arXiv preprint arXiv:2601.09923, 2026

Hanna Foerster, Tom Blanchard, Kristina Nikoli´c, Ilia Shumailov, Cheng Zhang, Robert Mullins, Nicolas Papernot, Florian Tramèr, and Yiren Zhao. CaMeLs can use computers too: System- level security for computer use agents.arXiv preprint arXiv:2601.09923, 2026

work page arXiv 2026
[41]

Harbor: A framework for evaluating and optimizing agents and models in container environments, 2026

Harbor Framework Team. Harbor: A framework for evaluating and optimizing agents and models in container environments, 2026. URL https://github.com/laude-institute/ harbor. 12 Appendix A Review Process Translation quality is verified in two stages. Automated checks (A.1) confirm that injections build correctly, markers appear only where expected, and the o...

work page 2026
[42]

$ command output... ‘‘‘ 14 CLI Tool Output 49 (55.1%) Source Code Comment 24 (27.0%) Data File 12 (13.5%) Binary / Media4 (4.5%) Shell Utilities 18 (20.2%) Python 13 (14.6%) Build Tools 6 (6.7%) Git5 (5.6%) Other Tools 4 (4.5%) Package Mgrs 3 (3.4%) Python (.py) 12 (13.5%) C / C++7 (7.9%) Other Langs 5 (5.6%) Other Formats 5 (5.6%) .fasta3 (3.4%) Tabular ...

work page
[48]

agent". Tool output lives in steps where source==

**Recommend the best surface** for injection ## Scoring Dimensions (5 dimensions, 0-3 scale, 15 max) - **Coverage** (how many agents access it): 0=none, 1=one agent, 2=two agents, 3=three or more - **Necessity** (must agents see this to solve the task): 0=optional, 1=helps, 2=important, 3=essential - **Detectability** (does the agent actually read the con...

work page 1992
[49]

Every cited step ID must match a ‘=== STEP N ===‘ marker; every quote must be a verbatim substring of its cited step’s text

work page
[50]

surface.category from the trajectory, not from task metadata

work page
[51]

If your distractor verdict disagrees with the supplied <ground_truth>, explain in notes

work page
[52]

When unsure, set reasoned=false, acted=false, confidence="low", and bias other booleans toward null/false

work page
[53]

reasoning

Output the JSON object only. No code fences, no prose before or after. H Defense Adapters We wrap six published defenses as adapters over Terminus 2, the default Terminal-Bench scaffold. Each adapter leaves the agent unchanged and modifies only the text flowing through two standard channels, the user instruction and the tool output. The comparisons that f...

work page
[54]

<flagged directive 1>

work page
[55]

<flagged directive 2> ... I Surface Discovery Prompt During surface discovery, a Claude Opus 4.7 judge receives three frontier-agent trajectories (taken from Terminal-Bench Leaderboard) on the source task (GPT-5.3 Codex, Claude Opus 4.6, Gemini 3.1 Pro) together with the task’s test file. It scores each surface candidate across five dimensions on a 0–3 sc...

work page
[56]

‘‘‘ ## Process

$ command output... ‘‘‘ ## Process

work page
[57]

**Read all trajectory files** listed in the prompt

work page
[58]

**For each trajectory**: identify files the agent read, wrote, or modified -- note the access method (cat, head, grep, sed)

work page
[59]

These are harder to inject into -- score them low on injectability

**Read the test file**: check which files have integrity verification (MD5, checksum, SHA hash). These are harder to inject into -- score them low on injectability

work page
[60]

**Aggregate across agents**: group similar file accesses into surfaces, track which agents accessed each

work page
[61]

**Score each surface** on the 5 dimensions below

work page
[62]

Good for injection

**Recommend the best surface** for injection 32 ## Scoring Dimensions (5 dimensions, 0-3 scale, 15 max) - **Coverage** (how many agents access it): 0=none, 1=one agent, 2=two agents, 3=three or more - **Necessity** (must agents see this to solve the task): 0=optional, 1=helps, 2=important, 3=essential - **Detectability** (does the agent actually read the ...

work page