arxiv: 2603.23806 · v2 · submitted 2026-03-25 · 💻 cs.SE · cs.AI

Recognition: no theorem link

Willful Disobedience: Automatically Detecting Failures in Agentic Traces

Reshabh K Sharma , Shraddha Barke , Benjamin Zorn

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:17 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords AI agentsagentic tracesbehavioral rulestrace compliancespecification violationsoutcome benchmarksmulti-step evaluation

0 comments

The pith

AgentPex extracts behavioral rules from agent prompts to automatically flag procedural violations in traces that outcome-only scores overlook.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AgentPex as a method to evaluate long multi-turn agent executions called agentic traces. It pulls explicit behavioral rules from the original prompts and system instructions, then checks each trace step against those rules for compliance. This matters because final outcome benchmarks miss failures such as wrong routing, unsafe tool calls, or prompt rule breaks that occur mid-execution. When run on 424 traces from telecom, retail, and airline domains across different models, the tool separates model behaviors and reveals violations invisible to outcome metrics alone. The approach supplies domain-level and metric-level breakdowns so developers can diagnose patterns at scale.

Core claim

AgentPex extracts behavioral rules from agent prompts and system instructions, then uses these specifications to automatically evaluate traces for compliance, distinguishing agent behavior across models and surfacing specification violations not captured by outcome-only scoring.

What carries the argument

Rule extraction from prompts and instructions that converts them into executable compliance checks for full traces.

If this is right

AgentPex identifies incorrect workflow routing and unsafe tool usage that final success scores ignore.
It separates performance differences between models on identical tasks by compliance rate.
It supplies per-domain and per-metric breakdowns that point to specific agent weaknesses.
Developers gain scalable inspection of multi-step traces without manual review of every turn.
The method highlights prompt-specified rules that current benchmarks do not test.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The extraction step could be applied to live agent logs for ongoing compliance monitoring in production.
Explicit rule statements in prompts may become a standard practice to improve automatic verifiability.
Teams might select or fine-tune models based on measured compliance to the extracted rule set rather than task accuracy alone.
Integration into agent development pipelines could catch violations during testing before deployment.

Load-bearing premise

Rules pulled from prompts and instructions accurately and completely represent the intended behavioral specifications without meaningful loss or misreading during extraction.

What would settle it

Human reviewers examine the same set of traces and find that AgentPex either misses major violations or flags behaviors that humans judge as compliant with the original prompt intent.

Figures

Figures reproduced from arXiv: 2603.23806 by Benjamin Zorn, Reshabh K Sharma, Shraddha Barke.

**Figure 2.** Figure 2: Extracted output and transition rules (subset) for [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The AgentPex pipeline from raw trace import through specification extraction to evaluation and aggregate scoring. system prompt and tool descriptions and are represented as lists of checkable statements. • Output Specification: Constraints on response behavior, including formatting, style, refusal conditions, and mandatory user confirmation. The extraction targets explicit directive language (e.g., “alway… view at source ↗

**Figure 4.** Figure 4: Example extracted rules per specification type from [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Traces sorted by AgentPex aggregate (low to high) alongside the 𝜏 2 composite score. Low-𝜏 2 traces concentrate among low AgentPex scores, showing agreement without ground truth [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Per-metric scores across models. Similar metric [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 8.** Figure 8: ROC and Precision–Recall curves for output_spec when classifying 𝜏 2 failures. Red dots indicate candidate score thresholds (50–80). ROC-AUC = 0.680. Dashed lines denote random (ROC) and base-rate (PR) baselines [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: 𝜏 2 composite and AgentPex aggregate scores are similar across models. 4.5 RQ4: Can AgentPex detect model-specific patterns? [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

read the original abstract

AI agents are increasingly embedded in real software systems, where they execute multi-step workflows through multi-turn dialogue, tool invocations, and intermediate decisions. These long execution histories, called agentic traces, make validation difficult. Outcome-only benchmarks can miss critical procedural failures, such as incorrect workflow routing, unsafe tool usage, or violations of prompt-specified rules. This paper presents AgentPex, an AI-powered tool designed to systematically evaluate agentic traces. AgentPex extracts behavioral rules from agent prompts and system instructions, then uses these specifications to automatically evaluate traces for compliance. We evaluate AgentPex on 424 traces from ${\tau}^2$-bench across models in telecom, retail, and airline customer service. Our results show that AgentPex distinguishes agent behavior across models and surfaces specification violations that are not captured by outcome-only scoring. It also provides fine-grained analysis by domain and metric, enabling developers to understand agent strengths and weaknesses at scale. The source code of AgentPex is available at https://github.com/microsoft/agentpex.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgentPex extracts rules from prompts to flag procedural failures in agent traces that outcome-only benchmarks miss, but the extraction step has no human validation.

read the letter

The main takeaway is that AgentPex pulls behavioral rules directly from agent prompts and system instructions, then scores full traces for compliance instead of stopping at final outcomes. This targets a real gap in agent evaluation for multi-step workflows like customer service routing or tool use. The paper applies it to 424 traces from τ²-bench across telecom, retail, and airline domains, reports model differences, and notes violations missed by outcome scoring. Releasing the code on GitHub is a practical plus for anyone who wants to inspect or extend the workflow. The domain and metric breakdowns also give developers concrete signals on where agents are weak. The soft spot is the rule extraction itself. The abstract gives no human-annotated ground truth, no coverage checks for implicit constraints, and no error rates on how faithfully the rules match the original instructions. Since the compliance scores and cross-model distinctions rest on those rules, any systematic extraction issues would undermine the results. Details on the extraction method and statistical significance are also thin. This is for researchers and engineers working on reliable agent deployment where steps matter as much as results. A reader focused on agent reliability would get usable ideas from the workflow even if the current evidence is preliminary. It deserves peer review because the problem is well-posed and the tool addresses a gap that outcome benchmarks ignore, though the extraction validation needs to be strengthened.

Referee Report

2 major / 2 minor

Summary. The paper introduces AgentPex, an AI-powered tool that extracts behavioral rules from agent prompts and system instructions, then automatically evaluates compliance in multi-turn agentic traces. On 424 traces from the τ²-bench benchmark spanning telecom, retail, and airline customer-service domains, the authors claim that AgentPex distinguishes behavioral differences across models and identifies specification violations missed by outcome-only scoring, while also enabling fine-grained per-domain and per-metric analysis.

Significance. If the rule-extraction step proves reliable, the work would offer a practical advance for agent evaluation by moving beyond binary outcome metrics to procedural and specification-based checks. This could help developers diagnose specific failure modes (e.g., unsafe tool use or workflow routing errors) at scale, especially as agentic systems are deployed in production software.

major comments (2)

[Methods / Evaluation] The central claim rests on the fidelity of automatically extracted behavioral rules, yet the manuscript provides no human-annotated ground truth, coverage metrics, or error rates for the extraction process (e.g., handling of implicit constraints or ambiguous phrasing). Because downstream compliance scoring and the reported model distinctions on the 424 traces depend entirely on these rules, the absence of validation directly weakens the evidence for the strongest claim.
[Results] No statistical significance tests, confidence intervals, or inter-model comparison details are reported for the claimed distinctions across models. The abstract states that AgentPex “distinguishes agent behavior,” but without these analyses it is unclear whether observed differences exceed what would be expected from extraction noise or sampling variation.

minor comments (2)

[Implementation] The GitHub link is provided, but the manuscript does not describe the exact prompting strategy or LLM used for rule extraction, which would aid reproducibility.
[Figures/Tables] Figure captions and table headers could more explicitly link each metric to the extracted rules that produced it.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We will revise the manuscript to include validation for the rule extraction process and statistical analyses for the model comparisons as detailed in our point-by-point responses below.

read point-by-point responses

Referee: [Methods / Evaluation] The central claim rests on the fidelity of automatically extracted behavioral rules, yet the manuscript provides no human-annotated ground truth, coverage metrics, or error rates for the extraction process (e.g., handling of implicit constraints or ambiguous phrasing). Because downstream compliance scoring and the reported model distinctions on the 424 traces depend entirely on these rules, the absence of validation directly weakens the evidence for the strongest claim.

Authors: We acknowledge this limitation in the current version. To address it, we will conduct a human evaluation of the extracted rules. We plan to sample 50 prompts from each domain, extract rules using AgentPex, and have two independent annotators label whether the rules accurately capture the behavioral constraints in the prompt (including implicit ones). We will report agreement metrics (Cohen's kappa) and accuracy. This new analysis will be added to the Methods section and will support the reliability of the downstream results. revision: yes
Referee: [Results] No statistical significance tests, confidence intervals, or inter-model comparison details are reported for the claimed distinctions across models. The abstract states that AgentPex “distinguishes agent behavior,” but without these analyses it is unclear whether observed differences exceed what would be expected from extraction noise or sampling variation.

Authors: We agree that statistical rigor is necessary. In the revision, we will add bootstrap confidence intervals for the violation rates and perform pairwise statistical tests (e.g., Fisher's exact test) between models to determine if the observed differences are significant. These will be reported in the Results section with p-values and effect sizes. We believe this will clarify that the distinctions are not due to noise. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes AgentPex as a tool that extracts behavioral rules from agent prompts and system instructions then applies them to evaluate traces against an external benchmark (τ²-bench with 424 traces). No equations, fitted parameters, self-citations, or uniqueness theorems are presented that reduce the central claims (distinguishing model behaviors and surfacing violations) to inputs by construction. The evaluation reports empirical distinctions across models and domains without the results being statistically forced by the extraction method itself. The derivation chain is self-contained against the external benchmark and does not rely on renaming known results or smuggling ansatzes via citation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that behavioral rules can be reliably extracted from natural-language prompts and that these rules form a sufficient specification for compliance checking. No free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Prompts and system instructions contain extractable behavioral rules that fully capture intended agent constraints.
Invoked when the method turns prompts into specifications for evaluation.

pith-pipeline@v0.9.0 · 5481 in / 1258 out tokens · 28002 ms · 2026-05-15T01:17:33.782555+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ContextCov: Deriving and Enforcing Executable Constraints from Agent Instruction Files
cs.SE 2026-02 unverdicted novelty 6.0

ContextCov compiles agent instruction files into static, runtime, and architectural guardrails, raising constraint compliance to 88.3% on SWE-bench Lite tasks versus 67% and 50.3% for prompt and reflection baselines.
Learning Correct Behavior from Examples: Validating Sequential Execution in Autonomous Agents
cs.AI 2026-05 unverdicted novelty 5.0

A new algorithm learns correct agent behavior models from few traces by combining dominator analysis, LLMs, and automata to validate sequential executions with high accuracy.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 2 Pith papers

[1]

https: //www.langchain.com/langsmith/observability

LangSmith: AI Agent & LLM Observability Platform — langchain.com. https: //www.langchain.com/langsmith/observability. [Accessed 24-03-2026]

work page 2026
[2]

Agentic ai: a com- prehensive survey of architectures, applications, and future directions.Artificial Intelligence Review, 59(1), November 2025

Mohamad Abou Ali, Fadi Dornaika, and Jinan Charafeddine. Agentic ai: a com- prehensive survey of architectures, applications, and future directions.Artificial Intelligence Review, 59(1), November 2025

work page 2025
[3]

Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

work page 2022
[4]

Agentrx: Diagnosing ai agent failures from execution trajectories, 2026

Shraddha Barke, Arnav Goyal, Alind Khare, Avaljot Singh, Suman Nath, and Chetan Bansal. Agentrx: Diagnosing ai agent failures from execution trajectories, 2026

work page 2026
[5]

𝜏 2-bench: Evaluating conversational agents in a dual-control environment, 2025

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. 𝜏 2-bench: Evaluating conversational agents in a dual-control environment, 2025

work page 2025
[6]

Humans or llms as the judge? a study on judgement biases

Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or llms as the judge? a study on judgement biases. 2024

work page 2024
[7]

Teaching large language models to self-debug

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug. 2023

work page 2023
[8]

Trail: Trace reasoning and agentic issue localization

Darshan Deshpande, Varun Gangal, Hersh Mehta, Jitin Krishnan, Anand Kan- nappan, and Rebecca Qian. Trail: Trace reasoning and agentic issue localization. 2025

work page 2025
[9]

GitHub Copilot: Your ai pair programmer

GitHub. GitHub Copilot: Your ai pair programmer. https://github .com/features/ copilot, 2026. Product page for GitHub Copilot

work page 2026
[10]

Prompts are programs

Tommy Guy, Peli de Halleux, Reshabh K Sharma, and Ben Zorn. Prompts are programs. https://blog .sigplan.org/2024/10/22/prompts-are-programs//, 2024. SIGPLAN Perspectives Blog, [Accessed 28-02-2026]

work page 2024
[11]

Metagpt: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework. 2024

work page 2024
[12]

A survey on hallucination in large language models: Principles, taxonomy, chal- lenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, January 2025

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, chal- lenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, January 2025

work page 2025
[13]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? 2024

work page 2024
[14]

Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compiling declarative language model calls into self-improving pipelines. 2023

work page 2023
[15]

Agentbench: Evaluating llms as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents. 2025

work page 2025
[16]

G-eval: Nlg evaluation using gpt-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. InProceed- ings of the 2023 conference on empirical methods in natural language processing, pages 2511–2522, 2023

work page 2023
[17]

design by contract

B. Meyer. Applying "design by contract".Computer, 25(10):40–51, 1992

work page 1992
[18]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. 2023

work page 2023
[19]

Microsoft 365 Copilot: Your everyday ai companion

Microsoft. Microsoft 365 Copilot: Your everyday ai companion. https:// www.microsoft.com/en-us/microsoft-365/copilot, 2026. Product page for Mi- crosoft 365 Copilot

work page 2026
[20]

Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. 2023

work page 2023
[21]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. 2023

work page 2023
[22]

The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning, 2025

work page 2025
[23]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis. 2023

work page 2023
[24]

Toolllm: Facilitating large language models to master 16000+ real-world apis

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis. 2023

work page 2023
[25]

The future of ai agents: Market projections and industry trends

Microsoft Research. The future of ai agents: Market projections and industry trends. https://www .microsoft.com/en-us/microsoft-365/blog/2025/11/18/ microsoft-ignite-2025-copilot-and-agents-built-to-power-the-frontier-firm/,

work page 2025
[26]

Industry projection for AI agent adoption

work page
[27]

Adaptive testing and debugging of nlp models

Marco Tulio Ribeiro and Scott Lundberg. Adaptive testing and debugging of nlp models. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3253–3267, 2022

work page 2022
[28]

Be- yond accuracy: Behavioral testing of nlp models with checklist, 2020

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Be- yond accuracy: Behavioral testing of nlp models with checklist, 2020

work page 2020
[29]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. 2023

work page 2023
[30]

Defending language models against image-based prompt attacks via user-provided specifications

Reshabh K Sharma, Vinayak Gupta, and Dan Grossman. Defending language models against image-based prompt attacks via user-provided specifications. In 2024 IEEE Security and Privacy Workshops (SPW), pages 112–131. IEEE, 2024

work page 2024
[31]

Spml: A dsl for defending language models against prompt attacks, 2024

Reshabh K Sharma, Vinayak Gupta, and Dan Grossman. Spml: A dsl for defending language models against prompt attacks, 2024

work page 2024
[32]

Promptpex: Automatic test generation for language model prompts, 2026

Reshabh K Sharma, Jonathan De Halleux, Shraddha Barke, Dan Grossman, and Benjamin Zorn. Promptpex: Automatic test generation for language model prompts, 2026

work page 2026
[33]

Reflexion: Language agents with verbal reinforce- ment learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforce- ment learning. 2023

work page 2023
[34]

Patil, Lingjiao Chen, Wei-Lin Chiang, and Jared Q

Ion Stoica, Matei Zaharia, Joseph Gonzalez, Ken Goldberg, Koushik Sen, Hao Zhang, Anastasios Angelopoulos, Shishir G. Patil, Lingjiao Chen, Wei-Lin Chiang, and Jared Q. Davis. Specifications: The missing link to making the development of llm systems an engineering discipline, 2024

work page 2024
[35]

Large language models are not fair evaluators

Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators. 2023

work page 2023
[36]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. 2023

work page 2023
[37]

Autogen: Enabling next-gen llm applications via multi-agent conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation. 2023

work page 2023
[38]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. NeurIPS 2024, 2024

work page 2024
[39]

React: Synergizing reasoning and acting in language models, 2023

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023

work page 2023
[40]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. 2023

work page 2023
[41]

always",

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2024. Willful Disobedience: Automatically Detecting Failures in Agentic Traces Conference’17, July 2017, Washington, DC, USA A Prompts Used inAgentPex This appendix provides the verbatim prompts used...

work page 2024
[42]

Read the system prompt carefully, sentence by sentence

work page
[43]

Look for ALL explicit directives about how the agent should respond or behave

work page
[44]

always",

Extract constraints that use words like "always", "never", "must", "should", "use", etc

work page
[45]

Include behavioral constraints that directly affect the output

work page
[46]

Convert each constraint into a clear output specification rule

work page
[47]

must include X

For confirmation/consent requirements, phrase the rule to accept FUNCTIONALLY EQUIVALENT expressions of clear intent If no explicit output constraints are found, return an empty array []. Generate a comprehensive list of output specification rules based on ALL explicitly mentioned constraints. Return only a JSON array of rule strings and nothing else. <SY...

work page
[48]

Scan for EXPLICIT transition constraints that satisfy Transition-only

work page
[49]

Extract each as a simple sentence using the exact wording from the text

work page
[50]

Exclude any single-output or question/policy-only statements

work page
[51]

must ensure X before Y

If none exist, return an empty array Output format: JSON array of strings (each a transition constraint). Only return JSON. You are evaluating how well an AI assistant adhered to transition specifications during a conversation using a 3-TIER PENALTY SYSTEM. MANDATORY EVALUATION PROCESS: STEP 1: Create a checklist for EACH transition specification: - Extra...

work page
[52]

Scan the system prompt for EXPLICIT statements about forbidden tool combinations or sequences

work page
[53]

Never call X after Y

Look for phrases like: - "Never call X after Y" - "Tool A and Tool B are mutually exclusive" - "Do not use X and Y together" - "Tool X cannot be followed by Tool Y"

work page
[54]

Examine tool descriptions for explicit incompatibility statements

work page
[55]

ONLY generate forbidden edges when explicit restrictions are found

work page
[56]

from": "tool1

If NO explicit restrictions exist, return an empty list Generate a JSON array of forbidden edges in this format: [ {"from": "tool1", "to": "tool2", "reason": "explicit reason from system prompt"} ] Important: - Only include edges that are EXPLICITLY forbidden - Include the specific reason/constraint that forbids each edge - If no explicit forbiddens are f...

work page 2017
[57]

You MUST check EVERY parameter in EVERY tool call

work page
[58]

You MUST use the ACTUAL arguments provided

work page
[59]

You MUST list ALL violations

work page
[60]

Type mismatches are VIOLATIONS

work page
[61]

Missing required parameters are CRITICAL VIOLATIONS

work page
[62]

A.5 Predicted Plan You are a tool-centric plan generation expert

Constraint violations are VIOLATIONS TOOL SCHEMAS: {tool_schemas_json} TOOL CALLS WITH SCHEMA MATCHING: {matched_calls_json} Respond with a JSON object containing reasoning and score (0-100). A.5 Predicted Plan You are a tool-centric plan generation expert. Given a user intent, and available tools, you need to generate a step-by-step plan as call to the a...

work page