pith. sign in

arxiv: 2605.14678 · v3 · pith:OX5AIH6Bnew · submitted 2026-05-14 · 💻 cs.AI

π-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

Pith reviewed 2026-05-20 21:09 UTC · model grok-4.3

classification 💻 cs.AI
keywords proactive assistancepersonal assistant agentslong-horizon workflowshidden user intentsmulti-turn interactionsbenchmark evaluationtask completioncross-session continuity
0
0 comments X

The pith

Current personal assistant agents struggle to provide proactive help in long-term workflows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces π-Bench to test how well AI agents anticipate and address users' unstated needs during extended interactions. It creates 100 multi-turn tasks for five personas that include hidden intents emerging gradually along with dependencies across tasks and sessions. A sympathetic reader would care because real assistance often requires figuring out what users want before they state it rather than only following explicit instructions. Experiments indicate that proactivity remains difficult for agents, that success at completing stated tasks does not guarantee proactive behavior, and that access to prior interactions improves handling of later hidden intents.

Core claim

The central claim is that proactive assistance requires agents to identify hidden user intents and leverage inter-task dependencies over long horizons. By building tasks where needs are underspecified at the start and develop across turns and sessions, the benchmark separates the ability to fulfill explicit requests from the ability to act ahead on inferred preferences. Testing current agents on this setup shows that proactive assistance stays challenging, that task completion and proactivity are distinct, and that continuity from past sessions aids intent resolution in later tasks.

What carries the argument

π-Bench, a collection of 100 multi-turn tasks with hidden intents, inter-task dependencies, and cross-session continuity across five personas that jointly measures proactivity and task completion.

Load-bearing premise

The 100 tasks and five personas with their predefined hidden intents and inter-task dependencies accurately capture how real user needs emerge gradually in sustained multi-turn interactions.

What would settle it

A real-user study in which participants engage with agents over multiple sessions and the agents' proactive actions are checked against actual unprompted needs would show whether the benchmark's observed distinctions hold outside the constructed scenarios.

Figures

Figures reproduced from arXiv: 2605.14678 by Bingsu He, Chicheng Qin, Haodi Lei, Haoran Zhang, Luxin Xu, Runquan Gui, Shunkai Zhang, Tong Zhu, Xiaoye Qu, Yafu Li, Yang Yang, Yu Cheng, Zhilin Wang, Zihao He.

Figure 1
Figure 1. Figure 1: Overview of π-BENCH. use memory to detect underspecified requirements and resolve hidden intents as workflows evolve through interaction. π-BENCH addresses this gap with a broader evaluation setting that combines memory, workspace state, and interaction history to assess proactivity and task completeness in long-horizon personal assistant workflows. Proactive Evaluation. Proactive benchmarks mainly study m… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of one benchmark session. The evaluated agent interacts with a simulated user [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: COMP and PROC for (a, b, c) three representative task workflow categories following the fine-grained taxonomy in Tab. 5 of App. A.3, and (d) the overall average across all tasks. The gray dashed line indicates COMP = PROC. 4.3 Analysis Performance by task type [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Relationship between average interaction [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation on the final task of each strong dependency group: six per user, aggre￾gated across five roles. Ours uses the original trajectories, while w/o dependencies removes preceding sessions from the same group. Prior interactions support proactive intent reso￾lution. We ablate each strong dependency group to test whether earlier sessions help agents resolve later hidden intents. For each group, we remove… view at source ↗
Figure 6
Figure 6. Figure 6: Task trigger and hidden intents. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Trajectory and scores; colored intent tags indicate terminal status, e.g., [PITH_FULL_IMAGE:figures/full_fig_p037_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Task trigger and partial hidden intents. [PITH_FULL_IMAGE:figures/full_fig_p037_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: DeepSeek trajectory and scores. Case Study – Researcher – Claude 4.6 Opus: Task and Hidden Intents ▶ TASK Design a one-week meal plan with per-meal prices controlled within RMB 20–30. The visible request only asks for a meal plan and prices, while the full task also requires body-profile conditioning, macro accounting, table structure, obtainable foods, and fallback meals. ▶ PRIOR CONTEXT In a prior muscle… view at source ↗
Figure 10
Figure 10. Figure 10: Task trigger and partial hidden intents. [PITH_FULL_IMAGE:figures/full_fig_p038_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Claude trajectory and scores. Case Study – Marketer – GPT-5.4: Task and Hidden Intents ▶ TASK Generate the final X apology letter for MeowConnect after a crisis-management approval phase. The visible trigger says the final letter is required, but the concrete incident facts depend on a prior client-alignment session. → INITIAL REQUEST [INCOMING SYSTEM WEBHOOK: CRISIS_MANAGEMENT_PLATFORM] client=MeowConnec… view at source ↗
Figure 12
Figure 12. Figure 12: Task trigger and hidden intents. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: GPT-5.4 trajectory and scores. Case Study – Pharmacist – Kimi K2.5: Task and Hidden Intents ▶ TASK Send a sandbox Gmail follow-up for an LC-MS instrument booking request, then verify the send through the outbox or matching thread. The visible request gives the synthetic login credentials and final response format, while the hidden requirements specify the correct Gmail workflow and brief-derived message d… view at source ↗
Figure 14
Figure 14. Figure 14: Task trigger and partial hidden intents. [PITH_FULL_IMAGE:figures/full_fig_p040_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Kimi K2.5 trajectory and scores. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Task overview and checklist targets. 41 [PITH_FULL_IMAGE:figures/full_fig_p041_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: GPT-5.4 trajectory and scores. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Kimi K2.5 trajectory and scores. Case Study – Researcher – Cross-Session Dependency: Task and Hidden Intents ▶ TASK Organize accepted ICLR papers from a local list for the user’s research theme. The current request does not restate the theme or output conventions, so the agent must recover them from earlier sessions. → INITIAL REQUEST I wrote a local file at paper_list.txt that contains some accepted ICLR… view at source ↗
Figure 19
Figure 19. Figure 19: Cross-session setup and hidden intents. 43 [PITH_FULL_IMAGE:figures/full_fig_p043_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Claude 4.6 Opus trajectory and scores. Case Study – Researcher – Kimi K2.5: Trajectory and Scores → INPUT Turn 1. Same request to organize ICLR papers related to “my research theme.” ⇐ OUTPUT SUMMARY Turn 1. The agent reads the paper list but asks the user to specify the research theme instead of carrying it over from earlier sessions. → INPUT Turn 2. Primarily recommend think-with-image papers. provided:… view at source ↗
Figure 21
Figure 21. Figure 21: Kimi K2.5 trajectory and scores. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_21.png] view at source ↗
read the original abstract

The rise of personal assistant agents, e.g., OpenClaw, highlights the growing potential of large language models to support users across everyday life and work. A core challenge in these settings is proactive assistance, since users often begin with underspecified requests and leave important needs, constraints, or preferences unstated. However, existing benchmarks rarely evaluate whether agents can identify and act on such hidden intents before they are explicitly stated, especially in sustained multi-turn interactions where user needs emerge gradually. To address this gap, we introduce $\pi$-Bench, a benchmark for proactive assistance comprising 100 multi-turn tasks across 5 domain-specific user personas. By incorporating hidden user intents, inter-task dependencies, and cross-session continuity, $\pi$-Bench evaluates agents' ability to anticipate and address user needs over extended interactions, jointly measuring proactivity and task completion in long-horizon trajectories that better reflect real-world use. Experiments show (1) proactive assistance remains challenging, (2) a clear distinction between task completion and proactivity, and (3) the value of prior interaction for proactive intent resolution in later tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces π-Bench, a benchmark for proactive personal assistant agents consisting of 100 multi-turn tasks across 5 domain-specific user personas. The benchmark incorporates author-defined hidden intents, explicit inter-task dependencies, and cross-session continuity to evaluate agents' ability to anticipate unstated user needs in long-horizon workflows. Reported experiments indicate that proactive assistance remains challenging, that proactivity is distinct from task completion, and that prior interaction improves proactive intent resolution in subsequent tasks.

Significance. If the task construction faithfully reproduces the gradual emergence of real-world user needs rather than introducing artificial signals, the benchmark would provide a valuable tool for measuring and improving proactivity in LLM-based agents, filling a gap left by existing evaluations that focus primarily on explicit or reactive requests. The joint measurement of proactivity and completion over sustained trajectories is a constructive contribution.

major comments (1)
  1. [Section 3] Section 3 (Benchmark Design and Task Construction): The central experimental claims rest on the 100 tasks and 5 personas with their predefined hidden intents and inter-task dependencies. The manuscript provides no validation (e.g., user studies, comparison to real interaction logs, or ablation on intent emergence) that these author-specified elements capture gradual, open-ended need surfacing in real-world sustained interactions rather than creating exploitable patterns via memory or pattern matching. This directly affects whether the reported distinction between task completion and proactivity, and the benefit of prior interaction, can be interpreted as general properties of proactive agents.
minor comments (1)
  1. [Abstract and Section 4] The abstract and experimental section would benefit from explicit mention of the number of agents evaluated, the statistical tests applied to support the three findings, and any inter-annotator agreement on intent labeling to strengthen reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their insightful comments and recommendation for major revision. We address the primary concern about validation of the benchmark design below and outline planned changes to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (Benchmark Design and Task Construction): The central experimental claims rest on the 100 tasks and 5 personas with their predefined hidden intents and inter-task dependencies. The manuscript provides no validation (e.g., user studies, comparison to real interaction logs, or ablation on intent emergence) that these author-specified elements capture gradual, open-ended need surfacing in real-world sustained interactions rather than creating exploitable patterns via memory or pattern matching. This directly affects whether the reported distinction between task completion and proactivity, and the benefit of prior interaction, can be interpreted as general properties of proactive agents.

    Authors: We appreciate the referee highlighting this important aspect of ecological validity. The tasks and personas were constructed iteratively by the authors based on realistic long-horizon workflows drawn from the five domains, with hidden intents and dependencies engineered to emerge gradually through logical connections (e.g., a scheduling task revealing unstated preferences that affect a later travel task). This controlled specification enables precise, reproducible measurement of proactivity separate from completion, which is difficult with raw logs that lack explicit labels for hidden intents. The reported experiments demonstrate consistent distinctions across models and the benefit of prior interaction, suggesting the setup does not reduce to simple pattern matching. We agree that additional grounding would be valuable. In the revised manuscript we will expand Section 3 to include a detailed rationale for task construction, concrete examples of dependency design intended to avoid artificial cues, and a new limitations subsection that explicitly discusses the synthetic nature of the benchmark and calls for future user studies or log-based validation. We will also add a brief ablation on dependency strength if feasible. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark construction is explicit and results are empirical

full rationale

The paper defines π-Bench explicitly as a collection of 100 multi-turn tasks across 5 personas that incorporate author-specified hidden intents, inter-task dependencies, and cross-session continuity. The reported experimental findings—that proactive assistance is challenging, distinct from task completion, and aided by prior interaction—are direct measurements of agent behavior on this fixed benchmark rather than any derivation, fitted parameter, or self-referential reduction. No equations, ansatzes, uniqueness theorems, or self-citation chains appear in the provided text, and the outcomes remain contingent on external agent performance instead of being forced by the benchmark definition itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the domain assumption that proactive behavior can be operationalized through predefined hidden intents and inter-task dependencies that emerge gradually across sessions.

axioms (1)
  • domain assumption Proactive assistance requires agents to identify and act on hidden user intents before they are explicitly stated, especially in sustained multi-turn interactions.
    This definition is used to construct the benchmark tasks and to distinguish proactivity from mere task completion.

pith-pipeline@v0.9.0 · 5769 in / 1142 out tokens · 38585 ms · 2026-05-20T21:09:34.280051+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 18 internal anchors

  1. [1]

    Claude code

    Anthropic. Claude code. https://claude.com/product/claude-code, 2026. AI-powered coding assistant for developers, accessed 2026-04-16

  2. [2]

    Introducing Claude Opus 4.6

    Anthropic. Introducing Claude Opus 4.6. https://www.anthropic.com/news/claude-opus-4-6 ,

  3. [3]

    Accessed: 2026-04-22

  4. [4]

    Seed2.0 model card: Towards intelligence frontier for real-world complexity

    Bytedance Seed. Seed2.0 model card: Towards intelligence frontier for real-world complexity. https://lf3-static.bytednsdoc.com/obj/eden-cn/lapzild-tss/ljhwZthlaukjlkulzlp/ seed2/0214/Seed2.0%20Model%20Card.pdf, 2026. Accessed: 2026-04-22

  5. [5]

    Pira-bench: A transition from reactive gui agents to gui-based proactive intent recommendation agents.arXiv preprint arXiv:2603.08013, 2026

    Yuxiang Chai, Shunye Tang, Han Xiao, Rui Liu, and Hongsheng Li. Pira-bench: A transition from reactive gui agents to gui-based proactive intent recommendation agents.arXiv preprint arXiv:2603.08013, 2026

  6. [6]

    Learning to clarify: Multi-turn conversations with action-based contrastive self-training.arXiv preprint arXiv:2406.00222, 2024

    Maximillian Chen, Ruoxi Sun, Tomas Pfister, and Sercan Ö Arık. Learning to clarify: Multi-turn conversations with action-based contrastive self-training.arXiv preprint arXiv:2406.00222, 2024

  7. [7]

    KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

    Tongbo Chen, Zhengxi Lu, Zhan Xu, Guocheng Shao, Shaohan Zhao, Fei Tang, Yong Du, Kaitao Song, Yizhou Liu, Yuchen Yan, et al. Knowu-bench: Towards interactive, proactive, and personalized mobile agent evaluation.arXiv preprint arXiv:2604.08455, 2026

  8. [8]

    WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

    Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, et al. Workarena: How capable are web agents at solving common knowledge work tasks?arXiv preprint arXiv:2403.07718, 2024

  9. [9]

    Clawmark: A living-world benchmark for multi-day, multimodal coworker agents

    Evolvent AI Research. Clawmark: A living-world benchmark for multi-day, multimodal coworker agents. https://evolvent.co/en/research/clawmark, 2026. Published 2026-04-13, accessed 2026-04-16

  10. [10]

    Gemini 3.1 pro

    Google DeepMind. Gemini 3.1 pro. https://deepmind.google/models/gemini/pro/, 2026. Ac- cessed: 2026-04-22

  11. [11]

    Memoryarena: Benchmarking agent memory in interdependent multi-session agentic tasks.arXiv preprint arXiv:2602.16313, 2026

    Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin, Ze Chen, Tong Arthur Wu, Siru Ouyang, Zihan Wang, et al. Memoryarena: Benchmarking agent memory in interdependent multi-session agentic tasks.arXiv preprint arXiv:2602.16313, 2026

  12. [12]

    nanobot: Ultra-lightweight personal ai agent

    HKUDS. nanobot: Ultra-lightweight personal ai agent. https://github.com/HKUDS/nanobot, 2026. Open-source personal AI agent, accessed 2026-04-16

  13. [13]

    ClawArena: Benchmarking AI Agents in Evolving Information Environments

    Haonian Ji, Kaiwen Xiong, Siwei Han, Peng Xia, Shi Qiu, Yiyang Zhou, Jiaqi Liu, Jinlong Li, Bingzhou Li, Zeyu Zheng, et al. Clawarena: Benchmarking ai agents in evolving information environments.arXiv preprint arXiv:2604.04202, 2026

  14. [14]

    PROPER Agents: Proactivity Driven Personalized Agents for Advancing Knowledge Gap Navigation

    Kirandeep Kaur, Vinayak Gupta, Aditya Gupta, and Chirag Shah. The proper approach to proactivity: Benchmarking and advancing knowledge gap navigation.arXiv preprint arXiv:2601.09926, 2026

  15. [15]

    Persona2web: Benchmarking personalized web agents for contextual reasoning with user history.arXiv preprint arXiv:2602.17003, 2026

    Serin Kim, Sangam Lee, and Dongha Lee. Persona2web: Benchmarking personalized web agents for contextual reasoning with user history.arXiv preprint arXiv:2602.17003, 2026

  16. [16]

    ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices

    Dezhi Kong, Zhengzhao Feng, Qiliang Liang, Hao Wang, Haofei Sun, Changpeng Yang, Yang Li, Peng Zhou, Shuai Nie, Hongzhen Wang, et al. Proactivemobile: A comprehensive benchmark for boosting proactive intelligence on mobile devices.arXiv preprint arXiv:2602.21858, 2026. 10

  17. [17]

    ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

    Xiangyi Li, Kyoung Whan Choe, Yimin Liu, Xiaokun Chen, Chujun Tao, Bingran You, Wenbo Chen, Zonglin Di, Jiankai Sun, Shenghan Zheng, et al. Clawsbench: Evaluating capability and safety of llm productivity agents in simulated workspaces.arXiv preprint arXiv:2604.05172, 2026

  18. [18]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3.2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556, 2025

  19. [19]

    Memgui-bench: Benchmarking memory of mobile gui agents in dynamic environments.arXiv preprint arXiv:2602.06075, 2026

    Guangyi Liu, Pengxiang Zhao, Yaozhen Liang, Qinyi Luo, Shunye Tang, Yuxiang Chai, Weifeng Lin, Han Xiao, WenHao Wang, Siheng Chen, et al. Memgui-bench: Benchmarking memory of mobile gui agents in dynamic environments.arXiv preprint arXiv:2602.06075, 2026

  20. [20]

    PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments

    Shuochen Liu, Junyi Zhu, Long Shu, Junda Lin, Yuhao Chen, Haotian Zhang, Chao Zhang, Derong Xu, Jia Li, Bo Tang, et al. Perma: Benchmarking personalized memory agents via event-driven preference and realistic task environments.arXiv preprint arXiv:2603.23231, 2026

  21. [21]

    AgentBench: Evaluating LLMs as Agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023

  22. [22]

    PersonalAlign: Hierarchical Implicit Intent Alignment for Personalized GUI Agent with Long-Term User-Centric Records

    Yibo Lyu, Gongwei Chen, Rui Shao, Weili Guan, and Liqiang Nie. Personalalign: Hierarchical im- plicit intent alignment for personalized gui agent with long-term user-centric records.arXiv preprint arXiv:2601.09636, 2026

  23. [23]

    Enterpriseops-gym: Environments and evaluations for stateful agentic planning and tool use in enterprise settings.arXiv preprint arXiv:2603.13594, 2026

    Shiva Krishna Reddy Malay, Shravan Nayak, Jishnu Sethumadhavan Nair, Sagar Davasam, Aman Ti- wari, Sathwik Tejaswi Madhusudhan, Sridhar Krishna Nemala, Srinivas Sunkara, and Sai Rajeswar. Enterpriseops-gym: Environments and evaluations for stateful agentic planning and tool use in enterprise settings.arXiv preprint arXiv:2603.13594, 2026

  24. [24]

    Gaia: a benchmark for general ai assistants

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

  25. [25]

    Minimax m2.7: Early echoes of self-evolution

    MiniMax. Minimax m2.7: Early echoes of self-evolution. https://www.minimax.io/news/ minimax-m27-en, 2026. Accessed: 2026-04-22

  26. [26]

    Stochasticity in agentic evaluations: Quantifying inconsistency with intraclass correlation.arXiv preprint arXiv:2512.06710, 2025

    Zairah Mustahsan, Abel Lim, Megna Anand, Saahil Jain, and Bryan McCann. Stochasticity in agentic evaluations: Quantifying inconsistency with intraclass correlation.arXiv preprint arXiv:2512.06710, 2025

  27. [27]

    Proactive agent research environment: Simulating active users to evaluate proactive assistants.arXiv preprint arXiv:2604.00842, 2026

    Deepak Nathani, Cheng Zhang, Chang Huan, Jiaming Shan, Yinfei Yang, Alkesh Patel, Zhe Gan, William Yang Wang, Michael Saxon, and Xin Eric Wang. Proactive agent research environment: Simulating active users to evaluate proactive assistants.arXiv preprint arXiv:2604.00842, 2026

  28. [28]

    Pspa-bench: A personalized benchmark for smartphone gui agent.arXiv preprint arXiv:2603.29318, 2026

    Hongyi Nie, Xunyuan Liu, Yudong Bai, Yaqing Wang, Yang Liu, Quanming Yao, and Zhen Wang. Pspa-bench: A personalized benchmark for smartphone gui agent.arXiv preprint arXiv:2603.29318, 2026

  29. [29]

    Introducing GPT-5.4

    OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4 , 2026. Ac- cessed: 2026-04-22

  30. [30]

    Openclaw

    OpenClaw. Openclaw. https://github.com/openclaw/openclaw, 2026. Open-source personal AI assistant, accessed 2026-04-16

  31. [31]

    Gorilla: Large language model connected with massive apis.Advances in Neural Information Processing Systems, 37:126544–126565, 2024

    Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis.Advances in Neural Information Processing Systems, 37:126544–126565, 2024

  32. [32]

    Qwen3.6-Plus: Towards real world agents, April 2026

    Qwen Team. Qwen3.6-Plus: Towards real world agents, April 2026. URL https://qwen.ai/blog?id= qwen3.6

  33. [33]

    Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023

  34. [34]

    Mem2actbench: A benchmark for evaluating long-term memory utilization in task-oriented autonomous agents.arXiv preprint arXiv:2601.19935, 2026

    Yiting Shen, Kun Li, Wei Zhou, and Songlin Hu. Mem2actbench: A benchmark for evaluating long-term memory utilization in task-oriented autonomous agents.arXiv preprint arXiv:2601.19935, 2026

  35. [35]

    Ambibench: Benchmarking mobile gui agents beyond one-shot instructions in the wild.arXiv preprint arXiv:2602.11750, 2026

    Jiazheng Sun, Mingxuan Li, Yingying Zhang, Jiayang Niu, Yachen Wu, Ruihan Jin, Shuyu Lei, Pengrongrui Tan, Zongyu Zhang, Ruoyi Wang, et al. Ambibench: Benchmarking mobile gui agents beyond one-shot instructions in the wild.arXiv preprint arXiv:2602.11750, 2026. 11

  36. [36]

    Kimi K2.5: Visual Agentic Intelligence

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

  37. [37]

    Appworld: A controllable world of apps and people for benchmarking interactive coding agents

    Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p...

  38. [38]

    Vidgen, A

    Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman, Marco Burstein, Julien Benchek, David Ostrofsky, Anirudh Ravichandran, Debnil Sur, et al. Apex-agents.arXiv preprint arXiv:2601.14242, 2026

  39. [39]

    Asking What Matters: Reward-Driven Clarification for Software Engineering Tasks

    Sanidhya Vijayvargiya, Vijay Viswanathan, and Graham Neubig. Asking what matters: Reward-driven clarification for software engineering tasks.arXiv preprint arXiv:2604.14624, 2026

  40. [40]

    SkillX: Automatically Constructing Skill Knowledge Bases for Agents

    Chenxi Wang, Zhuoyun Yu, Xin Xie, Wuguannan Yao, Runnan Fang, Shuofei Qiao, Kexin Cao, Guozhou Zheng, Xiang Qi, Peng Zhang, et al. Skillx: Automatically constructing skill knowledge bases for agents. arXiv preprint arXiv:2604.04804, 2026

  41. [41]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37: 52040–52094, 2024

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37: 52040–52094, 2024

  42. [42]

    PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory

    Zhifei Xie, Zongzheng Hu, Fangda Ye, Xin Zhang, Haobo Chai, Zihang Liu, Pengcheng Wu, Guibin Zhang, Yue Liao, Xiaobin Hu, et al. Pask: Toward intent-aware proactive agents with long-term memory. arXiv preprint arXiv:2604.08000, 2026

  43. [43]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

  44. [44]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent- user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

  45. [45]

    Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents

    Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An, Lei Li, Lingpeng Kong, et al. Claw-eval: Toward trustworthy evaluation of autonomous agents.arXiv preprint arXiv:2604.06132, 2026

  46. [46]

    Glm-5.1: Towards long-horizon tasks.https://z.ai/blog/glm-5.1, 2026

    Z.ai. Glm-5.1: Towards long-horizon tasks.https://z.ai/blog/glm-5.1, 2026. Accessed: 2026-04- 22

  47. [47]

    Memorycd: Benchmarking long-context user memory of llm agents for lifelong cross-domain personalization.arXiv preprint arXiv:2603.25973, 2026

    Weizhi Zhang, Xiaokai Wei, Wei-Chieh Huang, Zheng Hui, Chen Wang, Michelle Gong, and Philip S Yu. Memorycd: Benchmarking long-context user memory of llm agents for lifelong cross-domain personalization.arXiv preprint arXiv:2603.25973, 2026

  48. [48]

    ClawBench: Can AI Agents Complete Everyday Online Tasks?

    Yuxuan Zhang, Yubo Wang, Yipeng Zhu, Penghui Du, Junwen Miao, Xuan Lu, Wendong Xu, Yunzhuo Hao, Songcheng Cai, Xiaochen Wang, et al. Clawbench: Can ai agents complete everyday online tasks? arXiv preprint arXiv:2604.08523, 2026

  49. [49]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023. 12 Appendix A Benchmark Statistics 15 A.1 Task and Grader Statistics . . . . . . . . . . . . . . . . . . . . . . ...

  50. [50]

    Judge from the assistant response and the files-read context above

  51. [51]

    The assistant must precisely and explicitly hit the hidden intent

    Be strict and objective. The assistant must precisely and explicitly hit the hidden intent

  52. [52]

    Vague or generic answers do not count

    A hidden intent is satisfied only if the assistant provides specific, detailed explanations or concrete actions in the response. Vague or generic answers do not count

  53. [53]

    Fully trust the assistant’s wording and the files-read context, but strictly evaluate the level of detail provided

  54. [54]

    Do not call tools or check factual accuracy beyond the provided context

  55. [55]

    YES means the response and context precisely address the hidden intent with specific details

  56. [56]

    ## Output Format Output only XML blocks in the following shape

    NO means the response and context do not precisely hit the intent, or lack specific details. ## Output Format Output only XML blocks in the following shape. <c1> <content> {hidden_intent_content} </content> <decision> YES or NO </decision> </c1> In the second stage, the user agent checks whether the latest agent response contains a clarification question....

  57. [57]

    Consider all follow-up questions, requests, and action suggestions inside it, not only the last sentence

    Judge from the assistant response. Consider all follow-up questions, requests, and action suggestions inside it, not only the last sentence

  58. [58]

    YES means the assistant explicitly asks about that hidden intent, asks a very close confirmation question about it, or proposes concrete next steps that directly correspond to it

  59. [59]

    NO means the question is missing, vague, generic, or does not clearly target that hidden intent

  60. [60]

    anything else?

    Generic prompts such as “anything else?” or “do you want to add more?” must be NO

  61. [61]

    tool_name

    A question must clearly get the point. Broad topic overlap is not enough. ## Output Format Output only XML blocks in the following shape. <c1> <content> {hidden_intent_content} </content> <decision> YES or NO </decision> </c1> This two-stage procedure induces a priority order among terminal statuses. Direct satisfaction is assigned before targeted elicita...

  62. [62]

    Use only evidence from the interaction history

  63. [63]

    Score YES only when the criterion is clearly satisfied

  64. [64]

    Score NO when evidence is missing, ambiguous, or contradicted

  65. [65]

    used amazon__place_order successfully

    Do not guess. ## Output Format Output only XML blocks in the following shape. Keep each criterion text exactly the same as given. Each score must be YES or NO. <c1> <criterion> {criterion_text} </criterion> <score> YES or NO </score> </c1> C.6 Rule Based Tool Scoring Rule-based tool scoring is used when a checklist item requires exact verification over st...