pith. sign in

arxiv: 2605.26086 · v1 · pith:DFRKBQXOnew · submitted 2026-05-25 · 💻 cs.AI

Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World

Pith reviewed 2026-06-29 21:30 UTC · model grok-4.3

classification 💻 cs.AI
keywords AI agentspersonal assistantsbenchmarksLLM evaluationalways-on assistantsdigital contextagent simulationproactive assistance
0
0 comments X

The pith

A new benchmark shows leading AI agents succeed on only 34.5 percent of tasks when given full access to a user's months-long digital activity across services and devices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates Claw-Anything to test always-on personal assistants that can reach long user histories, linked backend services, and both screen and command interfaces on multiple devices at once. It builds test cases by injecting events over simulated months to create realistic states filled with noise such as irrelevant items and conflicting signals. Leading models reach just 34.5 percent first-try success, well below scores on earlier narrower tests. The authors also release a pipeline that auto-generates two thousand training environments and raises base model performance by 23.7 percent. The work demonstrates that current agent limits become visible only when evaluation matches the broad, continuous access an always-on assistant would need.

Core claim

Claw-Anything expands agent context along three dimensions: long-horizon activity histories, interdependent backend services, and integrated GUI and CLI interaction across multiple devices. Multi-round event injection produces complex world states and realistic noise. GPT-5.5 reaches only 34.5 percent pass@1, and an automated pipeline that yields 2,000 training environments improves the base model by 23.7 percent.

What carries the argument

The Claw-Anything benchmark, which measures agent performance over simulated months of user activity with noise and requires proactive recommendations from rich, interdependent context.

If this is right

  • Agents must maintain robustness to irrelevant events and conflicting signals while reasoning over long histories.
  • Proactive assistance can now be measured because full histories allow anticipation of needs.
  • Narrower prior benchmarks do not predict success once context expands to multiple services and devices.
  • Automated generation of thousands of training environments can measurably raise model scores on the new tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models may need explicit long-term memory mechanisms to handle the scale of activity histories shown here.
  • Cross-service consistency checks could become a standard requirement for any deployed personal assistant.
  • The data pipeline might extend to other domains where agents need months-scale context, such as project management or health tracking.

Load-bearing premise

The simulated months of injected user events create world states and noise levels that match those an always-on assistant would meet in real digital lives.

What would settle it

Measure the same agents on months of actual user digital traces and check whether the 34.5 percent pass rate and the noise sensitivity remain similar to the simulated results.

Figures

Figures reproduced from arXiv: 2605.26086 by Dandan Tu, Feiyang Pan, Haiyang Wang, Jiangui Chen, Lue Fan, Qipeng Gu, Sanyuan Zhao, Shuzhe Wu, Siqi Cheng, Xinyuan Liang, Yusong Lin.

Figure 1
Figure 1. Figure 1: Overview of Claw-Anything and its empirical value. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Three dimensions along which Claw-Anything expands agent context. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Claw-Anything environment and automated data pipeline. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Benchmark statistics of Claw-Anything. Left: Comparison with Claw-Eval in size, context length, services per task, and supported devices. Right: Category distribution of evaluation instances. event streams and richer cross-component dependencies, providing the substrate for subsequent task construction. Stage II: Task and verifier generation. We then derive tasks from designated rounds of the simulation. F… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation of contextual scale, showing the effects of event-stream volume and the number [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Trajectory scaling. number of involved services increases. This trend suggests that cross-service coordination remains a major challenge for current models and a key target for future improvement. CLI–GUI collaboration. We further ablate cross-interface coordination by removing GUI access and restricting the agent to CLI-only execution. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Ablation of the automatic data-generation pipeline, showing the effects of the noise-round [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of Failure modes. inconsistency is a major source of difficulty. This result further supports the realism of the environment generated by our pipeline. 4.2.3 Evaluation Setting Proactivity. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example of an initial persona 14 [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Example of persona enrichment 15 [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Example of a task that characterizes conflicting patterns [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Persona-specific event instantiation: prompt for adapting a seed task [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Persona-specific event instantiation: prompt for generating app data [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Example of the system prompt used with OpenHarness [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Example of the App backend used 19 [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗
read the original abstract

Large language model agents are increasingly envisioned as always-on personal assistants with access to anything relevant in the user's digital world. Yet current systems operate over only narrow slices of that world, limiting context-sensitive reasoning and effective assistance. Existing benchmarks similarly provide only partial user state and therefore fail to capture performance in such a broad, always-on setting. To address this gap, we introduce Claw-Anything, a benchmark that expands agent context along three dimensions: long-horizon activity histories, interdependent backend services, and integrated GUI and CLI interaction across multiple devices. To instantiate this setting, we simulate months of user activity through multi-round event injection, producing complex world states and realistic noise, including irrelevant events and conflicting signals. Agents must reason over rich contextual environments while remaining robust to such noise. This expanded scope also enables the evaluation of proactive assistance, requiring agents to anticipate user needs and deliver timely recommendations. Experiments show that GPT-5.5 achieves only 34.5% pass@1, substantially below prior benchmarks, underscoring a gap between current agent capabilities and the demands of always-on personal assistance. Alongside the benchmark, we release an automated data-generation pipeline that yields 2,000 training environments and improves the base model by 23.7%, demonstrating its utility of scalable data infrastructure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Claw-Anything, a benchmark expanding agent evaluation to always-on personal assistants with long-horizon activity histories, interdependent backend services, and integrated GUI/CLI access across devices. It instantiates this via multi-round event injection to simulate months of user activity, generating complex world states with noise including irrelevant events and conflicting signals. Experiments report GPT-5.5 at 34.5% pass@1 (below prior benchmarks), and release an automated pipeline generating 2,000 training environments that improves the base model by 23.7%.

Significance. If the simulated environments are shown to be representative, the work would usefully quantify a capability gap for always-on assistance and provide a scalable data-generation pipeline as a concrete resource for training. The pipeline's reported 23.7% improvement is a positive, reproducible-style contribution if the evaluation protocol is fully specified.

major comments (2)
  1. [Abstract and benchmark instantiation] Abstract and benchmark instantiation paragraph: the central claim that 34.5% pass@1 demonstrates a gap between current agents and always-on demands rests on the assumption that multi-round event injection produces world states whose difficulty and noise distribution match real deployments, yet the manuscript supplies no quantitative checks (event-rate histograms, dependency depth distributions, noise-to-signal ratios, or inter-rater realism scores) against real user logs or data. This validation gap is load-bearing for interpreting the headline result.
  2. [Experiments] Experiments section (headline result): pass@1, task construction details, baseline comparisons, and statistical significance are referenced in the abstract but the provided text gives no definition of pass@1, no description of how tasks are sampled or validated, and no comparison protocol, preventing assessment of whether the 34.5% figure is comparable to prior benchmarks.
minor comments (1)
  1. [Abstract] Abstract: the 23.7% training improvement is stated without specifying the base model, evaluation split, or whether the improvement is measured on the same benchmark tasks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. Below we respond point-by-point to the two major comments, indicating where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract and benchmark instantiation] Abstract and benchmark instantiation paragraph: the central claim that 34.5% pass@1 demonstrates a gap between current agents and always-on demands rests on the assumption that multi-round event injection produces world states whose difficulty and noise distribution match real deployments, yet the manuscript supplies no quantitative checks (event-rate histograms, dependency depth distributions, noise-to-signal ratios, or inter-rater realism scores) against real user logs or data. This validation gap is load-bearing for interpreting the headline result.

    Authors: We agree that direct quantitative validation against real user logs would strengthen the interpretation of the headline result. Such logs are not available to us due to privacy constraints, so the environments were constructed from domain-expert specifications of typical long-horizon activity patterns, service interdependencies, and noise sources. We will revise the benchmark instantiation section to report the concrete simulation parameters (event-rate distributions, dependency depths, and noise injection rules) and add an explicit limitations paragraph discussing the absence of real-log calibration. This is a partial revision. revision: partial

  2. Referee: [Experiments] Experiments section (headline result): pass@1, task construction details, baseline comparisons, and statistical significance are referenced in the abstract but the provided text gives no definition of pass@1, no description of how tasks are sampled or validated, and no comparison protocol, preventing assessment of whether the 34.5% figure is comparable to prior benchmarks.

    Authors: The full manuscript contains these definitions and protocols in the Experiments section, but they are not stated with sufficient prominence. We will revise the Experiments section to define pass@1 explicitly at the outset, describe the task-sampling procedure from the generated environments, detail the validation steps, and specify the exact comparison protocol with prior benchmarks. The abstract will be updated to point readers to these definitions. This revision will be made. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a benchmark via simulated multi-round event injection and reports direct experimental metrics (e.g., GPT-5.5 pass@1) without any equations, parameter fitting, or self-citations that reduce results to inputs by construction. The simulation is presented as an independent generation pipeline rather than a fitted model whose outputs are then relabeled as predictions; no load-bearing self-citation chains or ansatzes appear in the provided text. The central claim rests on external measurement against the generated environments, satisfying the criteria for a self-contained benchmark presentation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution is a new simulated benchmark rather than a derivation; the main assumption is that the constructed environments are representative of real always-on use.

axioms (1)
  • domain assumption Simulated multi-round event injection with irrelevant and conflicting signals produces evaluation conditions representative of real user digital worlds
    Invoked when claiming the benchmark captures the demands of always-on personal assistance (abstract section on benchmark instantiation).

pith-pipeline@v0.9.1-grok · 5800 in / 1243 out tokens · 34728 ms · 2026-06-29T21:30:37.455921+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 8 canonical work pages · 5 internal anchors

  1. [1]

    Gym-Anything: Turn any Software into an Agent Environment

    Pranjal Aggarwal, Graham Neubig, and Sean Welleck. Gym-anything: Turn any software into an agent environment.arXiv preprint arXiv:2604.06126, 2026

  2. [2]

    Introducing claude sonnet 4.5

    Anthropic. Introducing claude sonnet 4.5. https://www.anthropic.com/news/claude-sonnet-4-5 , September 2025

  3. [3]

    Introducing claude opus 4.7

    Anthropic. Introducing claude opus 4.7. https://www.anthropic.com/news/claude-opus-4-7 ,

  4. [4]

    Anthropic news release, accessed: 2026-04-28

  5. [5]

    Wildclawbench

    Shuangrui Ding, Xuanlang Dai, Long Xing, Shengyuan Ding, Ziyu Liu, Jingyi Yang, Penghui Yang, Zhixiong Zhang, Xilin Wei, Yubo Ma, Haodong Duan, Jing Shao, Jiaqi Wang, Dahua Lin, Kai Chen, and Yuhang Zang. Wildclawbench. https://github.com/InternLM/WildClawBench, 2026. GitHub repository

  6. [6]

    ClawMark: A living-world benchmark for multi-day, multimodal coworker agents

    Evolvent AI. ClawMark: A living-world benchmark for multi-day, multimodal coworker agents. https: //github.com/evolvent-ai/ClawMark, 2026. GitHub repository

  7. [7]

    One cli for all of google workspace — built for humans and ai agents

    Google Workspace. One cli for all of google workspace — built for humans and ai agents. https: //github.com/googleworkspace/cli, 2026. GitHub repository, accessed 2026-04-24

  8. [8]

    CLI-Anything: Making all software agent-native

    HKUDS Team. CLI-Anything: Making all software agent-native. https://github.com/HKUDS/ CLI-Anything, 2026. GitHub repository, accessed 2026-04-24

  9. [9]

    Nanobot: The ultra-lightweight personal ai agent

    HKUDS Teams. Nanobot: The ultra-lightweight personal ai agent. https://github.com/HKUDS/ nanobot, 2026. GitHub repository

  10. [10]

    Openharness: Open agent harness with a built-in personal agent–ohmo! https: //github.com/HKUDS/OpenHarness, 2026

    HKUDS Teams. Openharness: Open agent harness with a built-in personal agent–ohmo! https: //github.com/HKUDS/OpenHarness, 2026. GitHub repository

  11. [11]

    Swe-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations (ICLR), 2024

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. Swe-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations (ICLR), 2024

  12. [12]

    PinchBench: Real-world benchmarks for ai coding agents

    kilo.ai. PinchBench: Real-world benchmarks for ai coding agents. https://github.com/pinchbench/ skill, 2026. GitHub repository

  13. [13]

    lark-cli: The official lark/feishu cli for humans and ai agents

    Larksuite. lark-cli: The official lark/feishu cli for humans and ai agents. https://github.com/ larksuite/cli, 2026. GitHub repository, accessed 2026-04-24

  14. [14]

    Cli-gym: Scalable cli task generation via agentic environment inversion.arXiv preprint arXiv:2602.10999, 2026

    Yusong Lin, Haiyang Wang, Shuzhe Wu, Lue Fan, Feiyang Pan, Sanyuan Zhao, and Dandan Tu. Cli-gym: Scalable cli task generation via agentic environment inversion.arXiv preprint arXiv:2602.10999, 2026

  15. [15]

    Minimax-m2.7

    MiniMax-AI. Minimax-m2.7. https://huggingface.co/MiniMaxAI/MiniMax-M2.7, 2026. Hugging Face model repository, version 2.7, accessed: 2026-04-28

  16. [16]

    Kimi k2.6: Advancing open-source coding

    Moonshot AI. Kimi k2.6: Advancing open-source coding. https://www.kimi.com/blog/kimi-k2-6, 2026

  17. [17]

    A lightweight alternative to openclaw that runs in containers for security

    NanoClaw Teams. A lightweight alternative to openclaw that runs in containers for security. https: //github.com/qwibitai/nanoclaw, 2026. GitHub repository

  18. [18]

    Hermes agent: The agent that grows with you

    NousResearch. Hermes agent: The agent that grows with you. https://github.com/nousresearch/ hermes-agent, 2026

  19. [19]

    Introducing gpt-5.5

    OpenAI. Introducing gpt-5.5. https://openai.com/zh-Hans-CN/index/introducing-gpt-5-5/ ,

  20. [20]

    OpenAI blog post, accessed: 2026-04-28

  21. [21]

    Openclaw: Open-source personal ai assistant

    OpenClaw. Openclaw: Open-source personal ai assistant. https://github.com/openclaw/openclaw,

  22. [22]

    Training Software Engineering Agents and Verifiers with SWE-Gym

    Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym.arXiv preprint arXiv:2412.21139, 2024. 10

  23. [23]

    Qwen3.5: Towards native multimodal agents

    Qwen. Qwen3.5: Towards native multimodal agents. https://qwen.ai/blog?id=qwen3.5, February 2026

  24. [24]

    Qwen3.6-27B: Flagship-level coding in a 27B dense model

    Qwen. Qwen3.6-27B: Flagship-level coding in a 27B dense model. https://qwen.ai/blog?id=qwen3. 6-27b, April 2026

  25. [25]

    QwenClawBench: Real-user-distribution benchmark for openclaw agents

    Alibaba Group Qwen Team. QwenClawBench: Real-user-distribution benchmark for openclaw agents. https://github.com/SKYLENAGE-AI/QwenClawBench, April 2026. GitHub repository

  26. [26]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020

  27. [27]

    Proagentbench: Evaluating llm agents for proactive assistance with real-world data

    Yuanbo Tang, Huaze Tang, Tingyu Cao, Lam Nguyen, Anping Zhang, Xinwen Cao, Chunkang Liu, Wenbo Ding, and Yang Li. Proagentbench: Evaluating llm agents for proactive assistance with real-world data. arXiv preprint arXiv:2602.04482, 2026

  28. [28]

    Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces

    Terminal-Bench Teams. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces. InThe Fourteenth International Conference on Learning Representations, 2026

  29. [29]

    Contextagent: Context-aware proactive LLM agents with open-world sensory perceptions

    Bufang Yang, Lilin Xu, Liekang Zeng, Kaiwei Liu, Siyang Jiang, Wenrui Lu, Hongkai Chen, Xiaofan Jiang, Guoliang Xing, and Zhenyu Yan. Contextagent: Context-aware proactive LLM agents with open-world sensory perceptions. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  30. [30]

    SWE-smith: Scaling data for software engineering agents

    John Yang, Kilian Lieret, Carlos E Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. SWE-smith: Scaling data for software engineering agents. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS), 2025

  31. [31]

    Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents

    Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An, Lei Li, Lingpeng Kong, et al. Claw-eval: Toward trustworthy evaluation of autonomous agents.arXiv preprint arXiv:2604.06132, 2026

  32. [32]

    Glm-5.1: Towards long-horizon tasks.https://z.ai/blog/glm-5.1, 2026

    ZAI. Glm-5.1: Towards long-horizon tasks.https://z.ai/blog/glm-5.1, 2026

  33. [33]

    Yuxuan Zhang, Yubo Wang, Yipeng Zhu, Penghui Du, Junwen Miao, Xuan Lu, Wendong Xu, Yunzhuo Hao, Songcheng Cai, Xiaochen Wang, Huaisong Zhang, Xian Wu, Yi Lu, Minyi Lei, Kai Zou, Huifeng Yin, Ping Nie, Liang Chen, Dongfu Jiang, Wenhu Chen, and Kelsey R. Allen. Clawbench: Can ai agents complete everyday online tasks?arXiv preprint arXiv:2604.08523, 2026

  34. [34]

    Featurebench: Benchmarking agentic coding for complex feature development

    Qixing Zhou, JiaCheng Zhang, Haiyang Wang, Rui Hao, Jiahe Wang, Minghao Han, Yuxue Yang, Shuzhe Wu, Feiyang Pan, Lue Fan, Dandan Tu, and Zhaoxiang Zhang. Featurebench: Benchmarking agentic coding for complex feature development. InThe Fourteenth International Conference on Learning Representations, 2026

  35. [35]

    SWE Context Bench: A Benchmark for Context Learning in Coding

    Jared Zhu, Minhao Hu, and Junde Wu. Swe context bench: A benchmark for context learning in coding. arXiv preprint arXiv:2602.08316, 2026

  36. [36]

    current date

    Kaijie Zhu, Yuzhou Nie, Yijiang Li, Yiming Huang, Jialian Wu, Jiang Liu, Ximeng Sun, Zhenfei Yin, Lun Wang, Zicheng Liu, et al. Termigen: High-fidelity environment and robust trajectory synthesis for terminal agents.arXiv preprint arXiv:2602.07274, 2026. 11 A Details of Task Generation Pipeline A.1 Persona Creation and Enrichment In the generation of our ...

  37. [37]

    Do not use any other names

    Service names in i n v o l v e d _ s e r v i c e s must be one of these : { a l l o w e d _ s e r v i c e s _ l i s t }. Do not use any other names

  38. [38]

    The adapted task must n a t u r a l l y fit the user ’ s role ({ role }) , in du st ry ({ in du st ry }) , and daily r e s p o n s i b i l i t i e s . ... Figure 12: Persona-specific event instantiation: prompt for adapting a seed task 17 Prompt for Generating App Data You are an e n t e r p r i s e bus in es s data g e n e r a t i o n expert . Your job i...

  39. [39]

    records

    ** Cross - service r e f e r e n c e c o n s i s t e n c y **: If an email m en tio ns a customer , that c us to me r must exist in CRM ; if a ticket m ent io ns a product , that product must exist in i n v e n t o r y . Email senders should c o r r e s p o n d to c on ta ct s in the co nt act s service . ... ## Output Format Output in JSON format with th...