Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

· 2026 · cs.AI · arXiv 2605.03596

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively. Despite its importance, existing relevant benchmarks largely evaluate agents on pre-specified or synthesized files with limited real-world dependencies, leaving workspace-level evaluation underexplored. To this end, we introduce Workspace-Bench, a benchmark for evaluating AI agents on Workspace Learning involving Large-Scale File Dependencies. We construct realistic workspaces with 5 worker profiles, 74 file types, 20,476 files (up to 20GB) and curate 388 tasks, each with its own file dependency graph, evaluated across 7,399 total rubrics that require cross-file retrieval, contextual reasoning, and adaptive decision-making. We further provide Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation costs by about 70%. We evaluate 4 popular agent harnesses and 7 foundation models. Experimental results show that current agents remain far from reliable workspace learning, where the best reaches only about 60%, substantially below the human result of 80.7%, and the average performance across agents is only 43.3%.

representative citing papers

AI Snitches Get Glitches: Towards Evading Agentic Surveillance

cs.AI · 2026-06-24 · unverdicted · novelty 6.0

Formalizes agentic surveillance, releases SurveilBench for testing AI reporting behaviors across corporate, education, and police scenarios, and develops three prompt-injection evasion techniques.

AgenticDataBench: A Comprehensive Benchmark for Data Agents

cs.DB · 2026-07-02 · unverdicted · novelty 5.0

AgenticDataBench is a new benchmark covering realistic data science tasks across 15 domains using extracted skills and LLM-generated workflows to evaluate data agents at fine granularity.

citing papers explorer

Showing 2 of 2 citing papers after filters.

AI Snitches Get Glitches: Towards Evading Agentic Surveillance cs.AI · 2026-06-24 · unverdicted · none · ref 80 · internal anchor
Formalizes agentic surveillance, releases SurveilBench for testing AI reporting behaviors across corporate, education, and police scenarios, and develops three prompt-injection evasion techniques.
AgenticDataBench: A Comprehensive Benchmark for Data Agents cs.DB · 2026-07-02 · unverdicted · none · ref 53 · internal anchor
AgenticDataBench is a new benchmark covering realistic data science tasks across 15 domains using extracted skills and LLM-generated workflows to evaluate data agents at fine granularity.

Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

fields

years

verdicts

representative citing papers

citing papers explorer