π-Bench is a new benchmark for evaluating proactive personal assistant agents on 100 multi-turn tasks that include hidden intents, inter-task dependencies, and cross-session continuity.
Proactive agent research environment: Simulating active users to evaluate proactive assistants
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3representative citing papers
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
Coding agents require a three-level proactivity taxonomy (Reactive, Scheduled, Situation Aware) evaluated by insight policy quality using Insight Decision Quality, Context Grounding Score, and Learning Lift.
citing papers explorer
-
$\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows
π-Bench is a new benchmark for evaluating proactive personal assistant agents on 100 multi-turn tasks that include hidden intents, inter-task dependencies, and cross-session continuity.
-
ProactBench: Beyond What The User Asked For
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
-
Agentic Coding Needs Proactivity, Not Just Autonomy
Coding agents require a three-level proactivity taxonomy (Reactive, Scheduled, Situation Aware) evaluated by insight policy quality using Insight Decision Quality, Context Grounding Score, and Learning Lift.