ATRBench is the first benchmark for the Ask-to-Remember task, showing eight frontier LLM agents fall at least 62 points below an oracle that receives the relevant preference and that prompting closes little of the gap.
Proactive agent research environment: Simulating active users to evaluate proactive assistants
6 Pith papers cite this work. Polarity classification is still indexing.
years
2026 6representative citing papers
SentinelBench is a new benchmark for time-evolving monitoring tasks in web environments, measuring task completion, reaction time, and resource use with baselines from three models and two harnesses.
π-Bench is a new benchmark for evaluating proactive personal assistant agents on 100 multi-turn tasks that include hidden intents, inter-task dependencies, and cross-session continuity.
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
Audio-Interaction unifies offline and online audio tasks into one streaming model via the SoundFlow framework and a new 2.6M-item streaming corpus, enabling real-time instruction following and proactive responses.
Coding agents require a three-level proactivity taxonomy (Reactive, Scheduled, Situation Aware) evaluated by insight policy quality using Insight Decision Quality, Context Grounding Score, and Learning Lift.
citing papers explorer
-
Ask Now, Use Later: Benchmarking the Proactivity Gap in Long-Lived LLM Agents
ATRBench is the first benchmark for the Ask-to-Remember task, showing eight frontier LLM agents fall at least 62 points below an oracle that receives the relevant preference and that prompting closes little of the gap.
-
SentinelBench: A Benchmark for Long-Running Monitoring Agents
SentinelBench is a new benchmark for time-evolving monitoring tasks in web environments, measuring task completion, reaction time, and resource use with baselines from three models and two harnesses.
-
$\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows
π-Bench is a new benchmark for evaluating proactive personal assistant agents on 100 multi-turn tasks that include hidden intents, inter-task dependencies, and cross-session continuity.
-
ProactBench: Beyond What The User Asked For
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
-
Audio Interaction Model
Audio-Interaction unifies offline and online audio tasks into one streaming model via the SoundFlow framework and a new 2.6M-item streaming corpus, enabling real-time instruction following and proactive responses.
-
Agentic Coding Needs Proactivity, Not Just Autonomy
Coding agents require a three-level proactivity taxonomy (Reactive, Scheduled, Situation Aware) evaluated by insight policy quality using Insight Decision Quality, Context Grounding Score, and Learning Lift.