← back to paper
arxiv: 2605.14678 · 2 revisions
$\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows