Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective

· 2026 · cs.AI · arXiv 2603.14248

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Large language model (LLM) web agents are increasingly used for web navigation but remain far from human reliability on realistic, long-horizon tasks. Existing evaluations focus primarily on end-to-end success, offering limited insight into where failures arise. We propose a hierarchical planning framework to analyze web agents across three layers (i.e., high-level planning, low-level execution, and replanning), enabling process-based evaluation of reasoning, grounding, and recovery. Our experiments show that structured Planning Domain Definition Language (PDDL) plans produce more concise and goal-directed strategies than natural language (NL) plans, but low-level execution remains the dominant bottleneck. These results indicate that improving perceptual grounding and adaptive control, not only high-level reasoning, is critical for achieving human-level reliability. This hierarchical perspective provides a principled foundation for diagnosing and advancing LLM web agents.

representative citing papers

Multi-Agent Computer Use

cs.MA · 2026-06-01 · unverdicted · novelty 6.0

A manager-driven DAG decomposition with parallel subagents improves computer use agent success rates by 3.4-25.5% and reduces wall-clock time on long-horizon benchmarks.

citing papers explorer

Showing 1 of 1 citing paper.

Multi-Agent Computer Use cs.MA · 2026-06-01 · unverdicted · none · ref 3 · internal anchor
A manager-driven DAG decomposition with parallel subagents improves computer use agent success rates by 3.4-25.5% and reduces wall-clock time on long-horizon benchmarks.

Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective

fields

years

verdicts

representative citing papers

citing papers explorer