Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective
Pith reviewed 2026-05-15 11:57 UTC · model grok-4.3
The pith
LLM web agents fail mainly at low-level execution, not high-level planning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Evaluating LLM web agents through a three-layer hierarchy of high-level planning, low-level execution, and replanning reveals that structured PDDL plans produce more concise and goal-directed strategies than natural language plans, but low-level execution errors constitute the dominant bottleneck preventing human-level reliability on long-horizon web tasks.
What carries the argument
Three-layer hierarchical planning framework that isolates high-level planning, low-level execution, and replanning to evaluate reasoning, grounding, and recovery separately.
If this is right
- Structured PDDL plans yield shorter and more goal-directed high-level strategies than natural language alternatives.
- Low-level execution errors exceed failures in planning or replanning as the main performance limiter.
- Improvements in perceptual grounding and adaptive control are needed in addition to high-level reasoning gains.
- Process-based evaluation across the three layers provides clearer diagnosis than end-to-end success rates alone.
Where Pith is reading between the lines
- Agent architectures could improve by coupling high-level planners more tightly with low-level controllers that react to live page changes.
- The same layered breakdown might reveal analogous execution bottlenecks in other LLM agent settings such as code or robot control.
- Specialized vision-language models for web interfaces may address the execution layer more directly than further scaling of general models.
Load-bearing premise
The proposed three-layer hierarchy captures all primary sources of failure in web agents without overlooking other factors such as memory or tool use.
What would settle it
Providing agents with an oracle for perfect low-level execution on the same tasks and measuring whether overall success rates still remain low would test whether execution truly dominates the failure distribution.
Figures
read the original abstract
Large language model (LLM) web agents are increasingly used for web navigation but remain far from human reliability on realistic, long-horizon tasks. Existing evaluations focus primarily on end-to-end success, offering limited insight into where failures arise. We propose a hierarchical planning framework to analyze web agents across three layers (i.e., high-level planning, low-level execution, and replanning), enabling process-based evaluation of reasoning, grounding, and recovery. Our experiments show that structured Planning Domain Definition Language (PDDL) plans produce more concise and goal-directed strategies than natural language (NL) plans, but low-level execution remains the dominant bottleneck. These results indicate that improving perceptual grounding and adaptive control, not only high-level reasoning, is critical for achieving human-level reliability. This hierarchical perspective provides a principled foundation for diagnosing and advancing LLM web agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a three-layer hierarchical planning framework (high-level planning, low-level execution, and replanning) to diagnose failures in LLM-based web agents beyond end-to-end success rates. Experiments compare structured PDDL plans against natural language plans, finding the former more concise and goal-directed, while concluding that low-level execution (perceptual grounding and adaptive control) is the dominant bottleneck and the key target for improvement.
Significance. If the taxonomy holds, the work supplies a process-oriented diagnostic lens that could usefully redirect research from pure high-level reasoning toward grounding and recovery mechanisms in long-horizon web tasks. The PDDL-vs-NL comparison offers a concrete, falsifiable contrast that strengthens the framework's utility.
major comments (2)
- [Framework and experimental annotation procedure] The central claim that low-level execution is the dominant bottleneck rests on the three-layer partition cleanly isolating failure sources. If memory-state tracking or tool-integration errors are annotated into the 'execution' category (common in long-horizon web tasks), the reported dominance may be an artifact of the taxonomy rather than an empirical result. The manuscript should provide explicit annotation guidelines and inter-annotator agreement for layer boundaries, plus an ablation that isolates memory/tool modules.
- [Experiments and results] The abstract states that PDDL plans are more concise and goal-directed than NL plans and that execution remains the bottleneck, yet supplies no quantitative metrics (success rates, plan lengths, error distributions, task counts, or controls). Without these numbers and statistical tests in the results section, the load-bearing claim cannot be verified.
minor comments (2)
- [Evaluation protocol] Clarify how replanning is triggered and distinguished from initial high-level planning in the evaluation protocol.
- [Discussion] Add a limitations paragraph discussing whether the hierarchy omits other factors such as long-term memory or API/tool-use errors.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight important areas for improving the clarity and rigor of our hierarchical framework and experimental reporting. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Framework and experimental annotation procedure] The central claim that low-level execution is the dominant bottleneck rests on the three-layer partition cleanly isolating failure sources. If memory-state tracking or tool-integration errors are annotated into the 'execution' category (common in long-horizon web tasks), the reported dominance may be an artifact of the taxonomy rather than an empirical result. The manuscript should provide explicit annotation guidelines and inter-annotator agreement for layer boundaries, plus an ablation that isolates memory/tool modules.
Authors: We agree that reproducible layer boundaries are essential. The revised manuscript will include a new subsection with explicit annotation guidelines that define high-level planning as goal decomposition, low-level execution as perceptual grounding and action issuance (including tool calls), and replanning as recovery from detected failures. We will report inter-annotator agreement (e.g., Cohen's kappa) from our annotation process. Memory-state tracking errors are assigned to replanning when they necessitate re-planning; tool-integration errors during execution are kept in the low-level category. To directly test the taxonomy, we will add an ablation comparing agents with and without explicit memory modules on the same task set. revision: yes
-
Referee: [Experiments and results] The abstract states that PDDL plans are more concise and goal-directed than NL plans and that execution remains the bottleneck, yet supplies no quantitative metrics (success rates, plan lengths, error distributions, task counts, or controls). Without these numbers and statistical tests in the results section, the load-bearing claim cannot be verified.
Authors: The results section already reports aggregate success rates, average plan lengths, and error-type distributions for PDDL versus NL conditions across the evaluated tasks. However, we accept that these figures should be more prominent and statistically supported. In revision we will add a dedicated results table with exact counts, per-task breakdowns, plan-length statistics, and significance tests (e.g., paired t-tests or Wilcoxon tests) for the key comparisons, ensuring every claim in the abstract is directly traceable to verifiable numbers. revision: yes
Circularity Check
No significant circularity detected in the proposed framework
full rationale
The paper introduces a three-layer hierarchical planning framework (high-level planning, low-level execution, replanning) as an independent diagnostic tool for analyzing LLM web agent failures. The central experimental claim—that PDDL plans yield more concise strategies than NL plans while low-level execution remains the dominant bottleneck—arises from direct process-based evaluation of agent traces rather than any self-referential derivation, fitted parameter, or self-citation chain. No equations, ansatzes, or uniqueness theorems are invoked that reduce the reported results to the framework's own inputs by construction. The hierarchy functions as an external lens applied to observed behaviors, rendering the analysis self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The three layers (high-level planning, low-level execution, and replanning) cover the primary sources of failure in LLM web agents.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a hierarchical planning framework to analyze web agents across three layers (i.e., high-level planning, low-level execution, and replanning)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
WebArena: A Realistic Web Environment for Building Autonomous Agents
WebArena: A Realistic Web Environment for Building Autonomous Agents.arXiv preprint arXiv:2307.13854. A Experimental Setup A.1 Prompts This section outlines the prompts used for the dif- ferent stages of the planning pipeline we use. A.1.1 High-level Planning NL Prompt You are an intelligent assistant helping a user complete a task on a web page. First, r...
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Are these steps reusable and modular across different pages? Then output your response in this strict format: Explanation: Your reasoning for proposing the high-level plan in 2-3 sentences. Proposed high-level plan:
-
[6]
Step two ... Rules: - Do NOT use markdown formatting. - Assume the user is on the homepage of the website. Do not propose navigating to the homepage as your first action. - Make sure each step is atomic, reusable, and task-driven. - Focus on clarity, reusability, and structure. - Make sure each step is a high-level skill that can be decomposed into low-le...
-
[7]
(search_product "iphone")
-
[8]
(filter_by_price "low") ... Rules: You cannot break any of the following rules. - Your plan should be consistent, and complete. It should be a high-level set of steps that if executed, will achieve the task. - Do not use markdown formatting. - Do not use dummy credentials to log in. Unless explicitly asked to do so as part of the task, and credentials are...
- [9]
-
[10]
(select “Nike Red T-Shirt 330”) Why this is a bad plan: - This attempts to log in despite the task not asking for it. - The search result is too specific and is hallucinated. - If the value is unknown, the action should be abstracted or omitted. A good example of a high-level plan for the same task is: High level plan:
- [11]
-
[12]
(filter_by_price "low")
-
[13]
(select_top_result) Below is the task to be accomplished and the web page content: Task: {task} Web page content: {web_page} PDDL Domain Generation You are given a high level plan in PDDL- like format. Generate a domain file for the high level plan. You need to generate a domain file for the high level plan. The domain file should be a list of PDDL action...
-
[14]
link ’Images’ [163] textarea ’Search’
-
[15]
button ’See more’ “‘ In this example, each row represents the characteristic representation of a web page element. It has three attributes: ’[40]’ for the element’s element_id, ’link’ indicates the element is a link, and ’About’ for the content of the element. Note: The above element provided is purely for illustrative purposes and should NEVER be used di...
-
[16]
In the initial step of a process or when there’s no preceding interaction history (i.e., the previous trace is empty)
-
[17]
In situations where the accessibility tree is absent or not provided. - Your action should not be the same as last step’s action. - The ‘element_id‘ should be an integer accurately representing the element’s ID in the accessibility tree. - A VOID using the provided example’s element_id as your output. - The output JSON blob must be valid; otherwise, it ca...
-
[21]
Have the postconditions/effects of the executed action been achieved?
Output an answer to the question: "Have the postconditions/effects of the executed action been achieved?" Your output should ONLY be a json object with the following keys: { "explanation": "a short explanation of why the postconditions were achieved or not, ref- erencing specific elements or changes in the web page" "answer": "YES" | "NO" } Output only Th...
-
[22]
Understand and reason about what the executed effects of the high-level step should be
-
[23]
Check if the postconditions/effects of the executed action have been achieved based on the history of actions performed so far and the current state of the webpage
-
[24]
Check if the current web page does satisfy the postconditions/effects of the executed action
-
[25]
Have the postconditions/effects of the executed action been achieved?
Output an answer to the question: "Have the postconditions/effects of the executed action been achieved?" Your output should ONLY be a json object with the following keys: { "explanation": "a short explanation of why the postconditions were achieved or not, ref- erencing specific elements or changes in the web page" "answer": "YES" | "NO" } Output only Th...
-
[26]
**Start over completely**: Propose a completely new high-level plan from scratch, ignoring the previous attempt
-
[27]
**Continue from last successful ac- tion**: Build upon the actions that were successfully completed, and propose a new plan that continues from where the last successful action left off. {plan_context} {history_text} {failure_context} Based on the history above, decide whether to: - Start over with a completely new approach, OR - Continue from the last su...
-
[28]
What is the user trying to accomplish?
-
[29]
What subgoals must be achieved?
-
[30]
Should I start over or continue from the last successful action? Why?
-
[31]
Can each subgoal be broken into lower-level actions (click, type, etc.)?
-
[32]
Are these steps reusable and modular across different pages? Then output your response in this **strict format**. You have to output the plan sal- vageability assessment and the explanation first, then the proposed high-level plan. Plan Salvageability Assessment: - Is the original plan salvageable? Answer: "YES" or "NO" - Reasoning: Explain why the plan i...
-
[33]
Step two ... IMPORTANT REMINDER: If you an- swered "YES" (plan is salvageable), you MUST ONLY include the REMAINING steps that still need to be completed in your high-level plan. Do NOT repeat steps that were already successfully completed. If you answered "NO" (plan is not salvageable), in- clude ALL steps needed to complete the task from scratch. Rules:...
work page 2025
-
[34]
CLICK on the Men category to access men's apparel and outerwear
-
[35]
CLICK into the coats/outerwear subsection within Men to view winter coat options
-
[36]
SELECT size filter and set to Large to restrict results to coats available in Large
-
[37]
SELECT a clearance filter to show only items that are on clearance
-
[38]
SELECT a winter or cold-weather category/filter if available to focus on winter coats
-
[39]
REVIEW the filtered results to identify coats labeled as winter coats and marked on clearance
-
[40]
CLICK on a candidate product to view details; VERIFY it is a winter coat, available in size Large
-
[41]
If no qualifying items are found, ADJUST filters (e.g., broaden size or remove winter filter) and REPEAT steps 6-7. NL Plan Plan1. (browse_category "Men's Winter Coats")
-
[42]
(filter_by_size "Large")"
- [43]
-
[44]
(select_first_result) Domain File(:action browse_category :parameters (?cat - category) :precondition (on-homepage) :effect (and (on-category-page ?cat) (results-present ?cat))) ... PDDL Figure 6:Motivating example highlighting the impor- tance of high-level representation. B Additional Analysis B.1 Full Hierarchical Planning Example Figure 13 shows a run...
-
[45]
Navigate to the Great Escape park page
- [46]
-
[47]
SELECT the Great Escape park as the active park using the Park of Pref- erence control
-
[48]
Locate the Deals or Promotions section for the Great Escape park and CLICK it if present
-
[49]
If a Deals/Promotions section exists, READ and RECORD all current deals offered for Great Escape
-
[50]
If no Deals/Promotions section is found, CLICK into Tickets or Prices sections and SCAN for any current promotions or bundled offers, then RECORD any found promos
-
[51]
Compile and present a concise re- port listing all deals, including name, discount, eligibility, and how to re- deem, and note if no deals are avail- able with suggested next steps. LLM PDDL plan
-
[52]
(open_deals_section)
- [53]
-
[54]
(extract_deals) Low-level planning (LLM) [Excerpt for subgoal: (select_park "Great Escape Park")] Action:google_search "Six Flags deals site:sixflags.com"Observation:SUCCESS — Google search executed. Action:click "Deals & Offers - Six Flags Great Adventure"Observation:SUCCESS — link clicked. Action:goto "https://www.sixflags.com/national-sites"Observation...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.