pith. sign in

arxiv: 2603.14248 · v2 · submitted 2026-03-15 · 💻 cs.AI · cs.CL

Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective

Pith reviewed 2026-05-15 11:57 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords LLM web agentshierarchical planningPDDL plansweb navigationagent failure diagnosislow-level executionhigh-level planningreplanning
0
0 comments X

The pith

LLM web agents fail mainly at low-level execution, not high-level planning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a three-layer hierarchy to diagnose failures in LLM web agents on navigation tasks: high-level planning, low-level execution, and replanning. Experiments show that structured PDDL plans create shorter and more focused strategies than natural language plans. Despite this, execution errors on individual web actions remain the main source of overall failure. The findings indicate that progress requires stronger perceptual grounding and adaptive control, not only better reasoning at the planning stage.

Core claim

Evaluating LLM web agents through a three-layer hierarchy of high-level planning, low-level execution, and replanning reveals that structured PDDL plans produce more concise and goal-directed strategies than natural language plans, but low-level execution errors constitute the dominant bottleneck preventing human-level reliability on long-horizon web tasks.

What carries the argument

Three-layer hierarchical planning framework that isolates high-level planning, low-level execution, and replanning to evaluate reasoning, grounding, and recovery separately.

If this is right

  • Structured PDDL plans yield shorter and more goal-directed high-level strategies than natural language alternatives.
  • Low-level execution errors exceed failures in planning or replanning as the main performance limiter.
  • Improvements in perceptual grounding and adaptive control are needed in addition to high-level reasoning gains.
  • Process-based evaluation across the three layers provides clearer diagnosis than end-to-end success rates alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent architectures could improve by coupling high-level planners more tightly with low-level controllers that react to live page changes.
  • The same layered breakdown might reveal analogous execution bottlenecks in other LLM agent settings such as code or robot control.
  • Specialized vision-language models for web interfaces may address the execution layer more directly than further scaling of general models.

Load-bearing premise

The proposed three-layer hierarchy captures all primary sources of failure in web agents without overlooking other factors such as memory or tool use.

What would settle it

Providing agents with an oracle for perfect low-level execution on the same tasks and measuring whether overall success rates still remain low would test whether execution truly dominates the failure distribution.

Figures

Figures reproduced from arXiv: 2603.14248 by Gregory J. Stein, Mohamed Aghzal, Ziyu Yao.

Figure 1
Figure 1. Figure 1: Overview of the hierarchical planning evaluation framework we propose. The pipeline consists of 3 stages: 1) High-level Planning: The LLM proposes high-level subgoals, 2) Low-level Execution: each high-level subgoal is translated into a set of low-level actions, a postcondition checker verifies whether the low-level actions lead to successful completion of the subgoal. If the subgoal fails after multiple i… view at source ↗
Figure 2
Figure 2. Figure 2: Execution results of different representations [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance with and without replanning site (e.g., retrieving the information from a differ￾ent website). As can be observed in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Evaluation trees for high-level alignment. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Motivating example highlighting the impor￾tance of high-level representation. B Additional Analysis B.1 Full Hierarchical Planning Example [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 10
Figure 10. Figure 10: Replanning results claude-haiku-4.5 B.4.2 gemini-flash-2.5 The performance of gemini-flash-2.5 across each of the three levels is illustrated below. Ta￾ble 9 illustrates the peformance on high-level align￾ment metrics [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 9
Figure 9. Figure 9: Execution Results using different high-level [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 12
Figure 12. Figure 12: Replanning results gemini-flash-2.5 [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 4
Figure 4. Figure 4: Examples of the high-level step annotation process We begin by prompting gpt-5-nano to produce a high-level step based on the evaluation key-node object and correct issues such overspecification and steps framed as evaluation functions manually. Subgoal: Navigate to the page for used Audi cars. Action: goto https://www.kbb.com/used-cars/audi/ Hallucinated Links Task: Find the statistics of the player Crist… view at source ↗
Figure 8
Figure 8. Figure 8: Examples of some of the most common errors encountered by the model ( [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 13
Figure 13. Figure 13: Example of our hierarchical evaluation pipeline for a single task, showing high-level planning (Human [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
read the original abstract

Large language model (LLM) web agents are increasingly used for web navigation but remain far from human reliability on realistic, long-horizon tasks. Existing evaluations focus primarily on end-to-end success, offering limited insight into where failures arise. We propose a hierarchical planning framework to analyze web agents across three layers (i.e., high-level planning, low-level execution, and replanning), enabling process-based evaluation of reasoning, grounding, and recovery. Our experiments show that structured Planning Domain Definition Language (PDDL) plans produce more concise and goal-directed strategies than natural language (NL) plans, but low-level execution remains the dominant bottleneck. These results indicate that improving perceptual grounding and adaptive control, not only high-level reasoning, is critical for achieving human-level reliability. This hierarchical perspective provides a principled foundation for diagnosing and advancing LLM web agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a three-layer hierarchical planning framework (high-level planning, low-level execution, and replanning) to diagnose failures in LLM-based web agents beyond end-to-end success rates. Experiments compare structured PDDL plans against natural language plans, finding the former more concise and goal-directed, while concluding that low-level execution (perceptual grounding and adaptive control) is the dominant bottleneck and the key target for improvement.

Significance. If the taxonomy holds, the work supplies a process-oriented diagnostic lens that could usefully redirect research from pure high-level reasoning toward grounding and recovery mechanisms in long-horizon web tasks. The PDDL-vs-NL comparison offers a concrete, falsifiable contrast that strengthens the framework's utility.

major comments (2)
  1. [Framework and experimental annotation procedure] The central claim that low-level execution is the dominant bottleneck rests on the three-layer partition cleanly isolating failure sources. If memory-state tracking or tool-integration errors are annotated into the 'execution' category (common in long-horizon web tasks), the reported dominance may be an artifact of the taxonomy rather than an empirical result. The manuscript should provide explicit annotation guidelines and inter-annotator agreement for layer boundaries, plus an ablation that isolates memory/tool modules.
  2. [Experiments and results] The abstract states that PDDL plans are more concise and goal-directed than NL plans and that execution remains the bottleneck, yet supplies no quantitative metrics (success rates, plan lengths, error distributions, task counts, or controls). Without these numbers and statistical tests in the results section, the load-bearing claim cannot be verified.
minor comments (2)
  1. [Evaluation protocol] Clarify how replanning is triggered and distinguished from initial high-level planning in the evaluation protocol.
  2. [Discussion] Add a limitations paragraph discussing whether the hierarchy omits other factors such as long-term memory or API/tool-use errors.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas for improving the clarity and rigor of our hierarchical framework and experimental reporting. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Framework and experimental annotation procedure] The central claim that low-level execution is the dominant bottleneck rests on the three-layer partition cleanly isolating failure sources. If memory-state tracking or tool-integration errors are annotated into the 'execution' category (common in long-horizon web tasks), the reported dominance may be an artifact of the taxonomy rather than an empirical result. The manuscript should provide explicit annotation guidelines and inter-annotator agreement for layer boundaries, plus an ablation that isolates memory/tool modules.

    Authors: We agree that reproducible layer boundaries are essential. The revised manuscript will include a new subsection with explicit annotation guidelines that define high-level planning as goal decomposition, low-level execution as perceptual grounding and action issuance (including tool calls), and replanning as recovery from detected failures. We will report inter-annotator agreement (e.g., Cohen's kappa) from our annotation process. Memory-state tracking errors are assigned to replanning when they necessitate re-planning; tool-integration errors during execution are kept in the low-level category. To directly test the taxonomy, we will add an ablation comparing agents with and without explicit memory modules on the same task set. revision: yes

  2. Referee: [Experiments and results] The abstract states that PDDL plans are more concise and goal-directed than NL plans and that execution remains the bottleneck, yet supplies no quantitative metrics (success rates, plan lengths, error distributions, task counts, or controls). Without these numbers and statistical tests in the results section, the load-bearing claim cannot be verified.

    Authors: The results section already reports aggregate success rates, average plan lengths, and error-type distributions for PDDL versus NL conditions across the evaluated tasks. However, we accept that these figures should be more prominent and statistically supported. In revision we will add a dedicated results table with exact counts, per-task breakdowns, plan-length statistics, and significance tests (e.g., paired t-tests or Wilcoxon tests) for the key comparisons, ensuring every claim in the abstract is directly traceable to verifiable numbers. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in the proposed framework

full rationale

The paper introduces a three-layer hierarchical planning framework (high-level planning, low-level execution, replanning) as an independent diagnostic tool for analyzing LLM web agent failures. The central experimental claim—that PDDL plans yield more concise strategies than NL plans while low-level execution remains the dominant bottleneck—arises from direct process-based evaluation of agent traces rather than any self-referential derivation, fitted parameter, or self-citation chain. No equations, ansatzes, or uniqueness theorems are invoked that reduce the reported results to the framework's own inputs by construction. The hierarchy functions as an external lens applied to observed behaviors, rendering the analysis self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that the three named layers are sufficient to localize failures; no free parameters, invented physical entities, or additional axioms are stated in the abstract.

axioms (1)
  • domain assumption The three layers (high-level planning, low-level execution, and replanning) cover the primary sources of failure in LLM web agents.
    Invoked when the authors state that the framework enables process-based evaluation of reasoning, grounding, and recovery.

pith-pipeline@v0.9.0 · 5438 in / 1192 out tokens · 38094 ms · 2026-05-15T11:57:06.618540+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 1 internal anchor

  1. [1]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    WebArena: A Realistic Web Environment for Building Autonomous Agents.arXiv preprint arXiv:2307.13854. A Experimental Setup A.1 Prompts This section outlines the prompts used for the dif- ferent stages of the planning pipeline we use. A.1.1 High-level Planning NL Prompt You are an intelligent assistant helping a user complete a task on a web page. First, r...

  2. [5]

    Proposed high-level plan:

    Are these steps reusable and modular across different pages? Then output your response in this strict format: Explanation: Your reasoning for proposing the high-level plan in 2-3 sentences. Proposed high-level plan:

  3. [6]

    Find the cheapest red t-shirt

    Step two ... Rules: - Do NOT use markdown formatting. - Assume the user is on the homepage of the website. Do not propose navigating to the homepage as your first action. - Make sure each step is atomic, reusable, and task-driven. - Focus on clarity, reusability, and structure. - Make sure each step is a high-level skill that can be decomposed into low-le...

  4. [7]

    (search_product "iphone")

  5. [8]

    Find the cheapest red t-shirt

    (filter_by_price "low") ... Rules: You cannot break any of the following rules. - Your plan should be consistent, and complete. It should be a high-level set of steps that if executed, will achieve the task. - Do not use markdown formatting. - Do not use dummy credentials to log in. Unless explicitly asked to do so as part of the task, and credentials are...

  6. [9]

    example uname

    (login “example uname”, “example psswd”)

  7. [10]

    Nike Red T-Shirt 330

    (select “Nike Red T-Shirt 330”) Why this is a bad plan: - This attempts to log in despite the task not asking for it. - The search result is too specific and is hallucinated. - If the value is unknown, the action should be abstracted or omitted. A good example of a high-level plan for the same task is: High level plan:

  8. [11]

    red t-shirt

    (search "red t-shirt")

  9. [12]

    (filter_by_price "low")

  10. [13]

    on-homepage

    (select_top_result) Below is the task to be accomplished and the web page content: Task: {task} Web page content: {web_page} PDDL Domain Generation You are given a high level plan in PDDL- like format. Generate a domain file for the high level plan. You need to generate a domain file for the high level plan. The domain file should be a list of PDDL action...

  11. [14]

    link ’Images’ [163] textarea ’Search’

  12. [15]

    It has three attributes: ’[40]’ for the element’s element_id, ’link’ indicates the element is a link, and ’About’ for the content of the element

    button ’See more’ “‘ In this example, each row represents the characteristic representation of a web page element. It has three attributes: ’[40]’ for the element’s element_id, ’link’ indicates the element is a link, and ’About’ for the content of the element. Note: The above element provided is purely for illustrative purposes and should NEVER be used di...

  13. [16]

    In the initial step of a process or when there’s no preceding interaction history (i.e., the previous trace is empty)

  14. [17]

    thought":

    In situations where the accessibility tree is absent or not provided. - Your action should not be the same as last step’s action. - The ‘element_id‘ should be an integer accurately representing the element’s ID in the accessibility tree. - A VOID using the provided example’s element_id as your output. - The output JSON blob must be valid; otherwise, it ca...

  15. [21]

    Have the postconditions/effects of the executed action been achieved?

    Output an answer to the question: "Have the postconditions/effects of the executed action been achieved?" Your output should ONLY be a json object with the following keys: { "explanation": "a short explanation of why the postconditions were achieved or not, ref- erencing specific elements or changes in the web page" "answer": "YES" | "NO" } Output only Th...

  16. [22]

    Understand and reason about what the executed effects of the high-level step should be

  17. [23]

    Check if the postconditions/effects of the executed action have been achieved based on the history of actions performed so far and the current state of the webpage

  18. [24]

    Check if the current web page does satisfy the postconditions/effects of the executed action

  19. [25]

    Have the postconditions/effects of the executed action been achieved?

    Output an answer to the question: "Have the postconditions/effects of the executed action been achieved?" Your output should ONLY be a json object with the following keys: { "explanation": "a short explanation of why the postconditions were achieved or not, ref- erencing specific elements or changes in the web page" "answer": "YES" | "NO" } Output only Th...

  20. [26]

    **Start over completely**: Propose a completely new high-level plan from scratch, ignoring the previous attempt

  21. [27]

    **Continue from last successful ac- tion**: Build upon the actions that were successfully completed, and propose a new plan that continues from where the last successful action left off. {plan_context} {history_text} {failure_context} Based on the history above, decide whether to: - Start over with a completely new approach, OR - Continue from the last su...

  22. [28]

    What is the user trying to accomplish?

  23. [29]

    What subgoals must be achieved?

  24. [30]

    Should I start over or continue from the last successful action? Why?

  25. [31]

    Can each subgoal be broken into lower-level actions (click, type, etc.)?

  26. [32]

    YES" or

    Are these steps reusable and modular across different pages? Then output your response in this **strict format**. You have to output the plan sal- vageability assessment and the explanation first, then the proposed high-level plan. Plan Salvageability Assessment: - Is the original plan salvageable? Answer: "YES" or "NO" - Reasoning: Explain why the plan i...

  27. [33]

    explanation

    Step two ... IMPORTANT REMINDER: If you an- swered "YES" (plan is salvageable), you MUST ONLY include the REMAINING steps that still need to be completed in your high-level plan. Do NOT repeat steps that were already successfully completed. If you answered "NO" (plan is not salvageable), in- clude ALL steps needed to complete the task from scratch. Rules:...

  28. [34]

    CLICK on the Men category to access men's apparel and outerwear

  29. [35]

    CLICK into the coats/outerwear subsection within Men to view winter coat options

  30. [36]

    SELECT size filter and set to Large to restrict results to coats available in Large

  31. [37]

    SELECT a clearance filter to show only items that are on clearance

  32. [38]

    SELECT a winter or cold-weather category/filter if available to focus on winter coats

  33. [39]

    REVIEW the filtered results to identify coats labeled as winter coats and marked on clearance

  34. [40]

    CLICK on a candidate product to view details; VERIFY it is a winter coat, available in size Large

  35. [41]

    Men's Winter Coats

    If no qualifying items are found, ADJUST filters (e.g., broaden size or remove winter filter) and REPEAT steps 6-7. NL Plan Plan1. (browse_category "Men's Winter Coats")

  36. [42]

    (filter_by_size "Large")"

  37. [43]

    Clearance

    (filter_by_status "Clearance")"

  38. [44]

    Safety department

    (select_first_result) Domain File(:action browse_category :parameters (?cat - category) :precondition (on-homepage) :effect (and (on-category-page ?cat) (results-present ?cat))) ... PDDL Figure 6:Motivating example highlighting the impor- tance of high-level representation. B Additional Analysis B.1 Full Hierarchical Planning Example Figure 13 shows a run...

  39. [45]

    Navigate to the Great Escape park page

  40. [46]

    LLM NL plan

    Go to the ticket store section for the Great Escape park. LLM NL plan

  41. [47]

    SELECT the Great Escape park as the active park using the Park of Pref- erence control

  42. [48]

    Locate the Deals or Promotions section for the Great Escape park and CLICK it if present

  43. [49]

    If a Deals/Promotions section exists, READ and RECORD all current deals offered for Great Escape

  44. [50]

    If no Deals/Promotions section is found, CLICK into Tickets or Prices sections and SCAN for any current promotions or bundled offers, then RECORD any found promos

  45. [51]

    LLM PDDL plan

    Compile and present a concise re- port listing all deals, including name, discount, eligibility, and how to re- deem, and note if no deals are avail- able with suggested next steps. LLM PDDL plan

  46. [52]

    (open_deals_section)

  47. [53]

    Great Escape Park

    (select_park "Great Escape Park")

  48. [54]

    Great Escape Park

    (extract_deals) Low-level planning (LLM) [Excerpt for subgoal: (select_park "Great Escape Park")] Action:google_search "Six Flags deals site:sixflags.com"Observation:SUCCESS — Google search executed. Action:click "Deals & Offers - Six Flags Great Adventure"Observation:SUCCESS — link clicked. Action:goto "https://www.sixflags.com/national-sites"Observation...