Does The Way You Plan Matter? An Empirical Study of Planning Representations for LLM Web Agents

Alejandra Zambrano; Imene Kerboua; Leila Kosseim; Sara Vera Marjanovic; Xing Han L\`u

arxiv: 2605.29927 · v1 · pith:2VCUO4YCnew · submitted 2026-05-28 · 💻 cs.CL · cs.AI· cs.LG

Does The Way You Plan Matter? An Empirical Study of Planning Representations for LLM Web Agents

Alejandra Zambrano , Sara Vera Marjanovic , Imene Kerboua , Xing Han L\`u , Leila Kosseim This is my paper

Pith reviewed 2026-06-29 07:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords planning representationsLLM web agentsPlanAheadWebArenaplan formulationtask successagent robustnessevaluation metrics

0 comments

The pith

Both how plans are written and which LLM writes them shape web-agent success on hard tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PlanAhead, a static planner-executor setup that tests four natural-language plan styles on automatically graded hard tasks from WebArena. It runs the same tasks with plans generated by models from OpenAI, Alibaba, and Google, then measures outcomes with two new metrics that capture both overall achievement and consistency across runs. The central finding is that the choice of plan representation and the model producing the plan each change how often agents complete tasks and how reliably they do so.

Core claim

Both the plan formulation and the underlying LLM generating the plan significantly influence web-agent robustness and task success.

What carries the argument

PlanAhead, a static planner-executor framework that compares four plan representations (sequential subgoals, narrative, pseudocode, checklist) on automatically partitioned hard tasks.

If this is right

Sequential subgoals, narrative, pseudocode, and checklist plans produce measurably different success rates on the same tasks.
Switching the LLM that generates the plan can raise or lower both achievement rate and solved-task consistency.
The two new metrics (Achievement Rate and Solved-Task Consistency) give a more stable picture of stochastic agent behavior than single-run success.
Static planning separates plan quality from execution noise, allowing direct comparison of representation effects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Web-agent builders may gain more by tuning plan style to a specific model family than by scaling the executor alone.
The automatic difficulty partition could be validated or replaced by lightweight human checks on a small sample to strengthen future comparisons.
Similar representation tests could be applied to non-web agent settings where step omission or constraint sensitivity is also common.

Load-bearing premise

Automatic categorization of WebArena tasks into three difficulty levels without human annotation produces a reliable partition that does not bias the comparison of plan representations.

What would settle it

Re-running the same hard tasks after human annotators re-categorize difficulty levels and finding that the ranking of the four plan representations changes substantially.

Figures

Figures reproduced from arXiv: 2605.29927 by Alejandra Zambrano, Imene Kerboua, Leila Kosseim, Sara Vera Marjanovic, Xing Han L\`u.

**Figure 1.** Figure 1: Overview of PLANAHEAD, in comparison to single-agent and proactive planning frameworks 2 Related Works Planning through task decomposition, where complex tasks are broken into simpler subgoals (Huang et al., 2024), has emerged as a key capability for tackling complex, multi-step tasks in dynamic environments. Planning can be implemented within a single agent (Sodhi et al., 2023), but more often through … view at source ↗

**Figure 2.** Figure 2: The four representations used in our evaluation of [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of Easy, Medium, and Hard tasks across WebArena domains. Percentages indicate each domain’s share within a difficulty tier. Config Sequential Checklist Pseudocode Narrative 4.1 mini 3.59–6.59 1.20–4.79 2.40–4.79 5.39–6.59 Qwen 2.5 2.40–2.99 2.40–4.79 3.59–3.59 2.99–4.19 Gemini 3.80-5.70 5.06-7.59 4.43-8.23 6.33-7.59 4.1 mini - Qwen 2.99–5.99 4.79–8.38 2.40–5.39 3.59–5.39 Qwen - 4.1 mini 2.99-5… view at source ↗

**Figure 4.** Figure 4: Examples of single-agent planner-executor behavior across different WebArena tasks. (a-c) show task [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt for sequential plans [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt for checklist plans [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt for algorithmic pseudocode plans. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt for narrative plans [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

read the original abstract

Despite recent advances, LLM-based web agents still struggle with limited exploration, omission of critical steps, and sensitivity to task constraints. Prior work suggests that many of these failures stem from weaknesses in planning, yet the impact of alternative natural language plan representation remains unexplored. To address this, we introduce PlanAhead, a static planner-executor framework that evaluates the impact of plan representation in agent performance. We first automatically categorize WebArena tasks into 3 difficulty levels, enabling consistent difficulty grading without human annotation. Then we systematically evaluate 4 different plan representations on the tasks categorized as hard: sequential subgoals, narrative, pseudocode, and checklist; across different families of multimodal LLM powered agents (OpenAI, Alibaba, and Google). To account for stochastic variability, we introduce two novel evaluation metrics: Achievement Rate (AR) and Solved-Task Consistency (STC). Our results show that both, the plan formulation and the underlying LLM generating the plan, significantly influence web-agent robustness and task success.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The plan representation comparison adds some empirical data on web agents but the unvalidated auto difficulty split is a real soft spot that could affect the results.

read the letter

The main point is that plan format and the LLM used to create the plan both affect how well these web agents handle hard tasks, at least on the subset they tested. They compare sequential subgoals, narrative, pseudocode, and checklist in a planner-executor setup across a few model families and track results with two new metrics meant to capture consistency under stochastic runs.

What stands out as useful is the side-by-side evaluation on the same tasks rather than isolated prompting experiments. The new metrics try to move beyond single-success rates, which is a reasonable response to the variability problem in these agents. The work stays grounded in WebArena and reports differences that practitioners might actually try.

The clearest weakness is the automatic categorization of tasks into difficulty levels. The paper focuses its claims on the hard subset, yet the method for labeling without human input is not described in enough detail to judge whether it is neutral across plan styles. If the auto labels pick up on task features that favor narrative over pseudocode or vice versa, the reported effects could be partly artifacts of that split. That makes the central claim harder to trust until the procedure and any validation checks are shown.

The rest of the setup looks like standard empirical work in the area—no obvious circularity or invented quantities. Citations follow the usual web-agent and planning references.

This is the kind of paper that matters to people actively building or tuning LLM web agents who need concrete comparisons rather than high-level advice. A reader already working in that sub-area would get practical signals from the numbers, though they would want to re-check the difficulty partition themselves.

I would send it to peer review. The experiment is worth referee time even if the difficulty labeling needs more justification and possibly a human sanity check.

Referee Report

2 major / 2 minor

Summary. The paper claims that both plan formulation (evaluating sequential subgoals, narrative, pseudocode, and checklist representations) and the underlying LLM used to generate the plan significantly influence web-agent robustness and task success on hard WebArena tasks. It introduces the PlanAhead static planner-executor framework, automatically categorizes WebArena tasks into three difficulty levels without human annotation, evaluates the four representations across OpenAI, Alibaba, and Google multimodal LLMs, and proposes two new metrics—Achievement Rate (AR) and Solved-Task Consistency (STC)—to handle stochastic variability.

Significance. If the central empirical findings hold after methodological clarification, the work would offer concrete guidance on selecting plan representations for LLM web agents and demonstrate the value of systematic cross-LLM comparisons. The introduction of AR and STC as variability-aware metrics is a constructive addition to agent evaluation practices.

major comments (2)

[Methods / task categorization paragraph] The automatic categorization of WebArena tasks into three difficulty levels (described immediately after the introduction of PlanAhead): the manuscript states that tasks are 'automatically categorize[d] ... without human annotation' but supplies no description of the procedure, input features, prompts, or validation against human judgments. Because all quantitative results and the central claim rest exclusively on the 'hard' subset, any correlation between the auto-labels and task properties that interact with plan style (e.g., narrative vs. pseudocode) would render the reported differences between representations artifactual rather than substantive.
[Results] Evaluation and results sections reporting the influence of plan formulation and LLM choice: the claim of 'significant' influence is presented without accompanying statistical tests, per-condition sample sizes, confidence intervals, or correction for multiple comparisons across four representations and three LLM families. This omission prevents assessment of whether the observed differences in AR and STC are reliable or could arise from sampling variability.

minor comments (2)

[Abstract] The abstract contains the phrasing 'both, the plan formulation' (comma after 'both'), which should be corrected for grammatical clarity.
[Methods] The four plan representations are introduced without a concise table or figure summarizing their syntactic differences; adding one would improve readability when comparing results across representations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive suggestions. We address each of the major comments below, indicating the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Methods / task categorization paragraph] The automatic categorization of WebArena tasks into three difficulty levels (described immediately after the introduction of PlanAhead): the manuscript states that tasks are 'automatically categorize[d] ... without human annotation' but supplies no description of the procedure, input features, prompts, or validation against human judgments. Because all quantitative results and the central claim rest exclusively on the 'hard' subset, any correlation between the auto-labels and task properties that interact with plan style (e.g., narrative vs. pseudocode) would render the reported differences between representations artifactual rather than substantive.

Authors: We agree that a detailed description of the automatic categorization procedure is essential for reproducibility and to address potential confounds. In the revised manuscript, we will expand the Methods section to include the specific input features used (task metadata from WebArena such as number of steps, required actions, and domain), the LLM prompts employed for categorization, and any post-hoc checks performed. While the original categorization did not involve human validation to maintain the 'without human annotation' property, we will discuss potential limitations and how the difficulty levels correlate with observed agent performance metrics. revision: yes
Referee: [Results] Evaluation and results sections reporting the influence of plan formulation and LLM choice: the claim of 'significant' influence is presented without accompanying statistical tests, per-condition sample sizes, confidence intervals, or correction for multiple comparisons across four representations and three LLM families. This omission prevents assessment of whether the observed differences in AR and STC are reliable or could arise from sampling variability.

Authors: We acknowledge the importance of rigorous statistical reporting. The manuscript will be revised to include explicit per-condition sample sizes (number of hard tasks and number of evaluation runs per plan-LLM combination), 95% confidence intervals for AR and STC values, results from statistical tests (such as ANOVA or pairwise t-tests with Bonferroni correction for multiple comparisons), and p-values supporting the claims of significant influence. This will allow readers to better assess the reliability of the differences observed. revision: yes

Circularity Check

0 steps flagged

Purely empirical study; no derivations or self-referential reductions

full rationale

The paper performs an experimental comparison of four plan representations (sequential subgoals, narrative, pseudocode, checklist) on automatically categorized 'hard' WebArena tasks, across multiple LLM families, using two new metrics (AR, STC). No equations, parameter fits, uniqueness theorems, or derivations appear in the provided text. Claims rest on observed performance differences rather than any reduction to prior results or self-citations by construction. The automatic categorization step is a methodological choice but does not create a definitional or fitted-input circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical study with no mathematical derivations; no free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.1-grok · 5724 in / 1070 out tokens · 22411 ms · 2026-06-29T07:37:35.717893+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 2 canonical work pages · 1 internal anchor

[1]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

WorkArena: How capable are web agents at solving common knowledge work tasks?arXiv preprint arXiv:2403.07718. Lutfi Eren Erdogan, Nicholas Lee, Sehoon Kim, Suhong Moon, Hiroki Furuta, Gopala Anumanchipalli, Kurt Keutzer, and Amir Gholami. 2025. Plan-and-act: Improving planning of agents for long-horizon tasks. arXiv preprint arXiv:2503.09572. Izzeddin Gur...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

unsolved

Towards enterprise-ready computer using gen- eralist agent.arXiv preprint arXiv:2503.01861. Archiki Prasad, Alexander Koller, Mareike Hartmann, Peter Clark, Ashish Sabharwal, Mohit Bansal, and Tushar Khot. 2023. ADAPT: As-needed decompo- sition and planning with language models.arXiv preprint arXiv:2311.05772. Tianlin Shi, Andrej Karpathy, Linxi Fan, Jona...

work page arXiv 2023
[3]

Foot (OSRM)

Change the mode of transportation to "Foot (OSRM)"
[4]

Click the "Go" button to get walking directions
[5]

After reviewing the effect of your previous actions, verify if your plan is still relevant and update it if necessary

Extract the walking time from the results. After reviewing the effect of your previous actions, verify if your plan is still relevant and update it if necessary. As shown in the plan above, the step of ‘Click- ing Go’ is ignored and is assumed completed. The agent proceeds to just "wait" for the results to ap- pear instead of trying to click go again it p...
[7]

Generate a new plan to complete the task by decomposing the task into meaningful subtasks of logically ordered chunks
[8]

Ensure the plan is concise and contains only necessary steps
[9]

Carefully observe and understand the current state of the computer before generating your plan
[10]

Avoid including steps in your plan that the task does not ask for
[11]

Below are important considerations when generating your plan:

Generate a thought process or explanation about the plan generated. Below are important considerations when generating your plan:
[12]

Provide the plan in a step-by-step numbered format with sequential subtasks
[13]

avoid specifying mouse clicks)

Provide high-level subtasks and avoid over-specifying actions—this isn’t a low-level how-to (e.g. avoid specifying mouse clicks). This is a logical task breakdown that captures essential steps and decisions. The executor will identify the fine-grained actions to be taken whilst interacting with the GUI using your subgoals
[14]

Steps that confirm or validate other subtasks should not be included

Do not include verification steps in your planning. Steps that confirm or validate other subtasks should not be included
[15]

Your plan must be as concise as possible

Do not include optional steps in your planning. Your plan must be as concise as possible
[16]

If you are unsure if a step is necessary, do not include it in your plan

Do not include unnecessary steps in your planning. If you are unsure if a step is necessary, do not include it in your plan
[17]

Book a round-trip flight from New York to San Francisco for next weekend, departing Friday and returning Sunday, on the cheapest available airline

Your plan **must** be end-to-end meaning it should entail instructions from the initial state till the goal state. Below is an example of a plan created using subtask decomposition, given the request "Book a round-trip flight from New York to San Francisco for next weekend, departing Friday and returning Sunday, on the cheapest available airline.":
[18]

Open a flight booking website (e.g., Google Flights, Expedia, etc.)
[19]

Input departure city as New York and destination as San Francisco
[20]

Set departure date to next Friday and return date to next Sunday
[21]

Search for round-trip flights
[22]

Sort results by price
[23]

Select the cheapest available round-trip option that fits the criteria
[24]

Proceed to booking and complete the purchase Important formatting guidelines:
[25]

Make sure the formatting of your answer is correct. It should follow the following HTML tags format <observation> Description and understanding of current state </observation> <plan> Generated Plan </plan> <thought> Thought process </thought>
[26]

You are an expert planning agent for solving GUI navigation tasks

Make sure to provide the correct start and end tags Figure 5: Prompt for sequential plans. You are an expert planning agent for solving GUI navigation tasks. Using the plan that you output, another language model will execute the task. You are provided with: 1. The state of the computer screen through a desktop screenshot Your responsibilities:
[27]

Generate a checklist of all necessary requirements that need to completed in order to achieve the goal
[28]

Ensure this checklist is concise and contains only necessary requirements
[29]

Carefully observe and understand the current state of the computer before generating your checklist
[30]

Generate a thought process or explanation about the checklist generated Below are important considerations when generating your checklist:
[31]

This is not necessarily sequential, but outlines requirements that must be met

Provide the plan in a unordered checklist format. This is not necessarily sequential, but outlines requirements that must be met. The executor will identify the order of fine-grained actions to be taken whilst interacting with the GUI based on the provided checklist
[33]

Your checklist must be as concise as possible

Do not include optional requirements. Your checklist must be as concise as possible
[34]

Sign up for the weekly newsletter on the website

Do not include unnecessary requirements. If you are unsure if a requirement is necessary, do not include it in your checklist. Below is a example of a checklist given the request "Sign up for the weekly newsletter on the website.": [ ] The email input field is filled with a valid email address. [ ] The "Subscribe" or "Sign Up" button has been clicked. [ ]...
[35]

Make sure the formatting of your answer is correct. It should follow the following HTML tags format <observation> Description and understanding of current state </observation> <plan> Generated Checklist </plan> <thought> Thought process </thought>
[36]

You are an expert agent in web navigation and you are able to foresee multiple possible trajectories to achieve a navigation task

Make sure to provide the correct start and end tags Figure 6: Prompt for checklist plans. You are an expert agent in web navigation and you are able to foresee multiple possible trajectories to achieve a navigation task. Using the plan that you output, another language model will execute the task. You are provided with:
[37]

The state of the computer screen through a desktop screenshot Your responsibilities:
[38]

Generate an algorithm style plan in order to achieve the goal in a pseudocode style with abstract functions depicting high-level actions
[39]

Ensure this pseudocode plan is as concise as possible avoiding complex nested if statements
[40]

Carefully observe and understand the current state of the computer before generating your pseudocode
[41]

Generate a thought process or explanation about the pseudocode plan generated Below are important considerations when generating your pseudocode:
[42]

Make function names intuitive and descriptive

Provide the pseudocode plan in a readable format. Make function names intuitive and descriptive. The executor will then determine the fine-grained actions to take using these function names
[43]

Consider potential obstacles or layout variations the executor may encounter whilst completing the task
[44]

Steps that confirm or validate other subtasks should not be included

Do not include verification sections. Steps that confirm or validate other subtasks should not be included
[45]

Add red coat product to cart and wish list

Do not include unnecessary parts in your pseudocode. If you are unsure if a requirement is necessary, do not include it. Below is an example of the correct formatting and level of abstraction for the request "Add red coat product to cart and wish list": while(red_coat_not_in_cart and red_coat_not_in_wish_list) search_for_red_coat if item found: click add ...
[46]

Make sure the formatting of your answer is correct. It should follow the following HTML tags format <observation> Description and understanding of current state </observation> <plan> Generated Pseudocode Plan </plan> <thought> Thought process </thought>
[47]

You are an expert planning agent for solving GUI navigation tasks

Make sure to provide the correct start and end tags Figure 7: Prompt for algorithmic pseudocode plans. You are an expert planning agent for solving GUI navigation tasks. Using the plan that you output, another language model will execute the task. You are provided with: 1. The state of the computer screen through a desktop screenshot Your responsibilities:
[48]

Generate a paragraph describing what the agent should do in order to achieve the given goal
[49]

Ensure this paragraph is concise and contains only necessary information and suggestions to help the agent achieve the goal
[50]

Carefully observe and understand the current state of the computer before generating your paragraph
[51]

Generate a thought process or explanation about the paragraph generated Below are important considerations when generating your paragraph:
[52]

There is no need for stylistic formatting

Provide the paragraph in plain text. There is no need for stylistic formatting
[53]

Steps that confirm or validate other subtasks should not be included

Do not include verification steps. Steps that confirm or validate other subtasks should not be included
[54]

Book a round-trip flight from New York to San Francisco for next weekend, departing Friday and returning Sunday, on the cheapest available airline

Do not include unnecessary requirements. If you are unsure if a requirement is necessary, do not include it in your checklist. Below is an example of a paragraph describing how an executor agent should fulfill the request "Book a round-trip flight from New York to San Francisco for next weekend, departing Friday and returning Sunday, on the cheapest avail...
[55]

Make sure the formatting of your answer is correct. It should follow the following HTML tags format <observation> Description and understanding of current state </observation> <plan> Generated Paragraph </plan> <thought> Thought process </thought>
[56]

Make sure to provide the correct start and end tags Figure 8: Prompt for narrative plans

[1] [1]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

WorkArena: How capable are web agents at solving common knowledge work tasks?arXiv preprint arXiv:2403.07718. Lutfi Eren Erdogan, Nicholas Lee, Sehoon Kim, Suhong Moon, Hiroki Furuta, Gopala Anumanchipalli, Kurt Keutzer, and Amir Gholami. 2025. Plan-and-act: Improving planning of agents for long-horizon tasks. arXiv preprint arXiv:2503.09572. Izzeddin Gur...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

unsolved

Towards enterprise-ready computer using gen- eralist agent.arXiv preprint arXiv:2503.01861. Archiki Prasad, Alexander Koller, Mareike Hartmann, Peter Clark, Ashish Sabharwal, Mohit Bansal, and Tushar Khot. 2023. ADAPT: As-needed decompo- sition and planning with language models.arXiv preprint arXiv:2311.05772. Tianlin Shi, Andrej Karpathy, Linxi Fan, Jona...

work page arXiv 2023

[3] [3]

Foot (OSRM)

Change the mode of transportation to "Foot (OSRM)"

[4] [4]

Click the "Go" button to get walking directions

[5] [5]

After reviewing the effect of your previous actions, verify if your plan is still relevant and update it if necessary

Extract the walking time from the results. After reviewing the effect of your previous actions, verify if your plan is still relevant and update it if necessary. As shown in the plan above, the step of ‘Click- ing Go’ is ignored and is assumed completed. The agent proceeds to just "wait" for the results to ap- pear instead of trying to click go again it p...

[6] [7]

Generate a new plan to complete the task by decomposing the task into meaningful subtasks of logically ordered chunks

[7] [8]

Ensure the plan is concise and contains only necessary steps

[8] [9]

Carefully observe and understand the current state of the computer before generating your plan

[9] [10]

Avoid including steps in your plan that the task does not ask for

[10] [11]

Below are important considerations when generating your plan:

Generate a thought process or explanation about the plan generated. Below are important considerations when generating your plan:

[11] [12]

Provide the plan in a step-by-step numbered format with sequential subtasks

[12] [13]

avoid specifying mouse clicks)

Provide high-level subtasks and avoid over-specifying actions—this isn’t a low-level how-to (e.g. avoid specifying mouse clicks). This is a logical task breakdown that captures essential steps and decisions. The executor will identify the fine-grained actions to be taken whilst interacting with the GUI using your subgoals

[13] [14]

Steps that confirm or validate other subtasks should not be included

Do not include verification steps in your planning. Steps that confirm or validate other subtasks should not be included

[14] [15]

Your plan must be as concise as possible

Do not include optional steps in your planning. Your plan must be as concise as possible

[15] [16]

If you are unsure if a step is necessary, do not include it in your plan

Do not include unnecessary steps in your planning. If you are unsure if a step is necessary, do not include it in your plan

[16] [17]

Book a round-trip flight from New York to San Francisco for next weekend, departing Friday and returning Sunday, on the cheapest available airline

Your plan **must** be end-to-end meaning it should entail instructions from the initial state till the goal state. Below is an example of a plan created using subtask decomposition, given the request "Book a round-trip flight from New York to San Francisco for next weekend, departing Friday and returning Sunday, on the cheapest available airline.":

[17] [18]

Open a flight booking website (e.g., Google Flights, Expedia, etc.)

[18] [19]

Input departure city as New York and destination as San Francisco

[19] [20]

Set departure date to next Friday and return date to next Sunday

[20] [21]

Search for round-trip flights

[21] [22]

Sort results by price

[22] [23]

Select the cheapest available round-trip option that fits the criteria

[23] [24]

Proceed to booking and complete the purchase Important formatting guidelines:

[24] [25]

Make sure the formatting of your answer is correct. It should follow the following HTML tags format <observation> Description and understanding of current state </observation> <plan> Generated Plan </plan> <thought> Thought process </thought>

[25] [26]

You are an expert planning agent for solving GUI navigation tasks

Make sure to provide the correct start and end tags Figure 5: Prompt for sequential plans. You are an expert planning agent for solving GUI navigation tasks. Using the plan that you output, another language model will execute the task. You are provided with: 1. The state of the computer screen through a desktop screenshot Your responsibilities:

[26] [27]

Generate a checklist of all necessary requirements that need to completed in order to achieve the goal

[27] [28]

Ensure this checklist is concise and contains only necessary requirements

[28] [29]

Carefully observe and understand the current state of the computer before generating your checklist

[29] [30]

Generate a thought process or explanation about the checklist generated Below are important considerations when generating your checklist:

[30] [31]

This is not necessarily sequential, but outlines requirements that must be met

Provide the plan in a unordered checklist format. This is not necessarily sequential, but outlines requirements that must be met. The executor will identify the order of fine-grained actions to be taken whilst interacting with the GUI based on the provided checklist

[31] [33]

Your checklist must be as concise as possible

Do not include optional requirements. Your checklist must be as concise as possible

[32] [34]

Sign up for the weekly newsletter on the website

Do not include unnecessary requirements. If you are unsure if a requirement is necessary, do not include it in your checklist. Below is a example of a checklist given the request "Sign up for the weekly newsletter on the website.": [ ] The email input field is filled with a valid email address. [ ] The "Subscribe" or "Sign Up" button has been clicked. [ ]...

[33] [35]

Make sure the formatting of your answer is correct. It should follow the following HTML tags format <observation> Description and understanding of current state </observation> <plan> Generated Checklist </plan> <thought> Thought process </thought>

[34] [36]

You are an expert agent in web navigation and you are able to foresee multiple possible trajectories to achieve a navigation task

Make sure to provide the correct start and end tags Figure 6: Prompt for checklist plans. You are an expert agent in web navigation and you are able to foresee multiple possible trajectories to achieve a navigation task. Using the plan that you output, another language model will execute the task. You are provided with:

[35] [37]

The state of the computer screen through a desktop screenshot Your responsibilities:

[36] [38]

Generate an algorithm style plan in order to achieve the goal in a pseudocode style with abstract functions depicting high-level actions

[37] [39]

Ensure this pseudocode plan is as concise as possible avoiding complex nested if statements

[38] [40]

Carefully observe and understand the current state of the computer before generating your pseudocode

[39] [41]

Generate a thought process or explanation about the pseudocode plan generated Below are important considerations when generating your pseudocode:

[40] [42]

Make function names intuitive and descriptive

Provide the pseudocode plan in a readable format. Make function names intuitive and descriptive. The executor will then determine the fine-grained actions to take using these function names

[41] [43]

Consider potential obstacles or layout variations the executor may encounter whilst completing the task

[42] [44]

Steps that confirm or validate other subtasks should not be included

Do not include verification sections. Steps that confirm or validate other subtasks should not be included

[43] [45]

Add red coat product to cart and wish list

Do not include unnecessary parts in your pseudocode. If you are unsure if a requirement is necessary, do not include it. Below is an example of the correct formatting and level of abstraction for the request "Add red coat product to cart and wish list": while(red_coat_not_in_cart and red_coat_not_in_wish_list) search_for_red_coat if item found: click add ...

[44] [46]

Make sure the formatting of your answer is correct. It should follow the following HTML tags format <observation> Description and understanding of current state </observation> <plan> Generated Pseudocode Plan </plan> <thought> Thought process </thought>

[45] [47]

You are an expert planning agent for solving GUI navigation tasks

Make sure to provide the correct start and end tags Figure 7: Prompt for algorithmic pseudocode plans. You are an expert planning agent for solving GUI navigation tasks. Using the plan that you output, another language model will execute the task. You are provided with: 1. The state of the computer screen through a desktop screenshot Your responsibilities:

[46] [48]

Generate a paragraph describing what the agent should do in order to achieve the given goal

[47] [49]

Ensure this paragraph is concise and contains only necessary information and suggestions to help the agent achieve the goal

[48] [50]

Carefully observe and understand the current state of the computer before generating your paragraph

[49] [51]

Generate a thought process or explanation about the paragraph generated Below are important considerations when generating your paragraph:

[50] [52]

There is no need for stylistic formatting

Provide the paragraph in plain text. There is no need for stylistic formatting

[51] [53]

Steps that confirm or validate other subtasks should not be included

Do not include verification steps. Steps that confirm or validate other subtasks should not be included

[52] [54]

Book a round-trip flight from New York to San Francisco for next weekend, departing Friday and returning Sunday, on the cheapest available airline

Do not include unnecessary requirements. If you are unsure if a requirement is necessary, do not include it in your checklist. Below is an example of a paragraph describing how an executor agent should fulfill the request "Book a round-trip flight from New York to San Francisco for next weekend, departing Friday and returning Sunday, on the cheapest avail...

[53] [55]

Make sure the formatting of your answer is correct. It should follow the following HTML tags format <observation> Description and understanding of current state </observation> <plan> Generated Paragraph </plan> <thought> Thought process </thought>

[54] [56]

Make sure to provide the correct start and end tags Figure 8: Prompt for narrative plans