Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling

Azalia Mirhoseini; Caleb Winston; Christos Kozyrakis; Ron Yifeng Wang

arxiv: 2605.21470 · v2 · pith:45KMY5IPnew · submitted 2026-05-20 · 💻 cs.LG · cs.AI

Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling

Caleb Winston , Ron Yifeng Wang , Azalia Mirhoseini , Christos Kozyrakis This is my paper

Pith reviewed 2026-05-21 05:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords computer-use agentsweb automationjust-in-time compilationlatency optimizationtool callingagent planningparallel scheduling

0 comments

The pith

Compiling natural language web tasks into executable code with validation and parallel scheduling delivers major speedups and accuracy gains for computer-use agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that current computer-use agents rely on a slow sequential process of taking screenshots, calling an LLM for the next action, and executing tools, which leads to high latency and errors. Instead, agent JIT compilation generates full code plans upfront from the task description, allowing for parallel tool calls and embedded logic. It then validates plans against tool specifications using an invariant protocol and selects the lowest cost plan using Monte Carlo simulation based on learned latency distributions. A sympathetic reader cares because this could make automated web interactions practical for time-sensitive or complex tasks like online shopping or data entry.

Core claim

The authors present agent just-in-time compilation as a method to compile task descriptions directly into executable code that may include LLM calls, tool calls, and parallelization. JIT-Planner creates multiple code plans, validates each against tool specifications, and chooses the minimum-cost candidate. JIT-Scheduler explores parallelization strategies using Monte Carlo cost estimation from learned latency distributions. An invariant-enforcing tool protocol specifies precondition and postcondition state requirements to reduce incorrect tool use. This approach achieves 10.4× speedup and +28% accuracy over Browser-Use, and 2.4× speedup and +9% accuracy over OpenAI CUA across 5 web applicati

What carries the argument

Agent just-in-time compilation, which generates and optimizes executable code plans from natural language tasks using validation and latency-based scheduling.

If this is right

Plans can execute multiple tools in parallel when dependencies allow, reducing total time.
Validation against tool specs and invariants lowers the incidence of incorrect actions.
Monte Carlo estimation enables choosing efficient schedules without exhaustive search.
Embedding LLM calls inside the generated code allows dynamic decision making within the plan.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This compilation style might apply to agents in other domains such as mobile or API-based tasks.
Future improvements in code generation models would directly enhance plan quality and speed.
Combining this with better error recovery in the protocol could further boost reliability.

Load-bearing premise

The learned latency distributions accurately reflect real execution times and the invariant-enforcing tool protocol sufficiently prevents incorrect tool use in the generated plans.

What would settle it

Measuring the actual latency and accuracy on a held-out set of web applications or tasks where the learned distributions may not match real times.

Figures

Figures reproduced from arXiv: 2605.21470 by Azalia Mirhoseini, Caleb Winston, Christos Kozyrakis, Ron Yifeng Wang.

**Figure 1.** Figure 1: Competing Approaches to Computer-Use Agents. Automation of web-based tasks has relied on static scripts (RPA; Barman et al., 2016) and static tool sets (CUA; Wang et al., 2025). Our work introduces dynamic cost-optimizing planning and scheduling with cached, reusable tools. or accessibility tree. However, benchmarks such as WebArena (Zhou et al., 2023), WebVoyager (He et al., 2024), and REAL (Garg et al… view at source ↗

**Figure 2.** Figure 2: Agent JIT Architecture. Optimizing scheduling and planning for computer-use agents with caching of code and latency distributions. Non-determinism post-plan. In the standard agent loop, once a plan is generated, execution may be unnecessarily non-deterministic post-planning. For example, a plan may specify that the output of a tool to list items is piped into a tool to order the least expensive item. A non… view at source ↗

**Figure 3.** Figure 3: Example with scheduler. Three strategies are evaluated for each given task: serial execution, task-parallelism, and request hedging. Monte Carlo sampling from learned latency distributions for different elements in the web browser environment yields request hedging as optimal given the high latency variance of interaction with the order button. able if posti ⊆ prei+1. The optional runtime predicates (pre… view at source ↗

**Figure 4.** Figure 4: Example with planner. Three plans generated in parallel: Plan1 has an unnecessary LM call, Plan2 violates the precondition of list_items, Plan3 is cost-optimal as it replaces an LM call with code. Algorithm 2 Cost-Aware Scheduling 1: Input: Task τ , Cache E, Dists D, Model M, Workers n 2: Output: Strategy σ ∗ ∈ {SER, PAR, HED} 3: Constants: Trials NMC, Overhead δp (Par), δh (Hedge) 4: for each strategy σ… view at source ↗

**Figure 6.** Figure 6: End-to-end latency breakdown. Comparison of latency across baselines and ablations with breakdown across Planning (LM sampling for plan generation and checking), Inference (LM calls during execution), and Tool Execution (web browser actuation including CDP and DOM operations). Browser-Use (T-Short: 1–5 steps, T-Medium: 6–8 steps, TLong: 9+ steps). Thresholds are determined by terciles (33rd and 67th perce… view at source ↗

**Figure 7.** Figure 7: Failure type shift with protocol enforcement. Tool ordering dominates failures in generating tool-use plans. Protocol reduces total failure rate from 80% to 43% while shifting failure distribution. 2 4 6 8 10 12 14 k 0.0 0.2 0.4 0.6 0.8 1.0 Pass@k 0 5 10 15 20 25 30 35 40 t (seconds) 0.0 0.2 0.4 0.6 0.8 1.0 Pass@t GPT-4.1 GPT-5 Gemini-2.5-Flash With Protocol Without Protocol [PITH_FULL_IMAGE:figures/full_… view at source ↗

**Figure 8.** Figure 8: Planning efficiency via Pass@k and Pass@t metrics. Protocol enforcement enables higher success rates with fewer candidates (k) and lower time budgets (t) across frontier LMs. 0 10 20 30 40 t (seconds) 0.0 0.2 0.4 0.6 0.8 1.0 Pass@k Parallel k=8 (w/ protocol) Serial retry (w/ protocol) Parallel k=8 (w/o protocol) Serial retry (w/o protocol) [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 9.** Figure 9: Parallel hedging versus serial retry for plan generation. Protocol enforcement combined with parallelism (8 workers) outperforms serial retry (max 3 iterations) across latency budgets for Gemini-2.5-Flash. filtering, pricing extraction, and order placement. (2) Gomail is a Gmail-inspired email interface evaluating inbox navigation, message filtering, and email composition. (3) Omnizon is an Amazon-inspir… view at source ↗

**Figure 11.** Figure 11: illustrates the offline process that populates the planner and scheduler caches from execution traces. The pipeline consists of three stages: 1. Extract. Each trace step is processed by an LLM to extract a page schema and identify the high-level action performed. The page schema defines the actionable elements on each page (Section I, Prompt 3). 2. Label. Each trace step is mapped to a page schema element… view at source ↗

read the original abstract

Computer-use agents (CUAs) automate tasks specified with natural language such as "order the cheapest item from Taco Bell" by generating sequences of calls to tools such as click, type, and scroll on a browser. Current implementations follow a sequential fetch-screenshot-execute loop where each iteration requires an LLM call, resulting in high latency and frequent errors from incorrect tool use. We present agent just-in-time (JIT) compilation, a system that compiles task descriptions directly into executable code that may include LLM calls, tool calls, and parallelization. Our approach comprises three components: (1) JIT-Planner, which generates multiple code plans, validates each against tool specifications, and selects the minimum-cost candidate; (2) JIT-Scheduler, which explores parallelization strategies via Monte Carlo cost estimation from learned latency distributions; and (3) an invariant-enforcing tool protocol specifying precondition and postcondition requirements to reduce the rate of incorrect tool use. Across five applications, JIT-Planner achieves $10.4\times$ speedup and 28$\%$ higher accuracy over Browser-Use, while JIT-Scheduler achieves $2.4\times$ speedup and 9\% higher accuracy over OpenAI CUA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces agent just-in-time (JIT) compilation for computer-use agents (CUAs) that automate natural-language web tasks via tool calls such as click and type. It replaces the standard sequential fetch-screenshot-execute loop with three components: (1) JIT-Planner, which generates multiple executable code plans, validates them against tool specifications, and selects the minimum-cost candidate; (2) JIT-Scheduler, which explores parallelization strategies using Monte Carlo cost estimation over learned latency distributions; and (3) an invariant-enforcing tool protocol that specifies precondition and postcondition state requirements. Across five web applications the paper reports that JIT-Planner delivers 10.4× speedup and +28% accuracy relative to Browser-Use while JIT-Scheduler delivers 2.4× speedup and +9% accuracy relative to OpenAI CUA.

Significance. If the reported speedups and accuracy gains prove robust, the work could meaningfully advance practical deployment of low-latency web agents by combining code-level planning, learned-cost scheduling, and state-invariant validation. The empirical magnitude of the gains is substantial and the framing as a compilation rather than repeated LLM invocation is conceptually clean; however, the significance is tempered by the dependence on unvalidated latency models and the absence of detailed experimental controls.

major comments (2)

[JIT-Scheduler] JIT-Scheduler description: the 2.4× speedup claim rests on Monte Carlo sampling over learned latency distributions to choose parallelization strategies. Web execution exhibits high stochasticity from network jitter, variable page loads, and server responses; if the distributions are fitted on limited prior traces without explicit modeling of context-dependent variance, the cost estimates will be biased and the selected schedules will not deliver the claimed gains in practice.
[Experimental evaluation] Experimental evaluation section: the abstract states clear quantitative gains across five applications, yet the manuscript supplies no visible details on experimental controls, statistical tests, exact baseline implementations, task-selection criteria, or hardware/environmental variation. These omissions are load-bearing for the central empirical claims of 10.4× and 2.4× speedups.

minor comments (2)

[Abstract] Abstract: the number and identity of the five web applications and the number of tasks per application should be stated explicitly to allow readers to gauge the scope of the evaluation.
[Introduction] Terminology: ensure consistent expansion of the acronym CUA on first use and uniform reference to the three proposed components throughout the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below, providing clarifications from the manuscript and indicating planned revisions to improve clarity and completeness.

read point-by-point responses

Referee: [JIT-Scheduler] JIT-Scheduler description: the 2.4× speedup claim rests on Monte Carlo sampling over learned latency distributions to choose parallelization strategies. Web execution exhibits high stochasticity from network jitter, variable page loads, and server responses; if the distributions are fitted on limited prior traces without explicit modeling of context-dependent variance, the cost estimates will be biased and the selected schedules will not deliver the claimed gains in practice.

Authors: The JIT-Scheduler component learns latency distributions from a broad set of prior execution traces collected across repeated runs on the target web applications. These traces are intended to capture variability arising from network conditions, page loads, and server responses. Monte Carlo sampling then draws from the empirical distributions to evaluate parallelization strategies under uncertainty, which underpins the reported 2.4× speedup. We acknowledge that further explicit modeling of context-dependent variance could strengthen the approach and will expand the description of trace collection and variance handling in the revised manuscript. revision: partial
Referee: [Experimental evaluation] Experimental evaluation section: the abstract states clear quantitative gains across five applications, yet the manuscript supplies no visible details on experimental controls, statistical tests, exact baseline implementations, task-selection criteria, or hardware/environmental variation. These omissions are load-bearing for the central empirical claims of 10.4× and 2.4× speedups.

Authors: We agree that greater transparency is required to support the quantitative claims. The manuscript describes the five applications, the baselines (Browser-Use and OpenAI CUA), and the evaluation protocol, but we will revise the experimental evaluation section to add explicit details on controls, statistical tests for significance, precise baseline configurations, task selection criteria, and hardware/environmental setup. These additions will improve reproducibility without altering the reported results. revision: yes

Circularity Check

0 steps flagged

Empirical runtime measurements with no derivation reducing to fitted inputs or self-citations by construction

full rationale

The paper presents JIT-Planner and JIT-Scheduler as engineering components whose performance is evaluated via direct runtime experiments across 5 web applications, reporting measured speedups (10.4× and 2.4×) and accuracy gains (+28% and +9%). No equations, Monte Carlo cost estimates, or latency distributions are shown to be fitted to the same test data and then re-presented as predictions; the reported quantities are external benchmarks against Browser-Use and OpenAI CUA. The invariant-enforcing protocol and validation steps are described as design choices that reduce error rates, not as quantities derived tautologically from the evaluation metrics. No self-citation chains or uniqueness theorems are invoked as load-bearing premises in the provided text. The derivation chain is therefore self-contained against external runtime measurements.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central performance claims rest on LLM code-generation reliability and the fidelity of learned latency models; these are treated as domain capabilities rather than derived results.

free parameters (1)

learned latency distributions
Used by JIT-Scheduler for Monte Carlo cost estimation of parallelization strategies

axioms (1)

domain assumption LLMs can generate multiple valid code plans from natural language task descriptions that can be validated against tool specifications
Invoked by JIT-Planner to produce and filter candidate plans

pith-pipeline@v0.9.0 · 5758 in / 1419 out tokens · 48173 ms · 2026-05-21T05:11:20.619875+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

JIT-Scheduler, which explores parallelization strategies via Monte Carlo cost estimation from learned latency distributions
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

invariant-enforcing tool protocol specifying precondition and postcondition state requirements

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.