Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling
Pith reviewed 2026-05-21 05:11 UTC · model grok-4.3
The pith
Compiling natural language web tasks into executable code with validation and parallel scheduling delivers major speedups and accuracy gains for computer-use agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present agent just-in-time compilation as a method to compile task descriptions directly into executable code that may include LLM calls, tool calls, and parallelization. JIT-Planner creates multiple code plans, validates each against tool specifications, and chooses the minimum-cost candidate. JIT-Scheduler explores parallelization strategies using Monte Carlo cost estimation from learned latency distributions. An invariant-enforcing tool protocol specifies precondition and postcondition state requirements to reduce incorrect tool use. This approach achieves 10.4× speedup and +28% accuracy over Browser-Use, and 2.4× speedup and +9% accuracy over OpenAI CUA across 5 web applicati
What carries the argument
Agent just-in-time compilation, which generates and optimizes executable code plans from natural language tasks using validation and latency-based scheduling.
If this is right
- Plans can execute multiple tools in parallel when dependencies allow, reducing total time.
- Validation against tool specs and invariants lowers the incidence of incorrect actions.
- Monte Carlo estimation enables choosing efficient schedules without exhaustive search.
- Embedding LLM calls inside the generated code allows dynamic decision making within the plan.
Where Pith is reading between the lines
- This compilation style might apply to agents in other domains such as mobile or API-based tasks.
- Future improvements in code generation models would directly enhance plan quality and speed.
- Combining this with better error recovery in the protocol could further boost reliability.
Load-bearing premise
The learned latency distributions accurately reflect real execution times and the invariant-enforcing tool protocol sufficiently prevents incorrect tool use in the generated plans.
What would settle it
Measuring the actual latency and accuracy on a held-out set of web applications or tasks where the learned distributions may not match real times.
Figures
read the original abstract
Computer-use agents (CUAs) automate tasks specified with natural language such as "order the cheapest item from Taco Bell" by generating sequences of calls to tools such as click, type, and scroll on a browser. Current implementations follow a sequential fetch-screenshot-execute loop where each iteration requires an LLM call, resulting in high latency and frequent errors from incorrect tool use. We present agent just-in-time (JIT) compilation, a system that compiles task descriptions directly into executable code that may include LLM calls, tool calls, and parallelization. Our approach comprises three components: (1) JIT-Planner, which generates multiple code plans, validates each against tool specifications, and selects the minimum-cost candidate; (2) JIT-Scheduler, which explores parallelization strategies via Monte Carlo cost estimation from learned latency distributions; and (3) an invariant-enforcing tool protocol specifying precondition and postcondition requirements to reduce the rate of incorrect tool use. Across five applications, JIT-Planner achieves $10.4\times$ speedup and 28$\%$ higher accuracy over Browser-Use, while JIT-Scheduler achieves $2.4\times$ speedup and 9\% higher accuracy over OpenAI CUA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces agent just-in-time (JIT) compilation for computer-use agents (CUAs) that automate natural-language web tasks via tool calls such as click and type. It replaces the standard sequential fetch-screenshot-execute loop with three components: (1) JIT-Planner, which generates multiple executable code plans, validates them against tool specifications, and selects the minimum-cost candidate; (2) JIT-Scheduler, which explores parallelization strategies using Monte Carlo cost estimation over learned latency distributions; and (3) an invariant-enforcing tool protocol that specifies precondition and postcondition state requirements. Across five web applications the paper reports that JIT-Planner delivers 10.4× speedup and +28% accuracy relative to Browser-Use while JIT-Scheduler delivers 2.4× speedup and +9% accuracy relative to OpenAI CUA.
Significance. If the reported speedups and accuracy gains prove robust, the work could meaningfully advance practical deployment of low-latency web agents by combining code-level planning, learned-cost scheduling, and state-invariant validation. The empirical magnitude of the gains is substantial and the framing as a compilation rather than repeated LLM invocation is conceptually clean; however, the significance is tempered by the dependence on unvalidated latency models and the absence of detailed experimental controls.
major comments (2)
- [JIT-Scheduler] JIT-Scheduler description: the 2.4× speedup claim rests on Monte Carlo sampling over learned latency distributions to choose parallelization strategies. Web execution exhibits high stochasticity from network jitter, variable page loads, and server responses; if the distributions are fitted on limited prior traces without explicit modeling of context-dependent variance, the cost estimates will be biased and the selected schedules will not deliver the claimed gains in practice.
- [Experimental evaluation] Experimental evaluation section: the abstract states clear quantitative gains across five applications, yet the manuscript supplies no visible details on experimental controls, statistical tests, exact baseline implementations, task-selection criteria, or hardware/environmental variation. These omissions are load-bearing for the central empirical claims of 10.4× and 2.4× speedups.
minor comments (2)
- [Abstract] Abstract: the number and identity of the five web applications and the number of tasks per application should be stated explicitly to allow readers to gauge the scope of the evaluation.
- [Introduction] Terminology: ensure consistent expansion of the acronym CUA on first use and uniform reference to the three proposed components throughout the text.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. We address each major comment below, providing clarifications from the manuscript and indicating planned revisions to improve clarity and completeness.
read point-by-point responses
-
Referee: [JIT-Scheduler] JIT-Scheduler description: the 2.4× speedup claim rests on Monte Carlo sampling over learned latency distributions to choose parallelization strategies. Web execution exhibits high stochasticity from network jitter, variable page loads, and server responses; if the distributions are fitted on limited prior traces without explicit modeling of context-dependent variance, the cost estimates will be biased and the selected schedules will not deliver the claimed gains in practice.
Authors: The JIT-Scheduler component learns latency distributions from a broad set of prior execution traces collected across repeated runs on the target web applications. These traces are intended to capture variability arising from network conditions, page loads, and server responses. Monte Carlo sampling then draws from the empirical distributions to evaluate parallelization strategies under uncertainty, which underpins the reported 2.4× speedup. We acknowledge that further explicit modeling of context-dependent variance could strengthen the approach and will expand the description of trace collection and variance handling in the revised manuscript. revision: partial
-
Referee: [Experimental evaluation] Experimental evaluation section: the abstract states clear quantitative gains across five applications, yet the manuscript supplies no visible details on experimental controls, statistical tests, exact baseline implementations, task-selection criteria, or hardware/environmental variation. These omissions are load-bearing for the central empirical claims of 10.4× and 2.4× speedups.
Authors: We agree that greater transparency is required to support the quantitative claims. The manuscript describes the five applications, the baselines (Browser-Use and OpenAI CUA), and the evaluation protocol, but we will revise the experimental evaluation section to add explicit details on controls, statistical tests for significance, precise baseline configurations, task selection criteria, and hardware/environmental setup. These additions will improve reproducibility without altering the reported results. revision: yes
Circularity Check
Empirical runtime measurements with no derivation reducing to fitted inputs or self-citations by construction
full rationale
The paper presents JIT-Planner and JIT-Scheduler as engineering components whose performance is evaluated via direct runtime experiments across 5 web applications, reporting measured speedups (10.4× and 2.4×) and accuracy gains (+28% and +9%). No equations, Monte Carlo cost estimates, or latency distributions are shown to be fitted to the same test data and then re-presented as predictions; the reported quantities are external benchmarks against Browser-Use and OpenAI CUA. The invariant-enforcing protocol and validation steps are described as design choices that reduce error rates, not as quantities derived tautologically from the evaluation metrics. No self-citation chains or uniqueness theorems are invoked as load-bearing premises in the provided text. The derivation chain is therefore self-contained against external runtime measurements.
Axiom & Free-Parameter Ledger
free parameters (1)
- learned latency distributions
axioms (1)
- domain assumption LLMs can generate multiple valid code plans from natural language task descriptions that can be validated against tool specifications
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
JIT-Scheduler, which explores parallelization strategies via Monte Carlo cost estimation from learned latency distributions
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
invariant-enforcing tool protocol specifying precondition and postcondition state requirements
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.