pith. sign in

arxiv: 2606.20785 · v1 · pith:AEI6ZON7new · submitted 2026-06-18 · 💻 cs.AI · cs.LG

Fara-1.5: Scalable Learning Environments for Computer Use Agents

Pith reviewed 2026-06-26 17:23 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords computer use agentsdata generation pipelinesynthetic environmentsbrowser benchmarkssupervised finetuningtrajectory verificationweb navigation agentsscalable agent training
0
0 comments X

The pith

A scalable pipeline of synthetic environments and triple verifiers generates training data that produces new state-of-the-art 9B and 27B computer-use agents on browser benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FaraGen1.5 as a modular pipeline that generates large volumes of agent trajectories without expensive human demonstrations. It combines live websites with synthetic environments that handle authentication and irreversible actions, runs solvers that can include strong models plus a user simulator for multi-turn tasks, and applies three verifiers to score trajectories on correctness, efficiency, and critical-point adherence. The resulting data trains Fara1.5 models at 4B, 9B, and 27B scales on top of Qwen3.5 base models through an iterative supervised fine-tuning recipe that balances coverage, high-value tasks, and model weaknesses. These models then reach 63.4 percent and 72.3 percent on Online-Mind2Web along with 86.6 percent on WebVoyager, exceeding prior results for their size classes. A reader would care because the method shows one concrete route to scaling capable computer agents more affordably than human-only data collection.

Core claim

FaraGen1.5 is a data pipeline composed of environments, solvers, and verifiers that produces high-quality demonstration trajectories from both live and synthetic sites; when these trajectories are used in a balanced iterative supervised finetuning recipe, they yield Fara1.5 agents at three scales that set new state-of-the-art scores for their parameter classes on standard browser-use benchmarks.

What carries the argument

FaraGen1.5 pipeline, consisting of live and synthetic environments, a solver harness that supports multiple models and a user simulator, and three verifiers that evaluate task correctness, efficiency, and critical-point adherence.

If this is right

  • Fara1.5-9B reaches 63.4% on Online-Mind2Web and 86.6% on WebVoyager.
  • Fara1.5-27B reaches 72.3% on Online-Mind2Web, competitive with much larger proprietary systems.
  • The supervised finetuning recipe balances broad coverage, specific high-value tasks, and iterative correction of target model deficiencies.
  • Each Fara1.5 model sets a new state of the art for its size class on browser-use benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the pipeline generalizes, it could supply training data for computer-use agents operating outside browsers once additional synthetic environments are built.
  • The inclusion of efficiency and critical-point verifiers may encourage agents to prefer shorter, lower-risk action sequences even when task success is the main objective.
  • Using frontier models inside the solver harness could let smaller target models learn from higher-quality demonstrations than they could generate themselves.
  • Faithful synthetic environments would allow safe rehearsal of actions that cannot be tested directly on live systems.

Load-bearing premise

The synthetic environments must faithfully simulate domains gated by authentication or irreversible actions, and the three verifiers must accurately judge demonstration success without introducing bias or false positives.

What would settle it

Retraining the same base models on an equivalent amount of human-collected trajectories and measuring whether benchmark scores match or exceed those of the Fara1.5 models would directly test whether the pipeline's data is responsible for the gains.

Figures

Figures reproduced from arXiv: 2606.20785 by Ahmed Awadallah, Akshay Nambi, Alexey Taymanov, Andrew Zhao, Aravind Rajeswaran, Corby Rosset, Hussein Mozannar, Luiz do Valle, Sahil Gupta, Spencer Whitehead, Vibhav Vineet, Yadong Lu, Yash Lara, Yash Pandya, Zach Nussbaum.

Figure 1
Figure 1. Figure 1: Task success rate on Online-Mind2Web and WebVoyager for similarly sized CUA models. Fara1.5-9B reaches [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: FaraGen1.5 pipeline. Phase 1 instantiates tasks in two kinds of environments: [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Sample FaraEnvs for Calendar and ML Management environments populated with realistic data generated by [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (Left) Agentic training data collected by FaraGen1.5 and its predecessor over time, reported as the number [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: One step of Fara1.5’s observe-think-act loop. The model observes the recent browser state (up to three [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Supervised fine-tuning input and loss mask. The model conditions on the full conversation history but consumes [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: (Left) WebVoyager and Online-Mind2Web success rate as a function of model size for the Fara1.5 family. Both [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Plots of average number of steps per task and success rate as a function of number of steps. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of the number of steps split by outcome, resolved by model size (split violins: left/green = correct, [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Number of steps for successful trajectories on the examples solved by all three model sizes. Conditioned on success, the distributions are nearly identical across sizes, indicating smaller models are not inherently less step-efficient. [1, 5) [5, 8) [8, 12) [12, 20) [20, 35) [35, 100] Number of steps (bin) 0% 20% 40% 60% 80% 100% Success rate WebVoyager [1, 5) [5, 8) [8, 12) [12, 20) [20, 35) [35, 100] Nu… view at source ↗
Figure 11
Figure 11. Figure 11: Success rate as a function of trajectory length, broken out by model size. The mapping is largely independent [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Per-example pass@k (k = 1, 2, 3) with examples sorted from easiest to hardest. The right tail (pass rate → 0) is the set of hardest examples; larger models solve more of it, and additional attempts (pass@2, pass@3) recover more examples [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Median number of steps per example (faint points) and a rolling average (solid lines) along the same easiest-to [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Selected screenshots from hotels_head_orbitz_8. The agent visited the requested site, entered the hotel name and dates, and read the sold-out banner directly off the property card before terminating. A.6.2 Process-only success: jobs_apply_apply_1219 Task. Help me apply for a chemistry research scientist position in Madison, WI found on LinkedIn and output five listings that would have me be an early appli… view at source ↗
Figure 15
Figure 15. Figure 15: Selected screenshots from jobs_apply_apply_1219. The trajectory follows a clean process: navigate to the filtered LinkedIn search, dismiss the modal, collect candidates via pause_and_memorize_fact while scrolling. The outcome judge nevertheless marks this a failure because one returned listing is located in Verona, WI rather than the Madison, WI specified by the task. A.7 Failure Analysis Expanding on the… view at source ↗
Figure 16
Figure 16. Figure 16: Task: Please generate the next move according to the UI screenshot and instruction. Instruction: Select the handle located at the top of the text box containing the text "Ferry ride." The concentric red and lime green circle denotes where the Jedi dataset incorrectly says the object of interest is located, the blue circle is the correct location of the object of interest. This is an inaccurate error [PIT… view at source ↗
Figure 17
Figure 17. Figure 17: Task: The Text field within a table cell.’s intended function: The primary function of this element is to display a specific metric value, likely representing a financial figure or performance metric. Users can view this value to assess or compare it with other metrics in the table. This is an error because the task is not unique for the instruction’s description of the table cell [PITH_FULL_IMAGE:figure… view at source ↗
Figure 18
Figure 18. Figure 18: Form-filling data pipeline. Tasks are seeded from real Tally forms and critical-point prompts, solved by the [PITH_FULL_IMAGE:figures/full_fig_p037_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: A form-filling trajectory where the user authorizes submission but omits a required field. The agent fills [PITH_FULL_IMAGE:figures/full_fig_p038_19.png] view at source ↗
read the original abstract

Collecting computer use data from human demonstrations is expensive and slow, motivating the need for scalable generation strategies. This requires two key ingredients: environments in which agents can act and verifiers that can judge whether their demonstrations succeeded. We introduce FaraGen1.5, a scalable data pipeline for computer use agents composed of three modular components: environments, solvers, and verifiers. FaraGen1.5 uses both live websites and synthetic environments that faithfully simulate domains gated by authentication or that require irreversible actions. It employs a solver harness that can be powered by multiple models, including strong frontier models such as GPT-5.4, and also incorporates a user simulator to enable multi-turn rollouts. Finally, FaraGen1.5 scores the resulting trajectories with three complementary verifiers covering task correctness, efficiency, and critical-point adherence. Using data produced by this pipeline, we train Fara1.5, a family of native computer use agents (CUAs) at three scales built on Qwen3.5 (4B, 9B, and 27B). To train these models, we employ a supervised finetuning (SFT) recipe that carefully balances data from FaraGen1.5 for broad coverage, specific high-value tasks, and target model deficiencies in an iterative approach. Each model sets a new state of the art for its size class on browser-use benchmarks: Fara1.5-9B reaches 63.4% on Online-Mind2Web and 86.6% on WebVoyager, while Fara1.5-27B achieves 72.3% on Online-Mind2Web, which is competitive with much larger proprietary systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces FaraGen1.5, a modular pipeline for scalable generation of computer-use agent training data. The pipeline combines live and synthetic environments, a solver harness (supporting frontier models such as GPT-5.4 and a user simulator for multi-turn rollouts), and three verifiers that score trajectories on task correctness, efficiency, and critical-point adherence. Data produced by the pipeline is used to train the Fara1.5 family (4B, 9B, 27B) via iterative, balanced supervised fine-tuning on Qwen3.5 backbones. The resulting models are reported to set new state-of-the-art results for their size classes on Online-Mind2Web (63.4 % for 9B, 72.3 % for 27B) and WebVoyager (86.6 % for 9B).

Significance. If the verifiers and synthetic environments produce high-quality, unbiased trajectories, the work supplies a concrete, scalable alternative to human demonstration collection and demonstrates that the resulting data can yield measurable gains on established browser-use benchmarks. The modular design and the iterative balancing procedure for covering model deficiencies constitute methodological contributions that could be reused by others.

major comments (2)
  1. [Pipeline description / verifiers paragraph] The headline SOTA claims rest on the assumption that the three verifiers (task correctness, efficiency, critical-point adherence) produce low false-positive rates and do not systematically favor particular solution styles. The manuscript provides no quantitative validation of this assumption (human agreement rates, ablation removing one verifier, or error analysis of accepted vs. rejected trajectories).
  2. [Environments component] The assertion that synthetic environments 'faithfully simulate' domains gated by authentication or requiring irreversible actions is stated without supporting evidence such as side-by-side comparisons, failure-mode analysis, or human validation of simulation fidelity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Pipeline description / verifiers paragraph] The headline SOTA claims rest on the assumption that the three verifiers (task correctness, efficiency, critical-point adherence) produce low false-positive rates and do not systematically favor particular solution styles. The manuscript provides no quantitative validation of this assumption (human agreement rates, ablation removing one verifier, or error analysis of accepted vs. rejected trajectories).

    Authors: We agree that the current manuscript does not include quantitative validation of the verifiers. While the reported SOTA results offer indirect support for trajectory quality, direct evidence is needed. In the revised manuscript we will add (1) human agreement rates on a sampled subset of trajectories, (2) an ablation that removes each verifier in turn and reports the resulting change in accepted data volume and downstream model performance, and (3) an error analysis comparing characteristics of accepted versus rejected trajectories. revision: yes

  2. Referee: [Environments component] The assertion that synthetic environments 'faithfully simulate' domains gated by authentication or requiring irreversible actions is stated without supporting evidence such as side-by-side comparisons, failure-mode analysis, or human validation of simulation fidelity.

    Authors: We acknowledge that the manuscript asserts faithful simulation without accompanying evidence. We will add the requested supporting material: side-by-side comparisons (where live equivalents exist), a failure-mode analysis of the synthetic environments, and human validation results on simulation fidelity for the gated and irreversible-action domains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline and benchmark results with no derivation chain

full rationale

The paper presents a data-generation pipeline (environments, solvers, verifiers) and reports empirical SOTA benchmark scores on independent external tasks (Online-Mind2Web, WebVoyager). No equations, fitted parameters, predictions, or self-referential definitions appear. Central claims rest on measured performance rather than any reduction to inputs by construction, self-citation chains, or ansatz smuggling. The result is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; all details on implementation and validation are absent.

pith-pipeline@v0.9.1-grok · 5904 in / 1226 out tokens · 35802 ms · 2026-06-26T17:23:30.314438+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 1 linked inside Pith

  1. [1]

    AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials. InThe Thirteenth International Conference on Learning Representations.https://openreview.net/forum?id=EEgYUccwsV Fara-1.5 CUA model 23 Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. 2024. Aguvis: Unified Pure Vision...

  2. [2]

    An Illusion of Progress? Assessing the Current State of Web Agents. arXiv:2504.01382 [cs.AI] https://arxiv.org/abs/2504.01382 An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang,...

  3. [3]

    Feel free to create mock user information and make assumptions freely

  4. [4]

    Make sure to answer all questions and provide all necessary details

  5. [5]

    Make sure any assumptions you make are reasonable and consistent with the conversation history

  6. [6]

    If the agent asks for preferences, make a clear choice

  7. [7]

    If the agent needs personal details (name, email, address, etc.), provide realistic fake information Guidelines for approval requests:

  8. [8]

    If the agent is requesting approval to proceed with an action (e.g., making a purchase, submitting a form), make sure it aligns with the original task

  9. [9]

    If the action seems reasonable and safe, approve it

  10. [10]

    Table 8: Task-proposal dimensions for online environments

    If the action seems risky, unnecessary, or misaligned with the original task, deny it Fara-1.5 CUA model 27 Dimension # values Who decides Values Site 100s sampled popularity-weighted across 18 domain categories (e-commerce, travel, search-info, forms, food-delivery, social-media, government, finance, healthcare, news, education, real-estate, automotive, ...

  11. [11]

    Is naturally related to the conversation history and previous tasks/outputs

  12. [12]

    Could reasonably be asked by a user in this context

  13. [13]

    Leverages the current page state or information visible in the screenshot, but requires navigation beyond the current page

  14. [14]

    tell me more

    Is specific and actionable (not vague like "tell me more")

  15. [15]

    Requires actions (clicking, navigating to new page), beyond just reading the page

  16. [16]

    The follow-up task should be slightly simpler than the original task but within similar complexity

  17. [17]

    What is the best pair of noise canceling headphones

    Do not ever refer to https:// URLs directly in your feedback to the agent. Instead, refer to the general domain name (i.e, Google Flights) instead of an https URL. Examples of good follow-up tasks: - If the previous task was "What is the best pair of noise canceling headphones", a good follow-up: "can you see if this headphone is available on Amazon", a s...

  18. [18]

    Selection 1.1 Missing intent Choosing an entirely wrong product, location, person, service, etc. 1.2 Unauthorized substitution Silently swapping an unavailable item for a similar alternative without reporting 1.3 Wrong action type Performing the wrong interaction on the correct entity 1.4 Wrong values / constraint violation Incorrect parameters, unsatisfi...

  19. [19]

    Hallucination 2.1 Output contradiction Evidence shows X, but agent claims not-X; includes misinterpreting page/tool content 2.2 Action contradiction Agent claims action was performed but evidence contradicts; action was achievable 2.3 Output fabrication Agent claims a fact with zero evidentiary basis; complete invention 2.4 Action fabrication Agent claims...

  20. [20]

    Execution & Strategy 3.1 Computational mistakes Correct methodology but wrong answer due to miscounting, arithmetic, or misreading 3.2 Platform non-compliance Not attempting the specified platform or silently switching sources 3.2.1 API-Sniffing Agent navigates to a site’s underlying JSON/REST API instead of the GUI URL the task implied, when the task / p...

  21. [21]

    Critical Point 4.1 Premature stop Stopped at critical point despite user explicitly granting permission 4.2 Violation Crossed transactional boundary without permission 4.3 Other Critical point error not covered above

  22. [22]

    Side-Effect 5.1 Unsolicited Any lasting modification, enrollment, or addition not requested 5.2 Other Side-effect error not covered above

  23. [23]

    Ferry ride

    Tool Interaction 6.1 Invalid invocation Tool call with wrong arguments (action exists but args are incorrect) 6.2 Hallucinated action Agent invokes a tool/action that does not exist in the action space 6.3 Intent-action mismatch Agent’s stated intent differs from actual tool call issued 6.4 Grounding error Correct target identified but (x, y) coordinates ...