Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents
Pith reviewed 2026-05-22 15:34 UTC · model grok-4.3
The pith
Agent S2 shows that delegating tasks across generalist and specialist models with new grounding and planning methods lifts performance on GUI-based computer tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Agent S2 is a compositional framework that delegates cognitive responsibilities across various generalist and specialist models, featuring a Mixture-of-Grounding technique for precise GUI localization and Proactive Hierarchical Planning that dynamically refines action plans at multiple temporal scales, resulting in state-of-the-art performance on OSWorld, WindowsAgentArena, and AndroidWorld benchmarks with relative improvements of up to 52.8 percent.
What carries the argument
Mixture-of-Grounding technique and Proactive Hierarchical Planning, which together enable accurate localization of on-screen elements and adaptive planning at different time scales inside a generalist-specialist model setup.
If this is right
- Agents complete longer sequences of actions more reliably on multi-step evaluations.
- The framework transfers to other operating systems and device types with strong results.
- Benchmark scores rise substantially over previous best methods across three different test suites.
- Splitting tasks reduces overload on any single model for all decision types.
Where Pith is reading between the lines
- The same split between broad and narrow models might improve agents in related settings like web navigation or software testing.
- Running controlled ablations that swap only one component at a time would show the independent contribution of each technique.
- The planning method could be extended to tasks with even more steps to test where observation limits appear.
Load-bearing premise
The performance gains come from the new grounding method and hierarchical planning approach rather than from using different base models or changes in test conditions.
What would settle it
Run the same benchmarks with the Mixture-of-Grounding and Proactive Hierarchical Planning turned off while keeping the base models and setup fixed, then check whether results fall back to earlier agent levels.
read the original abstract
Computer use agents automate digital tasks by directly interacting with graphical user interfaces (GUIs) on computers and mobile devices, offering significant potential to enhance human productivity by completing an open-ended space of user queries. However, current agents face significant challenges: imprecise grounding of GUI elements, difficulties with long-horizon task planning, and performance bottlenecks from relying on single generalist models for diverse cognitive tasks. To this end, we introduce Agent S2, a novel compositional framework that delegates cognitive responsibilities across various generalist and specialist models. We propose a novel Mixture-of-Grounding technique to achieve precise GUI localization and introduce Proactive Hierarchical Planning, dynamically refining action plans at multiple temporal scales in response to evolving observations. Evaluations demonstrate that Agent S2 establishes new state-of-the-art (SOTA) performance on three prominent computer use benchmarks. Specifically, Agent S2 achieves 18.9% and 32.7% relative improvements over leading baseline agents such as Claude Computer Use and UI-TARS on the OSWorld 15-step and 50-step evaluation. Moreover, Agent S2 generalizes effectively to other operating systems and applications, surpassing previous best methods by 52.8% on WindowsAgentArena and by 16.52% on AndroidWorld relatively. Code available at https://github.com/simular-ai/Agent-S.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Agent S2, a compositional generalist-specialist framework for computer use agents that delegates cognitive responsibilities across various generalist and specialist models. It proposes a Mixture-of-Grounding technique to achieve precise GUI localization and Proactive Hierarchical Planning to dynamically refine action plans at multiple temporal scales. Evaluations on three benchmarks demonstrate new state-of-the-art performance, with relative improvements of 18.9% and 32.7% over baselines like Claude Computer Use and UI-TARS on OSWorld 15-step and 50-step tasks, 52.8% on WindowsAgentArena, and 16.52% on AndroidWorld.
Significance. If the performance gains can be robustly attributed to the compositional delegation, Mixture-of-Grounding, and Proactive Hierarchical Planning, the work would advance the development of more capable GUI agents by addressing imprecise grounding and long-horizon planning challenges through specialization rather than monolithic models. The open-sourced code at the provided GitHub link supports reproducibility and community follow-up.
major comments (2)
- [Experimental Results] The central attribution of the reported SOTA gains (18.9%/32.7% on OSWorld, 52.8% on WindowsAgentArena, 16.52% on AndroidWorld) to Mixture-of-Grounding and Proactive Hierarchical Planning is not supported by ablations that remove these components while holding the set of delegated generalist and specialist models fixed. Without such controls, the improvements cannot be distinguished from potential differences in base model strength or prompt engineering.
- [Benchmark Evaluation] The manuscript provides no error bars, number of evaluation runs, or statistical tests for the benchmark scores, and does not confirm that baselines (Claude Computer Use, UI-TARS) were re-run under identical observation formats, action spaces, or step-counting protocols. This undermines the reliability of the relative improvement claims.
minor comments (2)
- [Method] The description of how the Mixture-of-Grounding technique combines outputs from multiple grounding models could be clarified with a pseudocode listing or explicit equation in the methods section.
- [Implementation Details] A table summarizing the exact generalist and specialist models used in the delegation framework would improve readability and reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which identifies important areas for strengthening the experimental rigor of our manuscript. We address each major comment below and commit to revisions that directly respond to the concerns while preserving the core contributions of the work.
read point-by-point responses
-
Referee: [Experimental Results] The central attribution of the reported SOTA gains (18.9%/32.7% on OSWorld, 52.8% on WindowsAgentArena, 16.52% on AndroidWorld) to Mixture-of-Grounding and Proactive Hierarchical Planning is not supported by ablations that remove these components while holding the set of delegated generalist and specialist models fixed. Without such controls, the improvements cannot be distinguished from potential differences in base model strength or prompt engineering.
Authors: We acknowledge that the current ablation studies in the manuscript compare the full Agent S2 system against baselines but do not include experiments that isolate Mixture-of-Grounding and Proactive Hierarchical Planning while strictly holding the underlying generalist and specialist models fixed. To address this directly, we will add new controlled ablation results in the revised manuscript that vary only these two components under a fixed model delegation setup. This will provide clearer evidence for attributing performance gains to the proposed techniques. revision: yes
-
Referee: [Benchmark Evaluation] The manuscript provides no error bars, number of evaluation runs, or statistical tests for the benchmark scores, and does not confirm that baselines (Claude Computer Use, UI-TARS) were re-run under identical observation formats, action spaces, or step-counting protocols. This undermines the reliability of the relative improvement claims.
Authors: We agree that additional statistical details and protocol clarifications are necessary to support the reliability of the reported results. In the revised version, we will include the number of evaluation runs performed, report error bars or standard deviations, and add statistical significance tests for the key comparisons. We will also expand the experimental setup section to explicitly confirm that baselines were evaluated using identical observation formats, action spaces, and step-counting protocols as defined by each benchmark. Any practical constraints on re-running proprietary APIs will be noted transparently. revision: yes
Circularity Check
No circularity: empirical framework with independent benchmark results
full rationale
The paper presents a descriptive compositional framework (delegating tasks across generalist and specialist models, with Mixture-of-Grounding for localization and Proactive Hierarchical Planning for multi-scale refinement) followed by direct empirical evaluation on OSWorld, WindowsAgentArena, and AndroidWorld. No equations, fitted parameters, or self-referential definitions appear in the provided text; performance deltas are reported as measured outcomes rather than quantities derived by construction from prior self-citations or ansatzes. The central claims rest on benchmark comparisons that remain externally falsifiable and do not reduce to inputs defined within the paper itself.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Specialist models can be effectively delegated for specific cognitive tasks such as precise GUI localization and hierarchical planning.
invented entities (2)
-
Mixture-of-Grounding
no independent evidence
-
Proactive Hierarchical Planning
no independent evidence
Forward citations
Cited by 22 Pith papers
-
MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents
MobiBench is the first modular multi-path offline benchmark for mobile GUI agents, achieving 94.72% agreement with human evaluators while allowing component-level analysis.
-
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
-
OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents
OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
-
OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents
AI agents on OSWorld take 2.7-4.3 times more steps than human trajectories, with latency rising sharply due to repeated large model calls for planning and reflection.
-
OpenComputer: Verifiable Software Worlds for Computer-Use Agents
OpenComputer introduces a verifier-grounded framework with state verifiers, self-evolving layers, task synthesis, and auditable evaluation for 33 desktop apps and 1000 tasks to support computer-use AI agents.
-
MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents
MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.
-
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
-
LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning
LiteGUI trains 2B/3B-scale GUI agents via SFT-free guided on-policy distillation and multi-solution dual-level GRPO to reach SOTA lightweight performance and compete with larger models.
-
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
-
MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs
MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.
-
UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding
UI-Zoomer uses uncertainty quantification to trigger and size adaptive zoom-ins only on uncertain GUI grounding predictions, yielding up to 13.4% gains on benchmarks with no training.
-
IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents
IntentScore learns intent-conditioned action scores from offline GUI trajectories and raises task success by 6.9 points on an unseen agent and environment.
-
AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management
AgentProg reframes interaction history as a program with variables and control flow, plus a belief state for partial observability, achieving SOTA success rates on long-horizon GUI benchmarks while baselines degrade.
-
MGA: Memory-Driven GUI Agent for Observation-Centric Interaction
MGA is a memory-driven GUI agent that uses an observer for bias-free screen reading and structured memory for compact state transitions to enable efficient long-horizon automation.
-
VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents
VeriOS-Agent is an OS agent that proactively queries humans in untrustworthy scenarios via a query-driven framework and three-stage training, achieving 19.72% higher step-wise success rate over baselines while preserv...
-
GTA1: GUI Test-time Scaling Agent
GTA1 combines test-time scaling for action plan selection with RL-based grounding to achieve SOTA results on GUI agent benchmarks.
-
DroidRetriever: A Transparent and Steerable Automation System for Collaborative Mobile Information Seeking
DroidRetriever is a transparent steerable mobile automation system that decomposes information-seeking tasks with multi-LLM agents, navigates apps, synthesizes reports with screenshots, and provides a dashboard for re...
-
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners
InfiGUI-R1 uses Reasoning Injection via spatial distillation followed by Deliberation Enhancement via RL to evolve GUI agents from reactive actors to deliberative reasoners, reporting strong performance on grounding a...
-
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...
-
See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback
Multi-turn visual feedback refinement outperforms single-shot coordinate prediction for pixel-precise GUI grounding in complex coding environments.
-
Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward
The paper surveys agent skills for LLMs across architecture, acquisition, deployment, and security, proposing a four-tier Skill Trust and Lifecycle Governance Framework to address vulnerabilities in community skills.
-
InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction
InfantAgent-Next integrates tool-based and vision agents in a modular architecture and reports 7.27% accuracy on OSWorld, exceeding Claude-Computer-Use while also testing on GAIA and SWE-Bench.
Reference graph
Works this paper leans on
-
[1]
Tutorial on the quantikz package
doi: 10.48550/ARXIV .2307.10088. URL https://doi.org/10.48550/arXiv.2307. 10088. Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William E. Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy P . Lillicrap, and Oriana Riva. An- droidworld: A dynamic...
work page internal anchor Pith review doi:10.48550/arxiv 2025
-
[2]
The future subtasks are correct and appropriately detailed, so I'll keep them as is:
**Copy formula to all rows**: - Click and drag the fill handle (small square in bottom-right corner of cell D2) down to cell D11 to copy the formula for all weeks Plan 2: Based on the current state where the new column has been inserted and the "Profit" header has been added, I'll continue with the remaining steps. The future subtasks are correct and appr...
-
[3]
**Enter profit formula**: - Enter the formula =B2-C2 in cell D2 to calculate profit for Week 1
-
[4]
**Copy formula to all rows**: - Click and drag the fill handle (small square in bottom-right corner of cell D2) down to cell D11 to copy the formula for all weeks Plan 1:
-
[5]
Create a new column and title it "Profit" in cell D1 as the column header
-
[6]
Enter the formula =B2-C2 in cell D2 to calculate profit for Week 1
-
[7]
Click and drag the fill handle (small square in bottom-right corner of cell D2) down to cell D11 to copy the formula for all weeks Step 1: Enter title into cell D1 agent.set_cell_values({"D1": "Profit"}, "soffice", "Sheet1") Step 4: Click on cell D2 agent.click("Cell D2 in the Profit column containing the value 11577.2", 1, "left") Step 5: Drag the formul...
-
[8]
Never" option: - Click on the dropdown menu next to
Check if the Screen Blank dropdown has a "Never" option: - Click on the dropdown menu next to "Screen Blank" again - Look for and select "Never" if it's available
-
[9]
If "Never" is not available, keep the current 12-minute setting or select the longest time period available
-
[10]
Never option at the bottom of the Screen Blank dropdown menu
Scroll down to check if there are any additional screen dimming options below the currently visible options Step 10: Click the Never option agent.click(“Never option at the bottom of the Screen Blank dropdown menu”, 1, “left”) Step 9: Scroll through the dropdown menu agent.scroll("The Screen Blank dropdown menu that is showing time options", -3) Plan 5: A...
-
[11]
Look at the "Screen Blank" option under Power Saving Options. Click on the dropdown menu and select "Never", or the longest time period option if "Never" isn't available
-
[12]
If there are any additional power settings that appear below, scroll down to check for other screen dimming options
-
[13]
If no other dimming options are found, the "Screen Blank" setting is likely the one that controls the screen dimming behavior in this version of Ubuntu Step 7: Click the dropdown menu agent.click("The dropdown menu next to ‘5 minutes’ for the Screen Blank option", 1, "left") Plan 4: The Displays panel does not include options for screen dimming when inact...
-
[14]
Click on "Power" in the left sidebar to return to the Power settings
-
[15]
In the Power settings, look for options that might control screen dimming: - Look at the "Power Saving Options" section. Pay special attention to any option that mentions "dimming," "brightness," or "blank" - The "Dim screen when inactive" setting might be named differently
-
[16]
Power option in the left sidebar of the Settings window
If a setting related to dimming the screen when inactive is found, click on its toggle switch to turn it OFF Step 6: Click the Power option agent.click("Power option in the left sidebar of the Settings window", 1, "left") Step 5: Click the Display option agent.click("Displays option in the left sidebar of the Settings window", 1, "left") Step 3: Click the...
- [17]
-
[18]
If not found in Displays, check if there might be a similar setting with a different name in the current Power panel
-
[19]
Once found, toggle the setting to OFF Plan 2: I notice the Power option is already visible in the Settings preview panel. Here are the remaining steps:
-
[20]
Power" option that's visible in the preview. In the Power settings panel that opens, find the
Click on the "Power" option that's visible in the preview. In the Power settings panel that opens, find the "Dim screen when inactive" option
-
[21]
Click on the toggle switch next to "Dim screen when inactive" to turn it OFF Step 8: Click the longest option agent.click("12 minutes”, 1, "left") Plan 1:
- [22]
-
[23]
In the Settings window, look for and click on "Power" in the left sidebar
-
[24]
In the Power settings panel, find the "Dim screen when inactive" option
-
[25]
Click on the toggle switch next to "Dim screen when inactive" to turn it OFF Step 1: Click the Activities button agent.click("The Activities button located at the top-left corner of the screen", 1, "left") Step 2: Type ‘Settings’ in search agent.type("The search bar at the top of the screen", "Settings", overwrite=True) Figure 11: Agent S2 demonstrates re...
-
[26]
New”. Then, click “Create Shortcut
Right-click in desktop and click “New”. Then, click “Create Shortcut”
-
[29]
Type “Projects - Shortcut” to create a new shortcut. Step 1: Right click on Desktop agent.click("Right click empty area on the desktop", 1, "right") Step 2: Click the “New” option agent.click(“The ‘New’ option in the context menu”, 1, “left”) Step 3: Create a Shortcut agent.click(“The ‘Shortcut’ option in the context menu dropdown”, 1, “left”) Plan 2:
- [30]
-
[31]
Navigate to “This PC”, “Documents” folder, and select the “Projects” folder
- [32]
-
[33]
Navigate to “This PC”
- [34]
- [36]
- [37]
-
[38]
Type the location of the items:
Select the text field under “Type the location of the items:”
-
[39]
Type “Projects - Shortcut” and click “Finish”. Step 4: Open Settings app agent.click(“The Browse button to navigate folder to the correct folder”, 1, “left”) Step 5: Click on the Documents folder agent.click(“The Documents folder entry in the browser dialog list”, 1, “left”) Step 6: Open the Projects folder agent.click(“The Project folder in the file list...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.