IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents

Rongqian Chen; Sizhe Tang; Tian Lan; Weidong Cao; Yu Li; Zeyu Fang

arxiv: 2604.05157 · v2 · pith:Q4HAQCFSnew · submitted 2026-04-06 · 💻 cs.AI

IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents

Rongqian Chen , Yu Li , Zeyu Fang , Sizhe Tang , Weidong Cao , Tian Lan This is my paper

Pith reviewed 2026-05-25 06:54 UTC · model grok-4.3

classification 💻 cs.AI

keywords computer-use agentsreward modelGUI actionsintent-conditioned evaluationaction scoringOSWorld benchmarkoffline trajectoriesplan-aware ranking

0 comments

The pith

A reward model that embeds planning intent into action scoring can improve computer-use agent success by 6.9 points on environments never seen in training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Computer-use agents generate GUI actions without checking their quality, which often leads to irreversible mistakes that cascade through later steps. IntentScore addresses this by learning to score candidate actions according to how well they match an underlying plan. The model is trained on 398K offline interaction steps from three operating systems with two objectives: contrastive alignment of states and actions, and margin-based ranking of correct versus incorrect actions. By conditioning the action encoder on planning intent, it can tell apart actions that look similar but serve different purposes. When used to re-rank actions for an agent in the unseen OSWorld environment, it raises task success by 6.9 points.

Core claim

IntentScore is a plan-aware reward model that embeds each candidate action's planning intent in the action encoder. It is trained with contrastive alignment for state-action relevance and margin ranking for action correctness on 398K offline GUI interaction steps spanning three operating systems. The model achieves 97.5% pairwise discrimination accuracy on held-out data and, when deployed as a re-ranker, improves Agent S3 task success rate by 6.9 points on the unseen OSWorld environment.

What carries the argument

The intent-conditioned action encoder that incorporates planning intent to distinguish actions with similar surface forms but different rationales.

If this is right

Reward estimation learned from heterogeneous offline trajectories generalizes to unseen agents and task distributions.
Embedding planning intent allows discrimination between candidates with similar actions but different rationales.
Deploying the model as a re-ranker improves task success rate without retraining the base agent.
High pairwise discrimination accuracy of 97.5% supports reliable selection of higher-quality actions at each step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same intent-conditioned scoring could be tested on non-GUI sequential tasks such as web navigation or command-line agents.
Collecting larger and more diverse offline trajectories might increase the observed generalization gap closure.
Using the scorer inside the agent's training loop instead of only at inference time could produce additional gains.

Load-bearing premise

A reward model trained on offline trajectories from three operating systems will generalize to entirely unseen agents and task distributions in OSWorld.

What would settle it

Measuring task success rate on OSWorld with and without the IntentScore re-ranker to check whether the reported 6.9-point gain disappears.

Figures

Figures reproduced from arXiv: 2604.05157 by Rongqian Chen, Sizhe Tang, Tian Lan, Weidong Cao, Yu Li, Zeyu Fang.

**Figure 1.** Figure 1: Architecture of IntentScore. The state encoder is computed once per step; the intention-aware action encoder is computed per candidate. Reward estimation is temperaturescaled cosine similarity. Training uses a dual objective: state-action alignment (InfoNCE) plus reward learning (margin ranking on hard negatives). where diagonal entries are positives and all off-diagonal entries serve as in-batch negative… view at source ↗

**Figure 2.** Figure 2: Deployment inference pipeline. The CUA generates multiple candidate actions for the current state. The state and action encoders map inputs into a shared latent space, where cosine similarity determines action quality. We deploy IntentScore as a rewardguided re-ranker within Agent S3 on OSWorld, an environment entirely unseen during training. Agent S3 uses GPT-5-mini for planning and UI-TARS-1.5-7B (Q… view at source ↗

**Figure 3.** Figure 3: Decision timeline for a complete OSWorld trajectory (27 steps, task: “write gram [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

**Figure 4.** Figure 4: Case study 1 (step 7): IntentScore overrides Alt+Tab in favor of the “Window” menu for reliable document switching. The score gap of 0.236 reflects the intent-aware encoder’s ability to distinguish navigation strategies. Case 2: Consistent preference for deterministic navigation (step 10). Three steps later, the agent is back in “Answer.docx” and needs to switch to “Grammer test 2.docx” to read its questio… view at source ↗

**Figure 5.** Figure 5: Case study 2 (step 10): IntentScore again overrides a navigation hotkey (Ctrl+F6) in favor of the “Window” menu. The screenshot shows Answer.docx with the Window menu open, listing both documents. demonstrates that the intent-aware encoder distinguishes candidates with nearly identical coordinates but different spatial reasoning. # Action (intent summary) Score 1 Click below “Grammar test 2:” (“blank area … view at source ↗

**Figure 6.** Figure 6: Case study 3 (step 16): Three click candidates target the same line at slightly [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Case study 4 (step 17): Three type candidates with identical content but different [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

read the original abstract

Computer-Use Agents (CUAs) leverage large language models to execute GUI operations on desktop environments, yet they generate actions without evaluating action quality, leading to irreversible errors that cascade through subsequent steps. We propose IntentScore, a plan-aware reward model that learns to score candidate actions from 398K offline GUI interaction steps spanning three operating systems. IntentScore trains with two complementary objectives: contrastive alignment for state-action relevance and margin ranking for action correctness. Architecturally, it embeds each candidate's planning intent in the action encoder, enabling discrimination between candidates with similar actions but different rationales. IntentScore achieves 97.5% pairwise discrimination accuracy on held-out evaluation. Deployed as a re-ranker for Agent S3 on OSWorld, an environment entirely unseen during training, IntentScore improves task success rate by 6.9 points, demonstrating that reward estimation learned from heterogeneous offline trajectories generalizes to unseen agents and task distributions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IntentScore brings intent-conditioned scoring to agent action evaluation with transfer gains on OSWorld, though generalization needs more verification on data overlap.

read the letter

The key thing to know is that IntentScore is a new reward model for computer-use agents that embeds planning intent into the action scoring process and uses both contrastive and margin-ranking objectives. It reports strong numbers on held-out data and a meaningful improvement when plugged into an existing agent on a new environment. The novelty here is the specific combination of intent embedding in the encoder with those two training objectives. Prior work on reward models for agents exists, but this one ties the intent directly into the action representation in a way that lets it handle candidates with similar actions but different plans. Training on nearly 400K offline steps from three OSes gives it a broad base, and the transfer to OSWorld is the main empirical win. The 97.5% pairwise discrimination accuracy and the 6.9 point success rate gain are concrete results that show the approach has practical value. Using it as a re-ranker for Agent S3 demonstrates real-world utility. The main soft spot is the lack of detail on evaluation protocol and the unverified assumption of no distributional overlap with OSWorld. The abstract asserts the environment is entirely unseen, but without checks like task category overlap or sequence similarity, the lift could partly reflect shared patterns rather than robust out-of-distribution performance. The stress-test concern holds based on what's provided. This work is for people in the LLM agent community focused on GUI agents and reward modeling. A reader interested in improving reliability of computer-use agents would find the method and results useful. It has enough empirical grounding and a clear contribution to deserve peer review, though revisions on the generalization analysis would likely be requested.

Referee Report

3 major / 2 minor

Summary. The paper proposes IntentScore, a plan-aware reward model for computer-use agents (CUAs) that embeds planning intent into action scoring. It is trained on 398K offline GUI interaction steps across three operating systems using contrastive alignment for state-action relevance and margin ranking for action correctness. The model reports 97.5% pairwise discrimination accuracy on held-out evaluation and, when deployed as a re-ranker for Agent S3 on the unseen OSWorld benchmark, yields a 6.9-point gain in task success rate.

Significance. If the out-of-distribution generalization claim holds, the work offers a concrete, offline-trainable mechanism to mitigate cascading errors in GUI agents without requiring online interaction or environment-specific fine-tuning. The intent-conditioned architecture and dual-objective training provide a reusable component for CUA pipelines.

major comments (3)

[Abstract / Experiments] Abstract and Experiments section: The assertion that OSWorld is 'entirely unseen' during training is load-bearing for the 6.9-point gain claim, yet the manuscript supplies no quantitative check (task-category overlap, application-domain similarity, or action-sequence distribution distance) between the 398K training trajectories and OSWorld tasks. Without such evidence, the reported lift cannot be distinguished from possible memorization of similar patterns.
[Evaluation protocol] Evaluation protocol (presumably §4 or §5): The 97.5% pairwise discrimination accuracy is presented without details on data splits, number of evaluation pairs, sampling procedure, or statistical significance testing. This absence prevents assessment of whether the held-out result is robust or sensitive to unstated choices in pair construction.
[OSWorld experiments] OSWorld re-ranking results: The 6.9-point improvement is reported without the baseline Agent S3 success rate, number of evaluated tasks, run-to-run variance, or comparison against alternative re-ranking or reward-model baselines, making the magnitude and reliability of the gain difficult to interpret.

minor comments (2)

[Method] Notation for the two training objectives (contrastive and margin-ranking) should be defined with explicit loss equations rather than prose descriptions to allow exact reproduction.
[Figures] Figure captions for any architecture or trajectory diagrams should explicitly state the input dimensions and embedding sizes used in the intent-conditioned encoder.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address each major comment below and commit to revisions that improve clarity and completeness without altering the core claims.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: The assertion that OSWorld is 'entirely unseen' during training is load-bearing for the 6.9-point gain claim, yet the manuscript supplies no quantitative check (task-category overlap, application-domain similarity, or action-sequence distribution distance) between the 398K training trajectories and OSWorld tasks. Without such evidence, the reported lift cannot be distinguished from possible memorization of similar patterns.

Authors: We agree that a quantitative characterization of distribution shift would make the generalization claim more robust. The 398K trajectories were collected from open-ended GUI interactions across three operating systems using multiple agents, whereas OSWorld consists of curated, goal-directed tasks in a standardized benchmark setting with different application distributions. In the revision we will add a new subsection that tabulates task-category overlap (e.g., browser vs. file-manager actions) and qualitatively contrasts action-sequence statistics; a full distributional-distance computation is not feasible with the current offline dataset but the added analysis will clarify the degree of novelty. revision: partial
Referee: [Evaluation protocol] Evaluation protocol (presumably §4 or §5): The 97.5% pairwise discrimination accuracy is presented without details on data splits, number of evaluation pairs, sampling procedure, or statistical significance testing. This absence prevents assessment of whether the held-out result is robust or sensitive to unstated choices in pair construction.

Authors: We apologize for the missing protocol details. The revised manuscript will include an expanded evaluation subsection that specifies the train/validation/test split ratios, the exact number of held-out pairs, the procedure used to sample positive and negative pairs (including how negatives were drawn from the same state), and the results of statistical significance tests (paired t-test and bootstrap confidence intervals) confirming that accuracy is reliably above chance. revision: yes
Referee: [OSWorld experiments] OSWorld re-ranking results: The 6.9-point improvement is reported without the baseline Agent S3 success rate, number of evaluated tasks, run-to-run variance, or comparison against alternative re-ranking or reward-model baselines, making the magnitude and reliability of the gain difficult to interpret.

Authors: We will revise both the abstract and the experiments section to explicitly report the baseline Agent S3 success rate, the precise number of OSWorld tasks evaluated, standard deviation across repeated runs, and additional comparisons against simple heuristic re-rankers and an ablated reward model. These numbers and controls are already present in our internal experimental logs and will be added to the main text and a new supplementary table. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical training and held-out evaluation

full rationale

The paper describes training a reward model (IntentScore) on 398K offline GUI trajectories using standard contrastive alignment and margin-ranking objectives, followed by held-out accuracy measurement (97.5%) and deployment as a re-ranker on the unseen OSWorld benchmark. No equations, derivations, or first-principles claims are presented that reduce to fitted quantities by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The central results rest on conventional supervised learning with train/test separation and external environment testing, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities beyond the implicit domain assumption that offline GUI trajectories are sufficient to learn generalizable action quality.

axioms (1)

domain assumption Offline GUI interaction steps collected from three operating systems are representative enough to train a model that generalizes to unseen agents and tasks.
The generalization claim to OSWorld rests on this premise.

pith-pipeline@v0.9.0 · 5698 in / 1180 out tokens · 38832 ms · 2026-05-25T06:54:16.041032+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 8 internal anchors

[1]

Agent s: An open agentic framework that uses computers like a human,

Saaket Agashe, Jiuzhou Han, Shuyu Zhu, and Diyi Yang. Agent s: An open agentic framework that uses computers like a human.arXiv preprint arXiv:2410.08164,

work page arXiv
[2]

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist-specialist framework for computer use agents, 2025.URL https://arxiv. org/abs/2504.00906, 2:10–16,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Image matching filtering and refinement by planes and beyond.arXiv preprint arXiv:2411.09484,

Fabio Bellavia, Zhenjun Zhao, Luca Morelli, and Fabio Remondino. Image matching filtering and refinement by planes and beyond.arXiv preprint arXiv:2411.09484,

work page arXiv
[4]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling.arXiv preprint arXiv:1412.3555,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Learning from random demonstrations: Offline reinforcement learning with importance-sampled diffusion models.arXiv preprint arXiv:2405.19878,

Zeyu Fang and Tian Lan. Learning from random demonstrations: Offline reinforcement learning with importance-sampled diffusion models.arXiv preprint arXiv:2405.19878,

work page arXiv
[6]

Uncertainty mitigation and intent inference: A dual-mode human-machine joint planning system.arXiv preprint arXiv:2603.07822, 2026a

Zeyu Fang, Yuxin Lin, Cheng Liu, Beomyeol Yu, Zeyuan Yang, Rongqian Chen, Taeyoung Lee, Mahdi Imani, and Tian Lan. Uncertainty mitigation and intent inference: A dual-mode human-machine joint planning system.arXiv preprint arXiv:2603.07822, 2026a. Zeyu Fang, Zuyuan Zhang, Mahdi Imani, and Tian Lan. Manifold-constrained energy-based transition models for o...

work page arXiv
[7]

Computer-using world model

Yiming Guan, Rui Yu, John Zhang, Lu Wang, Chaoyun Zhang, Liqun Li, Bo Qiao, Si Qin, He Huang, Fangkai Yang, et al. Computer-using world model. InICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling,

work page 2026
[8]

AgentNet: A scalable framework for multi-step agent trajectory generation.arXiv preprint arXiv:2501.00000,

Woojin Kim, Sangwon Lee, and Joonhyung Park. AgentNet: A scalable framework for multi-step agent trajectory generation.arXiv preprint arXiv:2501.00000,

work page arXiv
[9]

Agentic test-time scaling for webagents

10 Nicholas Lee, Lutfi Eren Erdogan, Chris Joseph John, Surya Krishnapillai, Michael W Mahoney, Kurt Keutzer, and Amir Gholami. Agentic test-time scaling for webagents. arXiv preprint arXiv:2602.12276,

work page arXiv
[10]

When right meets wrong: Bilateral context conditioning with reward-confidence correction for grpo.arXiv preprint arXiv:2603.13134, 2026a

Yu Li, Tian Lan, and Zhengling Qi. When right meets wrong: Bilateral context conditioning with reward-confidence correction for grpo.arXiv preprint arXiv:2603.13134, 2026a. Yu Li, Rui Miao, Zhengling Qi, and Tian Lan. Arise: Agent reasoning with intrinsic skill evolution in hierarchical reinforcement learning.arXiv preprint arXiv:2603.16060, 2026b. Yu Li,...

work page arXiv
[11]

SEAgent: Bridging semantic understanding and action generation for computer-use agents.arXiv preprint arXiv:2503.00208, 2025a

Shuai Liu, Peng Zhang, and Xi Chen. SEAgent: Bridging semantic understanding and action generation for computer-use agents.arXiv preprint arXiv:2503.00208, 2025a. Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, et al. Scalecua: Scaling open-source computer use agents with cross-platf...

work page arXiv 2023
[12]

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. Leworldmodel: Stable end-to-end joint-embedding predictive architecture from pixels. arXiv preprint arXiv:2603.19312,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Andrei Polubarov, Lyubaykin Nikita, Alexander Derevyagin, Ilya Zisman, Denis Tarasov, Alexander Nikulin, and Vladislav Kurenkov. Vintix: Action model via in-context rein- forcement learning. InForty-second International Conference on Machine Learning. Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, et al. UI-TARS: Pioneering automated GUI i...

work page internal anchor Pith review Pith/arXiv arXiv
[14]

10852426

doi: 10.1109/FLLM63129.2024. 10852426. Pascal J Sager, Benjamin Meyer, Peng Yan, Rebekka von Wartburg-Kottler, Layan Etaiwi, Aref Enayati, Gabriel Nobel, Ahmed Abdulkadir, Benjamin F Grewe, and Thilo Stadel- mann. A comprehensive survey of agents for computer use: Foundations, challenges, and future directions.arXiv preprint arXiv:2501.16150,

work page doi:10.1109/fllm63129.2024 2024
[15]

Agent alpha: Tree search unifying generation, exploration and evaluation for computer-use agents.arXiv preprint arXiv:2602.02995,

Sizhe Tang, Rongqian Chen, and Tian Lan. Agent alpha: Tree search unifying generation, exploration and evaluation for computer-use agents.arXiv preprint arXiv:2602.02995,

work page arXiv
[16]

Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.Advances in Neural Information Processing Systems, 37:2686–2710, 2024a

Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.Advances in Neural Information Processing Systems, 37:2686–2710, 2024a. Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Y...

work page arXiv
[17]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shi, Joel Tao, et al. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments.arXiv preprint arXiv:2404.07972,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

GTA1: GUI Test-time Scaling Agent

Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, et al. Gta1: Gui test-time scaling agent.arXiv preprint arXiv:2507.05791,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Sigmoid Loss for Language Image Pre-Training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training.arXiv preprint arXiv:2303.15343,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Lisfc-search: Lifelong search for network sfc optimization under non-stationary drifts.arXiv preprint arXiv:2602.14360, 2026a

Zuyuan Zhang, Vaneet Aggarwal, and Tian Lan. Lisfc-search: Lifelong search for network sfc optimization under non-stationary drifts.arXiv preprint arXiv:2602.14360, 2026a. Zuyuan Zhang, Zeyu Fang, and Tian Lan. Structuring value representations via geometric coherence in markov decision processes.arXiv preprint arXiv:2602.02978, 2026b. Zhenjun Zhao. Balf:...

work page arXiv
[21]

Advances in global solvers for 3d vision.arXiv preprint arXiv:2602.14662,

Zhenjun Zhao, Heng Yang, Bangyan Liao, Yingping Zeng, Shaocheng Yan, Yingdong Gu, Peidong Liu, Yi Zhou, Haoang Li, and Javier Civera. Advances in global solvers for 3d vision.arXiv preprint arXiv:2602.14662,

work page arXiv
[22]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

A survey on in-context reinforcement learning.arXiv preprint arXiv:2408.10706,

Zhenwen Zhu, Yutong Wan, Kevin Zhang, Jing Shao, and Bin Ye. A survey on in-context reinforcement learning.arXiv preprint arXiv:2408.10706,

work page arXiv
[24]

MPNet + SigLIP2 + larger model

Total∼13M trainable parameters B Data statistics 45.9% of Ubuntu tasks contain at least one incorrect step, providing hard negatives for the margin ranking loss. All evaluation uses atask-levelsplit of the Ubuntu subset: 85% train / 10% validation / 5% test, ensuring no step from a test task appears during training. The cross-OS data (Windows + Mac) is us...

work page 1920
[25]

Adding incorrect-step negatives (labeled rt =

Negative type matters more than quantity.Adjacent-step negatives ( t±1) are the most effective training signal for Hard test performance, as they require distinguishing temporally close actions that share nearly identical UI context—a challenge shared by offline RL methods that must learn from suboptimal demonstrations without environment interaction (Fan...

work page 2024

[1] [1]

Agent s: An open agentic framework that uses computers like a human,

Saaket Agashe, Jiuzhou Han, Shuyu Zhu, and Diyi Yang. Agent s: An open agentic framework that uses computers like a human.arXiv preprint arXiv:2410.08164,

work page arXiv

[2] [2]

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist-specialist framework for computer use agents, 2025.URL https://arxiv. org/abs/2504.00906, 2:10–16,

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Image matching filtering and refinement by planes and beyond.arXiv preprint arXiv:2411.09484,

Fabio Bellavia, Zhenjun Zhao, Luca Morelli, and Fabio Remondino. Image matching filtering and refinement by planes and beyond.arXiv preprint arXiv:2411.09484,

work page arXiv

[4] [4]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling.arXiv preprint arXiv:1412.3555,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Learning from random demonstrations: Offline reinforcement learning with importance-sampled diffusion models.arXiv preprint arXiv:2405.19878,

Zeyu Fang and Tian Lan. Learning from random demonstrations: Offline reinforcement learning with importance-sampled diffusion models.arXiv preprint arXiv:2405.19878,

work page arXiv

[6] [6]

Uncertainty mitigation and intent inference: A dual-mode human-machine joint planning system.arXiv preprint arXiv:2603.07822, 2026a

Zeyu Fang, Yuxin Lin, Cheng Liu, Beomyeol Yu, Zeyuan Yang, Rongqian Chen, Taeyoung Lee, Mahdi Imani, and Tian Lan. Uncertainty mitigation and intent inference: A dual-mode human-machine joint planning system.arXiv preprint arXiv:2603.07822, 2026a. Zeyu Fang, Zuyuan Zhang, Mahdi Imani, and Tian Lan. Manifold-constrained energy-based transition models for o...

work page arXiv

[7] [7]

Computer-using world model

Yiming Guan, Rui Yu, John Zhang, Lu Wang, Chaoyun Zhang, Liqun Li, Bo Qiao, Si Qin, He Huang, Fangkai Yang, et al. Computer-using world model. InICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling,

work page 2026

[8] [8]

AgentNet: A scalable framework for multi-step agent trajectory generation.arXiv preprint arXiv:2501.00000,

Woojin Kim, Sangwon Lee, and Joonhyung Park. AgentNet: A scalable framework for multi-step agent trajectory generation.arXiv preprint arXiv:2501.00000,

work page arXiv

[9] [9]

Agentic test-time scaling for webagents

10 Nicholas Lee, Lutfi Eren Erdogan, Chris Joseph John, Surya Krishnapillai, Michael W Mahoney, Kurt Keutzer, and Amir Gholami. Agentic test-time scaling for webagents. arXiv preprint arXiv:2602.12276,

work page arXiv

[10] [10]

When right meets wrong: Bilateral context conditioning with reward-confidence correction for grpo.arXiv preprint arXiv:2603.13134, 2026a

Yu Li, Tian Lan, and Zhengling Qi. When right meets wrong: Bilateral context conditioning with reward-confidence correction for grpo.arXiv preprint arXiv:2603.13134, 2026a. Yu Li, Rui Miao, Zhengling Qi, and Tian Lan. Arise: Agent reasoning with intrinsic skill evolution in hierarchical reinforcement learning.arXiv preprint arXiv:2603.16060, 2026b. Yu Li,...

work page arXiv

[11] [11]

SEAgent: Bridging semantic understanding and action generation for computer-use agents.arXiv preprint arXiv:2503.00208, 2025a

Shuai Liu, Peng Zhang, and Xi Chen. SEAgent: Bridging semantic understanding and action generation for computer-use agents.arXiv preprint arXiv:2503.00208, 2025a. Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, et al. Scalecua: Scaling open-source computer use agents with cross-platf...

work page arXiv 2023

[12] [12]

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. Leworldmodel: Stable end-to-end joint-embedding predictive architecture from pixels. arXiv preprint arXiv:2603.19312,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Andrei Polubarov, Lyubaykin Nikita, Alexander Derevyagin, Ilya Zisman, Denis Tarasov, Alexander Nikulin, and Vladislav Kurenkov. Vintix: Action model via in-context rein- forcement learning. InForty-second International Conference on Machine Learning. Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, et al. UI-TARS: Pioneering automated GUI i...

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

10852426

doi: 10.1109/FLLM63129.2024. 10852426. Pascal J Sager, Benjamin Meyer, Peng Yan, Rebekka von Wartburg-Kottler, Layan Etaiwi, Aref Enayati, Gabriel Nobel, Ahmed Abdulkadir, Benjamin F Grewe, and Thilo Stadel- mann. A comprehensive survey of agents for computer use: Foundations, challenges, and future directions.arXiv preprint arXiv:2501.16150,

work page doi:10.1109/fllm63129.2024 2024

[15] [15]

Agent alpha: Tree search unifying generation, exploration and evaluation for computer-use agents.arXiv preprint arXiv:2602.02995,

Sizhe Tang, Rongqian Chen, and Tian Lan. Agent alpha: Tree search unifying generation, exploration and evaluation for computer-use agents.arXiv preprint arXiv:2602.02995,

work page arXiv

[16] [16]

Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.Advances in Neural Information Processing Systems, 37:2686–2710, 2024a

Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.Advances in Neural Information Processing Systems, 37:2686–2710, 2024a. Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Y...

work page arXiv

[17] [17]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shi, Joel Tao, et al. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments.arXiv preprint arXiv:2404.07972,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

GTA1: GUI Test-time Scaling Agent

Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, et al. Gta1: Gui test-time scaling agent.arXiv preprint arXiv:2507.05791,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Sigmoid Loss for Language Image Pre-Training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training.arXiv preprint arXiv:2303.15343,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Lisfc-search: Lifelong search for network sfc optimization under non-stationary drifts.arXiv preprint arXiv:2602.14360, 2026a

Zuyuan Zhang, Vaneet Aggarwal, and Tian Lan. Lisfc-search: Lifelong search for network sfc optimization under non-stationary drifts.arXiv preprint arXiv:2602.14360, 2026a. Zuyuan Zhang, Zeyu Fang, and Tian Lan. Structuring value representations via geometric coherence in markov decision processes.arXiv preprint arXiv:2602.02978, 2026b. Zhenjun Zhao. Balf:...

work page arXiv

[21] [21]

Advances in global solvers for 3d vision.arXiv preprint arXiv:2602.14662,

Zhenjun Zhao, Heng Yang, Bangyan Liao, Yingping Zeng, Shaocheng Yan, Yingdong Gu, Peidong Liu, Yi Zhou, Haoang Li, and Javier Civera. Advances in global solvers for 3d vision.arXiv preprint arXiv:2602.14662,

work page arXiv

[22] [22]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

A survey on in-context reinforcement learning.arXiv preprint arXiv:2408.10706,

Zhenwen Zhu, Yutong Wan, Kevin Zhang, Jing Shao, and Bin Ye. A survey on in-context reinforcement learning.arXiv preprint arXiv:2408.10706,

work page arXiv

[24] [24]

MPNet + SigLIP2 + larger model

Total∼13M trainable parameters B Data statistics 45.9% of Ubuntu tasks contain at least one incorrect step, providing hard negatives for the margin ranking loss. All evaluation uses atask-levelsplit of the Ubuntu subset: 85% train / 10% validation / 5% test, ensuring no step from a test task appears during training. The cross-OS data (Windows + Mac) is us...

work page 1920

[25] [25]

Adding incorrect-step negatives (labeled rt =

Negative type matters more than quantity.Adjacent-step negatives ( t±1) are the most effective training signal for Hard test performance, as they require distinguishing temporally close actions that share nearly identical UI context—a challenge shared by offline RL methods that must learn from suboptimal demonstrations without environment interaction (Fan...

work page 2024