IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents
Pith reviewed 2026-05-25 06:54 UTC · model grok-4.3
The pith
A reward model that embeds planning intent into action scoring can improve computer-use agent success by 6.9 points on environments never seen in training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
IntentScore is a plan-aware reward model that embeds each candidate action's planning intent in the action encoder. It is trained with contrastive alignment for state-action relevance and margin ranking for action correctness on 398K offline GUI interaction steps spanning three operating systems. The model achieves 97.5% pairwise discrimination accuracy on held-out data and, when deployed as a re-ranker, improves Agent S3 task success rate by 6.9 points on the unseen OSWorld environment.
What carries the argument
The intent-conditioned action encoder that incorporates planning intent to distinguish actions with similar surface forms but different rationales.
If this is right
- Reward estimation learned from heterogeneous offline trajectories generalizes to unseen agents and task distributions.
- Embedding planning intent allows discrimination between candidates with similar actions but different rationales.
- Deploying the model as a re-ranker improves task success rate without retraining the base agent.
- High pairwise discrimination accuracy of 97.5% supports reliable selection of higher-quality actions at each step.
Where Pith is reading between the lines
- The same intent-conditioned scoring could be tested on non-GUI sequential tasks such as web navigation or command-line agents.
- Collecting larger and more diverse offline trajectories might increase the observed generalization gap closure.
- Using the scorer inside the agent's training loop instead of only at inference time could produce additional gains.
Load-bearing premise
A reward model trained on offline trajectories from three operating systems will generalize to entirely unseen agents and task distributions in OSWorld.
What would settle it
Measuring task success rate on OSWorld with and without the IntentScore re-ranker to check whether the reported 6.9-point gain disappears.
Figures
read the original abstract
Computer-Use Agents (CUAs) leverage large language models to execute GUI operations on desktop environments, yet they generate actions without evaluating action quality, leading to irreversible errors that cascade through subsequent steps. We propose IntentScore, a plan-aware reward model that learns to score candidate actions from 398K offline GUI interaction steps spanning three operating systems. IntentScore trains with two complementary objectives: contrastive alignment for state-action relevance and margin ranking for action correctness. Architecturally, it embeds each candidate's planning intent in the action encoder, enabling discrimination between candidates with similar actions but different rationales. IntentScore achieves 97.5% pairwise discrimination accuracy on held-out evaluation. Deployed as a re-ranker for Agent S3 on OSWorld, an environment entirely unseen during training, IntentScore improves task success rate by 6.9 points, demonstrating that reward estimation learned from heterogeneous offline trajectories generalizes to unseen agents and task distributions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes IntentScore, a plan-aware reward model for computer-use agents (CUAs) that embeds planning intent into action scoring. It is trained on 398K offline GUI interaction steps across three operating systems using contrastive alignment for state-action relevance and margin ranking for action correctness. The model reports 97.5% pairwise discrimination accuracy on held-out evaluation and, when deployed as a re-ranker for Agent S3 on the unseen OSWorld benchmark, yields a 6.9-point gain in task success rate.
Significance. If the out-of-distribution generalization claim holds, the work offers a concrete, offline-trainable mechanism to mitigate cascading errors in GUI agents without requiring online interaction or environment-specific fine-tuning. The intent-conditioned architecture and dual-objective training provide a reusable component for CUA pipelines.
major comments (3)
- [Abstract / Experiments] Abstract and Experiments section: The assertion that OSWorld is 'entirely unseen' during training is load-bearing for the 6.9-point gain claim, yet the manuscript supplies no quantitative check (task-category overlap, application-domain similarity, or action-sequence distribution distance) between the 398K training trajectories and OSWorld tasks. Without such evidence, the reported lift cannot be distinguished from possible memorization of similar patterns.
- [Evaluation protocol] Evaluation protocol (presumably §4 or §5): The 97.5% pairwise discrimination accuracy is presented without details on data splits, number of evaluation pairs, sampling procedure, or statistical significance testing. This absence prevents assessment of whether the held-out result is robust or sensitive to unstated choices in pair construction.
- [OSWorld experiments] OSWorld re-ranking results: The 6.9-point improvement is reported without the baseline Agent S3 success rate, number of evaluated tasks, run-to-run variance, or comparison against alternative re-ranking or reward-model baselines, making the magnitude and reliability of the gain difficult to interpret.
minor comments (2)
- [Method] Notation for the two training objectives (contrastive and margin-ranking) should be defined with explicit loss equations rather than prose descriptions to allow exact reproduction.
- [Figures] Figure captions for any architecture or trajectory diagrams should explicitly state the input dimensions and embedding sizes used in the intent-conditioned encoder.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. We address each major comment below and commit to revisions that improve clarity and completeness without altering the core claims.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and Experiments section: The assertion that OSWorld is 'entirely unseen' during training is load-bearing for the 6.9-point gain claim, yet the manuscript supplies no quantitative check (task-category overlap, application-domain similarity, or action-sequence distribution distance) between the 398K training trajectories and OSWorld tasks. Without such evidence, the reported lift cannot be distinguished from possible memorization of similar patterns.
Authors: We agree that a quantitative characterization of distribution shift would make the generalization claim more robust. The 398K trajectories were collected from open-ended GUI interactions across three operating systems using multiple agents, whereas OSWorld consists of curated, goal-directed tasks in a standardized benchmark setting with different application distributions. In the revision we will add a new subsection that tabulates task-category overlap (e.g., browser vs. file-manager actions) and qualitatively contrasts action-sequence statistics; a full distributional-distance computation is not feasible with the current offline dataset but the added analysis will clarify the degree of novelty. revision: partial
-
Referee: [Evaluation protocol] Evaluation protocol (presumably §4 or §5): The 97.5% pairwise discrimination accuracy is presented without details on data splits, number of evaluation pairs, sampling procedure, or statistical significance testing. This absence prevents assessment of whether the held-out result is robust or sensitive to unstated choices in pair construction.
Authors: We apologize for the missing protocol details. The revised manuscript will include an expanded evaluation subsection that specifies the train/validation/test split ratios, the exact number of held-out pairs, the procedure used to sample positive and negative pairs (including how negatives were drawn from the same state), and the results of statistical significance tests (paired t-test and bootstrap confidence intervals) confirming that accuracy is reliably above chance. revision: yes
-
Referee: [OSWorld experiments] OSWorld re-ranking results: The 6.9-point improvement is reported without the baseline Agent S3 success rate, number of evaluated tasks, run-to-run variance, or comparison against alternative re-ranking or reward-model baselines, making the magnitude and reliability of the gain difficult to interpret.
Authors: We will revise both the abstract and the experiments section to explicitly report the baseline Agent S3 success rate, the precise number of OSWorld tasks evaluated, standard deviation across repeated runs, and additional comparisons against simple heuristic re-rankers and an ablated reward model. These numbers and controls are already present in our internal experimental logs and will be added to the main text and a new supplementary table. revision: yes
Circularity Check
No circularity; empirical training and held-out evaluation
full rationale
The paper describes training a reward model (IntentScore) on 398K offline GUI trajectories using standard contrastive alignment and margin-ranking objectives, followed by held-out accuracy measurement (97.5%) and deployment as a re-ranker on the unseen OSWorld benchmark. No equations, derivations, or first-principles claims are presented that reduce to fitted quantities by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The central results rest on conventional supervised learning with train/test separation and external environment testing, making the derivation chain self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Offline GUI interaction steps collected from three operating systems are representative enough to train a model that generalizes to unseen agents and tasks.
Reference graph
Works this paper leans on
-
[1]
Agent s: An open agentic framework that uses computers like a human,
Saaket Agashe, Jiuzhou Han, Shuyu Zhu, and Diyi Yang. Agent s: An open agentic framework that uses computers like a human.arXiv preprint arXiv:2410.08164,
-
[2]
Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents
Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist-specialist framework for computer use agents, 2025.URL https://arxiv. org/abs/2504.00906, 2:10–16,
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Image matching filtering and refinement by planes and beyond.arXiv preprint arXiv:2411.09484,
Fabio Bellavia, Zhenjun Zhao, Luca Morelli, and Fabio Remondino. Image matching filtering and refinement by planes and beyond.arXiv preprint arXiv:2411.09484,
-
[4]
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling.arXiv preprint arXiv:1412.3555,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Zeyu Fang and Tian Lan. Learning from random demonstrations: Offline reinforcement learning with importance-sampled diffusion models.arXiv preprint arXiv:2405.19878,
-
[6]
Zeyu Fang, Yuxin Lin, Cheng Liu, Beomyeol Yu, Zeyuan Yang, Rongqian Chen, Taeyoung Lee, Mahdi Imani, and Tian Lan. Uncertainty mitigation and intent inference: A dual-mode human-machine joint planning system.arXiv preprint arXiv:2603.07822, 2026a. Zeyu Fang, Zuyuan Zhang, Mahdi Imani, and Tian Lan. Manifold-constrained energy-based transition models for o...
-
[7]
Yiming Guan, Rui Yu, John Zhang, Lu Wang, Chaoyun Zhang, Liqun Li, Bo Qiao, Si Qin, He Huang, Fangkai Yang, et al. Computer-using world model. InICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling,
work page 2026
-
[8]
Woojin Kim, Sangwon Lee, and Joonhyung Park. AgentNet: A scalable framework for multi-step agent trajectory generation.arXiv preprint arXiv:2501.00000,
-
[9]
Agentic test-time scaling for webagents
10 Nicholas Lee, Lutfi Eren Erdogan, Chris Joseph John, Surya Krishnapillai, Michael W Mahoney, Kurt Keutzer, and Amir Gholami. Agentic test-time scaling for webagents. arXiv preprint arXiv:2602.12276,
-
[10]
Yu Li, Tian Lan, and Zhengling Qi. When right meets wrong: Bilateral context conditioning with reward-confidence correction for grpo.arXiv preprint arXiv:2603.13134, 2026a. Yu Li, Rui Miao, Zhengling Qi, and Tian Lan. Arise: Agent reasoning with intrinsic skill evolution in hierarchical reinforcement learning.arXiv preprint arXiv:2603.16060, 2026b. Yu Li,...
-
[11]
Shuai Liu, Peng Zhang, and Xi Chen. SEAgent: Bridging semantic understanding and action generation for computer-use agents.arXiv preprint arXiv:2503.00208, 2025a. Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, et al. Scalecua: Scaling open-source computer use agents with cross-platf...
-
[12]
LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. Leworldmodel: Stable end-to-end joint-embedding predictive architecture from pixels. arXiv preprint arXiv:2603.19312,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Andrei Polubarov, Lyubaykin Nikita, Alexander Derevyagin, Ilya Zisman, Denis Tarasov, Alexander Nikulin, and Vladislav Kurenkov. Vintix: Action model via in-context rein- forcement learning. InForty-second International Conference on Machine Learning. Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, et al. UI-TARS: Pioneering automated GUI i...
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
doi: 10.1109/FLLM63129.2024. 10852426. Pascal J Sager, Benjamin Meyer, Peng Yan, Rebekka von Wartburg-Kottler, Layan Etaiwi, Aref Enayati, Gabriel Nobel, Ahmed Abdulkadir, Benjamin F Grewe, and Thilo Stadel- mann. A comprehensive survey of agents for computer use: Foundations, challenges, and future directions.arXiv preprint arXiv:2501.16150,
-
[15]
Sizhe Tang, Rongqian Chen, and Tian Lan. Agent alpha: Tree search unifying generation, exploration and evaluation for computer-use agents.arXiv preprint arXiv:2602.02995,
-
[16]
Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.Advances in Neural Information Processing Systems, 37:2686–2710, 2024a. Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Y...
-
[17]
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shi, Joel Tao, et al. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments.arXiv preprint arXiv:2404.07972,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
GTA1: GUI Test-time Scaling Agent
Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, et al. Gta1: Gui test-time scaling agent.arXiv preprint arXiv:2507.05791,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Sigmoid Loss for Language Image Pre-Training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training.arXiv preprint arXiv:2303.15343,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Zuyuan Zhang, Vaneet Aggarwal, and Tian Lan. Lisfc-search: Lifelong search for network sfc optimization under non-stationary drifts.arXiv preprint arXiv:2602.14360, 2026a. Zuyuan Zhang, Zeyu Fang, and Tian Lan. Structuring value representations via geometric coherence in markov decision processes.arXiv preprint arXiv:2602.02978, 2026b. Zhenjun Zhao. Balf:...
-
[21]
Advances in global solvers for 3d vision.arXiv preprint arXiv:2602.14662,
Zhenjun Zhao, Heng Yang, Bangyan Liao, Yingping Zeng, Shaocheng Yan, Yingdong Gu, Peidong Liu, Yi Zhou, Haoang Li, and Javier Civera. Advances in global solvers for 3d vision.arXiv preprint arXiv:2602.14662,
-
[22]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
A survey on in-context reinforcement learning.arXiv preprint arXiv:2408.10706,
Zhenwen Zhu, Yutong Wan, Kevin Zhang, Jing Shao, and Bin Ye. A survey on in-context reinforcement learning.arXiv preprint arXiv:2408.10706,
-
[24]
MPNet + SigLIP2 + larger model
Total∼13M trainable parameters B Data statistics 45.9% of Ubuntu tasks contain at least one incorrect step, providing hard negatives for the margin ranking loss. All evaluation uses atask-levelsplit of the Ubuntu subset: 85% train / 10% validation / 5% test, ensuring no step from a test task appears during training. The cross-OS data (Windows + Mac) is us...
work page 1920
-
[25]
Adding incorrect-step negatives (labeled rt =
Negative type matters more than quantity.Adjacent-step negatives ( t±1) are the most effective training signal for Hard test performance, as they require distinguishing temporally close actions that share nearly identical UI context—a challenge shared by offline RL methods that must learn from suboptimal demonstrations without environment interaction (Fan...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.