Beyond Syntax: Action Semantics Learning for App Agents
Pith reviewed 2026-05-19 07:28 UTC · model grok-4.3
The pith
App agents learn the meaning of actions as UI state changes rather than reproducing exact command strings, yielding better accuracy and robustness on unfamiliar apps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Action Semantics Learning trains agents to generate actions whose induced user-interface state transitions match those of the ground-truth actions, using a SEmantic Estimator to measure semantic similarity even when syntactic forms differ; the approach is shown to deliver higher accuracy, better generalization, and theoretically stronger robustness to out-of-distribution cases than the conventional syntax-learning paradigm.
What carries the argument
The SEmantic Estimator (SEE) module, which computes semantic similarity between candidate and ground-truth actions according to the UI state transitions each produces.
If this is right
- Agents can succeed on app versions or devices whose exact action syntax differs from training data as long as the resulting screen states remain similar.
- The same SEE-based objective works for both supervised fine-tuning and reinforcement learning of the agent policy.
- Theoretical robustness guarantees for out-of-distribution inputs follow directly from replacing string matching with state-transition matching.
- Empirical gains appear consistently across multiple offline and online app-agent benchmarks.
Where Pith is reading between the lines
- The semantic framing may transfer to other agent domains where exact command formats vary but effects are stable, such as web or desktop automation.
- State-transition similarity could serve as a natural reward signal for multi-step planning without requiring full environment models.
- If SEE can be made lightweight, the method reduces reliance on exact demonstration data when deploying agents across many similar applications.
Load-bearing premise
The semantic similarity scores produced by the SEE module faithfully represent the intended state-transition effects without injecting their own biases or needing extra supervision at inference time.
What would settle it
A test set of actions that produce identical UI state changes but require syntactically different command strings; if agents trained with ASL maintain high success while syntax-trained agents drop sharply, the central claim is supported.
Figures
read the original abstract
The recent development of Large Language Models (LLMs) enables the rise of App agents that interpret user intent and operate smartphone Apps through actions such as clicking and scrolling. While prompt-based solutions with proprietary LLM APIs show promising ability, they incur heavy compute costs and external API dependency. Fine-tuning smaller open-source LLMs solves these limitations. However, current supervised fine-tuning methods use a syntax learning paradigm that forces agents to reproduce exactly the ground truth action strings, leading to out-of-distribution (OOD) vulnerability. To fill this gap, we propose Action Semantics Learning (ASL), a novel learning framework, where the learning objective is capturing the semantics of the ground truth actions. Specifically, inspired by the programming language theory, we define the action semantics for App agents as the state transition induced by the action in the user interface. Building on this insight, ASL employs a novel SEmantic Estimator~(SEE) to compute a semantic similarity to train the App agents in generating actions aligned with the semantics of ground truth actions, even when their syntactic forms differ. SEE is a flexible module that can be applied in both supervised and reinforcement fine-tuning paradigms. To support the effectiveness of ASL, we theoretically demonstrate the superior robustness of ASL for the OOD problem compared with the existing syntax learning paradigm. Extensive experiments across multiple offline and online benchmarks demonstrate that ASL significantly improves the accuracy and generalisation of App agents compared to existing methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Action Semantics Learning (ASL) for fine-tuning open-source LLMs as smartphone App agents. It contrasts a syntax learning paradigm (exact reproduction of ground-truth action strings) with a semantics-based approach that defines action semantics as the UI state transitions induced by each action, drawing on programming-language theory. A novel SEmantic Estimator (SEE) module computes semantic similarity to train agents (via SFT or RL) to produce actions aligned with ground-truth semantics even when syntactic forms differ. The paper asserts a theoretical demonstration of superior OOD robustness for ASL versus syntax learning and reports accuracy and generalization gains on multiple offline and online benchmarks.
Significance. If the SEE module reliably isolates intended state-transition semantics, ASL could provide a more robust training paradigm for mobile agents, mitigating OOD failures that arise from syntactic mismatch. The plug-in design for both supervised and reinforcement fine-tuning and the explicit theoretical contrast to syntax learning are constructive contributions to the agent-learning literature.
major comments (1)
- The central OOD-robustness claim rests on the premise that SEE similarity faithfully reflects UI state-transition semantics without introducing its own biases or depending on supervision unavailable at inference. The abstract and SEE description introduce this module as a flexible component but supply no independent validation, ablation, or analysis of its training objective and correlation with spurious UI features; this assumption is load-bearing for transferring the theoretical robustness argument to the practical agent.
minor comments (1)
- The acronym expansion 'SEmantic Estimator' uses inconsistent capitalization; standardize to 'Semantic Estimator' for readability.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive feedback on our manuscript. We address the major comment below, clarifying the role of the SEE module and indicating revisions that will strengthen the presentation of our claims.
read point-by-point responses
-
Referee: The central OOD-robustness claim rests on the premise that SEE similarity faithfully reflects UI state-transition semantics without introducing its own biases or depending on supervision unavailable at inference. The abstract and SEE description introduce this module as a flexible component but supply no independent validation, ablation, or analysis of its training objective and correlation with spurious UI features; this assumption is load-bearing for transferring the theoretical robustness argument to the practical agent.
Authors: We thank the referee for this important observation. The SEE module is trained to estimate semantic similarity by comparing the UI state transitions induced by predicted versus ground-truth actions, following the programming-language-inspired definition in Section 3. The theoretical analysis in Section 4 establishes superior OOD robustness for semantic alignment relative to syntax matching, conditional on SEE providing a faithful semantic signal; the empirical gains on offline and online benchmarks provide supporting evidence that this signal is effective in practice. We agree, however, that the manuscript would benefit from more explicit validation of SEE itself. In the revised version we will add (i) an ablation isolating the SEE training objective and (ii) an analysis of its correlation with potential spurious UI features (e.g., visual layout elements unrelated to state transitions). Regarding inference-time supervision, SEE is used exclusively during training (both SFT and RL stages) to shape the agent’s policy; the deployed agent generates actions directly from the fine-tuned model without invoking SEE. revision: yes
Circularity Check
No significant circularity; derivation grounded externally and self-contained
full rationale
The paper's core chain begins with an external definition of action semantics drawn from programming language theory (state transitions induced by actions), introduces the SEE module as a distinct learned component for semantic similarity estimation, and then claims a theoretical robustness comparison between ASL and syntax-based learning. No equation or step reduces the OOD robustness result to a fitted parameter, a self-referential definition, or a load-bearing self-citation. The theoretical demonstration operates under the stated assumption that SEE aligns with the external semantics definition, while experiments supply separate empirical validation. This structure keeps the derivation independent of its own outputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Action semantics are defined as the state transition induced by the action in the user interface
invented entities (1)
-
SEmantic Estimator (SEE)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we define the action semantics for App agents as the state transition induced by the action in the user interface... SEE... computes a semantic similarity to train the App agents
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 3.2... P^S(success|δ) − P^SFT(success|δ) > 0
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Commu- nicative agents for" mind" exploration of large language model society.NeurIPS, 36:51991–52008, 2023
work page 2023
-
[2]
Tora: A tool-integrated reasoning agent for mathematical problem solving
Zhibin Gou, Zhihong Shao, Yeyun Gong, Yujiu Yang, Minlie Huang, Nan Duan, Weizhu Chen, et al. Tora: A tool-integrated reasoning agent for mathematical problem solving. InICLR, 2023
work page 2023
-
[3]
A review of large language models and autonomous agents in chemistry.Chemical Science, 2025
Mayk Caldas Ramos, Christopher J Collison, and Andrew D White. A review of large language models and autonomous agents in chemistry.Chemical Science, 2025
work page 2025
-
[4]
Self-evolving multi-agent collaboration networks for software development
Yue Hu, Yuzhu Cai, Yaxin Du, Xinyu Zhu, Xiangrui Liu, Zijie Yu, Yuchen Hou, Shuo Tang, and Siheng Chen. Self-evolving multi-agent collaboration networks for software development. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[5]
Androidinthewild: A large-scale dataset for android device control.NeurIPS, 36, 2024
Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Androidinthewild: A large-scale dataset for android device control.NeurIPS, 36, 2024
work page 2024
-
[6]
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Al- ice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.arXiv preprint arXiv:2406.01014, 2024
-
[8]
Distrl: An asynchronous distributed reinforcement learning framework for on-device control agents,
Taiyi Wang, Zhihao Wu, Jianheng Liu, Jianye Hao, Jun Wang, and Kun Shao. Distrl: An asynchronous dis- tributed reinforcement learning framework for on-device control agents.arXiv preprint arXiv:2410.14803, 2024
-
[9]
Spa-bench: A comprehensive benchmark for smartphone agent evaluation
Jingxuan Chen, Derek Yuen, Bin Xie, Yuhao Yang, Gongwei Chen, Zhihao Wu, Li Yixing, Xurui Zhou, Weiwen Liu, Shuai Wang, et al. Spa-bench: A comprehensive benchmark for smartphone agent evaluation. InNeurIPS 2024 Workshop on Open-World Agents, 2024
work page 2024
-
[10]
Lightweight neural app control.arXiv preprint arXiv:2410.17883, 2024
Filippos Christianos, Georgios Papoudakis, Thomas Coste, Jianye Hao, Jun Wang, and Kun Shao. Lightweight neural app control.arXiv preprint arXiv:2410.17883, 2024
-
[11]
Appagent: Multimodal agents as smartphone users, 2023
Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users, 2023
work page 2023
-
[12]
On the effects of data scale on computer control agents.arXiv preprint arXiv:2406.03679, 2024
Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. On the effects of data scale on computer control agents.arXiv preprint arXiv:2406.03679, 2024
-
[13]
Hao Bai, Yifei Zhou, Jiayi Pan, Mert Cemri, Alane Suhr, Sergey Levine, and Aviral Kumar. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning.Advances in Neural Information Processing Systems, 37:12461–12495, 2024
work page 2024
-
[14]
Joseph E Stoy.Denotational semantics: the Scott-Strachey approach to programming language theory. MIT press, 1981
work page 1981
-
[15]
Available: https://arxiv.org/abs/2410.13232
Hyungjoo Chae, Namyoung Kim, Kai Tzu-iunn Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sungh- wan Kim, Dongha Lee, and Jinyoung Yeo. Web agents with world models: Learning and leveraging environment dynamics in web navigation.arXiv preprint arXiv:2410.13232, 2024
-
[16]
Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, and Fei Wu. Infiguiagent: A multimodal generalist gui agent with native reasoning and reflection.arXiv preprint arXiv:2501.04575, 2025
-
[17]
A path towards autonomous machine intelligence version 0.9
Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62(1):1–62, 2022
work page 2022
-
[18]
Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Zefang Zong, Jie Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, et al. Understanding world or predicting future? a comprehensive survey of world models.arXiv preprint arXiv:2411.14499, 2024
-
[19]
Learning Interactive Real-World Simulators
Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 1(2):6, 2023. 10
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
3D-VLA: A 3D Vision-Language-Action Generative World Model
Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: A 3d vision-language-action generative world model.arXiv preprint arXiv:2403.09631, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
RoboDreamer: Learning Compositional World Models for Robot Imagination
Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. Robodreamer: Learning compositional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Anurag Ajay, Seungwook Han, Yilun Du, Shuang Li, Abhi Gupta, Tommi Jaakkola, Josh Tenenbaum, Leslie Kaelbling, Akash Srivastava, and Pulkit Agrawal. Compositional foundation models for hierarchical planning.Advances in Neural Information Processing Systems, 36:22304–22325, 2023
work page 2023
-
[23]
Is your llm secretly a world model of the internet? model-based planning for web agents,
Yu Gu, Boyuan Zheng, Boyu Gou, Kai Zhang, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, Huan Sun, and Yu Su. Is your llm secretly a world model of the internet? model-based planning for web agents.arXiv preprint arXiv:2411.06559, 2024
-
[24]
A picture is worth a thousand words: Language models plan from pixels
Anthony Liu, Lajanugen Logeswaran, Sungryull Sohn, and Honglak Lee. A picture is worth a thousand words: Language models plan from pixels. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 16450–16459, 2023
work page 2023
-
[25]
Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning, 8:229–256, 1992
work page 1992
-
[26]
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[27]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
work page 2022
-
[28]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InThirty-seventh Conference on Neural Information Processing Systems, 2023
work page 2023
-
[29]
Bert: Pre-training of deep bidi- rectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidi- rectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019
work page 2019
-
[30]
Androidlab: Training and systematic benchmarking of android autonomous agents
Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, and Yuxiao Dong. Androidlab: Training and systematic benchmarking of android autonomous agents. arXiv preprint arXiv:2410.24024, 2024
-
[31]
Introducing gemini 2.0: our new ai model for the agentic era, December 2024
Demis Hassabis and Koray Kavukcuoglu. Introducing gemini 2.0: our new ai model for the agentic era, December 2024. Accessed: 2025-05-04. 11
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.