arxiv: 2506.17697 · v3 · submitted 2025-06-21 · 💻 cs.AI

Beyond Syntax: Action Semantics Learning for App Agents

Bohan Tang , Dezhao Luo , Jianheng Liu , Jingxuan Chen , Shaogang Gong , Jianye Hao , Jun Wang , Kun Shao This is my paper

Pith reviewed 2026-05-19 07:28 UTC · model grok-4.3

classification 💻 cs.AI

keywords Action Semantics LearningApp AgentsLLM Fine-TuningOut-of-Distribution GeneralizationUI State TransitionsSemantic SimilarityMobile Automation

0 comments p. Extension

The pith

App agents learn the meaning of actions as UI state changes rather than reproducing exact command strings, yielding better accuracy and robustness on unfamiliar apps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that training app agents by forcing exact reproduction of ground-truth action strings leaves them brittle when interfaces vary even slightly. It replaces this with Action Semantics Learning, which defines an action's meaning as the change it produces in the on-screen state and trains the agent to match that meaning instead. A dedicated SEmantic Estimator module scores how closely a generated action's effect matches the intended one, and this score serves as the training signal in both supervised and reinforcement settings. A theoretical argument establishes that this semantic objective is more robust to out-of-distribution inputs than pure syntax matching. Experiments on offline and online benchmarks confirm higher accuracy and generalization.

Core claim

Action Semantics Learning trains agents to generate actions whose induced user-interface state transitions match those of the ground-truth actions, using a SEmantic Estimator to measure semantic similarity even when syntactic forms differ; the approach is shown to deliver higher accuracy, better generalization, and theoretically stronger robustness to out-of-distribution cases than the conventional syntax-learning paradigm.

What carries the argument

The SEmantic Estimator (SEE) module, which computes semantic similarity between candidate and ground-truth actions according to the UI state transitions each produces.

If this is right

Agents can succeed on app versions or devices whose exact action syntax differs from training data as long as the resulting screen states remain similar.
The same SEE-based objective works for both supervised fine-tuning and reinforcement learning of the agent policy.
Theoretical robustness guarantees for out-of-distribution inputs follow directly from replacing string matching with state-transition matching.
Empirical gains appear consistently across multiple offline and online app-agent benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The semantic framing may transfer to other agent domains where exact command formats vary but effects are stable, such as web or desktop automation.
State-transition similarity could serve as a natural reward signal for multi-step planning without requiring full environment models.
If SEE can be made lightweight, the method reduces reliance on exact demonstration data when deploying agents across many similar applications.

Load-bearing premise

The semantic similarity scores produced by the SEE module faithfully represent the intended state-transition effects without injecting their own biases or needing extra supervision at inference time.

What would settle it

A test set of actions that produce identical UI state changes but require syntactically different command strings; if agents trained with ASL maintain high success while syntax-trained agents drop sharply, the central claim is supported.

Figures

Figures reproduced from arXiv: 2506.17697 by Bohan Tang, Dezhao Luo, Jianheng Liu, Jianye Hao, Jingxuan Chen, Jun Wang, Kun Shao, Shaogang Gong.

**Figure 2.** Figure 2: Examples of semantically equivalent actions: (a) and (b) lead to the same GUI state, while [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

The recent development of Large Language Models (LLMs) enables the rise of App agents that interpret user intent and operate smartphone Apps through actions such as clicking and scrolling. While prompt-based solutions with proprietary LLM APIs show promising ability, they incur heavy compute costs and external API dependency. Fine-tuning smaller open-source LLMs solves these limitations. However, current supervised fine-tuning methods use a syntax learning paradigm that forces agents to reproduce exactly the ground truth action strings, leading to out-of-distribution (OOD) vulnerability. To fill this gap, we propose Action Semantics Learning (ASL), a novel learning framework, where the learning objective is capturing the semantics of the ground truth actions. Specifically, inspired by the programming language theory, we define the action semantics for App agents as the state transition induced by the action in the user interface. Building on this insight, ASL employs a novel SEmantic Estimator~(SEE) to compute a semantic similarity to train the App agents in generating actions aligned with the semantics of ground truth actions, even when their syntactic forms differ. SEE is a flexible module that can be applied in both supervised and reinforcement fine-tuning paradigms. To support the effectiveness of ASL, we theoretically demonstrate the superior robustness of ASL for the OOD problem compared with the existing syntax learning paradigm. Extensive experiments across multiple offline and online benchmarks demonstrate that ASL significantly improves the accuracy and generalisation of App agents compared to existing methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ASL shifts app agent fine-tuning to semantic state-transition similarity via the SEE module, which is the real novelty but also the least secured part of the robustness claim.

read the letter

The main thing to know is that this paper replaces exact string reproduction in fine-tuning with a semantic objective based on UI state changes. That is the actual new piece, and it directly targets the out-of-distribution weakness in current syntax-based methods for mobile agents. They define action semantics as the state transition an action produces, then train with a SEmantic Estimator (SEE) that scores similarity between generated and ground-truth actions on that basis. SEE plugs into both supervised and reinforcement fine-tuning, and they claim a theoretical robustness advantage plus gains on offline and online benchmarks. The experiments are the part that feels most concrete; running both settings gives a bit more weight to the generalization results than abstract-only claims usually carry. The citation pattern is straightforward and pulls in the right prior agent and LLM work without obvious omissions. The theoretical argument draws from programming-language semantics in a way that is at least internally consistent with their own definitions. The soft spot is exactly where the stress test points: SEE has to track the intended state transitions faithfully for the OOD story to transfer to practice. The paper presents it as a flexible module, but the write-up gives limited independent checks on its training, possible biases, or whether it needs signals unavailable at inference. If SEE correlates with spurious features, the robustness edge shrinks. The theoretical comparison also assumes accurate semantic alignment, which is a strong premise. This is for people building or fine-tuning open models for real GUI automation tasks. A reader working on practical agent generalization will get usable ideas from the experiments even if they want tighter validation on SEE. I would send it for peer review so referees can examine the SEE implementation details and the experimental controls.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes Action Semantics Learning (ASL) for fine-tuning open-source LLMs as smartphone App agents. It contrasts a syntax learning paradigm (exact reproduction of ground-truth action strings) with a semantics-based approach that defines action semantics as the UI state transitions induced by each action, drawing on programming-language theory. A novel SEmantic Estimator (SEE) module computes semantic similarity to train agents (via SFT or RL) to produce actions aligned with ground-truth semantics even when syntactic forms differ. The paper asserts a theoretical demonstration of superior OOD robustness for ASL versus syntax learning and reports accuracy and generalization gains on multiple offline and online benchmarks.

Significance. If the SEE module reliably isolates intended state-transition semantics, ASL could provide a more robust training paradigm for mobile agents, mitigating OOD failures that arise from syntactic mismatch. The plug-in design for both supervised and reinforcement fine-tuning and the explicit theoretical contrast to syntax learning are constructive contributions to the agent-learning literature.

major comments (1)

The central OOD-robustness claim rests on the premise that SEE similarity faithfully reflects UI state-transition semantics without introducing its own biases or depending on supervision unavailable at inference. The abstract and SEE description introduce this module as a flexible component but supply no independent validation, ablation, or analysis of its training objective and correlation with spurious UI features; this assumption is load-bearing for transferring the theoretical robustness argument to the practical agent.

minor comments (1)

The acronym expansion 'SEmantic Estimator' uses inconsistent capitalization; standardize to 'Semantic Estimator' for readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback on our manuscript. We address the major comment below, clarifying the role of the SEE module and indicating revisions that will strengthen the presentation of our claims.

read point-by-point responses

Referee: The central OOD-robustness claim rests on the premise that SEE similarity faithfully reflects UI state-transition semantics without introducing its own biases or depending on supervision unavailable at inference. The abstract and SEE description introduce this module as a flexible component but supply no independent validation, ablation, or analysis of its training objective and correlation with spurious UI features; this assumption is load-bearing for transferring the theoretical robustness argument to the practical agent.

Authors: We thank the referee for this important observation. The SEE module is trained to estimate semantic similarity by comparing the UI state transitions induced by predicted versus ground-truth actions, following the programming-language-inspired definition in Section 3. The theoretical analysis in Section 4 establishes superior OOD robustness for semantic alignment relative to syntax matching, conditional on SEE providing a faithful semantic signal; the empirical gains on offline and online benchmarks provide supporting evidence that this signal is effective in practice. We agree, however, that the manuscript would benefit from more explicit validation of SEE itself. In the revised version we will add (i) an ablation isolating the SEE training objective and (ii) an analysis of its correlation with potential spurious UI features (e.g., visual layout elements unrelated to state transitions). Regarding inference-time supervision, SEE is used exclusively during training (both SFT and RL stages) to shape the agent’s policy; the deployed agent generates actions directly from the fine-tuned model without invoking SEE. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation grounded externally and self-contained

full rationale

The paper's core chain begins with an external definition of action semantics drawn from programming language theory (state transitions induced by actions), introduces the SEE module as a distinct learned component for semantic similarity estimation, and then claims a theoretical robustness comparison between ASL and syntax-based learning. No equation or step reduces the OOD robustness result to a fitted parameter, a self-referential definition, or a load-bearing self-citation. The theoretical demonstration operates under the stated assumption that SEE aligns with the external semantics definition, while experiments supply separate empirical validation. This structure keeps the derivation independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on a domain assumption imported from programming language theory and on the introduction of a new estimator module whose accuracy is not independently verified outside the training loop.

axioms (1)

domain assumption Action semantics are defined as the state transition induced by the action in the user interface
Invoked to replace syntax matching with semantic alignment; directly shapes the learning objective.

invented entities (1)

SEmantic Estimator (SEE) no independent evidence
purpose: Compute semantic similarity between generated actions and ground-truth actions to enable training on meaning rather than syntax
New module introduced to operationalize the semantic objective; no external falsifiable evidence provided beyond the paper's own experiments.

pith-pipeline@v0.9.0 · 5799 in / 1293 out tokens · 45371 ms · 2026-05-19T07:28:22.743338+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we define the action semantics for App agents as the state transition induced by the action in the user interface... SEE... computes a semantic similarity to train the App agents
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 3.2... P^S(success|δ) − P^SFT(success|δ) > 0

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 5 internal anchors

[1]

Camel: Commu- nicative agents for" mind" exploration of large language model society.NeurIPS, 36:51991–52008, 2023

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Commu- nicative agents for" mind" exploration of large language model society.NeurIPS, 36:51991–52008, 2023

work page 2023
[2]

Tora: A tool-integrated reasoning agent for mathematical problem solving

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yujiu Yang, Minlie Huang, Nan Duan, Weizhu Chen, et al. Tora: A tool-integrated reasoning agent for mathematical problem solving. InICLR, 2023

work page 2023
[3]

A review of large language models and autonomous agents in chemistry.Chemical Science, 2025

Mayk Caldas Ramos, Christopher J Collison, and Andrew D White. A review of large language models and autonomous agents in chemistry.Chemical Science, 2025

work page 2025
[4]

Self-evolving multi-agent collaboration networks for software development

Yue Hu, Yuzhu Cai, Yaxin Du, Xinyu Zhu, Xiangrui Liu, Zijie Yu, Yuchen Hou, Shuo Tang, and Siheng Chen. Self-evolving multi-agent collaboration networks for software development. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[5]

Androidinthewild: A large-scale dataset for android device control.NeurIPS, 36, 2024

Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Androidinthewild: A large-scale dataset for android device control.NeurIPS, 36, 2024

work page 2024
[6]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Al- ice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.arXiv preprint arXiv:2406.01014, 2024

Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.arXiv preprint arXiv:2406.01014, 2024

work page arXiv 2024
[8]

Distrl: An asynchronous distributed reinforcement learning framework for on-device control agents,

Taiyi Wang, Zhihao Wu, Jianheng Liu, Jianye Hao, Jun Wang, and Kun Shao. Distrl: An asynchronous dis- tributed reinforcement learning framework for on-device control agents.arXiv preprint arXiv:2410.14803, 2024

work page arXiv 2024
[9]

Spa-bench: A comprehensive benchmark for smartphone agent evaluation

Jingxuan Chen, Derek Yuen, Bin Xie, Yuhao Yang, Gongwei Chen, Zhihao Wu, Li Yixing, Xurui Zhou, Weiwen Liu, Shuai Wang, et al. Spa-bench: A comprehensive benchmark for smartphone agent evaluation. InNeurIPS 2024 Workshop on Open-World Agents, 2024

work page 2024
[10]

Lightweight neural app control.arXiv preprint arXiv:2410.17883, 2024

Filippos Christianos, Georgios Papoudakis, Thomas Coste, Jianye Hao, Jun Wang, and Kun Shao. Lightweight neural app control.arXiv preprint arXiv:2410.17883, 2024

work page arXiv 2024
[11]

Appagent: Multimodal agents as smartphone users, 2023

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users, 2023

work page 2023
[12]

On the effects of data scale on computer control agents.arXiv preprint arXiv:2406.03679, 2024

Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. On the effects of data scale on computer control agents.arXiv preprint arXiv:2406.03679, 2024

work page arXiv 2024
[13]

Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning.Advances in Neural Information Processing Systems, 37:12461–12495, 2024

Hao Bai, Yifei Zhou, Jiayi Pan, Mert Cemri, Alane Suhr, Sergey Levine, and Aviral Kumar. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning.Advances in Neural Information Processing Systems, 37:12461–12495, 2024

work page 2024
[14]

MIT press, 1981

Joseph E Stoy.Denotational semantics: the Scott-Strachey approach to programming language theory. MIT press, 1981

work page 1981
[15]

Available: https://arxiv.org/abs/2410.13232

Hyungjoo Chae, Namyoung Kim, Kai Tzu-iunn Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sungh- wan Kim, Dongha Lee, and Jinyoung Yeo. Web agents with world models: Learning and leveraging environment dynamics in web navigation.arXiv preprint arXiv:2410.13232, 2024

work page arXiv 2024
[16]

Infiguiagent: A multimodal generalist gui agent with native reasoning and reflection.arXiv preprint arXiv:2501.04575, 2025

Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, and Fei Wu. Infiguiagent: A multimodal generalist gui agent with native reasoning and reflection.arXiv preprint arXiv:2501.04575, 2025

work page arXiv 2025
[17]

A path towards autonomous machine intelligence version 0.9

Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62(1):1–62, 2022

work page 2022
[18]

Understanding world or predicting future? a comprehensive survey of world models.arXiv preprint arXiv:2411.14499, 2024

Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Zefang Zong, Jie Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, et al. Understanding world or predicting future? a comprehensive survey of world models.arXiv preprint arXiv:2411.14499, 2024

work page arXiv 2024
[19]

Learning Interactive Real-World Simulators

Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 1(2):6, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

3D-VLA: A 3D Vision-Language-Action Generative World Model

Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: A 3d vision-language-action generative world model.arXiv preprint arXiv:2403.09631, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

RoboDreamer: Learning Compositional World Models for Robot Imagination

Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. Robodreamer: Learning compositional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Compositional foundation models for hierarchical planning.Advances in Neural Information Processing Systems, 36:22304–22325, 2023

Anurag Ajay, Seungwook Han, Yilun Du, Shuang Li, Abhi Gupta, Tommi Jaakkola, Josh Tenenbaum, Leslie Kaelbling, Akash Srivastava, and Pulkit Agrawal. Compositional foundation models for hierarchical planning.Advances in Neural Information Processing Systems, 36:22304–22325, 2023

work page 2023
[23]

Is your llm secretly a world model of the internet? model-based planning for web agents,

Yu Gu, Boyuan Zheng, Boyu Gou, Kai Zhang, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, Huan Sun, and Yu Su. Is your llm secretly a world model of the internet? model-based planning for web agents.arXiv preprint arXiv:2411.06559, 2024

work page arXiv 2024
[24]

A picture is worth a thousand words: Language models plan from pixels

Anthony Liu, Lajanugen Logeswaran, Sungryull Sohn, and Honglak Lee. A picture is worth a thousand words: Language models plan from pixels. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 16450–16459, 2023

work page 2023
[25]

Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning, 8:229–256, 1992

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning, 8:229–256, 1992

work page 1992
[26]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[27]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[28]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InThirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023
[29]

Bert: Pre-training of deep bidi- rectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidi- rectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

work page 2019
[30]

Androidlab: Training and systematic benchmarking of android autonomous agents

Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, and Yuxiao Dong. Androidlab: Training and systematic benchmarking of android autonomous agents. arXiv preprint arXiv:2410.24024, 2024

work page arXiv 2024
[31]

Introducing gemini 2.0: our new ai model for the agentic era, December 2024

Demis Hassabis and Koray Kavukcuoglu. Introducing gemini 2.0: our new ai model for the agentic era, December 2024. Accessed: 2025-05-04. 11

work page 2024