pith. the verified trust layer for science. sign in

arxiv: 2506.17697 · v3 · submitted 2025-06-21 · 💻 cs.AI

Beyond Syntax: Action Semantics Learning for App Agents

Pith reviewed 2026-05-19 07:28 UTC · model grok-4.3

classification 💻 cs.AI
keywords Action Semantics LearningApp AgentsLLM Fine-TuningOut-of-Distribution GeneralizationUI State TransitionsSemantic SimilarityMobile Automation
0
0 comments X p. Extension

The pith

App agents learn the meaning of actions as UI state changes rather than reproducing exact command strings, yielding better accuracy and robustness on unfamiliar apps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that training app agents by forcing exact reproduction of ground-truth action strings leaves them brittle when interfaces vary even slightly. It replaces this with Action Semantics Learning, which defines an action's meaning as the change it produces in the on-screen state and trains the agent to match that meaning instead. A dedicated SEmantic Estimator module scores how closely a generated action's effect matches the intended one, and this score serves as the training signal in both supervised and reinforcement settings. A theoretical argument establishes that this semantic objective is more robust to out-of-distribution inputs than pure syntax matching. Experiments on offline and online benchmarks confirm higher accuracy and generalization.

Core claim

Action Semantics Learning trains agents to generate actions whose induced user-interface state transitions match those of the ground-truth actions, using a SEmantic Estimator to measure semantic similarity even when syntactic forms differ; the approach is shown to deliver higher accuracy, better generalization, and theoretically stronger robustness to out-of-distribution cases than the conventional syntax-learning paradigm.

What carries the argument

The SEmantic Estimator (SEE) module, which computes semantic similarity between candidate and ground-truth actions according to the UI state transitions each produces.

If this is right

  • Agents can succeed on app versions or devices whose exact action syntax differs from training data as long as the resulting screen states remain similar.
  • The same SEE-based objective works for both supervised fine-tuning and reinforcement learning of the agent policy.
  • Theoretical robustness guarantees for out-of-distribution inputs follow directly from replacing string matching with state-transition matching.
  • Empirical gains appear consistently across multiple offline and online app-agent benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The semantic framing may transfer to other agent domains where exact command formats vary but effects are stable, such as web or desktop automation.
  • State-transition similarity could serve as a natural reward signal for multi-step planning without requiring full environment models.
  • If SEE can be made lightweight, the method reduces reliance on exact demonstration data when deploying agents across many similar applications.

Load-bearing premise

The semantic similarity scores produced by the SEE module faithfully represent the intended state-transition effects without injecting their own biases or needing extra supervision at inference time.

What would settle it

A test set of actions that produce identical UI state changes but require syntactically different command strings; if agents trained with ASL maintain high success while syntax-trained agents drop sharply, the central claim is supported.

Figures

Figures reproduced from arXiv: 2506.17697 by Bohan Tang, Dezhao Luo, Jianheng Liu, Jianye Hao, Jingxuan Chen, Jun Wang, Kun Shao, Shaogang Gong.

Figure 1
Figure 1. Figure 1: Our action semantics learning framework. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Examples of semantically equivalent actions: (a) and (b) lead to the same GUI state, while [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

The recent development of Large Language Models (LLMs) enables the rise of App agents that interpret user intent and operate smartphone Apps through actions such as clicking and scrolling. While prompt-based solutions with proprietary LLM APIs show promising ability, they incur heavy compute costs and external API dependency. Fine-tuning smaller open-source LLMs solves these limitations. However, current supervised fine-tuning methods use a syntax learning paradigm that forces agents to reproduce exactly the ground truth action strings, leading to out-of-distribution (OOD) vulnerability. To fill this gap, we propose Action Semantics Learning (ASL), a novel learning framework, where the learning objective is capturing the semantics of the ground truth actions. Specifically, inspired by the programming language theory, we define the action semantics for App agents as the state transition induced by the action in the user interface. Building on this insight, ASL employs a novel SEmantic Estimator~(SEE) to compute a semantic similarity to train the App agents in generating actions aligned with the semantics of ground truth actions, even when their syntactic forms differ. SEE is a flexible module that can be applied in both supervised and reinforcement fine-tuning paradigms. To support the effectiveness of ASL, we theoretically demonstrate the superior robustness of ASL for the OOD problem compared with the existing syntax learning paradigm. Extensive experiments across multiple offline and online benchmarks demonstrate that ASL significantly improves the accuracy and generalisation of App agents compared to existing methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes Action Semantics Learning (ASL) for fine-tuning open-source LLMs as smartphone App agents. It contrasts a syntax learning paradigm (exact reproduction of ground-truth action strings) with a semantics-based approach that defines action semantics as the UI state transitions induced by each action, drawing on programming-language theory. A novel SEmantic Estimator (SEE) module computes semantic similarity to train agents (via SFT or RL) to produce actions aligned with ground-truth semantics even when syntactic forms differ. The paper asserts a theoretical demonstration of superior OOD robustness for ASL versus syntax learning and reports accuracy and generalization gains on multiple offline and online benchmarks.

Significance. If the SEE module reliably isolates intended state-transition semantics, ASL could provide a more robust training paradigm for mobile agents, mitigating OOD failures that arise from syntactic mismatch. The plug-in design for both supervised and reinforcement fine-tuning and the explicit theoretical contrast to syntax learning are constructive contributions to the agent-learning literature.

major comments (1)
  1. The central OOD-robustness claim rests on the premise that SEE similarity faithfully reflects UI state-transition semantics without introducing its own biases or depending on supervision unavailable at inference. The abstract and SEE description introduce this module as a flexible component but supply no independent validation, ablation, or analysis of its training objective and correlation with spurious UI features; this assumption is load-bearing for transferring the theoretical robustness argument to the practical agent.
minor comments (1)
  1. The acronym expansion 'SEmantic Estimator' uses inconsistent capitalization; standardize to 'Semantic Estimator' for readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback on our manuscript. We address the major comment below, clarifying the role of the SEE module and indicating revisions that will strengthen the presentation of our claims.

read point-by-point responses
  1. Referee: The central OOD-robustness claim rests on the premise that SEE similarity faithfully reflects UI state-transition semantics without introducing its own biases or depending on supervision unavailable at inference. The abstract and SEE description introduce this module as a flexible component but supply no independent validation, ablation, or analysis of its training objective and correlation with spurious UI features; this assumption is load-bearing for transferring the theoretical robustness argument to the practical agent.

    Authors: We thank the referee for this important observation. The SEE module is trained to estimate semantic similarity by comparing the UI state transitions induced by predicted versus ground-truth actions, following the programming-language-inspired definition in Section 3. The theoretical analysis in Section 4 establishes superior OOD robustness for semantic alignment relative to syntax matching, conditional on SEE providing a faithful semantic signal; the empirical gains on offline and online benchmarks provide supporting evidence that this signal is effective in practice. We agree, however, that the manuscript would benefit from more explicit validation of SEE itself. In the revised version we will add (i) an ablation isolating the SEE training objective and (ii) an analysis of its correlation with potential spurious UI features (e.g., visual layout elements unrelated to state transitions). Regarding inference-time supervision, SEE is used exclusively during training (both SFT and RL stages) to shape the agent’s policy; the deployed agent generates actions directly from the fine-tuned model without invoking SEE. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation grounded externally and self-contained

full rationale

The paper's core chain begins with an external definition of action semantics drawn from programming language theory (state transitions induced by actions), introduces the SEE module as a distinct learned component for semantic similarity estimation, and then claims a theoretical robustness comparison between ASL and syntax-based learning. No equation or step reduces the OOD robustness result to a fitted parameter, a self-referential definition, or a load-bearing self-citation. The theoretical demonstration operates under the stated assumption that SEE aligns with the external semantics definition, while experiments supply separate empirical validation. This structure keeps the derivation independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on a domain assumption imported from programming language theory and on the introduction of a new estimator module whose accuracy is not independently verified outside the training loop.

axioms (1)
  • domain assumption Action semantics are defined as the state transition induced by the action in the user interface
    Invoked to replace syntax matching with semantic alignment; directly shapes the learning objective.
invented entities (1)
  • SEmantic Estimator (SEE) no independent evidence
    purpose: Compute semantic similarity between generated actions and ground-truth actions to enable training on meaning rather than syntax
    New module introduced to operationalize the semantic objective; no external falsifiable evidence provided beyond the paper's own experiments.

pith-pipeline@v0.9.0 · 5799 in / 1293 out tokens · 45371 ms · 2026-05-19T07:28:22.743338+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 5 internal anchors

  1. [1]

    Camel: Commu- nicative agents for" mind" exploration of large language model society.NeurIPS, 36:51991–52008, 2023

    Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Commu- nicative agents for" mind" exploration of large language model society.NeurIPS, 36:51991–52008, 2023

  2. [2]

    Tora: A tool-integrated reasoning agent for mathematical problem solving

    Zhibin Gou, Zhihong Shao, Yeyun Gong, Yujiu Yang, Minlie Huang, Nan Duan, Weizhu Chen, et al. Tora: A tool-integrated reasoning agent for mathematical problem solving. InICLR, 2023

  3. [3]

    A review of large language models and autonomous agents in chemistry.Chemical Science, 2025

    Mayk Caldas Ramos, Christopher J Collison, and Andrew D White. A review of large language models and autonomous agents in chemistry.Chemical Science, 2025

  4. [4]

    Self-evolving multi-agent collaboration networks for software development

    Yue Hu, Yuzhu Cai, Yaxin Du, Xinyu Zhu, Xiangrui Liu, Zijie Yu, Yuchen Hou, Shuo Tang, and Siheng Chen. Self-evolving multi-agent collaboration networks for software development. InThe Thirteenth International Conference on Learning Representations, 2025

  5. [5]

    Androidinthewild: A large-scale dataset for android device control.NeurIPS, 36, 2024

    Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Androidinthewild: A large-scale dataset for android device control.NeurIPS, 36, 2024

  6. [6]

    AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Al- ice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024

  7. [7]

    Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.arXiv preprint arXiv:2406.01014, 2024

    Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.arXiv preprint arXiv:2406.01014, 2024

  8. [8]

    Distrl: An asynchronous distributed reinforcement learning framework for on-device control agents,

    Taiyi Wang, Zhihao Wu, Jianheng Liu, Jianye Hao, Jun Wang, and Kun Shao. Distrl: An asynchronous dis- tributed reinforcement learning framework for on-device control agents.arXiv preprint arXiv:2410.14803, 2024

  9. [9]

    Spa-bench: A comprehensive benchmark for smartphone agent evaluation

    Jingxuan Chen, Derek Yuen, Bin Xie, Yuhao Yang, Gongwei Chen, Zhihao Wu, Li Yixing, Xurui Zhou, Weiwen Liu, Shuai Wang, et al. Spa-bench: A comprehensive benchmark for smartphone agent evaluation. InNeurIPS 2024 Workshop on Open-World Agents, 2024

  10. [10]

    Lightweight neural app control.arXiv preprint arXiv:2410.17883, 2024

    Filippos Christianos, Georgios Papoudakis, Thomas Coste, Jianye Hao, Jun Wang, and Kun Shao. Lightweight neural app control.arXiv preprint arXiv:2410.17883, 2024

  11. [11]

    Appagent: Multimodal agents as smartphone users, 2023

    Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users, 2023

  12. [12]

    On the effects of data scale on computer control agents.arXiv preprint arXiv:2406.03679, 2024

    Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. On the effects of data scale on computer control agents.arXiv preprint arXiv:2406.03679, 2024

  13. [13]

    Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning.Advances in Neural Information Processing Systems, 37:12461–12495, 2024

    Hao Bai, Yifei Zhou, Jiayi Pan, Mert Cemri, Alane Suhr, Sergey Levine, and Aviral Kumar. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning.Advances in Neural Information Processing Systems, 37:12461–12495, 2024

  14. [14]

    MIT press, 1981

    Joseph E Stoy.Denotational semantics: the Scott-Strachey approach to programming language theory. MIT press, 1981

  15. [15]

    Available: https://arxiv.org/abs/2410.13232

    Hyungjoo Chae, Namyoung Kim, Kai Tzu-iunn Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sungh- wan Kim, Dongha Lee, and Jinyoung Yeo. Web agents with world models: Learning and leveraging environment dynamics in web navigation.arXiv preprint arXiv:2410.13232, 2024

  16. [16]

    Infiguiagent: A multimodal generalist gui agent with native reasoning and reflection.arXiv preprint arXiv:2501.04575, 2025

    Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, and Fei Wu. Infiguiagent: A multimodal generalist gui agent with native reasoning and reflection.arXiv preprint arXiv:2501.04575, 2025

  17. [17]

    A path towards autonomous machine intelligence version 0.9

    Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62(1):1–62, 2022

  18. [18]

    Understanding world or predicting future? a comprehensive survey of world models.arXiv preprint arXiv:2411.14499, 2024

    Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Zefang Zong, Jie Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, et al. Understanding world or predicting future? a comprehensive survey of world models.arXiv preprint arXiv:2411.14499, 2024

  19. [19]

    Learning Interactive Real-World Simulators

    Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 1(2):6, 2023. 10

  20. [20]

    3D-VLA: A 3D Vision-Language-Action Generative World Model

    Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: A 3d vision-language-action generative world model.arXiv preprint arXiv:2403.09631, 2024

  21. [21]

    RoboDreamer: Learning Compositional World Models for Robot Imagination

    Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. Robodreamer: Learning compositional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024

  22. [22]

    Compositional foundation models for hierarchical planning.Advances in Neural Information Processing Systems, 36:22304–22325, 2023

    Anurag Ajay, Seungwook Han, Yilun Du, Shuang Li, Abhi Gupta, Tommi Jaakkola, Josh Tenenbaum, Leslie Kaelbling, Akash Srivastava, and Pulkit Agrawal. Compositional foundation models for hierarchical planning.Advances in Neural Information Processing Systems, 36:22304–22325, 2023

  23. [23]

    Is your llm secretly a world model of the internet? model-based planning for web agents,

    Yu Gu, Boyuan Zheng, Boyu Gou, Kai Zhang, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, Huan Sun, and Yu Su. Is your llm secretly a world model of the internet? model-based planning for web agents.arXiv preprint arXiv:2411.06559, 2024

  24. [24]

    A picture is worth a thousand words: Language models plan from pixels

    Anthony Liu, Lajanugen Logeswaran, Sungryull Sohn, and Honglak Lee. A picture is worth a thousand words: Language models plan from pixels. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 16450–16459, 2023

  25. [25]

    Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning, 8:229–256, 1992

    Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning, 8:229–256, 1992

  26. [26]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332, 2021

  27. [27]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  28. [28]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InThirty-seventh Conference on Neural Information Processing Systems, 2023

  29. [29]

    Bert: Pre-training of deep bidi- rectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidi- rectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

  30. [30]

    Androidlab: Training and systematic benchmarking of android autonomous agents

    Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, and Yuxiao Dong. Androidlab: Training and systematic benchmarking of android autonomous agents. arXiv preprint arXiv:2410.24024, 2024

  31. [31]

    Introducing gemini 2.0: our new ai model for the agentic era, December 2024

    Demis Hassabis and Koray Kavukcuoglu. Introducing gemini 2.0: our new ai model for the agentic era, December 2024. Accessed: 2025-05-04. 11