Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?

Huan Liu; Li Gu; Linqiang Guo; Tse-Hsun Chen; Yang Wang; Yuanhao Yu; Zhixiang Chi; Zihuan Jiang; Ziqiang Wang

arxiv: 2606.19388 · v1 · pith:7GQNCE6Znew · submitted 2026-06-16 · 💻 cs.SE · cs.CL· cs.HC

Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?

Li Gu , Zihuan Jiang , Linqiang Guo , Zhixiang Chi , Ziqiang Wang , Huan Liu , Yuanhao Yu , Tse-Hsun Chen

show 1 more author

Yang Wang

This is my paper

Pith reviewed 2026-06-27 00:20 UTC · model grok-4.3

classification 💻 cs.SE cs.CLcs.HC

keywords mobile agentscommand-line interfaceGUI agentsAndroidWorldMobileWorldCLI-Advantage Task Suitecoding agentsdevice access

0 comments

The pith

Command-line access lets mobile agents outperform screen-based GUI approaches on device tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mobile agents have relied on graphical user interfaces that read phone screens and simulate taps or gestures. This paper shows that agents given direct command-line access to device services and data can complete more tasks than GUI-based methods on the same benchmarks. A coding agent reaches 71.8 percent success on AndroidWorld and 51.9 percent on MobileWorld, beating the best reproducible GUI baselines. The authors also introduce tasks involving bulk operations, filtering, and cross-app work that are difficult through the screen, where CLI agents succeed with roughly half the steps. Oracle CLI solutions indicate that most tasks in these environments can be solved without any visual interface.

Core claim

Coding agents equipped with command-line interface access, without any mobile-specific post-training, achieve higher task completion rates than GUI baselines on AndroidWorld and MobileWorld. The strongest configuration reaches 71.8 percent and 51.9 percent success, exceeding GUI results of at most 69.3 percent and 43.2 percent. Oracle CLI solutions solve 103 of 116 AndroidWorld tasks and 101 of 117 MobileWorld tasks. The new CLI-Advantage Task Suite shows uniform CLI superiority across bulk operations, multi-condition filtering, aggregation, cross-app workflows, and hidden device state, with agents using 10.7 steps on average versus 18.6 for GUI agents.

What carries the argument

Command-line interface (CLI) access that grants direct reach to device services and data, allowing coding agents to execute tasks without screen perception or simulated touch input.

If this is right

Every CLI agent outperforms every GUI baseline across all five categories in the CLI-Advantage Task Suite.
CLI agents complete tasks in 10.7 steps on average compared with 18.6 steps for GUI agents.
103 of 116 AndroidWorld tasks and 101 of 117 MobileWorld tasks are solvable through CLI commands.
All tested CLI configurations remain competitive with or exceed GUI performance without requiring mobile-specific training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agents could use CLI for data-heavy tasks and fall back to GUI only when visual details are required.
Many everyday phone operations such as filtering records or cross-app data movement do not need a visual interface.
Future agent development may prioritize language-model tool use over vision models for mobile environments.
The open-sourced oracle solutions and task suite enable direct tests of whether CLI advantages persist across different agent designs.

Load-bearing premise

The command-line access given to the agents supplies equivalent device service and data reach as the perception and interaction capabilities available to the GUI baselines.

What would settle it

A reproducible GUI agent reaching success rates above the reported oracle CLI levels of 88.8 percent on AndroidWorld or 86.3 percent on MobileWorld under matched evaluation conditions.

Figures

Figures reproduced from arXiv: 2606.19388 by Huan Liu, Li Gu, Linqiang Guo, Tse-Hsun Chen, Yang Wang, Yuanhao Yu, Zhixiang Chi, Zihuan Jiang, Ziqiang Wang.

**Figure 1.** Figure 1: GUI vs. CLI execution traces on a representative mobile task. Both agents are asked to delete backup_funny_zebra.mp4 from the Movies folder and proceed through three phases—find the target, delete the target, verify deletion. The GUI agent (top) consumes screenshots and emits taps, unfolding the phases over 7 UI steps (open Files → navigate menu → enter Movies → search → confirm delete). The CLI agent (bot… view at source ↗

**Figure 2.** Figure 2: Trajectory-level failure modes. Each bar denotes the fraction of failed trajectories exhibiting the corresponding taxonomy category (a trajectory may exhibit multiple). (a) Cross-paradigm: CLI vs. GUI on the same 103 CLI-solvable tasks of AndroidWorld. (b) Within-CLI tool effect: GPT-5.3 Codex is evaluated under bash-only vs. +tools (four generic, structured tool calls). Color encodes the top-level categ… view at source ↗

**Figure 3.** Figure 3: Semantic similarity between CLI-Advantage [PITH_FULL_IMAGE:figures/full_fig_p036_3.png] view at source ↗

**Figure 4.** Figure 4: Verbatim Opus 4.7 prompt for the cluster→leaf mapping step (Appendix D.2). rubric_v2.md and rubric_v2_clarifications.md are passed in the system prompt. The judge runs in JSON mode with temperature 0, reasoning effort high, and a maximum of 4,096 output tokens. whether the Opus 4.7 judge agreed with both human annotators — is too large to reproduce inline; it is released as cluster_to_leaf_mapping.csv in … view at source ↗

**Figure 5.** Figure 5: Failure-mode prevalence by model and harness in the CLI paradigm. Each column corresponds to one (model × harness) cell on the fair-comparison subset (1,508 failed trajectories across the 103 CLI-solvable AndroidWorld tasks). Bars give the prevalence of each sub-level category. The empty Codex / MiniMax slots under ClaudeCodeCLI reflect that ClaudeCodeCLI is a Claude-only harness. Disobey Specification sub… view at source ↗

**Figure 6.** Figure 6: Disobey Specification sub-cluster shifts under the four-tool wrapper (GPT-5.3 Codex). Each row is one fine-grained sub-cluster within Disobey Specification; bar length is the change in prevalence from bash-only to +tools on the fair-comparison subset. Blue bars (top) shrink under the wrapper; orange bars (bottom) grow. The italic column reports the absolute prevalence pair (bash-only → +tools). 45 [PITH_F… view at source ↗

read the original abstract

Recent advances in mobile agents are dominated by the GUI paradigm, in which agents perceive UI information and emit screen interactions. However, mobile platforms also expose a command-line interface (CLI) that provides direct access to device services and data. We argue CLI deserves first-class consideration alongside GUI. We evaluate three coding agents (Claude Code, Terminus-2, mini-swe-agent) across four model APIs on AndroidWorld and MobileWorld without any mobile-specific post-training, comparing against three reproducible GUI baselines (GUI-Owl-1.5-32B, MAI-UI, Qwen3-VL-32B). Claude Code (Opus 4.7) reaches 71.8\% and 51.9\%, outperforming every reproducible GUI baseline (69.3/68.1/57.8\% on AndroidWorld; 43.2/26.3/13.3\% on MobileWorld), while every other CLI configuration remains competitive. To establish the paradigm's ceiling, we provide oracle CLI solutions that reach 88.8\% on AndroidWorld (103/116 tasks CLI-solvable) and 86.3\% on MobileWorld (101/117 tasks CLI-solvable), indicating substantial room for future improvement. To cover everyday user intents beyond the GUI scope, we introduce the \textbf{CLI-Advantage Task Suite}, comprising 45 templates across five categories: bulk operations, multi-condition filtering, aggregation, cross-app workflows, and hidden device state. Every CLI agent outperforms every GUI baseline in all five categories, with substantially fewer steps per task (10.7 vs.\ 18.6). To support future research on mobile CLI agents, we will open-source agent implementations, oracle solutions, the CLI-Advantage suite, and evaluation infrastructure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CLI agents beat the GUI baselines on these mobile tasks, but the win may trace to access differences rather than the interface itself.

read the letter

The paper's core finding is that three off-the-shelf coding agents running on CLI reach higher success rates than the listed GUI baselines on AndroidWorld and MobileWorld, with Claude Code at 71.8% and 51.9%. They also release a new CLI-Advantage Task Suite of 45 templates that covers bulk operations, filtering, aggregation, cross-app flows, and hidden state, where every CLI run beats every GUI run and uses fewer steps.

What is actually new is the direct comparison of unmodified coding agents against reproducible GUI systems plus the oracle ceilings (88.8% and 86.3%) and the task suite itself. The decision to skip mobile-specific fine-tuning keeps the test clean and shows the paradigm difference without extra tuning.

The numbers are concrete and the open-source plan for agents, oracles, and infrastructure is useful. The category-wise results on the new suite give a clearer picture of where CLI helps than aggregate scores alone.

The soft spot is the access question. The abstract states CLI gives direct service and data access, but the write-up does not confirm that the GUI baselines operate under the same restrictions in the shared environments. If the CLI agents can read files or call system services that the GUI agents cannot reach through screen actions, the gap is partly an artifact of the setup rather than proof that CLI is inherently better. Missing error bars and task-selection details add to the uncertainty, though these are fixable.

This is for people working on device agents who want to test non-GUI routes. The empirical head-to-head and the new suite are concrete enough that a serious editor should send it to referees, with the expectation that the access parity and baseline details get tightened in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that the command-line interface (CLI) paradigm deserves first-class status for mobile agents alongside the dominant GUI paradigm. It evaluates three coding agents (Claude Code, Terminus-2, mini-swe-agent) across four model APIs on AndroidWorld and MobileWorld without mobile-specific post-training, reporting that Claude Code (Opus 4.7) achieves 71.8% and 51.9% success rates, outperforming reproducible GUI baselines (69.3/68.1/57.8% on AndroidWorld; 43.2/26.3/13.3% on MobileWorld). Oracle CLI solutions reach 88.8% (103/116 tasks) and 86.3% (101/117 tasks). It introduces the CLI-Advantage Task Suite (45 templates in five categories: bulk operations, multi-condition filtering, aggregation, cross-app workflows, hidden device state) where all CLI agents outperform all GUI baselines with fewer steps (10.7 vs. 18.6), and commits to open-sourcing implementations, oracles, the suite, and evaluation infrastructure.

Significance. If the results hold under clarified conditions, the work provides a substantive empirical challenge to GUI-centric mobile agent research by demonstrating concrete outperformance and efficiency gains on both standard benchmarks and a new task suite focused on everyday intents outside typical GUI scope. Credit is due for the oracle solutions establishing performance ceilings, the category-wise breakdowns, the step-count comparison, and the open-sourcing commitment, all of which directly support reproducibility and future work.

major comments (2)

[Abstract and evaluation setup] Abstract and evaluation setup: the central claim that CLI 'deserves first-class consideration' rests on outperformance, yet the manuscript notes CLI provides 'direct access to device services and data' without explicitly confirming whether the three GUI baselines (GUI-Owl-1.5-32B, MAI-UI, Qwen3-VL-32B) were granted equivalent direct service/data access or were restricted solely to screen perception and UI actions (taps/swipes). This risks confounding paradigm differences with action-space differences; a table or subsection detailing per-agent available actions and data reach in the shared environments is needed to substantiate the comparison.
[Results reporting] Results reporting (AndroidWorld/MobileWorld tables): success rates are given as point estimates (e.g., 71.8% vs. 69.3%) with no error bars, standard deviations, or number of runs, weakening support for the claim of consistent outperformance, especially given the modest margins and the reader's note on moderate soundness.

minor comments (2)

The manuscript would benefit from a brief limitations subsection addressing potential drawbacks of CLI access, such as security or permission model differences across devices.
Figure or table captions for the CLI-Advantage results could more explicitly note the step-count metric definition and how tasks were sampled from the 45 templates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important points for clarifying the evaluation setup and strengthening results reporting. We address each major comment below and commit to revisions that improve the paper without altering its core claims.

read point-by-point responses

Referee: [Abstract and evaluation setup] Abstract and evaluation setup: the central claim that CLI 'deserves first-class consideration' rests on outperformance, yet the manuscript notes CLI provides 'direct access to device services and data' without explicitly confirming whether the three GUI baselines (GUI-Owl-1.5-32B, MAI-UI, Qwen3-VL-32B) were granted equivalent direct service/data access or were restricted solely to screen perception and UI actions (taps/swipes). This risks confounding paradigm differences with action-space differences; a table or subsection detailing per-agent available actions and data reach in the shared environments is needed to substantiate the comparison.

Authors: We agree this clarification is necessary to avoid any perception of confounding. The GUI baselines operate exclusively under the standard GUI paradigm: they receive only screen-based perception (screenshots or UI hierarchies) and emit only UI actions (taps, swipes, text entry). In contrast, CLI agents receive shell access granting direct manipulation of device services, files, databases, and system APIs. To make this explicit, we will add a new subsection (Section 3.2) containing a comparison table that lists perception modalities, action spaces, and data reach for each agent category. This revision directly substantiates that observed differences arise from paradigm-level access rather than unequal capabilities within the same action space. revision: yes
Referee: [Results reporting] Results reporting (AndroidWorld/MobileWorld tables): success rates are given as point estimates (e.g., 71.8% vs. 69.3%) with no error bars, standard deviations, or number of runs, weakening support for the claim of consistent outperformance, especially given the modest margins and the reader's note on moderate soundness.

Authors: We acknowledge that single-run point estimates limit assessment of variability. All reported success rates reflect single deterministic evaluations over the complete task sets (116 tasks on AndroidWorld; 117 on MobileWorld). Multiple independent runs were not performed owing to the substantial compute cost of full agent trajectories. In revision we will (1) explicitly state the task counts and single-run nature in the tables and text, (2) highlight the oracle ceilings (103/116 and 101/117) and the uniform CLI superiority across all five categories of the CLI-Advantage suite (with step-count comparison) as convergent evidence beyond the main-benchmark margins. We cannot retroactively add error bars without new experiments, but the added transparency and supporting analyses address the concern. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark comparison

full rationale

The paper reports direct performance measurements of coding agents versus GUI baselines on AndroidWorld and MobileWorld, plus a new CLI-Advantage Task Suite. No derivations, equations, fitted parameters, predictions, or self-referential definitions appear in the abstract or described content. All central claims rest on observed task success rates and step counts rather than any reduction to prior fitted quantities or self-citation chains. The comparison is self-contained against external benchmarks with no load-bearing self-citations or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of AndroidWorld and MobileWorld as evaluation platforms and the fairness of CLI versus GUI access in those environments.

axioms (1)

domain assumption AndroidWorld and MobileWorld benchmarks are representative of real mobile agent tasks and user intents.
Evaluation and claims rely on these benchmarks without additional justification or external validation in the abstract.

pith-pipeline@v0.9.1-grok · 5889 in / 1170 out tokens · 49603 ms · 2026-06-27T00:20:32.038279+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 2 linked inside Pith

[1]

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631. Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. 2026. Why do multi-agent LLM sys- tems fail? InThe Thirty-ninth Annual Conference on Neural ...

Pith/arXiv arXiv 2026
[2]

9 Ryan Lopopolo

Llm-powered gui agents in phone automation: Surveying progress and prospects.arXiv preprint arXiv:2504.19838. 9 Ryan Lopopolo. 2026. Harness engineering: leveraging codex in an agent-first world.OpenAI engineering note. Renze Lou, Baolin Peng, Wenlin Yao, Qianhui Wu, Hao Cheng, Suman Nath, Wenpeng Yin, and Jianfeng Gao. 2026. The tool illusion: Rethinking...

arXiv 2026
[3]

Muxin Tian, Zhe Wang, Blair Yang, Zhenwei Tang, Kunlun Zhu, Honghua Dong, Hanchen Li, Xinni Xie, Guangjing Wang, and Jiaxuan You

Ui-venus-1.5 technical report.arXiv preprint arXiv:2602.09082. Muxin Tian, Zhe Wang, Blair Yang, Zhenwei Tang, Kunlun Zhu, Honghua Dong, Hanchen Li, Xinni Xie, Guangjing Wang, and Jiaxuan You. 2026. Swe-bench mobile: Can large language model agents develop industry-level mobile applications?arXiv preprint arXiv:2602.09540. Harsh Trivedi, Tushar Khot, Mare...

arXiv 2026
[4]

InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16022–16076

Appworld: A controllable world of apps and people for benchmarking interactive coding agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16022–16076. Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Hu...

Pith/arXiv arXiv 2024
[5]

InInternational Conference on Learning Representations, volume 2025, pages 5090– 5108

Os-atlas: Foundation action model for gen- eralist gui agents. InInternational Conference on Learning Representations, volume 2025, pages 5090– 5108. Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, and 1 others

2025
[6]

Advances in Neural Information Processing Systems, 37:52040–52094

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems, 37:52040–52094. Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, and 1 others. 2026. Mobile- agent-v3. 5: Multi-platform fundamental gui agent...

arXiv 2026
[7]

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu

Springer. Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu
[8]

<SQL>" read-file <device_path> write-file <device_path>'<content>' find-files <directory>

Appagent: Multimodal agents as smartphone users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–20. Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, and 1 oth- ers. 2025. Mai-ui technical report: Real-world centric foundation gui agents.ar...

arXiv 2025
[9]

When checking multiple plausible locations, batch them into one shell call (e.g.,`ls dirA; ls dirB; cat config`) instead of one probe per turn

**Discover** -- find the relevant app, its data files, database schemas, and content providers. When checking multiple plausible locations, batch them into one shell call (e.g.,`ls dirA; ls dirB; cat config`) instead of one probe per turn. ,→ ,→
[11]

Direct sqlite to a provider-backed database can skip derived/indexed columns and change notifications, so a ContentResolver consumer may see stale state

**Act** -- Android exposes three writing surfaces; pick by what the change has to drive, in this order:,→ (a) **Content providers** (`content insert/update/delete --uri ...`) when the data is exposed by a provider. Direct sqlite to a provider-backed database can skip derived/indexed columns and change notifications, so a ContentResolver consumer may see s...
[17]

,→ ,→ ,→ ,→ ,→

**Find the destination, don't invent it** -- when writing a new file, follow this precedence: (a) find an existing example of the same file type owned by the target app and reuse its directory and naming convention exactly (case, extension, subfolder depth); (b) if no example exists, infer the path from the app's storage (shared_prefs, content provider UR...
[21]

Unknown command

**Discover before guessing a verb** -- when you reach for a`cmd <service>`invocation, run`cmd <service> help`first instead of guessing the action name. The wrong verb returns "Unknown command" and looks like the service is unsupported, when in fact the right verb was one line away. ,→ ,→ ,→ ## Shell Escaping The`sql`,`read-file`, and`write-file`tools hand...
[23]

<SQL>" # run SQL on a device DB write-file <device_path>

ALWAYS call`finish`when done with a meaningful`--description`. AndroidWorld — mini-SWE-agent harness The mini-SWE-agent prompt is split into asystemtemplate and aninstancetemplate, rendered per turn with{{ task }}filled in by the harness. System template. You are an Android automation agent. You control an Android device by issuing commands through a CLI ...
[26]

Direct sqlite to a provider-backed database can skip derived/indexed columns and change notifications, so a ContentResolver consumer may see stale state

**Act** -- Android exposes three writing surfaces; pick by what the change has to drive, in this order:,→ 14 (a) **Content providers** (`content insert/update/delete --uri ...`) when the data is exposed by a provider. Direct sqlite to a provider-backed database can skip derived/indexed columns and change notifications, so a ContentResolver consumer may se...
[32]

**Find the destination, don't invent it** -- when writing a new file, follow this precedence: (a) find an existing example of the same file type owned by the target app and reuse its directory and naming convention exactly (case, extension, subfolder depth);,→ (b) if no example exists, infer the path from the app's storage (shared_prefs, content provider ...
[36]

Unknown command

**Discover before guessing a verb** -- when you reach for a`cmd <service>`invocation, run`cmd <service> help`first instead of guessing the action name. The wrong verb returns "Unknown command" and looks like the service is unsupported, when in fact the right verb was one line away. ,→ ,→ ,→ ## Shell Escaping When a command contains quotes, parentheses, or...
[39]

<SQL>" # run SQL on a device DB write-file <device_path>

Respond with required format. </instructions> AndroidWorld — Terminus-2 harness The Terminus-2 harness uses a single template that interleaves task instruction and prior command output via the placeholders%INSTRUCTION%and%COMMAND_OUTPUT%. You are an Android automation agent. You control an Android device by issuing commands through a CLI wrapper.,→ ## Com...
[40]

When checking multiple plausible locations, batch them into one command (e.g.,`ls dirA; ls dirB; cat config`) instead of one probe per turn

**Discover** -- find the relevant app, its data files, database schemas, and content providers. When checking multiple plausible locations, batch them into one command (e.g.,`ls dirA; ls dirB; cat config`) instead of one probe per turn. ,→ ,→
[42]

Direct sqlite to a provider-backed database can skip derived/indexed columns and change notifications, so a ContentResolver consumer may see stale state

**Act** -- Android exposes three writing surfaces; pick by what the change has to drive, in this order:,→ (a) **Content providers** (`content insert/update/delete --uri ...`) when the data is exposed by a provider. Direct sqlite to a provider-backed database can skip derived/indexed columns and change notifications, so a ContentResolver consumer may see s...
[43]

Once confirmed, do not re-verify

**Verify** -- query back through the same surface a consumer would read from (a content URI, a service`dumpsys`, or the file the app reads), not the underlying row. Once confirmed, do not re-verify. ,→ ,→
[44]

## Principles

**Sync** -- force-stop the app so it re-reads from disk on next launch. ## Principles
[45]

**Never assume** -- discover paths, package names, schemas, column values, content provider URIs, filename conventions, and extensions from the device. For text->integer mappings inside a database (codes, types, statuses, priorities), recover the mapping by`SELECT DISTINCT <text_field>, <id_field>`from existing rows before inserting new ones. ,→ ,→ ,→
[46]

Before any temporal reasoning, get the current date, time, and timezone from the device

**Ground in reality** -- base decisions on observed device state. Before any temporal reasoning, get the current date, time, and timezone from the device. Timestamps in databases are often UTC -- convert to the device's timezone before comparing. Some apps store timestamps in milliseconds, others in seconds; check existing rows to confirm the unit. ,→ ,→ ,→
[47]

Do not paraphrase, regenerate, or reformat

**Use exact task values** -- copy names, text, and values EXACTLY from the task description. Do not paraphrase, regenerate, or reformat. For numeric values, preserve full precision from your source -- do not truncate. ,→ ,→
[48]

**Find the destination, don't invent it** -- when writing a new file, follow this precedence: (a) find an existing example of the same file type owned by the target app and reuse its directory and naming convention exactly (case, extension, subfolder depth);,→ (b) if no example exists, infer the path from the app's storage (shared_prefs, content provider ...
[49]

Forbidden time-sinks: extracting APKs (`unzip`/`xxd`/`strings`on `base.apk`or`classes.dex`), full`dumpsys package`/`pm dump`, recursive`find /`over the whole filesystem

**Probe budget** -- if 2-3 probes have not surfaced the answer, the next probe is unlikely to either; switch tactic. Forbidden time-sinks: extracting APKs (`unzip`/`xxd`/`strings`on `base.apk`or`classes.dex`), full`dumpsys package`/`pm dump`, recursive`find /`over the whole filesystem. If a text->integer mapping isn't recoverable from existing rows or fro...
[50]

Do not re-verify, sanity-check, or explore further.,→

**Stop when done** -- once you have a verified answer or a successful write+sync, call`finish` immediately. Do not re-verify, sanity-check, or explore further.,→
[51]

After writing, verify through the same path the consumer would use (a content URI, a service's`dumpsys`/`service call`output, or the file the app reads), not the underlying row

**Verify through the surface a consumer reads from** -- the row a setting was written to is not always the surface a consumer queries. After writing, verify through the same path the consumer would use (a content URI, a service's`dumpsys`/`service call`output, or the file the app reads), not the underlying row. If a service-level read disagrees with the s...
[52]

Unknown command

**Discover before guessing a verb** -- when you reach for a`cmd <service>`invocation, run`cmd <service> help`first instead of guessing the action name. The wrong verb returns "Unknown command" and looks like the service is unsupported, when in fact the right verb was one line away. ,→ ,→ ,→ 17 ## Android-Specific Patterns These behaviors are non-obvious: ...
[53]

Give ONLY what was asked -- no extra commentary.,→

For information-retrieval tasks,`--description`in`finish`IS your answer. Give ONLY what was asked -- no extra commentary.,→
[54]

ALWAYS call`finish`when done with a meaningful`--description`
[55]

<pattern>

Respond ONLY with valid JSON, no extra text before or after. ## Task %INSTRUCTION% ## Last command output %COMMAND_OUTPUT% MobileWorld — Claude Code SDK harness You are an Android automation agent. You control an Android device through a typed CLI tool suite. You have no access to the screen - no screenshots, no UI hierarchy dumps, no tap/swipe/keyevent i...
[58]

during",

**Network backend** -- the app makes outbound HTTP where the endpoint is discoverable from on-device artifacts only. If you cannot observe the endpoint and schema, the backend is not your writing surface -- fall back to on-device state. ,→ ,→ 18 Termination: - Call`finish`when the task is done. For information tasks (find a value, answer a question), the ...
[69]

echo'foo (bar) baz'| wc -w

**Stop when done** -- once you have a verified answer or a successful write + sync, call `finish`immediately. Do not re-verify, sanity-check, or explore further.,→ ## Shell Escaping The typed tools (`sql`,`pg`,`read-file`,`write-file`,`json-read`,`json-write`) handle escaping internally -- prefer them. When you must compose raw shell with quotes, parens, ...
[71]

<pattern>

ALWAYS call`finish`when done, with a meaningful`--description`. MobileWorld — mini-SWE-agent harness System template. You are an Android automation agent. You control an Android device through a typed CLI tool suite as defined in <command_space>. Format your response as shown in <format_example>.,→ <command_space> # --- on-device tools --- find-files <dir...
[74]

during",

**Network backend** -- the app makes outbound HTTP where the endpoint is discoverable from on-device artifacts only. If you cannot observe the endpoint and schema, the backend is not your writing surface -- fall back to on-device state. ,→ ,→ Termination: - Call`finish`when the task is done. For information tasks (find a value, answer a question), the `--...
[85]

echo'foo (bar) baz'| wc -w

**Stop when done** -- once you have a verified answer or a successful write + sync, call `finish`immediately. Do not re-verify, sanity-check, or explore further.,→ ## Shell Escaping The typed tools (`sql`,`pg`,`read-file`,`write-file`,`json-read`,`json-write`) handle escaping internally -- prefer them. When you must compose raw shell with quotes, parens, ...
[87]

ALWAYS call`finish`when done, with a meaningful`--description`
[88]

<pattern>

Respond with required format (THOUGHT line + a single```bash command fence). </instructions> MobileWorld — Terminus-2 harness You are an Android automation agent. You control an Android device through a typed CLI tool suite. You have no access to the screen - no screenshots, no UI hierarchy dumps, no tap/swipe/keyevent input. All interaction is through sh...
[89]

Use the typed device tools first;`adb shell`is the escape hatch for`pm`,`am`,`dumpsys`,`settings`, `getprop`, ad-hoc composition

**On-device only** -- device storage, content providers, system settings, intents. Use the typed device tools first;`adb shell`is the escape hatch for`pm`,`am`,`dumpsys`,`settings`, `getprop`, ad-hoc composition. ,→ ,→
[90]

Use`pg`/`backend-exec`with a`<backend_grep>`that matches an actual row in `service-status`output

**Container backend** -- the persistent state lives in a container visible to`service-status`. Use`pg`/`backend-exec`with a`<backend_grep>`that matches an actual row in `service-status`output. Do NOT guess names. ,→ ,→
[91]

during",

**Network backend** -- the app makes outbound HTTP where the endpoint is discoverable from on-device artifacts only. If you cannot observe the endpoint and schema, the backend is not your writing surface -- fall back to on-device state. ,→ ,→ Termination: - Call`finish`when the task is done. For information tasks (find a value, answer a question), the `--...
[92]

**Discover** -- find the relevant app, its data files, database schemas, and content providers through`service-status`or`adb shell "pm list packages | grep <keyword>`.,→
[93]

Understand formats, ID mappings, timestamp units, and naming conventions

**Inspect** -- read existing data before modifying. Understand formats, ID mappings, timestamp units, and naming conventions. Match observed patterns exactly when creating new entries.,→
[94]

**Act** -- pick the writing surface by what the change has to drive
[95]

If a service-level read disagrees with the row, the service-level value is what matters

**Verify** -- query back through the surface a *consumer* would read from (a content URI, `dumpsys`/`service call`output, the file the app reads, an HTTP API response), not the underlying row. If a service-level read disagrees with the row, the service-level value is what matters. ,→ ,→ ,→
[96]

## Principles

**Sync** -- force-stop the app so it re-reads on next launch. ## Principles
[97]

Do not rely on prior knowledge.,→

**Never assume** -- discover paths, package names, schemas, column values, content provider URIs, and API endpoints from the device. Do not rely on prior knowledge.,→
[98]

Timestamps in DBs are often UTC; some apps use seconds, others milliseconds -- check existing rows to confirm the unit

**Ground in reality** --`adb shell date`before any temporal reasoning. Timestamps in DBs are often UTC; some apps use seconds, others milliseconds -- check existing rows to confirm the unit. ,→ ,→
[99]

When the answer is extracted from a document/app/DB, copy the literal token verbatim -- do not translate, expand abbreviations, reformat dates, or round numbers

**Use exact task values, exact source tokens** -- copy names, text, and numbers EXACTLY from the task description. When the answer is extracted from a document/app/DB, copy the literal token verbatim -- do not translate, expand abbreviations, reformat dates, or round numbers. Prefer raw units over human-readable ones (`stat -c %s`over`du -sh`). ,→ ,→ ,→
[100]

,→ ,→ ,→ ,→

**Find the destination, don't invent it** -- for new files, follow this precedence: (a) a path explicitly named in app source / strings / bundle / manifest / shared_prefs; (b) reuse the directory and naming convention of an existing example of the same file type owned by the target app; (c) infer from content URIs or manifest data dirs; (d) fall back to a...
[101]

Forbidden time-sinks: extracting APKs (`unzip`/`xxd`/`strings`on `base.apk`or`classes.dex`), full`dumpsys package`/`pm dump`, recursive`find /`over the whole filesystem

**Probe budget** -- if 2-3 probes haven't surfaced the answer, the next one probably won't either: switch tactic. Forbidden time-sinks: extracting APKs (`unzip`/`xxd`/`strings`on `base.apk`or`classes.dex`), full`dumpsys package`/`pm dump`, recursive`find /`over the whole filesystem. ,→ ,→ ,→
[102]

**Stop when done** -- once you have a verified answer or a successful write + sync, call `finish`immediately. Do not re-verify, sanity-check, or explore further.,→ ## Android-Specific Patterns 26 - **Provider notifications**: after`content insert/update/delete`(or a write through a content provider), the provider emits change notifications itself -- manua...
[103]

Give ONLY what was asked -- no commentary, no prefixes, exact format.,→

For information tasks,`--description`in`finish`IS your answer. Give ONLY what was asked -- no commentary, no prefixes, exact format.,→
[104]

ALWAYS call`finish`when done, with a meaningful`--description`. ## Task %INSTRUCTION% ## Last command output %COMMAND_OUTPUT% 27 Anatomy of a prompt All six prompts share the same skeleton, organised around the four categories of guidance described in Section 3:
[105]

what has been observed,

Four-phase interaction cycle:discover rel- evant data, inspect existing state, act through the terminal interface, and verify the result. The cycle appears verbatim under theAp- proachheading in each prompt and is pre- ceded by three preflight questions (“what has been observed,” “what is still an assumption,” “what is the most likely failure mode”). Wher...
[106]

On Mo- bileWorld the same hierarchy is restated as three writing surfaces (on-device, container backend, network backend) so the agent picks the layer by where the consumer reads

Prioritised hierarchy of mechanisms for modifying device state.A ranked list of write paths to try in order: content providers first (so derived columns and change notifications fire), then system service commands ( cmd <service>, service call, am broadcast) for live-service effects, then direct SQLite or file writes under /data/data/<pkg>/ for app-privat...
[107]

To- gether they cap exploration cost while keeping the agent deliberate per call

Efficiency strategies.Two rules: batch re- lated probes into a single shell call rather than one probe per turn, and respect a small probe budget — if 2–3 probes have not surfaced the answer, switch tactic rather than continuing with forbidden time-sinks (extracting APKs, fulldumpsys package, recursivefind /). To- gether they cap exploration cost while ke...
[108]

On MobileWorld this cate- gory also covers the device file-system layout, database discovery patterns, and the half-open range-query convention for UTC timestamps

Platform-specific patterns for file synchro- nisation and data persistence.A short list of non-obvious Android behaviours: media- scanner broadcast after writes to shared stor- age, the content-provider self-notify seman- tics that make manual broadcasts unnecessary for canonical providers, and the recovery pro- cedure for a corrupt SQLite file (remove th...
[109]

analysis

Output structure.Each harness imposes a different response shape. The Claude Code SDK harness uses native tool calls and there- fore needs no response-format block — the agent invokes the shim binaries directly via the SDK’s Bash tool. The mini-SWE-agent harness asks for a free-form THOUGHT: line followed by exactly one fenced```bash com- mand. The Termin...
[110]

<answer>

Section ordering and partitioning.The Claude SDK prompt is a single block. The mini-SWE-agent prompt is split into a system template (role + command space + format ex- ample) and an instance template (constraint + approach + principles + rules, rendered per turn with the task interpolated) following its original design. The Terminus-2 prompt is a single t...
[111]

parameter- 30 bearing attributes on the task class), or any gold-answer file

No verifier-internals leakage.The trajectory must be producible from device state and the task goal alone, without reading the verifier source, its private fixtures (e.g. parameter- 30 bearing attributes on the task class), or any gold-answer file
[112]

you skipped the force-stop

No hardcoded answers.Numeric and string answers for information-retrieval tasks must becomputedfrom data the verifier itself reads, not hardcoded as constants in the trajectory. (2) The oracle-agent loop We treat oracle construction itself as a human–LLM collaboration: a separateoracle-agent(Claude Code, running on the host) is given a far richer context ...
[113]

shell",

State-check.The grader re-reads device state after the agent finishes and compares it to the post-action expectation. The agent’s natural- language output is ignored. def is_successful(self, env) -> float: rows = adb_utils.issue_generic_request( "shell", "sqlite3 ... 'SELECT ... FROM ... WHERE ...'", env) return 1.0 if matches_expected(rows) else 0.0
[114]

Delete all expenses in Pro Expense that are less than $1.00 (less than 100 cents)

Cache-match.The grader inspects the agent’s FINISH(content=...) payload. Every expected substring (computed in initialize_task from the seed) must ap- pear; substring containment, case-sensitive, no whitespace normalization. def is_successful(self, env) -> float: return 1.0 if all( s in env.interaction_cache for s in self._expected ) else 0.0 Example 1Goa...

2026
[115]

State the dominant Android failure mechanism in one sentence, in your own words (no taxonomy jargon)

Read the 8 exemplars. State the dominant Android failure mechanism in one sentence, in your own words (no taxonomy jargon)
[116]

Which top-level class (Execution, Coherence, or Verification) best describes the failure? Justify in one sentence

Walk the tree top-down. Which top-level class (Execution, Coherence, or Verification) best describes the failure? Justify in one sentence
[117]

Within the chosen class, pick the single best-fitting leaf
[118]

TB7: timezone misinterpretation -> Disobey Specification, wrong output protocol

If two leaves are equally plausible, consult tie_breakers and cite the rule number that decides (e.g. "TB7: timezone misinterpretation -> Disobey Specification, wrong output protocol")
[119]

cluster_id

Optionally record one secondary leaf, only if at least 3 of the 8 exemplars also support it. Never list more than one secondary. OUTPUT (strict JSON, no prose outside the object) { "cluster_id": "...", "mechanism": "one sentence in plain English", "primary_class": "Execution | Coherence | Verification", "primary_leaf": "...", "tie_breaker": "TB# or null",...

[1] [1]

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631. Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. 2026. Why do multi-agent LLM sys- tems fail? InThe Thirty-ninth Annual Conference on Neural ...

Pith/arXiv arXiv 2026

[2] [2]

9 Ryan Lopopolo

Llm-powered gui agents in phone automation: Surveying progress and prospects.arXiv preprint arXiv:2504.19838. 9 Ryan Lopopolo. 2026. Harness engineering: leveraging codex in an agent-first world.OpenAI engineering note. Renze Lou, Baolin Peng, Wenlin Yao, Qianhui Wu, Hao Cheng, Suman Nath, Wenpeng Yin, and Jianfeng Gao. 2026. The tool illusion: Rethinking...

arXiv 2026

[3] [3]

Muxin Tian, Zhe Wang, Blair Yang, Zhenwei Tang, Kunlun Zhu, Honghua Dong, Hanchen Li, Xinni Xie, Guangjing Wang, and Jiaxuan You

Ui-venus-1.5 technical report.arXiv preprint arXiv:2602.09082. Muxin Tian, Zhe Wang, Blair Yang, Zhenwei Tang, Kunlun Zhu, Honghua Dong, Hanchen Li, Xinni Xie, Guangjing Wang, and Jiaxuan You. 2026. Swe-bench mobile: Can large language model agents develop industry-level mobile applications?arXiv preprint arXiv:2602.09540. Harsh Trivedi, Tushar Khot, Mare...

arXiv 2026

[4] [4]

InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16022–16076

Appworld: A controllable world of apps and people for benchmarking interactive coding agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16022–16076. Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Hu...

Pith/arXiv arXiv 2024

[5] [5]

InInternational Conference on Learning Representations, volume 2025, pages 5090– 5108

Os-atlas: Foundation action model for gen- eralist gui agents. InInternational Conference on Learning Representations, volume 2025, pages 5090– 5108. Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, and 1 others

2025

[6] [6]

Advances in Neural Information Processing Systems, 37:52040–52094

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems, 37:52040–52094. Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, and 1 others. 2026. Mobile- agent-v3. 5: Multi-platform fundamental gui agent...

arXiv 2026

[7] [7]

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu

Springer. Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu

[8] [8]

<SQL>" read-file <device_path> write-file <device_path>'<content>' find-files <directory>

Appagent: Multimodal agents as smartphone users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–20. Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, and 1 oth- ers. 2025. Mai-ui technical report: Real-world centric foundation gui agents.ar...

arXiv 2025

[9] [9]

When checking multiple plausible locations, batch them into one shell call (e.g.,`ls dirA; ls dirB; cat config`) instead of one probe per turn

**Discover** -- find the relevant app, its data files, database schemas, and content providers. When checking multiple plausible locations, batch them into one shell call (e.g.,`ls dirA; ls dirB; cat config`) instead of one probe per turn. ,→ ,→

[10] [11]

Direct sqlite to a provider-backed database can skip derived/indexed columns and change notifications, so a ContentResolver consumer may see stale state

**Act** -- Android exposes three writing surfaces; pick by what the change has to drive, in this order:,→ (a) **Content providers** (`content insert/update/delete --uri ...`) when the data is exposed by a provider. Direct sqlite to a provider-backed database can skip derived/indexed columns and change notifications, so a ContentResolver consumer may see s...

[11] [17]

,→ ,→ ,→ ,→ ,→

**Find the destination, don't invent it** -- when writing a new file, follow this precedence: (a) find an existing example of the same file type owned by the target app and reuse its directory and naming convention exactly (case, extension, subfolder depth); (b) if no example exists, infer the path from the app's storage (shared_prefs, content provider UR...

[12] [21]

Unknown command

**Discover before guessing a verb** -- when you reach for a`cmd <service>`invocation, run`cmd <service> help`first instead of guessing the action name. The wrong verb returns "Unknown command" and looks like the service is unsupported, when in fact the right verb was one line away. ,→ ,→ ,→ ## Shell Escaping The`sql`,`read-file`, and`write-file`tools hand...

[13] [23]

<SQL>" # run SQL on a device DB write-file <device_path>

ALWAYS call`finish`when done with a meaningful`--description`. AndroidWorld — mini-SWE-agent harness The mini-SWE-agent prompt is split into asystemtemplate and aninstancetemplate, rendered per turn with{{ task }}filled in by the harness. System template. You are an Android automation agent. You control an Android device by issuing commands through a CLI ...

[14] [26]

Direct sqlite to a provider-backed database can skip derived/indexed columns and change notifications, so a ContentResolver consumer may see stale state

**Act** -- Android exposes three writing surfaces; pick by what the change has to drive, in this order:,→ 14 (a) **Content providers** (`content insert/update/delete --uri ...`) when the data is exposed by a provider. Direct sqlite to a provider-backed database can skip derived/indexed columns and change notifications, so a ContentResolver consumer may se...

[15] [32]

**Find the destination, don't invent it** -- when writing a new file, follow this precedence: (a) find an existing example of the same file type owned by the target app and reuse its directory and naming convention exactly (case, extension, subfolder depth);,→ (b) if no example exists, infer the path from the app's storage (shared_prefs, content provider ...

[16] [36]

Unknown command

**Discover before guessing a verb** -- when you reach for a`cmd <service>`invocation, run`cmd <service> help`first instead of guessing the action name. The wrong verb returns "Unknown command" and looks like the service is unsupported, when in fact the right verb was one line away. ,→ ,→ ,→ ## Shell Escaping When a command contains quotes, parentheses, or...

[17] [39]

<SQL>" # run SQL on a device DB write-file <device_path>

Respond with required format. </instructions> AndroidWorld — Terminus-2 harness The Terminus-2 harness uses a single template that interleaves task instruction and prior command output via the placeholders%INSTRUCTION%and%COMMAND_OUTPUT%. You are an Android automation agent. You control an Android device by issuing commands through a CLI wrapper.,→ ## Com...

[18] [40]

When checking multiple plausible locations, batch them into one command (e.g.,`ls dirA; ls dirB; cat config`) instead of one probe per turn

**Discover** -- find the relevant app, its data files, database schemas, and content providers. When checking multiple plausible locations, batch them into one command (e.g.,`ls dirA; ls dirB; cat config`) instead of one probe per turn. ,→ ,→

[19] [42]

Direct sqlite to a provider-backed database can skip derived/indexed columns and change notifications, so a ContentResolver consumer may see stale state

**Act** -- Android exposes three writing surfaces; pick by what the change has to drive, in this order:,→ (a) **Content providers** (`content insert/update/delete --uri ...`) when the data is exposed by a provider. Direct sqlite to a provider-backed database can skip derived/indexed columns and change notifications, so a ContentResolver consumer may see s...

[20] [43]

Once confirmed, do not re-verify

**Verify** -- query back through the same surface a consumer would read from (a content URI, a service`dumpsys`, or the file the app reads), not the underlying row. Once confirmed, do not re-verify. ,→ ,→

[21] [44]

## Principles

**Sync** -- force-stop the app so it re-reads from disk on next launch. ## Principles

[22] [45]

**Never assume** -- discover paths, package names, schemas, column values, content provider URIs, filename conventions, and extensions from the device. For text->integer mappings inside a database (codes, types, statuses, priorities), recover the mapping by`SELECT DISTINCT <text_field>, <id_field>`from existing rows before inserting new ones. ,→ ,→ ,→

[23] [46]

Before any temporal reasoning, get the current date, time, and timezone from the device

**Ground in reality** -- base decisions on observed device state. Before any temporal reasoning, get the current date, time, and timezone from the device. Timestamps in databases are often UTC -- convert to the device's timezone before comparing. Some apps store timestamps in milliseconds, others in seconds; check existing rows to confirm the unit. ,→ ,→ ,→

[24] [47]

Do not paraphrase, regenerate, or reformat

**Use exact task values** -- copy names, text, and values EXACTLY from the task description. Do not paraphrase, regenerate, or reformat. For numeric values, preserve full precision from your source -- do not truncate. ,→ ,→

[25] [48]

**Find the destination, don't invent it** -- when writing a new file, follow this precedence: (a) find an existing example of the same file type owned by the target app and reuse its directory and naming convention exactly (case, extension, subfolder depth);,→ (b) if no example exists, infer the path from the app's storage (shared_prefs, content provider ...

[26] [49]

Forbidden time-sinks: extracting APKs (`unzip`/`xxd`/`strings`on `base.apk`or`classes.dex`), full`dumpsys package`/`pm dump`, recursive`find /`over the whole filesystem

**Probe budget** -- if 2-3 probes have not surfaced the answer, the next probe is unlikely to either; switch tactic. Forbidden time-sinks: extracting APKs (`unzip`/`xxd`/`strings`on `base.apk`or`classes.dex`), full`dumpsys package`/`pm dump`, recursive`find /`over the whole filesystem. If a text->integer mapping isn't recoverable from existing rows or fro...

[27] [50]

Do not re-verify, sanity-check, or explore further.,→

**Stop when done** -- once you have a verified answer or a successful write+sync, call`finish` immediately. Do not re-verify, sanity-check, or explore further.,→

[28] [51]

After writing, verify through the same path the consumer would use (a content URI, a service's`dumpsys`/`service call`output, or the file the app reads), not the underlying row

**Verify through the surface a consumer reads from** -- the row a setting was written to is not always the surface a consumer queries. After writing, verify through the same path the consumer would use (a content URI, a service's`dumpsys`/`service call`output, or the file the app reads), not the underlying row. If a service-level read disagrees with the s...

[29] [52]

Unknown command

**Discover before guessing a verb** -- when you reach for a`cmd <service>`invocation, run`cmd <service> help`first instead of guessing the action name. The wrong verb returns "Unknown command" and looks like the service is unsupported, when in fact the right verb was one line away. ,→ ,→ ,→ 17 ## Android-Specific Patterns These behaviors are non-obvious: ...

[30] [53]

Give ONLY what was asked -- no extra commentary.,→

For information-retrieval tasks,`--description`in`finish`IS your answer. Give ONLY what was asked -- no extra commentary.,→

[31] [54]

ALWAYS call`finish`when done with a meaningful`--description`

[32] [55]

<pattern>

Respond ONLY with valid JSON, no extra text before or after. ## Task %INSTRUCTION% ## Last command output %COMMAND_OUTPUT% MobileWorld — Claude Code SDK harness You are an Android automation agent. You control an Android device through a typed CLI tool suite. You have no access to the screen - no screenshots, no UI hierarchy dumps, no tap/swipe/keyevent i...

[33] [58]

during",

**Network backend** -- the app makes outbound HTTP where the endpoint is discoverable from on-device artifacts only. If you cannot observe the endpoint and schema, the backend is not your writing surface -- fall back to on-device state. ,→ ,→ 18 Termination: - Call`finish`when the task is done. For information tasks (find a value, answer a question), the ...

[34] [69]

echo'foo (bar) baz'| wc -w

**Stop when done** -- once you have a verified answer or a successful write + sync, call `finish`immediately. Do not re-verify, sanity-check, or explore further.,→ ## Shell Escaping The typed tools (`sql`,`pg`,`read-file`,`write-file`,`json-read`,`json-write`) handle escaping internally -- prefer them. When you must compose raw shell with quotes, parens, ...

[35] [71]

<pattern>

ALWAYS call`finish`when done, with a meaningful`--description`. MobileWorld — mini-SWE-agent harness System template. You are an Android automation agent. You control an Android device through a typed CLI tool suite as defined in <command_space>. Format your response as shown in <format_example>.,→ <command_space> # --- on-device tools --- find-files <dir...

[36] [74]

during",

**Network backend** -- the app makes outbound HTTP where the endpoint is discoverable from on-device artifacts only. If you cannot observe the endpoint and schema, the backend is not your writing surface -- fall back to on-device state. ,→ ,→ Termination: - Call`finish`when the task is done. For information tasks (find a value, answer a question), the `--...

[37] [85]

echo'foo (bar) baz'| wc -w

**Stop when done** -- once you have a verified answer or a successful write + sync, call `finish`immediately. Do not re-verify, sanity-check, or explore further.,→ ## Shell Escaping The typed tools (`sql`,`pg`,`read-file`,`write-file`,`json-read`,`json-write`) handle escaping internally -- prefer them. When you must compose raw shell with quotes, parens, ...

[38] [87]

ALWAYS call`finish`when done, with a meaningful`--description`

[39] [88]

<pattern>

Respond with required format (THOUGHT line + a single```bash command fence). </instructions> MobileWorld — Terminus-2 harness You are an Android automation agent. You control an Android device through a typed CLI tool suite. You have no access to the screen - no screenshots, no UI hierarchy dumps, no tap/swipe/keyevent input. All interaction is through sh...

[40] [89]

Use the typed device tools first;`adb shell`is the escape hatch for`pm`,`am`,`dumpsys`,`settings`, `getprop`, ad-hoc composition

**On-device only** -- device storage, content providers, system settings, intents. Use the typed device tools first;`adb shell`is the escape hatch for`pm`,`am`,`dumpsys`,`settings`, `getprop`, ad-hoc composition. ,→ ,→

[41] [90]

Use`pg`/`backend-exec`with a`<backend_grep>`that matches an actual row in `service-status`output

**Container backend** -- the persistent state lives in a container visible to`service-status`. Use`pg`/`backend-exec`with a`<backend_grep>`that matches an actual row in `service-status`output. Do NOT guess names. ,→ ,→

[42] [91]

during",

**Network backend** -- the app makes outbound HTTP where the endpoint is discoverable from on-device artifacts only. If you cannot observe the endpoint and schema, the backend is not your writing surface -- fall back to on-device state. ,→ ,→ Termination: - Call`finish`when the task is done. For information tasks (find a value, answer a question), the `--...

[43] [92]

**Discover** -- find the relevant app, its data files, database schemas, and content providers through`service-status`or`adb shell "pm list packages | grep <keyword>`.,→

[44] [93]

Understand formats, ID mappings, timestamp units, and naming conventions

**Inspect** -- read existing data before modifying. Understand formats, ID mappings, timestamp units, and naming conventions. Match observed patterns exactly when creating new entries.,→

[45] [94]

**Act** -- pick the writing surface by what the change has to drive

[46] [95]

If a service-level read disagrees with the row, the service-level value is what matters

**Verify** -- query back through the surface a *consumer* would read from (a content URI, `dumpsys`/`service call`output, the file the app reads, an HTTP API response), not the underlying row. If a service-level read disagrees with the row, the service-level value is what matters. ,→ ,→ ,→

[47] [96]

## Principles

**Sync** -- force-stop the app so it re-reads on next launch. ## Principles

[48] [97]

Do not rely on prior knowledge.,→

**Never assume** -- discover paths, package names, schemas, column values, content provider URIs, and API endpoints from the device. Do not rely on prior knowledge.,→

[49] [98]

Timestamps in DBs are often UTC; some apps use seconds, others milliseconds -- check existing rows to confirm the unit

**Ground in reality** --`adb shell date`before any temporal reasoning. Timestamps in DBs are often UTC; some apps use seconds, others milliseconds -- check existing rows to confirm the unit. ,→ ,→

[50] [99]

When the answer is extracted from a document/app/DB, copy the literal token verbatim -- do not translate, expand abbreviations, reformat dates, or round numbers

**Use exact task values, exact source tokens** -- copy names, text, and numbers EXACTLY from the task description. When the answer is extracted from a document/app/DB, copy the literal token verbatim -- do not translate, expand abbreviations, reformat dates, or round numbers. Prefer raw units over human-readable ones (`stat -c %s`over`du -sh`). ,→ ,→ ,→

[51] [100]

,→ ,→ ,→ ,→

**Find the destination, don't invent it** -- for new files, follow this precedence: (a) a path explicitly named in app source / strings / bundle / manifest / shared_prefs; (b) reuse the directory and naming convention of an existing example of the same file type owned by the target app; (c) infer from content URIs or manifest data dirs; (d) fall back to a...

[52] [101]

Forbidden time-sinks: extracting APKs (`unzip`/`xxd`/`strings`on `base.apk`or`classes.dex`), full`dumpsys package`/`pm dump`, recursive`find /`over the whole filesystem

**Probe budget** -- if 2-3 probes haven't surfaced the answer, the next one probably won't either: switch tactic. Forbidden time-sinks: extracting APKs (`unzip`/`xxd`/`strings`on `base.apk`or`classes.dex`), full`dumpsys package`/`pm dump`, recursive`find /`over the whole filesystem. ,→ ,→ ,→

[53] [102]

**Stop when done** -- once you have a verified answer or a successful write + sync, call `finish`immediately. Do not re-verify, sanity-check, or explore further.,→ ## Android-Specific Patterns 26 - **Provider notifications**: after`content insert/update/delete`(or a write through a content provider), the provider emits change notifications itself -- manua...

[54] [103]

Give ONLY what was asked -- no commentary, no prefixes, exact format.,→

For information tasks,`--description`in`finish`IS your answer. Give ONLY what was asked -- no commentary, no prefixes, exact format.,→

[55] [104]

ALWAYS call`finish`when done, with a meaningful`--description`. ## Task %INSTRUCTION% ## Last command output %COMMAND_OUTPUT% 27 Anatomy of a prompt All six prompts share the same skeleton, organised around the four categories of guidance described in Section 3:

[56] [105]

what has been observed,

Four-phase interaction cycle:discover rel- evant data, inspect existing state, act through the terminal interface, and verify the result. The cycle appears verbatim under theAp- proachheading in each prompt and is pre- ceded by three preflight questions (“what has been observed,” “what is still an assumption,” “what is the most likely failure mode”). Wher...

[57] [106]

On Mo- bileWorld the same hierarchy is restated as three writing surfaces (on-device, container backend, network backend) so the agent picks the layer by where the consumer reads

Prioritised hierarchy of mechanisms for modifying device state.A ranked list of write paths to try in order: content providers first (so derived columns and change notifications fire), then system service commands ( cmd <service>, service call, am broadcast) for live-service effects, then direct SQLite or file writes under /data/data/<pkg>/ for app-privat...

[58] [107]

To- gether they cap exploration cost while keeping the agent deliberate per call

Efficiency strategies.Two rules: batch re- lated probes into a single shell call rather than one probe per turn, and respect a small probe budget — if 2–3 probes have not surfaced the answer, switch tactic rather than continuing with forbidden time-sinks (extracting APKs, fulldumpsys package, recursivefind /). To- gether they cap exploration cost while ke...

[59] [108]

On MobileWorld this cate- gory also covers the device file-system layout, database discovery patterns, and the half-open range-query convention for UTC timestamps

Platform-specific patterns for file synchro- nisation and data persistence.A short list of non-obvious Android behaviours: media- scanner broadcast after writes to shared stor- age, the content-provider self-notify seman- tics that make manual broadcasts unnecessary for canonical providers, and the recovery pro- cedure for a corrupt SQLite file (remove th...

[60] [109]

analysis

Output structure.Each harness imposes a different response shape. The Claude Code SDK harness uses native tool calls and there- fore needs no response-format block — the agent invokes the shim binaries directly via the SDK’s Bash tool. The mini-SWE-agent harness asks for a free-form THOUGHT: line followed by exactly one fenced```bash com- mand. The Termin...

[61] [110]

<answer>

Section ordering and partitioning.The Claude SDK prompt is a single block. The mini-SWE-agent prompt is split into a system template (role + command space + format ex- ample) and an instance template (constraint + approach + principles + rules, rendered per turn with the task interpolated) following its original design. The Terminus-2 prompt is a single t...

[62] [111]

parameter- 30 bearing attributes on the task class), or any gold-answer file

No verifier-internals leakage.The trajectory must be producible from device state and the task goal alone, without reading the verifier source, its private fixtures (e.g. parameter- 30 bearing attributes on the task class), or any gold-answer file

[63] [112]

you skipped the force-stop

No hardcoded answers.Numeric and string answers for information-retrieval tasks must becomputedfrom data the verifier itself reads, not hardcoded as constants in the trajectory. (2) The oracle-agent loop We treat oracle construction itself as a human–LLM collaboration: a separateoracle-agent(Claude Code, running on the host) is given a far richer context ...

[64] [113]

shell",

State-check.The grader re-reads device state after the agent finishes and compares it to the post-action expectation. The agent’s natural- language output is ignored. def is_successful(self, env) -> float: rows = adb_utils.issue_generic_request( "shell", "sqlite3 ... 'SELECT ... FROM ... WHERE ...'", env) return 1.0 if matches_expected(rows) else 0.0

[65] [114]

Delete all expenses in Pro Expense that are less than $1.00 (less than 100 cents)

Cache-match.The grader inspects the agent’s FINISH(content=...) payload. Every expected substring (computed in initialize_task from the seed) must ap- pear; substring containment, case-sensitive, no whitespace normalization. def is_successful(self, env) -> float: return 1.0 if all( s in env.interaction_cache for s in self._expected ) else 0.0 Example 1Goa...

2026

[66] [115]

State the dominant Android failure mechanism in one sentence, in your own words (no taxonomy jargon)

Read the 8 exemplars. State the dominant Android failure mechanism in one sentence, in your own words (no taxonomy jargon)

[67] [116]

Which top-level class (Execution, Coherence, or Verification) best describes the failure? Justify in one sentence

Walk the tree top-down. Which top-level class (Execution, Coherence, or Verification) best describes the failure? Justify in one sentence

[68] [117]

Within the chosen class, pick the single best-fitting leaf

[69] [118]

TB7: timezone misinterpretation -> Disobey Specification, wrong output protocol

If two leaves are equally plausible, consult tie_breakers and cite the rule number that decides (e.g. "TB7: timezone misinterpretation -> Disobey Specification, wrong output protocol")

[70] [119]

cluster_id

Optionally record one secondary leaf, only if at least 3 of the 8 exemplars also support it. Never list more than one secondary. OUTPUT (strict JSON, no prose outside the object) { "cluster_id": "...", "mechanism": "one sentence in plain English", "primary_class": "Execution | Coherence | Verification", "primary_leaf": "...", "tie_breaker": "TB# or null",...