{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:Y4CLHGPBMYXO2I43ZRXH3JHZIW","short_pith_number":"pith:Y4CLHGPB","schema_version":"1.0","canonical_sha256":"c704b399e1662eed239bcc6e7da4f945b172336b19d0aba74b44dc1737aaad43","source":{"kind":"arxiv","id":"2401.16158","version":2},"attestation_state":"computed","paper":{"title":"Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Mobile-Agent operates mobile apps by visually identifying screen elements instead of using system metadata.","cross_cats":["cs.CV"],"primary_cat":"cs.CL","authors_text":"Fei Huang, Haiyang Xu, Jiabo Ye, Jitao Sang, Ji Zhang, Junyang Wang, Ming Yan, Weizhou Shen","submitted_at":"2024-01-29T13:46:37Z","abstract_excerpt":"Mobile device agent based on Multimodal Large Language Models (MLLM) is becoming a popular application. In this paper, we introduce Mobile-Agent, an autonomous multi-modal mobile device agent. Mobile-Agent first leverages visual perception tools to accurately identify and locate both the visual and textual elements within the app's front-end interface. Based on the perceived vision context, it then autonomously plans and decomposes the complex operation task, and navigates the mobile Apps through operations step by step. Different from previous solutions that rely on XML files of Apps or mobil"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2401.16158","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CL","submitted_at":"2024-01-29T13:46:37Z","cross_cats_sorted":["cs.CV"],"title_canon_sha256":"5407557842b8ba6a148ad7928ff7cadb44b03a92568e5c205a0cb85d50c4bb59","abstract_canon_sha256":"5504b8dbd1e360bf1b287603693ad18ad6df3cca39f17125188ae2229f9375cc"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:46.128501Z","signature_b64":"wpo+yrtGd/rjs2Dpdrkd+mpSj9oSPhwVCoIOc17HRSov0I0H+qnHDXsxGIYj0C9OelZO5y6WV/puxZx0JJwyBw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"c704b399e1662eed239bcc6e7da4f945b172336b19d0aba74b44dc1737aaad43","last_reissued_at":"2026-05-17T23:38:46.127931Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:46.127931Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Mobile-Agent operates mobile apps by visually identifying screen elements instead of using system metadata.","cross_cats":["cs.CV"],"primary_cat":"cs.CL","authors_text":"Fei Huang, Haiyang Xu, Jiabo Ye, Jitao Sang, Ji Zhang, Junyang Wang, Ming Yan, Weizhou Shen","submitted_at":"2024-01-29T13:46:37Z","abstract_excerpt":"Mobile device agent based on Multimodal Large Language Models (MLLM) is becoming a popular application. In this paper, we introduce Mobile-Agent, an autonomous multi-modal mobile device agent. Mobile-Agent first leverages visual perception tools to accurately identify and locate both the visual and textual elements within the app's front-end interface. Based on the perceived vision context, it then autonomously plans and decomposes the complex operation task, and navigates the mobile Apps through operations step by step. Different from previous solutions that rely on XML files of Apps or mobil"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Mobile-Agent achieved remarkable accuracy and completion rates. Even with challenging instructions, such as multi-app operations, Mobile-Agent can still complete the requirements.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That visual perception tools can accurately and reliably identify and locate both visual and textual elements within diverse app front-end interfaces across different mobile operating environments without significant errors or the need for system-specific adjustments.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Mobile-Agent is a vision-centric autonomous agent that uses MLLMs to perceive UI elements, plan complex multi-step tasks, and operate mobile apps without relying on XML or system metadata, showing strong results on the introduced Mobile-Eval benchmark.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Mobile-Agent operates mobile apps by visually identifying screen elements instead of using system metadata.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"413527ad113fac4d1f27850618b74ece8b22afa696c8b61534091b8a82977398"},"source":{"id":"2401.16158","kind":"arxiv","version":2},"verdict":{"id":"db266f45-9fc6-4eec-b69b-9cfffcd7a501","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T00:14:45.002300Z","strongest_claim":"Mobile-Agent achieved remarkable accuracy and completion rates. Even with challenging instructions, such as multi-app operations, Mobile-Agent can still complete the requirements.","one_line_summary":"Mobile-Agent is a vision-centric autonomous agent that uses MLLMs to perceive UI elements, plan complex multi-step tasks, and operate mobile apps without relying on XML or system metadata, showing strong results on the introduced Mobile-Eval benchmark.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That visual perception tools can accurately and reliably identify and locate both visual and textual elements within diverse app front-end interfaces across different mobile operating environments without significant errors or the need for system-specific adjustments.","pith_extraction_headline":"Mobile-Agent operates mobile apps by visually identifying screen elements instead of using system metadata."},"references":{"count":13,"sample":[{"doi":"","year":null,"title":"Modelscope-agent: Building your customizable agent system with open-source large language models","work_id":"2327fb2d-530b-44f6-972c-1ea9cb6b8c3d","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Controlllm: Augment language models with tools by searching on graphs","work_id":"5a9fd1ba-c4e9-4185-ad1e-5e587998b78a","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models","work_id":"b06ebfb8-5543-4f2c-af49-c05c4e63fc45","ref_index":3,"cited_arxiv_id":"2303.04671","is_internal_anchor":true},{"doi":"","year":null,"title":"Gpt4tools: Teaching large lan- guage model to use tools via self-instruction","work_id":"260a71e5-f66b-4679-a4f4-2d3778841d09","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action","work_id":"6dc43db8-227d-438e-8658-0c8acecba08a","ref_index":5,"cited_arxiv_id":"2303.11381","is_internal_anchor":true}],"resolved_work":13,"snapshot_sha256":"dcf915dec29c09442bb834231a4fec65806641cbb60eaec2f24a6467794ec2d3","internal_anchors":7},"formal_canon":{"evidence_count":2,"snapshot_sha256":"b0eccfe4921bae5c5aa763af5155b57745223683a218ecfd6b59535b936a29ea"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2401.16158","created_at":"2026-05-17T23:38:46.128031+00:00"},{"alias_kind":"arxiv_version","alias_value":"2401.16158v2","created_at":"2026-05-17T23:38:46.128031+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2401.16158","created_at":"2026-05-17T23:38:46.128031+00:00"},{"alias_kind":"pith_short_12","alias_value":"Y4CLHGPBMYXO","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"Y4CLHGPBMYXO2I43","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"Y4CLHGPB","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":31,"internal_anchor_count":31,"sample":[{"citing_arxiv_id":"2605.10347","citing_title":"How Mobile World Model Guides GUI Agents?","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2501.16150","citing_title":"A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions","ref_index":160,"is_internal_anchor":true},{"citing_arxiv_id":"2503.14075","citing_title":"Growing a Multi-head Twig via Distillation and Reinforcement Learning to Accelerate Large Vision-Language Models","ref_index":50,"is_internal_anchor":true},{"citing_arxiv_id":"2505.03364","citing_title":"DroidRetriever: A Transparent and Steerable Automation System for Collaborative Mobile Information Seeking","ref_index":59,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05765","citing_title":"X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2511.03293","citing_title":"UMDAM: A Unified Data Layout and DRAM Address Mapping for Heterogenous NPU-PIM","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16402","citing_title":"WinDeskGround: A Benchmark for Robust GUI Grounding in Complex Multi-Window Desktop Environments","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18535","citing_title":"Beyond Scaling: Agents Are Heading to the Edge","ref_index":58,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19260","citing_title":"AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16883","citing_title":"SE-GA: Memory-Augmented Self-Evolution for GUI Agents","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15224","citing_title":"ICRL: Learning to Internalize Self-Critique with Reinforcement Learning","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15542","citing_title":"DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2411.18279","citing_title":"Large Language Model-Brained GUI Agents: A Survey","ref_index":161,"is_internal_anchor":true},{"citing_arxiv_id":"2512.10371","citing_title":"AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2306.13549","citing_title":"A Survey on Multimodal Large Language Models","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2604.09574","citing_title":"Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2602.22942","citing_title":"ClawMobile: Rethinking Smartphone-Native Agentic Systems","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2603.26041","citing_title":"Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectives","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03486","citing_title":"VisionClaw: Always-On AI Agents through Smart Glasses","ref_index":63,"is_internal_anchor":true},{"citing_arxiv_id":"2404.07972","citing_title":"OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10347","citing_title":"How Mobile World Model Guides GUI Agents?","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09443","citing_title":"Through the Lens of Character: Resolving Modality-Role Interference in Multimodal Role-Playing Agent","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2604.26148","citing_title":"Beyond Screenshots: Evaluating VLMs' Understanding of UI Animations","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05765","citing_title":"X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07110","citing_title":"Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability","ref_index":73,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/Y4CLHGPBMYXO2I43ZRXH3JHZIW","json":"https://pith.science/pith/Y4CLHGPBMYXO2I43ZRXH3JHZIW.json","graph_json":"https://pith.science/api/pith-number/Y4CLHGPBMYXO2I43ZRXH3JHZIW/graph.json","events_json":"https://pith.science/api/pith-number/Y4CLHGPBMYXO2I43ZRXH3JHZIW/events.json","paper":"https://pith.science/paper/Y4CLHGPB"},"agent_actions":{"view_html":"https://pith.science/pith/Y4CLHGPBMYXO2I43ZRXH3JHZIW","download_json":"https://pith.science/pith/Y4CLHGPBMYXO2I43ZRXH3JHZIW.json","view_paper":"https://pith.science/paper/Y4CLHGPB","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2401.16158&json=true","fetch_graph":"https://pith.science/api/pith-number/Y4CLHGPBMYXO2I43ZRXH3JHZIW/graph.json","fetch_events":"https://pith.science/api/pith-number/Y4CLHGPBMYXO2I43ZRXH3JHZIW/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/Y4CLHGPBMYXO2I43ZRXH3JHZIW/action/timestamp_anchor","attest_storage":"https://pith.science/pith/Y4CLHGPBMYXO2I43ZRXH3JHZIW/action/storage_attestation","attest_author":"https://pith.science/pith/Y4CLHGPBMYXO2I43ZRXH3JHZIW/action/author_attestation","sign_citation":"https://pith.science/pith/Y4CLHGPBMYXO2I43ZRXH3JHZIW/action/citation_signature","submit_replication":"https://pith.science/pith/Y4CLHGPBMYXO2I43ZRXH3JHZIW/action/replication_record"}},"created_at":"2026-05-17T23:38:46.128031+00:00","updated_at":"2026-05-17T23:38:46.128031+00:00"}