{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:ZJVZ5MFZB6R3ITGTIVTVV3VS7I","short_pith_number":"pith:ZJVZ5MFZ","schema_version":"1.0","canonical_sha256":"ca6b9eb0b90fa3b44cd345675aeeb2fa1bfa4a3ac13e11b833231c3b236253ef","source":{"kind":"arxiv","id":"2410.05243","version":3},"attestation_state":"computed","paper":{"title":"Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A model trained on web synthetic GUI data enables agents to ground referring expressions to pixels using only screenshots, outperforming text-augmented systems.","cross_cats":["cs.CL","cs.CV"],"primary_cat":"cs.AI","authors_text":"Boyuan Zheng, Boyu Gou, Cheng Chang, Huan Sun, Ruohan Wang, Yanan Xie, Yiheng Shu, Yu Su","submitted_at":"2024-10-07T17:47:50Z","abstract_excerpt":"Multimodal large language models (MLLMs) are transforming the capabilities of graphical user interface (GUI) agents, facilitating their transition from controlled simulations to complex, real-world applications across various platforms. However, the effectiveness of these agents hinges on the robustness of their grounding capability. Current GUI agents predominantly utilize text-based representations such as HTML or accessibility trees, which, despite their utility, often introduce noise, incompleteness, and increased computational overhead. In this paper, we advocate a human-like embodiment f"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":false},"canonical_record":{"source":{"id":"2410.05243","kind":"arxiv","version":3},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.AI","submitted_at":"2024-10-07T17:47:50Z","cross_cats_sorted":["cs.CL","cs.CV"],"title_canon_sha256":"bba2772fdcea59c8366a2a30164329af2a062e8ad75c26e3a96b57fed4d95022","abstract_canon_sha256":"7ec6a812a76db0e31b4a3564f47a0d6172a60ad8d2b3157f1ef15062d373bf5d"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:46.567466Z","signature_b64":"/K5AtKT9SwVfvh1TIueLeebD3xp8hnpC1SJQ5uN+I2DzXyl6uUqxd4kd1PB+Q/e+p+pNrLdID77AqllOLqD6Bg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"ca6b9eb0b90fa3b44cd345675aeeb2fa1bfa4a3ac13e11b833231c3b236253ef","last_reissued_at":"2026-05-17T23:38:46.566939Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:46.566939Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A model trained on web synthetic GUI data enables agents to ground referring expressions to pixels using only screenshots, outperforming text-augmented systems.","cross_cats":["cs.CL","cs.CV"],"primary_cat":"cs.AI","authors_text":"Boyuan Zheng, Boyu Gou, Cheng Chang, Huan Sun, Ruohan Wang, Yanan Xie, Yiheng Shu, Yu Su","submitted_at":"2024-10-07T17:47:50Z","abstract_excerpt":"Multimodal large language models (MLLMs) are transforming the capabilities of graphical user interface (GUI) agents, facilitating their transition from controlled simulations to complex, real-world applications across various platforms. However, the effectiveness of these agents hinges on the robustness of their grounding capability. Current GUI agents predominantly utilize text-based representations such as HTML or accessibility trees, which, despite their utility, often introduce noise, incompleteness, and increased computational overhead. In this paper, we advocate a human-like embodiment f"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Empirical results on six benchmarks spanning three categories show that UGround substantially outperforms existing visual grounding models for GUI agents by up to 20% absolute, and agents with UGround outperform state-of-the-art agents despite existing agents using additional text-based input while ours only uses visual perception.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That web-based synthetic data combined with slight adaptation of the LLaVA architecture produces a model that generalizes robustly to real-world, diverse GUI platforms and referring expressions beyond the training distribution.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"UGround is a universal visual grounding model for GUI agents that uses only screenshots to locate elements and outperforms existing agents despite lacking text-based inputs.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A model trained on web synthetic GUI data enables agents to ground referring expressions to pixels using only screenshots, outperforming text-augmented systems.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"723b68956eff88a163c56064327dc455131934a534f6ccd8ca33139eeba23d91"},"source":{"id":"2410.05243","kind":"arxiv","version":3},"verdict":{"id":"5107a70c-78dc-4128-af30-0aba481607bd","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T21:05:31.956964Z","strongest_claim":"Empirical results on six benchmarks spanning three categories show that UGround substantially outperforms existing visual grounding models for GUI agents by up to 20% absolute, and agents with UGround outperform state-of-the-art agents despite existing agents using additional text-based input while ours only uses visual perception.","one_line_summary":"UGround is a universal visual grounding model for GUI agents that uses only screenshots to locate elements and outperforms existing agents despite lacking text-based inputs.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That web-based synthetic data combined with slight adaptation of the LLaVA architecture produces a model that generalizes robustly to real-world, diverse GUI platforms and referring expressions beyond the training distribution.","pith_extraction_headline":"A model trained on web synthetic GUI data enables agents to ground referring expressions to pixels using only screenshots, outperforming text-augmented systems."},"references":{"count":12,"sample":[{"doi":"","year":null,"title":"click ... then type","work_id":"23c8d22f-9413-422d-bc04-11dfde6dd5dd","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"We filter out any actions that do not have associated coordinate data, ensuring that only steps with specific visual grounding targets are included in the dataset","work_id":"4a69b9a8-e8ca-4f5d-bf41-eb09660d0534","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"To enhance diversity, two captions per element are randomly selected from the available set of functional captions during data construction","work_id":"aade4fc7-8978-4d0c-9f3f-3708f5f7362e","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"UIBert: We use the training set elements from UIBert without any additional special processing, directly utilizing the referring expressions provided by this dataset","work_id":"dc13223e-fa8f-4b7b-a524-d6ff01a0d8f8","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"These annotations contribute to a more diverse set of referring expressions, particularly for action-oriented grounding tasks","work_id":"1c8757f6-dfdc-4a18-a3e3-510b215dae6f","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":12,"snapshot_sha256":"e398405f90bd02eab9e7bb9af6de474af3a75184101af7b9e3ea13cc34100804","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2410.05243","created_at":"2026-05-17T23:38:46.567027+00:00"},{"alias_kind":"arxiv_version","alias_value":"2410.05243v3","created_at":"2026-05-17T23:38:46.567027+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2410.05243","created_at":"2026-05-17T23:38:46.567027+00:00"},{"alias_kind":"pith_short_12","alias_value":"ZJVZ5MFZB6R3","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"ZJVZ5MFZB6R3ITGT","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"ZJVZ5MFZ","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":33,"internal_anchor_count":33,"sample":[{"citing_arxiv_id":"2505.10887","citing_title":"InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17439","citing_title":"DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18652","citing_title":"MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17439","citing_title":"DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16883","citing_title":"SE-GA: Memory-Augmented Self-Evolution for GUI Agents","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15542","citing_title":"DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2411.18279","citing_title":"Large Language Model-Brained GUI Agents: A Survey","ref_index":221,"is_internal_anchor":true},{"citing_arxiv_id":"2507.04227","citing_title":"Mobile GUI Agents under Real-world Threats: Are We There Yet?","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2509.06477","citing_title":"MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2504.14239","citing_title":"InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2412.04454","citing_title":"Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction","ref_index":79,"is_internal_anchor":true},{"citing_arxiv_id":"2507.05791","citing_title":"GTA1: GUI Test-time Scaling Agent","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2512.19396","citing_title":"EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2503.21620","citing_title":"UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2602.22942","citing_title":"ClawMobile: Rethinking Smartphone-Native Agentic Systems","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13527","citing_title":"MMSkills: Towards Multimodal Skills for General Visual Agents","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2504.10458","citing_title":"GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2603.26041","citing_title":"Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectives","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13527","citing_title":"MMSkills: Towards Multimodal Skills for General Visual Agents","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2410.23218","citing_title":"OS-ATLAS: A Foundation Action Model for Generalist GUI Agents","ref_index":143,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12501","citing_title":"Covering Human Action Space for Computer Use: Data Synthesis and Benchmark","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27955","citing_title":"GUI Agents with Reinforcement Learning: Toward Digital Inhabitants","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08560","citing_title":"ZAYA1-VL-8B Technical Report","ref_index":156,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00642","citing_title":"Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2604.25380","citing_title":"Benchmarking and Improving GUI Agents in High-Dynamic Environments","ref_index":8,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/ZJVZ5MFZB6R3ITGTIVTVV3VS7I","json":"https://pith.science/pith/ZJVZ5MFZB6R3ITGTIVTVV3VS7I.json","graph_json":"https://pith.science/api/pith-number/ZJVZ5MFZB6R3ITGTIVTVV3VS7I/graph.json","events_json":"https://pith.science/api/pith-number/ZJVZ5MFZB6R3ITGTIVTVV3VS7I/events.json","paper":"https://pith.science/paper/ZJVZ5MFZ"},"agent_actions":{"view_html":"https://pith.science/pith/ZJVZ5MFZB6R3ITGTIVTVV3VS7I","download_json":"https://pith.science/pith/ZJVZ5MFZB6R3ITGTIVTVV3VS7I.json","view_paper":"https://pith.science/paper/ZJVZ5MFZ","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2410.05243&json=true","fetch_graph":"https://pith.science/api/pith-number/ZJVZ5MFZB6R3ITGTIVTVV3VS7I/graph.json","fetch_events":"https://pith.science/api/pith-number/ZJVZ5MFZB6R3ITGTIVTVV3VS7I/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/ZJVZ5MFZB6R3ITGTIVTVV3VS7I/action/timestamp_anchor","attest_storage":"https://pith.science/pith/ZJVZ5MFZB6R3ITGTIVTVV3VS7I/action/storage_attestation","attest_author":"https://pith.science/pith/ZJVZ5MFZB6R3ITGTIVTVV3VS7I/action/author_attestation","sign_citation":"https://pith.science/pith/ZJVZ5MFZB6R3ITGTIVTVV3VS7I/action/citation_signature","submit_replication":"https://pith.science/pith/ZJVZ5MFZB6R3ITGTIVTVV3VS7I/action/replication_record"}},"created_at":"2026-05-17T23:38:46.567027+00:00","updated_at":"2026-05-17T23:38:46.567027+00:00"}