{"paper":{"title":"Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A model trained on web synthetic GUI data enables agents to ground referring expressions to pixels using only screenshots, outperforming text-augmented systems.","cross_cats":["cs.CL","cs.CV"],"primary_cat":"cs.AI","authors_text":"Boyuan Zheng, Boyu Gou, Cheng Chang, Huan Sun, Ruohan Wang, Yanan Xie, Yiheng Shu, Yu Su","submitted_at":"2024-10-07T17:47:50Z","abstract_excerpt":"Multimodal large language models (MLLMs) are transforming the capabilities of graphical user interface (GUI) agents, facilitating their transition from controlled simulations to complex, real-world applications across various platforms. However, the effectiveness of these agents hinges on the robustness of their grounding capability. Current GUI agents predominantly utilize text-based representations such as HTML or accessibility trees, which, despite their utility, often introduce noise, incompleteness, and increased computational overhead. In this paper, we advocate a human-like embodiment f"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Empirical results on six benchmarks spanning three categories show that UGround substantially outperforms existing visual grounding models for GUI agents by up to 20% absolute, and agents with UGround outperform state-of-the-art agents despite existing agents using additional text-based input while ours only uses visual perception.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That web-based synthetic data combined with slight adaptation of the LLaVA architecture produces a model that generalizes robustly to real-world, diverse GUI platforms and referring expressions beyond the training distribution.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"UGround is a universal visual grounding model for GUI agents that uses only screenshots to locate elements and outperforms existing agents despite lacking text-based inputs.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A model trained on web synthetic GUI data enables agents to ground referring expressions to pixels using only screenshots, outperforming text-augmented systems.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"723b68956eff88a163c56064327dc455131934a534f6ccd8ca33139eeba23d91"},"source":{"id":"2410.05243","kind":"arxiv","version":3},"verdict":{"id":"5107a70c-78dc-4128-af30-0aba481607bd","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T21:05:31.956964Z","strongest_claim":"Empirical results on six benchmarks spanning three categories show that UGround substantially outperforms existing visual grounding models for GUI agents by up to 20% absolute, and agents with UGround outperform state-of-the-art agents despite existing agents using additional text-based input while ours only uses visual perception.","one_line_summary":"UGround is a universal visual grounding model for GUI agents that uses only screenshots to locate elements and outperforms existing agents despite lacking text-based inputs.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That web-based synthetic data combined with slight adaptation of the LLaVA architecture produces a model that generalizes robustly to real-world, diverse GUI platforms and referring expressions beyond the training distribution.","pith_extraction_headline":"A model trained on web synthetic GUI data enables agents to ground referring expressions to pixels using only screenshots, outperforming text-augmented systems."},"references":{"count":12,"sample":[{"doi":"","year":null,"title":"click ... then type","work_id":"23c8d22f-9413-422d-bc04-11dfde6dd5dd","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"We filter out any actions that do not have associated coordinate data, ensuring that only steps with specific visual grounding targets are included in the dataset","work_id":"4a69b9a8-e8ca-4f5d-bf41-eb09660d0534","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"To enhance diversity, two captions per element are randomly selected from the available set of functional captions during data construction","work_id":"aade4fc7-8978-4d0c-9f3f-3708f5f7362e","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"UIBert: We use the training set elements from UIBert without any additional special processing, directly utilizing the referring expressions provided by this dataset","work_id":"dc13223e-fa8f-4b7b-a524-d6ff01a0d8f8","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"These annotations contribute to a more diverse set of referring expressions, particularly for action-oriented grounding tasks","work_id":"1c8757f6-dfdc-4a18-a3e3-510b215dae6f","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":12,"snapshot_sha256":"e398405f90bd02eab9e7bb9af6de474af3a75184101af7b9e3ea13cc34100804","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}