{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:CVVQ5U6ZAPZQOEDEYH7QLSO6PQ","short_pith_number":"pith:CVVQ5U6Z","schema_version":"1.0","canonical_sha256":"156b0ed3d903f3071064c1ff05c9de7c107098706c7beb3b249c632e0ef6faf4","source":{"kind":"arxiv","id":"2401.10935","version":2},"attestation_state":"computed","paper":{"title":"SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Advancements in GUI grounding directly improve the performance of visual agents that automate tasks from screenshots alone.","cross_cats":["cs.AI"],"primary_cat":"cs.HC","authors_text":"Fangzhi Xu, Jianbing Zhang, Kanzhi Cheng, Qiushi Sun, Yantao Li, Yougang Chu, Zhiyong Wu","submitted_at":"2024-01-17T08:10:35Z","abstract_excerpt":"Graphical User Interface (GUI) agents are designed to automate complex tasks on digital devices, such as smartphones and desktops. Most existing GUI agents interact with the environment through extracted structured data, which can be notably lengthy (e.g., HTML) and occasionally inaccessible (e.g., on desktops). To alleviate this issue, we propose a novel visual GUI agent -- SeeClick, which only relies on screenshots for task automation. In our preliminary study, we have discovered a key challenge in developing visual GUI agents: GUI grounding -- the capacity to accurately locate screen elemen"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2401.10935","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.HC","submitted_at":"2024-01-17T08:10:35Z","cross_cats_sorted":["cs.AI"],"title_canon_sha256":"9c130e118c2a1c05f74f4892c3b481834ad6af8d966941736349a41f6527fb8a","abstract_canon_sha256":"c715e54ca2fc5df30d79d57a56510b383f51991a97577d6e6757f967c307f952"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:14.419307Z","signature_b64":"VXDlMSGk/xs/Y01obyKhmgS3c6nzgZYZVhU0eyq7bv0PYAyr/lbLbpVwL2y99E6E/7H5WWRpqlUaAKNSeTVEAA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"156b0ed3d903f3071064c1ff05c9de7c107098706c7beb3b249c632e0ef6faf4","last_reissued_at":"2026-05-17T23:38:14.418669Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:14.418669Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Advancements in GUI grounding directly improve the performance of visual agents that automate tasks from screenshots alone.","cross_cats":["cs.AI"],"primary_cat":"cs.HC","authors_text":"Fangzhi Xu, Jianbing Zhang, Kanzhi Cheng, Qiushi Sun, Yantao Li, Yougang Chu, Zhiyong Wu","submitted_at":"2024-01-17T08:10:35Z","abstract_excerpt":"Graphical User Interface (GUI) agents are designed to automate complex tasks on digital devices, such as smartphones and desktops. Most existing GUI agents interact with the environment through extracted structured data, which can be notably lengthy (e.g., HTML) and occasionally inaccessible (e.g., on desktops). To alleviate this issue, we propose a novel visual GUI agent -- SeeClick, which only relies on screenshots for task automation. In our preliminary study, we have discovered a key challenge in developing visual GUI agents: GUI grounding -- the capacity to accurately locate screen elemen"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"advancements in GUI grounding directly correlate with enhanced performance in downstream GUI agent tasks","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the automatically curated GUI grounding data is sufficiently high-quality and representative to enable effective transfer to real agent tasks across environments.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Advancements in GUI grounding directly improve the performance of visual agents that automate tasks from screenshots alone.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"31450ec01983bb27c2184222e9efbc8aa01f89caefedb5a5a3348063ccf3ae81"},"source":{"id":"2401.10935","kind":"arxiv","version":2},"verdict":{"id":"a13589ef-02ee-4f0f-b4b4-ed487bf180e8","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T10:04:24.837735Z","strongest_claim":"advancements in GUI grounding directly correlate with enhanced performance in downstream GUI agent tasks","one_line_summary":"SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the automatically curated GUI grounding data is sufficiently high-quality and representative to enable effective transfer to real agent tasks across environments.","pith_extraction_headline":"Advancements in GUI grounding directly improve the performance of visual agents that automate tasks from screenshots alone."},"references":{"count":81,"sample":[{"doi":"","year":1972,"title":"Aho and Jeffrey D","work_id":"b1f5cb43-a3c7-4ea0-85e7-9ccc9dfe1588","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":1983,"title":"Publications Manual , year = \"1983\", publisher =","work_id":"aca2b566-99e0-4ebb-9c7a-a81219531259","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.1145/322234.322243","year":1981,"title":"Chandra and Dexter C","work_id":"c3270592-bd69-4213-95e1-4aaf8312be9b","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Scalable training of","work_id":"aef70eae-f816-4598-84ec-429a2c09f5fc","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":1997,"title":"Dan Gusfield , title =. 1997","work_id":"852d89f5-1e7b-4296-b4f2-71e578b5e9f6","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":81,"snapshot_sha256":"47ca75fa2081800e3c2abcf76335eada545a4a8ec4fd2b8a871ee32595ca72c7","internal_anchors":24},"formal_canon":{"evidence_count":2,"snapshot_sha256":"7bb9bea7937b18735f61bdd93645b1bbe4fb4fbd19c2ddf38f9ed35cbe88971d"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2401.10935","created_at":"2026-05-17T23:38:14.418811+00:00"},{"alias_kind":"arxiv_version","alias_value":"2401.10935v2","created_at":"2026-05-17T23:38:14.418811+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2401.10935","created_at":"2026-05-17T23:38:14.418811+00:00"},{"alias_kind":"pith_short_12","alias_value":"CVVQ5U6ZAPZQ","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"CVVQ5U6ZAPZQOEDE","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"CVVQ5U6Z","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":25,"internal_anchor_count":25,"sample":[{"citing_arxiv_id":"2411.18279","citing_title":"Large Language Model-Brained GUI Agents: A Survey","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2507.04227","citing_title":"Mobile GUI Agents under Real-world Threats: Are We There Yet?","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2508.19679","citing_title":"InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2509.06477","citing_title":"MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2504.14239","citing_title":"InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2510.24168","citing_title":"MGA: Memory-Driven GUI Agent for Observation-Centric Interaction","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2601.12538","citing_title":"Agentic Reasoning for Large Language Models","ref_index":221,"is_internal_anchor":true},{"citing_arxiv_id":"2507.05791","citing_title":"GTA1: GUI Test-time Scaling Agent","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2401.05459","citing_title":"Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security","ref_index":111,"is_internal_anchor":true},{"citing_arxiv_id":"2512.10371","citing_title":"AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2604.02345","citing_title":"UI-Oceanus: Scaling GUI Agents with Synthetic Environmental Dynamics","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2504.10458","citing_title":"GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2405.14573","citing_title":"AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2509.02544","citing_title":"UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2410.23218","citing_title":"OS-ATLAS: A Foundation Action Model for Generalist GUI Agents","ref_index":103,"is_internal_anchor":true},{"citing_arxiv_id":"2404.07972","citing_title":"OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2604.26352","citing_title":"UIGaze: How Closely Can VLMs Approximate Human Visual Attention on User Interfaces?","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06664","citing_title":"BAMI: Training-Free Bias Mitigation in GUI Grounding","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2604.13019","citing_title":"See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2604.04838","citing_title":"Less Detail, Better Answers: Degradation-Driven Prompting for VQA","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2504.10479","citing_title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2604.14262","citing_title":"GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2412.05271","citing_title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2508.18265","citing_title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2604.21375","citing_title":"VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation","ref_index":18,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/CVVQ5U6ZAPZQOEDEYH7QLSO6PQ","json":"https://pith.science/pith/CVVQ5U6ZAPZQOEDEYH7QLSO6PQ.json","graph_json":"https://pith.science/api/pith-number/CVVQ5U6ZAPZQOEDEYH7QLSO6PQ/graph.json","events_json":"https://pith.science/api/pith-number/CVVQ5U6ZAPZQOEDEYH7QLSO6PQ/events.json","paper":"https://pith.science/paper/CVVQ5U6Z"},"agent_actions":{"view_html":"https://pith.science/pith/CVVQ5U6ZAPZQOEDEYH7QLSO6PQ","download_json":"https://pith.science/pith/CVVQ5U6ZAPZQOEDEYH7QLSO6PQ.json","view_paper":"https://pith.science/paper/CVVQ5U6Z","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2401.10935&json=true","fetch_graph":"https://pith.science/api/pith-number/CVVQ5U6ZAPZQOEDEYH7QLSO6PQ/graph.json","fetch_events":"https://pith.science/api/pith-number/CVVQ5U6ZAPZQOEDEYH7QLSO6PQ/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/CVVQ5U6ZAPZQOEDEYH7QLSO6PQ/action/timestamp_anchor","attest_storage":"https://pith.science/pith/CVVQ5U6ZAPZQOEDEYH7QLSO6PQ/action/storage_attestation","attest_author":"https://pith.science/pith/CVVQ5U6ZAPZQOEDEYH7QLSO6PQ/action/author_attestation","sign_citation":"https://pith.science/pith/CVVQ5U6ZAPZQOEDEYH7QLSO6PQ/action/citation_signature","submit_replication":"https://pith.science/pith/CVVQ5U6ZAPZQOEDEYH7QLSO6PQ/action/replication_record"}},"created_at":"2026-05-17T23:38:14.418811+00:00","updated_at":"2026-05-17T23:38:14.418811+00:00"}