{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2026:YLYZW2EP2NILYIOZY2B3LFFY3C","short_pith_number":"pith:YLYZW2EP","schema_version":"1.0","canonical_sha256":"c2f19b688fd350bc21d9c683b594b8d89422b33c9315cffdad8c87e6c88f5d5b","source":{"kind":"arxiv","id":"2605.12549","version":1},"attestation_state":"computed","paper":{"title":"What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"GUI grounding in VLMs follows a two-stage process where the prefill stage selects candidate UI elements that the decoding stage cannot correct.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Fei Shen, Fei Yu, Haizhou Li, Jiaping Lin, Junzhe Li, Ming Li, Ping Nie","submitted_at":"2026-05-10T07:04:07Z","abstract_excerpt":"Existing training-free approaches for GUI grounding often rely on multiple inference runs, such as iterative cropping or candidate aggregation, to identify target elements. Despite this additional computation, each forward pass still independently interprets the instruction and parses the visual layout, without enabling progressive interaction among visual tokens. In this paper, we study what happens during GUI grounding in Vision-Language Models (VLMs) and identify a previously overlooked bottleneck. We show that grounding follows a two-stage paradigm: the prefill stage determines candidate U"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2605.12549","kind":"arxiv","version":1},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2026-05-10T07:04:07Z","cross_cats_sorted":[],"title_canon_sha256":"07560c2ec7a79bc099ccfc84abaad10b7e4cf1a639680238e83555b1289482bf","abstract_canon_sha256":"39b07e6b1acba186aa8e866003a84fa071b9bb523e7eec9096a8a203f7dc0a28"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-18T03:10:02.165684Z","signature_b64":"/JvCSgk0PKh4+P4kaqEBtWwztQWjOBmYB4akp5oHmKhj420P8BbPEvs8mS8u+pUQnrCgqNjja7HmnCeF45ShCg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"c2f19b688fd350bc21d9c683b594b8d89422b33c9315cffdad8c87e6c88f5d5b","last_reissued_at":"2026-05-18T03:10:02.164922Z","signature_status":"signed_v1","first_computed_at":"2026-05-18T03:10:02.164922Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"GUI grounding in VLMs follows a two-stage process where the prefill stage selects candidate UI elements that the decoding stage cannot correct.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Fei Shen, Fei Yu, Haizhou Li, Jiaping Lin, Junzhe Li, Ming Li, Ping Nie","submitted_at":"2026-05-10T07:04:07Z","abstract_excerpt":"Existing training-free approaches for GUI grounding often rely on multiple inference runs, such as iterative cropping or candidate aggregation, to identify target elements. Despite this additional computation, each forward pass still independently interprets the instruction and parses the visual layout, without enabling progressive interaction among visual tokens. In this paper, we study what happens during GUI grounding in Vision-Language Models (VLMs) and identify a previously overlooked bottleneck. We show that grounding follows a two-stage paradigm: the prefill stage determines candidate U"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"we show that grounding follows a two-stage paradigm: the prefill stage determines candidate UI elements, while the decoding stage subsequently refines the final coordinates. This asymmetry establishes prefill as the critical step, as errors in candidate selection cannot be effectively corrected during decoding.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That visual tokens receiving consistently high attention from the query (final) position across layers form a reliable preliminary target hypothesis, and that re-appending them with instruction hidden states enables effective re-thinking without adding noise or bias.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"GUI grounding in VLMs is bottlenecked by prefill-stage candidate selection that decoding cannot fix, so Re-Prefill uses attention to extract and re-inject target tokens for up to 4.3% gains on ScreenSpot-Pro.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"GUI grounding in VLMs follows a two-stage process where the prefill stage selects candidate UI elements that the decoding stage cannot correct.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"452350322c55c0b566339e277999efbd40db970fa609bed1cebb22830c1d5ddb"},"source":{"id":"2605.12549","kind":"arxiv","version":1},"verdict":{"id":"069a3fcc-e36a-4495-8068-ffee7fd5b005","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T21:40:43.236562Z","strongest_claim":"we show that grounding follows a two-stage paradigm: the prefill stage determines candidate UI elements, while the decoding stage subsequently refines the final coordinates. This asymmetry establishes prefill as the critical step, as errors in candidate selection cannot be effectively corrected during decoding.","one_line_summary":"GUI grounding in VLMs is bottlenecked by prefill-stage candidate selection that decoding cannot fix, so Re-Prefill uses attention to extract and re-inject target tokens for up to 4.3% gains on ScreenSpot-Pro.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That visual tokens receiving consistently high attention from the query (final) position across layers form a reliable preliminary target hypothesis, and that re-appending them with instruction hidden states enables effective re-thinking without adding noise or bias.","pith_extraction_headline":"GUI grounding in VLMs follows a two-stage process where the prefill stage selects candidate UI elements that the decoding stage cannot correct."},"references":{"count":41,"sample":[{"doi":"","year":2025,"title":"Gui agents: A survey","work_id":"fcdddb35-71ff-4f05-bf1a-5c51dc16199d","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Large language model-brained gui agents: A survey","work_id":"782b1dd3-0702-49e8-a23c-abe09cdc169a","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Gui agents with foundation models: A comprehensive survey","work_id":"48bc4d35-0dc9-45c8-b563-cea72d9dd42e","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2026,"title":"GTA1: GUI test-time scaling agent","work_id":"7afb8428-931a-4e14-8225-dff91d7551d8","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2026,"title":"Gui-g2: Gaussian reward modeling for gui grounding","work_id":"6c47592b-9cee-4952-b93c-d1ff45b9c46a","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":41,"snapshot_sha256":"a845c91400c928b767e8c42c298c3a28279ef4a46f137b11f4089b947a555a6e","internal_anchors":4},"formal_canon":{"evidence_count":1,"snapshot_sha256":"809290e4a5c0fd8aca6d7dd71e8cc6803b902d017b7f3ab3f424f16e1e252fe5"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2605.12549","created_at":"2026-05-18T03:10:02.165026+00:00"},{"alias_kind":"arxiv_version","alias_value":"2605.12549v1","created_at":"2026-05-18T03:10:02.165026+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2605.12549","created_at":"2026-05-18T03:10:02.165026+00:00"},{"alias_kind":"pith_short_12","alias_value":"YLYZW2EP2NIL","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"YLYZW2EP2NILYIOZ","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"YLYZW2EP","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":0,"internal_anchor_count":0,"sample":[]},"formal_canon":{"evidence_count":1,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/YLYZW2EP2NILYIOZY2B3LFFY3C","json":"https://pith.science/pith/YLYZW2EP2NILYIOZY2B3LFFY3C.json","graph_json":"https://pith.science/api/pith-number/YLYZW2EP2NILYIOZY2B3LFFY3C/graph.json","events_json":"https://pith.science/api/pith-number/YLYZW2EP2NILYIOZY2B3LFFY3C/events.json","paper":"https://pith.science/paper/YLYZW2EP"},"agent_actions":{"view_html":"https://pith.science/pith/YLYZW2EP2NILYIOZY2B3LFFY3C","download_json":"https://pith.science/pith/YLYZW2EP2NILYIOZY2B3LFFY3C.json","view_paper":"https://pith.science/paper/YLYZW2EP","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2605.12549&json=true","fetch_graph":"https://pith.science/api/pith-number/YLYZW2EP2NILYIOZY2B3LFFY3C/graph.json","fetch_events":"https://pith.science/api/pith-number/YLYZW2EP2NILYIOZY2B3LFFY3C/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/YLYZW2EP2NILYIOZY2B3LFFY3C/action/timestamp_anchor","attest_storage":"https://pith.science/pith/YLYZW2EP2NILYIOZY2B3LFFY3C/action/storage_attestation","attest_author":"https://pith.science/pith/YLYZW2EP2NILYIOZY2B3LFFY3C/action/author_attestation","sign_citation":"https://pith.science/pith/YLYZW2EP2NILYIOZY2B3LFFY3C/action/citation_signature","submit_replication":"https://pith.science/pith/YLYZW2EP2NILYIOZY2B3LFFY3C/action/replication_record"}},"created_at":"2026-05-18T03:10:02.165026+00:00","updated_at":"2026-05-18T03:10:02.165026+00:00"}