Pith Number

pith:ZJVZ5MFZ

pith:2024:ZJVZ5MFZB6R3ITGTIVTVV3VS7I

not attested not anchored not stored refs resolved

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Boyuan Zheng, Boyu Gou, Cheng Chang, Huan Sun, Ruohan Wang, Yanan Xie, Yiheng Shu, Yu Su

A model trained on web synthetic GUI data enables agents to ground referring expressions to pixels using only screenshots, outperforming text-augmented systems.

arxiv:2410.05243 v3 · 2024-10-07 · cs.AI · cs.CL · cs.CV

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{ZJVZ5MFZB6R3ITGTIVTVV3VS7I}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Empirical results on six benchmarks spanning three categories show that UGround substantially outperforms existing visual grounding models for GUI agents by up to 20% absolute, and agents with UGround outperform state-of-the-art agents despite existing agents using additional text-based input while ours only uses visual perception.

C2weakest assumption

That web-based synthetic data combined with slight adaptation of the LLaVA architecture produces a model that generalizes robustly to real-world, diverse GUI platforms and referring expressions beyond the training distribution.

C3one line summary

UGround is a universal visual grounding model for GUI agents that uses only screenshots to locate elements and outperforms existing agents despite lacking text-based inputs.

References

12 extracted · 12 resolved · 0 Pith anchors

[1] click ... then type

[2] We filter out any actions that do not have associated coordinate data, ensuring that only steps with specific visual grounding targets are included in the dataset

[3] To enhance diversity, two captions per element are randomly selected from the available set of functional captions during data construction

[4] UIBert: We use the training set elements from UIBert without any additional special processing, directly utilizing the referring expressions provided by this dataset

[5] These annotations contribute to a more diverse set of referring expressions, particularly for action-oriented grounding tasks 2023

Cited by

33 papers in Pith

InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction

DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents

MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents

SE-GA: Memory-Augmented Self-Evolution for GUI Agents

Receipt and verification

First computed	2026-05-17T23:38:46.566939Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

ca6b9eb0b90fa3b44cd345675aeeb2fa1bfa4a3ac13e11b833231c3b236253ef

Aliases

arxiv: 2410.05243 · arxiv_version: 2410.05243v3 · doi: 10.48550/arxiv.2410.05243 · pith_short_12: ZJVZ5MFZB6R3 · pith_short_16: ZJVZ5MFZB6R3ITGT · pith_short_8: ZJVZ5MFZ

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/ZJVZ5MFZB6R3ITGTIVTVV3VS7I \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: ca6b9eb0b90fa3b44cd345675aeeb2fa1bfa4a3ac13e11b833231c3b236253ef

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "7ec6a812a76db0e31b4a3564f47a0d6172a60ad8d2b3157f1ef15062d373bf5d",
    "cross_cats_sorted": [
      "cs.CL",
      "cs.CV"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.AI",
    "submitted_at": "2024-10-07T17:47:50Z",
    "title_canon_sha256": "bba2772fdcea59c8366a2a30164329af2a062e8ad75c26e3a96b57fed4d95022"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2410.05243",
    "kind": "arxiv",
    "version": 3
  }
}