pith. sign in
Pith Number

pith:QHECGYVL

pith:2025:QHECGYVLDT7OHOERAC35FBGPR7
not attested not anchored not stored refs resolved

MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents

Byeongung Jo, Insik Shin, Jaeyoung Wi, Joo Hyung Lee, Sangeun Oh, Seungwoo Baek, Sunjae Lee, Tae Hoon Min, Youngmin Im

MobiBench provides a modular offline benchmark for mobile GUI agents that matches human evaluators at 94.72 percent agreement.

arxiv:2512.12634 v3 · 2025-12-14 · cs.AI

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{QHECGYVLDT7OHOERAC35FBGPR7}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

MobiBench achieves 94.72 percent agreement with human evaluators, on par with carefully engineered online benchmarks, while preserving the scalability and reproducibility of static offline benchmarks.

C2weakest assumption

That the multi-path annotations comprehensively capture all valid alternative actions that human evaluators would accept, without systematic omissions that could affect agreement rates.

C3one line summary

MobiBench is the first modular multi-path offline benchmark for mobile GUI agents, achieving 94.72% agreement with human evaluators while allowing component-level analysis.

References

63 extracted · 63 resolved · 11 Pith anchors

[1] Agent S2: A compositional generalist-specialist framework for computer use agents 2025
[2] Language Models are Few-Shot Learners 2020 · arXiv:2005.14165
[3] Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, and Bryan A Plummer. 2021. Mobile app tasks with itera- tive feedback (motif): Addressing task feasibility in interactive visual 2021
[4] arXiv preprint arXiv:2407.17490 , year= 2024
[5] The BrowserGym ecosystem for web agent research.arXiv preprint arXiv:2412.05467 2024

Cited by

4 papers in Pith

Receipt and verification
First computed 2026-05-18T03:09:32.623517Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

81c82362ab1cfee3b89100b7d284cf8fdd2331bbd8ad4e3c50084163ad07cdbe

Aliases

arxiv: 2512.12634 · arxiv_version: 2512.12634v3 · doi: 10.48550/arxiv.2512.12634 · pith_short_12: QHECGYVLDT7O · pith_short_16: QHECGYVLDT7OHOER · pith_short_8: QHECGYVL
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/QHECGYVLDT7OHOERAC35FBGPR7 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 81c82362ab1cfee3b89100b7d284cf8fdd2331bbd8ad4e3c50084163ad07cdbe
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "1974e3286eef3c7f833714c06a065c85214ebf5b5ac70cc3196a453ce2f2dbe1",
    "cross_cats_sorted": [],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.AI",
    "submitted_at": "2025-12-14T10:41:39Z",
    "title_canon_sha256": "446390125615900b34aef5e039d642b5de22165a8a8970ee94f2aacaa577efac"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2512.12634",
    "kind": "arxiv",
    "version": 3
  }
}