Pith Number

pith:QHECGYVL

pith:2025:QHECGYVLDT7OHOERAC35FBGPR7

not attested not anchored not stored refs resolved

MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents

Byeongung Jo, Insik Shin, Jaeyoung Wi, Joo Hyung Lee, Sangeun Oh, Seungwoo Baek, Sunjae Lee, Tae Hoon Min, Youngmin Im

MobiBench provides a modular offline benchmark for mobile GUI agents that matches human evaluators at 94.72 percent agreement.

arxiv:2512.12634 v3 · 2025-12-14 · cs.AI

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{QHECGYVLDT7OHOERAC35FBGPR7}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

MobiBench achieves 94.72 percent agreement with human evaluators, on par with carefully engineered online benchmarks, while preserving the scalability and reproducibility of static offline benchmarks.

C2weakest assumption

That the multi-path annotations comprehensively capture all valid alternative actions that human evaluators would accept, without systematic omissions that could affect agreement rates.

C3one line summary

MobiBench is the first modular multi-path offline benchmark for mobile GUI agents, achieving 94.72% agreement with human evaluators while allowing component-level analysis.

References

63 extracted · 63 resolved · 11 Pith anchors

[1] Agent S2: A compositional generalist-specialist framework for computer use agents 2025

[2] Language Models are Few-Shot Learners 2020 · arXiv:2005.14165

[3] Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, and Bryan A Plummer. 2021. Mobile app tasks with itera- tive feedback (motif): Addressing task feasibility in interactive visual 2021

[4] arXiv preprint arXiv:2407.17490 , year= 2024

[5] The BrowserGym ecosystem for web agent research.arXiv preprint arXiv:2412.05467 2024

Cited by

4 papers in Pith

Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

AgentLens: Adaptive Visual Modalities for Human-Agent Interaction in Mobile GUI Agents

RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management

Receipt and verification

First computed	2026-05-18T03:09:32.623517Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

81c82362ab1cfee3b89100b7d284cf8fdd2331bbd8ad4e3c50084163ad07cdbe

Aliases

arxiv: 2512.12634 · arxiv_version: 2512.12634v3 · doi: 10.48550/arxiv.2512.12634 · pith_short_12: QHECGYVLDT7O · pith_short_16: QHECGYVLDT7OHOER · pith_short_8: QHECGYVL

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/QHECGYVLDT7OHOERAC35FBGPR7 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 81c82362ab1cfee3b89100b7d284cf8fdd2331bbd8ad4e3c50084163ad07cdbe

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "1974e3286eef3c7f833714c06a065c85214ebf5b5ac70cc3196a453ce2f2dbe1",
    "cross_cats_sorted": [],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.AI",
    "submitted_at": "2025-12-14T10:41:39Z",
    "title_canon_sha256": "446390125615900b34aef5e039d642b5de22165a8a8970ee94f2aacaa577efac"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2512.12634",
    "kind": "arxiv",
    "version": 3
  }
}