pith. sign in
Pith Number

pith:STZE3XGY

pith:2025:STZE3XGYUA5FI64VOLYHDLMIWB
not attested not anchored not stored refs resolved

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

Chenxi Wang, Fei Huang, Jialong Wu, Jingren Zhou, Kuan Li, Pengjun Xie, Peng Xia, Qiuchen Wang, Ruixue Ding, Xinyu Geng, Xinyu Wang, Yida Zhao, Yong Jiang, Zhen Zhang

WebWatcher trains a vision-language agent on synthetic multimodal trajectories and reinforcement learning to outperform baselines on complex VQA tasks.

arxiv:2508.05748 v3 · 2025-08-07 · cs.IR

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{STZE3XGYUA5FI64VOLYHDLMIWB}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Experimental results show that WebWatcher significantly outperforms proprietary baseline, RAG workflow and open-source agents in four challenging VQA benchmarks, which paves the way for solving complex multimodal information-seeking tasks.

C2weakest assumption

That high-quality synthetic multimodal trajectories enable efficient cold start training for agents requiring stronger reasoning in perception, logic, knowledge, and that reinforcement learning further enhances generalization to complex tasks.

C3one line summary

WebWatcher introduces a vision-language deep research agent trained on synthetic multimodal trajectories and RL that outperforms baselines on VQA benchmarks, along with a new BrowseComp-VL evaluation.

References

31 extracted · 31 resolved · 11 Pith anchors

[1] Qwen2.5-VL Technical Report · arXiv:2502.13923
[2] Why reasoning matters? a survey of advancements in multimodal reasoning (v1)
[3] Evaluating Large Language Models Trained on Code · arXiv:2107.03374
[4] M3 cot: A novel benchmark for multi- domain multi-step multi-modal chain-of-thought
[5] arXiv preprint arXiv:2302.11713 , year=

Formal links

2 machine-checked theorem links

Cited by

32 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:50.509905Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

94f24ddcd8a03a547b9572f071ad88b064a7504c02a0adb1f23fbe038cec5ac2

Aliases

arxiv: 2508.05748 · arxiv_version: 2508.05748v3 · doi: 10.48550/arxiv.2508.05748 · pith_short_12: STZE3XGYUA5F · pith_short_16: STZE3XGYUA5FI64V · pith_short_8: STZE3XGY
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/STZE3XGYUA5FI64VOLYHDLMIWB \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 94f24ddcd8a03a547b9572f071ad88b064a7504c02a0adb1f23fbe038cec5ac2
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "e5f2ae3615b247e22deaa32da02a6ac383263c0d2ad78dace4e467850ce21504",
    "cross_cats_sorted": [],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.IR",
    "submitted_at": "2025-08-07T18:03:50Z",
    "title_canon_sha256": "a543c002b68a22ea3cccb801774aeff5d9c3a7cd3a2ef1ba117c6e419776e988"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2508.05748",
    "kind": "arxiv",
    "version": 3
  }
}