{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:STZE3XGYUA5FI64VOLYHDLMIWB","short_pith_number":"pith:STZE3XGY","schema_version":"1.0","canonical_sha256":"94f24ddcd8a03a547b9572f071ad88b064a7504c02a0adb1f23fbe038cec5ac2","source":{"kind":"arxiv","id":"2508.05748","version":3},"attestation_state":"computed","paper":{"title":"WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"WebWatcher trains a vision-language agent on synthetic multimodal trajectories and reinforcement learning to outperform baselines on complex VQA tasks.","cross_cats":[],"primary_cat":"cs.IR","authors_text":"Chenxi Wang, Fei Huang, Jialong Wu, Jingren Zhou, Kuan Li, Pengjun Xie, Peng Xia, Qiuchen Wang, Ruixue Ding, Xinyu Geng, Xinyu Wang, Yida Zhao, Yong Jiang, Zhen Zhang","submitted_at":"2025-08-07T18:03:50Z","abstract_excerpt":"Web agents such as Deep Research have demonstrated superhuman cognitive abilities, capable of solving highly challenging information-seeking problems. However, most research remains primarily text-centric, overlooking visual information in the real world. This makes multimodal Deep Research highly challenging, as such agents require much stronger reasoning abilities in perception, logic, knowledge, and the use of more sophisticated tools compared to text-based agents. To address this limitation, we introduce WebWatcher, a multi-modal Agent for Deep Research equipped with enhanced visual-langua"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2508.05748","kind":"arxiv","version":3},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.IR","submitted_at":"2025-08-07T18:03:50Z","cross_cats_sorted":[],"title_canon_sha256":"a543c002b68a22ea3cccb801774aeff5d9c3a7cd3a2ef1ba117c6e419776e988","abstract_canon_sha256":"e5f2ae3615b247e22deaa32da02a6ac383263c0d2ad78dace4e467850ce21504"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:50.510335Z","signature_b64":"rEASVho+LW+yev6fT7vfCjjrLX516s8BEJg+h+p9BwDlyFjqRe+VxmlmgiVLj5khgMOM/lifmWZBEbwjok+RAQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"94f24ddcd8a03a547b9572f071ad88b064a7504c02a0adb1f23fbe038cec5ac2","last_reissued_at":"2026-05-17T23:38:50.509905Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:50.509905Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"WebWatcher trains a vision-language agent on synthetic multimodal trajectories and reinforcement learning to outperform baselines on complex VQA tasks.","cross_cats":[],"primary_cat":"cs.IR","authors_text":"Chenxi Wang, Fei Huang, Jialong Wu, Jingren Zhou, Kuan Li, Pengjun Xie, Peng Xia, Qiuchen Wang, Ruixue Ding, Xinyu Geng, Xinyu Wang, Yida Zhao, Yong Jiang, Zhen Zhang","submitted_at":"2025-08-07T18:03:50Z","abstract_excerpt":"Web agents such as Deep Research have demonstrated superhuman cognitive abilities, capable of solving highly challenging information-seeking problems. However, most research remains primarily text-centric, overlooking visual information in the real world. This makes multimodal Deep Research highly challenging, as such agents require much stronger reasoning abilities in perception, logic, knowledge, and the use of more sophisticated tools compared to text-based agents. To address this limitation, we introduce WebWatcher, a multi-modal Agent for Deep Research equipped with enhanced visual-langua"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Experimental results show that WebWatcher significantly outperforms proprietary baseline, RAG workflow and open-source agents in four challenging VQA benchmarks, which paves the way for solving complex multimodal information-seeking tasks.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That high-quality synthetic multimodal trajectories enable efficient cold start training for agents requiring stronger reasoning in perception, logic, knowledge, and that reinforcement learning further enhances generalization to complex tasks.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"WebWatcher introduces a vision-language deep research agent trained on synthetic multimodal trajectories and RL that outperforms baselines on VQA benchmarks, along with a new BrowseComp-VL evaluation.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"WebWatcher trains a vision-language agent on synthetic multimodal trajectories and reinforcement learning to outperform baselines on complex VQA tasks.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"22ee8a2002767720a94e9009d3be2c17acd2b73820827819092101246b76f308"},"source":{"id":"2508.05748","kind":"arxiv","version":3},"verdict":{"id":"26567cb1-5767-41dd-a7e2-bd7bde24e2af","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T18:53:03.016124Z","strongest_claim":"Experimental results show that WebWatcher significantly outperforms proprietary baseline, RAG workflow and open-source agents in four challenging VQA benchmarks, which paves the way for solving complex multimodal information-seeking tasks.","one_line_summary":"WebWatcher introduces a vision-language deep research agent trained on synthetic multimodal trajectories and RL that outperforms baselines on VQA benchmarks, along with a new BrowseComp-VL evaluation.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That high-quality synthetic multimodal trajectories enable efficient cold start training for agents requiring stronger reasoning in perception, logic, knowledge, and that reinforcement learning further enhances generalization to complex tasks.","pith_extraction_headline":"WebWatcher trains a vision-language agent on synthetic multimodal trajectories and reinforcement learning to outperform baselines on complex VQA tasks."},"references":{"count":31,"sample":[{"doi":"","year":null,"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","ref_index":1,"cited_arxiv_id":"2502.13923","is_internal_anchor":true},{"doi":"","year":null,"title":"Why reasoning matters? a survey of advancements in multimodal reasoning (v1)","work_id":"776ae2d4-b7ef-445f-be80-6568ff81d28e","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","ref_index":3,"cited_arxiv_id":"2107.03374","is_internal_anchor":true},{"doi":"","year":null,"title":"M3 cot: A novel benchmark for multi- domain multi-step multi-modal chain-of-thought","work_id":"4d99c33a-262f-4545-baae-925205f5b2bc","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"arXiv preprint arXiv:2302.11713 , year=","work_id":"b7899c66-9ee1-45bf-8a1d-223b8959dee6","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":31,"snapshot_sha256":"76492630528aa15e1a63ce6c7c470f818a70fd03e06a51739bacc263ea189853","internal_anchors":11},"formal_canon":{"evidence_count":2,"snapshot_sha256":"5a401a4a930993de4bc33549fba4b44051404b0a8b11f11f265c0734f743c0e0"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2508.05748","created_at":"2026-05-17T23:38:50.509968+00:00"},{"alias_kind":"arxiv_version","alias_value":"2508.05748v3","created_at":"2026-05-17T23:38:50.509968+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2508.05748","created_at":"2026-05-17T23:38:50.509968+00:00"},{"alias_kind":"pith_short_12","alias_value":"STZE3XGYUA5F","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"STZE3XGYUA5FI64V","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"STZE3XGY","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":32,"internal_anchor_count":32,"sample":[{"citing_arxiv_id":"2603.28767","citing_title":"Gen-Searcher: Reinforcing Agentic Search for Image Generation","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17946","citing_title":"SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain","ref_index":70,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13193","citing_title":"FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17531","citing_title":"Don't Guess, Just Ask: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17946","citing_title":"SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain","ref_index":70,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27859","citing_title":"Rethinking Agentic Reinforcement Learning In Large Language Models","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2509.02547","citing_title":"The Landscape of Agentic Reinforcement Learning for LLMs: A Survey","ref_index":283,"is_internal_anchor":true},{"citing_arxiv_id":"2509.07969","citing_title":"Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2509.08827","citing_title":"A Survey of Reinforcement Learning for Large Reasoning Models","ref_index":159,"is_internal_anchor":true},{"citing_arxiv_id":"2601.22149","citing_title":"DynaWeb: Model-Based Reinforcement Learning of Web Agents","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2511.05271","citing_title":"DeepEyesV2: Toward Agentic Multimodal Model","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2509.24251","citing_title":"Latent Visual Reasoning","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2603.04751","citing_title":"Evaluating the Search Agent in a Parallel World","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2603.28767","citing_title":"Gen-Searcher: Reinforcing Agentic Search for Image Generation","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13193","citing_title":"FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13034","citing_title":"ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13831","citing_title":"Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2604.04017","citing_title":"GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12497","citing_title":"From Web to Pixels: Bringing Agentic Search into Visual Perception","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27859","citing_title":"Rethinking Agentic Reinforcement Learning In Large Language Models","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08762","citing_title":"Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07177","citing_title":"HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27859","citing_title":"Rethinking Agentic Reinforcement Learning In Large Language Models","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2604.20486","citing_title":"ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19264","citing_title":"DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents","ref_index":71,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/STZE3XGYUA5FI64VOLYHDLMIWB","json":"https://pith.science/pith/STZE3XGYUA5FI64VOLYHDLMIWB.json","graph_json":"https://pith.science/api/pith-number/STZE3XGYUA5FI64VOLYHDLMIWB/graph.json","events_json":"https://pith.science/api/pith-number/STZE3XGYUA5FI64VOLYHDLMIWB/events.json","paper":"https://pith.science/paper/STZE3XGY"},"agent_actions":{"view_html":"https://pith.science/pith/STZE3XGYUA5FI64VOLYHDLMIWB","download_json":"https://pith.science/pith/STZE3XGYUA5FI64VOLYHDLMIWB.json","view_paper":"https://pith.science/paper/STZE3XGY","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2508.05748&json=true","fetch_graph":"https://pith.science/api/pith-number/STZE3XGYUA5FI64VOLYHDLMIWB/graph.json","fetch_events":"https://pith.science/api/pith-number/STZE3XGYUA5FI64VOLYHDLMIWB/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/STZE3XGYUA5FI64VOLYHDLMIWB/action/timestamp_anchor","attest_storage":"https://pith.science/pith/STZE3XGYUA5FI64VOLYHDLMIWB/action/storage_attestation","attest_author":"https://pith.science/pith/STZE3XGYUA5FI64VOLYHDLMIWB/action/author_attestation","sign_citation":"https://pith.science/pith/STZE3XGYUA5FI64VOLYHDLMIWB/action/citation_signature","submit_replication":"https://pith.science/pith/STZE3XGYUA5FI64VOLYHDLMIWB/action/replication_record"}},"created_at":"2026-05-17T23:38:50.509968+00:00","updated_at":"2026-05-17T23:38:50.509968+00:00"}