SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain

Ben Chen; Chenyi Lei; Huangyu Dai; Lingtao Mao; Wenwu Ou; Xinyu Sun; Zihan Liang

arxiv: 2605.17946 · v2 · pith:TACJYF3Lnew · submitted 2026-05-18 · 💻 cs.AI · cs.CV· cs.LG

SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain

Lingtao Mao , Huangyu Dai , Xinyu Sun , Zihan Liang , Ben Chen , Chenyi Lei , Wenwu Ou This is my paper

Pith reviewed 2026-05-21 08:44 UTC · model grok-4.3

classification 💻 cs.AI cs.CVcs.LG

keywords multimodal benchmarkshort-video frame searchgaming domainagentic searchknowledge-intensive QAvisual groundingRAG evaluationtool-use behavior

0 comments

The pith

A new benchmark reveals that practical AI agents reach only 79.1% on ambiguous game video frames while oracle knowledge hits 95.4%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

SVFSearch introduces the first open benchmark for short-video frame search in the Chinese gaming domain, built around 5,000 four-choice questions drawn from paused scenes in real clips. It supplies a fixed offline retrieval environment with a game-domain text corpus and topic-linked image gallery so that models can be tested for retrieval actions, tool use, and reasoning without relying on live web search. Systematic evaluation of direct QA, RAG workflows, Plan-Act-Replan agents, and learned search models shows a clear hierarchy: top open-source direct models reach 66.4%, best practical agents reach 79.1%, and oracle access to perfect knowledge reaches 95.4%. The results matter because short-video platforms routinely present visually ambiguous frames that demand vertical, fast-changing domain knowledge that general multimodal models still lack.

Core claim

The paper establishes that even the strongest practical agentic systems fall well short of oracle performance on short-video frame questions in gaming, while exposing concrete bottlenecks in visual grounding, retrieval quality, evidence-grounded reasoning, and tool-use patterns such as over-search and answer-only shortcuts.

What carries the argument

The SVFSearch benchmark together with its frozen retrieval environment, which pairs each of the 5,000 test examples with a game-domain text corpus, a topic-linked image gallery, and standardized text, image, and multimodal retrieval interfaces.

If this is right

Agentic planning of retrieval actions raises accuracy over direct question answering by roughly 13 points.
Common failure modes include over-search, answer-only shortcuts, and retrieval that introduces misleading evidence.
Visual grounding and evidence-grounded reasoning remain limiting factors even when retrieval tools are available.
The 95.4% oracle ceiling indicates that current systems still miss substantial domain knowledge that the provided corpus contains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the same controlled-retrieval design to other vertical short-video domains such as sports or product reviews would test whether the observed bottlenecks are domain-specific.
Improving the alignment between the image gallery and the text corpus could raise the practical-agent ceiling without any change to the underlying models.
The gap between 79.1% and 95.4% suggests that future agents may need tighter multimodal fusion inside the retrieval step itself rather than sequential tool calls.

Load-bearing premise

The 5,000 curated test examples and the supplied game-domain text corpus plus image gallery together form a representative and unbiased proxy for real short-video frame search tasks in gaming.

What would settle it

A practical agent that achieves above 90% accuracy on the SVFSearch test set while restricted to the provided offline text and image retrieval interfaces would show that the reported performance gap can be closed under the benchmark's own rules.

Figures

Figures reproduced from arXiv: 2605.17946 by Ben Chen, Chenyi Lei, Huangyu Dai, Lingtao Mao, Wenwu Ou, Xinyu Sun, Zihan Liang.

**Figure 1.** Figure 1: Overview of SVFSearch. Top row: benchmark construction from game-specific core elements, short-video frames, and web-sourced knowledge to QA splits and frozen retrieval resources. Bottom left: a Plan-Act-Replan agent that dynamically decides whether more information is needed, selects retrieval tools, and integrates returned evidence before answering. Bottom right: MMSearchR1-Game, which learns search-an… view at source ↗

**Figure 2.** Figure 2: Representative examples from SVFSearch. Examples show paused frames, video-side metadata, and multiple-choice QA instances. Stage 1: Core Element and Knowledge Construction. We first collect 221 popular games covering diverse genres. Based on in-platform user queries, we mine game-specific core elements for each game, including characters, equipment, maps, story events, skills, and gameplay mechanics. Thi… view at source ↗

**Figure 3.** Figure 3: Distribution analysis of SVFSearch. Test examples grouped by question theme, question type, and difficulty. The test split is dominated by character questions, factual Q&A types, and medium-difficulty examples, while retaining long-tail themes and harder cases for stratified analysis. using 256-dimensional features from a fine-tuned DINOv3-Base model, and a multimodal index using 512-dimensional Qwen3-VL-E… view at source ↗

**Figure 4.** Figure 4: Tool-use diagnostics. Left: PAR tool calls, accuracy, and average planning rounds across backbones. Right: item-level search rates and accuracy of MS-R1-style models. Search-rate bars on the right are not mutually exclusive. of examples where a method invokes at least one retrieval tool. Direct QA and Oracle Knowledge do not invoke SVFSearch retrieval tools, so their SR is marked as “—”. Qwen2.5-VL-7B-CoT … view at source ↗

**Figure 5.** Figure 5: Retrieval gains and search behavior. Left: accuracy decomposition from Direct QA to RAG Workflow, PAR, and Oracle Knowledge. Right: correctness and search-usage breakdown for prompt-only and trained MS-R1-style models. RL training changes tool-use behavior, but the effect depends strongly on the task and reward design. The released Qwen2.5-VL-7B MMSearch-R1 model searches on 72.8% of examples, yet remains … view at source ↗

read the original abstract

Multimodal large language models are increasingly used as agent backbones that understand multimodal inputs, plan retrieval actions, invoke external tools, and reason over retrieved information. Yet existing benchmarks rarely evaluate this ability in short-video applications, where a paused frame is often visually ambiguous and answering requires vertical, long-tail, and fast-evolving domain knowledge. We introduce SVFSearch, the first open benchmark for short-video frame search in the Chinese gaming domain. SVFSearch contains 5,000 four-choice test examples and 4,198 auxiliary training examples, each centered on a paused game scene from a real short-video clip. To support fair and reproducible evaluation, SVFSearch provides a frozen offline retrieval environment with a game-domain text corpus, a topic-linked image gallery, and text, image, and multimodal retrieval interfaces, avoiding reliance on uncontrolled web search APIs. We evaluate representative paradigms ranging from direct QA and RAG workflow to Plan-Act-Replan agents and learned search models. Results reveal a large gap between model-only answering, practical agentic search, and oracle knowledge: the best open-source direct-QA model reaches 66.4%, the best practical agent achieves 79.1%, and oracle knowledge reaches 95.4%. Further analysis exposes bottlenecks in visual grounding, retrieval quality, evidence-grounded reasoning, and tool-use behavior, including over-search, answer-only shortcuts, and retrieval-induced misleading.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SVFSearch gives a clean first benchmark for short-video frame search in Chinese gaming with a fixed retrieval setup and shows real gaps between direct QA, agents, and oracles, but the test examples' representativeness is not yet proven.

read the letter

SVFSearch is the first open benchmark aimed at paused-frame search in Chinese gaming short videos. It supplies 5,000 four-choice test questions drawn from real clips, a frozen game-domain text corpus, a topic-linked image gallery, and fixed interfaces for text, image, and multimodal retrieval. This controlled environment removes dependence on live web APIs and lets different methods be compared directly on the same data.

Referee Report

1 major / 2 minor

Summary. The paper introduces SVFSearch, the first open benchmark for short-video frame search in the Chinese gaming domain. It consists of 5,000 four-choice test examples and 4,198 auxiliary training examples centered on paused game scenes from real short-video clips. To enable fair evaluation, the benchmark supplies a frozen offline retrieval environment including a game-domain text corpus, a topic-linked image gallery, and text/image/multimodal retrieval interfaces. Evaluations of direct QA, RAG workflows, Plan-Act-Replan agents, and learned search models reveal large performance gaps: the best open-source direct-QA model reaches 66.4%, the best practical agent achieves 79.1%, and oracle knowledge reaches 95.4%. Further analysis identifies bottlenecks in visual grounding, retrieval quality, evidence-grounded reasoning, and tool-use behaviors such as over-search and answer-only shortcuts.

Significance. If the 5,000 examples and offline corpus constitute a representative proxy for real gaming short-video tasks, the work would be significant for exposing concrete limitations of current multimodal LLMs and agentic systems in visually ambiguous, knowledge-intensive vertical domains. A notable strength is the controlled, reproducible offline retrieval setup that avoids dependence on uncontrolled web APIs, enabling fair comparisons across paradigms. This provides a useful testbed for diagnosing and addressing bottlenecks in multimodal retrieval and reasoning.

major comments (1)

[Dataset Construction and Evaluation Setup] The central claim that the measured gaps (66.4% direct-QA vs. 79.1% practical agent vs. 95.4% oracle) reflect genuine bottlenecks in visual grounding and retrieval (rather than artifacts) depends on the 5,000 four-choice questions, game-domain text corpus, and topic-linked image gallery forming a representative and unbiased proxy for real short-video frame search tasks. The manuscript provides no quantitative validation of this assumption, such as inter-annotator agreement, diversity metrics across game genres, or distributional checks against real short-video data (see Dataset Construction and Evaluation sections).

minor comments (2)

[Abstract] The abstract reports specific performance numbers but does not name the exact models achieving 66.4% and 79.1%; adding these identifiers would improve clarity and reproducibility.
[Benchmark Construction] The description of the retrieval interfaces could benefit from a table summarizing the available tools, their inputs/outputs, and any constraints to aid readers in replicating the agentic setups.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and commit to revisions that strengthen the validation of our benchmark's representativeness.

read point-by-point responses

Referee: [Dataset Construction and Evaluation Setup] The central claim that the measured gaps (66.4% direct-QA vs. 79.1% practical agent vs. 95.4% oracle) reflect genuine bottlenecks in visual grounding and retrieval (rather than artifacts) depends on the 5,000 four-choice questions, game-domain text corpus, and topic-linked image gallery forming a representative and unbiased proxy for real short-video frame search tasks. The manuscript provides no quantitative validation of this assumption, such as inter-annotator agreement, diversity metrics across game genres, or distributional checks against real short-video data (see Dataset Construction and Evaluation sections).

Authors: We agree that additional quantitative validation would help confirm that the observed performance gaps reflect genuine bottlenecks rather than dataset artifacts. The current manuscript describes the sourcing of examples from real short-video clips in the Chinese gaming domain, the curation of the frozen text corpus and topic-linked image gallery, and the four-choice question format, but does not report the specific metrics noted. In the revised version, we will expand the Dataset Construction section to include: (1) inter-annotator agreement computed on a 500-example subset independently annotated by three domain experts (reporting both exact match and relaxed agreement on choices); (2) diversity metrics such as the distribution of game genres (MOBA, FPS, RPG, etc.) and topic categories with percentages and entropy measures; and (3) distributional checks comparing statistics like average number of visual entities per frame, vocabulary overlap with long-tail terms, and video length against a larger sample of 10,000 real short-video frames from the source platform. These additions will support the claim that SVFSearch serves as a representative proxy while preserving the core evaluation results. revision: yes

Circularity Check

0 steps flagged

No circularity in benchmark construction or evaluation

full rationale

This is an empirical benchmark paper that releases a fixed test set, corpus, and retrieval interfaces, then measures model performance on them. No mathematical derivation chain, parameter fitting, or self-referential reduction exists; the reported gaps (66.4% direct QA vs 79.1% agent vs 95.4% oracle) are direct empirical measurements on externally supplied data rather than quantities derived from the results themselves. The representativeness concern is a validity issue, not a circularity issue under the defined patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the representativeness of the chosen frames, questions, and retrieval corpus rather than on new parameters or entities.

axioms (1)

domain assumption The 5000 four-choice examples centered on paused game scenes accurately capture the visual ambiguity and vertical knowledge demands of real short-video queries.
Stated in the abstract as the basis for the benchmark construction.

pith-pipeline@v0.9.0 · 5813 in / 1236 out tokens · 44237 ms · 2026-05-21T08:44:41.360162+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 14 internal anchors

[1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 2022

work page 2022
[2]

System Card: Claude Opus 4 & Claude Sonnet 4

Anthropic . System Card: Claude Opus 4 & Claude Sonnet 4 . https://www.anthropic.com/claude-4-system-card, 2025

work page 2025
[4]

Gmai-mmbench: A comprehensive multimodal evaluation benchmark towards general medical ai

Pengcheng Chen, Jin Ye, Guoan Wang, Yanjun Li, Zhongying Deng, Wei Li, Tianbin Li, Haodong Duan, Ziyan Huang, Yanzhou Su, et al. Gmai-mmbench: A comprehensive multimodal evaluation benchmark towards general medical ai. Advances in Neural Information Processing Systems, 2024

work page 2024
[5]

Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming-Wei Chang. Can pre-trained vision and language models answer visual information-seeking questions? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

work page 2023
[6]

Gemini 3.1 Pro Preview

Google AI for Developers . Gemini 3.1 Pro Preview . https://ai.google.dev/gemini-api/docs/models/gemini-3.1-pro-preview, 2026

work page 2026
[7]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017

work page 2017
[10]

LangGraph Overview

LangChain Inc. LangGraph Overview . https://docs.langchain.com/oss/python/langgraph/overview, 2024

work page 2024
[11]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, 2023

work page 2023
[13]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 2023

work page 2023
[14]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024

work page 2024
[15]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, 2019

work page 2019
[16]

Encyclopedic vqa: Visual questions about detailed properties of fine-grained categories

Thomas Mensink, Jasper Uijlings, Lluis Castrejon, Arushi Goel, Felipe Cadar, Howard Zhou, Fei Sha, Andr \'e Araujo, and Vittorio Ferrari. Encyclopedic vqa: Visual questions about detailed properties of fine-grained categories. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3113--3124, 2023

work page 2023
[17]

GPT-5 System Card

OpenAI . GPT-5 System Card . https://openai.com/index/gpt-5-system-card/, 2025

work page 2025
[18]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in neural information processing systems, 2023

work page 2023
[21]

Charxiv: Charting gaps in realistic chart understanding in multimodal llms

Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, et al. Charxiv: Charting gaps in realistic chart understanding in multimodal llms. Advances in Neural Information Processing Systems, 2024 b

work page 2024
[24]

Advances in neural information processing systems , year=

Flamingo: a visual language model for few-shot learning , author=. Advances in neural information processing systems , year=

work page
[25]

International conference on machine learning , year=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , year=

work page
[26]

Advances in neural information processing systems , year=

Visual instruction tuning , author=. Advances in neural information processing systems , year=

work page
[27]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=

Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=

work page
[28]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution , author=. arXiv preprint arXiv:2409.12191 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Qwen3-VL Technical Report

Qwen3-vl technical report , author=. arXiv preprint arXiv:2511.21631 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

2026 , howpublished =

work page 2026
[31]

2025 , howpublished =

work page 2025
[32]

Proceedings of the IEEE international conference on computer vision , year=

Vqa: Visual question answering , author=. Proceedings of the IEEE international conference on computer vision , year=

work page
[33]

Proceedings of the IEEE conference on computer vision and pattern recognition , year=

Making the v in vqa matter: Elevating the role of image understanding in visual question answering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , year=

work page
[34]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=

Towards vqa models that can read , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=

work page
[35]

Proceedings of the IEEE/CVF winter conference on applications of computer vision , year=

Docvqa: A dataset for vqa on document images , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , year=

work page
[36]

Findings of the association for computational linguistics: ACL 2022 , year=

Chartqa: A benchmark for question answering about charts with visual and logical reasoning , author=. Findings of the association for computational linguistics: ACL 2022 , year=

work page 2022
[37]

Science China Information Sciences , year=

Ocrbench: on the hidden mystery of ocr in large multimodal models , author=. Science China Information Sciences , year=

work page
[38]

Advances in neural information processing systems , year=

Learn to explain: Multimodal reasoning via thought chains for science question answering , author=. Advances in neural information processing systems , year=

work page
[39]

Proceedings of the IEEE/cvf conference on computer vision and pattern recognition , year=

Ok-vqa: A visual question answering benchmark requiring external knowledge , author=. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition , year=

work page
[40]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Mme: A comprehensive evaluation benchmark for multimodal large language models , author=. arXiv preprint arXiv:2306.13394 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

European conference on computer vision , year=

Mmbench: Is your multi-modal model an all-around player? , author=. European conference on computer vision , year=

work page
[42]

Advances in Neural Information Processing Systems , year=

Gmai-mmbench: A comprehensive multimodal evaluation benchmark towards general medical ai , author=. Advances in Neural Information Processing Systems , year=

work page
[43]

Advances in Neural Information Processing Systems , year=

Charxiv: Charting gaps in realistic chart understanding in multimodal llms , author=. Advances in Neural Information Processing Systems , year=

work page
[44]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Mm-vet: Evaluating large multimodal models for integrated capabilities , author=. arXiv preprint arXiv:2308.02490 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[45]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts , author=. arXiv preprint arXiv:2310.02255 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=

work page
[47]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year=

Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year=

work page
[48]

Advances in neural information processing systems , year=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , year=

work page
[49]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=

Webqa: Multihop and multimodal qa , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=

work page
[50]

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han

Mmsearch: Benchmarking the potential of large models as multi-modal search engines , author=. arXiv preprint arXiv:2409.12959 , year=

work page arXiv
[51]

arXiv preprint arXiv:2508.21475 , year=

Mmsearch-plus: Benchmarking provenance-aware search for multimodal browsing agents , author=. arXiv preprint arXiv:2508.21475 , year=

work page arXiv
[52]

arXiv preprint arXiv:2411.02937 , year=

Benchmarking multimodal retrieval augmented generation with dynamic vqa dataset and self-adaptive planning agent , author=. arXiv preprint arXiv:2411.02937 , year=

work page arXiv
[53]

arXiv preprint arXiv:2410.08182 , year=

Mrag-bench: Vision-centric evaluation for retrieval-augmented multimodal models , author=. arXiv preprint arXiv:2410.08182 , year=

work page arXiv
[54]

Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval , year=

MRAMG-Bench: a comprehensive benchmark for advancing multimodal retrieval-augmented multimodal generation , author=. Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval , year=

work page
[55]

ReAct: Synergizing Reasoning and Acting in Language Models

React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[56]

2023 , howpublished =

LangChain , author =. 2023 , howpublished =

work page 2023
[57]

2024 , howpublished =

work page 2024
[58]

2023 , howpublished =

LLM Powered Autonomous Agents , author =. 2023 , howpublished =

work page 2023
[59]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[60]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[61]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Search-r1: Training llms to reason and leverage search engines with reinforcement learning , author=. arXiv preprint arXiv:2503.09516 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[62]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

R1-searcher: Incentivizing the search capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2503.05592 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[63]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , year=

Can pre-trained vision and language models answer visual information-seeking questions? , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , year=

work page 2023
[64]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Encyclopedic vqa: Visual questions about detailed properties of fine-grained categories , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[65]

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning , author=. arXiv preprint arXiv:2503.19470 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[66]

MMSearch-R1: Incentivizing LMMs to Search

Mmsearch-r1: Incentivizing lmms to search , author=. arXiv preprint arXiv:2506.20670 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[67]

arXiv preprint arXiv:2510.12801 , year=

Deepmmsearch-r1: Empowering multimodal llms in multimodal web search , author=. arXiv preprint arXiv:2510.12801 , year=

work page arXiv
[68]

arXiv preprint arXiv:2508.13186 , year=

Mm-browsecomp: A comprehensive benchmark for multimodal browsing agents , author=. arXiv preprint arXiv:2508.13186 , year=

work page arXiv
[69]

BrowseComp- V\^

Zhang, Huanyao and Zhou, Jiepeng and Li, Bo and Zhou, Bowen and Shan, Yanzhe and Lu, Haishan and Cao, Zhiyong and Chen, Jiaoyang and Han, Yuqian and Sheng, Zinan and others , journal=. BrowseComp- V\^

work page
[70]

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

Webwatcher: Breaking new frontier of vision-language deep research agent , author=. arXiv preprint arXiv:2508.05748 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[71]

Proceedings of the 33rd ACM International Conference on Multimedia , year=

Vqa2: visual question answering for video quality assessment , author=. Proceedings of the 33rd ACM International Conference on Multimedia , year=

work page
[72]

DINOv3

Dinov3 , author=. arXiv preprint arXiv:2508.10104 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 2022

work page 2022

[2] [2]

System Card: Claude Opus 4 & Claude Sonnet 4

Anthropic . System Card: Claude Opus 4 & Claude Sonnet 4 . https://www.anthropic.com/claude-4-system-card, 2025

work page 2025

[3] [4]

Gmai-mmbench: A comprehensive multimodal evaluation benchmark towards general medical ai

Pengcheng Chen, Jin Ye, Guoan Wang, Yanjun Li, Zhongying Deng, Wei Li, Tianbin Li, Haodong Duan, Ziyan Huang, Yanzhou Su, et al. Gmai-mmbench: A comprehensive multimodal evaluation benchmark towards general medical ai. Advances in Neural Information Processing Systems, 2024

work page 2024

[4] [5]

Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming-Wei Chang. Can pre-trained vision and language models answer visual information-seeking questions? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

work page 2023

[5] [6]

Gemini 3.1 Pro Preview

Google AI for Developers . Gemini 3.1 Pro Preview . https://ai.google.dev/gemini-api/docs/models/gemini-3.1-pro-preview, 2026

work page 2026

[6] [7]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017

work page 2017

[7] [10]

LangGraph Overview

LangChain Inc. LangGraph Overview . https://docs.langchain.com/oss/python/langgraph/overview, 2024

work page 2024

[8] [11]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, 2023

work page 2023

[9] [13]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 2023

work page 2023

[10] [14]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024

work page 2024

[11] [15]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, 2019

work page 2019

[12] [16]

Encyclopedic vqa: Visual questions about detailed properties of fine-grained categories

Thomas Mensink, Jasper Uijlings, Lluis Castrejon, Arushi Goel, Felipe Cadar, Howard Zhou, Fei Sha, Andr \'e Araujo, and Vittorio Ferrari. Encyclopedic vqa: Visual questions about detailed properties of fine-grained categories. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3113--3124, 2023

work page 2023

[13] [17]

GPT-5 System Card

OpenAI . GPT-5 System Card . https://openai.com/index/gpt-5-system-card/, 2025

work page 2025

[14] [18]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in neural information processing systems, 2023

work page 2023

[15] [21]

Charxiv: Charting gaps in realistic chart understanding in multimodal llms

Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, et al. Charxiv: Charting gaps in realistic chart understanding in multimodal llms. Advances in Neural Information Processing Systems, 2024 b

work page 2024

[16] [24]

Advances in neural information processing systems , year=

Flamingo: a visual language model for few-shot learning , author=. Advances in neural information processing systems , year=

work page

[17] [25]

International conference on machine learning , year=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , year=

work page

[18] [26]

Advances in neural information processing systems , year=

Visual instruction tuning , author=. Advances in neural information processing systems , year=

work page

[19] [27]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=

Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=

work page

[20] [28]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution , author=. arXiv preprint arXiv:2409.12191 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[21] [29]

Qwen3-VL Technical Report

Qwen3-vl technical report , author=. arXiv preprint arXiv:2511.21631 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [30]

2026 , howpublished =

work page 2026

[23] [31]

2025 , howpublished =

work page 2025

[24] [32]

Proceedings of the IEEE international conference on computer vision , year=

Vqa: Visual question answering , author=. Proceedings of the IEEE international conference on computer vision , year=

work page

[25] [33]

Proceedings of the IEEE conference on computer vision and pattern recognition , year=

Making the v in vqa matter: Elevating the role of image understanding in visual question answering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , year=

work page

[26] [34]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=

Towards vqa models that can read , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=

work page

[27] [35]

Proceedings of the IEEE/CVF winter conference on applications of computer vision , year=

Docvqa: A dataset for vqa on document images , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , year=

work page

[28] [36]

Findings of the association for computational linguistics: ACL 2022 , year=

Chartqa: A benchmark for question answering about charts with visual and logical reasoning , author=. Findings of the association for computational linguistics: ACL 2022 , year=

work page 2022

[29] [37]

Science China Information Sciences , year=

Ocrbench: on the hidden mystery of ocr in large multimodal models , author=. Science China Information Sciences , year=

work page

[30] [38]

Advances in neural information processing systems , year=

Learn to explain: Multimodal reasoning via thought chains for science question answering , author=. Advances in neural information processing systems , year=

work page

[31] [39]

Proceedings of the IEEE/cvf conference on computer vision and pattern recognition , year=

Ok-vqa: A visual question answering benchmark requiring external knowledge , author=. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition , year=

work page

[32] [40]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Mme: A comprehensive evaluation benchmark for multimodal large language models , author=. arXiv preprint arXiv:2306.13394 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[33] [41]

European conference on computer vision , year=

Mmbench: Is your multi-modal model an all-around player? , author=. European conference on computer vision , year=

work page

[34] [42]

Advances in Neural Information Processing Systems , year=

Gmai-mmbench: A comprehensive multimodal evaluation benchmark towards general medical ai , author=. Advances in Neural Information Processing Systems , year=

work page

[35] [43]

Advances in Neural Information Processing Systems , year=

Charxiv: Charting gaps in realistic chart understanding in multimodal llms , author=. Advances in Neural Information Processing Systems , year=

work page

[36] [44]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Mm-vet: Evaluating large multimodal models for integrated capabilities , author=. arXiv preprint arXiv:2308.02490 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[37] [45]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts , author=. arXiv preprint arXiv:2310.02255 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[38] [46]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=

work page

[39] [47]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year=

Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year=

work page

[40] [48]

Advances in neural information processing systems , year=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , year=

work page

[41] [49]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=

Webqa: Multihop and multimodal qa , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , year=

work page

[42] [50]

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han

Mmsearch: Benchmarking the potential of large models as multi-modal search engines , author=. arXiv preprint arXiv:2409.12959 , year=

work page arXiv

[43] [51]

arXiv preprint arXiv:2508.21475 , year=

Mmsearch-plus: Benchmarking provenance-aware search for multimodal browsing agents , author=. arXiv preprint arXiv:2508.21475 , year=

work page arXiv

[44] [52]

arXiv preprint arXiv:2411.02937 , year=

Benchmarking multimodal retrieval augmented generation with dynamic vqa dataset and self-adaptive planning agent , author=. arXiv preprint arXiv:2411.02937 , year=

work page arXiv

[45] [53]

arXiv preprint arXiv:2410.08182 , year=

Mrag-bench: Vision-centric evaluation for retrieval-augmented multimodal models , author=. arXiv preprint arXiv:2410.08182 , year=

work page arXiv

[46] [54]

Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval , year=

MRAMG-Bench: a comprehensive benchmark for advancing multimodal retrieval-augmented multimodal generation , author=. Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval , year=

work page

[47] [55]

ReAct: Synergizing Reasoning and Acting in Language Models

React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[48] [56]

2023 , howpublished =

LangChain , author =. 2023 , howpublished =

work page 2023

[49] [57]

2024 , howpublished =

work page 2024

[50] [58]

2023 , howpublished =

LLM Powered Autonomous Agents , author =. 2023 , howpublished =

work page 2023

[51] [59]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[52] [60]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[53] [61]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Search-r1: Training llms to reason and leverage search engines with reinforcement learning , author=. arXiv preprint arXiv:2503.09516 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[54] [62]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

R1-searcher: Incentivizing the search capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2503.05592 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[55] [63]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , year=

Can pre-trained vision and language models answer visual information-seeking questions? , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , year=

work page 2023

[56] [64]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Encyclopedic vqa: Visual questions about detailed properties of fine-grained categories , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[57] [65]

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning , author=. arXiv preprint arXiv:2503.19470 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[58] [66]

MMSearch-R1: Incentivizing LMMs to Search

Mmsearch-r1: Incentivizing lmms to search , author=. arXiv preprint arXiv:2506.20670 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[59] [67]

arXiv preprint arXiv:2510.12801 , year=

Deepmmsearch-r1: Empowering multimodal llms in multimodal web search , author=. arXiv preprint arXiv:2510.12801 , year=

work page arXiv

[60] [68]

arXiv preprint arXiv:2508.13186 , year=

Mm-browsecomp: A comprehensive benchmark for multimodal browsing agents , author=. arXiv preprint arXiv:2508.13186 , year=

work page arXiv

[61] [69]

BrowseComp- V\^

Zhang, Huanyao and Zhou, Jiepeng and Li, Bo and Zhou, Bowen and Shan, Yanzhe and Lu, Haishan and Cao, Zhiyong and Chen, Jiaoyang and Han, Yuqian and Sheng, Zinan and others , journal=. BrowseComp- V\^

work page

[62] [70]

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

Webwatcher: Breaking new frontier of vision-language deep research agent , author=. arXiv preprint arXiv:2508.05748 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[63] [71]

Proceedings of the 33rd ACM International Conference on Multimedia , year=

Vqa2: visual question answering for video quality assessment , author=. Proceedings of the 33rd ACM International Conference on Multimedia , year=

work page

[64] [72]

DINOv3

Dinov3 , author=. arXiv preprint arXiv:2508.10104 , year=

work page internal anchor Pith review Pith/arXiv arXiv