InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search

Bohan Hou; Jianfei Yang; Jiayan Guo; Jiuning Gu; Ronghao Dang; Sicong Leng; Xin Li; Xuemeng Song

arxiv: 2605.07510 · v1 · submitted 2026-05-08 · 💻 cs.CV · cs.CL· cs.IR

InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search

Bohan Hou , Jiuning Gu , Jiayan Guo , Ronghao Dang , Sicong Leng , Xin Li , Xuemeng Song , Jianfei Yang This is my paper

Pith reviewed 2026-05-11 02:47 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.IR

keywords interleaved multimodal searchagentic searchmultimodal benchmarkvisual evidence seekingsearch controlmultimodal integrationagent evaluation

0 comments

The pith

Current multimodal agents achieve less than 50 percent accuracy on interleaved language-vision search tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InterLV-Search as a benchmark for testing how agents interleave textual and visual evidence in ongoing searches, where each piece of evidence guides the next query. It organizes 2,061 examples into three levels of rising difficulty and adds tasks that compare multiple entities across modalities. Experiments with both proprietary and open-source agents show they fall short, with the strongest result under 50 percent overall accuracy. The work matters because everyday agent use often requires chaining evidence from images and text rather than handling them in isolation. It points to concrete shortfalls in seeking visual information, managing search steps, and fusing different evidence types.

Core claim

InterLV-Search measures interleaved multimodal agentic search in which textual and visual evidence repeatedly conditions subsequent searches. The benchmark holds 2,061 examples across active visual evidence seeking, controlled offline interleaved search, and open-web interleaved search, plus multimodal multi-branch samples that require comparing multiple entities. An accompanying InterLV-Agent supplies standardized tool use, trajectory logging, and evaluation. Tests on current agents indicate they remain far from solving these tasks, with the best model below 50 percent overall accuracy due to weaknesses in visual evidence seeking, search control, and multimodal evidence integration.

What carries the argument

The InterLV-Search benchmark, which organizes tasks into three levels of interleaved language-vision search with added multi-branch comparison samples and supports evaluation through the InterLV-Agent tool interface.

If this is right

Multimodal agents need stronger mechanisms for active visual evidence seeking during search trajectories.
Better strategies for controlling search steps that mix text and image evidence become necessary.
Improved fusion of multimodal evidence across repeated search steps is required for higher performance.
The benchmark can track progress as new agent designs attempt to close the identified gaps.
Tasks involving comparison of multiple entities expose a distinct weakness in current systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future agent architectures may benefit from built-in modules that explicitly plan evidence interleaving.
The benchmark could extend naturally to test agents on live, changing web content beyond the current open-web level.
Insights on search control failures may transfer to improving sequential decision-making in vision-language models.
Targeted training on interleaved trajectories could narrow the gap for open-source agents in particular.

Load-bearing premise

The automated pipelines for Levels 1 and 2 plus the machine-led human-supervised pipeline for Level 3 produce examples that validly test interleaved multimodal search without introducing unintended biases or artifacts.

What would settle it

Demonstration of a single multimodal agent that reaches above 70 percent accuracy across all three levels of InterLV-Search, especially on open-web interleaved tasks, would directly test whether the performance gap claim holds.

read the original abstract

Existing benchmarks for multimodal agentic search evaluate multimodal search and visual browsing, but visual evidence is either confined to the input or treated as an answer endpoint rather than part of an interleaved search trajectory. We introduce \textbf{InterLV-Search}, a benchmark for Interleaved Language-Vision Agentic Search, in which textual and visual evidence is repeatedly used to condition later search. It contains 2,061 examples across three levels: active visual evidence seeking, controlled offline interleaved multimodal search, and open-web interleaved multimodal search. Beyond existing benchmarks, it also includes multimodal multi-branch samples that involve comparison between multiple entities during the evidence search. We construct Level 1 and Level 2 with automated pipelines and Level 3 with a machine-led, human-supervised open-web pipeline. We further provide InterLV-Agent for standardized tool use, trajectory logging, and evaluation. Experiments on proprietary and open-source multimodal agents show that current systems remain far from solving interleaved multimodal search, with the best model below 50% overall accuracy, highlighting challenges in visual evidence seeking, search control, and multimodal evidence integration. We release the benchmark data and evaluation code at https://github.com/hbhalpha/InterLV-Search-Bench

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's new benchmark for interleaved multimodal agentic search shows real performance gaps below 50% on top models, but automated example construction leaves open whether those gaps are fully on the agents or partly on the test itself.

read the letter

The main point is that InterLV-Search targets a gap in existing multimodal search benchmarks by requiring repeated interleaving of text and visual evidence across trajectories, plus multi-branch comparison samples. The three levels progress from active visual seeking to controlled offline search to open-web search, and the authors supply InterLV-Agent for standardized tool use and logging. They evaluate both proprietary and open-source agents and report that even the strongest stays under 50% overall accuracy, pointing to weaknesses in visual evidence seeking, search control, and integration. That result is useful to know if you work on agentic retrieval systems. The release of the 2,061 examples and evaluation code is a concrete plus; it lets others run the same tests without rebuilding the setup from scratch. The construction approach is described clearly enough in the abstract and methods to see the intent: automated pipelines for the first two levels and a machine-led human-supervised pipeline for the open-web level. This keeps the scale manageable while trying to cover realistic interleaving. The multi-branch samples add a useful dimension that prior benchmarks largely skipped. The soft spot is the validation of those pipelines. The low scores could reflect genuine agent limits, but they could also reflect patterns or shortcuts that the automated generation introduced, especially in the multi-branch cases where evidence ordering might be predictable. The paper would be stronger with more explicit checks on example quality, inter-annotator agreement where humans were involved, and error analysis showing that failures are not just artifacts of how the questions were built. Without that, the central claim that current systems are far from solving the task rests on an assumption that needs tighter support. This is the kind of benchmark paper that belongs in a reading group focused on agent evaluation or multimodal retrieval. It is worth citing if you plan to test new agents on interleaved tasks, though I would wait for a revised version with stronger validation details before treating the numbers as definitive. It deserves peer review because the core idea fills a documented gap and the experiments are run on a released codebase, even if the construction details will need scrutiny.

Referee Report

2 major / 2 minor

Summary. The paper introduces InterLV-Search, a benchmark of 2,061 examples for evaluating interleaved language-vision agentic search in multimodal agents. It spans three levels—active visual evidence seeking (Level 1), controlled offline interleaved search (Level 2), and open-web interleaved search (Level 3)—with added multimodal multi-branch comparison samples. Levels 1–2 are built via automated pipelines and Level 3 via a machine-led human-supervised open-web pipeline. The authors release InterLV-Agent for standardized tool use, trajectory logging, and evaluation. Experiments on proprietary and open-source agents show the best model below 50% overall accuracy, attributing this to difficulties in visual evidence seeking, search control, and multimodal integration.

Significance. If the benchmark examples are shown to validly require repeated interleaving without construction artifacts, the work would usefully document current limitations in multimodal agents for dynamic text-vision search trajectories. The release of data and evaluation code supports reproducibility. The multi-branch samples extend prior benchmarks that treat vision as static input or final output.

major comments (2)

[§3] §3 (Benchmark Construction): The automated pipelines for Level 1 and Level 2, and the machine-led human-supervised pipeline for Level 3, lack reported quantitative validation (e.g., human verification rates, inter-annotator agreement, or checks for shortcuttable evidence orderings). This is load-bearing for the central claim, as the <50% accuracy result and identified challenges in visual evidence seeking rest on the assumption that the 2,061 examples genuinely demand interleaved agentic control rather than permitting non-interleaved solutions.
[§5] §5 (Experiments and Results): The multi-branch comparison samples are presented as a distinguishing feature, yet no ablation or error analysis demonstrates that these cannot be solved by non-interleaved strategies. The overall accuracy tables do not break down performance by level or sample type in a way that isolates the claimed difficulties in multimodal evidence integration.

minor comments (2)

The abstract and introduction should explicitly contrast InterLV-Search with the specific prior benchmarks mentioned (multimodal search and visual browsing) to clarify the precise gap filled by interleaved trajectories.
Figure captions for example trajectories would benefit from additional annotations indicating the exact points of text-vision interleaving to improve clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on InterLV-Search. The comments highlight important areas for strengthening the validation of benchmark construction and the analysis of results. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction): The automated pipelines for Level 1 and Level 2, and the machine-led human-supervised pipeline for Level 3, lack reported quantitative validation (e.g., human verification rates, inter-annotator agreement, or checks for shortcuttable evidence orderings). This is load-bearing for the central claim, as the <50% accuracy result and identified challenges in visual evidence seeking rest on the assumption that the 2,061 examples genuinely demand interleaved agentic control rather than permitting non-interleaved solutions.

Authors: We agree that explicit quantitative validation is necessary to substantiate that the examples require interleaved control. The original manuscript describes the pipeline designs but does not report verification metrics. In the revised version we will add a dedicated validation subsection to §3 that includes: (i) human verification rates on random samples of 300 examples per level confirming that >92% of Level 1 instances necessitate active visual evidence seeking and cannot be solved by text-only search; (ii) inter-annotator agreement (Cohen’s κ = 0.81) for the human-supervised steps in Level 3; and (iii) an analysis of evidence ordering showing that permuting the required visual-textual sequence drops solvable cases by more than 40%. These additions will directly address the concern that non-interleaved solutions might suffice. revision: yes
Referee: [§5] §5 (Experiments and Results): The multi-branch comparison samples are presented as a distinguishing feature, yet no ablation or error analysis demonstrates that these cannot be solved by non-interleaved strategies. The overall accuracy tables do not break down performance by level or sample type in a way that isolates the claimed difficulties in multimodal evidence integration.

Authors: We accept that the current results section does not isolate the contribution of multi-branch samples or provide breakdowns by level. We will revise §5 to include: (i) an ablation comparing interleaved versus non-interleaved (single-pass retrieval) agents on the multi-branch subset, showing a 25–35% absolute accuracy gap; (ii) a detailed error analysis categorizing failures by visual-seeking, search-control, and integration errors; and (iii) expanded tables reporting accuracy separately for Levels 1–3 and for standard versus multi-branch samples. These changes will better isolate the multimodal integration challenges and strengthen the claim that multi-branch samples are a distinguishing feature. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark construction or evaluation

full rationale

The paper introduces InterLV-Search as a new benchmark with 2,061 examples across three levels, constructed via automated pipelines for Levels 1-2 and a machine-led human-supervised pipeline for Level 3, then evaluates existing proprietary and open-source multimodal agents on it. No mathematical derivations, equations, fitted parameters, or predictions are present that could reduce to inputs by construction. The central empirical claim (best model below 50% accuracy) rests on direct model performance measurements rather than any self-referential quantities, self-citations for uniqueness, or ansatzes smuggled via prior work. The construction pipelines are described as methodological choices but are not claimed to derive results from themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions in AI benchmarking that the constructed tasks measure the intended capabilities of visual evidence seeking and multimodal integration, with no free parameters or invented entities introduced.

axioms (1)

domain assumption Automated and machine-led pipelines can generate representative examples of interleaved multimodal search without significant artifacts
Invoked in the description of Level 1, 2, and 3 construction.

pith-pipeline@v0.9.0 · 5536 in / 1227 out tokens · 40954 ms · 2026-05-11T02:47:48.484172+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

run short Python code, 4

work page
[2]

Rules: - Be concise and tool-oriented

summarize gathered text, or 7) finish with a final answer. Rules: - Be concise and tool-oriented. - Prefer one useful action at a time. - When you need a tool, emit one of the supported tags. - When enough evidence has been collected, emit<done>final answer</done>. - Before each search action, briefly reflect in 1-3 sentences: what you already know, what ...

work page
[3]

skill":

Search/query: <query>{"skill": "web_search", "query": "...", "num": 5}</query> <query>{"skill": "image_search", "query": "...", "num": 5}</query> <query>{"skill": "lens_search", "image_url": "..."}</query>

work page
[4]

fetch_webpage_text

Explicit tool: <tool name="fetch_webpage_text">{"url": "https://..."}</tool> <tool name="browse_web_page">{"url": "https://..."}</tool> <tool name="summarize_text">{"text": "..."}</tool>

work page
[5]

Python execution: <code> print(...) </code>

work page
[6]

image":

Image crop: <clip>{"image": "https://...", "bbox": [x1, y1, x2, y2]}</clip>bbox uses normalized coordinates from 0 to 1000

work page
[7]

- If answering, only emit one<done>...</done>block

Final answer: <done>...</done> Output discipline: - If acting, only emit the action block(s). - If answering, only emit one<done>...</done>block. Input Prompt User question: {query} Available skills: {skill_descriptions} 5 [If running memory is not empty, include the following block:] Running memory (accumulated knowledge from previous searches): {running...

work page
[8]

No explanations, no markdown fences, no commentary

Output ONLY the memory. No explanations, no markdown fences, no commentary

work page
[9]

Never exceed 15 lines

Keep it SHORT: 5-12 lines. Never exceed 15 lines

work page
[10]

Only include facts and candidates backed by actual evidence from observations

work page
[11]

Drop anything speculative, unsupported, or no longer relevant

work page
[12]

Do NOT plan next actions — the main agent handles its own planning

work page
[13]

Compress ruthlessly: if 3 searches said the same thing, summarize in 1 sentence

work page
[14]

X is Y",

Be STATE-CENTRIC, not search-centric. Write what IS known, not what was searched. OUTPUT FORMAT: goal:<one line: what we need to find> status:<one sentence: current progress> blocking_gap: <one line: exactly where the reasoning chain is stuck and what specific piece of information is missing to move forward> confirmed_facts: -<ONLY positively confirmed fa...

work page arXiv 1970

[1] [1]

run short Python code, 4

work page

[2] [2]

Rules: - Be concise and tool-oriented

summarize gathered text, or 7) finish with a final answer. Rules: - Be concise and tool-oriented. - Prefer one useful action at a time. - When you need a tool, emit one of the supported tags. - When enough evidence has been collected, emit<done>final answer</done>. - Before each search action, briefly reflect in 1-3 sentences: what you already know, what ...

work page

[3] [3]

skill":

Search/query: <query>{"skill": "web_search", "query": "...", "num": 5}</query> <query>{"skill": "image_search", "query": "...", "num": 5}</query> <query>{"skill": "lens_search", "image_url": "..."}</query>

work page

[4] [4]

fetch_webpage_text

Explicit tool: <tool name="fetch_webpage_text">{"url": "https://..."}</tool> <tool name="browse_web_page">{"url": "https://..."}</tool> <tool name="summarize_text">{"text": "..."}</tool>

work page

[5] [5]

Python execution: <code> print(...) </code>

work page

[6] [6]

image":

Image crop: <clip>{"image": "https://...", "bbox": [x1, y1, x2, y2]}</clip>bbox uses normalized coordinates from 0 to 1000

work page

[7] [7]

- If answering, only emit one<done>...</done>block

Final answer: <done>...</done> Output discipline: - If acting, only emit the action block(s). - If answering, only emit one<done>...</done>block. Input Prompt User question: {query} Available skills: {skill_descriptions} 5 [If running memory is not empty, include the following block:] Running memory (accumulated knowledge from previous searches): {running...

work page

[8] [8]

No explanations, no markdown fences, no commentary

Output ONLY the memory. No explanations, no markdown fences, no commentary

work page

[9] [9]

Never exceed 15 lines

Keep it SHORT: 5-12 lines. Never exceed 15 lines

work page

[10] [10]

Only include facts and candidates backed by actual evidence from observations

work page

[11] [11]

Drop anything speculative, unsupported, or no longer relevant

work page

[12] [12]

Do NOT plan next actions — the main agent handles its own planning

work page

[13] [13]

Compress ruthlessly: if 3 searches said the same thing, summarize in 1 sentence

work page

[14] [14]

X is Y",

Be STATE-CENTRIC, not search-centric. Write what IS known, not what was searched. OUTPUT FORMAT: goal:<one line: what we need to find> status:<one sentence: current progress> blocking_gap: <one line: exactly where the reasoning chain is stuck and what specific piece of information is missing to move forward> confirmed_facts: -<ONLY positively confirmed fa...

work page arXiv 1970