Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval

Derek Hoiem; Michal Shlapentokh-Rothman; Prachi Garg; Yu-Xiong Wang

Keyframe retrieval for long-video QA works better when an LLM turns each query into tool calls and boolean merges of their rankings.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.5

2026-07-12 16:07 UTC pith:NGF6LUR4

load-bearing objection Clean systems idea—LLM-planned multi-tool keyframe retrieval with boolean rank merges—plus a retrieval-first benchmark; the 5% caption gain is the claim to verify, not a foundational leap. the 3 major comments →

arxiv 2605.23826 v2 pith:NGF6LUR4 submitted 2026-05-22 cs.CV cs.CL

Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval

Michal Shlapentokh-Rothman , Prachi Garg , Yu-Xiong Wang , Derek Hoiem This is my paper

classification cs.CV cs.CL

keywords keyframe retrievallong-video QAtool useLLM plannerboolean ranking mergevideo groundingMolmo-2 Moments

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Long-video question answering needs the right frames as visual evidence, but queries differ in what they ask for, so a single similarity score or a fixed decomposition often misses the mark. ToolMerge has an LLM planner break each query into calls to specialized visual tools and state how to combine those tools’ rankings with boolean operators such as AND and OR. The method is tested on a new benchmark, Molmo-2 Moments, where every question is tied by construction to a known time interval, so retrieval quality can be measured directly. Across question answering, question retrieval, and caption retrieval, ToolMerge matches or beats prior keyframe selectors and gains about five percent on caption retrieval. The practical claim is that flexible, query-specific tool plans plus explicit merge logic give more reliable keyframes than scoring every frame against one query or one fixed schema.

Core claim

The paper shows that decomposing a long-video query into tool calls whose per-tool rankings are merged by boolean operators produces keyframe rankings that are competitive with prior selectors and about five percent better on caption retrieval, while a new interval-anchored benchmark (M2M) makes that retrieval quality measurable without relying only on downstream QA accuracy.

What carries the argument

ToolMerge: an LLM planner that emits a set of visual-tool calls together with a boolean expression for merging their rankings into a single keyframe order.

Load-bearing premise

An off-the-shelf language model, given a fixed tool inventory, will produce tool calls and boolean merge expressions that correctly capture what diverse long-video queries actually need to see.

What would settle it

On M2M or a similar interval-anchored set, replace the LLM planner with random or fixed tool plans and check whether the five-percent caption-retrieval gain and overall competitiveness disappear; if they do, planning quality was load-bearing.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Keyframe evidence for long-video QA can be improved without training a new end-to-end scorer, by routing each query through existing visual tools and a boolean merge.
Caption-style and compositional queries benefit most from query-specific tool plans rather than a single global similarity score.
Benchmarks that anchor every question to a known time interval allow retrieval methods to be judged directly, not only through final QA accuracy.
Adding or swapping visual tools becomes a planning change rather than a full system redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If planner quality is the main lever, cheaper or distilled planners could preserve most of the gain while cutting inference cost.
Boolean merge operators may transfer to other multi-tool multimodal retrieval settings beyond video keyframes.
Failure modes will likely cluster on queries whose visual requirements cannot be expressed as a short boolean combination of the available tools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

Clean systems idea—LLM-planned multi-tool keyframe retrieval with boolean rank merges—plus a retrieval-first benchmark; the 5% caption gain is the claim to verify, not a foundational leap.

read the letter

The thing worth knowing is simple: ToolMerge treats keyframe selection as planned multi-tool search. An LLM breaks a query into tool calls and emits a boolean expression over the per-tool rankings, instead of scoring every frame against one query or forcing a fixed schema through one visual tool. They also ship Molmo-2 Moments (M2M), where every question is tied to a time interval by construction, so retrieval can be scored directly rather than only through downstream QA. That pairing is the real product.

What is new is not “decomposition” in the abstract—prior work already does single-query scoring or fixed-schema multi-part queries—but the combination of free-form LLM tool planning with explicit boolean merge of rankings, plus a benchmark built for retrieval rather than only answer accuracy. The abstract’s results are modest and honest: competitive on QA and question retrieval, about 5% better on caption retrieval. Code and data are promised on GitHub, which is the right move for a systems paper.

Soft spots are real but proportional. The load-bearing step is planner quality: if the LLM picks the wrong tools or a bad merge expression on compositional or ambiguous long-video queries, the “decomposition and merging” story collapses and you may just be riding stronger single-tool scorers. End-task metrics alone do not isolate that. Free parameters (planner model/prompt, tool inventory, cutoffs, number of frames) are the usual systems knobs; they are not hidden, but they mean the 5% edge needs ablations and planner validity rates before you trust the attribution. The manuscript text we have is badly corrupted, so I cannot audit tables or failure cases line by line—that is a process problem for us, not a verdict on the work.

This is for people building long-video QA pipelines who care about verifiable evidence frames, not for theory readers. It is within-subfield engineering progress with a useful eval artifact. I would send it to peer review: the claim is falsifiable, the benchmark is the right evaluation move, and the method is clear enough to referee. Engage if you work on video retrieval or multimodal evidence; skim the abstract and M2M construction if you only need the idea. Not a must-read for a general vision reading group unless the group is deep in long-video systems.

Referee Report

3 major / 4 minor

Summary. The paper proposes ToolMerge, a keyframe retrieval pipeline for long-video QA in which an LLM planner decomposes each natural-language query into a set of visual tool calls and emits a boolean expression that merges the resulting per-tool rankings. To support direct retrieval evaluation, the authors introduce Molmo-2 Moments (M2M), a benchmark whose questions are constructed to be anchored to specific temporal intervals. Empirically, ToolMerge is reported to be competitive with prior keyframe selectors on QA, question retrieval, and caption retrieval, with a roughly 5% absolute gain on caption retrieval. Code and data are released.

Significance. If the reported gains are real and attributable to the planner-plus-boolean-merge design rather than to stronger single-tool scorers or hyperparameter choices, ToolMerge is a useful systems contribution: it replaces fixed schemas and single-query scoring with a more flexible, query-adaptive retrieval interface. The M2M benchmark is a concrete evaluation asset that addresses a genuine gap (retrieval metrics that are not confounded by downstream QA models). Public code and data further raise the work’s value for follow-on research. The contribution is empirical and systems-oriented rather than theoretical; its lasting impact depends on whether the decomposition–merge benefit can be isolated and reproduced.

major comments (3)

The central design claim is that LLM-driven decomposition into tool calls plus boolean ranking merge improves keyframe retrieval over single-query or fixed-schema baselines. End-task metrics (QA / question retrieval / caption retrieval) alone do not isolate this claim: there are no reported planner validity rates (fraction of tool calls and merge expressions that correctly capture the query’s visual requirements), no human or automatic audit of merge expressions, and no ablation that holds the tool inventory fixed while removing multi-tool boolean fusion (e.g., single best tool, score sum/max without boolean structure, or a fixed schema). Without those measurements, the ~5% caption-retrieval edge cannot be attributed to the proposed mechanism rather than to stronger scorers or prompt engineering. This is load-bearing for the paper’s title and abstract framing.
The free parameters of the system—LLM planner model and prompt, visual tool inventory and per-tool scorers, top-k / merge thresholds, and number of keyframes returned—are not subjected to a systematic sensitivity or leave-one-component-out analysis in the reported results. Because the method is defined by the interaction of these choices, competitiveness on M2M and related sets could be driven by a favorable tool set or cutoff rather than by decomposition-and-merging per se. A minimal set of ablations (planner model swap, tool-set ablation, merge-operator ablation, k-sweep) is needed to support the claim that the architecture, not a particular configuration, is responsible for the gains.
M2M is presented as the primary direct-retrieval benchmark, yet the manuscript does not fully specify construction details that determine its difficulty and leakage risk: how intervals are chosen relative to video length, how questions are generated or filtered to avoid trivial lexical matches, inter-annotator agreement on anchors, and whether any training or few-shot material for the planner overlaps M2M sources. Without these, it is hard to judge whether the reported ranking improvements generalize beyond the benchmark’s construction process.

minor comments (4)

The provided full-text rendering is heavily corrupted (garbled Unicode blocks), which obscures section numbering, table contents, and equation-level detail. A clean, machine-readable PDF/source is required for a complete line-by-line audit of results tables and method pseudocode.
Abstract and method sketch use “boolean operators” without a precise formal definition of how ranked lists are combined under AND/OR/NOT (e.g., rank fusion, score thresholding, set intersection of top-k). Clarifying the merge semantics would aid reproducibility.
Related-work positioning against fixed-schema multi-tool systems and pure LLM-as-retriever baselines could be tightened so that the novelty of learned/planned boolean merge is clearer.
Report variance or multiple seeds for the LLM planner where stochastic decoding is used; single-run point estimates make the 5% caption-retrieval margin hard to interpret.

Circularity Check

0 steps flagged

Empirical systems paper with no derivation that reduces predictions to fitted inputs or self-definitional claims.

full rationale

ToolMerge is a systems method (LLM planner decomposes a query into tool calls; per-tool rankings are merged with boolean operators) evaluated on external retrieval and QA metrics, including the authors' M2M benchmark. The abstract and method frame results as competitive empirical performance (notably ~5% on caption retrieval), not as first-principles predictions or uniqueness theorems. M2M's design note that questions are 'anchored to a specific time interval by construction' defines evaluation ground truth; it does not make ToolMerge's rankings true by definition. There is no fitted parameter renamed as a prediction, no load-bearing uniqueness claim imported from overlapping authors, and no ansatz smuggled in via self-citation that forces the reported numbers. Design choices (tool inventory, merge operators, planner) are free parameters of a system, not circular reductions. Circularity burden is zero under the stated criteria.

Axiom & Free-Parameter Ledger

4 free parameters · 4 axioms · 2 invented entities

Load-bearing content is mostly engineering assumptions and design choices, not physical axioms. The claim rests on (1) an LLM planner being competent at tool decomposition and boolean merge planning, (2) a fixed inventory of visual tools whose rankings are meaningful, (3) boolean combination of rankings being a sufficient merge language, and (4) M2M's construction truly anchoring questions to intervals so retrieval metrics are valid. Free parameters are the usual systems knobs (model choice, top-k, tool set, prompts). No new physical entities are postulated.

free parameters (4)

LLM planner model and prompt
Choice of planner and prompt determines tool calls and merge operators; not derived, selected by authors.
Visual tool inventory and per-tool scoring models
Which tools exist and how each scores frames is a design choice that shapes all rankings.
Top-k / ranking cutoffs and merge thresholds
How many frames each tool returns and how boolean merges are applied to ranked lists are operational parameters.
Number of selected keyframes returned to the QA model
Downstream QA and retrieval metrics depend on how many frames are kept after merging.

axioms (4)

domain assumption An LLM can decompose natural-language video queries into a small set of tool calls plus boolean merge operators that reflect the query's visual logic.
Core planning assumption of ToolMerge; if false, decomposition does not help retrieval.
domain assumption Per-tool frame rankings are sufficiently calibrated that boolean combination of ranks (AND/OR-style) yields a useful joint ranking.
Merge step assumes ranks from heterogeneous tools are combinable without heavy reweighting.
domain assumption M2M questions are anchored to specific time intervals by construction, so interval-based retrieval metrics measure the intended evidence frames.
Validity of the new benchmark as a direct retrieval evaluation depends on this construction claim.
domain assumption Keyframe selection is a sufficient interface for providing verifiable visual evidence to long-video QA.
Stated motivation in the abstract; frames out continuous temporal reasoning or multi-modal non-frame evidence.

invented entities (2)

ToolMerge planner-merge pipeline no independent evidence
purpose: Decompose queries into tool calls and merge per-tool rankings with boolean operators for keyframe retrieval.
Primary proposed method; evaluated empirically, not an external physical entity.
Molmo-2 Moments (M2M) benchmark no independent evidence
purpose: Provide questions each tied to a known time interval so keyframe retrieval can be scored directly.
New evaluation resource constructed for this paper; usefulness depends on release quality and annotation fidelity.

pith-pipeline@v1.1.0-grok45 · 8767 in / 3042 out tokens · 32176 ms · 2026-07-12T16:07:21.316600+00:00 · methodology

0 comments

read the original abstract

Keyframe selection is a direct way to provide verifiable visual evidence for long-video question answering (QA). Queries differ in what they require, and finding the right frames depends on knowing what to look for. Existing keyframe selectors either score every frame against a single query, or decompose the query into a fixed schema evaluated by a single visual tool. We propose ToolMerge, a keyframe retrieval method based on decomposition and merging: an Large Language Model (LLM) based planner decomposes the query into tool calls and specifies how their per-tool rankings are merged using boolean operators. To evaluate retrieval directly, we construct Molmo-2 Moments (M2M), a benchmark in which every question is anchored to a specific time interval by construction. Across QA, question retrieval, and caption retrieval, ToolMerge is competitive with prior keyframe selectors, most notably on caption retrieval, outperforming other methods by 5%. Code and data can be found at https://github.com/michalsr/ToolMerge .

Figures

Figures reproduced from arXiv: 2605.23826 by Derek Hoiem, Michal Shlapentokh-Rothman, Prachi Garg, Yu-Xiong Wang.

**Figure 2.** Figure 2: Merging Example. Each frame has a rank per tool call. AND operators choose the worst [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

Review history (2 revisions) →

Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval

Core claim

What carries the argument

Load-bearing premise

What would settle it

If this is right

Where Pith is reading between the lines

discussion (0)