Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval
Pith reviewed 2026-05-25 04:16 UTC · model grok-4.3
The pith
An LLM decomposes a video query into multiple tool calls whose rankings are merged with boolean operators to select keyframes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ToolMerge lets an LLM planner turn an arbitrary query into a set of tool calls and a boolean merging plan; the per-tool rankings are then combined according to that plan to produce a final keyframe list. On the Molmo-2 Moments benchmark, where every question is tied to a specific time interval, this decomposition-plus-merge strategy matches or exceeds prior keyframe selectors across QA, question retrieval, and caption retrieval, with the largest recorded lift appearing on caption retrieval.
What carries the argument
ToolMerge: LLM-driven decomposition of queries into tool calls followed by boolean-operator merging of per-tool rankings.
If this is right
- Caption retrieval performance rises relative to methods that score every frame against one query or use a fixed decomposition schema.
- The same planner can be reused across QA, question retrieval, and caption tasks without task-specific retraining.
- Boolean merging lets the system express logical combinations such as 'frames that satisfy both tool A and tool B' that single-tool approaches cannot represent.
- Direct evaluation becomes possible once questions are anchored to explicit time intervals rather than whole videos.
Where Pith is reading between the lines
- The boolean merging step may transfer to other retrieval settings where multiple weak signals must be combined without learning a joint scorer.
- If the planner's decomposition quality varies with query complexity, performance gaps could widen on queries that require many tools or nested logic.
- Replacing the current visual tools with newer or domain-specific ones would test whether the gain comes mainly from the decomposition logic or from the tools themselves.
Load-bearing premise
The LLM planner can consistently turn any query into suitable tool calls and into merging rules that improve retrieval over simpler baselines.
What would settle it
Run the same queries on a held-out video set and measure whether the boolean-merged ranking still beats the single-tool baseline on recall at the ground-truth time interval.
Figures
read the original abstract
Keyframe selection is a direct way to provide verifiable visual evidence for long-video question answering (QA). Queries differ in what they require, and finding the right frames depends on knowing what to look for. Existing keyframe selectors either score every frame against a single query, or decompose the query into a fixed schema evaluated by a single visual tool. We propose ToolMerge, a keyframe retrieval method based on decomposition and merging: an Large Language Model (LLM) based planner decomposes the query into tool calls and specifies how their per-tool rankings are merged using boolean operators. To evaluate retrieval directly, we construct Molmo-2 Moments (M2M), a benchmark in which every question is anchored to a specific time interval by construction. Across QA, question retrieval, and caption retrieval, ToolMerge is competitive with prior keyframe selectors, most notably on caption retrieval, outperforming other methods by 5%. Code and data can be found at https://github.com/michalsr/ToolMerge .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes ToolMerge, a keyframe retrieval method for long-video QA that uses an LLM-based planner to decompose queries into tool calls and specify boolean operators for merging per-tool rankings. It introduces the Molmo-2 Moments (M2M) benchmark, where questions are anchored to specific time intervals by construction, and reports competitive performance against prior keyframe selectors across QA, question retrieval, and caption retrieval tasks, with a 5% outperformance on caption retrieval.
Significance. If the empirical gains hold under scrutiny, ToolMerge could advance adaptive keyframe selection by combining multiple visual tools through LLM-driven decomposition and merging, offering greater flexibility than fixed-schema or single-tool baselines for diverse queries in long-video understanding. The public release of code and data at the provided GitHub link is a clear strength supporting reproducibility.
major comments (2)
- [Abstract] Abstract: the headline claim of a 5% improvement on caption retrieval (and competitiveness on other tasks) is presented without any experimental details, error analysis, ablation studies, or baseline descriptions, preventing assessment of whether gains derive from the LLM planner, the boolean merging step, or the underlying tools.
- [Abstract] Abstract: no human evaluation of decomposition quality, no ablation isolating the boolean-merge contribution, and no tests on out-of-distribution or ambiguous queries are described, so the central assumption that the LLM planner reliably chooses effective tool calls and merge operators cannot be validated from the reported aggregate metrics alone.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the abstract. We agree that additional context would aid assessment of the results and will revise the manuscript to address this. Our point-by-point responses to the major comments follow.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim of a 5% improvement on caption retrieval (and competitiveness on other tasks) is presented without any experimental details, error analysis, ablation studies, or baseline descriptions, preventing assessment of whether gains derive from the LLM planner, the boolean merging step, or the underlying tools.
Authors: We agree that the abstract presents the headline result in a concise form without accompanying experimental details or baseline descriptions. Abstracts are subject to strict length limits, so such information is typically reserved for the main text. The manuscript describes the experimental tasks (QA, question retrieval, and caption retrieval), the M2M benchmark construction, and reports the competitive results including the 5% gain on caption retrieval. We will revise the abstract to include a brief additional sentence referencing the main comparison methods and the evaluated tasks. revision: yes
-
Referee: [Abstract] Abstract: no human evaluation of decomposition quality, no ablation isolating the boolean-merge contribution, and no tests on out-of-distribution or ambiguous queries are described, so the central assumption that the LLM planner reliably chooses effective tool calls and merge operators cannot be validated from the reported aggregate metrics alone.
Authors: The manuscript does not include human evaluation of decomposition quality, ablations that isolate the boolean-merge contribution, or explicit tests on out-of-distribution or ambiguous queries. Validation is provided through aggregate performance on the M2M benchmark (where questions are anchored to time intervals by construction) and the other retrieval tasks. We will add a limitations paragraph to the revised manuscript that discusses reliance on these aggregate metrics and the scope of the claims regarding the LLM planner. revision: partial
Circularity Check
No circularity: empirical method with no derivations or self-referential fits
full rationale
The paper introduces ToolMerge as an LLM-based decomposition and merging approach for keyframe retrieval, evaluated empirically on the constructed M2M benchmark across QA, retrieval, and caption tasks. No equations, fitted parameters presented as predictions, uniqueness theorems, or self-citations appear in the provided text. Claims rest on direct performance comparisons rather than any reduction of outputs to inputs by construction. The method is self-contained against external benchmarks with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
-
[2]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Wang Chen, Yuhui Zeng, Yongdong Luo, Tianyu Xie, Luojun Lin, Jiayi Ji, Yan Zhang, and Xiawu Zheng. Wavelet-based frame selection by detecting semantic boundary for long video understanding.Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2026
work page 2026
-
[4]
Yolo- world: Real-time open-vocabulary object detection
Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, and Ying Shan. Yolo- world: Real-time open-vocabulary object detection. InProc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[5]
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Winson Han, Ali Farhadi, and Ranjay Krishna. Molmo2: Open weights and data for vision- language ...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[6]
Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis. InP...
work page 2025
-
[7]
Weiyu Guo, Ziyang Chen, Shaoguang Wang, Jianxiang He, Yijie Xu, Jinhui Ye, Ying Sun, and Hui Xiong. Logic-in-frames: Dynamic keyframe search via visual semantic-logical verification for long video understanding. InAdvances in Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[8]
M-llm based video frame selection for efficient video understanding
Kai Hu, Feng Gao, Xiaohan Nie, Peng Zhou, Son Tran, Tal Neiman, Lingyun Wang, Mubarak Shah, Raffay Hamid, Bing Yin, and Trishul Chilimbi. M-llm based video frame selection for efficient video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
work page 2025
-
[9]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability
Savya Khosla, Sethuraman TV , Aryan Chadha, Alex Schwing, and Derek Hoiem. T-ren: Learning text-aligned region tokens improves dense vision-language alignment and scalability. arXiv preprint arXiv:2604.18573, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[11]
Bolt: Boost large vision-language model without training for long-form video understanding
Shuming Liu, Chen Zhao, Tianqi Xu, and Bernard Ghanem. Bolt: Boost large vision-language model without training for long-form video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025
work page 2025
-
[12]
Junwen Pan, Qizhe Zhang, Rui Zhang, Ming Lu, Xin Wan, Yuan Zhang, Chang Liu, and Qi She. Timesearch-r: Adaptive temporal search for long-form video understanding via self- verification reinforcement learning. InThe F ourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=gqb1hvuGcj. 10
work page 2026
-
[13]
Ziqi Pang and Yu-Xiong Wang. MR. Video: Mapreduce as an effective principle for long video understanding. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. URL https://openreview.net/forum?id=7n2Kv5BUz2
work page 2025
-
[14]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
MDP3: A training-free approach for list-wise frame selection in video-LLMs
Hui Sun, Shiyin Lu, Huanyu Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Ming Li. MDP3: A training-free approach for list-wise frame selection in video-LLMs. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025
work page 2025
-
[16]
Adaptive keyframe sampling for long video understanding
Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, and Qixiang Ye. Adaptive keyframe sampling for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 29118–29128, 2025
work page 2025
-
[17]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
TRL: Transformers Rein- forcement Learning, 2020
Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers Rein- forcement Learning, 2020. URLhttps://github.com/huggingface/trl
work page 2020
-
[19]
Videoagent: Long-form video understanding with large language model as agent
Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung-Levy. Videoagent: Long-form video understanding with large language model as agent. InEuropean Conference on Computer Vision (ECCV), 2024. doi: 10.1007/978-3-031-72989-8_4
-
[20]
Videotree: Adaptive tree-based video representation for llm reasoning on long videos
Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, and Mohit Bansal. Videotree: Adaptive tree-based video representation for llm reasoning on long videos. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 3272–3283, June 2025
work page 2025
-
[21]
Ziyang Wang, Honglu Zhou, Shijie Wang, Junnan Li, Caiming Xiong, Silvio Savarese, Mohit Bansal, Michael S. Ryoo, and Juan Carlos Niebles. Active video perception: Iterative evidence seeking for agentic long video understanding.arXiv preprint arXiv:2512.05774, 2025
-
[22]
Longvideobench: A benchmark for long-context interleaved video-language understanding
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https: //openreview.net/forum?id=3G1ZDXOI4f
work page 2024
-
[23]
Re-thinking temporal search for long-form video understanding
Jinhui Ye, Zihan Wang, Haosen Sun, Keshigeyan Chandrasegaran, Zane Durante, Cristobal Eyzaguirre, Yonatan Bisk, Juan Carlos Niebles, Ehsan Adeli, Li Fei-Fei, Jiajun Wu, and Manling Li. Re-thinking temporal search for long-form video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8579–8591, ...
work page 2025
-
[24]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Frame-voyager: Learning to query frames for video large language models
Sicheng Yu, Chengkai Jin, Huanyu Wang, Zhenghao Chen, Sheng Jin, Zhongrong Zuo, Xiaolei Xu, Zhenbang Sun, Bingni Zhang, Jiawei Wu, Hao Zhang, and Qianru Sun. Frame-voyager: Learning to query frames for video large language models. InInternational Conference on Learning Representations (ICLR), 2025
work page 2025
-
[26]
Knowledge-Centric Hallucination Detection
Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, and Gedas Bertasius. A simple LLM framework for long-range video question-answering. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21715–21737, Miami, Florida, USA, No...
-
[27]
Q-Frame: Query- aware frame selection and multi-resolution adaptation for video-LLMs
Shaojie Zhang, Jiahui Yang, Jianqin Yin, Zhenbo Luo, and Jian Luan. Q-Frame: Query- aware frame selection and multi-resolution adaptation for video-LLMs. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025
work page 2025
-
[28]
Deep video discovery: Agentic search with tool use for long-form video understanding
Xiaoyi Zhang, Zhaoyang Jia, Zongyu Guo, Jiahao Li, Bin Li, Houqiang Li, and Yan Lu. Deep video discovery: Agentic search with tool use for long-form video understanding. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. URL https://openreview. net/forum?id=oQYq9L1NVT
work page 2025
-
[29]
Yuanhao Zou, Shengji Jin, Andong Deng, Youpeng Zhao, Jun Wang, and Chen Chen. A.i.r.: Enabling adaptive, iterative, and reasoning-based frame selection for video question answering. InThe F ourteenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=SZVpOKw0YD. A Limitations A limitation of our work is that we...
work page 2026
-
[30]
**Locate, don’t answer.** Find the scene; the answerer decides what’s happening
-
[31]
**Always output at least one query.** Every question has a visual scene to find. 13
-
[32]
Entities, objects, settings, actions -- if it can help locate the right frames, query for it
**Use all information.** Extract every visually searchable detail from the question AND the answer choices. Entities, objects, settings, actions -- if it can help locate the right frames, query for it
-
[33]
Same scene described differently -> one query, let the answerer decide
**Use answer choices wisely.** Visually different choices -> search for each. Same scene described differently -> one query, let the answerer decide
-
[34]
**Right tool:** siglip for scenes, actions, layout, visual states. tren for specific objects or people. Use both when you need both. ## Question {question} Options: {options} Video duration: {duration}s encoded at {fps} fps. You MUST first write 1-3 sentences of reasoning before the JSON block. Think about: what must be visually true about the frames that...
-
[35]
A VIDEO SUMMARY describing the full video at a high level
-
[36]
When X is happening / is visible, what else is on screen?
A CLIP CAPTION describing a specific segment of that video in detail. Your job is to generate as many as possible high-quality questions about the clip that test whether someone actually watched the video. The viewer has access to the ENTIRE video, not just the clip -- so frame questions naturally, as if asking someone who watched the whole thing. Use the...
-
[37]
Are plausible for the question topic -- swap specific details (color, direction, object, name, count, position) 17
-
[38]
Would be tempting to someone who watched the video carelessly or only partially
-
[39]
Are clearly wrong (not synonyms or paraphrases of the correct answer)
-
[40]
Do NOT all share a pattern that the correct answer breaks
-
[41]
Cover diverse alternatives -- don’t just vary one detail across all 6 Return ONLY valid JSON (no markdown fences): {"wrong_answers": ["...", "...", "...", "...", "...", "..."]} Call 2 – analyze correct-answer format.System: You are a text format analyst. Analyze the format of the given answer choice precisely. User: Given this correct answer for a multipl...
-
[42]
{wrong_4} Return ONLY valid JSON (no markdown fences): {"reformatted": ["...", "...", "...", "..."]} G.3 Step 4: Scope Filter Model: GPT-5.2 (temperature 0). Two independent passes; failing either discards the question. You are a strict quality auditor for a video-understanding benchmark. You will receive a MULTIPLE-CHOICE QUESTION with 5 answer choices (...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.