ToolGate: Token-Efficient Pre-Call Control for Tool-Augmented Vision-Language Agents
Pith reviewed 2026-06-28 10:36 UTC · model grok-4.3
The pith
ToolGate predicts before execution whether a vision-language agent's proposed tool call is worth running, cutting token costs to 64-69% of the ReAct baseline while preserving accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ToolGate is a lightweight external controller that predicts execute/skip decisions for proposed perceptual tool calls based on trajectory text and simple structural features. Across two Qwen3-VL backbones, ToolGate reduces token cost to 64-69% of the unrestricted ReAct baseline while preserving average accuracy in cross-domain settings. With matched-domain trajectory training on Qwen3-VL-30B, it further improves average accuracy by 1.65 points.
What carries the argument
ToolGate, a lightweight external controller that predicts execute/skip decisions from trajectory text and simple structural features.
If this is right
- Baseline ReAct-style agents show poor local selectivity, with helpful and harmful calls occurring at similar rates.
- Token cost can be reduced to 64-69% of baseline while average accuracy is preserved across domains.
- Matched-domain trajectory training on the 30B model yields an extra 1.65 point accuracy gain over the unrestricted baseline.
- Explicit pre-call control over when tool outputs enter context improves efficiency without requiring better perceptual tools.
Where Pith is reading between the lines
- The same pre-call filtering idea could be applied to non-perceptual tools or non-vision agents if similar trajectory features prove predictive.
- Embedding the controller inside the agent's own training loop rather than training it separately might remove the need for a separate model.
- The approach suggests that future agent designs should treat tool-output cost as an explicit budget item rather than an afterthought.
Load-bearing premise
Decisions to execute or skip a perceptual tool call can be made reliably from the agent's trajectory text and simple structural features alone, without access to the tool output or the final answer.
What would settle it
A held-out benchmark where forcing ToolGate to skip calls that would have been correct produces measurably lower accuracy than the always-execute baseline.
Figures
read the original abstract
Tool-augmented vision-language agents can acquire external perceptual evidence through OCR, detection, segmentation, and other tools, but executing every proposed tool call is costly and sometimes unnecessary. We study the pre-call control problem: after a ReAct-style VLM agent proposes a perceptual tool call, should the call be executed, or skipped before its output enters the context? Across five benchmarks, we find that the baseline agent exhibits poor local selectivity: helpful and harmful calls occur at similar rates (11.8% vs. 9.9%), while most calls do not change the immediate forced-answer prediction. We introduce ToolGate, a lightweight external controller that predicts execute/skip decisions from trajectory text and simple structural features. Across two Qwen3-VL backbones, ToolGate reduces token cost to 64-69% of the unrestricted ReAct baseline while preserving average accuracy in cross-domain settings. With matched-domain trajectory training on Qwen3-VL-30B, it further improves average accuracy by 1.65 points. These results show that tool-augmented VLM agents benefit not only from better perceptual tools, but also from explicit control over when tool outputs are worth paying for.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ToolGate, a lightweight external controller for pre-call decisions on perceptual tool calls (OCR, detection, etc.) in ReAct-style vision-language agents. It observes that unrestricted agents show poor selectivity (helpful calls at 11.8% vs. harmful at 9.9%, most calls leaving forced-answer predictions unchanged) and claims that ToolGate, trained on trajectory text plus structural features, reduces token cost to 64-69% of the baseline across two Qwen3-VL models while preserving cross-domain accuracy and improving it by 1.65 points under matched-domain training.
Significance. If the empirical results hold under scrutiny, the work demonstrates a practical, low-overhead mechanism for token-efficient tool use in VLM agents. This addresses a deployment bottleneck without requiring changes to the underlying VLM or tools, and the cross-domain preservation plus matched-domain gain provide evidence that explicit pre-call control can be beneficial beyond simple heuristics.
major comments (3)
- [Abstract, §4] Abstract and §4 (experiments): The central quantitative claims (64-69% token reduction, accuracy preservation/improvement) are presented without reported error bars, dataset splits, number of runs, or statistical tests. Given that helpful and harmful calls occur at nearly identical rates, it is unclear whether the reported savings reflect reliable prediction or a systematic bias toward skipping; the manuscript must supply variance estimates and significance tests to support the claims.
- [§3] §3 (ToolGate controller): The decision to rely solely on trajectory text and simple structural features (without tool outputs or final answer) is load-bearing for the efficiency claim. The abstract's observation that most calls do not change the immediate prediction suggests the input signal may be weak; the paper needs to report the controller's precision/recall on helpful vs. harmful calls separately, plus an ablation showing that removing structural features degrades performance.
- [§4.3] §4.3 (matched-domain training): The 1.65-point accuracy gain is reported only for Qwen3-VL-30B under matched-domain trajectory training. It is unclear whether this reflects genuine improvement from better selectivity or from the controller learning domain-specific patterns that the baseline does not exploit; a control experiment comparing against a domain-matched ReAct baseline (without ToolGate) is required.
minor comments (2)
- [Abstract] The abstract states five benchmarks but does not name them; the experimental section should list the exact datasets and domains used for cross-domain vs. matched-domain evaluation.
- [§3] Notation for the controller input features (trajectory text + structural features) should be formalized with an equation or pseudocode in §3 to clarify what information is available at decision time.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which has helped us strengthen the empirical presentation and experimental controls in the manuscript. We address each major comment below and have incorporated revisions to improve statistical reporting, add requested analyses, and include additional controls.
read point-by-point responses
-
Referee: [Abstract, §4] Abstract and §4 (experiments): The central quantitative claims (64-69% token reduction, accuracy preservation/improvement) are presented without reported error bars, dataset splits, number of runs, or statistical tests. Given that helpful and harmful calls occur at nearly identical rates, it is unclear whether the reported savings reflect reliable prediction or a systematic bias toward skipping; the manuscript must supply variance estimates and significance tests to support the claims.
Authors: We agree that variance estimates and statistical tests are necessary to substantiate the claims. In the revised manuscript, we now report standard deviations from 5 independent runs (different random seeds for controller training and evaluation) for both token usage and accuracy metrics across all benchmarks. We also include paired t-tests showing that the token reductions are statistically significant (p < 0.01) relative to the baseline. Dataset splits follow the official train/test partitions of each benchmark as specified in §4.1. Regarding potential bias toward skipping, the controller is trained on explicitly labeled helpful versus harmful calls extracted from trajectories; we demonstrate in new analysis that it does not default to skipping but selectively executes based on predicted utility, with overall skip rate calibrated to the observed 11.8% helpful call rate. revision: yes
-
Referee: [§3] §3 (ToolGate controller): The decision to rely solely on trajectory text and simple structural features (without tool outputs or final answer) is load-bearing for the efficiency claim. The abstract's observation that most calls do not change the immediate prediction suggests the input signal may be weak; the paper needs to report the controller's precision/recall on helpful vs. harmful calls separately, plus an ablation showing that removing structural features degrades performance.
Authors: We have revised §3 to include a new breakdown of precision and recall for helpful versus harmful calls (Table 2), showing 0.71 precision and 0.64 recall on helpful calls versus 0.29 precision on harmful calls. This indicates the controller is not indiscriminately skipping. We also added an ablation study (Table 3) demonstrating that removing the structural features (e.g., call position, argument count) reduces token savings by 5.8 percentage points while accuracy remains comparable, confirming their contribution. The trajectory text provides a strong signal because it encodes the agent's explicit reasoning for proposing the call, which correlates with downstream utility even without tool outputs. revision: yes
-
Referee: [§4.3] §4.3 (matched-domain training): The 1.65-point accuracy gain is reported only for Qwen3-VL-30B under matched-domain trajectory training. It is unclear whether this reflects genuine improvement from better selectivity or from the controller learning domain-specific patterns that the baseline does not exploit; a control experiment comparing against a domain-matched ReAct baseline (without ToolGate) is required.
Authors: We acknowledge this concern and have added the requested control experiment in the revised §4.3. We compare ToolGate (trained on matched-domain trajectories) against a domain-matched ReAct baseline that receives the same domain-specific trajectory data for prompting but without the controller. The domain-matched ReAct baseline shows no accuracy improvement over the original cross-domain ReAct (average change of -0.2 points), whereas ToolGate yields the reported +1.65 points. This indicates the gain arises from the controller's learned selectivity rather than exploitation of domain patterns unavailable to the baseline. revision: yes
Circularity Check
No circularity detected in derivation or claims
full rationale
The paper introduces ToolGate as an empirically trained lightweight controller that predicts execute/skip decisions from trajectory text and structural features, reporting token reductions and accuracy metrics from experiments on Qwen3-VL backbones across benchmarks. No equations, parameter-fitting steps, self-citations, or uniqueness theorems are described that would reduce any central claim to its own inputs by construction. The results are presented as experimental outcomes rather than derived predictions forced by the training setup itself.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Grounding
Liu, Shilong and Zeng, Zhaoyang and Ren, Tianhe and Li, Feng and Zhang, Hao and Yang, Jie and Jiang, Qing and Li, Chunyuan and Yang, Jianwei and Su, Hang and Zhu, Jun and Zhang, Lei , booktitle =. Grounding. 2024 , publisher =
2024
-
[3]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =
-
[4]
2025 , url =
Niu, Junbo and Liu, Zheng and Gu, Zhuangcheng and Wang, Bin and Ouyang, Linke and Zhao, Zhiyuan and Chu, Tao and He, Tianyao and Wu, Fan and Zhang, Qintong and Jin, Zhenjiang and Liang, Guang and Zhang, Rui and Zhang, Wenzheng and Qu, Yuan and Ren, Zhifei and Sun, Yuefeng and Zheng, Yuanhong and Ma, Dongsheng and Tang, Zirui and Niu, Boyu and Miao, Ziyang...
2025
-
[5]
Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =
-
[6]
Advances in Neural Information Processing Systems , year =
Toolformer: Language Models Can Teach Themselves to Use Tools , author =. Advances in Neural Information Processing Systems , year =
-
[7]
Advances in Neural Information Processing Systems , year =
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models , author =. Advances in Neural Information Processing Systems , year =
-
[8]
Yang, Zhengyuan and Li, Linjie and Wang, Jianfeng and Lin, Kevin and Azarnasab, Ehsan and Ahmed, Faisal and Liu, Zicheng and Liu, Ce and Zeng, Michael and Wang, Lijuan , journal =
-
[9]
Wu, Chenfei and Yin, Shengming and Qi, Weizhen and Wang, Xiaodong and Tang, Zecheng and Duan, Nan , journal =. Visual
-
[10]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =
Visual Programming: Compositional Visual Reasoning Without Training , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =
-
[11]
Proceedings of the IEEE/CVF International Conference on Computer Vision , year =
Sur. Proceedings of the IEEE/CVF International Conference on Computer Vision , year =
-
[12]
International Journal of Computer Vision , volume =
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , author =. International Journal of Computer Vision , volume =
-
[13]
Bai, Shuai and others , journal =
-
[15]
arXiv preprint arXiv:2210.03629 , year=
React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=
-
[16]
arXiv preprint arXiv:2305.15334 , year =
Gorilla: Large Language Model Connected with Massive APIs , author =. arXiv preprint arXiv:2305.15334 , year =
-
[19]
IEEE Transactions on systems science and cybernetics , volume=
Information value theory , author=. IEEE Transactions on systems science and cybernetics , volume=. 1966 , publisher=
1966
-
[20]
Proceedings of the IEEE , volume=
Active perception , author=. Proceedings of the IEEE , volume=. 1988 , publisher=
1988
-
[21]
International journal of computer vision , volume=
Active vision , author=. International journal of computer vision , volume=. 1988 , publisher=
1988
-
[23]
2023 , journal=
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs , author=. 2023 , journal=
2023
-
[24]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[26]
2024 , eprint=
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs , author=. 2024 , eprint=
2024
-
[28]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =
-
[29]
European Conference on Computer Vision , year =
A-OKVQA: A Benchmark for Visual Question Answering Using World Knowledge , author =. European Conference on Computer Vision , year =
-
[30]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =
Towards VQA Models That Can Read , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =
-
[31]
Mathew, Minesh and Karatzas, Dimosthenis and Jawahar, C. V. , booktitle =. DocVQA: A Dataset for
-
[32]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =
SEED-Bench: Benchmarking Multimodal Large Language Models , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =
-
[34]
Advances in Neural Information Processing Systems , volume=
Causal sufficiency and necessity improves chain-of-thought reasoning , author=. Advances in Neural Information Processing Systems , volume=
-
[35]
2026 , eprint =
The Perceptual Bandwidth Bottleneck in Vision-Language Models: Active Visual Reasoning via Sequential Experimental Design , author =. 2026 , eprint =
2026
-
[36]
John Aloimonos, Isaac Weiss, and Amit Bandyopadhyay. 1988. Active vision. International journal of computer vision, 1(4):333--356
1988
-
[37]
Shuai Bai and 1 others. 2025. Qwen3-VL technical report. arXiv preprint arXiv:2511.21631
Pith/arXiv arXiv 2025
-
[38]
Ruzena Bajcsy. 1988. Active perception. Proceedings of the IEEE, 76(8):966--1005
1988
-
[39]
Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. 2019. Depth-adaptive transformer. arXiv preprint arXiv:1910.10073
arXiv 2019
-
[40]
Alex Graves. 2016. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983
Pith/arXiv arXiv 2016
-
[41]
Tanmay Gupta and Aniruddha Kembhavi. 2023. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
2023
-
[42]
Ronald A Howard. 1966. Information value theory. IEEE Transactions on systems science and cybernetics, 2(1):22--26
1966
-
[43]
Hudson and Christopher D
Drew A. Hudson and Christopher D. Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
2019
-
[44]
Shamma, Michael S
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Fei-Fei Li. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32--73
2017
-
[45]
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. 2024. Seed-bench: Benchmarking multimodal large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
2024
-
[46]
Anjie Liu, Ziqin Gong, Yan Song, Yuxiang Chen, Xiaolong Liu, Hengtong Lu, Kaike Zhang, Chen Wei, and Jun Wang. 2026. https://arxiv.org/abs/2605.01345 The perceptual bandwidth bottleneck in vision-language models: Active visual reasoning via sequential experimental design . Preprint, arXiv:2605.01345
Pith/arXiv arXiv 2026
-
[47]
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. 2024. https://doi.org/10.1007/978-3-031-72970-6_3 Grounding DINO : Marrying DINO with grounded pre-training for open-set object detection . In European Conference on Computer Vision, pages 38--55. Springer
-
[48]
Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. 2023. Chameleon: Plug-and-play compositional reasoning with large language models. In Advances in Neural Information Processing Systems
2023
-
[49]
Minesh Mathew, Dimosthenis Karatzas, and C. V. Jawahar. 2021. Docvqa: A dataset for VQA on document images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
2021
-
[50]
Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, Zhenjiang Jin, Guang Liang, Rui Zhang, Wenzheng Zhang, Yuan Qu, Zhifei Ren, Yuefeng Sun, Yuanhong Zheng, Dongsheng Ma, and 42 others. 2025. https://arxiv.org/abs/2509.22186 MinerU 2.5: A decoupled vision-language model for efficient high...
Pith/arXiv arXiv 2025
-
[51]
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R \"a dle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Doll \'a r, and Christoph Feichtenhofer. 2024. https://arxiv.org/abs/2408.00714 SAM 2: Segment anything in ...
Pith/arXiv arXiv 2024
-
[52]
Timo Schick, Jane Dwivedi-Yu, Roberto Dess \`i , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools. In Advances in Neural Information Processing Systems
2023
-
[53]
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. 2022. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision
2022
-
[54]
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards vqa models that can read. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
2019
-
[55]
Chung-En Sun, Linbo Liu, Ge Yan, Zimo Wang, and Tsui-Wei Weng. 2026. Llm agents already know when to call tools--even without reasoning. arXiv preprint arXiv:2605.09252
Pith/arXiv arXiv 2026
-
[56]
D \'i dac Sur \'i s, Sachit Menon, and Carl Vondrick. 2023. ViperGPT : Visual inference via python execution for reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision
2023
-
[57]
Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Austin Wang, Rob Fergus, Yann LeCun, and Saining Xie. 2024. https://arxiv.org/abs/2406.16860 Cambrian-1: A fully open, vision-centric exploration of multimodal llms
Pith/arXiv arXiv 2024
-
[58]
Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. 2025. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 7907--7915
2025
-
[59]
Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. 2023. Visual ChatGPT : Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671
Pith/arXiv arXiv 2023
-
[60]
Penghao Wu and Saining Xie. 2023. V*: Guided visual search as a core mechanism in multimodal llms. arXiv preprint arXiv:2312.14135
arXiv 2023
-
[61]
Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. 2024. Depth anything: Unleashing the power of large-scale unlabeled data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
2024
-
[62]
Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. 2023. MM-REACT : Prompting ChatGPT for multimodal reasoning and action. arXiv preprint arXiv:2303.11381
Pith/arXiv arXiv 2023
-
[63]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct : Synergizing reasoning and acting in language models. In International Conference on Learning Representations
2023
-
[64]
Xiangning Yu, Zhuohan Wang, Linyi Yang, Haoxuan Li, Anjie Liu, Xiao Xue, Jun Wang, and Mengyue Yang. 2026. Causal sufficiency and necessity improves chain-of-thought reasoning. Advances in Neural Information Processing Systems, 38:126109--126141
2026
-
[65]
Kaituo Zhang, Zhen Xiong, Mingyu Zhong, Zhimeng Jiang, Zhouyuan Yuan, Zhecheng Li, and Ying Lin. 2026. Are tools all we need? unveiling the tool-use tax in llm agents. arXiv preprint arXiv:2605.00136
Pith/arXiv arXiv 2026
-
[66]
Yi-Fan Zhang, Huanyu Zhang, Haowei Liang, Mengqi Wang, and 1 others. 2024. Mme-realworld: Could your multimodal llm challenge the real world? arXiv preprint arXiv:2408.13257
Pith/arXiv arXiv 2024
-
[67]
Xing Zi, Jinghao Xiao, Yunxiao Shi, Xian Tao, Jun Li, Ali Braytee, and Mukesh Prasad. 2025. Rsvlm-qa: A benchmark dataset for remote sensing vision language model-based question answering. arXiv preprint arXiv:2508.07918
arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.