Tool Retrieval Bridge: Aligning Vague Instructions with Retriever Preferences via Bridge Model

Bo Du; Fei Liao; Jian Wang; Juhua Liu; Kunfeng Chen; Luyao Zhuang

arxiv: 2604.07816 · v1 · submitted 2026-04-09 · 💻 cs.CL

Tool Retrieval Bridge: Aligning Vague Instructions with Retriever Preferences via Bridge Model

Kunfeng Chen , Luyao Zhuang , Fei Liao , Juhua Liu , Jian Wang , Bo Du This is my paper

Pith reviewed 2026-05-10 17:50 UTC · model grok-4.3

classification 💻 cs.CL

keywords tool retrievalvague instructionstool learninginstruction rewritingbridge modelVGToolBench

0 comments

The pith

A bridge model rewrites vague tool-use instructions into specific forms that standard retrievers can process effectively.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that real-world vague instructions for selecting tools damage the accuracy of tool retrieval systems trained on detailed benchmarks. It introduces VGToolBench to test this mismatch and proposes the Tool Retrieval Bridge approach, where a separate model rewrites vague inputs into clearer versions that better suit retriever preferences. Experiments demonstrate that this rewriting leads to consistent improvements across different retrieval methods, including doubling the performance of simple keyword-based retrievers.

Core claim

The central claim is that a bridge model can close the gap between ambiguous human instructions and the specific formats preferred by tool retrievers, resulting in substantial performance lifts such as BM25's NDCG rising from 9.73 to 19.59.

What carries the argument

The Tool Retrieval Bridge (TRB), a rewriting model that takes vague user instructions and produces more detailed versions aligned with retriever preferences.

If this is right

Standard retrievers become viable for vague real-world instructions without any changes to their training.
The gains appear across multiple retrieval settings and algorithms, including both sparse and dense methods.
The approach separates instruction clarification from the retrieval step itself, allowing reuse of existing tools.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The rewriting step could extend to other retrieval tasks where user queries are imprecise, such as web or document search.
Instead of a separate bridge, retrievers might be trained end-to-end on vague data, though the modular design keeps the core retriever unchanged.

Load-bearing premise

The bridge model can rewrite vague instructions without adding incorrect details that mislead the retriever away from the user's true intent.

What would settle it

An experiment in which retrieval accuracy on the rewritten instructions falls below accuracy on the original vague instructions would disprove the claimed benefit.

Figures

Figures reproduced from arXiv: 2604.07816 by Bo Du, Fei Liao, Jian Wang, Juhua Liu, Kunfeng Chen, Luyao Zhuang.

**Figure 1.** Figure 1: Instruction Comparison between ToolBench (Qin et al., 2023) and our VGToolBench. As seen, ToolBench contains more detailed and specific instructions (highlighted in red), while the instructions of our VGToolBench are vague and more aligned with real-world scenarios. tools (Gao et al., 2024); 2) Supervised Fine-Tuning (SFT), which incorporates tool learning capabilities into model parameters through fine-tu… view at source ↗

**Figure 2.** Figure 2: Retrieval Performance comparison of VGToolBench v.s. ToolBench. The x-axis denotes the different types of sub-sets in VGToolBench and ToolBench, and the y-axis denotes the tool retrieval performance, evaluated by the average of NDCG@5 and NDCG@10, where the evaluation details can be found in Section 5.1. The numerical results represent the relative decrease compared to the results on Toolbench. We can obs… view at source ↗

**Figure 3.** Figure 3: Overview of our proposed TRB. (a) The pipeline of TRB, where the core is to introduces a bridge model to enhance the instruction into a more specific version. (b) The training scheme of the bridge model, which consists of a two-stage process: ❶ Supervised Fine-tuning, ❷ Reinforcement Learning. We first construct paired input–output instances by aligning data correspondences between VGToolBench and ToolBenc… view at source ↗

**Figure 4.** Figure 4: Effect of the iteration number T in iterative DPO. We show the retrieval performance (NDCG@5 and NDCG@10) of TRB with BM25 on VGToolBench (I3) across different iterations. Both metrics display a noticeable upward trend followed by a subsequent downturn as T increases, indicating that moderate iterative refinement improves retrieval quality whereas excessive iterations may hinder performance. 5.4. In-depth … view at source ↗

**Figure 6.** Figure 6: Case study between the vague instruction in [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

read the original abstract

Tool learning has emerged as a promising paradigm for large language models (LLMs) to address real-world challenges. Due to the extensive and irregularly updated number of tools, tool retrieval for selecting the desired tool subset is essential. However, current tool retrieval methods are usually based on academic benchmarks containing overly detailed instructions (e.g., specific API names and parameters), while real-world instructions are more vague. Such a discrepancy would hinder the tool retrieval in real-world applications. In this paper, we first construct a new benchmark, VGToolBench, to simulate human vague instructions. Based on this, we conduct a series of preliminary analyses and find that vague instructions indeed damage the performance of tool retrieval. To this end, we propose a simple-yet-effective Tool Retrieval Bridge (TRB) approach to boost the performance of tool retrieval for vague instructions. The principle of TRB is to introduce a bridge model to rewrite the vague instructions into more specific ones and alleviate the gap between vague instructions and retriever preferences.We conduct extensive experiments under multiple commonly used retrieval settings, and the results show that TRB effectively mitigates the ambiguity of vague instructions while delivering consistent and substantial improvements across all baseline retrievers. For example, with the help of TRB, BM25 achieves a relative improvement of up to 111.51%, i.e., increasing the average NDCG score from 9.73 to 19.59. The source code and models are publicly available at https://github.com/kfchenhn/TRB.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a benchmark for vague tool instructions and shows a rewriting bridge lifts retrieval scores substantially, but the gains rest on a simulated setup that may not test real intent preservation.

read the letter

The main point is that vague user instructions degrade tool retrieval, and a simple bridge model that rewrites them into more specific versions delivers large measured gains on the new benchmark they introduce. VGToolBench starts from detailed academic instructions and creates vague variants to simulate real-world use. Their experiments confirm the performance drop and then show the bridge model recovers much of it across retrievers, including the reported jump for BM25 from 9.73 to 19.59 NDCG. They release the code and models, which makes the work easy to inspect or extend. This is a practical step in the tool-learning area where most prior benchmarks assume precise queries. The approach is straightforward and the consistent improvements across settings are the clearest positive. The soft spot is the lack of direct checks on whether the rewrites keep the original intent. Because the benchmark is built by removing details from known ground-truth instructions, the bridge could be learning to recover those specific details rather than handling genuine ambiguity. The abstract gives no fidelity scores, no human ratings on intent preservation, and no tests on naturally occurring vague queries outside the construction process. If the model adds ungrounded parameters or constraints, the NDCG numbers on this benchmark would look good while real-world results suffer. This paper is for people working on tool-augmented LLMs who need retrieval that works with messy inputs. A reader focused on practical fixes for agent systems would get value from the benchmark construction and the baseline comparisons. It has enough substance and a clear problem to deserve peer review, though the authors should add intent checks and out-of-distribution tests before publication. I would send it to referees rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The paper constructs VGToolBench to simulate vague real-world tool-use instructions (by degrading detailed academic queries), shows that vagueness degrades standard retriever performance, and proposes Tool Retrieval Bridge (TRB): a bridge model that rewrites vague instructions into more specific ones before retrieval. Experiments across multiple retrievers (including BM25) report large gains, e.g., BM25 NDCG rising from 9.73 to 19.59 (111.51% relative improvement), with code released.

Significance. If the rewrites reliably preserve user intent without introducing ungrounded details, TRB would address a practically important mismatch between academic tool benchmarks and real-world vague queries, potentially improving tool-augmented LLM systems. The public release of code and models is a positive for reproducibility.

major comments (3)

[VGToolBench construction] Benchmark construction (VGToolBench section): The benchmark simulates vagueness by starting from detailed instructions and removing specifics; the bridge model is therefore trained to recover those held-out details. This setup risks the observed NDCG gains arising from the model guessing the original detailed version rather than resolving genuine ambiguity. The paper should report the exact simulation procedure, the distribution of removed information, and results on naturally occurring vague instructions outside this construction process.
[Experiments and analysis] Evaluation and analysis sections: No fidelity, entailment, or intent-preservation metrics are provided (e.g., whether the rewritten instruction entails the original vague query or human ratings of intent match). Without these, it is impossible to rule out that gains stem from the bridge model adding plausible but incorrect parameters or tool names that happen to align with the ground-truth tools in the benchmark.
[Main results table] Table reporting main results (e.g., the BM25 row): The absolute and relative improvements are large, but the paper provides no error analysis of cases where the rewrite introduces details that cause retrieval of incorrect tools, nor controls for the quality of the bridge model's outputs. This leaves the central claim that TRB 'mitigates the ambiguity of vague instructions' only partially supported.

minor comments (2)

[Introduction and preliminary analysis] The abstract and introduction refer to 'preliminary analyses' showing vagueness harms retrieval, but the corresponding section could more clearly separate the analysis from the main TRB experiments.
[Method] Notation for the bridge model input/output could be formalized (e.g., an equation defining the rewrite function) to improve clarity for readers.

Simulated Author's Rebuttal

3 responses · 1 unresolved

Thank you for the detailed review and constructive feedback on our manuscript. We appreciate the referee's insights into the potential limitations of our benchmark construction and evaluation. We address each major comment below and outline the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: Benchmark construction (VGToolBench section): The benchmark simulates vagueness by starting from detailed instructions and removing specifics; the bridge model is therefore trained to recover those held-out details. This setup risks the observed NDCG gains arising from the model guessing the original detailed version rather than resolving genuine ambiguity. The paper should report the exact simulation procedure, the distribution of removed information, and results on naturally occurring vague instructions outside this construction process.

Authors: We will expand the VGToolBench section to include the exact simulation procedure for degrading detailed instructions into vague ones, along with the distribution of removed information categories (such as specific parameters, tool names, and constraints). This will clarify how the benchmark was built. Regarding results on naturally occurring vague instructions, our work focuses on a controlled simulation derived from existing academic benchmarks to isolate the impact of vagueness; we do not have paired data for real-world vague queries with verified tool ground truths, which would require new data collection efforts beyond the scope of this study. The substantial gains with BM25 (a lexical retriever) support that TRB learns to add generally useful details rather than merely reconstructing the originals. revision: partial
Referee: Evaluation and analysis sections: No fidelity, entailment, or intent-preservation metrics are provided (e.g., whether the rewritten instruction entails the original vague query or human ratings of intent match). Without these, it is impossible to rule out that gains stem from the bridge model adding plausible but incorrect parameters or tool names that happen to align with the ground-truth tools in the benchmark.

Authors: We agree that additional metrics are needed to better support the claims. In the revised manuscript, we will add automatic entailment evaluation (using an LLM-based judge to verify consistency between the original vague query and the rewritten instruction) as well as a human study on a sample of outputs assessing intent preservation and fidelity. These will help address concerns about potential introduction of ungrounded details. revision: yes
Referee: Table reporting main results (e.g., the BM25 row): The absolute and relative improvements are large, but the paper provides no error analysis of cases where the rewrite introduces details that cause retrieval of incorrect tools, nor controls for the quality of the bridge model's outputs. This leaves the central claim that TRB 'mitigates the ambiguity of vague instructions' only partially supported.

Authors: We will add an error analysis subsection examining failure cases where the bridge model's rewrites lead to incorrect tool retrieval, categorizing the types of introduced details that cause issues. We will also include controls, such as comparisons against outputs from an untrained or randomly initialized bridge model, to demonstrate that the trained TRB specifically contributes to the observed gains rather than any plausible rewrite. revision: yes

standing simulated objections not resolved

Providing empirical results on naturally occurring vague instructions outside the simulated VGToolBench construction process.

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper's chain consists of constructing VGToolBench by simulating vagueness, running preliminary retrieval analyses on it, training a separate bridge model to rewrite vague instructions, and measuring downstream NDCG gains on held-out retrieval settings. No equations, fitted parameters, or self-citations are shown that make the reported improvements (e.g., BM25 NDCG rising from 9.73 to 19.59) equivalent to the benchmark construction or training inputs by definition. The bridge model and final metrics are presented as independently trained and evaluated components, satisfying the criteria for a self-contained empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that a separate model can rewrite vague instructions without distorting intent. No free parameters or invented physical entities are introduced beyond standard supervised fine-tuning of the bridge model.

axioms (1)

domain assumption Vague instructions can be rewritten into specific ones that preserve user intent and improve retriever accuracy
This is the core premise of the TRB method and is required for the claimed performance gains.

pith-pipeline@v0.9.0 · 5585 in / 1155 out tokens · 35214 ms · 2026-05-10T17:50:49.789590+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The principle of TRB is to introduce a bridge model to rewrite the vague instructions into more specific ones and alleviate the gap between vague instructions and retriever preferences... two-stage processes: ❶ we fine-tune the model... ❷ the SFT model is further optimized... by using the retrieval performance as the reward for reinforcement learning (RL).
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We first construct a new benchmark, VGToolBench, to simulate human vague instructions... TRB effectively mitigates the ambiguity of vague instructions while delivering consistent and substantial improvements across all baseline retrievers.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · 4 internal anchors

[1]

Qwen Technical Report

Topic extraction and interactive knowledge graphs for learning resources. Sustainability 14, 226. Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al., 2023. Qwen technical report. arXiv preprint arXiv:2309.16609 . Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Sh...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

The Llama 3 Herd of Models

Detecting cyberbullying using deep learning techniques: a pre-trained glove and focal loss technique. PeerJ Computer Science 10, e1961. Farghaly, H.M., Ali, A.A., Abd El-Hafeez, T., 2020a. Build- ing an effective and accurate associative classifier based on support vector machine. Sylwan 164, 39–56. Farghaly, H.M., Ali, A.A., El-Hafeez, T.A., 2020b. Devel...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Information Systems 100, 101785

Multi-label arabic text classification in online social networks. Information Systems 100, 101785. Patil, S.G., Mao, H., Yan, F., Ji, C.C.J., Suresh, V ., Stoica, I., Gonzalez, J.E., 2025. The berkeley function calling leader- board (BFCL): From tool use to agentic evaluation of large language models, in: Forty-second International Conference on Machine L...

work page 2025
[4]

Proximal Policy Optimization Algorithms

Revisiting, benchmarking and exploring api recom- mendation: How far are we? IEEE Transactions on Software Engineering 49, 1876–1897. Qian, C., He, B., Zhuang, Z., Deng, J., Qin, Y ., Cong, X., Zhang, Z., Zhou, J., Lin, Y ., Liu, Z., et al., 2024. Tell me more! towards implicit user intention understanding of language model driven agents, in: Proceedings ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Dynamic two-way sign language interpretation, in: International Conference on Intelligent Manufacturing and Energy Sustainability, Springer. pp. 463–476. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al., 2024. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models. arXiv pr...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Qwen Technical Report

Topic extraction and interactive knowledge graphs for learning resources. Sustainability 14, 226. Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al., 2023. Qwen technical report. arXiv preprint arXiv:2309.16609 . Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Sh...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

The Llama 3 Herd of Models

Detecting cyberbullying using deep learning techniques: a pre-trained glove and focal loss technique. PeerJ Computer Science 10, e1961. Farghaly, H.M., Ali, A.A., Abd El-Hafeez, T., 2020a. Build- ing an effective and accurate associative classifier based on support vector machine. Sylwan 164, 39–56. Farghaly, H.M., Ali, A.A., El-Hafeez, T.A., 2020b. Devel...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Information Systems 100, 101785

Multi-label arabic text classification in online social networks. Information Systems 100, 101785. Patil, S.G., Mao, H., Yan, F., Ji, C.C.J., Suresh, V ., Stoica, I., Gonzalez, J.E., 2025. The berkeley function calling leader- board (BFCL): From tool use to agentic evaluation of large language models, in: Forty-second International Conference on Machine L...

work page 2025

[4] [4]

Proximal Policy Optimization Algorithms

Revisiting, benchmarking and exploring api recom- mendation: How far are we? IEEE Transactions on Software Engineering 49, 1876–1897. Qian, C., He, B., Zhuang, Z., Deng, J., Qin, Y ., Cong, X., Zhang, Z., Zhou, J., Lin, Y ., Liu, Z., et al., 2024. Tell me more! towards implicit user intention understanding of language model driven agents, in: Proceedings ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Dynamic two-way sign language interpretation, in: International Conference on Intelligent Manufacturing and Energy Sustainability, Springer. pp. 463–476. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al., 2024. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models. arXiv pr...

work page internal anchor Pith review Pith/arXiv arXiv 2024