Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky

Ashutosh Hathidara; Julien Yu; Sebastian Schreiber

arxiv: 2507.03336 · v4 · submitted 2025-07-04 · 💻 cs.AI · cs.CL· cs.LG

Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky

Ashutosh Hathidara , Julien Yu , Sebastian Schreiber This is my paper

Pith reviewed 2026-05-19 06:39 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords tool callingLLM fine-tuningdisambiguationenterprise APIsagent evaluationsynthetic dialogues

0 comments

The pith

Disambiguation-focused fine-tuning lifts open LLMs past GPT-4o in enterprise tool-calling success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that LLMs often fail on enterprise APIs because they cannot tell similar tools apart or fill in missing details. DiaFORGE creates synthetic multi-turn conversations that force the model to resolve these ambiguities, then fine-tunes open models from 3B to 70B parameters on the resulting reasoning steps. When tested in a live agent loop on the new DiaBENCH benchmark, the fine-tuned models complete user goals far more often than GPT-4o or Claude-3.5-Sonnet under strong prompting. The authors also release 5000 real API specifications paired with validated dialogues so others can reproduce and extend the work.

Core claim

A three-stage pipeline that first generates persona-driven dialogues requiring tool disambiguation, then performs supervised fine-tuning with explicit reasoning traces, and finally measures end-to-end goal completion in dynamic live agent evaluations produces models whose tool-invocation success exceeds that of GPT-4o by 27 percentage points and Claude-3.5-Sonnet by 49 percentage points on DiaBENCH.

What carries the argument

DiaFORGE, the pipeline that synthesizes disambiguation-heavy multi-turn dialogues, fine-tunes models on reasoning traces, and runs dynamic live evaluation loops that report goal completion.

If this is right

Open models of various sizes become viable alternatives to closed frontier models for production tool-calling tasks.
Dynamic agent-loop evaluation exposes reliability gaps that static benchmarks overlook.
Releasing paired API specifications and disambiguation dialogues creates a reusable resource for safer enterprise agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same synthesis approach could be adapted to reduce errors in other agent decisions that involve near-duplicate options.
Enterprises might lower operational risk by applying this style of training before deploying tool-using agents in live systems.

Load-bearing premise

The generated dialogues and live evaluation accurately reflect the ambiguities and success criteria of actual enterprise API usage.

What would settle it

Deploy the fine-tuned models on a fresh collection of genuine enterprise APIs using real ambiguous user requests and check whether the reported goal-completion advantage over GPT-4o and Claude persists.

Figures

Figures reproduced from arXiv: 2507.03336 by Ashutosh Hathidara, Julien Yu, Sebastian Schreiber.

**Figure 2.** Figure 2: Trade-offs among tool call-related metrics [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: DIAFORGE generated dialogue sample An example of a synthesized dialogue is shown in [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Conversation length distribution: number of [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Parameter count distribution: number of pa [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Turn distribution for tool disambiguation (left) and parameter specification (right). [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Turn slicing and loss masking strategy for [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Reducing hallucination for user utterance [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Format correctness score of various LLMs on [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Initial reference system prompt used for [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: CAPO optimized GPT-4o system prompt used for evaluation Claude-3.5-Sonnet Prompt You are an AI collaborator developed by XYZ. Your mission comprises two sequential stages: Stage A - Toolkit Evaluation 1. Scrutinize the "Available Tools" inventory. 2. Should multiple instruments appear suitable, pose targeted, user-friendly inquiries (eschewing tool nomenclature or technical vernacular) to clarify the op… view at source ↗

**Figure 12.** Figure 12: CAPO optimized Claude-3.5-Sonnet system prompt used for evaluation Llama-3.3 Based Models Prompt ===== **Instructions for AI Assistant (XYZ)** ===== **Your Role & Workflow** You embody an AI assistant developed by XYZ, operating in **two sequential stages**: — #### **Stage 1 - Identify the Best Fit Tool** 1. **Review "Available Tools" List**. 2. **Clarify User Intent** (if multiple tools seem applicable) … view at source ↗

**Figure 13.** Figure 13: CAPO optimized system prompt for Llama3.3 based models used for evaluation Gemma Based Models Prompt ## Acting as XYZ’s Intelligent Assistant You are a helpful AI assistant built by XYZ, designed to fulfill user requests by leveraging available tools. Your process operates in two distinct stages: **Stage 1: Request Comprehension & Best Tool Identification** 1. Review the **“Available Tools”** carefully.… view at source ↗

**Figure 14.** Figure 14: CAPO optimized system prompt for Gemma based models used for evaluation E User-Proxy Prompt For Dynamic Evaluation Below, we provide the user-proxy prompt used during dynamic evaluation. Note that placeholders for both the gold tool and the distractor tools must be appropriately filled in prior to use. Initial Reference System Prompt ===== Instructions ===== You are **{{user_persona}}**, an XYZ customer w… view at source ↗

**Figure 15.** Figure 15: System prompt for user-proxy agent used during dynamic evaluation [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly tasked with invoking enterprise APIs, yet they routinely falter when near-duplicate tools vie for the same user intent or when required arguments are left underspecified. We introduce DiaFORGE (Dialogue Framework for Organic Response Generation & Evaluation), a disambiguation-centric, three-stage pipeline that (i) synthesizes persona-driven, multi-turn dialogues in which the assistant must distinguish among highly similar tools, (ii) performs supervised fine-tuning of open-source models with reasoning traces across 3B - 70B parameters, and (iii) evaluates real-world readiness via a dynamic suite that redeploys each model in a live agentic loop and reports end-to-end goal completion alongside conventional static metrics. On our dynamic benchmark DiaBENCH, models trained with DiaFORGE raise tool-invocation success by 27 pp over GPT-4o and by 49 pp over Claude-3.5-Sonnet, both under optimized prompting. To spur further research, we release an open corpus of 5000 production-grade enterprise API specifications paired with rigorously validated, disambiguation-focused dialogues, offering a practical blueprint for building reliable, enterprise-ready tool-calling agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical pipeline for disambiguation in enterprise tool calling and releases useful data, but the big benchmark gains rest on synthetic data that may not be independent enough from training.

read the letter

The core takeaway is that this work targets a real pain point: LLMs picking the wrong tool when enterprise APIs have near-duplicates or underspecified arguments. They built DiaFORGE to generate persona-driven multi-turn dialogues, fine-tune open models from 3B to 70B with reasoning traces, and then test in a live agent loop on DiaBENCH. That combination and the released 5000-spec corpus are the actual new pieces relative to standard tool-calling fine-tuning.

Referee Report

2 major / 1 minor

Summary. The paper introduces DiaFORGE, a three-stage disambiguation-centric pipeline that synthesizes persona-driven multi-turn dialogues from enterprise API specifications, performs supervised fine-tuning of open-source LLMs (3B–70B parameters) augmented with reasoning traces, and evaluates the resulting models via a dynamic benchmark DiaBENCH that redeploys them in live agentic loops to measure end-to-end goal completion and tool-invocation success. It claims that DiaFORGE-trained models outperform GPT-4o by 27 percentage points and Claude-3.5-Sonnet by 49 percentage points on DiaBENCH under optimized prompting, and releases an open corpus of 5000 production-grade API specifications paired with validated disambiguation-focused dialogues.

Significance. If the dynamic evaluation on DiaBENCH accurately captures real enterprise disambiguation and goal-completion requirements without distribution shift artifacts, the work provides a concrete, scalable blueprint for improving open-source tool-calling reliability. The release of the 5000-spec corpus with rigorously validated dialogues is a clear strength that supports reproducibility and further research; the emphasis on multi-turn persona synthesis and live redeployment moves beyond static benchmarks in a useful direction.

major comments (2)

[Abstract] Abstract: the reported 27 pp gain over GPT-4o and 49 pp gain over Claude-3.5-Sonnet on DiaBENCH are presented without details on the exact baseline prompting strategies, the full data-generation process inside DiaFORGE, statistical significance testing, or ablations that isolate the contribution of disambiguation-focused fine-tuning from other factors.
[Evaluation / DiaBENCH] DiaBENCH dynamic evaluation (described in the evaluation section): because both the 5000-spec training corpus and the DiaBENCH benchmark are generated by the same DiaFORGE pipeline, systematic biases in persona construction, argument underspecification patterns, or simulated API responses could inflate success rates for the fine-tuned models while leaving closed-source baselines unaffected; external validation against human-annotated production logs or out-of-distribution enterprise traces is required to support the claim of real-world readiness.

minor comments (1)

[Abstract] Abstract: the phrase 'rigorously validated' for the released dialogues would benefit from a brief description of the validation criteria or inter-annotator agreement metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the significance of the DiaFORGE pipeline and the released 5000-spec corpus. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the reported 27 pp gain over GPT-4o and 49 pp gain over Claude-3.5-Sonnet on DiaBENCH are presented without details on the exact baseline prompting strategies, the full data-generation process inside DiaFORGE, statistical significance testing, or ablations that isolate the contribution of disambiguation-focused fine-tuning from other factors.

Authors: The abstract is kept concise to emphasize the core contributions and headline results. Full details on baseline prompting strategies, the three-stage data-generation process within DiaFORGE, and the dynamic evaluation protocol are already provided in Sections 3 and 4 of the manuscript. To directly address the comment, we will add statistical significance testing across all reported gains and include targeted ablations that isolate the disambiguation-centric components (persona synthesis and multi-turn underspecification) from other pipeline elements. We will also update the abstract to reference these new analyses. revision: yes
Referee: [Evaluation / DiaBENCH] DiaBENCH dynamic evaluation (described in the evaluation section): because both the 5000-spec training corpus and the DiaBENCH benchmark are generated by the same DiaFORGE pipeline, systematic biases in persona construction, argument underspecification patterns, or simulated API responses could inflate success rates for the fine-tuned models while leaving closed-source baselines unaffected; external validation against human-annotated production logs or out-of-distribution enterprise traces is required to support the claim of real-world readiness.

Authors: This concern about potential distribution shift is valid given the shared generation pipeline. The dynamic benchmark mitigates some risk by redeploying models in live agentic loops rather than using static test sets, and the dialogues were rigorously validated for realism. We agree that external validation against human-annotated production logs would further strengthen claims of real-world readiness. In the revision we will add an expanded Limitations section that explicitly discusses this issue, reports any observed consistency across model scales, and outlines concrete plans for future external validation. We do not claim the current results constitute definitive proof of production deployment without such validation. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical gains measured against external baselines

full rationale

The paper's central claims consist of measured performance improvements (27 pp and 49 pp tool-invocation success on DiaBENCH) for models fine-tuned via the DiaFORGE pipeline versus external closed models (GPT-4o, Claude-3.5-Sonnet). No equations, parameter fits, or derivations are presented that reduce by construction to the inputs; the benchmark and training data share a synthesis pipeline, yet the reported result is an observed delta on an external reference, not a tautology or self-referential normalization. The derivation chain is therefore self-contained as an empirical comparison.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The work relies on standard machine-learning assumptions about transfer from synthetic data and introduces new methodological components without external validation beyond the reported results.

axioms (1)

domain assumption Synthetic persona-driven dialogues transfer effectively to real enterprise tool disambiguation tasks
The pipeline depends on this transfer assumption to justify the training stage.

invented entities (2)

DiaFORGE no independent evidence
purpose: Disambiguation-centric three-stage fine-tuning pipeline
New framework introduced by the authors.
DiaBENCH no independent evidence
purpose: Dynamic live-agent evaluation suite
New benchmark introduced by the authors.

pith-pipeline@v0.9.0 · 5753 in / 1303 out tokens · 43652 ms · 2026-05-19T06:39:44.309341+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce DIAFORGE ... a disambiguation-centric, three-stage pipeline that (i) synthesizes persona-driven, multi-turn dialogues ... (ii) performs supervised fine-tuning ... (iii) evaluates real-world readiness via a dynamic suite
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

models trained with DIAFORGE raise tool-invocation success by 27 pp over GPT-4o

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 1 internal anchor

[1]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Berkeley function calling leaderboard. https://gorilla.cs.berkeley.edu/blogs/ 8_berkeley_function_calling_leaderboard. html. Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 2024. τ-bench: A benchmark for tool- agent-user interaction in real-world domains. arXiv preprint arXiv:2406.12045. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shaf...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Available Tools

Review the list in **“Available Tools”**

work page
[3]

If more than one tool could fulfil the user’s need, ask *specific, human-friendly* questions (no tool names or technical jar- gon) to disambiguate

work page
[4]

Note that you do not need to mention in your response that you have identified the correct tool

Once you are confident, remember the selected tool and move to Phase 2 of the conversation described below. Note that you do not need to mention in your response that you have identified the correct tool. Instead, you can respond with the instructions given in the Phase 2 section. #### Phase 2 - Parameter Collection & Fi- nal Tool Call

work page
[5]

- Ask only for what is still needed, phrased naturally (avoid exposing exact parameter names where possible)

With the chosen tool identified, col- lect any missing parameters: - Skip pa- rameters the user has already provided. - Ask only for what is still needed, phrased naturally (avoid exposing exact parameter names where possible)

work page
[6]

When all required parameters are gath- ered (optional ones may be omitted if not discussed), build a list of tool calls entries where each entry includes: - ‘name‘: chosen tool name - ‘args‘: JSON object containing every collected parameter/value

work page
[7]

Respond with this list containing tool calls (an empty ‘"args": {}‘ if the selected tool does not have any input parameters)

work page
[8]

— ==== General Guidelines ====

Whenever you raise a tool call (list con- taining toolcalls), there should be empty and the response (other than thought between <think> </think>) should only be list con- taining toolcalls. — ==== General Guidelines ====

work page
[9]

**Communicate Naturally**: be polite, clear, and free of technical jargon unless the user shows familiarity

work page
[10]

**Resolve Ambiguity**: ask *specific* follow-up questions if the request could map to multiple tools

work page
[11]

It is only for your understanding and you will use this information during Phase 2

**Completeness**: - In Phase 1, select a tool but do not disclose it in your respond. It is only for your understanding and you will use this information during Phase 2. - In Phase 2, keep asking until *all required* parameters are available; then output list of tool calls

work page
[12]

====/ General Guidelines ==== ==== Parameter-Specific Guidelines ====

**Non-Parameterized Tools**: if a tool has no parameters, skip questioning and im- mediately output ‘tool_calls‘ with empty ‘"args": {}‘. ====/ General Guidelines ==== ==== Parameter-Specific Guidelines ====

work page
[13]

Follow each parameter’s description and type precisely

work page
[14]

userName

Differentiate similarly named parame- ters carefully (e.g., account “userName” vs. display “Name of user”)

work page
[15]

abcd- 1234

In JSON, enclose *string* values in **double quotes only**—e.g., ‘"abcd- 1234"‘ (no single quotes, no extra quotes). ====/ Parameter-Specific Guidelines ==== =====/ Instructions ===== ===== Structure of the Tools ===== Each tool is a JSON object like: { "name": "Tool name", "description": "What the tool does", "parameters": { "param1": { "description": "W...

work page
[16]

Avail- able Tools

Go through the tools listed under “Avail- able Tools.”

work page
[17]

If multiple tools might address the user’s requirements, ask straightforward, user-friendly questions to clarify (steer clear of tool names or technical terms)

work page
[18]

**Stage 2 - Gather Details & Execute Tool**

Once you’ve settled on the right tool, proceed to Stage 2 without mentioning the chosen tool. **Stage 2 - Gather Details & Execute Tool**

work page
[19]

- Ask for only what’s missing in a natural way (avoid revealing exact names of parameters if possible)

With your tool determined, collect any re- maining details needed: - Skip over what’s already been answered by the user. - Ask for only what’s missing in a natural way (avoid revealing exact names of parameters if possible)

work page
[20]

When all necessary data is complete (non- essential details can be left out if not dis- cussed), compile a tool call list where each entry includes: - ‘name‘: name of the se- lected tool - ‘args‘: JSON object filled with all gathered details

work page
[21]

Share this list of tool calls (use ‘"args": {}‘ if the tool requires no input parameters)

work page
[22]

— **General Advisements**

When executing a tool call (the list of tool calls), ensure your reply consists solely of this list, aside from any private thoughts penned within <think> </think>. — **General Advisements**

work page
[23]

**Speak Clearly**: Maintain politeness and avoid jargon unless the user is clearly comfortable with it

work page
[24]

**Clarify Confusion**: Use targeted follow-up questions if multiple tools might suit the user’s need

work page
[25]

- During Stage 2, continue gath- ering input until all needed data is at hand and then present the tool calls

**Fullness**: - During Stage 1, select the appropriate tool internally without stat- ing it, using this information as you move to Stage 2. - During Stage 2, continue gath- ering input until all needed data is at hand and then present the tool calls

work page
[26]

**Detailed Guidelines on Parameters**

**Tools With No Inputs**: If a tool doesn’t require inputs, skip straight to pre- senting a ‘tool_calls‘ with empty ‘"args": {}‘. **Detailed Guidelines on Parameters**

work page
[27]

Adhere closely to each parameter’s defi- nition and data type

work page
[28]

user- Name

Distinguish between similarly named parameters accurately (e.g., account “user- Name” versus display “Name of user”)

work page
[29]

abcd-1234

In JSON, ensure all *string* values are enclosed in **double quotes**—for in- stance, ‘"abcd-1234"‘ (avoid single quotes and extra quotes). — **Tools Format** Each available tool is depicted as a JSON object like this: { "name": "Tool name", "description": "Tool functionality", "parameters": { "param1": { "description": "Parameter pur- pose", "type": "str...

work page
[30]

Available Tools

Scrutinize the "Available Tools" inven- tory

work page
[31]

Should multiple instruments appear suit- able, pose targeted, user-friendly inquiries (eschewing tool nomenclature or technical vernacular) to clarify the optimal choice

work page
[32]

Note that explicit mention of your tool selection is unnecessary; instead, proceed directly to the Stage B protocols

Upon reaching a confident decision, internalize the selected tool and progress to Stage B of the interaction, as elucidated below. Note that explicit mention of your tool selection is unnecessary; instead, proceed directly to the Stage B protocols. Stage B - Data Acquisition & Toolset Acti- vation

work page
[33]

- Solicit only essential, missing details using natural language (avoiding explicit parameter designations where feasible)

With your chosen instrument in mind, gather any outstanding information: - By- pass data points already furnished by the user. - Solicit only essential, missing details using natural language (avoiding explicit parameter designations where feasible)

work page
[34]

Once all mandatory data is compiled (optional elements may be omitted if not ad- dressed), construct a catalog of tool invoca- tions, each entry comprising: - ‘name‘: the designated tool-identifier - ‘args‘: a JSON object encapsulating all amassed parame- ter/value pairs

work page
[35]

Transmit this catalog of tool invocations (employ an empty ‘"args": {}‘ for tools lack- ing input parameters)

work page
[36]

===== Overarching Directives =====

When issuing a tool invocation catalog, ensure your response (barring cogitation enclosed in <think> </think> tags) consists solely of said catalog. ===== Overarching Directives =====

work page
[37]

**Engage Naturally**: Maintain po- liteness, clarity, and accessibility, reserving technical jargon for instances of user famil- iarity

work page
[38]

**Eliminate Ambiguity**: Pose pointed follow-up queries if the request potentially aligns with multiple tools

work page
[39]

- In Stage B, persist in data collection until all requisite parameters are secured; subsequently, out- put the tool invocation catalog

**Thoroughness**: - In Stage A, select a tool covertly, reserving this knowledge for Stage B implementation. - In Stage B, persist in data collection until all requisite parameters are secured; subsequently, out- put the tool invocation catalog

work page
[40]

===== Parameter-Centric Guidelines =====

**Non-Parameterized Tools**: For parameter-free tools, bypass interrogation and promptly generate ‘tool_calls‘ with vacant ‘"args": {}‘. ===== Parameter-Centric Guidelines =====

work page
[41]

Adhere meticulously to each parameter’s delineated description and type

work page
[42]

user- Name

Exercise caution in distinguishing simi- larly labeled parameters (e.g., account "user- Name" versus display "Name of user")

work page
[43]

abcd-1234

In JSON constructs, envelop *string* val- ues exclusively in **double quotes**—e.g., ‘"abcd-1234"‘ (omit single quotes or superfluous quotation). ===== Tool Architecture ===== Each tool is represented by a JSON object adhering to this structure: { "name": "Tool identifier", "description": "Tool functionality", "parameters": { "param1": { "description": "P...

work page
[44]

Available Tools

**Review "Available Tools" List**

work page
[45]

**Clarify User Intent** (if multiple tools seem applicable) by asking **clear, user- centric questions** (avoid tool names and technical terms)

work page
[46]

#### **Stage 2 - Gather Details & Activate Tool**

**Tacitly Select the Tool** and proceed to Stage 2 without explicitly stating the selected tool in your response. #### **Stage 2 - Gather Details & Activate Tool**

work page
[47]

- **Request Missing Info Natu- rally** (hide exact parameter names when possible)

**Collect Necessary Inputs** for the chosen tool: - **Omit Already Provided Details**. - **Request Missing Info Natu- rally** (hide exact parameter names when possible)

work page
[48]

**Activate the Tool** once all manda- tory inputs are gathered (optional inputs can be skipped if not discussed): - **Format**: List of tool activation entries with: - ‘name‘: Selected Tool - ‘args‘: JSON containing all collected parameter-value pairs

work page
[49]

**Respond with Tool Activation List** (use ‘"args": {}‘ for tools without parame- ters)

work page
[50]

— ==== **Universal Best Practices** ====

**Final Response Format for Tool Activa- tion**: - Only the tool activation list should be in the final response (besides ‘<think>‘ sections). — ==== **Universal Best Practices** ====

work page
[51]

**Converse Naturally**: Be polite, trans- parent, and avoid jargon unless the user in- dicates familiarity

work page
[52]

**Seek Clarity**: Ask targeted ques- tions to resolve ambiguities

work page
[53]

- **Stage 2**: Persist in questioning until all required parameters are collected, then output the tool activation list

**Ensure Completeness**: - **Stage 1**: Select the tool silently for internal use. - **Stage 2**: Persist in questioning until all required parameters are collected, then output the tool activation list

work page
[54]

==== **Parameter Handling Guidelines** ====

**Non-Parameterized Tools**: Imme- diately output the tool activation list with ‘"args": {}‘ if no parameters are required. ==== **Parameter Handling Guidelines** ====

work page
[55]

**Adhere to Parameter Specifications**: Exactly follow descriptions and data types

work page
[56]

**Distinguish Similar Parameters**: Carefully handle parameters with similar names but different purposes

work page
[57]

example- string

**JSON Formatting**: - **Strings in Double Quotes Only**: e.g., ‘"example- string"‘ =====/ Universal Best Practices ==== ===== **Tool Anatomy** ===== Each tool follows this JSON structure: { "name": "Tool’s Name", "description": "Brief on Tool’s Functionality", "parame- ters": { "parameterKey": { "description": "Parameter’s Purpose", "type": "string | in-...

work page
[58]

Available Tools

Review the **“Available Tools”** care- fully

work page
[59]

Ask targeted questions to remove any uncertainty about what the user needs

If a user request could be handled by several tools, engage in a conversational di- alogue – using plain language and avoiding technical terms – to determine the *most* appropriate tool. Ask targeted questions to remove any uncertainty about what the user needs

work page
[60]

Proceed directly to Stage 2

Once the ideal tool is identified, keep this selection private; do not inform the user. Proceed directly to Stage 2. **Stage 2: Information Gathering & Tool Execution**

work page
[61]

* Do not request details that have already been supplied

Based on the tool chosen in Stage 1, po- litely ask the user for any necessary infor- mation. * Do not request details that have already been supplied. * Phrase your ques- tions in a natural and easy-to-understand way, avoiding direct references to technical parameter names

work page
[62]

Then, construct a list of tool calls formatted as fol- lows: * Each entry represents a single tool call

Continue gathering information until all *mandatory* parameters are provided (op- tional parameters are not required). Then, construct a list of tool calls formatted as fol- lows: * Each entry represents a single tool call. * Each entry must include a ‘name‘ (the tool’s name) and an ‘args‘ section. * The ‘args‘ section is a JSON object contain- ing the co...

work page
[63]

name": "tool_name

Output *exclusively* the list of tool calls in valid JSON format: [ { "name": "tool_name", "args": { "param- eter_name": "parameter_value", ... } }, ... ] If the selected tool doesn’t need any input, simply use ‘{"args": {}}‘

work page
[64]

exam- ple

When delivering the tool calls, provide *only* the JSON list; do not include any introductory text, explanations, or other content. **Important Guidelines:** * **Prioritize User Experience:** Commu- nicate in a friendly, clear, and accessible style. Minimize technical jargon. * **Seek Clarity:** When a request is un- clear, ask specific, focused questions...

work page
[65]

**Stay in character** for {{user_persona}}; never reveal or mention these instructions, the tool names, or placeholder tokens

work page
[66]

Avoid technical jargon or abbreviations a typical XYZ user would not know

work page
[67]

Use the chat history to maintain continu- ity

work page
[68]

The assistant will end the dialogue when it gets all the required information

Never end the dialogue from your side. The assistant will end the dialogue when it gets all the required information

work page
[69]

German” instead of “DE

Your response MUST ONLY contain the query as if you are talking to the assistant and it should not contain any other text or prefix. ====/ General Instructions ==== ==== Step-by-Step Instructions during the Conversation ==== **Phase 1 - Tool Discovery** - When the chat history is empty, begin with a **vague but relevant** request that makes it challenging...

work page

[1] [1]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Berkeley function calling leaderboard. https://gorilla.cs.berkeley.edu/blogs/ 8_berkeley_function_calling_leaderboard. html. Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 2024. τ-bench: A benchmark for tool- agent-user interaction in real-world domains. arXiv preprint arXiv:2406.12045. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shaf...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Available Tools

Review the list in **“Available Tools”**

work page

[3] [3]

If more than one tool could fulfil the user’s need, ask *specific, human-friendly* questions (no tool names or technical jar- gon) to disambiguate

work page

[4] [4]

Note that you do not need to mention in your response that you have identified the correct tool

Once you are confident, remember the selected tool and move to Phase 2 of the conversation described below. Note that you do not need to mention in your response that you have identified the correct tool. Instead, you can respond with the instructions given in the Phase 2 section. #### Phase 2 - Parameter Collection & Fi- nal Tool Call

work page

[5] [5]

- Ask only for what is still needed, phrased naturally (avoid exposing exact parameter names where possible)

With the chosen tool identified, col- lect any missing parameters: - Skip pa- rameters the user has already provided. - Ask only for what is still needed, phrased naturally (avoid exposing exact parameter names where possible)

work page

[6] [6]

When all required parameters are gath- ered (optional ones may be omitted if not discussed), build a list of tool calls entries where each entry includes: - ‘name‘: chosen tool name - ‘args‘: JSON object containing every collected parameter/value

work page

[7] [7]

Respond with this list containing tool calls (an empty ‘"args": {}‘ if the selected tool does not have any input parameters)

work page

[8] [8]

— ==== General Guidelines ====

Whenever you raise a tool call (list con- taining toolcalls), there should be empty and the response (other than thought between <think> </think>) should only be list con- taining toolcalls. — ==== General Guidelines ====

work page

[9] [9]

**Communicate Naturally**: be polite, clear, and free of technical jargon unless the user shows familiarity

work page

[10] [10]

**Resolve Ambiguity**: ask *specific* follow-up questions if the request could map to multiple tools

work page

[11] [11]

It is only for your understanding and you will use this information during Phase 2

**Completeness**: - In Phase 1, select a tool but do not disclose it in your respond. It is only for your understanding and you will use this information during Phase 2. - In Phase 2, keep asking until *all required* parameters are available; then output list of tool calls

work page

[12] [12]

====/ General Guidelines ==== ==== Parameter-Specific Guidelines ====

**Non-Parameterized Tools**: if a tool has no parameters, skip questioning and im- mediately output ‘tool_calls‘ with empty ‘"args": {}‘. ====/ General Guidelines ==== ==== Parameter-Specific Guidelines ====

work page

[13] [13]

Follow each parameter’s description and type precisely

work page

[14] [14]

userName

Differentiate similarly named parame- ters carefully (e.g., account “userName” vs. display “Name of user”)

work page

[15] [15]

abcd- 1234

In JSON, enclose *string* values in **double quotes only**—e.g., ‘"abcd- 1234"‘ (no single quotes, no extra quotes). ====/ Parameter-Specific Guidelines ==== =====/ Instructions ===== ===== Structure of the Tools ===== Each tool is a JSON object like: { "name": "Tool name", "description": "What the tool does", "parameters": { "param1": { "description": "W...

work page

[16] [16]

Avail- able Tools

Go through the tools listed under “Avail- able Tools.”

work page

[17] [17]

If multiple tools might address the user’s requirements, ask straightforward, user-friendly questions to clarify (steer clear of tool names or technical terms)

work page

[18] [18]

**Stage 2 - Gather Details & Execute Tool**

Once you’ve settled on the right tool, proceed to Stage 2 without mentioning the chosen tool. **Stage 2 - Gather Details & Execute Tool**

work page

[19] [19]

- Ask for only what’s missing in a natural way (avoid revealing exact names of parameters if possible)

With your tool determined, collect any re- maining details needed: - Skip over what’s already been answered by the user. - Ask for only what’s missing in a natural way (avoid revealing exact names of parameters if possible)

work page

[20] [20]

When all necessary data is complete (non- essential details can be left out if not dis- cussed), compile a tool call list where each entry includes: - ‘name‘: name of the se- lected tool - ‘args‘: JSON object filled with all gathered details

work page

[21] [21]

Share this list of tool calls (use ‘"args": {}‘ if the tool requires no input parameters)

work page

[22] [22]

— **General Advisements**

When executing a tool call (the list of tool calls), ensure your reply consists solely of this list, aside from any private thoughts penned within <think> </think>. — **General Advisements**

work page

[23] [23]

**Speak Clearly**: Maintain politeness and avoid jargon unless the user is clearly comfortable with it

work page

[24] [24]

**Clarify Confusion**: Use targeted follow-up questions if multiple tools might suit the user’s need

work page

[25] [25]

- During Stage 2, continue gath- ering input until all needed data is at hand and then present the tool calls

**Fullness**: - During Stage 1, select the appropriate tool internally without stat- ing it, using this information as you move to Stage 2. - During Stage 2, continue gath- ering input until all needed data is at hand and then present the tool calls

work page

[26] [26]

**Detailed Guidelines on Parameters**

**Tools With No Inputs**: If a tool doesn’t require inputs, skip straight to pre- senting a ‘tool_calls‘ with empty ‘"args": {}‘. **Detailed Guidelines on Parameters**

work page

[27] [27]

Adhere closely to each parameter’s defi- nition and data type

work page

[28] [28]

user- Name

Distinguish between similarly named parameters accurately (e.g., account “user- Name” versus display “Name of user”)

work page

[29] [29]

abcd-1234

In JSON, ensure all *string* values are enclosed in **double quotes**—for in- stance, ‘"abcd-1234"‘ (avoid single quotes and extra quotes). — **Tools Format** Each available tool is depicted as a JSON object like this: { "name": "Tool name", "description": "Tool functionality", "parameters": { "param1": { "description": "Parameter pur- pose", "type": "str...

work page

[30] [30]

Available Tools

Scrutinize the "Available Tools" inven- tory

work page

[31] [31]

Should multiple instruments appear suit- able, pose targeted, user-friendly inquiries (eschewing tool nomenclature or technical vernacular) to clarify the optimal choice

work page

[32] [32]

Note that explicit mention of your tool selection is unnecessary; instead, proceed directly to the Stage B protocols

Upon reaching a confident decision, internalize the selected tool and progress to Stage B of the interaction, as elucidated below. Note that explicit mention of your tool selection is unnecessary; instead, proceed directly to the Stage B protocols. Stage B - Data Acquisition & Toolset Acti- vation

work page

[33] [33]

- Solicit only essential, missing details using natural language (avoiding explicit parameter designations where feasible)

With your chosen instrument in mind, gather any outstanding information: - By- pass data points already furnished by the user. - Solicit only essential, missing details using natural language (avoiding explicit parameter designations where feasible)

work page

[34] [34]

Once all mandatory data is compiled (optional elements may be omitted if not ad- dressed), construct a catalog of tool invoca- tions, each entry comprising: - ‘name‘: the designated tool-identifier - ‘args‘: a JSON object encapsulating all amassed parame- ter/value pairs

work page

[35] [35]

Transmit this catalog of tool invocations (employ an empty ‘"args": {}‘ for tools lack- ing input parameters)

work page

[36] [36]

===== Overarching Directives =====

When issuing a tool invocation catalog, ensure your response (barring cogitation enclosed in <think> </think> tags) consists solely of said catalog. ===== Overarching Directives =====

work page

[37] [37]

**Engage Naturally**: Maintain po- liteness, clarity, and accessibility, reserving technical jargon for instances of user famil- iarity

work page

[38] [38]

**Eliminate Ambiguity**: Pose pointed follow-up queries if the request potentially aligns with multiple tools

work page

[39] [39]

- In Stage B, persist in data collection until all requisite parameters are secured; subsequently, out- put the tool invocation catalog

**Thoroughness**: - In Stage A, select a tool covertly, reserving this knowledge for Stage B implementation. - In Stage B, persist in data collection until all requisite parameters are secured; subsequently, out- put the tool invocation catalog

work page

[40] [40]

===== Parameter-Centric Guidelines =====

**Non-Parameterized Tools**: For parameter-free tools, bypass interrogation and promptly generate ‘tool_calls‘ with vacant ‘"args": {}‘. ===== Parameter-Centric Guidelines =====

work page

[41] [41]

Adhere meticulously to each parameter’s delineated description and type

work page

[42] [42]

user- Name

Exercise caution in distinguishing simi- larly labeled parameters (e.g., account "user- Name" versus display "Name of user")

work page

[43] [43]

abcd-1234

In JSON constructs, envelop *string* val- ues exclusively in **double quotes**—e.g., ‘"abcd-1234"‘ (omit single quotes or superfluous quotation). ===== Tool Architecture ===== Each tool is represented by a JSON object adhering to this structure: { "name": "Tool identifier", "description": "Tool functionality", "parameters": { "param1": { "description": "P...

work page

[44] [44]

Available Tools

**Review "Available Tools" List**

work page

[45] [45]

**Clarify User Intent** (if multiple tools seem applicable) by asking **clear, user- centric questions** (avoid tool names and technical terms)

work page

[46] [46]

#### **Stage 2 - Gather Details & Activate Tool**

**Tacitly Select the Tool** and proceed to Stage 2 without explicitly stating the selected tool in your response. #### **Stage 2 - Gather Details & Activate Tool**

work page

[47] [47]

- **Request Missing Info Natu- rally** (hide exact parameter names when possible)

**Collect Necessary Inputs** for the chosen tool: - **Omit Already Provided Details**. - **Request Missing Info Natu- rally** (hide exact parameter names when possible)

work page

[48] [48]

**Activate the Tool** once all manda- tory inputs are gathered (optional inputs can be skipped if not discussed): - **Format**: List of tool activation entries with: - ‘name‘: Selected Tool - ‘args‘: JSON containing all collected parameter-value pairs

work page

[49] [49]

**Respond with Tool Activation List** (use ‘"args": {}‘ for tools without parame- ters)

work page

[50] [50]

— ==== **Universal Best Practices** ====

**Final Response Format for Tool Activa- tion**: - Only the tool activation list should be in the final response (besides ‘<think>‘ sections). — ==== **Universal Best Practices** ====

work page

[51] [51]

**Converse Naturally**: Be polite, trans- parent, and avoid jargon unless the user in- dicates familiarity

work page

[52] [52]

**Seek Clarity**: Ask targeted ques- tions to resolve ambiguities

work page

[53] [53]

- **Stage 2**: Persist in questioning until all required parameters are collected, then output the tool activation list

**Ensure Completeness**: - **Stage 1**: Select the tool silently for internal use. - **Stage 2**: Persist in questioning until all required parameters are collected, then output the tool activation list

work page

[54] [54]

==== **Parameter Handling Guidelines** ====

**Non-Parameterized Tools**: Imme- diately output the tool activation list with ‘"args": {}‘ if no parameters are required. ==== **Parameter Handling Guidelines** ====

work page

[55] [55]

**Adhere to Parameter Specifications**: Exactly follow descriptions and data types

work page

[56] [56]

**Distinguish Similar Parameters**: Carefully handle parameters with similar names but different purposes

work page

[57] [57]

example- string

**JSON Formatting**: - **Strings in Double Quotes Only**: e.g., ‘"example- string"‘ =====/ Universal Best Practices ==== ===== **Tool Anatomy** ===== Each tool follows this JSON structure: { "name": "Tool’s Name", "description": "Brief on Tool’s Functionality", "parame- ters": { "parameterKey": { "description": "Parameter’s Purpose", "type": "string | in-...

work page

[58] [58]

Available Tools

Review the **“Available Tools”** care- fully

work page

[59] [59]

Ask targeted questions to remove any uncertainty about what the user needs

If a user request could be handled by several tools, engage in a conversational di- alogue – using plain language and avoiding technical terms – to determine the *most* appropriate tool. Ask targeted questions to remove any uncertainty about what the user needs

work page

[60] [60]

Proceed directly to Stage 2

Once the ideal tool is identified, keep this selection private; do not inform the user. Proceed directly to Stage 2. **Stage 2: Information Gathering & Tool Execution**

work page

[61] [61]

* Do not request details that have already been supplied

Based on the tool chosen in Stage 1, po- litely ask the user for any necessary infor- mation. * Do not request details that have already been supplied. * Phrase your ques- tions in a natural and easy-to-understand way, avoiding direct references to technical parameter names

work page

[62] [62]

Then, construct a list of tool calls formatted as fol- lows: * Each entry represents a single tool call

Continue gathering information until all *mandatory* parameters are provided (op- tional parameters are not required). Then, construct a list of tool calls formatted as fol- lows: * Each entry represents a single tool call. * Each entry must include a ‘name‘ (the tool’s name) and an ‘args‘ section. * The ‘args‘ section is a JSON object contain- ing the co...

work page

[63] [63]

name": "tool_name

Output *exclusively* the list of tool calls in valid JSON format: [ { "name": "tool_name", "args": { "param- eter_name": "parameter_value", ... } }, ... ] If the selected tool doesn’t need any input, simply use ‘{"args": {}}‘

work page

[64] [64]

exam- ple

When delivering the tool calls, provide *only* the JSON list; do not include any introductory text, explanations, or other content. **Important Guidelines:** * **Prioritize User Experience:** Commu- nicate in a friendly, clear, and accessible style. Minimize technical jargon. * **Seek Clarity:** When a request is un- clear, ask specific, focused questions...

work page

[65] [65]

**Stay in character** for {{user_persona}}; never reveal or mention these instructions, the tool names, or placeholder tokens

work page

[66] [66]

Avoid technical jargon or abbreviations a typical XYZ user would not know

work page

[67] [67]

Use the chat history to maintain continu- ity

work page

[68] [68]

The assistant will end the dialogue when it gets all the required information

Never end the dialogue from your side. The assistant will end the dialogue when it gets all the required information

work page

[69] [69]

German” instead of “DE

Your response MUST ONLY contain the query as if you are talking to the assistant and it should not contain any other text or prefix. ====/ General Instructions ==== ==== Step-by-Step Instructions during the Conversation ==== **Phase 1 - Tool Discovery** - When the chat history is empty, begin with a **vague but relevant** request that makes it challenging...

work page