pith. sign in

arxiv: 2507.03336 · v4 · submitted 2025-07-04 · 💻 cs.AI · cs.CL· cs.LG

Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky

Pith reviewed 2026-05-19 06:39 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords tool callingLLM fine-tuningdisambiguationenterprise APIsagent evaluationsynthetic dialogues
0
0 comments X

The pith

Disambiguation-focused fine-tuning lifts open LLMs past GPT-4o in enterprise tool-calling success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that LLMs often fail on enterprise APIs because they cannot tell similar tools apart or fill in missing details. DiaFORGE creates synthetic multi-turn conversations that force the model to resolve these ambiguities, then fine-tunes open models from 3B to 70B parameters on the resulting reasoning steps. When tested in a live agent loop on the new DiaBENCH benchmark, the fine-tuned models complete user goals far more often than GPT-4o or Claude-3.5-Sonnet under strong prompting. The authors also release 5000 real API specifications paired with validated dialogues so others can reproduce and extend the work.

Core claim

A three-stage pipeline that first generates persona-driven dialogues requiring tool disambiguation, then performs supervised fine-tuning with explicit reasoning traces, and finally measures end-to-end goal completion in dynamic live agent evaluations produces models whose tool-invocation success exceeds that of GPT-4o by 27 percentage points and Claude-3.5-Sonnet by 49 percentage points on DiaBENCH.

What carries the argument

DiaFORGE, the pipeline that synthesizes disambiguation-heavy multi-turn dialogues, fine-tunes models on reasoning traces, and runs dynamic live evaluation loops that report goal completion.

If this is right

  • Open models of various sizes become viable alternatives to closed frontier models for production tool-calling tasks.
  • Dynamic agent-loop evaluation exposes reliability gaps that static benchmarks overlook.
  • Releasing paired API specifications and disambiguation dialogues creates a reusable resource for safer enterprise agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same synthesis approach could be adapted to reduce errors in other agent decisions that involve near-duplicate options.
  • Enterprises might lower operational risk by applying this style of training before deploying tool-using agents in live systems.

Load-bearing premise

The generated dialogues and live evaluation accurately reflect the ambiguities and success criteria of actual enterprise API usage.

What would settle it

Deploy the fine-tuned models on a fresh collection of genuine enterprise APIs using real ambiguous user requests and check whether the reported goal-completion advantage over GPT-4o and Claude persists.

Figures

Figures reproduced from arXiv: 2507.03336 by Ashutosh Hathidara, Julien Yu, Sebastian Schreiber.

Figure 1
Figure 1. Figure 1: Data Generation Engine for Disambiguation-Centric [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Trade-offs among tool call-related metrics [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: DIAFORGE generated dialogue sample An example of a synthesized dialogue is shown in [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Conversation length distribution: number of [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Parameter count distribution: number of pa [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Turn distribution for tool disambiguation (left) and parameter specification (right). [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Turn slicing and loss masking strategy for [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Reducing hallucination for user utterance [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Format correctness score of various LLMs on [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Initial reference system prompt used for [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: CAPO optimized GPT-4o system prompt used for evaluation Claude-3.5-Sonnet Prompt You are an AI collaborator developed by XYZ. Your mission comprises two sequential stages: Stage A - Toolkit Evaluation 1. Scrutinize the "Available Tools" inven￾tory. 2. Should multiple instruments appear suit￾able, pose targeted, user-friendly inquiries (eschewing tool nomenclature or technical vernacular) to clarify the op… view at source ↗
Figure 12
Figure 12. Figure 12: CAPO optimized Claude-3.5-Sonnet system prompt used for evaluation Llama-3.3 Based Models Prompt ===== **Instructions for AI Assistant (XYZ)** ===== **Your Role & Workflow** You embody an AI assistant developed by XYZ, operating in **two sequential stages**: — #### **Stage 1 - Identify the Best Fit Tool** 1. **Review "Available Tools" List**. 2. **Clarify User Intent** (if multiple tools seem applicable) … view at source ↗
Figure 13
Figure 13. Figure 13: CAPO optimized system prompt for Llama￾3.3 based models used for evaluation Gemma Based Models Prompt ## Acting as XYZ’s Intelligent Assistant You are a helpful AI assistant built by XYZ, designed to fulfill user requests by leveraging available tools. Your process operates in two distinct stages: **Stage 1: Request Comprehension & Best Tool Identification** 1. Review the **“Available Tools”** care￾fully.… view at source ↗
Figure 14
Figure 14. Figure 14: CAPO optimized system prompt for Gemma based models used for evaluation E User-Proxy Prompt For Dynamic Evaluation Below, we provide the user-proxy prompt used during dynamic evaluation. Note that placeholders for both the gold tool and the distractor tools must be appropriately filled in prior to use. Initial Reference System Prompt ===== Instructions ===== You are **{{user_persona}}**, an XYZ customer w… view at source ↗
Figure 15
Figure 15. Figure 15: System prompt for user-proxy agent used during dynamic evaluation [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly tasked with invoking enterprise APIs, yet they routinely falter when near-duplicate tools vie for the same user intent or when required arguments are left underspecified. We introduce DiaFORGE (Dialogue Framework for Organic Response Generation & Evaluation), a disambiguation-centric, three-stage pipeline that (i) synthesizes persona-driven, multi-turn dialogues in which the assistant must distinguish among highly similar tools, (ii) performs supervised fine-tuning of open-source models with reasoning traces across 3B - 70B parameters, and (iii) evaluates real-world readiness via a dynamic suite that redeploys each model in a live agentic loop and reports end-to-end goal completion alongside conventional static metrics. On our dynamic benchmark DiaBENCH, models trained with DiaFORGE raise tool-invocation success by 27 pp over GPT-4o and by 49 pp over Claude-3.5-Sonnet, both under optimized prompting. To spur further research, we release an open corpus of 5000 production-grade enterprise API specifications paired with rigorously validated, disambiguation-focused dialogues, offering a practical blueprint for building reliable, enterprise-ready tool-calling agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces DiaFORGE, a three-stage disambiguation-centric pipeline that synthesizes persona-driven multi-turn dialogues from enterprise API specifications, performs supervised fine-tuning of open-source LLMs (3B–70B parameters) augmented with reasoning traces, and evaluates the resulting models via a dynamic benchmark DiaBENCH that redeploys them in live agentic loops to measure end-to-end goal completion and tool-invocation success. It claims that DiaFORGE-trained models outperform GPT-4o by 27 percentage points and Claude-3.5-Sonnet by 49 percentage points on DiaBENCH under optimized prompting, and releases an open corpus of 5000 production-grade API specifications paired with validated disambiguation-focused dialogues.

Significance. If the dynamic evaluation on DiaBENCH accurately captures real enterprise disambiguation and goal-completion requirements without distribution shift artifacts, the work provides a concrete, scalable blueprint for improving open-source tool-calling reliability. The release of the 5000-spec corpus with rigorously validated dialogues is a clear strength that supports reproducibility and further research; the emphasis on multi-turn persona synthesis and live redeployment moves beyond static benchmarks in a useful direction.

major comments (2)
  1. [Abstract] Abstract: the reported 27 pp gain over GPT-4o and 49 pp gain over Claude-3.5-Sonnet on DiaBENCH are presented without details on the exact baseline prompting strategies, the full data-generation process inside DiaFORGE, statistical significance testing, or ablations that isolate the contribution of disambiguation-focused fine-tuning from other factors.
  2. [Evaluation / DiaBENCH] DiaBENCH dynamic evaluation (described in the evaluation section): because both the 5000-spec training corpus and the DiaBENCH benchmark are generated by the same DiaFORGE pipeline, systematic biases in persona construction, argument underspecification patterns, or simulated API responses could inflate success rates for the fine-tuned models while leaving closed-source baselines unaffected; external validation against human-annotated production logs or out-of-distribution enterprise traces is required to support the claim of real-world readiness.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'rigorously validated' for the released dialogues would benefit from a brief description of the validation criteria or inter-annotator agreement metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the significance of the DiaFORGE pipeline and the released 5000-spec corpus. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported 27 pp gain over GPT-4o and 49 pp gain over Claude-3.5-Sonnet on DiaBENCH are presented without details on the exact baseline prompting strategies, the full data-generation process inside DiaFORGE, statistical significance testing, or ablations that isolate the contribution of disambiguation-focused fine-tuning from other factors.

    Authors: The abstract is kept concise to emphasize the core contributions and headline results. Full details on baseline prompting strategies, the three-stage data-generation process within DiaFORGE, and the dynamic evaluation protocol are already provided in Sections 3 and 4 of the manuscript. To directly address the comment, we will add statistical significance testing across all reported gains and include targeted ablations that isolate the disambiguation-centric components (persona synthesis and multi-turn underspecification) from other pipeline elements. We will also update the abstract to reference these new analyses. revision: yes

  2. Referee: [Evaluation / DiaBENCH] DiaBENCH dynamic evaluation (described in the evaluation section): because both the 5000-spec training corpus and the DiaBENCH benchmark are generated by the same DiaFORGE pipeline, systematic biases in persona construction, argument underspecification patterns, or simulated API responses could inflate success rates for the fine-tuned models while leaving closed-source baselines unaffected; external validation against human-annotated production logs or out-of-distribution enterprise traces is required to support the claim of real-world readiness.

    Authors: This concern about potential distribution shift is valid given the shared generation pipeline. The dynamic benchmark mitigates some risk by redeploying models in live agentic loops rather than using static test sets, and the dialogues were rigorously validated for realism. We agree that external validation against human-annotated production logs would further strengthen claims of real-world readiness. In the revision we will add an expanded Limitations section that explicitly discusses this issue, reports any observed consistency across model scales, and outlines concrete plans for future external validation. We do not claim the current results constitute definitive proof of production deployment without such validation. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical gains measured against external baselines

full rationale

The paper's central claims consist of measured performance improvements (27 pp and 49 pp tool-invocation success on DiaBENCH) for models fine-tuned via the DiaFORGE pipeline versus external closed models (GPT-4o, Claude-3.5-Sonnet). No equations, parameter fits, or derivations are presented that reduce by construction to the inputs; the benchmark and training data share a synthesis pipeline, yet the reported result is an observed delta on an external reference, not a tautology or self-referential normalization. The derivation chain is therefore self-contained as an empirical comparison.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The work relies on standard machine-learning assumptions about transfer from synthetic data and introduces new methodological components without external validation beyond the reported results.

axioms (1)
  • domain assumption Synthetic persona-driven dialogues transfer effectively to real enterprise tool disambiguation tasks
    The pipeline depends on this transfer assumption to justify the training stage.
invented entities (2)
  • DiaFORGE no independent evidence
    purpose: Disambiguation-centric three-stage fine-tuning pipeline
    New framework introduced by the authors.
  • DiaBENCH no independent evidence
    purpose: Dynamic live-agent evaluation suite
    New benchmark introduced by the authors.

pith-pipeline@v0.9.0 · 5753 in / 1303 out tokens · 43652 ms · 2026-05-19T06:39:44.309341+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 1 internal anchor

  1. [1]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    Berkeley function calling leaderboard. https://gorilla.cs.berkeley.edu/blogs/ 8_berkeley_function_calling_leaderboard. html. Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 2024. τ-bench: A benchmark for tool- agent-user interaction in real-world domains. arXiv preprint arXiv:2406.12045. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shaf...

  2. [2]

    Available Tools

    Review the list in **“Available Tools”**

  3. [3]

    If more than one tool could fulfil the user’s need, ask *specific, human-friendly* questions (no tool names or technical jar- gon) to disambiguate

  4. [4]

    Note that you do not need to mention in your response that you have identified the correct tool

    Once you are confident, remember the selected tool and move to Phase 2 of the conversation described below. Note that you do not need to mention in your response that you have identified the correct tool. Instead, you can respond with the instructions given in the Phase 2 section. #### Phase 2 - Parameter Collection & Fi- nal Tool Call

  5. [5]

    - Ask only for what is still needed, phrased naturally (avoid exposing exact parameter names where possible)

    With the chosen tool identified, col- lect any missing parameters: - Skip pa- rameters the user has already provided. - Ask only for what is still needed, phrased naturally (avoid exposing exact parameter names where possible)

  6. [6]

    When all required parameters are gath- ered (optional ones may be omitted if not discussed), build a list of tool calls entries where each entry includes: - ‘name‘: chosen tool name - ‘args‘: JSON object containing every collected parameter/value

  7. [7]

    Respond with this list containing tool calls (an empty ‘"args": {}‘ if the selected tool does not have any input parameters)

  8. [8]

    — ==== General Guidelines ====

    Whenever you raise a tool call (list con- taining toolcalls), there should be empty and the response (other than thought between <think> </think>) should only be list con- taining toolcalls. — ==== General Guidelines ====

  9. [9]

    **Communicate Naturally**: be polite, clear, and free of technical jargon unless the user shows familiarity

  10. [10]

    **Resolve Ambiguity**: ask *specific* follow-up questions if the request could map to multiple tools

  11. [11]

    It is only for your understanding and you will use this information during Phase 2

    **Completeness**: - In Phase 1, select a tool but do not disclose it in your respond. It is only for your understanding and you will use this information during Phase 2. - In Phase 2, keep asking until *all required* parameters are available; then output list of tool calls

  12. [12]

    ====/ General Guidelines ==== ==== Parameter-Specific Guidelines ====

    **Non-Parameterized Tools**: if a tool has no parameters, skip questioning and im- mediately output ‘tool_calls‘ with empty ‘"args": {}‘. ====/ General Guidelines ==== ==== Parameter-Specific Guidelines ====

  13. [13]

    Follow each parameter’s description and type precisely

  14. [14]

    userName

    Differentiate similarly named parame- ters carefully (e.g., account “userName” vs. display “Name of user”)

  15. [15]

    abcd- 1234

    In JSON, enclose *string* values in **double quotes only**—e.g., ‘"abcd- 1234"‘ (no single quotes, no extra quotes). ====/ Parameter-Specific Guidelines ==== =====/ Instructions ===== ===== Structure of the Tools ===== Each tool is a JSON object like: { "name": "Tool name", "description": "What the tool does", "parameters": { "param1": { "description": "W...

  16. [16]

    Avail- able Tools

    Go through the tools listed under “Avail- able Tools.”

  17. [17]

    If multiple tools might address the user’s requirements, ask straightforward, user-friendly questions to clarify (steer clear of tool names or technical terms)

  18. [18]

    **Stage 2 - Gather Details & Execute Tool**

    Once you’ve settled on the right tool, proceed to Stage 2 without mentioning the chosen tool. **Stage 2 - Gather Details & Execute Tool**

  19. [19]

    - Ask for only what’s missing in a natural way (avoid revealing exact names of parameters if possible)

    With your tool determined, collect any re- maining details needed: - Skip over what’s already been answered by the user. - Ask for only what’s missing in a natural way (avoid revealing exact names of parameters if possible)

  20. [20]

    When all necessary data is complete (non- essential details can be left out if not dis- cussed), compile a tool call list where each entry includes: - ‘name‘: name of the se- lected tool - ‘args‘: JSON object filled with all gathered details

  21. [21]

    Share this list of tool calls (use ‘"args": {}‘ if the tool requires no input parameters)

  22. [22]

    — **General Advisements**

    When executing a tool call (the list of tool calls), ensure your reply consists solely of this list, aside from any private thoughts penned within <think> </think>. — **General Advisements**

  23. [23]

    **Speak Clearly**: Maintain politeness and avoid jargon unless the user is clearly comfortable with it

  24. [24]

    **Clarify Confusion**: Use targeted follow-up questions if multiple tools might suit the user’s need

  25. [25]

    - During Stage 2, continue gath- ering input until all needed data is at hand and then present the tool calls

    **Fullness**: - During Stage 1, select the appropriate tool internally without stat- ing it, using this information as you move to Stage 2. - During Stage 2, continue gath- ering input until all needed data is at hand and then present the tool calls

  26. [26]

    **Detailed Guidelines on Parameters**

    **Tools With No Inputs**: If a tool doesn’t require inputs, skip straight to pre- senting a ‘tool_calls‘ with empty ‘"args": {}‘. **Detailed Guidelines on Parameters**

  27. [27]

    Adhere closely to each parameter’s defi- nition and data type

  28. [28]

    user- Name

    Distinguish between similarly named parameters accurately (e.g., account “user- Name” versus display “Name of user”)

  29. [29]

    abcd-1234

    In JSON, ensure all *string* values are enclosed in **double quotes**—for in- stance, ‘"abcd-1234"‘ (avoid single quotes and extra quotes). — **Tools Format** Each available tool is depicted as a JSON object like this: { "name": "Tool name", "description": "Tool functionality", "parameters": { "param1": { "description": "Parameter pur- pose", "type": "str...

  30. [30]

    Available Tools

    Scrutinize the "Available Tools" inven- tory

  31. [31]

    Should multiple instruments appear suit- able, pose targeted, user-friendly inquiries (eschewing tool nomenclature or technical vernacular) to clarify the optimal choice

  32. [32]

    Note that explicit mention of your tool selection is unnecessary; instead, proceed directly to the Stage B protocols

    Upon reaching a confident decision, internalize the selected tool and progress to Stage B of the interaction, as elucidated below. Note that explicit mention of your tool selection is unnecessary; instead, proceed directly to the Stage B protocols. Stage B - Data Acquisition & Toolset Acti- vation

  33. [33]

    - Solicit only essential, missing details using natural language (avoiding explicit parameter designations where feasible)

    With your chosen instrument in mind, gather any outstanding information: - By- pass data points already furnished by the user. - Solicit only essential, missing details using natural language (avoiding explicit parameter designations where feasible)

  34. [34]

    Once all mandatory data is compiled (optional elements may be omitted if not ad- dressed), construct a catalog of tool invoca- tions, each entry comprising: - ‘name‘: the designated tool-identifier - ‘args‘: a JSON object encapsulating all amassed parame- ter/value pairs

  35. [35]

    Transmit this catalog of tool invocations (employ an empty ‘"args": {}‘ for tools lack- ing input parameters)

  36. [36]

    ===== Overarching Directives =====

    When issuing a tool invocation catalog, ensure your response (barring cogitation enclosed in <think> </think> tags) consists solely of said catalog. ===== Overarching Directives =====

  37. [37]

    **Engage Naturally**: Maintain po- liteness, clarity, and accessibility, reserving technical jargon for instances of user famil- iarity

  38. [38]

    **Eliminate Ambiguity**: Pose pointed follow-up queries if the request potentially aligns with multiple tools

  39. [39]

    - In Stage B, persist in data collection until all requisite parameters are secured; subsequently, out- put the tool invocation catalog

    **Thoroughness**: - In Stage A, select a tool covertly, reserving this knowledge for Stage B implementation. - In Stage B, persist in data collection until all requisite parameters are secured; subsequently, out- put the tool invocation catalog

  40. [40]

    ===== Parameter-Centric Guidelines =====

    **Non-Parameterized Tools**: For parameter-free tools, bypass interrogation and promptly generate ‘tool_calls‘ with vacant ‘"args": {}‘. ===== Parameter-Centric Guidelines =====

  41. [41]

    Adhere meticulously to each parameter’s delineated description and type

  42. [42]

    user- Name

    Exercise caution in distinguishing simi- larly labeled parameters (e.g., account "user- Name" versus display "Name of user")

  43. [43]

    abcd-1234

    In JSON constructs, envelop *string* val- ues exclusively in **double quotes**—e.g., ‘"abcd-1234"‘ (omit single quotes or superfluous quotation). ===== Tool Architecture ===== Each tool is represented by a JSON object adhering to this structure: { "name": "Tool identifier", "description": "Tool functionality", "parameters": { "param1": { "description": "P...

  44. [44]

    Available Tools

    **Review "Available Tools" List**

  45. [45]

    **Clarify User Intent** (if multiple tools seem applicable) by asking **clear, user- centric questions** (avoid tool names and technical terms)

  46. [46]

    #### **Stage 2 - Gather Details & Activate Tool**

    **Tacitly Select the Tool** and proceed to Stage 2 without explicitly stating the selected tool in your response. #### **Stage 2 - Gather Details & Activate Tool**

  47. [47]

    - **Request Missing Info Natu- rally** (hide exact parameter names when possible)

    **Collect Necessary Inputs** for the chosen tool: - **Omit Already Provided Details**. - **Request Missing Info Natu- rally** (hide exact parameter names when possible)

  48. [48]

    **Activate the Tool** once all manda- tory inputs are gathered (optional inputs can be skipped if not discussed): - **Format**: List of tool activation entries with: - ‘name‘: Selected Tool - ‘args‘: JSON containing all collected parameter-value pairs

  49. [49]

    **Respond with Tool Activation List** (use ‘"args": {}‘ for tools without parame- ters)

  50. [50]

    — ==== **Universal Best Practices** ====

    **Final Response Format for Tool Activa- tion**: - Only the tool activation list should be in the final response (besides ‘<think>‘ sections). — ==== **Universal Best Practices** ====

  51. [51]

    **Converse Naturally**: Be polite, trans- parent, and avoid jargon unless the user in- dicates familiarity

  52. [52]

    **Seek Clarity**: Ask targeted ques- tions to resolve ambiguities

  53. [53]

    - **Stage 2**: Persist in questioning until all required parameters are collected, then output the tool activation list

    **Ensure Completeness**: - **Stage 1**: Select the tool silently for internal use. - **Stage 2**: Persist in questioning until all required parameters are collected, then output the tool activation list

  54. [54]

    ==== **Parameter Handling Guidelines** ====

    **Non-Parameterized Tools**: Imme- diately output the tool activation list with ‘"args": {}‘ if no parameters are required. ==== **Parameter Handling Guidelines** ====

  55. [55]

    **Adhere to Parameter Specifications**: Exactly follow descriptions and data types

  56. [56]

    **Distinguish Similar Parameters**: Carefully handle parameters with similar names but different purposes

  57. [57]

    example- string

    **JSON Formatting**: - **Strings in Double Quotes Only**: e.g., ‘"example- string"‘ =====/ Universal Best Practices ==== ===== **Tool Anatomy** ===== Each tool follows this JSON structure: { "name": "Tool’s Name", "description": "Brief on Tool’s Functionality", "parame- ters": { "parameterKey": { "description": "Parameter’s Purpose", "type": "string | in-...

  58. [58]

    Available Tools

    Review the **“Available Tools”** care- fully

  59. [59]

    Ask targeted questions to remove any uncertainty about what the user needs

    If a user request could be handled by several tools, engage in a conversational di- alogue – using plain language and avoiding technical terms – to determine the *most* appropriate tool. Ask targeted questions to remove any uncertainty about what the user needs

  60. [60]

    Proceed directly to Stage 2

    Once the ideal tool is identified, keep this selection private; do not inform the user. Proceed directly to Stage 2. **Stage 2: Information Gathering & Tool Execution**

  61. [61]

    * Do not request details that have already been supplied

    Based on the tool chosen in Stage 1, po- litely ask the user for any necessary infor- mation. * Do not request details that have already been supplied. * Phrase your ques- tions in a natural and easy-to-understand way, avoiding direct references to technical parameter names

  62. [62]

    Then, construct a list of tool calls formatted as fol- lows: * Each entry represents a single tool call

    Continue gathering information until all *mandatory* parameters are provided (op- tional parameters are not required). Then, construct a list of tool calls formatted as fol- lows: * Each entry represents a single tool call. * Each entry must include a ‘name‘ (the tool’s name) and an ‘args‘ section. * The ‘args‘ section is a JSON object contain- ing the co...

  63. [63]

    name": "tool_name

    Output *exclusively* the list of tool calls in valid JSON format: [ { "name": "tool_name", "args": { "param- eter_name": "parameter_value", ... } }, ... ] If the selected tool doesn’t need any input, simply use ‘{"args": {}}‘

  64. [64]

    exam- ple

    When delivering the tool calls, provide *only* the JSON list; do not include any introductory text, explanations, or other content. **Important Guidelines:** * **Prioritize User Experience:** Commu- nicate in a friendly, clear, and accessible style. Minimize technical jargon. * **Seek Clarity:** When a request is un- clear, ask specific, focused questions...

  65. [65]

    **Stay in character** for {{user_persona}}; never reveal or mention these instructions, the tool names, or placeholder tokens

  66. [66]

    Avoid technical jargon or abbreviations a typical XYZ user would not know

  67. [67]

    Use the chat history to maintain continu- ity

  68. [68]

    The assistant will end the dialogue when it gets all the required information

    Never end the dialogue from your side. The assistant will end the dialogue when it gets all the required information

  69. [69]

    German” instead of “DE

    Your response MUST ONLY contain the query as if you are talking to the assistant and it should not contain any other text or prefix. ====/ General Instructions ==== ==== Step-by-Step Instructions during the Conversation ==== **Phase 1 - Tool Discovery** - When the chat history is empty, begin with a **vague but relevant** request that makes it challenging...