I.3 LLM-as-Judge Evaluation Details This appendix provides full details of the LLM-as- judge protocol used in Section 4.4 to assess the semantic quality of synthetic dialogues

Impact of Fine-Grained Planner: Row 3 shows that replacing our planner with ToolFlow’s monolithic planner collapses performanceto7 · 2025

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

ToolWeave: Structured Synthesis of Complex Multi-Turn Tool-Calling Dialogues

cs.CL · 2026-04-03 · conditional · novelty 6.0

ToolWeave synthesizes realistic multi-turn tool-calling dialogues via dependent workflows and parameter provenance tracking, yielding LLMs that score higher on benchmarks such as 39.75% on BFCL-V3 multi-turn versus 23.50% on prior SOTA data.

citing papers explorer

Showing 1 of 1 citing paper.

ToolWeave: Structured Synthesis of Complex Multi-Turn Tool-Calling Dialogues cs.CL · 2026-04-03 · conditional · none · ref 17
ToolWeave synthesizes realistic multi-turn tool-calling dialogues via dependent workflows and parameter provenance tracking, yielding LLMs that score higher on benchmarks such as 39.75% on BFCL-V3 multi-turn versus 23.50% on prior SOTA data.

I.3 LLM-as-Judge Evaluation Details This appendix provides full details of the LLM-as- judge protocol used in Section 4.4 to assess the semantic quality of synthetic dialogues

fields

years

verdicts

representative citing papers

citing papers explorer