ToolWeave synthesizes realistic multi-turn tool-calling dialogues via dependent workflows and parameter provenance tracking, yielding LLMs that score higher on benchmarks such as 39.75% on BFCL-V3 multi-turn versus 23.50% on prior SOTA data.
I.3 LLM-as-Judge Evaluation Details This appendix provides full details of the LLM-as- judge protocol used in Section 4.4 to assess the semantic quality of synthetic dialogues
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
ToolWeave: Structured Synthesis of Complex Multi-Turn Tool-Calling Dialogues
ToolWeave synthesizes realistic multi-turn tool-calling dialogues via dependent workflows and parameter provenance tracking, yielding LLMs that score higher on benchmarks such as 39.75% on BFCL-V3 multi-turn versus 23.50% on prior SOTA data.