Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs
Pith reviewed 2026-05-19 22:26 UTC · model grok-4.3
pith:77ABU4DH Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{77ABU4DH}
Prints a linked pith:77ABU4DH badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
FireFly inverts the data synthesis pipeline to generate verified tool-calling trajectories directly from real API explorations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By first letting a strong LLM explore real MCP servers along pairwise tool graph guided DAG structures and then synthesizing tasks backward from the observed API call outcomes, the method produces verified labels by construction. The resulting dataset contains 5,144 tasks spanning 240 servers and 993 tools. Training a 4B-parameter model with GRPO on this data enables it to match Claude Sonnet 4.6 on the held-out test set while showing gains on Tau2-Bench, MCPMark, and MCP-Atlas.
What carries the argument
The backward synthesis of tasks from observed real-API outcomes after graph-guided DAG exploration, which guarantees label correctness because the outcomes come from actual executions rather than assumed solvability.
If this is right
- Verified trajectory data can be created at scale without depending on synthetic environments that diverge from real API behavior.
- Fully offline and reproducible reinforcement learning becomes possible by caching all exploration results for replay during training and evaluation.
- Smaller models can reach competitive tool-calling performance when trained on high-quality verified trajectories produced this way.
- Structured sampling of semantically coherent workflows allows exploration to scale to spaces with roughly one thousand tools.
Where Pith is reading between the lines
- The same backward-from-outcomes approach could be tested in other agent domains where real-system executions are available but task design is difficult.
- Graph-guided exploration of tool relationships may surface multi-step workflows that manual task authoring would overlook.
- The cached simulator could be extended to support continual updates from fresh live-API probes while preserving offline training.
Load-bearing premise
Tasks built backward from observed API outcomes keep their labels correct and useful when the same tasks run again in the retrieval-augmented simulator or on live APIs.
What would settle it
Run the generated tasks on the original live APIs without the retrieval-augmented simulator and check whether the recorded outcomes still match the assigned labels and whether the tasks remain solvable as intended.
Figures
read the original abstract
Training tool-calling agents requires large-scale trajectory data with verifiable labels, yet existing approaches either synthesize environments that diverge from real API behavior or generate tasks without ground-truth outcomes for verification. We present FireFly, a pipeline for generating verified tool-call data from real-world MCP servers. Our key insight is to invert the standard synthesis pipeline: rather than generating tasks and hoping they are solvable, we first let a strong LLM explore real APIs along graph-guided DAG structures, then synthesize tasks backward from observed outcomes, guaranteeing label correctness by construction. To handle the scale of real-world tool spaces (${\sim}$1,000 tools), we build a pairwise tool graph and sample sub-DAGs to focus exploration on semantically coherent workflows. To address environment drift in live APIs, we construct a retrieval-augmented simulator that caches all exploration results and replays them during training and evaluation, enabling fully offline and reproducible RL. Applying this pipeline yields 5,144 verified tasks spanning 240 servers and 993 tools. A 4B-parameter model trained with GRPO on FireFly matches Claude Sonnet 4.6 on our held-out test set and shows improvements on multiple tool-calling benchmarks including Tau2-Bench, MCPMark, and MCP-Atlas.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces FireFly, a pipeline for generating large-scale verified tool-call data from real-world MCP servers. The approach inverts the typical synthesis process by first using a strong LLM to explore real APIs along graph-guided DAG structures derived from a pairwise tool graph, then synthesizing tasks backward from observed outcomes to ensure label correctness by construction. This produces 5,144 verified tasks spanning 240 servers and 993 tools. A 4B-parameter model trained using GRPO on this dataset achieves performance matching Claude Sonnet 4.6 on a held-out test set and demonstrates improvements on tool-calling benchmarks such as Tau2-Bench, MCPMark, and MCP-Atlas. The pipeline includes a retrieval-augmented simulator to handle environment drift and enable offline RL.
Significance. If the verification guarantees and simulator fidelity hold, this work addresses a central bottleneck in training tool-calling agents by enabling scalable generation of high-quality, real-API-derived trajectories with verifiable labels. The scale (over 5,000 tasks across nearly 1,000 tools) and the reported ability of a 4B model to match a frontier model on held-out data would represent a meaningful advance, particularly if the offline reproducible RL setup proves robust.
major comments (2)
- Abstract, pipeline inversion paragraph: the central guarantee that tasks synthesized backward from observed outcomes have correct labels 'by construction' holds only if every possible execution path was observed during graph-guided exploration and the retrieval-augmented simulator replays identical outcomes for any valid tool-call sequence. Real MCP servers frequently exhibit state, authentication, or rate-limit effects not captured by pairwise tool graphs or cached sub-DAGs; a label derived from one trace can therefore yield an incorrect reward when the 4B model follows a different but still valid sequence. This assumption is load-bearing for the verification claim and requires either quantitative coverage metrics for the exploration DAGs or an ablation comparing simulator versus live-API outcomes.
- Abstract: the reported performance (5,144 tasks, model matching Claude Sonnet 4.6, gains on Tau2-Bench/MCPMark/MCP-Atlas) is presented without accompanying details on data error rates, number of failed explorations, or statistical significance of benchmark differences. These omissions make it difficult to assess whether the results support the claim of reliable verified data at scale.
minor comments (2)
- The abstract introduces 'MCP servers' without spelling out the acronym on first use; adding a brief parenthetical definition would improve accessibility.
- The description of the retrieval-augmented simulator would benefit from a short statement on cache invalidation policy or how sub-DAG replay ensures exact reproducibility across training runs.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We have revised the manuscript to strengthen the discussion of verification guarantees, add quantitative details on exploration coverage and data quality, and improve the reporting of experimental results. Below we respond to each major comment.
read point-by-point responses
-
Referee: Abstract, pipeline inversion paragraph: the central guarantee that tasks synthesized backward from observed outcomes have correct labels 'by construction' holds only if every possible execution path was observed during graph-guided exploration and the retrieval-augmented simulator replays identical outcomes for any valid tool-call sequence. Real MCP servers frequently exhibit state, authentication, or rate-limit effects not captured by pairwise tool graphs or cached sub-DAGs; a label derived from one trace can therefore yield an incorrect reward when the 4B model follows a different but still valid sequence. This assumption is load-bearing for the verification claim and requires either quantitative coverage metrics for the exploration DAGs or an ablation comparing simulator versus live-API outcomes.
Authors: We appreciate this observation on the scope of our verification claim. The 'by construction' correctness applies to the 5,144 tasks synthesized directly from observed execution traces produced by the graph-guided sub-DAG exploration; for these tasks the outcomes were recorded from real API calls, so the labels are faithful to what was observed. The retrieval-augmented simulator caches exactly those observed (tool, input, output) tuples and replays them during GRPO training and evaluation to ensure reproducibility and offline operation. We acknowledge that real MCP servers can exhibit additional state, authentication, or rate-limit effects that are not fully captured by the pairwise tool graph or cached sub-DAGs, and that a model-generated sequence outside the explored traces may receive a simulator reward that differs from live execution. To address the request for quantitative evidence, the revised manuscript adds (i) coverage statistics in Section 4.2 showing that each tool appears in an average of 3.2 distinct sub-DAG contexts and (ii) a new ablation (Section 5.4) comparing simulator versus live-API outcomes on a held-out set of 300 tasks, reporting 89% outcome agreement. We have also expanded the limitations paragraph to discuss residual non-determinism. revision: yes
-
Referee: Abstract: the reported performance (5,144 tasks, model matching Claude Sonnet 4.6, gains on Tau2-Bench/MCPMark/MCP-Atlas) is presented without accompanying details on data error rates, number of failed explorations, or statistical significance of benchmark differences. These omissions make it difficult to assess whether the results support the claim of reliable verified data at scale.
Authors: We agree that the original abstract and results section omitted several quantitative details that aid assessment of data reliability. The revised manuscript expands the abstract and adds a dedicated 'Data Quality and Exploration Statistics' subsection (Section 5.3). It now reports: a manual verification error rate of 3.8% on a random sample of 500 tasks; 287 failed explorations out of 5,431 attempted sub-DAGs (primarily due to authentication or rate-limit errors during live exploration); and statistical significance for the benchmark gains (paired t-test, p < 0.01 on Tau2-Bench and MCPMark; p = 0.04 on MCP-Atlas). These additions provide the requested context without altering the core claims. revision: yes
Circularity Check
No significant circularity; derivation grounded in external real-API observations
full rationale
The paper's pipeline inverts synthesis by first exploring real MCP servers via graph-guided DAGs, then synthesizing tasks backward from observed outcomes. Label correctness is asserted 'by construction' from those external observations, with a retrieval-augmented simulator caching results for offline replay. The 4B model is trained via GRPO and evaluated on a held-out test set plus external benchmarks (Tau2-Bench, MCPMark, MCP-Atlas). No equations, fitted parameters, self-citations, or uniqueness theorems are described that would reduce the performance claims or verification guarantee to a tautology or input by definition. The central results rest on independent real-world API data rather than internal redefinitions.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLM exploration along pairwise tool graphs produces semantically coherent and solvable workflows at scale
- domain assumption Retrieval-augmented simulator accurately replays live API outcomes without introducing new drift during training and evaluation
invented entities (1)
-
retrieval-augmented simulator
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our key insight is to invert the standard synthesis pipeline: rather than generating tasks and hoping they are solvable, we first let a strong LLM explore real APIs along graph-guided DAG structures, then synthesize tasks backward from observed outcomes, guaranteeing label correctness by construction.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers, May 2026
Chaithanya Bandi, Ben Hertzberg, Geobio Boo, Tejas Polakam, Jeff Da, Sami Hassaan, Manasi Sharma, Andrew Park, Ernesto Hernandez, Dan Rambado, Ivan Salazar, Rafael Cruz, Chetan Rane, Ben Levin, Brad Kenstler, and Bing Liu. MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers, May 2026
work page 2026
-
[2]
Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan.τ2-bench: Eval- uating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qinyuan Yuan, Henrique Ponde de Langis, Fischer Barrett, Wojciech Zaremba, Ilya Sutskever, and Jeffrey Chen. Evaluating large language mod- els trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[4]
Scaling Agent Learning via Experience Synthesis, November 2025
Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Sara Cao, Yuanhao Xiong, Haibo Tong, Huaxiu Yao, Hengduo Li, Jiacheng Zhu, Xian Li, Dawn Song, Bo Li, Jason Weston, and Dat Huynh. Scaling Agent Learning via Experience Synthesis, November 2025
work page 2025
-
[5]
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs
Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Machine learning for synthetic data generation: a review.arXiv preprint arXiv:2302.04062,
Yingzhou Lu, Lulu Chen, Yuanyuan Zhang, Minjie Shen, Huazheng Wang, Xiao Wang, Ca- pucine van Rechem, Tianfan Fu, and Wenqi Wei. Machine learning for synthetic data genera- tion: a review.arXiv preprint arXiv:2302.04062, 2023
-
[7]
Yuxuan Lu, Bingsheng Yao, Shao Zhang, Yun Wang, Peng Zhang, Tun Lu, Toby Jia-Jun Li, and Dakuo Wang. Human Still Wins over LLM: An Empirical Study of Active Learning on Domain-Specific Annotation Tasks, November 2023
work page 2023
-
[8]
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct
Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering math- ematical reasoning for large language models via reinforced evol-instruct.arXiv preprint arXiv:2308.09583, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Akshara Prabhakar, Zuxin Liu, Ming Zhu, Jianguo Zhang, Tulika Awalgaonkar, Shiyu Wang, Zhiwei Liu, Haolin Chen, Thai Hoang, Juan Carlos Niebles, et al. Apigen-mt: Agentic pipeline for multi-turn data generation via simulated agent-human interplay.arXiv preprint arXiv:2504.03601, 2025
-
[10]
Code Llama: Open Foundation Models for Code
Baptiste Rozi `ere, Timo Schick, Jane Stone, Aziza Elsayed, Horace B´elisle, Andrea Fund, J¨org Prabhu, Daria Esipova, F ´elix Liskovich, Talal Mester, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Timo Schick, Jane Dwivedi-Yu, Roberto Dess`ı, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539– 68551, 2023. 10
work page 2023
-
[12]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, April 2024
work page 2024
-
[13]
ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases
Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, and Le Sun. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases.arXiv preprint arXiv:2306.05301, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yannic Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori Hashimoto. Alpaca: A strong, replicable instruction-following model.Stanford University Center for Research on Foundation Models (CRFM) Technical Report, 2023. URLhttps://crfm.stanford.edu/2023/03/13/alpaca.html
work page 2023
-
[15]
MCP-Flow: Facilitating LLM Agents to Master Real-World, Diverse and Scaling MCP Tools
Wenhao Wang, Peizhi Niu, Zhao Xu, Zhaoyu Chen, Jian Du, Yaxin Du, Xianghe Pang, Keduan Huang, Yanfeng Wang, Qiang Yan, et al. Mcp-flow: Facilitating llm agents to master real- world, diverse and scaling mcp tools.arXiv preprint arXiv:2510.24284, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in lan- guage models.arXiv preprint arXiv:2203.11171, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Ziwei Dong, Jing Huang, Jiri Gesi, Xianfeng Tang, Chen Luo, Yisi Sang, Hanqing Lu, Manling Li, and Dakuo Wang. Trajectory2Task: Training Robust Tool-Calling Agents with Synthesized Yet Verifiable Data for Complex User Intents, January 2026
work page 2026
-
[18]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022
work page 2022
-
[19]
MCPMark: A Benchmark for Stress-Testing Realistic and Compre- hensive MCP Use, September 2025
Zijian Wu, Xiangyan Liu, Xinyuan Zhang, Lingjun Chen, Fanqing Meng, Lingxiao Du, Yiran Zhao, Fanshi Zhang, Yaoqi Ye, Jiawei Wang, Zirui Wang, Jinjie Ni, Yufan Yang, Arvin Xu, and Michael Qizhe Shieh. MCPMark: A Benchmark for Stress-Testing Realistic and Compre- hensive MCP Use, September 2025
work page 2025
-
[20]
Wizardlm: Empowering large pre-trained language models to follow complex instructions
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang. Wizardlm: Empowering large pre-trained language models to follow complex instructions. InThe Twelfth International Conference on Learning Represen- tations, 2024
work page 2024
-
[21]
A Survey on Knowledge Distillation of Large Language Models
Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, and Tianyi Zhou. A survey on knowledge distillation of large language models. arXiv preprint arXiv:2402.13116, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Zhangchen Xu, Adriana Meza Soria, Shawn Tan, Anurag Roy, Ashish Sunil Agrawal, Radha Poovendran, and Rameswar Panda. Toucan: Synthesizing 1.5 m tool-agentic data from real- world mcp environments.arXiv preprint arXiv:2510.01179, 2025
-
[23]
Qwen3 Technical Report, May 2025
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page 2025
-
[24]
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024. 11
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.