pith. sign in
Pith Number

pith:77ABU4DH

pith:2026:77ABU4DH6AOOQG3MZ2K5VSA7MQ
not attested not anchored not stored refs resolved

Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs

Benoit Dumoulin, Chen Luo, Dakuo Wang, Hanqing Lu, Hui Liu, Jing Huang, Jiri Gesi, Qi He, Xianfeng Tang, Yimeng Zhang, Yingzhou Lu, Yisi Sang, Yuxuan Lu, Zhenwei Dai, Ziyi Wang

FireFly inverts the data synthesis pipeline to generate verified tool-calling trajectories directly from real API explorations.

arxiv:2605.17558 v1 · 2026-05-17 · cs.SE · cs.CL

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{77ABU4DH6AOOQG3MZ2K5VSA7MQ}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

A 4B-parameter model trained with GRPO on FireFly matches Claude Sonnet 4.6 on our held-out test set and shows improvements on multiple tool-calling benchmarks including Tau2-Bench, MCPMark, and MCP-Atlas.

C2weakest assumption

The assumption that backward synthesis from observed API outcomes produces tasks whose labels remain correct and useful when the same tasks are later executed in the retrieval-augmented simulator or on live APIs (abstract, paragraph describing the pipeline inversion).

C3one line summary

FireFly inverts task synthesis by exploring real MCP servers first via pairwise tool graphs and sub-DAG sampling, then generates 5,144 verified tasks backward from outcomes to train a 4B model that matches Claude Sonnet 4.6 on tool-calling benchmarks.

References

25 extracted · 25 resolved · 10 Pith anchors

[1] MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers, May 2026 2026
[2] $\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment 2025 · arXiv:2506.07982
[3] Evaluating Large Language Models Trained on Code 2021 · arXiv:2107.03374
[4] Scaling Agent Learning via Experience Synthesis, November 2025 2025
[5] API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs 2023 · arXiv:2304.08244

Formal links

1 machine-checked theorem link

Receipt and verification
First computed 2026-05-20T00:04:45.792098Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

ffc01a7067f01ce81b6cce95dac81f642fb04a3ef22b72379075eb9836c1f7d6

Aliases

arxiv: 2605.17558 · arxiv_version: 2605.17558v1 · doi: 10.48550/arxiv.2605.17558 · pith_short_12: 77ABU4DH6AOO · pith_short_16: 77ABU4DH6AOOQG3M · pith_short_8: 77ABU4DH
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/77ABU4DH6AOOQG3MZ2K5VSA7MQ \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: ffc01a7067f01ce81b6cce95dac81f642fb04a3ef22b72379075eb9836c1f7d6
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "592a532a7cc2094cb4cf0ef1e6f7f2da87901236124d12dad3c23475b383bb3d",
    "cross_cats_sorted": [
      "cs.CL"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.SE",
    "submitted_at": "2026-05-17T17:38:17Z",
    "title_canon_sha256": "5605cc685a86d0ae2548fada9dbc1ca5b84d98ffe19dffd9dfa6b056758910b6"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.17558",
    "kind": "arxiv",
    "version": 1
  }
}