pith. machine review for the scientific record. sign in

arxiv: 2605.11706 · v1 · submitted 2026-05-12 · 💻 cs.LG

Recognition: no theorem link

GRAFT: Graph-Tokenized LLMs for Tool Planning

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:17 UTC · model grok-4.3

classification 💻 cs.LG
keywords tool planningLLMgraph tokenizationdependency learningon-policy distillationsequence matchingtool graphs
0
0 comments X

The pith

Mapping each tool to a special token inside the LLM lets the model learn dependency graphs directly in its representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing approaches feed tool graphs to LLMs through retrieval or prompts, but this external matching often produces sequences that violate the graph's directed edges and compounds mistakes from any early wrong choice. GRAFT instead assigns every tool a dedicated special token and trains the model to capture the directed dependencies among those tokens inside its own representation space. It adds on-policy distillation that lets the model learn from trajectories it generates itself, supplying step-by-step signals aligned with the underlying subtask structure. Experiments report gains in exact sequence match and dependency legality over prior methods.

Core claim

GRAFT internalizes the tool graph by mapping each tool node to a dedicated special token and learning directed tool dependencies within the representation space; it further introduces on-policy tool context distillation, training the model on its own sampled trajectories while distilling stepwise planning signals.

What carries the argument

Dedicated special tokens, one per tool node, whose embeddings are trained to encode the directed dependency relations of the tool graph.

If this is right

  • Exact sequence matching accuracy rises because the model no longer relies on post-hoc retrieval to enforce constraints.
  • Dependency legality improves since violations are penalized directly in the token embedding space rather than corrected after the fact.
  • Error accumulation is reduced because an early token choice already reflects the learned graph structure for subsequent steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same tokenization pattern could be tested on other graph-structured planning domains such as workflow orchestration or chemical reaction sequences.
  • Scaling the number of special tokens to thousands of tools would test whether the representation-space approach remains more efficient than retrieval as graph size grows.
  • Combining the internal tokens with lightweight external verification might further raise legality without losing the speed of fully internalized planning.

Load-bearing premise

Embedding tool dependencies as relations among special tokens inside the model will keep generated plans inside valid graph states even after an early choice.

What would settle it

Run GRAFT and a strong external-matching baseline on the same held-out tool graphs and count the fraction of generated plans that violate at least one directed dependency; if the fractions are statistically indistinguishable, the internalization benefit is not supported.

Figures

Figures reproduced from arXiv: 2605.11706 by Hongzhi Yin, Junliang Yu, Quoc Viet Hung Nguyen, Tong Chen, Xinyi Gao, Xinyu Ren.

Figure 1
Figure 1. Figure 1: Existing methods use the tool graph externally: they retrieve tools [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Efficiency and teacher prompt analysis. (a) compares total token cost and average inference [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Hallucination analysis and hyperparameter sensitivity analysis of different methods using [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The distribution of tool sequence length for different datasets. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Representative examples from the experimental datasets. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly used to complete complex tasks by selecting and coordinating external tools across multiple steps. This requires aligning tool choices with subtask intent while satisfying directional execution dependencies among tools. To do this, existing methods model these dependencies as tool graphs and incorporate the graphs with LLMs through retrieval, serialization, or prompt-level injection. However, these external graph-use strategies all follow a matching paradigm, which often fails to align tool choices with the underlying subtask structure, producing semantically plausible plans that violate graph constraints. This issue is further exacerbated by error accumulation, where an early incorrect tool selection shifts the plan into an invalid graph state and causes subsequent predictions to drift away from the valid execution path. To address these challenges, we propose GRAFT, a graph-tokenized language model framework for dependency-aware tool planning. GRAFT internalizes the tool graph by mapping each tool node to a dedicated special token and learning directed tool dependencies within the representation space. It further introduces on-policy tool context distillation, training the model on its own sampled trajectories while distilling stepwise planning signals. Experiments show that GRAFT achieves state-of-the-art performance in exact sequence matching and dependency legality, supporting more reliable LLM tool planning in complex workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces GRAFT, a framework that internalizes tool graphs into LLMs by mapping each tool to a dedicated special token and learning directed dependencies directly in the representation space, combined with on-policy tool context distillation on self-sampled trajectories. It claims this avoids the alignment failures and error accumulation of external retrieval/serialization approaches, achieving state-of-the-art results on exact sequence matching and dependency legality across tool-planning benchmarks, supported by ablations isolating the tokenization and distillation components.

Significance. If the reported gains hold under the provided experimental controls, the work offers a concrete mechanism for embedding graph constraints inside the LLM rather than relying on post-hoc matching, which could improve reliability for multi-step tool workflows. The use of on-policy distillation and explicit ablations on tokenization versus distillation stages are positive features that allow direct assessment of the proposed internalization hypothesis.

major comments (2)
  1. [§3.2] §3.2: The on-policy distillation loss is defined using stepwise planning signals from sampled trajectories, but the manuscript does not specify how the teacher distribution is constructed when the sampled plan enters an invalid state; this detail is load-bearing for the claim that distillation prevents drift relative to standard supervised fine-tuning.
  2. [§5.1] §5.1, Table 1: The exact-match and legality improvements are reported against retrieval and serialization baselines, yet the paper does not include a control that keeps vocabulary size fixed while removing the directed-dependency learning objective; without this, it is unclear whether the gains are attributable to the graph internalization or simply to the addition of special tokens.
minor comments (3)
  1. [Abstract] Abstract: The abstract asserts SOTA performance without any numerical values, dataset names, or baseline identifiers; adding one or two key numbers would improve readability.
  2. [§4.3] §4.3: The description of how tool nodes are mapped to special tokens does not state whether these tokens are initialized from existing embeddings or randomly; this should be clarified for reproducibility.
  3. [Figure 2] Figure 2: The architecture diagram would benefit from explicit arrows indicating the flow of the on-policy distillation loss back to the representation space.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and the constructive major comments. We address each point below with clarifications and targeted revisions to the manuscript. Both comments can be resolved through additions to the text without requiring new experiments.

read point-by-point responses
  1. Referee: §3.2: The on-policy distillation loss is defined using stepwise planning signals from sampled trajectories, but the manuscript does not specify how the teacher distribution is constructed when the sampled plan enters an invalid state; this detail is load-bearing for the claim that distillation prevents drift relative to standard supervised fine-tuning.

    Authors: We agree this detail should be explicit for reproducibility. The original manuscript omitted the precise procedure for invalid states. In the revision, we will add to §3.2 that when a sampled trajectory reaches an invalid state, the teacher distribution is constructed from the last valid prefix by masking all actions that violate the internalized graph constraints (using the special token representations to enforce directed dependencies). Distillation then proceeds only over the legal next-token distribution at that prefix. This mechanism directly supports the claim that on-policy distillation reduces drift compared to standard SFT. We will include a short algorithmic description and update the loss equation accordingly. revision: yes

  2. Referee: §5.1, Table 1: The exact-match and legality improvements are reported against retrieval and serialization baselines, yet the paper does not include a control that keeps vocabulary size fixed while removing the directed-dependency learning objective; without this, it is unclear whether the gains are attributable to the graph internalization or simply to the addition of special tokens.

    Authors: This concern about isolating the source of gains is reasonable. Our §5.2 ablations already separate the effects of graph tokenization (which includes learning directed dependencies in representation space) from distillation, showing additive benefits. However, we acknowledge that an explicit control with identical vocabulary size but no dependency objective would further clarify attribution. Such a control would amount to adding unused special tokens without graph structure, which we view as less informative than the existing comparisons to non-tokenized baselines. In the revision we will add a paragraph in §5.1 explaining that the special tokens are not generic vocabulary additions but are tied to the graph internalization objective, and we will reference the ablation results to support that the observed improvements stem from dependency learning rather than token count alone. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces GRAFT as an architectural framework that maps tools to special tokens and applies on-policy distillation, with all components defined independently of the reported performance metrics. The abstract and method description present the approach as a direct response to limitations in external graph-matching paradigms, without any equations or steps that reduce the claimed improvements to quantities fitted from the target results. Experimental claims rest on benchmark comparisons and ablations rather than self-referential definitions or self-citation chains that would force the outcomes by construction. This is a standard self-contained proposal with no load-bearing reductions to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the unproven premise that internal token-level learning of directed dependencies will outperform external graph matching and that on-policy distillation will mitigate error accumulation; these are introduced without independent evidence or prior validation in the abstract.

invented entities (1)
  • dedicated special tool tokens no independent evidence
    purpose: to internalize the tool graph directly into the LLM representation space
    Each tool node is mapped to a special token so that directed dependencies can be learned within the model's embeddings and attention patterns.

pith-pipeline@v0.9.0 · 5532 in / 1245 out tokens · 61024 ms · 2026-05-13T07:17:27.192066+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 8 internal anchors

  1. [1]

    Llm with tools: A survey.arXiv preprint arXiv:2409.18807,

    Zhuocheng Shen. Llm with tools: A survey.arXiv preprint arXiv:2409.18807, 2024

  2. [2]

    Llm-based agents for tool learning: A survey: W

    Weikai Xu, Chengrui Huang, Shen Gao, and Shuo Shang. Llm-based agents for tool learning: A survey: W. xu et al.Data Science and Engineering, pages 1–31, 2025

  3. [3]

    Can graph learning improve planning in llm-based agents?Advances in Neural Information Processing Systems, 37:5338–5383, 2024

    Xixi Wu, Yifei Shen, Caihua Shan, Kaitao Song, Siwei Wang, Bohang Zhang, Jiarui Feng, Hong Cheng, Wei Chen, Yun Xiong, et al. Can graph learning improve planning in llm-based agents?Advances in Neural Information Processing Systems, 37:5338–5383, 2024

  4. [4]

    ToolNet: Connecting large language models with massive tools via tool graph,

    Xukun Liu, Zhiyuan Peng, Xiaoyuan Yi, Xing Xie, Lirong Xiang, Yuchen Liu, and Dongkuan Xu. Toolnet: Connecting large language models with massive tools via tool graph.arXiv preprint arXiv:2403.00839, 2024

  5. [5]

    Controlllm: Augment language models with tools by searching on graphs

    Zhaoyang Liu, Zeqiang Lai, Zhangwei Gao, Erfei Cui, Ziheng Li, Xizhou Zhu, Lewei Lu, Qifeng Chen, Yu Qiao, Jifeng Dai, and Wenhai Wang. Controlllm: Augment language models with tools by searching on graphs. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, pages 89–105, Cham...

  6. [6]

    Cohn, and Janet B

    Fangru Lin, Emanuele La Malfa, Valentin Hofmann, Elle Michelle Yang, Anthony G. Cohn, and Janet B. Pierrehumbert. Graph-enhanced large language models in asynchronous plan reasoning. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Co...

  7. [7]

    GTool: Graph enhanced tool planning with large language model

    Wenjie Chen, Di Yao, Wenbin Li, Xuying Meng, Chang Gong, and Jingping Bi. GTool: Graph enhanced tool planning with large language model. InThe Fourteenth International Conference on Learning Representations, 2026

  8. [8]

    Scheduled sampling for sequence prediction with recurrent neural networks

    Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. InAdvances in Neural Information Processing Systems, 2015

  9. [9]

    Sequence level training with recurrent neural networks

    Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. InInternational Conference on Learning Representations, 2016

  10. [10]

    Professor forcing: A new algorithm for training recurrent networks

    Alex Lamb, Anirudh Goyal, Ying Zhang, Saizheng Zhang, Aaron Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. InAdvances in Neural Information Processing Systems, 2016

  11. [11]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self- distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

  12. [12]

    On-Policy Context Distillation for Language Models

    Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275, 2026

  13. [13]

    Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

  14. [14]

    Taskbench: Benchmarking large language models for task automation.Advances in Neural Information Processing Systems, 37:4540–4574, 2024

    Yongliang Shen, Kaitao Song, Xu Tan, Wenqi Zhang, Kan Ren, Siyu Yuan, Weiming Lu, Dongsheng Li, and Yueting Zhuang. Taskbench: Benchmarking large language models for task automation.Advances in Neural Information Processing Systems, 37:4540–4574, 2024

  15. [15]

    Toolllm: Facilitating large language models to master 16000+ real-world apis

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. InThe twelfth international conference on learning representations, 2023

  16. [16]

    Gpt4tools: Teaching large lan- guage model to use tools via self-instruction.arXiv preprint arXiv:2305.18752, 2023

    Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge, Xiu Li, and Ying Shan. Gpt4tools: Teaching large language model to use tools via self-instruction.arXiv preprint arXiv:2305.18752, 2023

  17. [17]

    Toolgen: Unified tool re- trieval and calling via generation

    Renxi Wang, Xudong Han, Lei Ji, Shu Wang, Timothy Baldwin, and Haonan Li. Toolgen: Unified tool re- trieval and calling via generation. InThe Thirteenth International Conference on Learning Representations, 2025

  18. [18]

    arXiv preprint arXiv:2404.11584 , year=

    Tula Masterman, Sandi Besen, Mason Sawtell, and Alex Chao. The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey.arXiv preprint arXiv:2404.11584, 2024. 10

  19. [19]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

  20. [20]

    Agentgym-rl: Training LLM agents for long-horizon decision making through multi-turn reinforcement learning

    Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, et al. Agentgym-rl: Training llm agents for long-horizon decision making through multi-turn reinforcement learning.arXiv preprint arXiv:2509.08755, 2025

  21. [21]

    In-the-flow agentic system optimization for effective planning and tool use.arXiv preprint arXiv:2510.05592, 2025

    Zhuofeng Li, Haoxiang Zhang, Seungju Han, Sheng Liu, Jianwen Xie, Yu Zhang, Yejin Choi, James Zou, and Pan Lu. In-the-flow agentic system optimization for effective planning and tool use.arXiv preprint arXiv:2510.05592, 2025

  22. [22]

    Gordon, and J

    Stephane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 627–635, 2011

  23. [23]

    Williams

    Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine Learning, 8:229–256, 1992

  24. [24]

    Jordan, and Pieter Abbeel

    John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. InProceedings of the 32nd International Conference on Machine Learning, pages 1889–1897, 2015

  25. [25]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  26. [26]

    A Survey of On-Policy Distillation for Large Language Models

    Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models.arXiv preprint arXiv:2604.00626, 2026

  27. [27]

    A survey of graph meets large language model: Progress and future directions.arXiv preprint arXiv:2311.12399, 2023

    Yuhan Li, Zhixun Li, Peisong Wang, Jia Li, Xiangguo Sun, Hong Cheng, and Jeffrey Xu Yu. A survey of graph meets large language model: Progress and future directions.arXiv preprint arXiv:2311.12399, 2023

  28. [28]

    Toward graph-tokenizing large language models with reconstructive graph instruction tuning

    Zhongjian Zhang, Xiao Wang, Mengmei Zhang, Jiarui Tan, and Chuan Shi. Toward graph-tokenizing large language models with reconstructive graph instruction tuning. InProceedings of the ACM Web Conference 2026, pages 430–441, 2026

  29. [29]

    Large language models on graphs: A comprehensive survey.IEEE Transactions on Knowledge and Data Engineering, 36(12):8622–8642, 2024

    Bowen Jin, Gang Liu, Chi Han, Meng Jiang, Heng Ji, and Jiawei Han. Large language models on graphs: A comprehensive survey.IEEE Transactions on Knowledge and Data Engineering, 36(12):8622–8642, 2024

  30. [30]

    Minillm: Knowledge distillation of large language models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InThe twelfth international conference on learning representations, 2024

  31. [31]

    Planning, creation, usage: Benchmarking llms for comprehensive tool utilization in real-world complex scenarios

    Shijue Huang, Wanjun Zhong, Jianqiao Lu, Qi Zhu, Jiahui Gao, Weiwen Liu, Yutai Hou, Xingshan Zeng, Yasheng Wang, Lifeng Shang, et al. Planning, creation, usage: Benchmarking llms for comprehensive tool utilization in real-world complex scenarios. InFindings of the Association for Computational Linguistics: ACL 2024, pages 4363–4400, 2024

  32. [32]

    Llms in the imaginarium: tool learning through simulated trial and error

    Boshi Wang, Hao Fang, Jason Eisner, Benjamin Van Durme, and Yu Su. Llms in the imaginarium: tool learning through simulated trial and error. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10583–10604, 2024

  33. [33]

    Toolink: Linking toolkit creation and using through chain-of-solving on open-source model

    Cheng Qian, Chenyan Xiong, Zhenghao Liu, and Zhiyuan Liu. Toolink: Linking toolkit creation and using through chain-of-solving on open-source model. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 831–854, 2024

  34. [34]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

  35. [35]

    Llama 3.2 3b instruct model card

    Meta AI. Llama 3.2 3b instruct model card. https://huggingface.co/meta-llama/Llama-3. 2-3B-Instruct, 2024. Accessed: 2026-05-07

  36. [36]

    Mistral 7B

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023

  37. [37]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021

  38. [38]

    Max Len

    Anonymous. Graphtoolbench: Benchmarking llms for sequential graph comprehension and conflict identification in tool learning, 2026. OpenReview. 11 Appendix Contents A Algorithm 13 B Datasets 13 C Evaluation Metrics 14 D Additional Results 15 E Prompts 15 12 A Algorithm Algorithm 1 outlines the optimization pipeline of GRAFT. The first stage constructs gra...