Recognition: no theorem link
GRAFT: Graph-Tokenized LLMs for Tool Planning
Pith reviewed 2026-05-13 07:17 UTC · model grok-4.3
The pith
Mapping each tool to a special token inside the LLM lets the model learn dependency graphs directly in its representations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GRAFT internalizes the tool graph by mapping each tool node to a dedicated special token and learning directed tool dependencies within the representation space; it further introduces on-policy tool context distillation, training the model on its own sampled trajectories while distilling stepwise planning signals.
What carries the argument
Dedicated special tokens, one per tool node, whose embeddings are trained to encode the directed dependency relations of the tool graph.
If this is right
- Exact sequence matching accuracy rises because the model no longer relies on post-hoc retrieval to enforce constraints.
- Dependency legality improves since violations are penalized directly in the token embedding space rather than corrected after the fact.
- Error accumulation is reduced because an early token choice already reflects the learned graph structure for subsequent steps.
Where Pith is reading between the lines
- The same tokenization pattern could be tested on other graph-structured planning domains such as workflow orchestration or chemical reaction sequences.
- Scaling the number of special tokens to thousands of tools would test whether the representation-space approach remains more efficient than retrieval as graph size grows.
- Combining the internal tokens with lightweight external verification might further raise legality without losing the speed of fully internalized planning.
Load-bearing premise
Embedding tool dependencies as relations among special tokens inside the model will keep generated plans inside valid graph states even after an early choice.
What would settle it
Run GRAFT and a strong external-matching baseline on the same held-out tool graphs and count the fraction of generated plans that violate at least one directed dependency; if the fractions are statistically indistinguishable, the internalization benefit is not supported.
Figures
read the original abstract
Large language models (LLMs) are increasingly used to complete complex tasks by selecting and coordinating external tools across multiple steps. This requires aligning tool choices with subtask intent while satisfying directional execution dependencies among tools. To do this, existing methods model these dependencies as tool graphs and incorporate the graphs with LLMs through retrieval, serialization, or prompt-level injection. However, these external graph-use strategies all follow a matching paradigm, which often fails to align tool choices with the underlying subtask structure, producing semantically plausible plans that violate graph constraints. This issue is further exacerbated by error accumulation, where an early incorrect tool selection shifts the plan into an invalid graph state and causes subsequent predictions to drift away from the valid execution path. To address these challenges, we propose GRAFT, a graph-tokenized language model framework for dependency-aware tool planning. GRAFT internalizes the tool graph by mapping each tool node to a dedicated special token and learning directed tool dependencies within the representation space. It further introduces on-policy tool context distillation, training the model on its own sampled trajectories while distilling stepwise planning signals. Experiments show that GRAFT achieves state-of-the-art performance in exact sequence matching and dependency legality, supporting more reliable LLM tool planning in complex workflows.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces GRAFT, a framework that internalizes tool graphs into LLMs by mapping each tool to a dedicated special token and learning directed dependencies directly in the representation space, combined with on-policy tool context distillation on self-sampled trajectories. It claims this avoids the alignment failures and error accumulation of external retrieval/serialization approaches, achieving state-of-the-art results on exact sequence matching and dependency legality across tool-planning benchmarks, supported by ablations isolating the tokenization and distillation components.
Significance. If the reported gains hold under the provided experimental controls, the work offers a concrete mechanism for embedding graph constraints inside the LLM rather than relying on post-hoc matching, which could improve reliability for multi-step tool workflows. The use of on-policy distillation and explicit ablations on tokenization versus distillation stages are positive features that allow direct assessment of the proposed internalization hypothesis.
major comments (2)
- [§3.2] §3.2: The on-policy distillation loss is defined using stepwise planning signals from sampled trajectories, but the manuscript does not specify how the teacher distribution is constructed when the sampled plan enters an invalid state; this detail is load-bearing for the claim that distillation prevents drift relative to standard supervised fine-tuning.
- [§5.1] §5.1, Table 1: The exact-match and legality improvements are reported against retrieval and serialization baselines, yet the paper does not include a control that keeps vocabulary size fixed while removing the directed-dependency learning objective; without this, it is unclear whether the gains are attributable to the graph internalization or simply to the addition of special tokens.
minor comments (3)
- [Abstract] Abstract: The abstract asserts SOTA performance without any numerical values, dataset names, or baseline identifiers; adding one or two key numbers would improve readability.
- [§4.3] §4.3: The description of how tool nodes are mapped to special tokens does not state whether these tokens are initialized from existing embeddings or randomly; this should be clarified for reproducibility.
- [Figure 2] Figure 2: The architecture diagram would benefit from explicit arrows indicating the flow of the on-policy distillation loss back to the representation space.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and the constructive major comments. We address each point below with clarifications and targeted revisions to the manuscript. Both comments can be resolved through additions to the text without requiring new experiments.
read point-by-point responses
-
Referee: §3.2: The on-policy distillation loss is defined using stepwise planning signals from sampled trajectories, but the manuscript does not specify how the teacher distribution is constructed when the sampled plan enters an invalid state; this detail is load-bearing for the claim that distillation prevents drift relative to standard supervised fine-tuning.
Authors: We agree this detail should be explicit for reproducibility. The original manuscript omitted the precise procedure for invalid states. In the revision, we will add to §3.2 that when a sampled trajectory reaches an invalid state, the teacher distribution is constructed from the last valid prefix by masking all actions that violate the internalized graph constraints (using the special token representations to enforce directed dependencies). Distillation then proceeds only over the legal next-token distribution at that prefix. This mechanism directly supports the claim that on-policy distillation reduces drift compared to standard SFT. We will include a short algorithmic description and update the loss equation accordingly. revision: yes
-
Referee: §5.1, Table 1: The exact-match and legality improvements are reported against retrieval and serialization baselines, yet the paper does not include a control that keeps vocabulary size fixed while removing the directed-dependency learning objective; without this, it is unclear whether the gains are attributable to the graph internalization or simply to the addition of special tokens.
Authors: This concern about isolating the source of gains is reasonable. Our §5.2 ablations already separate the effects of graph tokenization (which includes learning directed dependencies in representation space) from distillation, showing additive benefits. However, we acknowledge that an explicit control with identical vocabulary size but no dependency objective would further clarify attribution. Such a control would amount to adding unused special tokens without graph structure, which we view as less informative than the existing comparisons to non-tokenized baselines. In the revision we will add a paragraph in §5.1 explaining that the special tokens are not generic vocabulary additions but are tied to the graph internalization objective, and we will reference the ablation results to support that the observed improvements stem from dependency learning rather than token count alone. revision: partial
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces GRAFT as an architectural framework that maps tools to special tokens and applies on-policy distillation, with all components defined independently of the reported performance metrics. The abstract and method description present the approach as a direct response to limitations in external graph-matching paradigms, without any equations or steps that reduce the claimed improvements to quantities fitted from the target results. Experimental claims rest on benchmark comparisons and ablations rather than self-referential definitions or self-citation chains that would force the outcomes by construction. This is a standard self-contained proposal with no load-bearing reductions to inputs.
Axiom & Free-Parameter Ledger
invented entities (1)
-
dedicated special tool tokens
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Llm with tools: A survey.arXiv preprint arXiv:2409.18807,
Zhuocheng Shen. Llm with tools: A survey.arXiv preprint arXiv:2409.18807, 2024
-
[2]
Llm-based agents for tool learning: A survey: W
Weikai Xu, Chengrui Huang, Shen Gao, and Shuo Shang. Llm-based agents for tool learning: A survey: W. xu et al.Data Science and Engineering, pages 1–31, 2025
work page 2025
-
[3]
Xixi Wu, Yifei Shen, Caihua Shan, Kaitao Song, Siwei Wang, Bohang Zhang, Jiarui Feng, Hong Cheng, Wei Chen, Yun Xiong, et al. Can graph learning improve planning in llm-based agents?Advances in Neural Information Processing Systems, 37:5338–5383, 2024
work page 2024
-
[4]
ToolNet: Connecting large language models with massive tools via tool graph,
Xukun Liu, Zhiyuan Peng, Xiaoyuan Yi, Xing Xie, Lirong Xiang, Yuchen Liu, and Dongkuan Xu. Toolnet: Connecting large language models with massive tools via tool graph.arXiv preprint arXiv:2403.00839, 2024
-
[5]
Controlllm: Augment language models with tools by searching on graphs
Zhaoyang Liu, Zeqiang Lai, Zhangwei Gao, Erfei Cui, Ziheng Li, Xizhou Zhu, Lewei Lu, Qifeng Chen, Yu Qiao, Jifeng Dai, and Wenhai Wang. Controlllm: Augment language models with tools by searching on graphs. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, pages 89–105, Cham...
work page 2024
-
[6]
Fangru Lin, Emanuele La Malfa, Valentin Hofmann, Elle Michelle Yang, Anthony G. Cohn, and Janet B. Pierrehumbert. Graph-enhanced large language models in asynchronous plan reasoning. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Co...
work page 2024
-
[7]
GTool: Graph enhanced tool planning with large language model
Wenjie Chen, Di Yao, Wenbin Li, Xuying Meng, Chang Gong, and Jingping Bi. GTool: Graph enhanced tool planning with large language model. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[8]
Scheduled sampling for sequence prediction with recurrent neural networks
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. InAdvances in Neural Information Processing Systems, 2015
work page 2015
-
[9]
Sequence level training with recurrent neural networks
Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. InInternational Conference on Learning Representations, 2016
work page 2016
-
[10]
Professor forcing: A new algorithm for training recurrent networks
Alex Lamb, Anirudh Goyal, Ying Zhang, Saizheng Zhang, Aaron Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. InAdvances in Neural Information Processing Systems, 2016
work page 2016
-
[11]
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self- distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[12]
On-Policy Context Distillation for Language Models
Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275, 2026
work page internal anchor Pith review arXiv 2026
-
[13]
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023
work page 2023
-
[14]
Yongliang Shen, Kaitao Song, Xu Tan, Wenqi Zhang, Kan Ren, Siyu Yuan, Weiming Lu, Dongsheng Li, and Yueting Zhuang. Taskbench: Benchmarking large language models for task automation.Advances in Neural Information Processing Systems, 37:4540–4574, 2024
work page 2024
-
[15]
Toolllm: Facilitating large language models to master 16000+ real-world apis
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. InThe twelfth international conference on learning representations, 2023
work page 2023
-
[16]
Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge, Xiu Li, and Ying Shan. Gpt4tools: Teaching large language model to use tools via self-instruction.arXiv preprint arXiv:2305.18752, 2023
-
[17]
Toolgen: Unified tool re- trieval and calling via generation
Renxi Wang, Xudong Han, Lei Ji, Shu Wang, Timothy Baldwin, and Haonan Li. Toolgen: Unified tool re- trieval and calling via generation. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[18]
arXiv preprint arXiv:2404.11584 , year=
Tula Masterman, Sandi Besen, Mason Sawtell, and Alex Chao. The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey.arXiv preprint arXiv:2404.11584, 2024. 10
-
[19]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[20]
Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, et al. Agentgym-rl: Training llm agents for long-horizon decision making through multi-turn reinforcement learning.arXiv preprint arXiv:2509.08755, 2025
-
[21]
Zhuofeng Li, Haoxiang Zhang, Seungju Han, Sheng Liu, Jianwen Xie, Yu Zhang, Yejin Choi, James Zou, and Pan Lu. In-the-flow agentic system optimization for effective planning and tool use.arXiv preprint arXiv:2510.05592, 2025
-
[22]
Stephane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 627–635, 2011
work page 2011
- [23]
-
[24]
John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. InProceedings of the 32nd International Conference on Machine Learning, pages 1889–1897, 2015
work page 2015
-
[25]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[26]
A Survey of On-Policy Distillation for Large Language Models
Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models.arXiv preprint arXiv:2604.00626, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[27]
Yuhan Li, Zhixun Li, Peisong Wang, Jia Li, Xiangguo Sun, Hong Cheng, and Jeffrey Xu Yu. A survey of graph meets large language model: Progress and future directions.arXiv preprint arXiv:2311.12399, 2023
-
[28]
Toward graph-tokenizing large language models with reconstructive graph instruction tuning
Zhongjian Zhang, Xiao Wang, Mengmei Zhang, Jiarui Tan, and Chuan Shi. Toward graph-tokenizing large language models with reconstructive graph instruction tuning. InProceedings of the ACM Web Conference 2026, pages 430–441, 2026
work page 2026
-
[29]
Bowen Jin, Gang Liu, Chi Han, Meng Jiang, Heng Ji, and Jiawei Han. Large language models on graphs: A comprehensive survey.IEEE Transactions on Knowledge and Data Engineering, 36(12):8622–8642, 2024
work page 2024
-
[30]
Minillm: Knowledge distillation of large language models
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InThe twelfth international conference on learning representations, 2024
work page 2024
-
[31]
Shijue Huang, Wanjun Zhong, Jianqiao Lu, Qi Zhu, Jiahui Gao, Weiwen Liu, Yutai Hou, Xingshan Zeng, Yasheng Wang, Lifeng Shang, et al. Planning, creation, usage: Benchmarking llms for comprehensive tool utilization in real-world complex scenarios. InFindings of the Association for Computational Linguistics: ACL 2024, pages 4363–4400, 2024
work page 2024
-
[32]
Llms in the imaginarium: tool learning through simulated trial and error
Boshi Wang, Hao Fang, Jason Eisner, Benjamin Van Durme, and Yu Su. Llms in the imaginarium: tool learning through simulated trial and error. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10583–10604, 2024
work page 2024
-
[33]
Toolink: Linking toolkit creation and using through chain-of-solving on open-source model
Cheng Qian, Chenyan Xiong, Zhenghao Liu, and Zhiyuan Liu. Toolink: Linking toolkit creation and using through chain-of-solving on open-source model. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 831–854, 2024
work page 2024
-
[34]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Llama 3.2 3b instruct model card
Meta AI. Llama 3.2 3b instruct model card. https://huggingface.co/meta-llama/Llama-3. 2-3B-Instruct, 2024. Accessed: 2026-05-07
work page 2024
-
[36]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[38]
Anonymous. Graphtoolbench: Benchmarking llms for sequential graph comprehension and conflict identification in tool learning, 2026. OpenReview. 11 Appendix Contents A Algorithm 13 B Datasets 13 C Evaluation Metrics 14 D Additional Results 15 E Prompts 15 12 A Algorithm Algorithm 1 outlines the optimization pipeline of GRAFT. The first stage constructs gra...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.