Internalizing Tool Knowledge in Small Language Models via QLoRA Fine-Tuning
Pith reviewed 2026-05-20 11:46 UTC · model grok-4.3
The pith
Small language models can internalize a fixed tool catalog through QLoRA fine-tuning, enabling structured planning without any tool descriptions in the inference prompt.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that QLoRA fine-tuning on tool-use data lets small models achieve better planning under description-free inference than baselines supplied with full tool schemas. On AssetOpsBench the strongest Gemma model records an AT-F1 of 0.65 and an overall judge score of 3.88 against the informed baseline values of 0.47 and 2.88. Qwen3-4B delivers a judge score of 3.78 while using 62 percent less memory and running 2.5 times faster. Ablation experiments indicate that LoRA rank trades planning quality against retention of general capabilities, with rank 32 giving the highest planning scores.
What carries the argument
QLoRA fine-tuning on a mixture of tool-knowledge statements, question-to-plan pairs, and execution traces, which embeds the tool catalog directly into model parameters for later description-free use.
If this is right
- Description-free inference becomes feasible for tool-planning tasks in agentic systems.
- Prompt token counts drop by 82.6 percent compared with supplying the full tool catalog at every step.
- Planning scores measured by both automatic metrics and LLM judges exceed those of informed but unfine-tuned baselines.
- LoRA rank can be chosen to favor either planning quality or preservation of general knowledge.
Where Pith is reading between the lines
- Models trained this way would require retraining or continued adaptation if the underlying tool set changes after deployment.
- The same internalization approach could be applied to other fixed knowledge domains to shrink context length in specialized agents.
- On-device or low-memory deployments might become practical once tool knowledge no longer occupies prompt space.
Load-bearing premise
The complete set of tools is fixed and fully known during training so that the model can absorb that exact catalog into its weights.
What would settle it
Measuring whether planning quality collapses when the inference prompt introduces a new tool absent from the training catalog.
Figures
read the original abstract
Large language models are increasingly used as planning components in agentic systems, but current tool-use pipelines often require full tool schemas to be included in every prompt, creating substantial token overhead and limiting the practicality of smaller models. This paper investigates whether tool-use knowledge can be internalized into small language models through parameter-efficient fine-tuning, enabling structured planning without explicit tool descriptions at inference time. Using AssetOpsBench as the primary benchmark, we fine-tune Gemma 4 E4B and Qwen3-4B with 8-bit QLoRA on approximately 1,700 tool-use examples spanning tool knowledge, question-to-plan mappings, and execution-style traces. We evaluate the resulting models under description-free inference, where the prompt omits the tool catalog entirely. The fine-tuned models outperform an informed unfine-tuned baseline that receives full tool descriptions, reducing input length by 82.6\% while improving structural and LLM-judge planning scores. In the best Gemma run, the model achieves an AT-F1 of 0.65 and an overall judge score of 3.88, compared with 0.47 and 2.88 for the informed baseline. Qwen3-4B achieves a strong overall judge score of 3.78 while using 62\% less memory and running 2.5$\times$ faster than Gemma, though it also exhibits greater catastrophic forgetting on general multiple-choice benchmarks. Additional ablations show that LoRA rank controls a quality--retention trade-off, with $r=32$ maximizing planning quality and smaller ranks preserving more general knowledge. These results suggest that, for fixed tool catalogs, QLoRA fine-tuning can shift tool knowledge from prompt context into model weights, substantially reducing inference overhead while maintaining or improving tool-planning quality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that QLoRA fine-tuning on ~1,700 tool-use examples from AssetOpsBench allows small models (Gemma-4B and Qwen3-4B) to internalize tool knowledge, enabling structured planning without tool descriptions in the prompt. The fine-tuned models outperform an informed unfine-tuned baseline under description-free inference (best Gemma: AT-F1 0.65 and judge score 3.88 vs. baseline 0.47 and 2.88), while cutting input length by 82.6%. Ablations show LoRA rank trades off planning quality against retention of general capabilities, with the work scoped to fixed tool catalogs.
Significance. If the empirical results hold under fuller controls, the work shows a viable path to reduce token overhead in agentic tool-use pipelines for closed, fixed tool sets by moving catalog knowledge into weights. The concrete metrics, LoRA-rank ablation, and explicit scoping to fixed catalogs are strengths. Significance is limited by the absence of tests for novel or changing tool sets and by missing reproducibility details.
major comments (2)
- [Evaluation and baseline description] The comparison to the informed unfine-tuned baseline is central to the claim of internalization, yet the manuscript provides insufficient detail on baseline construction, including exactly which tool schemas and usage patterns are supplied to it and whether they match the training distribution exactly.
- [Experimental setup] No information is given on train/test splits, cross-validation, or checks for data contamination between the ~1,700 fine-tuning examples and the evaluation set. This is load-bearing because performance under description-free inference could reflect memorization of the specific closed catalog rather than robust internalization.
minor comments (2)
- Clarify the precise model identifier 'Gemma 4 E4B' (likely a variant of Gemma-2-4B or similar) and report the exact parameter counts and quantization settings used.
- [Results] The abstract and results would benefit from reporting variance or multiple random seeds for the AT-F1 and judge scores rather than single-point estimates.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight areas where additional experimental details will strengthen the manuscript. We address each major comment below and have revised the paper accordingly.
read point-by-point responses
-
Referee: [Evaluation and baseline description] The comparison to the informed unfine-tuned baseline is central to the claim of internalization, yet the manuscript provides insufficient detail on baseline construction, including exactly which tool schemas and usage patterns are supplied to it and whether they match the training distribution exactly.
Authors: We agree that the baseline description requires more precision. In the revised manuscript we have expanded the 'Experimental Setup' and 'Baseline' subsections to specify that the informed unfine-tuned baseline is given the identical AssetOpsBench tool catalog, including full schemas and the same usage patterns that appear in the fine-tuning examples. This ensures the baseline receives exactly the information present in the training distribution but without any parameter updates. revision: yes
-
Referee: [Experimental setup] No information is given on train/test splits, cross-validation, or checks for data contamination between the ~1,700 fine-tuning examples and the evaluation set. This is load-bearing because performance under description-free inference could reflect memorization of the specific closed catalog rather than robust internalization.
Authors: We acknowledge the omission. The revised manuscript now includes a new 'Data Preparation' paragraph stating that the approximately 1,700 examples were partitioned into an 80/20 train/test split with no overlap between sets. We added explicit checks confirming that evaluation prompts contain no verbatim or near-verbatim overlap with training instances. While the work is scoped to fixed catalogs, these details address the concern that gains might stem solely from memorization. revision: yes
Circularity Check
No circularity: purely empirical fine-tuning study with benchmark-driven results
full rationale
The paper is an empirical ML study that fine-tunes Gemma and Qwen models on ~1700 AssetOpsBench examples and reports direct performance metrics (AT-F1, judge scores) under description-free inference. No equations, derivations, or first-principles predictions appear in the provided text or abstract. Claims rest on experimental comparisons against an informed baseline, with explicit statements about the fixed tool catalog; these are falsifiable via benchmarks rather than reducing to self-definition, fitted inputs renamed as predictions, or self-citation chains. The central result (fine-tuned models outperforming baseline while shortening prompts) is measured, not derived by construction from its own inputs.
Axiom & Free-Parameter Ledger
free parameters (2)
- LoRA rank r
- Training set size
axioms (1)
- domain assumption A fixed tool catalog can be internalized into model weights via supervised fine-tuning on planning traces.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We formalize description-free tool planning for a fixed tool catalog... the tool catalog is assumed to be fixed during evaluation. We do not attempt to generalize to unseen tools.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
QLoRA fine-tuning on ~1,700 examples enables ~4B parameter models to produce correct MCP tool-use plans without tool descriptions in the prompt
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
What is the Model Context Proto- col (MCP)?
Model Context Protocol, “What is the Model Context Proto- col (MCP)?” Model Context Protocol Documentation. [Online]. Available: https://modelcontextprotocol.io/docs/getting-started/intro. Ac- cessed: Apr. 30, 2026
work page 2026
-
[2]
D. C. Patel, S. Lin, J. T. Rayfield, N. Zhou, R. Vaculin, N. Martinez, F. O’Donncha, and J. Kalagnanam, “AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Main- tenance,” submitted to ICLR 2026, OpenReview, Sept. 2025. [Online]. Available: https://openreview.net/forum?id=ld6JUQbhes
work page 2026
-
[3]
QLoRA: Efficient Finetuning of Quantized LLMs,
T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient Finetuning of Quantized LLMs,” inAdvances in Neural Infor- mation Processing Systems, vol. 36, 2023
work page 2023
-
[4]
LoRA: Low-Rank Adaptation of Large Language Models,
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-Rank Adaptation of Large Language Models,” inInternational Conference on Learning Representations, 2022
work page 2022
-
[5]
Gemini 2.5 Flash and 2.5 Flash Image Model Card,
Google DeepMind, “Gemini 2.5 Flash and 2.5 Flash Image Model Card,”
-
[6]
[Online]. Available: https://storage.googleapis.com/deepmind- media/Model-Cards/Gemini-2-5-Flash-Model-Card.pdf. Accessed: Apr. 30, 2026
work page 2026
-
[7]
Google AI for Developers, “Gemma 4 model overview,” Google AI for Developers Documentation, 2026. [Online]. Available: https://ai.google.dev/gemma/docs/core. Accessed: Apr. 30, 2026
work page 2026
-
[8]
SLIM: Let LLM Learn More and Forget Less with Soft LoRA and Identity Mixture,
J. Han, L. Du, H. Du, X. Zhou, Y . Wu, W. Zheng, and D. Han, “SLIM: Let LLM Learn More and Forget Less with Soft LoRA and Identity Mixture,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, 2025
work page 2025
-
[9]
Toolformer: Language Models Can Teach Themselves to Use Tools
T. Schick, J. Dwivedi-Yu, R. Dess `ı, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language Models Can Teach Themselves to Use Tools,”arXiv preprint arXiv:2302.04761, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Gorilla: Large Language Model Connected with Massive APIs
S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Gorilla: Large Language Model Connected with Massive APIs,”arXiv preprint arXiv:2305.15334, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qian, S. Zhao, L. Hong, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun, “ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs,”arXiv preprint arXiv:2307.16789, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models,
H. Jiang, Q. Wu, C.-Y . Lin, Y . Yang, and L. Qiu, “LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023
work page 2023
-
[13]
C.-Y . Hsieh, C.-L. Li, C.-K. Yeh, H. Nakhost, Y . Fujii, A. Ratner, R. Kr- ishna, C.-Y . Lee, and T. Pfister, “Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes,”arXiv preprint arXiv:2305.02301, 2023
work page internal anchor Pith review arXiv 2023
-
[14]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,”arXiv preprint arXiv:2306.05685, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Measuring Massive Multitask Language Understanding
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring Massive Multitask Language Understanding,” arXiv preprint arXiv:2009.03300, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[16]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge,”arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[17]
HellaSwag: Can a Machine Really Finish Your Sentence?,
R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi, “HellaSwag: Can a Machine Really Finish Your Sentence?,” inProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019
work page 2019
-
[18]
Qwen Team, “Qwen3 Technical Report,” Alibaba Group, 2025. [On- line]. Available: https://qwenlm.github.io/blog/qwen3/. Accessed: May 6, 2026. APPENDIX TABLE IV THREE TRAINING DATASETS Dataset N Source Description Tool Knowledge∼500 Gemini 2.5 Flash Tool taxonomy, own- ership, args, routing, hard negatives Planning∼1,200 Gold plans + para- phrases Scenario...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.