Don't Adapt Small Language Models for Tools; Adapt Tool Schemas to the Models
Pith reviewed 2026-05-18 08:57 UTC · model grok-4.3
The pith
Small language models improve tool-use accuracy up to 17 percent by renaming tool schemas to match their pretraining familiarity instead of retraining the models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Schema misalignment arises when small language models generate tool names that do not exist in the supplied schema because of differing naming conventions internalized during pretraining. PA-Tool addresses this by generating candidate names for tool components and choosing the variants with highest peakedness, a measure of pretraining familiarity, thereby producing schemas that align with the models' existing knowledge and enabling more accurate tool selection and parameter prediction.
What carries the argument
PA-Tool, a training-free procedure that generates candidate tool schema names and selects those with highest peakedness to align with pretraining knowledge.
If this is right
- Small language models can handle tool-augmented subtasks more reliably in multi-agent systems without retraining costs.
- Schema misalignment errors drop sharply when tool names are chosen to match pretraining patterns rather than arbitrary conventions.
- Peakedness provides a practical, computable signal for aligning external interfaces with a model's internalized knowledge.
- Parameter prediction accuracy improves as a direct result of fewer hallucinated tool names.
- Resource-efficient models become viable for complex tool-use workflows once schemas are adapted to them.
Where Pith is reading between the lines
- The same peakedness selection could be applied to other structured output tasks where models invent names or formats outside the allowed set.
- Many apparent capability limits in small models may actually reflect interface mismatches that can be fixed at the schema level rather than by scaling or fine-tuning.
- Routine generation of multiple schema variants followed by peakedness ranking could become a standard preprocessing step for tool deployment.
- The approach may extend naturally to larger agentic setups involving dozens of tools or dynamic tool libraries.
Load-bearing premise
Peakedness measured on candidate tool names reliably signals the model's pretraining familiarity with those names, and greater familiarity directly improves downstream tool selection and parameter prediction.
What would settle it
A controlled test that presents the same tools under multiple naming variants of differing peakedness and checks whether tool-use accuracy rises consistently with the highest-peakedness choice on held-out tasks.
read the original abstract
Small language models (SLMs) enable scalable tool-augmented multi-agent systems where multiple SLMs handle subtasks orchestrated by a powerful coordinator. However, they struggle with tool-use tasks, particularly in selecting appropriate tools and identifying correct parameters. A common failure mode is \textit{schema misalignment}: models hallucinate plausible tool names that are absent from the provided tool schema, due to different naming conventions internalized during pretraining. Rather than training models to adapt to unfamiliar schemas, we propose adapting schemas to align with models' pretrained knowledge. We introduce \textbf{PA-Tool} (Pretraining-Aligned Tool Schema Generation), a training-free method that leverages peakedness, a signal used in contamination detection that indicates pretraining familiarity, to rename tool components. By generating multiple candidates and selecting the candidate with the highest peakedness, PA-Tool identifies pretraining-aligned naming patterns. Experiments on MetaTool and RoTBench show improvements of up to 17\%, with schema misalignment errors reduced by 80\%. PA-Tool enables small models to substantially improve tool-use accuracy without retraining, showing that schema-level interventions can unlock the tool-use potential of resource-efficient models. Our code is available at https://github.com/holi-lab/PA-Tool.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that small language models struggle with tool-use tasks due to schema misalignment, where they hallucinate tool names absent from the provided schema because of pretraining naming conventions. Rather than retraining the models, the authors introduce PA-Tool, a training-free method that generates multiple candidate names for tool components and selects the one with highest peakedness (a contamination-detection signal for pretraining familiarity). Experiments on MetaTool and RoTBench report up to 17% gains in tool-use accuracy and 80% reduction in schema misalignment errors, enabling better performance from SLMs without adaptation.
Significance. If the results hold, this is a significant contribution for scalable tool-augmented multi-agent systems, as it offers a low-cost, training-free intervention at the schema level that unlocks tool-use capabilities in resource-efficient SLMs. The approach leverages an existing metric (peakedness) in a novel way and includes open-sourced code, which supports reproducibility. It usefully reframes the problem from model-centric to schema-centric adaptation.
major comments (2)
- [PA-Tool method description and Experiments] The central mechanism relies on peakedness computed over isolated candidate tool names indicating pretraining familiarity that then improves downstream selection and parameter prediction. However, the manuscript provides no direct evidence (e.g., likelihood measurements or ablation on full prompts) that the selected names actually produce higher probability or reduced hallucination when inserted into complete tool-calling contexts containing the schema and query. This leaves the causal claim load-bearing but under-supported.
- [Experiments section] The reported improvements (up to 17% accuracy lift and 80% error reduction) are presented without accompanying details on data splits, number of runs, variance, or statistical tests. This makes it impossible to assess whether the gains are robust across model families or tool domains, directly affecting the soundness of the headline empirical claims.
minor comments (2)
- [Abstract] The abstract introduces 'peakedness' without a short inline definition or citation to its origin in contamination detection literature; a brief parenthetical would improve accessibility.
- [Method and Results] Notation for tool schema components (names, parameters) could be made more consistent between the method description and the experimental tables to avoid minor reader confusion.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful comments on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: The central mechanism relies on peakedness computed over isolated candidate tool names indicating pretraining familiarity that then improves downstream selection and parameter prediction. However, the manuscript provides no direct evidence (e.g., likelihood measurements or ablation on full prompts) that the selected names actually produce higher probability or reduced hallucination when inserted into complete tool-calling contexts containing the schema and query. This leaves the causal claim load-bearing but under-supported.
Authors: We acknowledge that the current manuscript does not include direct likelihood measurements or ablations on full prompts to demonstrate that the peakedness-selected names yield higher probability in complete tool-calling contexts. The primary evidence presented is the substantial reduction in schema misalignment errors (80%) and the corresponding gains in end-to-end tool-use accuracy, which we interpret as support for the mechanism. To address this gap directly, we will add a new ablation in the revised version that computes token-level likelihoods for selected versus non-selected names when inserted into full schemas and queries, providing more explicit causal evidence. revision: yes
-
Referee: The reported improvements (up to 17% accuracy lift and 80% error reduction) are presented without accompanying details on data splits, number of runs, variance, or statistical tests. This makes it impossible to assess whether the gains are robust across model families or tool domains, directly affecting the soundness of the headline empirical claims.
Authors: We agree that the experimental section would benefit from greater transparency regarding reproducibility and statistical robustness. In the revised manuscript, we will expand the Experiments section to specify the exact data splits used for MetaTool and RoTBench, report results over multiple runs with standard deviations or variance measures, and include appropriate statistical tests (e.g., paired t-tests or Wilcoxon tests) to assess significance. We will also clarify performance across the different model families and tool domains evaluated. revision: yes
Circularity Check
No circularity: empirical selection heuristic validated on external benchmarks
full rationale
The paper presents PA-Tool as a training-free procedure that generates candidate tool names and selects the one with highest peakedness (a pre-existing contamination-detection statistic). Results are reported as measured accuracy lifts on the external MetaTool and RoTBench suites. No equations, fitted parameters, or self-citation chains are shown that would make the reported gains equivalent to the inputs by construction. The derivation therefore remains self-contained and non-circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Peakedness computed on candidate strings is a valid proxy for whether the model encountered similar strings during pretraining.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.