Don't Adapt Small Language Models for Tools; Adapt Tool Schemas to the Models

Haesung Pyun; Jonggeun Lee; Jongwook Han; Woojung Song; Yohan Jo

arxiv: 2510.07248 · v3 · submitted 2025-10-08 · 💻 cs.CL

Don't Adapt Small Language Models for Tools; Adapt Tool Schemas to the Models

Jonggeun Lee , Woojung Song , Jongwook Han , Haesung Pyun , Yohan Jo This is my paper

Pith reviewed 2026-05-18 08:57 UTC · model grok-4.3

classification 💻 cs.CL

keywords small language modelstool useschema misalignmentpretraining alignmentpeakednesstraining-free adaptationmulti-agent systemshallucination reduction

0 comments

The pith

Small language models improve tool-use accuracy up to 17 percent by renaming tool schemas to match their pretraining familiarity instead of retraining the models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Small language models struggle with tool-use tasks because they often hallucinate tool names absent from the provided schema, a problem called schema misalignment that stems from naming conventions learned during pretraining. Rather than retraining the models to fit new schemas, this paper demonstrates that renaming the schema components to better match the models' existing knowledge produces large gains. The method generates multiple candidate names for each tool element and selects those showing the strongest pretraining familiarity signal. On standard benchmarks this approach cuts misalignment errors by 80 percent and raises tool selection and parameter prediction accuracy by as much as 17 percent, all without any model updates. The finding matters for building scalable multi-agent systems that rely on many small, efficient models coordinated by a larger one.

Core claim

Schema misalignment arises when small language models generate tool names that do not exist in the supplied schema because of differing naming conventions internalized during pretraining. PA-Tool addresses this by generating candidate names for tool components and choosing the variants with highest peakedness, a measure of pretraining familiarity, thereby producing schemas that align with the models' existing knowledge and enabling more accurate tool selection and parameter prediction.

What carries the argument

PA-Tool, a training-free procedure that generates candidate tool schema names and selects those with highest peakedness to align with pretraining knowledge.

If this is right

Small language models can handle tool-augmented subtasks more reliably in multi-agent systems without retraining costs.
Schema misalignment errors drop sharply when tool names are chosen to match pretraining patterns rather than arbitrary conventions.
Peakedness provides a practical, computable signal for aligning external interfaces with a model's internalized knowledge.
Parameter prediction accuracy improves as a direct result of fewer hallucinated tool names.
Resource-efficient models become viable for complex tool-use workflows once schemas are adapted to them.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same peakedness selection could be applied to other structured output tasks where models invent names or formats outside the allowed set.
Many apparent capability limits in small models may actually reflect interface mismatches that can be fixed at the schema level rather than by scaling or fine-tuning.
Routine generation of multiple schema variants followed by peakedness ranking could become a standard preprocessing step for tool deployment.
The approach may extend naturally to larger agentic setups involving dozens of tools or dynamic tool libraries.

Load-bearing premise

Peakedness measured on candidate tool names reliably signals the model's pretraining familiarity with those names, and greater familiarity directly improves downstream tool selection and parameter prediction.

What would settle it

A controlled test that presents the same tools under multiple naming variants of differing peakedness and checks whether tool-use accuracy rises consistently with the highest-peakedness choice on held-out tasks.

read the original abstract

Small language models (SLMs) enable scalable tool-augmented multi-agent systems where multiple SLMs handle subtasks orchestrated by a powerful coordinator. However, they struggle with tool-use tasks, particularly in selecting appropriate tools and identifying correct parameters. A common failure mode is \textit{schema misalignment}: models hallucinate plausible tool names that are absent from the provided tool schema, due to different naming conventions internalized during pretraining. Rather than training models to adapt to unfamiliar schemas, we propose adapting schemas to align with models' pretrained knowledge. We introduce \textbf{PA-Tool} (Pretraining-Aligned Tool Schema Generation), a training-free method that leverages peakedness, a signal used in contamination detection that indicates pretraining familiarity, to rename tool components. By generating multiple candidates and selecting the candidate with the highest peakedness, PA-Tool identifies pretraining-aligned naming patterns. Experiments on MetaTool and RoTBench show improvements of up to 17\%, with schema misalignment errors reduced by 80\%. PA-Tool enables small models to substantially improve tool-use accuracy without retraining, showing that schema-level interventions can unlock the tool-use potential of resource-efficient models. Our code is available at https://github.com/holi-lab/PA-Tool.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that small language models struggle with tool-use tasks due to schema misalignment, where they hallucinate tool names absent from the provided schema because of pretraining naming conventions. Rather than retraining the models, the authors introduce PA-Tool, a training-free method that generates multiple candidate names for tool components and selects the one with highest peakedness (a contamination-detection signal for pretraining familiarity). Experiments on MetaTool and RoTBench report up to 17% gains in tool-use accuracy and 80% reduction in schema misalignment errors, enabling better performance from SLMs without adaptation.

Significance. If the results hold, this is a significant contribution for scalable tool-augmented multi-agent systems, as it offers a low-cost, training-free intervention at the schema level that unlocks tool-use capabilities in resource-efficient SLMs. The approach leverages an existing metric (peakedness) in a novel way and includes open-sourced code, which supports reproducibility. It usefully reframes the problem from model-centric to schema-centric adaptation.

major comments (2)

[PA-Tool method description and Experiments] The central mechanism relies on peakedness computed over isolated candidate tool names indicating pretraining familiarity that then improves downstream selection and parameter prediction. However, the manuscript provides no direct evidence (e.g., likelihood measurements or ablation on full prompts) that the selected names actually produce higher probability or reduced hallucination when inserted into complete tool-calling contexts containing the schema and query. This leaves the causal claim load-bearing but under-supported.
[Experiments section] The reported improvements (up to 17% accuracy lift and 80% error reduction) are presented without accompanying details on data splits, number of runs, variance, or statistical tests. This makes it impossible to assess whether the gains are robust across model families or tool domains, directly affecting the soundness of the headline empirical claims.

minor comments (2)

[Abstract] The abstract introduces 'peakedness' without a short inline definition or citation to its origin in contamination detection literature; a brief parenthetical would improve accessibility.
[Method and Results] Notation for tool schema components (names, parameters) could be made more consistent between the method description and the experimental tables to avoid minor reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: The central mechanism relies on peakedness computed over isolated candidate tool names indicating pretraining familiarity that then improves downstream selection and parameter prediction. However, the manuscript provides no direct evidence (e.g., likelihood measurements or ablation on full prompts) that the selected names actually produce higher probability or reduced hallucination when inserted into complete tool-calling contexts containing the schema and query. This leaves the causal claim load-bearing but under-supported.

Authors: We acknowledge that the current manuscript does not include direct likelihood measurements or ablations on full prompts to demonstrate that the peakedness-selected names yield higher probability in complete tool-calling contexts. The primary evidence presented is the substantial reduction in schema misalignment errors (80%) and the corresponding gains in end-to-end tool-use accuracy, which we interpret as support for the mechanism. To address this gap directly, we will add a new ablation in the revised version that computes token-level likelihoods for selected versus non-selected names when inserted into full schemas and queries, providing more explicit causal evidence. revision: yes
Referee: The reported improvements (up to 17% accuracy lift and 80% error reduction) are presented without accompanying details on data splits, number of runs, variance, or statistical tests. This makes it impossible to assess whether the gains are robust across model families or tool domains, directly affecting the soundness of the headline empirical claims.

Authors: We agree that the experimental section would benefit from greater transparency regarding reproducibility and statistical robustness. In the revised manuscript, we will expand the Experiments section to specify the exact data splits used for MetaTool and RoTBench, report results over multiple runs with standard deviations or variance measures, and include appropriate statistical tests (e.g., paired t-tests or Wilcoxon tests) to assess significance. We will also clarify performance across the different model families and tool domains evaluated. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical selection heuristic validated on external benchmarks

full rationale

The paper presents PA-Tool as a training-free procedure that generates candidate tool names and selects the one with highest peakedness (a pre-existing contamination-detection statistic). Results are reported as measured accuracy lifts on the external MetaTool and RoTBench suites. No equations, fitted parameters, or self-citation chains are shown that would make the reported gains equivalent to the inputs by construction. The derivation therefore remains self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the empirical claim that peakedness correlates with pretraining exposure for tool-related strings and that this correlation transfers to improved tool-use behavior. No free parameters are introduced in the abstract description. No new entities are postulated.

axioms (1)

domain assumption Peakedness computed on candidate strings is a valid proxy for whether the model encountered similar strings during pretraining.
Invoked when the method selects the highest-peakedness candidate as the aligned name.

pith-pipeline@v0.9.0 · 5766 in / 1252 out tokens · 20980 ms · 2026-05-18T08:57:24.126240+00:00 · methodology

Don't Adapt Small Language Models for Tools; Adapt Tool Schemas to the Models

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)