NTILC: Neural Tool Invocation via Learned Compression

Andrew Krikorian; Jason J. Corso; Yayuan Li

arxiv: 2606.06566 · v1 · pith:SZ52XIDCnew · submitted 2026-06-04 · 💻 cs.SE · cs.AI

NTILC: Neural Tool Invocation via Learned Compression

Andrew Krikorian , Yayuan Li , Jason J. Corso This is my paper

Pith reviewed 2026-06-28 00:03 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords tool callingfunction callingneural retrievalcontext compressionembedding spacecomposite lossagentic modelstool selection

0 comments

The pith

NTILC learns shared embeddings to retrieve tools externally instead of listing full specifications in every prompt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Agentic models face rising costs as tool registries grow because every callable API must be described in the prompt. NTILC learns a joint embedding space for user intents and tool signatures so that relevant tools can be fetched from an external store rather than carried in context. A composite training objective combines semantic similarity with explicit constraints from each tool's argument schema and return type. Once a tool is selected, the model receives only its schema and generates arguments under that constraint. The result is a sharp drop in tokens used and in generation latency while preserving the ability to invoke the correct function.

Core claim

NTILC maps both user intent and tool specifications into a shared embedding space, enabling tool selection via external retrieval rather than in-context lookup. The language model is conditioned only on the selected tool schema, allowing for precise, constrained argument generation. Central to the approach is a signature-aware composite objective that augments semantic similarity with constraints derived from tool signatures by combining Circle Loss with a Functional Margin Loss to enforce separation between tools that are semantically similar but incompatible under their execution signatures.

What carries the argument

Signature-aware composite objective that combines Circle Loss with Functional Margin Loss to separate tools by both semantic similarity and execution-signature compatibility.

If this is right

Context token consumption drops by more than 95 percent relative to long-context in-context tool baselines.
Inference latency falls by as much as 74 percent on the same tasks.
The model receives only the schema of the retrieved tool and therefore generates arguments under explicit type and arity constraints.
Evaluation covers both public tool-selection benchmarks and function-calling datasets.
The same framework applies to any registry whose members can be described by a signature of arguments and return types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method opens the possibility of maintaining registries that are orders of magnitude larger than any single context window can hold.
External retrieval could be updated in real time without retraining the language model itself.
The same embedding-plus-signature technique might be reused for other selection problems such as API endpoint routing or plugin discovery.
If retrieval accuracy holds, the architecture decouples tool inventory size from prompt length, changing how agent systems are deployed at scale.

Load-bearing premise

The learned embeddings will continue to retrieve the correct tool with high accuracy when the full registry is no longer supplied inside the prompt.

What would settle it

A controlled test on a registry of several hundred tools where retrieval accuracy falls below the in-context baseline or where the reported 95 percent context reduction is not observed.

Figures

Figures reproduced from arXiv: 2606.06566 by Andrew Krikorian, Jason J. Corso, Yayuan Li.

**Figure 2.** Figure 2: NTILC inference and training pipelines. The backbone LLM [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Context length as a function of the number of available tools. ICT incurs linear growth [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative examples illustrating semantic blur. While baseline methods frequently misclassify user intents by selecting tools with similar natural language descriptions but executionally incompatible arguments, NTILC leverages Functional Margin (FM) Loss to separate these confusable tools in the embedding space, ensuring accurate and functionally viable tool selection [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗

read the original abstract

Agentic tool-calling language models depend on large registries of callable APIs, functions, and local actions. Placing full tool specifications directly in the prompt incurs a cost that scales linearly with the size of the tool registry, rapidly consuming the context budget. As the registry grows, this leads to higher latency and degrades selection accuracy, particularly due to interference from irrelevant tools. We overcome these limitations by introducing NTILC, a neural tool selection and invocation framework that replaces in-context registry look-up with learned latent retrieval. NTILC maps both user intent and tool specifications into a shared embedding space, enabling tool selection via external retrieval rather than in-context lookup. The language model is conditioned only on the selected tool schema, allowing for precise, constrained argument generation. Central to our approach is a signature-aware composite objective, which augments semantic similarity with constraints derived from tool signatures (e.g., argument schema, type compatibility, and return types). By combining Circle Loss with a Functional Margin Loss, the model enforces separation between tools that are semantically similar but incompatible under their execution signatures. We evaluate NTILC on public tool-selection and function-calling datasets and report context token usage, retrieval accuracy, and selection latency metrics. Across these settings, NTILC reduces context window consumption by over 95% and inference latency by up to 74% compared to long-context ICT baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NTILC replaces full tool registries in context with embedding retrieval plus a signature-aware loss, but the abstract gives no accuracy numbers so the claimed efficiency gains cannot be judged as net wins.

read the letter

The core move is to stop stuffing every tool description into the prompt and instead learn a shared embedding space for user intent and tool signatures, then retrieve externally and condition the model only on the chosen schema. They add a composite loss that combines Circle Loss with a Functional Margin Loss to push apart tools that look similar semantically but differ in argument types or return values. That signature constraint is the part that feels new relative to plain semantic retrieval.

The approach targets a genuine pain point: context length and latency both blow up with registry size, and irrelevant tools in context can hurt selection. Mapping the problem to latent retrieval plus constrained generation is a reasonable direction, and the abstract indicates they ran it on public tool-selection datasets.

The obvious gap is that the abstract states they report retrieval accuracy and selection latency but supplies none of the actual figures, no baselines beyond the long-context ICT comparison, and no error analysis. The 95% context reduction and 74% latency drop are therefore unanchored; we cannot tell whether selection quality held up or traded off. The stress-test note correctly flags this as the load-bearing assumption.

This is for groups building production agent systems that already manage large tool sets and are hitting context limits. It is worth a serious referee because the problem is concrete, the method is specified enough to implement, and the missing accuracy metrics are straightforward to add. I would not cite it yet, but I would want to see the full results before deciding.

Referee Report

2 major / 1 minor

Summary. The paper introduces NTILC, a neural tool selection and invocation framework that replaces in-context tool registry lookup with learned latent retrieval in a shared embedding space. Tool selection uses external retrieval, after which the LM is conditioned only on the selected tool schema for argument generation. The core technical contribution is a signature-aware composite objective combining Circle Loss with Functional Margin Loss to enforce separation based on argument schemas, type compatibility, and return types. The authors evaluate on public tool-selection and function-calling datasets and claim >95% reduction in context token usage and up to 74% reduction in inference latency relative to long-context in-context tool (ICT) baselines.

Significance. If the unstated accuracy numbers confirm that retrieval and selection quality remain comparable to ICT baselines, the result would be significant for scaling agentic systems to registries far larger than current context windows allow. The signature-aware loss is a concrete technical step beyond pure semantic retrieval and directly addresses a known failure mode (semantically similar but signature-incompatible tools).

major comments (2)

[Abstract] Abstract: the manuscript states that 'retrieval accuracy and selection latency metrics' are reported and that selection quality is preserved, yet supplies no numerical values, no baseline comparisons, and no error analysis. Without these figures the 95% context and 74% latency claims cannot be evaluated as net gains rather than accuracy-efficiency trade-offs.
[Evaluation] Evaluation section (implied by the abstract's reference to public datasets): the central claim that the signature-aware objective 'maintains high retrieval and selection accuracy' when tools are fetched externally is load-bearing for the efficiency results, but no accuracy deltas, precision@K, or error rates versus ICT baselines are provided.

minor comments (1)

[Abstract] The abstract mentions 'public tool-selection and function-calling datasets' but does not name them; the evaluation section should list the exact datasets and splits.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The feedback correctly identifies that the abstract and evaluation section would benefit from more explicit numerical presentation of accuracy metrics to allow direct assessment of the efficiency claims. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Abstract] Abstract: the manuscript states that 'retrieval accuracy and selection latency metrics' are reported and that selection quality is preserved, yet supplies no numerical values, no baseline comparisons, and no error analysis. Without these figures the 95% context and 74% latency claims cannot be evaluated as net gains rather than accuracy-efficiency trade-offs.

Authors: We agree that the abstract would be strengthened by incorporating the key numerical results it summarizes. The evaluation section of the manuscript reports retrieval accuracy, selection latency, and context usage on the public datasets, along with comparisons to ICT baselines and supporting analysis. To address the concern directly, we will revise the abstract to include the primary accuracy figures, baseline comparisons, and a concise reference to the error analysis so that the efficiency gains can be evaluated as net improvements. revision: yes
Referee: [Evaluation] Evaluation section (implied by the abstract's reference to public datasets): the central claim that the signature-aware objective 'maintains high retrieval and selection accuracy' when tools are fetched externally is load-bearing for the efficiency results, but no accuracy deltas, precision@K, or error rates versus ICT baselines are provided.

Authors: We acknowledge that the evaluation section would benefit from more prominent and explicit presentation of accuracy deltas, precision@K scores, and direct error-rate comparisons to the ICT baselines. While the section evaluates on public tool-selection and function-calling datasets and states that selection quality is preserved under the signature-aware objective, we will expand the section with additional tables and text that report these quantities side-by-side with the ICT baselines to make the supporting evidence clearer. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is empirical and self-contained

full rationale

The paper introduces NTILC as a neural embedding-based retrieval method trained with a composite loss (Circle Loss + Functional Margin Loss) on tool signatures, then evaluates empirically on public datasets for context reduction and latency. No equations, fitted parameters renamed as predictions, self-citations as load-bearing premises, uniqueness theorems, or ansatzes smuggled via citation appear in the provided text. The central claims rest on reported metrics from external datasets rather than any reduction of outputs to inputs by construction. This is the expected non-finding for a standard applied ML framework paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5774 in / 1002 out tokens · 26708 ms · 2026-06-28T00:03:54.976338+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 9 internal anchors

[1]

URLhttps://arxiv.org/abs/2401.08281. P. Hosseini, I. Castro, I. Ghinassi, and M. Purver. Efficient solutions for an intriguing failure of llms: Long context window does not mean llms can analyze long sequences flawlessly,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

URL https://arxiv.org/abs/2408.01866. Y . Huang, J. Shi, Y . Li, C. Fan, S. Wu, Q. Zhang, Y . Liu, P. Zhou, Y . Wan, N. Z. Gong, and L. Sun. Metatool benchmark for large language models: Deciding whether to use tools and which to use,

work page arXiv
[3]

URLhttps://arxiv.org/abs/2310.03128. J. Johnson, M. Douze, and H. Jégou. Billion-scale similarity search with gpus,

work page arXiv
[4]

URL https: //arxiv.org/abs/1702.08734. M. Kang, W.-N. Chen, D. Han, H. A. Inan, L. Wutschitz, Y . Chen, R. Sim, and S. Rajmohan. Acon: Optimizing context compression for long-horizon llm agents,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

URL https://arxiv.org/ abs/2510.00615. C. Li, Z. Tang, Z. Li, M. Xue, K. Bao, T. Ding, R. Sun, B. Wang, X. Wang, J. Lin, and D. Liu. Teaching language models to reason with tools,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

URLhttps://arxiv.org/abs/2510.20342. M. Li, Y . Zhao, B. Yu, F. Song, H. Li, H. Yu, Z. Li, F. Huang, and Y . Li. Api-bank: A comprehensive benchmark for tool-augmented llms,

work page arXiv
[7]

URLhttps://arxiv.org/abs/2304.08244. N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle: How language models use long contexts,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

URLhttps://arxiv.org/abs/2307.03172. G. Mialon, R. Dessì, M. Lomeli, C. Nalmpantis, R. Pasunuru, R. Raileanu, B. Rozière, T. Schick, J. Dwivedi-Yu, A. Celikyilmaz, E. Grave, Y . LeCun, and T. Scialom. Augmented language models: a survey,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

URLhttps://arxiv.org/abs/2302.07842. A. Parisi, Y . Zhao, and N. Fiedel. Talm: Tool augmented language models,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

URL https: //arxiv.org/abs/2205.12255. S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez. Gorilla: Large language model connected with massive apis,

work page arXiv
[11]

URLhttps://arxiv.org/abs/2305.15334. S. G. Patil, H. Mao, F. Yan, C. C.-J. Ji, V . Suresh, I. Stoica, and J. E. Gonzalez. The berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

doi: 10.54364/aaiml.2026.61268

ISSN 2582-9793. doi: 10.54364/aaiml.2026.61268. URL http://dx.doi.org/10.54364/AAIML. 2026.61268. A. Plaat, M. Van Duijn, N. Van Stein, M. Preuss, P. Van der Putten, and K. J. Batenburg. Agentic large language models, a survey.Journal of Artificial Intelligence Research, 84, Dec

work page doi:10.54364/aaiml.2026.61268 2026
[13]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

ISSN 1076-9757. doi: 10.1613/jair.1.18675. URLhttp://dx.doi.org/10.1613/jair.1.18675. Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qian, S. Zhao, L. Hong, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis, 2023a. URL https://arx...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1613/jair.1.18675
[14]

URL https://arxiv.org/ abs/2410.10348. Y . Sun, C. Cheng, Y . Zhang, C. Zhang, L. Zheng, Z. Wang, and Y . Wei. Circle loss: A unified perspective of pair similarity optimization,

work page arXiv
[15]

URL https://arxiv.org/abs/2002.10857. R. Wang, X. Han, L. Ji, S. Wang, T. Baldwin, and H. Li. Toolgen: Unified tool retrieval and calling via generation,

work page arXiv 2002
[16]

URLhttps://arxiv.org/abs/2410.03439. W. Wang, J. Min, and W. Zou. Intelligence degradation in long-context llms: Critical threshold determination via natural length distribution analysis,

work page arXiv
[17]

URL https://arxiv.org/abs/ 2601.15300. B. T. Willard and R. Louf. Efficient guided generation for large language models,

work page arXiv
[18]

URL https://arxiv.org/abs/2307.09702. J. Ye, G. Li, S. Gao, C. Huang, Y . Wu, S. Li, X. Fan, S. Dou, T. Ji, Q. Zhang, T. Gui, and X. Huang. Tooleyes: Fine-grained evaluation for tool learning capabilities of large language models in real-world scenarios,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

URLhttps://arxiv.org/abs/2401.00741. Y . Ye, Y . Zhao, K. Duan, Z. Zheng, K. Kawaguchi, C. Xie, and M. Q. Shieh. In-context reinforcement learning for tool use in large language models,

work page arXiv

[1] [1]

URLhttps://arxiv.org/abs/2401.08281. P. Hosseini, I. Castro, I. Ghinassi, and M. Purver. Efficient solutions for an intriguing failure of llms: Long context window does not mean llms can analyze long sequences flawlessly,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

URL https://arxiv.org/abs/2408.01866. Y . Huang, J. Shi, Y . Li, C. Fan, S. Wu, Q. Zhang, Y . Liu, P. Zhou, Y . Wan, N. Z. Gong, and L. Sun. Metatool benchmark for large language models: Deciding whether to use tools and which to use,

work page arXiv

[3] [3]

URLhttps://arxiv.org/abs/2310.03128. J. Johnson, M. Douze, and H. Jégou. Billion-scale similarity search with gpus,

work page arXiv

[4] [4]

URL https: //arxiv.org/abs/1702.08734. M. Kang, W.-N. Chen, D. Han, H. A. Inan, L. Wutschitz, Y . Chen, R. Sim, and S. Rajmohan. Acon: Optimizing context compression for long-horizon llm agents,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

URL https://arxiv.org/ abs/2510.00615. C. Li, Z. Tang, Z. Li, M. Xue, K. Bao, T. Ding, R. Sun, B. Wang, X. Wang, J. Lin, and D. Liu. Teaching language models to reason with tools,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

URLhttps://arxiv.org/abs/2510.20342. M. Li, Y . Zhao, B. Yu, F. Song, H. Li, H. Yu, Z. Li, F. Huang, and Y . Li. Api-bank: A comprehensive benchmark for tool-augmented llms,

work page arXiv

[7] [7]

URLhttps://arxiv.org/abs/2304.08244. N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle: How language models use long contexts,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

URLhttps://arxiv.org/abs/2307.03172. G. Mialon, R. Dessì, M. Lomeli, C. Nalmpantis, R. Pasunuru, R. Raileanu, B. Rozière, T. Schick, J. Dwivedi-Yu, A. Celikyilmaz, E. Grave, Y . LeCun, and T. Scialom. Augmented language models: a survey,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

URLhttps://arxiv.org/abs/2302.07842. A. Parisi, Y . Zhao, and N. Fiedel. Talm: Tool augmented language models,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

URL https: //arxiv.org/abs/2205.12255. S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez. Gorilla: Large language model connected with massive apis,

work page arXiv

[11] [11]

URLhttps://arxiv.org/abs/2305.15334. S. G. Patil, H. Mao, F. Yan, C. C.-J. Ji, V . Suresh, I. Stoica, and J. E. Gonzalez. The berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

doi: 10.54364/aaiml.2026.61268

ISSN 2582-9793. doi: 10.54364/aaiml.2026.61268. URL http://dx.doi.org/10.54364/AAIML. 2026.61268. A. Plaat, M. Van Duijn, N. Van Stein, M. Preuss, P. Van der Putten, and K. J. Batenburg. Agentic large language models, a survey.Journal of Artificial Intelligence Research, 84, Dec

work page doi:10.54364/aaiml.2026.61268 2026

[13] [13]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

ISSN 1076-9757. doi: 10.1613/jair.1.18675. URLhttp://dx.doi.org/10.1613/jair.1.18675. Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qian, S. Zhao, L. Hong, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis, 2023a. URL https://arx...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1613/jair.1.18675

[14] [14]

URL https://arxiv.org/ abs/2410.10348. Y . Sun, C. Cheng, Y . Zhang, C. Zhang, L. Zheng, Z. Wang, and Y . Wei. Circle loss: A unified perspective of pair similarity optimization,

work page arXiv

[15] [15]

URL https://arxiv.org/abs/2002.10857. R. Wang, X. Han, L. Ji, S. Wang, T. Baldwin, and H. Li. Toolgen: Unified tool retrieval and calling via generation,

work page arXiv 2002

[16] [16]

URLhttps://arxiv.org/abs/2410.03439. W. Wang, J. Min, and W. Zou. Intelligence degradation in long-context llms: Critical threshold determination via natural length distribution analysis,

work page arXiv

[17] [17]

URL https://arxiv.org/abs/ 2601.15300. B. T. Willard and R. Louf. Efficient guided generation for large language models,

work page arXiv

[18] [18]

URL https://arxiv.org/abs/2307.09702. J. Ye, G. Li, S. Gao, C. Huang, Y . Wu, S. Li, X. Fan, S. Dou, T. Ji, Q. Zhang, T. Gui, and X. Huang. Tooleyes: Fine-grained evaluation for tool learning capabilities of large language models in real-world scenarios,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

URLhttps://arxiv.org/abs/2401.00741. Y . Ye, Y . Zhao, K. Duan, Z. Zheng, K. Kawaguchi, C. Xie, and M. Q. Shieh. In-context reinforcement learning for tool use in large language models,

work page arXiv