Internalizing Tool Knowledge in Small Language Models via QLoRA Fine-Tuning

Ayal Yakobe; Tanmay Agarwal; Yuval Shemla

arxiv: 2605.17774 · v1 · pith:SRI7TKFOnew · submitted 2026-05-18 · 💻 cs.CL

Internalizing Tool Knowledge in Small Language Models via QLoRA Fine-Tuning

Yuval Shemla , Ayal Yakobe , Tanmay Agarwal This is my paper

Pith reviewed 2026-05-20 11:46 UTC · model grok-4.3

classification 💻 cs.CL

keywords tool useQLoRAfine-tuningsmall language modelsdescription-free inferenceagent planningparameter-efficient trainingbenchmark evaluation

0 comments

The pith

Small language models can internalize a fixed tool catalog through QLoRA fine-tuning, enabling structured planning without any tool descriptions in the inference prompt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether parameter-efficient fine-tuning can move tool-use knowledge from the prompt into the weights of small models. The authors train Gemma 4B and Qwen3-4B on roughly 1,700 examples that include tool specifications, question-to-plan mappings, and execution traces. At inference the prompt contains no tool catalog at all, yet the fine-tuned models produce higher-quality plans than an unfine-tuned baseline that is given the complete catalog. The best runs cut input length by 82.6 percent while raising both structural and judge-based planning metrics. These outcomes show that, for a stable set of tools, shifting knowledge into parameters improves efficiency and performance together.

Core claim

The central claim is that QLoRA fine-tuning on tool-use data lets small models achieve better planning under description-free inference than baselines supplied with full tool schemas. On AssetOpsBench the strongest Gemma model records an AT-F1 of 0.65 and an overall judge score of 3.88 against the informed baseline values of 0.47 and 2.88. Qwen3-4B delivers a judge score of 3.78 while using 62 percent less memory and running 2.5 times faster. Ablation experiments indicate that LoRA rank trades planning quality against retention of general capabilities, with rank 32 giving the highest planning scores.

What carries the argument

QLoRA fine-tuning on a mixture of tool-knowledge statements, question-to-plan pairs, and execution traces, which embeds the tool catalog directly into model parameters for later description-free use.

If this is right

Description-free inference becomes feasible for tool-planning tasks in agentic systems.
Prompt token counts drop by 82.6 percent compared with supplying the full tool catalog at every step.
Planning scores measured by both automatic metrics and LLM judges exceed those of informed but unfine-tuned baselines.
LoRA rank can be chosen to favor either planning quality or preservation of general knowledge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models trained this way would require retraining or continued adaptation if the underlying tool set changes after deployment.
The same internalization approach could be applied to other fixed knowledge domains to shrink context length in specialized agents.
On-device or low-memory deployments might become practical once tool knowledge no longer occupies prompt space.

Load-bearing premise

The complete set of tools is fixed and fully known during training so that the model can absorb that exact catalog into its weights.

What would settle it

Measuring whether planning quality collapses when the inference prompt introduces a new tool absent from the training catalog.

Figures

Figures reproduced from arXiv: 2605.17774 by Ayal Yakobe, Tanmay Agarwal, Yuval Shemla.

**Figure 2.** Figure 2: LLM-as-judge scores (1–5 scale) across five evaluation dimensions. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Training and evaluation loss over 436 steps (2 epochs). Both models [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Overall MCQ accuracy before and after tool-use fine-tuning. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Effect of training data composition on planning quality. Plan-only [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Profiling comparison between Gemma 4 E4B and Qwen3-4B. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Per-benchmark MCQ accuracy for base and fine-tuned models. [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

read the original abstract

Large language models are increasingly used as planning components in agentic systems, but current tool-use pipelines often require full tool schemas to be included in every prompt, creating substantial token overhead and limiting the practicality of smaller models. This paper investigates whether tool-use knowledge can be internalized into small language models through parameter-efficient fine-tuning, enabling structured planning without explicit tool descriptions at inference time. Using AssetOpsBench as the primary benchmark, we fine-tune Gemma 4 E4B and Qwen3-4B with 8-bit QLoRA on approximately 1,700 tool-use examples spanning tool knowledge, question-to-plan mappings, and execution-style traces. We evaluate the resulting models under description-free inference, where the prompt omits the tool catalog entirely. The fine-tuned models outperform an informed unfine-tuned baseline that receives full tool descriptions, reducing input length by 82.6\% while improving structural and LLM-judge planning scores. In the best Gemma run, the model achieves an AT-F1 of 0.65 and an overall judge score of 3.88, compared with 0.47 and 2.88 for the informed baseline. Qwen3-4B achieves a strong overall judge score of 3.78 while using 62\% less memory and running 2.5$\times$ faster than Gemma, though it also exhibits greater catastrophic forgetting on general multiple-choice benchmarks. Additional ablations show that LoRA rank controls a quality--retention trade-off, with $r=32$ maximizing planning quality and smaller ranks preserving more general knowledge. These results suggest that, for fixed tool catalogs, QLoRA fine-tuning can shift tool knowledge from prompt context into model weights, substantially reducing inference overhead while maintaining or improving tool-planning quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

QLoRA fine-tuning lets 4B models plan without tool descriptions on a fixed catalog, beating the prompted baseline with big token cuts, but the gains look like recall of the training set rather than general tool reasoning.

read the letter

This paper shows that QLoRA fine-tuning on roughly 1700 AssetOpsBench examples lets Gemma-4B and Qwen3-4B generate tool plans without any catalog in the prompt. The best Gemma run reaches 0.65 AT-F1 and 3.88 judge score versus 0.47 and 2.88 for the informed unfine-tuned baseline, while shrinking input length by 82 percent. Qwen runs faster and uses less memory but forgets more on general benchmarks. The LoRA rank ablation is a useful addition, showing r=32 maximizes planning quality while smaller ranks keep more outside knowledge intact.

Referee Report

2 major / 2 minor

Summary. The paper claims that QLoRA fine-tuning on ~1,700 tool-use examples from AssetOpsBench allows small models (Gemma-4B and Qwen3-4B) to internalize tool knowledge, enabling structured planning without tool descriptions in the prompt. The fine-tuned models outperform an informed unfine-tuned baseline under description-free inference (best Gemma: AT-F1 0.65 and judge score 3.88 vs. baseline 0.47 and 2.88), while cutting input length by 82.6%. Ablations show LoRA rank trades off planning quality against retention of general capabilities, with the work scoped to fixed tool catalogs.

Significance. If the empirical results hold under fuller controls, the work shows a viable path to reduce token overhead in agentic tool-use pipelines for closed, fixed tool sets by moving catalog knowledge into weights. The concrete metrics, LoRA-rank ablation, and explicit scoping to fixed catalogs are strengths. Significance is limited by the absence of tests for novel or changing tool sets and by missing reproducibility details.

major comments (2)

[Evaluation and baseline description] The comparison to the informed unfine-tuned baseline is central to the claim of internalization, yet the manuscript provides insufficient detail on baseline construction, including exactly which tool schemas and usage patterns are supplied to it and whether they match the training distribution exactly.
[Experimental setup] No information is given on train/test splits, cross-validation, or checks for data contamination between the ~1,700 fine-tuning examples and the evaluation set. This is load-bearing because performance under description-free inference could reflect memorization of the specific closed catalog rather than robust internalization.

minor comments (2)

Clarify the precise model identifier 'Gemma 4 E4B' (likely a variant of Gemma-2-4B or similar) and report the exact parameter counts and quantization settings used.
[Results] The abstract and results would benefit from reporting variance or multiple random seeds for the AT-F1 and judge scores rather than single-point estimates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight areas where additional experimental details will strengthen the manuscript. We address each major comment below and have revised the paper accordingly.

read point-by-point responses

Referee: [Evaluation and baseline description] The comparison to the informed unfine-tuned baseline is central to the claim of internalization, yet the manuscript provides insufficient detail on baseline construction, including exactly which tool schemas and usage patterns are supplied to it and whether they match the training distribution exactly.

Authors: We agree that the baseline description requires more precision. In the revised manuscript we have expanded the 'Experimental Setup' and 'Baseline' subsections to specify that the informed unfine-tuned baseline is given the identical AssetOpsBench tool catalog, including full schemas and the same usage patterns that appear in the fine-tuning examples. This ensures the baseline receives exactly the information present in the training distribution but without any parameter updates. revision: yes
Referee: [Experimental setup] No information is given on train/test splits, cross-validation, or checks for data contamination between the ~1,700 fine-tuning examples and the evaluation set. This is load-bearing because performance under description-free inference could reflect memorization of the specific closed catalog rather than robust internalization.

Authors: We acknowledge the omission. The revised manuscript now includes a new 'Data Preparation' paragraph stating that the approximately 1,700 examples were partitioned into an 80/20 train/test split with no overlap between sets. We added explicit checks confirming that evaluation prompts contain no verbatim or near-verbatim overlap with training instances. While the work is scoped to fixed catalogs, these details address the concern that gains might stem solely from memorization. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical fine-tuning study with benchmark-driven results

full rationale

The paper is an empirical ML study that fine-tunes Gemma and Qwen models on ~1700 AssetOpsBench examples and reports direct performance metrics (AT-F1, judge scores) under description-free inference. No equations, derivations, or first-principles predictions appear in the provided text or abstract. Claims rest on experimental comparisons against an informed baseline, with explicit statements about the fixed tool catalog; these are falsifiable via benchmarks rather than reducing to self-definition, fitted inputs renamed as predictions, or self-citation chains. The central result (fine-tuned models outperforming baseline while shortening prompts) is measured, not derived by construction from its own inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions about fine-tuning transferring task-specific knowledge and on the existence of a fixed tool catalog that can be internalized.

free parameters (2)

LoRA rank r
Ablation shows r=32 maximizes planning quality while smaller ranks preserve more general knowledge; value is selected post-experiment.
Training set size
Approximately 1700 examples chosen for fine-tuning spanning tool knowledge and traces.

axioms (1)

domain assumption A fixed tool catalog can be internalized into model weights via supervised fine-tuning on planning traces.
Invoked to justify shifting knowledge from prompt context to parameters.

pith-pipeline@v0.9.0 · 5859 in / 1281 out tokens · 66737 ms · 2026-05-20T11:46:16.505990+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize description-free tool planning for a fixed tool catalog... the tool catalog is assumed to be fixed during evaluation. We do not attempt to generalize to unseen tools.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

QLoRA fine-tuning on ~1,700 examples enables ~4B parameter models to produce correct MCP tool-use plans without tool descriptions in the prompt

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 7 internal anchors

[1]

What is the Model Context Proto- col (MCP)?

Model Context Protocol, “What is the Model Context Proto- col (MCP)?” Model Context Protocol Documentation. [Online]. Available: https://modelcontextprotocol.io/docs/getting-started/intro. Ac- cessed: Apr. 30, 2026

work page 2026
[2]

AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Main- tenance,

D. C. Patel, S. Lin, J. T. Rayfield, N. Zhou, R. Vaculin, N. Martinez, F. O’Donncha, and J. Kalagnanam, “AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Main- tenance,” submitted to ICLR 2026, OpenReview, Sept. 2025. [Online]. Available: https://openreview.net/forum?id=ld6JUQbhes

work page 2026
[3]

QLoRA: Efficient Finetuning of Quantized LLMs,

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient Finetuning of Quantized LLMs,” inAdvances in Neural Infor- mation Processing Systems, vol. 36, 2023

work page 2023
[4]

LoRA: Low-Rank Adaptation of Large Language Models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-Rank Adaptation of Large Language Models,” inInternational Conference on Learning Representations, 2022

work page 2022
[5]

Gemini 2.5 Flash and 2.5 Flash Image Model Card,

Google DeepMind, “Gemini 2.5 Flash and 2.5 Flash Image Model Card,”

work page
[6]

Available: https://storage.googleapis.com/deepmind- media/Model-Cards/Gemini-2-5-Flash-Model-Card.pdf

[Online]. Available: https://storage.googleapis.com/deepmind- media/Model-Cards/Gemini-2-5-Flash-Model-Card.pdf. Accessed: Apr. 30, 2026

work page 2026
[7]

Gemma 4 model overview,

Google AI for Developers, “Gemma 4 model overview,” Google AI for Developers Documentation, 2026. [Online]. Available: https://ai.google.dev/gemma/docs/core. Accessed: Apr. 30, 2026

work page 2026
[8]

SLIM: Let LLM Learn More and Forget Less with Soft LoRA and Identity Mixture,

J. Han, L. Du, H. Du, X. Zhou, Y . Wu, W. Zheng, and D. Han, “SLIM: Let LLM Learn More and Forget Less with Soft LoRA and Identity Mixture,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, 2025

work page 2025
[9]

Toolformer: Language Models Can Teach Themselves to Use Tools

T. Schick, J. Dwivedi-Yu, R. Dess `ı, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language Models Can Teach Themselves to Use Tools,”arXiv preprint arXiv:2302.04761, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Gorilla: Large Language Model Connected with Massive APIs

S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Gorilla: Large Language Model Connected with Massive APIs,”arXiv preprint arXiv:2305.15334, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qian, S. Zhao, L. Hong, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun, “ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs,”arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models,

H. Jiang, Q. Wu, C.-Y . Lin, Y . Yang, and L. Qiu, “LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

work page 2023
[13]

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

C.-Y . Hsieh, C.-L. Li, C.-K. Yeh, H. Nakhost, Y . Fujii, A. Ratner, R. Kr- ishna, C.-Y . Lee, and T. Pfister, “Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes,”arXiv preprint arXiv:2305.02301, 2023

work page internal anchor Pith review arXiv 2023
[14]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,”arXiv preprint arXiv:2306.05685, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring Massive Multitask Language Understanding,” arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[16]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge,”arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[17]

HellaSwag: Can a Machine Really Finish Your Sentence?,

R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi, “HellaSwag: Can a Machine Really Finish Your Sentence?,” inProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

work page 2019
[18]

Qwen3 Technical Report,

Qwen Team, “Qwen3 Technical Report,” Alibaba Group, 2025. [On- line]. Available: https://qwenlm.github.io/blog/qwen3/. Accessed: May 6, 2026. APPENDIX TABLE IV THREE TRAINING DATASETS Dataset N Source Description Tool Knowledge∼500 Gemini 2.5 Flash Tool taxonomy, own- ership, args, routing, hard negatives Planning∼1,200 Gold plans + para- phrases Scenario...

work page 2025

[1] [1]

What is the Model Context Proto- col (MCP)?

Model Context Protocol, “What is the Model Context Proto- col (MCP)?” Model Context Protocol Documentation. [Online]. Available: https://modelcontextprotocol.io/docs/getting-started/intro. Ac- cessed: Apr. 30, 2026

work page 2026

[2] [2]

AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Main- tenance,

D. C. Patel, S. Lin, J. T. Rayfield, N. Zhou, R. Vaculin, N. Martinez, F. O’Donncha, and J. Kalagnanam, “AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Main- tenance,” submitted to ICLR 2026, OpenReview, Sept. 2025. [Online]. Available: https://openreview.net/forum?id=ld6JUQbhes

work page 2026

[3] [3]

QLoRA: Efficient Finetuning of Quantized LLMs,

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient Finetuning of Quantized LLMs,” inAdvances in Neural Infor- mation Processing Systems, vol. 36, 2023

work page 2023

[4] [4]

LoRA: Low-Rank Adaptation of Large Language Models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-Rank Adaptation of Large Language Models,” inInternational Conference on Learning Representations, 2022

work page 2022

[5] [5]

Gemini 2.5 Flash and 2.5 Flash Image Model Card,

Google DeepMind, “Gemini 2.5 Flash and 2.5 Flash Image Model Card,”

work page

[6] [6]

Available: https://storage.googleapis.com/deepmind- media/Model-Cards/Gemini-2-5-Flash-Model-Card.pdf

[Online]. Available: https://storage.googleapis.com/deepmind- media/Model-Cards/Gemini-2-5-Flash-Model-Card.pdf. Accessed: Apr. 30, 2026

work page 2026

[7] [7]

Gemma 4 model overview,

Google AI for Developers, “Gemma 4 model overview,” Google AI for Developers Documentation, 2026. [Online]. Available: https://ai.google.dev/gemma/docs/core. Accessed: Apr. 30, 2026

work page 2026

[8] [8]

SLIM: Let LLM Learn More and Forget Less with Soft LoRA and Identity Mixture,

J. Han, L. Du, H. Du, X. Zhou, Y . Wu, W. Zheng, and D. Han, “SLIM: Let LLM Learn More and Forget Less with Soft LoRA and Identity Mixture,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, 2025

work page 2025

[9] [9]

Toolformer: Language Models Can Teach Themselves to Use Tools

T. Schick, J. Dwivedi-Yu, R. Dess `ı, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language Models Can Teach Themselves to Use Tools,”arXiv preprint arXiv:2302.04761, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Gorilla: Large Language Model Connected with Massive APIs

S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Gorilla: Large Language Model Connected with Massive APIs,”arXiv preprint arXiv:2305.15334, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qian, S. Zhao, L. Hong, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun, “ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs,”arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models,

H. Jiang, Q. Wu, C.-Y . Lin, Y . Yang, and L. Qiu, “LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

work page 2023

[13] [13]

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

C.-Y . Hsieh, C.-L. Li, C.-K. Yeh, H. Nakhost, Y . Fujii, A. Ratner, R. Kr- ishna, C.-Y . Lee, and T. Pfister, “Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes,”arXiv preprint arXiv:2305.02301, 2023

work page internal anchor Pith review arXiv 2023

[14] [14]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,”arXiv preprint arXiv:2306.05685, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring Massive Multitask Language Understanding,” arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[16] [16]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge,”arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[17] [17]

HellaSwag: Can a Machine Really Finish Your Sentence?,

R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi, “HellaSwag: Can a Machine Really Finish Your Sentence?,” inProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

work page 2019

[18] [18]

Qwen3 Technical Report,

Qwen Team, “Qwen3 Technical Report,” Alibaba Group, 2025. [On- line]. Available: https://qwenlm.github.io/blog/qwen3/. Accessed: May 6, 2026. APPENDIX TABLE IV THREE TRAINING DATASETS Dataset N Source Description Tool Knowledge∼500 Gemini 2.5 Flash Tool taxonomy, own- ership, args, routing, hard negatives Planning∼1,200 Gold plans + para- phrases Scenario...

work page 2025