pith. machine review for the scientific record. sign in

arxiv: 2605.07990 · v1 · submitted 2026-05-08 · 💻 cs.CL · cs.AI· cs.LG· cs.SE

Recognition: 2 theorem links

· Lean Theorem

Tool Calling is Linearly Readable and Steerable in Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-11 03:04 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGcs.SE
keywords tool callingactivation steeringlinear representationsmechanistic interpretabilitylanguage model agentserror detection
0
0 comments X

The pith

The identity of the tool a language model chooses is linearly readable from its activations and can be switched by adding a mean difference vector.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language models internally represent the tool they will call in a linear way that can be read directly from hidden states. Researchers calculate the average activation pattern for each tool across prompts and use the difference between two such patterns as a steering vector. Adding this vector during a forward pass causes the model to select the alternate tool at 77-100 percent accuracy on simple prompts, after which it generates JSON arguments that match the new tool's schema. The same averages also identify likely mistakes in advance because queries with small gaps between the top two tools produce far more errors. The pattern appears across model sizes from hundreds of millions to tens of billions of parameters and persists even in base models before instruction tuning.

Core claim

The identity of the chosen tool is linearly readable and steerable inside the model. Adding the mean-difference between two tools' average internal activations switches which tool the model selects at 77-100% accuracy on name-only single-turn prompts (93-100% at 4B+), and the JSON arguments that follow autoregressively match the new tool's schema. The causal effect concentrates along one direction, the row of the output layer that produces the target tool's first token. Activation patching localises this to a small set of mid- and late-layer attention heads, and a within-topic probe across 14 same-domain airline tools reaches 61-89% accuracy, ruling out a pure topic explanation. Even base 4B

What carries the argument

the mean-difference vector between average internal activations for each tool

Load-bearing premise

The vector difference between average activations for different tools specifically encodes tool identity rather than correlated features such as query topic or prompt syntax.

What would settle it

If adding the mean-difference vector between two tools no longer produces the new tool name and matching JSON schema at high rates on a fresh set of prompts outside the original calculation set, the linear steerability claim would fail.

Figures

Figures reproduced from arXiv: 2605.07990 by 2), (2) Holistic AI, (3) Imperial College London), Adriano Koshiyama (1, Maria Perez-Ortiz (1) ((1) University College London, Sahan Bulathwela (1), Seonglae Cho (2), Yufei Yang (3), Zekun Wu (1, Ze Wang (1.

Figure 1
Figure 1. Figure 1: Overview of the three-stage circuit and steering demonstration. Adding a mean-difference vector redirects [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Left: 3D PCA of 15-tool activations, one model per family. Right: cumulative variance converges to [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Mean steering accuracy vs model scale across [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Linearity of tool selection. Cosine similar [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Steering vectors align with unembedding di [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Steering accuracy across 12 models × 4 bench￾marks. 0 20 40 60 80 100 Switch accuracy (%) Finance (10t) Health (10t) Sports (10t) Science (10t) Music (10t) Travel (10t) 75% 75% 0% 80% 55% 95% Cross-domain: 100% ToolBench per-domain steering (Gemma 3 4B) [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: ToolBench per-domain steering (Gemma 3 4B). Sports fails due to near-synonymous tools. LAYER SAE FEATURE NEURONPEDIA LABEL 2 tc-262k 159677 “tool description” 9 tc-262k 73435 “tool selection” 9 res-16k 5625 “function calls” 16 tc-262k 177725 “tool calls” 17 res-16k 1645 “function calls” 17 res-16k 15177 “tools and their functions” 22 tc-262k 19924 “tool use invocation” 22 res-16k 3122 “function calls” 25 t… view at source ↗
Figure 9
Figure 9. Figure 9: k90 as the number of real ToolBench APIs grows from 5 to 2,000, across 12 instruction-tuned mod￾els from three families (Gemma 3, Qwen 3 / Qwen 2.5, Llama 3.1; stratified sampling across 49 domains, with descriptions). All models compress tool activations into k90 that grows far slower than K, but the absolute level depends more on architecture (attention heads) than parameter count: the Gemma 3 family (bl… view at source ↗
Figure 10
Figure 10. Figure 10: Phase transition at α≈0.7. 4B+ models jump to 95–100% with no collapse. 0 20 40 60 80 100 120 140 Number of tools (K) 60 70 80 90 100 Switch accuracy (%) 90% Gemma 3 Qwen 3 Llama 3.1 Qwen 2.5 [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: K-tool scaling. Larger models sustain accu￾racy better. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: Base vs IT. Left: 5-tool accuracy (base im [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Attribution patching peak layer. IT peaks at [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 12
Figure 12. Figure 12: SAE feature sharpening across layers. Spe [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 15
Figure 15. Figure 15: Cross-tool divergence by layer depth. Di [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 17
Figure 17. Figure 17: 3D PCA of 39 tools from 3 sources (15 syn [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Linearity interpolation for all 4 tested tool [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗
read the original abstract

When a tool-calling agent picks the wrong tool, the failure is invisible until execution: the email gets sent, the meeting gets missed. Probing 12 instruction-tuned models across Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1 (270M to 27B), we find the identity of the chosen tool is linearly readable and steerable inside the model. Adding the mean-difference between two tools' average internal activations switches which tool the model selects at 77-100% accuracy on name-only single-turn prompts (93-100% at 4B+), and the JSON arguments that follow autoregressively match the new tool's schema, so flipping the name is enough. The same per-tool means also flag likely errors before they happen: on Gemma 3 12B and 27B, queries where the gap between the top-1 and top-2 tool is smallest produce 14-21x more wrong calls than queries with the largest gap. The causal effect concentrates along one direction, the row of the output layer that produces the target tool's first token: a unit vector along it at matched magnitude already reaches 93-100%, while what is left over leaves the choice almost untouched. Activation patching localises this to a small set of mid- and late-layer attention heads, and a within-topic probe across 14 same-domain $\tau$-bench airline tools reaches top-1 61-89% across five 4B-14B models, ruling out the reading that we are just moving the model along a topic axis. Even base models encode the right tool before they can emit it: cosine readout from the internal state recovers 69-82% on BFCL while base generation reaches only 2-10%, suggesting pretraining forms the representation and instruction tuning later wires it to the output. We measure tool identity selection and JSON schema correctness in single-turn fixed-menu settings; multi-turn agentic transfer is more fragile and is discussed in Limitations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 4 minor

Summary. The manuscript claims that tool identity in language models is linearly readable and steerable via internal activations. Across 12 instruction-tuned models (Gemma 3, Qwen 3/2.5, Llama 3.1; 270M–27B), mean-difference vectors between per-tool average activations steer tool selection at 77–100% accuracy on name-only single-turn prompts (93–100% for 4B+ models), with autoregressive JSON arguments matching the new tool's schema. The effect concentrates in the output-layer row for the tool's first token (unit vector along this direction reaches 93–100% while the orthogonal residual leaves choice nearly untouched), localizes to a small set of mid- and late-layer attention heads via patching, and is not reducible to topic (within-topic probe on 14 airline tools yields 61–89% top-1 across five 4B–14B models). Base models already encode the correct tool internally (69–82% cosine readout on BFCL) despite 2–10% generation accuracy, and activation gaps predict errors (14–21× higher error rate for smallest vs. largest top-1/top-2 gaps on Gemma 3 12B/27B). All measurements are in single-turn fixed-menu settings.

Significance. If the results hold under the reported controls, the work provides concrete mechanistic evidence that tool selection is mediated by linear directions in activation space, with the output-layer decomposition and within-topic probe addressing key alternative explanations (topic/syntax confounds and post-hoc fitting). The base-model readout result is particularly notable, indicating the representation forms during pretraining and is later wired to output by instruction tuning. The error-prediction finding via activation gaps has direct practical value for reliable tool use. These elements together strengthen the central claim beyond correlational probing and could inform interpretability, safety interventions, and steering of agentic systems.

major comments (2)
  1. [Experiments (within-topic probe)] The within-topic probe (14 airline tools, 61–89% top-1) is load-bearing for ruling out topic as the driver; however, the manuscript should explicitly report the number of queries per tool, the exact layer(s) used for the mean vectors, and whether the same held-out set was used for both mean computation and evaluation to confirm no data leakage.
  2. [Results (steering experiments)] The claim that 'flipping the name is enough' for JSON schema correctness relies on autoregressive continuation; the paper should quantify schema-match rates separately from name accuracy and report whether any residual JSON errors occur even when the steered name is accepted.
minor comments (4)
  1. [Abstract] The abstract reports ranges (77–100%, 93–100% at 4B+) without per-model breakdowns or trial counts; adding a table with exact accuracies, model sizes, and number of prompts per condition would improve reproducibility.
  2. [Localization results] Activation patching localization to 'a small set of mid- and late-layer attention heads' would benefit from a figure or table listing the specific head indices and layers across models.
  3. [Base model analysis] The base-model readout (69–82%) vs. generation (2–10%) is a strong point; clarify whether the cosine readout uses the same mean vectors as the steered models or a separate base-model probe.
  4. [Throughout] Minor typographical inconsistencies in model naming (e.g., 'Gemma 3' vs. 'Gemma-3') should be standardized throughout.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive assessment and recommendation for minor revision. We appreciate the constructive comments on improving the clarity and reproducibility of the within-topic probe and steering results. We address each major comment below and will update the manuscript accordingly.

read point-by-point responses
  1. Referee: [Experiments (within-topic probe)] The within-topic probe (14 airline tools, 61–89% top-1) is load-bearing for ruling out topic as the driver; however, the manuscript should explicitly report the number of queries per tool, the exact layer(s) used for the mean vectors, and whether the same held-out set was used for both mean computation and evaluation to confirm no data leakage.

    Authors: We agree that these experimental details are necessary for full reproducibility and to confirm the absence of data leakage. We will revise the manuscript to explicitly report the number of queries per tool in the within-topic probe, the exact layer(s) from which the mean activation vectors were computed, and confirmation that the mean vectors were derived from a separate split with evaluation performed on a held-out set with no overlap. revision: yes

  2. Referee: [Results (steering experiments)] The claim that 'flipping the name is enough' for JSON schema correctness relies on autoregressive continuation; the paper should quantify schema-match rates separately from name accuracy and report whether any residual JSON errors occur even when the steered name is accepted.

    Authors: We agree that separating schema-match rates from name accuracy and reporting residual errors would strengthen the steering results. We will revise the relevant Results section to include explicit quantification of schema-match rates independent of name accuracy and to report the incidence of any residual JSON errors in cases where the steered tool name is accepted. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper derives its claims from direct empirical measurements of internal activations across multiple models, causal steering via mean-difference vector addition on held-out single-turn prompts, within-topic controls across 14 airline tools, activation patching to localize heads, and base-model readout comparisons. None of these steps reduce to their inputs by construction: the mean vectors are computed on separate data and tested causally, the within-topic probe explicitly addresses confounds, and the output-layer decomposition is an independent verification. No self-citation chains, ansatzes, or fitted predictions masquerading as results appear in the load-bearing sections. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical computation of per-tool mean activation vectors from model forward passes on example prompts; these vectors are data-dependent and serve as the steering direction.

free parameters (1)
  • per-tool mean activation vectors
    Computed as averages of internal activations over prompts that elicit each tool; used to form the difference vector for steering.
axioms (1)
  • domain assumption Tool identity is represented as a linear direction in the model's residual stream or attention outputs.
    The entire steering and readout procedure assumes that adding a fixed vector can reliably alter tool choice without destroying other capabilities.

pith-pipeline@v0.9.0 · 5748 in / 1522 out tokens · 44401 ms · 2026-05-11T03:04:19.844351+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 15 internal anchors

  1. [1]

    Locating and Editing Factual Associations in

    Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle =. Locating and Editing Factual Associations in

  2. [2]

    NeurIPS , year =

    Towards Automated Circuit Discovery for Mechanistic Interpretability , author =. NeurIPS , year =

  3. [4]

    Wang, Youjin and Zhou, Run and Fu, Rong and Cao, Shuaishuai and Zeng, Hongwei and Lu, Jiaxuan and Fan, Sicheng and Zhao, Jiaqiao and Pan, Liangming , journal =

  4. [6]

    Transformer Circuits Thread , year =

    Circuit Tracing: Revealing Computational Graphs in Language Models , author =. Transformer Circuits Thread , year =

  5. [9]

    Advances in Neural Information Processing Systems , year =

    Toolformer: Language Models Can Teach Themselves to Use Tools , author =. Advances in Neural Information Processing Systems , year =

  6. [10]

    Qin, Yujia and Liang, Shihao and Ye, Yining and Zhu, Kunlun and Yan, Lan and Lu, Yaxi and Lin, Yankai and Cong, Xin and Tang, Xiangru and Qian, Bill and others , journal =

  7. [12]

    Gorilla: Large Language Model Connected with Massive

    Patil, Shishir G and Zhang, Tianjun and Wang, Xin and Gonzalez, Joseph E , journal =. Gorilla: Large Language Model Connected with Massive

  8. [14]

    Transformer Circuits Thread , year =

    Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author =. Transformer Circuits Thread , year =

  9. [16]

    Causal Mediation Analysis for Interpreting Neural

    Vig, Jesse and Gehrmann, Sebastian and Belinkov, Yonatan and Qian, Sharon and Nevo, Daniel and Sakenis, Simas and Huang, Jason and Singer, Yaron and Shieber, Stuart , journal =. Causal Mediation Analysis for Interpreting Neural

  10. [17]

    Interpretability in the Wild: a Circuit for Indirect Object Identification in

    Wang, Kevin and Variengien, Alexandre and Conmy, Arthur and Shlegeris, Buck and Steinhardt, Jacob , journal =. Interpretability in the Wild: a Circuit for Indirect Object Identification in

  11. [18]

    Advances in Neural Information Processing Systems , year =

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems , year =

  12. [20]

    Li, Minghao and others , journal =

  13. [21]

    Hao, Shibo and Liu, Tianyang and Wang, Zhen and Hu, Zhiting , journal =

  14. [24]

    Transformer Circuits Thread , year =

    Toy Models of Superposition , author =. Transformer Circuits Thread , year =

  15. [25]

    Scaling Monosemanticity: Extracting Interpretable Features from

    Templeton, Adly and Conerly, Tom and Marcus, Jonathan and Lindsey, Jack and Bricken, Trenton and Chen, Brian and Pearce, Adam and Citro, Craig and Ameisen, Emmanuel and Jones, Andy and others , journal =. Scaling Monosemanticity: Extracting Interpretable Features from

  16. [27]

    Advances in Neural Information Processing Systems , year =

    Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author =. Advances in Neural Information Processing Systems , year =

  17. [28]

    Representation Engineering: A Top-Down Approach to

    Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and others , journal =. Representation Engineering: A Top-Down Approach to

  18. [30]

    An Overview of Catastrophic

    Hendrycks, Dan and Mazeika, Mantas and Woodside, Thomas , journal =. An Overview of Catastrophic

  19. [34]

    Qwen2.5 Technical Report

    Qwen2.5 Technical Report , author =. arXiv preprint arXiv:2412.15115 , year =

  20. [35]

    Grattafiori, Aaron and others , journal =. The

  21. [36]

    Nanda, Neel and Bloom, Joseph , url =

  22. [37]

    Bloom, Joseph and others , url =

  23. [38]

    Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on

    Lieberum, Tom and Raber, Senthooran and Kramar, Janos and others , journal =. Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on

  24. [39]

    Goodfire: Interpretability Infrastructure for

  25. [40]

    NeurIPS , year =

    Berkeley Function Calling Leaderboard , author =. NeurIPS , year =

  26. [41]

    Emmanuel Ameisen and 1 others. 2025. https://transformer-circuits.pub/2025/attribution-graphs/methods.html Circuit tracing: Revealing computational graphs in language models . Transformer Circuits Thread

  27. [42]

    Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Rimsky, Lee Sharkey, and 1 others. 2024. Refusal in language models is mediated by a single direction. arXiv preprint arXiv:2406.11717

  28. [43]

    Joseph Bloom and 1 others. 2024. https://github.com/jbloomAUS/SAELens SAELens : A library for sparse autoencoder training, analysis, and interpretability

  29. [44]

    Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, and 1 others. 2023. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread

  30. [45]

    Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adri\` a Garriga-Alonso. 2023. Towards automated circuit discovery for mechanistic interpretability. In NeurIPS

  31. [46]

    Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. 2023. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600

  32. [47]

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, and 1 others. 2022. Toy models of superposition. Transformer Circuits Thread

  33. [48]

    Joshua Engels, Isaac Liao, Eric J Michaud, Wes Gurnee, and Max Tegmark. 2024. Not all language model features are linear. arXiv preprint arXiv:2405.14860

  34. [49]

    Gemma Team . 2024. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295

  35. [50]

    Mor Geva, Jaap Bastings, Katja Filippova, and Amir Globerson. 2023. Dissecting recall of factual associations in auto-regressive language models. arXiv preprint arXiv:2304.14767

  36. [51]

    Aaron Grattafiori and 1 others. 2024. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783

  37. [52]

    Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu. 2023. ToolkenGPT : Augmenting frozen language models with massive tools via tool embeddings. Advances in Neural Information Processing Systems

  38. [53]

    Kait Healy, Bharathi Srinivasan, Visakh Madathil, and Jing Wu. 2026. Internal representations as indicators of hallucinations in agent tool selection. arXiv preprint arXiv:2601.05214

  39. [54]

    Dan Hendrycks, Mantas Mazeika, and Thomas Woodside. 2023. An overview of catastrophic AI risks. arXiv preprint arXiv:2306.12001

  40. [55]

    Kenneth Li, Oam Patel, Fernanda Vi \'e gas, Hanspeter Pfister, and Martin Wattenberg. 2023 a . Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems

  41. [56]

    Minghao Li and 1 others. 2023 b . API-Bank : A comprehensive benchmark for tool-augmented LLMs . arXiv preprint arXiv:2304.08244

  42. [57]

    Tom Lieberum, Senthooran Raber, Janos Kramar, and 1 others. 2024. Gemma scope: Open sparse autoencoders everywhere all at once on Gemma 2. arXiv preprint arXiv:2408.05147

  43. [58]

    Johnny Lin. 2024. https://www.neuronpedia.org/ NeuronPedia : A platform for mechanistic interpretability

  44. [59]

    Samuel Marks and Max Tegmark. 2024. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824

  45. [60]

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in GPT . In NeurIPS

  46. [61]

    Neel Nanda and Joseph Bloom. 2022. https://github.com/TransformerLensOrg/TransformerLens TransformerLens : A library for mechanistic interpretability of GPT -style language models

  47. [62]

    Neel Nanda, Andrew Lee, and Martin Wattenberg. 2023. Emergent linear representations in world models of self-supervised sequence models. arXiv preprint arXiv:2309.00941

  48. [63]

    Kiho Park, Yo Joong Choe, and Victor Veitch. 2024. The linear representation hypothesis and the geometry of large language models. arXiv preprint arXiv:2311.03658

  49. [64]

    Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. 2023. Gorilla: Large language model connected with massive APIs . arXiv preprint arXiv:2305.15334

  50. [65]

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, and 1 others. 2024. ToolLLM : Facilitating large language models to master 16000+ real-world APIs . arXiv preprint arXiv:2307.16789

  51. [66]

    Qwen Team . 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388

  52. [67]

    Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, J \'a nos Kram \'a r, Rohin Shah, and Neel Nanda. 2024. Jumping ahead: Improving reconstruction fidelity with JumpReLU sparse autoencoders. arXiv preprint arXiv:2407.14435

  53. [68]

    Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems

  54. [69]

    Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, and 1 others. 2024. Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet . Transformer Circuits Thread

  55. [70]

    Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, and Neel Nanda. 2023. Linear representations of sentiment in large language models. arXiv preprint arXiv:2310.15154

  56. [71]

    Eric Todd, Millicent L Li, Arnab Sen Sharma, Aaron Mueller, Byron C Wallace, and David Bau. 2024. Function vectors in large language models. arXiv preprint arXiv:2310.15213

  57. [72]

    Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte Castricato. 2024. Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248

  58. [73]

    Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Simas Sakenis, Jason Huang, Yaron Singer, and Stuart Shieber. 2020. Causal mediation analysis for interpreting neural NLP : The case of gender bias. arXiv preprint arXiv:2004.12265

  59. [74]

    Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. 2023. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. arXiv preprint arXiv:2211.00593

  60. [75]

    Youjin Wang, Run Zhou, Rong Fu, Shuaishuai Cao, Hongwei Zeng, Jiaxuan Lu, Sicheng Fan, Jiaqiao Zhao, and Liangming Pan. 2026. ASA : Training-free representation engineering for tool-calling agents. arXiv preprint arXiv:2602.04935

  61. [76]

    Fanjia Yan, Huanzhi Mao, Charlie Ji, Shishir Patil, Ion Stoica, Joseph E Gonzalez, and Hao Zhang. 2024. Berkeley function calling leaderboard. In NeurIPS

  62. [77]

    Shunyu Yao and 1 others. 2024. -bench: A benchmark for tool-agent-user interaction in real-world domains. arXiv preprint arXiv:2406.12045

  63. [78]

    Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, and 1 others. 2023. Representation engineering: A top-down approach to AI transparency. arXiv preprint arXiv:2310.01405