arxiv: 2605.07990 · v1 · submitted 2026-05-08 · 💻 cs.CL · cs.AI· cs.LG· cs.SE

Recognition: 2 theorem links

· Lean Theorem

Tool Calling is Linearly Readable and Steerable in Language Models

Zekun Wu (1 , 2) , Ze Wang (1 , Seonglae Cho (2) , Yufei Yang (3) , Adriano Koshiyama (1 , Sahan Bulathwela (1) , Maria Perez-Ortiz (1) ((1) University College London

show 2 more authors

(2) Holistic AI (3) Imperial College London)

Authors on Pith no claims yet

Pith reviewed 2026-05-11 03:04 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGcs.SE

keywords tool callingactivation steeringlinear representationsmechanistic interpretabilitylanguage model agentserror detection

0 comments

The pith

The identity of the tool a language model chooses is linearly readable from its activations and can be switched by adding a mean difference vector.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language models internally represent the tool they will call in a linear way that can be read directly from hidden states. Researchers calculate the average activation pattern for each tool across prompts and use the difference between two such patterns as a steering vector. Adding this vector during a forward pass causes the model to select the alternate tool at 77-100 percent accuracy on simple prompts, after which it generates JSON arguments that match the new tool's schema. The same averages also identify likely mistakes in advance because queries with small gaps between the top two tools produce far more errors. The pattern appears across model sizes from hundreds of millions to tens of billions of parameters and persists even in base models before instruction tuning.

Core claim

The identity of the chosen tool is linearly readable and steerable inside the model. Adding the mean-difference between two tools' average internal activations switches which tool the model selects at 77-100% accuracy on name-only single-turn prompts (93-100% at 4B+), and the JSON arguments that follow autoregressively match the new tool's schema. The causal effect concentrates along one direction, the row of the output layer that produces the target tool's first token. Activation patching localises this to a small set of mid- and late-layer attention heads, and a within-topic probe across 14 same-domain airline tools reaches 61-89% accuracy, ruling out a pure topic explanation. Even base 4B

What carries the argument

the mean-difference vector between average internal activations for each tool

Load-bearing premise

The vector difference between average activations for different tools specifically encodes tool identity rather than correlated features such as query topic or prompt syntax.

What would settle it

If adding the mean-difference vector between two tools no longer produces the new tool name and matching JSON schema at high rates on a fresh set of prompts outside the original calculation set, the linear steerability claim would fail.

Figures

Figures reproduced from arXiv: 2605.07990 by 2), (2) Holistic AI, (3) Imperial College London), Adriano Koshiyama (1, Maria Perez-Ortiz (1) ((1) University College London, Sahan Bulathwela (1), Seonglae Cho (2), Yufei Yang (3), Zekun Wu (1, Ze Wang (1.

**Figure 1.** Figure 1: Overview of the three-stage circuit and steering demonstration. Adding a mean-difference vector redirects [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Left: 3D PCA of 15-tool activations, one model per family. Right: cumulative variance converges to [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Mean steering accuracy vs model scale across [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Linearity of tool selection. Cosine similar [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Steering vectors align with unembedding di [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 7.** Figure 7: Steering accuracy across 12 models × 4 benchmarks. 0 20 40 60 80 100 Switch accuracy (%) Finance (10t) Health (10t) Sports (10t) Science (10t) Music (10t) Travel (10t) 75% 75% 0% 80% 55% 95% Cross-domain: 100% ToolBench per-domain steering (Gemma 3 4B) [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: ToolBench per-domain steering (Gemma 3 4B). Sports fails due to near-synonymous tools. LAYER SAE FEATURE NEURONPEDIA LABEL 2 tc-262k 159677 “tool description” 9 tc-262k 73435 “tool selection” 9 res-16k 5625 “function calls” 16 tc-262k 177725 “tool calls” 17 res-16k 1645 “function calls” 17 res-16k 15177 “tools and their functions” 22 tc-262k 19924 “tool use invocation” 22 res-16k 3122 “function calls” 25 t… view at source ↗

**Figure 9.** Figure 9: k90 as the number of real ToolBench APIs grows from 5 to 2,000, across 12 instruction-tuned models from three families (Gemma 3, Qwen 3 / Qwen 2.5, Llama 3.1; stratified sampling across 49 domains, with descriptions). All models compress tool activations into k90 that grows far slower than K, but the absolute level depends more on architecture (attention heads) than parameter count: the Gemma 3 family (bl… view at source ↗

**Figure 10.** Figure 10: Phase transition at α≈0.7. 4B+ models jump to 95–100% with no collapse. 0 20 40 60 80 100 120 140 Number of tools (K) 60 70 80 90 100 Switch accuracy (%) 90% Gemma 3 Qwen 3 Llama 3.1 Qwen 2.5 [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: K-tool scaling. Larger models sustain accuracy better. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 13.** Figure 13: Base vs IT. Left: 5-tool accuracy (base im [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Attribution patching peak layer. IT peaks at [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 12.** Figure 12: SAE feature sharpening across layers. Spe [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 15.** Figure 15: Cross-tool divergence by layer depth. Di [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 17.** Figure 17: 3D PCA of 39 tools from 3 sources (15 syn [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗

**Figure 18.** Figure 18: Linearity interpolation for all 4 tested tool [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗

read the original abstract

When a tool-calling agent picks the wrong tool, the failure is invisible until execution: the email gets sent, the meeting gets missed. Probing 12 instruction-tuned models across Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1 (270M to 27B), we find the identity of the chosen tool is linearly readable and steerable inside the model. Adding the mean-difference between two tools' average internal activations switches which tool the model selects at 77-100% accuracy on name-only single-turn prompts (93-100% at 4B+), and the JSON arguments that follow autoregressively match the new tool's schema, so flipping the name is enough. The same per-tool means also flag likely errors before they happen: on Gemma 3 12B and 27B, queries where the gap between the top-1 and top-2 tool is smallest produce 14-21x more wrong calls than queries with the largest gap. The causal effect concentrates along one direction, the row of the output layer that produces the target tool's first token: a unit vector along it at matched magnitude already reaches 93-100%, while what is left over leaves the choice almost untouched. Activation patching localises this to a small set of mid- and late-layer attention heads, and a within-topic probe across 14 same-domain $\tau$-bench airline tools reaches top-1 61-89% across five 4B-14B models, ruling out the reading that we are just moving the model along a topic axis. Even base models encode the right tool before they can emit it: cosine readout from the internal state recovers 69-82% on BFCL while base generation reaches only 2-10%, suggesting pretraining forms the representation and instruction tuning later wires it to the output. We measure tool identity selection and JSON schema correctness in single-turn fixed-menu settings; multi-turn agentic transfer is more fragile and is discussed in Limitations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Tool identity is linearly readable and steerable in these models, with a within-topic control and base-model result that make the claim hold up better than most activation-steering papers.

read the letter

The central result is that mean-difference vectors between tool activations let you switch which tool the model picks on single-turn name-only prompts, often at 93-100% for models 4B and up, and the generated JSON then follows the new tool's schema. They also show the same vectors can flag likely errors via small top-1/top-2 gaps, localize the effect to a few mid-to-late attention heads, and recover the correct tool from base-model states even when generation fails badly. The within-topic probe on 14 airline tools (61-89% top-1) and the output-layer decomposition are the parts that actually address the obvious confounds. That combination is what makes the work worth reading rather than another generic steering demo. The experiments cover a reasonable range of model families and sizes, and the causal patching plus schema-following check give some evidence the direction is doing what they claim. The main limitation is that everything is single-turn with a fixed menu; they note multi-turn agentic transfer is fragile, which is the setting where this would matter most. The error-flagging result is only shown on two models, and the within-topic numbers are not overwhelming, so other correlated features could still be in play. No obvious circularity or untested assumption that breaks the main claim. This is for people working on reliable tool-use agents or on linear representations in LLMs. It is a clear empirical step on a concrete problem rather than a broad theoretical advance, but the controls are better than average for this style of paper. I would send it to peer review.

Referee Report

2 major / 4 minor

Summary. The manuscript claims that tool identity in language models is linearly readable and steerable via internal activations. Across 12 instruction-tuned models (Gemma 3, Qwen 3/2.5, Llama 3.1; 270M–27B), mean-difference vectors between per-tool average activations steer tool selection at 77–100% accuracy on name-only single-turn prompts (93–100% for 4B+ models), with autoregressive JSON arguments matching the new tool's schema. The effect concentrates in the output-layer row for the tool's first token (unit vector along this direction reaches 93–100% while the orthogonal residual leaves choice nearly untouched), localizes to a small set of mid- and late-layer attention heads via patching, and is not reducible to topic (within-topic probe on 14 airline tools yields 61–89% top-1 across five 4B–14B models). Base models already encode the correct tool internally (69–82% cosine readout on BFCL) despite 2–10% generation accuracy, and activation gaps predict errors (14–21× higher error rate for smallest vs. largest top-1/top-2 gaps on Gemma 3 12B/27B). All measurements are in single-turn fixed-menu settings.

Significance. If the results hold under the reported controls, the work provides concrete mechanistic evidence that tool selection is mediated by linear directions in activation space, with the output-layer decomposition and within-topic probe addressing key alternative explanations (topic/syntax confounds and post-hoc fitting). The base-model readout result is particularly notable, indicating the representation forms during pretraining and is later wired to output by instruction tuning. The error-prediction finding via activation gaps has direct practical value for reliable tool use. These elements together strengthen the central claim beyond correlational probing and could inform interpretability, safety interventions, and steering of agentic systems.

major comments (2)

[Experiments (within-topic probe)] The within-topic probe (14 airline tools, 61–89% top-1) is load-bearing for ruling out topic as the driver; however, the manuscript should explicitly report the number of queries per tool, the exact layer(s) used for the mean vectors, and whether the same held-out set was used for both mean computation and evaluation to confirm no data leakage.
[Results (steering experiments)] The claim that 'flipping the name is enough' for JSON schema correctness relies on autoregressive continuation; the paper should quantify schema-match rates separately from name accuracy and report whether any residual JSON errors occur even when the steered name is accepted.

minor comments (4)

[Abstract] The abstract reports ranges (77–100%, 93–100% at 4B+) without per-model breakdowns or trial counts; adding a table with exact accuracies, model sizes, and number of prompts per condition would improve reproducibility.
[Localization results] Activation patching localization to 'a small set of mid- and late-layer attention heads' would benefit from a figure or table listing the specific head indices and layers across models.
[Base model analysis] The base-model readout (69–82%) vs. generation (2–10%) is a strong point; clarify whether the cosine readout uses the same mean vectors as the steered models or a separate base-model probe.
[Throughout] Minor typographical inconsistencies in model naming (e.g., 'Gemma 3' vs. 'Gemma-3') should be standardized throughout.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive assessment and recommendation for minor revision. We appreciate the constructive comments on improving the clarity and reproducibility of the within-topic probe and steering results. We address each major comment below and will update the manuscript accordingly.

read point-by-point responses

Referee: [Experiments (within-topic probe)] The within-topic probe (14 airline tools, 61–89% top-1) is load-bearing for ruling out topic as the driver; however, the manuscript should explicitly report the number of queries per tool, the exact layer(s) used for the mean vectors, and whether the same held-out set was used for both mean computation and evaluation to confirm no data leakage.

Authors: We agree that these experimental details are necessary for full reproducibility and to confirm the absence of data leakage. We will revise the manuscript to explicitly report the number of queries per tool in the within-topic probe, the exact layer(s) from which the mean activation vectors were computed, and confirmation that the mean vectors were derived from a separate split with evaluation performed on a held-out set with no overlap. revision: yes
Referee: [Results (steering experiments)] The claim that 'flipping the name is enough' for JSON schema correctness relies on autoregressive continuation; the paper should quantify schema-match rates separately from name accuracy and report whether any residual JSON errors occur even when the steered name is accepted.

Authors: We agree that separating schema-match rates from name accuracy and reporting residual errors would strengthen the steering results. We will revise the relevant Results section to include explicit quantification of schema-match rates independent of name accuracy and to report the incidence of any residual JSON errors in cases where the steered tool name is accepted. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper derives its claims from direct empirical measurements of internal activations across multiple models, causal steering via mean-difference vector addition on held-out single-turn prompts, within-topic controls across 14 airline tools, activation patching to localize heads, and base-model readout comparisons. None of these steps reduce to their inputs by construction: the mean vectors are computed on separate data and tested causally, the within-topic probe explicitly addresses confounds, and the output-layer decomposition is an independent verification. No self-citation chains, ansatzes, or fitted predictions masquerading as results appear in the load-bearing sections. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical computation of per-tool mean activation vectors from model forward passes on example prompts; these vectors are data-dependent and serve as the steering direction.

free parameters (1)

per-tool mean activation vectors
Computed as averages of internal activations over prompts that elicit each tool; used to form the difference vector for steering.

axioms (1)

domain assumption Tool identity is represented as a linear direction in the model's residual stream or attention outputs.
The entire steering and readout procedure assumes that adding a fixed vector can reliably alter tool choice without destroying other capabilities.

pith-pipeline@v0.9.0 · 5748 in / 1522 out tokens · 44401 ms · 2026-05-11T03:04:19.844351+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the identity of the chosen tool is linearly readable and steerable inside the model. Adding the mean-difference between two tools' average internal activations switches which tool the model selects
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PCA over 15 tool means fits in ∼10 directions... k90 as the smallest number of components needed to reach 90%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 15 internal anchors

[1]

Locating and Editing Factual Associations in

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle =. Locating and Editing Factual Associations in

work page
[2]

NeurIPS , year =

Towards Automated Circuit Discovery for Mechanistic Interpretability , author =. NeurIPS , year =

work page
[4]

Wang, Youjin and Zhou, Run and Fu, Rong and Cao, Shuaishuai and Zeng, Hongwei and Lu, Jiaxuan and Fan, Sicheng and Zhao, Jiaqiao and Pan, Liangming , journal =

work page
[6]

Transformer Circuits Thread , year =

Circuit Tracing: Revealing Computational Graphs in Language Models , author =. Transformer Circuits Thread , year =

work page
[9]

Advances in Neural Information Processing Systems , year =

Toolformer: Language Models Can Teach Themselves to Use Tools , author =. Advances in Neural Information Processing Systems , year =

work page
[10]

Qin, Yujia and Liang, Shihao and Ye, Yining and Zhu, Kunlun and Yan, Lan and Lu, Yaxi and Lin, Yankai and Cong, Xin and Tang, Xiangru and Qian, Bill and others , journal =

work page
[12]

Gorilla: Large Language Model Connected with Massive

Patil, Shishir G and Zhang, Tianjun and Wang, Xin and Gonzalez, Joseph E , journal =. Gorilla: Large Language Model Connected with Massive

work page
[14]

Transformer Circuits Thread , year =

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author =. Transformer Circuits Thread , year =

work page
[16]

Causal Mediation Analysis for Interpreting Neural

Vig, Jesse and Gehrmann, Sebastian and Belinkov, Yonatan and Qian, Sharon and Nevo, Daniel and Sakenis, Simas and Huang, Jason and Singer, Yaron and Shieber, Stuart , journal =. Causal Mediation Analysis for Interpreting Neural

work page
[17]

Interpretability in the Wild: a Circuit for Indirect Object Identification in

Wang, Kevin and Variengien, Alexandre and Conmy, Arthur and Shlegeris, Buck and Steinhardt, Jacob , journal =. Interpretability in the Wild: a Circuit for Indirect Object Identification in

work page
[18]

Advances in Neural Information Processing Systems , year =

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems , year =

work page
[20]

Li, Minghao and others , journal =

work page
[21]

Hao, Shibo and Liu, Tianyang and Wang, Zhen and Hu, Zhiting , journal =

work page
[24]

Transformer Circuits Thread , year =

Toy Models of Superposition , author =. Transformer Circuits Thread , year =

work page
[25]

Scaling Monosemanticity: Extracting Interpretable Features from

Templeton, Adly and Conerly, Tom and Marcus, Jonathan and Lindsey, Jack and Bricken, Trenton and Chen, Brian and Pearce, Adam and Citro, Craig and Ameisen, Emmanuel and Jones, Andy and others , journal =. Scaling Monosemanticity: Extracting Interpretable Features from

work page
[27]

Advances in Neural Information Processing Systems , year =

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author =. Advances in Neural Information Processing Systems , year =

work page
[28]

Representation Engineering: A Top-Down Approach to

Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and others , journal =. Representation Engineering: A Top-Down Approach to

work page
[30]

An Overview of Catastrophic

Hendrycks, Dan and Mazeika, Mantas and Woodside, Thomas , journal =. An Overview of Catastrophic

work page
[34]

Qwen2.5 Technical Report

Qwen2.5 Technical Report , author =. arXiv preprint arXiv:2412.15115 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Grattafiori, Aaron and others , journal =. The

work page
[36]

Nanda, Neel and Bloom, Joseph , url =

work page
[37]

Bloom, Joseph and others , url =

work page
[38]

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on

Lieberum, Tom and Raber, Senthooran and Kramar, Janos and others , journal =. Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on

work page
[39]

Goodfire: Interpretability Infrastructure for

work page
[40]

NeurIPS , year =

Berkeley Function Calling Leaderboard , author =. NeurIPS , year =

work page
[41]

Emmanuel Ameisen and 1 others. 2025. https://transformer-circuits.pub/2025/attribution-graphs/methods.html Circuit tracing: Revealing computational graphs in language models . Transformer Circuits Thread

work page 2025
[42]

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Rimsky, Lee Sharkey, and 1 others. 2024. Refusal in language models is mediated by a single direction. arXiv preprint arXiv:2406.11717

work page internal anchor Pith review arXiv 2024
[43]

Joseph Bloom and 1 others. 2024. https://github.com/jbloomAUS/SAELens SAELens : A library for sparse autoencoder training, analysis, and interpretability

work page 2024
[44]

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, and 1 others. 2023. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread

work page 2023
[45]

Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adri\` a Garriga-Alonso. 2023. Towards automated circuit discovery for mechanistic interpretability. In NeurIPS

work page 2023
[46]

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. 2023. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, and 1 others. 2022. Toy models of superposition. Transformer Circuits Thread

work page 2022
[48]

Joshua Engels, Isaac Liao, Eric J Michaud, Wes Gurnee, and Max Tegmark. 2024. Not all language model features are linear. arXiv preprint arXiv:2405.14860

work page arXiv 2024
[49]

Gemma Team . 2024. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Mor Geva, Jaap Bastings, Katja Filippova, and Amir Globerson. 2023. Dissecting recall of factual associations in auto-regressive language models. arXiv preprint arXiv:2304.14767

work page arXiv 2023
[51]

Aaron Grattafiori and 1 others. 2024. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu. 2023. ToolkenGPT : Augmenting frozen language models with massive tools via tool embeddings. Advances in Neural Information Processing Systems

work page 2023
[53]

Kait Healy, Bharathi Srinivasan, Visakh Madathil, and Jing Wu. 2026. Internal representations as indicators of hallucinations in agent tool selection. arXiv preprint arXiv:2601.05214

work page arXiv 2026
[54]

Dan Hendrycks, Mantas Mazeika, and Thomas Woodside. 2023. An overview of catastrophic AI risks. arXiv preprint arXiv:2306.12001

work page arXiv 2023
[55]

Kenneth Li, Oam Patel, Fernanda Vi \'e gas, Hanspeter Pfister, and Martin Wattenberg. 2023 a . Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems

work page 2023
[56]

Minghao Li and 1 others. 2023 b . API-Bank : A comprehensive benchmark for tool-augmented LLMs . arXiv preprint arXiv:2304.08244

work page arXiv 2023
[57]

Tom Lieberum, Senthooran Raber, Janos Kramar, and 1 others. 2024. Gemma scope: Open sparse autoencoders everywhere all at once on Gemma 2. arXiv preprint arXiv:2408.05147

work page internal anchor Pith review arXiv 2024
[58]

Johnny Lin. 2024. https://www.neuronpedia.org/ NeuronPedia : A platform for mechanistic interpretability

work page 2024
[59]

Samuel Marks and Max Tegmark. 2024. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824

work page internal anchor Pith review arXiv 2024
[60]

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in GPT . In NeurIPS

work page 2022
[61]

Neel Nanda and Joseph Bloom. 2022. https://github.com/TransformerLensOrg/TransformerLens TransformerLens : A library for mechanistic interpretability of GPT -style language models

work page 2022
[62]

Neel Nanda, Andrew Lee, and Martin Wattenberg. 2023. Emergent linear representations in world models of self-supervised sequence models. arXiv preprint arXiv:2309.00941

work page arXiv 2023
[63]

Kiho Park, Yo Joong Choe, and Victor Veitch. 2024. The linear representation hypothesis and the geometry of large language models. arXiv preprint arXiv:2311.03658

work page internal anchor Pith review arXiv 2024
[64]

Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. 2023. Gorilla: Large language model connected with massive APIs . arXiv preprint arXiv:2305.15334

work page internal anchor Pith review arXiv 2023
[65]

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, and 1 others. 2024. ToolLLM : Facilitating large language models to master 16000+ real-world APIs . arXiv preprint arXiv:2307.16789

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

Qwen Team . 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, J \'a nos Kram \'a r, Rohin Shah, and Neel Nanda. 2024. Jumping ahead: Improving reconstruction fidelity with JumpReLU sparse autoencoders. arXiv preprint arXiv:2407.14435

work page arXiv 2024
[68]

Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems

work page 2023
[69]

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, and 1 others. 2024. Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet . Transformer Circuits Thread

work page 2024
[70]

Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, and Neel Nanda. 2023. Linear representations of sentiment in large language models. arXiv preprint arXiv:2310.15154

work page arXiv 2023
[71]

Eric Todd, Millicent L Li, Arnab Sen Sharma, Aaron Mueller, Byron C Wallace, and David Bau. 2024. Function vectors in large language models. arXiv preprint arXiv:2310.15213

work page arXiv 2024
[72]

Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte Castricato. 2024. Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248

work page internal anchor Pith review Pith/arXiv arXiv 2024
[73]

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Simas Sakenis, Jason Huang, Yaron Singer, and Stuart Shieber. 2020. Causal mediation analysis for interpreting neural NLP : The case of gender bias. arXiv preprint arXiv:2004.12265

work page arXiv 2020
[74]

Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. 2023. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. arXiv preprint arXiv:2211.00593

work page internal anchor Pith review arXiv 2023
[75]

Youjin Wang, Run Zhou, Rong Fu, Shuaishuai Cao, Hongwei Zeng, Jiaxuan Lu, Sicheng Fan, Jiaqiao Zhao, and Liangming Pan. 2026. ASA : Training-free representation engineering for tool-calling agents. arXiv preprint arXiv:2602.04935

work page arXiv 2026
[76]

Fanjia Yan, Huanzhi Mao, Charlie Ji, Shishir Patil, Ion Stoica, Joseph E Gonzalez, and Hao Zhang. 2024. Berkeley function calling leaderboard. In NeurIPS

work page 2024
[77]

Shunyu Yao and 1 others. 2024. -bench: A benchmark for tool-agent-user interaction in real-world domains. arXiv preprint arXiv:2406.12045

work page internal anchor Pith review arXiv 2024
[78]

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, and 1 others. 2023. Representation engineering: A top-down approach to AI transparency. arXiv preprint arXiv:2310.01405

work page internal anchor Pith review Pith/arXiv arXiv 2023