pith. sign in

arxiv: 2605.18882 · v1 · pith:U3SDSGQSnew · submitted 2026-05-16 · 💻 cs.LG · cs.AI

To Call or Not to Call: Diagnosing Intrinsic Over-Calling Bias in LLM Agents

Pith reviewed 2026-05-20 16:03 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords LLM agentstool callingover-calling biassparse autoencodersmechanistic interpretabilitydecision biassteering methods
0
0 comments X

The pith

LLM agents over-call tools because their call/no-call mapping includes an activation-independent bias toward calling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that LLM agents show a consistent over-calling tendency on the When2Call benchmark, achieving high call accuracy but low no-call accuracy and overall accuracy of only 55-70 percent across six models. The authors trace this to the Intrinsic Bias Hypothesis, under which the decision process carries a fixed offset favoring calls even when call and no-call activations are balanced. They recover aligned feature directions with sparse autoencoders, measure the offset directly from the signed activation margin, and then apply a closed-form correction called Adaptive Margin-Calibrated Steering to remove it. This adjustment reduces unnecessary calls while preserving call accuracy and lifting overall performance. A sympathetic reader would care because the work converts an observed behavioral flaw into a measurable, correctable feature-level property.

Core claim

Across all six models the decision boundary is neutral only when no-call activation exceeds call activation by the estimated offset, confirming the presence of an activation-independent call bias. Applying Adaptive Margin-Calibrated Steering to cancel the offset along the SAE decoder directions mitigates over-calling and raises overall accuracy with negligible loss in call accuracy.

What carries the argument

Sparse Autoencoders that extract behavior-aligned feature bases for the call/no-call decision, reduced to a signed activation margin whose offset is estimated and then cancelled by Adaptive Margin-Calibrated Steering.

If this is right

  • The same activation-independent offset appears consistently across models from three different families.
  • Cancelling the offset improves overall accuracy while call accuracy remains nearly unchanged.
  • Over-calling can be treated as a mechanistic object rather than an empirical observation only.
  • The bias correction requires no retraining and works through a closed-form shift along existing SAE directions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same diagnostic and correction approach could be tested on other agent decisions such as when to stop or when to plan.
  • If the offset originates in training data, similar biases may appear in non-agent LLM tasks that involve binary choices.
  • Future measurements could check whether the offset size scales with model size or training compute.

Load-bearing premise

The SAE-recovered feature directions are causally aligned with the actual call/no-call decision boundary rather than being an artifact of feature selection.

What would settle it

If Adaptive Margin-Calibrated Steering applied to the diagnosed offset fails to reduce over-calling rates or to improve overall accuracy on the When2Call benchmark, the Intrinsic Bias Hypothesis would be falsified.

Figures

Figures reproduced from arXiv: 2605.18882 by Mengnan Du, Na Zou, Sihang Li, Wei Shi, Xiang Wang, Xiting Wang, Ziheng Peng.

Figure 1
Figure 1. Figure 1: Overview of over-calling, intrinsic bias, and AMCS. (a) Across six target models, call accuracy is high, but no-call accuracy is much lower, reducing overall accuracy; a representative case shows the model calling a tool despite missing required information. (b) Intrinsic Bias Hypothesis: at activation parity (m = 0), the decision still favors CALL; the neutral boundary shifts to m⋆ < 0. (c) AMCS converts … view at source ↗
Figure 2
Figure 2. Figure 2: Top-ranked gating features discovered for Qwen3.5-9B. Bars show the mean activation [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Discriminative validation across six target models. We train 5-fold cross-validated logistic [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Response-conditioned activation geometry. Examples are grouped by the model’s emitted [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Input-conditioned activation geometry, restricted to examples for which the model emits a [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Logistic estimates of intrinsic CALL bias. Left: fitted decision curves with margin standardized within model, so the six curves can be visually compared. Right: raw neutral boundaries mˆ ⋆ . All boundaries are negative, meaning the model becomes decision-neutral only when NO_CALL￾feature mean activation exceeds CALL-feature mean activation. For each model, we fit Eq. 8 on cached SAE activations and observ… view at source ↗
Figure 7
Figure 7. Figure 7: SAE reconstruction loss across the two-stage training curriculum, grouped by model family. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: UMAP visualization of the SAE feature dictionary for each target model. Gray points [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-feature mean SAE activation on Tool-Call-failure (over-calling) vs. No-Call-success [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Per-feature mean SAE activation on Tool-Call-failure (over-calling) vs. No-Call-success [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Per-feature mean SAE activation on Tool-Call-failure (over-calling) vs. No-Call-success [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Token-level attribution of the top-10 TOOL_CALL features on a representative context where the model correctly issues a tool call. Green intensity is proportional to each token’s contribu￾tion to the summed feature activation (x-axis). The strongest signal concentrates in the tool schema definitions and the user query, consistent with these features encoding tool-invocation intent. 19 [PITH_FULL_IMAGE:fi… view at source ↗
Figure 13
Figure 13. Figure 13: Token-level attribution of the top-10 NO_CALL features on a representative context where the model correctly withholds a tool call and requests missing information instead. Red intensity is proportional to each token’s contribution to the summed feature activation (x-axis). The signal concentrates in the underspecified user query and tool schema, consistent with these features encoding information-insuffi… view at source ↗
read the original abstract

LLM agents exhibit a consistent tendency to over-call, invoking tools even in situations where none is needed. On the When2Call benchmark, six models from three families show high call accuracy but much lower no-call accuracy, leaving overall accuracy in the 55%-70% range. We trace this to an Intrinsic Bias Hypothesis (IBH): the call/no-call decision mapping carries an activation-independent call offset, so the model favors call even at activation parity. Using Sparse Autoencoders (SAEs), we recover behavior-aligned feature bases for the call/no_call decision, reduce them to a signed activation margin, and estimate the offset directly. Across all six models, the model is decision-neutral only when no_call activation outweighs call activation, consistent with IBH. We then causally test IBH with Adaptive Margin-Calibrated Steering (AMCS), a closed-form counter-bias shift along SAE decoder directions. Cancelling the diagnosed offset mitigates over-calling and improves overall accuracy with a negligible drop in call accuracy. Our work recasts over-calling from an empirical phenomenon into a mechanistic object amenable to causal correction. Code is available at https://github.com/SKURA502/agent-sae/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLM agents exhibit over-calling on the When2Call benchmark due to an Intrinsic Bias Hypothesis (IBH): an activation-independent call offset in the decision mapping. Using SAEs across six models from three families, the authors recover behavior-aligned call/no-call feature bases, reduce them to a signed activation margin, and directly estimate the offset. They report that models are decision-neutral only when no_call activation outweighs call activation. They then introduce Adaptive Margin-Calibrated Steering (AMCS) to cancel the offset, which mitigates over-calling, raises overall accuracy, and preserves call accuracy. Code is provided.

Significance. If the diagnosed offset proves robust to feature-selection choices and the AMCS intervention is shown to be independent of the SAE fitting process, the work would convert an observed empirical bias into a mechanistic, causally correctable object. The consistency of results across six models and the release of reproducible code are clear strengths. The approach could inform bias diagnosis in other agentic settings.

major comments (2)
  1. [Methods] Methods (SAE feature selection and margin reduction): The offset is recovered by selecting SAE features via correlation with call labels on When2Call and then computing a signed activation margin. Without reported ablations on alternative selection criteria (e.g., mutual information, activation magnitude), different sparsity levels, or balanced vs. imbalanced training data, it remains possible that the zero-crossing shift is an artifact of the pipeline rather than an intrinsic model property. This directly affects the load-bearing IBH claim.
  2. [Results] Results (causal steering test and accuracy numbers): The manuscript states that cancelling the offset improves overall accuracy with negligible drop in call accuracy, but provides no details on exact margin calculation, statistical significance tests, or controls for confounding factors in feature selection. These omissions make it difficult to assess whether the reported gains are robust or driven by the same data-dependent correlations used to define the offset.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'activation-independent call offset' is introduced without a concise mathematical definition; a short equation or explicit formula would clarify the quantity being estimated.
  2. [Introduction] The When2Call benchmark description would benefit from a brief table summarizing call vs. no-call accuracy for each of the six models to make the empirical phenomenon immediately visible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments on feature selection robustness and result details are well-taken and point to opportunities to strengthen the evidence for the Intrinsic Bias Hypothesis (IBH). We address each major comment below and will revise the manuscript accordingly to improve clarity and demonstrate robustness.

read point-by-point responses
  1. Referee: [Methods] Methods (SAE feature selection and margin reduction): The offset is recovered by selecting SAE features via correlation with call labels on When2Call and then computing a signed activation margin. Without reported ablations on alternative selection criteria (e.g., mutual information, activation magnitude), different sparsity levels, or balanced vs. imbalanced training data, it remains possible that the zero-crossing shift is an artifact of the pipeline rather than an intrinsic model property. This directly affects the load-bearing IBH claim.

    Authors: We agree that additional validation is needed to confirm the offset is not an artifact of the correlation-based selection. Correlation was chosen because it directly ties features to the behavioral labels in When2Call, but we acknowledge this could introduce pipeline dependence. In the revision we will add ablations using mutual information and activation-magnitude selection, vary SAE sparsity levels, and compare balanced versus imbalanced training sets for feature discovery. These results will be reported in an expanded Methods section and supplementary material, showing that the decision-neutral point (no_call activation outweighing call activation) remains consistent across criteria. revision: yes

  2. Referee: [Results] Results (causal steering test and accuracy numbers): The manuscript states that cancelling the offset improves overall accuracy with negligible drop in call accuracy, but provides no details on exact margin calculation, statistical significance tests, or controls for confounding factors in feature selection. These omissions make it difficult to assess whether the reported gains are robust or driven by the same data-dependent correlations used to define the offset.

    Authors: We accept that the current presentation lacks sufficient detail for full evaluation of robustness. The revised Results section will include the exact formula for the signed activation margin, report statistical significance (paired t-tests and Wilcoxon tests on accuracy deltas across models and runs), and add a control using a held-out validation split for feature selection before applying AMCS. This separation will show that accuracy gains from offset cancellation are not artifacts of the original selection correlations and that AMCS remains effective as an independent intervention. revision: yes

Circularity Check

1 steps flagged

Offset estimated from SAE margin on call-labeled data then presented as independent evidence for IBH

specific steps
  1. fitted input called prediction [Abstract]
    "Using Sparse Autoencoders (SAEs), we recover behavior-aligned feature bases for the call/no_call decision, reduce them to a signed activation margin, and estimate the offset directly. Across all six models, the model is decision-neutral only when no_call activation outweighs call activation, consistent with IBH."

    The signed activation margin and offset are computed from SAE features chosen for correlation with call behavior on the benchmark; the reported neutrality condition (no_call must outweigh call) is therefore the direct numerical result of that estimation, rendering the 'consistent with IBH' statement a restatement of the fitted input rather than a separate verification.

full rationale

The derivation recovers SAE features aligned to call/no-call labels, reduces them to a signed margin, estimates the offset directly from that margin, and reports that decision neutrality occurs only when no_call activation exceeds call activation as confirmation of the Intrinsic Bias Hypothesis. This step is a direct readout of the fitted quantity rather than an independent prediction. The later AMCS causal intervention supplies partial independence, preventing a higher score, but the initial diagnostic claim remains partially constructed from the same feature-selection and reduction pipeline.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that SAE features capture decision-relevant directions and that the measured offset is a stable, causally actionable property rather than a fitting artifact.

free parameters (1)
  • call offset
    Estimated directly from the difference in SAE activation margins between call and no-call features across models.
axioms (1)
  • domain assumption SAE decoder directions align with behaviorally relevant call/no-call features
    Invoked when reducing SAE features to a signed activation margin for offset estimation.
invented entities (1)
  • Intrinsic Bias Hypothesis (IBH) offset no independent evidence
    purpose: Postulated activation-independent preference for tool calling
    New explanatory construct introduced to account for the observed accuracy gap; no independent falsifiable prediction outside the current experiments is stated.

pith-pipeline@v0.9.0 · 5762 in / 1191 out tokens · 65717 ms · 2026-05-20T16:03:08.566151+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 7 internal anchors

  1. [1]

    OpenAI GPT-5 System Card

    OpenAI. Openai GPT-5 system card.CoRR, abs/2601.03267, 2026

  2. [2]

    Claude opus 4.6 system card

    Anthropic. Claude opus 4.6 system card. Technical report, Anthropic, February 2026

  3. [3]

    Gemma 4 model card

    Google DeepMind. Gemma 4 model card. https://ai.google.dev/gemma/docs/core/ model_card_4, April 2026. Accessed: 2026-04-08

  4. [4]

    Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

  5. [5]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InNeurIPS, 2023

  6. [6]

    Toolllm: Facilitating large language models to master 16000+ real-world APIs

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world APIs. InICLR. OpenReview.net, 2024

  7. [7]

    Gorilla: Large Language Model Connected with Massive APIs

    Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive APIs.CoRR, abs/2305.15334, 2023

  8. [8]

    AFlow: Automating agentic workflow generation

    Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. AFlow: Automating agentic workflow generation. InICLR. OpenReview.net, 2025

  9. [9]

    Gemma 3 Technical Report

    Gemma Team. Gemma 3 technical report.CoRR, abs/2503.19786, 2025

  10. [10]

    Ministral 3

    Mistral AI. Ministral 3.CoRR, abs/2601.08584, 2026

  11. [11]

    When2call: When (not) to call tools

    Hayley Ross, Ameya Sunil Mahabaleshwarkar, and Yoshi Suhara. When2call: When (not) to call tools. InNAACL (Long Papers), pages 3391–3409. Association for Computational Linguistics, 2025

  12. [12]

    Sparse autoencoders find highly interpretable features in language models

    Robert Huben, Hoagy Cunningham, Logan Riggs, Aidan Ewart, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. InICLR. OpenReview.net, 2024

  13. [13]

    Scaling and evaluating sparse autoencoders

    Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. 2025

  14. [14]

    Daniel Freeman, Theodore R

    Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosema...

  15. [15]

    MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning

    Ehud Karpas, Omri Abend, Yonatan Berant, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, Dor Muhlgay, Noam Rozen, Erez Schwartz, Gal Shachaf, Shai Shalev-Shwartz, Amnon Shashua, and Moshe Tennenholtz. MRKL systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge...

  16. [16]

    Narasimhan, and Yuan Cao

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InICLR. OpenReview.net, 2023

  17. [17]

    Agentbench: Evaluating llms as agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents. InICLR. OpenReview.net, 2024. 10

  18. [18]

    Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E

    Shishir G. Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (BFCL): from tool use to agentic evaluation of large language models. InICML, Proceedings of Machine Learning Research. PMLR / OpenReview.net, 2025

  19. [19]

    Michaud, Stephen Casper, Max Tegmark, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, and Tom McGrath

    Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeffrey Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Isaac Bloom, Stella Biderman, Adrià Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, William Saunders, Eric J. Michaud, Stephen Casper, Max T...

  20. [20]

    Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023

    Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Con- erly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and ...

  21. [21]

    Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

    Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, and Neel Nanda. Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders.CoRR, abs/2407.14435, 2024

  22. [22]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling.CoRR, abs/2101.00027, 2021

  23. [23]

    classification

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR (Poster). OpenReview.net, 2019. 11 A SAE Training Details Table 2: SAE training configuration for each target model. d is the text-backbone residual width, L is the number of transformer blocks, ℓ is the zero-indexed hook block, M= 8d is the SAE dictionary size, andK=⌊d/32⌋is ...