pith. sign in

arxiv: 2606.12818 · v1 · pith:EMALY7ACnew · submitted 2026-06-11 · 💻 cs.CL · cs.AI

Localizing Anchoring Pathways in Language Models

Pith reviewed 2026-06-27 07:03 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords anchoring effectslanguage modelscircuit localizationattribution methodsnumerical reasoningmodel variantstransfer
0
0 comments X

The pith

Edge-level attribution recovers the anchor-sensitive signal in language models more faithfully than node-level methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how irrelevant numbers in prompts create anchoring effects that shift language model judgments in numerical reasoning tasks. It sets up a controlled multiple-choice format with shared answer options and defines a logit-difference metric between the correct option and the anchor option to track this behavior. Attribution-based circuit localization applied to base and instruction-tuned models shows that edge-level methods identify the relevant pathways more accurately than node-level methods. Low-anchor and high-anchor circuits transfer strongly inside the same model but only sparsely between base and instruction-tuned variants. The results give a mechanistic account of where anchoring-related decision signals are carried inside the models.

Core claim

Attribution-based circuit localization recovers the anchor-sensitive signal more faithfully with edge-level methods than with node-level methods. Low- and high-anchor circuits transfer strongly within a model, indicating shared pathway structure across anchor direction, while transfer across base and instruction-tuned variants is sparse, showing that post-training changes which pathways matter most.

What carries the argument

The logit-difference metric comparing correct and anchor answer options, used to validate tracking of behavioral anchoring and combined with attribution-based circuit localization to identify pathways.

If this is right

  • Low- and high-anchor circuits share pathway structure across anchor direction within a model.
  • Post-training alters which pathways carry the anchor-sensitive signal compared with base models.
  • Edge-level attribution methods give a more faithful recovery of the signal than node-level methods.
  • Anchoring-related decision signals localize to identifiable circuits inside the models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Causal interventions on the localized circuits would be expected to reduce observed anchoring effects.
  • The same localization approach could be tested on other documented biases in language model reasoning.
  • Sparse transfer across variants implies that circuit findings require re-localization after instruction tuning.

Load-bearing premise

The logit-difference metric between correct and anchor options faithfully isolates anchoring behavior without being confounded by the shared answer-option structure or other prompt features.

What would settle it

Ablating the localized circuits produces no measurable change in the model's preference for the anchor option over the correct option in the multiple-choice task.

Figures

Figures reproduced from arXiv: 2606.12818 by Hillary N. Owusu, Naomi H. Feldman, Sarah Wiegreffe.

Figure 1
Figure 1. Figure 1: Anchors (shown in purple) given in the prompt [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: EAP-IG attribution by source-layer relative [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Structural overlap and transfer of EAP-IG edge circuits. Panel (a) compares low- and high-anchor circuits [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Irrelevant numbers in a prompt can shift language model judgments, producing anchoring effects in numerical reasoning. We study where this anchor-sensitive signal is carried inside language models using a controlled multiple-choice setup with shared answer options. We define a logit-difference metric comparing the correct answer option with the answer option corresponding to the anchor, and validate that it tracks behavioral anchoring. Using attribution-based circuit localization on 7B--8B Qwen and Llama base and instruction-tuned models, we find that edge-level methods recover this signal more faithfully than node-level methods. Low- and high-anchor circuits transfer strongly within a model, suggesting shared pathway structure across anchor direction. However, sparse transfer across base and instruction-tuned variants is less reliable, indicating that post-training changes which pathways matter most. Overall, our results provide a mechanistic account of how anchoring-related decision signals are carried inside language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript examines anchoring effects in language models' numerical reasoning using a controlled multiple-choice setup with shared answer options. It defines a logit-difference metric (correct option minus anchor option), states that this metric tracks behavioral anchoring, and applies attribution-based circuit localization on 7B-8B Qwen and Llama base and instruction-tuned models. Main results claim that edge-level attribution recovers the anchor-sensitive signal more faithfully than node-level methods, that low- and high-anchor circuits transfer strongly within a model, and that transfer is sparse across base and instruction-tuned variants.

Significance. If the logit-difference metric is shown to isolate anchoring without confounds from shared option structure, the results would provide a mechanistic account of how anchoring-related decision signals are carried in LMs, with useful comparisons of edge vs. node attribution and effects of post-training. The work applies standard attribution methods to a new behavioral metric and reports within- vs. cross-variant transfer patterns.

major comments (1)
  1. [Abstract and metric definition paragraph] Abstract and paragraph on metric definition: The claim that the logit-difference metric 'tracks behavioral anchoring' is asserted without reported quantitative validation numbers, error bars, ablation details, or explicit controls separating anchor sensitivity from shared answer-option structure (e.g., position biases or general option preferences). This is load-bearing for the central claims on edge-level fidelity and circuit transfer.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for stronger validation of the logit-difference metric. We agree this is central to the claims and will expand the manuscript with the requested quantitative details, ablations, and controls.

read point-by-point responses
  1. Referee: [Abstract and metric definition paragraph] Abstract and paragraph on metric definition: The claim that the logit-difference metric 'tracks behavioral anchoring' is asserted without reported quantitative validation numbers, error bars, ablation details, or explicit controls separating anchor sensitivity from shared answer-option structure (e.g., position biases or general option preferences). This is load-bearing for the central claims on edge-level fidelity and circuit transfer.

    Authors: We accept the critique that the abstract and metric definition section present the validation claim without sufficient supporting numbers or controls. The full paper reports behavioral correlations between the logit-difference and anchoring effects, but these lack the explicit ablations, error bars, and option-structure controls requested. In revision we will add a dedicated validation subsection with: (1) quantitative correlation coefficients and error bars across prompt variants, (2) ablations that shuffle or remove the anchor while preserving option structure, and (3) controls for position bias and general option preference by comparing against non-anchor numerical prompts. These additions will directly support the downstream claims on edge-level attribution and circuit transfer. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on independent metric and standard methods

full rationale

The paper defines a logit-difference metric on shared answer options and validates it tracks behavioral anchoring before applying off-the-shelf attribution techniques for circuit localization. No equations, fitted parameters, or predictions reduce by construction to the same inputs; no load-bearing self-citations or uniqueness theorems from prior author work are invoked. The derivation chain is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the logit-difference metric is presented as a definition rather than a fitted quantity.

pith-pipeline@v0.9.1-grok · 5676 in / 1128 out tokens · 18701 ms · 2026-06-27T07:03:07.818028+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 12 canonical work pages · 5 internal anchors

  1. [1]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Think you have solved question answering? try arc, the ai2 reasoning challenge. Preprint, arXiv:1803.05457. Atticus Geiger, Jacqueline Harding, and Thomas Icard

  2. [2]

    ArXiv:2508.11214 [cs]

    How Causal Abstraction Under- pins Computational Explanation.arXiv preprint. ArXiv:2508.11214 [cs]. Nicholas Goldowsky-Dill, Chris MacLeod, Lucas Sato, and Aryaman Arora

  3. [3]

    Localizing Model Behavior with Path Patching

    Localizing Model Behavior with Path Patching.arXiv preprint. ArXiv:2304.05969 [cs]. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mi- tra, Archie Sravankumar, Artem Kor...

  4. [4]

    The Llama 3 Herd of Models

    The llama 3 herd of models.Preprint, arXiv:2407.21783. Michael Hanna, Sandro Pezzelle, and Yonatan Belinkov

  5. [5]

    János Kramár, Tom Lieberum, Rohin Shah, and Neel Nanda

    Un- derstanding the anchoring effect of llm with synthetic data: Existence, mechanism, and potential mitiga- tions.Preprint, arXiv:2505.15392. János Kramár, Tom Lieberum, Rohin Shah, and Neel Nanda

  6. [6]

    arXiv preprint arXiv:2403.00745 , year=

    Atp*: An efficient and scalable method for localizing llm behaviour to components.Preprint, arXiv:2403.00745. Tom Lieberum, Matthew Rahtz, János Kramár, Neel Nanda, Geoffrey Irving, Rohin Shah, and Vladimir Mikulik

  7. [7]

    Technical Report arXiv:2307.09458, arXiv

    Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla. Technical Report arXiv:2307.09458, arXiv. ArXiv:2307.09458 [cs]. Jiaxu Lou and Yifan Sun

  8. [8]

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov

    Anchoring bias in large language models: An experimental study.Preprint, arXiv:2412.06593. Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov

  9. [9]

    InThe Twelfth Inter- national Conference on Learning Representations

    Circuit Component Reuse Across Tasks in Trans- former Language Models. InThe Twelfth Inter- national Conference on Learning Representations. arXiv. ArXiv:2310.08744 [cs]. 9 Aaron Mueller, Jannik Brinkmann, Millicent Li, Samuel Marks, Koyena Pal, Nikhil Prakash, Can Rager, Aruna Sankaranarayanan, Arnab Sen Sharma, Jiud- ing Sun, Eric Todd, David Bau, and Y...

  10. [10]

    InFindings of ACL: NAACL 2024, pages 2006–2017

    Large language models sensitivity to the order of options in multiple-choice questions. InFindings of ACL: NAACL 2024, pages 2006–2017. Lukas Röseler, Lucia Weber, Ena P. B. Stijovi´c, Katha- rina A. K. Jaekel, J. F. (Janne)ke M. T. (Janneke) G. (Gijsbers) Gijsbers, and Nir Milstein

  11. [11]

    InProceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing, pages 7035–7052

    A mechanistic interpretation of arith- metic reasoning in language models using causal mediation analysis. InProceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing, pages 7035–7052. Aaquib Syed, Can Rager, and Arthur Conmy

  12. [12]

    Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber

    Anchors in the machine: Behavioral and attributional evidence of anchoring bias in llms.Preprint, arXiv:2511.05766. Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber

  13. [13]

    InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online

    Trans- formers: State-of-the-art natural language processing. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics. Qwen An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Gua...

  14. [14]

    Qwen2.5 Technical Report

    Qwen2.5 technical report.ArXiv, abs/2412.15115. Fred Zhang and Neel Nanda

  15. [15]

    Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

    Towards Best Practices of Activation Patching in Language Mod- els: Metrics and Methods. InThe Twelfth Inter- national Conference on Learning Representations. arXiv. ArXiv:2309.16042 [cs]. Appendix A MCQA Menu Size Validation Our main experiments use nine answer options per item. To justify this choice, we evaluated whether anchoring is preserved as the n...