Localizing Anchoring Pathways in Language Models

Hillary N. Owusu; Naomi H. Feldman; Sarah Wiegreffe

arxiv: 2606.12818 · v1 · pith:EMALY7ACnew · submitted 2026-06-11 · 💻 cs.CL · cs.AI

Localizing Anchoring Pathways in Language Models

Hillary N. Owusu , Sarah Wiegreffe , Naomi H. Feldman This is my paper

Pith reviewed 2026-06-27 07:03 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords anchoring effectslanguage modelscircuit localizationattribution methodsnumerical reasoningmodel variantstransfer

0 comments

The pith

Edge-level attribution recovers the anchor-sensitive signal in language models more faithfully than node-level methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how irrelevant numbers in prompts create anchoring effects that shift language model judgments in numerical reasoning tasks. It sets up a controlled multiple-choice format with shared answer options and defines a logit-difference metric between the correct option and the anchor option to track this behavior. Attribution-based circuit localization applied to base and instruction-tuned models shows that edge-level methods identify the relevant pathways more accurately than node-level methods. Low-anchor and high-anchor circuits transfer strongly inside the same model but only sparsely between base and instruction-tuned variants. The results give a mechanistic account of where anchoring-related decision signals are carried inside the models.

Core claim

Attribution-based circuit localization recovers the anchor-sensitive signal more faithfully with edge-level methods than with node-level methods. Low- and high-anchor circuits transfer strongly within a model, indicating shared pathway structure across anchor direction, while transfer across base and instruction-tuned variants is sparse, showing that post-training changes which pathways matter most.

What carries the argument

The logit-difference metric comparing correct and anchor answer options, used to validate tracking of behavioral anchoring and combined with attribution-based circuit localization to identify pathways.

If this is right

Low- and high-anchor circuits share pathway structure across anchor direction within a model.
Post-training alters which pathways carry the anchor-sensitive signal compared with base models.
Edge-level attribution methods give a more faithful recovery of the signal than node-level methods.
Anchoring-related decision signals localize to identifiable circuits inside the models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Causal interventions on the localized circuits would be expected to reduce observed anchoring effects.
The same localization approach could be tested on other documented biases in language model reasoning.
Sparse transfer across variants implies that circuit findings require re-localization after instruction tuning.

Load-bearing premise

The logit-difference metric between correct and anchor options faithfully isolates anchoring behavior without being confounded by the shared answer-option structure or other prompt features.

What would settle it

Ablating the localized circuits produces no measurable change in the model's preference for the anchor option over the correct option in the multiple-choice task.

Figures

Figures reproduced from arXiv: 2606.12818 by Hillary N. Owusu, Naomi H. Feldman, Sarah Wiegreffe.

**Figure 2.** Figure 2: EAP-IG attribution by source-layer relative [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Structural overlap and transfer of EAP-IG edge circuits. Panel (a) compares low- and high-anchor circuits [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Irrelevant numbers in a prompt can shift language model judgments, producing anchoring effects in numerical reasoning. We study where this anchor-sensitive signal is carried inside language models using a controlled multiple-choice setup with shared answer options. We define a logit-difference metric comparing the correct answer option with the answer option corresponding to the anchor, and validate that it tracks behavioral anchoring. Using attribution-based circuit localization on 7B--8B Qwen and Llama base and instruction-tuned models, we find that edge-level methods recover this signal more faithfully than node-level methods. Low- and high-anchor circuits transfer strongly within a model, suggesting shared pathway structure across anchor direction. However, sparse transfer across base and instruction-tuned variants is less reliable, indicating that post-training changes which pathways matter most. Overall, our results provide a mechanistic account of how anchoring-related decision signals are carried inside language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies edge attribution to localize anchoring in LMs and reports better recovery than node methods plus within-model transfer, but the logit-difference metric may still mix in option-position effects.

read the letter

The core contribution is a controlled application of attribution-based circuit finding to numerical anchoring in 7-8B Qwen and Llama models. They set up MCQs with shared answer options, define a logit difference between the correct choice and the anchor choice, claim it tracks the behavioral shift, then compare edge-level versus node-level methods and check transfer of the recovered circuits. Edge methods come out ahead and low/high-anchor circuits transfer within a model but show sparse transfer across base and instruction-tuned variants.

What stands out is the direct comparison of edge and node attribution on this signal and the within-model versus cross-variant transfer results. Those are concrete observations rather than restatements of prior circuit work.

The setup is reasonable on its face and the authors are upfront that cross-variant transfer is less reliable. The main soft spot is the logit-difference metric itself. Shared answer options across conditions leave room for the metric to pick up position biases or general option preferences instead of anchor sensitivity alone. The abstract states that the metric tracks behavioral anchoring, but the strength of that validation determines how much weight the downstream claims about faithful recovery and pathway structure can carry. If the controls are thin, the edge-versus-node and transfer findings become harder to interpret as specifically about anchoring.

This work is aimed at people already doing mechanistic interpretability on biases inside current-scale models. It is solid enough on the experimental framing and the reported patterns to merit a serious referee, mainly to press on the metric validation and any ablations that separate anchor effects from the shared-option structure. I would send it to review rather than desk reject.

Referee Report

1 major / 0 minor

Summary. The manuscript examines anchoring effects in language models' numerical reasoning using a controlled multiple-choice setup with shared answer options. It defines a logit-difference metric (correct option minus anchor option), states that this metric tracks behavioral anchoring, and applies attribution-based circuit localization on 7B-8B Qwen and Llama base and instruction-tuned models. Main results claim that edge-level attribution recovers the anchor-sensitive signal more faithfully than node-level methods, that low- and high-anchor circuits transfer strongly within a model, and that transfer is sparse across base and instruction-tuned variants.

Significance. If the logit-difference metric is shown to isolate anchoring without confounds from shared option structure, the results would provide a mechanistic account of how anchoring-related decision signals are carried in LMs, with useful comparisons of edge vs. node attribution and effects of post-training. The work applies standard attribution methods to a new behavioral metric and reports within- vs. cross-variant transfer patterns.

major comments (1)

[Abstract and metric definition paragraph] Abstract and paragraph on metric definition: The claim that the logit-difference metric 'tracks behavioral anchoring' is asserted without reported quantitative validation numbers, error bars, ablation details, or explicit controls separating anchor sensitivity from shared answer-option structure (e.g., position biases or general option preferences). This is load-bearing for the central claims on edge-level fidelity and circuit transfer.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for stronger validation of the logit-difference metric. We agree this is central to the claims and will expand the manuscript with the requested quantitative details, ablations, and controls.

read point-by-point responses

Referee: [Abstract and metric definition paragraph] Abstract and paragraph on metric definition: The claim that the logit-difference metric 'tracks behavioral anchoring' is asserted without reported quantitative validation numbers, error bars, ablation details, or explicit controls separating anchor sensitivity from shared answer-option structure (e.g., position biases or general option preferences). This is load-bearing for the central claims on edge-level fidelity and circuit transfer.

Authors: We accept the critique that the abstract and metric definition section present the validation claim without sufficient supporting numbers or controls. The full paper reports behavioral correlations between the logit-difference and anchoring effects, but these lack the explicit ablations, error bars, and option-structure controls requested. In revision we will add a dedicated validation subsection with: (1) quantitative correlation coefficients and error bars across prompt variants, (2) ablations that shuffle or remove the anchor while preserving option structure, and (3) controls for position bias and general option preference by comparing against non-anchor numerical prompts. These additions will directly support the downstream claims on edge-level attribution and circuit transfer. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on independent metric and standard methods

full rationale

The paper defines a logit-difference metric on shared answer options and validates it tracks behavioral anchoring before applying off-the-shelf attribution techniques for circuit localization. No equations, fitted parameters, or predictions reduce by construction to the same inputs; no load-bearing self-citations or uniqueness theorems from prior author work are invoked. The derivation chain is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the logit-difference metric is presented as a definition rather than a fitted quantity.

pith-pipeline@v0.9.1-grok · 5676 in / 1128 out tokens · 18701 ms · 2026-06-27T07:03:07.818028+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 12 canonical work pages · 5 internal anchors

[1]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Think you have solved question answering? try arc, the ai2 reasoning challenge. Preprint, arXiv:1803.05457. Atticus Geiger, Jacqueline Harding, and Thomas Icard

work page internal anchor Pith review Pith/arXiv arXiv
[2]

ArXiv:2508.11214 [cs]

How Causal Abstraction Under- pins Computational Explanation.arXiv preprint. ArXiv:2508.11214 [cs]. Nicholas Goldowsky-Dill, Chris MacLeod, Lucas Sato, and Aryaman Arora

work page arXiv
[3]

Localizing Model Behavior with Path Patching

Localizing Model Behavior with Path Patching.arXiv preprint. ArXiv:2304.05969 [cs]. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mi- tra, Archie Sravankumar, Artem Kor...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

The Llama 3 Herd of Models

The llama 3 herd of models.Preprint, arXiv:2407.21783. Michael Hanna, Sandro Pezzelle, and Yonatan Belinkov

work page internal anchor Pith review Pith/arXiv arXiv
[5]

János Kramár, Tom Lieberum, Rohin Shah, and Neel Nanda

Un- derstanding the anchoring effect of llm with synthetic data: Existence, mechanism, and potential mitiga- tions.Preprint, arXiv:2505.15392. János Kramár, Tom Lieberum, Rohin Shah, and Neel Nanda

work page arXiv
[6]

arXiv preprint arXiv:2403.00745 , year=

Atp*: An efficient and scalable method for localizing llm behaviour to components.Preprint, arXiv:2403.00745. Tom Lieberum, Matthew Rahtz, János Kramár, Neel Nanda, Geoffrey Irving, Rohin Shah, and Vladimir Mikulik

work page arXiv
[7]

Technical Report arXiv:2307.09458, arXiv

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla. Technical Report arXiv:2307.09458, arXiv. ArXiv:2307.09458 [cs]. Jiaxu Lou and Yifan Sun

work page arXiv
[8]

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov

Anchoring bias in large language models: An experimental study.Preprint, arXiv:2412.06593. Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov

work page arXiv
[9]

InThe Twelfth Inter- national Conference on Learning Representations

Circuit Component Reuse Across Tasks in Trans- former Language Models. InThe Twelfth Inter- national Conference on Learning Representations. arXiv. ArXiv:2310.08744 [cs]. 9 Aaron Mueller, Jannik Brinkmann, Millicent Li, Samuel Marks, Koyena Pal, Nikhil Prakash, Can Rager, Aruna Sankaranarayanan, Arnab Sen Sharma, Jiud- ing Sun, Eric Todd, David Bau, and Y...

work page arXiv
[10]

InFindings of ACL: NAACL 2024, pages 2006–2017

Large language models sensitivity to the order of options in multiple-choice questions. InFindings of ACL: NAACL 2024, pages 2006–2017. Lukas Röseler, Lucia Weber, Ena P. B. Stijovi´c, Katha- rina A. K. Jaekel, J. F. (Janne)ke M. T. (Janneke) G. (Gijsbers) Gijsbers, and Nir Milstein

2024
[11]

InProceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing, pages 7035–7052

A mechanistic interpretation of arith- metic reasoning in language models using causal mediation analysis. InProceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing, pages 7035–7052. Aaquib Syed, Can Rager, and Arthur Conmy

2023
[12]

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber

Anchors in the machine: Behavioral and attributional evidence of anchoring bias in llms.Preprint, arXiv:2511.05766. Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber

work page arXiv
[13]

InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online

Trans- formers: State-of-the-art natural language processing. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics. Qwen An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Gua...

2020
[14]

Qwen2.5 Technical Report

Qwen2.5 technical report.ArXiv, abs/2412.15115. Fred Zhang and Neel Nanda

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

Towards Best Practices of Activation Patching in Language Mod- els: Metrics and Methods. InThe Twelfth Inter- national Conference on Learning Representations. arXiv. ArXiv:2309.16042 [cs]. Appendix A MCQA Menu Size Validation Our main experiments use nine answer options per item. To justify this choice, we evaluated whether anchoring is preserved as the n...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Think you have solved question answering? try arc, the ai2 reasoning challenge. Preprint, arXiv:1803.05457. Atticus Geiger, Jacqueline Harding, and Thomas Icard

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

ArXiv:2508.11214 [cs]

How Causal Abstraction Under- pins Computational Explanation.arXiv preprint. ArXiv:2508.11214 [cs]. Nicholas Goldowsky-Dill, Chris MacLeod, Lucas Sato, and Aryaman Arora

work page arXiv

[3] [3]

Localizing Model Behavior with Path Patching

Localizing Model Behavior with Path Patching.arXiv preprint. ArXiv:2304.05969 [cs]. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mi- tra, Archie Sravankumar, Artem Kor...

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

The Llama 3 Herd of Models

The llama 3 herd of models.Preprint, arXiv:2407.21783. Michael Hanna, Sandro Pezzelle, and Yonatan Belinkov

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

János Kramár, Tom Lieberum, Rohin Shah, and Neel Nanda

Un- derstanding the anchoring effect of llm with synthetic data: Existence, mechanism, and potential mitiga- tions.Preprint, arXiv:2505.15392. János Kramár, Tom Lieberum, Rohin Shah, and Neel Nanda

work page arXiv

[6] [6]

arXiv preprint arXiv:2403.00745 , year=

Atp*: An efficient and scalable method for localizing llm behaviour to components.Preprint, arXiv:2403.00745. Tom Lieberum, Matthew Rahtz, János Kramár, Neel Nanda, Geoffrey Irving, Rohin Shah, and Vladimir Mikulik

work page arXiv

[7] [7]

Technical Report arXiv:2307.09458, arXiv

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla. Technical Report arXiv:2307.09458, arXiv. ArXiv:2307.09458 [cs]. Jiaxu Lou and Yifan Sun

work page arXiv

[8] [8]

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov

Anchoring bias in large language models: An experimental study.Preprint, arXiv:2412.06593. Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov

work page arXiv

[9] [9]

InThe Twelfth Inter- national Conference on Learning Representations

Circuit Component Reuse Across Tasks in Trans- former Language Models. InThe Twelfth Inter- national Conference on Learning Representations. arXiv. ArXiv:2310.08744 [cs]. 9 Aaron Mueller, Jannik Brinkmann, Millicent Li, Samuel Marks, Koyena Pal, Nikhil Prakash, Can Rager, Aruna Sankaranarayanan, Arnab Sen Sharma, Jiud- ing Sun, Eric Todd, David Bau, and Y...

work page arXiv

[10] [10]

InFindings of ACL: NAACL 2024, pages 2006–2017

Large language models sensitivity to the order of options in multiple-choice questions. InFindings of ACL: NAACL 2024, pages 2006–2017. Lukas Röseler, Lucia Weber, Ena P. B. Stijovi´c, Katha- rina A. K. Jaekel, J. F. (Janne)ke M. T. (Janneke) G. (Gijsbers) Gijsbers, and Nir Milstein

2024

[11] [11]

InProceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing, pages 7035–7052

A mechanistic interpretation of arith- metic reasoning in language models using causal mediation analysis. InProceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing, pages 7035–7052. Aaquib Syed, Can Rager, and Arthur Conmy

2023

[12] [12]

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber

Anchors in the machine: Behavioral and attributional evidence of anchoring bias in llms.Preprint, arXiv:2511.05766. Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber

work page arXiv

[13] [13]

InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online

Trans- formers: State-of-the-art natural language processing. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics. Qwen An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Gua...

2020

[14] [14]

Qwen2.5 Technical Report

Qwen2.5 technical report.ArXiv, abs/2412.15115. Fred Zhang and Neel Nanda

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

Towards Best Practices of Activation Patching in Language Mod- els: Metrics and Methods. InThe Twelfth Inter- national Conference on Learning Representations. arXiv. ArXiv:2309.16042 [cs]. Appendix A MCQA Menu Size Validation Our main experiments use nine answer options per item. To justify this choice, we evaluated whether anchoring is preserved as the n...

work page internal anchor Pith review Pith/arXiv arXiv 2026