A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis.arXiv preprint arXiv:2305.15054

Alessandro Stolfo, Yonatan Belinkov, Mrinmaya Sachan · 2023 · arXiv 2305.15054

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

representative citing papers

The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It

cs.LG · 2026-05-05 · accept · novelty 7.0 · 2 refs

Transformers store count information internally but cannot read it out as digits due to near-orthogonal alignment with output-head rows; updating digit rows or applying LoRA to attention layers improves constrained and unconstrained counting respectively.

Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms

cs.CL · 2026-04-17 · unverdicted · novelty 6.0

Proficient LLMs detect arithmetic tasks early but output correct answers only in final layers, with attention and MLP modules dividing labor in a way absent from less proficient models.

Understanding Counting Mechanisms in Large Language and Vision-Language Models

cs.CV · 2025-11-21 · unverdicted · novelty 6.0

LLMs and LVLMs encode latent positional count information in individual tokens or visual features, with an internal counter mechanism that updates per item and emerges progressively across layers, relying on structural cues like separators.

How to use and interpret activation patching

cs.LG · 2024-04-23 · accept · novelty 5.0

Activation patching provides evidence about neural network circuits when the choice of metric is aligned with the hypothesis and common interpretation errors are avoided.

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

cs.LG · 2023-09-27 · unverdicted · novelty 5.0

Varying evaluation metrics and corruption methods in activation patching produces different localization and circuit discovery outcomes in language models, leading to recommendations for preferred practices.

citing papers explorer

Showing 5 of 5 citing papers.

The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It cs.LG · 2026-05-05 · accept · none · ref 10 · 2 links
Transformers store count information internally but cannot read it out as digits due to near-orthogonal alignment with output-head rows; updating digit rows or applying LoRA to attention layers improves constrained and unconstrained counting respectively.
Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms cs.CL · 2026-04-17 · unverdicted · none · ref 5
Proficient LLMs detect arithmetic tasks early but output correct answers only in final layers, with attention and MLP modules dividing labor in a way absent from less proficient models.
Understanding Counting Mechanisms in Large Language and Vision-Language Models cs.CV · 2025-11-21 · unverdicted · none · ref 18
LLMs and LVLMs encode latent positional count information in individual tokens or visual features, with an internal counter mechanism that updates per item and emerges progressively across layers, relying on structural cues like separators.
How to use and interpret activation patching cs.LG · 2024-04-23 · accept · none · ref 25
Activation patching provides evidence about neural network circuits when the choice of metric is aligned with the hypothesis and common interpretation errors are avoided.
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods cs.LG · 2023-09-27 · unverdicted · none · ref 112
Varying evaluation metrics and corruption methods in activation patching produces different localization and circuit discovery outcomes in language models, leading to recommendations for preferred practices.

A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis.arXiv preprint arXiv:2305.15054

fields

years

verdicts

representative citing papers

citing papers explorer