Transformers store count information internally but cannot read it out as digits due to near-orthogonal alignment with output-head rows; updating digit rows or applying LoRA to attention layers improves constrained and unconstrained counting respectively.
A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis.arXiv preprint arXiv:2305.15054
5 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Proficient LLMs detect arithmetic tasks early but output correct answers only in final layers, with attention and MLP modules dividing labor in a way absent from less proficient models.
LLMs and LVLMs encode latent positional count information in individual tokens or visual features, with an internal counter mechanism that updates per item and emerges progressively across layers, relying on structural cues like separators.
Activation patching provides evidence about neural network circuits when the choice of metric is aligned with the hypothesis and common interpretation errors are avoided.
Varying evaluation metrics and corruption methods in activation patching produces different localization and circuit discovery outcomes in language models, leading to recommendations for preferred practices.
citing papers explorer
-
The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It
Transformers store count information internally but cannot read it out as digits due to near-orthogonal alignment with output-head rows; updating digit rows or applying LoRA to attention layers improves constrained and unconstrained counting respectively.
-
Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms
Proficient LLMs detect arithmetic tasks early but output correct answers only in final layers, with attention and MLP modules dividing labor in a way absent from less proficient models.
-
Understanding Counting Mechanisms in Large Language and Vision-Language Models
LLMs and LVLMs encode latent positional count information in individual tokens or visual features, with an internal counter mechanism that updates per item and emerges progressively across layers, relying on structural cues like separators.
-
How to use and interpret activation patching
Activation patching provides evidence about neural network circuits when the choice of metric is aligned with the hypothesis and common interpretation errors are avoided.
-
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Varying evaluation metrics and corruption methods in activation patching produces different localization and circuit discovery outcomes in language models, leading to recommendations for preferred practices.