Mechanistic? arXiv preprint arXiv:2410.09087

Naomi Saphra, Sarah Wiegreffe · 2024 · arXiv 2410.09087

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

representative citing papers

Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.

Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts

cs.AI · 2026-05-01 · unverdicted · novelty 7.0

Llama-3.1-8B computes sums for cyclic concepts using base-10 addition via task-agnostic Fourier features with periods 2, 5, and 10 rather than modular arithmetic in the concept period.

Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

cs.CL · 2026-01-20 · unverdicted · novelty 5.0

The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.

Do Activation Verbalization Methods Convey Privileged Information?

cs.CL · 2025-09-16 · unverdicted · novelty 5.0

Activation verbalization methods for LLMs largely reflect the verbalizer model's parametric knowledge rather than privileged information from the target model's activations.

From Weights to Activations: Is Steering the Next Frontier of Adaptation?

cs.CL · 2026-04-15 · unverdicted · novelty 4.0

Steering is positioned as a distinct adaptation paradigm that uses targeted activation interventions for local, reversible behavioral changes without parameter updates.

Mechanistic Interpretability Needs Philosophy

cs.CL · 2025-06-23 · unverdicted · novelty 4.0

The paper claims that mechanistic interpretability needs philosophy as a partner to clarify concepts, refine methods, and navigate epistemic and ethical complexities in AI systems.

citing papers explorer

Showing 7 of 7 citing papers.

Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior cs.LG · 2026-05-06 · unverdicted · none · ref 114
Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts cs.AI · 2026-05-01 · unverdicted · none · ref 119
Llama-3.1-8B computes sums for cyclic concepts using base-10 addition via task-agnostic Fourier features with periods 2, 5, and 10 rather than modular arithmetic in the concept period.
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space cs.CL · 2026-05-12 · unverdicted · none · ref 129
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models cs.CL · 2026-01-20 · unverdicted · none · ref 263
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.
Do Activation Verbalization Methods Convey Privileged Information? cs.CL · 2025-09-16 · unverdicted · none · ref 46
Activation verbalization methods for LLMs largely reflect the verbalizer model's parametric knowledge rather than privileged information from the target model's activations.
From Weights to Activations: Is Steering the Next Frontier of Adaptation? cs.CL · 2026-04-15 · unverdicted · none · ref 19
Steering is positioned as a distinct adaptation paradigm that uses targeted activation interventions for local, reversible behavioral changes without parameter updates.
Mechanistic Interpretability Needs Philosophy cs.CL · 2025-06-23 · unverdicted · none · ref 23
The paper claims that mechanistic interpretability needs philosophy as a partner to clarify concepts, refine methods, and navigate epistemic and ethical complexities in AI systems.

Mechanistic? arXiv preprint arXiv:2410.09087

fields

years

verdicts

representative citing papers

citing papers explorer