hub

A mathematical framework for transformer circuits.Transformer Circuits Thread

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Za

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

browse 11 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 3

citation-polarity summary

background 2 support 1

representative citing papers

Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

cs.LG · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

POISE trains a lightweight probe on the actor's internal states to predict expected rewards for RLVR, matching DAPO performance on math benchmarks with lower compute by avoiding extra rollouts or critic models.

Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

Transformer circuits show free evolution during SFT, rendering static mechanistic localization inadequate for future parameter updates due to inherent temporal latency.

Task Vector Geometry Underlies Dual Modes of Task Inference in Transformers

cs.LG · 2026-05-05 · unverdicted · novelty 7.0

In a controlled synthetic setting, transformers implement in-distribution task inference via convex combinations of task vectors and out-of-distribution inference via nearly orthogonal extrapolative representations.

VLAs are Confined yet Capable of Generalizing to Novel Instructions

cs.RO · 2025-05-06 · unverdicted · novelty 7.0

Averaging and temporally interpolating text latents in VLAs enables 83% success on novel task combinations in the libero-ood benchmark where SOTA models achieve under 15%.

Manifold-Guided Attention Steering

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

MAGS learns low-dimensional subspaces from correct versus incorrect reasoning traces and applies targeted projection corrections to attention heads when they deviate from the correctness manifold during inference.

Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

Probe trajectories across token positions in LRMs, combined with signal-processing features, improve prediction of future model outputs over static probes on safety and math tasks.

Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

cs.LG · 2026-05-13 · unverdicted · novelty 6.0 · 3 refs

Sparse autoencoders on EEG transformers extract clinical features, identify three steering regimes, expose age-pathology entanglements and wrecking-ball failures, and map interventions to frequency spectra.

A Geometric Perspective on Next-Token Prediction in Large Language Models: Three Emerging Phases

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

LLMs exhibit three geometric phases in next-token prediction—seeding multiplexing, hoisting overriding, and focal convergence—where predictive subspaces rise, stabilize, and converge across layers.

Understanding In-Context Learning for Nonlinear Regression with Transformers: Attention as Featurizer

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

Transformers can be built to act as nonlinear featurizers via attention, supporting in-context regression with proven generalization bounds on synthetic tasks.

Towards Effective Theory of LLMs: A Representation Learning Approach

cs.LG · 2026-05-10 · unverdicted · novelty 5.0

RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.

The Scaling Properties of Implicit Deductive Reasoning in Transformers

cs.AI · 2026-05-05 · unverdicted · novelty 5.0

In deep Transformers using bidirectional prefix masks, implicit reasoning on Horn clauses matches explicit CoT performance across topologies and widths, but CoT is still required for depth extrapolation.

citing papers explorer

Showing 11 of 11 citing papers.

Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States cs.LG · 2026-05-08 · unverdicted · none · ref 6 · 2 links
POISE trains a lightweight probe on the actor's internal states to predict expected rewards for RLVR, matching DAPO performance on math benchmarks with lower compute by avoiding extra rollouts or critic models.
Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training cs.CL · 2026-05-07 · unverdicted · none · ref 24
Transformer circuits show free evolution during SFT, rendering static mechanistic localization inadequate for future parameter updates due to inherent temporal latency.
Task Vector Geometry Underlies Dual Modes of Task Inference in Transformers cs.LG · 2026-05-05 · unverdicted · none · ref 11
In a controlled synthetic setting, transformers implement in-distribution task inference via convex combinations of task vectors and out-of-distribution inference via nearly orthogonal extrapolative representations.
VLAs are Confined yet Capable of Generalizing to Novel Instructions cs.RO · 2025-05-06 · unverdicted · none · ref 9
Averaging and temporally interpolating text latents in VLAs enables 83% success on novel task combinations in the libero-ood benchmark where SOTA models achieve under 15%.
Manifold-Guided Attention Steering cs.LG · 2026-05-20 · unverdicted · none · ref 5
MAGS learns low-dimensional subspaces from correct versus incorrect reasoning traces and applies targeted projection corrections to attention heads when they deviate from the correctness manifold during inference.
Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics cs.CL · 2026-05-18 · unverdicted · none · ref 16
Probe trajectories across token positions in LRMs, combined with signal-processing features, improve prediction of future model outputs over static probes on safety and math tasks.
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders cs.LG · 2026-05-13 · unverdicted · none · ref 7 · 3 links
Sparse autoencoders on EEG transformers extract clinical features, identify three steering regimes, expose age-pathology entanglements and wrecking-ball failures, and map interventions to frequency spectra.
A Geometric Perspective on Next-Token Prediction in Large Language Models: Three Emerging Phases cs.LG · 2026-05-09 · unverdicted · none · ref 7
LLMs exhibit three geometric phases in next-token prediction—seeding multiplexing, hoisting overriding, and focal convergence—where predictive subspaces rise, stabilize, and converge across layers.
Understanding In-Context Learning for Nonlinear Regression with Transformers: Attention as Featurizer cs.LG · 2026-05-06 · unverdicted · none · ref 11
Transformers can be built to act as nonlinear featurizers via attention, supporting in-context regression with proven generalization bounds on synthetic tasks.
Towards Effective Theory of LLMs: A Representation Learning Approach cs.LG · 2026-05-10 · unverdicted · none · ref 25
RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.
The Scaling Properties of Implicit Deductive Reasoning in Transformers cs.AI · 2026-05-05 · unverdicted · none · ref 76
In deep Transformers using bidirectional prefix masks, implicit reasoning on Horn clauses matches explicit CoT performance across topologies and widths, but CoT is still required for depth extrapolation.

A mathematical framework for transformer circuits.Transformer Circuits Thread

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer