The Cylindrical Representation Hypothesis models concept representations in LLMs as a central axis for concept presence surrounded by a normal plane containing sensitive sectors that control activation ease, explaining steering unpredictability.
A machine-rendered reading of the paper's core claim, the
machinery that carries it, and where it could break.
Large language models store ideas in their hidden layers, and steering tries to push the model toward a desired idea by adding a direction vector. Earlier theories assumed these ideas sit on straight, non-overlapping lines so you could add a vector and get clean control. In reality steering often flips or fails unpredictably. The new hypothesis says the geometry is cylindrical: a main line (the axis) turns the concept on or off, while a flat disk around it contains directions that make activation easy in some sectors and hard or impossible in others. Only certain sectors on that disk strongly help the concept appear, which creates built-in uncertainty even when the main direction looks good.
Core claim
By relaxing LRH's orthogonality assumption while preserving linear representations, we show that overlapping concept contributions naturally yield a sample-specific axis-orthogonal structure. We formalize this as the Cylindrical Representation Hypothesis (CRH).
Load-bearing premise
That overlapping concept contributions produce a reliably identifiable normal plane from difference vectors while the sensitive sector within that plane remains intrinsically unidentifiable, introducing unavoidable uncertainty at the sector level.
read the original abstract
Steering is a widely used technique for controlling large language models, yet its effects are often unstable and hard to predict. Existing theoretical accounts are largely based on the Linear Representation Hypothesis (LRH). While LRH assumes that concepts can be orthogonalized for lossless control, this idealized mapping fails in real representations and cannot account for the observed unpredictability of steering. By relaxing LRH's orthogonality assumption while preserving linear representations, we show that overlapping concept contributions naturally yield a sample-specific axis-orthogonal structure. We formalize this as the Cylindrical Representation Hypothesis (CRH). In CRH, a central axis captures the main difference between concept absence and presence and drives concept generation. A surrounding normal plane controls steering sensitivity by determining how easily the axis can activate the target concept. Within this plane, only specific sensitive sectors strongly facilitate concept activation, while other sectors can suppress or delay it. While the surrounding normal plane can be reliably identified from difference vectors, the sensitive sector cannot, introducing intrinsic uncertainty at the sector level. This uncertainty provides a principled explanation for why steering outcomes often fluctuate even when using well-aligned directions. Our experiments verify the existence of the cylindrical structure and demonstrate that CRH provides a valid and practical way to interpret model steering behavior in real settings: https://github.com/mbzuai-nlp/CRH.
Editorial analysis
A structured set of objections, weighed in public.
Desk editor's note, referee report, simulated authors' rebuttal, and a
circularity audit. Tearing a paper down is the easy half of reading it; the
pith above is the substance, this is the friction.
The claim rests on relaxing the orthogonality assumption of LRH and postulating that overlapping linear contributions produce a cylindrical geometry whose normal plane is identifiable but whose sensitive sectors are not.
axioms (2)
domain assumptionConcept representations remain linear even after relaxing orthogonality The paper preserves linearity while dropping the orthogonality requirement of LRH.
ad hoc to paperOverlapping concept contributions naturally produce a sample-specific axis-orthogonal cylindrical structure This is the key modeling step that generates the central axis and surrounding normal plane.
invented entities (2)
Central axis capturing concept absence versus presenceno independent evidence purpose: To drive concept generation in the representation space Postulated as the load-bearing direction within the cylindrical model; no independent falsifiable prediction supplied beyond the hypothesis.
Normal plane with sensitive sectors controlling steering sensitivityno independent evidence purpose: To explain variable activation ease and intrinsic uncertainty in steering outcomes Invented to account for observed unpredictability; the paper states the sector cannot be reliably identified from difference vectors.
pith-pipeline@v0.9.0 ·
5560 in / 1414 out tokens ·
54484 ms ·
2026-05-09T13:43:16.319471+00:00
· methodology
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.