Recognition: unknown
Conceptors for Semantic Steering
Pith reviewed 2026-05-08 17:12 UTC · model grok-4.3
The pith
Conceptors steer large language models by projecting onto the full multidimensional subspace of a concept rather than a single direction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By estimating soft projection matrices from activations pooled across both poles of a bipolar concept, conceptors preserve the concept's full multidimensional subspace. A geometric analysis demonstrates that this bipolar subspace strictly subsumes the single-vector baseline. The conceptor quota acts as a parameter-free layer-selection diagnostic, achieving Pearson correlations up to 0.96 with concept separability across three models and three semantic dimensions. Conceptors admit a closed-form Boolean algebra for composition, and evaluations across a five-axis design space show they match or outperform additive baselines at multi-dimensional layers while yielding substantially fewer outputs.
What carries the argument
The conceptor, a soft projection matrix estimated from activations pooled across both poles of a bipolar concept, which preserves the full multidimensional subspace and admits closed-form Boolean operations.
If this is right
- The single-vector steering direction is contained within the conceptor subspace.
- The conceptor quota enables selection of effective layers without model-specific tuning or extra data.
- Concepts can be combined using exact AND, OR, and NOT operations.
- Steering performance matches or exceeds baselines particularly when the concept occupies a multidimensional subspace.
- Fewer degenerate outputs are produced compared to single-direction methods.
Where Pith is reading between the lines
- This method might extend naturally to concepts that are not strictly bipolar by adjusting the pooling strategy.
- The geometric subsumption could be tested on other representation engineering techniques beyond steering.
- Boolean composition opens the possibility of building hierarchical control policies for model behavior.
- Validation on additional models and tasks would confirm the reliability of the quota as a general diagnostic.
Load-bearing premise
That pooling activations across both poles of a bipolar concept yields a subspace that strictly subsumes the single-vector baseline and that the conceptor quota reliably predicts separability without model-specific tuning.
What would settle it
If experiments on a new model or concept show the conceptor quota correlation falling below 0.5 with separability, or if single-direction steering produces fewer degenerate outputs than conceptors at multi-dimensional layers, the superiority claims would not hold.
read the original abstract
Activation-based steering provides control of LLM behavior at inference time, but the dominant paradigm reduces each concept to a single direction whose geometry is left largely unexamined. Rather than selecting a single steering direction, we use conceptors: soft projection matrices estimated from activations pooled across both poles of a bipolar concept, which preserve the concept's full multidimensional subspace. A geometric analysis shows the bipolar subspace strictly subsumes the single-vector baseline. We further show that the conceptor quota provides a parameter-free layer-selection diagnostic, predicting concept separability with Pearson correlations up to r=0.96 across three instruction-tuned models and three semantic dimensions. Beyond selection, conceptors admit a closed-form Boolean algebra (AND, OR, NOT): we evaluate conceptor compositionality on thematically related sub-concepts. Across a systematic five-axis design-space evaluation, conceptors match or outperform additive baselines at layers where concept subspaces are multi-dimensional while producing substantially fewer degenerate outputs. Conceptor steering is a geometrically principled, compositional, and practically safer alternative to single-direction steering from a limited number of contrastive pairs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes conceptors—soft projection matrices estimated from activations pooled across both poles of bipolar semantic concepts—as an alternative to single-direction steering vectors for controlling LLM behavior at inference time. It claims a geometric analysis demonstrates that the resulting bipolar subspace strictly subsumes the single-vector baseline, that the conceptor quota provides a parameter-free layer-selection diagnostic predicting separability (Pearson r up to 0.96 across three models and three dimensions), and that conceptors support closed-form Boolean composition (AND/OR/NOT) while matching or outperforming additive baselines with fewer degenerate outputs in a five-axis design-space evaluation.
Significance. If the geometric subsumption and empirical results hold, the work supplies a more principled, multidimensional, and compositional framework for activation steering that could reduce reliance on limited contrastive pairs and improve safety by avoiding degenerate outputs. The reported high correlations and systematic evaluation across models constitute concrete strengths that would support broader adoption if the underlying assumptions about activation subspaces are validated.
major comments (2)
- [Abstract / Geometric analysis] Abstract and geometric analysis section: the claim that the bipolar conceptor subspace 'strictly subsumes' the single-vector baseline assumes that directions orthogonal to the contrastive vector in the pooled activations carry semantically relevant variance rather than noise. If concept activations are effectively one-dimensional (as the skeptic note flags), the extra dimensions add no gain and the subsumption reduces to equality; an explicit rank or variance decomposition of the pooled activations is needed to establish this for the central geometric claim.
- [Abstract / Evaluation] Abstract and evaluation sections: the conceptor quota is presented as parameter-free and predictive (r=0.96), yet layer selection and the choice of bipolar pooling could embed model-specific fitting. The manuscript should clarify whether the quota was computed without any post-hoc adjustment across the three models and whether the correlation holds under cross-validation or held-out semantic dimensions.
minor comments (2)
- [Methods] Notation for the conceptor matrix and quota formula should be introduced with an explicit equation number in the methods section to aid reproducibility.
- [Evaluation] The five-axis design-space evaluation would benefit from a table summarizing the axes and the exact metrics used for 'degenerate outputs' to make the comparison with additive baselines fully transparent.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below. Where the feedback identifies gaps in supporting evidence for our geometric claims and evaluation robustness, we have incorporated revisions to strengthen the manuscript without altering its core contributions.
read point-by-point responses
-
Referee: [Abstract / Geometric analysis] Abstract and geometric analysis section: the claim that the bipolar conceptor subspace 'strictly subsumes' the single-vector baseline assumes that directions orthogonal to the contrastive vector in the pooled activations carry semantically relevant variance rather than noise. If concept activations are effectively one-dimensional (as the skeptic note flags), the extra dimensions add no gain and the subsumption reduces to equality; an explicit rank or variance decomposition of the pooled activations is needed to establish this for the central geometric claim.
Authors: We agree that the strict subsumption claim requires explicit evidence that the orthogonal directions in the pooled bipolar activations contain semantically relevant variance. In the revised manuscript we have added a variance decomposition (via SVD of the pooled activation matrix) and rank analysis for each concept and layer. This shows that the effective rank of the bipolar subspace exceeds 1 in the layers where conceptors outperform single-vector baselines, with the additional singular values correlating positively with improved separability. We have updated the geometric analysis section and abstract to report these statistics, thereby grounding the subsumption claim in the data rather than assumption. revision: yes
-
Referee: [Abstract / Evaluation] Abstract and evaluation sections: the conceptor quota is presented as parameter-free and predictive (r=0.96), yet layer selection and the choice of bipolar pooling could embed model-specific fitting. The manuscript should clarify whether the quota was computed without any post-hoc adjustment across the three models and whether the correlation holds under cross-validation or held-out semantic dimensions.
Authors: The conceptor quota is computed directly from the singular-value spectrum of the raw pooled activation matrix with no post-hoc scaling, thresholding, or model-specific adjustments; layer selection uses only the quota value itself. To address potential fitting concerns we have added a leave-one-dimension-out cross-validation: for each held-out semantic dimension the quota is recomputed on the remaining dimensions and the Pearson correlation with separability is re-evaluated. The correlations remain above 0.90 across the three models. These results are now reported in the evaluation section. revision: yes
Circularity Check
No circularity detected; geometric subsumption and quota diagnostic are independent of inputs
full rationale
The paper's core claims rest on a geometric analysis establishing strict subsumption of the single-vector baseline by the bipolar conceptor subspace, plus empirical Pearson correlations (r up to 0.96) between the conceptor quota and separability across models. These steps do not reduce by construction to fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations. The quota is presented as parameter-free and evaluated externally; the subsumption follows from the definition of conceptors as soft projections over pooled activations without circular reduction to the baseline vector. The derivation chain is self-contained against the stated assumptions and does not invoke unverified uniqueness theorems or ansatzes smuggled via prior work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Activations from contrastive pairs can be pooled to estimate a soft projection matrix representing the full concept subspace
Reference graph
Works this paper leans on
-
[1]
Francesco Barbieri, Jose Camacho-Collados, Luis Espinosa Anke, and Leonardo Neves. 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.148 T weet E val: Unified benchmark and comparative evaluation for tweet classification . In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1644--1650, Online. Association for Computational ...
-
[2]
Yonatan Belinkov. 2022. https://doi.org/10.1162/coli_a_00422 Probing classifiers: Promises, shortcomings, and advances . Computational Linguistics, 48(1):207--219
work page internal anchor Pith review doi:10.1162/coli_a_00422 2022
-
[3]
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Ch...
2023
-
[4]
Alexis Conneau, German Kruszewski, Guillaume Lample, Lo \"i c Barrault, and Marco Baroni. 2018. https://doi.org/10.18653/v1/P18-1198 What you can cram into a single \ & ! \# * vector: Probing sentence embeddings for linguistic properties . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p...
-
[5]
Shangbin Feng, Chan Young Park, Yuhan Liu, and Yulia Tsvetkov. 2023. https://doi.org/10.18653/v1/2023.acl-long.656 From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair NLP models . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa...
-
[6]
Wes Gurnee and Max Tegmark. 2024. https://proceedings.iclr.cc/paper_files/paper/2024/file/0a6059857ae5c82ea9726ee9282a7145-Paper-Conference.pdf Language models represent space and time . In International Conference on Learning Representations, volume 2024, pages 2483--2503
2024
-
[7]
Xu He and Herbert Jaeger. 2018. https://openreview.net/forum?id=B1al7jg0b Overcoming catastrophic interference using conceptor-aided backpropagation . In International Conference on Learning Representations
2018
-
[8]
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. http://arxiv.org/abs/2409.12186 Qwen2.5-coder technical report
work page internal anchor Pith review arXiv 2024
-
[9]
echo state
Herbert Jaeger. 2001. https://www.ai.rug.nl/minds/uploads/EchoStatesTechRep.pdf The “echo state” approach to analysing and training recurrent neural networks-with an erratum note . Bonn, Germany: German national research center for information technology gmd technical report, 148(34):13
2001
-
[10]
Herbert Jaeger. 2017. http://jmlr.org/papers/v18/15-449.html Using conceptors to manage neural long-term memories for temporal patterns . Journal of Machine Learning Research, 18(13):1--43
2017
- [11]
-
[12]
Saket Karve, Lyle Ungar, and Jo \ a o Sedoc. 2019. https://doi.org/10.18653/v1/W19-3806 Conceptor debiasing of word representations evaluated on WEAT . In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pages 40--48, Florence, Italy. Association for Computational Linguistics
-
[13]
Johannes Kiesel, Maria Mestre, Rishabh Shukla, Emmanuel Vincent, Payam Adineh, David Corney, Benno Stein, and Martin Potthast. 2019. https://doi.org/10.18653/v1/S19-2145 S em E val-2019 task 4: Hyperpartisan news detection . In Proceedings of the 13th International Workshop on Semantic Evaluation, pages 829--839, Minneapolis, Minnesota, USA. Association f...
-
[14]
Andrew V. Knyazev and Merico E. Argentati. 2002. https://doi.org/10.1137/S1064827500377332 Principal angles between subspaces in an a-based scalar product: Algorithms and perturbation estimates . SIAM J. Sci. Comput., 23(6):2008–2040
-
[15]
Kenneth Li, Oam Patel, Fernanda Vi \'e gas, Hanspeter Pfister, and Martin Wattenberg. 2023. https://openreview.net/forum?id=aLLuYpn83y Inference-time intervention: Eliciting truthful answers from a language model . In Thirty-seventh Conference on Neural Information Processing Systems
2023
-
[16]
Tianlin Liu, Lyle Ungar, and Jo\ a o Sedoc. 2019 a . https://doi.org/10.1609/aaai.v33i01.33016778 Unsupervised post-processing of word vectors via conceptor negation . In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educ...
-
[17]
Tianlin Liu, Lyle Ungar, and Jo \ a o Sedoc. 2019 b . https://doi.org/10.18653/v1/N19-1331 Continual learning for sentence representations using conceptors . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 3274--3279...
- [18]
-
[19]
Bo Pang and Lillian Lee. 2005. https://doi.org/10.3115/1219840.1219855 Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales . In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics ( ACL ' 05) , pages 115--124, Ann Arbor, Michigan. Association for Computational Linguistics
-
[20]
Kiho Park, Yo Joong Choe, and Victor Veitch. 2024. https://proceedings.mlr.press/v235/park24c.html The linear representation hypothesis and the geometry of large language models . In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 39643--39666. PMLR
2024
-
[21]
Joris Postmus and Steven Abreu. 2024. https://openreview.net/forum?id=gyAnAq16HC Steering large language models using conceptors: Improving addition-based activation engineering . In MINT: Foundation Model Interventions
2024
-
[22]
Sunny Rai, Khushi Shelat, Devansh Jain, Ashwin Kishen, Young Min Cho, Maitreyi Redkar, Samindara Hardikar-Sawant, Lyle Ungar, and Sharath Chandra Guntuku. 2025. https://doi.org/10.18653/v1/2025.c3nlp-1.10 Cross-cultural differences in mental health expressions on social media . In Proceedings of the 3rd Workshop on Cross-Cultural Considerations in NLP (C3...
-
[23]
Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. 2024. https://doi.org/10.18653/v1/2024.acl-long.828 Steering llama 2 via contrastive activation addition . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15504--15522, Bangkok, Thailand. Assoc...
-
[24]
Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. 2023. https://proceedings.mlr.press/v202/santurkar23a/santurkar23a.pdf Whose opinions do language models reflect? Proceedings of the 40th International Conference on Machine Learning
2023
-
[25]
Shashwat Singh, Shauli Ravfogel, Jonathan Herzig, Roee Aharoni, Ryan Cotterell, and Ponnurangam Kumaraguru. 2024. https://arxiv.org/abs/2402.09631 Representation surgery: Theory and practice of affine steering . In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research. PMLR
-
[26]
Manning, Andrew Ng, and Christopher Potts
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. https://aclanthology.org/D13-1170/ Recursive deep models for semantic compositionality over a sentiment treebank . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631--1642, Seattle, Washi...
2013
-
[27]
Nicholas Sofroniew, Isaac Kauvar, William Saunders, Runjin Chen, Tom Henighan, Sasha Hydrie, Craig Citro, Adam Pearce, Julius Tarng, Wes Gurnee, Joshua Batson, Sam Zimmerman, Kelley Rivoire, Kyle Fish, Chris Olah, and Jack Lindsey. 2026. https://transformer-circuits.pub/2026/emotions/index.html Emotion concepts and their function in a large language model...
2026
-
[28]
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, and et al. 2024. http://arxiv.org/abs/2408.00118 Gemma 2: Improving open language models at a practical size
work page internal anchor Pith review arXiv 2024
-
[29]
Daniel Freeman, Theodore R
Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. 2024. https://tr...
2024
-
[30]
Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. https://doi.org/10.18653/v1/P19-1452 BERT rediscovers the classical NLP pipeline . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593--4601, Florence, Italy. Association for Computational Linguistics
-
[31]
Steering Language Models With Activation Engineering
Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. 2024. http://arxiv.org/abs/2308.10248 Steering language models with activation engineering
work page internal anchor Pith review arXiv 2024
-
[32]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30
2017
-
[33]
and Potts, Christopher , booktitle=
Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. 2024. http://arxiv.org/abs/2404.03592 Reft: Representation finetuning for language models
- [34]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.