pith. machine review for the scientific record. sign in

arxiv: 2605.04980 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.CL

Recognition: unknown

Conceptors for Semantic Steering

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:12 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords conceptorssemantic steeringactivation steeringLLM controlbipolar conceptssubspace projectionBoolean compositionrepresentation engineering
0
0 comments X

The pith

Conceptors steer large language models by projecting onto the full multidimensional subspace of a concept rather than a single direction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that conceptors, soft projection matrices built from activations of both positive and negative instances of a concept, capture the complete geometric space associated with that concept. This matters for readers interested in controlling AI outputs because it moves beyond the common practice of using one steering vector from contrastive pairs, offering instead a richer and more composable method. The authors demonstrate through geometry that the full subspace includes the single direction as a special case. They introduce the conceptor quota as a way to choose which layer to apply steering without extra tuning, and they provide algebraic rules to combine different concepts. Tests across models indicate comparable or better results with reduced numbers of invalid generations.

Core claim

By estimating soft projection matrices from activations pooled across both poles of a bipolar concept, conceptors preserve the concept's full multidimensional subspace. A geometric analysis demonstrates that this bipolar subspace strictly subsumes the single-vector baseline. The conceptor quota acts as a parameter-free layer-selection diagnostic, achieving Pearson correlations up to 0.96 with concept separability across three models and three semantic dimensions. Conceptors admit a closed-form Boolean algebra for composition, and evaluations across a five-axis design space show they match or outperform additive baselines at multi-dimensional layers while yielding substantially fewer outputs.

What carries the argument

The conceptor, a soft projection matrix estimated from activations pooled across both poles of a bipolar concept, which preserves the full multidimensional subspace and admits closed-form Boolean operations.

If this is right

  • The single-vector steering direction is contained within the conceptor subspace.
  • The conceptor quota enables selection of effective layers without model-specific tuning or extra data.
  • Concepts can be combined using exact AND, OR, and NOT operations.
  • Steering performance matches or exceeds baselines particularly when the concept occupies a multidimensional subspace.
  • Fewer degenerate outputs are produced compared to single-direction methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method might extend naturally to concepts that are not strictly bipolar by adjusting the pooling strategy.
  • The geometric subsumption could be tested on other representation engineering techniques beyond steering.
  • Boolean composition opens the possibility of building hierarchical control policies for model behavior.
  • Validation on additional models and tasks would confirm the reliability of the quota as a general diagnostic.

Load-bearing premise

That pooling activations across both poles of a bipolar concept yields a subspace that strictly subsumes the single-vector baseline and that the conceptor quota reliably predicts separability without model-specific tuning.

What would settle it

If experiments on a new model or concept show the conceptor quota correlation falling below 0.5 with separability, or if single-direction steering produces fewer degenerate outputs than conceptors at multi-dimensional layers, the superiority claims would not hold.

read the original abstract

Activation-based steering provides control of LLM behavior at inference time, but the dominant paradigm reduces each concept to a single direction whose geometry is left largely unexamined. Rather than selecting a single steering direction, we use conceptors: soft projection matrices estimated from activations pooled across both poles of a bipolar concept, which preserve the concept's full multidimensional subspace. A geometric analysis shows the bipolar subspace strictly subsumes the single-vector baseline. We further show that the conceptor quota provides a parameter-free layer-selection diagnostic, predicting concept separability with Pearson correlations up to r=0.96 across three instruction-tuned models and three semantic dimensions. Beyond selection, conceptors admit a closed-form Boolean algebra (AND, OR, NOT): we evaluate conceptor compositionality on thematically related sub-concepts. Across a systematic five-axis design-space evaluation, conceptors match or outperform additive baselines at layers where concept subspaces are multi-dimensional while producing substantially fewer degenerate outputs. Conceptor steering is a geometrically principled, compositional, and practically safer alternative to single-direction steering from a limited number of contrastive pairs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes conceptors—soft projection matrices estimated from activations pooled across both poles of bipolar semantic concepts—as an alternative to single-direction steering vectors for controlling LLM behavior at inference time. It claims a geometric analysis demonstrates that the resulting bipolar subspace strictly subsumes the single-vector baseline, that the conceptor quota provides a parameter-free layer-selection diagnostic predicting separability (Pearson r up to 0.96 across three models and three dimensions), and that conceptors support closed-form Boolean composition (AND/OR/NOT) while matching or outperforming additive baselines with fewer degenerate outputs in a five-axis design-space evaluation.

Significance. If the geometric subsumption and empirical results hold, the work supplies a more principled, multidimensional, and compositional framework for activation steering that could reduce reliance on limited contrastive pairs and improve safety by avoiding degenerate outputs. The reported high correlations and systematic evaluation across models constitute concrete strengths that would support broader adoption if the underlying assumptions about activation subspaces are validated.

major comments (2)
  1. [Abstract / Geometric analysis] Abstract and geometric analysis section: the claim that the bipolar conceptor subspace 'strictly subsumes' the single-vector baseline assumes that directions orthogonal to the contrastive vector in the pooled activations carry semantically relevant variance rather than noise. If concept activations are effectively one-dimensional (as the skeptic note flags), the extra dimensions add no gain and the subsumption reduces to equality; an explicit rank or variance decomposition of the pooled activations is needed to establish this for the central geometric claim.
  2. [Abstract / Evaluation] Abstract and evaluation sections: the conceptor quota is presented as parameter-free and predictive (r=0.96), yet layer selection and the choice of bipolar pooling could embed model-specific fitting. The manuscript should clarify whether the quota was computed without any post-hoc adjustment across the three models and whether the correlation holds under cross-validation or held-out semantic dimensions.
minor comments (2)
  1. [Methods] Notation for the conceptor matrix and quota formula should be introduced with an explicit equation number in the methods section to aid reproducibility.
  2. [Evaluation] The five-axis design-space evaluation would benefit from a table summarizing the axes and the exact metrics used for 'degenerate outputs' to make the comparison with additive baselines fully transparent.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below. Where the feedback identifies gaps in supporting evidence for our geometric claims and evaluation robustness, we have incorporated revisions to strengthen the manuscript without altering its core contributions.

read point-by-point responses
  1. Referee: [Abstract / Geometric analysis] Abstract and geometric analysis section: the claim that the bipolar conceptor subspace 'strictly subsumes' the single-vector baseline assumes that directions orthogonal to the contrastive vector in the pooled activations carry semantically relevant variance rather than noise. If concept activations are effectively one-dimensional (as the skeptic note flags), the extra dimensions add no gain and the subsumption reduces to equality; an explicit rank or variance decomposition of the pooled activations is needed to establish this for the central geometric claim.

    Authors: We agree that the strict subsumption claim requires explicit evidence that the orthogonal directions in the pooled bipolar activations contain semantically relevant variance. In the revised manuscript we have added a variance decomposition (via SVD of the pooled activation matrix) and rank analysis for each concept and layer. This shows that the effective rank of the bipolar subspace exceeds 1 in the layers where conceptors outperform single-vector baselines, with the additional singular values correlating positively with improved separability. We have updated the geometric analysis section and abstract to report these statistics, thereby grounding the subsumption claim in the data rather than assumption. revision: yes

  2. Referee: [Abstract / Evaluation] Abstract and evaluation sections: the conceptor quota is presented as parameter-free and predictive (r=0.96), yet layer selection and the choice of bipolar pooling could embed model-specific fitting. The manuscript should clarify whether the quota was computed without any post-hoc adjustment across the three models and whether the correlation holds under cross-validation or held-out semantic dimensions.

    Authors: The conceptor quota is computed directly from the singular-value spectrum of the raw pooled activation matrix with no post-hoc scaling, thresholding, or model-specific adjustments; layer selection uses only the quota value itself. To address potential fitting concerns we have added a leave-one-dimension-out cross-validation: for each held-out semantic dimension the quota is recomputed on the remaining dimensions and the Pearson correlation with separability is re-evaluated. The correlations remain above 0.90 across the three models. These results are now reported in the evaluation section. revision: yes

Circularity Check

0 steps flagged

No circularity detected; geometric subsumption and quota diagnostic are independent of inputs

full rationale

The paper's core claims rest on a geometric analysis establishing strict subsumption of the single-vector baseline by the bipolar conceptor subspace, plus empirical Pearson correlations (r up to 0.96) between the conceptor quota and separability across models. These steps do not reduce by construction to fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations. The quota is presented as parameter-free and evaluated externally; the subsumption follows from the definition of conceptors as soft projections over pooled activations without circular reduction to the baseline vector. The derivation chain is self-contained against the stated assumptions and does not invoke unverified uniqueness theorems or ansatzes smuggled via prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities are detailed beyond standard assumptions of activation-based steering.

axioms (1)
  • domain assumption Activations from contrastive pairs can be pooled to estimate a soft projection matrix representing the full concept subspace
    Central to defining conceptors from limited examples

pith-pipeline@v0.9.0 · 5511 in / 1185 out tokens · 22147 ms · 2026-05-08T17:12:07.697938+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 21 canonical work pages · 4 internal anchors

  1. [1]

    Francesco Barbieri, Jose Camacho-Collados, Luis Espinosa Anke, and Leonardo Neves. 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.148 T weet E val: Unified benchmark and comparative evaluation for tweet classification . In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1644--1650, Online. Association for Computational ...

  2. [2]

    Yonatan Belinkov. 2022. https://doi.org/10.1162/coli_a_00422 Probing classifiers: Promises, shortcomings, and advances . Computational Linguistics, 48(1):207--219

  3. [3]

    Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Ch...

  4. [4]

    Alexis Conneau, German Kruszewski, Guillaume Lample, Lo \"i c Barrault, and Marco Baroni. 2018. https://doi.org/10.18653/v1/P18-1198 What you can cram into a single \ & ! \# * vector: Probing sentence embeddings for linguistic properties . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p...

  5. [5]

    Shangbin Feng, Chan Young Park, Yuhan Liu, and Yulia Tsvetkov. 2023. https://doi.org/10.18653/v1/2023.acl-long.656 From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair NLP models . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa...

  6. [6]

    Wes Gurnee and Max Tegmark. 2024. https://proceedings.iclr.cc/paper_files/paper/2024/file/0a6059857ae5c82ea9726ee9282a7145-Paper-Conference.pdf Language models represent space and time . In International Conference on Learning Representations, volume 2024, pages 2483--2503

  7. [7]

    Xu He and Herbert Jaeger. 2018. https://openreview.net/forum?id=B1al7jg0b Overcoming catastrophic interference using conceptor-aided backpropagation . In International Conference on Learning Representations

  8. [8]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. http://arxiv.org/abs/2409.12186 Qwen2.5-coder technical report

  9. [9]

    echo state

    Herbert Jaeger. 2001. https://www.ai.rug.nl/minds/uploads/EchoStatesTechRep.pdf The “echo state” approach to analysing and training recurrent neural networks-with an erratum note . Bonn, Germany: German national research center for information technology gmd technical report, 148(34):13

  10. [10]

    Herbert Jaeger. 2017. http://jmlr.org/papers/v18/15-449.html Using conceptors to manage neural long-term memories for temporal patterns . Journal of Machine Learning Research, 18(13):1--43

  11. [11]

    Herbert Jaeger. 2024. http://arxiv.org/abs/1403.3369 Controlling recurrent neural networks by conceptors

  12. [12]

    Saket Karve, Lyle Ungar, and Jo \ a o Sedoc. 2019. https://doi.org/10.18653/v1/W19-3806 Conceptor debiasing of word representations evaluated on WEAT . In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pages 40--48, Florence, Italy. Association for Computational Linguistics

  13. [13]

    Johannes Kiesel, Maria Mestre, Rishabh Shukla, Emmanuel Vincent, Payam Adineh, David Corney, Benno Stein, and Martin Potthast. 2019. https://doi.org/10.18653/v1/S19-2145 S em E val-2019 task 4: Hyperpartisan news detection . In Proceedings of the 13th International Workshop on Semantic Evaluation, pages 829--839, Minneapolis, Minnesota, USA. Association f...

  14. [14]

    and Argentati, Merico E

    Andrew V. Knyazev and Merico E. Argentati. 2002. https://doi.org/10.1137/S1064827500377332 Principal angles between subspaces in an a-based scalar product: Algorithms and perturbation estimates . SIAM J. Sci. Comput., 23(6):2008–2040

  15. [15]

    Kenneth Li, Oam Patel, Fernanda Vi \'e gas, Hanspeter Pfister, and Martin Wattenberg. 2023. https://openreview.net/forum?id=aLLuYpn83y Inference-time intervention: Eliciting truthful answers from a language model . In Thirty-seventh Conference on Neural Information Processing Systems

  16. [16]

    Tianlin Liu, Lyle Ungar, and Jo\ a o Sedoc. 2019 a . https://doi.org/10.1609/aaai.v33i01.33016778 Unsupervised post-processing of word vectors via conceptor negation . In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educ...

  17. [17]

    Tianlin Liu, Lyle Ungar, and Jo \ a o Sedoc. 2019 b . https://doi.org/10.18653/v1/N19-1331 Continual learning for sentence representations using conceptors . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 3274--3279...

  18. [18]

    Miranda Muqing Miao, Young-Min Cho, and Lyle Ungar. 2026. http://arxiv.org/abs/2602.06022 Correctness-optimized residual activation lens (coral): Transferrable and calibration-aware inference-time steering

  19. [19]

    Bo Pang and Lillian Lee. 2005. https://doi.org/10.3115/1219840.1219855 Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales . In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics ( ACL ' 05) , pages 115--124, Ann Arbor, Michigan. Association for Computational Linguistics

  20. [20]

    Kiho Park, Yo Joong Choe, and Victor Veitch. 2024. https://proceedings.mlr.press/v235/park24c.html The linear representation hypothesis and the geometry of large language models . In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 39643--39666. PMLR

  21. [21]

    Joris Postmus and Steven Abreu. 2024. https://openreview.net/forum?id=gyAnAq16HC Steering large language models using conceptors: Improving addition-based activation engineering . In MINT: Foundation Model Interventions

  22. [22]

    Sunny Rai, Khushi Shelat, Devansh Jain, Ashwin Kishen, Young Min Cho, Maitreyi Redkar, Samindara Hardikar-Sawant, Lyle Ungar, and Sharath Chandra Guntuku. 2025. https://doi.org/10.18653/v1/2025.c3nlp-1.10 Cross-cultural differences in mental health expressions on social media . In Proceedings of the 3rd Workshop on Cross-Cultural Considerations in NLP (C3...

  23. [23]

    Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. 2024. https://doi.org/10.18653/v1/2024.acl-long.828 Steering llama 2 via contrastive activation addition . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15504--15522, Bangkok, Thailand. Assoc...

  24. [24]

    Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. 2023. https://proceedings.mlr.press/v202/santurkar23a/santurkar23a.pdf Whose opinions do language models reflect? Proceedings of the 40th International Conference on Machine Learning

  25. [25]

    Shashwat Singh, Shauli Ravfogel, Jonathan Herzig, Roee Aharoni, Ryan Cotterell, and Ponnurangam Kumaraguru. 2024. https://arxiv.org/abs/2402.09631 Representation surgery: Theory and practice of affine steering . In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research. PMLR

  26. [26]

    Manning, Andrew Ng, and Christopher Potts

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. https://aclanthology.org/D13-1170/ Recursive deep models for semantic compositionality over a sentiment treebank . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631--1642, Seattle, Washi...

  27. [27]

    Nicholas Sofroniew, Isaac Kauvar, William Saunders, Runjin Chen, Tom Henighan, Sasha Hydrie, Craig Citro, Adam Pearce, Julius Tarng, Wes Gurnee, Joshua Batson, Sam Zimmerman, Kelley Rivoire, Kyle Fish, Chris Olah, and Jack Lindsey. 2026. https://transformer-circuits.pub/2026/emotions/index.html Emotion concepts and their function in a large language model...

  28. [28]

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, and et al. 2024. http://arxiv.org/abs/2408.00118 Gemma 2: Improving open language models at a practical size

  29. [29]

    Daniel Freeman, Theodore R

    Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. 2024. https://tr...

  30. [30]

    Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. https://doi.org/10.18653/v1/P19-1452 BERT rediscovers the classical NLP pipeline . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593--4601, Florence, Italy. Association for Computational Linguistics

  31. [31]

    Steering Language Models With Activation Engineering

    Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. 2024. http://arxiv.org/abs/2308.10248 Steering language models with activation engineering

  32. [32]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30

  33. [33]

    and Potts, Christopher , booktitle=

    Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. 2024. http://arxiv.org/abs/2404.03592 Reft: Representation finetuning for language models

  34. [34]

    Zejia You, Chunyuan Deng, and Hanjie Chen. 2026. http://arxiv.org/abs/2602.08169 Spherical steering: Geometry-aware activation rotation for language models