Recognition: 2 theorem links
· Lean TheoremCross-Family Universality of Behavioral Axes via Anchor-Projected Representations
Pith reviewed 2026-05-12 04:46 UTC · model grok-4.3
The pith
Behavioral directions for the same traits align across Llama, Qwen, Mistral, and Phi models inside a shared anchor coordinate space.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Projecting each model's representations into an anchor coordinate space allows behavioral directions from multiple source models to be averaged into a canonical direction that reconstructs accurately in a target model's native space using only anchor activations. For the LQMP cluster, same-axis directions align tightly enough that held-out targets reach 0.83 ten-way detection accuracy and 0.95 mean binary AUROC, while canonical steering produces refusal-rate shifts up to 0.46 under distribution shift. Two source models and small anchor pools already suffice for useful approximations.
What carries the argument
The anchor-projection framework, which maps each model's hidden states into a shared anchor coordinate space (ACS) via fixed anchor activations so that directions can be averaged and then reconstructed in any target model's native space.
If this is right
- Same-axis directions can be averaged across the aligned LQMP models to produce a usable canonical direction for any member of the cluster.
- Only two source models and small anchor pools are required to approximate transferable directions with high downstream accuracy.
- Canonical steering works under distribution shift, producing refusal-rate changes up to 0.46.
- Held-out models achieve 0.83 ten-way detection accuracy and 0.95 binary AUROC when using the reconstructed directions.
Where Pith is reading between the lines
- A fixed public set of anchors could let practitioners adapt steering vectors to new model releases without repeating full direction extraction.
- The same projection technique might be tested on additional model families to check whether the alignment pattern generalizes beyond the current LQMP cluster.
- If the alignment persists, interpretability tools could treat behavioral axes as approximately family-agnostic within aligned groups.
Load-bearing premise
Anchor activations alone contain enough information to reconstruct a canonical behavioral direction in any target model's native hidden space without fine-tuning or target-specific direction extraction.
What would settle it
If the direction reconstructed from anchors in a held-out model from the LQMP cluster produces no better than random performance on ten-way behavioral detection or zero measurable steering effect, the cross-family transfer claim would fail.
Figures
read the original abstract
Large language models from different families use different hidden dimensions, tokenizers, and training procedures, making behavioral directions difficult to compare or transfer across models. We introduce an anchor-projection framework that maps hidden representations from each model into a shared anchor coordinate space (ACS). Behavioral directions extracted from source models are projected into ACS and averaged into a canonical direction. For a new model, the canonical direction is reconstructed into its native hidden space using only anchor activations, without fine-tuning or target-specific direction extraction. We evaluate five instruction-tuned model families and ten behavioral axes. We find that same-axis directions align tightly across the Llama-Qwen-Mistral-Phi (LQMP) cluster in ACS. This shared structure transfers to downstream tasks. For the aligned LQMP cluster, held-out targets achieve (0.83) ten-way detection accuracy and (0.95) mean binary AUROC, while canonical steering induces refusal-rate shifts of up to +0.46% under distribution shift. Sensitivity analyses show that two source models and small anchor pools already suffice to approximate transferable directions. Overall, ACS provides a novel perspective on cross-family interpretability, revealing that representation-level transfer remains robust across model families.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces an anchor-projection framework that maps hidden representations from different LLM families into a shared Anchor Coordinate Space (ACS). Behavioral directions extracted from source models are projected into ACS, averaged to form a canonical direction, and then reconstructed in a target model's native space using only that model's anchor activations, without fine-tuning or target-specific extraction. Evaluations on five instruction-tuned families and ten behavioral axes report tight alignment of same-axis directions within the Llama-Qwen-Mistral-Phi cluster in ACS, with held-out targets achieving 0.83 ten-way detection accuracy and 0.95 mean binary AUROC; canonical steering produces refusal-rate shifts up to +0.46 under distribution shift. Sensitivity analyses indicate that two source models and small anchor pools suffice.
Significance. If the reconstruction and transfer results hold under proper controls, the work would advance cross-family interpretability by demonstrating that behavioral axes can be made universal via an external anchor basis rather than model-specific fitting. The reported robustness to minimal sources and anchors is a concrete strength that could enable practical applications in steering and detection. The approach offers a falsifiable test of representation-level universality and supplies reproducible numerical evidence on held-out targets.
major comments (2)
- [§5] §5 (Results and Tables 1-3): the reported 0.83 ten-way accuracy and 0.95 AUROC for held-out targets are presented without baselines (e.g., random anchor projections or direct non-ACS transfer), statistical tests, or controls for confounds such as tokenizer overlap and hidden-dimension mismatch. These omissions are load-bearing because the central transfer claim cannot be evaluated without them.
- [§4.2] §4.2 (Reconstruction mapping): the procedure that reconstructs the ACS-averaged canonical direction into the target hidden space using only anchor activations assumes the anchors span the behavioral variance. No diagnostic (e.g., correlation of anchor-induced variance with the target axis or ablation of anchor subsets) is supplied; if the anchors are orthogonal or weakly correlated with the axis, the mapped vector would be dominated by noise, directly undermining the 'no target-specific extraction' guarantee.
minor comments (2)
- [§3] The exact linear map from each model's hidden space to ACS (including dimensionality and normalization) is described only at a high level; an explicit equation would improve reproducibility.
- [Figure 2] Figure 2 (ACS alignment visualization) would benefit from axis labels indicating the behavioral axes and a quantitative measure of tightness (e.g., cosine variance) rather than qualitative description alone.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our anchor-projection framework. We address each major comment below and will revise the manuscript to incorporate additional baselines, statistical tests, and diagnostics as suggested.
read point-by-point responses
-
Referee: [§5] §5 (Results and Tables 1-3): the reported 0.83 ten-way accuracy and 0.95 AUROC for held-out targets are presented without baselines (e.g., random anchor projections or direct non-ACS transfer), statistical tests, or controls for confounds such as tokenizer overlap and hidden-dimension mismatch. These omissions are load-bearing because the central transfer claim cannot be evaluated without them.
Authors: We agree that baselines, statistical tests, and confound controls are necessary to rigorously evaluate the transfer claims. In the revised manuscript we will add comparisons against random anchor projections and direct non-ACS transfer, include permutation or bootstrap statistical tests for the reported accuracy and AUROC values, and explicitly discuss tokenizer overlap and dimensional mismatch (noting that the shared ACS basis is intended to mitigate the latter). These additions will appear in §5 and Tables 1-3. revision: yes
-
Referee: [§4.2] §4.2 (Reconstruction mapping): the procedure that reconstructs the ACS-averaged canonical direction into the target hidden space using only anchor activations assumes the anchors span the behavioral variance. No diagnostic (e.g., correlation of anchor-induced variance with the target axis or ablation of anchor subsets) is supplied; if the anchors are orthogonal or weakly correlated with the axis, the mapped vector would be dominated by noise, directly undermining the 'no target-specific extraction' guarantee.
Authors: Our sensitivity analyses already show that small anchor pools suffice, indicating that anchors capture relevant variance. We nevertheless accept the request for explicit diagnostics and will add to §4.2 both the correlation between anchor-induced variance and each behavioral axis and ablations over anchor subsets. These will demonstrate that reconstruction performance is stable and not noise-dominated, thereby reinforcing the no-target-extraction claim. revision: yes
Circularity Check
No significant circularity; method uses independent anchors and source extractions
full rationale
The derivation extracts behavioral directions from source models, projects them into ACS, averages to a canonical direction, and reconstructs for targets solely via anchor activations without target behavioral data or fine-tuning. Downstream metrics (0.83 accuracy, 0.95 AUROC, steering shifts) are measured on held-out targets as empirical transfer results. No equation or step reduces the claimed transfer to a self-definition, fitted input renamed as prediction, or self-citation chain; the reconstruction is a defined linear mapping whose success is externally validated rather than tautological.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Anchor Coordinate Space (ACS)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce an anchor-projection framework that maps hidden representations from each model into a shared anchor coordinate space (ACS). Behavioral directions extracted from source models are projected into ACS and averaged into a canonical direction. For a new model, the canonical direction is reconstructed into its native hidden space using only anchor activations
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
anchor projection pi_pt_m(h) = bA_m · norm(h - mu_m) in R^N ... pi_dir_m(v) = bA_m · norm(v)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Hewett, Mojan Javaheripi, Piero Kauffmann, James R
Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue Wu, Dingli Yu,...
work page 2024
-
[2]
Refusal in language models is mediated by a sin- gle direction
Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a sin- gle direction. InAdvances in Neural Information Processing Systems (NeurIPS),
-
[3]
URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/ f545448535dfde4f9786555403ab7c49-Abstract-Conference.html
work page 2024
-
[4]
Revisiting model stitching to compare neural representations
Yamini Bansal, Preetum Nakkiran, and Boaz Barak. Revisiting model stitching to compare neural representations. InAdvances in Neural Information Processing Sys- tems (NeurIPS), 2021. URL https://proceedings.neurips.cc/paper/2021/hash/ 01ded4259d101feb739b06c399e9cd9c-Abstract.html
work page 2021
-
[5]
Computational Linguistics , year =
Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022. doi: 10.1162/coli_a_00422. URL https://aclanthology. org/2022.cl-1.7/
work page internal anchor Pith review doi:10.1162/coli_a_00422 2022
-
[6]
Nuanced metrics for measuring unintended bias with real data for text classification
Daniel Borkan, Lucas Dixon, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. Nuanced metrics for measuring unintended bias with real data for text classification. InCompanion Proceedings of The 2019 World Wide Web Conference (WWW), pages 491–500, 2019
work page 2019
-
[7]
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nicholas L. Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E. Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Ol...
work page 2023
-
[8]
Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong
Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong. JailbreakBench: An open robustness benchmark for jailbreaking large language models. InAdvances in Neural Information Processing Systems (NeurIPS) Dat...
-
[9]
URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/ 63092d79154adebd7305dfd498cbff70-Abstract-Datasets_and_Benchmarks_Track. html
work page 2024
-
[10]
Persona Vectors: Monitoring and Controlling Character Traits in Language Models
Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Monitoring and controlling character traits in language models, 2025. URL https://arxiv. org/abs/2507.21509
work page internal anchor Pith review arXiv 2025
-
[11]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John 10 Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/ abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[12]
Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso
Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. Towards automated circuit discovery for mechanistic in- terpretability. InAdvances in Neural Information Processing Systems (NeurIPS),
-
[13]
URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/ 34e1dbe95d34d7ebaf99b9bcaeb5b2be-Abstract-Conference.html
work page 2023
-
[14]
Esin Durmus, Karina Nguyen, Thomas I. Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny Hernandez, Nicholas Joseph, Liane Lovitt, Sam McCandlish, Orowa Sikder, Alex Tamkin, Janel Thamkul, Jared Kaplan, Jack Clark, and Deep Ganguli. Towards measuring the representation of subjective global opinions in language mod...
-
[15]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size, 2024. URL https://arxiv. org/abs/2408.00118
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 herd of models, 2024. URLhttps://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Emotion English DistilRoBERTa-base
Jochen Hartmann. Emotion English DistilRoBERTa-base. Hugging Face model card, 2022. URL https://huggingface.co/j-hartmann/emotion-english-distilroberta-base
work page 2022
-
[18]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InThe Ninth Interna- tional Conference on Learning Representations (ICLR), 2021. URL https://openreview. net/forum?id=d7KBjmI3GmQ
work page 2021
-
[19]
Position: The platonic representation hypothesis
Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. Position: The platonic representation hypothesis. InProceedings of the 41st International Conference on Machine Learning (ICML). PMLR, 2024. URL https://proceedings.mlr.press/v235/huh24a. html
work page 2024
-
[20]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama Guard: LLM-based input-output safeguard for human-AI conversations, 2023. URL https: //arxiv.org/abs/2312.06674
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7B, 2023. URL https://arxi...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
WildTeaming at scale: From in-the-wild jailbreaks to (adversarially) safer lan- guage models
Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. WildTeaming at scale: From in-the-wild jailbreaks to (adversarially) safer lan- guage models. InAdvances in Neural Information Processing Systems (NeurIPS),
-
[23]
URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/ 54024fca0cef9911be36319e622cde38-Abstract-Conference.html
work page 2024
-
[24]
Manning, Christopher Ré, Diana Acosta-Navas, Drew A
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu...
work page 2023
-
[25]
TimeLMs: Diachronic language models from Twitter
Daniel Loureiro, Francesco Barbieri, Leonardo Neves, Luis Espinosa Anke, and Jose Camacho- Collados. TimeLMs: Diachronic language models from Twitter. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demon- strations, pages 251–260. Association for Computational Linguistics, 2022. URL https: //aclanthology....
work page 2022
-
[26]
Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. InConference on Language Modeling (COLM), 2024. URLhttps://openreview.net/forum?id=aajyHYjjsk
work page 2024
-
[27]
Steering Llama 2 via contrastive activation addition
Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering Llama 2 via contrastive activation addition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pages 15504–15522. Association for Computational Linguistics, 2024. URL https://aclanthology.org/2024. acl...
work page 2024
-
[28]
The linear representation hypothesis and the geometry of large language models
Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. InProceedings of the 41st International Conference on Machine Learning (ICML). PMLR, 2024. URL https://proceedings.mlr.press/v235/ park24c.html
work page 2024
-
[29]
BBQ: A hand-built bias benchmark for question answering
Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel Bowman. BBQ: A hand-built bias benchmark for question answering. InFindings of the Association for Computational Linguistics: ACL 2022, pages 2086–2105. Association for Computational Linguistics, 2022. URL https: //aclanthology.org/2022.f...
work page 2022
-
[30]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vander- plas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikit-learn: Machine learning in Python.Journal of Machine Learning Resear...
work page 2011
-
[31]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google- proof q&a benchmark. InFirst Conference on Language Modeling (COLM), 2024. URL https://openreview.net/forum?id=Ti67584b98
work page 2024
-
[32]
XSTest: A test suite for identifying exaggerated safety behaviours in large language models
Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. XSTest: A test suite for identifying exaggerated safety behaviours in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages ...
work page 2024
-
[33]
CARER: Contextualized affect representations for emotion recognition
Elvis Saravia, Hsien-Chi Toby Liu, Yen-Hao Huang, Junlin Wu, and Yi-Shin Chen. CARER: Contextualized affect representations for emotion recognition. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3687–
work page 2018
-
[34]
URL https://aclanthology.org/ D18-1404/
Association for Computational Linguistics, 2018. URL https://aclanthology.org/ D18-1404/
work page 2018
-
[35]
Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R
Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophancy in language models. InThe Twelf...
work page 2024
-
[36]
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y . Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1631–1642. Association for Computational Linguistics, 2013
work page 2013
-
[37]
Nishant Subramani, Nivedita Suresh, and Matthew E. Peters. Extracting latent steering vectors from pretrained language models. InFindings of the Association for Computational Linguistics: ACL 2022, pages 566–581. Association for Computational Linguistics, 2022. URL https: //aclanthology.org/2022.findings-acl.48/
work page 2022
-
[38]
Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L. Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn,...
work page 2024
-
[39]
Language models linearly represent sentiment
Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, and Neel Nanda. Language models linearly represent sentiment. InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, 2024. URL https://aclanthology.org/2024.blackboxnlp-1.5/
work page 2024
-
[40]
Steering Language Models With Activation Engineering
Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization, 2023. URL https://arxiv.org/abs/2308.10248
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
MMLU-Pro: A more robust and challenging multi-task language understanding benchmark
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. InAdvances in Neural Information Processing Systems (NeurIPS)...
work page 2024
-
[42]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T...
-
[43]
URLhttps://arxiv.org/abs/2412.15115
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Character-level convolutional networks for text classification
Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. InAdvances in Neural Information Processing Systems (NeurIPS), 2015
work page 2015
-
[45]
Representation Engineering: A Top-Down Approach to AI Transparency
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to A...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.