arxiv: 2605.09875 · v1 · submitted 2026-05-11 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations

Su-Hyeon Kim , Yo-Sub Han

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:46 UTC · model grok-4.3

classification 💻 cs.AI

keywords behavioral directionsanchor coordinate spacecross-family transferLLM interpretabilitysteering vectorsrepresentation alignmentmodel universality

0 comments

The pith

Behavioral directions for the same traits align across Llama, Qwen, Mistral, and Phi models inside a shared anchor coordinate space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an anchor-projection method that places hidden representations from different large language models into one common coordinate space built from fixed anchor activations. Directions extracted in source models are averaged there into a single canonical direction for each behavioral axis. That canonical direction is then mapped back into any new model's own hidden space using nothing but the same anchor activations and no extra training. Within the Llama-Qwen-Mistral-Phi group the aligned directions support strong downstream detection and steering on held-out models. The result suggests that once a small set of anchors is chosen, behavioral axes can be treated as roughly portable across these model families.

Core claim

Projecting each model's representations into an anchor coordinate space allows behavioral directions from multiple source models to be averaged into a canonical direction that reconstructs accurately in a target model's native space using only anchor activations. For the LQMP cluster, same-axis directions align tightly enough that held-out targets reach 0.83 ten-way detection accuracy and 0.95 mean binary AUROC, while canonical steering produces refusal-rate shifts up to 0.46 under distribution shift. Two source models and small anchor pools already suffice for useful approximations.

What carries the argument

The anchor-projection framework, which maps each model's hidden states into a shared anchor coordinate space (ACS) via fixed anchor activations so that directions can be averaged and then reconstructed in any target model's native space.

If this is right

Same-axis directions can be averaged across the aligned LQMP models to produce a usable canonical direction for any member of the cluster.
Only two source models and small anchor pools are required to approximate transferable directions with high downstream accuracy.
Canonical steering works under distribution shift, producing refusal-rate changes up to 0.46.
Held-out models achieve 0.83 ten-way detection accuracy and 0.95 binary AUROC when using the reconstructed directions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A fixed public set of anchors could let practitioners adapt steering vectors to new model releases without repeating full direction extraction.
The same projection technique might be tested on additional model families to check whether the alignment pattern generalizes beyond the current LQMP cluster.
If the alignment persists, interpretability tools could treat behavioral axes as approximately family-agnostic within aligned groups.

Load-bearing premise

Anchor activations alone contain enough information to reconstruct a canonical behavioral direction in any target model's native hidden space without fine-tuning or target-specific direction extraction.

What would settle it

If the direction reconstructed from anchors in a held-out model from the LQMP cluster produces no better than random performance on ten-way behavioral detection or zero measurable steering effect, the cross-family transfer claim would fail.

Figures

Figures reproduced from arXiv: 2605.09875 by Su-Hyeon Kim, Yo-Sub Han.

**Figure 2.** Figure 2: Cross-family alignment of behavioral-axis directions in ACS. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Behavioral steering performance (LQPG → Mistral). We compare canonical reconstructed directions, native directions (eval-only reference), and random controls across an α-sweep. Refusal is evaluated on in-distribution prompts as well as OOD benchmarks (JailbreakBench and XSTest) [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Sensitivity to anchor size and source count. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Sensitivity to layer selection and model scale. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: t-SNE embedding of the 50 per-model anchor-projected axis directions ( [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

read the original abstract

Large language models from different families use different hidden dimensions, tokenizers, and training procedures, making behavioral directions difficult to compare or transfer across models. We introduce an anchor-projection framework that maps hidden representations from each model into a shared anchor coordinate space (ACS). Behavioral directions extracted from source models are projected into ACS and averaged into a canonical direction. For a new model, the canonical direction is reconstructed into its native hidden space using only anchor activations, without fine-tuning or target-specific direction extraction. We evaluate five instruction-tuned model families and ten behavioral axes. We find that same-axis directions align tightly across the Llama-Qwen-Mistral-Phi (LQMP) cluster in ACS. This shared structure transfers to downstream tasks. For the aligned LQMP cluster, held-out targets achieve (0.83) ten-way detection accuracy and (0.95) mean binary AUROC, while canonical steering induces refusal-rate shifts of up to +0.46% under distribution shift. Sensitivity analyses show that two source models and small anchor pools already suffice to approximate transferable directions. Overall, ACS provides a novel perspective on cross-family interpretability, revealing that representation-level transfer remains robust across model families.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The anchor-projection method for averaging behavioral directions across LLM families is a concrete new mechanism, but the reconstruction step rests on an untested assumption that anchors carry the right signal.

read the letter

The paper's main move is to project hidden states from different models into a shared anchor coordinate space, average the behavioral directions there, and then reconstruct a canonical version in a target model's native space using only its anchor activations. No target-specific direction extraction or fine-tuning is needed. This directly tackles the incompatibility problem that grows as more model families appear with different dimensions and tokenizers. The reported numbers on the Llama-Qwen-Mistral-Phi cluster—0.83 ten-way detection accuracy and 0.95 mean binary AUROC on held-out targets, plus steering shifts up to 0.46%—suggest the averaged direction carries over usefully in at least that group. The sensitivity checks showing that two sources and small anchor pools are often enough add a practical note. That part is new enough to stand out from standard alignment techniques. The work does a straightforward job of naming the cross-family transfer issue and giving a workable recipe for it. The downstream task results and the refusal-rate shifts under distribution shift are the clearest evidence that something is transferring. The stress-test concern about anchors failing to span the relevant directions is worth taking seriously here. If the chosen anchors mainly capture unrelated variation, the reconstructed vector in the target space could be dominated by noise rather than the intended behavior, which would make the high accuracies look better than they are. The abstract supplies no baselines, no statistical tests, and no explicit controls for tokenizer overlap or other confounds, so it is difficult to judge how much of the result is real transfer versus incidental alignment. This leaves the central claim under-supported on current evidence. The paper is aimed at interpretability researchers who need to move findings between models without re-running extraction each time. Readers already working on steering or representation analysis would get the most out of the ACS construction and the cluster-specific results. It deserves a serious referee to examine the projection math, the anchor selection procedure, and whether the full experiments include proper controls. I would send it to review rather than desk-reject, with the expectation that the evaluation section needs substantial strengthening.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces an anchor-projection framework that maps hidden representations from different LLM families into a shared Anchor Coordinate Space (ACS). Behavioral directions extracted from source models are projected into ACS, averaged to form a canonical direction, and then reconstructed in a target model's native space using only that model's anchor activations, without fine-tuning or target-specific extraction. Evaluations on five instruction-tuned families and ten behavioral axes report tight alignment of same-axis directions within the Llama-Qwen-Mistral-Phi cluster in ACS, with held-out targets achieving 0.83 ten-way detection accuracy and 0.95 mean binary AUROC; canonical steering produces refusal-rate shifts up to +0.46 under distribution shift. Sensitivity analyses indicate that two source models and small anchor pools suffice.

Significance. If the reconstruction and transfer results hold under proper controls, the work would advance cross-family interpretability by demonstrating that behavioral axes can be made universal via an external anchor basis rather than model-specific fitting. The reported robustness to minimal sources and anchors is a concrete strength that could enable practical applications in steering and detection. The approach offers a falsifiable test of representation-level universality and supplies reproducible numerical evidence on held-out targets.

major comments (2)

[§5] §5 (Results and Tables 1-3): the reported 0.83 ten-way accuracy and 0.95 AUROC for held-out targets are presented without baselines (e.g., random anchor projections or direct non-ACS transfer), statistical tests, or controls for confounds such as tokenizer overlap and hidden-dimension mismatch. These omissions are load-bearing because the central transfer claim cannot be evaluated without them.
[§4.2] §4.2 (Reconstruction mapping): the procedure that reconstructs the ACS-averaged canonical direction into the target hidden space using only anchor activations assumes the anchors span the behavioral variance. No diagnostic (e.g., correlation of anchor-induced variance with the target axis or ablation of anchor subsets) is supplied; if the anchors are orthogonal or weakly correlated with the axis, the mapped vector would be dominated by noise, directly undermining the 'no target-specific extraction' guarantee.

minor comments (2)

[§3] The exact linear map from each model's hidden space to ACS (including dimensionality and normalization) is described only at a high level; an explicit equation would improve reproducibility.
[Figure 2] Figure 2 (ACS alignment visualization) would benefit from axis labels indicating the behavioral axes and a quantitative measure of tightness (e.g., cosine variance) rather than qualitative description alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our anchor-projection framework. We address each major comment below and will revise the manuscript to incorporate additional baselines, statistical tests, and diagnostics as suggested.

read point-by-point responses

Referee: [§5] §5 (Results and Tables 1-3): the reported 0.83 ten-way accuracy and 0.95 AUROC for held-out targets are presented without baselines (e.g., random anchor projections or direct non-ACS transfer), statistical tests, or controls for confounds such as tokenizer overlap and hidden-dimension mismatch. These omissions are load-bearing because the central transfer claim cannot be evaluated without them.

Authors: We agree that baselines, statistical tests, and confound controls are necessary to rigorously evaluate the transfer claims. In the revised manuscript we will add comparisons against random anchor projections and direct non-ACS transfer, include permutation or bootstrap statistical tests for the reported accuracy and AUROC values, and explicitly discuss tokenizer overlap and dimensional mismatch (noting that the shared ACS basis is intended to mitigate the latter). These additions will appear in §5 and Tables 1-3. revision: yes
Referee: [§4.2] §4.2 (Reconstruction mapping): the procedure that reconstructs the ACS-averaged canonical direction into the target hidden space using only anchor activations assumes the anchors span the behavioral variance. No diagnostic (e.g., correlation of anchor-induced variance with the target axis or ablation of anchor subsets) is supplied; if the anchors are orthogonal or weakly correlated with the axis, the mapped vector would be dominated by noise, directly undermining the 'no target-specific extraction' guarantee.

Authors: Our sensitivity analyses already show that small anchor pools suffice, indicating that anchors capture relevant variance. We nevertheless accept the request for explicit diagnostics and will add to §4.2 both the correlation between anchor-induced variance and each behavioral axis and ablations over anchor subsets. These will demonstrate that reconstruction performance is stable and not noise-dominated, thereby reinforcing the no-target-extraction claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method uses independent anchors and source extractions

full rationale

The derivation extracts behavioral directions from source models, projects them into ACS, averages to a canonical direction, and reconstructs for targets solely via anchor activations without target behavioral data or fine-tuning. Downstream metrics (0.83 accuracy, 0.95 AUROC, steering shifts) are measured on held-out targets as empirical transfer results. No equation or step reduces the claimed transfer to a self-definition, fitted input renamed as prediction, or self-citation chain; the reconstruction is a defined linear mapping whose success is externally validated rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities beyond the high-level description of the anchor coordinate space.

invented entities (1)

Anchor Coordinate Space (ACS) no independent evidence
purpose: Shared projection space for aligning and averaging behavioral directions across model families
Core novel construct introduced to enable transfer; no independent evidence or external validation is described.

pith-pipeline@v0.9.0 · 5507 in / 1102 out tokens · 35469 ms · 2026-05-12T04:46:45.908623+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce an anchor-projection framework that maps hidden representations from each model into a shared anchor coordinate space (ACS). Behavioral directions extracted from source models are projected into ACS and averaged into a canonical direction. For a new model, the canonical direction is reconstructed into its native hidden space using only anchor activations
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

anchor projection pi_pt_m(h) = bA_m · norm(h - mu_m) in R^N ... pi_dir_m(v) = bA_m · norm(v)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 10 internal anchors

[1]

Hewett, Mojan Javaheripi, Piero Kauffmann, James R

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue Wu, Dingli Yu,...

work page 2024
[2]

Refusal in language models is mediated by a sin- gle direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a sin- gle direction. InAdvances in Neural Information Processing Systems (NeurIPS),

work page
[3]

URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/ f545448535dfde4f9786555403ab7c49-Abstract-Conference.html

work page 2024
[4]

Revisiting model stitching to compare neural representations

Yamini Bansal, Preetum Nakkiran, and Boaz Barak. Revisiting model stitching to compare neural representations. InAdvances in Neural Information Processing Sys- tems (NeurIPS), 2021. URL https://proceedings.neurips.cc/paper/2021/hash/ 01ded4259d101feb739b06c399e9cd9c-Abstract.html

work page 2021
[5]

Computational Linguistics , year =

Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022. doi: 10.1162/coli_a_00422. URL https://aclanthology. org/2022.cl-1.7/

work page internal anchor Pith review doi:10.1162/coli_a_00422 2022
[6]

Nuanced metrics for measuring unintended bias with real data for text classification

Daniel Borkan, Lucas Dixon, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. Nuanced metrics for measuring unintended bias with real data for text classification. InCompanion Proceedings of The 2019 World Wide Web Conference (WWW), pages 491–500, 2019

work page 2019
[7]

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nicholas L. Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E. Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Ol...

work page 2023
[8]

Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong. JailbreakBench: An open robustness benchmark for jailbreaking large language models. InAdvances in Neural Information Processing Systems (NeurIPS) Dat...

work page
[9]

URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/ 63092d79154adebd7305dfd498cbff70-Abstract-Datasets_and_Benchmarks_Track. html

work page 2024
[10]

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Monitoring and controlling character traits in language models, 2025. URL https://arxiv. org/abs/2507.21509

work page internal anchor Pith review arXiv 2025
[11]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John 10 Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/ abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso

Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. Towards automated circuit discovery for mechanistic in- terpretability. InAdvances in Neural Information Processing Systems (NeurIPS),

work page
[13]

URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/ 34e1dbe95d34d7ebaf99b9bcaeb5b2be-Abstract-Conference.html

work page 2023
[14]

Esin Durmus, Karina Nguyen, Thomas I. Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny Hernandez, Nicholas Joseph, Liane Lovitt, Sam McCandlish, Orowa Sikder, Alex Tamkin, Janel Thamkul, Jared Kaplan, Jack Clark, and Deep Ganguli. Towards measuring the representation of subjective global opinions in language mod...

work page arXiv 2023
[15]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size, 2024. URL https://arxiv. org/abs/2408.00118

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 herd of models, 2024. URLhttps://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Emotion English DistilRoBERTa-base

Jochen Hartmann. Emotion English DistilRoBERTa-base. Hugging Face model card, 2022. URL https://huggingface.co/j-hartmann/emotion-english-distilroberta-base

work page 2022
[18]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InThe Ninth Interna- tional Conference on Learning Representations (ICLR), 2021. URL https://openreview. net/forum?id=d7KBjmI3GmQ

work page 2021
[19]

Position: The platonic representation hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. Position: The platonic representation hypothesis. InProceedings of the 41st International Conference on Machine Learning (ICML). PMLR, 2024. URL https://proceedings.mlr.press/v235/huh24a. html

work page 2024
[20]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama Guard: LLM-based input-output safeguard for human-AI conversations, 2023. URL https: //arxiv.org/abs/2312.06674

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7B, 2023. URL https://arxi...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

WildTeaming at scale: From in-the-wild jailbreaks to (adversarially) safer lan- guage models

Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. WildTeaming at scale: From in-the-wild jailbreaks to (adversarially) safer lan- guage models. InAdvances in Neural Information Processing Systems (NeurIPS),

work page
[23]

URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/ 54024fca0cef9911be36319e622cde38-Abstract-Conference.html

work page 2024
[24]

Manning, Christopher Ré, Diana Acosta-Navas, Drew A

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu...

work page 2023
[25]

TimeLMs: Diachronic language models from Twitter

Daniel Loureiro, Francesco Barbieri, Leonardo Neves, Luis Espinosa Anke, and Jose Camacho- Collados. TimeLMs: Diachronic language models from Twitter. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demon- strations, pages 251–260. Association for Computational Linguistics, 2022. URL https: //aclanthology....

work page 2022
[26]

The geometry of truth: Emergent linear structure in large language model representations of true/false datasets

Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. InConference on Language Modeling (COLM), 2024. URLhttps://openreview.net/forum?id=aajyHYjjsk

work page 2024
[27]

Steering Llama 2 via contrastive activation addition

Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering Llama 2 via contrastive activation addition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pages 15504–15522. Association for Computational Linguistics, 2024. URL https://aclanthology.org/2024. acl...

work page 2024
[28]

The linear representation hypothesis and the geometry of large language models

Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. InProceedings of the 41st International Conference on Machine Learning (ICML). PMLR, 2024. URL https://proceedings.mlr.press/v235/ park24c.html

work page 2024
[29]

BBQ: A hand-built bias benchmark for question answering

Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel Bowman. BBQ: A hand-built bias benchmark for question answering. InFindings of the Association for Computational Linguistics: ACL 2022, pages 2086–2105. Association for Computational Linguistics, 2022. URL https: //aclanthology.org/2022.f...

work page 2022
[30]

Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12(85):2825–2830, 2011

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vander- plas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikit-learn: Machine learning in Python.Journal of Machine Learning Resear...

work page 2011
[31]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google- proof q&a benchmark. InFirst Conference on Language Modeling (COLM), 2024. URL https://openreview.net/forum?id=Ti67584b98

work page 2024
[32]

XSTest: A test suite for identifying exaggerated safety behaviours in large language models

Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. XSTest: A test suite for identifying exaggerated safety behaviours in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages ...

work page 2024
[33]

CARER: Contextualized affect representations for emotion recognition

Elvis Saravia, Hsien-Chi Toby Liu, Yen-Hao Huang, Junlin Wu, and Yi-Shin Chen. CARER: Contextualized affect representations for emotion recognition. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3687–

work page 2018
[34]

URL https://aclanthology.org/ D18-1404/

Association for Computational Linguistics, 2018. URL https://aclanthology.org/ D18-1404/

work page 2018
[35]

Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophancy in language models. InThe Twelf...

work page 2024
[36]

Manning, Andrew Y

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y . Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1631–1642. Association for Computational Linguistics, 2013

work page 2013
[37]

Nishant Subramani, Nivedita Suresh, and Matthew E. Peters. Extracting latent steering vectors from pretrained language models. InFindings of the Association for Computational Linguistics: ACL 2022, pages 566–581. Association for Computational Linguistics, 2022. URL https: //aclanthology.org/2022.findings-acl.48/

work page 2022
[38]

Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L. Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn,...

work page 2024
[39]

Language models linearly represent sentiment

Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, and Neel Nanda. Language models linearly represent sentiment. InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, 2024. URL https://aclanthology.org/2024.blackboxnlp-1.5/

work page 2024
[40]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization, 2023. URL https://arxiv.org/abs/2308.10248

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

MMLU-Pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. InAdvances in Neural Information Processing Systems (NeurIPS)...

work page 2024
[42]

Qwen2.5 technical report,

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T...

work page
[43]

URLhttps://arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Character-level convolutional networks for text classification

Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. InAdvances in Neural Information Processing Systems (NeurIPS), 2015

work page 2015
[45]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to A...

work page internal anchor Pith review Pith/arXiv arXiv 2023