Comparing Linear Probes with Mahalanobis Cosine Similarity

Nikolaus Kriegeskorte; Peter Hase; Zhuofan Josh Ying

arxiv: 2606.19603 · v1 · pith:TQK3A6DQnew · submitted 2026-06-17 · 💻 cs.LG

Comparing Linear Probes with Mahalanobis Cosine Similarity

Zhuofan Josh Ying , Peter Hase , Nikolaus Kriegeskorte This is my paper

Pith reviewed 2026-06-26 20:37 UTC · model grok-4.3

classification 💻 cs.LG

keywords linear probesMahalanobis cosine similarityout-of-distribution AUROCsignal-to-noise ratioGaussian projectionsinterpretabilitycosine similarity

0 comments

The pith

For balanced Gaussian class projections, Mahalanobis cosine similarity to a reference probe is linearly related to a linear probe's OOD AUROC because both are sigmoid functions of the same signal-to-noise ratio.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that when data projections onto probe directions follow Gaussian distributions and the two classes have equal size, the out-of-distribution area under the ROC curve achieved by a linear probe stands in linear relation to its Mahalanobis cosine similarity with a reference probe trained on the OOD data. Both quantities emerge as sigmoid-shaped functions of the probe's signal-to-noise ratio computed on the test set, which supplies the closed-form reason for the near-perfect empirical correlation (R^2 = 0.98) observed across models, layers, and domains. The same derivation identifies the precise conditions under which the linear relationship must break, and the authors confirm those breakdowns in additional experiments. This supplies a task-aware, covariance-adjusted alternative to ordinary Euclidean cosine similarity when ranking or comparing linear probes.

Core claim

For balanced classes whose projections are Gaussian, OOD AUROC and MCS to the reference probe are linear because both are sigmoid-shaped functions of the probe's signal-to-noise ratio (SNR) on the test data.

What carries the argument

Mahalanobis cosine similarity, which reweights the inner product of two probe directions by the inverse of the test-data covariance matrix.

If this is right

MCS supplies a theoretically justified replacement for Euclidean cosine when ranking linear probes by expected OOD performance.
The linear relation holds across models, layers, and concept domains as long as the Gaussian and balance conditions are met.
Linearity fails exactly when the Gaussian or balance assumptions are violated, which can be checked by inspecting projection histograms or class counts.
The SNR itself becomes the single sufficient statistic that governs both probe quality and inter-probe similarity under the stated model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

MCS could be computed on a small held-out set to rank many candidate probes without running full OOD AUROC evaluations on each.
The result suggests that covariance-adjusted similarities may improve other comparison tasks that currently rely on Euclidean inner products in high-dimensional spaces.
A direct test would apply the same SNR derivation to non-linear probes or to multi-class settings to see whether analogous closed-form relations appear.
The theory offers a way to predict, before training, how much a change in probe direction will move its OOD performance.

Load-bearing premise

The projections of the data onto the probe directions are Gaussian distributed and the two classes are balanced in size.

What would settle it

An experiment on data whose projections onto the probe directions are visibly non-Gaussian or whose class sizes are markedly unequal, showing that the linear correlation between MCS and OOD AUROC drops substantially.

Figures

Figures reproduced from arXiv: 2606.19603 by Nikolaus Kriegeskorte, Peter Hase, Zhuofan Josh Ying.

**Figure 1.** Figure 1: Mahalanobis cosine similarity (MCS) linearly tracks generalization performance. (a) AUROC is a near-linear function of MCS across heterogeneous tasks. (b–c) The generalization AUROC heatmap and the MCS heatmap share structure almost entry-for-entry. Reproduced from Ying et al. (2026) Condition R 2 (MCS) R 2 (ECS) Llama-70B, L33, truth 0.980 0.441 Layers (Llama-70B, truth) layer 20 0.990 0.628 layer 50 0.96… view at source ↗

**Figure 2.** Figure 2: Theory predicts empirical data without free parameters. Across panels, empirical points largely lie on the theory prediction. (a) AUROC–SNR shows Lemma 2. (b) MCS–SNR shows Theorem 1. (c) Eliminating SNR, AUROC–MCS shows a near-straight line that bends only in the top-right corner, matching the empirical data. This is much weaker than joint Gaussianity of X, and plausible even for non-Gaussian distribution… view at source ↗

**Figure 3.** Figure 3: Failure modes. Each panel illustrates a violation of an assumption in §3, and the linearity breaks. diffmean probe gives a markedly lower R2 of 0.79. This delimits the law: it predicts generalization for Fisher-style probes (LR, LDA, shrinkage variants), not for diffmean-style probes. (c) Small Fisher distance. For small zmax, the slope of the MCS formula does not saturate, so each task is in its own near… view at source ↗

**Figure 4.** Figure 4: Cross-domain generalization performance for all eight conditions across models, layers, and concept domains. We observe rich cross-domain generalization patterns across all eight conditions. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: MCS and ECS against AUROC across conditions. We observe a strong linear relationship between MCS and AUROC for all eight conditions across models, layers, and concept domains, while the relationship between ECS and AUROC is much weaker. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Empirical verification of the theory. The theory predicts the empirical data well across conditions. 1.5 1.0 0.5 0.0 0.5 1.0 1.5 Sample skewness 0 50 100 150 200 Count median | | = 0.30 73% within ±0.5 Skewness 1 6 4 2 0 2 4 6 Sample excess kurtosis 0 100 200 300 400 500 Count median | | = 0.40 79% within ±1.0 Excess kurtosis 2 Gaussianity of All Directions Across All Conditions [PITH_FULL_IMAGE:figures/f… view at source ↗

**Figure 7.** Figure 7: Most directions are largely Gaussian across all conditions. Across all conditions, all probe directions, and all test data distributions, most samples are largely Gaussian. Some samples have notably high kurtosis, which is mostly attributed to the sycophantic lying dataset. 4 2 0 2 4 SNR (s) 0.0 0.2 0.4 0.6 0.8 1.0 OOD AUROC Gaussian (proj-Gauss holds) (s/ 2) 4 2 0 2 4 SNR (s) 0.0 0.2 0.4 0.6 0.8 1.0 Lapla… view at source ↗

**Figure 8.** Figure 8: The strong linearity between AUROC and MCS still holds for non-Gaussian distributions. On deliberated constructed distributions where the projection-Gaussianity assumption is broken, the relationship between AUROC and MCS is still linear. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Empirical slopes are near the theoretical central slope. Linear-fit slope of OOD AUROC against MCSΣtot , with 95% bootstrap CIs, across all eight conditions. The dashed line marks the universal central slope 1/ √ π ≈ 0.564 predicted by the theory (App. G). Empirical estimates are close to this value, with most of them slightly below it. This is consistent with data sampling some of the saturation tail whe… view at source ↗

**Figure 10.** Figure 10: MC(w LR id , w LR ood) vs. MC(w LR id , w LDA ood ). The two MCs correlate strongly across all eight conditions, explaining why substituting the LDA direction with the LR direction in our empirical experiments does not affect the observed linearity. r > 0.99, with mean |∆(wid)| < 0.07. Substituting w LR ood for w LDA ood in the headline regression of §2 changes the linear-fit R2 by less than 1%. Why this… view at source ↗

read the original abstract

Linear probes are widely used in interpretability research and often compared by cosine similarity. The Mahalanobis cosine similarity (MCS) between two directions, which reweights the inner product by test data covariance, is a natural task-aware refinement. Ying et al. (2026) report that a probe's MCS to a reference probe trained on the out-of-distribution (OOD) data near-perfectly linearly predicts the probe's OOD AUROC (R^2 = 0.98). Here, we extend this empirical finding across models, layers, and concept domains, and prove this general phenomenon in closed form: For balanced classes whose projections are Gaussian, OOD AUROC and MCS to the reference probe are linear because both are sigmoid-shaped functions of the probe's signal-to-noise ratio (SNR) on the test data. The theory also predicts when this linearity fails, which we verify empirically. MCS offers a theoretically grounded and empirically effective alternative to Euclidean cosine similarity for comparing linear probes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives that OOD AUROC and MCS are both sigmoids of SNR under balanced Gaussian projections, which accounts for their linearity.

read the letter

The main thing to know is that this paper supplies a closed-form derivation showing why Mahalanobis cosine similarity to a reference probe linearly tracks OOD AUROC. For balanced classes with Gaussian projections onto the probe directions, both quantities are sigmoid functions of the same signal-to-noise ratio on the test data, so linearity follows directly.

The closed-form step is the actual advance over the prior empirical R^2=0.98 result. The authors also spell out the conditions where the relationship should break and check those cases empirically, which adds value. Extending the observation across models, layers, and domains is straightforward but helpful for scope.

The assumptions are stated plainly—Gaussian projections and class balance—so the claim stays within bounds. Those conditions are strong, but the paper flags the failure regimes rather than overclaiming. A referee would still want the full derivation steps to verify the algebra has no slips, though the stress-test indicates the logic holds without internal gaps.

This is for researchers who already use linear probes for interpretability and want a task-aware alternative to plain cosine similarity. It gives a principled reason to prefer MCS when the assumptions are reasonable. Readers outside that niche will not get much from it.

I would send it to peer review. The theoretical grounding is modest but exact, the empirical checks are targeted, and the scope is narrow enough that a serious referee can evaluate it quickly.

Referee Report

1 major / 2 minor

Summary. The manuscript claims that for balanced classes with Gaussian projections onto probe directions, OOD AUROC and Mahalanobis cosine similarity (MCS) to a reference probe are linearly related because both quantities are sigmoid-shaped functions of the probe's signal-to-noise ratio (SNR) on the test data. It extends the empirical observation from Ying et al. (2026) of near-perfect linearity (R²=0.98) across models, layers, and concept domains, supplies a closed-form derivation under the stated assumptions, identifies regimes where linearity is predicted to break, and verifies those predictions empirically. MCS is positioned as a theoretically grounded alternative to Euclidean cosine similarity for comparing linear probes.

Significance. If the result holds, the work supplies a mechanistic, parameter-free explanation for the high observed correlation between MCS and OOD performance, thereby strengthening the case for task-aware similarity measures in interpretability research. The explicit closed-form derivation, the absence of fitted parameters or invented entities, and the empirical verification of predicted failure regimes are concrete strengths that make the central claim falsifiable and reproducible.

major comments (1)

[Theoretical derivation (likely §3 or §4)] The central claim rests on the closed-form demonstration that both AUROC and MCS reduce to sigmoid functions of the same SNR quantity under Gaussian projections and class balance. The manuscript states this derivation explicitly, but the algebraic steps connecting the Gaussian assumption to the sigmoid form should be presented with equation numbers so that readers can verify the reduction without ambiguity.

minor comments (2)

[References] The abstract and introduction cite Ying et al. (2026) for the original R²=0.98 result; the reference list should contain the full bibliographic entry.
[Discussion or Limitations] While the paper states the assumptions of balanced classes and Gaussian projections, a short paragraph discussing the robustness of the linearity result to modest violations of these assumptions (beyond the already-reported empirical checks) would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation of minor revision. We address the single major comment below.

read point-by-point responses

Referee: [Theoretical derivation (likely §3 or §4)] The central claim rests on the closed-form demonstration that both AUROC and MCS reduce to sigmoid functions of the same SNR quantity under Gaussian projections and class balance. The manuscript states this derivation explicitly, but the algebraic steps connecting the Gaussian assumption to the sigmoid form should be presented with equation numbers so that readers can verify the reduction without ambiguity.

Authors: We agree that numbering the equations will improve verifiability. In the revised manuscript we will assign equation numbers to each algebraic step in the derivation that reduces the Gaussian class-conditional projections and class balance to the sigmoid forms of AUROC and MCS as functions of SNR. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained mathematical identity

full rationale

The paper states that under the explicit assumptions of balanced classes and Gaussian projections, both OOD AUROC and MCS are sigmoid functions of the same SNR quantity on test data, which algebraically implies their linear relationship. This is presented as a closed-form derivation with stated regimes of validity, without any fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations that reduce the central claim to unverified inputs. The prior empirical report (Ying et al. 2026) is cited only as motivation; the linearity proof stands independently on the SNR dependence. No steps reduce by construction to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions required for the closed-form result; no free parameters or new entities are introduced.

axioms (2)

domain assumption Projections of the data onto the probe directions are Gaussian distributed
Explicitly required for both AUROC and MCS to be sigmoid functions of SNR.
domain assumption The two classes are balanced
Required for the linearity between AUROC and MCS to hold exactly.

pith-pipeline@v0.9.1-grok · 5701 in / 1300 out tokens · 30404 ms · 2026-06-26T20:37:58.420658+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

106 extracted references · 19 canonical work pages · 8 internal anchors

[1]

First Conference on Language Modeling , year=

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets , author=. First Conference on Language Modeling , year=
[2]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

The Internal State of an LLM Knows When It’s Lying , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

2023
[3]

2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024) , pages=

On the Universal Truthfulness Hyperplane Inside LLMs , author=. 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024) , pages=. 2024 , organization=

2024
[4]

The Neuroscientist , volume=

How the brain shapes deception: An integrated review of the literature , author=. The Neuroscientist , volume=. 2011 , publisher=

2011
[5]

Nature , volume=

Detecting hallucinations in large language models using semantic entropy , author=. Nature , volume=. 2024 , publisher=

2024
[6]

Forty-second International Conference on Machine Learning , year=

Detecting strategic deception with linear probes , author=. Forty-second International Conference on Machine Learning , year=
[7]

arXiv preprint arXiv:2507.12691 , year=

Benchmarking deception probes via black-to-white performance boosts , author=. arXiv preprint arXiv:2507.12691 , year=

work page arXiv
[8]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

The Curious Case of Hallucinatory (Un) answerability: Finding Truths in the Hidden States of Over-Confident Large Language Models , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023
[9]

The Eleventh International Conference on Learning Representations , year=

Discovering Latent Knowledge in Language Models Without Supervision , author=. The Eleventh International Conference on Learning Representations , year=
[10]

ICLR 2024 Workshop on Large Language Model (LLM) Agents , year=

Large Language Models can Strategically Deceive their Users when Put Under Pressure , author=. ICLR 2024 Workshop on Large Language Model (LLM) Agents , year=

2024
[11]

arXiv preprint arXiv:2410.21514 , year=

Sabotage evaluations for frontier models , author=. arXiv preprint arXiv:2410.21514 , year=

work page arXiv
[12]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Mixtral of Experts

Mixtral of experts , author=. arXiv preprint arXiv:2401.04088 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=
[15]

Findings of the association for computational linguistics: ACL 2023 , pages=

Discovering language model behaviors with model-written evaluations , author=. Findings of the association for computational linguistics: ACL 2023 , pages=

2023
[16]

arXiv preprint arXiv:2509.21305 , year=

Sycophancy Is Not One Thing: Causal Separation of Sycophantic Behaviors in LLMs , author=. arXiv preprint arXiv:2509.21305 , year=

work page arXiv
[17]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009
[18]

The Twelfth International Conference on Learning Representations , year=

Towards Understanding Sycophancy in Language Models , author=. The Twelfth International Conference on Learning Representations , year=
[19]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma 2: Improving open language models at a practical size , author=. arXiv preprint arXiv:2408.00118 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Proceedings of the National Academy of Sciences , volume=

Deception abilities emerged in large language models , author=. Proceedings of the National Academy of Sciences , volume=. 2024 , publisher=

2024
[21]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

International conference on machine learning , pages=

Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark , author=. International conference on machine learning , pages=. 2023 , organization=

2023
[23]

CoRR , year=

Representation Engineering: A Top-Down Approach to AI Transparency , author=. CoRR , year=
[24]

Steering Language Models With Activation Engineering

Steering language models with activation engineering , author=. arXiv preprint arXiv:2308.10248 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

arXiv preprint arXiv:2510.14318 , year =

Marwa Abdulhai and Ryan Cheng and Aryansh Shrivastava and Natasha Jaques and Yarin Gal and Sergey Levine , title =. arXiv preprint arXiv:2510.14318 , year =

work page arXiv
[26]

Advances in Neural Information Processing Systems , volume=

Truth is universal: Robust detection of lies in llms , author=. Advances in Neural Information Processing Systems , volume=
[27]

Advances in Neural Information Processing Systems , volume=

Inference-time intervention: Eliciting truthful answers from a language model , author=. Advances in Neural Information Processing Systems , volume=
[28]

2025 , eprint=

Preference Learning with Lie Detectors can Induce Honesty or Evasion , author=. 2025 , eprint=

2025
[29]

Cerebral cortex , volume=

Neural correlates of different types of deception: an fMRI investigation , author=. Cerebral cortex , volume=. 2003 , publisher=

2003
[30]

arXiv preprint arXiv:2511.16035 , year=

Liars' Bench: Evaluating Lie Detectors for Language Models , author=. arXiv preprint arXiv:2511.16035 , year=

work page arXiv
[31]

International Conference on Learning Representations , year=

Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=
[32]

Simple synthetic data reduces sycophancy in large language models

Simple synthetic data reduces sycophancy in large language models , author=. arXiv preprint arXiv:2308.03958 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

CoRR , year=

The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems , author=. CoRR , year=
[34]

International Conference on Machine Learning , pages=

Linear Adversarial Concept Erasure , author=. International Conference on Machine Learning , pages=. 2022 , organization=

2022
[35]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025
[36]

Advances in Neural Information Processing Systems , volume=

Leace: Perfect linear concept erasure in closed form , author=. Advances in Neural Information Processing Systems , volume=
[37]

ICLR , year=

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations , author=. ICLR , year=
[38]

Philosophical Studies , year=

Still no lie detector for language models: probing empirical and conceptual roadblocks , author=. Philosophical Studies , year=
[39]

ICML 2025 Workshop on Reliable and Responsible Foundation Models , year=

The Geometries of Truth Are Orthogonal Across Tasks , author=. ICML 2025 Workshop on Reliable and Responsible Foundation Models , year=

2025
[40]

International Conference on Learning Representations , year=

Aligning AI With Shared Human Values , author=. International Conference on Learning Representations , year=
[41]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Emergence of Linear Truth Encodings in Language Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[42]

Proceedings of the 41st International Conference on Machine Learning , pages=

Representation surgery: theory and practice of affine steering , author=. Proceedings of the 41st International Conference on Machine Learning , pages=
[43]

Advances in Neural Information Processing Systems , volume=

Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting , author=. Advances in Neural Information Processing Systems , volume=
[44]

arXiv preprint arXiv:2509.07968 , year=

Simpleqa verified: A reliable factuality benchmark to measure parametric knowledge , author=. arXiv preprint arXiv:2509.07968 , year=

work page arXiv
[45]

Proceedings of the ACM on Web Conference 2025 , pages=

Adaptive activation steering: A tuning-free llm truthfulness improvement method for diverse hallucinations categories , author=. Proceedings of the ACM on Web Conference 2025 , pages=

2025
[46]

Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations , pages=

Transformers: State-of-the-art natural language processing , author=. Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations , pages=

2020
[47]

The Thirteenth International Conference on Learning Representations , year=

NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals , author=. The Thirteenth International Conference on Learning Representations , year=
[48]

Findings of the Association for Computational Linguistics: ACL 2024 , pages=

Do androids know they’re only dreaming of electric sheep? , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

2024
[49]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

When Truthful Representations Flip Under Deceptive Instructions? , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[50]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Steering llama 2 via contrastive activation addition , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[51]

arXiv preprint arXiv:2506.00823 , year=

Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLMs Across Logical Transformations and Question Answering Tasks , author=. arXiv preprint arXiv:2506.00823 , year=

work page arXiv
[52]

arXiv preprint arXiv:2509.23024 , year=

Tracing the representation geometry of language models from pretraining to post-training , author=. arXiv preprint arXiv:2509.23024 , year=

work page arXiv
[53]

Transactions of the association for computational linguistics , volume=

A primer in BERTology: What we know about how BERT works , author=. Transactions of the association for computational linguistics , volume=
[54]

Computational Linguistics , volume=

Probing classifiers: Promises, shortcomings, and advances , author=. Computational Linguistics , volume=
[55]

Language models as knowledge bases? , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages=

2019
[56]

How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages=

2019
[57]

Intrinsic dimensionality explains the effectiveness of language model fine-tuning , author=. Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers) , pages=
[58]

Advances in Neural Information Processing Systems , volume=

The geometry of hidden representations of large transformer models , author=. Advances in Neural Information Processing Systems , volume=
[59]

Neurons, Behavior, Data analysis, and Theory , volume=

Comparing representational geometries using whitened unbiased-distance-matrix similarity , author=. Neurons, Behavior, Data analysis, and Theory , volume=. 2021 , publisher=

2021
[60]

International conference on machine learning , pages=

Leep: A new measure to evaluate transferability of learned representations , author=. International conference on machine learning , pages=. 2020 , organization=

2020
[61]

International conference on machine learning , pages=

Logme: Practical assessment of pre-trained models for transfer learning , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[62]

Advances in Neural Information Processing Systems , volume=

Refusal in language models is mediated by a single direction , author=. Advances in Neural Information Processing Systems , volume=
[63]

Second Conference on Language Modeling , year=

How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence , author=. Second Conference on Language Modeling , year=
[64]

the Journal of machine Learning research , volume=

Scikit-learn: Machine learning in Python , author=. the Journal of machine Learning research , volume=. 2011 , publisher=

2011
[65]

arXiv preprint arXiv:2310.18512 , year=

Preventing language models from hiding their reasoning , author=. arXiv preprint arXiv:2310.18512 , year=

work page arXiv
[66]

arXiv preprint arXiv:2602.20273 , year=

The Truthfulness Spectrum Hypothesis , author=. arXiv preprint arXiv:2602.20273 , year=

work page arXiv
[67]

Understanding intermediate layers using linear classifier probes

Understanding intermediate layers using linear classifier probes , author=. arXiv preprint arXiv:1610.01644 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[68]

2004 , publisher=

Discriminant analysis and statistical pattern recognition , author=. 2004 , publisher=

2004
[69]

International Conference on Computer Vision Systems , pages=

The CSU face identification evaluation system: its purpose, features, and structure , author=. International Conference on Computer Vision Systems , pages=. 2003 , organization=

2003
[70]

Annals of eugenics , volume=

The use of multiple measurements in taxonomic problems , author=. Annals of eugenics , volume=. 1936 , publisher=

1936
[71]

1966 , publisher=

Signal detection theory and psychophysics , author=. 1966 , publisher=

1966
[72]

, author=

The meaning and use of the area under a receiver operating characteristic (ROC) curve. , author=. Radiology , volume=
[73]

2018 , publisher=

Introduction to multivariate analysis , author=. 2018 , publisher=

2018
[74]

2009 , publisher=

The elements of statistical learning: data mining, inference, and prediction , author=. 2009 , publisher=

2009
[75]

Designing and interpreting probes with control tasks , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) , pages=

2019
[76]

Metabolites , volume=

Extraction and integration of genetic networks from short-profile omic data sets , author=. Metabolites , volume=. 2020 , publisher=

2020
[77]

2019 IEEE international conference on image processing (ICIP) , pages=

An information-theoretic approach to transferability in task transfer learning , author=. 2019 IEEE international conference on image processing (ICIP) , pages=. 2019 , organization=

2019
[78]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Transferability estimation using bhattacharyya class separability , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[79]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

How far pre-trained models are from neural collapse on the target dataset informs their transferability , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[80]

Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies , pages=

Learning word vectors for sentiment analysis , author=. Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies , pages=

Showing first 80 references.

[1] [1]

First Conference on Language Modeling , year=

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets , author=. First Conference on Language Modeling , year=

[2] [2]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

The Internal State of an LLM Knows When It’s Lying , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

2023

[3] [3]

2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024) , pages=

On the Universal Truthfulness Hyperplane Inside LLMs , author=. 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024) , pages=. 2024 , organization=

2024

[4] [4]

The Neuroscientist , volume=

How the brain shapes deception: An integrated review of the literature , author=. The Neuroscientist , volume=. 2011 , publisher=

2011

[5] [5]

Nature , volume=

Detecting hallucinations in large language models using semantic entropy , author=. Nature , volume=. 2024 , publisher=

2024

[6] [6]

Forty-second International Conference on Machine Learning , year=

Detecting strategic deception with linear probes , author=. Forty-second International Conference on Machine Learning , year=

[7] [7]

arXiv preprint arXiv:2507.12691 , year=

Benchmarking deception probes via black-to-white performance boosts , author=. arXiv preprint arXiv:2507.12691 , year=

work page arXiv

[8] [8]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

The Curious Case of Hallucinatory (Un) answerability: Finding Truths in the Hidden States of Over-Confident Large Language Models , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023

[9] [9]

The Eleventh International Conference on Learning Representations , year=

Discovering Latent Knowledge in Language Models Without Supervision , author=. The Eleventh International Conference on Learning Representations , year=

[10] [10]

ICLR 2024 Workshop on Large Language Model (LLM) Agents , year=

Large Language Models can Strategically Deceive their Users when Put Under Pressure , author=. ICLR 2024 Workshop on Large Language Model (LLM) Agents , year=

2024

[11] [11]

arXiv preprint arXiv:2410.21514 , year=

Sabotage evaluations for frontier models , author=. arXiv preprint arXiv:2410.21514 , year=

work page arXiv

[12] [12]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Mixtral of Experts

Mixtral of experts , author=. arXiv preprint arXiv:2401.04088 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

[15] [15]

Findings of the association for computational linguistics: ACL 2023 , pages=

Discovering language model behaviors with model-written evaluations , author=. Findings of the association for computational linguistics: ACL 2023 , pages=

2023

[16] [16]

arXiv preprint arXiv:2509.21305 , year=

Sycophancy Is Not One Thing: Causal Separation of Sycophantic Behaviors in LLMs , author=. arXiv preprint arXiv:2509.21305 , year=

work page arXiv

[17] [17]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009

[18] [18]

The Twelfth International Conference on Learning Representations , year=

Towards Understanding Sycophancy in Language Models , author=. The Twelfth International Conference on Learning Representations , year=

[19] [19]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma 2: Improving open language models at a practical size , author=. arXiv preprint arXiv:2408.00118 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Proceedings of the National Academy of Sciences , volume=

Deception abilities emerged in large language models , author=. Proceedings of the National Academy of Sciences , volume=. 2024 , publisher=

2024

[21] [21]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

International conference on machine learning , pages=

Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark , author=. International conference on machine learning , pages=. 2023 , organization=

2023

[23] [23]

CoRR , year=

Representation Engineering: A Top-Down Approach to AI Transparency , author=. CoRR , year=

[24] [24]

Steering Language Models With Activation Engineering

Steering language models with activation engineering , author=. arXiv preprint arXiv:2308.10248 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

arXiv preprint arXiv:2510.14318 , year =

Marwa Abdulhai and Ryan Cheng and Aryansh Shrivastava and Natasha Jaques and Yarin Gal and Sergey Levine , title =. arXiv preprint arXiv:2510.14318 , year =

work page arXiv

[26] [26]

Advances in Neural Information Processing Systems , volume=

Truth is universal: Robust detection of lies in llms , author=. Advances in Neural Information Processing Systems , volume=

[27] [27]

Advances in Neural Information Processing Systems , volume=

Inference-time intervention: Eliciting truthful answers from a language model , author=. Advances in Neural Information Processing Systems , volume=

[28] [28]

2025 , eprint=

Preference Learning with Lie Detectors can Induce Honesty or Evasion , author=. 2025 , eprint=

2025

[29] [29]

Cerebral cortex , volume=

Neural correlates of different types of deception: an fMRI investigation , author=. Cerebral cortex , volume=. 2003 , publisher=

2003

[30] [30]

arXiv preprint arXiv:2511.16035 , year=

Liars' Bench: Evaluating Lie Detectors for Language Models , author=. arXiv preprint arXiv:2511.16035 , year=

work page arXiv

[31] [31]

International Conference on Learning Representations , year=

Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=

[32] [32]

Simple synthetic data reduces sycophancy in large language models

Simple synthetic data reduces sycophancy in large language models , author=. arXiv preprint arXiv:2308.03958 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

CoRR , year=

The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems , author=. CoRR , year=

[34] [34]

International Conference on Machine Learning , pages=

Linear Adversarial Concept Erasure , author=. International Conference on Machine Learning , pages=. 2022 , organization=

2022

[35] [35]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025

[36] [36]

Advances in Neural Information Processing Systems , volume=

Leace: Perfect linear concept erasure in closed form , author=. Advances in Neural Information Processing Systems , volume=

[37] [37]

ICLR , year=

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations , author=. ICLR , year=

[38] [38]

Philosophical Studies , year=

Still no lie detector for language models: probing empirical and conceptual roadblocks , author=. Philosophical Studies , year=

[39] [39]

ICML 2025 Workshop on Reliable and Responsible Foundation Models , year=

The Geometries of Truth Are Orthogonal Across Tasks , author=. ICML 2025 Workshop on Reliable and Responsible Foundation Models , year=

2025

[40] [40]

International Conference on Learning Representations , year=

Aligning AI With Shared Human Values , author=. International Conference on Learning Representations , year=

[41] [41]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Emergence of Linear Truth Encodings in Language Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

[42] [42]

Proceedings of the 41st International Conference on Machine Learning , pages=

Representation surgery: theory and practice of affine steering , author=. Proceedings of the 41st International Conference on Machine Learning , pages=

[43] [43]

Advances in Neural Information Processing Systems , volume=

Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting , author=. Advances in Neural Information Processing Systems , volume=

[44] [44]

arXiv preprint arXiv:2509.07968 , year=

Simpleqa verified: A reliable factuality benchmark to measure parametric knowledge , author=. arXiv preprint arXiv:2509.07968 , year=

work page arXiv

[45] [45]

Proceedings of the ACM on Web Conference 2025 , pages=

Adaptive activation steering: A tuning-free llm truthfulness improvement method for diverse hallucinations categories , author=. Proceedings of the ACM on Web Conference 2025 , pages=

2025

[46] [46]

Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations , pages=

Transformers: State-of-the-art natural language processing , author=. Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations , pages=

2020

[47] [47]

The Thirteenth International Conference on Learning Representations , year=

NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals , author=. The Thirteenth International Conference on Learning Representations , year=

[48] [48]

Findings of the Association for Computational Linguistics: ACL 2024 , pages=

Do androids know they’re only dreaming of electric sheep? , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

2024

[49] [49]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

When Truthful Representations Flip Under Deceptive Instructions? , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025

[50] [50]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Steering llama 2 via contrastive activation addition , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[51] [51]

arXiv preprint arXiv:2506.00823 , year=

Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLMs Across Logical Transformations and Question Answering Tasks , author=. arXiv preprint arXiv:2506.00823 , year=

work page arXiv

[52] [52]

arXiv preprint arXiv:2509.23024 , year=

Tracing the representation geometry of language models from pretraining to post-training , author=. arXiv preprint arXiv:2509.23024 , year=

work page arXiv

[53] [53]

Transactions of the association for computational linguistics , volume=

A primer in BERTology: What we know about how BERT works , author=. Transactions of the association for computational linguistics , volume=

[54] [54]

Computational Linguistics , volume=

Probing classifiers: Promises, shortcomings, and advances , author=. Computational Linguistics , volume=

[55] [55]

Language models as knowledge bases? , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages=

2019

[56] [56]

How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages=

2019

[57] [57]

Intrinsic dimensionality explains the effectiveness of language model fine-tuning , author=. Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers) , pages=

[58] [58]

Advances in Neural Information Processing Systems , volume=

The geometry of hidden representations of large transformer models , author=. Advances in Neural Information Processing Systems , volume=

[59] [59]

Neurons, Behavior, Data analysis, and Theory , volume=

Comparing representational geometries using whitened unbiased-distance-matrix similarity , author=. Neurons, Behavior, Data analysis, and Theory , volume=. 2021 , publisher=

2021

[60] [60]

International conference on machine learning , pages=

Leep: A new measure to evaluate transferability of learned representations , author=. International conference on machine learning , pages=. 2020 , organization=

2020

[61] [61]

International conference on machine learning , pages=

Logme: Practical assessment of pre-trained models for transfer learning , author=. International conference on machine learning , pages=. 2021 , organization=

2021

[62] [62]

Advances in Neural Information Processing Systems , volume=

Refusal in language models is mediated by a single direction , author=. Advances in Neural Information Processing Systems , volume=

[63] [63]

Second Conference on Language Modeling , year=

How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence , author=. Second Conference on Language Modeling , year=

[64] [64]

the Journal of machine Learning research , volume=

Scikit-learn: Machine learning in Python , author=. the Journal of machine Learning research , volume=. 2011 , publisher=

2011

[65] [65]

arXiv preprint arXiv:2310.18512 , year=

Preventing language models from hiding their reasoning , author=. arXiv preprint arXiv:2310.18512 , year=

work page arXiv

[66] [66]

arXiv preprint arXiv:2602.20273 , year=

The Truthfulness Spectrum Hypothesis , author=. arXiv preprint arXiv:2602.20273 , year=

work page arXiv

[67] [67]

Understanding intermediate layers using linear classifier probes

Understanding intermediate layers using linear classifier probes , author=. arXiv preprint arXiv:1610.01644 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[68] [68]

2004 , publisher=

Discriminant analysis and statistical pattern recognition , author=. 2004 , publisher=

2004

[69] [69]

International Conference on Computer Vision Systems , pages=

The CSU face identification evaluation system: its purpose, features, and structure , author=. International Conference on Computer Vision Systems , pages=. 2003 , organization=

2003

[70] [70]

Annals of eugenics , volume=

The use of multiple measurements in taxonomic problems , author=. Annals of eugenics , volume=. 1936 , publisher=

1936

[71] [71]

1966 , publisher=

Signal detection theory and psychophysics , author=. 1966 , publisher=

1966

[72] [72]

, author=

The meaning and use of the area under a receiver operating characteristic (ROC) curve. , author=. Radiology , volume=

[73] [73]

2018 , publisher=

Introduction to multivariate analysis , author=. 2018 , publisher=

2018

[74] [74]

2009 , publisher=

The elements of statistical learning: data mining, inference, and prediction , author=. 2009 , publisher=

2009

[75] [75]

Designing and interpreting probes with control tasks , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) , pages=

2019

[76] [76]

Metabolites , volume=

Extraction and integration of genetic networks from short-profile omic data sets , author=. Metabolites , volume=. 2020 , publisher=

2020

[77] [77]

2019 IEEE international conference on image processing (ICIP) , pages=

An information-theoretic approach to transferability in task transfer learning , author=. 2019 IEEE international conference on image processing (ICIP) , pages=. 2019 , organization=

2019

[78] [78]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Transferability estimation using bhattacharyya class separability , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[79] [79]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

How far pre-trained models are from neural collapse on the target dataset informs their transferability , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[80] [80]

Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies , pages=

Learning word vectors for sentiment analysis , author=. Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies , pages=