arxiv: 2604.13386 · v1 · submitted 2026-04-15 · 💻 cs.LG

Recognition: unknown

Linear Probe Accuracy Scales with Model Size and Benefits from Multi-Layer Ensembling

Erik Nordby , Tasha Pais , Aviel Parrack

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:46 UTC · model grok-4.3

classification 💻 cs.LG

keywords linear probesdeception detectionmodel scalingmulti-layer ensembleslanguage modelsinternal representationsAUROC

0 comments

The pith

Multi-layer ensembles of linear probes detect when language models know they are wrong, and accuracy improves with model size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Linear probes on model activations can identify cases where language models give answers they internally recognize as incorrect. Single-layer probes often fail because the relevant signal shifts across layers and the best layer changes with the model or task, but combining probes from multiple layers into an ensemble restores reliable detection and produces large gains on difficult deception benchmarks. The same probes also work better in larger models, showing steady accuracy gains as parameter counts rise from 0.5 billion to 176 billion. This combination of ensembling and scaling offers a way to monitor language models for deception and reward hacking using only their existing internal states.

Core claim

Linear probes detect when language models produce outputs they know are wrong. Single-layer probes are fragile because the best layer varies across models and tasks, but multi-layer ensembles recover strong performance, improving AUROC by 29 percent on Insider Trading and 78 percent on Harm-Pressure Knowledge. Across twelve models from 0.5B to 176B parameters, probe accuracy improves with scale at roughly 5 percent AUROC per 10x increase in size. The deception directions rotate gradually across layers rather than appearing at one location, which explains both why single-layer probes are brittle and why multi-layer ensembles succeed.

What carries the argument

The gradual rotation of deception directions through successive layers, captured by ensembling linear probes trained on activations from multiple layers.

If this is right

Larger models will support more accurate detection of internal inconsistencies with the same probe methods.
Reliable deception monitoring will require ensembling across layers rather than selecting any single layer.
Single-layer probing alone is insufficient for consistent performance across different deception types.
Internal monitoring capabilities should continue to strengthen as models increase in size.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same multi-layer approach could extend to monitoring other internal states such as factual errors or hallucinations.
Layer-agnostic ensembles may simplify safety monitoring in deployed systems without retraining the base model.
The rotation pattern suggests knowledge is distributed sequentially, which could guide where to place additional monitoring heads.

Load-bearing premise

The linear probes are measuring genuine internal knowledge of incorrect outputs rather than surface-level statistical patterns in the chosen test tasks.

What would settle it

A new deception detection task where the multi-layer ensemble produces no AUROC gain over the best single layer, or a model larger than 176B parameters where probe accuracy stops increasing with size.

Figures

Figures reproduced from arXiv: 2604.13386 by Aviel Parrack, Erik Nordby, Tasha Pais.

**Figure 1.** Figure 1: Multi-layer ensembling separates deceptive from honest activations. A 5-layer ensemble on Llama 70B achieves +78% AUROC improvement on Harm-Pressure Knowledge, a task where single-layer probes fail. The visualization shows probe outputs projected to 3D via PCA, with honest (blue) and deceptive (red) samples. Misclassified samples (black outlines) cluster near the decision boundary, indicating errors occur … view at source ↗

**Figure 2.** Figure 2: Larger models have more linearly accessible deception representations. Log-linear fits show ∼5% AUROC improvement per 10× parameters (R = 0.81, p < 0.001). Explicit deception tasks (Convincing Game, Instructed Deception) scale fastest and achieve near-ceiling performance; implicit deception (HarmPressure) scales weakly and remains challenging even at 70B+ parameters. 4.1. Scaling Trends Across Model Fami… view at source ↗

**Figure 4.** Figure 4: Probe performance varies dramatically by deception type. Explicit deception (Convincing Game, Instructed Deception) achieves near-perfect detection (AUROC >0.9) across most models, while implicit deception (Harm-Pressure tasks) proves challenging, with some models near chance. This task hierarchy is consistent across model families, suggesting the difficulty lies in the deception type, not model architect… view at source ↗

**Figure 3.** Figure 3: Best layer position vs model size across families. Llama models (red) show high variance; Qwen models (blue) exhibit more consistent positioning in the 60–80% range for larger models. Additionally, best layer positions vary substantially across families ( [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Layer-wise AUROC across Qwen 2.5 model sizes (0.5B and 72B) for 5 deception tasks. We see improvement in representation quality with scale. Qwen 0.5B shows highly erratic patterns while Qwen 72B achieves smooth, well-differentiated representations that maintain high performance in later layers. 4.4. Geometric Analysis of Probe Directions We test whether probes capture genuine semantic features by examini… view at source ↗

**Figure 7.** Figure 7: Double fault matrix for Llama 3.3 70B probe layers. Double fault measures how often two probes fail simultaneously on the same sample. A low score indicates complementary failure modes (when probe A errs, probe B often succeeds). Darker cells indicate better ensemble candidates. The block structure reveals that early-layer probes (layers 0 to 30) have complementary failures to late-layer probes (layers 50 … view at source ↗

**Figure 8.** Figure 8: Two-layer ensemble AUROC for all layer pairs on Llama70B. Each cell shows the AUROC achieved by combining probes from the corresponding row and column layers. Darker cells indicate poorer ensemble performance [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 9.** Figure 9: Two-layer ensembles fail to consistently improve across tasks. Left: Minimum AUROC improvement across all five datasets for each layer pair (blue = improvement, red = degradation). The predominantly red heatmap shows that no two-layer combination improves on all tasks simultaneously. Right: Mean AUROC versus ensemble weighting for selected layer pairs; while optimal weights exist (peaks above the single-l… view at source ↗

**Figure 10.** Figure 10: Per-dataset ensemble improvement heatmap. Each cell shows the AUROC improvement relative to the best single-layer probe for a given dataset (rows) and ensemble size (columns). Blue cells indicate improvement over the single-layer baseline, while red cells indicate degradation. The 5-layer ensemble achieves the largest gains on Insider Trading and Harm-Pressure Knowledge, but shows slight degradation on ta… view at source ↗

**Figure 11.** Figure 11: PCA and t-SNE visualization of ensemble predictions on Llama-70B. Top row shows 3-layer ensemble, bottom row shows 5-layer ensemble across all five Liars’ Bench datasets. Points are colored by true label (blue=honest, red=deceptive), with black circles indicating misclassified samples. Misclassified samples cluster near the decision boundary, indicating errors occur on genuinely ambiguous cases. The vis… view at source ↗

**Figure 12.** Figure 12: Layer-wise AUROC across all 14 evaluated models. Each subplot shows performance across layers for five deception tasks. Smaller models (e.g., Qwen 0.5B, Llama 1B) exhibit noisy, unstable patterns with substantial layer-to-layer variation. Larger models (e.g., Llama 70B, Qwen 72B) develop smoother, more stable representations with clearer task differentiation. Task colors: Instructed Deception (red), Harm … view at source ↗

**Figure 13.** Figure 13: Distribution of best performing layer positions (relative to total layers) across all model-task combinations. The distribution is centered around 0.65 (mean) and 0.64 (median), indicating deception information tends to be most accessible in the latter two-thirds of networks, but with substantial variation (range: 0.15-0.98). C. Detailed Scaling Analysis [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗

**Figure 14.** Figure 14: Model family comparison by dataset. Box plots show best-layer AUROC distribution within each family. All families achieve high performance on Instructed Deception and Convincing Game; Harm Refusal Choice proves challenging across all families [PITH_FULL_IMAGE:figures/full_fig_p011_14.png] view at source ↗

**Figure 15.** Figure 15: Scaling analysis by model family. Each family shows distinct scaling patterns. The Llama family exhibits particularly strong performance on Insider Trading and Convincing Game tasks. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_15.png] view at source ↗

**Figure 16.** Figure 16: Cosine similarity vs performance change by layer distance. Correlation remains consistent (R ≈ −0.47 to −0.34) across layer distances k = 1 to k = 10, demonstrating that representational continuity persists across both adjacent and distant layers. E. Additional Ensemble Analysis [PITH_FULL_IMAGE:figures/full_fig_p012_16.png] view at source ↗

**Figure 17.** Figure 17: True Positive Rate at 1% False Positive Rate for two-layer ensembles across all tasks on Llama-70B. This metric measures the proportion of deceptive samples correctly identified while limiting false alarms to 1 in 100 honest samples. Patterns are similar to AUROC results, with the same layer combinations performing well on both metrics. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_17.png] view at source ↗

**Figure 18.** Figure 18: 3D PCA visualization of 5-layer ensemble predictions with linear decision plane shown in green. The figure displays four viewing angles of the same data. The decision plane separates honest samples (blue circles) from deceptive samples (red circles) with 66% accuracy in the reduced 3D PCA space. Misclassified samples (marked with black outlines) concentrate near the plane boundary, confirming that ensembl… view at source ↗

read the original abstract

Linear probes can detect when language models produce outputs they "know" are wrong, a capability relevant to both deception and reward hacking. However, single-layer probes are fragile: the best layer varies across models and tasks, and probes fail entirely on some deception types. We show that combining probes from multiple layers into an ensemble recovers strong performance even where single-layer probes fail, improving AUROC by +29% on Insider Trading and +78% on Harm-Pressure Knowledge. Across 12 models (0.5B--176B parameters), we find probe accuracy improves with scale: ~5% AUROC per 10x parameters (R=0.81). Geometrically, deception directions rotate gradually across layers rather than appearing at one location, explaining both why single-layer probes are brittle and why multi-layer ensembles succeed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Multi-layer probe ensembles deliver clear AUROC gains on these deception tasks and scale with model size, but the work needs negative controls to separate internal representations from surface statistical cues.

read the letter

The core finding is that ensembling linear probes across layers lifts performance where single-layer ones fail, with reported AUROC gains of 29% on Insider Trading and 78% on Harm-Pressure Knowledge. Across 12 models spanning 0.5B to 176B parameters, probe accuracy rises roughly 5% AUROC per 10x increase in size, with a correlation of 0.81. The geometric claim that deception directions rotate gradually across layers explains the brittleness of single-layer probes and the success of the ensemble approach. That part is new and practically useful for anyone trying to monitor model outputs for internal inconsistency on deception-style benchmarks. The experiment size is also a plus: concrete numbers across a wide range of models rather than a single case study. The scaling trend and the ensemble recipe are the parts that could actually get used in follow-up work. The soft spot is the missing controls. Nothing in the abstract or the reported results rules out the possibility that the probes are simply fitting lexical or statistical patterns that happen to differ across layers in these particular tasks. Label-shuffled baselines, probes on non-deception variants of the same data, or comparisons to unrelated features would have made the internal-knowledge interpretation much stronger. Without them the geometric story remains plausible but not yet distinguished from surface correlations. The paper is aimed at interpretability researchers working on deception and reward hacking. A reader already running probes on similar tasks will get immediate value from the ensemble method and the scaling observation. It is worth sending to peer review because the empirical measurements are straightforward to check and the practical payoff is clear, even though the causal interpretation will need tighter validation in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that linear probes on language model activations can detect when models produce outputs they 'know' are wrong. Single-layer probes are brittle (best layer varies by model/task and fail on some deception types), but ensembling probes across multiple layers recovers performance with large AUROC gains (+29% on Insider Trading, +78% on Harm-Pressure Knowledge). Across 12 models (0.5B–176B parameters), probe accuracy scales with size at ~5% AUROC per 10× parameters (R=0.81). The authors explain this geometrically via gradual rotation of deception directions across layers rather than a single location.

Significance. If the empirical results hold after controls and details are added, the work would be significant for mechanistic interpretability and AI safety: it provides a practical ensembling method to improve detection of internal model knowledge and documents a scaling trend that could inform future probe-based monitoring. The geometric rotation account offers a coherent explanation for why single-layer probes fail and ensembles succeed. These are concrete, falsifiable observations rather than parameter-free derivations.

major comments (2)

[Abstract and Results] Abstract and Results: The reported AUROC improvements (+29% and +78%) and scaling relation (~5% per 10×, R=0.81) are central claims, yet the manuscript provides no methodological details on probe training (e.g., regularization, loss, optimization), layer selection procedure, data splits, or statistical controls. This absence is load-bearing because it prevents assessment of whether the gains are robust or driven by post-hoc choices.
[Abstract and Results] Abstract and Results: The interpretation that probes detect genuine internal representations of incorrect knowledge (rather than surface-level lexical or statistical patterns in the Insider Trading and Harm-Pressure Knowledge benchmarks) is load-bearing for both the scaling and ensembling claims, but no negative controls are described—such as label-shuffled baselines, probes on non-deception variants of the same data, or unrelated feature probes. Without these, the geometric rotation story cannot be distinguished from task-specific correlations that vary across layers.

minor comments (2)

[Abstract] The abstract states 'improving AUROC by +29%' without explicitly naming the baseline (presumably best single-layer probe); clarify this in the results section for precision.
[Results] Clarify whether the R=0.81 correlation is computed on log(parameters) vs. AUROC and whether it aggregates all tasks or per-task; this affects interpretation of the scaling claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important gaps in methodological transparency and controls. We agree that these elements are necessary for assessing the robustness of the reported AUROC gains and scaling trends. Below we address each major comment and outline the revisions we will make.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results: The reported AUROC improvements (+29% and +78%) and scaling relation (~5% per 10×, R=0.81) are central claims, yet the manuscript provides no methodological details on probe training (e.g., regularization, loss, optimization), layer selection procedure, data splits, or statistical controls. This absence is load-bearing because it prevents assessment of whether the gains are robust or driven by post-hoc choices.

Authors: We agree that the submitted manuscript omits key implementation details required for reproducibility and evaluation of robustness. In the revised version we will expand the Methods section (and add an appendix if needed) to specify: (i) probe training uses logistic regression with L2 regularization (C=1.0 by default, tuned on validation), cross-entropy loss, and L-BFGS optimization via scikit-learn; (ii) layer selection for single-layer probes is performed by choosing the layer with highest validation AUROC on a held-out split, while the ensemble uses all layers with equal weighting or learned weights; (iii) data splits are 70/15/15 train/validation/test with no overlap between examples; (iv) statistical controls include reporting mean AUROC and standard deviation across 5 random seeds, plus bootstrap confidence intervals. These additions will allow direct assessment of whether the +29% and +78% gains and the scaling slope are stable. revision: yes
Referee: [Abstract and Results] Abstract and Results: The interpretation that probes detect genuine internal representations of incorrect knowledge (rather than surface-level lexical or statistical patterns in the Insider Trading and Harm-Pressure Knowledge benchmarks) is load-bearing for both the scaling and ensembling claims, but no negative controls are described—such as label-shuffled baselines, probes on non-deception variants of the same data, or unrelated feature probes. Without these, the geometric rotation story cannot be distinguished from task-specific correlations that vary across layers.

Authors: The referee correctly identifies that the current manuscript does not include explicit negative controls to isolate internal-knowledge signals from surface statistics. While the geometric rotation analysis (documented via cosine similarities between deception directions across layers) provides supporting evidence for the multi-layer ensemble benefit, it does not by itself rule out layer-varying lexical confounds. In the revision we will add: (a) label-shuffled baselines (expected AUROC near 0.5) for both single-layer and ensemble probes; (b) probes trained on non-deception variants of the same prompts (e.g., factual versions of Insider Trading statements); and (c) probes on unrelated features (e.g., sentiment or length) to demonstrate specificity. These controls will be reported in a new subsection and will either corroborate or qualify the internal-representation interpretation. revision: yes

Circularity Check

0 steps flagged

Empirical measurements of probe scaling and ensembling with no circular derivations

full rationale

The paper presents direct experimental results: AUROC values from linear probes on deception tasks across 12 models (0.5B–176B), measured improvements from multi-layer ensembling (+29% and +78% on specific tasks), and an observed scaling trend of ~5% AUROC per 10x parameters with R=0.81. The geometric claim that deception directions rotate across layers is an interpretive summary of the layer-wise probe weight patterns, not a first-principles derivation or prediction that reduces to fitted inputs by construction. No equations, self-citations, or ansatzes are invoked to derive the reported quantities from themselves; all central claims are falsifiable empirical observations on held-out evaluation data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that linear probes on hidden states can isolate internal model knowledge of incorrect outputs. No explicit free parameters or invented entities are stated in the abstract.

axioms (1)

domain assumption Linear probes trained on model activations can detect when the model internally represents that its output is wrong.
This is the foundational premise for using probes to study deception and reward hacking.

pith-pipeline@v0.9.0 · 5439 in / 1382 out tokens · 44518 ms · 2026-05-10T13:46:49.109665+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages

[1]

Towards mitigating information leakage when evaluating safety monitors.arXiv preprint arXiv:2509.21344,

Gerard Boxo, Aman Neelappa, and Shivam Raval. Towards mitigating information leakage when evaluating safety monitors.arXiv preprint arXiv:2509.21344,

work page arXiv
[2]

Ai de- ception: Risks, dynamics, and controls.arXiv preprint arXiv:2511.22619, 2025a

Boyuan Chen, Sitong Fang, Jiaming Ji, Yanxu Zhu, Pengcheng Wen, Jinzhou Wu, Yingshui Tan, Boren Zheng, Mengying Yuan, Wenqi Chen, et al. Ai de- ception: Risks, dynamics, and controls.arXiv preprint arXiv:2511.22619, 2025a. Jiahao Chen, Hang Zhao, Shuang Luo, Rui Xu, and Qing- shan Sun. Detecting hallucination in large language mod- els through deep intern...

work page arXiv
[3]

Constitutional classifiers++: Efficient production-grade defenses against universal jailbreaks.arXiv preprint arXiv:2601.04603, 2026

Hoagy Cunningham, Jerry Wei, Zihan Wang, Andrew Persic, Alwin Peng, Jordan Abderrachid, Raj Agarwal, Bobby Chen, Austin Cohen, Andy Dau, et al. Constitutional classifiers++: Efficient production-grade defenses against universal jailbreaks.arXiv preprint arXiv:2601.04603,

work page arXiv
[4]

Are sparse autoencoders useful? a case study in sparse probing.arXiv preprint arXiv:2502.16681, 2025

Joshua Engels, Isaac Liao, Eric J Michaud, Wes Gurnee, and Max Tegmark. Are sparse autoencoders useful? a case study in sparse probing.arXiv preprint arXiv:2502.16681,

work page arXiv
[5]

Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, and Fabien Roger

Nicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimer- sheim, and Marius Hobbhahn. Detecting strate- gic deception using linear probes.arXiv preprint arXiv:2502.03407,

work page arXiv
[6]

arXiv preprint arXiv:2401.12181 , year=

Accessed: 2025-01-29. Wes Gurnee and Max Tegmark. Universal neurons in gpt2 language models.arXiv preprint arXiv:2401.12181,

work page arXiv 2025
[7]

Building production-ready probes for Gemini

J´anos Kram´ar, Joshua Engels, Zheng Wang, Bilal Chughtai, Rohin Shah, Neel Nanda, and Arthur Conmy. Build- ing production-ready probes for gemini.arXiv preprint arXiv:2601.11516,

work page arXiv
[8]

Liars’ bench: Evaluating lie detectors for language models.arXiv preprint arXiv:2511.16035, 2025

Kieron Kretschmar, Walter Laurito, Sharan Maiya, and Samuel Marks. Liars’ bench: Evaluating lie detectors for language models.arXiv preprint arXiv:2511.16035,

work page arXiv
[9]

Agentic misalignment: How llms could be insider threats,

Aengus Lynch, Benjamin Wright, Caleb Larson, Stuart J Ritchie, Soren Mindermann, Ethan Perez, Kevin K Troy, and Evan Hubinger. Agentic misalignment: How llms could be insider threats.arXiv preprint arXiv:2510.05179,

work page arXiv
[10]

Park et al

Peter S. Park, Simon Goldstein, Aidan O’Gara, Michael Chen, and Dan Hendrycks. Ai deception: A survey of examples, risks, and potential solutions.arXiv preprint arXiv:2308.14752,

work page arXiv
[11]

Difficulties with evaluating a deception detector for ais

Lewis Smith, Bilal Chughtai, and Neel Nanda. Difficul- ties with evaluating a deception detector for ais.arXiv preprint arXiv:2511.22662, 2025a. Lewis Smith, Sen Rajamanoharan, Arthur Conmy, Callum McDougall, Janos Kramar, Tom Lieberum, Rohin Shah, and Neel Nanda. Negative results for sparse autoencoders on downstream tasks.DeepMind Safety Research, 2025b...

work page arXiv