Fine-tuning updates frequently stale activation monitors for language model safety while quantization does not, with degradation predictable and repairable via label-free realignment.
super hub Mixed citations
Understanding intermediate layers using linear classifier probes
Mixed citation behavior. Most common role is method (53%).
abstract
Neural network models have a reputation for being black boxes. We propose to monitor the features at every layer of a model and measure how suitable they are for classification. We use linear classifiers, which we refer to as "probes", trained entirely independently of the model itself. This helps us better understand the roles and dynamics of the intermediate layers. We demonstrate how this can be used to develop a better intuition about models and to diagnose potential problems. We apply this technique to the popular models Inception v3 and Resnet-50. Among other things, we observe experimentally that the linear separability of features increase monotonically along the depth of the model.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Neural network models have a reputation for being black boxes. We propose to monitor the features at every layer of a model and measure how suitable they are for classification. We use linear classifiers, which we refer to as "probes", trained entirely independently of the model itself. This helps us better understand the roles and dynamics of the intermediate layers. We demonstrate how this can be used to develop a better intuition about models and to diagnose potential problems. We apply this technique to the popular models Inception v3 and Resnet-50. Among other things, we observe exper
authors
co-cited works
representative citing papers
LeJEPA achieves linear identifiability of latent variables uniquely when the latents are Gaussian in worlds with stationary additive-noise transitions.
A Particle Transformer jet tagger contains a sparse six-head circuit whose source-relay-readout structure recovers most performance and whose residual stream preferentially encodes 2-prong energy correlators.
AVLLMs encode audio semantics in middle layers but suppress them in final text outputs when audio conflicts with vision, due to training that largely inherits from vision-language base models.
Transformers performing in-context learning implicitly implement gradient descent, ridge regression, and least-squares predictors for linear models, with behavior shifting based on model depth, width, and data noise.
LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.
The subspace intervention framework reveals that pre-training objectives shape how ViTs encode geometric information in compressible low-rank subspaces, with peak precision at intermediate layers.
LBR performs token-level test-time scaling via local branch routing on hidden states, enabling end-to-end RL training and improving Pass@1 and Pass@32 on math benchmarks over CoT and RLVR baselines.
A reinforcement learning policy for the vertex-guard art gallery problem encodes sufficient geometric information in its encoder to allow a simple classifier to achieve high coverage feasibility out of distribution.
For balanced Gaussian class projections, OOD AUROC is a linear function of MCS to the reference probe because both are sigmoid-shaped functions of the probe SNR on test data.
PhiCalNet cuts object MAE from 14.54 mm to 4.46 mm on a 15,600-image synthetic long-range FPP benchmark by architecturally removing the shape-prior shortcut that baseline UNets exploit.
Linear recoverability of transformer FFN blocks varies widely across depth, is learned during training, and is independent of the activation function.
Sign patterns in the unrotated standard basis of transformer activations form independent binary feature registers that support training-free detection, prediction, and causal intervention across language, vision, and audio models.
Fragility, the activation noise level causing probe accuracy collapse, reveals evolving lexical-to-compositional moral encoding, layer robustness gradients, and fine-tuning differences invisible to saturated probing accuracy.
PROBEACT is a plug-and-play intervention framework that combines hidden-state probing, kinematic failure detection, and CBF-based correction to boost success rates of pre-trained VLA models on the LIBERO-plus benchmark from 69.6% to 74.1%.
TRL-Bench is a new multi-granular benchmark that releases 50 OpenML tables, linkage tasks, and a 47k-table data lake to show that tabular encoder performance is capability-specific rather than captured by one leaderboard.
VLMs across families and scales show anchoring to discrete slant angles in zero-shot and prompted settings rather than human-like graded texture-based slant perception.
SASA replaces single-vector decoders in SAEs with learned subspaces plus block sparsity and nuclear-norm regularization, proving that a single group becomes the global minimizer once block size meets intrinsic dimension and yielding polynomial rather than exponential sample complexity.
Introduces SARL benchmark showing pretrained audio encoders encode source-level spatial factors more readily than room-level factors, with patterns shaped by input configuration and training paradigm.
Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.
UWM-JEPA uses a density-matrix latent and unitary predictor in JEPA to preserve joint-state spectrum during blind rollouts, achieving 0.77 accuracy on a five-step hidden-velocity task versus 0.53 for an LSTM baseline.
Introduces the Perception-Physics Paradox and TC-Bench benchmark demonstrating that vision foundation models rely on visual shortcuts that fail in intense regimes rather than achieving scientific alignment via structural isomorphism.
LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.
This paper presents Markovian Circuit Tracing (MCT) as a benchmark and pipeline to extract and test state-transition structures in transformer activations using synthetic HMM tasks, demonstrating that state patching improves counterfactual predictions.
citing papers explorer
-
Slot Machines: How LLMs Keep Track of Multiple Entities
LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.
-
Efficient and Trainable Language Model Test-Time Scaling via Local Branch Routing
LBR performs token-level test-time scaling via local branch routing on hidden states, enabling end-to-end RL training and improving Pass@1 and Pass@32 on math benchmarks over CoT and RLVR baselines.
-
When Probing Accuracy Saturates, Fragility Resolves: A Complementary Metric for LLM Pre-Training Analysis
Fragility, the activation noise level causing probe accuracy collapse, reveals evolving lexical-to-compositional moral encoding, layer robustness gradients, and fine-tuning differences invisible to saturated probing accuracy.
-
LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models
LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.
-
Logic-Regularized Verifier Elicits Reasoning from LLMs
LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.
-
Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions
Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.
-
The Pinocchio Dimension: Phenomenality of Experience as the Primary Axis of LLM Psychometric Differences
The primary axis of psychometric variation among LLMs is the degree to which they represent themselves as loci of phenomenal experience rather than systems of behavioral responses.
-
A Mechanistic View of Authority Hierarchy in LLM Sycophancy
Authority sycophancy in LLMs is a layer-localized erasure of correct answer representations that scales with authority level and resists simple interventions.
-
Behavioral and Representational Evidence of Binomial Ordering Preferences in Large Language Models
LLMs recover dominant binomial orders from corpora but align less closely with exact preference distributions, with preference strength partially encoded in middle-to-late layers and manipulable via steering.
-
From Texts to Scores: Tracing the Emergence of Essay Quality Representations in Large Language Models
LLM representations encode essay quality in a linearly decodable form that emerges across layers and includes identifiable scoring neurons whose distribution shifts with essay length.
-
The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model
RLHF provides shallow alignment by inactivating partisan features and severing causal pathways in LLMs without erasing partisan geometry, as evidenced by sparse autoencoder analysis and steering experiments.
-
The Amplifying Mirror: Locating and Steering the Partisan Direction inside a Large Language Model
A linear probe trained on 190k congressional tweets identifies a partisan direction in Llama 3.1 8B layer 18 that can be causally ablated or amplified to reverse or shift the model's political output.
-
Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics
Probe trajectories across token positions in LRMs, combined with signal-processing features, improve prediction of future model outputs over static probes on safety and math tasks.
-
Hidden Error Awareness in Chain-of-Thought Reasoning: The Signal Is Diagnostic, Not Causal
LLMs detect CoT reasoning errors in hidden states with 0.95 AUROC but cannot use this awareness to correct them via steering, patching, or self-correction, indicating the signal is diagnostic not causal.
-
Debiasing Reward Models via Causally Motivated Inference-Time Intervention
Neuron-level inference-time intervention reduces multiple biases in reward models, enabling 2B and 7B models to match 70B performance on LLM alignment benchmarks without trade-offs.
-
Contextual Linear Activation Steering of Language Models
CLAS dynamically adapts linear activation steering strengths to context, outperforming fixed-strength steering and matching or exceeding ReFT and LoRA on eleven benchmarks across four model families with limited labeled data.
-
Lost in State Space: Probing Frozen Mamba Representations
Frozen Mamba patch-boundary readouts do not outperform mean pooling for sentence representations on SST-2, CoLA, MRPC, STS-B, and IMDb due to anisotropy (cosine similarity ~0.9999) and representational collapse (MCC=0 on CoLA).
-
ProbeScale: Probing Analysis to Optimize Neural Scaling Laws for Efficient Small Language Model Inference
ProbScale finds layer subsets in SLMs like RoBERTa-Large and T5-Base that cut parameters 5-10x while retaining 95-98% of original task performance by maximizing aggregated probe scores under a budget.
-
Probing for Reading Times
Early layers of language models predict early-pass human reading times better than surprisal, with surprisal superior for late-pass measures and strong variation by language.
-
Understanding Large Language Models
The paper reviews Transformer architecture, emergent LLM capabilities resembling cognition, explainable AI methods, and argues against both anthropomorphism and overly reductive views of LLM behavior as mere memorization.