super hub Mixed citations

Understanding intermediate layers using linear classifier probes

Guillaume Alain, Yoshua Bengio · 2016 · stat.ML · arXiv 1610.01644

Mixed citation behavior. Most common role is method (53%).

141 Pith papers citing it

Method 53% of classified citations

open full Pith review browse 141 citing papers more from Guillaume Alain arXiv PDF

abstract

Neural network models have a reputation for being black boxes. We propose to monitor the features at every layer of a model and measure how suitable they are for classification. We use linear classifiers, which we refer to as "probes", trained entirely independently of the model itself. This helps us better understand the roles and dynamics of the intermediate layers. We demonstrate how this can be used to develop a better intuition about models and to diagnose potential problems. We apply this technique to the popular models Inception v3 and Resnet-50. Among other things, we observe experimentally that the linear separability of features increase monotonically along the depth of the model.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 10 background 9

citation-polarity summary

use method 10 background 8 unclear 1

claims ledger

abstract Neural network models have a reputation for being black boxes. We propose to monitor the features at every layer of a model and measure how suitable they are for classification. We use linear classifiers, which we refer to as "probes", trained entirely independently of the model itself. This helps us better understand the roles and dynamics of the intermediate layers. We demonstrate how this can be used to develop a better intuition about models and to diagnose potential problems. We apply this technique to the popular models Inception v3 and Resnet-50. Among other things, we observe exper

authors

Guillaume Alain Yoshua Bengio

co-cited works

representative citing papers

Do Activation Monitors Survive Model Updates? Benchmarking, Predicting, and Repairing Activation-Monitor Staleness

cs.LG · 2026-06-14 · unverdicted · novelty 8.0

Fine-tuning updates frequently stale activation monitors for language model safety while quantization does not, with degradation predictable and repairable via label-free realignment.

When Does LeJEPA Learn a World Model?

stat.ML · 2026-05-25 · unverdicted · novelty 8.0

LeJEPA achieves linear identifiability of latent variables uniquely when the latents are Gaussian in worlds with stationary additive-noise transitions.

Dissecting Jet-Tagger Through Mechanistic Interpretability

hep-ph · 2026-05-11 · accept · novelty 8.0

A Particle Transformer jet tagger contains a sparse six-head circuit whose source-relay-readout structure recovers most performance and whose residual stream preferentially encodes 2-prong energy correlators.

Do Audio-Visual Large Language Models Really See and Hear?

cs.AI · 2026-04-03 · unverdicted · novelty 8.0

AVLLMs encode audio semantics in middle layers but suppress them in final text outputs when audio conflicts with vision, due to training that largely inherits from vision-language base models.

What learning algorithm is in-context learning? Investigations with linear models

cs.LG · 2022-11-28 · accept · novelty 8.0

Transformers performing in-context learning implicitly implement gradient descent, ridge regression, and least-squares predictors for linear models, with behavior shifting based on model depth, width, and data noise.

Slot Machines: How LLMs Keep Track of Multiple Entities

cs.CL · 2026-04-22 · unverdicted · novelty 8.0

LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.

Understanding Geometric Representations in Self-Supervised Vision Transformers via Subspace Intervention

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

The subspace intervention framework reveals that pre-training objectives shape how ViTs encode geometric information in compressible low-rank subspaces, with peak precision at intermediate layers.

Efficient and Trainable Language Model Test-Time Scaling via Local Branch Routing

cs.CL · 2026-06-24 · unverdicted · novelty 7.0 · 2 refs

LBR performs token-level test-time scaling via local branch routing on hidden states, enabling end-to-end RL training and improving Pass@1 and Pass@32 on math benchmarks over CoT and RLVR baselines.

Learning to Place Guards by Reinforcement: A Geo-Free Neural Policy for the Vertex-Guard Art Gallery Problem

cs.LG · 2026-06-19 · unverdicted · novelty 7.0

A reinforcement learning policy for the vertex-guard art gallery problem encodes sufficient geometric information in its encoder to allow a simple classifier to achieve high coverage feasibility out of distribution.

Comparing Linear Probes with Mahalanobis Cosine Similarity

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

For balanced Gaussian class projections, OOD AUROC is a linear function of MCS to the reference probe because both are sigmoid-shaped functions of the probe SNR on test data.

Diagnosing and Repairing Shape-Prior Shortcuts in Long-Range Single-Shot Fringe Projection Profilometry

cs.LG · 2026-06-13 · conditional · novelty 7.0

PhiCalNet cuts object MAE from 14.54 mm to 4.46 mm on a 15,600-image synthetic long-range FPP benchmark by architecturally removing the shape-prior shortcut that baseline UNets exploit.

How Linear Is a Transformer Feed-Forward Block? Per-Block Linear Recoverability Is Learned, Not Architectural

cs.LG · 2026-06-12 · unverdicted · novelty 7.0

Linear recoverability of transformer FFN blocks varies widely across depth, is learned during training, and is independent of the activation function.

Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

cs.LG · 2026-06-10 · unverdicted · novelty 7.0

Sign patterns in the unrotated standard basis of transformer activations form independent binary feature registers that support training-free detection, prediction, and causal intervention across language, vision, and audio models.

When Probing Accuracy Saturates, Fragility Resolves: A Complementary Metric for LLM Pre-Training Analysis

cs.CL · 2026-06-09 · unverdicted · novelty 7.0

Fragility, the activation noise level causing probe accuracy collapse, reveals evolving lexical-to-compositional moral encoding, layer robustness gradients, and fine-tuning differences invisible to saturated probing accuracy.

ProbeAct: Probe-Guided Training-Free Failure Recovery in Vision-Language-Action Models

cs.RO · 2026-06-08 · unverdicted · novelty 7.0

PROBEACT is a plug-and-play intervention framework that combines hidden-state probing, kinematic failure detection, and CBF-based correction to boost success rates of pre-trained VLA models on the LIBERO-plus benchmark from 69.6% to 74.1%.

TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

cs.AI · 2026-06-08 · unverdicted · novelty 7.0

TRL-Bench is a new multi-granular benchmark that releases 50 OpenML tables, linkage tasks, and a 47k-table data lake to show that tabular encoder performance is capability-specific rather than captured by one leaderboard.

Anchored, Not Graded: Vision-Language Models Fail at Slant-from-Texture Perception

cs.CV · 2026-06-04 · unverdicted · novelty 7.0 · 2 refs

VLMs across families and scales show anchoring to discrete slant angles in zero-shot and prompted settings rather than human-like graded texture-based slant perception.

Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

cs.LG · 2026-06-04 · conditional · novelty 7.0

SASA replaces single-vector decoders in SAEs with learned subspaces plus block sparsity and nuclear-norm regularization, proving that a single group becomes the global minimizer once block size meets intrinsic dimension and yielding polynomial rather than exponential sample complexity.

Probing Spatial Structure in Pretrained Audio Representations

cs.SD · 2026-06-04 · unverdicted · novelty 7.0

Introduces SARL benchmark showing pretrained audio encoders encode source-level spatial factors more readily than room-level factors, with patterns shaped by input configuration and training paradigm.

Toward Calibrated, Fair, and accurate Deepfake Detection

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.

UWM-JEPA: Predictive World Models That Imagine in Belief Space

cs.LG · 2026-05-25 · unverdicted · novelty 7.0

UWM-JEPA uses a density-matrix latent and unitary predictor in JEPA to preserve joint-state spectrum during blind rollouts, achieving 0.77 accuracy on a five-step hidden-velocity task versus 0.53 for an LSTM baseline.

The Perception-Physics Paradox: Probing Scientific Alignment with TC-Bench

cs.LG · 2026-05-23 · unverdicted · novelty 7.0

Introduces the Perception-Physics Paradox and TC-Bench benchmark demonstrating that vision foundation models rely on visual shortcuts that fail in intense regimes rather than achieving scientific alignment via structural isomorphism.

LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

cs.CL · 2026-05-20 · unverdicted · novelty 7.0

LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.

Markovian Circuit Tracing for Transformer State Dynamic

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

This paper presents Markovian Circuit Tracing (MCT) as a benchmark and pipeline to extract and test state-transition structures in transformer activations using synthetic HMM tasks, demonstrating that state patching improves counterfactual predictions.

citing papers explorer

Showing 21 of 21 citing papers after filters.

Understanding Geometric Representations in Self-Supervised Vision Transformers via Subspace Intervention cs.CV · 2026-07-02 · unverdicted · none · ref 2 · internal anchor
The subspace intervention framework reveals that pre-training objectives shape how ViTs encode geometric information in compressible low-rank subspaces, with peak precision at intermediate layers.
Anchored, Not Graded: Vision-Language Models Fail at Slant-from-Texture Perception cs.CV · 2026-06-04 · unverdicted · none · ref 1 · 2 links · internal anchor
VLMs across families and scales show anchoring to discrete slant angles in zero-shot and prompted settings rather than human-like graded texture-based slant perception.
MAPS: A Synthetic Dataset for Probing Vision Models in a Controlled 3D Scene Space cs.CV · 2026-05-19 · unverdicted · none · ref 1 · internal anchor
MAPS provides 2618 validated 3D meshes and a controllable rendering pipeline to attribute vision model recognition failures to specific scene parameters, finding camera distance and elevation as the dominant failure factors across 20 tested models.
Privacy-Aware Video Anomaly Detection through Orthogonal Subspace Projection cs.CV · 2026-05-09 · unverdicted · none · ref 21 · internal anchor
A new orthogonal projection module for video anomaly detection suppresses facial attributes via weak face-presence signals and cosine alignment while preserving anomaly-relevant features like pose and motion.
Synthetic Designed Experiments for Diagnosing Vision Model Failure cs.CV · 2026-03-30 · unverdicted · none · ref 1 · internal anchor
SDRS uses designed experiments and ANOVA decomposition on synthetic data to identify Type I coverage gaps and Type II spurious dependencies in vision models, then generates targeted data to improve performance.
Latent Space Probing for Adult Content Detection in Video Generative Models cs.CV · 2026-04-25 · unverdicted · none · ref 18
Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
Lost in the Hype: Revealing and Dissecting the Performance Degradation of Medical Multimodal Large Language Models in Image Classification cs.CV · 2026-04-09 · unverdicted · none · ref 2
Medical MLLMs degrade on image classification due to four failure modes in visual representation quality, connector projection fidelity, LLM comprehension, and semantic mapping alignment, quantified by feature probing on 14 models across 3 datasets.
Beyond Compression: Quantifying Spectral Accessibility in Vision Representations cs.CV · 2026-06-02 · unverdicted · none · ref 2 · internal anchor
Vision encoders alter spectral accessibility non-monotonically across depth with architecture-specific effects from projections and pooling, quantified via a new residual loss against random baselines.
Detect Before You Leap: Mirage Detection in Vision-Language Models cs.CV · 2026-05-29 · unverdicted · none · ref 1 · internal anchor
TC-LIA detects mirage in VLMs via layer-wise image patch to question alignment in CLIP encoders, reaching 94.6-94.7% three-class accuracy and under 3% mirage rate across five domains and twelve backbones.
Do Vision Models Truly Forget? New Findings from Representation-Level Certification of Visual Unlearning in Vertical Federated Learning cs.CV · 2026-05-19 · unverdicted · none · ref 1 · 2 links · internal anchor
Mirage auditing framework reveals that VFL unlearning methods passing output-level certification retain substantial class structure in representations, with no method achieving high utility plus both output and representation forgetting, plus class-sample asymmetry in residual traces.
Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers cs.CV · 2026-05-14 · unverdicted · none · ref 47 · internal anchor
Text embeddings in MM-DiTs encode a detectable omission signal for missing concepts; amplifying it via OSI reduces concept omission in text-to-image outputs on FLUX.1-Dev and SD3.5-Medium.
Concepts Worth Having: Refining VLM-Guided Concept Bottleneck Models with Minimal Annotations cs.CV · 2026-05-13 · unverdicted · none · ref 11 · internal anchor
VH-CBM uses a Gaussian process in VLM embedding space to propagate sparse human annotations and improve concept accuracy and calibration over pure VLM-guided concept bottleneck models.
AttriBE: Quantifying Attribute Expressivity in Body Embeddings for Recognition and Identification cs.CV · 2026-04-29 · unverdicted · none · ref 30 · internal anchor
Transformer-based ReID embeddings encode BMI most strongly in deeper layers, followed by pitch, gender, and yaw, with pose peaking in middle layers and BMI increasing with depth; cross-spectral settings shift reliance toward structural cues.
Class Unlearning via Depth-Aware Removal of Forget-Specific Directions cs.CV · 2026-04-16 · unverdicted · none · ref 1 · 2 links · internal anchor
DAMP performs one-shot class unlearning by depth-aware projection removal of forget-specific directions, producing forgetting behavior closer to retraining from scratch than prior methods on image classification tasks.
Intermediate Representations are Strong AI-Generated Image Detectors cs.CV · 2026-05-05 · unverdicted · none · ref 1
Intermediate layer embedding sensitivity to perturbations distinguishes AI-generated images from real ones, yielding higher AUROC on GenImage and Forensics Small benchmarks than prior methods.
GeoSAE: Geometric Prior-Guided Layer-Wise Sparse Autoencoder Annotation of Brain MRI Foundation Models cs.CV · 2026-05-03 · unverdicted · none · ref 3
GeoSAE extracts a compact, interpretable feature set from frozen brain MRI foundation models that predicts MCI-to-AD conversion (AUC 0.746) with age-deconfounded annotations and replicates across cohorts.
Zero-Shot Synthetic-to-Real Handwritten Text Recognition via Task Analogies cs.CV · 2026-04-08 · unverdicted · none · ref 4
A method learns synthetic-to-real parameter corrections from source languages and transfers them to target languages without any real target data, improving HTR across five languages and six models.
Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis cs.CV · 2026-06-08 · unverdicted · none · ref 2 · internal anchor
Video foundation models encode intuitive physics knowledge that is strongest in V-JEPA at intermediate-to-late layers and depends on pretraining type and probe design.
Suppressing Forgery-Specific Shortcuts for Generalizable Deepfake Detection cs.CV · 2026-06-01 · unverdicted · none · ref 2 · internal anchor
S^3 extracts dominant shortcut directions from a linear forgery-method classifier using SVD and attenuates them in feature space to improve cross-method generalization in deepfake detection.
Information-Regularized Attention for Visual-Centric Reasoning cs.CV · 2026-07-01 · unverdicted · none · ref 1 · internal anchor
IRA is a stochastic attention mechanism that regulates visual information injection in VLMs to yield smoother embedding trajectories and reduced attention sinks.
Early Warning Signals for OpenVLA Failure under Visual Distribution Shift cs.CV · 2026-06-29 · conditional · none · ref 14 · internal anchor
OpenVLA layer-16 activations allow a logistic probe to predict failure within 15 steps under occlusion (AUROC 0.972) better than baselines, with some transfer to camera jitter.

Understanding intermediate layers using linear classifier probes

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer