LeJEPA achieves linear identifiability of latent variables uniquely when the latents are Gaussian in worlds with stationary additive-noise transitions.
super hub Mixed citations
Understanding intermediate layers using linear classifier probes
Mixed citation behavior. Most common role is method (53%).
abstract
Neural network models have a reputation for being black boxes. We propose to monitor the features at every layer of a model and measure how suitable they are for classification. We use linear classifiers, which we refer to as "probes", trained entirely independently of the model itself. This helps us better understand the roles and dynamics of the intermediate layers. We demonstrate how this can be used to develop a better intuition about models and to diagnose potential problems. We apply this technique to the popular models Inception v3 and Resnet-50. Among other things, we observe experimentally that the linear separability of features increase monotonically along the depth of the model.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Neural network models have a reputation for being black boxes. We propose to monitor the features at every layer of a model and measure how suitable they are for classification. We use linear classifiers, which we refer to as "probes", trained entirely independently of the model itself. This helps us better understand the roles and dynamics of the intermediate layers. We demonstrate how this can be used to develop a better intuition about models and to diagnose potential problems. We apply this technique to the popular models Inception v3 and Resnet-50. Among other things, we observe exper
authors
co-cited works
representative citing papers
A Particle Transformer jet tagger contains a sparse six-head circuit whose source-relay-readout structure recovers most performance and whose residual stream preferentially encodes 2-prong energy correlators.
AVLLMs encode audio semantics in middle layers but suppress them in final text outputs when audio conflicts with vision, due to training that largely inherits from vision-language base models.
Transformers performing in-context learning implicitly implement gradient descent, ridge regression, and least-squares predictors for linear models, with behavior shifting based on model depth, width, and data noise.
LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.
LBR performs token-level test-time scaling via local branch routing on hidden states, enabling end-to-end RL training and improving Pass@1 and Pass@32 on math benchmarks over CoT and RLVR baselines.
Sign patterns in the unrotated standard basis of transformer activations form independent binary feature registers that support training-free detection, prediction, and causal intervention across language, vision, and audio models.
Fragility, the activation noise level causing probe accuracy collapse, reveals evolving lexical-to-compositional moral encoding, layer robustness gradients, and fine-tuning differences invisible to saturated probing accuracy.
PROBEACT is a plug-and-play intervention framework that combines hidden-state probing, kinematic failure detection, and CBF-based correction to boost success rates of pre-trained VLA models on the LIBERO-plus benchmark from 69.6% to 74.1%.
TRL-Bench is a new multi-granular benchmark that releases 50 OpenML tables, linkage tasks, and a 47k-table data lake to show that tabular encoder performance is capability-specific rather than captured by one leaderboard.
VLMs across families and scales show anchoring to discrete slant angles in zero-shot and prompted settings rather than human-like graded texture-based slant perception.
SASA replaces single-vector decoders in SAEs with learned subspaces plus block sparsity and nuclear-norm regularization, proving that a single group becomes the global minimizer once block size meets intrinsic dimension and yielding polynomial rather than exponential sample complexity.
Introduces SARL benchmark showing pretrained audio encoders encode source-level spatial factors more readily than room-level factors, with patterns shaped by input configuration and training paradigm.
Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.
UWM-JEPA uses a density-matrix latent and unitary predictor in JEPA to preserve joint-state spectrum during blind rollouts, achieving 0.77 accuracy on a five-step hidden-velocity task versus 0.53 for an LSTM baseline.
Introduces the Perception-Physics Paradox and TC-Bench benchmark demonstrating that vision foundation models rely on visual shortcuts that fail in intense regimes rather than achieving scientific alignment via structural isomorphism.
LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.
This paper presents Markovian Circuit Tracing (MCT) as a benchmark and pipeline to extract and test state-transition structures in transformer activations using synthetic HMM tasks, demonstrating that state patching improves counterfactual predictions.
MAPS provides 2618 validated 3D meshes and a controllable rendering pipeline to attribute vision model recognition failures to specific scene parameters, finding camera distance and elevation as the dominant failure factors across 20 tested models.
LLM societies in Nomic show non-monotonic collective adaptation peaking at mid-scales, with smaller models rule-inert and larger ones restrictive.
Language models produce overcomplete reasoning traces where on average 46% of steps can be removed while preserving the answer in 86% of cases, with necessity concentrated in the top three steps.
Projecting LLM hidden states onto F2 algebra with 42 pairs yields 93% zero-shot accuracy on logical relations and identifies prompt-preventable late-layer collapse.
A formal construction and synthetic experiment demonstrate that proxy rankings of pretraining datasets can reverse OOD accuracy rankings.
Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
citing papers explorer
-
When Does LeJEPA Learn a World Model?
LeJEPA achieves linear identifiability of latent variables uniquely when the latents are Gaussian in worlds with stationary additive-noise transitions.
-
Dissecting Jet-Tagger Through Mechanistic Interpretability
A Particle Transformer jet tagger contains a sparse six-head circuit whose source-relay-readout structure recovers most performance and whose residual stream preferentially encodes 2-prong energy correlators.
-
Do Audio-Visual Large Language Models Really See and Hear?
AVLLMs encode audio semantics in middle layers but suppress them in final text outputs when audio conflicts with vision, due to training that largely inherits from vision-language base models.
-
What learning algorithm is in-context learning? Investigations with linear models
Transformers performing in-context learning implicitly implement gradient descent, ridge regression, and least-squares predictors for linear models, with behavior shifting based on model depth, width, and data noise.
-
Slot Machines: How LLMs Keep Track of Multiple Entities
LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.
-
Efficient and Trainable Language Model Test-Time Scaling via Local Branch Routing
LBR performs token-level test-time scaling via local branch routing on hidden states, enabling end-to-end RL training and improving Pass@1 and Pass@32 on math benchmarks over CoT and RLVR baselines.
-
Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns
Sign patterns in the unrotated standard basis of transformer activations form independent binary feature registers that support training-free detection, prediction, and causal intervention across language, vision, and audio models.
-
When Probing Accuracy Saturates, Fragility Resolves: A Complementary Metric for LLM Pre-Training Analysis
Fragility, the activation noise level causing probe accuracy collapse, reveals evolving lexical-to-compositional moral encoding, layer robustness gradients, and fine-tuning differences invisible to saturated probing accuracy.
-
ProbeAct: Probe-Guided Training-Free Failure Recovery in Vision-Language-Action Models
PROBEACT is a plug-and-play intervention framework that combines hidden-state probing, kinematic failure detection, and CBF-based correction to boost success rates of pre-trained VLA models on the LIBERO-plus benchmark from 69.6% to 74.1%.
-
TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders
TRL-Bench is a new multi-granular benchmark that releases 50 OpenML tables, linkage tasks, and a 47k-table data lake to show that tabular encoder performance is capability-specific rather than captured by one leaderboard.
-
Anchored, Not Graded: Vision-Language Models Fail at Slant-from-Texture Perception
VLMs across families and scales show anchoring to discrete slant angles in zero-shot and prompted settings rather than human-like graded texture-based slant perception.
-
Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability
SASA replaces single-vector decoders in SAEs with learned subspaces plus block sparsity and nuclear-norm regularization, proving that a single group becomes the global minimizer once block size meets intrinsic dimension and yielding polynomial rather than exponential sample complexity.
-
Probing Spatial Structure in Pretrained Audio Representations
Introduces SARL benchmark showing pretrained audio encoders encode source-level spatial factors more readily than room-level factors, with patterns shaped by input configuration and training paradigm.
-
Toward Calibrated, Fair, and accurate Deepfake Detection
Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.
-
UWM-JEPA: Predictive World Models That Imagine in Belief Space
UWM-JEPA uses a density-matrix latent and unitary predictor in JEPA to preserve joint-state spectrum during blind rollouts, achieving 0.77 accuracy on a five-step hidden-velocity task versus 0.53 for an LSTM baseline.
-
The Perception-Physics Paradox: Probing Scientific Alignment with TC-Bench
Introduces the Perception-Physics Paradox and TC-Bench benchmark demonstrating that vision foundation models rely on visual shortcuts that fail in intense regimes rather than achieving scientific alignment via structural isomorphism.
-
LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models
LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.
-
Markovian Circuit Tracing for Transformer State Dynamic
This paper presents Markovian Circuit Tracing (MCT) as a benchmark and pipeline to extract and test state-transition structures in transformer activations using synthetic HMM tasks, demonstrating that state patching improves counterfactual predictions.
-
MAPS: A Synthetic Dataset for Probing Vision Models in a Controlled 3D Scene Space
MAPS provides 2618 validated 3D meshes and a controllable rendering pipeline to attribute vision model recognition failures to specific scene parameters, finding camera distance and elevation as the dominant failure factors across 20 tested models.
-
Scale-Dependent Collective Adaptation in Self-Amending LLM Societies: A Cross-Family Study of Emergent Governance
LLM societies in Nomic show non-monotonic collective adaptation peaking at mid-scales, with smaller models rule-inert and larger ones restrictive.
-
Uncovering the Representation Geometry of Minimal Cores in Overcomplete Reasoning Traces
Language models produce overcomplete reasoning traces where on average 46% of steps can be removed while preserving the answer in 86% of cases, with necessity concentrated in the top three steps.
-
Controlling Logical Collapse in LLMs via Algebraic Ontology Projection over F2
Projecting LLM hidden states onto F2 algebra with 42 pairs yields 93% zero-shot accuracy on logical relations and identifies prompt-preventable late-layer collapse.
-
A Controlled Counterexample to Strong Proxy-Based Explanations of OOD Performance: in a Fixed Pretraining-and-Probing Setup
A formal construction and synthetic experiment demonstrate that proxy rankings of pretraining datasets can reverse OOD accuracy rankings.
-
Deep Minds and Shallow Probes
Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
-
From Mechanistic to Compositional Interpretability
The paper introduces compositional interpretability as a category-theoretic framework that casts mechanistic explanations as commuting syntactic-semantic mappings optimized under faithfulness and complexity constraints derived from minimum description length.
-
Privacy-Aware Video Anomaly Detection through Orthogonal Subspace Projection
A new orthogonal projection module for video anomaly detection suppresses facial attributes via weak face-presence signals and cosine alignment while preserving anomaly-relevant features like pose and motion.
-
SeBA: Semi-supervised few-shot learning via Separated-at-Birth Alignment for tabular data
SeBA is a joint-embedding framework that separates tabular data into two complementary views and aligns one view's representations to the nearest-neighbor structure of the other, improving feature-label relationships and achieving SOTA results in most benchmarks without relying on augmentations.
-
Logic-Regularized Verifier Elicits Reasoning from LLMs
LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.
-
Synthetic Designed Experiments for Diagnosing Vision Model Failure
SDRS uses designed experiments and ANOVA decomposition on synthetic data to identify Type I coverage gaps and Type II spurious dependencies in vision models, then generates targeted data to improve performance.
-
V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models
V-SEAM combines concept-level visual semantic editing with attention head modulation to identify positive and negative contributors across object, attribute, and relationship levels, then uses this to improve VLM performance on VQA benchmarks.
-
Eliciting Latent Predictions from Transformers with the Tuned Lens
Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
-
Understanding intermediate layers using linear classifier probes
Linear probes demonstrate that feature separability for classification increases monotonically with network depth in Inception v3 and ResNet-50.
-
Inference Time Causal Probing in LLMs
HDMI is a new probe-free technique that steers LLM hidden states via margin objectives to achieve more reliable causal interventions than prior probe-based methods on standard benchmarks.
-
Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions
Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.
-
The Pinocchio Dimension: Phenomenality of Experience as the Primary Axis of LLM Psychometric Differences
The primary axis of psychometric variation among LLMs is the degree to which they represent themselves as loci of phenomenal experience rather than systems of behavioral responses.
-
LUMINA: A Grid Foundation Model for Benchmarking AC Optimal Power Flow Surrogate Learning
LUMINA-Bench is a standardized evaluation framework for ACOPF surrogate models that tests generalization across multiple grid topologies using accuracy and physics-constraint metrics.
-
Concepts Whisper While Syntax Shouts: Spectral Anti-Concentration and the Dual Geometry of Transformer Representations
Transformer activations show spectral anti-concentration for concepts in the tail while syntax prefers high-variance directions, forming a dual geometry.
-
Knowing when to trust machine-learned interatomic potentials
PROBE recasts MLIP uncertainty quantification as selective classification by training a compact discriminative classifier on frozen per-atom backbone embeddings, yielding a reliability probability that tracks actual error better than ensemble disagreement.
-
Latent Space Probing for Adult Content Detection in Video Generative Models
Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
-
Lost in the Hype: Revealing and Dissecting the Performance Degradation of Medical Multimodal Large Language Models in Image Classification
Medical MLLMs degrade on image classification due to four failure modes in visual representation quality, connector projection fidelity, LLM comprehension, and semantic mapping alignment, quantified by feature probing on 14 models across 3 datasets.
-
Robust for the Wrong Reasons: The Representational Geometry of LLM Robustness to Science Skepticism
LLMs show three distinct non-sycophantic responses to science skepticism, with robustness in some cases being accidental because the model does not represent the skepticism signal, as determined by linear probes on three models in three domains.
-
Aionoscope: Debugging Latent-State Accessibility in Time-Series Representations
Aionoscope shows that time-series representations recover coarse signal types reliably but expose dense latent states like phase and amplitude much less reliably, with best dense-probe R² at 0.689 versus oracle 0.999.
-
Graph-Native Reinforcement Learning Enables Traceable Scientific Hypothesis Generation through Conceptual Recombination
Graph-PRefLexOR fine-tunes graph-native models with GRPO to organize reasoning into phases, yielding 40-65% gains in traceable hypothesis generation and 2-3x semantic diversity on 100 materials science questions.
-
A Mechanistic View of Authority Hierarchy in LLM Sycophancy
Authority sycophancy in LLMs is a layer-localized erasure of correct answer representations that scales with authority level and resists simple interventions.
-
The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model
RLHF provides shallow alignment by inactivating partisan features and severing causal pathways in LLMs without erasing partisan geometry, as evidenced by sparse autoencoder analysis and steering experiments.
-
Now You (Still) See Me: Detecting Evasive Steganographic Payloads in LLMs
Adversarial fine-tuning evades activation-based steganography detection in five LLMs while preserving secret recovery, but a recontextualization dataset restores both ridge and MLP probe detectability.
-
One Lens, Many Worlds : A Capability-Typed Interface for World-Model Interpretability
WorldModelLens defines a typed adapter with four core methods and a capability descriptor to unify interpretability tooling across diverse world model architectures.
-
The Amplifying Mirror: Locating and Steering the Partisan Direction inside a Large Language Model
A linear probe trained on 190k congressional tweets identifies a partisan direction in Llama 3.1 8B layer 18 that can be causally ablated or amplified to reverse or shift the model's political output.
-
AEGIS: A Backup Reflex for Physical AI
AEGIS uses activation probes for early-warning detection of high-risk steps in weak policies and selectively escalates to stronger policies, recovering 10.1% of lost trajectories on LIBERO-Spatial while activating the strong policy on only 38% of steps.
-
When Does Structure Help? The Information Bonus of AlphaFold2 Representations over Protein Language Models
AlphaFold2 representations outperform ESM-2 only for allosteric site classification while ESM-2 wins on binding affinity and flexibility, with a residue-level leakage artifact that can reverse rankings.