hub Mixed citations

Probing Classifiers: Promises, Shortcomings, and Advances

Yonatan Belinkov · 2022 · cs.CL · DOI 10.1162/coli_a_00422 · arXiv 2102.12452

Mixed citation behavior. Most common role is background (62%).

28 Pith papers citing it

130 external citations · Crossref

Background 62% of classified citations

open full Pith review browse 28 citing papers arXiv PDF

abstract

Probing classifiers have emerged as one of the prominent methodologies for interpreting and analyzing deep neural network models of natural language processing. The basic idea is simple -- a classifier is trained to predict some linguistic property from a model's representations -- and has been used to examine a wide variety of models and properties. However, recent studies have demonstrated various methodological limitations of this approach. This article critically reviews the probing classifiers framework, highlighting their promises, shortcomings, and advances.

hub tools

JSON dossier citing papers JSON publisher DOI arXiv source

citation-role summary

background 8 method 3 baseline 1 dataset 1

citation-polarity summary

background 8 use method 3 baseline 1 use dataset 1

representative citing papers

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.

Locating and Editing Factual Associations in GPT

cs.CL · 2022-02-10 · accept · novelty 8.0

Factual associations in autoregressive transformers are localized to mid-layer feed-forward modules and can be edited via rank-one model editing while preserving both specificity and generalization on counterfactual tests.

Where Does Authorship Signal Emerge in Encoder-Based Language Models?

cs.CL · 2026-05-19 · unverdicted · novelty 7.0

Scoring mechanism determines the layer at which encoder-based models consolidate authorship signals, with mean pooling acting early and late interaction deferring to later layers.

When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

QAOD projects away question-aligned directions from answer representations to isolate domain-agnostic factuality signals, enabling efficient hallucination detection with top in-domain AUROC and up to 21% better OOD transfer.

KamonBench: A Grammar-Based Dataset for Evaluating Compositional Factor Recovery in Vision-Language Models

cs.CV · 2026-05-13 · unverdicted · novelty 7.0 · 2 refs

KamonBench is a grammar-based dataset of 20,000 synthetic Japanese crests with multi-format annotations that enables direct evaluation of factor recovery beyond caption accuracy in vision-language models.

Deep Minds and Shallow Probes

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.

What Do EEG Foundation Models Capture from Human Brain Signals?

cs.AI · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

EEG foundation models encode 68.6% of a 63-feature clinical lexicon in a representation-causal way, with frequency-domain features dominant; these recover 79.3% of the models' advantage over random baselines on average.

Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

Behavioral directions from one LLM family transfer to others via projection into a shared anchor coordinate space, yielding 0.83 ten-way detection accuracy and steering effects up to 0.46% on held-out models.

Is One Layer Enough? Understanding Inference Dynamics in Tabular Foundation Models

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

Tabular foundation models show substantial depthwise redundancy, so a looped single-layer version achieves comparable results with 20% of the original parameters.

When Does a Language Model Commit? A Finite-Answer Theory of Pre-Verbalization Commitment

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

Finite-answer projections of continuation probabilities stabilize before the answer is parseable, showing 17-31 token mean lead in delayed-verdict tasks with Qwen3-4B-Instruct.

Latent Space Probing for Adult Content Detection in Video Generative Models

cs.CV · 2026-04-25 · unverdicted · novelty 7.0

Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.

Represented Is Not Computed: A Causal Test of Candidate Algorithmic Intermediates in a Transformer

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

Transformer represents but does not causally transmit staged algorithmic intermediates for base-digit extraction, diverging from probe predictions.

Language-Switching Triggers Take a Latent Detour Through Language Models

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

Researchers identify and decompose a language-switching backdoor circuit in an autoregressive LM into early attention composition, mid-layer orthogonal propagation, and final MLP conversion.

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A tabular foundation model with LLM-as-Observer features predicts AI agent decisions in controlled games, outperforming baselines by 4 AUC points and 14% lower error at K=16 interactions.

Instructions Shape Production of Language, not Processing

cs.CL · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.

Molecules Meet Language: Confound-Aware Representation Learning and Chemical Property Steering in Transformer-VAE Latent Spaces

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Chemically meaningful steering for properties like cLogP and TPSA emerges in entangled Transformer-VAE latent spaces only after controlling for SELFIES representation confounds through residualization and decoded traversals.

Conceptors for Semantic Steering

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

Conceptors as soft projection matrices from bipolar activations offer a multidimensional, compositional, and geometrically principled method for semantic steering in LLMs that outperforms single-vector baselines in multi-dimensional subspaces.

Beyond Decodability: Reconstructing Language Model Representations with an Encoding Probe

cs.CL · 2026-05-01 · unverdicted · novelty 6.0

An encoding probe reconstructs transformer representations from acoustic, phonetic, syntactic, lexical and speaker features, showing independent syntactic/lexical contributions and training-dependent speaker effects.

Why Do LLMs Struggle in Strategic Play? Broken Links Between Observations, Beliefs, and Actions

cs.CL · 2026-04-30 · unverdicted · novelty 6.0

LLMs encode accurate but brittle internal beliefs about latent game states and convert them poorly into actions, creating systematic gaps that explain strategic failures.

Architecture Determines Observability of Transformers

cs.LG · 2026-04-27 · unverdicted · novelty 6.0 · 2 refs

Architecture and training determine whether transformers retain a readable internal signal that lets activation monitors catch errors missed by output confidence.

Prophecy: Inferring Formal Properties from Neuron Activations

cs.LG · 2025-09-25 · unverdicted · novelty 6.0

Prophecy infers formal properties of feed-forward neural networks by extracting rules from neuron activation patterns that imply desirable output behaviors.

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

cs.AI · 2023-10-10 · unverdicted · novelty 6.0

At sufficient scale, LLMs linearly represent the truth value of factual statements, as shown by visualizations, cross-dataset generalization, and causal interventions that flip truth judgments.

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

cs.CL · 2022-11-09 · unverdicted · novelty 6.0

BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.

citing papers explorer

Showing 28 of 28 citing papers.

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 17 · internal anchor
Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.
Locating and Editing Factual Associations in GPT cs.CL · 2022-02-10 · accept · none · ref 4 · internal anchor
Factual associations in autoregressive transformers are localized to mid-layer feed-forward modules and can be edited via rank-one model editing while preserving both specificity and generalization on counterfactual tests.
Where Does Authorship Signal Emerge in Encoder-Based Language Models? cs.CL · 2026-05-19 · unverdicted · none · ref 10 · internal anchor
Scoring mechanism determines the layer at which encoder-based models consolidate authorship signals, with mean pooling acting early and late interaction deferring to later layers.
When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition cs.LG · 2026-05-14 · unverdicted · none · ref 26 · internal anchor
QAOD projects away question-aligned directions from answer representations to isolate domain-agnostic factuality signals, enabling efficient hallucination detection with top in-domain AUROC and up to 21% better OOD transfer.
KamonBench: A Grammar-Based Dataset for Evaluating Compositional Factor Recovery in Vision-Language Models cs.CV · 2026-05-13 · unverdicted · none · ref 2 · 2 links · internal anchor
KamonBench is a grammar-based dataset of 20,000 synthetic Japanese crests with multi-format annotations that enables direct evaluation of factor recovery beyond caption accuracy in vision-language models.
Deep Minds and Shallow Probes cs.LG · 2026-05-12 · unverdicted · none · ref 1 · internal anchor
Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
What Do EEG Foundation Models Capture from Human Brain Signals? cs.AI · 2026-05-12 · unverdicted · none · ref 30 · 2 links · internal anchor
EEG foundation models encode 68.6% of a 63-feature clinical lexicon in a representation-causal way, with frequency-domain features dominant; these recover 79.3% of the models' advantage over random baselines on average.
Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations cs.AI · 2026-05-11 · unverdicted · none · ref 5 · internal anchor
Behavioral directions from one LLM family transfer to others via projection into a shared anchor coordinate space, yielding 0.83 ten-way detection accuracy and steering effects up to 0.46% on held-out models.
Is One Layer Enough? Understanding Inference Dynamics in Tabular Foundation Models cs.LG · 2026-05-07 · unverdicted · none · ref 1 · internal anchor
Tabular foundation models show substantial depthwise redundancy, so a looped single-layer version achieves comparable results with 20% of the original parameters.
When Does a Language Model Commit? A Finite-Answer Theory of Pre-Verbalization Commitment cs.AI · 2026-05-07 · unverdicted · none · ref 8 · internal anchor
Finite-answer projections of continuation probabilities stabilize before the answer is parseable, showing 17-31 token mean lead in delayed-verdict tasks with Qwen3-4B-Instruct.
Latent Space Probing for Adult Content Detection in Video Generative Models cs.CV · 2026-04-25 · unverdicted · none · ref 36 · internal anchor
Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
Represented Is Not Computed: A Causal Test of Candidate Algorithmic Intermediates in a Transformer cs.LG · 2026-05-21 · unverdicted · none · ref 2 · internal anchor
Transformer represents but does not causally transmit staged algorithmic intermediates for base-digit extraction, diverging from probe predictions.
Language-Switching Triggers Take a Latent Detour Through Language Models cs.CL · 2026-05-18 · unverdicted · none · ref 23 · internal anchor
Researchers identify and decompose a language-switching backdoor circuit in an autoregressive LM into early attention composition, mid-layer orthogonal propagation, and final MLP conversion.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces cs.LG · 2026-05-12 · unverdicted · none · ref 148 · internal anchor
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling cs.LG · 2026-05-12 · unverdicted · none · ref 11 · internal anchor
A tabular foundation model with LLM-as-Observer features predicts AI agent decisions in controlled games, outperforming baselines by 4 AUC points and 14% lower error at K=16 interactions.
Instructions Shape Production of Language, not Processing cs.CL · 2026-05-11 · unverdicted · none · ref 5 · 2 links · internal anchor
Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.
Molecules Meet Language: Confound-Aware Representation Learning and Chemical Property Steering in Transformer-VAE Latent Spaces cs.LG · 2026-05-07 · unverdicted · none · ref 33 · internal anchor
Chemically meaningful steering for properties like cLogP and TPSA emerges in entangled Transformer-VAE latent spaces only after controlling for SELFIES representation confounds through residualization and decoded traversals.
Conceptors for Semantic Steering cs.LG · 2026-05-06 · unverdicted · none · ref 2 · internal anchor
Conceptors as soft projection matrices from bipolar activations offer a multidimensional, compositional, and geometrically principled method for semantic steering in LLMs that outperforms single-vector baselines in multi-dimensional subspaces.
Beyond Decodability: Reconstructing Language Model Representations with an Encoding Probe cs.CL · 2026-05-01 · unverdicted · none · ref 4 · internal anchor
An encoding probe reconstructs transformer representations from acoustic, phonetic, syntactic, lexical and speaker features, showing independent syntactic/lexical contributions and training-dependent speaker effects.
Why Do LLMs Struggle in Strategic Play? Broken Links Between Observations, Beliefs, and Actions cs.CL · 2026-04-30 · unverdicted · none · ref 30 · internal anchor
LLMs encode accurate but brittle internal beliefs about latent game states and convert them poorly into actions, creating systematic gaps that explain strategic failures.
Architecture Determines Observability of Transformers cs.LG · 2026-04-27 · unverdicted · none · ref 5 · 2 links · internal anchor
Architecture and training determine whether transformers retain a readable internal signal that lets activation monitors catch errors missed by output confidence.
Prophecy: Inferring Formal Properties from Neuron Activations cs.LG · 2025-09-25 · unverdicted · none · ref 6 · internal anchor
Prophecy infers formal properties of feed-forward neural networks by extracting rules from neuron activation patterns that imply desirable output behaviors.
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets cs.AI · 2023-10-10 · unverdicted · none · ref 31 · internal anchor
At sufficient scale, LLMs linearly represent the truth value of factual statements, as shown by visualizations, cross-dataset generalization, and causal interventions that flip truth judgments.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model cs.CL · 2022-11-09 · unverdicted · none · ref 199 · internal anchor
BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes cs.AI · 2026-05-07 · unverdicted · none · ref 6 · internal anchor
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
Do Activation Verbalization Methods Convey Privileged Information? cs.CL · 2025-09-16 · unverdicted · none · ref 7 · internal anchor
Activation verbalization methods for LLMs largely reflect the verbalizer model's parametric knowledge rather than privileged information from the target model's activations.
Functional Emotions or Situational Contexts? A Discriminating Test from the Mythos Preview System Card cs.HC · 2026-04-09 · unverdicted · none · ref 1 · internal anchor
The note proposes applying emotion probes to SAE-analyzed strategic concealment episodes to test if emotion vectors capture causal emotions or situational projections in AI models.
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders cs.LG · 2026-05-13 · unreviewed · ref 26 · 2 links · internal anchor

Probing Classifiers: Promises, Shortcomings, and Advances

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer