hub Mixed citations

Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219

Yonatan Belinkov · 2022 · cs.CL · DOI 10.1162/coli_a_00422 · arXiv 2102.12452

Mixed citation behavior. Most common role is background (62%).

51 Pith papers citing it

130 external citations · Crossref

Background 62% of classified citations

open full Pith review browse 51 citing papers arXiv PDF

abstract

Probing classifiers have emerged as one of the prominent methodologies for interpreting and analyzing deep neural network models of natural language processing. The basic idea is simple -- a classifier is trained to predict some linguistic property from a model's representations -- and has been used to examine a wide variety of models and properties. However, recent studies have demonstrated various methodological limitations of this approach. This article critically reviews the probing classifiers framework, highlighting their promises, shortcomings, and advances.

hub tools

JSON dossier citing papers JSON publisher DOI arXiv source

citation-role summary

background 8 method 3 baseline 1 dataset 1

citation-polarity summary

background 8 use method 3 baseline 1 use dataset 1

representative citing papers

Do Activation Monitors Survive Model Updates? Benchmarking, Predicting, and Repairing Activation-Monitor Staleness

cs.LG · 2026-06-14 · unverdicted · novelty 8.0

Fine-tuning updates frequently stale activation monitors for language model safety while quantization does not, with degradation predictable and repairable via label-free realignment.

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.

Locating and Editing Factual Associations in GPT

cs.CL · 2022-02-10 · accept · novelty 8.0

Factual associations in autoregressive transformers are localized to mid-layer feed-forward modules and can be edited via rank-one model editing while preserving both specificity and generalization on counterfactual tests.

Efficient and Trainable Language Model Test-Time Scaling via Local Branch Routing

cs.CL · 2026-06-24 · unverdicted · novelty 7.0 · 2 refs

LBR performs token-level test-time scaling via local branch routing on hidden states, enabling end-to-end RL training and improving Pass@1 and Pass@32 on math benchmarks over CoT and RLVR baselines.

Observable Patterns Are Not Explanations: A Causal-Geometric Analysis of Latent Reasoning Models

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

Evaluation of two latent reasoning models against controls shows observable latent patterns appear without the proposed mechanisms, have graded causal effects on behavior, and concentrate in structured low-rank directions, arguing that patterns are insufficient evidence for reasoning.

PRISM: Recovering Instruction Sets from Language Model Activations

cs.AI · 2026-06-08 · unverdicted · novelty 7.0

PRISM is a new activation-conditioned model that recovers full sets of simultaneous instructions from LLM hidden states via judge-guided GRPO training and outperforms prior activation-to-language methods on security-relevant tasks.

Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

cs.CV · 2026-06-04 · conditional · novelty 7.0

Empirical study of five LVR variants finds cosine alignment negatively correlates with accuracy (r=-0.94), supervised latents are bypassed under corruption (max 4-point shift), and answers are decodable downstream but not at the latent.

Causal Evidence of Stack Representations in Modeling Counter Languages Using Transformers

cs.CL · 2026-06-02 · unverdicted · novelty 7.0

Ablation of a stack-depth direction extracted via linear probes from transformer hidden states causes performance on counter languages to drop to near zero, showing causal necessity of the representation.

Mechanistic Diagnostics of Spatial Lexical Bias in Multimodal Large Language Model Spatial Reasoning

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

MLLMs exhibit spatial lexical bias on multiple-choice spatial questions, traced via mechanistic tools to language-side channels rather than vision, and largely mitigated by LLM-only DPO on synthetic data.

Encoded but Not Routed: Explaining the Table-Chart Gap in Scientific Claim Verification

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

Chart information is encoded but not routed to predictions in VLMs for claim verification, unlike tables, revealed by layer-wise probing and attention analysis on three models.

A Fiber Criterion for Representation Identifiability in Supervised Learning

cs.LG · 2026-05-31 · conditional · novelty 7.0

A representation property is identifiable from the induced predictor iff it is constant on the fibers of the map from admissible (representation, head) pairs to the composite predictor.

Where Does Authorship Signal Emerge in Encoder-Based Language Models?

cs.CL · 2026-05-19 · conditional · novelty 7.0

Different scoring mechanisms cause encoder-based authorship attribution models to consolidate authorship signals at different layers, as shown by causal interventions and gradient analysis.

Language-Switching Triggers Take a Latent Detour Through Language Models

cs.CL · 2026-05-18 · unverdicted · novelty 7.0

An 8B autoregressive LM implements a language-switching backdoor via a three-phase circuit with early trigger composition, orthogonal mid-layer propagation, and final-layer MLP conversion, routed through a single-position serial bottleneck.

When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

QAOD projects away question-aligned directions from answer representations to isolate domain-agnostic factuality signals, enabling efficient hallucination detection with top in-domain AUROC and up to 21% better OOD transfer.

KamonBench: A Grammar-Based Dataset for Evaluating Compositional Factor Recovery in Vision-Language Models

cs.CV · 2026-05-13 · unverdicted · novelty 7.0 · 2 refs

KamonBench is a grammar-based dataset of 20,000 synthetic Japanese crests with multi-format annotations that enables direct evaluation of factor recovery beyond caption accuracy in vision-language models.

Deep Minds and Shallow Probes

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.

What Do EEG Foundation Models Capture from Human Brain Signals?

cs.AI · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

EEG foundation models encode 68.6% of a 63-feature clinical lexicon in a representation-causal way, with frequency-domain features dominant; these recover 79.3% of the models' advantage over random baselines on average.

Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

Behavioral directions from one LLM family transfer to others via projection into a shared anchor coordinate space, yielding 0.83 ten-way detection accuracy and steering effects up to 0.46% on held-out models.

Is One Layer Enough? Understanding Inference Dynamics in Tabular Foundation Models

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

Tabular foundation models show substantial depthwise redundancy, so a looped single-layer version achieves comparable results with 20% of the original parameters.

When Does a Language Model Commit? A Finite-Answer Theory of Pre-Verbalization Commitment

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

Finite-answer projections of continuation probabilities stabilize before the answer is parseable, showing 17-31 token mean lead in delayed-verdict tasks with Qwen3-4B-Instruct.

Latent Space Probing for Adult Content Detection in Video Generative Models

cs.CV · 2026-04-25 · unverdicted · novelty 7.0

Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.

The strength of clinical evidence is recoverable from language model representations but not from their stated grades

cs.CL · 2026-06-27 · unverdicted · novelty 6.0

Linear probes recover evidence grades from LLM activations (median AUROC 71.8) across 22 models but the models' stated grades perform at chance level and the signal is largely lexical.

ToxiREX: A Dataset on Toxic REasoning in ConteXt

cs.CL · 2026-06-26 · unverdicted · novelty 6.0

ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.

From Texts to Scores: Tracing the Emergence of Essay Quality Representations in Large Language Models

cs.CL · 2026-06-18 · unverdicted · novelty 6.0

LLM representations encode essay quality in a linearly decodable form that emerges across layers and includes identifiable scoring neurons whose distribution shifts with essay length.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Latent Space Probing for Adult Content Detection in Video Generative Models cs.CV · 2026-04-25 · unverdicted · none · ref 36 · internal anchor
Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.

Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer