hub

arXiv preprint arXiv:2308.09124 , year=

Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, David Bau · 2023 · arXiv 2308.09124

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

read on arXiv browse 11 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2

citation-polarity summary

background 1 unclear 1

representative citing papers

The Linear Representation Hypothesis and the Geometry of Large Language Models

cs.CL · 2023-11-07 · conditional · novelty 8.0

Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.

GKnow: Measuring the Entanglement of Gender Bias and Factual Gender

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

Gender bias and factual gender knowledge are severely entangled in language model circuits and neurons, making neuron ablation an unreliable method for debiasing.

Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction

cs.AI · 2026-05-04 · unverdicted · novelty 7.0

A four-step recipe partitions the input space using interchange intervention behavior to diagnose where causal abstractions hold and to guide improvements, demonstrated by recovering a full hypothesis from scratch in a toy logic task.

Cell-Based Representation of Relational Binding in Language Models

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the matching cell.

Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models

cs.CL · 2026-05-10 · unverdicted · novelty 6.0

Steering Dark Triad features in an LLM increases exploitative and aggressive behavior while leaving strategic deception and cognitive empathy unchanged, indicating dissociable antisocial pathways.

Architecture, Not Scale: Circuit Localization in Large Language Models

cs.CL · 2026-05-09 · unverdicted · novelty 6.0

Grouped query attention produces more concentrated and stable circuits than multi-head attention across tasks and scales in Pythia and Qwen2.5 models, with a phase transition in factual recall circuits.

LLM Safety From Within: Detecting Harmful Content with Internal Representations

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.

How Do Language Models Compose Functions?

cs.CL · 2025-10-02 · conditional · novelty 6.0

LLMs solve compositional factual recall either by computing intermediates or directly, with mechanism choice correlated to translation geometry in embedding spaces.

How Training Data Shapes the Use of Parametric and In-Context Knowledge in Language Models

cs.CL · 2025-09-29 · unverdicted · novelty 6.0

Balanced parametric and in-context knowledge use in LLMs is an emergent property requiring intra-document repetition, moderate inconsistency, and skewed distributions in training data.

Towards Effective Theory of LLMs: A Representation Learning Approach

cs.LG · 2026-05-10 · unverdicted · novelty 5.0

RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.

Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes

cs.AI · 2026-05-07 · unverdicted · novelty 5.0

Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.

citing papers explorer

Showing 11 of 11 citing papers.

The Linear Representation Hypothesis and the Geometry of Large Language Models cs.CL · 2023-11-07 · conditional · none · ref 9
Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.
GKnow: Measuring the Entanglement of Gender Bias and Factual Gender cs.CL · 2026-05-12 · unverdicted · none · ref 39
Gender bias and factual gender knowledge are severely entangled in language model circuits and neurons, making neuron ablation an unreliable method for debiasing.
Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction cs.AI · 2026-05-04 · unverdicted · none · ref 12
A four-step recipe partitions the input space using interchange intervention behavior to diagnose where causal abstractions hold and to guide improvements, demonstrated by recovering a full hypothesis from scratch in a toy logic task.
Cell-Based Representation of Relational Binding in Language Models cs.CL · 2026-04-21 · unverdicted · none · ref 22
Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the matching cell.
Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models cs.CL · 2026-05-10 · unverdicted · none · ref 11
Steering Dark Triad features in an LLM increases exploitative and aggressive behavior while leaving strategic deception and cognitive empathy unchanged, indicating dissociable antisocial pathways.
Architecture, Not Scale: Circuit Localization in Large Language Models cs.CL · 2026-05-09 · unverdicted · none · ref 7
Grouped query attention produces more concentrated and stable circuits than multi-head attention across tasks and scales in Pythia and Qwen2.5 models, with a phase transition in factual recall circuits.
LLM Safety From Within: Detecting Harmful Content with Internal Representations cs.AI · 2026-04-20 · unverdicted · none · ref 36
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
How Do Language Models Compose Functions? cs.CL · 2025-10-02 · conditional · none · ref 14
LLMs solve compositional factual recall either by computing intermediates or directly, with mechanism choice correlated to translation geometry in embedding spaces.
How Training Data Shapes the Use of Parametric and In-Context Knowledge in Language Models cs.CL · 2025-09-29 · unverdicted · none · ref 5
Balanced parametric and in-context knowledge use in LLMs is an emergent property requiring intra-document repetition, moderate inconsistency, and skewed distributions in training data.
Towards Effective Theory of LLMs: A Representation Learning Approach cs.LG · 2026-05-10 · unverdicted · none · ref 32
RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes cs.AI · 2026-05-07 · unverdicted · none · ref 28
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.

arXiv preprint arXiv:2308.09124 , year=

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer