Null it out: Guarding protected attributes by iterative nullspace projection

URL https://arxiv · 2004 · arXiv 2004.07667

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

representative citing papers

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

HDMI is a new probe-free technique that steers LLM hidden states via margin objectives to achieve more reliable causal interventions than prior probe-based methods on standard benchmarks.

Computational Lesions in Multilingual Language Models Separate Shared and Language-specific Brain Alignment

cs.CL · 2026-04-12 · unverdicted · novelty 7.0

Lesioning a shared core in multilingual LLMs drops whole-brain fMRI encoding correlation by 60.32%, while language-specific lesions selectively weaken predictions only for the matched native language.

Refusal in Language Models Is Mediated by a Single Direction

cs.LG · 2024-06-17 · accept · novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

Eliciting Latent Predictions from Transformers with the Tuned Lens

cs.LG · 2023-03-14 · accept · novelty 7.0

Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.

Probing Persona-Dependent Preferences in Language Models

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

Linear probes on residual-stream activations identify a shared preference vector in LLMs that tracks choices across prompts and causally steers decisions even for anti-correlated personas.

Attribution-Guided Pruning for Insight and Control: Circuit Discovery and Targeted Correction in Small-scale LLMs

cs.LG · 2025-06-16 · conditional · novelty 5.0

Attribution-guided pruning with contrastive relevance identifies behavior-specific circuits in small LLMs and removes as little as 0.03-0.3% of components to reduce toxicity or repetition while preserving general performance.

citing papers explorer

Showing 6 of 6 citing papers.

Inference Time Causal Probing in LLMs cs.AI · 2026-05-08 · unverdicted · none · ref 15
HDMI is a new probe-free technique that steers LLM hidden states via margin objectives to achieve more reliable causal interventions than prior probe-based methods on standard benchmarks.
Computational Lesions in Multilingual Language Models Separate Shared and Language-specific Brain Alignment cs.CL · 2026-04-12 · unverdicted · none · ref 64
Lesioning a shared core in multilingual LLMs drops whole-brain fMRI encoding correlation by 60.32%, while language-specific lesions selectively weaken predictions only for the matched native language.
Refusal in Language Models Is Mediated by a Single Direction cs.LG · 2024-06-17 · accept · none · ref 176
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Eliciting Latent Predictions from Transformers with the Tuned Lens cs.LG · 2023-03-14 · accept · none · ref 74
Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
Probing Persona-Dependent Preferences in Language Models cs.CL · 2026-05-13 · unverdicted · none · ref 28
Linear probes on residual-stream activations identify a shared preference vector in LLMs that tracks choices across prompts and causally steers decisions even for anti-correlated personas.
Attribution-Guided Pruning for Insight and Control: Circuit Discovery and Targeted Correction in Small-scale LLMs cs.LG · 2025-06-16 · conditional · none · ref 26
Attribution-guided pruning with contrastive relevance identifies behavior-specific circuits in small LLMs and removes as little as 0.03-0.3% of components to reduce toxicity or repetition while preserving general performance.

Null it out: Guarding protected attributes by iterative nullspace projection

fields

years

verdicts

representative citing papers

citing papers explorer