Gold doesn't always glitter: Spectral removal of linear and nonlinear guarded attribute information

Shun Shao, Yftah Ziser, Shay B Cohen · 2022 · arXiv 2203.07893

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

From Measurement to Mitigation: Quantifying and Reducing Identity Leakage in Image Representation Encoders with Linear Subspace Removal

cs.CV · 2026-04-07 · conditional · novelty 7.0

Visual encoders leak identity information; a one-shot linear subspace removal method (ISP) reduces leakage to near-chance levels while retaining high non-biometric utility across datasets.

Refusal in Language Models Is Mediated by a Single Direction

cs.LG · 2024-06-17 · accept · novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

citing papers explorer

Showing 2 of 2 citing papers.

From Measurement to Mitigation: Quantifying and Reducing Identity Leakage in Image Representation Encoders with Linear Subspace Removal cs.CV · 2026-04-07 · conditional · none · ref 24
Visual encoders leak identity information; a one-shot linear subspace removal method (ISP) reduces leakage to near-chance levels while retaining high non-biometric utility across datasets.
Refusal in Language Models Is Mediated by a Single Direction cs.LG · 2024-06-17 · accept · none · ref 182
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

Gold doesn't always glitter: Spectral removal of linear and nonlinear guarded attribute information

fields

years

verdicts

representative citing papers

citing papers explorer