A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
C., Sedghi, H., Lipton, Z
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
LM-DP-SGD estimates layer-specific MIA risks from shadow models and reweights gradients to give stronger protection to vulnerable layers, improving the privacy-utility trade-off over uniform DP-SGD.
citing papers explorer
-
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
-
Mitigating Membership Inference in Intermediate Representations with Differentially Private Training
LM-DP-SGD estimates layer-specific MIA risks from shadow models and reweights gradients to give stronger protection to vulnerable layers, improving the privacy-utility trade-off over uniform DP-SGD.