A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 5roles
background 1polarities
background 1representative citing papers
SoftSAE replaces fixed-K sparsity in autoencoders with a learned, input-dependent number of active features via a soft top-k operator.
Decoder-based VLMs hallucinate because visual embeddings are over-aligned to a text manifold; projecting out the top principal components of a universal linguistic subspace reduces this bias and improves benchmark performance.
GeoSAE extracts a compact, interpretable feature set from frozen brain MRI foundation models that predicts MCI-to-AD conversion (AUC 0.746) with age-deconfounded annotations and replicates across cohorts.
A new framework combines AI-derived concept embeddings with high-dimensional selective inference to enable statistically principled, interpretable discovery from unstructured data in empirical economics.
citing papers explorer
-
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
-
SoftSAE: Dynamic Top-K Selection for Adaptive Sparse Autoencoders
SoftSAE replaces fixed-K sparsity in autoencoders with a learned, input-dependent number of active features via a soft top-k operator.
-
When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models
Decoder-based VLMs hallucinate because visual embeddings are over-aligned to a text manifold; projecting out the top principal components of a universal linguistic subspace reduces this bias and improves benchmark performance.
-
GeoSAE: Geometric Prior-Guided Layer-Wise Sparse Autoencoder Annotation of Brain MRI Foundation Models
GeoSAE extracts a compact, interpretable feature set from frozen brain MRI foundation models that predicts MCI-to-AD conversion (AUC 0.746) with age-deconfounded annotations and replicates across cohorts.
-
Making Interpretable Discoveries from Unstructured Data: A High-Dimensional Multiple Hypothesis Testing Approach
A new framework combines AI-derived concept embeddings with high-dimensional selective inference to enable statistically principled, interpretable discovery from unstructured data in empirical economics.