A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Disentangling by Factorising
5 Pith papers cite this work. Polarity classification is still indexing.
abstract
We define and address the problem of unsupervised learning of disentangled representations on data generated from independent factors of variation. We propose FactorVAE, a method that disentangles by encouraging the distribution of representations to be factorial and hence independent across the dimensions. We show that it improves upon $\beta$-VAE by providing a better trade-off between disentanglement and reconstruction quality. Moreover, we highlight the problems of a commonly used disentanglement metric and introduce a new metric that does not suffer from them.
representative citing papers
DPA provides closed-form relation from level-set geometry to data score and proves extra latent components are conditionally independent, revealing intrinsic dimension.
A multi-branch β-VAE on tropical Pacific SST, OHC, and OLR fields yields a latent space that reconstructs data well and aligns with physical ENSO and longer-term coupled variability modes.
DAR replaces GAP with an attention-based aggregation module retrained jointly with the classifier head to disentangle core from spurious features and outperforms DFR on multiple datasets.
citing papers explorer
-
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
-
Distributional Autoencoders Know the Score
DPA provides closed-form relation from level-set geometry to data score and proves extra latent components are conditionally independent, revealing intrinsic dimension.
-
What's in the latent space? Exploring coupled tropical Pacific variability within a Multi-branch $\beta$-Variational Autoencoder
A multi-branch β-VAE on tropical Pacific SST, OHC, and OLR fields yields a latent space that reconstructs data well and aligns with physical ENSO and longer-term coupled variability modes.
-
Deep Attention Reweighting: Post-Hoc Attention-Based Feature Aggregation in CNNs for Disentangling Core and Spurious Features under Spurious Correlations
DAR replaces GAP with an attention-based aggregation module retrained jointly with the classifier head to disentangle core from spurious features and outperforms DFR on multiple datasets.
- Learning to Theorize the World from Observation