Ravel: Evaluat- ing interpretability methods on disentangling language model representations

Jing Huang, Zhengxuan Wu, Christopher Potts, Mor Geva, Atticus Geiger · 2024 · arXiv 2402.17700

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

A framework for analyzing concept representations in neural models

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

A new framework shows concept subspaces are not unique, estimator choice affects containment and disentanglement, LEACE works well but generalizes poorly, and HuBERT encodes phone info as contained and disentangled from speaker info while speaker info resists compact containment.

The Generalization Ridge: Information Flow in Natural Language Generation

cs.CL · 2025-07-07 · unverdicted · novelty 6.0

InfoRidge reveals a non-monotonic pattern in which predictive mutual information between hidden states and outputs peaks in intermediate layers before declining in final layers.

citing papers explorer

Showing 2 of 2 citing papers.

A framework for analyzing concept representations in neural models cs.CL · 2026-05-02 · unverdicted · none · ref 102
A new framework shows concept subspaces are not unique, estimator choice affects containment and disentanglement, LEACE works well but generalizes poorly, and HuBERT encodes phone info as contained and disentangled from speaker info while speaker info resists compact containment.
The Generalization Ridge: Information Flow in Natural Language Generation cs.CL · 2025-07-07 · unverdicted · none · ref 30
InfoRidge reveals a non-monotonic pattern in which predictive mutual information between hidden states and outputs peaks in intermediate layers before declining in final layers.

Ravel: Evaluat- ing interpretability methods on disentangling language model representations

fields

years

verdicts

representative citing papers

citing papers explorer