pith. sign in

Enhancing neural network interpretability with feature-aligned sparse autoencoders

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

fields

cs.LG 4 cs.CE 1

years

2026 5

verdicts

UNVERDICTED 5

clear filters

representative citing papers

A Unifying Framework for Concept-Based Representational Similarity

cs.LG · 2026-06-08 · unverdicted · novelty 7.0

A unifying framework decomposes concept alignment into instance-wise and distributional translation and concept consistency, introduces the InterVenchA benchmark, and shows that joint optimization via CoSAE recovers strong alignment even with 0.1% paired data.

Improving Sparse Autoencoder with Dynamic Attention

cs.LG · 2026-04-16 · unverdicted · novelty 7.0

A cross-attention SAE with sparsemax attention achieves lower reconstruction loss and higher-quality concepts than fixed-sparsity baselines by making activation counts data-dependent.

Perplexity Can Miss SAE Feature Damage Under Quantization

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

Quantization of LLMs can degrade many SAE features even when perplexity improves or stays similar, as shown by correlation measurements on frozen SAEs for Pythia-70M and Gemma-2-2B models across INT8 to INT4.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.