This survey organizes intrinsic interpretability approaches for LLMs into five categories—functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction—while discussing challenges and future directions.
Preprint, arXiv:2402.06155
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Towards Intrinsic Interpretability of Large Language Models:A Survey of Design Principles and Architectures
This survey organizes intrinsic interpretability approaches for LLMs into five categories—functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction—while discussing challenges and future directions.