Towards Intrinsic Interpretability of Large Language Models:A Survey of Design Principles and Architectures
Pith reviewed 2026-05-10 08:25 UTC · model grok-4.3
The pith
This survey organizes intrinsic interpretability approaches for LLMs into five categories—functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction—while discussing challenges and future directions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Intrinsic interpretability, which builds transparency directly into model architectures and computations, has recently emerged as a promising alternative to post-hoc methods, with existing approaches categorizable into five design paradigms: functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction.
Load-bearing premise
That the five proposed design paradigms provide a comprehensive, non-overlapping, and useful taxonomy that accurately captures the current landscape of intrinsic interpretability methods for LLMs without major omissions.
Figures
read the original abstract
While Large Language Models (LLMs) have achieved strong performance across many NLP tasks, their opaque internal mechanisms hinder trustworthiness and safe deployment. Existing surveys in explainable AI largely focus on post-hoc explanation methods that interpret trained models through external approximations. In contrast, intrinsic interpretability, which builds transparency directly into model architectures and computations, has recently emerged as a promising alternative. This paper presents a systematic review of the recent advances in intrinsic interpretability for LLMs, categorizing existing approaches into five design paradigms: functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction. We further discuss open challenges and outline future research directions in this emerging field. The paper list is available at: https://github.com/PKU-PILLAR-Group/Survey-Intrinsic-Interpretability-of-LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Confer- ence on Artificial Intelligence and Statistics, AIS- TATS 2011, Fort Lauderdale, USA, April 11-13, 2011, volume 15 ofJMLR Proceedings, pages 315–323. JMLR.org. Robert M. Gray. 1984. Vector quantization.IEEE ASSP Magazine, 1:4–29. Hongcan Guo, Haolang Lu, Guoshun ...
-
[2]
Addressing leakage in concept bottleneck mod- els. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Informa- tion Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. Dan Hendrycks and Kevin Gimpel. 2023. Gaussian er- ror linear units (gelus).Preprint, arXiv:1606.08415. John Hewitt...
work page Pith review arXiv 2022
-
[3]
Model editing with canonical examples. Preprint, arXiv:2402.06155. John Hewitt, John Thickstun, Christopher D. Manning, and Percy Liang. 2023. Backpack language mod- els. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 9103–9125. Associat...
-
[4]
Promises and pitfalls of black-box concept learning models
Learning sparse neural networks through l_0 regularization. In6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. Scott M. Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. InAd- vances in Neural Information Proce...
-
[5]
Preprint, https://doi.org/10.48550/arXiv.1909.09223, 2019
Generalized linear models.Journal of the Royal Statistical Society Series A: Statistics in Soci- ety, 135(3):370–384. Harsha Nori, Samuel Jenkins, Paul Koch, and Rich Caruana. 2019. Interpretml: A unified frame- work for machine learning interpretability.Preprint, arXiv:1909.09223. nostalgebraist. 2020. Interpreting gpt: The logit lens. LessWrong blog pos...
-
[6]
GLU Variants Improve Transformer
ACM. Cynthia Rudin. 2019. Stop explaining black box ma- chine learning models for high stakes decisions and use interpretable models instead.Nat. Mach. Intell., 1(5):206–215. Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeffrey Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Or- tega, Joseph Isaac Bloom, Stella Bide...
work page internal anchor Pith review arXiv 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.