Towards Intrinsic Interpretability of Large Language Models:A Survey of Design Principles and Architectures

Liangming Pan; Qinglin Meng; Yuan Zhou; Yutong Gao

arxiv: 2604.16042 · v2 · submitted 2026-04-17 · 💻 cs.CL · cs.AI· cs.LG

Towards Intrinsic Interpretability of Large Language Models:A Survey of Design Principles and Architectures

Yutong Gao , Qinglin Meng , Yuan Zhou , Liangming Pan This is my paper

Pith reviewed 2026-05-10 08:25 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords interpretabilityintrinsicmodelsarchitecturesdesignexistinglanguagelarge

0 comments

The pith

This survey organizes intrinsic interpretability approaches for LLMs into five categories—functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction—while discussing challenges and future directions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models work well but their internal workings are hard to see, which makes them risky to trust in important decisions. Most current explanation tools look at a finished model from the outside and guess what it is doing. Intrinsic interpretability instead changes how the model is built so its parts are clearer by design. The survey sorts these built-in approaches into five groups: making the model's functions easy to follow, linking its internal concepts to human-understandable ideas, breaking its representations into understandable pieces, using separate modules that can be inspected, and forcing the model to use only a few important internal features at a time. It also lists open problems that still need work.

Core claim

Intrinsic interpretability, which builds transparency directly into model architectures and computations, has recently emerged as a promising alternative to post-hoc methods, with existing approaches categorizable into five design paradigms: functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction.

Load-bearing premise

That the five proposed design paradigms provide a comprehensive, non-overlapping, and useful taxonomy that accurately captures the current landscape of intrinsic interpretability methods for LLMs without major omissions.

Figures

Figures reproduced from arXiv: 2604.16042 by Liangming Pan, Qinglin Meng, Yuan Zhou, Yutong Gao.

**Figure 2.** Figure 2: A taxonomy of intrinsic architectural designs for interpretable LLMs. We categorize existing approaches into five primary families based on their core mechanism for transparency. Pre-defined Model Structural Flexibility Model Capacity Rigid / Human Defined Flexible / Data Driven Low Capacity High Capacity GAM (1986) GA2M (2013) CBM (2020) Hybrid CBM (2022) CEM (2022) Codebook (2023) MoE-X (2025) MxD (2025)… view at source ↗

**Figure 3.** Figure 3: Evolution of intrinsic interpretability. The field has shifted from rigid, human-defined structures (e.g., GAMs) to scalable, data-driven sparse architectures (e.g., Specialized MoEs) that balance interpretability with performance. press redundant channels and form task-specific subcircuits. Representative techniques following this principle are discussed in Section 4.5. A key trade-off is that strong sp… view at source ↗

**Figure 4.** Figure 4: Architectural strategies for intrinsically interpretable MoEs. We distinguish between methods enforcing intra-expert sparsity, fine-grained decomposition, and semantically aligned routing. anisms are typically optimized for load balancing rather than semantic transparency (Fedus et al., 2022). Recent work revisits MoE design with interpretability as a central goal. We organize these methods into three arc… view at source ↗

read the original abstract

While Large Language Models (LLMs) have achieved strong performance across many NLP tasks, their opaque internal mechanisms hinder trustworthiness and safe deployment. Existing surveys in explainable AI largely focus on post-hoc explanation methods that interpret trained models through external approximations. In contrast, intrinsic interpretability, which builds transparency directly into model architectures and computations, has recently emerged as a promising alternative. This paper presents a systematic review of the recent advances in intrinsic interpretability for LLMs, categorizing existing approaches into five design paradigms: functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction. We further discuss open challenges and outline future research directions in this emerging field. The paper list is available at: https://github.com/PKU-PILLAR-Group/Survey-Intrinsic-Interpretability-of-LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey paper, it does not introduce new free parameters, axioms, or invented entities; it reviews and organizes prior work in explainable AI without adding novel foundational assumptions or constructs.

pith-pipeline@v0.9.0 · 5449 in / 1132 out tokens · 58874 ms · 2026-05-10T08:25:03.901936+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · 1 internal anchor

[1]

Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Confer- ence on Artificial Intelligence and Statistics, AIS- TATS 2011, Fort Lauderdale, USA, April 11-13, 2011, volume 15 ofJMLR Proceedings, pages 315–323. JMLR.org. Robert M. Gray. 1984. Vector quantization.IEEE ASSP Magazine, 1:4–29. Hongcan Guo, Haolang Lu, Guoshun ...

work page arXiv 2011
[2]

Addressing leakage in concept bottleneck mod- els. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Informa- tion Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. Dan Hendrycks and Kevin Gimpel. 2023. Gaussian er- ror linear units (gelus).Preprint, arXiv:1606.08415. John Hewitt...

work page Pith review arXiv 2022
[3]

Preprint, arXiv:2402.06155

Model editing with canonical examples. Preprint, arXiv:2402.06155. John Hewitt, John Thickstun, Christopher D. Manning, and Percy Liang. 2023. Backpack language mod- els. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 9103–9125. Associat...

work page arXiv 2023
[4]

Promises and pitfalls of black-box concept learning models

Learning sparse neural networks through l_0 regularization. In6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. Scott M. Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. InAd- vances in Neural Information Proce...

work page arXiv 2018
[5]

Preprint, https://doi.org/10.48550/arXiv.1909.09223, 2019

Generalized linear models.Journal of the Royal Statistical Society Series A: Statistics in Soci- ety, 135(3):370–384. Harsha Nori, Samuel Jenkins, Paul Koch, and Rich Caruana. 2019. Interpretml: A unified frame- work for machine learning interpretability.Preprint, arXiv:1909.09223. nostalgebraist. 2020. Interpreting gpt: The logit lens. LessWrong blog pos...

work page arXiv 2019
[6]

GLU Variants Improve Transformer

ACM. Cynthia Rudin. 2019. Stop explaining black box ma- chine learning models for high stakes decisions and use interpretable models instead.Nat. Mach. Intell., 1(5):206–215. Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeffrey Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Or- tega, Joseph Isaac Bloom, Stella Bide...

work page internal anchor Pith review arXiv 2019

[1] [1]

Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Confer- ence on Artificial Intelligence and Statistics, AIS- TATS 2011, Fort Lauderdale, USA, April 11-13, 2011, volume 15 ofJMLR Proceedings, pages 315–323. JMLR.org. Robert M. Gray. 1984. Vector quantization.IEEE ASSP Magazine, 1:4–29. Hongcan Guo, Haolang Lu, Guoshun ...

work page arXiv 2011

[2] [2]

Addressing leakage in concept bottleneck mod- els. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Informa- tion Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. Dan Hendrycks and Kevin Gimpel. 2023. Gaussian er- ror linear units (gelus).Preprint, arXiv:1606.08415. John Hewitt...

work page Pith review arXiv 2022

[3] [3]

Preprint, arXiv:2402.06155

Model editing with canonical examples. Preprint, arXiv:2402.06155. John Hewitt, John Thickstun, Christopher D. Manning, and Percy Liang. 2023. Backpack language mod- els. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 9103–9125. Associat...

work page arXiv 2023

[4] [4]

Promises and pitfalls of black-box concept learning models

Learning sparse neural networks through l_0 regularization. In6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. Scott M. Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. InAd- vances in Neural Information Proce...

work page arXiv 2018

[5] [5]

Preprint, https://doi.org/10.48550/arXiv.1909.09223, 2019

Generalized linear models.Journal of the Royal Statistical Society Series A: Statistics in Soci- ety, 135(3):370–384. Harsha Nori, Samuel Jenkins, Paul Koch, and Rich Caruana. 2019. Interpretml: A unified frame- work for machine learning interpretability.Preprint, arXiv:1909.09223. nostalgebraist. 2020. Interpreting gpt: The logit lens. LessWrong blog pos...

work page arXiv 2019

[6] [6]

GLU Variants Improve Transformer

ACM. Cynthia Rudin. 2019. Stop explaining black box ma- chine learning models for high stakes decisions and use interpretable models instead.Nat. Mach. Intell., 1(5):206–215. Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeffrey Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Or- tega, Joseph Isaac Bloom, Stella Bide...

work page internal anchor Pith review arXiv 2019