Prototype Transformer: Towards Language Model Architectures Interpretable by Design

Amine M'Charrak; Bayar Menzat; Chang Qi; Markus Kaltenberger; Matteo Forasassi; Ruizhi Wang; Thomas Lukasiewicz; Tommaso Salvatori; Yordan Yordanov

arxiv: 2602.11852 · v2 · pith:HJQ65BA3new · submitted 2026-02-12 · 💻 cs.AI · cs.CL· cs.LG

Prototype Transformer: Towards Language Model Architectures Interpretable by Design

Yordan Yordanov , Matteo Forasassi , Bayar Menzat , Ruizhi Wang , Chang Qi , Markus Kaltenberger , Amine M'Charrak , Tommaso Salvatori

show 1 more author

Thomas Lukasiewicz

This is my paper

classification 💻 cs.AI cs.CLcs.LG

keywords modelprototlanguageprototypestransformerautoregressivedesigninterpretable

0 comments

read the original abstract

While state-of-the-art language models (LMs) surpass most humans in certain domains, their reasoning remains largely opaque, reducing trust and increasing the risk of deception and hallucination. We introduce the Prototype Transformer (ProtoT), an autoregressive LM architecture that replaces the quadratic-cost self-attention module of the Transformer with a linear-cost module based on prototypes, which are learned parameter vectors. In ProtoT, prototypes create communication channels that aggregate contextual information at different time scales. We show that this structure leads prototypes to automatically capture nameable concepts, such as "woman", during training, offering a path toward interpreting model reasoning and making targeted edits to model behavior. Compared with baselines, ProtoT scales well with model and data size, is robust to input perturbations, and performs well on text generation and downstream tasks, including GLUE. These results suggest that ProtoT is a promising step toward autoregressive language models that are more interpretable by design.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Collapse-Free Prototype Readout Layer for Transformer Encoders
cs.LG 2026-04 unverdicted novelty 7.0

DDCL-Attention introduces a collapse-free prototype readout for transformers that decomposes the training loss exactly into reconstruction and diversity terms while providing stability guarantees via singular perturba...
Graph Memory Transformer (GMT)
cs.LG 2026-04 unverdicted novelty 5.0

Graph Memory Transformer (GMT) swaps dense FFN sublayers for a graph of 128 centroids and a learned 128x128 transition matrix per block, yielding a 82M-parameter decoder-only LM that trains stably but trails a 103M de...