Interpretability-based selection of vocabulary items plus FragMend initialization reduces token over-fragmentation and improves performance for non-Latin script languages by roughly 20 points over baselines.
Model Internal Sleuthing: Finding Lexical Identity and Inflectional Features in Modern Language Models
3 Pith papers cite this work. Polarity classification is still indexing.
abstract
Large transformer-based language models dominate modern NLP, yet our understanding of how they encode linguistic information relies primarily on studies of early models like BERT and GPT-2. We systematically probe 25 models from BERT Base to Qwen2.5-7B focusing on two linguistic properties: lexical identity and inflectional features across 6 diverse languages. We find a consistent pattern: inflectional features are linearly decodable throughout the model, while lexical identity is prominent early but increasingly weakens with depth. Further analysis of the representation geometry reveals that models with aggressive mid-layer dimensionality compression show reduced steering effectiveness in those layers, despite probe accuracy remaining high. Pretraining analysis shows that inflectional structure stabilizes early while lexical identity representations continue evolving. Taken together, our findings suggest that transformers maintain inflectional features across layers, while trading off lexical identity for compact, predictive representations. Our code is available at https://github.com/ml5885/model_internal_sleuthing
years
2026 3representative citing papers
GUARD-IT performs machine unlearning in LLMs via input-dependent activation steering at inference time, matching or exceeding gradient-based baselines on TOFU and MUSE while preserving utility and working under quantization.
citing papers explorer
-
Defragmenting Language Models: An Interpretability-based Approach for Vocabulary Expansion
Interpretability-based selection of vocabulary items plus FragMend initialization reduces token over-fragmentation and improves performance for non-Latin script languages by roughly 20 points over baselines.
-
Inference-Time Machine Unlearning via Gated Activation Redirection
GUARD-IT performs machine unlearning in LLMs via input-dependent activation steering at inference time, matching or exceeding gradient-based baselines on TOFU and MUSE while preserving utility and working under quantization.
- Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models