Model Internal Sleuthing: Finding Lexical Identity and Inflectional Features in Modern Language Models

· 2025 · cs.CL · arXiv 2506.02132

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

Large transformer-based language models dominate modern NLP, yet our understanding of how they encode linguistic information relies primarily on studies of early models like BERT and GPT-2. We systematically probe 25 models from BERT Base to Qwen2.5-7B focusing on two linguistic properties: lexical identity and inflectional features across 6 diverse languages. We find a consistent pattern: inflectional features are linearly decodable throughout the model, while lexical identity is prominent early but increasingly weakens with depth. Further analysis of the representation geometry reveals that models with aggressive mid-layer dimensionality compression show reduced steering effectiveness in those layers, despite probe accuracy remaining high. Pretraining analysis shows that inflectional structure stabilizes early while lexical identity representations continue evolving. Taken together, our findings suggest that transformers maintain inflectional features across layers, while trading off lexical identity for compact, predictive representations. Our code is available at https://github.com/ml5885/model_internal_sleuthing

representative citing papers

Defragmenting Language Models: An Interpretability-based Approach for Vocabulary Expansion

cs.CL · 2026-04-17 · unverdicted · novelty 7.0

Interpretability-based selection of vocabulary items plus FragMend initialization reduces token over-fragmentation and improves performance for non-Latin script languages by roughly 20 points over baselines.

Inference-Time Machine Unlearning via Gated Activation Redirection

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

GUARD-IT performs machine unlearning in LLMs via input-dependent activation steering at inference time, matching or exceeding gradient-based baselines on TOFU and MUSE while preserving utility and working under quantization.

Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models

cs.CL · 2026-05-06

citing papers explorer

Showing 3 of 3 citing papers.

Defragmenting Language Models: An Interpretability-based Approach for Vocabulary Expansion cs.CL · 2026-04-17 · unverdicted · none · ref 3 · internal anchor
Interpretability-based selection of vocabulary items plus FragMend initialization reduces token over-fragmentation and improves performance for non-Latin script languages by roughly 20 points over baselines.
Inference-Time Machine Unlearning via Gated Activation Redirection cs.LG · 2026-05-12 · unverdicted · none · ref 31 · 2 links · internal anchor
GUARD-IT performs machine unlearning in LLMs via input-dependent activation steering at inference time, matching or exceeding gradient-based baselines on TOFU and MUSE while preserving utility and working under quantization.
Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models cs.CL · 2026-05-06 · unreviewed · ref 17 · internal anchor

Model Internal Sleuthing: Finding Lexical Identity and Inflectional Features in Modern Language Models

fields

years

verdicts

representative citing papers

citing papers explorer