Lesioning parameters in large language models produces aphasia-like symptoms whose distributions vary by attention versus feed-forward components and by layer depth, but differ qualitatively from human clinical profiles.
Advances in Neural Information Processing Systems , volume =
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.
ICL task identity is encoded as distributed output format templates across demonstration tokens rather than localized at single positions.
TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
citing papers explorer
-
Artificial Aphasias in Lesioned Language Models
Lesioning parameters in large language models produces aphasia-like symptoms whose distributions vary by attention versus feed-forward components and by layer depth, but differ qualitatively from human clinical profiles.
-
Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute
Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.
-
Single-Position Intervention Fails: Distributed Output Templates Drive In-Context Learning
ICL task identity is encoded as distributed output format templates across demonstration tokens rather than localized at single positions.
-
TIDE: Every Layer Knows the Token Beneath the Context
TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.