Exploratory study maps stereotype encodings to individual neurons and attention heads in two LLMs via contrastive activation analysis.
InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 11328– 11345, Singapore
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Can We Locate and Prevent Stereotypes in LLMs?
Exploratory study maps stereotype encodings to individual neurons and attention heads in two LLMs via contrastive activation analysis.