RepIt creates semantic backdoors in frontier language models by steering refusal vectors for specific concepts, allowing targeted unsafe responses while preserving safe scores on standard benchmarks.
The reaction is typically exothermic, so temperature control is important
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
RepIt: Steering Language Models with Concept-Specific Refusal Vectors
RepIt creates semantic backdoors in frontier language models by steering refusal vectors for specific concepts, allowing targeted unsafe responses while preserving safe scores on standard benchmarks.