Optimizing input embeddings sub-lexically via black-box zeroth-order gradients neutralizes all safety-flagged responses from aligned models on standard benchmarks.
InferAligner: Inference-time alignment for harmlessness through cross-model guidance
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
citation-role summary
background 1
citation-polarity summary
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1roles
background 1polarities
background 1representative citing papers
citing papers explorer
-
Test-Time Safety Alignment
Optimizing input embeddings sub-lexically via black-box zeroth-order gradients neutralizes all safety-flagged responses from aligned models on standard benchmarks.