AMBS is a 1-to-N Transformer steering framework that shares a base representation across HHH objectives and restricts divergence during inference to produce consistent multi-objective responses in one forward pass.
Too helpful, too harmless, too honest or just right? arXiv preprint arXiv:2509.08486, 2025
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
We Think, Therefore We Align LLMs to Helpful, Harmless and Honest Before They Go Wrong
AMBS is a 1-to-N Transformer steering framework that shares a base representation across HHH objectives and restricts divergence during inference to produce consistent multi-objective responses in one forward pass.