MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall improvement in simultaneous alignment.
Steerable chatbots: Personalizing llms with preference-based activation steering
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
method 1
citation-polarity summary
fields
cs.AI 2years
2026 2verdicts
UNVERDICTED 2roles
method 1polarities
use method 1representative citing papers
AI alignment must move beyond assuming users have fully formed goals and instead provide active cognitive support to help form and refine intent over time.
citing papers explorer
-
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall improvement in simultaneous alignment.
-
Alignment has a Fantasia Problem
AI alignment must move beyond assuming users have fully formed goals and instead provide active cognitive support to help form and refine intent over time.