Steerable chatbots: Personalizing llms with preference-based activation steering

Jessica Y Bo, Tianyu Xu, Ishan Chatterjee, Katrina Passarella-Ward, Achin Kulshrestha, D Shin · 2025 · arXiv 2505.04260

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

cs.AI · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall improvement in simultaneous alignment.

Alignment has a Fantasia Problem

cs.AI · 2026-04-23 · unverdicted · novelty 6.0

AI alignment must move beyond assuming users have fully formed goals and instead provide active cognitive support to help form and refine intent over time.

citing papers explorer

Showing 2 of 2 citing papers.

Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion cs.AI · 2026-05-12 · unverdicted · none · ref 70 · 2 links
MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall improvement in simultaneous alignment.
Alignment has a Fantasia Problem cs.AI · 2026-04-23 · unverdicted · none · ref 59
AI alignment must move beyond assuming users have fully formed goals and instead provide active cognitive support to help form and refine intent over time.

Steerable chatbots: Personalizing llms with preference-based activation steering

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer