pith. sign in

Steerable chatbots: Personalizing llms with preference-based activation steering

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

citation-role summary

method 1

citation-polarity summary

fields

cs.AI 2

years

2026 2

verdicts

UNVERDICTED 2

roles

method 1

polarities

use method 1

representative citing papers

Alignment has a Fantasia Problem

cs.AI · 2026-04-23 · unverdicted · novelty 6.0

AI alignment must move beyond assuming users have fully formed goals and instead provide active cognitive support to help form and refine intent over time.

citing papers explorer

Showing 2 of 2 citing papers.

  • Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion cs.AI · 2026-05-12 · unverdicted · none · ref 70 · 2 links

    MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall improvement in simultaneous alignment.

  • Alignment has a Fantasia Problem cs.AI · 2026-04-23 · unverdicted · none · ref 59

    AI alignment must move beyond assuming users have fully formed goals and instead provide active cognitive support to help form and refine intent over time.