pith. sign in

arXiv preprint arXiv:2506.08266 , year=

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

years

2026 2

representative citing papers

Implicit Safety Alignment from Crowd Preferences

cs.AI · 2026-05-20 · unverdicted · novelty 6.0

A hierarchical framework extracts implicit safety criteria from crowd preferences and composes them via high-level policy to reduce safety violations in downstream RL tasks without explicit safety rewards.

citing papers explorer

Showing 2 of 2 citing papers.

  • Implicit Safety Alignment from Crowd Preferences cs.AI · 2026-05-20 · unverdicted · none · ref 27

    A hierarchical framework extracts implicit safety criteria from crowd preferences and composes them via high-level policy to reduce safety violations in downstream RL tasks without explicit safety rewards.

  • Reinforcement Learning from Human Feedback: A Statistical Perspective stat.ML · 2026-04-02 · accept · none · ref 14

    A statistical survey of RLHF for LLM alignment that connects preference learning and policy optimization to models like Bradley-Terry-Luce while reviewing methods, extensions, and open challenges.