arXiv preprint arXiv:2506.08266 , year=

Reinforcement Learning from Human Feedback with High-Confidence Safety Constraints , author= · 2025 · arXiv 2506.08266

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Implicit Safety Alignment from Crowd Preferences

cs.AI · 2026-05-20 · unverdicted · novelty 6.0

A hierarchical framework extracts implicit safety criteria from crowd preferences and composes them via high-level policy to reduce safety violations in downstream RL tasks without explicit safety rewards.

Reinforcement Learning from Human Feedback: A Statistical Perspective

stat.ML · 2026-04-02 · accept · novelty 2.0

A statistical survey of RLHF for LLM alignment that connects preference learning and policy optimization to models like Bradley-Terry-Luce while reviewing methods, extensions, and open challenges.

citing papers explorer

Showing 2 of 2 citing papers.

Implicit Safety Alignment from Crowd Preferences cs.AI · 2026-05-20 · unverdicted · none · ref 27
A hierarchical framework extracts implicit safety criteria from crowd preferences and composes them via high-level policy to reduce safety violations in downstream RL tasks without explicit safety rewards.
Reinforcement Learning from Human Feedback: A Statistical Perspective stat.ML · 2026-04-02 · accept · none · ref 14
A statistical survey of RLHF for LLM alignment that connects preference learning and policy optimization to models like Bradley-Terry-Luce while reviewing methods, extensions, and open challenges.

arXiv preprint arXiv:2506.08266 , year=

fields

years

verdicts

representative citing papers

citing papers explorer