Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022a

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfi · 2023

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

UNA: A Unified Supervised Framework for Efficient LLM Alignment Across Feedback Types

cs.LG · 2024-08-27 · unverdicted · novelty 6.0

UNA unifies binary, pairwise, and score-based feedback for LLM alignment via a generalized implicit reward function shown optimal by the log sum inequality.

citing papers explorer

Showing 1 of 1 citing paper after filters.

UNA: A Unified Supervised Framework for Efficient LLM Alignment Across Feedback Types cs.LG · 2024-08-27 · unverdicted · none · ref 2
UNA unifies binary, pairwise, and score-based feedback for LLM alignment via a generalized implicit reward function shown optimal by the log sum inequality.

Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022a

fields

years

verdicts

representative citing papers

citing papers explorer