Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback

Jordan Boyd-Graber · 2017 · DOI 10.18653/v1/d17-1153

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open at publisher browse 2 citing papers

representative citing papers

FIESTA: Fast IdEntification of State-of-The-Art models using adaptive bandit algorithms

cs.LG · 2019-06-28 · unverdicted · novelty 7.0

FIESTA uses bandit algorithms to adaptively decide how many seeds and splits to run for each candidate model, focusing effort on promising ones while providing guarantees on selecting the optimal model.

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

cs.LG · 2024-02-22 · conditional · novelty 6.0

REINFORCE-style variants outperform PPO, DPO, and RAFT in RLHF for LLMs by removing unnecessary PPO components and adapting the simpler method to LLM alignment characteristics.

citing papers explorer

Showing 2 of 2 citing papers.

FIESTA: Fast IdEntification of State-of-The-Art models using adaptive bandit algorithms cs.LG · 2019-06-28 · unverdicted · none · ref 30
FIESTA uses bandit algorithms to adaptively decide how many seeds and splits to run for each candidate model, focusing effort on promising ones while providing guarantees on selecting the optimal model.
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs cs.LG · 2024-02-22 · conditional · none · ref 111
REINFORCE-style variants outperform PPO, DPO, and RAFT in RLHF for LLMs by removing unnecessary PPO components and adapting the simpler method to LLM alignment characteristics.

Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback

fields

years

verdicts

representative citing papers

citing papers explorer