arXiv preprint arXiv:2306.14111 , year=

Is RLHF More Difficult than Standard RL? , author= · 2023 · arXiv 2306.14111

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Convex Optimization for Alignment and Preference Learning on a Single GPU

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

COALA applies convex optimization reformulations of neural networks to direct preference optimization, claiming single-GPU training with ~18% of DPO's TFLOPs and competitive performance on multiple datasets and models up to 8B parameters.

OPRIDE: Offline Preference-based Reinforcement Learning via In-Dataset Exploration

cs.LG · 2026-02-19 · unverdicted · novelty 6.0

OPRIDE improves query efficiency in offline PbRL via a principled in-dataset exploration strategy and discount scheduling, outperforming prior methods with fewer queries and providing theoretical guarantees.

On the optimization dynamics of RLVR: Gradient gap and step size thresholds

cs.LG · 2025-10-09 · unverdicted · novelty 6.0

The paper defines a Gradient Gap for RLVR policy gradients and proves a sharp step-size threshold below which training converges and above which it collapses, with predictions for length and success-rate scaling validated in simulations and on Qwen2.5-Math-7B.

citing papers explorer

Showing 3 of 3 citing papers.

Convex Optimization for Alignment and Preference Learning on a Single GPU cs.LG · 2026-05-22 · unverdicted · none · ref 46
COALA applies convex optimization reformulations of neural networks to direct preference optimization, claiming single-GPU training with ~18% of DPO's TFLOPs and competitive performance on multiple datasets and models up to 8B parameters.
OPRIDE: Offline Preference-based Reinforcement Learning via In-Dataset Exploration cs.LG · 2026-02-19 · unverdicted · none · ref 47
OPRIDE improves query efficiency in offline PbRL via a principled in-dataset exploration strategy and discount scheduling, outperforming prior methods with fewer queries and providing theoretical guarantees.
On the optimization dynamics of RLVR: Gradient gap and step size thresholds cs.LG · 2025-10-09 · unverdicted · none · ref 23
The paper defines a Gradient Gap for RLVR policy gradients and proves a sharp step-size threshold below which training converges and above which it collapses, with predictions for length and success-rate scaling validated in simulations and on Qwen2.5-Math-7B.

arXiv preprint arXiv:2306.14111 , year=

fields

years

verdicts

representative citing papers

citing papers explorer