HelpSteer3-preference: Open human-annotated preference data across diverse tasks and languages

· 2025 · arXiv 2505.11475

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

cs.CL · 2025-09-25 · unverdicted · novelty 6.0

RLBFF extracts binary principles from human feedback to train reward models that outperform Bradley-Terry models on RM-Bench and JudgeBench and enable customizable inference-time focus for LLM alignment.

NVIDIA Nemotron 3: Efficient and Open Intelligence

cs.CL · 2025-12-24 · unverdicted · novelty 5.0

NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.

Users as Annotators: LLM Preference Learning from Comparison Mode

cs.CL · 2025-10-10 · unverdicted · novelty 5.0

Introduces a latent user quality model and EM algorithm to infer and filter noisy user-provided pairwise preferences for improved LLM alignment.

Reinforcement Learning from Human Feedback

cs.LG · 2025-04-16 · unverdicted · novelty 2.0

The book introduces the origins, mathematical setup, and optimization stages of RLHF including reward modeling, reinforcement learning, rejection sampling, and direct alignment algorithms.

citing papers explorer

Showing 4 of 4 citing papers.

RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards cs.CL · 2025-09-25 · unverdicted · none · ref 43
RLBFF extracts binary principles from human feedback to train reward models that outperform Bradley-Terry models on RM-Bench and JudgeBench and enable customizable inference-time focus for LLM alignment.
NVIDIA Nemotron 3: Efficient and Open Intelligence cs.CL · 2025-12-24 · unverdicted · none · ref 182
NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
Users as Annotators: LLM Preference Learning from Comparison Mode cs.CL · 2025-10-10 · unverdicted · none · ref 26
Introduces a latent user quality model and EM algorithm to infer and filter noisy user-provided pairwise preferences for improved LLM alignment.
Reinforcement Learning from Human Feedback cs.LG · 2025-04-16 · unverdicted · none · ref 289
The book introduces the origins, mathematical setup, and optimization stages of RLHF including reward modeling, reinforcement learning, rejection sampling, and direct alignment algorithms.

HelpSteer3-preference: Open human-annotated preference data across diverse tasks and languages

fields

years

verdicts

representative citing papers

citing papers explorer