Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, Chuan Wu · 2025

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training

cs.LG · 2025-09-03 · unverdicted · novelty 6.0

PROF curates RL training data via PRM-ORM consistency to improve both final-answer accuracy and intermediate reasoning quality while reducing reliance on strong process reward models.

Self-Aligned Reward: Towards Effective and Efficient Reasoners

cs.LG · 2025-09-05 · unverdicted · novelty 5.0

Self-aligned reward uses relative perplexity differences to encourage concise, query-specific reasoning in LLMs, yielding 4% accuracy gains and 30% lower inference cost when added to PPO or GRPO.

citing papers explorer

Showing 2 of 2 citing papers.

Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training cs.LG · 2025-09-03 · unverdicted · none · ref 24
PROF curates RL training data via PRM-ORM consistency to improve both final-answer accuracy and intermediate reasoning quality while reducing reliance on strong process reward models.
Self-Aligned Reward: Towards Effective and Efficient Reasoners cs.LG · 2025-09-05 · unverdicted · none · ref 39
Self-aligned reward uses relative perplexity differences to encourage concise, query-specific reasoning in LLMs, yielding 4% accuracy gains and 30% lower inference cost when added to PPO or GRPO.

Hybridflow: A flexible and efficient rlhf framework

fields

years

verdicts

representative citing papers

citing papers explorer