OPHSD uses harness-augmented models as teachers to distill reasoning capabilities into base LLMs, yielding strong standalone performance on classification and math tasks.
Hybridflow: A flexible and efficient rlhf framework
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
method 1
citation-polarity summary
fields
cs.CL 2roles
method 1polarities
use method 1representative citing papers
Prolonged RL training with KL control and reference policy resetting enables LLMs to develop novel reasoning strategies inaccessible to base models even under extensive sampling.
citing papers explorer
-
Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning
OPHSD uses harness-augmented models as teachers to distill reasoning capabilities into base LLMs, yielding strong standalone performance on classification and math tasks.
-
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
Prolonged RL training with KL control and reference policy resetting enables LLMs to develop novel reasoning strategies inaccessible to base models even under extensive sampling.