Correcting DeepSpeed optimizer and OpenRLHF loss bugs reveals SFT-then-RL outperforms mixed-policy methods by 3.8-22.2 points on math benchmarks.
An empirical analysis of compute-optimal large language model training
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
years
2026 2roles
background 1polarities
background 1representative citing papers
BenchHAR finds that hybrid reconstruction-plus-contrastive SSL with CNN encoders generalizes best for sensor HAR but overall performance on unseen distributions remains unsatisfactory.
citing papers explorer
-
SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning
Correcting DeepSpeed optimizer and OpenRLHF loss bugs reveals SFT-then-RL outperforms mixed-policy methods by 3.8-22.2 points on math benchmarks.
-
BenchHAR: Benchmarking Self-Supervised Learning for Generalizable Sensor-based Activity Recognition
BenchHAR finds that hybrid reconstruction-plus-contrastive SSL with CNN encoders generalizes best for sensor HAR but overall performance on unseen distributions remains unsatisfactory.