Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
Acemath: Advancing frontier math reasoning with post-training and reward modeling
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
Cosmos-Reason1-7B and 56B models are trained with physical common sense and embodied reasoning ontologies via supervised fine-tuning and reinforcement learning to produce next-step physical actions.
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.
ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting benchmarks.
The book introduces the origins, mathematical setup, and optimization stages of RLHF including reward modeling, reinforcement learning, rejection sampling, and direct alignment algorithms.
citing papers explorer
-
Fine-Tuning Small Reasoning Models for Quantum Field Theory
Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
-
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Cosmos-Reason1-7B and 56B models are trained with physical common sense and embodied reasoning ontologies via supervised fine-tuning and reinforcement learning to produce next-step physical actions.
-
Process Reinforcement through Implicit Rewards
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.
-
ZAYA1-VL-8B Technical Report
ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting benchmarks.
-
Reinforcement Learning from Human Feedback
The book introduces the origins, mathematical setup, and optimization stages of RLHF including reward modeling, reinforcement learning, rejection sampling, and direct alignment algorithms.