Reasoning language models extract answers from sparse, order-shuffled chain-of-thought traces with little accuracy loss.
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=
4 Pith papers cite this work. Polarity classification is still indexing.
4
Pith papers citing it
representative citing papers
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
RLAIF matches RLHF on summarization and dialogue tasks, with a direct-RLAIF variant achieving superior results by using LLM rewards directly during training.
citing papers explorer
-
Training Language Models to Self-Correct via Reinforcement Learning
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.