Reasoning language models extract answers from sparse, order-shuffled chain-of-thought traces with little accuracy loss.
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=
4 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
RLAIF matches RLHF on summarization and dialogue tasks, with a direct-RLAIF variant achieving superior results by using LLM rewards directly during training.
citing papers explorer
-
Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts
Reasoning language models extract answers from sparse, order-shuffled chain-of-thought traces with little accuracy loss.
-
Training Language Models to Self-Correct via Reinforcement Learning
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
-
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
RLAIF matches RLHF on summarization and dialogue tasks, with a direct-RLAIF variant achieving superior results by using LLM rewards directly during training.
- Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models