ImpRIF improves LLM complex instruction following by synthesizing data from reasoning graphs and training models to reason explicitly along those graphs.
Scaling reasoning, losing control: Evaluating instruction following in large reasoning models.arXiv preprint arXiv:2505.14810,
5 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Retrieval-of-Thought organizes prior reasoning into a thought graph for retrieval and reward-guided recombination, reducing output tokens by up to 40% and latency by 82% while preserving accuracy on reasoning benchmarks.
LUFFY mixes off-policy reasoning traces into RLVR training via Mixed-Policy GRPO and regularized importance sampling, delivering over 6-point gains on math benchmarks and enabling training of weak models where on-policy RLVR fails.
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
citing papers explorer
-
ImpRIF: Stronger Implicit Reasoning Leads to Better Complex Instruction Following
ImpRIF improves LLM complex instruction following by synthesizing data from reasoning graphs and training models to reason explicitly along those graphs.
-
Retrieval-of-Thought: Efficient Reasoning via Reusing Thoughts
Retrieval-of-Thought organizes prior reasoning into a thought graph for retrieval and reward-guided recombination, reducing output tokens by up to 40% and latency by 82% while preserving accuracy on reasoning benchmarks.
-
Learning to Reason under Off-Policy Guidance
LUFFY mixes off-policy reasoning traces into RLVR training via Mixed-Policy GRPO and regularized importance sampling, delivering over 6-point gains on math benchmarks and enabling training of weak models where on-policy RLVR fails.
-
A Survey of Reinforcement Learning for Large Reasoning Models
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
- Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards