LUFFY mixes off-policy reasoning traces into RLVR training via Mixed-Policy GRPO and regularized importance sampling, delivering over 6-point gains on math benchmarks and enabling training of weak models where on-policy RLVR fails.
Bleu: a method for automatic evaluation of machine translation
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2025 2verdicts
UNVERDICTED 2representative citing papers
TableMaster improves LM table understanding by verbalizing tables with enriched semantics and using adaptive textual-symbolic reasoning, reaching 78.13% accuracy on WikiTQ with GPT-4o-mini.
citing papers explorer
-
Learning to Reason under Off-Policy Guidance
LUFFY mixes off-policy reasoning traces into RLVR training via Mixed-Policy GRPO and regularized importance sampling, delivering over 6-point gains on math benchmarks and enabling training of weak models where on-policy RLVR fails.
-
TableMaster: A Recipe to Advance Table Understanding with Language Models
TableMaster improves LM table understanding by verbalizing tables with enriched semantics and using adaptive textual-symbolic reasoning, reaching 78.13% accuracy on WikiTQ with GPT-4o-mini.