IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of the annotated data.
In Proceedings of the 61st Annual Meeting of the As- sociation for Computational Linguistics (Volume 5: Industry Track), pages 134–148
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 6representative citing papers
Short GRPO warm-up followed by offline DPO on informative rollouts matches or beats full GRPO on math reasoning benchmarks at substantially lower compute cost.
CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.
TPMM-DPO applies trajectory-aware learned-weight merging of prior policy models to stabilize iterative DPO against preference noise accumulation.
LegalDrill uses diagnosis-driven synthesis and self-reflective verification to create high-quality training data that improves small language models' legal reasoning without expert annotations.
LoRR augments preference optimization methods like DPO with high-replay training, periodic resets to initial data/policy, and a hybrid objective to improve sample efficiency and reduce primacy bias on math and reasoning tasks.
citing papers explorer
-
IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning
IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of the annotated data.
-
How Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVR
Short GRPO warm-up followed by offline DPO on informative rollouts matches or beats full GRPO on math reasoning benchmarks at substantially lower compute cost.
-
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable
CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.
-
TPMM-DPO: Trajectory-aware Preference-guided Model Merging for Iterative Direct Preference Optimization
TPMM-DPO applies trajectory-aware learned-weight merging of prior policy models to stabilize iterative DPO against preference noise accumulation.
-
LegalDrill: Diagnosis-Driven Synthesis for Legal Reasoning in Small Language Models
LegalDrill uses diagnosis-driven synthesis and self-reflective verification to create high-quality training data that improves small language models' legal reasoning without expert annotations.
-
Sample-efficient LLM Optimization with Reset Replay
LoRR augments preference optimization methods like DPO with high-replay training, periodic resets to initial data/policy, and a hybrid objective to improve sample efficiency and reduce primacy bias on math and reasoning tasks.