In Proceedings of the 61st Annual Meeting of the As- sociation for Computational Linguistics (Volume 5: Industry Track), pages 134–148

· 2025 · arXiv 2503.12854

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of the annotated data.

How Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVR

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

Short GRPO warm-up followed by offline DPO on informative rollouts matches or beats full GRPO on math reasoning benchmarks at substantially lower compute cost.

Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.

TPMM-DPO: Trajectory-aware Preference-guided Model Merging for Iterative Direct Preference Optimization

cs.IR · 2026-05-22 · unverdicted · novelty 5.0

TPMM-DPO applies trajectory-aware learned-weight merging of prior policy models to stabilize iterative DPO against preference noise accumulation.

LegalDrill: Diagnosis-Driven Synthesis for Legal Reasoning in Small Language Models

cs.CL · 2026-04-26 · unverdicted · novelty 5.0

LegalDrill uses diagnosis-driven synthesis and self-reflective verification to create high-quality training data that improves small language models' legal reasoning without expert annotations.

Sample-efficient LLM Optimization with Reset Replay

cs.LG · 2025-08-08 · unverdicted · novelty 5.0

LoRR augments preference optimization methods like DPO with high-replay training, periodic resets to initial data/policy, and a hybrid objective to improve sample efficiency and reduce primacy bias on math and reasoning tasks.

citing papers explorer

Showing 6 of 6 citing papers.

IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning cs.LG · 2026-04-22 · unverdicted · none · ref 70
IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of the annotated data.
How Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVR cs.LG · 2026-05-20 · unverdicted · none · ref 13
Short GRPO warm-up followed by offline DPO on informative rollouts matches or beats full GRPO on math reasoning benchmarks at substantially lower compute cost.
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable cs.AI · 2026-05-08 · unverdicted · none · ref 42
CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.
TPMM-DPO: Trajectory-aware Preference-guided Model Merging for Iterative Direct Preference Optimization cs.IR · 2026-05-22 · unverdicted · none · ref 17
TPMM-DPO applies trajectory-aware learned-weight merging of prior policy models to stabilize iterative DPO against preference noise accumulation.
LegalDrill: Diagnosis-Driven Synthesis for Legal Reasoning in Small Language Models cs.CL · 2026-04-26 · unverdicted · none · ref 7
LegalDrill uses diagnosis-driven synthesis and self-reflective verification to create high-quality training data that improves small language models' legal reasoning without expert annotations.
Sample-efficient LLM Optimization with Reset Replay cs.LG · 2025-08-08 · unverdicted · none · ref 15
LoRR augments preference optimization methods like DPO with high-replay training, periodic resets to initial data/policy, and a hybrid objective to improve sample efficiency and reduce primacy bias on math and reasoning tasks.

In Proceedings of the 61st Annual Meeting of the As- sociation for Computational Linguistics (Volume 5: Industry Track), pages 134–148

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer