SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
SLICE applies gradient surgery via projection and truncated SVD to initialize LoRA adapters, yielding better stability-plasticity trade-offs on continual learning benchmarks including adversarial task sequences.
citing papers explorer
-
Self-Supervised On-Policy Distillation for Reasoning Language Models
SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.
-
Low-Rank Adapters Initialization via Gradient Surgery for Continual Learning
SLICE applies gradient surgery via projection and truncated SVD to initialize LoRA adapters, yielding better stability-plasticity trade-offs on continual learning benchmarks including adversarial task sequences.