COPSD improves mathematical reasoning in low-resource languages by having LLMs self-distill from their own high-resource English behavior via token-level divergence on rollouts with privileged crosslingual context.
Title resolution pending
5 Pith papers cite this work. Polarity classification is still indexing.
years
2026 5verdicts
UNVERDICTED 5representative citing papers
RAG over structured thinking traces boosts LLM reasoning on AIME, LiveCodeBench, and GPQA, with relative gains up to 56% and little added cost.
Hybrid Policy Distillation unifies existing knowledge distillation methods for LLMs into a reweighted log-likelihood objective and introduces a hybrid forward-reverse KL approach with mixed data sampling to improve stability, efficiency, and performance.
Defines Entropy-Gradient Inversion as a geometric fingerprint of LRM reasoning and introduces CorR-PO to embed it in RL reward regularization, reporting improved benchmark performance.
ProxyCoT transfers CoT reasoning from proxy short contexts to full long contexts through RL/distillation followed by SFT, outperforming baselines with lower overhead and generalizing out-of-domain.
citing papers explorer
-
Crosslingual On-Policy Self-Distillation for Multilingual Reasoning
COPSD improves mathematical reasoning in low-resource languages by having LLMs self-distill from their own high-resource English behavior via token-level divergence on rollouts with privileged crosslingual context.
-
RAG over Thinking Traces Can Improve Reasoning Tasks
RAG over structured thinking traces boosts LLM reasoning on AIME, LiveCodeBench, and GPQA, with relative gains up to 56% and little added cost.
-
Hybrid Policy Distillation for LLMs
Hybrid Policy Distillation unifies existing knowledge distillation methods for LLMs into a reweighted log-likelihood objective and introduces a hybrid forward-reverse KL approach with mixed data sampling to improve stability, efficiency, and performance.
-
Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models
Defines Entropy-Gradient Inversion as a geometric fingerprint of LRM reasoning and introduces CorR-PO to embed it in RL reward regularization, reporting improved benchmark performance.
-
Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning
ProxyCoT transfers CoT reasoning from proxy short contexts to full long contexts through RL/distillation followed by SFT, outperforming baselines with lower overhead and generalizing out-of-domain.