pith. sign in

hub Mixed citations

arXiv preprint arXiv:2505.12346 , year=

Mixed citation behavior. Most common role is background (50%).

19 Pith papers citing it
Background 50% of classified citations

hub tools

citation-role summary

background 4 method 2

citation-polarity summary

years

2026 17 2025 2

clear filters

representative citing papers

Self-Distilled RLVR

cs.LG · 2026-04-03 · unverdicted · novelty 7.0

RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.

Holder Policy Optimisation

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.

Trust Region On-Policy Distillation

cs.LG · 2026-05-31 · unverdicted · novelty 5.0

TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.

Self-Distilled Policy Gradient

cs.LG · 2026-06-02 · unverdicted · novelty 4.0

SDPG combines group-relative verifier advantages, normalized standard deviation, full-vocabulary on-policy self-distillation, and reference-policy KL regularization to improve stability and performance over RLVR and self-distillation baselines in language model RL.

citing papers explorer

Showing 1 of 1 citing paper after filters.