pith. sign in

hub

arXiv preprint arXiv:2504.00891 , year=

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

hub tools

citation-role summary

background 2

citation-polarity summary

years

2026 9 2025 4

roles

background 2

polarities

background 2

clear filters

representative citing papers

Not only where, But when: Temporal Scheduling for RLVR

cs.LG · 2026-05-25 · unverdicted · novelty 7.0

Temporal scheduling of credit allocation criteria over RLVR training, using trajectory percentiles to target heterogeneous behaviors, yields more stable policy entropy and better reasoning benchmark results than static allocation.

Unsupervised Process Reward Models

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.

OpenClaw-RL: Train Any Agent Simply by Talking

cs.CL · 2026-03-10 · unverdicted · novelty 6.0

OpenClaw-RL recovers evaluative and directive signals from next-state interactions to enable online RL training of agents across terminal, GUI, SWE, and tool environments via a server-client architecture and hybrid objective.

Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR

cs.CL · 2025-07-21 · unverdicted · novelty 6.0

Archer introduces response-level entropy normalization and differentiated clipping/KL regularization in RLVR to encourage exploration on reasoning tokens while stabilizing knowledge tokens, yielding gains in pass@1 and pass@K on reasoning benchmarks.

Trust Region On-Policy Distillation

cs.LG · 2026-05-31 · unverdicted · novelty 5.0

TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.

citing papers explorer

Showing 5 of 5 citing papers after filters.

  • The Hidden Bias of Process Reward Models:PRISM for Rewarding the Right Reasoning cs.LG · 2026-06-08 · unverdicted · none · ref 32

    PRISM is a contrastive, policy-aware training framework for process reward models that reduces false positives by 22% on PRMBench and boosts downstream accuracy up to 33% in Best-of-N selection by learning reliable relative comparisons instead of pointwise labels.

  • Not only where, But when: Temporal Scheduling for RLVR cs.LG · 2026-05-25 · unverdicted · none · ref 4

    Temporal scheduling of credit allocation criteria over RLVR training, using trajectory percentiles to target heterogeneous behaviors, yields more stable policy entropy and better reasoning benchmark results than static allocation.

  • Unsupervised Process Reward Models cs.LG · 2026-05-11 · unverdicted · none · ref 33

    Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.

  • Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning cs.LG · 2026-04-08 · unverdicted · none · ref 180

    This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

  • Trust Region On-Policy Distillation cs.LG · 2026-05-31 · unverdicted · none · ref 196

    TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.