pith. sign in

hub

arXiv preprint arXiv:2504.00891 , year=

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

hub tools

citation-role summary

background 2

citation-polarity summary

years

2026 9 2025 4

roles

background 2

polarities

background 2

representative citing papers

Not only where, But when: Temporal Scheduling for RLVR

cs.LG · 2026-05-25 · unverdicted · novelty 7.0

Temporal scheduling of credit allocation criteria over RLVR training, using trajectory percentiles to target heterogeneous behaviors, yields more stable policy entropy and better reasoning benchmark results than static allocation.

Unsupervised Process Reward Models

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.

OpenClaw-RL: Train Any Agent Simply by Talking

cs.CL · 2026-03-10 · unverdicted · novelty 6.0

OpenClaw-RL recovers evaluative and directive signals from next-state interactions to enable online RL training of agents across terminal, GUI, SWE, and tool environments via a server-client architecture and hybrid objective.

Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR

cs.CL · 2025-07-21 · unverdicted · novelty 6.0

Archer introduces response-level entropy normalization and differentiated clipping/KL regularization in RLVR to encourage exploration on reasoning tokens while stabilizing knowledge tokens, yielding gains in pass@1 and pass@K on reasoning benchmarks.

Trust Region On-Policy Distillation

cs.LG · 2026-05-31 · unverdicted · novelty 5.0

TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.

citing papers explorer

Showing 13 of 13 citing papers.