Differentiable relaxation of LTL automata via soft labeling enables gradient-based RL from formal specifications, with theoretical bounds on discrete-differentiable discrepancy and up to 2x returns on nonlinear tasks.
Defining and characterizing reward gaming
3 Pith papers cite this work. Polarity classification is still indexing.
years
2025 3verdicts
UNVERDICTED 3representative citing papers
LUFFY mixes off-policy reasoning traces into RLVR training via Mixed-Policy GRPO and regularized importance sampling, delivering over 6-point gains on math benchmarks and enabling training of weak models where on-policy RLVR fails.
Process supervision via RAG-Gym produces more reliable and generalizable search agents, with gains driven by higher-quality queries on out-of-domain multi-hop tasks.
citing papers explorer
-
Accelerated Learning with Linear Temporal Logic using Differentiable Simulation
Differentiable relaxation of LTL automata via soft labeling enables gradient-based RL from formal specifications, with theoretical bounds on discrete-differentiable discrepancy and up to 2x returns on nonlinear tasks.
-
Learning to Reason under Off-Policy Guidance
LUFFY mixes off-policy reasoning traces into RLVR training via Mixed-Policy GRPO and regularized importance sampling, delivering over 6-point gains on math benchmarks and enabling training of weak models where on-policy RLVR fails.
-
Supervising the search process produces reliable and generalizable information-seeking agents
Process supervision via RAG-Gym produces more reliable and generalizable search agents, with gains driven by higher-quality queries on out-of-domain multi-hop tasks.