Defining and characterizing reward gaming

Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, David Krueger · 2022

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

representative citing papers

Accelerated Learning with Linear Temporal Logic using Differentiable Simulation

cs.LG · 2025-06-01 · unverdicted · novelty 7.0

Differentiable relaxation of LTL automata via soft labeling enables gradient-based RL from formal specifications, with theoretical bounds on discrete-differentiable discrepancy and up to 2x returns on nonlinear tasks.

Learning to Reason under Off-Policy Guidance

cs.LG · 2025-04-21 · unverdicted · novelty 6.0

LUFFY mixes off-policy reasoning traces into RLVR training via Mixed-Policy GRPO and regularized importance sampling, delivering over 6-point gains on math benchmarks and enabling training of weak models where on-policy RLVR fails.

Supervising the search process produces reliable and generalizable information-seeking agents

cs.CL · 2025-02-19 · unverdicted · novelty 6.0

Process supervision via RAG-Gym produces more reliable and generalizable search agents, with gains driven by higher-quality queries on out-of-domain multi-hop tasks.

citing papers explorer

Showing 3 of 3 citing papers.

Accelerated Learning with Linear Temporal Logic using Differentiable Simulation cs.LG · 2025-06-01 · unverdicted · none · ref 52
Differentiable relaxation of LTL automata via soft labeling enables gradient-based RL from formal specifications, with theoretical bounds on discrete-differentiable discrepancy and up to 2x returns on nonlinear tasks.
Learning to Reason under Off-Policy Guidance cs.LG · 2025-04-21 · unverdicted · none · ref 18
LUFFY mixes off-policy reasoning traces into RLVR training via Mixed-Policy GRPO and regularized importance sampling, delivering over 6-point gains on math benchmarks and enabling training of weak models where on-policy RLVR fails.
Supervising the search process produces reliable and generalizable information-seeking agents cs.CL · 2025-02-19 · unverdicted · none · ref 67
Process supervision via RAG-Gym produces more reliable and generalizable search agents, with gains driven by higher-quality queries on out-of-domain multi-hop tasks.

Defining and characterizing reward gaming

fields

years

verdicts

representative citing papers

citing papers explorer