Spontaneous reward hacking in iterative self-refinement

Jane Pan, He He, Samuel R · 2024 · arXiv 2407.04549

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Stop Hand-Holding Your Coding Agent: Engineering the Loops that Replace Step-by-Step Prompting

cs.SE · 2026-06-28 · unverdicted · novelty 6.0

Introduces loop engineering as a distinct practice layer for coding agents, supplies a taxonomy and verification ladder, and analyzes a hand-coded corpus of fifty real loops.

Exploring the Secondary Risks of Large Language Models

cs.LG · 2025-06-14 · unverdicted · novelty 6.0

Introduces secondary risks as a new class of LLM failures from benign prompts, defines two primitives, proposes SecLens search framework, and releases SecRiskBench showing risks are widespread across 16 models.

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

cs.CV · 2025-04-10 · unverdicted · novelty 6.0

VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.

citing papers explorer

Showing 1 of 1 citing paper after filters.

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model cs.CV · 2025-04-10 · unverdicted · none · ref 41
VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.

Spontaneous reward hacking in iterative self-refinement

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer