pith. sign in

arxiv: 2606.01066 · v1 · pith:L5IHLUCBnew · submitted 2026-05-31 · 💻 cs.AI

Before the Model Learns the Bug:Fuzzing RLVR Verifiers

classification 💻 cs.AI
keywords rewardrlvrverifiersadversarialanswerartifactbeforebuggy
0
0 comments X
read the original abstract

Reinforcement learning with verifiable rewards (RLVR) replaces human preference labels with executable reward functions such as math answer checkers, JSON tool-call validators, and code unit-test harnesses. That makes the reward partly a software artifact: if the verifier is wrong, optimization can learn the bug. We study this failure mode with a lightweight verifier-fuzzing framework that generates adversarial completions, compares buggy and stricter reference verifiers, logs paired decisions, and reports false-positive, false-negative, disagreement, exploit, and uncertainty metrics.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.