pith. sign in

arxiv: 2601.11061 · v2 · pith:5BBE4O53new · submitted 2026-01-16 · 💻 cs.LG · cs.CL

Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs

classification 💻 cs.LG cs.CL
keywords rewardsrlvrspuriouscircuitlayersmemorizationmodelsparadox
0
0 comments X
read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is highly effective for enhancing LLM reasoning, yet recent evidence shows models like Qwen 2.5 achieve significant gains even with spurious or incorrect rewards. We investigate this phenomenon and identify a "Perplexity Paradox": spurious RLVR triggers a divergence where answer-token perplexity drops while prompt-side coherence degrades, suggesting the model is bypassing reasoning in favor of memorization. Using Path Patching, Logit Lens, JSD analysis, and Neural Differential Equations, we uncover a hidden Anchor-Adapter circuit that facilitates this shortcut. We localize a Functional Anchor in the middle layers (L18-20) that triggers the retrieval of memorized solutions, followed by Structural Adapters in later layers (L21+) that transform representations to accommodate the shortcut signal. Finally, we demonstrate that scaling specific MLP keys within this circuit allows for bidirectional causal steering-artificially amplifying or suppressing contamination-driven performance. Our results provide a mechanistic roadmap for identifying and mitigating data contamination in RLVR-tuned models. Code is available at https://github.com/idwts/How-RLVR-Activates-Memorization-Shortcuts.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training

    cs.CL 2026-05 unverdicted novelty 7.0

    Transformer circuits show free evolution during SFT, rendering static mechanistic localization inadequate for future parameter updates due to inherent temporal latency.

  2. When Does a Video-Language Model Stop Watching? Reward Strength Controls the Formation and Reversal of Visual Shortcuts in Multimodal RLVR

    cs.AI 2026-06 unverdicted novelty 6.0

    Visual shortcut reliance in multimodal RLVR emerges abruptly, shows monotone response to penalty strength lambda, exhibits hysteresis in reversal, and has a critical early intervention window on an out-of-distribution...

  3. VeriGate: Verifier-Gated Step-Level Supervision for GRPO

    cs.LG 2026-05 unverdicted novelty 6.0

    VeriGate adds verifier-gated step-level supervision to GRPO via cumulated PRM rewards and group-normalized token advantages, raising accuracy 20% and 12% on 1.5B and 7B models on MATH and six benchmarks.