arXiv preprint arXiv:2507.10532 , year=

Reasoning or memorization? unreliable results of reinforcement learning due to data contamination , author= · 2024 · arXiv 2507.10532

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Inducing Artificial Uncertainty in Language Models

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.

Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer for Code RL

cs.CL · 2026-04-22 · unverdicted · novelty 7.0

Parallel-SFT mixes parallel programs across languages during SFT to produce more transferable RL initializations, yielding better zero-shot generalization to unseen programming languages.

CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning

cs.AI · 2025-12-21 · unverdicted · novelty 7.0

CORE is a concept-oriented RL method that synthesizes quizzes, injects concept snippets into rollouts, and reinforces conceptual trajectories to close the gap between restating definitions and applying them in math problems.

HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.

Characterizing Model-Native Skills

cs.AI · 2026-04-19 · conditional · novelty 6.0

Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming human-characterized alternatives.

Revisiting Entropy in Reinforcement Learning for Large Reasoning Models

cs.CL · 2025-11-08 · unverdicted · novelty 6.0

Tokens with positive advantages primarily drive entropy collapse in RLVR training of LLMs, and reweighting their loss contributions regulates entropy while maintaining competitive performance.

The Illusion of Reasoning: Exposing Evasive Data Contamination in LLMs via Zero-CoT Truncation

cs.LG · 2026-05-21 · unverdicted · novelty 5.0

ZCP detects direct and evasive data contamination in LLMs by truncating CoT reasoning and contrasting zero-CoT accuracy on original versus perturbed isomorphic datasets, plus a Contamination Confidence metric.

Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards

cs.LG · 2025-09-26

citing papers explorer

Showing 8 of 8 citing papers.

Inducing Artificial Uncertainty in Language Models cs.CL · 2026-05-13 · unverdicted · none · ref 39
Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.
Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer for Code RL cs.CL · 2026-04-22 · unverdicted · none · ref 71
Parallel-SFT mixes parallel programs across languages during SFT to produce more transferable RL initializations, yielding better zero-shot generalization to unseen programming languages.
CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning cs.AI · 2025-12-21 · unverdicted · none · ref 22
CORE is a concept-oriented RL method that synthesizes quizzes, injects concept snippets into rollouts, and reinforces conceptual trajectories to close the gap between restating definitions and applying them in math problems.
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment cs.LG · 2026-04-20 · unverdicted · none · ref 45
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
Characterizing Model-Native Skills cs.AI · 2026-04-19 · conditional · none · ref 81
Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming human-characterized alternatives.
Revisiting Entropy in Reinforcement Learning for Large Reasoning Models cs.CL · 2025-11-08 · unverdicted · none · ref 4
Tokens with positive advantages primarily drive entropy collapse in RLVR training of LLMs, and reweighting their loss contributions regulates entropy while maintaining competitive performance.
The Illusion of Reasoning: Exposing Evasive Data Contamination in LLMs via Zero-CoT Truncation cs.LG · 2026-05-21 · unverdicted · none · ref 4
ZCP detects direct and evasive data contamination in LLMs by truncating CoT reasoning and contrasting zero-CoT accuracy on original versus perturbed isomorphic datasets, plus a Contamination Confidence metric.
Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards cs.LG · 2025-09-26 · unreviewed · ref 34

arXiv preprint arXiv:2507.10532 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer