pith. sign in

Recoverable Identifier

arXiv:2605.06188 · detector doi_compliance · incontrovertible · 2026-05-19 12:54:18.608593+00:00

advisory doi_compliance recoverable_identifier

DOI in the printed bibliography is fragmented by whitespace or line breaks. A longer candidate (10.48550/arXiv.2509.07430.A) was visible in the surrounding text but could not be confirmed against doi.org as printed.

Paper page Integrity report arXiv Try DOI

Evidence text

Long Li, Zhijian Zhou, Jiaran Hao, Jason Klein Liu, Yanting Miao, Wei Pang, Xiaoyu Tan, Wei Chu, Zhe Wang, Shirui Pan, Chao Qu, and Yuan Qi. The choice of divergence: A neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward.CoRR, abs/2509.07430, 2025. doi: 10.48550/ARXIV .2509.07430. URL https://doi.org/10. 48550/arXiv.2509.07430. A Matched DeepSeek-R1-Distill-Qwen-7B Pre-RL Reference As a matched lineage check, we also ranCorrect-onlyKL andIncorrect-onlyKL on DeepSeek-R1- Distill-Qwen-7B, the pre-RL model underlying AceReason-Nemotron-7B (Table 6).Correct-only KL remains the safer branch, whileIncorrect-onlyKL remains negative on average accuracy. Table 6: Matched DeepSeek-R1-Distill-Qwen-7B pre-RL reference, evaluated on MATH500, AIME24, and AIME25. The same branch ordering holds before RLVR, withCorrect-onlysafer and Incorrect-onlynegative on average accuracy. Accuracy Length Method MATH500 AIME24 AIME25 Avg.∆(pp) MATH500 AIME24 AIME25 Avg.∆(%) Baseline65.5 38.8 26.2—2,886 10,268 11,724— Correct-only65.8 40.4 30.0 +1.9 2,256 8,572 9,112−20.2 Incorrect-only60.4 32.9 25.4−3.9 2,429 8,714 9,578−16.4 B Multi-Seed Robustness of the Correct-only vs Incorrect-only Contrast As a complementary check, we reranCorrect-onlyKL andIncorrect-onlyKL on both main models with two additional training seeds (Table 7).Correct-onlyKL is better thanIncorrect-onlyKL in all matched model-seed comparisons, so the branch ordering is stable across seeds. Table

Evidence payload

{
  "printed_excerpt": "Long Li, Zhijian Zhou, Jiaran Hao, Jason Klein Liu, Yanting Miao, Wei Pang, Xiaoyu Tan, Wei Chu, Zhe Wang, Shirui Pan, Chao Qu, and Yuan Qi. The choice of divergence: A neglected key to mitigating diversity collapse in reinforcement learnin",
  "reconstructed_doi": "10.48550/arXiv.2509.07430.A",
  "ref_index": 38,
  "resolved_title": null,
  "verdict_class": "incontrovertible"
}