{"total":21,"items":[{"citing_arxiv_id":"2605.20473","ref_index":70,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Code Generation by Differential Test Time Scaling","primary_cat":"cs.SE","submitted_at":"2026-05-19T20:39:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DiffCodeGen clusters code candidates by behavioral similarity from fuzzing-synthesized inputs and selects the largest cluster's medoid, matching or exceeding prior test-time scaling methods with far less token and time cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15464","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero","primary_cat":"cs.LG","submitted_at":"2026-05-14T23:05:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GRLO shows RLHF from scratch on 5K open-ended prompts raises average performance from 24.1 to 63.1 across domains on Qwen3-4B-Base using 46x less data and 68x less compute than in-domain RLVR while remaining competitive with heavily post-trained models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14186","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling","primary_cat":"cs.LG","submitted_at":"2026-05-13T23:09:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A metacognitive harness uses LLMs' pre- and post-solution self-monitoring signals to control test-time reasoning, raising pooled accuracy from 48.3% to 56.9% on text, code, and multimodal benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14098","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Pause and Reflect: Conformal Aggregation for Chain-of-Thought Reasoning","primary_cat":"stat.ML","submitted_at":"2026-05-13T20:33:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A conformal procedure for CoT replaces majority voting with weighted aggregation and calibrates abstention to guarantee low confident-error rates, achieving 90.1% selective accuracy on GSM8K by abstaining on under 5% of cases.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12667","ref_index":25,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization","primary_cat":"cs.LG","submitted_at":"2026-05-12T19:17:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ODRPO decomposes discrete rewards into ordinal binary indicators to create robust, variance-aware advantage estimators for noisy RLAIF in LLM alignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08472","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models","primary_cat":"cs.AI","submitted_at":"2026-05-08T20:46:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"We construct multiple mid-training datasets with n∈ {1,2,4, . . . ,64} solution variants per question and train a separate model on each mid-training dataset. Models and RewardWe useLlama 3.2-3B-Instruct[ 10] as the primary base model for all experiments, including both baselines and mid-training, and report additional results withQwen2.5- 7B-Instructin the § A.5. We useSkywork-Reward-V2-Llama-3.2-3B[ 29] as the reward model (Rϕ) to score responses during data generation. Evaluation DetailsWe evaluate on six mathematical reasoning benchmarks:Math-500[ 17], AIME 2024[ 63],AIME 2025[ 64],AMC 2023[ 35],HMMT 2025[ 2], andOlympiadBench[ 15], covering a wide range of difficulties and reasoning types. We use Math-Verify [19] to verify the correctness of the models' generated solutions automatically."},{"citing_arxiv_id":"2605.07461","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance","primary_cat":"cs.CL","submitted_at":"2026-05-08T09:08:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"internal consistency of responses respectively. 1 Introduction Reinforcement learning (RL) has driven substantial progress in LLMs on verifiable tasks [ 3, 11], where reliable reward signals can be derived from ground-truth labels. However, extending RL to open-ended, unverifiable domains remains challenging, as such tasks lack ground truth. In these settings, scalar rewards [ 5, 19] and monolithic generative reward models [ 21, 22] often provide supervision that is too coarse and underspecified to capture response quality. To address this, recent research has increasingly focused on rubric-as-reward [2, 6, 14, 17] paradigm: using structured sets of criteria to decompose each open-ended task into interpretable objectives."},{"citing_arxiv_id":"2605.16339","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders","primary_cat":"cs.LG","submitted_at":"2026-05-07T16:48:48+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sparse autoencoders isolate unstable features in reward model representations and enable two mitigation techniques that reduce preference errors on perturbed inputs without retraining.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26644","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling","primary_cat":"cs.AI","submitted_at":"2026-04-29T13:11:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A disagreement-guided routing framework dynamically selects among resolution, voting, and rewriting strategies for test-time scaling, delivering 3-7% accuracy gains with lower sampling cost on mathematical benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Recent work [1, 29, 42, 32] also studies adaptive test-time compute allocation and compute-accuracy trade-offs, which are complementary to our focus on routing among inference strategies. 2.2 Rewriting Rewriting has been widely studied in both information systems and language model reasoning. In database systems, query rewriting is used to improve retrieval efficiency [ 25]. Recent work extends this idea to LLMs. RewriteLM [28] combines instruction tuning and reinforcement learning to optimize rewrite quality, while LLM-R2 [ 22] learns to distinguish semantically equivalent but more effective rewrites via contrastive training. Other studies explore prompt optimization and preference-based rewriting [17, 5]. In mathematical reasoning, Zhou et al."},{"citing_arxiv_id":"2604.25872","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient","primary_cat":"cs.LG","submitted_at":"2026-04-28T17:10:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward design insights.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 5 2023. [42] Max Qiushi Lin, Jincheng Mei, Matin Aghaei, Michael Lu, Bo Dai, Alekh Agarwal, Dale Schuurmans, Csaba Szepesvari, and Sharan Vaswani. Rethinking the global convergence of softmax policy gradient with linear function approximation.arXiv preprint arXiv:2505.03155, 2025. [43] Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, and Yahui Zhou. Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352, 2025. [44] Jiacai Liu, Wenye Li, and Ke Wei. Elementary analysis of policy gradient methods."},{"citing_arxiv_id":"2604.19544","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling","primary_cat":"cs.AI","submitted_at":"2026-04-21T15:02:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DT2IT-MRM proposes a debiased preference construction pipeline, T2I data reformulation, and iterative training to curate multimodal preference data, achieving SOTA on VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"and scalable alternative to manual annotation for con- structing multimodal preference data. For RL, MRMs are particularly important in domains lacking verifiable answers [15, 65]. Compared to rule-based rewards [12, 33], *University of Science and Technology of China MRMs are better suited for modeling complex human preferences required for general preference learning [24]. At inference time, MRMs can be employed in various test-time scaling strategies (e.g., best-of-N) to identify the most optimal response among multiple candidates [43, 48]. Training MRMs requires preference data to exhibit three essential characteristics: unbiasedness, diversity, and reli- ability. Existing multimodal preference datasets face sev-"},{"citing_arxiv_id":"2604.18176","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"QuantumQA: Enhancing Scientific Reasoning via Physics-Consistent Dataset and Verification-Aware Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2026-04-20T12:33:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"QuantumQA dataset and verification-aware RL with adaptive reward fusion enable an 8B LLM to achieve performance competitive with proprietary models on quantum mechanics tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17501","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CoAct: Co-Active LLM Preference Learning with Human-AI Synergy","primary_cat":"cs.CL","submitted_at":"2026-04-19T15:43:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoAct synergistically merges self-rewarding and active learning via self-consistency to select reliable AI labels and oracle-needed samples, delivering 8-13% gains on GSM8K, MATH, and WebInstruct.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16004","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AgentV-RL: Scaling Reward Modeling with Agentic Verifier","primary_cat":"cs.CL","submitted_at":"2026-04-17T12:27:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AgentV-RL introduces bidirectional forward-backward agents and RL-driven tool use to improve LLM verifiers, with a 4B model beating prior outcome reward models by 25.2%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15602","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"GroupDPO: Memory efficient Group-wise Direct Preference Optimization","primary_cat":"cs.CL","submitted_at":"2026-04-17T00:56:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GroupDPO decouples group-wise preference optimization during backpropagation to cut peak memory while keeping the same gradients, allowing larger groups and consistent gains over single-pair DPO plus an NLL term on positives.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02766","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs","primary_cat":"cs.LG","submitted_at":"2026-04-03T06:24:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Random sampling matches active preference learning on win-rate gains in online DPO yet both degrade benchmark performance, making active selection's overhead hard to justify.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02686","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Beyond Semantic Manipulation: Token-Space Attacks on Reward Models","primary_cat":"cs.LG","submitted_at":"2026-04-03T03:30:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TOMPA performs black-box adversarial optimization in token space to discover non-linguistic patterns that nearly double the reward scores of GPT-5 answers on Skywork-Reward-V2 while producing gibberish text.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.12125","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation","primary_cat":"cs.LG","submitted_at":"2026-02-12T16:14:29+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Generalized on-policy distillation with reward scaling above one (ExOPD) lets student models surpass teacher performance when merging domain experts on math and code tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.02535","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation","primary_cat":"cs.CL","submitted_at":"2026-01-05T20:16:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ModeX selects the modal semantic output from multiple LLM generations via a similarity graph and recursive spectral clustering without needing reward models or evaluators.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.13564","ref_index":124,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Memory in the Age of AI Agents","primary_cat":"cs.CL","submitted_at":"2025-12-15T17:22:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.23868","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA","primary_cat":"cs.LG","submitted_at":"2025-10-27T21:18:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GIFT matches the optimal policy of GRPO using an endogenous prompt-dependent KL coefficient derived via z-score standardization of implicit rewards.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}