{"total":20,"items":[{"citing_arxiv_id":"2605.18721","ref_index":1,"ref_count":3,"confidence":0.9,"is_internal_anchor":false,"paper_title":"General Preference Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-18T17:50:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GPRL carries a k-dimensional skew-symmetric preference structure into policy updates with per-dimension advantages and a drift monitor, yielding 56.51% length-controlled win rate on AlpacaEval 2.0 from Llama-3-8B-Instruct while outperforming SimPO and SPPO on other benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09608","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training","primary_cat":"cs.LG","submitted_at":"2026-05-10T15:40:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Forgetting in LLM continual post-training is a geometry conflict between task-induced covariance structures and the evolving model state, controlled by gating Wasserstein barycenter merging on measured conflict.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Across domain-continual and capability-continual settings, GCWM improves retention and final performance over data-free baselines without replay data. 10 References [1] Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Zifeng Wang, Sayna Ebrahimi, and Hao Wang. Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1-42, 2025. [2] Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip HS Torr, Fahad Shahbaz Khan, and Salman Khan. Llm post-training: A deep dive into reasoning large language models.arXiv preprint arXiv:2502.21321, 2025. [3] Zixuan Ke, Yifei Ming, Xuan-Phi Nguyen, Caiming Xiong, and Shafiq Joty."},{"citing_arxiv_id":"2605.09584","ref_index":57,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics","primary_cat":"cs.CL","submitted_at":"2026-05-10T14:51:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08817","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors","primary_cat":"cs.AI","submitted_at":"2026-05-09T09:10:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"IMAX trains soft prefixes with an InfoMax reward to drive diverse exploration in RLVR, yielding up to 11.60% gains in Pass@4 over standard RLVR across model scales.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[18] Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. [19] Aayush Karan and Yilun Du. Reasoning with sampling: Your base model is smarter than you think.arXiv preprint arXiv:2510.14901, 2025. [20] Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip HS Torr, Fahad Shahbaz Khan, and Salman Khan. Llm post-training: A deep dive into reasoning large language models.arXiv preprint arXiv:2502.21321, 2025. [21] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient"},{"citing_arxiv_id":"2605.08378","ref_index":199,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reinforcement Learning for Scalable and Trustworthy Intelligent Systems","primary_cat":"cs.LG","submitted_at":"2026-05-08T18:36:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05742","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Weak-to-Strong Generalization is Nearly Inevitable (in Linear Models)","primary_cat":"cs.LG","submitted_at":"2026-05-07T06:33:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Weak-to-strong generalization is nearly inevitable in linear logistic regression for most student-teacher pairs without any model capacity mismatch.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04344","ref_index":100,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Perturbation is All You Need for Extrapolating Language Models","primary_cat":"stat.ML","submitted_at":"2026-05-05T23:03:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Perturbing prefixes to semantic neighbors during training creates a hierarchical noise model that improves language model predictions on token sequences outside the training corpus support.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03295","ref_index":82,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Cheap Expertise: Mapping and Challenging Industry Perspectives in the Expert Data Gig Economy","primary_cat":"cs.CY","submitted_at":"2026-05-05T02:34:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AI data firms view human expertise as an extractable, low-cost resource to feed AI systems while treating institutional expertise as something needing liberation or reform to fit this model.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"The first two authors then repeat this process for the remaining transcripts, finalizing the codebook. Before generating themes, the authors revisit previously coded X posts and podcasts, adding or merging subcodes where appropriate. Finally, the authors hold a meeting to generate themes that an- swer the study's RQs, using an affinity diagramming approach [82] to group observations within each deductive code (corresponding to a research question), extracting representative quotes from pod- casts and identifying informative X posts. The authors generate a total of nine themes, including three themes each to answer the AI Expertise RQ, the Human Expertise RQ, and the Institutional Expertise RQ. These themes accordingly form the content of the"},{"citing_arxiv_id":"2604.17614","ref_index":85,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Characterizing Model-Native Skills","primary_cat":"cs.AI","submitted_at":"2026-04-19T20:58:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming human-characterized alternatives.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16557","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"S-GRPO: Unified Post-Training for Large Vision-Language Models","primary_cat":"cs.LG","submitted_at":"2026-04-17T08:39:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"2 Post-training for LVLMs Post-training techniques, primarily Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), have been central to adapting LVLMs [9, 19, 20, 23]. SFT enables direct task-specific learning but often leads to catastrophic forgetting of pre-trained knowledge [5, 12, 13]. While parameter-efficient approaches like LoRA [17] and Prefix Tuning [22] alleviate computational burdens, they remain prone to overfitting and exhibit limited transferability. In contrast, RL-based methods enhance adaptability by optimiz- ing sequential decision-making. Unlike traditional Proximal Policy Optimization (PPO) [34], which imposes high computational costs, recent alternatives like Direct Preference Optimization (DPO) [33]"},{"citing_arxiv_id":"2604.15705","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Robust Endogenous Reasoning: Unifying Drift Adaptation in Non-Stationary Tuning","primary_cat":"cs.LG","submitted_at":"2026-04-17T05:24:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CPO++ adapts reinforcement fine-tuning of MLLMs to endogenous multi-modal concept drift through counterfactual reasoning and preference optimization, yielding better coherence and cross-domain robustness in safety-critical settings.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"the tendency of policies to exploit the inaccuracies of reward models, which leads to a systemic deviation from true human intent [7]-[9]. Finally, investigations into policy drift and entropy collapse scrutinize the training dynamics where the policy deviates excessively from the reference model or suffers from a collapse in distributional diversity [10]-[14]. Although these paradigms provide robust mechanisms for stabilizing RFT against external perturbations, they fundamentally treat drift as a phenomenon induced by exogenous environmental or data-centric factors. Consequently, the internal instability of the thinking process remains unexplored, leaving a critical gap in understanding the endogenous reasoning drift that arises"},{"citing_arxiv_id":"2604.10436","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SignReasoner: Compositional Reasoning for Complex Traffic Sign Understanding via Functional Structure Units","primary_cat":"cs.CV","submitted_at":"2026-04-12T03:18:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SignReasoner decomposes traffic signs into functional structure units and uses a two-stage VLM post-training pipeline to achieve state-of-the-art compositional reasoning on a new benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07941","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning","primary_cat":"cs.CL","submitted_at":"2026-04-09T08:00:37+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In:arXiv preprint arXiv:2510.08049(2025). [31] G. Tie et al. \"A Survey on Post-training of Large Language Models\". In:arXiv preprint arXiv:2503.06072 (2025). [32] H. Lai et al. \"A Survey of Post-Training Scaling in Large Language Models\". In:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025, pp. 2771- 2791. [33] K. Kumar et al. \"LLM Post-Training: A Deep Dive into Reasoning Large Language Models\". In:arXiv preprint arXiv:2502.21321(2025). [34] G. I. Winata et al. \"Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey\". In:Journal of Artificial Intelligence Research82 (2025), pp. 2595-2661. [35] M. Pternea et al. \"The RL/LLM Taxonomy Tree: Reviewing Synergies between Reinforcement Learning"},{"citing_arxiv_id":"2605.20201","ref_index":19,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning","primary_cat":"cs.CL","submitted_at":"2026-04-06T16:44:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ProxyCoT transfers CoT reasoning from proxy short contexts to full long contexts through RL/distillation followed by SFT, outperforming baselines with lower overhead and generalizing out-of-domain.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03231","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning","primary_cat":"cs.CV","submitted_at":"2026-04-03T17:59:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[29] Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. 2024. What matters when building vision-language models?Neural Information Processing Systems(2024). doi:10.48550/arXiv.2405.02246 [30] Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. 2023. Otter: A multi-modal model with in-context instruction tuning.arXiv preprint arXiv:2305.03726(2023). [31] Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. 2023. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.arXiv preprint arXiv:2306.00890(2023). [32] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language"},{"citing_arxiv_id":"2601.21484","ref_index":19,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment","primary_cat":"cs.LG","submitted_at":"2026-01-29T10:06:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ETS performs training-free RL alignment for language models by energy-guided test-time scaling with Monte Carlo energy estimation and importance sampling acceleration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.06412","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sample-efficient LLM Optimization with Reset Replay","primary_cat":"cs.LG","submitted_at":"2025-08-08T15:56:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LoRR augments preference optimization methods like DPO with high-replay training, periodic resets to initial data/policy, and a hybrid objective to improve sample efficiency and reduce primacy bias on math and reasoning tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.05015","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SPaCe: Unlocking Sample-Efficient Large Language Models Training With Self-Pace Curriculum Learning","primary_cat":"cs.LG","submitted_at":"2025-08-07T03:50:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SPaCe uses semantic clustering to shrink training sets and a multi-armed bandit to adaptively select samples, matching or beating baselines on reasoning benchmarks with up to 100x fewer examples.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.13958","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ToolRL: Reward is All Tool Learning Needs","primary_cat":"cs.LG","submitted_at":"2025-04-16T21:45:32+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.02181","ref_index":88,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey of Scaling in Large Language Model Reasoning","primary_cat":"cs.AI","submitted_at":"2025-04-02T23:51:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A survey categorizing scaling in LLM reasoning across input size, steps, rounds, training, and future directions, noting that scaling can negatively affect performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"extending traditional domains and opening entirely new research avenues. This section explores how scaling has influenced three critical areas: LLM-as-a-Judge, fact-checking, and dialogue systems. LLM-as-a-Judge. Using LLMs to evaluate model outputs or other models has emerged as a pivotal research direction, enabling eval- uation at scale beyond traditional approaches and human assess- ment [88]. Notably, larger models demonstrate a significantly higher correlation with human preferences compared to their smaller coun- terparts [238]. To further improve evaluation quality, recent work has explored multi-step reasoning processes [151], where scaling the number of reasoning steps enhances evaluation capabilities [29]. Additionally, scaling across multiple judge models has emerged as"}],"limit":50,"offset":0}