{"total":11,"items":[{"citing_arxiv_id":"2604.19144","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation","primary_cat":"cs.CL","submitted_at":"2026-04-21T06:48:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03144","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InCoder-32B-Thinking: Industrial Code World Model for Thinking","primary_cat":"cs.AR","submitted_at":"2026-04-03T16:06:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"achieves competitive performance on general code benchmarks while establishing strong base- lines across industrial domains. Building upon it, InCoder-32B-Thinking is a Thinking variant that integrates chain-of-thought reasoning to further enhance the model's capacity for complex industrial code generation. 5.2. Thinking in Large Language Models OpenAI o1 [49] demonstrated that training models to produce long internal chains of thought via reinforcement learning (RL) can dramatically improve performance on complex reasoning tasks, establishing the concept ofthinking models. This line was advanced by o3 [ 50] and Gemini 3 [22]. On the open-source side, DeepSeek-R1 [26] showed that pure RL can incentivize emergent"},{"citing_arxiv_id":"2602.07906","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering","primary_cat":"cs.LG","submitted_at":"2026-02-08T10:55:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AceGRPO trains 30B-parameter LLM agents to achieve 100% valid submissions and competitive performance on MLE-Bench-Lite through evolving data buffers and adaptive task sampling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.11989","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Thought Graph Traversal for Test-time Scaling in Chest X-ray VLLMs","primary_cat":"cs.CV","submitted_at":"2025-06-13T17:46:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A new prompting framework called Thought Graph Traversal combined with reasoning budget forcing improves test-time performance of frozen chest X-ray VLLMs on report generation benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.06856","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning","primary_cat":"cs.CV","submitted_at":"2025-06-07T16:37:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Vision-EKIPL injects high-quality actions from external models into RL training to expand exploration and raise the reasoning ceiling of MLLMs, reporting up to 5% gains on the Reason-RFT-CoT benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.01990","ref_index":160,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems","primary_cat":"cs.AI","submitted_at":"2025-03-31T18:00:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"This survey frames foundation agents using brain-inspired modular architectures and reviews challenges in evolution, collaboration, and safety.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"demonstrated the effectiveness of reward-guided optimization in specific reasoning tasks. Building on these foundational frameworks, more recent research has pushed the boundaries of LLM reasoning enhancement by exploring how to cultivate reasoning capabilities in more challenginglow-resource 50 Cognition settings and self-supervised manner. Shafayat et al.[160] explains that the success of self-training highly depends on the model's initial capabilities and the nature of the task, motivating further exploration of effective self-improvement strategies. To address the heavy reliance on large-scale annotated data, innovative reinforcementlearningparadigmshavebeenproposed. Forinstance,AbsoluteZero[ 161]leveragesaself-play"},{"citing_arxiv_id":"2503.01785","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Visual-RFT: Visual Reinforcement Fine-Tuning","primary_cat":"cs.CV","submitted_at":"2025-03-03T18:16:32+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"models with self generated and verified source code. arXiv preprint arXiv:2410.05605, 2024. 4 [47] Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, et al. Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual in- put and output. arXiv preprint arXiv:2407.03320, 2024. 3 [48] Yuxiang Zhang, Shangxi Wu, Yuqi Yang, Jiangming Shu, Jinlin Xiao, Chao Kong, and Jitao Sang. o1-coder: an o1 replication for coding. arXiv preprint arXiv:2412.00154 , 2024. 4 [49] Zhiyuan Zhao, Bin Wang, Linke Ouyang, Xiaoyi Dong, Ji- aqi Wang, and Conghui He. Beyond hallucinations: Enhanc- ing lvlms through hallucination-aware direct preference op-"},{"citing_arxiv_id":"2502.17419","ref_index":122,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From System 1 to System 2: A Survey of Reasoning Large Language Models","primary_cat":"cs.AI","submitted_at":"2025-02-24T18:50:52+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tential knowledge conflicts arise. In medical scenarios [119], complex problems, such as those requiring test-time scaling techniques, demonstrate significant improvements [52]. Overly Cautious & Simple Problem Trap: Currently, rea- soning LLMs have demonstrated strong performance in domains such as competitive-level mathematics [31], [54], [120], [121], complex coding [122], medical question an- swering [52], [113], and multilingual translation [112], [123]. These scenarios require the model to perform fine-grained analysis of the problem and execute careful logical rea- soning based on the given conditions. Interestingly, even for straightforward problems like \"2+3=?\", reasoning LLMs can exhibit overconfidence or uncertainty."},{"citing_arxiv_id":"2501.05366","ref_index":81,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Search-o1: Agentic Search-Enhanced Large Reasoning Models","primary_cat":"cs.AI","submitted_at":"2025-01-09T16:48:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Search-o1 integrates agentic retrieval-augmented generation and a Reason-in-Documents module into large reasoning models to dynamically supply missing knowledge and improve performance on complex science, math, coding, and QA tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Other studies incorporate deliberate errors in reasoning paths during training to partially internalize these abilities [49, 71]. Additionally, distilling training data has been shown to enhance models' o1-like reasoning skills [45]. The o1-like reasoning paradigm has demonstrated strong performance across diverse domains, including vision-language reasoning [65, 11, 48, 69], code generation [81, 32], healthcare [3], and machine translation [57]. However, these approaches are limited by their reliance on static, parameterized models, which cannot leverage external world knowledge when internal knowledge is insufficient. Retrieval-Augmented Generation. Retrieval-augmented generation (RAG) introduces retrieval mechanisms to address the limitations of static parameters in generative models, allowing access"},{"citing_arxiv_id":"2412.18925","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs","primary_cat":"cs.CL","submitted_at":"2024-12-25T15:12:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.09413","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems","primary_cat":"cs.AI","submitted_at":"2024-12-12T16:20:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"STILL-2 uses imitation of distilled long-form thoughts, multi-rollout exploration on difficult problems, and iterative self-improvement of the dataset to train reasoning models that reach competitive performance on three challenging benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}