{"total":14,"items":[{"citing_arxiv_id":"2606.26918","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Diagnosing Task Insensitivity in Language Agents","primary_cat":"cs.AI","submitted_at":"2026-06-25T11:53:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper diagnoses task insensitivity in LLM agents as a cause of weak OOD generalization, links it to attention drift, and proposes Task-Perturbed NLL Optimization as a contrastive regularizer to improve task dependence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.21943","ref_index":279,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Modularized Reinforcement Learning on LLMs: From MDP Creation to Exploration and Learning","primary_cat":"cs.LG","submitted_at":"2026-06-20T08:20:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Survey mapping RL techniques onto LLM training and highlighting gaps in value-based, off-policy, and bootstrapping methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05885","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When Denser Credit Is Not Enough: Evidence-Calibrated Policy Optimization for Long-Horizon LLM Agent Training","primary_cat":"cs.LG","submitted_at":"2026-06-04T08:54:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ECPO improves GiGPO by shrinking low-count action advantages and suppressing noisy anchor states, yielding +5.2/+7.3 success gains on ALFWorld/WebShop with Qwen2.5-1.5B models at negligible extra cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02355","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training","primary_cat":"cs.AI","submitted_at":"2026-06-01T15:02:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SIRI trains LLM agents to discover, validate, and internalize reusable skills from their own rollouts without external generators or inference-time skill banks, yielding gains on ALFWorld and WebShop.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01249","ref_index":146,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Trust Region On-Policy Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-31T14:04:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24517","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ECHO: Terminal Agents Learn World Models for Free","primary_cat":"cs.LG","submitted_at":"2026-05-23T11:08:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ECHO is a hybrid RL objective that trains agents to predict environment observation tokens from their actions, doubling GRPO pass@1 on TerminalBench-2.0 while improving dynamics prediction on held-out trajectories.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22240","ref_index":21,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Unlocking Proactivity in Task-Oriented Dialogue","primary_cat":"cs.AI","submitted_at":"2026-05-21T09:46:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces a Cognitive User Simulator modeling stratified personas with hidden concerns and Simulator-Induced Asymmetric-View Policy Optimization to unlock proactive behavior in task-oriented dialogue agents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20061","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents","primary_cat":"cs.CL","submitted_at":"2026-05-19T16:19:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ReBel uses belief-consistency supervision and belief-aware grouping to improve credit assignment in long-horizon RL for LLM agents, achieving up to 20.4 percentage points higher success and 2.1x better sample efficiency than GRPO on ALFWorld and WebShop.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07725","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SOD: Step-wise On-policy Distillation for Small Language Model Agents","primary_cat":"cs.CL","submitted_at":"2026-05-08T13:30:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"and FireAct [45] enable tool use but rely on demonstrations rather than online optimization. Recent work extends RL to agent interaction trajectories across code generation [ 46], tool use [47], GUI interaction [48], and web navigation [ 49]. A central challenge is credit assignment under sparse, delayed feedback, addressed via trajectory-level updates and value-free formulations [50, 51]. KL- regularized policy optimization further introduces bias and instability concerns [ 44], amplified in agentic settings by distribution shift and compounding errors. Broader frameworks scaling agentic RL across environments [52-54] still rely on trajectory-level rewards without dense supervision. On-policy Distillation.On-Policy Distillation (OPD) [ 36, 55-58, 41, 59] introduces token-level"},{"citing_arxiv_id":"2605.06642","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction","primary_cat":"cs.CL","submitted_at":"2026-05-07T17:51:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05413","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From History to State: Constant-Context Skill Learning for LLM Agents","primary_cat":"cs.AI","submitted_at":"2026-05-06T20:13:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Constant-context skill learning trains reusable task-family modules for LLM agents using a deterministic state block for progress tracking and subgoal rewards, achieving 89.6% unseen success on ALFWorld, 76.8% on WebShop, and 66.4% on SciWorld with Qwen3-8B while reducing prompt tokens 2-7x.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.13727","ref_index":61,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails","primary_cat":"cs.AI","submitted_at":"2025-10-15T16:30:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Control-theoretic guardrails enable proactive correction of risky LLM agent actions in latent space, preventing catastrophes like collisions or bankruptcy while preserving task performance in simulated environments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.09686","ref_index":200,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models","primary_cat":"cs.AI","submitted_at":"2025-01-16T17:37:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[184],Gao et al. [39] Qian et al.[108] Reflective Learning Shinn et al.[129],Sun et al.[138], Sun et al.[190] Concept Learning Zhang et al.[188],Gao et al.[40] , Guan et al.[44] Agentic System Search Prompt Level Madaan et al.[90], Fernando et al.[38] Yang et al.[169] Module Level Shang et al.[125], Zhang et al.[186] Agent Level Huot et al.[54],Zhuge et al.[200] 7.1 Verbal Reinforcement Search Verbal Reinforcement Search (VRS) leverages the pre-trained reasoning and semantic capabilities of LLMs to explore and optimize solution spaces. Unlike traditional reinforcement learning or training- intensive approaches, VRS operates purely through test-time inference, using iterative feedback loops to refine solutions without requiring additional training."},{"citing_arxiv_id":"2409.12917","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Training Language Models to Self-Correct via Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2024-09-19T17:16:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}