{"total":14,"items":[{"citing_arxiv_id":"2606.29745","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ECHO: Learning Epistemically Adaptive Language Agents with Turn-Level Credit","primary_cat":"cs.MA","submitted_at":"2026-06-29T03:42:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ECHO is a clipped policy-gradient method that uses posterior-sensitive rewards to give turn-level epistemic credit in multi-turn information-seeking tasks, outperforming trajectory-level GRPO on a new Clue Selector Game benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.23640","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning Process Rewards via Success Visitation Matching for Efficient RL","primary_cat":"cs.LG","submitted_at":"2026-06-22T17:30:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Success Visitation Matching uses a discriminator to turn sparse outcome rewards into dense process rewards by matching visitations of successful episodes, provably preserving the optimal policy and speeding up robotic RL finetuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.21399","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Calibration Is Not Control: Why LLM-Agent Oversight Needs Intervention","primary_cat":"cs.AI","submitted_at":"2026-06-19T13:08:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Action-conditioned estimation of intervention advantage via prefix branching reduces control regret over calibrated scalar risk scores in LLM agent oversight across benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.21262","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ARCO: Adaptive Rubric with Co-Evolution for Multi-Step LLM-Based Agents","primary_cat":"cs.AI","submitted_at":"2026-06-19T09:38:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ARCO introduces a co-evolving rubric model with generation and scoring heads plus a trajectory decomposition constraint that improves exact-match scores on multi-hop QA tasks over outcome, rubric, and process reward baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07367","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Self-evolving LLM agents with in-distribution Optimization","primary_cat":"cs.LG","submitted_at":"2026-06-05T15:09:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Q-Evolve unifies automatic process-reward labeling via advantage estimation and behavior-proximal policy optimization inside an in-distribution RL loop to enable self-evolving LLM agents on interactive tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07027","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"StainFlow: Entity-Stain Tracking and Evidence Linking for Process Rewards in GUI Agents","primary_cat":"cs.AI","submitted_at":"2026-06-05T08:17:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"StainFlow proposes global entity stain tracking and local stain evidence linking modules to improve process rewards for GUI agents, reporting 3.2% relative gain in online RL success and 1.8% in judgment accuracy on AndroidWorld and OGRBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11235","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning","primary_cat":"cs.LG","submitted_at":"2026-05-11T20:50:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior performance and up to 67% faster convergence across math, code, and agent benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[22] Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456, 2025. [23] Sanjiban Choudhury. Process reward models for llm agents: Practical framework and directions.arXiv preprint arXiv:2502.10325, 2025. [24] Ruiqi Zhang, Daman Arora, Song Mei, and Andrea Zanette. Speed-rl: Faster training of reasoning models via online curriculum learning.arXiv preprint arXiv:2506.09016, 2025. [25] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind"},{"citing_arxiv_id":"2605.09934","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents","primary_cat":"cs.CL","submitted_at":"2026-05-11T03:32:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TRACER attaches verifiable sentence-level provenance records to multimodal agent outputs using tool-turn alignment and semantic relations, yielding 78.23% answer accuracy and fewer tool calls than baselines on TRACE-Bench.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"to static source documents, TRACER links multimodal tool-grounded answers to tool turns, evidence units, and semantic support relations in dynamic interaction trajectories. It also converts verified provenance into local credit for tool use, which is not addressed by text-only provenance generation. Process supervision and agent-alignment methods provide denser feedback for intermediate behav- ior [21, 22, 23, 24, 25]. Recent methods score progress, compare steps, propagate delayed feedback, or refine tool-integrated reinforcement learning [26, 27, 28, 29, 30, 31, 32]. These methods supervise actions, progress, or outcomes, but they do not directly model claim-level provenance over concrete multimodal tool observations. TRACER complements process supervision by making the evidential"},{"citing_arxiv_id":"2605.06200","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping","primary_cat":"cs.CL","submitted_at":"2026-05-07T13:09:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A²TGPO improves RL policy optimization for multi-turn agentic LLMs by normalizing information gain within same-depth turn groups, rescaling cumulative advantages by sqrt of term count, and modulating clipping ranges per turn's normalized IG.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"03300. [25] Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. Rewarding progress: Scaling automated process verifiers for LLM reasoning. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=A6Y7AqlzLW. [26] Sanjiban Choudhury. Process reward models for llm agents: Practical framework and directions.arXiv preprint arXiv:2502.10325, 2025. [27] Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P Lillicrap, Kenji Kawaguchi, and Michael Shieh. Monte carlo tree search boosts reasoning via iterative preference learning.arXiv preprint arXiv:2405.00451, 2024."},{"citing_arxiv_id":"2604.07851","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ReRec: Reasoning-Augmented LLM-based Recommendation Assistant via Reinforcement Fine-tuning","primary_cat":"cs.IR","submitted_at":"2026-04-09T06:07:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ReRec uses reinforcement fine-tuning with dual-graph reward shaping, reasoning-aware advantage estimation, and online curriculum scheduling to improve LLM reasoning and performance in recommendation tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"Sample Filtering and Sorting.We apply a diffi- culty threshold τ to filter out \"easy\" samples where dt−1 < τ , as consistent high performance across rollouts suggests minimal learning benefit. The remaining samples are sorted by dt−1 in ascending order to form the new datasetD t: Dt = n\u0010 q(k), dt−1 (k) \u0011om k=1 whereτ≤d t−1 (1) ≤ · · · ≤d t−1 (m). (8) This prioritizes easier samples early in epoch t, fostering stable learning and gradual progression. Iterative Curriculum Update.The sorted Dt is used for training in epoch t, and the process repeats for epoch t+ 1 with updated difficulty dt n based on the previous rollouts. This dynamic process adapts to the model's evolving abilities while staying ef-"},{"citing_arxiv_id":"2509.02547","ref_index":268,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Landscape of Agentic Reinforcement Learning for LLMs: A Survey","primary_cat":"cs.AI","submitted_at":"2025-09-02T17:46:26+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"2-3B-Instruct /githubGitHub Search-R1 [264] External Qwen2.5-3B/7B-Base/Instruct /githubGitHub R1-Searcher [265] External Qwen2.5-7B, Llama3.1-8B-Instruct /githubGitHub R1-Searcher++ [266] External Qwen2.5-7B-Instruct /githubGitHub ReSearch [108] External Qwen2.5-7B/32B-Instruct /githubGitHub StepSearch [267] External Qwen2.5-3B/7B-Base/Instruct /githubGitHub DeepResearcher [268] External Qwen2.5-7B-Instruct /githubGitHub WebDancer [106] External Qwen2.5-7B/32B, QWQ-32B /githubGitHub WebThinker [269] External QwQ-32B, DeepSeek-R1-Distilled-Qwen, Qwen2.5-32B/githubGitHub WebSailor [105] External Qwen2.5-3B/7B/32B/72B /githubGitHub WebWatcher [270] External Qwen2.5-VL-7B/32B /githubGitHub WebShaper [271] External Qwen-2.5-32B/72B, QwQ-32B /githubGitHub"},{"citing_arxiv_id":"2507.21046","ref_index":235,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence","primary_cat":"cs.AI","submitted_at":"2025-07-28T17:59:05+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.14200","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Scalable Multi-LLM Collaboration System with Retrieval-based Selection and Exploration-Exploitation-Driven Enhancement","primary_cat":"cs.CL","submitted_at":"2025-07-14T16:17:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SMCS coordinates 15 open-source LLMs via retrieval-based prior selection and exploration-exploitation posterior enhancement, outperforming GPT-4.1 by 5.36% and GPT-o3-mini by 5.28% on eight benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.09567","ref_index":137,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models","primary_cat":"cs.AI","submitted_at":"2025-03-12T17:35:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":",PPO [648], GRPO [658], REINFORCE++ [265], OREO [773], DAPO [476], LIMR [420] DAPO[476], LIMR [420], TRPO [648], DVPO [277], RPO [699],PRIME [143], DivPO [369],COS(M+O)S[542],CPL [801], Focused-DPO [1043],RFTT [1046],OREO [773], DeepSeekMath [658],TPO [942],etc. Reward Strategiese.g.,DeepSeek-R1 [227],Kimi-k1.5 [722], T1 [264],ReST-EM [674],SWE-RL [841], DeepScaleR [518],ReST-MCTS* [1032], rSTaR-Math [222], Logic-RL [886], OREAL [522], StepCoder [161], RLSP [962],Verifier [141], TS-LLM [755], STeCa [768], OREO[773], Chu et al. [137], Shen et al. [661],etc. External Exploration(§6.3) Human-drivenExploration e.g.,SPaR [118], Forest-of-thought [54],Scattered ForestSearch [448],Kang et al. [339],AlphaLLM [737],PATHFINDER [213],Least-to-Most [1117], ToT [955], TreeBoN [625], CodeTree [400], Tree-of-Code[565] TouT [556], GoT [48], GraphReason [64], Besta et al. [49], AoT [733], Chen et al."}],"limit":50,"offset":0}