{"total":14,"items":[{"citing_arxiv_id":"2606.22844","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RaMem: Contextual Reinstatement for Long-term Agentic Memory","primary_cat":"cs.AI","submitted_at":"2026-06-22T04:41:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RaMem improves LLM agent memory by grounding fragments in original conditions like time and participants, then using validity-aware retrieval, yielding >10% average F1 gains over baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.22385","ref_index":80,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MetaPS: Adaptive Programmatic Strategy Selection for Market Agents","primary_cat":"cs.AI","submitted_at":"2026-06-21T08:22:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MetaPS trains models via simulation rollouts to select from programmatic strategy libraries for market agents, yielding better performance than fixed or direct LLM baselines across model sizes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.12191","ref_index":300,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application","primary_cat":"cs.CL","submitted_at":"2026-06-10T15:15:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environment evolution paradigms.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":",RAGEN [287], ZeroSearch [288], EvolveSearch [289], GiGPO [290], ARPO [291],MobileGUI-RL [292], SPEAR [293], VAGEN [294], SeeUPO [295],etc. EnvironmentEvolution (§7) Neural-DrivenEvolution (§7.1) Self-Play (§7.1.1) e.g.,Absolute Zero [296], Self-Challenging [297], Active Zero [298], Vision-zero [299],etc. World Model (§7.1.2)e.g.,WebDreamer [192], UI-Simulator [300], Code2World [190], WebWorld [301] ,etc. Difficulty-DrivenEvolution (§7.2) Explicit CurriculumSignals (§7.2.1) e.g.,POET [302], AgentGen [303], Environment Tuning [304], SEC [305], Reasoning Core [306],DreamGym [307] ,etc. Implicit CurriculumMechanisms (§7.2.2)e.g.,PAIRED [308], DCD [309], ACCEL [310], MAESTRO [311], ReMiDi [312], EnvGen [163],DataEnvGym [313], Eurekaverse [314], RLVE [315], CuES [316], GenEnv [317], SCALER [318] ,etc."},{"citing_arxiv_id":"2606.09138","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Claw-R1: A Step-Level Data Middleware System for Agentic Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-06-08T07:35:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Claw-R1 provides a Gateway Server and Data Pool to manage step-level agent interaction traces as structured data assets for agentic RL training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00472","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CodeCytos: AI-assisted spatial molecular imaging analysis via code-augmented agent action space","primary_cat":"cs.CV","submitted_at":"2026-05-30T01:37:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CodeCytos is a code-augmented reasoning agent framework for dynamic, programmable exploration of custom spatial cellular features in molecular imaging data across four tissue types.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27960","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mags-RL: Wearing Multimodal LLMs a Magnifying Glass via Agentic Reinforcement Learning For Complex Scene Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-27T04:54:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Mags-RL uses agentic RL and a super-resolution agent for two-round reasoning in MLLMs, claiming gains on VSR, TallyQA, and GQA with a curriculum needing only 40 samples.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14133","ref_index":39,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents","primary_cat":"cs.AI","submitted_at":"2026-05-13T21:34:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and that state inspection drives most performance gaps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07339","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Tools as Continuous Flow for Evolving Agentic Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-08T06:44:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FlowAgent models tool chaining as continuous latent trajectory generation with conditional flow matching to deliver global planning, formal utility bounds, and better robustness on long-horizon tasks, plus a new plan-level benchmark.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Nie, Shuang Li, Qiuyang Feng, Pengxu Qiu, et al. Routine: A structural planning framework for llm agent system in enterprise.arXiv preprint arXiv:2507.14447, 2025. [15] Peiyuan Feng, Yichen He, Guanhua Huang, Yuan Lin, Hanchong Zhang, Yuchen Zhang, and Hang Li. Agile: A novel reinforcement learning framework of llm agents.Advances in Neural Information Processing Systems, 37:5244-5284, 2024. [16] Mingyue Cheng, Jie Ouyang, Shuo Yu, Ruiran Yan, Yucong Luo, Zirui Liu, Daoyu Wang, Qi Liu, and Enhong Chen. Agent-r1: Training powerful llm agents with end-to-end reinforcement learning.arXiv preprint arXiv:2511.14460, 2025. [17] Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and"},{"citing_arxiv_id":"2604.18401","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning","primary_cat":"cs.CL","submitted_at":"2026-04-20T15:22:39+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"extension","top_context_polarity":"extend","context_text":"LightningRL [17] Step-level Trajectory-level StepPO Step-level Step-level Once optimization targets this regime, the problem is no longer only how to score a final answer, but also how to model the decision process that produces sustained behavior across many interaction rounds. This is why Agentic Reinforcement Learning (RL) is becoming a central post-training paradigm for LLM agents [7, 34, 38]. Earlier recipes such as Reinforcement Learning from Human Feedback (RLHF) [20] and Reinforcement Learning with Verifiable Rewards (RLVR) [26] were largely developed around single responses or short reasoning traces. By contrast, Agentic RL targets multi-turn interaction with tools and environments, where the training objective must directly shape decision making, action selection, and adaptation under"},{"citing_arxiv_id":"2604.10674","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents","primary_cat":"cs.LG","submitted_at":"2026-04-12T14:57:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Skill-SD turns an agent's completed trajectories into dynamic natural-language skills that condition only the teacher in self-distillation, yielding 14-42% gains over RL and OPSD baselines on multi-turn agent benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"We do not store full trajectories as skills. Multi-turn tasks admit many valid action sequences, and a single canonical trace would overconstrain exploration. Given task x, the student acts on the plain prompt, while the teacher sees the prompt augmented with retrieved skills: πstu θ (· |x,y <t) =π θ(· |x,y <t), (6) πtea ¯θ (· |x,S(x),y <t) =π ¯θ(· |x⊕S(x),y <t). (7) Here S(x) denotes the retrieved skills and ¯θ is the teacher parameter state. In thedynamic setting, ¯θ is synchronized from the latest student checkpoint at each iteration; in thefrozen setting, ¯θis fixed throughout training. Skill retrieval is lightweight. For each task, we select the single highest-scoring skill using a UCB criterion: score(e) = ¯r(e) +c"},{"citing_arxiv_id":"2604.07927","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"EigentSearch-Q+: Enhancing Deep Research Agents with Structured Reasoning Tools","primary_cat":"cs.AI","submitted_at":"2026-04-09T07:47:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Structured query and evidence tools added to an AI research agent improve benchmark accuracy by 0.6 to 3.8 percentage points.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06734","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TEC: A Collection of Human Trial-and-error Trajectories for Problem Solving","primary_cat":"cs.CL","submitted_at":"2026-04-08T06:57:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TEC is a new public dataset of detailed human trial-and-error trajectories and reflections on web tasks, with humans showing substantially higher accuracy than LLMs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '26).ACM, New York, NY, USA, 9 pages. https://doi.org/XXXXXXX.XXXXXXX 1 Introduction a Trial-and-error is a fundamental mechanism of natural selection: organisms that relentlessly try and learn from errors may survive; those that do not become extinct [ 7, 27]. This iterative process demands two dimensions of intelligence. Thetrialaspect requires problem understanding, strategy selection, and effective tool use to explore candidate solutions [23, 30]. Theerroraspect requires self- evaluation and reflection, recognizingwhya trial failed and deciding whatto change next [ 18, 19]. This principle is equally important"},{"citing_arxiv_id":"2603.24935","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SABER: A Stealthy Agentic Black-Box Attack Framework for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-03-26T01:56:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SABER uses a trained ReAct agent to produce bounded adversarial edits to robot instructions, cutting task success by 20.6% and increasing execution length and violations on the LIBERO benchmark across six VLA models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.00520","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Toward a Safe Internet of Agents","primary_cat":"cs.MA","submitted_at":"2025-11-29T15:31:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper proposes a bottom-up framework for safe agentic AI systems that treats each component as a dual-use interface where added capabilities also expand attack surfaces across single agents, multi-agent systems, and interoperable ecosystems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}