{"total":18,"items":[{"citing_arxiv_id":"2606.12191","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application","primary_cat":"cs.CL","submitted_at":"2026-06-10T15:15:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environment evolution paradigms.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24426","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SEAL: Synergistic Co-Evolution of Agents and Learning Environments","primary_cat":"cs.CL","submitted_at":"2026-05-23T06:41:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SEAL co-evolves LLM agents and environments via shared turn-level failure diagnoses, yielding +8.25 to +26.25 point gains on tool-use tasks with only 400 samples.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14133","ref_index":100,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents","primary_cat":"cs.AI","submitted_at":"2026-05-13T21:34:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and that state inspection drives most performance gaps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13037","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-13T05:46:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-2K dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12004","ref_index":63,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Learning Agentic Policy from Action Guidance","primary_cat":"cs.CL","submitted_at":"2026-05-12T11:54:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"To evaluate the effectiveness of our proposed ACTGUIDE-RL in LLM agentic RL, we conduct experiments in the search-agent setting, which is stateless and facilitates the collection of action data. Our evaluation covers two categories of benchmarks. The first category is in-domain search-agent benchmarks, including four representative datasets,GAIA[ 39],WebWalkerQA[ 63], XBench[ 5], andBrowseComp-ZH (BC-ZH)[ 83], which span diverse difficulty levels, multiple languages, and real-world multi-step reasoning scenarios. The second category is out-of-domain benchmarks, includingGPQA[ 47],TruthfulQA[ 34], andIFEval[ 82], which are used to evaluate the out-of-domain generalization ability of models beyond the search-agent setting."},{"citing_arxiv_id":"2605.10698","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Bystander Effect in Multi-Agent Reasoning: Quantifying Cognitive Loafing in Collaborative Interactions","primary_cat":"cs.MA","submitted_at":"2026-05-11T15:13:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Multi-agent LLM interactions induce cognitive loafing via a formalized Interaction Depth Limit and Sovereignty Gap, where models subjugate correct derivations to social compliance, with lead agent identity disproportionately affecting outcomes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09879","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models","primary_cat":"cs.AI","submitted_at":"2026-05-11T02:05:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"M2A uses null-space model merging to combine mathematical and agentic reasoning in LLMs, raising SWE-Bench Verified performance from 44.0% to 51.2% on Qwen3-8B without retraining.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"This motivates our parameter-level approach, which integrates mathematical reasoning updates through behavior-preserving model merging rather than joint training. 2.2 Model Merging Model merging aims to combine multiple fine-tuned models into a single model without additional training, reducing storage and computational costs [ 5, 21, 26, 43]. Early methods such as Model Soups [39] and Task Arithmetic [13] showed that weight averaging or task-vector composition can be effective when source models are sufficiently aligned. Later approaches, TIES-Merging [40] alleviates parameter interference via pruning, symbol election, and merging, DARE [ 48] enhances merged model robustness and base capabilities by randomly dropping and rescaling fine-tuning weights."},{"citing_arxiv_id":"2605.06130","ref_index":53,"ref_count":3,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2026-05-07T12:33:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency variation to credit distillation, outperforming baselines on ALFWorld and WebShop.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02801","ref_index":65,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces","primary_cat":"cs.CL","submitted_at":"2026-05-04T16:42:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"This survey organizes RL for LLM multi-agent systems into reward families, credit units, and five orchestration sub-decisions, notes the absence of explicit stopping-decision training in its paper pool, and releases a tagged corpus.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"planner/workforce optimization [22], and zero-supervision MAS design [26]. A May 2026 refresh added actor-critic decentralized collaboration [38], width-scaling search teams [68], communi- cation/topology learning [23], language-space credit assignment [71], multi-agent self-search for code [61], GUI role orchestration [62], attacker-defender safety training [65], and self-play / hierarchical interaction entries from OpenReview submissions and proceedings [34, 75, 21, 1]. These are not isolated tricks-they collectively formalize LLM collaboration as cooperative MARL with new credit- and signal-bearing units. Figure 2 visualizes the corpus across an 18-month window. Existing surveys cover pairwise intersections but not the triple."},{"citing_arxiv_id":"2605.00200","ref_index":34,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Confidence Estimation in Automatic Short Answer Grading with LLMs","primary_cat":"cs.CL","submitted_at":"2026-04-30T20:26:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A hybrid confidence framework for LLM-based short answer grading combines model signals with aleatoric uncertainty from semantic clustering of responses and improves selective grading reliability over single-source methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27351","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Heterogeneous Scientific Foundation Model Collaboration","primary_cat":"cs.AI","submitted_at":"2026-04-30T03:02:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"08774. URLhttps://doi.org/10.48550/arXiv.2303.08774. [2] Gemma Team. Gemma 3 technical report.CoRR, abs/2503.19786, 2025. doi: 10.48550/ARXIV.2503. 19786. URLhttps://doi.org/10.48550/arXiv.2503.19786. [3] Llama Team. The llama 3 herd of models.CoRR, abs/2407.21783, 2024. doi: 10.48550/ARXIV.2407. 21783. URLhttps://doi.org/10.48550/arXiv.2407.21783. [4] Tianxin Wei, Ting-Wei Li, Zhining Liu, Xuying Ning, Ze Yang, Jiaru Zou, Zhichen Zeng, Ruizhong Qiu, Xiao Lin, Dongqi Fu, Zihao Li, Mengting Ai, Duo Zhou, Wenxuan Bao, Yunzhe Li, Gaotang Li, Cheng Qian, Yu Wang, Xiangru Tang, Yin Xiao, Liri Fang, Hui Liu, Xianfeng Tang, Yuji Zhang, Chi Wang, Jiaxuan You, Heng Ji, Hanghang Tong, and Jingrui He. Agentic reasoning for large"},{"citing_arxiv_id":"2604.26615","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TDD Governance for Multi-Agent Code Generation via Prompt Engineering","primary_cat":"cs.SE","submitted_at":"2026-04-29T12:43:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"An AI-native TDD framework operationalizes classical TDD principles as prompt-level and workflow-level governance mechanisms in a layered multi-agent architecture to improve stability and reproducibility of LLM code generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21027","ref_index":78,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering","primary_cat":"cs.AI","submitted_at":"2026-04-22T19:18:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17821","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"WebUncertainty: Dual-Level Uncertainty Driven Planning and Reasoning For Autonomous Web Agent","primary_cat":"cs.AI","submitted_at":"2026-04-20T05:19:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"WebUncertainty improves web agent performance on benchmarks by adaptively selecting planning modes based on task uncertainty and using confidence-induced action uncertainty in MCTS to quantify aleatoric and epistemic uncertainty for better decisions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16646","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Agentic Frameworks for Reasoning Tasks: An Empirical Study","primary_cat":"cs.AI","submitted_at":"2026-04-17T19:02:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"autonomous or semi-autonomous agents. The growing adoption of these frameworks in both academia and industry demonstrates their increasing importance in agent-based system development [5]. Reasoning is a core capability of intelligent agents, enabling them to perform logical inference, solve problems, and make decisions in dynamic and interactive environments [6]. As a result, agentic frameworks have been explored for applications that require advanced reasoning capabilities [5, 7]. However, despite their widespread use, there is still a lack of comprehensive empirical studies that systematically evaluate and com- pare agentic frameworks in terms of reasoning performance, efficiency, and practical effectiveness in software engineering contexts [4, 6]."},{"citing_arxiv_id":"2604.05719","ref_index":114,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing","primary_cat":"cs.CR","submitted_at":"2026-04-07T11:19:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"The first SoK on LLM-based AutoPT frameworks provides a six-dimension taxonomy of agent designs and a unified empirical benchmark evaluating 15 frameworks via over 10 billion tokens and 1,500 manually reviewed logs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ally designs a dedicated evaluator agent to process this information and offer modification recommendations to the planner. 3.1.3 Multi-Agent Collaboration After designing the agent roles in the AutoPT frameworks based on functionality, these roles must collaborate to achieve the automated penetration goals. Unlike the Agent plan that formulates high-level attack plans and pending tasks, and referring to prior studies [ 114, 51], we define multi-agent collaboration as the interaction patterns and execution sequences among agent roles. The core lies in organizing and coordinating these agents to jointly complete complex tasks that are diﬀicult for a single agent to handle through division of labor and cooperation. Based on the execution paths among multiple agents, this paper"},{"citing_arxiv_id":"2604.03512","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ActionNex: A Virtual Outage Manager for Cloud Computing","primary_cat":"cs.AI","submitted_at":"2026-04-03T23:19:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ActionNex is an agentic system for cloud outage management that compresses multimodal signals into critical events, uses hierarchical memory for reasoning, and recommends actions with 71.4% precision on real Azure outages.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20858","ref_index":117,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Mixture of Sequence: Theme-Aware Mixture-of-Experts for Long-Sequence Recommendation","primary_cat":"cs.IR","submitted_at":"2026-03-01T23:20:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MoS applies theme-aware routing to extract multi-scale theme-specific subsequences from noisy long user sequences, achieving state-of-the-art recommendation performance with fewer FLOPs than comparable MoE models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[115] Xu Xie, Fei Sun, Zhaoyang Liu, Shiwen Wu, Jinyang Gao, Jiandong Zhang, Bolin Ding, and Bin Cui. 2022. Contrastive learning for sequential recommendation. In2022 IEEE 38th international conference on data engineering (ICDE). IEEE, 1259-1273. [116] Chengfeng Xu, Pengpeng Zhao, Yanchi Liu, et al. 2019. Graph contextualized self-attention network for session-based recommendation.. InIJCAI, Vol. 19. 3940-3946. [117] Chengfeng Xu, Pengpeng Zhao, Yanchi Liu, Victor S Sheng, Jiajie Xu, Fuzhen Zhuang, Junhua Fang, and Xiaofang Zhou. 2019. Graph contextualized self- attention network for session-based recommendation.. InIJCAI, Vol. 19. 3940- 3946. [118] Haobo Xu, Yuchen Yan, Dingsu Wang, Zhe Xu, Zhichen Zeng, Tarek F Abdelza- her, Jiawei Han, and Hanghang Tong. 2024."}],"limit":50,"offset":0}