{"total":144,"items":[{"citing_arxiv_id":"2606.10106","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"What makes a harness a harness: necessary and sufficient conditions for an agent harness","primary_cat":"cs.SE","submitted_at":"2026-06-08T19:35:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Proposes and tests a constitutive definition of 'agent harness' via conceptual analysis of literature and six real systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05525","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SciVisAgentSkills: Design and Evaluation of Agent Skills for Scientific Data Analysis and Visualization","primary_cat":"cs.AI","submitted_at":"2026-06-04T00:14:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SciVisAgentSkills provides reusable agent skills that raise mean task scores on a 108-task SciVis benchmark when paired with Codex and Claude Code agents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31408","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Skill Availability and Presentation Granularity in Large-Language-Model Agents: A Controlled SkillsBench Study","primary_cat":"cs.CL","submitted_at":"2026-05-29T15:12:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"In a 30-task SkillsBench study, skill availability boosts GPT-5.5 and DeepSeek V4-Flash agent pass rates substantially while presentation-granularity variations yield small uncertain effects.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23574","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents","primary_cat":"cs.LG","submitted_at":"2026-05-22T12:44:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces QGP and PushBench to evaluate LLM agent persistence on quantitative goals, showing specialized controllers outperform baselines on verifier-checked artifact collection tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23414","ref_index":82,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When Planning Fails Despite Correct Execution: On Epistemic Calibration for LLM-Based Multi-Agent Systems","primary_cat":"cs.AI","submitted_at":"2026-05-22T09:24:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces EPC-AW to mitigate epistemic miscalibration in LLM multi-agent planning via consistency-based selection and refinement, reporting 9.75% average success improvement.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23311","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DART: Semantic Recoverability for Structured Tool Agents","primary_cat":"cs.AI","submitted_at":"2026-05-22T07:30:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DART is a modular runtime that certifies semantically recoverable boundaries for failed tool-agent instances and selects admissible restore points that preserve downstream commitments or blocks recovery.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23262","ref_index":64,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Design and Report Benchmarks for Knowledge Work","primary_cat":"cs.AI","submitted_at":"2026-05-22T06:03:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Proposes a three-step benchmark design method (define work activity, specify tested setting, score work product) derived from work studies and O*NET, demonstrated via three case analyses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22643","ref_index":50,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety","primary_cat":"cs.CL","submitted_at":"2026-05-21T15:50:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22564","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations","primary_cat":"cs.CL","submitted_at":"2026-05-21T14:45:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SynAE is a multi-metric framework that evaluates how well synthetic benchmarks replicate real data characteristics for multi-turn tool-calling agent testing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22238","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Evaluating Large Language Models as Live Strategic Agents: Provider Performance, Hybrid Decomposition, and Operational Gaps in Timed Risk Play","primary_cat":"cs.AI","submitted_at":"2026-05-21T09:41:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Gemini-3.1-pro-preview won 20 of 32 Risk games through superior objective tracking and execution conversion, while a hybrid test with fixed execution showed near-equal planner performance across providers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22855","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PrefBench: Evaluating Zero-Shot LLM Agents in Hidden-Preference Personalized Pricing Negotiations","primary_cat":"cs.GT","submitted_at":"2026-05-19T04:10:05+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PrefBench benchmark shows zero-shot LLMs achieve deal rates above 0.99 but seller profits only slightly above random and far below a simple concession heuristic across 7,500 episodes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18747","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Code as Agent Harness","primary_cat":"cs.CL","submitted_at":"2026-05-18T17:59:03+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"RWML [129] Trace-based Self-supervised RL Aligns simulated next states with realized environment states AWM [130] Trace-based World-modeling Aligns multiple executable world models across tasks WorldMind [131] Trace-based Model fusion Coordinates executable world models from knowledge sources SWE-bench [5] Evaluation Repo-level testing Uses unit tests as objective world states AgentBench [12] Evaluation Multi-env interaction Benchmarks across OS, databases, and games CRUXEval [132] Evaluation Execution tasks Benchmarks functional input and output prediction End Terms. [39] Evaluation Procedural RL envs Automates generation of terminal-use evaluation tasks InterCode [11] Evaluation Interactive execution Frames coding tasks as actions with sandbox feedback"},{"citing_arxiv_id":"2605.18630","ref_index":45,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science","primary_cat":"cs.AI","submitted_at":"2026-05-18T16:34:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechanics disambiguation cases.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18597","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Latent Action Reparameterization for Efficient Agent Inference","primary_cat":"cs.AI","submitted_at":"2026-05-18T16:07:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LAR learns a compact latent action space from trajectories that shortens the effective decision horizon for LLM agents, reducing token count and inference time while preserving task success.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18535","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Scaling: Agents Are Heading to the Edge","primary_cat":"cs.LG","submitted_at":"2026-05-18T15:18:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Personal agents require edge deployment to preserve high-fidelity local context and zero-latency loops, as claimed through three structural shifts away from cloud-centric designs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17548","ref_index":231,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Rethinking Code Review in the Age of AI: A Vision for Agentic Code Review","primary_cat":"cs.SE","submitted_at":"2026-05-17T17:04:21+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Evaluations should quantify task success rate, human agreement, and convergence efficiency while focusing on the code review ultimately. Also evaluations should balance cost and accuracy. Kapoor et al. [ 230] caution that optimizing only for accuracy creates needlessly expensive systems. Prioritizing cost-performance trade-offs alongside dynamic frameworks like AgentBench by Liu et al. [231] ensures workflows remain economically viable. 5.1.5 Transparency of The System The inherent architectural opacity of LLMs acts as a fundamental constraint on reviewer trust, necessitating the explicit internal reasoning and tool execution traces within our framework. As defined by Lipton in [232], deep neural networks lack internal decomposability, meaning human PR reviewers cannot manually audit the inferential steps"},{"citing_arxiv_id":"2605.16895","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Alpha Illusion: Reported Alpha from LLM Trading Agents Should Not Be Treated as Deployment Evidence","primary_cat":"cs.CE","submitted_at":"2026-05-16T09:14:35+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Reported alpha from end-to-end LLM trading agents does not constitute deployment evidence until it passes structural tests for temporal integrity, frictions, robustness, calibration, execution, and disaggregation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16508","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Scaling Laws of Skills in LLM Agent Systems","primary_cat":"cs.CL","submitted_at":"2026-05-15T18:05:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Empirical analysis across 15 LLMs and 1,141 skills identifies a logarithmic routing decay law and a multiplicative execution law coupled by a single fitted slope parameter b that enables targeted library optimizations improving routing accuracy and downstream task pass rates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15766","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"BioXArena: Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks","primary_cat":"cs.CE","submitted_at":"2026-05-15T09:24:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BioXArena benchmarks LLM agents on generating end-to-end ML pipelines for 76 multi-modal biomedical tasks, with MLEvolve plus Gemini-3.1-Pro scoring highest at 0.666.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"We release the task capsules, benchmark framework, evaluation protocol, leaderboard, and full agent traces to support reproducible progress on biomedical ML coding agents. References [1] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022. [2] Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023. [3] Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. InForty-first International Conference on"},{"citing_arxiv_id":"2605.15425","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Runtime-Structured Task Decomposition for Agentic Coding Systems","primary_cat":"cs.SE","submitted_at":"2026-05-14T21:16:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Runtime-structured task decomposition reduces retry costs in agentic coding systems by up to 51.7% versus monolithic prompts by rerunning only failed subtasks on two software engineering workloads.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14678","ref_index":21,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"$\\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows","primary_cat":"cs.AI","submitted_at":"2026-05-14T10:47:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"π-Bench is a new benchmark for evaluating proactive personal assistant agents on 100 multi-turn tasks that include hidden intents, inter-task dependencies, and cross-session continuity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14460","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Exploiting LLM Agent Supply Chains via Payload-less Skills","primary_cat":"cs.CR","submitted_at":"2026-05-14T06:55:47+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Semantic Compliance Hijacking lets attackers hijack LLM agents by disguising malicious instructions as compliance rules in skills, reaching up to 77.67% success on confidentiality breaches and 67.33% on RCE while evading all tested scanners.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14133","ref_index":18,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents","primary_cat":"cs.AI","submitted_at":"2026-05-13T21:34:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and that state inspection drives most performance gaps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14102","ref_index":4,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ChromaFlow: A Negative Ablation Study of Orchestration Overhead in Tool-Augmented Agent Evaluation","primary_cat":"cs.AI","submitted_at":"2026-05-13T20:40:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ChromaFlow reports a negative ablation in which expanded orchestration on GAIA Level-1 tasks reduced accuracy and increased tracebacks, timeouts, and token costs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18856","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference","primary_cat":"cs.LG","submitted_at":"2026-05-13T18:48:48+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15230","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data","primary_cat":"econ.EM","submitted_at":"2026-05-13T18:03:51+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13391","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RS-Claw: Progressive Active Tool Exploration via Hierarchical Skill Trees for Remote Sensing Agents","primary_cat":"cs.AI","submitted_at":"2026-05-13T11:49:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RS-Claw enables remote sensing agents to actively explore tools via hierarchical skill trees, achieving up to 86% token compression and outperforming flat registration and RAG baselines on Earth-Bench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12925","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation","primary_cat":"cs.SE","submitted_at":"2026-05-13T03:00:57+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22842","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Misattribution Gap: When Memory Poisoning Looks Like Model Failure in Agentic AI Systems","primary_cat":"cs.CR","submitted_at":"2026-05-12T20:21:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Memory poisoning via lost-provenance documents in agent memory stores creates agent misconduct that safety systems misattribute to model failure; the paper defines Semantic Norm Drift, releases a benchmark, and proposes a new testing method plus a defense.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12673","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack","primary_cat":"cs.AI","submitted_at":"2026-05-12T19:22:45+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"benchmark scoring mechanisms themselves are exploitable under optimization pressure. Preventing reward hacking.Zhu et al. [63] introduces the Agentic Benchmark Checklist, requiring task validity and outcome validity and finding performance overestimates of up to 100% through manual inspection. Other efforts include monitoring pipelines that mitigate reward hacking during the training process [32, 6, 4, 54, 27, 21, 52]. However, a growing body of work further suggests that monitoring-based defenses are insufficient in isolation due to phenomena like unfaithful reasoning traces [12, 6, 30, 56, 21]. Stein et al. [47] show that failures often only become detectable when analyzing collections of traces rather than individual trajectories. TRACE [17] finds that reward-hack"},{"citing_arxiv_id":"2605.12015","ref_index":67,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces","primary_cat":"cs.CR","submitted_at":"2026-05-12T12:03:54+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11920","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Domain Restriction via Multi SAE Layer Transitions","primary_cat":"cs.AI","submitted_at":"2026-05-12T10:36:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11882","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment","primary_cat":"cs.AI","submitted_at":"2026-05-12T09:56:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, et al. Ministral 3.arXiv preprint arXiv:2601.08584, 2026. [22] Dongrui Liu, Qihan Ren, Chen Qian, Shuai Shao, Yuejin Xie, Yu Li, Zhonghao Yang, Haoyu Luo, Peng Wang, Qingyu Liu, et al. Agentdog: A diagnostic guardrail framework for ai agent safety and security.arXiv preprint arXiv:2601.18491, 2026. [23] Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023. [24] Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and benchmarking prompt injection attacks and defenses. In33rd USENIX Security Symposium"},{"citing_arxiv_id":"2605.18805","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RecoAtlas: From Semantic Plausibility to Set-Level Utility in LLM Recommendation Agents","primary_cat":"cs.IR","submitted_at":"2026-05-11T18:55:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RecoAtlas is a benchmark that evaluates LLM recommendation agents on behavior-grounded metrics for relevance, complementarity, and diversity in addition to semantic coherence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10912","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation","primary_cat":"cs.CL","submitted_at":"2026-05-11T17:49:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"memory and state, and adapt to intermediate results across coding assistance, scientific research workflows, and everyday computer use tasks [ 14, 15, 23, 35, 47, 49, 56]. As capabilities and deployment scale grow , evaluation must assess not only final task success but also whether it was reached through reliable, auditable, and safe interaction with the underlying runtime. Recent agent benchmarks [ 22, 25, 46] cover real deployment conditions unevenly along four recurring axes (Fig. 1 (a)): (1) synthetic sandboxes rather than open-world runtimes [ 37, 45, 48, 59], (2) short-horizon tasks that finish in under a minute, (3) a handful of mock-service API calls in place of compound real-tool use, and (4) final-answer checks [ 21, 31, 39] without trajectory- and artifact-level auditing [ 2]."},{"citing_arxiv_id":"2605.09942","ref_index":58,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution","primary_cat":"cs.AI","submitted_at":"2026-05-11T03:41:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09544","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-10T13:56:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09423","ref_index":48,"ref_count":4,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning","primary_cat":"cs.AI","submitted_at":"2026-05-10T08:51:50+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.","context_count":2,"top_context_role":"background","top_context_polarity":"background","context_text":"Closed-loop learning of video world model and vla policy.arXiv preprint arXiv:2602.06508, 2026. URLhttps://arxiv.org/abs/2602.06508. [47] Zishan Liu, Zecong Tang, RuoCheng Wu, Xinzhe Zheng, Jingyu Hu, Ka-Hei Hui, Haoran Xie, Bo Dai, and Zhengzhe Liu. Imagine a city: Citygenagent for procedural 3d city generation, 2026. URLhttps://arxiv.org/abs/2602.05362. [48] Calvin Luo, Zilai Zeng, Mingxi Jia, Yilun Du, and Chen Sun. Self-adapting improvement loops for robotic learning.arXiv preprint arXiv:2506.06658, 2025. URL https://arxiv. org/abs/2506.06658. [49] Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward"},{"citing_arxiv_id":"2605.08904","ref_index":136,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces","primary_cat":"cs.AI","submitted_at":"2026-05-09T11:51:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10990","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries","primary_cat":"cs.SE","submitted_at":"2026-05-09T11:41:53+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round repair success from 10% to 78%.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"This changes the monitoring objective. Probing every observed value creates false alarms, while probing only declared manifests misses assumptions expressed in prose. SKILLGUARDaddresses this gap by validating role-bearing environment contracts rather than all observed values. Agent and software-maintenance benchmarks.Agent benchmarks such as AgentBench [ 17], SWE-bench [14], and SWE-Skills-Bench [12] evaluate task-solving, software engineering, and skill quality. These benchmarks are valuable, but they do not isolate the maintenance failure studied here: a skill that was once valid can degrade when the external environment changes. DRIFTBENCH complements them by pairing previously valid skills with generated drifts, LLM-free real-world drifts,"},{"citing_arxiv_id":"2605.08761","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond the All-in-One Agent: Benchmarking Role-Specialized Multi-Agent Collaboration in Enterprise Workflows","primary_cat":"cs.MA","submitted_at":"2026-05-09T07:47:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EntCollabBench shows that today's LLM agents still struggle with delegation, context transfer, parameter grounding, workflow closure, and decision commitment when tested in a simulated enterprise with 11 role-specialized agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ering both operational workflow execution and policy-based approval decisions with objective verification. We evaluate representative LLM agents on ENTCOLLABBENCHand identify key bottlenecks in delegation, parameter grounding, workflow closure, decision commitment, and coordination cost. 2 Related Works Single-Agent Enterprise Benchmarks.In recent years, agent evaluation benchmarks targeting enterprise scenarios have advanced rapidly. AgentBench [3] was the first to systematically evaluate LLMs as agents across diverse environments. WorkArena [4] and WorkArena++ [5] built web-based task suites for knowledge workers on the ServiceNow platform. TheAgentCompany [6] simulated a corporate environment equipped with tools such as GitLab and RocketChat to assess agents on realistic enterprise tasks. EntWorld [ 11]"},{"citing_arxiv_id":"2605.08647","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators","primary_cat":"cs.CL","submitted_at":"2026-05-09T03:35:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AgentCollabBench shows that multi-agent reliability is limited by communication topology, with converging-DAG nodes causing synthesis bottlenecks that discard constraints and explain 7-40% of information loss variance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"How language models use long contexts.Transactions of the Association for Computational Linguistics, 2024. doi: 10.1162/tacl_a_00638. [24] X. Liu, H. Yu, et al. Agentbench: Evaluating llms as agents. InInternational Conference on Learning Representations, 2023. doi: 10.48550/arxiv.2308.03688. [25] LMSYS Org. Chatbot arena leaderboard, 2026. Accessed May 2026. [26] J. Luo and Y . Shao. Cayley graph optimization for scalable multi-agent communication topologies, 2026. [27] H. B. Mann and D. R. Whitney. On a test of whether one of two random variables is stochastically larger than the other.The annals of mathematical statistics, pages 50-60, 1947. [28] G. Mialon, C. Fourrier, T. Wolf, Y . LeCun, and T. Scialom."},{"citing_arxiv_id":"2605.08621","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EvidenT: An Evidence-Preserving Framework for Iterative System-Level Package Repair","primary_cat":"cs.SE","submitted_at":"2026-05-09T02:29:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EvidenT repairs 53.88% of real-world RISC-V system-level package build failures by preserving repair history and build artifacts in a closed-loop validation system, outperforming baselines by a wide margin.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"LLM-based AI agents are what's next. https://research.ibm.com/blog/what-are-ai-agents-llm. Accessed: 2024-09-13. [24] Kitware. 2025. CMake: Cross-Platform Make. https://cmake.org [25] Naveen Krishnan. 2025. Advancing Multi-Agent Systems Through Model Context Protocol: Architecture, Implementa- tion, and Applications. https://arxiv.org/html/2504.21030v1. Accessed: 2025-09-01. [26] Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. 2023. AgentBench: Evaluating LLMs as Agents.arXiv preprint arXiv:2308.03688(2023). https://arxiv.org/ abs/2308.03688 [27] Minghua Ma, Yinfang Chen, Huaibing Xie, Xuchao Zhang, Yu Kang, Xin Gao, Liu Shi, Yunjie Cao, Hao Fan, Ming Wen,"},{"citing_arxiv_id":"2605.07926","ref_index":9,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents","primary_cat":"cs.AI","submitted_at":"2026-05-08T15:59:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AgentEscapeBench is a benchmark of 270 tasks across five difficulty tiers that measures LLM agents' ability to manage long-range tool dependencies, state tracking, and intermediate result propagation, revealing sharp performance drops with increasing depth.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Agent Benchmarks with Domain-Specific Tasks.A large body of agent benchmarks focuses on tasks drawn from well-defined, real-world domains where models carry substantial prior knowledge. Representative examples include software engineering [6], travel planning [15], retail and telecom services [2], smartphone applications [ 3], tool-call accuracy [ 12], code and web navigation [ 9], real-world applications [4], and long-horizon task completion [10]. While these benchmarks measure important capabilities, their domain-specific nature means that a model can succeed largely through prior knowledge of the task workflow. AgentEscapeBench intentionally providesnodomain prior: the agent must discover solution paths through evidence-based inference in an unfamiliar environment,"},{"citing_arxiv_id":"2605.07830","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios","primary_cat":"cs.CR","submitted_at":"2026-05-08T14:57:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM agents exhibit persistent attack-selection biases as fixed traits independent of success rates, with a bias momentum effect that resists steering and yields no performance gain.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"a tendency that varies significantly by model lineage [ 24, 28]. They further exhibit entrenched cognitive biases-such as anchoring, framing effects, and position bias-that remain robust to prompt variations [4, 9], and domain-specific stable preferences that escalate into confirmation bias under contradictory evidence [10]. The agent evaluation literature, such as AgentBench [12], WebArena [35], and AgentBoard [ 13], demonstrates that binary success metrics are insufficient to capture the complexities of agentic behavior, as fine-grained measurements reveal performance disparities obscured by aggregate scores. At the intersection of behavioral bias and security, we bridge these research trajectories by benchmarking attack-selection bias in LLM agents."},{"citing_arxiv_id":"2605.07462","ref_index":110,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment","primary_cat":"cs.CL","submitted_at":"2026-05-08T09:10:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07069","ref_index":63,"ref_count":3,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Social Theory Should Be a Structural Prior for Agentic AI: A Formal Framework for Multi-Agent Social Systems","primary_cat":"cs.MA","submitted_at":"2026-05-08T00:30:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Agentic AI needs social theory as structural priors in the MASS framework to model emergent dynamics from multi-agent interactions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"study of moltbook.arXiv preprint arXiv:2602.14299, 2026. [62] Wenkai Li, Lynnette Hui Xian Ng, Andy Liu, and Daniel Fried. Measuring fine-grained negotiation tactics of humans and llms in diplomacy.arXiv preprint arXiv:2512.18292, 2025. [63] Bonnie Litowitz. Individual and shared meanings.Research on Language & Social Interaction, 10(3-4):341-373, 1977. [64] Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023. [65] Peter V Marsden and Noah E Friedkin. Network studies of social influence.Sociological Methods & Research, 22(1):127-151, 1993. [66] Tula Masterman, Sandi Besen, Mason Sawtell, and Alex Chao."},{"citing_arxiv_id":"2605.06992","ref_index":67,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Why Does Agentic Safety Fail to Generalize Across Tasks?","primary_cat":"cs.LG","submitted_at":"2026-05-07T22:16:03+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstrated in quadcopter and LLM experiments.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Reinforcement learning in robust markov decision processes.Advances in neural information processing systems, 26, 2013. 13 [66] Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025. [67] Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023. [68] Zhihan Liu, Lin Guan, Yixin Nie, Kai Zhang, Zhuoqun Hao, Lin Chen, Asli Celikyilmaz, Zhaoran Wang, and Na Zhang. Paying less generalization tax: A cross-domain generalization study of rl training for llm"},{"citing_arxiv_id":"2605.08258","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Designing Intelligent Enterprise Agents: A Capability-Aligned Multi-Agent Architecture","primary_cat":"cs.MA","submitted_at":"2026-05-07T21:42:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CEAD architecture for intelligent enterprise agents achieves 70.6% safe success rate on 10,000 tasks by making agent design the primary abstraction rather than governance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Generative Agents combined memory, reflection, and planning to simulate believable social behavior [7]. AutoGen demonstrated multi-agent LLM applications built through conversational interactions among configurable agents, humans, and tools [8]. Benchmarks show both promise and fragility. AgentBench evaluates LLMs as agents across multiple interactive environ- ments [ 9]. WebArena reports that a best GPT-4-based web agent achieved 14.41% end-to-end success on realistic web tasks versus 78.24% for humans [ 10]. GAIA emphasizes real- world questions requiring reasoning, multimodality, browsing, and tool use, with human performance far exceeding early GPT- 4-with-tools performance [ 11]. SWE-bench evaluates whether models can resolve real GitHub issues by producing patches"},{"citing_arxiv_id":"2605.06457","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Task Success: Measuring Workflow Fidelity in LLM-Based Agentic Payment Systems","primary_cat":"cs.AI","submitted_at":"2026-05-07T15:50:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ASR, a new trajectory-fidelity metric, detects that 10 of 18 LLMs skip confirmation steps in payment agents despite perfect scores on prior metrics, and ASR-guided refinements improve task success by up to 93.8 percentage points.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}