{"total":30,"items":[{"citing_arxiv_id":"2606.29537","ref_index":93,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks","primary_cat":"cs.AI","submitted_at":"2026-06-28T17:59:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OSWorld 2.0 is a benchmark of 108 realistic long-horizon computer-use tasks where current agents achieve only 20.6% binary completion, struggling with state inference and constraint tracking.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20629","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Specialize Roles, Mix Deployments: Pushing the Cost-Accuracy Frontier of LLM Agent Teams","primary_cat":"cs.MA","submitted_at":"2026-05-28T17:12:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AgentCARD benchmark shows heterogeneous LLM agent teams with mixed deployments reach the cost-accuracy frontier, delivering up to 44% higher accuracy or 12x lower cost than uniform teams, with domain-specific role bottlenecks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26321","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Anchor: Mitigating Artifact Drift in Agent Benchmark Generation","primary_cat":"cs.AI","submitted_at":"2026-05-25T20:44:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Anchor generates consistent long-horizon agent tasks from parametric constraint programs, yielding ERP-Bench of 300 ERP tasks where frontier models reach optimal solutions in 17.4% of trials.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20506","ref_index":100,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reinforcing Human Behavior Simulation via Verbal Feedback","primary_cat":"cs.LG","submitted_at":"2026-05-19T21:23:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DITTO uses RL with verbal feedback to train LLMs for human behavior simulation, reporting 36% average gains over base models and outperforming GPT-5.4 on 6 of 10 SOUL benchmark tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16679","ref_index":62,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?","primary_cat":"cs.CL","submitted_at":"2026-05-15T22:34:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CHI-Bench shows current AI agents achieve at most 28% success on long-horizon healthcare workflows that require dense policy adherence, multi-role handoffs, and multi-turn interactions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15777","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?","primary_cat":"cs.AI","submitted_at":"2026-05-15T09:35:15+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14322","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Are Agents Ready to Teach? A Multi-Stage Benchmark for Real-World Teaching Workflows","primary_cat":"cs.AI","submitted_at":"2026-05-14T03:34:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EduAgentBench is a new source-grounded benchmark that evaluates tutor agents across pedagogical judgment, situated multi-turn tutoring, and Canvas-style workflow completion, finding frontier models capable of basic judgment but inadequate for professional teaching standards.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12015","ref_index":86,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces","primary_cat":"cs.CR","submitted_at":"2026-05-12T12:03:54+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10912","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation","primary_cat":"cs.CL","submitted_at":"2026-05-11T17:49:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"trol (WebArena [59], WebShop [48], VisualWebArena [20]), OS and mobile control (OSWorld [45], Windows Agent Arena [ 5], AndroidWorld [ 33]), enterprise knowledge work (WorkArena [ 11], OdysseyBench [ 38]), interactive coding (AppWorld [ 37]), browsing-centric research (BrowseComp [ 40]), and tool orchestration (ToolBench [31], τ -bench [ 50]). Broader suites such as GAIA [ 25] and TheAgentCompany [ 46] widen task coverage, but most prior benchmarks remain restricted along one or more of the axes summarized in Tab. 1. SWE-bench [ 17] and Terminal-Bench [ 24] are fully reproducible with executable checks but text-only and 2 WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation Benchmark Cross-modal Auditable Native Runtime Bilingual Reproducible Verification"},{"citing_arxiv_id":"2605.10500","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SkillEvolver: Skill Learning as a Meta-Skill","primary_cat":"cs.AI","submitted_at":"2026-05-11T12:58:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A meta-skill authors and refines prose-and-code skills for agents by learning from post-deployment failures with an overfit audit, achieving 56.8% accuracy on SkillsBench tasks versus 43.6% for human-curated skills.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10448","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation","primary_cat":"cs.AI","submitted_at":"2026-05-11T12:20:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"WOB [10], WebShop [28], Mind2Web [3], VisualWebArena [9], WebArena [31], and BrowserGym [1]. Tool-use and stateful-interaction benchmarks include ToolBench-style tool-use benchmarks [19], τ-bench [29], τ 3-bench retail [22], and ToolSandbox [11]. Broader agent environments include AGENTDOJO[ 2], ANDROIDWORLD[ 20], WORKARENA[ 4], OSWORLD-VERIFIED[ 25], and long- horizon workplace settings such as TheAgentCompany [26]. These benchmarks make interactive evaluation realistic, but realism alone does not guarantee that a stored trace decides the outcome claim being reported. Verified evaluators and benchmark validity.A recent line of work strengthens outcome checks through state-based or verified evaluation, including the verified WEBARENA-VERIFIEDrelease [ 7], APPWORLDdatabase-state unit tests [ 23], and SWE-bench Verified [16]."},{"citing_arxiv_id":"2605.08828","ref_index":21,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents","primary_cat":"cs.AI","submitted_at":"2026-05-09T09:32:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EnvTrustBench is a new agentic benchmark that measures evidence-grounding defects where LLM agents overtrust faulty environmental observations and take incorrect actions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"that interleave reasoning and acting [20], while API-Bank and ToolLLM evaluate whether models can select and invoke external APIs in multi-turn tasks [9, 14]. AgentBench extends evaluation across multiple environments [10], OSWorld tests multimodal agents in real computer environ- ments [18],τ-bench evaluates agent-user interaction in real-world domains [21], OpenHands stud- ies software-development agents [17], and TheAgentCompany evaluates agents on consequential workplace tasks [19]. These benchmarks show that agent evaluation must move beyond single-turn answers. EnvTrustBench targets task-correct grounding under misleading observations. Agent Security Benchmarks.Agent security benchmarks evaluate several distinct risk classes."},{"citing_arxiv_id":"2605.08761","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Beyond the All-in-One Agent: Benchmarking Role-Specialized Multi-Agent Collaboration in Enterprise Workflows","primary_cat":"cs.MA","submitted_at":"2026-05-09T07:47:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EntCollabBench shows that today's LLM agents still struggle with delegation, context transfer, parameter grounding, workflow closure, and decision commitment when tested in a simulated enterprise with 11 role-specialized agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Single-Agent Enterprise Benchmarks.In recent years, agent evaluation benchmarks targeting enterprise scenarios have advanced rapidly. AgentBench [3] was the first to systematically evaluate LLMs as agents across diverse environments. WorkArena [4] and WorkArena++ [5] built web-based task suites for knowledge workers on the ServiceNow platform. TheAgentCompany [6] simulated a corporate environment equipped with tools such as GitLab and RocketChat to assess agents on realistic enterprise tasks. EntWorld [ 11] further scaled to 1,756 GUI tasks spanning six enterprise domains including CRM and ITSM. In addition, Finch [12] and CRMArena [ 13] constructed domain-specific benchmarks for finance & accounting and CRM, respectively."},{"citing_arxiv_id":"2605.08334","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SalesSim: Benchmarking and Aligning Multimodal Language Models as Retail User Simulators","primary_cat":"cs.CL","submitted_at":"2026-05-08T17:59:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SalesSim benchmarks MLLMs as retail user simulators, finds gaps in persona adherence and over-persuasion, and introduces UserGRPO RL to raise decision alignment by 13.8%.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"org/10.1609/aaai.v39i24.34806. [42] Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, and Maarten Sap. Sotopia: Interactive evaluation for social intelligence in language agents.ArXiv, abs/2310.11667, 2023. URLhttps://api.semanticscholar.org/CorpusID:264289186. [43] Yifei Zhou, Song Jiang, Yuandong Tian, Jason Weston, Sergey Levine, Sainbayar Sukhbaatar, and Xian Li. Sweet-rl: Training multi-turn llm agents on collaborative reasoning tasks, 2025. URLhttps://arxiv.org/abs/2503.15478. A Dataset Construction Details We provide an overview of the data sources and enrichment procedures used to develop our SALESSIM"},{"citing_arxiv_id":"2605.02661","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AcademiClaw: When Students Set Challenges for AI Agents","primary_cat":"cs.AI","submitted_at":"2026-05-04T14:40:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02572","ref_index":77,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length","primary_cat":"cs.AI","submitted_at":"2026-05-04T13:25:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27955","ref_index":76,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"GUI Agents with Reinforcement Learning: Toward Digital Inhabitants","primary_cat":"cs.AI","submitted_at":"2026-04-30T14:51:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a future roadmap.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23781","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents","primary_cat":"cs.CV","submitted_at":"2026-04-26T16:05:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23455","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CUJBench: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend","primary_cat":"cs.SE","submitted_at":"2026-04-25T22:10:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"CUJBench is the first benchmark for cross-modal LLM-agent failure diagnosis, reporting 19.7% accuracy and identifying evidence attribution as the core bottleneck across six models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19657","ref_index":80,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"An AI Agent Execution Environment to Safeguard User Data","primary_cat":"cs.CR","submitted_at":"2026-04-21T16:45:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack-free models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"and Graham Neubig. 2024. TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks. arXiv:2412.14161 [cs.CL] https://arxiv.org/abs/2412.14161 [79] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models.arXiv preprint arXiv:2210.03629(2023). [80] Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. 2023. Benchmarking and defending against indirect prompt injection attacks on large language models.arXiv (2023). arXiv:2312.14197 [81] Nickolai Zeldovich, Silas Boyd-Wickizer, Eddie Kohler, and David Mazières. 2006. Making Information Flow Explicit in HiStar."},{"citing_arxiv_id":"2604.15597","ref_index":89,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LLMs Corrupt Your Documents When You Delegate","primary_cat":"cs.CL","submitted_at":"2026-04-17T00:33:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12162","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AlphaEval: Evaluating Agents in Production","primary_cat":"cs.CL","submitted_at":"2026-04-14T00:43:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AlphaEval is a benchmark of 94 production-sourced tasks from seven companies for evaluating full AI agent products across six domains using multiple judgment methods, plus a framework to build similar benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10015","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks","primary_cat":"cs.AI","submitted_at":"2026-04-11T03:58:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FinTrace supplies trajectory-level metrics for LLM financial tool calling, exposing gaps in information use and output quality, while its preference dataset enables DPO training that boosts intermediate metrics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09408","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?","primary_cat":"cs.AI","submitted_at":"2026-04-10T15:21:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.13564","ref_index":298,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Memory in the Age of AI Agents","primary_cat":"cs.CL","submitted_at":"2025-12-15T17:22:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.05307","ref_index":97,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When Should Users Check? Modeling Confirmation Frequency inMulti-Step Agentic AI Tasks","primary_cat":"cs.HC","submitted_at":"2025-10-06T19:18:56+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A decision-theoretic model based on the observed Confirmation-Diagnosis-Correction-Redo user pattern places intermediate confirmations in AI agent tasks, yielding 81% user preference and 13.54% faster completion versus confirm-at-end.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.21046","ref_index":120,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence","primary_cat":"cs.AI","submitted_at":"2025-07-28T17:59:05+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.22598","ref_index":51,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RExBench: Can coding agents autonomously implement AI research extensions?","primary_cat":"cs.CL","submitted_at":"2025-06-27T19:41:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RExBench is a new benchmark showing that LLM coding agents fail to autonomously implement most realistic research extensions to prior AI papers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.13585","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention","primary_cat":"cs.CL","submitted_at":"2025-06-16T15:08:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training in three weeks on 512 GPUs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.21460","ref_index":151,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Large Language Model Agent: A Survey on Methodology, Applications and Challenges","primary_cat":"cs.CL","submitted_at":"2025-03-27T12:50:17+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A survey that deconstructs LLM agent systems via a methodology-centered taxonomy linking design principles to emergent behaviors, applications, and challenges.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"As agency systems evolve toward organizational complexity, evaluation frameworks must quantify emergent coordination patterns and collective intelligence. Recent approaches shift evaluation from isolated agent proficiency to system-level cognitive collaboration, revealing scalability challenges in multi-agent workflows. Multi-Agent System Benchmarking. TheAgentCompany [151] pioneered enterprise-level assessments using simulated software company environments to test web interaction and code collaboration capabilities. Comparative analysis like AutoGen and CrewAI [152] establishes methodological stan- dards through ML code generation challenges. Large Visual Language Model Survey [153] systematizes over 200 multi- modal benchmarks."}],"limit":50,"offset":0}