{"total":14,"items":[{"citing_arxiv_id":"2606.09669","ref_index":76,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks","primary_cat":"cs.AI","submitted_at":"2026-06-08T15:51:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SpatialWorld is a new multi-simulator benchmark showing top multimodal agents achieve under 18% success on interactive spatial tasks requiring active exploration and long-horizon planning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15128","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory","primary_cat":"cs.CV","submitted_at":"2026-05-14T17:37:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MemEye benchmark evaluates multimodal memory on visual granularity and evidence synthesis, finding that 13 methods across 4 VLMs struggle with fine details and temporal state changes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13213","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Hierarchical Attacks for Multi-Modal Multi-Agent Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-13T09:06:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HAM³ achieves up to 78.3% attack success rate on the GQA benchmark by hierarchically attacking perception, communication, and reasoning layers in multi-modal multi-agent systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17052","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning","primary_cat":"cs.CV","submitted_at":"2026-04-18T16:22:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"SIS against two categories of state-of-the-art (SOTA) models. First, we compare against SOTA Offline MLLMs to estab- lish a performance ceiling based on general visual reasoning. This includes top-tier open-source models: LongV A [53], LLaMA-VID [ 16], LLaV A-Hound [ 54], LLaV A-Video- 7B [55], InternVL2 [5], LLaV A-OneVision-7B [11], Vide- oLLaMA2 [ 49], MiniCPM-V 2.6 [ 44], VILA-1.5 [ 18], InternLM-XCP2.5 [ 52], MovieChat [ 31], FreeV A [ 38], Qwen2.5-VL [1], Qwen2-VL [ 34]. As well as powerful closed-source models like Gemini-1.5 Pro [ 32] and GPT- 4o [ 25]. Second, we also conduct a head-to-head com- parison with SOTA Online MLLMs. This includes Flash- VStream [50], VideoLLM-online [ 4], Dispider [ 27], and"},{"citing_arxiv_id":"2603.01455","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents","primary_cat":"cs.CV","submitted_at":"2026-03-02T05:12:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MM-Mem distills video input through a hierarchical memory of sensory buffer, episodic stream, and symbolic schema, optimized by a semantic information bottleneck and SIB-GRPO, to achieve SOTA on long-horizon video benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.22683","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses","primary_cat":"cs.CV","submitted_at":"2026-02-26T06:55:48+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SUPERGLASSES is the first VQA benchmark built from actual smart glasses data, and SUPERLENS is an agent using automatic object detection, query decoupling, and multimodal search that outperforms GPT-4o by 2.19% on it.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.08392","ref_index":119,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs","primary_cat":"cs.RO","submitted_at":"2026-02-09T08:47:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"embodiment intelligence normative data for robot manipula- tion.arXiv preprint arXiv:2412.13877, 2024. [118] Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, et al. Sapien: A simulated part-based interactive environment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11097- 11107, 2020. [119] Junlin Xie, Zhihong Chen, Ruifei Zhang, Xiang Wan, and Guanbin Li. Large multimodal agents: A survey.arXiv preprint arXiv:2402.15116, 2024. [120] Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023. 3 [121] Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng"},{"citing_arxiv_id":"2510.14133","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Formalizing the Safety, Security, and Functional Properties of Agentic AI Systems","primary_cat":"cs.AI","submitted_at":"2025-10-15T22:02:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces host agent and task lifecycle models plus 30 temporal logic properties to enable formal verification of liveness, safety, completeness, and fairness in agentic AI systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.19662","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks","primary_cat":"cs.AI","submitted_at":"2025-05-26T08:21:46+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.16120","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLM-Powered AI Agent Systems and Their Applications in Industry","primary_cat":"cs.AI","submitted_at":"2025-05-22T01:52:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A survey categorizing LLM-powered agent systems into software-based, physical, and hybrid types, covering industrial applications and challenges such as latency and security.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"deployment of LLM-powered agent systems is the lack of stan- dardized benchmarks and evaluation metrics [105]. Although there are numerous studies and metrics designed to evaluate the performance of LLMs themselves, these metrics often fall short when applied to complex agent systems that involve decision-making, multi-modal processing, and human-AI in- teractions [4]. LLM-based agents go beyond generating coher- ent text-they perform tasks such as planning, interacting with other systems, and adapting to dynamic environments. As a result, evaluating their performance requires a comprehensive approach that considers not only linguistic accuracy but also task success rate, adaptability, context awareness, and human"},{"citing_arxiv_id":"2411.18279","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Large Language Model-Brained GUI Agents: A Survey","primary_cat":"cs.AI","submitted_at":"2024-11-27T12:13:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"as valuable references for a foundational understanding of LLM-driven agents, laying the groundwork for further exploration into LLM-based GUI agents. Xie et al. , [59] provide an extensive overview of multimodal agents, which can process images, videos, and audio in addition to text. This multimodal capability significantly broadens the scope beyond traditional text-based agents [60]. Notably, most GUI agents fall under this category, as they rely on image inputs, such as screenshots, to interpret and interact with graphical interfaces effectively. Multi-agent frameworks are frequently employed in the design of GUI agents to enhance their capabilities and scalability. Surveys by Guo et al., [48] and Han et al. , [49] provide comprehensive overviews of the"},{"citing_arxiv_id":"2407.13193","ref_index":184,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Retrieval-Augmented Generation for Natural Language Processing: A Survey","primary_cat":"cs.CL","submitted_at":"2024-07-18T06:06:53+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The survey organizes RAG methods via a taxonomy of query-based, logits-based, latent, and parametric fusion with comparisons on accessibility, efficiency, applications, and challenges.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Yufei Xue, Jidong Zhai, Wenguang Chen, Zhiyuan Liu, Peng Zhang, Yuxiao Dong, and Jie Tang. 2023. GLM-130B: An Open Bilingual Pre-trained Model. In The Eleventh International Conference on Learning Representations (ICLR) . OpenReview.net. [183] Biao Zhang and Rico Sennrich. 2019. Root Mean Square Layer Normalization. In Advances in Neural Information Processing Systems 32 (NeurIPS) . 12360-12371. [184] Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Guiming Chen, Jianquan Li, Xiangbo Wu, Zhiyi Zhang, Qingying Xiao, Xiang Wan, Benyou Wang, and Haizhou Li. 2023. HuatuoGPT, Towards Taming Language Model to Be a Doctor. InFindings of the Association for Computational Linguistics (EMNLP). Association for Computational Linguistics, 10859-10885."},{"citing_arxiv_id":"2407.01284","ref_index":101,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?","primary_cat":"cs.AI","submitted_at":"2024-07-01T13:39:08+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In parallel, the intersection of large language models (LLMs) and Large Multimodal Models (LMMs) has surged, extending the applicability of LMMs evaluations across diverse modal- ities including 2D images [ 88, 89, 90], 3D point clouds [ 91, 92, 93], audio [ 94, 95, 96, 97], and video [98, 99, 100]. Moreover, a series of works have positioned LMMs as agents with various tools, such as APIs [101, 102, 103], retrievers [104, 105] , thereby broadening the development avenues for the model evaluation community [106, 107, 108, 109]. 11 5 Conclusion In this paper, we propose WE-M ATH, a comprehensive benchmark for in-depth analysis of LMMs in visual mathematical reasoning. WE-MATH encompasses 6.5K visual math problems, covering 5 layers and 67 knowledge concepts."},{"citing_arxiv_id":"2404.14294","ref_index":298,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on Efficient Inference for Large Language Models","primary_cat":"cs.CL","submitted_at":"2024-04-22T15:53:08+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"LLMs' output content, thereby creating opportunities for data-level and system-level optimizations such as output or- ganization techniques [52]. Furthermore, these frameworks naturally introduce a new optimization level, i.e., pipeline- level, which holds potential for efficiency enhancements at this level [58]. In addition, there is a growing research trend [298] fo- cused on extending AI agents into the multimodal domain, which often utilize Large Multimodal Models (LMMs) as the core of these agent systems. To enhance the efficiency of these emerging LMM-based agents, designing optimization techniques for LMMs is a promising research direction. Long-Context LLMs. Currently, LLMs face the challenge of handling increasingly longer input contexts."}],"limit":50,"offset":0}