CoCoSI: Collaborative Cognitive Map Construction for Spatial Intelligence

Ruoxuan Cao; Yiming Zhang; Zhihang Zhong

arxiv: 2606.10401 · v2 · pith:ADORAVZRnew · submitted 2026-06-09 · 💻 cs.CV

CoCoSI: Collaborative Cognitive Map Construction for Spatial Intelligence

Yiming Zhang , Ruoxuan Cao , Zhihang Zhong This is my paper

Pith reviewed 2026-06-27 14:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords cognitive mapsmulti-agent frameworkspatial intelligencemultimodal LLMstraining-freespatial reasoningcollaborative constructionplug-and-play

0 comments

The pith

A multi-agent system lets any pretrained MLLM maintain spatial coherence across long visual sequences by building a shared cognitive map through atomic commits and verification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a plug-and-play multi-agent framework that constructs grid-based cognitive maps from multi-frame inputs to serve as structured spatial memory. Local and global agents coordinate to add information via atomic commits while cross-agent verification checks consistency, all without altering the underlying model or requiring training. This setup targets the limit of native context windows in off-the-shelf MLLMs, which otherwise lose spatial relations over extended inputs. If the mechanism works as described, existing models gain the ability to reason about object positions, layouts, and movements across longer time spans than direct prompting allows. The approach stays fully model-agnostic and avoids external memory hardware or fine-tuning steps.

Core claim

The central claim is that collaborative construction of a cognitive map, using local-global agent coordination, atomic commits for updates, and cross-agent verification, enables reliable storage and retrieval of spatial information that exceeds the context window of any unmodified pretrained MLLM, producing better results on spatial understanding tasks while remaining training-free.

What carries the argument

The collaborative cognitive map built through local-global agent coordination, atomic commits, and cross-agent verification.

If this is right

The method applies to arbitrary pretrained MLLMs without architectural changes.
It produces superior performance on spatial understanding tasks compared to prior approaches.
Spatial representations remain coherent over extended multi-frame inputs.
No finetuning or specialized memory modules are required.
The framework operates as a lightweight plug-and-play addition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The verification step might reduce spatial hallucinations that single models exhibit on long inputs.
The same coordination pattern could support other memory-intensive tasks such as long-horizon planning from visual streams.
If atomic commits prove stable, the approach might scale to multi-robot teams sharing a common spatial reference.
It suggests a general route for augmenting fixed models with external structured memory instead of expanding context windows.

Load-bearing premise

A multi-agent collaborative process with atomic commits and cross-verification can reliably preserve and retrieve spatial information beyond the native context window of an unmodified pretrained MLLM.

What would settle it

A controlled test on video sequences longer than the model's context length where queries about relative object positions yield the same or lower accuracy than a single-pass baseline that simply truncates the input.

Figures

Figures reproduced from arXiv: 2606.10401 by Ruoxuan Cao, Yiming Zhang, Zhihang Zhong.

**Figure 2.** Figure 2: Overview of our multi-agent framework for video spatial understanding. Our method [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy trends with respect to video length across three VLM backbones. Videos are [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison between single-agent and multi-agent cognitive map construction. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Spatial intelligence is a key frontier for multimodal large language models (MLLMs), enabling them to reason about the physical world from visual experience. Inspired by human spatial cognition, recent approaches construct grid-based cognitive maps from multi-frame visual inputs to maintain coherent spatial representations over time. However, limited context lengths still challenge spatial understanding, while existing methods, such as long-context modeling and external memory, often require architectural changes, memory modules, or finetuning, limiting their applicability to off-the-shelf pretrained MLLMs. This motivates a lightweight, model-agnostic method for preserving spatial information beyond the native context window. To this end, we propose a plug-and-play multi-agent framework that collaboratively constructs cognitive maps as structured spatial memory, enhancing the spatial understanding of arbitrary pretrained MLLMs without architectural modification or additional training. Our framework features local-global agent coordination, cognitive map construction with atomic commits, and cross-agent verification. Extensive experiments demonstrate that our method achieves superior performance on spatial understanding tasks while remaining fully training-free. Code will be released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A training-free multi-agent cognitive map setup for MLLMs is a reasonable engineering idea but the abstract supplies no evidence that the map actually preserves spatial relations at scale.

read the letter

The paper's core contribution is a plug-and-play multi-agent system that builds structured cognitive maps from visual sequences using local-global coordination, atomic commits, and cross-agent verification. This is positioned as a lightweight alternative to long-context extensions or added memory modules, and it stays fully training-free for any pretrained MLLM.

What is actually new is the specific engineering combination of atomic commits inside the map construction step plus explicit cross-agent verification to catch inconsistencies. The approach is model-agnostic by design, which is a practical strength if the full paper shows it works on multiple backbones without modification.

The main limitation is that the abstract asserts superior performance on spatial tasks yet gives no numbers, baselines, datasets, or ablations. The central assumption—that the external map plus verification lets the model handle queries whose evidence no longer fits the native context—rests on three unshown conditions: lossless encoding of metric relations, precise sub-graph retrieval, and verification that actually repairs spatial errors rather than just syntax. The stress-test concern holds here because nothing in the provided text demonstrates these conditions are met when the map grows.

This is the kind of paper that would interest people building practical spatial-reasoning pipelines on top of existing MLLMs. A reader already working on context-window workarounds might extract the coordination pattern and try it, but only if the experiments section contains reproducible results.

If the full manuscript includes quantitative comparisons and ablations that address the preservation question, it is worth sending to peer review. The idea is narrow enough that a referee could evaluate it quickly; without those results the claims stay too thin to engage seriously.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes CoCoSI, a plug-and-play multi-agent framework for collaboratively constructing structured cognitive maps as external spatial memory. The approach uses local-global agent coordination, atomic commits during map construction, and cross-agent verification to enable spatial reasoning in unmodified pretrained MLLMs beyond their native context windows, without any training or architectural changes. It claims superior performance on spatial understanding tasks while remaining fully training-free.

Significance. If the central claims are substantiated with rigorous experiments, the work would offer a lightweight, model-agnostic solution to context-length limitations for spatial tasks in MLLMs, with potential value for embodied AI and navigation. The training-free and plug-and-play design is a notable strength if it demonstrably works with arbitrary off-the-shelf models.

major comments (2)

[Abstract] Abstract: the claim that the external cognitive map plus cross-agent verification enables queries whose visual evidence exceeds the native context window is load-bearing, yet the manuscript provides no mechanism, encoding scheme, or ablation demonstrating that (a) the map encoding is lossless for metric relations, (b) retrieval selects the needed sub-graph without reintroducing context pressure, and (c) verification detects and repairs spatial inconsistencies rather than merely confirming syntactic well-formedness.
[Abstract] Abstract: the assertion of 'superior performance' and 'extensive experiments' cannot be evaluated because the manuscript supplies no quantitative results, baselines, error bars, dataset details, or ablation studies, leaving the performance gains unsupported.

minor comments (1)

[Abstract] The abstract states that code will be released but provides no link or repository information in the current version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas where additional detail and evidence are needed to support the central claims. We address each point below and will revise the manuscript to incorporate the requested clarifications and supporting material.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the external cognitive map plus cross-agent verification enables queries whose visual evidence exceeds the native context window is load-bearing, yet the manuscript provides no mechanism, encoding scheme, or ablation demonstrating that (a) the map encoding is lossless for metric relations, (b) retrieval selects the needed sub-graph without reintroducing context pressure, and (c) verification detects and repairs spatial inconsistencies rather than merely confirming syntactic well-formedness.

Authors: We agree that the abstract is high-level and that the load-bearing claim requires explicit support. The manuscript describes local-global coordination, atomic commits for incremental map updates, and cross-agent verification, but does not include the requested ablations or encoding details. In the revision we will add: (1) a precise description of the map encoding (graph with nodes as object instances and edges as directed metric relations with coordinate tuples), (2) an ablation isolating retrieval that measures context length before/after sub-graph selection, and (3) concrete verification traces showing detection and repair of metric inconsistencies (e.g., contradictory distance or angle relations) rather than only syntactic checks. These additions will be placed in a new subsection and appendix. revision: yes
Referee: [Abstract] Abstract: the assertion of 'superior performance' and 'extensive experiments' cannot be evaluated because the manuscript supplies no quantitative results, baselines, error bars, dataset details, or ablation studies, leaving the performance gains unsupported.

Authors: The referee is correct that the current manuscript version does not contain the quantitative results, baselines, error bars, dataset specifications, or ablation studies referenced in the abstract. We will add a complete experimental section (including tables with means and standard deviations, baseline comparisons, dataset descriptions, and targeted ablations on coordination, atomic commits, and verification) in the revised manuscript so that the performance claims can be properly evaluated. revision: yes

Circularity Check

0 steps flagged

No circularity; engineering framework with no derivations or self-referential reductions

full rationale

The paper describes a plug-and-play multi-agent system for cognitive map construction using local-global coordination, atomic commits, and cross-agent verification. No equations, fitted parameters, predictions, or uniqueness theorems appear in the provided text. The central claims rest on the independent design of the framework and its experimental outcomes rather than any self-definition, fitted-input renaming, or self-citation chain. The method is presented as model-agnostic and training-free without reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review supplies insufficient detail for a complete ledger; the central premise rests on the domain assumption that grid-based cognitive maps can serve as effective external spatial memory for MLLMs.

axioms (1)

domain assumption Human spatial cognition can be effectively modeled with grid-based cognitive maps constructed from visual input
Paper states it is inspired by human spatial cognition and adopts grid-based maps as the representation.

pith-pipeline@v0.9.1-grok · 5710 in / 1056 out tokens · 27456 ms · 2026-06-27T14:15:36.164931+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 13 canonical work pages · 9 internal anchors

[1]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024

2024
[2]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025

2025
[3]

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, et al. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Video-xl: Extra-long vision language model for hour-scale video understanding

Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26160–26169, 2025

2025
[5]

Moviechat: From dense token to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024

2024
[6]

Ma-lmm: Memory-augmented large multimodal model for long-term video understanding

Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13504–13514, 2024

2024
[7]

Videobert: A joint model for video and language representation learning

Chen Sun, Austin Myers, Carl V ondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. InProceedings of the IEEE/CVF international conference on computer vision, pages 7464–7473, 2019

2019
[8]

Hero: Hierarchi- cal encoder for video+ language omni-representation pre-training

Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. Hero: Hierarchi- cal encoder for video+ language omni-representation pre-training. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 2046–2065, 2020

2020
[9]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

2022
[10]

Video-chatgpt: Towards detailed video understanding via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, 2024

2024
[11]

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, et al. Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, et al. Internvideo2. 5: Empowering video mllms with long and rich context modeling.arXiv preprint arXiv:2501.12386, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Videoagent: Long-form video understanding with large language model as agent

Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung-Levy. Videoagent: Long-form video understanding with large language model as agent. InEuropean Conference on Computer Vision, pages 58–76. Springer, 2024. 10

2024
[14]

Ziqi Pang and Yu-Xiong Wang. Mr. video: Mapreduce as an effective principle for long video understanding. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
[15]

Neural Map: Structured Memory for Deep Reinforcement Learning

Emilio Parisotto and Ruslan Salakhutdinov. Neural map: Structured memory for deep reinforce- ment learning.arXiv preprint arXiv:1702.08360, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[16]

Learning to explore using active neural slam.arXiv preprint arXiv:2004.05155, 2020

Devendra Singh Chaplot, Dhiraj Gandhi, Saurabh Gupta, Abhinav Gupta, and Ruslan Salakhut- dinov. Learning to explore using active neural slam.arXiv preprint arXiv:2004.05155, 2020

work page arXiv 2004
[17]

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning

Qiao Gu, Ali Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, et al. Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 5021–5028. IEEE, 2024

2024
[19]

3d-mem: 3d scene memory for embodied exploration and reasoning

Yuncong Yang, Han Yang, Jiachen Zhou, Peihao Chen, Hongxin Zhang, Yilun Du, and Chuang Gan. 3d-mem: 3d scene memory for embodied exploration and reasoning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17294–17303, 2025

2025
[20]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

2022
[22]

Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

2023
[23]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

2024
[24]

Improv- ing factuality and reasoning in language models through multiagent debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate. InForty-first international conference on machine learning, 2024

2024
[25]

Lvagent: Long video understanding by multi-round dynamical collaboration of mllm agents

Boyu Chen, Zhengrong Yue, Siran Chen, Zikang Wang, Yang Liu, Peng Li, and Yali Wang. Lvagent: Long video understanding by multi-round dynamical collaboration of mllm agents. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20237– 20246, 2025

2025
[26]

Longvideoagent: Multi-agent reasoning with long videos.arXiv preprint arXiv:2512.20618, 2025

Runtao Liu, Ziyi Liu, Jiaqi Tang, Yue Ma, Renjie Pi, Jipeng Zhang, and Qifeng Chen. Longvideoagent: Multi-agent reasoning with long videos.arXiv preprint arXiv:2512.20618, 2025

work page arXiv 2025
[27]

arXiv preprint arXiv:2512.10863 , year=

Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao, Yunlong Ran, Miao Hu, Chenming Zhu, Yiman Xie, Yilin Long, et al. Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence.arXiv preprint arXiv:2512.10863, 2025

work page arXiv 2025
[28]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Traveler: A modular multi-lmm agent framework for video question-answering.arXiv preprint arXiv:2404.01476,

Chuyi Shang, Amos You, Sanjay Subramanian, Trevor Darrell, and Roei Herzig. Trav- eler: A modular multi-lmm agent framework for video question-answering.arXiv preprint arXiv:2404.01476, 2024

work page arXiv 2024
[32]

Vca: Video curious agent for long video understanding

Zeyuan Yang, Delin Chen, Xueyang Yu, Maohao Shen, and Chuang Gan. Vca: Video curious agent for long video understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20168–20179, 2025. 12

2025

[1] [1]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024

2024

[2] [2]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025

2025

[3] [3]

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, et al. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Video-xl: Extra-long vision language model for hour-scale video understanding

Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26160–26169, 2025

2025

[5] [5]

Moviechat: From dense token to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024

2024

[6] [6]

Ma-lmm: Memory-augmented large multimodal model for long-term video understanding

Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13504–13514, 2024

2024

[7] [7]

Videobert: A joint model for video and language representation learning

Chen Sun, Austin Myers, Carl V ondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. InProceedings of the IEEE/CVF international conference on computer vision, pages 7464–7473, 2019

2019

[8] [8]

Hero: Hierarchi- cal encoder for video+ language omni-representation pre-training

Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. Hero: Hierarchi- cal encoder for video+ language omni-representation pre-training. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 2046–2065, 2020

2020

[9] [9]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

2022

[10] [10]

Video-chatgpt: Towards detailed video understanding via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, 2024

2024

[11] [11]

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, et al. Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, et al. Internvideo2. 5: Empowering video mllms with long and rich context modeling.arXiv preprint arXiv:2501.12386, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Videoagent: Long-form video understanding with large language model as agent

Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung-Levy. Videoagent: Long-form video understanding with large language model as agent. InEuropean Conference on Computer Vision, pages 58–76. Springer, 2024. 10

2024

[14] [14]

Ziqi Pang and Yu-Xiong Wang. Mr. video: Mapreduce as an effective principle for long video understanding. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

[15] [15]

Neural Map: Structured Memory for Deep Reinforcement Learning

Emilio Parisotto and Ruslan Salakhutdinov. Neural map: Structured memory for deep reinforce- ment learning.arXiv preprint arXiv:1702.08360, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[16] [16]

Learning to explore using active neural slam.arXiv preprint arXiv:2004.05155, 2020

Devendra Singh Chaplot, Dhiraj Gandhi, Saurabh Gupta, Abhinav Gupta, and Ruslan Salakhut- dinov. Learning to explore using active neural slam.arXiv preprint arXiv:2004.05155, 2020

work page arXiv 2004

[17] [17]

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning

Qiao Gu, Ali Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, et al. Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 5021–5028. IEEE, 2024

2024

[19] [19]

3d-mem: 3d scene memory for embodied exploration and reasoning

Yuncong Yang, Han Yang, Jiachen Zhou, Peihao Chen, Hongxin Zhang, Yilun Du, and Chuang Gan. 3d-mem: 3d scene memory for embodied exploration and reasoning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17294–17303, 2025

2025

[20] [20]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[21] [21]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

2022

[22] [22]

Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

2023

[23] [23]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

2024

[24] [24]

Improv- ing factuality and reasoning in language models through multiagent debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate. InForty-first international conference on machine learning, 2024

2024

[25] [25]

Lvagent: Long video understanding by multi-round dynamical collaboration of mllm agents

Boyu Chen, Zhengrong Yue, Siran Chen, Zikang Wang, Yang Liu, Peng Li, and Yali Wang. Lvagent: Long video understanding by multi-round dynamical collaboration of mllm agents. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20237– 20246, 2025

2025

[26] [26]

Longvideoagent: Multi-agent reasoning with long videos.arXiv preprint arXiv:2512.20618, 2025

Runtao Liu, Ziyi Liu, Jiaqi Tang, Yue Ma, Renjie Pi, Jipeng Zhang, and Qifeng Chen. Longvideoagent: Multi-agent reasoning with long videos.arXiv preprint arXiv:2512.20618, 2025

work page arXiv 2025

[27] [27]

arXiv preprint arXiv:2512.10863 , year=

Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao, Yunlong Ran, Miao Hu, Chenming Zhu, Yiman Xie, Yilin Long, et al. Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence.arXiv preprint arXiv:2512.10863, 2025

work page arXiv 2025

[28] [28]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Traveler: A modular multi-lmm agent framework for video question-answering.arXiv preprint arXiv:2404.01476,

Chuyi Shang, Amos You, Sanjay Subramanian, Trevor Darrell, and Roei Herzig. Trav- eler: A modular multi-lmm agent framework for video question-answering.arXiv preprint arXiv:2404.01476, 2024

work page arXiv 2024

[32] [32]

Vca: Video curious agent for long video understanding

Zeyuan Yang, Delin Chen, Xueyang Yu, Maohao Shen, and Chuang Gan. Vca: Video curious agent for long video understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20168–20179, 2025. 12

2025