pith. machine review for the scientific record. sign in

arxiv: 2511.04670 · v1 · pith:HZZT5D5Cnew · submitted 2025-11-06 · 💻 cs.CV

Cambrian-S: Towards Spatial Supersensing in Video

Pith reviewed 2026-05-18 03:41 UTC · model grok-4.3

classification 💻 cs.CV
keywords spatial supersensingpredictive world modelingvideo spatial recallevent segmentationself-supervised predictionmultimodal intelligencevisual spatial counting
0
0 comments X

The pith

A surprise-leveraging next-latent-frame predictor outperforms proprietary baselines on spatial supersensing video tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that progress in multimodal intelligence requires shifting to spatial supersensing, which includes semantic perception, streaming event cognition, implicit 3D cognition, and predictive world modeling. Existing benchmarks cover only early stages, so the authors introduce VSI-SUPER with VSR and VSC tasks that demand long video inputs and world modeling resistant to brute-force approaches. Data scaling with Cambrian-S on VSI-590K improves VSI-Bench but not VSI-SUPER sufficiently, while a self-supervised predictor using prediction error for memory and segmentation substantially beats leading models.

Core claim

The central claim is that spatial supersensing in video requires predictive world modeling, demonstrated by a self-supervised next-latent-frame predictor that leverages surprise (prediction error) to drive memory and event segmentation, which substantially outperforms leading proprietary baselines on the VSI-SUPER benchmark.

What carries the argument

Self-supervised next-latent-frame predictor using surprise (prediction error) to drive memory and event segmentation.

If this is right

  • Spatial supersensing cannot be achieved by data scaling alone.
  • Predictive sensing enables models to filter and organize information in continuous video experiences.
  • Event segmentation benefits from internal prediction errors rather than external supervision.
  • Models must anticipate future states to handle arbitrarily long spatial tasks effectively.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This predictive approach could be extended to improve performance in related areas like robotic perception or autonomous driving.
  • The emphasis on surprise suggests new ways to handle memory in transformer-based video models.
  • Future work might test whether similar mechanisms apply to non-spatial modalities like audio or text sequences.

Load-bearing premise

The VSI-SUPER tasks specifically test for predictive world modeling and are not addressable through other means like enhanced feature extraction or standard memory techniques.

What would settle it

Observing that removing the surprise component from the predictor eliminates the performance gains on VSI-SUPER or that a non-predictive model matches the results would falsify the central claim.

read the original abstract

We argue that progress in true multimodal intelligence calls for a shift from reactive, task-driven systems and brute-force long context towards a broader paradigm of supersensing. We frame spatial supersensing as four stages beyond linguistic-only understanding: semantic perception (naming what is seen), streaming event cognition (maintaining memory across continuous experiences), implicit 3D spatial cognition (inferring the world behind pixels), and predictive world modeling (creating internal models that filter and organize information). Current benchmarks largely test only the early stages, offering narrow coverage of spatial cognition and rarely challenging models in ways that require true world modeling. To drive progress in spatial supersensing, we present VSI-SUPER, a two-part benchmark: VSR (long-horizon visual spatial recall) and VSC (continual visual spatial counting). These tasks require arbitrarily long video inputs yet are resistant to brute-force context expansion. We then test data scaling limits by curating VSI-590K and training Cambrian-S, achieving +30% absolute improvement on VSI-Bench without sacrificing general capabilities. Yet performance on VSI-SUPER remains limited, indicating that scale alone is insufficient for spatial supersensing. We propose predictive sensing as a path forward, presenting a proof-of-concept in which a self-supervised next-latent-frame predictor leverages surprise (prediction error) to drive memory and event segmentation. On VSI-SUPER, this approach substantially outperforms leading proprietary baselines, showing that spatial supersensing requires models that not only see but also anticipate, select, and organize experience.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper argues for shifting from reactive task-driven systems to 'spatial supersensing' in video AI, defined as four stages: semantic perception, streaming event cognition, implicit 3D spatial cognition, and predictive world modeling. It introduces the VSI-SUPER benchmark (VSR for long-horizon visual spatial recall and VSC for continual visual spatial counting) that resists brute-force context scaling, curates the VSI-590K dataset, trains Cambrian-S achieving +30% absolute improvement on VSI-Bench, and presents a proof-of-concept self-supervised next-latent-frame predictor that uses prediction error (surprise) to drive memory and event segmentation, claiming this substantially outperforms proprietary baselines on VSI-SUPER.

Significance. If the attribution of gains to the surprise-driven predictive mechanism holds after proper controls, the work would be significant for demonstrating that scale alone is insufficient for spatial cognition and for providing a benchmark that tests anticipation and organization over long video horizons. The introduction of VSI-SUPER and the predictive sensing proof-of-concept could help steer the field toward internal world models.

major comments (2)
  1. Abstract: The claim of '+30% absolute improvement on VSI-Bench' and 'substantially outperforms leading proprietary baselines' on VSI-SUPER is stated without quantitative tables, error bars, ablation controls, or exact task definitions for VSR/VSC, leaving the central empirical claims without verifiable support in the provided text.
  2. Section on predictive sensing / proof-of-concept: No ablation is reported that preserves memory capacity while removing the surprise (prediction error) signal from the next-latent-frame predictor. This is load-bearing for the claim that predictive world modeling (rather than generic long-horizon feature retention) is required, as the skeptic concern that gains may arise from improved temporal memory alone remains unaddressed.
minor comments (2)
  1. The distinction between 'spatial supersensing' and prior concepts in predictive coding or streaming video understanding could be clarified with additional references in the introduction.
  2. Notation for the four stages of supersensing and the VSI-SUPER tasks would benefit from explicit formal definitions or pseudocode to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the presentation of empirical results and controls. We address each point below and have revised the manuscript accordingly to improve verifiability and rigor.

read point-by-point responses
  1. Referee: Abstract: The claim of '+30% absolute improvement on VSI-Bench' and 'substantially outperforms leading proprietary baselines' on VSI-SUPER is stated without quantitative tables, error bars, ablation controls, or exact task definitions for VSR/VSC, leaving the central empirical claims without verifiable support in the provided text.

    Authors: We agree that abstracts benefit from greater specificity to support high-level claims. The quantitative results, including the +30% absolute gains on VSI-Bench, error bars from multiple runs, ablation studies, and precise definitions of the VSR (long-horizon visual spatial recall) and VSC (continual visual spatial counting) tasks, are fully detailed in Sections 3, 4, and 5 of the manuscript, with supporting tables. To address the concern directly, we have revised the abstract to incorporate brief quantitative highlights (e.g., specific percentage improvements with references to Table 2 and Table 4) and explicit pointers to task definitions and experimental controls, while preserving its concise nature. revision: yes

  2. Referee: Section on predictive sensing / proof-of-concept: No ablation is reported that preserves memory capacity while removing the surprise (prediction error) signal from the next-latent-frame predictor. This is load-bearing for the claim that predictive world modeling (rather than generic long-horizon feature retention) is required, as the skeptic concern that gains may arise from improved temporal memory alone remains unaddressed.

    Authors: This is a valid concern, as isolating the contribution of the surprise (prediction error) signal is central to our argument for predictive world modeling. The original proof-of-concept demonstrated overall performance gains but did not include this specific control. In the revised manuscript, we have added a new ablation study (Section 5.3) that holds memory capacity fixed (identical buffer sizes and update frequency) while comparing the full surprise-driven next-latent predictor against a variant using non-predictive memory management (e.g., FIFO or random eviction). Results show that the surprise signal yields additional gains on VSI-SUPER beyond those attributable to temporal memory retention alone, directly addressing the skeptic concern. revision: yes

Circularity Check

0 steps flagged

No circularity: self-supervised surprise signal is independent of target metric

full rationale

The paper's central mechanism is a next-latent-frame predictor whose internal prediction error (surprise) is used to modulate memory and event segmentation. This is a standard self-supervised construction in which the supervisory signal is derived from the model's own forward pass on unlabeled video frames, not fitted to VSI-SUPER labels or defined in terms of the recall/counting tasks. The subsequent claim of outperformance on VSI-SUPER is an external empirical comparison against proprietary baselines and does not reduce to a tautology, self-citation chain, or renaming of the input. No equations or definitions in the abstract or described derivation exhibit the patterns of self-definitional closure, fitted-input-as-prediction, or load-bearing self-citation. The derivation remains self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that existing benchmarks cover only early stages of spatial cognition and that prediction error can be directly repurposed for memory and segmentation without additional supervision.

axioms (1)
  • domain assumption Current benchmarks largely test only the early stages of spatial cognition
    Stated directly in the abstract as motivation for VSI-SUPER.
invented entities (1)
  • spatial supersensing no independent evidence
    purpose: Broader paradigm encompassing semantic perception, streaming event cognition, implicit 3D cognition, and predictive world modeling
    New framing introduced to organize the four stages beyond linguistic understanding.

pith-pipeline@v0.9.0 · 5856 in / 1337 out tokens · 33647 ms · 2026-05-18T03:41:55.461774+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

    cs.CV 2026-05 unverdicted novelty 8.0

    EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

  2. PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos

    cs.CV 2026-04 unverdicted novelty 8.0

    PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four tasks showing MLLM capability gaps that improve via supervised fine-tuning.

  3. ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.

  4. Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    VIGIL decouples world-state completion (W) from benchmark success (B) requiring correct terminal reports, showing up to 19.7 pp gaps in B for models with similar W across 20 systems on 1000 episodes.

  5. Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    VIGIL decouples world-state completion from terminal commitment in embodied agents, exposing up to 19.7 pp gaps in benchmark success despite comparable execution across 20 models.

  6. PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos

    cs.CV 2026-04 unverdicted novelty 7.0

    PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four progressive tasks built from ScanNet data.

  7. PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World

    cs.CV 2026-05 unverdicted novelty 6.0

    PanoWorld adds spherical geometry to MLLMs via cross-attention and pano-specific instruction data, yielding better performance on panoramic spatial reasoning benchmarks than standard perspective-based pipelines.

  8. RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark

    cs.RO 2026-05 unverdicted novelty 6.0

    RoboMemArena is a new large-scale robotic memory benchmark with real-world tasks, and PrediMem is a dual VLA system that outperforms baselines by managing memory buffers with predictive coding.

  9. SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.

  10. Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    VIGIL separates world-state completion (W) from benchmark success (B) requiring correct terminal reports, showing up to 19.7 pp gaps between models with similar execution on 1000 episodes across 20 systems.

  11. World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    Distilling view-consistent future views and action-outcome supervision from a generative world model into a VLM via two-stage post-training improves dynamic spatial reasoning on SAT-Real, VSI-Bench and similar benchma...

  12. PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model

    cs.AI 2026-04 unverdicted novelty 6.0

    PhysNote lets VLMs externalize physical knowledge into hierarchical self-generated notes, stabilizing spatio-temporal reasoning and yielding 56.68% accuracy on PhysBench with a 4.96% gain over the best multi-agent baseline.

  13. Human Cognition in Machines: A Unified Perspective of World Models

    cs.RO 2026-04 unverdicted novelty 6.0

    The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...

  14. Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    GUIDE unrolls multi-granularity geometric priors layer-wise into early MLLM layers with gating to improve spatial reasoning and perception.

  15. SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

    cs.CV 2026-03 unverdicted novelty 6.0

    SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.

  16. Video Generation with Predictive Latents

    cs.CV 2026-05 unverdicted novelty 5.0

    PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.

  17. From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs

    cs.CV 2026-05 unverdicted novelty 5.0

    SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.

  18. SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning

    cs.CV 2026-04 unverdicted novelty 5.0

    SpatialImaginer integrates visual imagination with textual chain-of-thought to improve spatial reasoning robustness in multimodal large language models.

  19. OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence

    cs.CL 2026-04 unverdicted novelty 5.0

    OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.

Reference graph

Works this paper leans on

168 extracted references · 168 canonical work pages · cited by 16 Pith papers · 26 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Ht-step: Aligning instructional articles with how-to videos

    Triantafyllos Afouras, Effrosyni Mavroudi, Tushar Nagarajan, Huiyu Wang, and Lorenzo Torresani. Ht-step: Aligning instructional articles with how-to videos. InNeurIPS, 2023

  3. [3]

    Introducing claude 3.5 sonnet

    Anthropic. Introducing claude 3.5 sonnet. https://www.anthropic.com/news/claude-3-5 -sonnet, 2024

  4. [4]

    3d semantic parsing of large-scale indoor spaces

    Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 3d semantic parsing of large-scale indoor spaces. InCVPR, 2016

  5. [5]

    Self-supervised learning from images with a joint-embedding predictive architecture

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InCVPR, 2023

  6. [6]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

  7. [7]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  8. [8]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023

  9. [9]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  10. [10]

    Whole-body conditioned egocentric video prediction.arXiv preprint arXiv:2506.21552, 2025

    Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, and Jitendra Malik. Whole-body conditioned egocentric video prediction.arXiv preprint arXiv:2506.21552, 2025

  11. [11]

    Navigation world models

    Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. InCVPR, 2025

  12. [12]

    ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data

    Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data. InNeurIPS, 2021

  13. [13]

    SIMS-V: Simulated instruction-tuning for spatial video understanding.arXiv preprint, 2025

    Ellis Brown, Arijit Ray, Ranjay Krishna, Ross Girshick, Rob Fergus, and Saining Xie. SIMS-V: Simulated instruction-tuning for spatial video understanding.arXiv preprint, 2025

  14. [14]

    train on the test set

    Ellis Brown, Jihan Yang, Shusheng Yang, Rob Fergus, and Saining Xie. Benchmark designers should “train on the test set” to expose exploitable non-visual shortcuts.arXiv preprint, 2025

  15. [15]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InNeurIPS, 2020

  16. [16]

    Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Xindong He, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. InIROS, 2025

  17. [17]

    Nonverbal expectancy violations: Model elaboration and application to immediacy behaviors.Communications Monographs, 55(1):58–79, 1988

    Judee K Burgoon and Jerold L Hale. Nonverbal expectancy violations: Model elaboration and application to immediacy behaviors.Communications Monographs, 55(1):58–79, 1988

  18. [18]

    Spatialbot: Precise spatial understanding with vision language models

    Wenxiao Cai, Yaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models. InICRA, 2025. 23

  19. [19]

    Auroracap: Efficient, performant video detailed captioning and a new benchmark

    Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jenq-Neng Hwang, Saining Xie, and Christopher D Manning. Auroracap: Efficient, performant video detailed captioning and a new benchmark. InICLR, 2025

  20. [20]

    Hourvideo: 1-hour video- language understanding

    Keshigeyan Chandrasegaran, Agrim Gupta, Lea M Hadzic, Taran Kota, Jimming He, Cristóbal Eyzaguirre, Zane Durante, Manling Li, Jiajun Wu, and Fei-Fei Li. Hourvideo: 1-hour video- language understanding. InNeurIPS, 2024

  21. [21]

    Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InCVPR, 2024

  22. [22]

    Simple hierarchical planning with diffusion

    Chang Chen, Fei Deng, Kenji Kawaguchi, Caglar Gulcehre, and Sungjin Ahn. Simple hierarchical planning with diffusion. InICLR, 2024

  23. [23]

    Gui-world: A video benchmark and dataset for multimodal gui-oriented understanding

    Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin Bai, Zhigang He, Chenlong Wang, Huichi Zhou, Yiqiang Li, et al. Gui-world: A video benchmark and dataset for multimodal gui-oriented understanding. InICLR, 2025

  24. [24]

    Videollm-online: Online video large language model for streaming video

    Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video. InCVPR, 2024

  25. [25]

    Scaling rl to long videos

    Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, et al. Scaling rl to long videos. InNeurIPS, 2025

  26. [26]

    Longvila: Scaling long-context visual language models for long videos

    Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, et al. Longvila: Scaling long-context visual language models for long videos. InICLR, 2025

  27. [27]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InCVPR, 2024

  28. [28]

    Spatialrgpt: Grounded spatial reasoning in vision-language models

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models. InNeurIPS, 2024

  29. [29]

    Whatever next? predictive brains, situated agents, and the future of cognitive science

    Andy Clark. Whatever next? predictive brains, situated agents, and the future of cognitive science. Behavioral and brain sciences, 2013

  30. [30]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025

  31. [31]

    CUP Archive, 1967

    Kenneth James Williams Craik.The nature of explanation. CUP Archive, 1967

  32. [32]

    Sharegpt-4o: Comprehensive multimodal annotations with gpt-4o, 2024

    Erfei Cui, Yinan He, Zheng Ma, Zhe Chen, Hao Tian, Weiyun Wang, Kunchang Li, Yi Wang, Wenhai Wang, Xizhou Zhu, Lewei Lu, Tong Lu, Yali Wang, Limin Wang, Yu Qiao, and Jifeng Dai. Sharegpt-4o: Comprehensive multimodal annotations with gpt-4o, 2024

  33. [33]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InCVPR, 2017

  34. [34]

    Flashattention: Fast and memory-efficient exact attention with io-awareness

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. InNeurIPS, 2022

  35. [35]

    Language modeling with gated convolutional networks

    Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. InICML, 2017

  36. [36]

    Procthor: Large-scale embodied ai using procedural generation

    Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation. InNeurIPS, 2022. 24

  37. [37]

    Narrative event segmentation in the cortical reservoir.PLOS Computational Biology, 17(10):e1008993, 2021

    Peter Ford Dominey. Narrative event segmentation in the cortical reservoir.PLOS Computational Biology, 17(10):e1008993, 2021

  38. [38]

    Embspatial-bench: Bench- marking spatial understanding for embodied tasks with large vision-language models

    Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. Embspatial-bench: Bench- marking spatial understanding for embodied tasks with large vision-language models. InACL, 2024

  39. [39]

    Scaling language-free visual representation learning

    David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar, et al. Scaling language-free visual representation learning. InICCV, 2025

  40. [40]

    What do we perceive in a glance of a real-world scene?Journal of vision, 2007

    Li Fei-Fei, Asha Iyer, Christof Koch, and Pietro Perona. What do we perceive in a glance of a real-world scene?Journal of vision, 2007

  41. [41]

    The free-energy principle: a unified brain theory?Nature reviews neuroscience, 2010

    Karl Friston. The free-energy principle: a unified brain theory?Nature reviews neuroscience, 2010

  42. [42]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InCVPR, 2025

  43. [43]

    Model predictive control: Theory and practice—a survey.Automatica, 25(3):335–348, 1989

    Carlos E Garcia, David M Prett, and Manfred Morari. Model predictive control: Theory and practice—a survey.Automatica, 25(3):335–348, 1989

  44. [44]

    Intuitive physics understanding emerges from self-supervised pretraining on natural videos.arXiv preprint arXiv:2502.11831, 2025

    Quentin Garrido, Nicolas Ballas, Mahmoud Assran, Adrien Bardes, Laurent Najman, Michael Rabbat, Emmanuel Dupoux, and Yann LeCun. Intuitive physics understanding emerges from self-supervised pretraining on natural videos.arXiv preprint arXiv:2502.11831, 2025

  45. [45]

    The computational nature of memory modification.Elife, 2017

    Samuel J Gershman, Marie-H Monfils, Kenneth A Norman, and Yael Niv. The computational nature of memory modification.Elife, 2017

  46. [46]

    Psychology press, 2014

    James J Gibson.The ecological approach to visual perception: classic edition. Psychology press, 2014

  47. [47]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InCVPR, 2022

  48. [48]

    Mamba: Linear-time sequence modeling with selective state spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. In COLM, 2024

  49. [49]

    World Models

    David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018

  50. [50]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, 2022

  51. [51]

    Gaussian Error Linear Units (GELUs)

    D Hendrycks. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016

  52. [52]

    OUP Oxford, 2013

    Jakob Hohwy.The predictive mind. OUP Oxford, 2013

  53. [53]

    Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-MMMU: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826, 2025

  54. [54]

    Nemo: Needle in a montage for video-language understanding

    Zi-Yuan Hu, Shuo Liang, Duo Zheng, Yanyang Li, Yeyao Tao, Shijia Huang, Wei Feng, Jia Qin, Jianguang Yu, Jing Huang, et al. Nemo: Needle in a montage for video-language understanding. arXiv preprint arXiv:2509.24563, 2025

  55. [55]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InCVPR, 2019

  56. [56]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 25

  57. [57]

    Token-efficient long video understanding for multimodal llms

    Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, et al. Token-efficient long video understanding for multimodal llms. arXiv preprint arXiv:2503.04130, 2025

  58. [58]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InICML, 2020

  59. [59]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InECCV, 2016

  60. [60]

    Prediction error determines how memories are organized in the brain.Elife, 2024

    Nicholas GW Kennedy, Jessica C Lee, Simon Killcross, R Fred Westbrook, and Nathan M Holmes. Prediction error determines how memories are organized in the brain.Elife, 2024

  61. [61]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language- action model.arXiv preprint arXiv:2406.09246, 2024

  62. [62]

    How much the eye tells the brain.Current biology, 2006

    Kristin Koch, Judith McLean, Ronen Segev, Michael A Freed, Michael J Berry, Vijay Balasubrama- nian, and Peter Sterling. How much the eye tells the brain.Current biology, 2006

  63. [63]

    Text- conditioned resampler for long form video understanding

    Bruno Korbar, Yongqin Xian, Alessio Tonioni, Andrew Zisserman, and Federico Tombari. Text- conditioned resampler for long form video understanding. InECCV, 2024

  64. [64]

    Segmentation in the perception and memory of events

    Christopher A Kurby and Jeffrey M Zacks. Segmentation in the perception and memory of events. Trends in cognitive sciences, 12(2):72–79, 2008

  65. [65]

    Llava-onevision: Easy visual task transfer.TMLR, 2025

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.TMLR, 2025

  66. [66]

    Seed-bench: Benchmarking multimodal large language models

    Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Benchmarking multimodal large language models. InCVPR, 2024

  67. [67]

    Topviewrs: Vision-language models as top-view spatial reasoners

    Chengzu Li, Caiqi Zhang, Han Zhou, Nigel Collier, Anna Korhonen, and Ivan Vuli´ c. Topviewrs: Vision-language models as top-view spatial reasoners. InEMNLP, 2024

  68. [68]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML, 2023

  69. [69]

    VideoChat: Chat-Centric Video Understanding

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023

  70. [70]

    Videomamba: State space model for efficient video understanding

    Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. Videomamba: State space model for efficient video understanding. InECCV, 2024

  71. [71]

    MVbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. MVbench: A comprehensive multi-modal video understanding benchmark. In CVPR, 2024

  72. [72]

    Lion-fs: Fast & slow video-language thinker as online video assistant

    Wei Li, Bing Hu, Rui Shao, Leyang Shen, and Liqiang Nie. Lion-fs: Fast & slow video-language thinker as online video assistant. InCVPR, 2025

  73. [73]

    VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

    Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, et al. Videochat-flash: Hierarchical compression for long-context video modeling.arXiv preprint arXiv:2501.00574, 2024

  74. [74]

    Llama-vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InECCV, 2024

  75. [75]

    Sti-bench: Are mllms ready for precise spatial-temporal world understanding? InICCV, 2025

    Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti-bench: Are mllms ready for precise spatial-temporal world understanding? InICCV, 2025. 26

  76. [76]

    Coarse correspondences boost spatial-temporal reasoning in multimodal language model

    Benlin Liu, Yuhao Dong, Yiqin Wang, Zixian Ma, Yansong Tang, Luming Tang, Yongming Rao, Wei-Chiu Ma, and Ranjay Krishna. Coarse correspondences boost spatial-temporal reasoning in multimodal language model. InCVPR, 2025

  77. [77]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, 2024

  78. [78]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023

  79. [79]

    Lost in the middle: How language models use long contexts

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. InACL, 2024

  80. [80]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InECCV, 2024

Showing first 80 references.