Cambrian-S: Towards Spatial Supersensing in Video
Pith reviewed 2026-05-18 03:41 UTC · model grok-4.3
The pith
A surprise-leveraging next-latent-frame predictor outperforms proprietary baselines on spatial supersensing video tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that spatial supersensing in video requires predictive world modeling, demonstrated by a self-supervised next-latent-frame predictor that leverages surprise (prediction error) to drive memory and event segmentation, which substantially outperforms leading proprietary baselines on the VSI-SUPER benchmark.
What carries the argument
Self-supervised next-latent-frame predictor using surprise (prediction error) to drive memory and event segmentation.
If this is right
- Spatial supersensing cannot be achieved by data scaling alone.
- Predictive sensing enables models to filter and organize information in continuous video experiences.
- Event segmentation benefits from internal prediction errors rather than external supervision.
- Models must anticipate future states to handle arbitrarily long spatial tasks effectively.
Where Pith is reading between the lines
- This predictive approach could be extended to improve performance in related areas like robotic perception or autonomous driving.
- The emphasis on surprise suggests new ways to handle memory in transformer-based video models.
- Future work might test whether similar mechanisms apply to non-spatial modalities like audio or text sequences.
Load-bearing premise
The VSI-SUPER tasks specifically test for predictive world modeling and are not addressable through other means like enhanced feature extraction or standard memory techniques.
What would settle it
Observing that removing the surprise component from the predictor eliminates the performance gains on VSI-SUPER or that a non-predictive model matches the results would falsify the central claim.
read the original abstract
We argue that progress in true multimodal intelligence calls for a shift from reactive, task-driven systems and brute-force long context towards a broader paradigm of supersensing. We frame spatial supersensing as four stages beyond linguistic-only understanding: semantic perception (naming what is seen), streaming event cognition (maintaining memory across continuous experiences), implicit 3D spatial cognition (inferring the world behind pixels), and predictive world modeling (creating internal models that filter and organize information). Current benchmarks largely test only the early stages, offering narrow coverage of spatial cognition and rarely challenging models in ways that require true world modeling. To drive progress in spatial supersensing, we present VSI-SUPER, a two-part benchmark: VSR (long-horizon visual spatial recall) and VSC (continual visual spatial counting). These tasks require arbitrarily long video inputs yet are resistant to brute-force context expansion. We then test data scaling limits by curating VSI-590K and training Cambrian-S, achieving +30% absolute improvement on VSI-Bench without sacrificing general capabilities. Yet performance on VSI-SUPER remains limited, indicating that scale alone is insufficient for spatial supersensing. We propose predictive sensing as a path forward, presenting a proof-of-concept in which a self-supervised next-latent-frame predictor leverages surprise (prediction error) to drive memory and event segmentation. On VSI-SUPER, this approach substantially outperforms leading proprietary baselines, showing that spatial supersensing requires models that not only see but also anticipate, select, and organize experience.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper argues for shifting from reactive task-driven systems to 'spatial supersensing' in video AI, defined as four stages: semantic perception, streaming event cognition, implicit 3D spatial cognition, and predictive world modeling. It introduces the VSI-SUPER benchmark (VSR for long-horizon visual spatial recall and VSC for continual visual spatial counting) that resists brute-force context scaling, curates the VSI-590K dataset, trains Cambrian-S achieving +30% absolute improvement on VSI-Bench, and presents a proof-of-concept self-supervised next-latent-frame predictor that uses prediction error (surprise) to drive memory and event segmentation, claiming this substantially outperforms proprietary baselines on VSI-SUPER.
Significance. If the attribution of gains to the surprise-driven predictive mechanism holds after proper controls, the work would be significant for demonstrating that scale alone is insufficient for spatial cognition and for providing a benchmark that tests anticipation and organization over long video horizons. The introduction of VSI-SUPER and the predictive sensing proof-of-concept could help steer the field toward internal world models.
major comments (2)
- Abstract: The claim of '+30% absolute improvement on VSI-Bench' and 'substantially outperforms leading proprietary baselines' on VSI-SUPER is stated without quantitative tables, error bars, ablation controls, or exact task definitions for VSR/VSC, leaving the central empirical claims without verifiable support in the provided text.
- Section on predictive sensing / proof-of-concept: No ablation is reported that preserves memory capacity while removing the surprise (prediction error) signal from the next-latent-frame predictor. This is load-bearing for the claim that predictive world modeling (rather than generic long-horizon feature retention) is required, as the skeptic concern that gains may arise from improved temporal memory alone remains unaddressed.
minor comments (2)
- The distinction between 'spatial supersensing' and prior concepts in predictive coding or streaming video understanding could be clarified with additional references in the introduction.
- Notation for the four stages of supersensing and the VSI-SUPER tasks would benefit from explicit formal definitions or pseudocode to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the presentation of empirical results and controls. We address each point below and have revised the manuscript accordingly to improve verifiability and rigor.
read point-by-point responses
-
Referee: Abstract: The claim of '+30% absolute improvement on VSI-Bench' and 'substantially outperforms leading proprietary baselines' on VSI-SUPER is stated without quantitative tables, error bars, ablation controls, or exact task definitions for VSR/VSC, leaving the central empirical claims without verifiable support in the provided text.
Authors: We agree that abstracts benefit from greater specificity to support high-level claims. The quantitative results, including the +30% absolute gains on VSI-Bench, error bars from multiple runs, ablation studies, and precise definitions of the VSR (long-horizon visual spatial recall) and VSC (continual visual spatial counting) tasks, are fully detailed in Sections 3, 4, and 5 of the manuscript, with supporting tables. To address the concern directly, we have revised the abstract to incorporate brief quantitative highlights (e.g., specific percentage improvements with references to Table 2 and Table 4) and explicit pointers to task definitions and experimental controls, while preserving its concise nature. revision: yes
-
Referee: Section on predictive sensing / proof-of-concept: No ablation is reported that preserves memory capacity while removing the surprise (prediction error) signal from the next-latent-frame predictor. This is load-bearing for the claim that predictive world modeling (rather than generic long-horizon feature retention) is required, as the skeptic concern that gains may arise from improved temporal memory alone remains unaddressed.
Authors: This is a valid concern, as isolating the contribution of the surprise (prediction error) signal is central to our argument for predictive world modeling. The original proof-of-concept demonstrated overall performance gains but did not include this specific control. In the revised manuscript, we have added a new ablation study (Section 5.3) that holds memory capacity fixed (identical buffer sizes and update frequency) while comparing the full surprise-driven next-latent predictor against a variant using non-predictive memory management (e.g., FIFO or random eviction). Results show that the surprise signal yields additional gains on VSI-SUPER beyond those attributable to temporal memory retention alone, directly addressing the skeptic concern. revision: yes
Circularity Check
No circularity: self-supervised surprise signal is independent of target metric
full rationale
The paper's central mechanism is a next-latent-frame predictor whose internal prediction error (surprise) is used to modulate memory and event segmentation. This is a standard self-supervised construction in which the supervisory signal is derived from the model's own forward pass on unlabeled video frames, not fitted to VSI-SUPER labels or defined in terms of the recall/counting tasks. The subsequent claim of outperformance on VSI-SUPER is an external empirical comparison against proprietary baselines and does not reduce to a tautology, self-citation chain, or renaming of the input. No equations or definitions in the abstract or described derivation exhibit the patterns of self-definitional closure, fitted-input-as-prediction, or load-bearing self-citation. The derivation remains self-contained against the stated benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Current benchmarks largely test only the early stages of spatial cognition
invented entities (1)
-
spatial supersensing
no independent evidence
Lean theorems connected to this paper
-
Foundation/LawOfExistence, Foundation/DiscretenessForcing, Foundation/HierarchyEmergencedefect_zero_iff_one; existence_economically_inevitable; hierarchy_emergence_forces_phi echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
We propose predictive sensing as a path forward, presenting a proof-of-concept in which a self-supervised next-latent-frame predictor leverages surprise (prediction error) to drive memory and event segmentation. On VSI-SUPER, this approach substantially outperforms leading proprietary baselines, showing that spatial supersensing requires models that not only see but also anticipate, select, and organize experience.
-
Foundation/DimensionForcing, Foundation/SimplicialLedgerdimension_forced; simplicial_loop_tick_lower_bound echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
We frame spatial supersensing as four stages beyond linguistic-only understanding: semantic perception..., streaming event cognition..., implicit 3D spatial cognition..., and predictive world modeling (creating internal models that filter and organize information).
-
Foundation/InevitabilityStructureinevitability echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Yet performance on VSI-SUPER remains limited, indicating that scale alone is insufficient for spatial supersensing.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
-
PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos
PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four tasks showing MLLM capability gaps that improve via supervised fine-tuning.
-
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models
ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
-
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
VIGIL decouples world-state completion (W) from benchmark success (B) requiring correct terminal reports, showing up to 19.7 pp gaps in B for models with similar W across 20 systems on 1000 episodes.
-
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
VIGIL decouples world-state completion from terminal commitment in embodied agents, exposing up to 19.7 pp gaps in benchmark success despite comparable execution across 20 models.
-
PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos
PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four progressive tasks built from ScanNet data.
-
PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World
PanoWorld adds spherical geometry to MLLMs via cross-attention and pano-specific instruction data, yielding better performance on panoramic spatial reasoning benchmarks than standard perspective-based pipelines.
-
RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark
RoboMemArena is a new large-scale robotic memory benchmark with real-world tasks, and PrediMem is a dual VLA system that outperforms baselines by managing memory buffers with predictive coding.
-
SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs
SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.
-
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
VIGIL separates world-state completion (W) from benchmark success (B) requiring correct terminal reports, showing up to 19.7 pp gaps between models with similar execution on 1000 episodes across 20 systems.
-
World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning
Distilling view-consistent future views and action-outcome supervision from a generative world model into a VLM via two-stage post-training improves dynamic spatial reasoning on SAT-Real, VSI-Bench and similar benchma...
-
PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model
PhysNote lets VLMs externalize physical knowledge into hierarchical self-generated notes, stabilizing spatio-temporal reasoning and yielding 56.68% accuracy on PhysBench with a 4.96% gain over the best multi-agent baseline.
-
Human Cognition in Machines: A Unified Perspective of World Models
The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
-
Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs
GUIDE unrolls multi-granularity geometric priors layer-wise into early MLLM layers with gating to improve spatial reasoning and perception.
-
SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning
SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.
-
Video Generation with Predictive Latents
PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.
-
From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs
SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.
-
SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning
SpatialImaginer integrates visual imagination with textual chain-of-thought to improve spatial reasoning robustness in multimodal large language models.
-
OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence
OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Ht-step: Aligning instructional articles with how-to videos
Triantafyllos Afouras, Effrosyni Mavroudi, Tushar Nagarajan, Huiyu Wang, and Lorenzo Torresani. Ht-step: Aligning instructional articles with how-to videos. InNeurIPS, 2023
work page 2023
-
[3]
Anthropic. Introducing claude 3.5 sonnet. https://www.anthropic.com/news/claude-3-5 -sonnet, 2024
work page 2024
-
[4]
3d semantic parsing of large-scale indoor spaces
Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 3d semantic parsing of large-scale indoor spaces. InCVPR, 2016
work page 2016
-
[5]
Self-supervised learning from images with a joint-embedding predictive architecture
Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InCVPR, 2023
work page 2023
-
[6]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Whole-body conditioned egocentric video prediction.arXiv preprint arXiv:2506.21552, 2025
Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, and Jitendra Malik. Whole-body conditioned egocentric video prediction.arXiv preprint arXiv:2506.21552, 2025
-
[11]
Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. InCVPR, 2025
work page 2025
-
[12]
ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data
Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data. InNeurIPS, 2021
work page 2021
-
[13]
SIMS-V: Simulated instruction-tuning for spatial video understanding.arXiv preprint, 2025
Ellis Brown, Arijit Ray, Ranjay Krishna, Ross Girshick, Rob Fergus, and Saining Xie. SIMS-V: Simulated instruction-tuning for spatial video understanding.arXiv preprint, 2025
work page 2025
-
[14]
Ellis Brown, Jihan Yang, Shusheng Yang, Rob Fergus, and Saining Xie. Benchmark designers should “train on the test set” to expose exploitable non-visual shortcuts.arXiv preprint, 2025
work page 2025
-
[15]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InNeurIPS, 2020
work page 2020
-
[16]
Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Xindong He, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. InIROS, 2025
work page 2025
-
[17]
Judee K Burgoon and Jerold L Hale. Nonverbal expectancy violations: Model elaboration and application to immediacy behaviors.Communications Monographs, 55(1):58–79, 1988
work page 1988
-
[18]
Spatialbot: Precise spatial understanding with vision language models
Wenxiao Cai, Yaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models. InICRA, 2025. 23
work page 2025
-
[19]
Auroracap: Efficient, performant video detailed captioning and a new benchmark
Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jenq-Neng Hwang, Saining Xie, and Christopher D Manning. Auroracap: Efficient, performant video detailed captioning and a new benchmark. InICLR, 2025
work page 2025
-
[20]
Hourvideo: 1-hour video- language understanding
Keshigeyan Chandrasegaran, Agrim Gupta, Lea M Hadzic, Taran Kota, Jimming He, Cristóbal Eyzaguirre, Zane Durante, Manling Li, Jiajun Wu, and Fei-Fei Li. Hourvideo: 1-hour video- language understanding. InNeurIPS, 2024
work page 2024
-
[21]
Spatialvlm: Endowing vision-language models with spatial reasoning capabilities
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InCVPR, 2024
work page 2024
-
[22]
Simple hierarchical planning with diffusion
Chang Chen, Fei Deng, Kenji Kawaguchi, Caglar Gulcehre, and Sungjin Ahn. Simple hierarchical planning with diffusion. InICLR, 2024
work page 2024
-
[23]
Gui-world: A video benchmark and dataset for multimodal gui-oriented understanding
Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin Bai, Zhigang He, Chenlong Wang, Huichi Zhou, Yiqiang Li, et al. Gui-world: A video benchmark and dataset for multimodal gui-oriented understanding. InICLR, 2025
work page 2025
-
[24]
Videollm-online: Online video large language model for streaming video
Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video. InCVPR, 2024
work page 2024
-
[25]
Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, et al. Scaling rl to long videos. InNeurIPS, 2025
work page 2025
-
[26]
Longvila: Scaling long-context visual language models for long videos
Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, et al. Longvila: Scaling long-context visual language models for long videos. InICLR, 2025
work page 2025
-
[27]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InCVPR, 2024
work page 2024
-
[28]
Spatialrgpt: Grounded spatial reasoning in vision-language models
An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models. InNeurIPS, 2024
work page 2024
-
[29]
Whatever next? predictive brains, situated agents, and the future of cognitive science
Andy Clark. Whatever next? predictive brains, situated agents, and the future of cognitive science. Behavioral and brain sciences, 2013
work page 2013
-
[30]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Kenneth James Williams Craik.The nature of explanation. CUP Archive, 1967
work page 1967
-
[32]
Sharegpt-4o: Comprehensive multimodal annotations with gpt-4o, 2024
Erfei Cui, Yinan He, Zheng Ma, Zhe Chen, Hao Tian, Weiyun Wang, Kunchang Li, Yi Wang, Wenhai Wang, Xizhou Zhu, Lewei Lu, Tong Lu, Yali Wang, Limin Wang, Yu Qiao, and Jifeng Dai. Sharegpt-4o: Comprehensive multimodal annotations with gpt-4o, 2024
work page 2024
-
[33]
Scannet: Richly-annotated 3d reconstructions of indoor scenes
Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InCVPR, 2017
work page 2017
-
[34]
Flashattention: Fast and memory-efficient exact attention with io-awareness
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. InNeurIPS, 2022
work page 2022
-
[35]
Language modeling with gated convolutional networks
Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. InICML, 2017
work page 2017
-
[36]
Procthor: Large-scale embodied ai using procedural generation
Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation. InNeurIPS, 2022. 24
work page 2022
-
[37]
Peter Ford Dominey. Narrative event segmentation in the cortical reservoir.PLOS Computational Biology, 17(10):e1008993, 2021
work page 2021
-
[38]
Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. Embspatial-bench: Bench- marking spatial understanding for embodied tasks with large vision-language models. InACL, 2024
work page 2024
-
[39]
Scaling language-free visual representation learning
David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar, et al. Scaling language-free visual representation learning. InICCV, 2025
work page 2025
-
[40]
What do we perceive in a glance of a real-world scene?Journal of vision, 2007
Li Fei-Fei, Asha Iyer, Christof Koch, and Pietro Perona. What do we perceive in a glance of a real-world scene?Journal of vision, 2007
work page 2007
-
[41]
The free-energy principle: a unified brain theory?Nature reviews neuroscience, 2010
Karl Friston. The free-energy principle: a unified brain theory?Nature reviews neuroscience, 2010
work page 2010
-
[42]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InCVPR, 2025
work page 2025
-
[43]
Model predictive control: Theory and practice—a survey.Automatica, 25(3):335–348, 1989
Carlos E Garcia, David M Prett, and Manfred Morari. Model predictive control: Theory and practice—a survey.Automatica, 25(3):335–348, 1989
work page 1989
-
[44]
Quentin Garrido, Nicolas Ballas, Mahmoud Assran, Adrien Bardes, Laurent Najman, Michael Rabbat, Emmanuel Dupoux, and Yann LeCun. Intuitive physics understanding emerges from self-supervised pretraining on natural videos.arXiv preprint arXiv:2502.11831, 2025
-
[45]
The computational nature of memory modification.Elife, 2017
Samuel J Gershman, Marie-H Monfils, Kenneth A Norman, and Yael Niv. The computational nature of memory modification.Elife, 2017
work page 2017
-
[46]
James J Gibson.The ecological approach to visual perception: classic edition. Psychology press, 2014
work page 2014
-
[47]
Ego4d: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InCVPR, 2022
work page 2022
-
[48]
Mamba: Linear-time sequence modeling with selective state spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. In COLM, 2024
work page 2024
-
[49]
David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[50]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, 2022
work page 2022
-
[51]
Gaussian Error Linear Units (GELUs)
D Hendrycks. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
- [52]
-
[53]
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-MMMU: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
Nemo: Needle in a montage for video-language understanding
Zi-Yuan Hu, Shuo Liang, Duo Zheng, Yanyang Li, Yeyao Tao, Shijia Huang, Wei Feng, Jia Qin, Jianguang Yu, Jing Huang, et al. Nemo: Needle in a montage for video-language understanding. arXiv preprint arXiv:2509.24563, 2025
-
[55]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InCVPR, 2019
work page 2019
-
[56]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 25
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
Token-efficient long video understanding for multimodal llms
Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, et al. Token-efficient long video understanding for multimodal llms. arXiv preprint arXiv:2503.04130, 2025
-
[58]
Transformers are rnns: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InICML, 2020
work page 2020
-
[59]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InECCV, 2016
work page 2016
-
[60]
Prediction error determines how memories are organized in the brain.Elife, 2024
Nicholas GW Kennedy, Jessica C Lee, Simon Killcross, R Fred Westbrook, and Nathan M Holmes. Prediction error determines how memories are organized in the brain.Elife, 2024
work page 2024
-
[61]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language- action model.arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[62]
How much the eye tells the brain.Current biology, 2006
Kristin Koch, Judith McLean, Ronen Segev, Michael A Freed, Michael J Berry, Vijay Balasubrama- nian, and Peter Sterling. How much the eye tells the brain.Current biology, 2006
work page 2006
-
[63]
Text- conditioned resampler for long form video understanding
Bruno Korbar, Yongqin Xian, Alessio Tonioni, Andrew Zisserman, and Federico Tombari. Text- conditioned resampler for long form video understanding. InECCV, 2024
work page 2024
-
[64]
Segmentation in the perception and memory of events
Christopher A Kurby and Jeffrey M Zacks. Segmentation in the perception and memory of events. Trends in cognitive sciences, 12(2):72–79, 2008
work page 2008
-
[65]
Llava-onevision: Easy visual task transfer.TMLR, 2025
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.TMLR, 2025
work page 2025
-
[66]
Seed-bench: Benchmarking multimodal large language models
Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Benchmarking multimodal large language models. InCVPR, 2024
work page 2024
-
[67]
Topviewrs: Vision-language models as top-view spatial reasoners
Chengzu Li, Caiqi Zhang, Han Zhou, Nigel Collier, Anna Korhonen, and Ivan Vuli´ c. Topviewrs: Vision-language models as top-view spatial reasoners. InEMNLP, 2024
work page 2024
-
[68]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML, 2023
work page 2023
-
[69]
VideoChat: Chat-Centric Video Understanding
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[70]
Videomamba: State space model for efficient video understanding
Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. Videomamba: State space model for efficient video understanding. InECCV, 2024
work page 2024
-
[71]
MVbench: A comprehensive multi-modal video understanding benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. MVbench: A comprehensive multi-modal video understanding benchmark. In CVPR, 2024
work page 2024
-
[72]
Lion-fs: Fast & slow video-language thinker as online video assistant
Wei Li, Bing Hu, Rui Shao, Leyang Shen, and Liqiang Nie. Lion-fs: Fast & slow video-language thinker as online video assistant. InCVPR, 2025
work page 2025
-
[73]
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, et al. Videochat-flash: Hierarchical compression for long-context video modeling.arXiv preprint arXiv:2501.00574, 2024
work page internal anchor Pith review arXiv 2024
-
[74]
Llama-vid: An image is worth 2 tokens in large language models
Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InECCV, 2024
work page 2024
-
[75]
Sti-bench: Are mllms ready for precise spatial-temporal world understanding? InICCV, 2025
Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti-bench: Are mllms ready for precise spatial-temporal world understanding? InICCV, 2025. 26
work page 2025
-
[76]
Coarse correspondences boost spatial-temporal reasoning in multimodal language model
Benlin Liu, Yuhao Dong, Yiqin Wang, Zixian Ma, Yansong Tang, Luming Tang, Yongming Rao, Wei-Chiu Ma, and Ranjay Krishna. Coarse correspondences boost spatial-temporal reasoning in multimodal language model. InCVPR, 2025
work page 2025
-
[77]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, 2024
work page 2024
-
[78]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023
work page 2023
-
[79]
Lost in the middle: How language models use long contexts
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. InACL, 2024
work page 2024
-
[80]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InECCV, 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.