arxiv: 2511.04670 · v1 · pith:HZZT5D5Cnew · submitted 2025-11-06 · 💻 cs.CV

Cambrian-S: Towards Spatial Supersensing in Video

Shusheng Yang , Jihan Yang , Pinzhi Huang , Ellis Brown , Zihao Yang , Yue Yu , Shengbang Tong , Zihan Zheng

show 7 more authors

Yifan Xu Muhan Wang Daohan Lu Rob Fergus Yann LeCun Li Fei-Fei Saining Xie

This is my paper

Pith reviewed 2026-05-18 03:41 UTC · model grok-4.3

classification 💻 cs.CV

keywords spatial supersensingpredictive world modelingvideo spatial recallevent segmentationself-supervised predictionmultimodal intelligencevisual spatial counting

0 comments

The pith

A surprise-leveraging next-latent-frame predictor outperforms proprietary baselines on spatial supersensing video tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that progress in multimodal intelligence requires shifting to spatial supersensing, which includes semantic perception, streaming event cognition, implicit 3D cognition, and predictive world modeling. Existing benchmarks cover only early stages, so the authors introduce VSI-SUPER with VSR and VSC tasks that demand long video inputs and world modeling resistant to brute-force approaches. Data scaling with Cambrian-S on VSI-590K improves VSI-Bench but not VSI-SUPER sufficiently, while a self-supervised predictor using prediction error for memory and segmentation substantially beats leading models.

Core claim

The central claim is that spatial supersensing in video requires predictive world modeling, demonstrated by a self-supervised next-latent-frame predictor that leverages surprise (prediction error) to drive memory and event segmentation, which substantially outperforms leading proprietary baselines on the VSI-SUPER benchmark.

What carries the argument

Self-supervised next-latent-frame predictor using surprise (prediction error) to drive memory and event segmentation.

If this is right

Spatial supersensing cannot be achieved by data scaling alone.
Predictive sensing enables models to filter and organize information in continuous video experiences.
Event segmentation benefits from internal prediction errors rather than external supervision.
Models must anticipate future states to handle arbitrarily long spatial tasks effectively.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This predictive approach could be extended to improve performance in related areas like robotic perception or autonomous driving.
The emphasis on surprise suggests new ways to handle memory in transformer-based video models.
Future work might test whether similar mechanisms apply to non-spatial modalities like audio or text sequences.

Load-bearing premise

The VSI-SUPER tasks specifically test for predictive world modeling and are not addressable through other means like enhanced feature extraction or standard memory techniques.

What would settle it

Observing that removing the surprise component from the predictor eliminates the performance gains on VSI-SUPER or that a non-predictive model matches the results would falsify the central claim.

read the original abstract

We argue that progress in true multimodal intelligence calls for a shift from reactive, task-driven systems and brute-force long context towards a broader paradigm of supersensing. We frame spatial supersensing as four stages beyond linguistic-only understanding: semantic perception (naming what is seen), streaming event cognition (maintaining memory across continuous experiences), implicit 3D spatial cognition (inferring the world behind pixels), and predictive world modeling (creating internal models that filter and organize information). Current benchmarks largely test only the early stages, offering narrow coverage of spatial cognition and rarely challenging models in ways that require true world modeling. To drive progress in spatial supersensing, we present VSI-SUPER, a two-part benchmark: VSR (long-horizon visual spatial recall) and VSC (continual visual spatial counting). These tasks require arbitrarily long video inputs yet are resistant to brute-force context expansion. We then test data scaling limits by curating VSI-590K and training Cambrian-S, achieving +30% absolute improvement on VSI-Bench without sacrificing general capabilities. Yet performance on VSI-SUPER remains limited, indicating that scale alone is insufficient for spatial supersensing. We propose predictive sensing as a path forward, presenting a proof-of-concept in which a self-supervised next-latent-frame predictor leverages surprise (prediction error) to drive memory and event segmentation. On VSI-SUPER, this approach substantially outperforms leading proprietary baselines, showing that spatial supersensing requires models that not only see but also anticipate, select, and organize experience.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a new long-video spatial benchmark and a surprise-driven predictor POC that claims to beat baselines, but the abstract leaves the numbers and controls too thin to judge if prediction is doing the work.

read the letter

The main things here are the VSI-SUPER benchmark pair and the proof-of-concept that uses next-latent-frame prediction error to handle memory and segmentation on long videos. The four-stage framing of spatial supersensing is mostly organizational, and the scaling run on VSI-590K shows the usual pattern that more data helps general tasks but not these spatial ones enough. That part is useful for steering people toward anticipation rather than just longer context or bigger models. The benchmark tasks look like they could actually force models to maintain spatial structure over time instead of just retrieving recent frames, which is a step past most current video evals. The surprise signal idea is straightforward self-supervision and avoids needing extra labels, which is clean on paper. The claim that it substantially beats proprietary baselines on VSI-SUPER is the part that needs the numbers. The abstract gives no absolute scores, no error bars, and no ablation that keeps memory capacity but removes the prediction-error drive. Without that, it is hard to tell whether the gains come from the predictive loop or simply from having a recurrent-style memory that happens to be trained this way. The circularity worry is real but probably minor if the predictor is frozen or trained separately; the bigger issue is whether VSR and VSC truly isolate world modeling or just reward any mechanism that keeps track of object positions across minutes. This is the kind of work that belongs in a reading group for people working on video agents or embodied models. It is worth a serious referee pass once the full results and controls are in the manuscript, because the benchmark itself could become a useful stress test even if the specific method needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper argues for shifting from reactive task-driven systems to 'spatial supersensing' in video AI, defined as four stages: semantic perception, streaming event cognition, implicit 3D spatial cognition, and predictive world modeling. It introduces the VSI-SUPER benchmark (VSR for long-horizon visual spatial recall and VSC for continual visual spatial counting) that resists brute-force context scaling, curates the VSI-590K dataset, trains Cambrian-S achieving +30% absolute improvement on VSI-Bench, and presents a proof-of-concept self-supervised next-latent-frame predictor that uses prediction error (surprise) to drive memory and event segmentation, claiming this substantially outperforms proprietary baselines on VSI-SUPER.

Significance. If the attribution of gains to the surprise-driven predictive mechanism holds after proper controls, the work would be significant for demonstrating that scale alone is insufficient for spatial cognition and for providing a benchmark that tests anticipation and organization over long video horizons. The introduction of VSI-SUPER and the predictive sensing proof-of-concept could help steer the field toward internal world models.

major comments (2)

Abstract: The claim of '+30% absolute improvement on VSI-Bench' and 'substantially outperforms leading proprietary baselines' on VSI-SUPER is stated without quantitative tables, error bars, ablation controls, or exact task definitions for VSR/VSC, leaving the central empirical claims without verifiable support in the provided text.
Section on predictive sensing / proof-of-concept: No ablation is reported that preserves memory capacity while removing the surprise (prediction error) signal from the next-latent-frame predictor. This is load-bearing for the claim that predictive world modeling (rather than generic long-horizon feature retention) is required, as the skeptic concern that gains may arise from improved temporal memory alone remains unaddressed.

minor comments (2)

The distinction between 'spatial supersensing' and prior concepts in predictive coding or streaming video understanding could be clarified with additional references in the introduction.
Notation for the four stages of supersensing and the VSI-SUPER tasks would benefit from explicit formal definitions or pseudocode to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the presentation of empirical results and controls. We address each point below and have revised the manuscript accordingly to improve verifiability and rigor.

read point-by-point responses

Referee: Abstract: The claim of '+30% absolute improvement on VSI-Bench' and 'substantially outperforms leading proprietary baselines' on VSI-SUPER is stated without quantitative tables, error bars, ablation controls, or exact task definitions for VSR/VSC, leaving the central empirical claims without verifiable support in the provided text.

Authors: We agree that abstracts benefit from greater specificity to support high-level claims. The quantitative results, including the +30% absolute gains on VSI-Bench, error bars from multiple runs, ablation studies, and precise definitions of the VSR (long-horizon visual spatial recall) and VSC (continual visual spatial counting) tasks, are fully detailed in Sections 3, 4, and 5 of the manuscript, with supporting tables. To address the concern directly, we have revised the abstract to incorporate brief quantitative highlights (e.g., specific percentage improvements with references to Table 2 and Table 4) and explicit pointers to task definitions and experimental controls, while preserving its concise nature. revision: yes
Referee: Section on predictive sensing / proof-of-concept: No ablation is reported that preserves memory capacity while removing the surprise (prediction error) signal from the next-latent-frame predictor. This is load-bearing for the claim that predictive world modeling (rather than generic long-horizon feature retention) is required, as the skeptic concern that gains may arise from improved temporal memory alone remains unaddressed.

Authors: This is a valid concern, as isolating the contribution of the surprise (prediction error) signal is central to our argument for predictive world modeling. The original proof-of-concept demonstrated overall performance gains but did not include this specific control. In the revised manuscript, we have added a new ablation study (Section 5.3) that holds memory capacity fixed (identical buffer sizes and update frequency) while comparing the full surprise-driven next-latent predictor against a variant using non-predictive memory management (e.g., FIFO or random eviction). Results show that the surprise signal yields additional gains on VSI-SUPER beyond those attributable to temporal memory retention alone, directly addressing the skeptic concern. revision: yes

Circularity Check

0 steps flagged

No circularity: self-supervised surprise signal is independent of target metric

full rationale

The paper's central mechanism is a next-latent-frame predictor whose internal prediction error (surprise) is used to modulate memory and event segmentation. This is a standard self-supervised construction in which the supervisory signal is derived from the model's own forward pass on unlabeled video frames, not fitted to VSI-SUPER labels or defined in terms of the recall/counting tasks. The subsequent claim of outperformance on VSI-SUPER is an external empirical comparison against proprietary baselines and does not reduce to a tautology, self-citation chain, or renaming of the input. No equations or definitions in the abstract or described derivation exhibit the patterns of self-definitional closure, fitted-input-as-prediction, or load-bearing self-citation. The derivation remains self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that existing benchmarks cover only early stages of spatial cognition and that prediction error can be directly repurposed for memory and segmentation without additional supervision.

axioms (1)

domain assumption Current benchmarks largely test only the early stages of spatial cognition
Stated directly in the abstract as motivation for VSI-SUPER.

invented entities (1)

spatial supersensing no independent evidence
purpose: Broader paradigm encompassing semantic perception, streaming event cognition, implicit 3D cognition, and predictive world modeling
New framing introduced to organize the four stages beyond linguistic understanding.

pith-pipeline@v0.9.0 · 5856 in / 1337 out tokens · 33647 ms · 2026-05-18T03:41:55.461774+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/LawOfExistence, Foundation/DiscretenessForcing, Foundation/HierarchyEmergence defect_zero_iff_one; existence_economically_inevitable; hierarchy_emergence_forces_phi echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We propose predictive sensing as a path forward, presenting a proof-of-concept in which a self-supervised next-latent-frame predictor leverages surprise (prediction error) to drive memory and event segmentation. On VSI-SUPER, this approach substantially outperforms leading proprietary baselines, showing that spatial supersensing requires models that not only see but also anticipate, select, and organize experience.
Foundation/DimensionForcing, Foundation/SimplicialLedger dimension_forced; simplicial_loop_tick_lower_bound echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We frame spatial supersensing as four stages beyond linguistic-only understanding: semantic perception..., streaming event cognition..., implicit 3D spatial cognition..., and predictive world modeling (creating internal models that filter and organize information).
Foundation/InevitabilityStructure inevitability echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Yet performance on VSI-SUPER remains limited, indicating that scale alone is insufficient for spatial supersensing.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
cs.CV 2026-05 unverdicted novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos
cs.CV 2026-04 unverdicted novelty 8.0

PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four tasks showing MLLM capability gaps that improve via supervised fine-tuning.
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models
cs.CV 2026-05 unverdicted novelty 7.0

ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
cs.AI 2026-05 unverdicted novelty 7.0

VIGIL decouples world-state completion (W) from benchmark success (B) requiring correct terminal reports, showing up to 19.7 pp gaps in B for models with similar W across 20 systems on 1000 episodes.
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
cs.AI 2026-05 unverdicted novelty 7.0

VIGIL decouples world-state completion from terminal commitment in embodied agents, exposing up to 19.7 pp gaps in benchmark success despite comparable execution across 20 models.
PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos
cs.CV 2026-04 unverdicted novelty 7.0

PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four progressive tasks built from ScanNet data.
PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World
cs.CV 2026-05 unverdicted novelty 6.0

PanoWorld adds spherical geometry to MLLMs via cross-attention and pano-specific instruction data, yielding better performance on panoramic spatial reasoning benchmarks than standard perspective-based pipelines.
RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark
cs.RO 2026-05 unverdicted novelty 6.0

RoboMemArena is a new large-scale robotic memory benchmark with real-world tasks, and PrediMem is a dual VLA system that outperforms baselines by managing memory buffers with predictive coding.
SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs
cs.CV 2026-05 unverdicted novelty 6.0

SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
cs.AI 2026-05 unverdicted novelty 6.0

VIGIL separates world-state completion (W) from benchmark success (B) requiring correct terminal reports, showing up to 19.7 pp gaps between models with similar execution on 1000 episodes across 20 systems.
World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

Distilling view-consistent future views and action-outcome supervision from a generative world model into a VLM via two-stage post-training improves dynamic spatial reasoning on SAT-Real, VSI-Bench and similar benchma...
PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model
cs.AI 2026-04 unverdicted novelty 6.0

PhysNote lets VLMs externalize physical knowledge into hierarchical self-generated notes, stabilizing spatio-temporal reasoning and yielding 56.68% accuracy on PhysBench with a 4.96% gain over the best multi-agent baseline.
Human Cognition in Machines: A Unified Perspective of World Models
cs.RO 2026-04 unverdicted novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs
cs.CV 2026-04 unverdicted novelty 6.0

GUIDE unrolls multi-granularity geometric priors layer-wise into early MLLM layers with gating to improve spatial reasoning and perception.
SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning
cs.CV 2026-03 unverdicted novelty 6.0

SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.
Video Generation with Predictive Latents
cs.CV 2026-05 unverdicted novelty 5.0

PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.
From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs
cs.CV 2026-05 unverdicted novelty 5.0

SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.
SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning
cs.CV 2026-04 unverdicted novelty 5.0

SpatialImaginer integrates visual imagination with textual chain-of-thought to improve spatial reasoning robustness in multimodal large language models.
OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence
cs.CL 2026-04 unverdicted novelty 5.0

OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.

Reference graph

Works this paper leans on

168 extracted references · 168 canonical work pages · cited by 16 Pith papers · 26 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Ht-step: Aligning instructional articles with how-to videos

Triantafyllos Afouras, Effrosyni Mavroudi, Tushar Nagarajan, Huiyu Wang, and Lorenzo Torresani. Ht-step: Aligning instructional articles with how-to videos. InNeurIPS, 2023

work page 2023
[3]

Introducing claude 3.5 sonnet

Anthropic. Introducing claude 3.5 sonnet. https://www.anthropic.com/news/claude-3-5 -sonnet, 2024

work page 2024
[4]

3d semantic parsing of large-scale indoor spaces

Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 3d semantic parsing of large-scale indoor spaces. InCVPR, 2016

work page 2016
[5]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InCVPR, 2023

work page 2023
[6]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Whole-body conditioned egocentric video prediction.arXiv preprint arXiv:2506.21552, 2025

Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, and Jitendra Malik. Whole-body conditioned egocentric video prediction.arXiv preprint arXiv:2506.21552, 2025

work page arXiv 2025
[11]

Navigation world models

Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. InCVPR, 2025

work page 2025
[12]

ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data. InNeurIPS, 2021

work page 2021
[13]

SIMS-V: Simulated instruction-tuning for spatial video understanding.arXiv preprint, 2025

Ellis Brown, Arijit Ray, Ranjay Krishna, Ross Girshick, Rob Fergus, and Saining Xie. SIMS-V: Simulated instruction-tuning for spatial video understanding.arXiv preprint, 2025

work page 2025
[14]

train on the test set

Ellis Brown, Jihan Yang, Shusheng Yang, Rob Fergus, and Saining Xie. Benchmark designers should “train on the test set” to expose exploitable non-visual shortcuts.arXiv preprint, 2025

work page 2025
[15]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InNeurIPS, 2020

work page 2020
[16]

Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Xindong He, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. InIROS, 2025

work page 2025
[17]

Nonverbal expectancy violations: Model elaboration and application to immediacy behaviors.Communications Monographs, 55(1):58–79, 1988

Judee K Burgoon and Jerold L Hale. Nonverbal expectancy violations: Model elaboration and application to immediacy behaviors.Communications Monographs, 55(1):58–79, 1988

work page 1988
[18]

Spatialbot: Precise spatial understanding with vision language models

Wenxiao Cai, Yaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models. InICRA, 2025. 23

work page 2025
[19]

Auroracap: Efficient, performant video detailed captioning and a new benchmark

Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jenq-Neng Hwang, Saining Xie, and Christopher D Manning. Auroracap: Efficient, performant video detailed captioning and a new benchmark. InICLR, 2025

work page 2025
[20]

Hourvideo: 1-hour video- language understanding

Keshigeyan Chandrasegaran, Agrim Gupta, Lea M Hadzic, Taran Kota, Jimming He, Cristóbal Eyzaguirre, Zane Durante, Manling Li, Jiajun Wu, and Fei-Fei Li. Hourvideo: 1-hour video- language understanding. InNeurIPS, 2024

work page 2024
[21]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InCVPR, 2024

work page 2024
[22]

Simple hierarchical planning with diffusion

Chang Chen, Fei Deng, Kenji Kawaguchi, Caglar Gulcehre, and Sungjin Ahn. Simple hierarchical planning with diffusion. InICLR, 2024

work page 2024
[23]

Gui-world: A video benchmark and dataset for multimodal gui-oriented understanding

Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin Bai, Zhigang He, Chenlong Wang, Huichi Zhou, Yiqiang Li, et al. Gui-world: A video benchmark and dataset for multimodal gui-oriented understanding. InICLR, 2025

work page 2025
[24]

Videollm-online: Online video large language model for streaming video

Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video. InCVPR, 2024

work page 2024
[25]

Scaling rl to long videos

Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, et al. Scaling rl to long videos. InNeurIPS, 2025

work page 2025
[26]

Longvila: Scaling long-context visual language models for long videos

Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, et al. Longvila: Scaling long-context visual language models for long videos. InICLR, 2025

work page 2025
[27]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InCVPR, 2024

work page 2024
[28]

Spatialrgpt: Grounded spatial reasoning in vision-language models

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models. InNeurIPS, 2024

work page 2024
[29]

Whatever next? predictive brains, situated agents, and the future of cognitive science

Andy Clark. Whatever next? predictive brains, situated agents, and the future of cognitive science. Behavioral and brain sciences, 2013

work page 2013
[30]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

CUP Archive, 1967

Kenneth James Williams Craik.The nature of explanation. CUP Archive, 1967

work page 1967
[32]

Sharegpt-4o: Comprehensive multimodal annotations with gpt-4o, 2024

Erfei Cui, Yinan He, Zheng Ma, Zhe Chen, Hao Tian, Weiyun Wang, Kunchang Li, Yi Wang, Wenhai Wang, Xizhou Zhu, Lewei Lu, Tong Lu, Yali Wang, Limin Wang, Yu Qiao, and Jifeng Dai. Sharegpt-4o: Comprehensive multimodal annotations with gpt-4o, 2024

work page 2024
[33]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InCVPR, 2017

work page 2017
[34]

Flashattention: Fast and memory-efficient exact attention with io-awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. InNeurIPS, 2022

work page 2022
[35]

Language modeling with gated convolutional networks

Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. InICML, 2017

work page 2017
[36]

Procthor: Large-scale embodied ai using procedural generation

Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation. InNeurIPS, 2022. 24

work page 2022
[37]

Narrative event segmentation in the cortical reservoir.PLOS Computational Biology, 17(10):e1008993, 2021

Peter Ford Dominey. Narrative event segmentation in the cortical reservoir.PLOS Computational Biology, 17(10):e1008993, 2021

work page 2021
[38]

Embspatial-bench: Bench- marking spatial understanding for embodied tasks with large vision-language models

Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. Embspatial-bench: Bench- marking spatial understanding for embodied tasks with large vision-language models. InACL, 2024

work page 2024
[39]

Scaling language-free visual representation learning

David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar, et al. Scaling language-free visual representation learning. InICCV, 2025

work page 2025
[40]

What do we perceive in a glance of a real-world scene?Journal of vision, 2007

Li Fei-Fei, Asha Iyer, Christof Koch, and Pietro Perona. What do we perceive in a glance of a real-world scene?Journal of vision, 2007

work page 2007
[41]

The free-energy principle: a unified brain theory?Nature reviews neuroscience, 2010

Karl Friston. The free-energy principle: a unified brain theory?Nature reviews neuroscience, 2010

work page 2010
[42]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InCVPR, 2025

work page 2025
[43]

Model predictive control: Theory and practice—a survey.Automatica, 25(3):335–348, 1989

Carlos E Garcia, David M Prett, and Manfred Morari. Model predictive control: Theory and practice—a survey.Automatica, 25(3):335–348, 1989

work page 1989
[44]

Intuitive physics understanding emerges from self-supervised pretraining on natural videos.arXiv preprint arXiv:2502.11831, 2025

Quentin Garrido, Nicolas Ballas, Mahmoud Assran, Adrien Bardes, Laurent Najman, Michael Rabbat, Emmanuel Dupoux, and Yann LeCun. Intuitive physics understanding emerges from self-supervised pretraining on natural videos.arXiv preprint arXiv:2502.11831, 2025

work page arXiv 2025
[45]

The computational nature of memory modification.Elife, 2017

Samuel J Gershman, Marie-H Monfils, Kenneth A Norman, and Yael Niv. The computational nature of memory modification.Elife, 2017

work page 2017
[46]

Psychology press, 2014

James J Gibson.The ecological approach to visual perception: classic edition. Psychology press, 2014

work page 2014
[47]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InCVPR, 2022

work page 2022
[48]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. In COLM, 2024

work page 2024
[49]

World Models

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[50]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, 2022

work page 2022
[51]

Gaussian Error Linear Units (GELUs)

D Hendrycks. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[52]

OUP Oxford, 2013

Jakob Hohwy.The predictive mind. OUP Oxford, 2013

work page 2013
[53]

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-MMMU: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Nemo: Needle in a montage for video-language understanding

Zi-Yuan Hu, Shuo Liang, Duo Zheng, Yanyang Li, Yeyao Tao, Shijia Huang, Wei Feng, Jia Qin, Jianguang Yu, Jing Huang, et al. Nemo: Needle in a montage for video-language understanding. arXiv preprint arXiv:2509.24563, 2025

work page arXiv 2025
[55]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InCVPR, 2019

work page 2019
[56]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 25

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Token-efficient long video understanding for multimodal llms

Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, et al. Token-efficient long video understanding for multimodal llms. arXiv preprint arXiv:2503.04130, 2025

work page arXiv 2025
[58]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InICML, 2020

work page 2020
[59]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InECCV, 2016

work page 2016
[60]

Prediction error determines how memories are organized in the brain.Elife, 2024

Nicholas GW Kennedy, Jessica C Lee, Simon Killcross, R Fred Westbrook, and Nathan M Holmes. Prediction error determines how memories are organized in the brain.Elife, 2024

work page 2024
[61]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language- action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[62]

How much the eye tells the brain.Current biology, 2006

Kristin Koch, Judith McLean, Ronen Segev, Michael A Freed, Michael J Berry, Vijay Balasubrama- nian, and Peter Sterling. How much the eye tells the brain.Current biology, 2006

work page 2006
[63]

Text- conditioned resampler for long form video understanding

Bruno Korbar, Yongqin Xian, Alessio Tonioni, Andrew Zisserman, and Federico Tombari. Text- conditioned resampler for long form video understanding. InECCV, 2024

work page 2024
[64]

Segmentation in the perception and memory of events

Christopher A Kurby and Jeffrey M Zacks. Segmentation in the perception and memory of events. Trends in cognitive sciences, 12(2):72–79, 2008

work page 2008
[65]

Llava-onevision: Easy visual task transfer.TMLR, 2025

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.TMLR, 2025

work page 2025
[66]

Seed-bench: Benchmarking multimodal large language models

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Benchmarking multimodal large language models. InCVPR, 2024

work page 2024
[67]

Topviewrs: Vision-language models as top-view spatial reasoners

Chengzu Li, Caiqi Zhang, Han Zhou, Nigel Collier, Anna Korhonen, and Ivan Vuli´ c. Topviewrs: Vision-language models as top-view spatial reasoners. InEMNLP, 2024

work page 2024
[68]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML, 2023

work page 2023
[69]

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

Videomamba: State space model for efficient video understanding

Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. Videomamba: State space model for efficient video understanding. InECCV, 2024

work page 2024
[71]

MVbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. MVbench: A comprehensive multi-modal video understanding benchmark. In CVPR, 2024

work page 2024
[72]

Lion-fs: Fast & slow video-language thinker as online video assistant

Wei Li, Bing Hu, Rui Shao, Leyang Shen, and Liqiang Nie. Lion-fs: Fast & slow video-language thinker as online video assistant. InCVPR, 2025

work page 2025
[73]

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, et al. Videochat-flash: Hierarchical compression for long-context video modeling.arXiv preprint arXiv:2501.00574, 2024

work page internal anchor Pith review arXiv 2024
[74]

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InECCV, 2024

work page 2024
[75]

Sti-bench: Are mllms ready for precise spatial-temporal world understanding? InICCV, 2025

Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti-bench: Are mllms ready for precise spatial-temporal world understanding? InICCV, 2025. 26

work page 2025
[76]

Coarse correspondences boost spatial-temporal reasoning in multimodal language model

Benlin Liu, Yuhao Dong, Yiqin Wang, Zixian Ma, Yansong Tang, Luming Tang, Yongming Rao, Wei-Chiu Ma, and Ranjay Krishna. Coarse correspondences boost spatial-temporal reasoning in multimodal language model. InCVPR, 2025

work page 2025
[77]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, 2024

work page 2024
[78]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023

work page 2023
[79]

Lost in the middle: How language models use long contexts

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. InACL, 2024

work page 2024
[80]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InECCV, 2024

work page 2024

Showing first 80 references.