pith. sign in

arxiv: 2606.03890 · v1 · pith:IGMYQJMQnew · submitted 2026-06-02 · 💻 cs.CV

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

Pith reviewed 2026-06-28 10:59 UTC · model grok-4.3

classification 💻 cs.CV
keywords streaming spatial intelligencemultimodal LLMsegocentric videoallocentric mappingspatial reasoning benchmarkprefix evaluationMLLM evaluation
0
0 comments X

The pith

A benchmark shows top multimodal models trail humans by 27 points on streaming spatial tasks from video prefixes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a new evaluation for how multimodal large language models handle spatial reasoning over continuous egocentric video streams rather than complete offline videos. It creates questions tied to specific timestamps where the model receives only the preceding video segment and must draw on prior evidence for spatial structure. Testing across 38 models reveals consistent shortfalls relative to human performance, concentrated at the highest level of abstraction. This setup matters for applications that require ongoing layout awareness from partial views. The hierarchy isolates whether failures stem from basic perception, context tracking, simulation, or global mapping.

Core claim

OVO-S-Bench consists of 1,680 human-annotated questions over 348 videos, each with a query timestamp and evidence interval, evaluated under prefix-only input. Questions are organized into four increasing levels of abstraction: instantaneous egocentric perception, spatiotemporal context tracking, spatial simulation and reasoning, and allocentric mapping. Across 38 models the strongest result reaches 59.2 while humans reach 86.6, with allocentric mapping the dominant gap; streaming and spatially fine-tuned models score below their base versions, and ungrounded chain-of-thought increases spatial errors.

What carries the argument

The four-level hierarchy of spatial abstraction combined with prefix-only evaluation at query timestamps, where each question specifies the exact evidence interval needed from the stream.

If this is right

  • Future multimodal models must improve allocentric mapping from partial egocentric streams.
  • Spatially fine-tuned or streaming-adapted models do not automatically gain the required capabilities.
  • Ungrounded chain-of-thought reasoning increases rather than reduces spatial errors in this setting.
  • Robotics and AR systems will require new architectures tested against prefix-only streaming benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The gap may reflect a broader limitation in maintaining consistent spatial representations across time rather than isolated reasoning failures.
  • Similar hierarchical benchmarks could be applied to test whether the same bottlenecks appear in non-video modalities such as audio or sensor streams.
  • Developers of autonomous agents might prioritize explicit memory mechanisms for allocentric coordinates over general scaling.

Load-bearing premise

The multi-round annotation and blind cross-review process by 12 annotators produces questions whose difficulty and evidence requirements isolate streaming spatial intelligence rather than other skills such as object recognition or language.

What would settle it

A model that closes the 27-point gap to human experts on allocentric mapping questions while maintaining performance on lower levels, or a re-annotation showing that many questions can be solved without reference to the specified evidence intervals.

read the original abstract

Multimodal agents in robotics, AR, and autonomous driving must reason about places and layouts from continuous egocentric streams, often using evidence outside the current view. Existing benchmarks either evaluate offline over full videos or target events rather than spatial structure. We introduce OVO-S-Bench, a fully human-annotated benchmark for streaming spatial intelligence, comprising 1,680 questions over 348 source videos. Annotation involves 12 trained annotators, each also serving as a blind cross-reviewer, across roughly 804 person-hours of multi-round quality assurance. Each question carries a query timestamp and an evidence interval, and at evaluation, the model sees only the prefix preceding the query. Questions span four levels of increasing abstraction: instantaneous egocentric perception, spatiotemporal context tracking, spatial simulation and reasoning, and allocentric mapping. Across 38 proprietary and open-source MLLMs, Gemini-3.1-Pro trails human experts by 27 points, 59.2 vs. 86.6, with allocentric mapping as the dominant bottleneck. Notably, streaming and spatially fine-tuned MLLMs underperform their own backbones. We further find that chain-of-thought reasoning amplifies spatial errors when ungrounded in the stream. By exposing these limitations, OVO-S-Bench establishes a demanding testbed for next-generation streaming spatial MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces OVO-S-Bench, a human-annotated benchmark of 1,680 questions over 348 videos for evaluating streaming spatial intelligence in MLLMs. Each question includes a query timestamp and evidence interval, with models receiving only the video prefix up to the query. Questions are organized into four hierarchical levels of increasing abstraction (instantaneous egocentric perception, spatiotemporal context tracking, spatial simulation and reasoning, allocentric mapping). Evaluation across 38 proprietary and open-source MLLMs shows Gemini-3.1-Pro at 59.2 versus human experts at 86.6, with allocentric mapping as the dominant bottleneck; streaming and spatially fine-tuned models underperform their backbones, and ungrounded chain-of-thought amplifies errors.

Significance. If the questions validly isolate streaming spatial intelligence, the benchmark would fill an important gap by targeting continuous egocentric streams rather than offline full-video or event-based evaluation, directly relevant to robotics, AR, and autonomous driving. The scale of annotation (804 person-hours with blind cross-review) and the concrete model-human gaps plus the counterintuitive underperformance of fine-tuned models would provide actionable guidance for future MLLM development.

major comments (1)
  1. [Annotation Process] Annotation Process section: the multi-round blind cross-review by 12 annotators is described in detail, but the manuscript provides no control experiments (such as text-only question variants or evidence-interval ablations) to test whether questions remain solvable via language priors, object recognition, or other non-spatial cues. This validation step is load-bearing for the central claims, as the reported 27-point gap and level-specific bottlenecks cannot be attributed to streaming spatial intelligence without evidence that the questions isolate the intended factors.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential value of OVO-S-Bench in addressing a gap in streaming spatial evaluation. We respond to the single major comment below.

read point-by-point responses
  1. Referee: [Annotation Process] Annotation Process section: the multi-round blind cross-review by 12 annotators is described in detail, but the manuscript provides no control experiments (such as text-only question variants or evidence-interval ablations) to test whether questions remain solvable via language priors, object recognition, or other non-spatial cues. This validation step is load-bearing for the central claims, as the reported 27-point gap and level-specific bottlenecks cannot be attributed to streaming spatial intelligence without evidence that the questions isolate the intended factors.

    Authors: We agree that the absence of explicit control experiments leaves open the possibility that some questions could be solved via language priors or non-spatial cues, which would weaken attribution of the observed gaps and bottlenecks specifically to streaming spatial intelligence. The manuscript's design choices—providing only the video prefix up to the query timestamp, specifying evidence intervals, and organizing questions into four levels of increasing spatial abstraction—were intended to focus evaluation on the target capabilities, and the multi-round blind cross-review was used to ensure question quality. Nevertheless, these measures do not constitute the requested controls. In the revised manuscript we will add text-only question variants (to measure language-prior solvability) and evidence-interval ablations (to measure reliance on the designated visual evidence) and report the resulting performance drops. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivation chain or fitted predictions

full rationale

The paper constructs and evaluates an empirical benchmark via human annotation (12 annotators, multi-round review) and direct model testing on 1,680 held-out questions. No equations, parameters, or predictions are derived; performance gaps (e.g., 59.2 vs. 86.6) are raw evaluation outputs. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear. The work is self-contained against external human annotations and model runs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the quality and representativeness of the human annotations and the assumption that the chosen videos and questions isolate spatial streaming ability. No free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Human annotators with 804 person-hours of multi-round review produce reliable ground truth for spatial reasoning questions.
    Invoked in the description of the 12-annotator process with blind cross-review.
  • domain assumption The four abstraction levels (instantaneous egocentric perception, spatiotemporal context tracking, spatial simulation and reasoning, allocentric mapping) form a meaningful increasing hierarchy for streaming spatial intelligence.
    Stated directly in the abstract as the structure of the benchmark.

pith-pipeline@v0.9.1-grok · 5796 in / 1454 out tokens · 20117 ms · 2026-06-28T10:59:18.011657+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

81 extracted references · 31 canonical work pages · 11 internal anchors

  1. [1]

    Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

    Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025. 4.1

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 4.1

  3. [3]

    Qwen2.5-VL Technical Report

    Shuai Bai, Ke qin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.A...

  4. [4]

    ARKitScenes: A diverse real-world dataset for 3d indoor scene understanding using mobile RGB-D data

    Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARKitScenes: A diverse real-world dataset for 3d indoor scene understanding using mobile RGB-D data. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2021. 3.2, 10

  5. [5]

    Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025

    Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, et al. Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025. 2, 4.1, 4.2

  6. [6]

    Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024. 2

  7. [7]

    Streamingtom: Streaming token compression for efficient video understanding.arXiv preprint arXiv:2510.18269, 2025

    Xueyi Chen, Keda Tao, Kele Shao, and Huan Wang. Streamingtom: Streaming token compression for efficient video understanding.arXiv preprint arXiv:2510.18269, 2025. 4.1

  8. [8]

    Rynnbrain: Open embodied foundation models.arXiv preprint arXiv:2602.14979, 2026

    Ronghao Dang, Jiayan Guo, Bohan Hou, Sicong Leng, Kehan Li, Xin Li, Jiangpin Liu, Yunxuan Mao, Zhikai Wang, Yuqian Yuan, et al. Rynnbrain: Open embodied foundation models.arXiv preprint arXiv:2602.14979, 2026. 4.1

  9. [9]

    Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models

    Mengfei Du, Binhao Wu, Zejun Li, Xuan-Jing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 346–355,

  10. [10]

    Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

    Chaoyou Fu, Haozhi Yuan, Yuhao Dong, Yi-Fan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, et al. Video-mme-v2: Towards the next stage in benchmarks for comprehensive video understanding.arXiv preprint arXiv:2604.05015, 2026. 4.2

  11. [11]

    Holi-spatial: Evolving video streams into holistic 3d spatial intelligence

    Yuanyuan Gao, Hao Li, Yifei Liu, Xinhao Ji, Yuning Gong, Yuanjun Liao, Fangfu Liu, Manyuan Zhang, Yuchen Yang, Dan Xu, et al. Holi-spatial: Evolving video streams into holistic 3d spatial intelligence. arXiv preprint arXiv:2603.07660, 2026. 2

  12. [12]

    Gemini 3.1 flash-lite

    Google DeepMind. Gemini 3.1 flash-lite. https://deepmind.google/models/gemini/ flash-lite/, March 2026. 4.1

  13. [13]

    Gemini 3.1 pro.https://blog.google/products/gemini/gemini-3-pro/, February 2026

    Google DeepMind. Gemini 3.1 pro.https://blog.google/products/gemini/gemini-3-pro/, February 2026. 4.1

  14. [14]

    Gemma 4.https://ai.google.dev/gemma/docs/core/model_card_4, April

    Google DeepMind. Gemma 4.https://ai.google.dev/gemma/docs/core/model_card_4, April

  15. [15]

    Model card; technical report forthcoming. 4.1

  16. [16]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022. 1, 3.2, 10 10 OVO-S-Bench: A Hi...

  17. [17]

    RoomTour3D: Geometry-aware video-instruction tuning for embodied navigation

    Mingfei Han, Liang Ma, Kamila Zhumakhanova, Ekaterina Radionova, Jingyi Zhang, Xiaojun Chang, Xiaodan Liang, and Ivan Laptev. RoomTour3D: Geometry-aware video-instruction tuning for embodied navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 3.2, 10

  18. [18]

    ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

    Yining Hong, Jiageng Liu, Han Yin, Manling Li, Leonidas Guibas, Li Fei-Fei, Jiajun Wu, and Yejin Choi. ESI-Bench: Towards embodied spatial intelligence that closes the perception-action loop.arXiv preprint arXiv:2605.18746, 2026. 5

  19. [19]

    Online video understanding: Ovbench and videochat-online

    Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng, Cheng Liang, Tao Wu, Xi Chen, Liang Li, and Limin Wang. Online video understanding: Ovbench and videochat-online. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3328–3338, 2025. 1, 2, F.3

  20. [20]

    Omnispatial: Towards comprehensive spatial reasoning benchmark for vi- sion language models.arXiv preprint arXiv:2506.03135,

    Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025. 1, 2

  21. [21]

    Infinipot-v: Memory-constrained kv cache compression for streaming video understanding.Advances in Neural Information Processing Systems, 38:138983–139013, 2026

    Minsoo Kim, Kyuhong Shim, Jungwook Choi, and Simyung Chang. Infinipot-v: Memory-constrained kv cache compression for streaming video understanding.Advances in Neural Information Processing Systems, 38:138983–139013, 2026. 4.1

  22. [22]

    Richard Landis and Gary G

    J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data. Biometrics, 33(1):159–174, 1977. E.5.2

  23. [23]

    TopViewRS: Vision- language models as top-view spatial reasoners

    Chengzu Li, Caiqi Zhang, Han Zhou, Nigel Collier, Anna Korhonen, and Ivan Vulić. TopViewRS: Vision- language models as top-view spatial reasoners. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1786–1807, 2024. 1, 2, F.1

  24. [24]

    ViewSpatial- Bench: Evaluating multi-perspective spatial under- standing of vision-language models.arXiv preprint arXiv:2505.21500, 2025

    Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, et al. Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models.arXiv preprint arXiv:2505.21500, 2025. 12

  25. [25]

    CODA: A real-world road corner case dataset for object detection in autonomous driving

    Kaican Li, Kai Chen, Haoyu Wang, Lanqing Hong, Chaoqiang Ye, Jianhua Han, Yukuai Chen, Wei Zhang, Chunjing Xu, Dit-Yan Yeung, Xiaodan Liang, Zhenguo Li, and Hang Xu. CODA: A real-world road corner case dataset for object detection in autonomous driving. InProceedings of the European Conference on Computer Vision, 2022. 3.2, 10

  26. [26]

    Sti-bench: Are mllms ready for precise spatial-temporal world understanding? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5622–5632, 2025

    Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti-bench: Are mllms ready for precise spatial-temporal world understanding? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5622–5632, 2025. 1, 2, F.2, 12

  27. [27]

    Sekai: A video dataset towards world exploration, 2025

    Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, Yukang Feng, Jianwen Sun, et al. Sekai: A video dataset towards world exploration, 2025. 3.2, 10

  28. [28]

    Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence

    Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao, Yunlong Ran, Miao Hu, Chenming Zhu, Yiman Xie, Yilin Long, et al. Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence. arXiv preprint arXiv:2512.10863, 2025. 1, F.2, 12

  29. [29]

    Ost-bench: Evaluating the capabilities of mllms in online spatio-temporal scene understanding.Advances in Neural Information Processing Systems, 38, 2026

    Jingli Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu, Tai Wang, and Jiangmiao Pang. Ost-bench: Evaluating the capabilities of mllms in online spatio-temporal scene understanding.Advances in Neural Information Processing Systems, 38, 2026. 1, 2, 4.1, 5, F.3, 12

  30. [30]

    Streamingbench: Assessing the gap for mllms to achieve streaming video understanding

    Junming Lin, Zheng Fang, Chi Chen, Haoxuan Cheng, Zihao Wan, Fuwen Luo, Ziyue Wang, Peng Li, Yang Liu, and Maosong Sun. Streamingbench: Assessing the gap for mllms to achieve streaming video understanding. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12147–12151. IEEE, 2026. 1, 2, F.3

  31. [31]

    Spatial-ttt: Streaming visual-based spatial intelligence with test-time training.arXiv preprint arXiv:2603.12255, 2026

    Fangfu Liu, Diankun Wu, Jiawei Chi, Yimo Cai, Yi-Hsin Hung, Xumin Yu, Hao Li, Han Hu, Yongming Rao, and Yueqi Duan. Spatial-ttt: Streaming visual-based spatial intelligence with test-time training.arXiv preprint arXiv:2603.12255, 2026. 2, 4.1 11 OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

  32. [32]

    Vcbench: A streaming counting benchmark for spatial-temporal state maintenance in long videos, 2026

    Pengyiang Liu, Zhongyue Shi, Hongye Hao, Qi Fu, Xueting Bi, Siwei Zhang, Xiaoyang Hu, Zitian Wang, Linjiang Huang, and Si Liu. Vcbench: A streaming counting benchmark for spatial-temporal state maintenance in long videos, 2026. 2

  33. [33]

    Visual embodied brain: Let multimodal large language models see, think, and control in spaces.arXiv preprint arXiv:2506.00123, 2025

    Gen Luo, Ganlin Yang, Ziyang Gong, Guanzhou Chen, Haonan Duan, Erfei Cui, Ronglei Tong, Zhi Hou, Tianyi Zhang, Zhe Chen, et al. Visual embodied brain: Let multimodal large language models see, think, and control in spaces.arXiv preprint arXiv:2506.00123, 2025. 4.1

  34. [34]

    Robobench: A comprehensive evaluation benchmark for multimodal large language models as embodied brain.arXiv preprint arXiv:2510.17801, 2025

    Yulin Luo, Chun-Kai Fan, Menghang Dong, Jiayu Shi, Mengdi Zhao, Bo-Wen Zhang, Cheng Chi, Jiaming Liu, Gaole Dai, Rongyu Zhang, et al. Robobench: A comprehensive evaluation benchmark for multimodal large language models as embodied brain.arXiv preprint arXiv:2510.17801, 2025. 12

  35. [35]

    Aria everyday activities dataset.arXiv preprint arXiv:2402.13349, 2024

    Zhaoyang Lv, Nicholas Charron, Pierre Moulon, Alexander Gamino, Cheng Peng, Chris Sweeney, Edward Miller, Huixuan Tang, Jeff Meissner, Jing Dong, et al. Aria everyday activities dataset.arXiv preprint arXiv:2402.13349, 2024. 1

  36. [36]

    Openeqa: Embodied question answering in the era of foundation models

    Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, et al. Openeqa: Embodied question answering in the era of foundation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16488–16498, 2024. 1, 2

  37. [37]

    Egoschema: A diagnostic benchmark for very long-form video language understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023. 2

  38. [38]

    Ovo-bench: How far is your video-llms from real-world online video understanding? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18902–18913, 2025

    Junbo Niu, Yifei Li, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, et al. Ovo-bench: How far is your video-llms from real-world online video understanding? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18902–18913, 2025. 1, 2, 4.1, F.3

  39. [39]

    Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, March 2026

    OpenAI. Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, March 2026. 4.1

  40. [40]

    Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction

    Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24045–24055, 2025. 2

  41. [41]

    Streaming long video understanding with large language models.Advances in Neural Information Processing Systems, 37:119336–119360, 2024

    Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, and Jiaqi Wang. Streaming long video understanding with large language models.Advances in Neural Information Processing Systems, 37:119336–119360, 2024. 2

  42. [42]

    Qwen3.5: Towards native multimodal agents.https://qwen.ai/blog?id=qwen3.5, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents.https://qwen.ai/blog?id=qwen3.5, February 2026. Alibaba Cloud. 4.1

  43. [43]

    Toward driving scene understanding: A dataset for learning driver behavior and causal reasoning, 2018

    Vasili Ramanishka, Yi-Ting Chen, Teruhisa Misu, and Kate Saenko. Toward driving scene understanding: A dataset for learning driver behavior and causal reasoning, 2018. 3.2, 10

  44. [44]

    Out of sight, not out of context? egocentric spatial reasoning in vlms across disjoint frames

    Sahithya Ravi, Gabriel Herbert Sarch, Vibhav Vineet, Andrew D Wilson, and Balasaravanan Thoravi Kumaravel. Out of sight, not out of context? egocentric spatial reasoning in vlms across disjoint frames. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 16146– 16161, 2025. 1, 2, F.2

  45. [45]

    A simple baseline for streaming video under- standing.arXiv preprint arXiv:2604.02317, 2026

    Yujiao Shen, Shulin Tian, Jingkang Yang, and Ziwei Liu. A simple baseline for streaming video under- standing.arXiv preprint arXiv:2604.02317, 2026. 2, A.4

  46. [46]

    Robobrain2.5: Depthinsight, timeinmind.arXivpreprintarXiv:2601.14352,

    Huajie Tan, Enshen Zhou, Zhiyu Li, Yijie Xu, Yuheng Ji, Xiansheng Chen, Cheng Chi, Pengwei Wang, HuizhuJia, YulongAo, etal. Robobrain2.5: Depthinsight, timeinmind.arXivpreprintarXiv:2601.14352,

  47. [47]

    4.1 12 OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

  48. [48]

    Lvbench: An extreme long video understanding benchmark

    Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. Lvbench: An extreme long video understanding benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22958–22967, 2025. 2

  49. [49]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 4.1

  50. [50]

    Streameqa: Towards streaming video understanding for embodied scenarios.arXiv preprint arXiv:2512.04451, 2025

    Yifei Wang, Zhenkai Li, Tianwen Qian, Huanran Zheng, Zheng Wang, Yuqian Fu, and Xiaoling Wang. Streameqa: Towards streaming video understanding for embodied scenarios.arXiv preprint arXiv:2512.04451, 2025. 1, F.3

  51. [51]

    Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.Advances in neural information processing systems, 38:13569–13597,

    Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.Advances in neural information processing systems, 38:13569–13597,

  52. [52]

    Spatialscore: Towards unified evaluation for multimodal spatial understanding.arXiv e-prints, pages arXiv–2505, 2025

    Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, and Weidi Xie. Spatialscore: Towards unified evaluation for multimodal spatial understanding.arXiv e-prints, pages arXiv–2505, 2025. 12

  53. [53]

    Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37:28828– 28857, 2024

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37:28828– 28857, 2024. 2

  54. [54]

    Grok 4.1 fast and the Agent Tools API.https://x.ai/news/grok-4-1-fast, November 2025

    xAI. Grok 4.1 fast and the Agent Tools API.https://x.ai/news/grok-4-1-fast, November 2025. 4.1

  55. [55]

    Spatialtree: How spatial abilities branch out in mllms

    Yuxi Xiao, Longfei Li, Shen Yan, Xinhang Liu, Sida Peng, Yunchao Wei, Xiaowei Zhou, and Bingyi Kang. Spatialtree: How spatial abilities branch out in mllms. InThe First Workshop on Efficient Spatial Reasoning,

  56. [56]

    Fluxmem: Adaptive hierarchical memory for streaming video understanding.arXiv preprint arXiv:2603.02096, 2026

    Yiweng Xie, Bo He, JunkeWang, Xiangyu Zheng, Ziyi Ye, and Zuxuan Wu. Fluxmem: Adaptive hierarchical memory for streaming video understanding.arXiv preprint arXiv:2603.02096, 2026. 2, 4.1

  57. [57]

    SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

    Peiran Xu, Sudong Wang, Yao Zhu, Jianing Li, Gege Qi, and Yunjian Zhang. Spatialbench: Benchmarking multimodal large language models for spatial cognition.arXiv preprint arXiv:2511.21471, 2025. 2

  58. [58]

    StreamingVLM: Real-Time Understanding for Infinite Video Streams

    Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. Streamingvlm: Real-time understanding for infinite video streams.arXiv preprint arXiv:2510.09608, 2025. 4.1

  59. [59]

    Thinking in space: How multimodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 1, 2, 3.2, 10, F.2, 12

  60. [60]

    Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025

    Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, et al. Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025. 2, 4.1

  61. [61]

    Cambrian-s: Towards spatial supersensing in video

    Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis L Brown II, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, et al. Cambrian-s: Towards spatial supersensing in video. InThe Fourteenth International Conference on Learning Representations, 2025. 1, 2, 4.1, F.2

  62. [62]

    MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

    Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, et al. Mmsi-bench: A benchmark for multi-image spatial intelligence.arXiv preprint arXiv:2505.23764, 2025. 1, 2, F.1, 12

  63. [63]

    Timechat-online: 80% visual tokens are naturally redundant in streaming videos

    Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, et al. Timechat-online: 80% visual tokens are naturally redundant in streaming videos. InProceedings of the 33rd ACM International Conference on Multimedia, pages 10807–10816, 2025. 2

  64. [64]

    Spatial mental modeling from limited views

    Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, et al. Spatial mental modeling from limited views. InStructural Priors for Vision Workshop at ICCV’25, 2025. 12 13 OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

  65. [65]

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025. 4.1

  66. [66]

    Streamforest: Efficient online video understanding with persistent event memory

    Xiangyu Zeng, Kefan Qiu, Qingyu Zhang, Xinhao Li, Jing Wang, Jiaxin Li, Ziang Yan, Kun Tian, Meng Tian, Xinhai Zhao, et al. Streamforest: Efficient online video understanding with persistent event memory. Advances in Neural Information Processing Systems, 38:75804–75835, 2026. 1, 2, 4.1, F.3

  67. [67]

    arXiv preprint arXiv:2406.08085 , year=

    Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. Flash-vstream: Memory-based real-time understanding for long video streams.arXiv preprint arXiv:2406.08085, 2024. 4.1

  68. [68]

    Flash-vstream: Efficient real-time understanding for long video streams

    Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, and Xiaojie Jin. Flash-vstream: Efficient real-time understanding for long video streams. InProceedings of the IEEE/CVF international conference on computer vision, pages 21059–21069, 2025. 2

  69. [69]

    HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

    Haowei Zhang, Shudong Yang, Jinlan Fu, See-Kiong Ng, and Xipeng Qiu. Hermes: Kv cache as hierarchical memory for efficient streaming video understanding.arXiv preprint arXiv:2601.14724, 2026. 4.1

  70. [70]

    Open3d-vqa: A benchmark for embodied spatial concept reasoning with multimodal large language model in open space

    Weichen Zhang, Zile Zhou, Xin Zeng, Liu Xuchen, Jianjie Fang, Chen Gao, Jinqiang Cui, Yong Li, Xinlei Chen, and Xiao-Ping Zhang. Open3d-vqa: A benchmark for embodied spatial concept reasoning with multimodal large language model in open space. InProceedings of the 33rd ACM International Conference on Multimedia, pages 12784–12791, 2025. 2

  71. [71]

    Dsi-bench: A benchmark for dynamic spatial intelligence.arXiv preprint arXiv:2510.18873, 2025

    Ziang Zhang, Zehan Wang, Guanghao Zhang, Weilong Dai, Yan Xia, Ziang Yan, Minjie Hong, and Zhou Zhao. Dsi-bench: A benchmark for dynamic spatial intelligence.arXiv preprint arXiv:2510.18873, 2025. 2

  72. [72]

    Spacemind: Camera-guided modality fusion for spatial reasoning in vision-language models.arXiv preprint arXiv:2511.23075, 2025

    Ruosen Zhao, Zhikang Zhang, Jialei Xu, Jiahao Chang, Dong Chen, Lingyun Li, Weijian Sun, and Zizhuang Wei. Spacemind: Camera-guided modality fusion for spatial reasoning in vision-language models.arXiv preprint arXiv:2511.23075, 2025. 2

  73. [73]

    longer = harder

    Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, et al. OmniWorld: A multi-domain and multi-modal dataset for 4d world modeling, 2025. 3.2, 10 14 OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs Appendix Content of Appendices SectionA. Frame-...

  74. [74]

    About how far is the player from the bottom edge of the lowest stair step?

    The seventieth frame shows the same road and surroundings. 71. The seventy-first frame is another instance of the same scene. 72. The seventy-second frame continues the same view. 73. The seventy-third frame shows the same scene again. 74. The seventy- fourth frame is another continuation. 7 Answer: D To determine how many times this scene has appeared, I...

  75. [75]

    37 OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

    Within the last300characters: a tailAnswer: X(includingfinal answer,final). 37 OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

  76. [76]

    Within the last300characters: a Cosmos-style<answer> Xtag

  77. [77]

    therefore B

    Within the last300characters: a bare single letter at the very end (e.g.“...therefore B”)

  78. [78]

    Within the last300characters: a GLM-style<|begin_of_box|> Xmarker

  79. [79]

    A single letter at the start of the stripped response

  80. [80]

    Anywhere in the response:answer / choice / option(s): Xwith a true separator

Showing first 80 references.