OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

Hongye Hao; Jiwen Lu; Pengyiang Liu; Qi Fu; Yifei Li; Yuhang Zang; Zhongyue Shi

arxiv: 2606.03890 · v1 · pith:IGMYQJMQnew · submitted 2026-06-02 · 💻 cs.CV

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

Yifei Li , Pengyiang Liu , Yuhang Zang , Zhongyue Shi , Qi Fu , Hongye Hao , Jiwen Lu This is my paper

Pith reviewed 2026-06-28 10:59 UTC · model grok-4.3

classification 💻 cs.CV

keywords streaming spatial intelligencemultimodal LLMsegocentric videoallocentric mappingspatial reasoning benchmarkprefix evaluationMLLM evaluation

0 comments

The pith

A benchmark shows top multimodal models trail humans by 27 points on streaming spatial tasks from video prefixes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a new evaluation for how multimodal large language models handle spatial reasoning over continuous egocentric video streams rather than complete offline videos. It creates questions tied to specific timestamps where the model receives only the preceding video segment and must draw on prior evidence for spatial structure. Testing across 38 models reveals consistent shortfalls relative to human performance, concentrated at the highest level of abstraction. This setup matters for applications that require ongoing layout awareness from partial views. The hierarchy isolates whether failures stem from basic perception, context tracking, simulation, or global mapping.

Core claim

OVO-S-Bench consists of 1,680 human-annotated questions over 348 videos, each with a query timestamp and evidence interval, evaluated under prefix-only input. Questions are organized into four increasing levels of abstraction: instantaneous egocentric perception, spatiotemporal context tracking, spatial simulation and reasoning, and allocentric mapping. Across 38 models the strongest result reaches 59.2 while humans reach 86.6, with allocentric mapping the dominant gap; streaming and spatially fine-tuned models score below their base versions, and ungrounded chain-of-thought increases spatial errors.

What carries the argument

The four-level hierarchy of spatial abstraction combined with prefix-only evaluation at query timestamps, where each question specifies the exact evidence interval needed from the stream.

If this is right

Future multimodal models must improve allocentric mapping from partial egocentric streams.
Spatially fine-tuned or streaming-adapted models do not automatically gain the required capabilities.
Ungrounded chain-of-thought reasoning increases rather than reduces spatial errors in this setting.
Robotics and AR systems will require new architectures tested against prefix-only streaming benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The gap may reflect a broader limitation in maintaining consistent spatial representations across time rather than isolated reasoning failures.
Similar hierarchical benchmarks could be applied to test whether the same bottlenecks appear in non-video modalities such as audio or sensor streams.
Developers of autonomous agents might prioritize explicit memory mechanisms for allocentric coordinates over general scaling.

Load-bearing premise

The multi-round annotation and blind cross-review process by 12 annotators produces questions whose difficulty and evidence requirements isolate streaming spatial intelligence rather than other skills such as object recognition or language.

What would settle it

A model that closes the 27-point gap to human experts on allocentric mapping questions while maintaining performance on lower levels, or a re-annotation showing that many questions can be solved without reference to the specified evidence intervals.

read the original abstract

Multimodal agents in robotics, AR, and autonomous driving must reason about places and layouts from continuous egocentric streams, often using evidence outside the current view. Existing benchmarks either evaluate offline over full videos or target events rather than spatial structure. We introduce OVO-S-Bench, a fully human-annotated benchmark for streaming spatial intelligence, comprising 1,680 questions over 348 source videos. Annotation involves 12 trained annotators, each also serving as a blind cross-reviewer, across roughly 804 person-hours of multi-round quality assurance. Each question carries a query timestamp and an evidence interval, and at evaluation, the model sees only the prefix preceding the query. Questions span four levels of increasing abstraction: instantaneous egocentric perception, spatiotemporal context tracking, spatial simulation and reasoning, and allocentric mapping. Across 38 proprietary and open-source MLLMs, Gemini-3.1-Pro trails human experts by 27 points, 59.2 vs. 86.6, with allocentric mapping as the dominant bottleneck. Notably, streaming and spatially fine-tuned MLLMs underperform their own backbones. We further find that chain-of-thought reasoning amplifies spatial errors when ungrounded in the stream. By exposing these limitations, OVO-S-Bench establishes a demanding testbed for next-generation streaming spatial MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OVO-S-Bench brings a prefix-only streaming protocol and explicit four-level spatial hierarchy to MLLM testing, with reported gaps that look real but rest on unverified isolation of spatial factors.

read the letter

The paper's core contribution is a benchmark of 1,680 questions across 348 videos that requires models to answer from video prefixes only, with each question tagged to an evidence interval and placed in one of four levels: instantaneous egocentric, spatiotemporal tracking, spatial simulation, and allocentric mapping. This setup is new relative to existing offline or event-focused benchmarks, and the annotation effort (12 annotators doing blind cross-review for 804 person-hours) is substantial. The results on 38 models are concrete: Gemini-3.1-Pro reaches 59.2 against a human baseline of 86.6, allocentric mapping is the clearest weak point, and both streaming and spatially fine-tuned models lag their base versions. Chain-of-thought also appears to hurt when not grounded in the stream.

Those numbers and the hierarchy are the parts that could matter for robotics and AR work. The evaluation protocol itself is straightforward to implement and the human-model gap is large enough to be useful.

The soft spot is whether the questions actually isolate streaming spatial intelligence. The stress-test concern holds: the abstract gives no evidence that the items remain hard when the video prefix is swapped for text descriptions or when evidence intervals are changed. If language priors or basic recognition cues suffice for many items, then the 27-point gap and the level-specific bottlenecks cannot be cleanly attributed to spatial streaming. The paper would be stronger with at least a small ablation showing that prefix removal or interval alteration drops performance as expected. Question balance across the four levels is also not detailed here, so it is hard to judge whether allocentric items are simply harder for other reasons.

This is a benchmark paper aimed at groups building or evaluating spatial MLLMs for continuous streams. It deserves peer review because the protocol and scale are new and the reported gaps are presented as observations rather than overclaims; a referee can check the missing validation steps and the raw question distribution. I would bring it to a reading group for the evaluation design discussion but would not cite it in my own work unless I started using the benchmark directly.

Referee Report

1 major / 0 minor

Summary. The paper introduces OVO-S-Bench, a human-annotated benchmark of 1,680 questions over 348 videos for evaluating streaming spatial intelligence in MLLMs. Each question includes a query timestamp and evidence interval, with models receiving only the video prefix up to the query. Questions are organized into four hierarchical levels of increasing abstraction (instantaneous egocentric perception, spatiotemporal context tracking, spatial simulation and reasoning, allocentric mapping). Evaluation across 38 proprietary and open-source MLLMs shows Gemini-3.1-Pro at 59.2 versus human experts at 86.6, with allocentric mapping as the dominant bottleneck; streaming and spatially fine-tuned models underperform their backbones, and ungrounded chain-of-thought amplifies errors.

Significance. If the questions validly isolate streaming spatial intelligence, the benchmark would fill an important gap by targeting continuous egocentric streams rather than offline full-video or event-based evaluation, directly relevant to robotics, AR, and autonomous driving. The scale of annotation (804 person-hours with blind cross-review) and the concrete model-human gaps plus the counterintuitive underperformance of fine-tuned models would provide actionable guidance for future MLLM development.

major comments (1)

[Annotation Process] Annotation Process section: the multi-round blind cross-review by 12 annotators is described in detail, but the manuscript provides no control experiments (such as text-only question variants or evidence-interval ablations) to test whether questions remain solvable via language priors, object recognition, or other non-spatial cues. This validation step is load-bearing for the central claims, as the reported 27-point gap and level-specific bottlenecks cannot be attributed to streaming spatial intelligence without evidence that the questions isolate the intended factors.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential value of OVO-S-Bench in addressing a gap in streaming spatial evaluation. We respond to the single major comment below.

read point-by-point responses

Referee: [Annotation Process] Annotation Process section: the multi-round blind cross-review by 12 annotators is described in detail, but the manuscript provides no control experiments (such as text-only question variants or evidence-interval ablations) to test whether questions remain solvable via language priors, object recognition, or other non-spatial cues. This validation step is load-bearing for the central claims, as the reported 27-point gap and level-specific bottlenecks cannot be attributed to streaming spatial intelligence without evidence that the questions isolate the intended factors.

Authors: We agree that the absence of explicit control experiments leaves open the possibility that some questions could be solved via language priors or non-spatial cues, which would weaken attribution of the observed gaps and bottlenecks specifically to streaming spatial intelligence. The manuscript's design choices—providing only the video prefix up to the query timestamp, specifying evidence intervals, and organizing questions into four levels of increasing spatial abstraction—were intended to focus evaluation on the target capabilities, and the multi-round blind cross-review was used to ensure question quality. Nevertheless, these measures do not constitute the requested controls. In the revised manuscript we will add text-only question variants (to measure language-prior solvability) and evidence-interval ablations (to measure reliance on the designated visual evidence) and report the resulting performance drops. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivation chain or fitted predictions

full rationale

The paper constructs and evaluates an empirical benchmark via human annotation (12 annotators, multi-round review) and direct model testing on 1,680 held-out questions. No equations, parameters, or predictions are derived; performance gaps (e.g., 59.2 vs. 86.6) are raw evaluation outputs. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear. The work is self-contained against external human annotations and model runs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the quality and representativeness of the human annotations and the assumption that the chosen videos and questions isolate spatial streaming ability. No free parameters or invented entities are introduced.

axioms (2)

domain assumption Human annotators with 804 person-hours of multi-round review produce reliable ground truth for spatial reasoning questions.
Invoked in the description of the 12-annotator process with blind cross-review.
domain assumption The four abstraction levels (instantaneous egocentric perception, spatiotemporal context tracking, spatial simulation and reasoning, allocentric mapping) form a meaningful increasing hierarchy for streaming spatial intelligence.
Stated directly in the abstract as the structure of the benchmark.

pith-pipeline@v0.9.1-grok · 5796 in / 1454 out tokens · 20117 ms · 2026-06-28T10:59:18.011657+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

81 extracted references · 31 canonical work pages · 11 internal anchors

[1]

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025. 4.1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 4.1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Qwen2.5-VL Technical Report

Shuai Bai, Ke qin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.A...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

ARKitScenes: A diverse real-world dataset for 3d indoor scene understanding using mobile RGB-D data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARKitScenes: A diverse real-world dataset for 3d indoor scene understanding using mobile RGB-D data. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2021. 3.2, 10

2021
[5]

Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025

Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, et al. Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025. 2, 4.1, 4.2

work page arXiv 2025
[6]

Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024. 2

2024
[7]

Streamingtom: Streaming token compression for efficient video understanding.arXiv preprint arXiv:2510.18269, 2025

Xueyi Chen, Keda Tao, Kele Shao, and Huan Wang. Streamingtom: Streaming token compression for efficient video understanding.arXiv preprint arXiv:2510.18269, 2025. 4.1

work page arXiv 2025
[8]

Rynnbrain: Open embodied foundation models.arXiv preprint arXiv:2602.14979, 2026

Ronghao Dang, Jiayan Guo, Bohan Hou, Sicong Leng, Kehan Li, Xin Li, Jiangpin Liu, Yunxuan Mao, Zhikai Wang, Yuqian Yuan, et al. Rynnbrain: Open embodied foundation models.arXiv preprint arXiv:2602.14979, 2026. 4.1

work page arXiv 2026
[9]

Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models

Mengfei Du, Binhao Wu, Zejun Li, Xuan-Jing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 346–355,
[10]

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

Chaoyou Fu, Haozhi Yuan, Yuhao Dong, Yi-Fan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, et al. Video-mme-v2: Towards the next stage in benchmarks for comprehensive video understanding.arXiv preprint arXiv:2604.05015, 2026. 4.2

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Holi-spatial: Evolving video streams into holistic 3d spatial intelligence

Yuanyuan Gao, Hao Li, Yifei Liu, Xinhao Ji, Yuning Gong, Yuanjun Liao, Fangfu Liu, Manyuan Zhang, Yuchen Yang, Dan Xu, et al. Holi-spatial: Evolving video streams into holistic 3d spatial intelligence. arXiv preprint arXiv:2603.07660, 2026. 2

work page arXiv 2026
[12]

Gemini 3.1 flash-lite

Google DeepMind. Gemini 3.1 flash-lite. https://deepmind.google/models/gemini/ flash-lite/, March 2026. 4.1

2026
[13]

Gemini 3.1 pro.https://blog.google/products/gemini/gemini-3-pro/, February 2026

Google DeepMind. Gemini 3.1 pro.https://blog.google/products/gemini/gemini-3-pro/, February 2026. 4.1

2026
[14]

Gemma 4.https://ai.google.dev/gemma/docs/core/model_card_4, April

Google DeepMind. Gemma 4.https://ai.google.dev/gemma/docs/core/model_card_4, April
[15]

Model card; technical report forthcoming. 4.1
[16]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022. 1, 3.2, 10 10 OVO-S-Bench: A Hi...

2022
[17]

RoomTour3D: Geometry-aware video-instruction tuning for embodied navigation

Mingfei Han, Liang Ma, Kamila Zhumakhanova, Ekaterina Radionova, Jingyi Zhang, Xiaojun Chang, Xiaodan Liang, and Ivan Laptev. RoomTour3D: Geometry-aware video-instruction tuning for embodied navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 3.2, 10

2025
[18]

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

Yining Hong, Jiageng Liu, Han Yin, Manling Li, Leonidas Guibas, Li Fei-Fei, Jiajun Wu, and Yejin Choi. ESI-Bench: Towards embodied spatial intelligence that closes the perception-action loop.arXiv preprint arXiv:2605.18746, 2026. 5

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Online video understanding: Ovbench and videochat-online

Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng, Cheng Liang, Tao Wu, Xi Chen, Liang Li, and Limin Wang. Online video understanding: Ovbench and videochat-online. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3328–3338, 2025. 1, 2, F.3

2025
[20]

Omnispatial: Towards comprehensive spatial reasoning benchmark for vi- sion language models.arXiv preprint arXiv:2506.03135,

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025. 1, 2

work page arXiv 2025
[21]

Infinipot-v: Memory-constrained kv cache compression for streaming video understanding.Advances in Neural Information Processing Systems, 38:138983–139013, 2026

Minsoo Kim, Kyuhong Shim, Jungwook Choi, and Simyung Chang. Infinipot-v: Memory-constrained kv cache compression for streaming video understanding.Advances in Neural Information Processing Systems, 38:138983–139013, 2026. 4.1

2026
[22]

Richard Landis and Gary G

J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data. Biometrics, 33(1):159–174, 1977. E.5.2

1977
[23]

TopViewRS: Vision- language models as top-view spatial reasoners

Chengzu Li, Caiqi Zhang, Han Zhou, Nigel Collier, Anna Korhonen, and Ivan Vulić. TopViewRS: Vision- language models as top-view spatial reasoners. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1786–1807, 2024. 1, 2, F.1

2024
[24]

ViewSpatial- Bench: Evaluating multi-perspective spatial under- standing of vision-language models.arXiv preprint arXiv:2505.21500, 2025

Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, et al. Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models.arXiv preprint arXiv:2505.21500, 2025. 12

work page arXiv 2025
[25]

CODA: A real-world road corner case dataset for object detection in autonomous driving

Kaican Li, Kai Chen, Haoyu Wang, Lanqing Hong, Chaoqiang Ye, Jianhua Han, Yukuai Chen, Wei Zhang, Chunjing Xu, Dit-Yan Yeung, Xiaodan Liang, Zhenguo Li, and Hang Xu. CODA: A real-world road corner case dataset for object detection in autonomous driving. InProceedings of the European Conference on Computer Vision, 2022. 3.2, 10

2022
[26]

Sti-bench: Are mllms ready for precise spatial-temporal world understanding? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5622–5632, 2025

Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti-bench: Are mllms ready for precise spatial-temporal world understanding? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5622–5632, 2025. 1, 2, F.2, 12

2025
[27]

Sekai: A video dataset towards world exploration, 2025

Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, Yukang Feng, Jianwen Sun, et al. Sekai: A video dataset towards world exploration, 2025. 3.2, 10

2025
[28]

Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence

Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao, Yunlong Ran, Miao Hu, Chenming Zhu, Yiman Xie, Yilin Long, et al. Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence. arXiv preprint arXiv:2512.10863, 2025. 1, F.2, 12

work page arXiv 2025
[29]

Ost-bench: Evaluating the capabilities of mllms in online spatio-temporal scene understanding.Advances in Neural Information Processing Systems, 38, 2026

Jingli Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu, Tai Wang, and Jiangmiao Pang. Ost-bench: Evaluating the capabilities of mllms in online spatio-temporal scene understanding.Advances in Neural Information Processing Systems, 38, 2026. 1, 2, 4.1, 5, F.3, 12

2026
[30]

Streamingbench: Assessing the gap for mllms to achieve streaming video understanding

Junming Lin, Zheng Fang, Chi Chen, Haoxuan Cheng, Zihao Wan, Fuwen Luo, Ziyue Wang, Peng Li, Yang Liu, and Maosong Sun. Streamingbench: Assessing the gap for mllms to achieve streaming video understanding. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12147–12151. IEEE, 2026. 1, 2, F.3

2026
[31]

Spatial-ttt: Streaming visual-based spatial intelligence with test-time training.arXiv preprint arXiv:2603.12255, 2026

Fangfu Liu, Diankun Wu, Jiawei Chi, Yimo Cai, Yi-Hsin Hung, Xumin Yu, Hao Li, Han Hu, Yongming Rao, and Yueqi Duan. Spatial-ttt: Streaming visual-based spatial intelligence with test-time training.arXiv preprint arXiv:2603.12255, 2026. 2, 4.1 11 OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

work page arXiv 2026
[32]

Vcbench: A streaming counting benchmark for spatial-temporal state maintenance in long videos, 2026

Pengyiang Liu, Zhongyue Shi, Hongye Hao, Qi Fu, Xueting Bi, Siwei Zhang, Xiaoyang Hu, Zitian Wang, Linjiang Huang, and Si Liu. Vcbench: A streaming counting benchmark for spatial-temporal state maintenance in long videos, 2026. 2

2026
[33]

Visual embodied brain: Let multimodal large language models see, think, and control in spaces.arXiv preprint arXiv:2506.00123, 2025

Gen Luo, Ganlin Yang, Ziyang Gong, Guanzhou Chen, Haonan Duan, Erfei Cui, Ronglei Tong, Zhi Hou, Tianyi Zhang, Zhe Chen, et al. Visual embodied brain: Let multimodal large language models see, think, and control in spaces.arXiv preprint arXiv:2506.00123, 2025. 4.1

work page arXiv 2025
[34]

Robobench: A comprehensive evaluation benchmark for multimodal large language models as embodied brain.arXiv preprint arXiv:2510.17801, 2025

Yulin Luo, Chun-Kai Fan, Menghang Dong, Jiayu Shi, Mengdi Zhao, Bo-Wen Zhang, Cheng Chi, Jiaming Liu, Gaole Dai, Rongyu Zhang, et al. Robobench: A comprehensive evaluation benchmark for multimodal large language models as embodied brain.arXiv preprint arXiv:2510.17801, 2025. 12

work page arXiv 2025
[35]

Aria everyday activities dataset.arXiv preprint arXiv:2402.13349, 2024

Zhaoyang Lv, Nicholas Charron, Pierre Moulon, Alexander Gamino, Cheng Peng, Chris Sweeney, Edward Miller, Huixuan Tang, Jeff Meissner, Jing Dong, et al. Aria everyday activities dataset.arXiv preprint arXiv:2402.13349, 2024. 1

work page arXiv 2024
[36]

Openeqa: Embodied question answering in the era of foundation models

Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, et al. Openeqa: Embodied question answering in the era of foundation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16488–16498, 2024. 1, 2

2024
[37]

Egoschema: A diagnostic benchmark for very long-form video language understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023. 2

2023
[38]

Ovo-bench: How far is your video-llms from real-world online video understanding? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18902–18913, 2025

Junbo Niu, Yifei Li, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, et al. Ovo-bench: How far is your video-llms from real-world online video understanding? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18902–18913, 2025. 1, 2, 4.1, F.3

2025
[39]

Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, March 2026

OpenAI. Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, March 2026. 4.1

2026
[40]

Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction

Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24045–24055, 2025. 2

2025
[41]

Streaming long video understanding with large language models.Advances in Neural Information Processing Systems, 37:119336–119360, 2024

Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, and Jiaqi Wang. Streaming long video understanding with large language models.Advances in Neural Information Processing Systems, 37:119336–119360, 2024. 2

2024
[42]

Qwen3.5: Towards native multimodal agents.https://qwen.ai/blog?id=qwen3.5, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents.https://qwen.ai/blog?id=qwen3.5, February 2026. Alibaba Cloud. 4.1

2026
[43]

Toward driving scene understanding: A dataset for learning driver behavior and causal reasoning, 2018

Vasili Ramanishka, Yi-Ting Chen, Teruhisa Misu, and Kate Saenko. Toward driving scene understanding: A dataset for learning driver behavior and causal reasoning, 2018. 3.2, 10

2018
[44]

Out of sight, not out of context? egocentric spatial reasoning in vlms across disjoint frames

Sahithya Ravi, Gabriel Herbert Sarch, Vibhav Vineet, Andrew D Wilson, and Balasaravanan Thoravi Kumaravel. Out of sight, not out of context? egocentric spatial reasoning in vlms across disjoint frames. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 16146– 16161, 2025. 1, 2, F.2

2025
[45]

A simple baseline for streaming video under- standing.arXiv preprint arXiv:2604.02317, 2026

Yujiao Shen, Shulin Tian, Jingkang Yang, and Ziwei Liu. A simple baseline for streaming video under- standing.arXiv preprint arXiv:2604.02317, 2026. 2, A.4

work page arXiv 2026
[46]

Robobrain2.5: Depthinsight, timeinmind.arXivpreprintarXiv:2601.14352,

Huajie Tan, Enshen Zhou, Zhiyu Li, Yijie Xu, Yuheng Ji, Xiansheng Chen, Cheng Chi, Pengwei Wang, HuizhuJia, YulongAo, etal. Robobrain2.5: Depthinsight, timeinmind.arXivpreprintarXiv:2601.14352,

work page arXiv
[47]

4.1 12 OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs
[48]

Lvbench: An extreme long video understanding benchmark

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. Lvbench: An extreme long video understanding benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22958–22967, 2025. 2

2025
[49]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 4.1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Streameqa: Towards streaming video understanding for embodied scenarios.arXiv preprint arXiv:2512.04451, 2025

Yifei Wang, Zhenkai Li, Tianwen Qian, Huanran Zheng, Zheng Wang, Yuqian Fu, and Xiaoling Wang. Streameqa: Towards streaming video understanding for embodied scenarios.arXiv preprint arXiv:2512.04451, 2025. 1, F.3

work page arXiv 2025
[51]

Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.Advances in neural information processing systems, 38:13569–13597,

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.Advances in neural information processing systems, 38:13569–13597,
[52]

Spatialscore: Towards unified evaluation for multimodal spatial understanding.arXiv e-prints, pages arXiv–2505, 2025

Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, and Weidi Xie. Spatialscore: Towards unified evaluation for multimodal spatial understanding.arXiv e-prints, pages arXiv–2505, 2025. 12

2025
[53]

Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37:28828– 28857, 2024

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37:28828– 28857, 2024. 2

2024
[54]

Grok 4.1 fast and the Agent Tools API.https://x.ai/news/grok-4-1-fast, November 2025

xAI. Grok 4.1 fast and the Agent Tools API.https://x.ai/news/grok-4-1-fast, November 2025. 4.1

2025
[55]

Spatialtree: How spatial abilities branch out in mllms

Yuxi Xiao, Longfei Li, Shen Yan, Xinhang Liu, Sida Peng, Yunchao Wei, Xiaowei Zhou, and Bingyi Kang. Spatialtree: How spatial abilities branch out in mllms. InThe First Workshop on Efficient Spatial Reasoning,
[56]

Fluxmem: Adaptive hierarchical memory for streaming video understanding.arXiv preprint arXiv:2603.02096, 2026

Yiweng Xie, Bo He, JunkeWang, Xiangyu Zheng, Ziyi Ye, and Zuxuan Wu. Fluxmem: Adaptive hierarchical memory for streaming video understanding.arXiv preprint arXiv:2603.02096, 2026. 2, 4.1

work page arXiv 2026
[57]

SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

Peiran Xu, Sudong Wang, Yao Zhu, Jianing Li, Gege Qi, and Yunjian Zhang. Spatialbench: Benchmarking multimodal large language models for spatial cognition.arXiv preprint arXiv:2511.21471, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

StreamingVLM: Real-Time Understanding for Infinite Video Streams

Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. Streamingvlm: Real-time understanding for infinite video streams.arXiv preprint arXiv:2510.09608, 2025. 4.1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 1, 2, 3.2, 10, F.2, 12

2025
[60]

Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025

Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, et al. Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025. 2, 4.1

work page arXiv 2025
[61]

Cambrian-s: Towards spatial supersensing in video

Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis L Brown II, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, et al. Cambrian-s: Towards spatial supersensing in video. InThe Fourteenth International Conference on Learning Representations, 2025. 1, 2, 4.1, F.2

2025
[62]

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, et al. Mmsi-bench: A benchmark for multi-image spatial intelligence.arXiv preprint arXiv:2505.23764, 2025. 1, 2, F.1, 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

Timechat-online: 80% visual tokens are naturally redundant in streaming videos

Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, et al. Timechat-online: 80% visual tokens are naturally redundant in streaming videos. InProceedings of the 33rd ACM International Conference on Multimedia, pages 10807–10816, 2025. 2

2025
[64]

Spatial mental modeling from limited views

Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, et al. Spatial mental modeling from limited views. InStructural Priors for Vision Workshop at ICCV’25, 2025. 12 13 OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

2025
[65]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025. 4.1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Streamforest: Efficient online video understanding with persistent event memory

Xiangyu Zeng, Kefan Qiu, Qingyu Zhang, Xinhao Li, Jing Wang, Jiaxin Li, Ziang Yan, Kun Tian, Meng Tian, Xinhai Zhao, et al. Streamforest: Efficient online video understanding with persistent event memory. Advances in Neural Information Processing Systems, 38:75804–75835, 2026. 1, 2, 4.1, F.3

2026
[67]

arXiv preprint arXiv:2406.08085 , year=

Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. Flash-vstream: Memory-based real-time understanding for long video streams.arXiv preprint arXiv:2406.08085, 2024. 4.1

work page arXiv 2024
[68]

Flash-vstream: Efficient real-time understanding for long video streams

Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, and Xiaojie Jin. Flash-vstream: Efficient real-time understanding for long video streams. InProceedings of the IEEE/CVF international conference on computer vision, pages 21059–21069, 2025. 2

2025
[69]

HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

Haowei Zhang, Shudong Yang, Jinlan Fu, See-Kiong Ng, and Xipeng Qiu. Hermes: Kv cache as hierarchical memory for efficient streaming video understanding.arXiv preprint arXiv:2601.14724, 2026. 4.1

work page internal anchor Pith review Pith/arXiv arXiv 2026
[70]

Open3d-vqa: A benchmark for embodied spatial concept reasoning with multimodal large language model in open space

Weichen Zhang, Zile Zhou, Xin Zeng, Liu Xuchen, Jianjie Fang, Chen Gao, Jinqiang Cui, Yong Li, Xinlei Chen, and Xiao-Ping Zhang. Open3d-vqa: A benchmark for embodied spatial concept reasoning with multimodal large language model in open space. InProceedings of the 33rd ACM International Conference on Multimedia, pages 12784–12791, 2025. 2

2025
[71]

Dsi-bench: A benchmark for dynamic spatial intelligence.arXiv preprint arXiv:2510.18873, 2025

Ziang Zhang, Zehan Wang, Guanghao Zhang, Weilong Dai, Yan Xia, Ziang Yan, Minjie Hong, and Zhou Zhao. Dsi-bench: A benchmark for dynamic spatial intelligence.arXiv preprint arXiv:2510.18873, 2025. 2

work page arXiv 2025
[72]

Spacemind: Camera-guided modality fusion for spatial reasoning in vision-language models.arXiv preprint arXiv:2511.23075, 2025

Ruosen Zhao, Zhikang Zhang, Jialei Xu, Jiahao Chang, Dong Chen, Lingyun Li, Weijian Sun, and Zizhuang Wei. Spacemind: Camera-guided modality fusion for spatial reasoning in vision-language models.arXiv preprint arXiv:2511.23075, 2025. 2

work page arXiv 2025
[73]

longer = harder

Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, et al. OmniWorld: A multi-domain and multi-modal dataset for 4d world modeling, 2025. 3.2, 10 14 OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs Appendix Content of Appendices SectionA. Frame-...

work page arXiv 2025
[74]

About how far is the player from the bottom edge of the lowest stair step?

The seventieth frame shows the same road and surroundings. 71. The seventy-first frame is another instance of the same scene. 72. The seventy-second frame continues the same view. 73. The seventy-third frame shows the same scene again. 74. The seventy- fourth frame is another continuation. 7 Answer: D To determine how many times this scene has appeared, I...

2026
[75]

37 OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

Within the last300characters: a tailAnswer: X(includingfinal answer,final). 37 OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs
[76]

Within the last300characters: a Cosmos-style<answer> Xtag
[77]

therefore B

Within the last300characters: a bare single letter at the very end (e.g.“...therefore B”)
[78]

Within the last300characters: a GLM-style<|begin_of_box|> Xmarker
[79]

A single letter at the start of the stripped response
[80]

Anywhere in the response:answer / choice / option(s): Xwith a true separator

Showing first 80 references.

[1] [1]

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025. 4.1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 4.1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Qwen2.5-VL Technical Report

Shuai Bai, Ke qin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.A...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

ARKitScenes: A diverse real-world dataset for 3d indoor scene understanding using mobile RGB-D data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARKitScenes: A diverse real-world dataset for 3d indoor scene understanding using mobile RGB-D data. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2021. 3.2, 10

2021

[5] [5]

Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025

Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, et al. Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025. 2, 4.1, 4.2

work page arXiv 2025

[6] [6]

Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024. 2

2024

[7] [7]

Streamingtom: Streaming token compression for efficient video understanding.arXiv preprint arXiv:2510.18269, 2025

Xueyi Chen, Keda Tao, Kele Shao, and Huan Wang. Streamingtom: Streaming token compression for efficient video understanding.arXiv preprint arXiv:2510.18269, 2025. 4.1

work page arXiv 2025

[8] [8]

Rynnbrain: Open embodied foundation models.arXiv preprint arXiv:2602.14979, 2026

Ronghao Dang, Jiayan Guo, Bohan Hou, Sicong Leng, Kehan Li, Xin Li, Jiangpin Liu, Yunxuan Mao, Zhikai Wang, Yuqian Yuan, et al. Rynnbrain: Open embodied foundation models.arXiv preprint arXiv:2602.14979, 2026. 4.1

work page arXiv 2026

[9] [9]

Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models

Mengfei Du, Binhao Wu, Zejun Li, Xuan-Jing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 346–355,

[10] [10]

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

Chaoyou Fu, Haozhi Yuan, Yuhao Dong, Yi-Fan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, et al. Video-mme-v2: Towards the next stage in benchmarks for comprehensive video understanding.arXiv preprint arXiv:2604.05015, 2026. 4.2

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

Holi-spatial: Evolving video streams into holistic 3d spatial intelligence

Yuanyuan Gao, Hao Li, Yifei Liu, Xinhao Ji, Yuning Gong, Yuanjun Liao, Fangfu Liu, Manyuan Zhang, Yuchen Yang, Dan Xu, et al. Holi-spatial: Evolving video streams into holistic 3d spatial intelligence. arXiv preprint arXiv:2603.07660, 2026. 2

work page arXiv 2026

[12] [12]

Gemini 3.1 flash-lite

Google DeepMind. Gemini 3.1 flash-lite. https://deepmind.google/models/gemini/ flash-lite/, March 2026. 4.1

2026

[13] [13]

Gemini 3.1 pro.https://blog.google/products/gemini/gemini-3-pro/, February 2026

Google DeepMind. Gemini 3.1 pro.https://blog.google/products/gemini/gemini-3-pro/, February 2026. 4.1

2026

[14] [14]

Gemma 4.https://ai.google.dev/gemma/docs/core/model_card_4, April

Google DeepMind. Gemma 4.https://ai.google.dev/gemma/docs/core/model_card_4, April

[15] [15]

Model card; technical report forthcoming. 4.1

[16] [16]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022. 1, 3.2, 10 10 OVO-S-Bench: A Hi...

2022

[17] [17]

RoomTour3D: Geometry-aware video-instruction tuning for embodied navigation

Mingfei Han, Liang Ma, Kamila Zhumakhanova, Ekaterina Radionova, Jingyi Zhang, Xiaojun Chang, Xiaodan Liang, and Ivan Laptev. RoomTour3D: Geometry-aware video-instruction tuning for embodied navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 3.2, 10

2025

[18] [18]

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

Yining Hong, Jiageng Liu, Han Yin, Manling Li, Leonidas Guibas, Li Fei-Fei, Jiajun Wu, and Yejin Choi. ESI-Bench: Towards embodied spatial intelligence that closes the perception-action loop.arXiv preprint arXiv:2605.18746, 2026. 5

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

Online video understanding: Ovbench and videochat-online

Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng, Cheng Liang, Tao Wu, Xi Chen, Liang Li, and Limin Wang. Online video understanding: Ovbench and videochat-online. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3328–3338, 2025. 1, 2, F.3

2025

[20] [20]

Omnispatial: Towards comprehensive spatial reasoning benchmark for vi- sion language models.arXiv preprint arXiv:2506.03135,

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025. 1, 2

work page arXiv 2025

[21] [21]

Infinipot-v: Memory-constrained kv cache compression for streaming video understanding.Advances in Neural Information Processing Systems, 38:138983–139013, 2026

Minsoo Kim, Kyuhong Shim, Jungwook Choi, and Simyung Chang. Infinipot-v: Memory-constrained kv cache compression for streaming video understanding.Advances in Neural Information Processing Systems, 38:138983–139013, 2026. 4.1

2026

[22] [22]

Richard Landis and Gary G

J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data. Biometrics, 33(1):159–174, 1977. E.5.2

1977

[23] [23]

TopViewRS: Vision- language models as top-view spatial reasoners

Chengzu Li, Caiqi Zhang, Han Zhou, Nigel Collier, Anna Korhonen, and Ivan Vulić. TopViewRS: Vision- language models as top-view spatial reasoners. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1786–1807, 2024. 1, 2, F.1

2024

[24] [24]

ViewSpatial- Bench: Evaluating multi-perspective spatial under- standing of vision-language models.arXiv preprint arXiv:2505.21500, 2025

Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, et al. Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models.arXiv preprint arXiv:2505.21500, 2025. 12

work page arXiv 2025

[25] [25]

CODA: A real-world road corner case dataset for object detection in autonomous driving

Kaican Li, Kai Chen, Haoyu Wang, Lanqing Hong, Chaoqiang Ye, Jianhua Han, Yukuai Chen, Wei Zhang, Chunjing Xu, Dit-Yan Yeung, Xiaodan Liang, Zhenguo Li, and Hang Xu. CODA: A real-world road corner case dataset for object detection in autonomous driving. InProceedings of the European Conference on Computer Vision, 2022. 3.2, 10

2022

[26] [26]

Sti-bench: Are mllms ready for precise spatial-temporal world understanding? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5622–5632, 2025

Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti-bench: Are mllms ready for precise spatial-temporal world understanding? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5622–5632, 2025. 1, 2, F.2, 12

2025

[27] [27]

Sekai: A video dataset towards world exploration, 2025

Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, Yukang Feng, Jianwen Sun, et al. Sekai: A video dataset towards world exploration, 2025. 3.2, 10

2025

[28] [28]

Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence

Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao, Yunlong Ran, Miao Hu, Chenming Zhu, Yiman Xie, Yilin Long, et al. Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence. arXiv preprint arXiv:2512.10863, 2025. 1, F.2, 12

work page arXiv 2025

[29] [29]

Ost-bench: Evaluating the capabilities of mllms in online spatio-temporal scene understanding.Advances in Neural Information Processing Systems, 38, 2026

Jingli Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu, Tai Wang, and Jiangmiao Pang. Ost-bench: Evaluating the capabilities of mllms in online spatio-temporal scene understanding.Advances in Neural Information Processing Systems, 38, 2026. 1, 2, 4.1, 5, F.3, 12

2026

[30] [30]

Streamingbench: Assessing the gap for mllms to achieve streaming video understanding

Junming Lin, Zheng Fang, Chi Chen, Haoxuan Cheng, Zihao Wan, Fuwen Luo, Ziyue Wang, Peng Li, Yang Liu, and Maosong Sun. Streamingbench: Assessing the gap for mllms to achieve streaming video understanding. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12147–12151. IEEE, 2026. 1, 2, F.3

2026

[31] [31]

Spatial-ttt: Streaming visual-based spatial intelligence with test-time training.arXiv preprint arXiv:2603.12255, 2026

Fangfu Liu, Diankun Wu, Jiawei Chi, Yimo Cai, Yi-Hsin Hung, Xumin Yu, Hao Li, Han Hu, Yongming Rao, and Yueqi Duan. Spatial-ttt: Streaming visual-based spatial intelligence with test-time training.arXiv preprint arXiv:2603.12255, 2026. 2, 4.1 11 OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

work page arXiv 2026

[32] [32]

Vcbench: A streaming counting benchmark for spatial-temporal state maintenance in long videos, 2026

Pengyiang Liu, Zhongyue Shi, Hongye Hao, Qi Fu, Xueting Bi, Siwei Zhang, Xiaoyang Hu, Zitian Wang, Linjiang Huang, and Si Liu. Vcbench: A streaming counting benchmark for spatial-temporal state maintenance in long videos, 2026. 2

2026

[33] [33]

Visual embodied brain: Let multimodal large language models see, think, and control in spaces.arXiv preprint arXiv:2506.00123, 2025

Gen Luo, Ganlin Yang, Ziyang Gong, Guanzhou Chen, Haonan Duan, Erfei Cui, Ronglei Tong, Zhi Hou, Tianyi Zhang, Zhe Chen, et al. Visual embodied brain: Let multimodal large language models see, think, and control in spaces.arXiv preprint arXiv:2506.00123, 2025. 4.1

work page arXiv 2025

[34] [34]

Robobench: A comprehensive evaluation benchmark for multimodal large language models as embodied brain.arXiv preprint arXiv:2510.17801, 2025

Yulin Luo, Chun-Kai Fan, Menghang Dong, Jiayu Shi, Mengdi Zhao, Bo-Wen Zhang, Cheng Chi, Jiaming Liu, Gaole Dai, Rongyu Zhang, et al. Robobench: A comprehensive evaluation benchmark for multimodal large language models as embodied brain.arXiv preprint arXiv:2510.17801, 2025. 12

work page arXiv 2025

[35] [35]

Aria everyday activities dataset.arXiv preprint arXiv:2402.13349, 2024

Zhaoyang Lv, Nicholas Charron, Pierre Moulon, Alexander Gamino, Cheng Peng, Chris Sweeney, Edward Miller, Huixuan Tang, Jeff Meissner, Jing Dong, et al. Aria everyday activities dataset.arXiv preprint arXiv:2402.13349, 2024. 1

work page arXiv 2024

[36] [36]

Openeqa: Embodied question answering in the era of foundation models

Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, et al. Openeqa: Embodied question answering in the era of foundation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16488–16498, 2024. 1, 2

2024

[37] [37]

Egoschema: A diagnostic benchmark for very long-form video language understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023. 2

2023

[38] [38]

Ovo-bench: How far is your video-llms from real-world online video understanding? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18902–18913, 2025

Junbo Niu, Yifei Li, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, et al. Ovo-bench: How far is your video-llms from real-world online video understanding? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18902–18913, 2025. 1, 2, 4.1, F.3

2025

[39] [39]

Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, March 2026

OpenAI. Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, March 2026. 4.1

2026

[40] [40]

Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction

Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24045–24055, 2025. 2

2025

[41] [41]

Streaming long video understanding with large language models.Advances in Neural Information Processing Systems, 37:119336–119360, 2024

Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, and Jiaqi Wang. Streaming long video understanding with large language models.Advances in Neural Information Processing Systems, 37:119336–119360, 2024. 2

2024

[42] [42]

Qwen3.5: Towards native multimodal agents.https://qwen.ai/blog?id=qwen3.5, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents.https://qwen.ai/blog?id=qwen3.5, February 2026. Alibaba Cloud. 4.1

2026

[43] [43]

Toward driving scene understanding: A dataset for learning driver behavior and causal reasoning, 2018

Vasili Ramanishka, Yi-Ting Chen, Teruhisa Misu, and Kate Saenko. Toward driving scene understanding: A dataset for learning driver behavior and causal reasoning, 2018. 3.2, 10

2018

[44] [44]

Out of sight, not out of context? egocentric spatial reasoning in vlms across disjoint frames

Sahithya Ravi, Gabriel Herbert Sarch, Vibhav Vineet, Andrew D Wilson, and Balasaravanan Thoravi Kumaravel. Out of sight, not out of context? egocentric spatial reasoning in vlms across disjoint frames. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 16146– 16161, 2025. 1, 2, F.2

2025

[45] [45]

A simple baseline for streaming video under- standing.arXiv preprint arXiv:2604.02317, 2026

Yujiao Shen, Shulin Tian, Jingkang Yang, and Ziwei Liu. A simple baseline for streaming video under- standing.arXiv preprint arXiv:2604.02317, 2026. 2, A.4

work page arXiv 2026

[46] [46]

Robobrain2.5: Depthinsight, timeinmind.arXivpreprintarXiv:2601.14352,

Huajie Tan, Enshen Zhou, Zhiyu Li, Yijie Xu, Yuheng Ji, Xiansheng Chen, Cheng Chi, Pengwei Wang, HuizhuJia, YulongAo, etal. Robobrain2.5: Depthinsight, timeinmind.arXivpreprintarXiv:2601.14352,

work page arXiv

[47] [47]

4.1 12 OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

[48] [48]

Lvbench: An extreme long video understanding benchmark

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. Lvbench: An extreme long video understanding benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22958–22967, 2025. 2

2025

[49] [49]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 4.1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

Streameqa: Towards streaming video understanding for embodied scenarios.arXiv preprint arXiv:2512.04451, 2025

Yifei Wang, Zhenkai Li, Tianwen Qian, Huanran Zheng, Zheng Wang, Yuqian Fu, and Xiaoling Wang. Streameqa: Towards streaming video understanding for embodied scenarios.arXiv preprint arXiv:2512.04451, 2025. 1, F.3

work page arXiv 2025

[51] [51]

Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.Advances in neural information processing systems, 38:13569–13597,

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.Advances in neural information processing systems, 38:13569–13597,

[52] [52]

Spatialscore: Towards unified evaluation for multimodal spatial understanding.arXiv e-prints, pages arXiv–2505, 2025

Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, and Weidi Xie. Spatialscore: Towards unified evaluation for multimodal spatial understanding.arXiv e-prints, pages arXiv–2505, 2025. 12

2025

[53] [53]

Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37:28828– 28857, 2024

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37:28828– 28857, 2024. 2

2024

[54] [54]

Grok 4.1 fast and the Agent Tools API.https://x.ai/news/grok-4-1-fast, November 2025

xAI. Grok 4.1 fast and the Agent Tools API.https://x.ai/news/grok-4-1-fast, November 2025. 4.1

2025

[55] [55]

Spatialtree: How spatial abilities branch out in mllms

Yuxi Xiao, Longfei Li, Shen Yan, Xinhang Liu, Sida Peng, Yunchao Wei, Xiaowei Zhou, and Bingyi Kang. Spatialtree: How spatial abilities branch out in mllms. InThe First Workshop on Efficient Spatial Reasoning,

[56] [56]

Fluxmem: Adaptive hierarchical memory for streaming video understanding.arXiv preprint arXiv:2603.02096, 2026

Yiweng Xie, Bo He, JunkeWang, Xiangyu Zheng, Ziyi Ye, and Zuxuan Wu. Fluxmem: Adaptive hierarchical memory for streaming video understanding.arXiv preprint arXiv:2603.02096, 2026. 2, 4.1

work page arXiv 2026

[57] [57]

SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

Peiran Xu, Sudong Wang, Yao Zhu, Jianing Li, Gege Qi, and Yunjian Zhang. Spatialbench: Benchmarking multimodal large language models for spatial cognition.arXiv preprint arXiv:2511.21471, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [58]

StreamingVLM: Real-Time Understanding for Infinite Video Streams

Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. Streamingvlm: Real-time understanding for infinite video streams.arXiv preprint arXiv:2510.09608, 2025. 4.1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 1, 2, 3.2, 10, F.2, 12

2025

[60] [60]

Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025

Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, et al. Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025. 2, 4.1

work page arXiv 2025

[61] [61]

Cambrian-s: Towards spatial supersensing in video

Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis L Brown II, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, et al. Cambrian-s: Towards spatial supersensing in video. InThe Fourteenth International Conference on Learning Representations, 2025. 1, 2, 4.1, F.2

2025

[62] [62]

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, et al. Mmsi-bench: A benchmark for multi-image spatial intelligence.arXiv preprint arXiv:2505.23764, 2025. 1, 2, F.1, 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[63] [63]

Timechat-online: 80% visual tokens are naturally redundant in streaming videos

Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, et al. Timechat-online: 80% visual tokens are naturally redundant in streaming videos. InProceedings of the 33rd ACM International Conference on Multimedia, pages 10807–10816, 2025. 2

2025

[64] [64]

Spatial mental modeling from limited views

Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, et al. Spatial mental modeling from limited views. InStructural Priors for Vision Workshop at ICCV’25, 2025. 12 13 OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

2025

[65] [65]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025. 4.1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[66] [66]

Streamforest: Efficient online video understanding with persistent event memory

Xiangyu Zeng, Kefan Qiu, Qingyu Zhang, Xinhao Li, Jing Wang, Jiaxin Li, Ziang Yan, Kun Tian, Meng Tian, Xinhai Zhao, et al. Streamforest: Efficient online video understanding with persistent event memory. Advances in Neural Information Processing Systems, 38:75804–75835, 2026. 1, 2, 4.1, F.3

2026

[67] [67]

arXiv preprint arXiv:2406.08085 , year=

Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. Flash-vstream: Memory-based real-time understanding for long video streams.arXiv preprint arXiv:2406.08085, 2024. 4.1

work page arXiv 2024

[68] [68]

Flash-vstream: Efficient real-time understanding for long video streams

Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, and Xiaojie Jin. Flash-vstream: Efficient real-time understanding for long video streams. InProceedings of the IEEE/CVF international conference on computer vision, pages 21059–21069, 2025. 2

2025

[69] [69]

HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

Haowei Zhang, Shudong Yang, Jinlan Fu, See-Kiong Ng, and Xipeng Qiu. Hermes: Kv cache as hierarchical memory for efficient streaming video understanding.arXiv preprint arXiv:2601.14724, 2026. 4.1

work page internal anchor Pith review Pith/arXiv arXiv 2026

[70] [70]

Open3d-vqa: A benchmark for embodied spatial concept reasoning with multimodal large language model in open space

Weichen Zhang, Zile Zhou, Xin Zeng, Liu Xuchen, Jianjie Fang, Chen Gao, Jinqiang Cui, Yong Li, Xinlei Chen, and Xiao-Ping Zhang. Open3d-vqa: A benchmark for embodied spatial concept reasoning with multimodal large language model in open space. InProceedings of the 33rd ACM International Conference on Multimedia, pages 12784–12791, 2025. 2

2025

[71] [71]

Dsi-bench: A benchmark for dynamic spatial intelligence.arXiv preprint arXiv:2510.18873, 2025

Ziang Zhang, Zehan Wang, Guanghao Zhang, Weilong Dai, Yan Xia, Ziang Yan, Minjie Hong, and Zhou Zhao. Dsi-bench: A benchmark for dynamic spatial intelligence.arXiv preprint arXiv:2510.18873, 2025. 2

work page arXiv 2025

[72] [72]

Spacemind: Camera-guided modality fusion for spatial reasoning in vision-language models.arXiv preprint arXiv:2511.23075, 2025

Ruosen Zhao, Zhikang Zhang, Jialei Xu, Jiahao Chang, Dong Chen, Lingyun Li, Weijian Sun, and Zizhuang Wei. Spacemind: Camera-guided modality fusion for spatial reasoning in vision-language models.arXiv preprint arXiv:2511.23075, 2025. 2

work page arXiv 2025

[73] [73]

longer = harder

Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, et al. OmniWorld: A multi-domain and multi-modal dataset for 4d world modeling, 2025. 3.2, 10 14 OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs Appendix Content of Appendices SectionA. Frame-...

work page arXiv 2025

[74] [74]

About how far is the player from the bottom edge of the lowest stair step?

The seventieth frame shows the same road and surroundings. 71. The seventy-first frame is another instance of the same scene. 72. The seventy-second frame continues the same view. 73. The seventy-third frame shows the same scene again. 74. The seventy- fourth frame is another continuation. 7 Answer: D To determine how many times this scene has appeared, I...

2026

[75] [75]

37 OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

Within the last300characters: a tailAnswer: X(includingfinal answer,final). 37 OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

[76] [76]

Within the last300characters: a Cosmos-style<answer> Xtag

[77] [77]

therefore B

Within the last300characters: a bare single letter at the very end (e.g.“...therefore B”)

[78] [78]

Within the last300characters: a GLM-style<|begin_of_box|> Xmarker

[79] [79]

A single letter at the start of the stripped response

[80] [80]

Anywhere in the response:answer / choice / option(s): Xwith a true separator