pith. sign in

arxiv: 2606.02482 · v3 · pith:LORQE3HTnew · submitted 2026-06-01 · 💻 cs.CV

X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding

Pith reviewed 2026-06-30 10:39 UTC · model grok-4.3

classification 💻 cs.CV
keywords multi-stream understandingmultimodal large language modelsvideo benchmarkconcurrent streamssignal multiplexingcross-stream reasoningX-Stream dataset
0
0 comments X

The pith

State-of-the-art multimodal large language models score only around 50 percent when required to reason across concurrent video streams.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces X-Stream as the first benchmark built specifically to test multi-stream video understanding in multimodal large language models. It assembles 4,220 question-answer pairs from 932 videos spanning multi-window, multi-view, and multi-device settings, with a dual-verification construction step designed to force models to draw information from more than one stream at once. By framing the models as naive multiplexers and measuring them against Signal Multiplexing Theory, the work shows that leading systems reach roughly 50 percent accuracy and display weak ability to initiate useful cross-stream actions. This gap matters because applications such as live broadcasting, autonomous driving, and multi-screen work all require continuous handling of several streams without falling back to any single one.

Core claim

X-Stream supplies 4,220 QA pairs across 932 videos and 11 subtasks that evaluate online cross-stream reasoning in multi-window, multi-view, and multi-device scenarios. A dual-verification pipeline ensures each pair depends on multiple streams rather than any one alone. When current MLLMs are evaluated as naive multiplexers under Signal Multiplexing Theory, they reach only about 50 percent overall and show limited proactive behavior, revealing a concrete performance ceiling for existing multiplexing schemes.

What carries the argument

The X-Stream benchmark and its dual-verification pipeline, which together produce QA pairs that require simultaneous attention to several video streams, together with the explicit treatment of MLLMs as naive multiplexers evaluated through Signal Multiplexing Theory.

If this is right

  • Existing single-stream benchmarks are insufficient for predicting performance in live multi-stream settings.
  • Multiplexing schemes in current MLLMs must be redesigned to support proactive cross-stream selection.
  • The dual-verification construction method can be reused to generate additional multi-stream datasets.
  • Real-world systems for autonomous driving or multi-screen collaboration will need new evaluation protocols that include concurrent inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training regimes that expose models to synchronized multi-stream data during pre-training could raise the observed 50 percent ceiling.
  • The same benchmark structure might be extended to audio or sensor streams to test whether the multiplexing limitation is modality-specific.
  • If the performance gap persists across model scales, architectural changes such as explicit stream routers may become necessary rather than relying on scale alone.

Load-bearing premise

The dual-verification pipeline actually succeeds in making each QA pair depend on more than one stream rather than allowing models to answer from a single dominant stream.

What would settle it

An experiment in which an unmodified state-of-the-art MLLM scores well above 50 percent on the full set of 11 subtasks while still operating on the same 932 videos would directly contradict the reported performance ceiling.

Figures

Figures reproduced from arXiv: 2606.02482 by Dongming Wu, Fang Liu, Huadai Liu, Huankang Guan, Jinpeng Chen, Minghong Cai, Peiwen Sun, Rui Liu, Shuhan Li, Xiangyu Yue, Xintong Guo, Xudong Lu, Yang Bo.

Figure 1
Figure 1. Figure 1: Our X-Stream, as the first multi-stream streaming benchmark, encompasses a diverse range of scenarios featuring multi-angle, multi-view, and multi-device capabilities. and mean balanced and imbalanced streams. and mean the same domain and different domain streams. and mean the real-world and synthesized pairs. Abstract. While video streaming understanding has made significant strides, real-world applicatio… view at source ↗
Figure 2
Figure 2. Figure 2: The illustration of the multi-streaming task. Fig.(a) and (b) showcase the practical examples in daily life. Essentially, the multi-streaming task involves multiple videos with temporal constraints and alignment, requiring the synchronization of video timestamps, as shown in Fig.(c). However, compared to multi-view and multi-angle, it also necessitates important streaming properties to fit the online appli… view at source ↗
Figure 3
Figure 3. Figure 3: The illustration of the 4 multi-stream abilities. To evaluate these abilities, our X-Streaming Benchmark includes 3 progressive dimensions and 11 subtasks. necessitates cross-stream reference alignment to accurately map abstract refer￾ences in one stream to their corresponding concrete entities or timestamps in another. Finally, the framework culminates in cross-stream cooperation, which demands synthesizi… view at source ↗
Figure 4
Figure 4. Figure 4: Diversity analysis. Algorithm 1: X-streams Benchmark Pipeline Input: RawVideo Output: Multi-Stream QA Benchmark 1 MultiStreams = Preprocess(RawVideo); 2 AllCandQA = EmptySet; 3 FinalQA = EmptySet; 4 for Video in MultiStreams do 5 CandQA = GenerateQA(Video); 6 Append(AllCandQA, CandQA); 7 end 8 for QA in AllCandQA do 9 Clip = TrimVideo(MultiStreams, QA.Timestamp); 10 if Check(Clip, QA.Question) == Correct t… view at source ↗
Figure 5
Figure 5. Figure 5: MLLMs can only handle one token stream at a time, making a multiplexer essential for integrating multiple video streams into one token stream. To address this, we investigate three multiplexing strategies and uncover their inherent trade-offs. During evaluation, the model sequentially processes continuous video streams in 1- second intervals while maintaining a sliding memory window for context management.… view at source ↗
Figure 6
Figure 6. Figure 6: The case study in our X-Stream Benchmark. We choose a 4-stream, proactive, free-form QA (yellow) and a 2-stream, proactive, multi-choice QA (green) as examples. Human verification of LLM-as-judge: To validate the effectiveness, we request both the LLM-as-judge and human experts to evaluate 200 QAs. The Spearman correlation of 0.62 (p<0.05) confirms that LLM-as-judge reliably mirrors human evaluation. See t… view at source ↗
read the original abstract

While video streaming understanding has made significant strides, real-world applications, such as live sports broadcasting, autonomous driving, and multi-screen collaboration, inherently demand continuous, multi-stream interactions. However, existing benchmarks are confined to single-stream paradigms, leaving a critical gap in evaluating online, cross-stream reasoning. To bridge this, we introduce X-Stream, the first benchmark dedicated to multi-stream streaming understanding. Comprising 4,220 rigorously curated QA pairs across 932 videos, X-Stream evaluates 11 subtasks across multi-window, multi-view, and multi-device scenarios. Crucially, our dataset is constructed using a novel dual-verification pipeline that prevents over-reliance on a single stream. Furthermore, we pioneer the conceptualization of multi-modal large language models (MLLMs) as naive multiplexers, systematically evaluating their performance through the lens of Signal Multiplexing Theory. Our extensive online inference experiments reveal a stark reality: state-of-the-art MLLMs struggle significantly with concurrent streams, achieving only about 50% score and exhibiting poor proactive ability. Ultimately, X-Stream exposes the trade-off of current multiplexing schemes, providing both a practical evaluation protocol and empirical guidance for next-generation multi-stream agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces X-Stream, the first benchmark for multi-stream streaming understanding in MLLMs. It comprises 4,220 QA pairs from 932 videos spanning 11 subtasks in multi-window, multi-view, and multi-device scenarios. A novel dual-verification pipeline is used during dataset construction to ensure questions require cross-stream reasoning rather than single-stream answers. The work frames MLLMs as naive multiplexers under Signal Multiplexing Theory and reports that state-of-the-art models achieve only ~50% accuracy with poor proactive ability on concurrent streams, exposing limitations in current multiplexing schemes and providing guidance for future multi-stream agents.

Significance. If the central empirical result holds and the benchmark questions are shown to require genuine multi-stream reasoning, the work would be significant: it fills a documented gap between existing single-stream video benchmarks and real-world applications (live sports, autonomous driving, multi-screen collaboration) that demand continuous cross-stream interaction. The framing via multiplexing theory and the online inference protocol could supply a reusable evaluation standard and concrete failure modes for next-generation MLLM agents.

major comments (2)
  1. [dataset construction paragraph] Dataset-construction paragraph: the assertion that the dual-verification pipeline 'prevents over-reliance on a single stream' is load-bearing for the headline claim that the observed ~50% score reflects multiplexing failure rather than ordinary video-understanding difficulty. No quantitative check (single-stream oracle accuracy, ablation removing one stream, or inter-annotator agreement conditioned on stream count) is described, so it remains possible that retained questions remain answerable from any individual stream.
  2. [results / online inference experiments] Results section (online inference experiments): the claim that SOTA MLLMs 'struggle significantly with concurrent streams' and exhibit 'poor proactive ability' rests on the 4,220 QA pairs being information-theoretically dependent on multiple streams. Without the missing single-stream controls or ablation, the performance gap cannot be attributed specifically to multiplexing rather than task hardness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the need for quantitative validation that our benchmark questions require genuine cross-stream reasoning. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [dataset construction paragraph] Dataset-construction paragraph: the assertion that the dual-verification pipeline 'prevents over-reliance on a single stream' is load-bearing for the headline claim that the observed ~50% score reflects multiplexing failure rather than ordinary video-understanding difficulty. No quantitative check (single-stream oracle accuracy, ablation removing one stream, or inter-annotator agreement conditioned on stream count) is described, so it remains possible that retained questions remain answerable from any individual stream.

    Authors: We agree that the dual-verification pipeline claim requires quantitative support to rule out single-stream solvability. In the revised manuscript we will add (i) single-stream oracle accuracy results on the full 4,220 QA pairs and (ii) ablations that remove one stream at inference time, together with inter-annotator agreement broken down by number of streams. These controls will directly demonstrate that retained questions cannot be answered from any individual stream alone. revision: yes

  2. Referee: [results / online inference experiments] Results section (online inference experiments): the claim that SOTA MLLMs 'struggle significantly with concurrent streams' and exhibit 'poor proactive ability' rests on the 4,220 QA pairs being information-theoretically dependent on multiple streams. Without the missing single-stream controls or ablation, the performance gap cannot be attributed specifically to multiplexing rather than task hardness.

    Authors: We acknowledge that the attribution of the ~50% performance and poor proactive ability specifically to multiplexing limitations depends on confirming multi-stream dependency. The single-stream oracle and ablation experiments described in the response to the first comment will be added to the results section, allowing us to isolate the effect of concurrent streams from general task difficulty. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction and evaluation are independent of self-citations or fitted inputs.

full rationale

The paper introduces X-Stream as a new benchmark with 4,220 QA pairs built via a described dual-verification pipeline and evaluates off-the-shelf MLLMs on it. No equations, parameters, or derivations appear in the abstract or described sections. No self-citation is invoked to justify uniqueness, multiplexing theory, or the pipeline itself. The central empirical claim (SOTA models at ~50%) is a direct measurement on the new data, not a reduction to prior fitted values or author-defined constructs. This is a standard benchmark paper with self-contained content against external models.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities; ledger left empty.

pith-pipeline@v0.9.1-grok · 5788 in / 1059 out tokens · 31800 ms · 2026-06-30T10:39:04.245168+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

75 extracted references · 37 canonical work pages · 20 internal anchors

  1. [1]

    arXiv preprint arXiv:2506.22554 (2025) 4, 5, 22, 29

    Agrawal, V., Akinyemi, A., Alvero, K., Behrooz, M., Buffalini, J., Carlucci, F.M., Chen, J., Chen, J., Chen, Z., Cheng, S., et al.: Seamless interaction: Dyadic audio- visual motion modeling and large-scale dataset. arXiv preprint arXiv:2506.22554 (2025) 4, 5, 22, 29

  2. [2]

    System card, Anthropic (Sep 2025), https : / / www - cdn

    Anthropic: Claude sonnet 4.5 system card. System card, Anthropic (Sep 2025), https : / / www - cdn . anthropic . com / 963373e433e489a87a10c823c52a0a013e9172dd.pdf2

  3. [3]

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

  4. [4]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 11

  5. [5]

    arXiv preprint arXiv:2503.07860 (2025) 4, 8

    Burgess, J., Wang, X., Zhang, Y., Rau, A., Lozano, A., Dunlap, L., Darrell, T., Yeung-Levy, S.: Video action differencing. arXiv preprint arXiv:2503.07860 (2025) 4, 8

  6. [6]

    Available at Volcengine ARK Platform (2026) 11, 12

    ByteDance: Doubao-seed-1.8. Available at Volcengine ARK Platform (2026) 11, 12

  7. [7]

    https://lf3-static.bytednsdoc.com/obj/eden-cn/lapzild- tss/ljhwZthlaukjlkulzlp/seed2/0214/Seed2.0 Model Card.pdf (2026), accessed: 2026- 02-26 3

    ByteDance Seed Team: Seed2.0 model card: Towards intelligence frontier for real-world complexity. https://lf3-static.bytednsdoc.com/obj/eden-cn/lapzild- tss/ljhwZthlaukjlkulzlp/seed2/0214/Seed2.0 Model Card.pdf (2026), accessed: 2026- 02-26 3

  8. [8]

    Chavdarova, T., Baqué, P., Bouquet, S., Maksai, A., Jose, C., Bagautdinov, T., Lettry, L., Fua, P., Van Gool, L., Fleuret, F.: Wildtrack: A multi-camera hd dataset for dense unscripted pedestrian detection–supplementary material– 22, 29

  9. [9]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Chen, J., Lv, Z., Wu, S., Lin, K.Q., Song, C., Gao, D., Liu, J.W., Gao, Z., Mao, D., Shou, M.Z.: Videollm-online: Online video large language model for streaming video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18407–18418 (2024) 2, 4, 11, 12 16 P. Sun et al

  10. [10]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Chen, J., Zeng, Z., Lin, Y., Li, W., Ma, Z., Shou, M.Z.: Livecc: Learning video llm with streaming speech transcription at scale. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 29083–29095 (2025) 8

  11. [11]

    Advances in Neural Information Processing Systems37, 28662–28673 (2024) 3

    Chen, J.J., Liao, Y.C., Lin, H.C., Yu, Y.C., Chen, Y.C., Wang, F.: Rextime: A benchmark suite for reasoning-across-time in videos. Advances in Neural Information Processing Systems37, 28662–28673 (2024) 3

  12. [12]

    IEEE Transactions on Affective Computing13(2), 651–665 (2019) 22, 29

    Chen, X., Niu, L., Veeraraghavan, A., Sabharwal, A.: Faceengage: Robust estimation of gameplay engagement from user-contributed (youtube) videos. IEEE Transactions on Affective Computing13(2), 651–665 (2019) 22, 29

  13. [13]

    arXiv preprint arXiv:2503.11495 (2025) 3

    Cheng, Z., Hu, J., Liu, Z., Si, C., Li, W., Gong, S.: V-star: Benchmarking video-llms on video spatio-temporal reasoning. arXiv preprint arXiv:2503.11495 (2025) 3

  14. [14]

    Moshi: a speech-text foundation model for real-time dialogue

    Défossez, A., Mazaré, L., Orsini, M., Royer, A., Pérez, P., Jégou, H., Grave, E., Zeghidour, N.: Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037 (2024) 4

  15. [15]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Fan, H., Qiao, Y., Zhen, Y., Zhao, T., Fan, B., Wang, Q.: All-day multi-camera multi- target tracking. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 16892–16901 (2025) 22, 29

  16. [16]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., et al.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24108–24118 (2025) 3

  17. [17]

    Available at Google DeepMind Model Cards (2025) 2, 3, 11, 12

    Google DeepMind: Gemini 3 pro model card. Available at Google DeepMind Model Cards (2025) 2, 3, 11, 12

  18. [18]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Grauman, K., Westbury, A., Torresani, L., Kitani, K., Malik, J., Afouras, T., Ashutosh, K., Baiyya, V., Bansal, S., Boote, B., et al.: Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19383–19400 (2024) 4, 22, 29

  19. [19]

    ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly

    Hasegawa, K., Imrattanatrai, W., Asada, M., Holm, S., Wang, Y., Zhou, V., Fukuda, K., Mitamura, T.: Promqa-assembly: Multimodal procedural qa dataset on assembly. arXiv preprint arXiv:2509.02949 (2025) 4, 8

  20. [20]

    John Wiley & Sons, Inc., New York, 4th edn

    Haykin, S.: Communication Systems. John Wiley & Sons, Inc., New York, 4th edn. (2001) 9

  21. [21]

    GPT-4o System Card

    Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024) 11, 12

  22. [22]

    Brain4Cars: Car That Knows Before You Do via Sensory-Fusion Deep Learning Architecture

    Jain, A., Koppula, H.S., Soh, S., Raghavan, B., Singh, A., Saxena, A.: Brain4cars: Car that knows before you do via sensory-fusion deep learning architecture. arXiv preprint arXiv:1601.00740 (2016) 22, 29

  23. [23]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Khazatsky, A., Pertsch, K., Nair, S., Balakrishna, A., Dasari, S., Karamcheti, S., Nasiriany, S., Srirama, M.K., Chen, L.Y., Ellis, K., et al.: Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945 (2024) 22, 29

  24. [24]

    LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video

    Lang, S., Liu, J., He, H., Sun, P., Chen, Y., Liu, T., Yang, L., Guo, L., Zhang, H.: Longspace: Exploring long-horizon spatial memory from perception to recall in video. arXiv preprint arXiv:2606.05677 (2026) 3

  25. [25]

    Lin,J.,Fang,Z.,Chen,C.,Wan,Z.,Luo,F.,Li,P.,Liu,Y.,Sun,M.:Streamingbench: Assessingthegapformllmstoachievestreamingvideounderstanding.arXivpreprint arXiv:2411.03628 (2024) 4, 5, 8 X-Stream 17

  26. [26]

    arXiv preprint arXiv:2601.22575 (2026) 4, 5, 8, 11

    Lu, X., Guan, H., Bo, Y., Chen, J., Guo, X., Li, S., Liu, F., Sun, P., Li, X., Zhang, W., et al.: Phostream: Benchmarking real-world streaming for omnimodal assistants in mobile scenarios. arXiv preprint arXiv:2601.22575 (2026) 4, 5, 8, 11

  27. [27]

    Niu, J., Li, Y., Miao, Z., Ge, C., Zhou, Y., He, Q., Dong, X., Duan, H., Ding, S., Qian, R., et al.: Ovo-bench: How far is your video-llms from real-world online video understanding? In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 18902–18913 (2025) 4, 8

  28. [28]

    OpenAI: Gpt realtime,https://developers.openai.com/api/docs/models/gpt- realtime4

  29. [29]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Park,S.Y., Cui,C.,Ma, Y.,Moradipari,A., Gupta,R.,Han, K.,Wang,Z.:Nuplanqa: A large-scale dataset and benchmark for multi-view driving scene understanding in multi-modal large language models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8066–8076 (2025) 4

  30. [30]

    arXiv preprint arXiv:2511.07250 (2025) 4, 8

    Peng, T., Wang, H., Zhang, Y., Wang, Z., Wang, Z., Chang, G., Yang, J., Li, S., Wang, Y., Wang, X., et al.: Mvu-eval: Towards multi-video understanding evaluation for multimodal llms. arXiv preprint arXiv:2511.07250 (2025) 4, 8

  31. [31]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Qian, R., Ding, S., Dong, X., Zhang, P., Zang, Y., Cao, Y., Lin, D., Wang, J.: Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24045–24055 (2025) 2, 4, 11, 12

  32. [32]

    Qwen Team: Qwen3.5: Towards native multimodal agents (February 2026),https: //qwen.ai/blog?id=qwen3.53

  33. [33]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 10

  34. [34]

    A Commute in Data: The comma2k19 Dataset

    Schafer,H.,Santana, E.,Haden,A., Biasini,R.:Acommutein data:Thecomma2k19 dataset. arXiv preprint arXiv:1812.05752 (2018) 5, 22, 23, 29

  35. [35]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Sener, F., Chatterjee, D., Shelepov, D., He, K., Singhania, D., Wang, R., Yao, A.: Assembly101: A large-scale multi-view video dataset for understanding procedural activities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21096–21106 (2022) 4

  36. [36]

    OpenAI GPT-5 System Card

    Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2025) 2, 3, 11, 12

  37. [37]

    SpaceVista: All-Scale Visual Spatial Reasoning from mm to km

    Sun, P., Lang, S., Wu, D., Ding, Y., Feng, K., Liu, H., Ye, Z., Liu, R., Liu, Y.H., Wang, J., et al.: Spacevista: All-scale visual spatial reasoning from mm to km. arXiv preprint arXiv:2510.09606 (2025) 3

  38. [38]

    In: The Fourteenth International Conference on Learning Representations 10

    Tang, C., Ek, S., Koch, D., Mullins, R.D., Weddell, A.S., Chauhan, J.: Surge: Surprise-guided token reduction for efficient video understanding with vlms. In: The Fourteenth International Conference on Learning Representations 10

  39. [39]

    arXiv preprint arXiv:2506.13654 (2025) 4

    Tian, S., Wang, R., Guo, H., Wu, P., Dong, Y., Wang, X., Yang, J., Zhang, H., Zhu, H., Liu, Z.: Ego-r1: Chain-of-tool-thought for ultra-long egocentric video reasoning. arXiv preprint arXiv:2506.13654 (2025) 4

  40. [40]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., et al.: Siglip 2: Multilingual vision-language encoders with improved semantic understanding, lo- calization, and dense features. arXiv preprint arXiv:2502.14786 (2025) 10

  41. [41]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025) 3 18 P. Sun et al

  42. [42]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Wang, X., Kwon, T., Rad, M., Pan, B., Chakraborty, I., Andrist, S., Bohus, D., Feniello, A., Tekin, B., Frujeri, F.V., et al.: Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 20270–20281 (2023) 4

  43. [43]

    arXiv preprint arXiv:2512.06810 (2025) 4, 8, 11, 12

    Wang, Y., Liu, S., Wang, D., Xu, N., Wan, G., Zhang, H., Zhao, D.: Mmduet2: Enhancing proactive interaction of video mllms with multi-turn reinforcement learning. arXiv preprint arXiv:2512.06810 (2025) 4, 8, 11, 12

  44. [44]

    arXiv preprint arXiv:2507.09313 (2025) 4, 5, 8

    Wang, Y., Meng, X., Wang, Y., Zhang, H., Zhao, D.: Proactivevideoqa: A com- prehensive benchmark evaluating proactive interactions in video large language models. arXiv preprint arXiv:2507.09313 (2025) 4, 5, 8

  45. [45]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Wang, Y., Wang, Y., Chen, B., Wu, T., Zhao, D., Zheng, Z.: Omnimmi: A com- prehensive multi-modal interaction benchmark in streaming video contexts. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 18925–18935 (2025) 4, 8

  46. [46]

    Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

    Wu, D., Liu, F., Hung, Y.H., Duan, Y.: Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence. arXiv preprint arXiv:2505.23747 (2025) 3

  47. [47]

    In: Proceedings of the AAAI conference on artificial intelligence

    Wu, D., Han, W., Liu, Y., Wang, T., Xu, C.z., Zhang, X., Shen, J.: Language prompt for autonomous driving. In: Proceedings of the AAAI conference on artificial intelligence. vol. 39, pp. 8359–8367 (2025) 4

  48. [48]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Wu, D., Wang, T., Zhang, Y., Zhang, X., Shen, J.: Onlinerefer: A simple online baseline for referring video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2761–2770 (2023) 4

  49. [49]

    arXiv preprint arXiv:2512.03405 (2025) 4

    Wu, J., Li, S., Bian, Z., Chen, J., Wen, R., Ping, A., He, Y., Wang, J., Zhang, Y., Liu, J.: Vidic: Video difference captioning. arXiv preprint arXiv:2512.03405 (2025) 4

  50. [50]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Wu, Z., Chen, X., Pan, Z., Liu, X., Liu, W., Dai, D., Gao, H., Ma, Y., Wu, C., Wang, B., et al.: Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302 (2024) 3

  51. [51]

    Streaming Video Instruction Tuning

    Xia, J., Chen, P., Zhang, M., Sun, X., Zhou, K.: Streaming video instruction tuning. arXiv preprint arXiv:2512.21334 (2025) 4

  52. [52]

    arXiv preprint arXiv:2503.08221 (2025) 4

    Xiao, J., Huang, N., Qiu, H., Tao, Z., Yang, X., Hong, R., Wang, M., Yao, A.: Egoblind: Towards egocentric visual assistance for the blind people. arXiv preprint arXiv:2503.08221 (2025) 4

  53. [53]

    Xu, J., Guo, Z., He, J., Hu, H., He, T., Bai, S., Chen, K., Wang, J., Fan, Y., Dang, K., Zhang, B., Wang, X., Chu, Y., Lin, J.: Qwen2.5-omni technical report (2025), https://arxiv.org/abs/2503.2021511

  54. [54]

    Xu, J., Guo, Z., Hu, H., Chu, Y., Wang, X., He, J., Wang, Y., Shi, X., He, T., Zhu, X., Lv, Y., Wang, Y., Guo, D., Wang, H., Ma, L., Zhang, P., Zhang, X., Hao, H., Guo, Z., Yang, B., Zhang, B., Ma, Z., Wei, X., Bai, S., Chen, K., Liu, X., Wang, P., Yang, M., Liu, D., Ren, X., Zheng, B., Men, R., Zhou, F., Yu, B., Yang, J., Yu, L., Zhou, J., Lin, J.: Qwen3...

  55. [55]

    Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv preprint arXiv:2510.26125,

    Xu, R., Lin, H., Jeon, W., Feng, H., Zou, Y., Sun, L., Gorman, J., Tolstaya, E., Tang, S., White, B., et al.: Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios. arXiv preprint arXiv:2510.26125 (2025) 4, 8, 22, 29

  56. [56]

    StreamingVLM: Real-Time Understanding for Infinite Video Streams

    Xu, R., Xiao, G., Chen, Y., He, L., Peng, K., Lu, Y., Han, S.: Streamingvlm: Real-time understanding for infinite video streams. arXiv preprint arXiv:2510.09608 (2025) 4, 8 X-Stream 19

  57. [57]

    arXiv preprint arXiv:2405.11936 (2024) 22, 29

    Xu, W., Yao, Y., Cao, J., Wei, Z., Liu, C., Wang, J., Peng, M.: Uav-visloc: A large-scale dataset for uav visual localization. arXiv preprint arXiv:2405.11936 (2024) 22, 29

  58. [58]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Yang, J., Liu, S., Guo, H., Dong, Y., Zhang, X., Zhang, S., Wang, P., Zhou, Z., Xie, B., Wang, Z., et al.: Egolife: Towards egocentric life assistant. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 28885–28900 (2025) 4, 5, 8, 22, 29

  59. [59]

    arXiv preprint arXiv:2502.10810 (2025) 4

    Yang, Z., Hu, Y., Du, Z., Xue, D., Qian, S., Wu, J., Yang, F., Dong, W., Xu, C.: Svbench: A benchmark with temporal multi-turn dialogues for streaming video understanding. arXiv preprint arXiv:2502.10810 (2025) 4

  60. [60]

    Seeing from another perspective: Evaluating multi-view understanding in mllms.arXiv preprint arXiv:2504.15280, 2025

    Yeh, C.H., Wang, C., Tong, S., Cheng, T.Y., Wang, R., Chu, T., Zhai, Y., Chen, Y., Gao, S., Ma, Y.: Seeing from another perspective: Evaluating multi-view under- standing in mllms. arXiv preprint arXiv:2504.15280 (2025) 4

  61. [61]

    MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

    Yu, T., Wang, Z., Wang, C., Huang, F., Ma, W., He, Z., Cai, T., Chen, W., Huang, Y., Zhao, Y., et al.: Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe. arXiv preprint arXiv:2509.18154 (2025) 3

  62. [62]

    In: Pro- ceedings of the AAAI Conference on Artificial Intelligence

    Yu, Z., Xu, D., Yu, J., Yu, T., Zhao, Z., Zhuang, Y., Tao, D.: Activitynet-qa: A dataset for understanding complex web videos via question answering. In: Pro- ceedings of the AAAI Conference on Artificial Intelligence. pp. 9127–9134 (2019) 3

  63. [63]

    https://sites.uclouvain.be/ispgroup/Softwares/DeepSport, refs: 3, 4, 7 22, 29

    Zandycke, G.V.: DeepSport – Image and Signal Processing Group (UCL) softwares. https://sites.uclouvain.be/ispgroup/Softwares/DeepSport, refs: 3, 4, 7 22, 29

  64. [64]

    arXiv preprint arXiv:2506.10967 (2025) 10

    Zhang, Q., Liu, M., Li, L., Lu, M., Zhang, Y., Pan, J., She, Q., Zhang, S.: Beyond attention or similarity: Maximizing conditional diversity for token pruning in mllms. arXiv preprint arXiv:2506.10967 (2025) 10

  65. [65]

    arXiv preprint arXiv:2510.14560 (2025) 8 20 P

    Zhang, Y., Shi, C., Wang, Y., Yang, S.: Eyes wide open: Ego proactive video-llm for streaming video. arXiv preprint arXiv:2510.14560 (2025) 8 20 P. Sun et al. In this supplementary material, we provide two key components. • we release a preview version of theevaluation codein the attached com- pressed file. • we includeadditional informationfor the reader...

  66. [66]

    Cross-stream Interference Noise Filtering Target stream A + distracting stream B→ answer from A In a split-screen setting, identify what the blue-haired girl in Stream 2 needs to do while ignoring visually salient but irrelevant actions in Stream 1. Contradiction Suppression Relevant cue in A + misleading cue in B→ robust answer Determine whether pressing...

  67. [67]

    Multi-stream Cooperation Complementary Reasoning Clue in A + clue in B→ answer Did the driver looking at the phone (Inner) cause the lane deviation (Outer)? Multi-stream Evidence Aggregation Partial evidence from A and B→joint conclusion Detect an abnormal event only after combining surveillance footage from two different viewpoints

  68. [68]

    Cross-stream Reference Cross-view Localization Object/entity in A→ corresponding object/entity in B Where is the screw seen in the robotic arm view located in the global view? Temporal / Event Alignment Event in A↔event or state in B What facial expression (Player) was caused by the character’s death (Game)?

  69. [69]

    Single-stream Understanding Stream-specific Perception Query specifies one stream within a multi-stream context→ answer from that stream In Stream 2, what is the woman holding when she enters the room? Local Grounding in Context Grounding/ counting/ recognition in A while other streams are present In Stream 1, how many buttons are visible on the control p...

  70. [70]

    Different angle of the same object Robotics(shoulder + wrist view) Manipulation failure diag- nosis Why did the robot fail to insert the plug? The shoulder view shows that the arm reached the socket area, while the wrist view reveals that the plug was slightly misaligned. Egocentric video(ego + exo view) Referring expression resolu- tion Which ingredient ...

  71. [71]

    Different views of the same behavior Autonomous driving (front + rear view) Causal explanation of driv- ing decisions Why did the vehicle avoid changing lanes despite an open front view? The forward camera shows a clear lane, whereas the rear camera reveals a fast-approaching ambulance in the blind spot. Collaborative gaming (player 1 + player 2 view) Tea...

  72. [72]

    Different devices of the same goal Geo-localization (street view + map) Visual entity linking What is the name of the company located in the blue build- ing? The street view identifies the building, and the map stream links its location to the corresponding business entry. Aerial inspection (drone + satellite view) Structural condition assess- ment Is the...

  73. [73]

    Qwen3-VL:28 ×28 pixel patches per token with token merging.3) GPT-5: 85 tokens/frame + 170 tokens per 512×512 tile. Since these mechanisms require different adjustments, speeding and resizing videos are necessary compromises to establish a baseline.rn and rn (Cmax=250) depend on the pixel dimensions to ensure: 1) GPT: a maximum edge of512; 2) Qwen:511×383...

  74. [74]

    What is the person doing in Stream A when the door opens in Stream B?

    Pseudo Reference Invalid T emporal Anchoring Occurs when the target action is continuous or static, rendering the cross-stream temporal constraint meaningless. Since the answer is invariant to time, the specific timestamp from the reference stream becomes redundant. Query:"What is the person doing in Stream A when the door opens in Stream B?" Issue:The pe...

  75. [75]

    Identify the object held by the person

    Pseudo Coop. Information Redundancy Happens when streams share overlapping fields of view or semantic content. The model can resolve the query using a single stream alone, bypassing the need for genuine multi-view fusion or collaboration. Query:"Identify the object held by the person." Issue:Due to overlapping views, the object is clearly visible in Strea...