IPIBench: Evaluating Interactive Proactive Intelligence of MLLMs under Continuous Streams

Honglei Yan; Jinzhao Li; Miao Liu; Panwang Pan; Wenxuan Song; Yichi Zhang; Yijia Lei; Yinuo Chen

arxiv: 2605.27074 · v1 · pith:3X7BXFJCnew · submitted 2026-05-26 · 💻 cs.CV

IPIBench: Evaluating Interactive Proactive Intelligence of MLLMs under Continuous Streams

Jinzhao Li , Yinuo Chen , Wenxuan Song , Yijia Lei , Yichi Zhang , Honglei Yan , Panwang Pan , Miao Liu This is my paper

Pith reviewed 2026-06-29 18:22 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal large language modelsproactive intelligencestreaming videointeractive benchmarkagent frameworkreactive-proactive coordinationtemporal gating

0 comments

The pith

Existing MLLMs show unstable proactive triggering and weak coordination in streaming video, which a training-free agent with control policies and temporal gating can stabilize.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces IPIBench as the first benchmark designed to test multimodal large language models on proactive monitoring, task management, and interleaved reactive-proactive requests inside continuous video streams. It demonstrates through evaluation that current models frequently fail to decide when to act on their own and struggle to balance those actions with user-driven queries. The authors then present IPI-Agent, a training-free framework that adds an interaction-control policy and temporal-gating mechanism to address these issues. A sympathetic reader would care because real assistants must operate over ongoing visual input rather than isolated questions, and the benchmark moves evaluation closer to that requirement. The work establishes both the measurement tool and a practical improvement method that works across existing models.

Core claim

The paper claims that representative MLLMs exhibit two core limitations under continuous visual streams: unstable proactive triggering and weak coordination between reactive and proactive behaviors, and that IPI-Agent, built around an interaction-control policy and temporal-gating mechanism, consistently improves performance on all IPIBench settings without any model fine-tuning or task-specific adaptation.

What carries the argument

IPIBench benchmark covering proactive monitoring, proactive task management, and interleaved reactive-proactive requests in streaming video, paired with IPI-Agent, a training-free agentic framework that applies an interaction-control policy and temporal-gating mechanism to stabilize triggering and coordinate behaviors.

If this is right

MLLMs equipped with IPI-Agent show measurable gains in proactive stability and behavioral coordination across all benchmark settings.
The interaction-control policy enables reliable decision-making about when to initiate proactive actions during multi-turn streams.
Temporal gating improves separation and integration of reactive queries with ongoing proactive monitoring.
Existing models can reach better performance on dynamic streaming tasks without retraining.
The benchmark provides a standardized way to measure progress on interactive proactive intelligence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar control policies could be tested on audio or sensor streams to support always-on assistants in other modalities.
Production systems might adopt the gating approach to lower false-positive proactive interventions without custom training data.
The benchmark structure could guide creation of longer-horizon interaction datasets that track request history over extended sessions.
If the mechanisms prove robust, they may reduce reliance on expensive fine-tuning loops for interaction-aware models.

Load-bearing premise

The interaction-control policy and temporal-gating mechanism can be applied generally to stabilize proactive triggering and coordinate behaviors across different MLLMs without model-specific fine-tuning or benchmark-specific adaptation.

What would settle it

Applying the IPI-Agent framework to an MLLM outside the original evaluation set on a new streaming video task where proactive triggering remains unstable or reactive-proactive coordination does not improve would falsify the general applicability claim.

Figures

Figures reproduced from arXiv: 2605.27074 by Honglei Yan, Jinzhao Li, Miao Liu, Panwang Pan, Wenxuan Song, Yichi Zhang, Yijia Lei, Yinuo Chen.

**Figure 1.** Figure 1: Visual examples of our proposed IPI-Bench. For single-turn proactive monitoring tasks, our benchmark covers timing, understanding, and repeated proactiveness. For multi-turn interactive proactive tasks, the benchmark further addresses proactive task management (e.g., modification, cancellation, and multi-task management) and interleaved reactive–proactive requests. benchmark for evaluating Interactive Proa… view at source ↗

**Figure 2.** Figure 2: Statistics of our IPIBench. Left: Distribution of the three formulated task categories and [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of IPI-Agent, a training-free agentic framework for interactive proactive behavior under streaming settings. IPI-Agent operates under two workflows: Query Arrival and Continuous Monitoring. The framework maintains a unified Interaction-Control Policy through a Memory Tool and an Intent Router to coordinate reactive, proactive, and management instructions. In addition, IPI-Agent introduces a Tempor… view at source ↗

read the original abstract

Recent multimodal large language models (MLLMs) achieve strong performance on reactive question answering, but real-world streaming assistants require proactive reasoning over continuous visual inputs. Existing benchmarks mainly study reactive or proactive interactions in isolated single-turn settings, overlooking dynamic multi-turn scenarios where users may add, modify, or cancel proactive requests alongside interleaved reactive queries. To address this gap, we introduce IPIBench, the first benchmark for evaluating Interactive Proactive Intelligence of MLLMs under streaming video settings. IPIBench covers proactive monitoring, proactive task management, and interleaved reactive-proactive requests. Evaluations on representative MLLMs reveal two major limitations: unstable proactive triggering and weak coordination between reactive and proactive behaviors. We further propose IPI-Agent, a training-free agentic framework with an interaction-control policy and a temporal-gating mechanism for stabilizing proactive triggering and coordinating multi-turn interactions. Experiments show that IPI-Agent consistently improves existing MLLMs across all benchmark settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New benchmark for proactive streaming MLLM behavior is the core addition, but the abstract supplies no construction details or results to check the claims.

read the letter

The paper's main move is defining IPIBench to test MLLMs on continuous video streams where proactive monitoring, task management, and mixed reactive-proactive turns happen together. That setting is not covered by existing single-turn or isolated benchmarks, so the gap it names is real. The abstract also flags two concrete problems in current models—unstable triggering of proactive actions and poor coordination with reactive queries—and offers IPI-Agent as a training-free fix using an interaction-control policy plus temporal gating.

What stands out is the attempt to move evaluation closer to real assistant use cases with ongoing visual input and user interruptions. The training-free angle for the agent framework is a practical choice if it works.

The soft spot is the complete lack of supporting material. No description of how the benchmark videos or queries were collected, no dataset statistics, no metric definitions, and no numbers on the reported limitations or improvements. Without those, the statements about consistent gains across settings cannot be checked for selection effects or hidden assumptions in the policy. The interaction-control and gating mechanisms are named but not shown, so it is impossible to see whether they are general or tuned to the benchmark.

This work is aimed at researchers building streaming multimodal agents who need evaluation tools beyond static QA. A reader already working on proactive or multi-turn MLLM setups could get value from the problem framing once the full construction details appear. Right now the evidence is too thin for a serious referee to spend time on; the paper would need the dataset, metrics, and results sections filled in before it merits review.

Referee Report

2 major / 0 minor

Summary. The paper introduces IPIBench as the first benchmark for evaluating interactive proactive intelligence of MLLMs under continuous streaming video inputs, covering proactive monitoring, task management, and interleaved reactive-proactive requests. It identifies two limitations in existing MLLMs (unstable proactive triggering and weak coordination between reactive and proactive behaviors) and proposes IPI-Agent, a training-free agentic framework using an interaction-control policy and temporal-gating mechanism, claiming consistent improvements across all benchmark settings.

Significance. If the benchmark construction, metrics, and experimental results hold up under scrutiny, the work would address a genuine gap in evaluating dynamic, multi-turn proactive capabilities for real-world streaming assistants, providing a new evaluation resource and a simple baseline agent framework.

major comments (2)

No details are provided on IPIBench construction (dataset sources, video stream generation, annotation protocol, or task distribution statistics), which is load-bearing for the central claim that existing MLLMs exhibit the two stated limitations.
The abstract states that IPI-Agent 'consistently improves existing MLLMs across all benchmark settings' but supplies no quantitative results, ablation studies on the interaction-control policy or temporal-gating mechanism, or comparison against fine-tuned baselines, preventing verification that the gains are not due to benchmark-specific tuning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our work introducing IPIBench and the IPI-Agent framework. We address each major comment below with clarifications and plans for revision where appropriate.

read point-by-point responses

Referee: No details are provided on IPIBench construction (dataset sources, video stream generation, annotation protocol, or task distribution statistics), which is load-bearing for the central claim that existing MLLMs exhibit the two stated limitations.

Authors: We agree that additional details on benchmark construction would strengthen the paper and support the claims about MLLM limitations. The current manuscript provides an overview in Section 3, but we will expand it in revision to explicitly include: sources of the video streams (e.g., public datasets and synthetic generation methods), the protocol for creating continuous streaming inputs with interleaved requests, the full annotation guidelines and inter-annotator agreement, and detailed task distribution statistics (e.g., proportions of proactive monitoring, task management, and reactive-proactive interleaving). This will be added as a new subsection or appendix to ensure reproducibility. revision: yes
Referee: The abstract states that IPI-Agent 'consistently improves existing MLLMs across all benchmark settings' but supplies no quantitative results, ablation studies on the interaction-control policy or temporal-gating mechanism, or comparison against fine-tuned baselines, preventing verification that the gains are not due to benchmark-specific tuning.

Authors: The abstract summarizes the key finding, while the full quantitative results (including per-setting improvements for multiple MLLMs) are reported in Section 5 with accompanying tables. However, we acknowledge the value of component ablations and additional baselines. In the revised manuscript, we will add an ablation study isolating the interaction-control policy and temporal-gating mechanism, along with comparisons to relevant training-free and fine-tuned baselines where feasible. As IPI-Agent is explicitly training-free, we will clarify that the primary contribution is the agentic framework rather than end-to-end fine-tuning, but the added experiments will address concerns about benchmark-specific effects. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces a new benchmark (IPIBench) and a training-free agentic framework (IPI-Agent) with an interaction-control policy and temporal-gating mechanism. No equations, derivations, fitted parameters, or self-citation chains are present in the provided abstract or description that reduce any claimed result to its inputs by construction. The central claims rest on empirical evaluations of existing MLLMs and the proposed framework's improvements, which are presented as independent experimental outcomes rather than self-referential definitions or renamings. This is a standard benchmark/framework paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no free parameters, axioms, or invented entities are described in sufficient detail to populate the ledger.

pith-pipeline@v0.9.1-grok · 5714 in / 1139 out tokens · 31021 ms · 2026-06-29T18:22:36.540916+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EgoSAT: A Comprehensive Benchmark of Egocentric Streaming Interaction Understanding
cs.CV 2026-06 unverdicted novelty 7.0

EgoSAT is the first benchmark unifying retrospective, online, and prospective reasoning tasks in egocentric streaming video to evaluate VLMs, revealing struggles with temporal modeling and mis-calibration.

Reference graph

Works this paper leans on

63 extracted references · 28 canonical work pages · cited by 1 Pith paper · 16 internal anchors

[1]

S. Azad, V . Vineet, and Y . S. Rawat. Streamready: Learning what to answer and when in long streaming videos.arXiv preprint arXiv:2603.08620, 2026

work page arXiv 2026
[2]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin. Qwen2.5-vl technical report.ArXiv, abs/2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Bärmann and A

L. Bärmann and A. Waibel. Where did i leave my keys?-episodic-memory-based question answering on egocentric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1560–1568, 2022

2022
[5]

M. Cai, R. Tan, J. Zhang, B. Zou, K. Zhang, F. Yao, F. Zhu, J. Gu, Y . Zhong, Y . Shang, et al. Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models.arXiv preprint arXiv:2410.10818, 2024

work page arXiv 2024
[6]

J. Chen, Z. Lv, S. Wu, K. Q. Lin, C. Song, D. Gao, J.-W. Liu, Z. Gao, D. Mao, and M. Z. Shou. Videollm-online: Online video large language model for streaming video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18407–18418, 2024

2024
[7]

Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

2024
[8]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al. Gemini 2.5: Pushing the frontier with advanced rea- soning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

X. Ding, H. Wu, Y . Yang, S. Jiang, Q. Zhang, D. Bai, Z. Chen, and T. Cao. Streammind: Un- locking full frame rate streaming video dialogue through event-gated cognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13448–13459, 2025

2025
[10]

Driess, F

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence. Palm-e: An embodied multimodal language model, 2023

2023
[11]

Epstein, B

D. Epstein, B. Chen, and C. V ondrick. Oops! predicting unintentional action in video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 919–929, 2020

2020
[12]

J. Gao, C. Sun, Z. Yang, and R. Nevatia. Tall: Temporal activity localization via language query. InProceedings of the IEEE international conference on computer vision, pages 5267–5275, 2017

2017
[13]

Grauman, A

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

2022
[14]

C. Gu, C. Sun, D. A. Ross, C. V ondrick, C. Pantofaru, Y . Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6047–6056, 2018. 10

2018
[15]

W. Hong, W. Yu, X. Gu, G. Wang, G. Gan, H. Tang, J. Cheng, J. Qi, J. Ji, L. Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Y . Hu, Z. Yang, S. Wang, S. Qian, B. Wen, F. Yang, T. Gao, and C. Xu. Streamingcot: A dataset for temporal dynamics and multimodal chain-of-thought reasoning in streaming videoqa. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 13464–13470, 2025

2025
[17]

Huang, X

Z. Huang, X. Li, J. Li, J. Wang, X. Zeng, C. Liang, T. Wu, X. Chen, L. Li, and L. Wang. Online video understanding: Ovbench and videochat-online. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3328–3338, 2025

2025
[18]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

in the wild

H. Idrees, A. R. Zamir, Y .-G. Jiang, A. Gorban, I. Laptev, R. Sukthankar, and M. Shah. The thumos challenge on action recognition for videos “in the wild”.Computer Vision and Image Understanding, 155:1–23, 2017

2017
[20]

J. Kim, M. S. Kim, J. Chung, J. Cho, J. Kim, S. Kim, G. Sim, and Y . Yu. Egospeak: learning when to speak for egocentric conversational agents in the wild. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 2990–3005, 2025

2025
[21]

J. Lei, T. L. Berg, and M. Bansal. Detecting moments and highlights in videos via natural language queries.Advances in Neural Information Processing Systems, 34:11846–11858, 2021

2021
[22]

B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

M. Li, Y . Zhang, D. Long, K. Chen, S. Song, S. Bai, Z. Yang, P. Xie, A. Yang, D. Liu, et al. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.arXiv preprint arXiv:2601.04720, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

W. Li, B. Hu, R. Shao, L. Shen, and L. Nie. Lion-fs: Fast & slow video-language thinker as online video assistant. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3240–3251, 2025

2025
[25]

J. Lin, Z. Fang, C. Chen, H. Cheng, Z. Wan, F. Luo, Z. Wang, P. Li, Y . Liu, and M. Sun. Streamingbench: Assessing the gap for mllms to achieve streaming video understanding. In ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12147–12151. IEEE, 2026

2026
[26]

H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning, 2023

2023
[27]

Z. Liu, Y . Dong, Z. Liu, W. Hu, J. Lu, and Y . Rao. Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution. InThe Thirteenth International Conference on Learning Representations
[28]

M. Maaz, H. Rasheed, S. Khan, and F. Khan. Video-chatgpt: Towards detailed video under- standing via large vision and language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, 2024

2024
[29]

M. Maaz, H. Rasheed, S. Khan, and F. Khan. Videogpt+: Integrating image and video encoders for enhanced video understanding.arXiv preprint arXiv:2406.09418, 2024

work page arXiv 2024
[30]

J. Niu, Y . Li, Z. Miao, C. Ge, Y . Zhou, Q. He, X. Dong, H. Duan, S. Ding, R. Qian, et al. Ovo-bench: How far is your video-llms from real-world online video understanding? In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 18902–18913, 2025. 11

2025
[31]

R. Qian, S. Ding, X. Dong, P. Zhang, Y . Zang, Y . Cao, D. Lin, and J. Wang. Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24045–24055, 2025

2025
[32]

R. Qian, S. Ding, X. Dong, P. Zhang, Y . Zang, Y . Cao, D. Lin, and J. Wang. Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction, 2025

2025
[33]

Y . Shi, Q. Zhao, T. Jiang, X. Zeng, Y . Wang, and L. Wang. River: A real-time interaction bench- mark for video llms. InThe Fourteenth International Conference on Learning Representations
[34]

OpenAI GPT-5 System Card

A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

H. Tang, K. J. Liang, K. Grauman, M. Feiszli, and W. Wang. Egotracks: A long-term egocentric visual object tracking dataset.Advances in Neural Information Processing Systems, 36:75716– 75739, 2023

2023
[36]

Y . Tang, D. Ding, Y . Rao, Y . Zheng, D. Zhang, L. Zhao, J. Lu, and J. Zhou. Coin: A large- scale dataset for comprehensive instructional video analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1207–1216, 2019

2019
[37]

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Q. Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

G. Tom, M. Mathew, S. Garcia-Bordils, D. Karatzas, and C. Jawahar. Reading between the lanes: Text videoqa on the road. InInternational Conference on Document Analysis and Recognition, pages 137–154. Springer, 2023

2023
[40]

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Y . Wang, X. Li, Z. Yan, Y . He, J. Yu, X. Zeng, C. Wang, C. Ma, H. Huang, J. Gao, et al. Internvideo2. 5: Empowering video mllms with long and rich context modeling.arXiv preprint arXiv:2501.12386, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Y . Wang, S. Liu, D. Wang, N. Xu, G. Wan, H. Zhang, and D. Zhao. Mmduet2: Enhancing proactive interaction of video mllms with multi-turn reinforcement learning.arXiv preprint arXiv:2512.06810, 2025

work page arXiv 2025
[43]

Y . Wang, X. Meng, Y . Wang, J. Liang, J. Wei, H. Zhang, and D. Zhao. Videollm knows when to speak: Enhancing time-sensitive video comprehension with video-text duet interaction format. arXiv preprint arXiv:2411.17991, 1(3):5, 2024

work page arXiv 2024
[44]

Y . Wang, X. Meng, Y . Wang, H. Zhang, and D. Zhao. Proactivevideoqa: A comprehensive benchmark evaluating proactive interactions in video large language models.arXiv preprint arXiv:2507.09313, 2025

work page arXiv 2025
[45]

Y . Wang, Y . Wang, B. Chen, T. Wu, D. Zhao, and Z. Zheng. Omnimmi: A comprehensive multi- modal interaction benchmark in streaming video contexts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18925–18935, 2025

2025
[46]

Z. Wen, Y . Wang, C. Liao, B. Yang, J. Li, W. Liu, H. He, B. Feng, X. Liu, Y . Lyu, X. Zheng, X. Hu, and L. Zhang. Ai for service: Proactive assistance with ai glasses, 2025

2025
[47]

J. Wu, R. Antonova, A. Kan, M. Lepert, A. Zeng, S. Song, J. Bohg, S. Rusinkiewicz, and T. Funkhouser. Tidybot: personalized robot assistance with large language models.Autonomous Robots, 47(8):1087–1102, Nov. 2023. 12

2023
[48]

S. Wu, J. Chen, K. Q. Lin, Q. Wang, Y . Gao, Q. Xu, T. Xu, Y . Hu, E. Chen, and M. Z. Shou. Videollm-mod: Efficient video-language streaming with mixture-of-depths vision computation. Advances in Neural Information Processing Systems, 37:109922–109947, 2024

2024
[49]

J. Xia, P. Chen, M. Zhang, X. Sun, and K. Zhou. Streaming video instruction tuning.arXiv preprint arXiv:2512.21334, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Streaming video understanding and multi-round interaction with memory-enhanced knowl- edge.arXiv preprint arXiv:2501.13468, 2025

H. Xiong, Z. Yang, J. Yu, Y . Zhuge, L. Zhang, J. Zhu, and H. Lu. Streaming video un- derstanding and multi-round interaction with memory-enhanced knowledge.arXiv preprint arXiv:2501.13468, 2025

work page arXiv 2025
[51]

M. Xu, M. Gao, S. Li, J. Lu, Z. Gan, Z. Lai, M. Cao, K. Kang, Y . Yang, and A. Dehghan. Slowfast-llava-1.5: A family of token-efficient video large language models for long-form video understanding.arXiv preprint arXiv:2503.18943, 2025

work page arXiv 2025
[52]

R. Xu, G. Xiao, Y . Chen, L. He, K. Peng, Y . Lu, and S. Han. Streamingvlm: Real-time understanding for infinite video streams, 2025

2025
[53]

S. Xun, S. Tao, J. Li, Y . Shi, Z. Lin, Z. Zhu, Y . Yan, H. Li, L. Zhang, S. Wang, et al. Rtv-bench: Benchmarking mllm continuous perception, understanding and reasoning through real-time video. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track
[54]

B. Yang, B. Wen, B. Ding, C. Liu, C. Chu, C. Song, C. Rao, C. Yi, D. Li, D. Zang, et al. Kwai keye-vl 1.5 technical report.arXiv preprint arXiv:2509.01563, 2025

work page arXiv 2025
[55]

H. Yang, F. Tang, L. Zhao, X. An, M. Hu, H. Li, X. Zhuang, Y . Lu, X. Zhang, A. Swikir, et al. Streamagent: Towards anticipatory agents for streaming video understanding.arXiv preprint arXiv:2508.01875, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Y . Yang, Z. Zhao, S. N. Shukla, A. Singh, S. K. Mishra, L. Zhang, and M. Ren. Streammem: Query-agnostic kv cache memory for streaming video understanding, 2025

2025
[57]

Z. Yang, Y . Hu, Z. Du, D. Xue, S. Qian, J. Wu, F. Yang, W. Dong, and C. Xu. Svbench: A benchmark with temporal multi-turn dialogues for streaming video understanding.arXiv preprint arXiv:2502.10810, 2025

work page arXiv 2025
[58]

Zhang, Y

H. Zhang, Y . Wang, Y . Tang, Y . Liu, J. Feng, J. Dai, and X. Jin. Flash-vstream: Memory-based real-time understanding for long video streams.arXiv preprint arXiv:2406.08085, 2024

work page arXiv 2024
[59]

Long Context Transfer from Language to Vision

P. Zhang, K. Zhang, B. Li, G. Zeng, J. Yang, Y . Zhang, Z. Wang, H. Tan, C. Li, and Z. Liu. Long context transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

Zhang, X

Y . Zhang, X. L. Dong, Z. Lin, A. Madotto, A. Kumar, B. Damavandi, J. Chai, and S. Moon. Proactive assistant dialogue generation from streaming egocentric videos. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12055–12079, 2025

2025
[61]

Zhang, C

Y . Zhang, C. Shi, Y . Wang, and S. Yang. Eyes wide open: Ego proactive video-llm for streaming video.arXiv preprint arXiv:2510.14560, 2025

work page arXiv 2025
[62]

L. Zhou, C. Xu, and J. Corso. Towards automatic learning of procedures from web instructional videos. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

2018
[63]

J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y . Duan, W. Su, J. Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 13

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

S. Azad, V . Vineet, and Y . S. Rawat. Streamready: Learning what to answer and when in long streaming videos.arXiv preprint arXiv:2603.08620, 2026

work page arXiv 2026

[2] [2]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin. Qwen2.5-vl technical report.ArXiv, abs/2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Bärmann and A

L. Bärmann and A. Waibel. Where did i leave my keys?-episodic-memory-based question answering on egocentric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1560–1568, 2022

2022

[5] [5]

M. Cai, R. Tan, J. Zhang, B. Zou, K. Zhang, F. Yao, F. Zhu, J. Gu, Y . Zhong, Y . Shang, et al. Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models.arXiv preprint arXiv:2410.10818, 2024

work page arXiv 2024

[6] [6]

J. Chen, Z. Lv, S. Wu, K. Q. Lin, C. Song, D. Gao, J.-W. Liu, Z. Gao, D. Mao, and M. Z. Shou. Videollm-online: Online video large language model for streaming video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18407–18418, 2024

2024

[7] [7]

Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

2024

[8] [8]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al. Gemini 2.5: Pushing the frontier with advanced rea- soning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

X. Ding, H. Wu, Y . Yang, S. Jiang, Q. Zhang, D. Bai, Z. Chen, and T. Cao. Streammind: Un- locking full frame rate streaming video dialogue through event-gated cognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13448–13459, 2025

2025

[10] [10]

Driess, F

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence. Palm-e: An embodied multimodal language model, 2023

2023

[11] [11]

Epstein, B

D. Epstein, B. Chen, and C. V ondrick. Oops! predicting unintentional action in video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 919–929, 2020

2020

[12] [12]

J. Gao, C. Sun, Z. Yang, and R. Nevatia. Tall: Temporal activity localization via language query. InProceedings of the IEEE international conference on computer vision, pages 5267–5275, 2017

2017

[13] [13]

Grauman, A

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

2022

[14] [14]

C. Gu, C. Sun, D. A. Ross, C. V ondrick, C. Pantofaru, Y . Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6047–6056, 2018. 10

2018

[15] [15]

W. Hong, W. Yu, X. Gu, G. Wang, G. Gan, H. Tang, J. Cheng, J. Qi, J. Ji, L. Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Y . Hu, Z. Yang, S. Wang, S. Qian, B. Wen, F. Yang, T. Gao, and C. Xu. Streamingcot: A dataset for temporal dynamics and multimodal chain-of-thought reasoning in streaming videoqa. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 13464–13470, 2025

2025

[17] [17]

Huang, X

Z. Huang, X. Li, J. Li, J. Wang, X. Zeng, C. Liang, T. Wu, X. Chen, L. Li, and L. Wang. Online video understanding: Ovbench and videochat-online. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3328–3338, 2025

2025

[18] [18]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

in the wild

H. Idrees, A. R. Zamir, Y .-G. Jiang, A. Gorban, I. Laptev, R. Sukthankar, and M. Shah. The thumos challenge on action recognition for videos “in the wild”.Computer Vision and Image Understanding, 155:1–23, 2017

2017

[20] [20]

J. Kim, M. S. Kim, J. Chung, J. Cho, J. Kim, S. Kim, G. Sim, and Y . Yu. Egospeak: learning when to speak for egocentric conversational agents in the wild. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 2990–3005, 2025

2025

[21] [21]

J. Lei, T. L. Berg, and M. Bansal. Detecting moments and highlights in videos via natural language queries.Advances in Neural Information Processing Systems, 34:11846–11858, 2021

2021

[22] [22]

B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

M. Li, Y . Zhang, D. Long, K. Chen, S. Song, S. Bai, Z. Yang, P. Xie, A. Yang, D. Liu, et al. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.arXiv preprint arXiv:2601.04720, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

W. Li, B. Hu, R. Shao, L. Shen, and L. Nie. Lion-fs: Fast & slow video-language thinker as online video assistant. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3240–3251, 2025

2025

[25] [25]

J. Lin, Z. Fang, C. Chen, H. Cheng, Z. Wan, F. Luo, Z. Wang, P. Li, Y . Liu, and M. Sun. Streamingbench: Assessing the gap for mllms to achieve streaming video understanding. In ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12147–12151. IEEE, 2026

2026

[26] [26]

H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning, 2023

2023

[27] [27]

Z. Liu, Y . Dong, Z. Liu, W. Hu, J. Lu, and Y . Rao. Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution. InThe Thirteenth International Conference on Learning Representations

[28] [28]

M. Maaz, H. Rasheed, S. Khan, and F. Khan. Video-chatgpt: Towards detailed video under- standing via large vision and language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, 2024

2024

[29] [29]

M. Maaz, H. Rasheed, S. Khan, and F. Khan. Videogpt+: Integrating image and video encoders for enhanced video understanding.arXiv preprint arXiv:2406.09418, 2024

work page arXiv 2024

[30] [30]

J. Niu, Y . Li, Z. Miao, C. Ge, Y . Zhou, Q. He, X. Dong, H. Duan, S. Ding, R. Qian, et al. Ovo-bench: How far is your video-llms from real-world online video understanding? In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 18902–18913, 2025. 11

2025

[31] [31]

R. Qian, S. Ding, X. Dong, P. Zhang, Y . Zang, Y . Cao, D. Lin, and J. Wang. Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24045–24055, 2025

2025

[32] [32]

R. Qian, S. Ding, X. Dong, P. Zhang, Y . Zang, Y . Cao, D. Lin, and J. Wang. Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction, 2025

2025

[33] [33]

Y . Shi, Q. Zhao, T. Jiang, X. Zeng, Y . Wang, and L. Wang. River: A real-time interaction bench- mark for video llms. InThe Fourteenth International Conference on Learning Representations

[34] [34]

OpenAI GPT-5 System Card

A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

H. Tang, K. J. Liang, K. Grauman, M. Feiszli, and W. Wang. Egotracks: A long-term egocentric visual object tracking dataset.Advances in Neural Information Processing Systems, 36:75716– 75739, 2023

2023

[36] [36]

Y . Tang, D. Ding, Y . Rao, Y . Zheng, D. Zhang, L. Zhao, J. Lu, and J. Zhou. Coin: A large- scale dataset for comprehensive instructional video analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1207–1216, 2019

2019

[37] [37]

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

Q. Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[39] [39]

G. Tom, M. Mathew, S. Garcia-Bordils, D. Karatzas, and C. Jawahar. Reading between the lanes: Text videoqa on the road. InInternational Conference on Document Analysis and Recognition, pages 137–154. Springer, 2023

2023

[40] [40]

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Y . Wang, X. Li, Z. Yan, Y . He, J. Yu, X. Zeng, C. Wang, C. Ma, H. Huang, J. Gao, et al. Internvideo2. 5: Empowering video mllms with long and rich context modeling.arXiv preprint arXiv:2501.12386, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Y . Wang, S. Liu, D. Wang, N. Xu, G. Wan, H. Zhang, and D. Zhao. Mmduet2: Enhancing proactive interaction of video mllms with multi-turn reinforcement learning.arXiv preprint arXiv:2512.06810, 2025

work page arXiv 2025

[43] [43]

Y . Wang, X. Meng, Y . Wang, J. Liang, J. Wei, H. Zhang, and D. Zhao. Videollm knows when to speak: Enhancing time-sensitive video comprehension with video-text duet interaction format. arXiv preprint arXiv:2411.17991, 1(3):5, 2024

work page arXiv 2024

[44] [44]

Y . Wang, X. Meng, Y . Wang, H. Zhang, and D. Zhao. Proactivevideoqa: A comprehensive benchmark evaluating proactive interactions in video large language models.arXiv preprint arXiv:2507.09313, 2025

work page arXiv 2025

[45] [45]

Y . Wang, Y . Wang, B. Chen, T. Wu, D. Zhao, and Z. Zheng. Omnimmi: A comprehensive multi- modal interaction benchmark in streaming video contexts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18925–18935, 2025

2025

[46] [46]

Z. Wen, Y . Wang, C. Liao, B. Yang, J. Li, W. Liu, H. He, B. Feng, X. Liu, Y . Lyu, X. Zheng, X. Hu, and L. Zhang. Ai for service: Proactive assistance with ai glasses, 2025

2025

[47] [47]

J. Wu, R. Antonova, A. Kan, M. Lepert, A. Zeng, S. Song, J. Bohg, S. Rusinkiewicz, and T. Funkhouser. Tidybot: personalized robot assistance with large language models.Autonomous Robots, 47(8):1087–1102, Nov. 2023. 12

2023

[48] [48]

S. Wu, J. Chen, K. Q. Lin, Q. Wang, Y . Gao, Q. Xu, T. Xu, Y . Hu, E. Chen, and M. Z. Shou. Videollm-mod: Efficient video-language streaming with mixture-of-depths vision computation. Advances in Neural Information Processing Systems, 37:109922–109947, 2024

2024

[49] [49]

J. Xia, P. Chen, M. Zhang, X. Sun, and K. Zhou. Streaming video instruction tuning.arXiv preprint arXiv:2512.21334, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

Streaming video understanding and multi-round interaction with memory-enhanced knowl- edge.arXiv preprint arXiv:2501.13468, 2025

H. Xiong, Z. Yang, J. Yu, Y . Zhuge, L. Zhang, J. Zhu, and H. Lu. Streaming video un- derstanding and multi-round interaction with memory-enhanced knowledge.arXiv preprint arXiv:2501.13468, 2025

work page arXiv 2025

[51] [51]

M. Xu, M. Gao, S. Li, J. Lu, Z. Gan, Z. Lai, M. Cao, K. Kang, Y . Yang, and A. Dehghan. Slowfast-llava-1.5: A family of token-efficient video large language models for long-form video understanding.arXiv preprint arXiv:2503.18943, 2025

work page arXiv 2025

[52] [52]

R. Xu, G. Xiao, Y . Chen, L. He, K. Peng, Y . Lu, and S. Han. Streamingvlm: Real-time understanding for infinite video streams, 2025

2025

[53] [53]

S. Xun, S. Tao, J. Li, Y . Shi, Z. Lin, Z. Zhu, Y . Yan, H. Li, L. Zhang, S. Wang, et al. Rtv-bench: Benchmarking mllm continuous perception, understanding and reasoning through real-time video. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track

[54] [54]

B. Yang, B. Wen, B. Ding, C. Liu, C. Chu, C. Song, C. Rao, C. Yi, D. Li, D. Zang, et al. Kwai keye-vl 1.5 technical report.arXiv preprint arXiv:2509.01563, 2025

work page arXiv 2025

[55] [55]

H. Yang, F. Tang, L. Zhao, X. An, M. Hu, H. Li, X. Zhuang, Y . Lu, X. Zhang, A. Swikir, et al. Streamagent: Towards anticipatory agents for streaming video understanding.arXiv preprint arXiv:2508.01875, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [56]

Y . Yang, Z. Zhao, S. N. Shukla, A. Singh, S. K. Mishra, L. Zhang, and M. Ren. Streammem: Query-agnostic kv cache memory for streaming video understanding, 2025

2025

[57] [57]

Z. Yang, Y . Hu, Z. Du, D. Xue, S. Qian, J. Wu, F. Yang, W. Dong, and C. Xu. Svbench: A benchmark with temporal multi-turn dialogues for streaming video understanding.arXiv preprint arXiv:2502.10810, 2025

work page arXiv 2025

[58] [58]

Zhang, Y

H. Zhang, Y . Wang, Y . Tang, Y . Liu, J. Feng, J. Dai, and X. Jin. Flash-vstream: Memory-based real-time understanding for long video streams.arXiv preprint arXiv:2406.08085, 2024

work page arXiv 2024

[59] [59]

Long Context Transfer from Language to Vision

P. Zhang, K. Zhang, B. Li, G. Zeng, J. Yang, Y . Zhang, Z. Wang, H. Tan, C. Li, and Z. Liu. Long context transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[60] [60]

Zhang, X

Y . Zhang, X. L. Dong, Z. Lin, A. Madotto, A. Kumar, B. Damavandi, J. Chai, and S. Moon. Proactive assistant dialogue generation from streaming egocentric videos. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12055–12079, 2025

2025

[61] [61]

Zhang, C

Y . Zhang, C. Shi, Y . Wang, and S. Yang. Eyes wide open: Ego proactive video-llm for streaming video.arXiv preprint arXiv:2510.14560, 2025

work page arXiv 2025

[62] [62]

L. Zhou, C. Xu, and J. Corso. Towards automatic learning of procedures from web instructional videos. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

2018

[63] [63]

J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y . Duan, W. Su, J. Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 13

work page internal anchor Pith review Pith/arXiv arXiv 2025