arxiv: 2605.06185 · v1 · submitted 2026-05-07 · 💻 cs.AI · cs.CV

Recognition: unknown

Event-Causal RAG: A Retrieval-Augmented Generation Framework for Long Video Reasoning in Complex Scenarios

Peizheng Yan , Yu Zhao , Liang Xie , Juntong Qi , Mingming Wang , Erwei Yin

Authors on Pith no claims yet

Pith reviewed 2026-05-08 10:11 UTC · model grok-4.3

classification 💻 cs.AI cs.CV

keywords Event-Causal RAGretrieval-augmented generationlong video understandingcausal reasoningevent knowledge graphState-Event-State graphstreaming videovideo reasoning

0 comments

The pith

Event-Causal RAG organizes long videos into causal event graphs to support reasoning over extended temporal gaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Event-Causal RAG to address limitations in handling ultra-long video reasoning with current vision-language models. It segments streaming videos into events represented as State-Event-State graphs, which are merged into a global knowledge graph. This structure enables bidirectional retrieval of relevant causal chains for a backbone model. A sympathetic reader would care because it offers a way to manage memory efficiently while inferring causes across distant events, unlike fragmented clip approaches or costly full attention methods.

Core claim

Event-Causal RAG segments videos into semantically coherent events stored as State-Event-State graphs in a dual-store memory system. It uses causal-topological retrieval to provide relevant event chains and video evidence to a foundation model, leading to superior performance on benchmarks for multi-event causal reasoning in long videos.

What carries the argument

The State-Event-State (SES) graph, which represents each event along with its preceding and following states to capture transitions, combined with the Event Knowledge Graph for global causal structure and dual-store memory for efficient retrieval.

Load-bearing premise

That automatically segmenting videos into semantically coherent events and modeling them as State-Event-State graphs will accurately capture causal dependencies without segmentation errors that affect retrieval and reasoning.

What would settle it

A test where videos have ambiguous or overlapping events leading to poor segmentation, showing if the method underperforms clip-based baselines on causal inference tasks.

Figures

Figures reproduced from arXiv: 2605.06185 by Erwei Yin, Juntong Qi, Liang Xie, Mingming Wang, Peizheng Yan, Yu Zhao.

**Figure 1.** Figure 1: Overview of the Event-Causal RAG framework. topology and node representations. With the event memory, the RAG module (Section 3.3) can query event causal chains and enhance LVLM generation. 3.1 Dual-Sentinel Event Segmentation Visual Sentinel. The foundational step of EC-RAG is determining where an event begins and ends. Let a video stream be V = {f1, f2, ..., fT }, where fi is a visual frame. We first uti… view at source ↗

**Figure 2.** Figure 2: The hourly strict accuracy rate of the 24-hour security surveillance stream. view at source ↗

**Figure 3.** Figure 3: Qualitative pipeline of EC-RAG. (a) SES Graph Construction: Continuous video frames are parsed into alternating nodes of entity States (S) and Actions (E), compressing pixels into a discrete causal chain. (b) Dual-Store Retrieval and QA: For a complex causal query, pure vector retrieval only locates isolated semantic anchors (A1, A2, A3). EC-RAG utilizes these anchors to trigger a bidirectional graph trave… view at source ↗

**Figure 4.** Figure 4: VRAM consumption comparison during video processing. (a) The native 8B VLM accumulates KV-cache linearly, hitting a 32GB OOM wall at ∼162 seconds. (b) EC-RAG, bounded by a 12s maximum chunking strategy, plateaus at ∼17.6 GB, enabling infinite stream processing without memory explosion. 9 view at source ↗

read the original abstract

Recent large vision-language models have achieved strong performance on short- and medium-length video understanding, yet they remain inadequate for ultra-long or even infinite video reasoning, where models must preserve coherent memory over extended durations and infer causal dependencies across temporally distant events. Existing end-to-end video understanding methods are fundamentally limited by the $O(n^2)$ complexity of self-attention, while recent retrieval-augmented generation (RAG) approaches still suffer from fragmented clip-level memory, weak modeling of temporal and causal structure, and high storage and online inference costs. We present Event-Causal RAG, a lightweight retrieval-augmented framework for infinite long-video reasoning. Instead of indexing fixed-length clips, our method segments streaming videos into semantically coherent events and represents each event as a structured State-Event-State (SES) graph, capturing the event together with its surrounding state transitions. These graphs are merged into a global Event Knowledge Graph and stored in a dual-store memory that supports both semantic matching and causal-topological retrieval. On top of this memory, we design a bidirectional retrieval strategy to efficiently identify the most relevant event causal chains and provide them, together with the associated video evidence, to a backbone video foundation model for answer generation. Experiments on long-video understanding benchmarks demonstrate that Event-Causal RAG consistently outperforms strong clip-based retrieval baselines and long-context video models, particularly on questions requiring multi-event integration and causal inference across long temporal gaps, while also achieving improved memory efficiency and robust streaming performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Event-Causal RAG adds event segmentation and SES graphs to RAG for long-video causal chains, but the abstract supplies no numbers or ablations so the gains stay unproven.

read the letter

The paper's main move is to replace fixed clips with semantically segmented events stored as State-Event-State graphs. These graphs feed a global Event Knowledge Graph and a dual-store memory that supports both semantic lookup and causal-topological retrieval. The bidirectional strategy then pulls relevant event chains plus video evidence for the final model. That combination is not in the prior clip-based RAG work the abstract cites, and it directly targets the fragmentation and weak temporal modeling problems in current long-video setups.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces Event-Causal RAG, a lightweight retrieval-augmented framework for ultra-long or infinite video reasoning. Videos are segmented into semantically coherent events, each represented as a State-Event-State (SES) graph that captures the event and surrounding state transitions; these are merged into a global Event Knowledge Graph stored in a dual-store memory supporting semantic matching and causal-topological retrieval. A bidirectional retrieval strategy then supplies relevant event causal chains plus video evidence to a backbone video foundation model. The authors claim consistent outperformance over clip-based retrieval baselines and long-context video models on long-video benchmarks, especially for multi-event integration and causal inference across temporal gaps, together with gains in memory efficiency and streaming robustness.

Significance. If the experimental claims hold after proper validation, the work could advance long-video understanding by replacing quadratic self-attention and fragmented clip memory with structured event-level causal modeling, offering a practical path toward coherent reasoning over extended temporal spans.

major comments (2)

[Abstract] Abstract: the claim of consistent outperformance on long-video understanding benchmarks is presented without any quantitative results, baseline names, dataset identifiers, or ablation studies, preventing direct assessment of the magnitude or reliability of the reported gains.
[Method] Method description (SES graph and Event Knowledge Graph construction): the central experimental claim for superior performance on multi-event causal questions rests on the premise that event segmentation into SES graphs reliably encodes causal dependencies without propagating segmentation errors into the global graph or retrieval indices; however, the manuscript supplies no segmentation accuracy metrics, error-propagation analysis, or ablation isolating the contribution of the SES structure versus simpler clip-level retrieval.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of consistent outperformance on long-video understanding benchmarks is presented without any quantitative results, baseline names, dataset identifiers, or ablation studies, preventing direct assessment of the magnitude or reliability of the reported gains.

Authors: We agree that the abstract would be more informative with concrete details. In the revised manuscript we will expand the abstract to include specific quantitative gains (e.g., accuracy deltas on named long-video QA benchmarks), the identities of the clip-based retrieval baselines and long-context video models, and a brief reference to the ablation studies. These results are already reported in Section 4; we will simply surface the most salient numbers and identifiers in the abstract. revision: yes
Referee: [Method] Method description (SES graph and Event Knowledge Graph construction): the central experimental claim for superior performance on multi-event causal questions rests on the premise that event segmentation into SES graphs reliably encodes causal dependencies without propagating segmentation errors into the global graph or retrieval indices; however, the manuscript supplies no segmentation accuracy metrics, error-propagation analysis, or ablation isolating the contribution of the SES structure versus simpler clip-level retrieval.

Authors: We accept that the current version lacks explicit segmentation accuracy metrics, error-propagation analysis, and a dedicated ablation of SES versus clip-level retrieval. We will add a new subsection (or appendix) containing: (i) segmentation accuracy measured against human-annotated event boundaries on a held-out subset, (ii) a qualitative and quantitative discussion of error propagation together with the mitigation provided by the dual-store memory and bidirectional retrieval, and (iii) an ablation that directly compares the full SES-based pipeline against a simpler clip-level retrieval baseline. These additions will be included in the revised paper. revision: yes

Circularity Check

0 steps flagged

No circularity; engineering framework with independent empirical claims.

full rationale

The paper describes Event-Causal RAG as a new retrieval-augmented architecture that segments videos into State-Event-State graphs, builds a global Event Knowledge Graph, and uses dual-store memory with bidirectional causal-topological retrieval. No equations, fitted parameters, or derivations appear in the provided text. The method is presented as an explicit construction rather than a reduction of any claimed prediction or uniqueness result to its own inputs. Experimental outperformance is asserted via benchmark comparisons without self-citation chains or ansatzes that smuggle in the target behavior. This is a standard descriptive systems paper whose validity rests on external evaluation, not internal self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

No mathematical axioms, free parameters, or formal derivations are stated in the abstract; the contribution is an engineering framework whose correctness rests on empirical performance.

invented entities (2)

State-Event-State (SES) graph no independent evidence
purpose: Structured representation of each video event together with preceding and following states to encode causal transitions
New data structure introduced to replace fixed-length clip indexing
Event Knowledge Graph no independent evidence
purpose: Global merged store of all SES graphs enabling topological causal queries
Constructed by merging per-event graphs; no external validation provided in abstract

pith-pipeline@v0.9.0 · 5585 in / 1218 out tokens · 33931 ms · 2026-05-08T10:11:51.288538+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 11 canonical work pages · 2 internal anchors

[1]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. LLaV A-NeXT-Interleave: Tackling multi-image, video, and 3d in large multimodal models.arXiv preprint arXiv:2407.07895, 2024

work page internal anchor Pith review arXiv 2024
[2]

MA-LMM: Memory-augmented large multimodal model for long-term video understanding

Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. MA-LMM: Memory-augmented large multimodal model for long-term video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13504–13514, 2024

2024
[3]

Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, and Michael S. Ryoo. Understanding long videos with multimodal language models. InInternational Conference on Learning Representations (ICLR), 2025

2025
[4]

From seconds to hours: Reviewing multimodal large language models on comprehensive long video understanding.arXiv preprint arXiv:2409.18938, 2024

Heqing Zou, Tianze Luo, Guiyang Xie, Victor Zhang, Fengmao Lv, Guangcong Wang, Junyang Chen, Zhuochen Wang, Hansheng Zhang, and Huaijian Zhang. From seconds to hours: Reviewing multimodal large language models on comprehensive long video understanding.arXiv preprint arXiv:2409.18938, 2024

work page arXiv 2024
[5]

Video-XL: Extra-long vision language model for hour-scale video understanding

Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-XL: Extra-long vision language model for hour-scale video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26160–26169, 2025

2025
[6]

LongVLM: Efficient long video understanding via large language models

Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, and Bohan Zhuang. LongVLM: Efficient long video understanding via large language models. InProceedings of the European Conference on Computer Vision (ECCV), pages 453–470, 2024

2024
[7]

EgoSchema: A diagnostic benchmark for very long-form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. EgoSchema: A diagnostic benchmark for very long-form video language understanding. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023), Datasets and Benchmarks Track, 2023

2023
[8]

LongVideoBench: A benchmark for long-context interleaved video-language understanding

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. LongVideoBench: A benchmark for long-context interleaved video-language understanding. InAdvances in Neural Information Processing Systems 37 (NeurIPS 2024), Datasets and Benchmarks Track, 2024

2024
[9]

Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InP...

2025
[10]

Towards event-oriented long video understanding, 2024

Yifan Du, Kun Zhou, Yuqi Huo, Yifan Li, Wayne Xin Zhao, Haoyu Lu, Zijia Zhao, Bingning Wang, Weipeng Chen, and Ji-Rong Wen. Towards event-oriented long video understanding, 2024

2024
[11]

Causality matters: How temporal information emerges in video language models.arXiv preprint arXiv:2508.11576, 2026

Yumeng Shi, Quanyu Long, Yin Wu, and Wenya Wang. Causality matters: How temporal information emerges in video language models.arXiv preprint arXiv:2508.11576, 2026. AAAI 2026. 10

work page arXiv 2026
[12]

EventV AD: Training-free event-aware video anomaly detection

Yihua Shao, Haojin He, Sijie Li, Siyu Chen, Xinwei Long, Fanhu Zeng, Yuxuan Fan, Muyang Zhang, Ziyang Yan, Ao Ma, Xiaochen Wang, Hao Tang, Yan Wang, and Shuyan Li. EventV AD: Training-free event-aware video anomaly detection. InProceedings of the 33rd ACM International Conference on Multimedia (MM), pages 2586–2595, 2025

2025
[13]

Deep bilstm attention model for spatial and temporal anomaly detection in video surveillance

Sarfaraz Natha, Fareed Ahmed, Mohammad Siraj, Mehwish Lagari, Majid Altamimi, and Asghar Ali Chandio. Deep bilstm attention model for spatial and temporal anomaly detection in video surveillance. Sensors, 25(1):251, 2025

2025
[14]

Anomaly detection in traffic surveillance videos using deep learning.Sensors, 22(17), 2022

Sardar Waqar Khan, Qasim Hafeez, Muhammad Irfan Khalid, Roobaea Alroobaea, Saddam Hussain, Jawaid Iqbal, Jasem Almotiri, and Syed Sajid Ullah. Anomaly detection in traffic surveillance videos using deep learning.Sensors, 22(17), 2022

2022
[15]

Ketan Pawar and Vahida Z. Attar. Deep learning based detection and localization of road accidents from traffic surveillance videos.ICT Express, 8(3):379–387, 2022

2022
[16]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, volume 30, 2017

2017
[17]

Visual context window extension: A new perspective for long video understanding.arXiv preprint arXiv:2409.20018, 2024

Hongchen Wei and Zhenzhong Chen. Visual context window extension: A new perspective for long video understanding.arXiv preprint arXiv:2409.20018, 2024

work page arXiv 2024
[18]

V2PE: Improving multimodal long-context capability of vision-language models with variable visual position encoding

Junqi Ge, Ziyi Chen, Jintao Lin, Jinguo Zhu, Xihui Liu, Jifeng Dai, and Xizhou Zhu. V2PE: Improving multimodal long-context capability of vision-language models with variable visual position encoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 21070–21084, 2025

2025
[19]

Model tells you what to discard: Adaptive kv cache compression for llms

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. InInternational Conference on Learning Representations (ICLR), 2024. Oral

2024
[20]

MiniCache: Kv cache compression in depth dimension for large language models

Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, and Bohan Zhuang. MiniCache: Kv cache compression in depth dimension for large language models. InAdvances in Neural Information Processing Systems 37 (NeurIPS 2024), Main Conference Track, 2024

2024
[21]

InfiniPot-V: Memory-constrained kv cache compression for streaming video understanding

Minsoo Kim, Kyuhong Shim, Jungwook Choi, and Simyung Chang. InfiniPot-V: Memory-constrained kv cache compression for streaming video understanding. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025
[22]

CORM: Cache optimization with recent message for large language model inference.arXiv preprint arXiv:2404.15949, 2024

Jincheng Dai, Zhuowei Huang, Haiyun Jiang, Chen Chen, Deng Cai, Wei Bi, and Shuming Shi. CORM: Cache optimization with recent message for large language model inference.arXiv preprint arXiv:2404.15949, 2024

work page arXiv 2024
[23]

Pyramidinfer: Pyramid kv cache compression for high-throughput llm inference

Dongjie Yang, Xiaodong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao. Pyramidinfer: Pyramid kv cache compression for high-throughput llm inference. InFindings of the Association for Computational Linguistics: ACL 2024, pages 3258–3270, 2024

2024
[24]

CacheGen: Kv cache compression and streaming for fast large language model serving

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. CacheGen: Kv cache compression and streaming for fast large language model serving. InProceedings of the ACM SIGCOMM Conference, 2024

2024
[25]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024

2024
[26]

Temporal contrastive learning for video temporal reasoning in large vision-language models.arXiv preprint arXiv:2412.11391, 2024

Rafael Souza, Jia-Hao Lim, and Alexander Davis. Temporal contrastive learning for video temporal reasoning in large vision-language models.arXiv preprint arXiv:2412.11391, 2024

work page arXiv 2024
[27]

MECD: Unlocking multi-event causal discovery in video reasoning

Tieyuan Chen, Huabin Liu, Tianyao He, Yihang Chen, Chaofan Gan, Xiao Ma, Cheng Zhong, Yang Zhang, Yingxue Wang, Hui Lin, and Weiyao Lin. MECD: Unlocking multi-event causal discovery in video reasoning. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. Spotlight paper

2024
[28]

VideoRAG: Retrieval-augmented generation over video corpus

Soyeong Jeong, Kangsan Kim, Jinheon Baek, and Sung Ju Hwang. VideoRAG: Retrieval-augmented generation over video corpus. InFindings of the Association for Computational Linguistics: ACL 2025, pages 21278–21298, Vienna, Austria, 2025. Association for Computational Linguistics. 11

2025
[29]

Video-RAG: Visually-aligned retrieval-augmented long video comprehension

Yongdong Luo, Xiawu Zheng, Guilin Li, Shukang Yin, Haojia Lin, Chaoyou Fu, Jinfa Huang, Jiayi Ji, Fei Chao, Jiebo Luo, and Rongrong Ji. Video-RAG: Visually-aligned retrieval-augmented long video comprehension. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025
[30]

arXiv:2502.01549 [cs.IR] https://arxiv.org/abs/2502.01549

Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, and Chao Huang. Videorag: Retrieval- augmented generation with extreme long-context videos.arXiv preprint arXiv:2502.01549, 2025

work page arXiv 2025
[31]

MegaRAG: Multimodal Knowledge Graph-Based Retrieval Augmented Generation

Chi-Hsiang Hsiao, Yi-Cheng Wang, Tzung-Sheng Lin, Yi-Ren Yeh, and Chu-Song Chen. MegaRAG: Multimodal knowledge graph-based retrieval augmented generation.arXiv preprint arXiv:2512.20626,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Listed as ACL 2026 in the arXiv metadata available during verification

2026
[33]

Ask in any modality: A comprehensive survey on multimodal retrieval-augmented generation

Mohammad Mahdi Abootorabi, Amirhosein Zobeiri, Mahdi Dehghani, Mohammadali Mohammadkhani, Bardia Mohammadi, Omid Ghahroodi, Mahdieh Soleymani Baghshah, and Ehsaneddin Asgari. Ask in any modality: A comprehensive survey on multimodal retrieval-augmented generation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 16776–16809, Vi...

2025
[34]

MMA-RAG: A survey on multimodal agentic retrieval-augmented generation

Vladana Perlic, Stephane Lebailly, Vadim Malvone, Van-Tam Nguyen, and Pascal Urard. MMA-RAG: A survey on multimodal agentic retrieval-augmented generation. SSRN preprint / HAL preprint hal-05322313, 2025

2025
[35]

Action scene graphs for long-form understanding of egocentric videos

Ivan Rodin, Antonino Furnari, Kyle Min, Subarna Tripathi, and Giovanni Maria Farinella. Action scene graphs for long-form understanding of egocentric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18622–18632, 2024

2024
[36]

NExT-QA: Next phase of question-answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. NExT-QA: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

2021
[37]

Multimodal event causality reasoning with scene graph enhanced interaction network.Proceedings of the AAAI Conference on Artificial Intelligence, 38(8):8778– 8786, 2024

Jintao Liu, Kaiwen Wei, and Chenglong Liu. Multimodal event causality reasoning with scene graph enhanced interaction network.Proceedings of the AAAI Conference on Artificial Intelligence, 38(8):8778– 8786, 2024

2024
[38]

Harnessing temporal causality for advanced temporal action detection.arXiv preprint arXiv:2407.17792, 2024

Shuming Liu, Lin Sui, Chen-Lin Zhang, Fangzhou Mu, Chen Zhao, and Bernard Ghanem. Harnessing temporal causality for advanced temporal action detection.arXiv preprint arXiv:2407.17792, 2024

work page arXiv 2024
[39]

Streaming long video understanding with large language models

Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, and Jiaqi Wang. Streaming long video understanding with large language models. InAdvances in Neural Information Processing Systems 37 (NeurIPS 2024), Main Conference Track, 2024

2024
[40]

Stream- ingVLM: Real-time understanding for infinite video streams

Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. Stream- ingVLM: Real-time understanding for infinite video streams. InInternational Conference on Learning Representations (ICLR), 2026

2026
[41]

Learning streaming video representation via multitask training

Yibin Yan, Jilan Xu, Shangzhe Di, Yikun Liu, Yudi Shi, Qirui Chen, Zeqian Li, Yifei Huang, and Weidi Xie. Learning streaming video representation via multitask training. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9900–9912, 2025

2025
[42]

Streaming videollms for real-time procedural video understanding

Dibyadip Chatterjee, Edoardo Remelli, Yale Song, Bugra Tekin, Abhay Mittal, Bharat Bhatnagar, Necati Ci- han Camgoz, Shreyas Hampali, Eric Sauser, Shugao Ma, Angela Yao, and Fadime Sener. Streaming videollms for real-time procedural video understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22586–22598, 2025

2025
[43]

Streaming video understanding and multi-round interaction with memory-enhanced knowledge

Haomiao Xiong, Zongxin Yang, Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Jiawen Zhu, and Huchuan Lu. Streaming video understanding and multi-round interaction with memory-enhanced knowledge. In International Conference on Learning Representations (ICLR), 2025

2025
[44]

Scenerag: Scene-level retrieval- augmented generation for video understanding.arXiv preprint arXiv:2506.07600, 2025

Nianbo Zeng, Haowen Hou, Fei Richard Yu, Si Shi, and Ying Tiffany He. Scenerag: Scene-level retrieval- augmented generation for video understanding.arXiv preprint arXiv:2506.07600, 2025

work page arXiv 2025
[45]

Vgent: Graph-based retrieval- reasoning-augmented generation for long video understanding

Xiaoqian Shen, Wenxuan Zhang, Jun Chen, and Mohamed Elhoseiny. Vgent: Graph-based retrieval- reasoning-augmented generation for long video understanding. InAdvances in Neural Information Processing Systems, 2025. Spotlight

2025
[46]

ViG-RAG: Video-aware graph retrieval-augmented generation via temporal and semantic hybrid reasoning

Zongsheng Cao, Anran Liu, Yangfan He, Jing Li, Bo Zhang, and Zigan Wang. ViG-RAG: Video-aware graph retrieval-augmented generation via temporal and semantic hybrid reasoning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 48–56, 2026. 12

2026
[47]

Temporal chain of thought: Long-video understanding by thinking in frames, 2025

Anurag Arnab, Ahmet Iscen, Mathilde Caron, Alireza Fathi, and Cordelia Schmid. Temporal chain of thought: Long-video understanding by thinking in frames.arXiv preprint arXiv:2507.02001, 2025

work page arXiv 2025
[48]

Video-of-thought: Step-by-step video reasoning from perception to cognition

Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, and Wynne Hsu. Video-of-thought: Step-by-step video reasoning from perception to cognition. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 13109–13125. PMLR, 2024

2024
[49]

Extract SES

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 13 A Implementation Details A.1 A. Algorithm Pseudocode Algorithm 1:EC-RAG: Spatiotemporal Graph Construction and Retrieval Strategy Input:Streaming videoV...

2023
[50]

- DO NOT use vague umbrella terms (e.g., ’performing’)

Task-Level Physical Actions (The ’Goldilocks’ Granularity): Describe specific, observable physical tasks and object interactions... - DO NOT use vague umbrella terms (e.g., ’performing’). - DO NOT over-decompose into meaningless joint kinematics
[51]

You MUST refer to entities by distinct visual attributes (e.g., write ’The man in the black t-shirt’, NOT ’E1’)

Visual Attribute Injection (CRITICAL): NEVER use generic IDs or pronouns. You MUST refer to entities by distinct visual attributes (e.g., write ’The man in the black t-shirt’, NOT ’E1’)
[52]

Micro-Detail Exhaustion: Capture secondary background events, holding props, and screen text
[53]

If text is blurry or uncertain, output an empty string

Direct Visual Evidence (CRITICAL): Preserve only directly visible evidence. If text is blurry or uncertain, output an empty string
[54]

Strict Causality: An ’Event’ is a directed edge bridging a ’Pre-State’ and ’Post-State’
[55]

[USER PROMPT TEMPLATE] Target Timestamp: {start_time} - {end_time}

Output Format: Output ONLY pure JSON. [USER PROMPT TEMPLATE] Target Timestamp: {start_time} - {end_time}. {audio_context} Task: Deconstruct this clip into a chronological causal graph. Step 1: Inventory ALL interacting entities, detailing visual attributes. Step 2: Map the specific, task-level physical actions. Step 3: Preserve direct visual evidence usef...
[56]

Use the original video frames as the primary evidence
[57]

Use the retrieved graph memory as complementary evidence from the same video
[58]

Compare all five options against the visible action, object, person, and purpose
[59]

If video evidence and graph memory disagree, prefer the directly visible video evidence
[60]

### [Retrieved RAG Memory] The following graph memory was extracted from the same video and may help identify temporal actions, objects, participants, and causal context

If evidence is incomplete, make the best forced choice from the available options. ### [Retrieved RAG Memory] The following graph memory was extracted from the same video and may help identify temporal actions, objects, participants, and causal context. {ctx} Output exactly one line and nothing else: [FINAL ANSWER: X] 16