Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

Hao Wu; Qianxi Zhang; Shiqi Jiang; Ting Cao; Weijun Wang; Xin Ding; Yifan Yang; Yikai Zheng; Yunxin Liu

arxiv: 2603.19054 · v2 · pith:XZPDMBTNnew · submitted 2026-03-19 · 💻 cs.CV · cs.AI

Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

Yikai Zheng , Xin Ding , Yifan Yang , Shiqi Jiang , Hao Wu , Qianxi Zhang , Weijun Wang , Ting Cao

show 1 more author

Yunxin Liu

This is my paper

classification 💻 cs.CV cs.AI

keywords proactivestreamingunderstandingvideoem-gardeframeworkmatchingmodels

0 comments

read the original abstract

Recent advances in Streaming Video Understanding has enabled a new interaction paradigm where models respond proactively to user queries. Current proactive VideoLLMs rely on per-frame triggering decision making, which suffers from an efficiency-accuracy dilemma. We propose Em-Garde, a novel framework that decouples semantic understanding from streaming perception. At query time, the Instruction-Guided Proposal Parser transforms user queries into structured, perceptually grounded visual proposals; during streaming, a Lightweight Proposal Matching Module performs efficient embedding-based matching to trigger responses. Experiments on StreamingBench and OVO-Bench demonstrate consistent improvements over prior models in proactive response accuracy and efficiency, validating an effective solution for proactive video understanding under strict computational constraints.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

OmniPro is the first benchmark jointly evaluating omni-modal perception, proactive responding, and diverse streaming video understanding tasks using a dual-mode protocol on 2700 samples.
StreamOV: Streaming Omni-Video Understanding via Evidence-Guided Memory and Response Triggering
cs.CV 2026-05 unverdicted novelty 6.0

StreamOV proposes evidence-guided long-short term memory and a hidden-state-driven trigger for efficient online audio-visual reasoning in streaming videos, along with the SOVBench benchmark for multi-turn evaluation.