CoVR-R:Reason-Aware Composed Video Retrieval

Alaa Mostafa Lasheen; Dmitry Demidov; Fahad Khan; Omkar Thawakar; Rao Muhammad Anwer; Sai Prasanna Teja Reddy Bogireddy; Vaishnav Potlapalli; Viswanatha Reddy Gajjala

arxiv: 2603.20190 · v2 · pith:3B4EOGX2new · submitted 2026-03-20 · 💻 cs.CV

CoVR-R:Reason-Aware Composed Video Retrieval

Omkar Thawakar , Dmitry Demidov , Vaishnav Potlapalli , Sai Prasanna Teja Reddy Bogireddy , Viswanatha Reddy Gajjala , Alaa Mostafa Lasheen , Rao Muhammad Anwer , Fahad Khan This is my paper

classification 💻 cs.CV

keywords videoafter-effectscovrreasoningretrievaleditbenchmarkcausal

0 comments

read the original abstract

Composed Video Retrieval (CoVR) aims to find a target video given a reference video and a textual modification. Prior work assumes the modification text fully specifies the visual changes, overlooking after-effects and implicit consequences (e.g., motion, state transitions, viewpoint or duration cues) that emerge from the edit. We argue that successful CoVR requires reasoning about these after-effects. We introduce a reasoning-first, zero-shot approach that leverages large multimodal models to (i) infer causal and temporal consequences implied by the edit, and (ii) align the resulting reasoned queries to candidate videos without task-specific finetuning. To evaluate reasoning in CoVR, we also propose CoVR-Reason, a benchmark that pairs each (reference, edit, target) triplet with structured internal reasoning traces and challenging distractors that require predicting after-effects rather than keyword matching. Experiments show that our zero-shot method outperforms strong retrieval baselines on recall at K and particularly excels on implicit-effect subsets. Our automatic and human analysis confirm higher step consistency and effect factuality in our retrieved results. Our findings show that incorporating reasoning into general-purpose multimodal models enables effective CoVR by explicitly accounting for causal and temporal after-effects. This reduces dependence on task-specific supervision, improves generalization to challenging implicit-effect cases, and enhances interpretability of retrieval outcomes. These results point toward a scalable and principled framework for explainable video search. The model, code, and benchmark are available at https://github.com/mbzuai-oryx/CoVR-R.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ReCoVR: Closing the Loop in Interactive Composed Video Retrieval
cs.IR 2026-05 unverdicted novelty 6.0

ReCoVR introduces a reflexive dual-pathway architecture for interactive composed video retrieval that outperforms baselines by combining intent routing with trajectory-level reflection on retrieval history.