TempRet: Temporal Enhancement and Two-Stage Reranking for CVPR 2026 EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge

· 2026 · cs.CV · arXiv 2605.24470

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open full Pith review browse 4 citing papers arXiv PDF

abstract

Video-text retrieval has witnessed remarkable progress driven by large-scale vision-language pretraining, yet most existing approaches inherit an implicit assumption from image-text retrieval: that visual semantics can be captured frame-by-frame. This assumption overlooks the temporal dynamics of egocentric videos. The EPIC-KITCHENS-100 Multi-Instance Retrieval (MIR) challenge further raises the bar by providing soft-label relevance matrices rather than binary labels, demanding models that can resolve graded semantic correspondences across modalities. In this report, we present our solution, termed TempRet, to the CVPR 2026 EPIC-KITCHENS-100 MIR challenge. Our approach builds upon a CLIP-based dual-encoder backbone and introduces two key components to address the temporal and cross-modal challenges. First, a temporal transformer operates exclusively on the video side, modeling inter-frame dependencies through learnable positional encodings and multi-head self-attention over frame-level CLIP features. Second, a two-stage reranking pipeline first retrieves Top-K candidates via the dual-encoder, then refines their scores using a cross-encoder equipped with an Image-Text Matching (ITM) head. The entire system is trained with Symmetric Multi-Similarity Loss to exploit the soft-label relevance matrices provided by the challenge. Our method achieves 67.97% average mAP and 82.92% average nDCG on the EK-100 MIR benchmark, demonstrating the effectiveness of temporal modeling and cross-modal refinement for egocentric video retrieval.

representative citing papers

R^3: Composed Video Retrieval via Reasoning-Guided Recalling and Re-ranking

cs.CV · 2026-05-31 · unverdicted · novelty 5.0

R^3 is a zero-shot pipeline that generates reasoning traces to augment composed video queries, fuses scores via agreement-gated residual, and re-ranks candidates for the CoVR-R challenge.

EgoAdapt: A Multi-Scene Egocentric Adaptation Method for CVPR 2026 HD-EPIC VQA Challenge

cs.CV · 2026-05-23 · unverdicted · novelty 3.0

EgoAdapt improves VQA on the HD-EPIC egocentric benchmark via category-conditioned routing, calibrated option scoring, and test-time consistency adaptation.

EgoAction: Egocentric Action Composition with Reliability-Aware Temporal Fusion for the EPIC-KITCHENS Action Detection Challenge at CVPR 2026

cs.CV · 2026-05-23 · unverdicted · novelty 3.0

EgoAction uses decoupled verb-noun temporal detectors on VideoMAE features and Dynamic Weighted Fusion of boundaries based on classification confidences for the EPIC-KITCHENS action detection challenge.

OmniEgo-R$^2$: A Routed Reasoning Framework for the 1st Cross-Domain EgoCross Challenge at CVPR 2026

cs.CV · 2026-05-23 · unverdicted · novelty 3.0

OmniEgo-R² is a competition system that combines domain-specific VL models with temporal normalization, capability routing, and answer calibration to reach 66.35-66.77% accuracy on the EgoCross challenge.

citing papers explorer

Showing 4 of 4 citing papers after filters.

R^3: Composed Video Retrieval via Reasoning-Guided Recalling and Re-ranking cs.CV · 2026-05-31 · unverdicted · none · ref 14 · internal anchor
R^3 is a zero-shot pipeline that generates reasoning traces to augment composed video queries, fuses scores via agreement-gated residual, and re-ranks candidates for the CoVR-R challenge.
EgoAdapt: A Multi-Scene Egocentric Adaptation Method for CVPR 2026 HD-EPIC VQA Challenge cs.CV · 2026-05-23 · unverdicted · none · ref 2 · internal anchor
EgoAdapt improves VQA on the HD-EPIC egocentric benchmark via category-conditioned routing, calibrated option scoring, and test-time consistency adaptation.
EgoAction: Egocentric Action Composition with Reliability-Aware Temporal Fusion for the EPIC-KITCHENS Action Detection Challenge at CVPR 2026 cs.CV · 2026-05-23 · unverdicted · none · ref 12 · internal anchor
EgoAction uses decoupled verb-noun temporal detectors on VideoMAE features and Dynamic Weighted Fusion of boundaries based on classification confidences for the EPIC-KITCHENS action detection challenge.
OmniEgo-R$^2$: A Routed Reasoning Framework for the 1st Cross-Domain EgoCross Challenge at CVPR 2026 cs.CV · 2026-05-23 · unverdicted · none · ref 31 · internal anchor
OmniEgo-R² is a competition system that combines domain-specific VL models with temporal normalization, capability routing, and answer calibration to reach 66.35-66.77% accuracy on the EgoCross challenge.

TempRet: Temporal Enhancement and Two-Stage Reranking for CVPR 2026 EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge

fields

years

verdicts

representative citing papers

citing papers explorer