OmniEgo-R$^2$: A Routed Reasoning Framework for the 1st Cross-Domain EgoCross Challenge at CVPR 2026

· 2026 · cs.CV · arXiv 2605.24481

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open full Pith review browse 5 citing papers arXiv PDF

abstract

The 1st Cross-Domain EgoCross Challenge at EgoVis, CVPR 2026 evaluates whether multimodal large language models can reason over egocentric videos across surgery, industry, extreme sports, and animal perspective. We achieved second place in both the Source-Limited and Open-Source tracks. In this report, we formulate EgoCross as a robust cross-domain embodied video reasoning problem rather than a simple multiple-choice visual question answering task. We identify three key challenges: (C1) temporal boundary ambiguity, where critical state transitions are sparsely sampled and often occur between frames; (C2) cross-domain semantic granularity mismatch, where the same capability requires different domain-specific visual grammar; and (C3) decision instability under close options, where long multimodal reasoning can select unsupported distractors or produce malformed outputs. To address them, we propose OmniEgo-R$^2$ (Omnidomain Egocentric Routed Reasoning), a unified routed reasoning pipeline consisting of temporal-evidence normalization, domain-agnostic capability routing, structured perception--dynamics--decision reasoning, boundary-aware option verification, and defensive answer calibration. OmniEgo-R$^2$ uses the Qwen3-VL-4B-SFT checkpoints on each EgoCross domain as the visual-language backbone, and wraps them with lightweight test-time reasoning and parsing programs. Our final submissions obtain 66.35% overall accuracy in the Source-Limited track and 66.77% in the Open-Source track, ranking second in both leaderboards. The codes are available on https://github.com/Lee-zixu/OmniEgo-R2

representative citing papers

COMBINER: Composed Image Retrieval Guided by Attribute-based Neighbor Relations

cs.CV · 2026-06-03 · unverdicted · novelty 6.0

COMBINER proposes a new architecture for composed image retrieval using adaptive semantic disentanglement, unified prototype-based composition, and dual attribute-based relation modeling to address visually similar but attribute-unrelated samples.

R^3: Composed Video Retrieval via Reasoning-Guided Recalling and Re-ranking

cs.CV · 2026-05-31 · unverdicted · novelty 5.0

R^3 is a zero-shot pipeline that generates reasoning traces to augment composed video queries, fuses scores via agreement-gated residual, and re-ranks candidates for the CoVR-R challenge.

RankVR: Low-Rank Structure Perception and Value Recalibration for Robust Composed Image Retrieval

cs.CV · 2026-06-10 · unverdicted · novelty 4.0

RankVR introduces GSCP and ASVC modules to improve CIR robustness by decoupling clean samples via low-rank structure and dynamically scoring triplet value in noisy datasets.

IMAGINE: Adaptive Schema-Imagery Enhanced Composition for Composed Video Retrieval

cs.CV · 2026-06-06 · unverdicted · novelty 4.0

IMAGINE uses adaptive schema-imagery via dynamic multimodal prototypes to incorporate implicit semantics into composed video retrieval, claiming SOTA results on CVR and CIR benchmarks.

EgoAction: Egocentric Action Composition with Reliability-Aware Temporal Fusion for the EPIC-KITCHENS Action Detection Challenge at CVPR 2026

cs.CV · 2026-05-23 · unverdicted · novelty 3.0

EgoAction uses decoupled verb-noun temporal detectors on VideoMAE features and Dynamic Weighted Fusion of boundaries based on classification confidences for the EPIC-KITCHENS action detection challenge.

citing papers explorer

Showing 5 of 5 citing papers after filters.

COMBINER: Composed Image Retrieval Guided by Attribute-based Neighbor Relations cs.CV · 2026-06-03 · unverdicted · none · ref 21 · internal anchor
COMBINER proposes a new architecture for composed image retrieval using adaptive semantic disentanglement, unified prototype-based composition, and dual attribute-based relation modeling to address visually similar but attribute-unrelated samples.
R^3: Composed Video Retrieval via Reasoning-Guided Recalling and Re-ranking cs.CV · 2026-05-31 · unverdicted · none · ref 20 · internal anchor
R^3 is a zero-shot pipeline that generates reasoning traces to augment composed video queries, fuses scores via agreement-gated residual, and re-ranks candidates for the CoVR-R challenge.
RankVR: Low-Rank Structure Perception and Value Recalibration for Robust Composed Image Retrieval cs.CV · 2026-06-10 · unverdicted · none · ref 93 · internal anchor
RankVR introduces GSCP and ASVC modules to improve CIR robustness by decoupling clean samples via low-rank structure and dynamically scoring triplet value in noisy datasets.
IMAGINE: Adaptive Schema-Imagery Enhanced Composition for Composed Video Retrieval cs.CV · 2026-06-06 · unverdicted · none · ref 96 · internal anchor
IMAGINE uses adaptive schema-imagery via dynamic multimodal prototypes to incorporate implicit semantics into composed video retrieval, claiming SOTA results on CVR and CIR benchmarks.
EgoAction: Egocentric Action Composition with Reliability-Aware Temporal Fusion for the EPIC-KITCHENS Action Detection Challenge at CVPR 2026 cs.CV · 2026-05-23 · unverdicted · none · ref 13 · internal anchor
EgoAction uses decoupled verb-noun temporal detectors on VideoMAE features and Dynamic Weighted Fusion of boundaries based on classification confidences for the EPIC-KITCHENS action detection challenge.

OmniEgo-R$^2$: A Routed Reasoning Framework for the 1st Cross-Domain EgoCross Challenge at CVPR 2026

fields

years

verdicts

representative citing papers

citing papers explorer