EgoAdapt: A Multi-Scene Egocentric Adaptation Method for CVPR 2026 HD-EPIC VQA Challenge

Guozhi Qiu; Liqiang Nie; Weili Guan; Yupeng Hu; Zhiheng Fu; Zhiwei Chen; Zixu Li

arxiv: 2605.24500 · v2 · pith:5DHH6GFBnew · submitted 2026-05-23 · 💻 cs.CV

EgoAdapt: A Multi-Scene Egocentric Adaptation Method for CVPR 2026 HD-EPIC VQA Challenge

Zhiwei Chen , Yupeng Hu , Zixu Li , Zhiheng Fu , Guozhi Qiu , Weili Guan , Liqiang Nie This is my paper

classification 💻 cs.CV

keywords hd-epicadaptationegoadaptrecipeacrossbenchmarkchallengeconsistency

0 comments

read the original abstract

This technical report presents our solution, EgoAdapt (Egocentric Adaptation via Category, Calibration, and Consistency), to the CVPR 2026 HD-EPIC VQA challenge. HD-EPIC evaluates whether a vision-language model can reason over realistic first-person kitchen videos, where the evidence for an answer may be a short hand-object interaction, a long recipe trajectory, a spatial relation to a fixture, or a subtle gaze cue. The benchmark contains 26K multiple-choice questions across seven macro-categories: recipe, ingredient, nutrition, fine-grained action, 3D perception, object motion, and gaze. We observe that the main difficulty is not only model capacity, but also the mismatch between a single generic inference recipe and the heterogeneous temporal, spatial, and semantic structure of the benchmark. Our method, EgoAdapt, introduces three inference-time components: (1) category-conditioned routing with per-category prompts, frame budgets, and sampling rates; (2) calibrated option scoring that evaluates all candidate answers with letter-token likelihoods and generation agreement instead of relying only on direct generation; and (3) test-time consistency adaptation that aggregates predictions across option permutations and verification-style prompts for ambiguous cases. This design substantially improves over the available HD-EPIC baselines.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

COMBINER: Composed Image Retrieval Guided by Attribute-based Neighbor Relations
cs.CV 2026-06 unverdicted novelty 6.0

COMBINER proposes a new architecture for composed image retrieval using adaptive semantic disentanglement, unified prototype-based composition, and dual attribute-based relation modeling to address visually similar bu...
R^3: Composed Video Retrieval via Reasoning-Guided Recalling and Re-ranking
cs.CV 2026-05 unverdicted novelty 5.0

R^3 is a zero-shot pipeline that generates reasoning traces to augment composed video queries, fuses scores via agreement-gated residual, and re-ranks candidates for the CoVR-R challenge.
RankVR: Low-Rank Structure Perception and Value Recalibration for Robust Composed Image Retrieval
cs.CV 2026-06 unverdicted novelty 4.0

RankVR introduces GSCP and ASVC modules to improve CIR robustness by decoupling clean samples via low-rank structure and dynamically scoring triplet value in noisy datasets.
IMAGINE: Adaptive Schema-Imagery Enhanced Composition for Composed Video Retrieval
cs.CV 2026-06 unverdicted novelty 4.0

IMAGINE uses adaptive schema-imagery via dynamic multimodal prototypes to incorporate implicit semantics into composed video retrieval, claiming SOTA results on CVR and CIR benchmarks.