pith. sign in

arxiv: 2605.24500 · v2 · pith:5DHH6GFBnew · submitted 2026-05-23 · 💻 cs.CV

EgoAdapt: A Multi-Scene Egocentric Adaptation Method for CVPR 2026 HD-EPIC VQA Challenge

classification 💻 cs.CV
keywords hd-epicadaptationegoadaptrecipeacrossbenchmarkchallengeconsistency
0
0 comments X
read the original abstract

This technical report presents our solution, EgoAdapt (Egocentric Adaptation via Category, Calibration, and Consistency), to the CVPR 2026 HD-EPIC VQA challenge. HD-EPIC evaluates whether a vision-language model can reason over realistic first-person kitchen videos, where the evidence for an answer may be a short hand-object interaction, a long recipe trajectory, a spatial relation to a fixture, or a subtle gaze cue. The benchmark contains 26K multiple-choice questions across seven macro-categories: recipe, ingredient, nutrition, fine-grained action, 3D perception, object motion, and gaze. We observe that the main difficulty is not only model capacity, but also the mismatch between a single generic inference recipe and the heterogeneous temporal, spatial, and semantic structure of the benchmark. Our method, EgoAdapt, introduces three inference-time components: (1) category-conditioned routing with per-category prompts, frame budgets, and sampling rates; (2) calibrated option scoring that evaluates all candidate answers with letter-token likelihoods and generation agreement instead of relying only on direct generation; and (3) test-time consistency adaptation that aggregates predictions across option permutations and verification-style prompts for ambiguous cases. This design substantially improves over the available HD-EPIC baselines.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. COMBINER: Composed Image Retrieval Guided by Attribute-based Neighbor Relations

    cs.CV 2026-06 unverdicted novelty 6.0

    COMBINER proposes a new architecture for composed image retrieval using adaptive semantic disentanglement, unified prototype-based composition, and dual attribute-based relation modeling to address visually similar bu...

  2. R^3: Composed Video Retrieval via Reasoning-Guided Recalling and Re-ranking

    cs.CV 2026-05 unverdicted novelty 5.0

    R^3 is a zero-shot pipeline that generates reasoning traces to augment composed video queries, fuses scores via agreement-gated residual, and re-ranks candidates for the CoVR-R challenge.

  3. RankVR: Low-Rank Structure Perception and Value Recalibration for Robust Composed Image Retrieval

    cs.CV 2026-06 unverdicted novelty 4.0

    RankVR introduces GSCP and ASVC modules to improve CIR robustness by decoupling clean samples via low-rank structure and dynamically scoring triplet value in noisy datasets.

  4. IMAGINE: Adaptive Schema-Imagery Enhanced Composition for Composed Video Retrieval

    cs.CV 2026-06 unverdicted novelty 4.0

    IMAGINE uses adaptive schema-imagery via dynamic multimodal prototypes to incorporate implicit semantics into composed video retrieval, claiming SOTA results on CVR and CIR benchmarks.