ViewBridge: Curriculum Knowledge Distillation for Activity View-Invariance Under Extreme Viewpoint Changes

Arjun Somayazulu; Changan Chen; Efi Mavroudi; Kristen Grauman; Lorenzo Torresani

arxiv: 2504.05451 · v2 · pith:SPIUHXXUnew · submitted 2025-04-07 · 💻 cs.CV

ViewBridge: Curriculum Knowledge Distillation for Activity View-Invariance Under Extreme Viewpoint Changes

Arjun Somayazulu , Efi Mavroudi , Changan Chen , Lorenzo Torresani , Kristen Grauman This is my paper

Pith reviewed 2026-05-22 20:32 UTC · model grok-4.3

classification 💻 cs.CV

keywords view-invariant learningcurriculum learningknowledge distillationvideo representation learningactivity recognitionviewpoint changeskeystep grounding

0 comments

The pith

Curriculum knowledge distillation with geometry-sorted view pairs produces video representations invariant to extreme viewpoint changes from single-view input.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets learning rich video representations for activities when training involves severe view-occlusions and extreme viewpoint differences that share little visual content. It combines a knowledge distillation objective that preserves action-centric semantics with a curriculum procedure that gradually pairs more challenging views. Segments are sorted for this curriculum using a geometry-based metric that estimates occlusion levels. Training draws on multi-view data yet the resulting model accepts only uncalibrated single-view videos at inference. The method reports stronger results than prior approaches on temporal keystep grounding and fine-grained keystep recognition across Ego-Exo4D, LEMMA, and EPFL-Smart-Kitchen-30.

Core claim

ViewBridge shows that a knowledge distillation objective paired with a curriculum of incrementally harder viewpoint pairs, ordered by a geometry-based occlusion metric, yields video representations that remain effective for activity understanding under extreme view shifts, with inference performed on single uncalibrated viewpoints and superior performance on keystep tasks over three datasets.

What carries the argument

The curriculum learning procedure that uses a geometry-based metric to estimate occlusion levels and orders training segments into progressively more challenging view pairs for the knowledge distillation objective.

If this is right

Models trained with multi-view data can be deployed on single-view videos for activity analysis in cluttered real-world settings.
View-invariant representations become feasible without requiring controlled minimal-occlusion training footage.
Temporal localization and fine-grained classification of activity steps improve when viewpoint differences are bridged gradually.
The framework supports inference on uncalibrated videos while still leveraging multi-view supervision during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar curriculum strategies ordered by geometric difficulty could transfer to other video domain-adaptation problems where view or appearance gaps are large.
If the occlusion metric generalizes, it may serve as a template for quantifying training difficulty in additional invariance tasks such as lighting or motion changes.
The separation of multi-view training from single-view inference suggests a practical route for scaling activity models to mobile or wearable camera settings.

Load-bearing premise

A geometry-based metric can be defined that accurately reflects the likely occlusion level of training video segments to enable effective curriculum sorting.

What would settle it

An experiment in which random or non-geometry-based ordering of view pairs during curriculum training produces equal or better results on the same keystep grounding and recognition tasks would falsify the contribution of the proposed metric and sorting.

Figures

Figures reproduced from arXiv: 2504.05451 by Arjun Somayazulu, Changan Chen, Efi Mavroudi, Kristen Grauman, Lorenzo Torresani.

**Figure 1.** Figure 1: Edited vs. natural procedural video. Top: Whereas edited video switches between close-in shots and wide-body shots to best capture the ongoing action, natural in-the-wild video can instead experience significant object and view occlusions. Bottom: Directly distilling the best view into an impoverished viewpoint has limited utility given the lack of shared visual content. Our curriculum knowledge distilla… view at source ↗

**Figure 2.** Figure 2: Approach overview. a) Given an ego-worn camera looking down at the active workspace, we rank each exo camera by their view-alignment with the hand-object interaction region pcenter (green). To account for self-occlusion by the camera-wearer, we enforce that views facing the ego-camera (1, 2) are ranked ahead of views behind the ego-camera (3, 4). b) For each feature from a source view (highlighted in blue)… view at source ↗

**Figure 3.** Figure 3: Downstream tasks. a) Our temporal keystep grounding model is input an untrimmed video V and sequence of keysteps N and regresses the center timestamp cˆni and duration dˆni for each narration ni. We jointly optimize with our cross-view/cross-temporal knowledge distillation loss (red). b) We pre-train a keystep recognition model on randomly-selected clips from untrimmed videos. We rank the views using our m… view at source ↗

**Figure 4.** Figure 4: t-SNE of learned video features. We visualize video features learned by our grounding model’s knowledge distill head (blue), best-view video features (green), and features from other synchronized views (red) on an input chunk of video. Our model closely aligns source view features with the best-view features throughout the video, despite the time-varying nature of the ’best view’. all views (A) and separat… view at source ↗

**Figure 5.** Figure 5: Mean IoU difference (Ours - EgoVLPv2) by keystep name and task. We compute mean IoU across all instances and views of each unique keystep in the test set – for both our model and the EgoVLPv2-trained grounding model. We display signed mean IoU difference between ours and EgoVLPv2 for the top-20 keysteps (left half) and bottom-20 keysteps (right half) that have largest mean IoU difference. We outperform Ego… view at source ↗

read the original abstract

Traditional methods for view-invariant learning rely on controlled multi-view training data with minimal scene clutter. However, they struggle with in-the-wild videos that exhibit extreme viewpoint differences and share little visual content. We introduce ViewBridge, a framework for learning rich video representations in the presence of severe view-occlusions. We introduce a knowledge distillation objective that preserves action-centric semantics, together with a novel curriculum learning procedure that pairs incrementally more challenging views over time, thereby allowing smooth adaptation to extreme viewpoint differences. To sort training video segments for the proposed curriculum, we define a geometry-based metric that reflects their likely occlusion level. While training leverages multi-view data, at inference time, the input is an uncalibrated, single-viewpoint video. Evaluating our approach on two tasks -- temporal keystep grounding and fine-grained keystep recognition -- we outperform SOTA approaches across three datasets (Ego-Exo4D, LEMMA, EPFL-Smart-Kitchen-30). Project page: https://vision.cs.utexas.edu/projects/learning_view_distill/ .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ViewBridge's geometry-based curriculum for ordering view pairs during distillation is the new piece, but it lacks any check that the metric actually tracks occlusion or difficulty.

read the letter

The paper introduces a curriculum that sorts multi-view training segments by a geometry-derived occlusion metric and feeds them incrementally into a knowledge distillation loss meant to keep action semantics intact. Training uses paired views; inference is single uncalibrated video. They report gains on keystep grounding and fine-grained recognition over Ego-Exo4D, LEMMA, and EPFL-Smart-Kitchen-30. That combination of distillation plus ordered pairing by occlusion level is not in the prior work they cite, and the single-view test setting is a realistic constraint for downstream use in robotics or video analysis. The problem they target—extreme viewpoint shifts with little shared content—is real and persistent. The distillation objective itself is a reasonable way to transfer semantics without forcing pixel-level alignment. The main weakness is the one flagged in the stress test. The abstract defines the geometry metric and uses it to order pairs, but supplies no correlation with measured overlap, no human difficulty ratings, and no ablation showing ordered curriculum beats random ordering. If the metric is only loosely related to actual view difficulty, the curriculum claim does not hold and the reported improvements cannot be credited to the stated mechanism. No numbers, error bars, or implementation details appear here either, so the SOTA claim stays uncheckable. This is worth sending to review for groups working on egocentric or multi-view activity recognition. The method is concrete enough that referees can test the metric validation directly; the core idea is worth that effort even if the current evidence is thin.

Referee Report

1 major / 2 minor

Summary. The paper introduces ViewBridge, a curriculum-based knowledge distillation framework for view-invariant video representation learning under extreme viewpoint changes and occlusions. It defines a geometry-based metric to order training segments by likely occlusion level, progressively pairing more challenging views during training while preserving action-centric semantics via distillation. Training uses multi-view data but inference operates on single uncalibrated views. The method is evaluated on temporal keystep grounding and fine-grained keystep recognition, reporting outperformance over SOTA baselines across Ego-Exo4D, LEMMA, and EPFL-Smart-Kitchen-30.

Significance. If the geometry-based curriculum metric is shown to meaningfully rank view difficulty and the reported gains are attributable to the proposed mechanism rather than other factors, the work could meaningfully advance view-invariant activity understanding for in-the-wild egocentric-exocentric scenarios. The single-view inference setting and use of independent multi-view training data are practical strengths; the curriculum idea addresses a recognized challenge in gradual adaptation to severe occlusions.

major comments (1)

[Curriculum learning procedure (Section 3.2)] The geometry-based occlusion metric (defined to sort segments for the curriculum) is load-bearing for attributing performance gains to the proposed adaptation mechanism, yet the manuscript provides no quantitative validation—such as correlation with measured keypoint overlap, shared visual content, human difficulty ratings, or an ablation replacing the metric with random ordering—on Ego-Exo4D, LEMMA, or EPFL-Smart-Kitchen-30. Without this, the curriculum component risks being non-predictive of actual view difficulty.

minor comments (2)

[Abstract and Section 5] The abstract and results sections would benefit from reporting specific quantitative margins (e.g., absolute improvements in mAP or accuracy with error bars) rather than the generic claim of outperforming SOTA.
[Method] Notation for the geometry metric and curriculum progression rate should be introduced with explicit equations to improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the practical strengths of the single-view inference setting. We address the major comment below and will revise the manuscript accordingly to strengthen the attribution of gains to the curriculum mechanism.

read point-by-point responses

Referee: [Curriculum learning procedure (Section 3.2)] The geometry-based occlusion metric (defined to sort segments for the curriculum) is load-bearing for attributing performance gains to the proposed adaptation mechanism, yet the manuscript provides no quantitative validation—such as correlation with measured keypoint overlap, shared visual content, human difficulty ratings, or an ablation replacing the metric with random ordering—on Ego-Exo4D, LEMMA, or EPFL-Smart-Kitchen-30. Without this, the curriculum component risks being non-predictive of actual view difficulty.

Authors: We agree that quantitative validation of the geometry-based occlusion metric is necessary to more convincingly attribute performance improvements to the curriculum ordering rather than other factors. The metric is computed from projected 3D keypoints and relative camera poses to estimate the degree of view-induced occlusion without additional supervision. In the revised manuscript we will add (i) an ablation that replaces the proposed ordering with random segment ordering and reports the resulting performance on all three datasets, and (ii) correlation analysis between metric scores and keypoint overlap ratios (where 3D annotations are available) to provide direct evidence that the ordering reflects actual view difficulty. These additions will be included in Section 3.2 and the experimental section. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The derivation relies on an externally defined geometry-based occlusion metric constructed from camera parameters or keypoint overlap to order training segments for curriculum learning, followed by a knowledge-distillation objective trained on multi-view data and evaluated on independent benchmarks (Ego-Exo4D, LEMMA, EPFL-Smart-Kitchen-30). No equation or procedure reduces the reported performance gains to a fitted parameter, self-defined target, or load-bearing self-citation; the metric is presented as a geometric proxy rather than optimized against the final task metrics, and inference uses single-view input without reference to the training ordering. The chain is therefore self-contained against external data and standard evaluation protocols.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate all free parameters or axioms; the approach assumes availability of multi-view training data and that distillation can preserve semantics across views.

free parameters (1)

curriculum progression rate
The schedule for moving from easier to harder view pairs is not specified and is likely tuned on validation data.

axioms (1)

domain assumption Multi-view training data with varying occlusion levels is available during training
The curriculum and distillation rely on access to such paired data, as stated in the abstract.

pith-pipeline@v0.9.0 · 5732 in / 1258 out tokens · 47445 ms · 2026-05-22T20:32:08.120431+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction and orbit embedding unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We divide training into P phases … In each phase p, we choose the cross-view positive distillation target … rτ(vpos) = max(0, rτ(vi) − p). … last lP epochs reserved for the final phase (lP = 50% of M).
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J-cost uniqueness) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HOIi = cos(g′i, pcenter − t′i) … hierarchically sort first using XY cosine similarity … then sort views within each set using the HOI-based view-similarity metric

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages

[1]

Ht-step: Align- ing instructional articles with how-to videos

Triantafyllos Afouras, Effrosyni Mavroudi, Tushar Nagara- jan, Huiyu Wang, and Lorenzo Torresani. Ht-step: Align- ing instructional articles with how-to videos. In Advances 8 in Neural Information Processing Systems , pages 50310– 50326. Curran Associates, Inc., 2023. 1, 6

work page 2023
[2]

An exocentric look at ego- centric actions and vice versa

Shervin Ardeshir and Ali Borji. An exocentric look at ego- centric actions and vice versa. Computer Vision and Image Understanding, 171:61–68, 2018. 2

work page 2018
[3]

Video-mined task graphs for keystep recognition in instructional videos, 2023

Kumar Ashutosh, Santhosh Kumar Ramakrishnan, Tri- antafyllos Afouras, and Kristen Grauman. Video-mined task graphs for keystep recognition in instructional videos, 2023. 3

work page 2023
[4]

Siddhant Bansal, Chetan Arora, and C. V . Jawahar. My view is the best view: Procedure learning from egocentric videos,

work page
[5]

Local- izing moments in long video via multimodal guidance, 2023

Wayner Barrios, Mattia Soldan, Alberto Mario Ceballos- Arroyo, Fabian Caba Heilbron, and Bernard Ghanem. Local- izing moments in long video via multimodal guidance, 2023. 7

work page 2023
[6]

Is space-time attention all you need for video understanding?,

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding?,

work page
[7]

A short note about kinetics- 600, 2018

Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics- 600, 2018. 1

work page 2018
[8]

4diff: 3d- aware diffusion model for third-to-first viewpoint translation

Feng Cheng, Mi Luo, Huiyu Wang, Alex Dimakis, Lorenzo Torresani, Gedas Bertasius, and Kristen Grauman. 4diff: 3d- aware diffusion model for third-to-first viewpoint translation. In Computer Vision – ECCV 2024 , pages 409–427, Cham,

work page 2024
[9]

Springer Nature Switzerland. 2

work page
[10]

Srijan Das and Michael S. Ryoo. Viewclr: Learning self- supervised video representation for unseen viewpoints. In 2023 IEEE/CVF Winter Conference on Applications of Com- puter Vision (WACV), pages 5562–5572, 2023. 2

work page 2023
[11]

Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsu- pervised visual representation learning by context prediction,

work page
[12]

Activitynet: A large-scale video bench- mark for human activity understanding

Bernard Ghanem Fabian Caba Heilbron, Victor Escorcia and Juan Carlos Niebles. Activitynet: A large-scale video bench- mark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961–970, 2015. 1, 6

work page 2015
[13]

Learning to recog- nize activities from the wrong view point

Ali Farhadi and Mostafa Kamali Tabrizi. Learning to recog- nize activities from the wrong view point. In Proceedings of the 10th European Conference on Computer Vision: Part I , page 154–166, Berlin, Heidelberg, 2008. Springer-Verlag. 2

work page 2008
[14]

Learning temporal sentence grounding from narrated egovideos, 2023

Kevin Flanagan, Dima Damen, and Michael Wray. Learning temporal sentence grounding from narrated egovideos, 2023. 3, 5, 7, 13

work page 2023
[15]

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu- Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Moh...

work page 2024
[16]

Temporal alignment networks for long-term video, 2022

Tengda Han, Weidi Xie, and Andrew Zisserman. Temporal alignment networks for long-term video, 2022. 3, 5

work page 2022
[17]

View-invariant action recognition based on artificial neu- ral networks

Alexandros Iosifidis, Anastasios Tefas, and Ioannis Pitas. View-invariant action recognition based on artificial neu- ral networks. IEEE Transactions on Neural Networks and Learning Systems, 23(3):412–424, 2012. 2

work page 2012
[18]

Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints,

Asako Kanezaki, Yasuyuki Matsushita, and Yoshifumi Nishida. Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints,

work page
[19]

The kinetics human action video dataset, 2017

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset, 2017. 1

work page 2017
[20]

Kuehne, H

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A large video database for human motion recogni- tion. In 2011 International Conference on Computer Vision, pages 2556–2563, 2011. 1

work page 2011
[21]

Unsupervised learning of view-invariant action repre- sentations

Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S Kankan- halli. Unsupervised learning of view-invariant action repre- sentations. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2018. 2

work page 2018
[22]

Learning distortion invariant representation for image restoration from a causality perspective, 2023

Xin Li, Bingchen Li, Xin Jin, Cuiling Lan, and Zhibo Chen. Learning distortion invariant representation for image restoration from a causality perspective, 2023. 2

work page 2023
[23]

Ego-exo: Transferring visual representations from third-person to first-person videos

Yanghao Li, Tushar Nagarajan, Bo Xiong, and Kristen Grau- man. Ego-exo: Transferring visual representations from third-person to first-person videos. 2021 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR) , pages 6939–6949, 2021. 2

work page 2021
[24]

Put myself in your shoes: Lifting the egocentric perspective from exocentric videos, 2024

Mi Luo, Zihui Xue, Alex Dimakis, and Kristen Grauman. Put myself in your shoes: Lifting the egocentric perspective from exocentric videos, 2024. 2

work page 2024
[25]

Learning to ground instructional articles in videos through narrations, 2023

Effrosyni Mavroudi, Triantafyllos Afouras, and Lorenzo Torresani. Learning to ground instructional articles in videos through narrations, 2023. 3, 5

work page 2023
[26]

9 Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 9 Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019. 1, 6

work page 2019
[27]

AJ Piergiovanni and Michael S. Ryoo. Recognizing ac- tions in videos from unseen viewpoints. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4122–4130, Los Alamitos, CA, USA, 2021. IEEE Computer Society. 2

work page 2021
[28]

Egovlpv2: Egocentric video-language pre-training with fusion in the backbone, 2023

Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, and Pengchuan Zhang. Egovlpv2: Egocentric video-language pre-training with fusion in the backbone, 2023. 6, 7, 12, 13

work page 2023
[29]

The meccano dataset: Under- standing human-object interactions from egocentric videos in an industrial-like domain, 2020

Francesco Ragusa, Antonino Furnari, Salvatore Livatino, and Giovanni Maria Farinella. The meccano dataset: Under- standing human-object interactions from egocentric videos in an industrial-like domain, 2020. 3

work page 2020
[30]

Learning a non-linear knowledge transfer model for cross-view action recognition

Hossein Rahmani and Ajmal Mian. Learning a non-linear knowledge transfer model for cross-view action recognition. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2458–2466, 2015. 2

work page 2015
[31]

On the benefits of 3d pose and tracking for human action recognition

Jathushan Rajasegaran, Georgios Pavlakos, Angjoo Kanazawa, Christoph Feichtenhofer, and Jitendra Malik. On the benefits of 3d pose and tracking for human action recognition. In CVPR, 2023. 2

work page 2023
[32]

Ground- ing action descriptions in videos

Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. Ground- ing action descriptions in videos. Transactions of the Associ- ation for Computational Linguistics (TACL), 1:25–36, 2013. 2

work page 2013
[33]

Unsu- pervised view-invariant human posture representation, 2024

Faegheh Sardari, Bj ¨orn Ommer, and Majid Mirmehdi. Unsu- pervised view-invariant human posture representation, 2024. 2

work page 2024
[34]

Learning state-invariant representations of objects from image collections with state, pose, and viewpoint changes, 2024

Rohan Sarkar and Avinash Kak. Learning state-invariant representations of objects from image collections with state, pose, and viewpoint changes, 2024. 2

work page 2024
[35]

M. Shah, B. Kuipers, S. Savarese, and Jingen Liu. Cross- view action recognition via view knowledge transfer. In2013 IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 3209–3216, Los Alamitos, CA, USA, 2011. IEEE Computer Society. 2

work page 2011
[36]

Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, and Karteek Alahari

Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, and Karteek Alahari. Charades-ego: A large- scale dataset of paired third and first person videos, 2018. 2, 3

work page 2018
[37]

Mad: A scalable dataset for language grounding in videos from movie audio descriptions, 2022

Mattia Soldan, Alejandro Pardo, Juan Le ´on Alc ´azar, Fabian Caba Heilbron, Chen Zhao, Silvio Giancola, and Bernard Ghanem. Mad: A scalable dataset for language grounding in videos from movie audio descriptions, 2022. 1

work page 2022
[38]

Ego4d goal-step: To- ward hierarchical understanding of procedural activities

Yale Song, Eugene Byrne, Tushar Nagarajan, Huiyu Wang, Miguel Martin, and Lorenzo Torresani. Ego4d goal-step: To- ward hierarchical understanding of procedural activities. In Advances in Neural Information Processing Systems , pages 38863–38886. Curran Associates, Inc., 2023. 3

work page 2023
[39]

Ucf101: A dataset of 101 human actions classes from videos in the wild, 2012

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild, 2012. 1

work page 2012
[40]

Action recog- nition in the presence of one egocentric and multiple static cameras, 2014

Bilge Soran, Ali Farhadi, and Linda Shapiro. Action recog- nition in the presence of one egocentric and multiple static cameras, 2014. 2

work page 2014
[41]

View-invariant proba- bilistic embedding for human pose

Jennifer J Sun, Jiaping Zhao, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, and Ting Liu. View-invariant proba- bilistic embedding for human pose. In European Conference on Computer Vision, pages 53–70. Springer, 2020. 2

work page 2020
[42]

Comprehensive in- structional video analysis: The coin dataset and performance evaluation

Yansong Tang, Jiwen Lu, and Jie Zhou. Comprehensive in- structional video analysis: The coin dataset and performance evaluation. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 43(9):3138–3153, 2021. 3

work page 2021
[43]

Repre- sentation learning with contrastive predictive coding, 2019

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding, 2019. 7

work page 2019
[44]

Cross-view action modeling, learning and recognition,

Jiang wang, Xiaohan Nie, Yin Xia, Ying Wu, and Song-Chun Zhu. Cross-view action modeling, learning and recognition,

work page
[45]

Free viewpoint action recognition using motion history volumes

Daniel Weinland, Remi Ronfard, and Edmond Boyer. Free viewpoint action recognition using motion history volumes. Comput. Vis. Image Underst., 104(2):249–257, 2006. 2

work page 2006
[46]

Learning fine- grained view-invariant representations from unpaired ego- exo videos via temporal alignment

Zihui (Sherry) Xue and Kristen Grauman. Learning fine- grained view-invariant representations from unpaired ego- exo videos via temporal alignment. In Advances in Neural Information Processing Systems, pages 53688–53710. Cur- ran Associates, Inc., 2023. 2

work page 2023
[47]

What i see is what you see: Joint attention learning for first and third person video co-analysis

Huangyue Yu, Minjie Cai, Yunfei Liu, and Feng Lu. What i see is what you see: Joint attention learning for first and third person video co-analysis. Proceedings of the 27th ACM International Conference on Multimedia, 2019. 2

work page 2019
[48]

View-robust neural networks for unseen human action recognition in videos

Jiahui Yu, Tianyu Ma, Zhaojie Ju, Hang Chen, and Yingke Xu. View-robust neural networks for unseen human action recognition in videos. In 2022 IEEE International Confer- ence on Systems, Man, and Cybernetics (SMC), pages 1242– 1247, 2022. 2

work page 2022
[49]

Dense regression network for video grounding, 2020

Runhao Zeng, Haoming Xu, Wenbing Huang, Peihao Chen, Mingkui Tan, and Chuang Gan. Dense regression network for video grounding, 2020. 5

work page 2020
[50]

Temporal sentence grounding in videos: A survey and fu- ture directions

Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. Temporal sentence grounding in videos: A survey and fu- ture directions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(8):10443–10465, 2023. 3, 5

work page 2023
[51]

View adaptive recurrent neural networks for high performance human action recog- nition from skeleton data, 2017

Pengfei Zhang, Cuiling Lan, Junliang Xing, Wenjun Zeng, Jianru Xue, and Nanning Zheng. View adaptive recurrent neural networks for high performance human action recog- nition from skeleton data, 2017. 2

work page 2017
[52]

Cross-view action recog- nition via a continuous virtual path

Zhong Zhang, Chunheng Wang, Baihua Xiao, Wen Zhou, Shuang Liu, and Cunzhao Shi. Cross-view action recog- nition via a continuous virtual path. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2013. 2

work page 2013
[53]

Luowei Zhou, Chenliang Xu, and Jason J. Corso. To- wards automatic learning of procedures from web instruc- tional videos, 2017. 3

work page 2017
[54]

Cross- task weakly supervised learning from instructional videos,

Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, and Josef Sivic. Cross- task weakly supervised learning from instructional videos,

work page
[55]

8.1) — We report the full version of Table 1 across all IoU thresholdsθ, as mentioned in Sec

Full keystep grounding results (Sec. 8.1) — We report the full version of Table 1 across all IoU thresholdsθ, as mentioned in Sec. 6 (Temporal Keystep Grounding) of the main paper

work page
[56]

8.2) — We provide an analysis of our model’s performance relative to EgoVLPv2 (strongest baseline) within each unique keystep name as well as within each high-level activity

Keystep grounding results stratified by keystep name and task (Sec. 8.2) — We provide an analysis of our model’s performance relative to EgoVLPv2 (strongest baseline) within each unique keystep name as well as within each high-level activity

work page
[57]

EgoVLPv2 (Sec

Feature similarity with ego feature vs. EgoVLPv2 (Sec. 8.3) — We provide an analysis demonstrating close alignment between our learned features from any source view and the corresponding ego video features at each moment as verification of effective distillation between target and source views

work page
[58]

Results on keystep grounding in seen and unseen environments (Sec. 8.4) — We stratify our test set by videos from environments observed during training (test-seen) and from environments unseen during train- ing (test-unseen) to evaluate robustness of our approach to novel scenes

work page
[59]

Ablations of camera ranking algorithm/use. (Sec. 8.5) — We train a model with several varia- tions of our camera ranking to quantitatively validate its utility vs. selecting a random distillation target, as well as to confirm that our particular camera ranking is effective

work page
[60]

Demo video. We provide a short video on our project page with qualitative examples of our view ranking across diverse scenarios, as well as qualitative keystep grounding examples with EgoVLPv2-based grounding – our strongest baseline – for reference, on videos from diverse activities and viewpoints, as well as failure cases. 8.1. Complete keystep groundin...

work page
[61]

ge- ometric

the same view and 2) the same (synchronous) action, but a severely occluded viewpoint. 8.4. Evaluation on seen vs. unseen environments We stratify our test set into videos that are recorded in physical environments which were observed during training (test-seen), and videos recorded in five ”unseen” environ- ments that were unobserved during training (tes...

work page

[1] [1]

Ht-step: Align- ing instructional articles with how-to videos

Triantafyllos Afouras, Effrosyni Mavroudi, Tushar Nagara- jan, Huiyu Wang, and Lorenzo Torresani. Ht-step: Align- ing instructional articles with how-to videos. In Advances 8 in Neural Information Processing Systems , pages 50310– 50326. Curran Associates, Inc., 2023. 1, 6

work page 2023

[2] [2]

An exocentric look at ego- centric actions and vice versa

Shervin Ardeshir and Ali Borji. An exocentric look at ego- centric actions and vice versa. Computer Vision and Image Understanding, 171:61–68, 2018. 2

work page 2018

[3] [3]

Video-mined task graphs for keystep recognition in instructional videos, 2023

Kumar Ashutosh, Santhosh Kumar Ramakrishnan, Tri- antafyllos Afouras, and Kristen Grauman. Video-mined task graphs for keystep recognition in instructional videos, 2023. 3

work page 2023

[4] [4]

Siddhant Bansal, Chetan Arora, and C. V . Jawahar. My view is the best view: Procedure learning from egocentric videos,

work page

[5] [5]

Local- izing moments in long video via multimodal guidance, 2023

Wayner Barrios, Mattia Soldan, Alberto Mario Ceballos- Arroyo, Fabian Caba Heilbron, and Bernard Ghanem. Local- izing moments in long video via multimodal guidance, 2023. 7

work page 2023

[6] [6]

Is space-time attention all you need for video understanding?,

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding?,

work page

[7] [7]

A short note about kinetics- 600, 2018

Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics- 600, 2018. 1

work page 2018

[8] [8]

4diff: 3d- aware diffusion model for third-to-first viewpoint translation

Feng Cheng, Mi Luo, Huiyu Wang, Alex Dimakis, Lorenzo Torresani, Gedas Bertasius, and Kristen Grauman. 4diff: 3d- aware diffusion model for third-to-first viewpoint translation. In Computer Vision – ECCV 2024 , pages 409–427, Cham,

work page 2024

[9] [9]

Springer Nature Switzerland. 2

work page

[10] [10]

Srijan Das and Michael S. Ryoo. Viewclr: Learning self- supervised video representation for unseen viewpoints. In 2023 IEEE/CVF Winter Conference on Applications of Com- puter Vision (WACV), pages 5562–5572, 2023. 2

work page 2023

[11] [11]

Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsu- pervised visual representation learning by context prediction,

work page

[12] [12]

Activitynet: A large-scale video bench- mark for human activity understanding

Bernard Ghanem Fabian Caba Heilbron, Victor Escorcia and Juan Carlos Niebles. Activitynet: A large-scale video bench- mark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961–970, 2015. 1, 6

work page 2015

[13] [13]

Learning to recog- nize activities from the wrong view point

Ali Farhadi and Mostafa Kamali Tabrizi. Learning to recog- nize activities from the wrong view point. In Proceedings of the 10th European Conference on Computer Vision: Part I , page 154–166, Berlin, Heidelberg, 2008. Springer-Verlag. 2

work page 2008

[14] [14]

Learning temporal sentence grounding from narrated egovideos, 2023

Kevin Flanagan, Dima Damen, and Michael Wray. Learning temporal sentence grounding from narrated egovideos, 2023. 3, 5, 7, 13

work page 2023

[15] [15]

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu- Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Moh...

work page 2024

[16] [16]

Temporal alignment networks for long-term video, 2022

Tengda Han, Weidi Xie, and Andrew Zisserman. Temporal alignment networks for long-term video, 2022. 3, 5

work page 2022

[17] [17]

View-invariant action recognition based on artificial neu- ral networks

Alexandros Iosifidis, Anastasios Tefas, and Ioannis Pitas. View-invariant action recognition based on artificial neu- ral networks. IEEE Transactions on Neural Networks and Learning Systems, 23(3):412–424, 2012. 2

work page 2012

[18] [18]

Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints,

Asako Kanezaki, Yasuyuki Matsushita, and Yoshifumi Nishida. Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints,

work page

[19] [19]

The kinetics human action video dataset, 2017

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset, 2017. 1

work page 2017

[20] [20]

Kuehne, H

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A large video database for human motion recogni- tion. In 2011 International Conference on Computer Vision, pages 2556–2563, 2011. 1

work page 2011

[21] [21]

Unsupervised learning of view-invariant action repre- sentations

Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S Kankan- halli. Unsupervised learning of view-invariant action repre- sentations. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2018. 2

work page 2018

[22] [22]

Learning distortion invariant representation for image restoration from a causality perspective, 2023

Xin Li, Bingchen Li, Xin Jin, Cuiling Lan, and Zhibo Chen. Learning distortion invariant representation for image restoration from a causality perspective, 2023. 2

work page 2023

[23] [23]

Ego-exo: Transferring visual representations from third-person to first-person videos

Yanghao Li, Tushar Nagarajan, Bo Xiong, and Kristen Grau- man. Ego-exo: Transferring visual representations from third-person to first-person videos. 2021 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR) , pages 6939–6949, 2021. 2

work page 2021

[24] [24]

Put myself in your shoes: Lifting the egocentric perspective from exocentric videos, 2024

Mi Luo, Zihui Xue, Alex Dimakis, and Kristen Grauman. Put myself in your shoes: Lifting the egocentric perspective from exocentric videos, 2024. 2

work page 2024

[25] [25]

Learning to ground instructional articles in videos through narrations, 2023

Effrosyni Mavroudi, Triantafyllos Afouras, and Lorenzo Torresani. Learning to ground instructional articles in videos through narrations, 2023. 3, 5

work page 2023

[26] [26]

9 Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 9 Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019. 1, 6

work page 2019

[27] [27]

AJ Piergiovanni and Michael S. Ryoo. Recognizing ac- tions in videos from unseen viewpoints. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4122–4130, Los Alamitos, CA, USA, 2021. IEEE Computer Society. 2

work page 2021

[28] [28]

Egovlpv2: Egocentric video-language pre-training with fusion in the backbone, 2023

Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, and Pengchuan Zhang. Egovlpv2: Egocentric video-language pre-training with fusion in the backbone, 2023. 6, 7, 12, 13

work page 2023

[29] [29]

The meccano dataset: Under- standing human-object interactions from egocentric videos in an industrial-like domain, 2020

Francesco Ragusa, Antonino Furnari, Salvatore Livatino, and Giovanni Maria Farinella. The meccano dataset: Under- standing human-object interactions from egocentric videos in an industrial-like domain, 2020. 3

work page 2020

[30] [30]

Learning a non-linear knowledge transfer model for cross-view action recognition

Hossein Rahmani and Ajmal Mian. Learning a non-linear knowledge transfer model for cross-view action recognition. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2458–2466, 2015. 2

work page 2015

[31] [31]

On the benefits of 3d pose and tracking for human action recognition

Jathushan Rajasegaran, Georgios Pavlakos, Angjoo Kanazawa, Christoph Feichtenhofer, and Jitendra Malik. On the benefits of 3d pose and tracking for human action recognition. In CVPR, 2023. 2

work page 2023

[32] [32]

Ground- ing action descriptions in videos

Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. Ground- ing action descriptions in videos. Transactions of the Associ- ation for Computational Linguistics (TACL), 1:25–36, 2013. 2

work page 2013

[33] [33]

Unsu- pervised view-invariant human posture representation, 2024

Faegheh Sardari, Bj ¨orn Ommer, and Majid Mirmehdi. Unsu- pervised view-invariant human posture representation, 2024. 2

work page 2024

[34] [34]

Learning state-invariant representations of objects from image collections with state, pose, and viewpoint changes, 2024

Rohan Sarkar and Avinash Kak. Learning state-invariant representations of objects from image collections with state, pose, and viewpoint changes, 2024. 2

work page 2024

[35] [35]

M. Shah, B. Kuipers, S. Savarese, and Jingen Liu. Cross- view action recognition via view knowledge transfer. In2013 IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 3209–3216, Los Alamitos, CA, USA, 2011. IEEE Computer Society. 2

work page 2011

[36] [36]

Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, and Karteek Alahari

Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, and Karteek Alahari. Charades-ego: A large- scale dataset of paired third and first person videos, 2018. 2, 3

work page 2018

[37] [37]

Mad: A scalable dataset for language grounding in videos from movie audio descriptions, 2022

Mattia Soldan, Alejandro Pardo, Juan Le ´on Alc ´azar, Fabian Caba Heilbron, Chen Zhao, Silvio Giancola, and Bernard Ghanem. Mad: A scalable dataset for language grounding in videos from movie audio descriptions, 2022. 1

work page 2022

[38] [38]

Ego4d goal-step: To- ward hierarchical understanding of procedural activities

Yale Song, Eugene Byrne, Tushar Nagarajan, Huiyu Wang, Miguel Martin, and Lorenzo Torresani. Ego4d goal-step: To- ward hierarchical understanding of procedural activities. In Advances in Neural Information Processing Systems , pages 38863–38886. Curran Associates, Inc., 2023. 3

work page 2023

[39] [39]

Ucf101: A dataset of 101 human actions classes from videos in the wild, 2012

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild, 2012. 1

work page 2012

[40] [40]

Action recog- nition in the presence of one egocentric and multiple static cameras, 2014

Bilge Soran, Ali Farhadi, and Linda Shapiro. Action recog- nition in the presence of one egocentric and multiple static cameras, 2014. 2

work page 2014

[41] [41]

View-invariant proba- bilistic embedding for human pose

Jennifer J Sun, Jiaping Zhao, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, and Ting Liu. View-invariant proba- bilistic embedding for human pose. In European Conference on Computer Vision, pages 53–70. Springer, 2020. 2

work page 2020

[42] [42]

Comprehensive in- structional video analysis: The coin dataset and performance evaluation

Yansong Tang, Jiwen Lu, and Jie Zhou. Comprehensive in- structional video analysis: The coin dataset and performance evaluation. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 43(9):3138–3153, 2021. 3

work page 2021

[43] [43]

Repre- sentation learning with contrastive predictive coding, 2019

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding, 2019. 7

work page 2019

[44] [44]

Cross-view action modeling, learning and recognition,

Jiang wang, Xiaohan Nie, Yin Xia, Ying Wu, and Song-Chun Zhu. Cross-view action modeling, learning and recognition,

work page

[45] [45]

Free viewpoint action recognition using motion history volumes

Daniel Weinland, Remi Ronfard, and Edmond Boyer. Free viewpoint action recognition using motion history volumes. Comput. Vis. Image Underst., 104(2):249–257, 2006. 2

work page 2006

[46] [46]

Learning fine- grained view-invariant representations from unpaired ego- exo videos via temporal alignment

Zihui (Sherry) Xue and Kristen Grauman. Learning fine- grained view-invariant representations from unpaired ego- exo videos via temporal alignment. In Advances in Neural Information Processing Systems, pages 53688–53710. Cur- ran Associates, Inc., 2023. 2

work page 2023

[47] [47]

What i see is what you see: Joint attention learning for first and third person video co-analysis

Huangyue Yu, Minjie Cai, Yunfei Liu, and Feng Lu. What i see is what you see: Joint attention learning for first and third person video co-analysis. Proceedings of the 27th ACM International Conference on Multimedia, 2019. 2

work page 2019

[48] [48]

View-robust neural networks for unseen human action recognition in videos

Jiahui Yu, Tianyu Ma, Zhaojie Ju, Hang Chen, and Yingke Xu. View-robust neural networks for unseen human action recognition in videos. In 2022 IEEE International Confer- ence on Systems, Man, and Cybernetics (SMC), pages 1242– 1247, 2022. 2

work page 2022

[49] [49]

Dense regression network for video grounding, 2020

Runhao Zeng, Haoming Xu, Wenbing Huang, Peihao Chen, Mingkui Tan, and Chuang Gan. Dense regression network for video grounding, 2020. 5

work page 2020

[50] [50]

Temporal sentence grounding in videos: A survey and fu- ture directions

Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. Temporal sentence grounding in videos: A survey and fu- ture directions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(8):10443–10465, 2023. 3, 5

work page 2023

[51] [51]

View adaptive recurrent neural networks for high performance human action recog- nition from skeleton data, 2017

Pengfei Zhang, Cuiling Lan, Junliang Xing, Wenjun Zeng, Jianru Xue, and Nanning Zheng. View adaptive recurrent neural networks for high performance human action recog- nition from skeleton data, 2017. 2

work page 2017

[52] [52]

Cross-view action recog- nition via a continuous virtual path

Zhong Zhang, Chunheng Wang, Baihua Xiao, Wen Zhou, Shuang Liu, and Cunzhao Shi. Cross-view action recog- nition via a continuous virtual path. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2013. 2

work page 2013

[53] [53]

Luowei Zhou, Chenliang Xu, and Jason J. Corso. To- wards automatic learning of procedures from web instruc- tional videos, 2017. 3

work page 2017

[54] [54]

Cross- task weakly supervised learning from instructional videos,

Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, and Josef Sivic. Cross- task weakly supervised learning from instructional videos,

work page

[55] [55]

8.1) — We report the full version of Table 1 across all IoU thresholdsθ, as mentioned in Sec

Full keystep grounding results (Sec. 8.1) — We report the full version of Table 1 across all IoU thresholdsθ, as mentioned in Sec. 6 (Temporal Keystep Grounding) of the main paper

work page

[56] [56]

8.2) — We provide an analysis of our model’s performance relative to EgoVLPv2 (strongest baseline) within each unique keystep name as well as within each high-level activity

Keystep grounding results stratified by keystep name and task (Sec. 8.2) — We provide an analysis of our model’s performance relative to EgoVLPv2 (strongest baseline) within each unique keystep name as well as within each high-level activity

work page

[57] [57]

EgoVLPv2 (Sec

Feature similarity with ego feature vs. EgoVLPv2 (Sec. 8.3) — We provide an analysis demonstrating close alignment between our learned features from any source view and the corresponding ego video features at each moment as verification of effective distillation between target and source views

work page

[58] [58]

Results on keystep grounding in seen and unseen environments (Sec. 8.4) — We stratify our test set by videos from environments observed during training (test-seen) and from environments unseen during train- ing (test-unseen) to evaluate robustness of our approach to novel scenes

work page

[59] [59]

Ablations of camera ranking algorithm/use. (Sec. 8.5) — We train a model with several varia- tions of our camera ranking to quantitatively validate its utility vs. selecting a random distillation target, as well as to confirm that our particular camera ranking is effective

work page

[60] [60]

Demo video. We provide a short video on our project page with qualitative examples of our view ranking across diverse scenarios, as well as qualitative keystep grounding examples with EgoVLPv2-based grounding – our strongest baseline – for reference, on videos from diverse activities and viewpoints, as well as failure cases. 8.1. Complete keystep groundin...

work page

[61] [61]

ge- ometric

the same view and 2) the same (synchronous) action, but a severely occluded viewpoint. 8.4. Evaluation on seen vs. unseen environments We stratify our test set into videos that are recorded in physical environments which were observed during training (test-seen), and videos recorded in five ”unseen” environ- ments that were unobserved during training (tes...

work page