pith. sign in

arxiv: 2504.05451 · v2 · pith:SPIUHXXUnew · submitted 2025-04-07 · 💻 cs.CV

ViewBridge: Curriculum Knowledge Distillation for Activity View-Invariance Under Extreme Viewpoint Changes

Pith reviewed 2026-05-22 20:32 UTC · model grok-4.3

classification 💻 cs.CV
keywords view-invariant learningcurriculum learningknowledge distillationvideo representation learningactivity recognitionviewpoint changeskeystep grounding
0
0 comments X

The pith

Curriculum knowledge distillation with geometry-sorted view pairs produces video representations invariant to extreme viewpoint changes from single-view input.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets learning rich video representations for activities when training involves severe view-occlusions and extreme viewpoint differences that share little visual content. It combines a knowledge distillation objective that preserves action-centric semantics with a curriculum procedure that gradually pairs more challenging views. Segments are sorted for this curriculum using a geometry-based metric that estimates occlusion levels. Training draws on multi-view data yet the resulting model accepts only uncalibrated single-view videos at inference. The method reports stronger results than prior approaches on temporal keystep grounding and fine-grained keystep recognition across Ego-Exo4D, LEMMA, and EPFL-Smart-Kitchen-30.

Core claim

ViewBridge shows that a knowledge distillation objective paired with a curriculum of incrementally harder viewpoint pairs, ordered by a geometry-based occlusion metric, yields video representations that remain effective for activity understanding under extreme view shifts, with inference performed on single uncalibrated viewpoints and superior performance on keystep tasks over three datasets.

What carries the argument

The curriculum learning procedure that uses a geometry-based metric to estimate occlusion levels and orders training segments into progressively more challenging view pairs for the knowledge distillation objective.

If this is right

  • Models trained with multi-view data can be deployed on single-view videos for activity analysis in cluttered real-world settings.
  • View-invariant representations become feasible without requiring controlled minimal-occlusion training footage.
  • Temporal localization and fine-grained classification of activity steps improve when viewpoint differences are bridged gradually.
  • The framework supports inference on uncalibrated videos while still leveraging multi-view supervision during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar curriculum strategies ordered by geometric difficulty could transfer to other video domain-adaptation problems where view or appearance gaps are large.
  • If the occlusion metric generalizes, it may serve as a template for quantifying training difficulty in additional invariance tasks such as lighting or motion changes.
  • The separation of multi-view training from single-view inference suggests a practical route for scaling activity models to mobile or wearable camera settings.

Load-bearing premise

A geometry-based metric can be defined that accurately reflects the likely occlusion level of training video segments to enable effective curriculum sorting.

What would settle it

An experiment in which random or non-geometry-based ordering of view pairs during curriculum training produces equal or better results on the same keystep grounding and recognition tasks would falsify the contribution of the proposed metric and sorting.

Figures

Figures reproduced from arXiv: 2504.05451 by Arjun Somayazulu, Changan Chen, Efi Mavroudi, Kristen Grauman, Lorenzo Torresani.

Figure 1
Figure 1. Figure 1: Edited vs. natural procedural video. Top: Whereas edited video switches between close-in shots and wide-body shots to best capture the ongoing action, natural in-the-wild video can instead experience significant object and view occlusions. Bot￾tom: Directly distilling the best view into an impoverished view￾point has limited utility given the lack of shared visual content. Our curriculum knowledge distilla… view at source ↗
Figure 2
Figure 2. Figure 2: Approach overview. a) Given an ego-worn camera looking down at the active workspace, we rank each exo camera by their view-alignment with the hand-object interaction region pcenter (green). To account for self-occlusion by the camera-wearer, we enforce that views facing the ego-camera (1, 2) are ranked ahead of views behind the ego-camera (3, 4). b) For each feature from a source view (highlighted in blue)… view at source ↗
Figure 3
Figure 3. Figure 3: Downstream tasks. a) Our temporal keystep grounding model is input an untrimmed video V and sequence of keysteps N and regresses the center timestamp cˆni and duration dˆni for each narration ni. We jointly optimize with our cross-view/cross-temporal knowledge distillation loss (red). b) We pre-train a keystep recognition model on randomly-selected clips from untrimmed videos. We rank the views using our m… view at source ↗
Figure 4
Figure 4. Figure 4: t-SNE of learned video features. We visualize video features learned by our grounding model’s knowledge distill head (blue), best-view video features (green), and features from other synchronized views (red) on an input chunk of video. Our model closely aligns source view features with the best-view features throughout the video, despite the time-varying nature of the ’best view’. all views (A) and separat… view at source ↗
Figure 5
Figure 5. Figure 5: Mean IoU difference (Ours - EgoVLPv2) by keystep name and task. We compute mean IoU across all instances and views of each unique keystep in the test set – for both our model and the EgoVLPv2-trained grounding model. We display signed mean IoU difference between ours and EgoVLPv2 for the top-20 keysteps (left half) and bottom-20 keysteps (right half) that have largest mean IoU difference. We outperform Ego… view at source ↗
read the original abstract

Traditional methods for view-invariant learning rely on controlled multi-view training data with minimal scene clutter. However, they struggle with in-the-wild videos that exhibit extreme viewpoint differences and share little visual content. We introduce ViewBridge, a framework for learning rich video representations in the presence of severe view-occlusions. We introduce a knowledge distillation objective that preserves action-centric semantics, together with a novel curriculum learning procedure that pairs incrementally more challenging views over time, thereby allowing smooth adaptation to extreme viewpoint differences. To sort training video segments for the proposed curriculum, we define a geometry-based metric that reflects their likely occlusion level. While training leverages multi-view data, at inference time, the input is an uncalibrated, single-viewpoint video. Evaluating our approach on two tasks -- temporal keystep grounding and fine-grained keystep recognition -- we outperform SOTA approaches across three datasets (Ego-Exo4D, LEMMA, EPFL-Smart-Kitchen-30). Project page: https://vision.cs.utexas.edu/projects/learning_view_distill/ .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces ViewBridge, a curriculum-based knowledge distillation framework for view-invariant video representation learning under extreme viewpoint changes and occlusions. It defines a geometry-based metric to order training segments by likely occlusion level, progressively pairing more challenging views during training while preserving action-centric semantics via distillation. Training uses multi-view data but inference operates on single uncalibrated views. The method is evaluated on temporal keystep grounding and fine-grained keystep recognition, reporting outperformance over SOTA baselines across Ego-Exo4D, LEMMA, and EPFL-Smart-Kitchen-30.

Significance. If the geometry-based curriculum metric is shown to meaningfully rank view difficulty and the reported gains are attributable to the proposed mechanism rather than other factors, the work could meaningfully advance view-invariant activity understanding for in-the-wild egocentric-exocentric scenarios. The single-view inference setting and use of independent multi-view training data are practical strengths; the curriculum idea addresses a recognized challenge in gradual adaptation to severe occlusions.

major comments (1)
  1. [Curriculum learning procedure (Section 3.2)] The geometry-based occlusion metric (defined to sort segments for the curriculum) is load-bearing for attributing performance gains to the proposed adaptation mechanism, yet the manuscript provides no quantitative validation—such as correlation with measured keypoint overlap, shared visual content, human difficulty ratings, or an ablation replacing the metric with random ordering—on Ego-Exo4D, LEMMA, or EPFL-Smart-Kitchen-30. Without this, the curriculum component risks being non-predictive of actual view difficulty.
minor comments (2)
  1. [Abstract and Section 5] The abstract and results sections would benefit from reporting specific quantitative margins (e.g., absolute improvements in mAP or accuracy with error bars) rather than the generic claim of outperforming SOTA.
  2. [Method] Notation for the geometry metric and curriculum progression rate should be introduced with explicit equations to improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the practical strengths of the single-view inference setting. We address the major comment below and will revise the manuscript accordingly to strengthen the attribution of gains to the curriculum mechanism.

read point-by-point responses
  1. Referee: [Curriculum learning procedure (Section 3.2)] The geometry-based occlusion metric (defined to sort segments for the curriculum) is load-bearing for attributing performance gains to the proposed adaptation mechanism, yet the manuscript provides no quantitative validation—such as correlation with measured keypoint overlap, shared visual content, human difficulty ratings, or an ablation replacing the metric with random ordering—on Ego-Exo4D, LEMMA, or EPFL-Smart-Kitchen-30. Without this, the curriculum component risks being non-predictive of actual view difficulty.

    Authors: We agree that quantitative validation of the geometry-based occlusion metric is necessary to more convincingly attribute performance improvements to the curriculum ordering rather than other factors. The metric is computed from projected 3D keypoints and relative camera poses to estimate the degree of view-induced occlusion without additional supervision. In the revised manuscript we will add (i) an ablation that replaces the proposed ordering with random segment ordering and reports the resulting performance on all three datasets, and (ii) correlation analysis between metric scores and keypoint overlap ratios (where 3D annotations are available) to provide direct evidence that the ordering reflects actual view difficulty. These additions will be included in Section 3.2 and the experimental section. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The derivation relies on an externally defined geometry-based occlusion metric constructed from camera parameters or keypoint overlap to order training segments for curriculum learning, followed by a knowledge-distillation objective trained on multi-view data and evaluated on independent benchmarks (Ego-Exo4D, LEMMA, EPFL-Smart-Kitchen-30). No equation or procedure reduces the reported performance gains to a fitted parameter, self-defined target, or load-bearing self-citation; the metric is presented as a geometric proxy rather than optimized against the final task metrics, and inference uses single-view input without reference to the training ordering. The chain is therefore self-contained against external data and standard evaluation protocols.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate all free parameters or axioms; the approach assumes availability of multi-view training data and that distillation can preserve semantics across views.

free parameters (1)
  • curriculum progression rate
    The schedule for moving from easier to harder view pairs is not specified and is likely tuned on validation data.
axioms (1)
  • domain assumption Multi-view training data with varying occlusion levels is available during training
    The curriculum and distillation rely on access to such paired data, as stated in the abstract.

pith-pipeline@v0.9.0 · 5732 in / 1258 out tokens · 47445 ms · 2026-05-22T20:32:08.120431+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages

  1. [1]

    Ht-step: Align- ing instructional articles with how-to videos

    Triantafyllos Afouras, Effrosyni Mavroudi, Tushar Nagara- jan, Huiyu Wang, and Lorenzo Torresani. Ht-step: Align- ing instructional articles with how-to videos. In Advances 8 in Neural Information Processing Systems , pages 50310– 50326. Curran Associates, Inc., 2023. 1, 6

  2. [2]

    An exocentric look at ego- centric actions and vice versa

    Shervin Ardeshir and Ali Borji. An exocentric look at ego- centric actions and vice versa. Computer Vision and Image Understanding, 171:61–68, 2018. 2

  3. [3]

    Video-mined task graphs for keystep recognition in instructional videos, 2023

    Kumar Ashutosh, Santhosh Kumar Ramakrishnan, Tri- antafyllos Afouras, and Kristen Grauman. Video-mined task graphs for keystep recognition in instructional videos, 2023. 3

  4. [4]

    Siddhant Bansal, Chetan Arora, and C. V . Jawahar. My view is the best view: Procedure learning from egocentric videos,

  5. [5]

    Local- izing moments in long video via multimodal guidance, 2023

    Wayner Barrios, Mattia Soldan, Alberto Mario Ceballos- Arroyo, Fabian Caba Heilbron, and Bernard Ghanem. Local- izing moments in long video via multimodal guidance, 2023. 7

  6. [6]

    Is space-time attention all you need for video understanding?,

    Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding?,

  7. [7]

    A short note about kinetics- 600, 2018

    Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics- 600, 2018. 1

  8. [8]

    4diff: 3d- aware diffusion model for third-to-first viewpoint translation

    Feng Cheng, Mi Luo, Huiyu Wang, Alex Dimakis, Lorenzo Torresani, Gedas Bertasius, and Kristen Grauman. 4diff: 3d- aware diffusion model for third-to-first viewpoint translation. In Computer Vision – ECCV 2024 , pages 409–427, Cham,

  9. [9]

    Springer Nature Switzerland. 2

  10. [10]

    Srijan Das and Michael S. Ryoo. Viewclr: Learning self- supervised video representation for unseen viewpoints. In 2023 IEEE/CVF Winter Conference on Applications of Com- puter Vision (WACV), pages 5562–5572, 2023. 2

  11. [11]

    Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsu- pervised visual representation learning by context prediction,

  12. [12]

    Activitynet: A large-scale video bench- mark for human activity understanding

    Bernard Ghanem Fabian Caba Heilbron, Victor Escorcia and Juan Carlos Niebles. Activitynet: A large-scale video bench- mark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961–970, 2015. 1, 6

  13. [13]

    Learning to recog- nize activities from the wrong view point

    Ali Farhadi and Mostafa Kamali Tabrizi. Learning to recog- nize activities from the wrong view point. In Proceedings of the 10th European Conference on Computer Vision: Part I , page 154–166, Berlin, Heidelberg, 2008. Springer-Verlag. 2

  14. [14]

    Learning temporal sentence grounding from narrated egovideos, 2023

    Kevin Flanagan, Dima Damen, and Michael Wray. Learning temporal sentence grounding from narrated egovideos, 2023. 3, 5, 7, 13

  15. [15]

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu- Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Moh...

  16. [16]

    Temporal alignment networks for long-term video, 2022

    Tengda Han, Weidi Xie, and Andrew Zisserman. Temporal alignment networks for long-term video, 2022. 3, 5

  17. [17]

    View-invariant action recognition based on artificial neu- ral networks

    Alexandros Iosifidis, Anastasios Tefas, and Ioannis Pitas. View-invariant action recognition based on artificial neu- ral networks. IEEE Transactions on Neural Networks and Learning Systems, 23(3):412–424, 2012. 2

  18. [18]

    Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints,

    Asako Kanezaki, Yasuyuki Matsushita, and Yoshifumi Nishida. Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints,

  19. [19]

    The kinetics human action video dataset, 2017

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset, 2017. 1

  20. [20]

    Kuehne, H

    H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A large video database for human motion recogni- tion. In 2011 International Conference on Computer Vision, pages 2556–2563, 2011. 1

  21. [21]

    Unsupervised learning of view-invariant action repre- sentations

    Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S Kankan- halli. Unsupervised learning of view-invariant action repre- sentations. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2018. 2

  22. [22]

    Learning distortion invariant representation for image restoration from a causality perspective, 2023

    Xin Li, Bingchen Li, Xin Jin, Cuiling Lan, and Zhibo Chen. Learning distortion invariant representation for image restoration from a causality perspective, 2023. 2

  23. [23]

    Ego-exo: Transferring visual representations from third-person to first-person videos

    Yanghao Li, Tushar Nagarajan, Bo Xiong, and Kristen Grau- man. Ego-exo: Transferring visual representations from third-person to first-person videos. 2021 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR) , pages 6939–6949, 2021. 2

  24. [24]

    Put myself in your shoes: Lifting the egocentric perspective from exocentric videos, 2024

    Mi Luo, Zihui Xue, Alex Dimakis, and Kristen Grauman. Put myself in your shoes: Lifting the egocentric perspective from exocentric videos, 2024. 2

  25. [25]

    Learning to ground instructional articles in videos through narrations, 2023

    Effrosyni Mavroudi, Triantafyllos Afouras, and Lorenzo Torresani. Learning to ground instructional articles in videos through narrations, 2023. 3, 5

  26. [26]

    9 Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019

    Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 9 Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019. 1, 6

  27. [27]

    AJ Piergiovanni and Michael S. Ryoo. Recognizing ac- tions in videos from unseen viewpoints. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4122–4130, Los Alamitos, CA, USA, 2021. IEEE Computer Society. 2

  28. [28]

    Egovlpv2: Egocentric video-language pre-training with fusion in the backbone, 2023

    Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, and Pengchuan Zhang. Egovlpv2: Egocentric video-language pre-training with fusion in the backbone, 2023. 6, 7, 12, 13

  29. [29]

    The meccano dataset: Under- standing human-object interactions from egocentric videos in an industrial-like domain, 2020

    Francesco Ragusa, Antonino Furnari, Salvatore Livatino, and Giovanni Maria Farinella. The meccano dataset: Under- standing human-object interactions from egocentric videos in an industrial-like domain, 2020. 3

  30. [30]

    Learning a non-linear knowledge transfer model for cross-view action recognition

    Hossein Rahmani and Ajmal Mian. Learning a non-linear knowledge transfer model for cross-view action recognition. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2458–2466, 2015. 2

  31. [31]

    On the benefits of 3d pose and tracking for human action recognition

    Jathushan Rajasegaran, Georgios Pavlakos, Angjoo Kanazawa, Christoph Feichtenhofer, and Jitendra Malik. On the benefits of 3d pose and tracking for human action recognition. In CVPR, 2023. 2

  32. [32]

    Ground- ing action descriptions in videos

    Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. Ground- ing action descriptions in videos. Transactions of the Associ- ation for Computational Linguistics (TACL), 1:25–36, 2013. 2

  33. [33]

    Unsu- pervised view-invariant human posture representation, 2024

    Faegheh Sardari, Bj ¨orn Ommer, and Majid Mirmehdi. Unsu- pervised view-invariant human posture representation, 2024. 2

  34. [34]

    Learning state-invariant representations of objects from image collections with state, pose, and viewpoint changes, 2024

    Rohan Sarkar and Avinash Kak. Learning state-invariant representations of objects from image collections with state, pose, and viewpoint changes, 2024. 2

  35. [35]

    M. Shah, B. Kuipers, S. Savarese, and Jingen Liu. Cross- view action recognition via view knowledge transfer. In2013 IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 3209–3216, Los Alamitos, CA, USA, 2011. IEEE Computer Society. 2

  36. [36]

    Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, and Karteek Alahari

    Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, and Karteek Alahari. Charades-ego: A large- scale dataset of paired third and first person videos, 2018. 2, 3

  37. [37]

    Mad: A scalable dataset for language grounding in videos from movie audio descriptions, 2022

    Mattia Soldan, Alejandro Pardo, Juan Le ´on Alc ´azar, Fabian Caba Heilbron, Chen Zhao, Silvio Giancola, and Bernard Ghanem. Mad: A scalable dataset for language grounding in videos from movie audio descriptions, 2022. 1

  38. [38]

    Ego4d goal-step: To- ward hierarchical understanding of procedural activities

    Yale Song, Eugene Byrne, Tushar Nagarajan, Huiyu Wang, Miguel Martin, and Lorenzo Torresani. Ego4d goal-step: To- ward hierarchical understanding of procedural activities. In Advances in Neural Information Processing Systems , pages 38863–38886. Curran Associates, Inc., 2023. 3

  39. [39]

    Ucf101: A dataset of 101 human actions classes from videos in the wild, 2012

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild, 2012. 1

  40. [40]

    Action recog- nition in the presence of one egocentric and multiple static cameras, 2014

    Bilge Soran, Ali Farhadi, and Linda Shapiro. Action recog- nition in the presence of one egocentric and multiple static cameras, 2014. 2

  41. [41]

    View-invariant proba- bilistic embedding for human pose

    Jennifer J Sun, Jiaping Zhao, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, and Ting Liu. View-invariant proba- bilistic embedding for human pose. In European Conference on Computer Vision, pages 53–70. Springer, 2020. 2

  42. [42]

    Comprehensive in- structional video analysis: The coin dataset and performance evaluation

    Yansong Tang, Jiwen Lu, and Jie Zhou. Comprehensive in- structional video analysis: The coin dataset and performance evaluation. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 43(9):3138–3153, 2021. 3

  43. [43]

    Repre- sentation learning with contrastive predictive coding, 2019

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding, 2019. 7

  44. [44]

    Cross-view action modeling, learning and recognition,

    Jiang wang, Xiaohan Nie, Yin Xia, Ying Wu, and Song-Chun Zhu. Cross-view action modeling, learning and recognition,

  45. [45]

    Free viewpoint action recognition using motion history volumes

    Daniel Weinland, Remi Ronfard, and Edmond Boyer. Free viewpoint action recognition using motion history volumes. Comput. Vis. Image Underst., 104(2):249–257, 2006. 2

  46. [46]

    Learning fine- grained view-invariant representations from unpaired ego- exo videos via temporal alignment

    Zihui (Sherry) Xue and Kristen Grauman. Learning fine- grained view-invariant representations from unpaired ego- exo videos via temporal alignment. In Advances in Neural Information Processing Systems, pages 53688–53710. Cur- ran Associates, Inc., 2023. 2

  47. [47]

    What i see is what you see: Joint attention learning for first and third person video co-analysis

    Huangyue Yu, Minjie Cai, Yunfei Liu, and Feng Lu. What i see is what you see: Joint attention learning for first and third person video co-analysis. Proceedings of the 27th ACM International Conference on Multimedia, 2019. 2

  48. [48]

    View-robust neural networks for unseen human action recognition in videos

    Jiahui Yu, Tianyu Ma, Zhaojie Ju, Hang Chen, and Yingke Xu. View-robust neural networks for unseen human action recognition in videos. In 2022 IEEE International Confer- ence on Systems, Man, and Cybernetics (SMC), pages 1242– 1247, 2022. 2

  49. [49]

    Dense regression network for video grounding, 2020

    Runhao Zeng, Haoming Xu, Wenbing Huang, Peihao Chen, Mingkui Tan, and Chuang Gan. Dense regression network for video grounding, 2020. 5

  50. [50]

    Temporal sentence grounding in videos: A survey and fu- ture directions

    Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. Temporal sentence grounding in videos: A survey and fu- ture directions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(8):10443–10465, 2023. 3, 5

  51. [51]

    View adaptive recurrent neural networks for high performance human action recog- nition from skeleton data, 2017

    Pengfei Zhang, Cuiling Lan, Junliang Xing, Wenjun Zeng, Jianru Xue, and Nanning Zheng. View adaptive recurrent neural networks for high performance human action recog- nition from skeleton data, 2017. 2

  52. [52]

    Cross-view action recog- nition via a continuous virtual path

    Zhong Zhang, Chunheng Wang, Baihua Xiao, Wen Zhou, Shuang Liu, and Cunzhao Shi. Cross-view action recog- nition via a continuous virtual path. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2013. 2

  53. [53]

    Luowei Zhou, Chenliang Xu, and Jason J. Corso. To- wards automatic learning of procedures from web instruc- tional videos, 2017. 3

  54. [54]

    Cross- task weakly supervised learning from instructional videos,

    Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, and Josef Sivic. Cross- task weakly supervised learning from instructional videos,

  55. [55]

    8.1) — We report the full version of Table 1 across all IoU thresholdsθ, as mentioned in Sec

    Full keystep grounding results (Sec. 8.1) — We report the full version of Table 1 across all IoU thresholdsθ, as mentioned in Sec. 6 (Temporal Keystep Grounding) of the main paper

  56. [56]

    8.2) — We provide an analysis of our model’s performance relative to EgoVLPv2 (strongest baseline) within each unique keystep name as well as within each high-level activity

    Keystep grounding results stratified by keystep name and task (Sec. 8.2) — We provide an analysis of our model’s performance relative to EgoVLPv2 (strongest baseline) within each unique keystep name as well as within each high-level activity

  57. [57]

    EgoVLPv2 (Sec

    Feature similarity with ego feature vs. EgoVLPv2 (Sec. 8.3) — We provide an analysis demonstrating close alignment between our learned features from any source view and the corresponding ego video features at each moment as verification of effective distillation between target and source views

  58. [58]

    Results on keystep grounding in seen and unseen environments (Sec. 8.4) — We stratify our test set by videos from environments observed during training (test-seen) and from environments unseen during train- ing (test-unseen) to evaluate robustness of our approach to novel scenes

  59. [59]

    Ablations of camera ranking algorithm/use. (Sec. 8.5) — We train a model with several varia- tions of our camera ranking to quantitatively validate its utility vs. selecting a random distillation target, as well as to confirm that our particular camera ranking is effective

  60. [60]

    Demo video. We provide a short video on our project page with qualitative examples of our view ranking across diverse scenarios, as well as qualitative keystep grounding examples with EgoVLPv2-based grounding – our strongest baseline – for reference, on videos from diverse activities and viewpoints, as well as failure cases. 8.1. Complete keystep groundin...

  61. [61]

    ge- ometric

    the same view and 2) the same (synchronous) action, but a severely occluded viewpoint. 8.4. Evaluation on seen vs. unseen environments We stratify our test set into videos that are recorded in physical environments which were observed during training (test-seen), and videos recorded in five ”unseen” environ- ments that were unobserved during training (tes...