ViewBridge: Curriculum Knowledge Distillation for Activity View-Invariance Under Extreme Viewpoint Changes
Pith reviewed 2026-05-22 20:32 UTC · model grok-4.3
The pith
Curriculum knowledge distillation with geometry-sorted view pairs produces video representations invariant to extreme viewpoint changes from single-view input.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ViewBridge shows that a knowledge distillation objective paired with a curriculum of incrementally harder viewpoint pairs, ordered by a geometry-based occlusion metric, yields video representations that remain effective for activity understanding under extreme view shifts, with inference performed on single uncalibrated viewpoints and superior performance on keystep tasks over three datasets.
What carries the argument
The curriculum learning procedure that uses a geometry-based metric to estimate occlusion levels and orders training segments into progressively more challenging view pairs for the knowledge distillation objective.
If this is right
- Models trained with multi-view data can be deployed on single-view videos for activity analysis in cluttered real-world settings.
- View-invariant representations become feasible without requiring controlled minimal-occlusion training footage.
- Temporal localization and fine-grained classification of activity steps improve when viewpoint differences are bridged gradually.
- The framework supports inference on uncalibrated videos while still leveraging multi-view supervision during training.
Where Pith is reading between the lines
- Similar curriculum strategies ordered by geometric difficulty could transfer to other video domain-adaptation problems where view or appearance gaps are large.
- If the occlusion metric generalizes, it may serve as a template for quantifying training difficulty in additional invariance tasks such as lighting or motion changes.
- The separation of multi-view training from single-view inference suggests a practical route for scaling activity models to mobile or wearable camera settings.
Load-bearing premise
A geometry-based metric can be defined that accurately reflects the likely occlusion level of training video segments to enable effective curriculum sorting.
What would settle it
An experiment in which random or non-geometry-based ordering of view pairs during curriculum training produces equal or better results on the same keystep grounding and recognition tasks would falsify the contribution of the proposed metric and sorting.
Figures
read the original abstract
Traditional methods for view-invariant learning rely on controlled multi-view training data with minimal scene clutter. However, they struggle with in-the-wild videos that exhibit extreme viewpoint differences and share little visual content. We introduce ViewBridge, a framework for learning rich video representations in the presence of severe view-occlusions. We introduce a knowledge distillation objective that preserves action-centric semantics, together with a novel curriculum learning procedure that pairs incrementally more challenging views over time, thereby allowing smooth adaptation to extreme viewpoint differences. To sort training video segments for the proposed curriculum, we define a geometry-based metric that reflects their likely occlusion level. While training leverages multi-view data, at inference time, the input is an uncalibrated, single-viewpoint video. Evaluating our approach on two tasks -- temporal keystep grounding and fine-grained keystep recognition -- we outperform SOTA approaches across three datasets (Ego-Exo4D, LEMMA, EPFL-Smart-Kitchen-30). Project page: https://vision.cs.utexas.edu/projects/learning_view_distill/ .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ViewBridge, a curriculum-based knowledge distillation framework for view-invariant video representation learning under extreme viewpoint changes and occlusions. It defines a geometry-based metric to order training segments by likely occlusion level, progressively pairing more challenging views during training while preserving action-centric semantics via distillation. Training uses multi-view data but inference operates on single uncalibrated views. The method is evaluated on temporal keystep grounding and fine-grained keystep recognition, reporting outperformance over SOTA baselines across Ego-Exo4D, LEMMA, and EPFL-Smart-Kitchen-30.
Significance. If the geometry-based curriculum metric is shown to meaningfully rank view difficulty and the reported gains are attributable to the proposed mechanism rather than other factors, the work could meaningfully advance view-invariant activity understanding for in-the-wild egocentric-exocentric scenarios. The single-view inference setting and use of independent multi-view training data are practical strengths; the curriculum idea addresses a recognized challenge in gradual adaptation to severe occlusions.
major comments (1)
- [Curriculum learning procedure (Section 3.2)] The geometry-based occlusion metric (defined to sort segments for the curriculum) is load-bearing for attributing performance gains to the proposed adaptation mechanism, yet the manuscript provides no quantitative validation—such as correlation with measured keypoint overlap, shared visual content, human difficulty ratings, or an ablation replacing the metric with random ordering—on Ego-Exo4D, LEMMA, or EPFL-Smart-Kitchen-30. Without this, the curriculum component risks being non-predictive of actual view difficulty.
minor comments (2)
- [Abstract and Section 5] The abstract and results sections would benefit from reporting specific quantitative margins (e.g., absolute improvements in mAP or accuracy with error bars) rather than the generic claim of outperforming SOTA.
- [Method] Notation for the geometry metric and curriculum progression rate should be introduced with explicit equations to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the practical strengths of the single-view inference setting. We address the major comment below and will revise the manuscript accordingly to strengthen the attribution of gains to the curriculum mechanism.
read point-by-point responses
-
Referee: [Curriculum learning procedure (Section 3.2)] The geometry-based occlusion metric (defined to sort segments for the curriculum) is load-bearing for attributing performance gains to the proposed adaptation mechanism, yet the manuscript provides no quantitative validation—such as correlation with measured keypoint overlap, shared visual content, human difficulty ratings, or an ablation replacing the metric with random ordering—on Ego-Exo4D, LEMMA, or EPFL-Smart-Kitchen-30. Without this, the curriculum component risks being non-predictive of actual view difficulty.
Authors: We agree that quantitative validation of the geometry-based occlusion metric is necessary to more convincingly attribute performance improvements to the curriculum ordering rather than other factors. The metric is computed from projected 3D keypoints and relative camera poses to estimate the degree of view-induced occlusion without additional supervision. In the revised manuscript we will add (i) an ablation that replaces the proposed ordering with random segment ordering and reports the resulting performance on all three datasets, and (ii) correlation analysis between metric scores and keypoint overlap ratios (where 3D annotations are available) to provide direct evidence that the ordering reflects actual view difficulty. These additions will be included in Section 3.2 and the experimental section. revision: yes
Circularity Check
No significant circularity detected
full rationale
The derivation relies on an externally defined geometry-based occlusion metric constructed from camera parameters or keypoint overlap to order training segments for curriculum learning, followed by a knowledge-distillation objective trained on multi-view data and evaluated on independent benchmarks (Ego-Exo4D, LEMMA, EPFL-Smart-Kitchen-30). No equation or procedure reduces the reported performance gains to a fitted parameter, self-defined target, or load-bearing self-citation; the metric is presented as a geometric proxy rather than optimized against the final task metrics, and inference uses single-view input without reference to the training ordering. The chain is therefore self-contained against external data and standard evaluation protocols.
Axiom & Free-Parameter Ledger
free parameters (1)
- curriculum progression rate
axioms (1)
- domain assumption Multi-view training data with varying occlusion levels is available during training
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat induction and orbit embedding unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We divide training into P phases … In each phase p, we choose the cross-view positive distillation target … rτ(vpos) = max(0, rτ(vi) − p). … last lP epochs reserved for the final phase (lP = 50% of M).
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel (J-cost uniqueness) unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
HOIi = cos(g′i, pcenter − t′i) … hierarchically sort first using XY cosine similarity … then sort views within each set using the HOI-based view-similarity metric
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Ht-step: Align- ing instructional articles with how-to videos
Triantafyllos Afouras, Effrosyni Mavroudi, Tushar Nagara- jan, Huiyu Wang, and Lorenzo Torresani. Ht-step: Align- ing instructional articles with how-to videos. In Advances 8 in Neural Information Processing Systems , pages 50310– 50326. Curran Associates, Inc., 2023. 1, 6
work page 2023
-
[2]
An exocentric look at ego- centric actions and vice versa
Shervin Ardeshir and Ali Borji. An exocentric look at ego- centric actions and vice versa. Computer Vision and Image Understanding, 171:61–68, 2018. 2
work page 2018
-
[3]
Video-mined task graphs for keystep recognition in instructional videos, 2023
Kumar Ashutosh, Santhosh Kumar Ramakrishnan, Tri- antafyllos Afouras, and Kristen Grauman. Video-mined task graphs for keystep recognition in instructional videos, 2023. 3
work page 2023
-
[4]
Siddhant Bansal, Chetan Arora, and C. V . Jawahar. My view is the best view: Procedure learning from egocentric videos,
-
[5]
Local- izing moments in long video via multimodal guidance, 2023
Wayner Barrios, Mattia Soldan, Alberto Mario Ceballos- Arroyo, Fabian Caba Heilbron, and Bernard Ghanem. Local- izing moments in long video via multimodal guidance, 2023. 7
work page 2023
-
[6]
Is space-time attention all you need for video understanding?,
Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding?,
-
[7]
A short note about kinetics- 600, 2018
Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics- 600, 2018. 1
work page 2018
-
[8]
4diff: 3d- aware diffusion model for third-to-first viewpoint translation
Feng Cheng, Mi Luo, Huiyu Wang, Alex Dimakis, Lorenzo Torresani, Gedas Bertasius, and Kristen Grauman. 4diff: 3d- aware diffusion model for third-to-first viewpoint translation. In Computer Vision – ECCV 2024 , pages 409–427, Cham,
work page 2024
-
[9]
Springer Nature Switzerland. 2
-
[10]
Srijan Das and Michael S. Ryoo. Viewclr: Learning self- supervised video representation for unseen viewpoints. In 2023 IEEE/CVF Winter Conference on Applications of Com- puter Vision (WACV), pages 5562–5572, 2023. 2
work page 2023
-
[11]
Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsu- pervised visual representation learning by context prediction,
-
[12]
Activitynet: A large-scale video bench- mark for human activity understanding
Bernard Ghanem Fabian Caba Heilbron, Victor Escorcia and Juan Carlos Niebles. Activitynet: A large-scale video bench- mark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961–970, 2015. 1, 6
work page 2015
-
[13]
Learning to recog- nize activities from the wrong view point
Ali Farhadi and Mostafa Kamali Tabrizi. Learning to recog- nize activities from the wrong view point. In Proceedings of the 10th European Conference on Computer Vision: Part I , page 154–166, Berlin, Heidelberg, 2008. Springer-Verlag. 2
work page 2008
-
[14]
Learning temporal sentence grounding from narrated egovideos, 2023
Kevin Flanagan, Dima Damen, and Michael Wray. Learning temporal sentence grounding from narrated egovideos, 2023. 3, 5, 7, 13
work page 2023
-
[15]
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu- Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Moh...
work page 2024
-
[16]
Temporal alignment networks for long-term video, 2022
Tengda Han, Weidi Xie, and Andrew Zisserman. Temporal alignment networks for long-term video, 2022. 3, 5
work page 2022
-
[17]
View-invariant action recognition based on artificial neu- ral networks
Alexandros Iosifidis, Anastasios Tefas, and Ioannis Pitas. View-invariant action recognition based on artificial neu- ral networks. IEEE Transactions on Neural Networks and Learning Systems, 23(3):412–424, 2012. 2
work page 2012
-
[18]
Asako Kanezaki, Yasuyuki Matsushita, and Yoshifumi Nishida. Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints,
-
[19]
The kinetics human action video dataset, 2017
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset, 2017. 1
work page 2017
- [20]
-
[21]
Unsupervised learning of view-invariant action repre- sentations
Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S Kankan- halli. Unsupervised learning of view-invariant action repre- sentations. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2018. 2
work page 2018
-
[22]
Xin Li, Bingchen Li, Xin Jin, Cuiling Lan, and Zhibo Chen. Learning distortion invariant representation for image restoration from a causality perspective, 2023. 2
work page 2023
-
[23]
Ego-exo: Transferring visual representations from third-person to first-person videos
Yanghao Li, Tushar Nagarajan, Bo Xiong, and Kristen Grau- man. Ego-exo: Transferring visual representations from third-person to first-person videos. 2021 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR) , pages 6939–6949, 2021. 2
work page 2021
-
[24]
Put myself in your shoes: Lifting the egocentric perspective from exocentric videos, 2024
Mi Luo, Zihui Xue, Alex Dimakis, and Kristen Grauman. Put myself in your shoes: Lifting the egocentric perspective from exocentric videos, 2024. 2
work page 2024
-
[25]
Learning to ground instructional articles in videos through narrations, 2023
Effrosyni Mavroudi, Triantafyllos Afouras, and Lorenzo Torresani. Learning to ground instructional articles in videos through narrations, 2023. 3, 5
work page 2023
-
[26]
9 Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 9 Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019. 1, 6
work page 2019
-
[27]
AJ Piergiovanni and Michael S. Ryoo. Recognizing ac- tions in videos from unseen viewpoints. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4122–4130, Los Alamitos, CA, USA, 2021. IEEE Computer Society. 2
work page 2021
-
[28]
Egovlpv2: Egocentric video-language pre-training with fusion in the backbone, 2023
Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, and Pengchuan Zhang. Egovlpv2: Egocentric video-language pre-training with fusion in the backbone, 2023. 6, 7, 12, 13
work page 2023
-
[29]
Francesco Ragusa, Antonino Furnari, Salvatore Livatino, and Giovanni Maria Farinella. The meccano dataset: Under- standing human-object interactions from egocentric videos in an industrial-like domain, 2020. 3
work page 2020
-
[30]
Learning a non-linear knowledge transfer model for cross-view action recognition
Hossein Rahmani and Ajmal Mian. Learning a non-linear knowledge transfer model for cross-view action recognition. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2458–2466, 2015. 2
work page 2015
-
[31]
On the benefits of 3d pose and tracking for human action recognition
Jathushan Rajasegaran, Georgios Pavlakos, Angjoo Kanazawa, Christoph Feichtenhofer, and Jitendra Malik. On the benefits of 3d pose and tracking for human action recognition. In CVPR, 2023. 2
work page 2023
-
[32]
Ground- ing action descriptions in videos
Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. Ground- ing action descriptions in videos. Transactions of the Associ- ation for Computational Linguistics (TACL), 1:25–36, 2013. 2
work page 2013
-
[33]
Unsu- pervised view-invariant human posture representation, 2024
Faegheh Sardari, Bj ¨orn Ommer, and Majid Mirmehdi. Unsu- pervised view-invariant human posture representation, 2024. 2
work page 2024
-
[34]
Rohan Sarkar and Avinash Kak. Learning state-invariant representations of objects from image collections with state, pose, and viewpoint changes, 2024. 2
work page 2024
-
[35]
M. Shah, B. Kuipers, S. Savarese, and Jingen Liu. Cross- view action recognition via view knowledge transfer. In2013 IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 3209–3216, Los Alamitos, CA, USA, 2011. IEEE Computer Society. 2
work page 2011
-
[36]
Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, and Karteek Alahari
Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, and Karteek Alahari. Charades-ego: A large- scale dataset of paired third and first person videos, 2018. 2, 3
work page 2018
-
[37]
Mad: A scalable dataset for language grounding in videos from movie audio descriptions, 2022
Mattia Soldan, Alejandro Pardo, Juan Le ´on Alc ´azar, Fabian Caba Heilbron, Chen Zhao, Silvio Giancola, and Bernard Ghanem. Mad: A scalable dataset for language grounding in videos from movie audio descriptions, 2022. 1
work page 2022
-
[38]
Ego4d goal-step: To- ward hierarchical understanding of procedural activities
Yale Song, Eugene Byrne, Tushar Nagarajan, Huiyu Wang, Miguel Martin, and Lorenzo Torresani. Ego4d goal-step: To- ward hierarchical understanding of procedural activities. In Advances in Neural Information Processing Systems , pages 38863–38886. Curran Associates, Inc., 2023. 3
work page 2023
-
[39]
Ucf101: A dataset of 101 human actions classes from videos in the wild, 2012
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild, 2012. 1
work page 2012
-
[40]
Action recog- nition in the presence of one egocentric and multiple static cameras, 2014
Bilge Soran, Ali Farhadi, and Linda Shapiro. Action recog- nition in the presence of one egocentric and multiple static cameras, 2014. 2
work page 2014
-
[41]
View-invariant proba- bilistic embedding for human pose
Jennifer J Sun, Jiaping Zhao, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, and Ting Liu. View-invariant proba- bilistic embedding for human pose. In European Conference on Computer Vision, pages 53–70. Springer, 2020. 2
work page 2020
-
[42]
Comprehensive in- structional video analysis: The coin dataset and performance evaluation
Yansong Tang, Jiwen Lu, and Jie Zhou. Comprehensive in- structional video analysis: The coin dataset and performance evaluation. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 43(9):3138–3153, 2021. 3
work page 2021
-
[43]
Repre- sentation learning with contrastive predictive coding, 2019
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding, 2019. 7
work page 2019
-
[44]
Cross-view action modeling, learning and recognition,
Jiang wang, Xiaohan Nie, Yin Xia, Ying Wu, and Song-Chun Zhu. Cross-view action modeling, learning and recognition,
-
[45]
Free viewpoint action recognition using motion history volumes
Daniel Weinland, Remi Ronfard, and Edmond Boyer. Free viewpoint action recognition using motion history volumes. Comput. Vis. Image Underst., 104(2):249–257, 2006. 2
work page 2006
-
[46]
Zihui (Sherry) Xue and Kristen Grauman. Learning fine- grained view-invariant representations from unpaired ego- exo videos via temporal alignment. In Advances in Neural Information Processing Systems, pages 53688–53710. Cur- ran Associates, Inc., 2023. 2
work page 2023
-
[47]
What i see is what you see: Joint attention learning for first and third person video co-analysis
Huangyue Yu, Minjie Cai, Yunfei Liu, and Feng Lu. What i see is what you see: Joint attention learning for first and third person video co-analysis. Proceedings of the 27th ACM International Conference on Multimedia, 2019. 2
work page 2019
-
[48]
View-robust neural networks for unseen human action recognition in videos
Jiahui Yu, Tianyu Ma, Zhaojie Ju, Hang Chen, and Yingke Xu. View-robust neural networks for unseen human action recognition in videos. In 2022 IEEE International Confer- ence on Systems, Man, and Cybernetics (SMC), pages 1242– 1247, 2022. 2
work page 2022
-
[49]
Dense regression network for video grounding, 2020
Runhao Zeng, Haoming Xu, Wenbing Huang, Peihao Chen, Mingkui Tan, and Chuang Gan. Dense regression network for video grounding, 2020. 5
work page 2020
-
[50]
Temporal sentence grounding in videos: A survey and fu- ture directions
Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. Temporal sentence grounding in videos: A survey and fu- ture directions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(8):10443–10465, 2023. 3, 5
work page 2023
-
[51]
Pengfei Zhang, Cuiling Lan, Junliang Xing, Wenjun Zeng, Jianru Xue, and Nanning Zheng. View adaptive recurrent neural networks for high performance human action recog- nition from skeleton data, 2017. 2
work page 2017
-
[52]
Cross-view action recog- nition via a continuous virtual path
Zhong Zhang, Chunheng Wang, Baihua Xiao, Wen Zhou, Shuang Liu, and Cunzhao Shi. Cross-view action recog- nition via a continuous virtual path. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2013. 2
work page 2013
-
[53]
Luowei Zhou, Chenliang Xu, and Jason J. Corso. To- wards automatic learning of procedures from web instruc- tional videos, 2017. 3
work page 2017
-
[54]
Cross- task weakly supervised learning from instructional videos,
Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, and Josef Sivic. Cross- task weakly supervised learning from instructional videos,
-
[55]
8.1) — We report the full version of Table 1 across all IoU thresholdsθ, as mentioned in Sec
Full keystep grounding results (Sec. 8.1) — We report the full version of Table 1 across all IoU thresholdsθ, as mentioned in Sec. 6 (Temporal Keystep Grounding) of the main paper
-
[56]
Keystep grounding results stratified by keystep name and task (Sec. 8.2) — We provide an analysis of our model’s performance relative to EgoVLPv2 (strongest baseline) within each unique keystep name as well as within each high-level activity
-
[57]
Feature similarity with ego feature vs. EgoVLPv2 (Sec. 8.3) — We provide an analysis demonstrating close alignment between our learned features from any source view and the corresponding ego video features at each moment as verification of effective distillation between target and source views
-
[58]
Results on keystep grounding in seen and unseen environments (Sec. 8.4) — We stratify our test set by videos from environments observed during training (test-seen) and from environments unseen during train- ing (test-unseen) to evaluate robustness of our approach to novel scenes
-
[59]
Ablations of camera ranking algorithm/use. (Sec. 8.5) — We train a model with several varia- tions of our camera ranking to quantitatively validate its utility vs. selecting a random distillation target, as well as to confirm that our particular camera ranking is effective
-
[60]
Demo video. We provide a short video on our project page with qualitative examples of our view ranking across diverse scenarios, as well as qualitative keystep grounding examples with EgoVLPv2-based grounding – our strongest baseline – for reference, on videos from diverse activities and viewpoints, as well as failure cases. 8.1. Complete keystep groundin...
-
[61]
the same view and 2) the same (synchronous) action, but a severely occluded viewpoint. 8.4. Evaluation on seen vs. unseen environments We stratify our test set into videos that are recorded in physical environments which were observed during training (test-seen), and videos recorded in five ”unseen” environ- ments that were unobserved during training (tes...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.