pith. sign in

arxiv: 2606.20559 · v1 · pith:HUCCEV2Hnew · submitted 2026-06-18 · 💻 cs.CV · cs.LG

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

Pith reviewed 2026-06-26 18:23 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords egocentric videoknowledge distillationmulti-teacher distillationaction recognitionvideo retrievalaction segmentationproxy modelsego-exo benchmarks
0
0 comments X

The pith

Proxy models translate knowledge from nine heterogeneous teachers into a unified egocentric video encoder that outperforms direct distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a hierarchical distillation method to build one egocentric video encoder that incorporates knowledge from multiple viewpoints, modalities, and foundation models. Instead of distilling straight from incompatible teachers, the approach inserts representation-specific proxy models that first convert each teacher's signals into a shared egocentric space. A selective stage then distills only from the proxies that are both correct and confident on each sample, and the student begins training from a convex combination of the proxy weights. This produces stronger results than naive multi-teacher baselines on action recognition, retrieval, and segmentation across ego-exo benchmarks.

Core claim

A unified egocentric encoder can be trained by first mapping nine teachers spanning ego-exo views, RGB, depth, skeleton, and foundation models into representation-specific proxy models that create a homogeneous space, then applying selective proxy distillation that retains only reliable signals per sample, with the student initialized as a learned convex combination of the proxies; this structured mediation yields richer egocentric representations than direct multi-teacher distillation.

What carries the argument

Representation-specific Proxy models that convert heterogeneous teacher knowledge into a homogeneous egocentric space, paired with Selective Proxy Distillation that adaptively keeps only correct and confident signals.

If this is right

  • UNIEGO exceeds naive multi-teacher distillation on action recognition, video retrieval, and action segmentation across three ego-exo benchmarks.
  • The two-stage proxy mediation followed by selective distillation produces more discriminative egocentric features than unmediated transfer.
  • Initializing the student as a convex combination of proxy parameters places the optimization in a better-conditioned region of the loss landscape.
  • The framework remains deployable from egocentric video alone while subsuming complementary knowledge from exo, depth, skeleton, and foundation-model sources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The proxy-mediation pattern could be applied to other multi-source distillation settings where direct fusion produces gradient conflicts.
  • Selective per-sample filtering may reduce sensitivity to the choice or quality of individual teachers in large-scale distillation.
  • If the convex-combination initialization proves critical, similar warm-start strategies might stabilize other multi-teacher training regimes.

Load-bearing premise

The proxy models can convert knowledge from teachers with different architectures and feature geometries into a single homogeneous space without substantial distortion or conflicting gradients.

What would settle it

An experiment in which direct distillation from the same nine teachers without any proxy layer produces equal or higher accuracy than the full UNIEGO pipeline on the three reported benchmarks.

Figures

Figures reproduced from arXiv: 2606.20559 by Arkaprava Sinha, Dominick Reilly, Hieu Le, Srijan Das, Wenhao Chi.

Figure 1
Figure 1. Figure 1: (a) Naive multi-teacher distillation with heterogeneous teachers for learning unified egocentric representations results in representational gaps and conflicting gradients, as illustrated in (b). (c) In contrast, our proposed UNIEGO adopts a hierarchical distillation framework that mitigates these limitations through proxy-mediated learning, as shown in (d). Black dashed arrows illustrate the effects of th… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of UNIEGO. UNIEGO learns a unified egocentric encoder through a two￾level proxy-mediated distillation framework. In Level-I (left), heterogeneous teachers spanning viewpoints, modalities, and foundation representations independently supervise egocentric proxy models, converting diverse teacher signals into a homogeneous proxy space. In Level-II (right), the proxy parameters are first merged to ini… view at source ↗
Figure 3
Figure 3. Figure 3: Teacher vs Proxy Centered Kernel Alignment scores Conflict Rate (%) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Epoch 10 20 30 40 50 60 70 All Proxy Distillation Direct Multimodal Distillation Selective Proxy Distillation (Top 1) Selective Proxy Distillation (Top 3) [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 6
Figure 6. Figure 6: Selective Proxy Distillation Statistics [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
read the original abstract

Egocentric video understanding is inherently limited by the narrow perspective of wearable cameras: a single viewpoint, a single modality, a single model cannot capture the full richness of human action. We argue that a truly expressive egocentric representation must subsume complementary knowledge across viewpoints, modalities, and foundation model representations, yet remain deployable from egocentric video alone. To this end, we introduce a hierarchical multi-teacher distillation framework that produces UNIEGO, a unified egocentric encoder trained with nine teachers spanning ego-exo viewpoints, RGB, depth, and skeleton modalities, and four foundation models. Rather than distilling directly from heterogeneous teachers whose incompatible architectures and feature geometries induce conflicting gradients, our framework interposes a layer of representation-specific Proxy models that translate diverse teacher knowledge into a homogeneous egocentric space. A second distillation stage, Selective Proxy Distillation (SPD), then adaptively selects, for each training sample, the subset of proxies that are both correct and confident, distilling exclusively from reliable supervision and suppressing erroneous signals. SPD is further stabilized by initializing UNIEGO as a learned convex combination of proxy parameters, placing the unified model in a well-conditioned region of the loss landscape before distillation begins. UNIEGO achieves state-of-the-art performance across three egocentric video understanding tasks - action recognition, video retrieval, and action segmentation on three challenging ego-exo benchmarks, outperforming naive multi-teacher distillation baselines and demonstrating that structured, proxy-mediated knowledge transfer yields richer and more discriminative egocentric representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces UNIEGO, a unified egocentric video encoder trained via hierarchical multi-teacher distillation from nine teachers spanning ego-exo viewpoints, RGB/depth/skeleton modalities, and four foundation models. Representation-specific Proxy models first translate the heterogeneous teacher knowledge into a homogeneous egocentric space; Selective Proxy Distillation (SPD) then adaptively selects per-sample reliable and confident proxy signals while suppressing erroneous ones. The unified model is initialized as a learned convex combination of the proxy parameters. The paper claims this yields state-of-the-art performance on action recognition, video retrieval, and action segmentation across three ego-exo benchmarks, outperforming naive multi-teacher distillation baselines.

Significance. If the empirical superiority holds and the proxy-mediated transfer demonstrably avoids distortion, the work would offer a practical route to richer egocentric representations that remain deployable from single-view video alone. The two-stage structure with selective per-sample distillation and convex initialization provides a concrete mechanism for managing heterogeneous supervision that could generalize beyond the reported benchmarks.

major comments (2)
  1. [Abstract] Abstract: the central claim that UNIEGO achieves SOTA performance and outperforms naive multi-teacher distillation is asserted without any reported metrics, dataset identifiers, ablation numbers, or error analysis, so the performance advantage cannot be evaluated from the provided text.
  2. [Abstract] Proxy-model and SPD description (abstract): the claim that structured proxy-mediated transfer yields richer representations rests on the untested premise that the representation-specific Proxies map heterogeneous teachers into homogeneous space without substantial distortion or conflicting gradients; no alignment, reconstruction-error, or gradient-norm evidence is supplied to support this load-bearing assumption.
minor comments (1)
  1. [Abstract] The abstract would be clearer if it named the three ego-exo benchmarks and the nine teachers explicitly rather than describing them generically.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address the two major comments below and will incorporate revisions to strengthen the presentation of results and supporting analyses.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that UNIEGO achieves SOTA performance and outperforms naive multi-teacher distillation is asserted without any reported metrics, dataset identifiers, ablation numbers, or error analysis, so the performance advantage cannot be evaluated from the provided text.

    Authors: We agree that the abstract would be more informative with quantitative support. The full manuscript contains tables reporting exact metrics (e.g., top-1 accuracy on Ego4D, EPIC-Kitchens, and Charades-Ego), dataset names, and ablation results comparing UNIEGO to naive multi-teacher distillation. In the revised manuscript we will condense the key numbers, dataset identifiers, and a reference to the ablation study into the abstract. revision: yes

  2. Referee: [Abstract] Proxy-model and SPD description (abstract): the claim that structured proxy-mediated transfer yields richer representations rests on the untested premise that the representation-specific Proxies map heterogeneous teachers into homogeneous space without substantial distortion or conflicting gradients; no alignment, reconstruction-error, or gradient-norm evidence is supplied to support this load-bearing assumption.

    Authors: The manuscript demonstrates the practical benefit of the proxy layer through consistent gains over direct multi-teacher distillation and through selective-distillation ablations. We acknowledge, however, that direct measurements of proxy-teacher alignment, reconstruction error, or per-layer gradient norms are not reported. We will add these analyses (feature cosine similarity, reconstruction MSE, and gradient-norm statistics) to the revised experimental section to provide explicit support for the homogeneity assumption. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical training procedure with external benchmark validation

full rationale

The paper presents a hierarchical multi-teacher distillation method using representation-specific Proxy models followed by Selective Proxy Distillation (SPD) to train UNIEGO. All claims rest on reported empirical performance gains across action recognition, retrieval, and segmentation tasks on ego-exo benchmarks, compared against naive multi-teacher baselines. No equations, self-definitional constructs, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text; the derivation chain consists of a standard training pipeline whose outputs are validated externally rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; full experimental details, teacher implementations, and loss formulations are unavailable, so the ledger is necessarily incomplete.

axioms (1)
  • domain assumption Knowledge distillation from heterogeneous teachers is feasible once feature geometries are aligned via intermediate proxies.
    The entire pipeline presupposes that proxy translation preserves useful supervisory signal.
invented entities (1)
  • Proxy models no independent evidence
    purpose: Translate diverse teacher representations into a homogeneous egocentric space.
    New components introduced by the framework to mediate distillation.

pith-pipeline@v0.9.1-grok · 5813 in / 1336 out tokens · 27854 ms · 2026-06-26T18:23:03.942158+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 2 linked inside Pith

  1. [1]

    Ticon: A slide-level tile contextualizer for histopathology representation learning, 2025

    Varun Belagali, Saarthak Kapse, Pierre Marza, Srijan Das, Zilinghan Li, Sofiène Boutaj, Pushpak Pati, Srikar Yellapragada, Tarak Nath Nandi, Ravi K Madduri, Joel Saltz, Prateek Prasanna, Stergios Christodoulidis, Maria Vakalopoulou, and Dimitris Samaras. Ticon: A slide-level tile contextualizer for histopathology representation learning, 2025

  2. [2]

    Is space-time attention all you need for video understanding? InProceedings of the International Conference on Machine Learning (ICML), July 2021

    Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InProceedings of the International Conference on Machine Learning (ICML), July 2021

  3. [3]

    Cambridge University Press, 2004

    Stephen Boyd and Lieven Vandenberghe.Convex Optimization. Cambridge University Press, 2004

  4. [4]

    Openpose: Realtime multi-person 2d pose estimation using part affinity fields.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019

    Zhe Cao, Gines Hidalgo Martinez, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019

  5. [5]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021

  6. [6]

    Quo vadis, action recognition? a new model and the kinetics dataset

    Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. InConference on Computer Vision and Pattern Recognition, pages 4724–4733. IEEE, 2017

  7. [7]

    Scaling egocentric vision: The epic-kitchens dataset

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset. InEuropean Conference on Computer Vision (ECCV), 2018

  8. [8]

    Vpn++: Rethinking video-pose embed- dings for understanding activities of daily living.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2021

    Srijan Das, Rui Dai, Di Yang, and Francois Bremond. Vpn++: Rethinking video-pose embed- dings for understanding activities of daily living.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2021

  9. [9]

    Vpn: Learning video-pose embedding for activities of daily living

    Srijan Das, Saurav Sharma, Rui Dai, Francois Bremond, and Monique Thonnat. Vpn: Learning video-pose embedding for activities of daily living. InEuropean Conference on Computer Vision, pages 72–90. Springer, 2020

  10. [10]

    Unlocking exocentric video-language data for egocentric video representation learning, 2024

    Zi-Yi Dou, Xitong Yang, Tushar Nagarajan, Huiyu Wang, Jing Huang, Nanyun Peng, Kris Kitani, and Fu-Jen Chu. Unlocking exocentric video-language data for egocentric video representation learning, 2024

  11. [11]

    Linear mode connectivity and the lottery ticket hypothesis

    Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carlin. Linear mode connectivity and the lottery ticket hypothesis. InInternational Conference on Machine Learning (ICML), 2020

  12. [12]

    Mmg-ego4d: Multi-modal generalization in egocentric action recognition

    Xinyu Gong, Sreyas Mohan, Naina Dhingra, Jean-Charles Bazin, Yilei Li, Zhangyang Wang, and Rakesh Ranjan. Mmg-ego4d: Multi-modal generalization in egocentric action recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  13. [13]

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Ro- hit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, 10 Vincent ...

  14. [14]

    Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh K

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyl- los Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zachary Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, María Escobar, Cristhian Forigua, Abrham Kahsay Gebreselasie, Sanjay Haresh, Jing Hu...

  15. [15]

    Distilling the knowledge in a neural network, 2015

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network, 2015

  16. [16]

    Averaging weights leads to wider optima and better generalization

    Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. InUncertainty in Artificial Intelligence (UAI), 2018

  17. [17]

    Learning clip representations for skeleton-based 3d action recognition.IEEE Transactions on Image Processing, 27(6):2842–2855, June 2018

    Qiuhong Ke, Mohammed Bennamoun, Senjian An, Ferdous Sohel, and Farid Boussaid. Learning clip representations for skeleton-based 3d action recognition.IEEE Transactions on Image Processing, 27(6):2842–2855, June 2018

  18. [18]

    Adam: A method for stochastic optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InInterna- tional Conference on Learning Representations (ICLR), 2015

  19. [19]

    Similarity of neural network representations revisited

    Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InProceedings of the 36th International Conference on Machine Learning (ICML), 2019

  20. [20]

    Uniformer: Unifying convolution and self-attention for visual recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:12581–12600, 2022

    Kunchang Li, Yali Wang, Junhao Zhang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, and Yu Jiao Qiao. Uniformer: Unifying convolution and self-attention for visual recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:12581–12600, 2022

  21. [21]

    Ego-exo: Transferring visual representations from third-person to first-person videos

    Yanghao Li, Tushar Nagarajan, Bo Xiong, and Kristen Grauman. Ego-exo: Transferring visual representations from third-person to first-person videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10995–11005, 2021. 11

  22. [22]

    Egoexo-fitness: towards egocentric and exocentric full-body action understanding

    Yuan-Ming Li, Wei-Jin Huang, An-Lan Wang, Ling-An Zeng, Jing-Ke Meng, and Wei-Shi Zheng. Egoexo-fitness: towards egocentric and exocentric full-body action understanding. In European Conference on Computer Vision, pages 363–382. Springer, 2024

  23. [23]

    Tsm: Temporal shift module for efficient video understand- ing

    Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video understand- ing. InProceedings of the IEEE International Conference on Computer Vision, 2019

  24. [24]

    Egocentric video- language pretraining.arXiv preprint arXiv:2206.01670, 2022

    Kevin Qinghong Lin, Alex Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhong- cong Xu, Difei Gao, Rongcheng Tu, Wenzhe Zhao, Weijie Kong, et al. Egocentric video- language pretraining.arXiv preprint arXiv:2206.01670, 2022

  25. [25]

    Conflict-averse gradient descent for multi-task learning

    Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 18873–18885. Curran Associates, Inc., 2021

  26. [26]

    Adaptive multi-teacher multi-level knowledge distillation

    Yuang Liu, Wei Zhang, and Jun Wang. Adaptive multi-teacher multi-level knowledge distillation. Neurocomputing, 2020

  27. [27]

    Viewpoint rosetta stone: Unlocking unpaired ego-exo videos for view-invariant representation learning

    Mi Luo, Zihui Xue, Alex Dimakis, and Kristen Grauman. Viewpoint rosetta stone: Unlocking unpaired ego-exo videos for view-invariant representation learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  28. [28]

    Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

  29. [29]

    Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Hervé Jégou, Julien Mairal, Patrick Laba...

  30. [30]

    3d human pose estimation in video with temporal convolutions and semi-supervised training

    Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 3d human pose estimation in video with temporal convolutions and semi-supervised training. InConference on Computer Vision and Pattern Recognition (CVPR), 2019

  31. [31]

    Egovlpv2: Egocentric video-language pre- training with fusion in the backbone

    Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, and Pengchuan Zhang. Egovlpv2: Egocentric video-language pre- training with fusion in the backbone. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5285–5297, 2023

  32. [32]

    Synchronization is all you need: Exocentric-to-egocentric transfer for temporal action segmentation with unlabeled synchronized video pairs

    Camillo Quattrocchi, Antonino Furnari, Daniele Di Mauro, Mario Valerio Giuffrida, and Giovanni Maria Farinella. Synchronization is all you need: Exocentric-to-egocentric transfer for temporal action segmentation with unlabeled synchronized video pairs. InEuropean Conference on Computer Vision (ECCV), 2024

  33. [33]

    Multimodal distillation for egocentric action recognition

    Gorjan Radevski, Dusan Grujicic, Marie-Francine Moens, Matthew Blaschko, and Tinne Tuytelaars. Multimodal distillation for egocentric action recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  34. [34]

    Am-radio: Agglomera- tive vision foundation model reduce all domains into one

    Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. Am-radio: Agglomera- tive vision foundation model reduce all domains into one. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024

  35. [35]

    Finetuned clip models are efficient video learners

    Hanoona Rasheed, Muhammad Uzair Khattak, Muhammad Maaz, Salman Khan, and Fa- had Shahbaz Khan. Finetuned clip models are efficient video learners. InThe IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

  36. [36]

    Just add π! pose induced video transformers for understanding activities of daily living

    Dominick Reilly and Srijan Das. Just add π! pose induced video transformers for understanding activities of daily living. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024. 12

  37. [37]

    From my view to yours: Ego-augmented learning in large vision language models for understanding exocentric daily living activities, 2025

    Dominick Reilly, Manish Kumar Govind, Le Xue, and Srijan Das. From my view to yours: Ego-augmented learning in large vision language models for understanding exocentric daily living activities, 2025

  38. [38]

    Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. Assembly101: A large-scale multi-view video dataset for understanding procedural activities.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21064–21074, 2022

  39. [39]

    Temporal aggregate representations for long-range video understanding

    Fadime Sener, Dipika Singhania, and Angela Yao. Temporal aggregate representations for long-range video understanding. InEuropean conference on computer vision, pages 154–171. Springer, 2020

  40. [40]

    Theia: Distilling diverse vision foundation models for robot learning.arXiv preprint arXiv:2407.20179, 2024

    Jinghuan Shang, Karl Schmeckpeper, Brandon B May, Maria Vittoria Minniti, Tarik Kelestemur, David Watkins, and Laura Herlant. Theia: Distilling diverse vision foundation models for robot learning.arXiv preprint arXiv:2407.20179, 2024

  41. [41]

    Recon: Reducing conflicting gradients from the root for multi-task learning

    Guangyuan Shi, Qimai Li, Wenlong Zhang, Jiaxin Chen, and Xiao-Ming Wu. Recon: Reducing conflicting gradients from the root for multi-task learning. InInternational Conference on Learning Representations (ICLR), 2023

  42. [42]

    Charades-ego: A large-scale dataset of paired third and first person videos.arXiv preprint arXiv:1804.09626, 2018

    Gunnar A Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, and Karteek Alahari. Charades-ego: A large-scale dataset of paired third and first person videos.arXiv preprint arXiv:1804.09626, 2018

  43. [43]

    Ms- temba: Multi-scale temporal mamba for efficient temporal action detection.arXiv preprint arXiv:2501.06138, 2025

    Arkaprava Sinha, Monish Soundar Raj, Pu Wang, Ahmed Helmy, and Srijan Das. Ms- temba: Multi-scale temporal mamba for efficient temporal action detection.arXiv preprint arXiv:2501.06138, 2025

  44. [44]

    Egodistill: Egocentric head motion distillation for efficient video understanding.arXiv preprint arXiv:2301.02217, 2023

    Shuhan Tan, Tushar Nagarajan, and Kristen Grauman. Egodistill: Egocentric head motion distillation for efficient video understanding.arXiv preprint arXiv:2301.02217, 2023

  45. [45]

    Roy-Chowdhury

    Anirudh Thatipelli, Shao-Yuan Lo, and Amit K. Roy-Chowdhury. Egocentric and exocentric methods: A short survey.Computer Vision and Image Understanding, 257:104371, 2025

  46. [46]

    Anyup: Universal feature upsampling

    Thomas Wimmer, Prune Truong, Marie-Julie Rakotosaona, Michael Oechsle, Federico Tombari, Bernt Schiele, and Jan Eric Lenssen. Anyup: Universal feature upsampling. InProceedings of the International Conference on Learning Representations (ICLR), 2026

  47. [47]

    Egodtm: Towards 3d-aware egocentric video-language pretraining

    Boshen Xu, Yuting Mei, Xinbi Liu, Sipeng Zheng, and Qin Jin. Egodtm: Towards 3d-aware egocentric video-language pretraining. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  48. [48]

    Learning fine-grained view-invariant representations from unpaired ego-exo videos via temporal alignment

    Zihui Xue and Kristen Grauman. Learning fine-grained view-invariant representations from unpaired ego-exo videos via temporal alignment. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  49. [49]

    Spatial temporal graph convolutional networks for skeleton-based action recognition

    Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. InThirty-second AAAI conference on artificial intelligence, 2018

  50. [50]

    Depth anything v2

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2. InAdvances in Neural Information Processing Systems, 2024

  51. [51]

    Gradient surgery for multi-task learning

    Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 5434–5445. Curran Associates, Inc., 2020

  52. [52]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 2023. 13

  53. [53]

    Confidence-aware multi-teacher knowledge distillation

    Hailin Zhang, Defang Chen, and Can Wang. Confidence-aware multi-teacher knowledge distillation. InICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022

  54. [54]

    Learning video representations from large language models.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6586–6597, 2022

    Yue Zhao, Ishan Misra, Philipp Krahenbuhl, and Rohit Girdhar. Learning video representations from large language models.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6586–6597, 2022. 14 A Overview The appendix is categorized into the following parts: • Section B: Detailed Data Description • Section C: Proxy Performances...