pith. machine review for the scientific record. sign in

arxiv: 2604.19564 · v2 · submitted 2026-04-21 · 💻 cs.CV · cs.AI

Recognition: unknown

EgoSelf: From Memory to Personalized Egocentric Assistant

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords egocentric visionpersonalized assistantgraph-based memoryinteraction predictionuser profilingfirst-person viewtemporal relationships
0
0 comments X

The pith

EgoSelf builds a graph memory from past egocentric observations to derive user profiles and predict future interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Egocentric assistants must adapt to each person's distinct habits and routines, but long-term data integration has remained difficult. EgoSelf creates a graph-based interaction memory that records temporal and semantic connections among events and entities drawn from observations. User-specific profiles are extracted from this graph, and the system learns to predict possible future interactions based on an individual's recorded history. Experiments show this produces more effective personalized assistance than generic approaches. Readers would value assistants that shift from broad defaults toward routines shaped by personal patterns.

Core claim

EgoSelf consists of a graph-based interaction memory constructed from past observations that captures temporal and semantic relationships among interaction events and entities. User-specific profiles are derived from this memory. The personalized learning task is cast as a prediction problem in which the model forecasts possible future interactions from the user's historical behavior stored in the graph.

What carries the argument

Graph-based interaction memory that encodes temporal and semantic relationships among events and entities to derive profiles and enable future-interaction prediction.

Load-bearing premise

A graph constructed from past observations can reliably capture the temporal and semantic relationships needed to derive accurate user-specific profiles and enable meaningful future-interaction prediction.

What would settle it

A controlled test in which EgoSelf predictions from the graph memory show no accuracy gain over non-graph, non-personalized baselines on held-out future interaction data would undermine the central claim.

Figures

Figures reproduced from arXiv: 2604.19564 by Chang Wen Chen, Jie Hong, Wentao Zhu, Xuesong Li, Yanshuo Wang, Yizhou Wang, Yuan Xu.

Figure 1
Figure 1. Figure 1: An example: Egocentric personal assistants capture multi-day egocentric ac￾tivity history to construct structured interaction memory. From this memory, they summarize general events and temporal patterns to build a persistent user model that encodes personal preferences and daily habits. Finally, they generate consistent and personalized responses that are well aligned with the user’s inherent behavior pat… view at source ↗
Figure 2
Figure 2. Figure 2: Framework overview of EgoSelf: a personalized egocentric assistant framework. EgoSelf constructs a heterogeneous graph-structured personal interaction memory, where nodes represent historical user interaction events, involved objects, and per￾sons, while edges encode temporal, causal, and semantic relations among them. Based on this structured memory, the system extracts user-specific habit profiles that s… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of habit learning task generation. We identify a suitable partition point in the interaction graph using a reasoning LLM that leverages event relationships, and then split the video into past and future segments to form training pairs. An additional LLM verifier examines the generated pairs and refines captions to ensure reliable training data. events have explicit relational ties to the entit… view at source ↗
Figure 4
Figure 4. Figure 4: The retrieval and accuracy performance comparison between EgoSelf and EgoRag on the 7-day task. Yellow line represents our EgoSelf Performance, and green line represents the compared baseline. 4.3 Effectiveness on Different History Length We analyze how performance varies with the temporal span of questions [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison between EgoSelf and EgoRAG. EgoSelf retrieves struc￾tured, relationally connected events that capture consistent behavioral patterns, with the user profile supplementing long-term habit context. In contrast, EgoRAG retrieves isolated events without relational modeling, leading to a fragmented and less reliable reasoning context. retrival performance. The proposed method consistently … view at source ↗
read the original abstract

Egocentric assistants often rely on first-person view data to capture user behavior and context for personalized services. Since different users exhibit distinct habits, preferences, and routines, such personalization is essential for truly effective assistance. However, effectively integrating long-term user data for personalization remains a key challenge. To address this, we introduce EgoSelf, a system that includes a graph-based interaction memory constructed from past observations and a dedicated learning task for personalization. The memory captures temporal and semantic relationships among interaction events and entities, from which user-specific profiles are derived. The personalized learning task is formulated as a prediction problem where the model predicts possible future interactions from individual user's historical behavior recorded in the graph. Extensive experiments demonstrate the effectiveness of EgoSelf as a personalized egocentric assistant. Code is available at https://abie-e.github.io/EgoSelf/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces EgoSelf, a system for personalized egocentric assistants. It constructs a graph-based interaction memory from past observations to capture temporal and semantic relationships among events and entities, derives user-specific profiles from the graph, and formulates a dedicated learning task as next-interaction prediction from historical behavior. The authors report quantitative results on real egocentric datasets, including ablations and baseline comparisons, claiming these demonstrate effectiveness; code is released.

Significance. If the reported gains hold, the work offers a concrete mechanism for long-term personalization in egocentric vision by leveraging graph memory and a future-prediction objective. The availability of code supports reproducibility and is a clear strength.

minor comments (2)
  1. Abstract: the claim of 'extensive experiments' would be strengthened by briefly naming the datasets, primary metrics, and key baselines so readers can immediately gauge the scope of the evaluation.
  2. The graph-construction procedure (temporal and semantic edges) is described at a high level; a short pseudocode or explicit edge-type enumeration in the methods section would improve clarity without lengthening the paper.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of EgoSelf, particularly the recognition of our graph-based memory construction, user profile derivation, and next-interaction prediction formulation as a concrete mechanism for long-term personalization in egocentric vision. We appreciate the recommendation for minor revision and the note that code release aids reproducibility.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents a standard construction: a graph memory is built from historical observations to capture temporal/semantic relations, user profiles are derived from that graph, and a predictor is trained to forecast future interactions from the same historical graph data. This is a conventional next-event prediction setup with no equations or claims that reduce the output to the input by definition, no fitted parameters renamed as predictions, and no load-bearing self-citations or uniqueness theorems invoked. The central claim rests on empirical experiments rather than tautological derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no explicit free parameters, axioms, or invented entities beyond standard graph modeling and supervised prediction concepts already present in the broader literature.

pith-pipeline@v0.9.0 · 5451 in / 1080 out tokens · 51437 ms · 2026-05-10T02:20:14.192907+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 19 canonical work pages · 13 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    Qwen Technical Report

    Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

  3. [3]

    Harvard university press (2009)

    Bruner, J.S.: The process of education. Harvard university press (2009)

  4. [4]

    CoRR (2023)

    Chen, Y., Ge, Y., Ge, Y., Ding, M., Li, B., Wang, R., Xu, R., Shan, Y., Liu, X.: Egoplan-bench: Benchmarking egocentric embodied planning with multimodal large language models. CoRR (2023)

  5. [5]

    Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024

    Chen, Y., Xue, F., Li, D., Hu, Q., Zhu, L., Li, X., Fang, Y., Tang, H., Yang, S., Liu, Z., et al.: Longvila: Scaling long-context visual language models for long videos. arXiv preprint arXiv:2408.10188 (2024)

  6. [6]

    In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Cheng, S., Guo, Z., Wu, J., Fang, K., Li, P., Liu, H., Liu, Y.: Egothink: Evaluating first-person perspective thinking capability of vision-language models. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14291–14302 (2024)

  7. [7]

    Journal of memory and language53(4), 594– 628 (2005)

    Conway, M.A.: Memory and the self. Journal of memory and language53(4), 594– 628 (2005)

  8. [8]

    Psychological review107(2), 261 (2000)

    Conway, M.A., Pleydell-Pearce, C.W.: The construction of autobiographical mem- ories in the self-memory system. Psychological review107(2), 261 (2000)

  9. [9]

    In: Proceedings of the European conference on computer vision (ECCV)

    Damen, D., Doughty, H., Farinella, G.M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al.: Scaling egocentric vision: The epic-kitchens dataset. In: Proceedings of the European conference on computer vision (ECCV). pp. 720–736 (2018)

  10. [10]

    International Journal of Computer Vision130(1), 33–55 (2022)

    Damen, D., Doughty, H., Farinella, G.M., Furnari, A., Kazakos, E., Ma, J., Molti- santi, D., Munro, J., Perrett, T., Price, W., et al.: Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision130(1), 33–55 (2022)

  11. [11]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., Metropolitansky, D., Ness, R.O., Larson, J.: From local to global: A graph rag approach to query-focused summarization. arXiv preprint arXiv:2404.16130 (2024)

  12. [12]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., et al.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24108–24118 (2025)

  13. [13]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Girdhar, R., Grauman, K.: Anticipative video transformer. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 13505–13515 (2021)

  14. [14]

    In: European Conference on Computer Vision

    Goletto, G., Nagarajan, T., Averta, G., Damen, D.: Amego: Active memory from long egocentric videos. In: European Conference on Computer Vision. pp. 92–110. Springer (2024)

  15. [15]

    In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition

    Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Ham- burger, J., Jiang, H., Liu, M., Liu, X., et al.: Ego4d: Around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 18995–19012 (2022)

  16. [16]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Grauman, K., Westbury, A., Torresani, L., Kitani, K., Malik, J., Afouras, T., Ashutosh, K., Baiyya, V., Bansal, S., Boote, B., et al.: Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19383–19400 (2024) 16 Y. Wang et al

  17. [17]

    Nature645(8081), 633–638 (2025)

    Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al.: Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature645(8081), 633–638 (2025)

  18. [18]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

  19. [19]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    He, B., Li, H., Jang, Y.K., Jia, M., Cao, X., Shah, A., Shrivastava, A., Lim, S.N.: Ma-lmm: Memory-augmented large multimodal model for long-term video under- standing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13504–13514 (2024)

  20. [20]

    GPT-4o System Card

    Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

  21. [21]

    Advances in Neural Information Processing Systems37, 59532–59569 (2024)

    Jimenez Gutierrez, B., Shu, Y., Gu, Y., Yasunaga, M., Su, Y.: Hipporag: Neu- robiologically inspired long-term memory for large language models. Advances in Neural Information Processing Systems37, 59532–59569 (2024)

  22. [22]

    European journal of social psychology 40(6), 998–1009 (2010)

    Lally, P., Van Jaarsveld, C.H., Potts, H.W., Wardle, J.: How are habits formed: Modelling habit formation in the real world. European journal of social psychology 40(6), 998–1009 (2010)

  23. [23]

    Lee, J., Chen, F., Dua, S., Cer, D., Shanbhogue, M., Naim, I., Ábrego, G.H., Li, Z., Chen, K., Vera, H.S., Ren, X., Zhang, S., Salz, D., Boratko, M., Han, J., Chen, B., Huang, S., Rao, V., Suganthan, P., Han, F., Doumanoglou, A., Gupta, N., Moiseev, F., Yip, C., Jain, A., Baumgartner, S., Shahi, S., Gomez, F.P., Mariserla, S., Choi, M., Shah, P., Goenka, ...

  24. [24]

    LLaVA-OneVision: Easy Visual Task Transfer

    Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)

  25. [25]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Li, Y., Nagarajan, T., Xiong, B., Grauman, K.: Ego-exo: Transferring visual representations from third-person to first-person videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6943– 6953 (2021)

  26. [26]

    Li, Y., Wang, C., Jia, J.: Llama-vid: An image is worth 2 tokens in large lan- guagemodels.In:EuropeanConferenceonComputerVision.pp.323–340.Springer (2024)

  27. [27]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Lin, J., Yin, H., Ping, W., Molchanov, P., Shoeybi, M., Han, S.: Vila: On pre- training for visual language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26689–26699 (2024)

  28. [28]

    Advances in Neural Information Processing Systems35, 7575–7586 (2022)

    Lin, K.Q., Wang, J., Soldan, M., Wray, M., Yan, R., Xu, E.Z., Gao, D., Tu, R.C., Zhao, W., Kong, W., et al.: Egocentric video-language pretraining. Advances in Neural Information Processing Systems35, 7575–7586 (2022)

  29. [29]

    In: European conference on computer vision

    Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: European conference on computer vision. pp. 38–55. Springer (2024)

  30. [30]

    Evaluating Very Long-Term Conversational Memory of LLM Agents

    Maharana, A., Lee, D.H., Tulyakov, S., Bansal, M., Barbieri, F., Fang, Y.: Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753 (2024) EgoSelf: From Memory to Personalized Egocentric Assistant 17

  31. [31]

    Advances in Neural Information Processing Systems36, 46212–46244 (2023)

    Mangalam, K., Akshulakov, R., Malik, J.: Egoschema: A diagnostic benchmark for very long-form video language understanding. Advances in Neural Information Processing Systems36, 46212–46244 (2023)

  32. [32]

    Imu2clip: Multimodal contrastive learning for imu motion sensors from egocentric videos and text,

    Moon, S., Madotto, A., Lin, Z., Dirafzoon, A., Saraf, A., Bearman, A., Dama- vandi, B.: Imu2clip: Multimodal contrastive learning for imu motion sensors from egocentric videos and text. arXiv preprint arXiv:2210.14395 (2022)

  33. [33]

    Packer, C., Wooders, S., Lin, K., Fang, V., Patil, S.G., Stoica, I., Gonzalez, J.E.: Memgpt: Towards llms as operating systems (2024),https://arxiv.org/abs/23 10.08560

  34. [34]

    In: Proceedings of the 36th annual acm symposium on user interface software and technology

    Park, J.S., O’Brien, J., Cai, C.J., Morris, M.R., Liang, P., Bernstein, M.S.: Gener- ative agents: Interactive simulacra of human behavior. In: Proceedings of the 36th annual acm symposium on user interface software and technology. pp. 1–22 (2023)

  35. [35]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Peirone, S.A., Pistilli, F., Alliegro, A., Averta, G.: A backpack full of skills: Ego- centric video understanding with diverse task perspectives. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18275– 18285 (2024)

  36. [36]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Perrett, T., Darkhalil, A., Sinha, S., Emara, O., Pollard, S., Parida, K.K., Liu, K., Gatti, P., Bansal, S., Flanagan, K., et al.: Hd-epic: A highly-detailed egocentric video dataset. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 23901–23913 (2025)

  37. [37]

    In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision

    Pramanick, S., Song, Y., Nag, S., Lin, K.Q., Shah, H., Shou, M.Z., Chellappa, R., Zhang, P.: Egovlpv2: Egocentric video-language pre-training with fusion in the backbone. In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision. pp. 5285–5297 (2023)

  38. [38]

    Advances in Neural Information Processing Systems37, 119336–119360 (2024)

    Qian, R.,Dong,X.,Zhang,P., Zang,Y.,Ding,S.,Lin,D., Wang,J.:Streaminglong video understanding with large language models. Advances in Neural Information Processing Systems37, 119336–119360 (2024)

  39. [39]

    Qu, T., Tang, L., Peng, B., Yang, S., Yu, B., Jia, J.: Does your vision- language model get lost in the long video sampling dilemma? arXiv preprint arXiv:2503.12496 (2025)

  40. [40]

    In: International conference on machine learning

    Radford,A.,Kim,J.W.,Xu,T.,Brockman,G.,McLeavey,C.,Sutskever,I.:Robust speech recognition via large-scale weak supervision. In: International conference on machine learning. pp. 28492–28518. PMLR (2023)

  41. [41]

    Zep: A Temporal Knowledge Graph Architecture for Agent Memory

    Rasmussen, P., Paliychuk, P., Beauvais, T., Ryan, J., Chalef, D.: Zep: a temporal knowledge graph architecture for agent memory. arXiv preprint arXiv:2501.13956 (2025)

  42. [42]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert- networks. arXiv preprint arXiv:1908.10084 (2019)

  43. [43]

    Rodin, I., Furnari, A., Min, K., Tripathi, S., Farinella, G.M.: Action scene graphs forlong-formunderstandingofegocentricvideos.In:ProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18622–18632 (2024)

  44. [44]

    In: Proceedings of the IEEE/CVF Winter Confer- ence on Applications of Computer Vision

    Roy, D., Rajendiran, R., Fernando, B.: Interaction region visual transformer for egocentric action anticipation. In: Proceedings of the IEEE/CVF Winter Confer- ence on Applications of Computer Vision. pp. 6740–6750 (2024)

  45. [45]

    Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024

    Shen, X., Xiong, Y., Zhao, C., Wu, L., Chen, J., Zhu, C., Liu, Z., Xiao, F., Varadarajan, B., Bordes, F., et al.: Longvu: Spatiotemporal adaptive compression for long video-language understanding. arXiv preprint arXiv:2410.17434 (2024)

  46. [46]

    BF Skinner Foundation (2019) 18 Y

    Skinner, B.F.: The behavior of organisms: An experimental analysis. BF Skinner Foundation (2019) 18 Y. Wang et al

  47. [47]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Song, E., Chai, W., Wang, G., Zhang, Y., Zhou, H., Wu, F., Chi, H., Guo, X., Ye, T., Zhang, Y., et al.: Moviechat: From dense token to sparse memory for long video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18221–18232 (2024)

  48. [48]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Sun,C.,Myers,A.,Vondrick,C.,Murphy,K.,Schmid,C.:Videobert:Ajointmodel for video and language representation learning. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 7464–7473 (2019)

  49. [50]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

  50. [51]

    Behavioral and Brain Sciences7(2), 257–268 (1984)

    Tulving, E.: Relations among components and processes of memory. Behavioral and Brain Sciences7(2), 257–268 (1984)

  51. [52]

    Wang, H., Liu, H., Liu, X., Du, C., Kawaguchi, K., Wang, Y., Pang, T.: Fostering video reasoning via next-event prediction (2025),https://arxiv.org/abs/2505 .22457

  52. [53]

    arXiv preprint arXiv:2505.22457 (2025)

    Wang, H., Liu, H., Liu, X., Du, C., Kawaguchi, K., Wang, Y., Pang, T.: Fostering video reasoning via next-event prediction. arXiv preprint arXiv:2505.22457 (2025)

  53. [54]

    In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision

    Wang, H., Singh, M.K., Torresani, L.: Ego-only: Egocentric action detection with- out exocentric transferring. In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision. pp. 5250–5261 (2023)

  54. [55]

    In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision

    Wang, J., Chen, G., Huang, Y., Wang, L., Lu, T.: Memory-and-anticipation trans- former for online action understanding. In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision. pp. 13824–13835 (2023)

  55. [56]

    Wang, Q., Zhao, L., Yuan, L., Liu, T., Peng, X.: Learning from semantic align- mentbetweenunpairedmultiviewsforegocentricvideorecognition.In:Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3307–3317 (2023)

  56. [57]

    Psychological review20(2), 158 (1913)

    Watson, J.B.: Psychology as the behaviorist views it. Psychological review20(2), 158 (1913)

  57. [58]

    In: European Conference on Computer Vision

    Weng, Y., Han, M., He, H., Chang, X., Zhuang, B.: Longvlm: Efficient long video understanding via large language models. In: European Conference on Computer Vision. pp. 453–470. Springer (2024)

  58. [59]

    LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

    Wu, D., Wang, H., Yu, W., Zhang, Y., Chang, K.W., Yu, D.: Longmemeval: Benchmarking chat assistants on long-term interactive memory. arXiv preprint arXiv:2410.10813 (2024)

  59. [60]

    Slowfast-llava: A strong training-free base- line for video large language models.arXiv preprint arXiv:2407.15841, 2024

    Xu, M., Gao, M., Gan, Z., Chen, H.Y., Lai, Z., Gang, H., Kang, K., Dehghan, A.: Slowfast-llava: A strong training-free baseline for video large language models. arXiv preprint arXiv:2407.15841 (2024)

  60. [61]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Yang, J., Liu, S., Guo, H., Dong, Y., Zhang, X., Zhang, S., Wang, P., Zhou, Z., Xie, B., Wang, Z., et al.: Egolife: Towards egocentric life assistant. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 28885–28900 (2025)

  61. [62]

    Long Context Transfer from Language to Vision

    Zhang, P., Zhang, K., Li, B., Zeng, G., Yang, J., Zhang, Y., Wang, Z., Tan, H., Li, C., Liu, Z.: Long context transfer from language to vision. arXiv preprint arXiv:2406.16852 (2024)

  62. [63]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Zhao, Y., Misra, I., Krähenbühl, P., Girdhar, R.: Learning video representations from large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6586–6597 (2023) EgoSelf: From Memory to Personalized Egocentric Assistant 19

  63. [64]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Zhong, W., Guo, L., Gao, Q., Ye, H., Wang, Y.: Memorybank: Enhancing large language models with long-term memory. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 19724–19731 (2024)