arxiv: 2604.19564 · v2 · submitted 2026-04-21 · 💻 cs.CV · cs.AI

Recognition: unknown

EgoSelf: From Memory to Personalized Egocentric Assistant

Yanshuo Wang , Yuan Xu , Xuesong Li , Jie Hong , Yizhou Wang , Chang Wen Chen , Wentao Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords egocentric visionpersonalized assistantgraph-based memoryinteraction predictionuser profilingfirst-person viewtemporal relationships

0 comments

The pith

EgoSelf builds a graph memory from past egocentric observations to derive user profiles and predict future interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Egocentric assistants must adapt to each person's distinct habits and routines, but long-term data integration has remained difficult. EgoSelf creates a graph-based interaction memory that records temporal and semantic connections among events and entities drawn from observations. User-specific profiles are extracted from this graph, and the system learns to predict possible future interactions based on an individual's recorded history. Experiments show this produces more effective personalized assistance than generic approaches. Readers would value assistants that shift from broad defaults toward routines shaped by personal patterns.

Core claim

EgoSelf consists of a graph-based interaction memory constructed from past observations that captures temporal and semantic relationships among interaction events and entities. User-specific profiles are derived from this memory. The personalized learning task is cast as a prediction problem in which the model forecasts possible future interactions from the user's historical behavior stored in the graph.

What carries the argument

Graph-based interaction memory that encodes temporal and semantic relationships among events and entities to derive profiles and enable future-interaction prediction.

Load-bearing premise

A graph constructed from past observations can reliably capture the temporal and semantic relationships needed to derive accurate user-specific profiles and enable meaningful future-interaction prediction.

What would settle it

A controlled test in which EgoSelf predictions from the graph memory show no accuracy gain over non-graph, non-personalized baselines on held-out future interaction data would undermine the central claim.

Figures

Figures reproduced from arXiv: 2604.19564 by Chang Wen Chen, Jie Hong, Wentao Zhu, Xuesong Li, Yanshuo Wang, Yizhou Wang, Yuan Xu.

**Figure 1.** Figure 1: An example: Egocentric personal assistants capture multi-day egocentric activity history to construct structured interaction memory. From this memory, they summarize general events and temporal patterns to build a persistent user model that encodes personal preferences and daily habits. Finally, they generate consistent and personalized responses that are well aligned with the user’s inherent behavior pat… view at source ↗

**Figure 2.** Figure 2: Framework overview of EgoSelf: a personalized egocentric assistant framework. EgoSelf constructs a heterogeneous graph-structured personal interaction memory, where nodes represent historical user interaction events, involved objects, and persons, while edges encode temporal, causal, and semantic relations among them. Based on this structured memory, the system extracts user-specific habit profiles that s… view at source ↗

**Figure 3.** Figure 3: Illustration of habit learning task generation. We identify a suitable partition point in the interaction graph using a reasoning LLM that leverages event relationships, and then split the video into past and future segments to form training pairs. An additional LLM verifier examines the generated pairs and refines captions to ensure reliable training data. events have explicit relational ties to the entit… view at source ↗

**Figure 4.** Figure 4: The retrieval and accuracy performance comparison between EgoSelf and EgoRag on the 7-day task. Yellow line represents our EgoSelf Performance, and green line represents the compared baseline. 4.3 Effectiveness on Different History Length We analyze how performance varies with the temporal span of questions [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison between EgoSelf and EgoRAG. EgoSelf retrieves structured, relationally connected events that capture consistent behavioral patterns, with the user profile supplementing long-term habit context. In contrast, EgoRAG retrieves isolated events without relational modeling, leading to a fragmented and less reliable reasoning context. retrival performance. The proposed method consistently … view at source ↗

read the original abstract

Egocentric assistants often rely on first-person view data to capture user behavior and context for personalized services. Since different users exhibit distinct habits, preferences, and routines, such personalization is essential for truly effective assistance. However, effectively integrating long-term user data for personalization remains a key challenge. To address this, we introduce EgoSelf, a system that includes a graph-based interaction memory constructed from past observations and a dedicated learning task for personalization. The memory captures temporal and semantic relationships among interaction events and entities, from which user-specific profiles are derived. The personalized learning task is formulated as a prediction problem where the model predicts possible future interactions from individual user's historical behavior recorded in the graph. Extensive experiments demonstrate the effectiveness of EgoSelf as a personalized egocentric assistant. Code is available at https://abie-e.github.io/EgoSelf/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EgoSelf applies a graph memory of past egocentric interactions to drive personalization via future-event prediction, with experiments showing gains but only moderate novelty.

read the letter

EgoSelf builds a graph from a user's historical egocentric data to capture temporal and semantic links between events and entities, then derives user profiles and trains the model to predict future interactions as the personalization objective. The paper does a clean job spelling out this construction and turning the long-term data problem into a standard next-event prediction task. Experiments report quantitative improvements on real egocentric datasets, include ablations, and compare against baselines, which gives the claims some grounding. Releasing the code is also useful for checking the details. The setup is internally consistent with no circularity or hidden assumptions that would break the argument. The graph encodes the relationships needed, profiles follow from it, and the loss is ordinary prediction. That part holds up. The main limitation is that the core technique—graph memory plus prediction—already appears in sequence modeling and recommendation work, so the advance is mainly the application to long-term egocentric personalization rather than a new method. The abstract was thin on numbers, but the full paper supplies the missing experimental support without obvious flaws. This paper is aimed at researchers working on wearable AI, AR assistants, or egocentric vision who need concrete ways to handle user-specific long-term data. A reader in that niche would find the implementation and results worth examining. It has enough concrete method and evidence to deserve peer review.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces EgoSelf, a system for personalized egocentric assistants. It constructs a graph-based interaction memory from past observations to capture temporal and semantic relationships among events and entities, derives user-specific profiles from the graph, and formulates a dedicated learning task as next-interaction prediction from historical behavior. The authors report quantitative results on real egocentric datasets, including ablations and baseline comparisons, claiming these demonstrate effectiveness; code is released.

Significance. If the reported gains hold, the work offers a concrete mechanism for long-term personalization in egocentric vision by leveraging graph memory and a future-prediction objective. The availability of code supports reproducibility and is a clear strength.

minor comments (2)

Abstract: the claim of 'extensive experiments' would be strengthened by briefly naming the datasets, primary metrics, and key baselines so readers can immediately gauge the scope of the evaluation.
The graph-construction procedure (temporal and semantic edges) is described at a high level; a short pseudocode or explicit edge-type enumeration in the methods section would improve clarity without lengthening the paper.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of EgoSelf, particularly the recognition of our graph-based memory construction, user profile derivation, and next-interaction prediction formulation as a concrete mechanism for long-term personalization in egocentric vision. We appreciate the recommendation for minor revision and the note that code release aids reproducibility.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents a standard construction: a graph memory is built from historical observations to capture temporal/semantic relations, user profiles are derived from that graph, and a predictor is trained to forecast future interactions from the same historical graph data. This is a conventional next-event prediction setup with no equations or claims that reduce the output to the input by definition, no fitted parameters renamed as predictions, and no load-bearing self-citations or uniqueness theorems invoked. The central claim rests on empirical experiments rather than tautological derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no explicit free parameters, axioms, or invented entities beyond standard graph modeling and supervised prediction concepts already present in the broader literature.

pith-pipeline@v0.9.0 · 5451 in / 1080 out tokens · 51437 ms · 2026-05-10T02:20:14.192907+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 19 canonical work pages · 13 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Qwen Technical Report

Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Harvard university press (2009)

Bruner, J.S.: The process of education. Harvard university press (2009)

2009
[4]

CoRR (2023)

Chen, Y., Ge, Y., Ge, Y., Ding, M., Li, B., Wang, R., Xu, R., Shan, Y., Liu, X.: Egoplan-bench: Benchmarking egocentric embodied planning with multimodal large language models. CoRR (2023)

2023
[5]

Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024

Chen, Y., Xue, F., Li, D., Hu, Q., Zhu, L., Li, X., Fang, Y., Tang, H., Yang, S., Liu, Z., et al.: Longvila: Scaling long-context visual language models for long videos. arXiv preprint arXiv:2408.10188 (2024)

work page arXiv 2024
[6]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Cheng, S., Guo, Z., Wu, J., Fang, K., Li, P., Liu, H., Liu, Y.: Egothink: Evaluating first-person perspective thinking capability of vision-language models. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14291–14302 (2024)

2024
[7]

Journal of memory and language53(4), 594– 628 (2005)

Conway, M.A.: Memory and the self. Journal of memory and language53(4), 594– 628 (2005)

2005
[8]

Psychological review107(2), 261 (2000)

Conway, M.A., Pleydell-Pearce, C.W.: The construction of autobiographical mem- ories in the self-memory system. Psychological review107(2), 261 (2000)

2000
[9]

In: Proceedings of the European conference on computer vision (ECCV)

Damen, D., Doughty, H., Farinella, G.M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al.: Scaling egocentric vision: The epic-kitchens dataset. In: Proceedings of the European conference on computer vision (ECCV). pp. 720–736 (2018)

2018
[10]

International Journal of Computer Vision130(1), 33–55 (2022)

Damen, D., Doughty, H., Farinella, G.M., Furnari, A., Kazakos, E., Ma, J., Molti- santi, D., Munro, J., Perrett, T., Price, W., et al.: Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision130(1), 33–55 (2022)

2022
[11]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., Metropolitansky, D., Ness, R.O., Larson, J.: From local to global: A graph rag approach to query-focused summarization. arXiv preprint arXiv:2404.16130 (2024)

work page internal anchor Pith review arXiv 2024
[12]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., et al.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24108–24118 (2025)

2025
[13]

In: Proceedings of the IEEE/CVF international conference on computer vision

Girdhar, R., Grauman, K.: Anticipative video transformer. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 13505–13515 (2021)

2021
[14]

In: European Conference on Computer Vision

Goletto, G., Nagarajan, T., Averta, G., Damen, D.: Amego: Active memory from long egocentric videos. In: European Conference on Computer Vision. pp. 92–110. Springer (2024)

2024
[15]

In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition

Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Ham- burger, J., Jiang, H., Liu, M., Liu, X., et al.: Ego4d: Around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 18995–19012 (2022)

2022
[16]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Grauman, K., Westbury, A., Torresani, L., Kitani, K., Malik, J., Afouras, T., Ashutosh, K., Baiyya, V., Bansal, S., Boote, B., et al.: Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19383–19400 (2024) 16 Y. Wang et al

2024
[17]

Nature645(8081), 633–638 (2025)

Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al.: Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature645(8081), 633–638 (2025)

2025
[18]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

He, B., Li, H., Jang, Y.K., Jia, M., Cao, X., Shah, A., Shrivastava, A., Lim, S.N.: Ma-lmm: Memory-augmented large multimodal model for long-term video under- standing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13504–13514 (2024)

2024
[20]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Advances in Neural Information Processing Systems37, 59532–59569 (2024)

Jimenez Gutierrez, B., Shu, Y., Gu, Y., Yasunaga, M., Su, Y.: Hipporag: Neu- robiologically inspired long-term memory for large language models. Advances in Neural Information Processing Systems37, 59532–59569 (2024)

2024
[22]

European journal of social psychology 40(6), 998–1009 (2010)

Lally, P., Van Jaarsveld, C.H., Potts, H.W., Wardle, J.: How are habits formed: Modelling habit formation in the real world. European journal of social psychology 40(6), 998–1009 (2010)

2010
[23]

Lee, J., Chen, F., Dua, S., Cer, D., Shanbhogue, M., Naim, I., Ábrego, G.H., Li, Z., Chen, K., Vera, H.S., Ren, X., Zhang, S., Salz, D., Boratko, M., Han, J., Chen, B., Huang, S., Rao, V., Suganthan, P., Han, F., Doumanoglou, A., Gupta, N., Moiseev, F., Yip, C., Jain, A., Baumgartner, S., Shahi, S., Gomez, F.P., Mariserla, S., Choi, M., Shah, P., Goenka, ...

work page internal anchor Pith review arXiv 2025
[24]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)

work page internal anchor Pith review arXiv 2024
[25]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, Y., Nagarajan, T., Xiong, B., Grauman, K.: Ego-exo: Transferring visual representations from third-person to first-person videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6943– 6953 (2021)

2021
[26]

Li, Y., Wang, C., Jia, J.: Llama-vid: An image is worth 2 tokens in large lan- guagemodels.In:EuropeanConferenceonComputerVision.pp.323–340.Springer (2024)

2024
[27]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Lin, J., Yin, H., Ping, W., Molchanov, P., Shoeybi, M., Han, S.: Vila: On pre- training for visual language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26689–26699 (2024)

2024
[28]

Advances in Neural Information Processing Systems35, 7575–7586 (2022)

Lin, K.Q., Wang, J., Soldan, M., Wray, M., Yan, R., Xu, E.Z., Gao, D., Tu, R.C., Zhao, W., Kong, W., et al.: Egocentric video-language pretraining. Advances in Neural Information Processing Systems35, 7575–7586 (2022)

2022
[29]

In: European conference on computer vision

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: European conference on computer vision. pp. 38–55. Springer (2024)

2024
[30]

Evaluating Very Long-Term Conversational Memory of LLM Agents

Maharana, A., Lee, D.H., Tulyakov, S., Bansal, M., Barbieri, F., Fang, Y.: Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753 (2024) EgoSelf: From Memory to Personalized Egocentric Assistant 17

work page internal anchor Pith review arXiv 2024
[31]

Advances in Neural Information Processing Systems36, 46212–46244 (2023)

Mangalam, K., Akshulakov, R., Malik, J.: Egoschema: A diagnostic benchmark for very long-form video language understanding. Advances in Neural Information Processing Systems36, 46212–46244 (2023)

2023
[32]

Imu2clip: Multimodal contrastive learning for imu motion sensors from egocentric videos and text,

Moon, S., Madotto, A., Lin, Z., Dirafzoon, A., Saraf, A., Bearman, A., Dama- vandi, B.: Imu2clip: Multimodal contrastive learning for imu motion sensors from egocentric videos and text. arXiv preprint arXiv:2210.14395 (2022)

work page arXiv 2022
[33]

Packer, C., Wooders, S., Lin, K., Fang, V., Patil, S.G., Stoica, I., Gonzalez, J.E.: Memgpt: Towards llms as operating systems (2024),https://arxiv.org/abs/23 10.08560

2024
[34]

In: Proceedings of the 36th annual acm symposium on user interface software and technology

Park, J.S., O’Brien, J., Cai, C.J., Morris, M.R., Liang, P., Bernstein, M.S.: Gener- ative agents: Interactive simulacra of human behavior. In: Proceedings of the 36th annual acm symposium on user interface software and technology. pp. 1–22 (2023)

2023
[35]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Peirone, S.A., Pistilli, F., Alliegro, A., Averta, G.: A backpack full of skills: Ego- centric video understanding with diverse task perspectives. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18275– 18285 (2024)

2024
[36]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Perrett, T., Darkhalil, A., Sinha, S., Emara, O., Pollard, S., Parida, K.K., Liu, K., Gatti, P., Bansal, S., Flanagan, K., et al.: Hd-epic: A highly-detailed egocentric video dataset. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 23901–23913 (2025)

2025
[37]

In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision

Pramanick, S., Song, Y., Nag, S., Lin, K.Q., Shah, H., Shou, M.Z., Chellappa, R., Zhang, P.: Egovlpv2: Egocentric video-language pre-training with fusion in the backbone. In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision. pp. 5285–5297 (2023)

2023
[38]

Advances in Neural Information Processing Systems37, 119336–119360 (2024)

Qian, R.,Dong,X.,Zhang,P., Zang,Y.,Ding,S.,Lin,D., Wang,J.:Streaminglong video understanding with large language models. Advances in Neural Information Processing Systems37, 119336–119360 (2024)

2024
[39]

Qu, T., Tang, L., Peng, B., Yang, S., Yu, B., Jia, J.: Does your vision- language model get lost in the long video sampling dilemma? arXiv preprint arXiv:2503.12496 (2025)

work page arXiv 2025
[40]

In: International conference on machine learning

Radford,A.,Kim,J.W.,Xu,T.,Brockman,G.,McLeavey,C.,Sutskever,I.:Robust speech recognition via large-scale weak supervision. In: International conference on machine learning. pp. 28492–28518. PMLR (2023)

2023
[41]

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Rasmussen, P., Paliychuk, P., Beauvais, T., Ryan, J., Chalef, D.: Zep: a temporal knowledge graph architecture for agent memory. arXiv preprint arXiv:2501.13956 (2025)

work page internal anchor Pith review arXiv 2025
[42]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert- networks. arXiv preprint arXiv:1908.10084 (2019)

work page internal anchor Pith review arXiv 1908
[43]

Rodin, I., Furnari, A., Min, K., Tripathi, S., Farinella, G.M.: Action scene graphs forlong-formunderstandingofegocentricvideos.In:ProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18622–18632 (2024)

2024
[44]

In: Proceedings of the IEEE/CVF Winter Confer- ence on Applications of Computer Vision

Roy, D., Rajendiran, R., Fernando, B.: Interaction region visual transformer for egocentric action anticipation. In: Proceedings of the IEEE/CVF Winter Confer- ence on Applications of Computer Vision. pp. 6740–6750 (2024)

2024
[45]

Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024

Shen, X., Xiong, Y., Zhao, C., Wu, L., Chen, J., Zhu, C., Liu, Z., Xiao, F., Varadarajan, B., Bordes, F., et al.: Longvu: Spatiotemporal adaptive compression for long video-language understanding. arXiv preprint arXiv:2410.17434 (2024)

work page arXiv 2024
[46]

BF Skinner Foundation (2019) 18 Y

Skinner, B.F.: The behavior of organisms: An experimental analysis. BF Skinner Foundation (2019) 18 Y. Wang et al

2019
[47]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Song, E., Chai, W., Wang, G., Zhang, Y., Zhou, H., Wu, F., Chi, H., Guo, X., Ye, T., Zhang, Y., et al.: Moviechat: From dense token to sparse memory for long video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18221–18232 (2024)

2024
[48]

In: Proceedings of the IEEE/CVF international conference on computer vision

Sun,C.,Myers,A.,Vondrick,C.,Murphy,K.,Schmid,C.:Videobert:Ajointmodel for video and language representation learning. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 7464–7473 (2019)

2019
[50]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Behavioral and Brain Sciences7(2), 257–268 (1984)

Tulving, E.: Relations among components and processes of memory. Behavioral and Brain Sciences7(2), 257–268 (1984)

1984
[52]

Wang, H., Liu, H., Liu, X., Du, C., Kawaguchi, K., Wang, Y., Pang, T.: Fostering video reasoning via next-event prediction (2025),https://arxiv.org/abs/2505 .22457

2025
[53]

arXiv preprint arXiv:2505.22457 (2025)

Wang, H., Liu, H., Liu, X., Du, C., Kawaguchi, K., Wang, Y., Pang, T.: Fostering video reasoning via next-event prediction. arXiv preprint arXiv:2505.22457 (2025)

work page arXiv 2025
[54]

In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision

Wang, H., Singh, M.K., Torresani, L.: Ego-only: Egocentric action detection with- out exocentric transferring. In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision. pp. 5250–5261 (2023)

2023
[55]

In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision

Wang, J., Chen, G., Huang, Y., Wang, L., Lu, T.: Memory-and-anticipation trans- former for online action understanding. In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision. pp. 13824–13835 (2023)

2023
[56]

Wang, Q., Zhao, L., Yuan, L., Liu, T., Peng, X.: Learning from semantic align- mentbetweenunpairedmultiviewsforegocentricvideorecognition.In:Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3307–3317 (2023)

2023
[57]

Psychological review20(2), 158 (1913)

Watson, J.B.: Psychology as the behaviorist views it. Psychological review20(2), 158 (1913)

1913
[58]

In: European Conference on Computer Vision

Weng, Y., Han, M., He, H., Chang, X., Zhuang, B.: Longvlm: Efficient long video understanding via large language models. In: European Conference on Computer Vision. pp. 453–470. Springer (2024)

2024
[59]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Wu, D., Wang, H., Yu, W., Zhang, Y., Chang, K.W., Yu, D.: Longmemeval: Benchmarking chat assistants on long-term interactive memory. arXiv preprint arXiv:2410.10813 (2024)

work page internal anchor Pith review arXiv 2024
[60]

Slowfast-llava: A strong training-free base- line for video large language models.arXiv preprint arXiv:2407.15841, 2024

Xu, M., Gao, M., Gan, Z., Chen, H.Y., Lai, Z., Gang, H., Kang, K., Dehghan, A.: Slowfast-llava: A strong training-free baseline for video large language models. arXiv preprint arXiv:2407.15841 (2024)

work page arXiv 2024
[61]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Yang, J., Liu, S., Guo, H., Dong, Y., Zhang, X., Zhang, S., Wang, P., Zhou, Z., Xie, B., Wang, Z., et al.: Egolife: Towards egocentric life assistant. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 28885–28900 (2025)

2025
[62]

Long Context Transfer from Language to Vision

Zhang, P., Zhang, K., Li, B., Zeng, G., Yang, J., Zhang, Y., Wang, Z., Tan, H., Li, C., Liu, Z.: Long context transfer from language to vision. arXiv preprint arXiv:2406.16852 (2024)

work page internal anchor Pith review arXiv 2024
[63]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zhao, Y., Misra, I., Krähenbühl, P., Girdhar, R.: Learning video representations from large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6586–6597 (2023) EgoSelf: From Memory to Personalized Egocentric Assistant 19

2023
[64]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Zhong, W., Guo, L., Gao, Q., Ye, H., Wang, Y.: Memorybank: Enhancing large language models with long-term memory. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 19724–19731 (2024)

2024