pith. machine review for the scientific record. sign in

arxiv: 2605.10936 · v1 · submitted 2026-05-11 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Personal Visual Context Learning in Large Multimodal Models

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:37 UTC · model grok-4.3

classification 💻 cs.CV
keywords Personal Visual Context LearningLarge Multimodal ModelsPersonalizationVisual ContextMemory BankWearable DevicesInference-time AdaptationContext Utilization
0
0 comments X

The pith

Large multimodal models improve on personalized visual queries when user context is stored in a self-refining memory bank instead of raw prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Wearable devices will soon stream continuous first-person video to large multimodal models, requiring them to answer questions based on a specific user's visual experiences. The paper defines Personal Visual Context Learning as the prompt-time ability to draw on that unique visual history. Current models fall short in using this information effectively, especially when multiple observations must be combined. To close the gap, the authors introduce a benchmark spanning persons, objects, and behaviors, and test a baseline method called the Agentic Context Bank. This method builds a memory structure from the visual context and picks relevant pieces for each query, yielding better results than standard prompting across several models and tasks.

Core claim

We formalize Personal Visual Context Learning as the capability of large multimodal models to resolve personalized queries using user-specific visual context at prompt time. Analysis of frontier LMMs on the new Personal-VCL-Bench reveals a profound gap in context utilization, including mechanisms for leveraging visual evidence and aggregating multiple observations. The Agentic Context Bank addresses this by structuring visual context into a self-refining memory bank and employing query-adaptive evidence selection, consistently outperforming standard context prompting regimes.

What carries the argument

The Agentic Context Bank, a memory structure that self-refines and selects query-adaptive visual evidence from personal context to improve inference in large multimodal models.

Load-bearing premise

The Personal-VCL-Bench dataset and its evaluation protocol represent real-world personal visual contexts without significant collection biases or incomplete coverage of user behaviors.

What would settle it

Observing no improvement or a decrease in performance when applying the Agentic Context Bank to queries from actual users wearing smart glasses over extended periods.

Figures

Figures reproduced from arXiv: 2605.10936 by Ami Baid, Kristen Grauman, Mi Luo, Sangho Kim, Zihui Xue.

Figure 1
Figure 1. Figure 1: Personal Visual Context Learning. Continuous egocentric capture from wearable devices [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: We propose Personal-VCL-Bench to evaluate how LMMs use personal visual context [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative example of the full Agentic Context Bank pipeline. The left side shows [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Examples from the Persons, Objects, and EgoWearer identification tasks in Personal-VCL [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Examples from the Behavior tasks in Personal-VCL-Bench, built from CaptainCook4D [ [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Failure case of our Agentic Context Bank. The model requests visual evidence for entry [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗
read the original abstract

As wearable devices like smart glasses integrate Large Multimodal Models (LMMs) into the continuous first-person visual streams of individual users, the evolution of these models into true personal assistants hinges on visual personalization: the ability to reason over visual information unique to the wearer. We formalize this capability as Personal Visual Context Learning (Personal VCL), the prompt-time capability of using user-specific visual context to resolve personalized queries. To systematically evaluate this, we present Personal-VCL-Bench, a comprehensive benchmark capturing the personal visual world across persons, objects, and behaviors. Our analysis of frontier LMMs identifies a profound context utilization gap, revealing that the mechanisms for leveraging visual evidence, as well as aggregating multiple visual observations, remain critically understudied. Motivated by these findings, we propose the Agentic Context Bank, a strong inference-time baseline that structures a user's visual context into a self-refining memory bank and employs query-adaptive evidence selection. Our baseline approach consistently improves over standard context prompting regimes across tasks and evaluated backbones, demonstrating a practical path towards future personalized LMMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper formalizes Personal Visual Context Learning (Personal VCL) as the prompt-time ability of large multimodal models to leverage user-specific visual context from first-person streams to answer personalized queries. It introduces Personal-VCL-Bench to evaluate this across persons, objects, and behaviors, documents a context utilization gap in frontier LMMs, and proposes the Agentic Context Bank—an inference-time baseline that organizes visual context into a self-refining memory bank and performs query-adaptive evidence selection—claiming consistent gains over standard prompting regimes.

Significance. If the reported gains prove robust and the benchmark is shown to be free of collection artifacts, the work would be significant for wearable LMM applications: it supplies the first dedicated benchmark for personal visual context and a practical, training-free baseline that structures memory and retrieval, directly addressing the identified gap in evidence aggregation.

major comments (2)
  1. [§4] §4 (Personal-VCL-Bench construction): no quantitative validation is reported for the benchmark (inter-annotator agreement on personal-relevance labels, coverage statistics across user demographics, or comparison to held-out real wearable logs). This is load-bearing for the central claim, because any over-representation of short, high-salience events or query distributions that favor explicit memory-bank retrieval would inflate the measured delta between Agentic Context Bank and naive prompting.
  2. [§5] §5 (Experiments): the abstract states that the baseline “consistently improves” across tasks and backbones, yet no quantitative tables, error bars, ablation results on the self-refining mechanism or query-adaptive selection, or statistical significance tests are referenced. Without these details the empirical support for the practical-path claim cannot be assessed.
minor comments (1)
  1. [Abstract] The abstract introduces the term “Agentic Context Bank” without a concise one-sentence definition; adding this in the abstract or §2 would improve immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work formalizing Personal Visual Context Learning and introducing the Agentic Context Bank baseline. We address the major comments point-by-point below.

read point-by-point responses
  1. Referee: [§4] §4 (Personal-VCL-Bench construction): no quantitative validation is reported for the benchmark (inter-annotator agreement on personal-relevance labels, coverage statistics across user demographics, or comparison to held-out real wearable logs). This is load-bearing for the central claim, because any over-representation of short, high-salience events or query distributions that favor explicit memory-bank retrieval would inflate the measured delta between Agentic Context Bank and naive prompting.

    Authors: We acknowledge the importance of quantitative validation for the benchmark. While the construction process involved multiple annotators and careful selection to cover diverse persons, objects, and behaviors, we did not report inter-annotator agreement or demographic coverage in the initial submission. In the revised version, we will add these statistics, including inter-annotator agreement scores and coverage across demographics. Regarding comparison to held-out real wearable logs, we will include an analysis comparing the benchmark's event distributions to a small set of anonymized real-world logs to address potential artifacts. revision: yes

  2. Referee: [§5] §5 (Experiments): the abstract states that the baseline “consistently improves” across tasks and backbones, yet no quantitative tables, error bars, ablation results on the self-refining mechanism or query-adaptive selection, or statistical significance tests are referenced. Without these details the empirical support for the practical-path claim cannot be assessed.

    Authors: We apologize if the experimental details were not sufficiently highlighted. The manuscript in §5 includes quantitative tables showing performance improvements across multiple tasks and LMM backbones, with results averaged over multiple queries. We will add error bars, detailed ablations on the self-refining memory bank and query-adaptive selection components, and statistical significance tests (such as p-values from t-tests) in the revised manuscript to better support the claims. These elements were partially present but will be expanded for clarity. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark and baseline evaluation

full rationale

The paper introduces Personal-VCL-Bench and evaluates the Agentic Context Bank baseline through direct experiments on frontier LMMs. No equations, derivations, fitted parameters, or predictions appear in the provided text. Claims rest on measured performance deltas rather than any reduction to self-defined quantities or self-citation chains. The evaluation protocol is presented as new data collection, not as a tautological fit.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Work rests on standard assumptions about LMM prompt-based visual reasoning and the representativeness of the new benchmark; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption Frontier LMMs possess basic mechanisms for visual evidence leveraging and multi-observation aggregation that can be improved via structured prompting.
    Invoked when identifying the context utilization gap and motivating the baseline.
invented entities (1)
  • Agentic Context Bank no independent evidence
    purpose: Structures user visual context into a self-refining memory bank with query-adaptive evidence selection.
    New inference-time construct proposed to address identified gaps.

pith-pipeline@v0.9.0 · 5494 in / 1180 out tokens · 35050 ms · 2026-05-12T03:37:56.001425+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 10 internal anchors

  1. [1]

    MyVLM: Personalizing VLMs for user-specific queries.arXiv preprint arXiv:2403.14599, 2024

    Yuval Alaluf, Elad Richardson, Sergey Tulyakov, Kfir Aberman, and Daniel Cohen-Or. MyVLM: Personalizing VLMs for user-specific queries.arXiv preprint arXiv:2403.14599, 2024

  2. [2]

    MC- LLaV A: Multi-concept personalized vision-language model.arXiv preprint arXiv:2411.11706, 2024

    Ruichuan An, Sihan Yang, Ming Lu, Renrui Zhang, Kai Zeng, Yulin Luo, and Jiwen Cao. MC- LLaV A: Multi-concept personalized vision-language model.arXiv preprint arXiv:2411.11706, 2024

  3. [3]

    UniCTokens: Boosting personalized understanding and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025

    Ruichuan An, Sihan Yang, Renrui Zhang, Zijun Shen, Ming Lu, and Gaole Dai. UniCTokens: Boosting personalized understanding and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025

  4. [4]

    Concept-as-tree: A controllable synthetic data framework makes stronger personalized VLMs.arXiv preprint arXiv:2503.12999, 2025

    Ruichuan An, Kai Zeng, Ming Lu, Sihan Yang, Renrui Zhang, Huitong Ji, Hao Liang, and Wentao Zhang. Concept-as-tree: A controllable synthetic data framework makes stronger personalized VLMs.arXiv preprint arXiv:2503.12999, 2025

  5. [5]

    Temporal chain of thought: Long-video understanding by thinking in frames, 2025

    Anurag Arnab, Ahmet Iscen, Mathilde Caron, Alireza Fathi, and Cordelia Schmid. Tem- poral chain of thought: Long-video understanding by thinking in frames.arXiv preprint arXiv:2507.02001, 2025

  6. [6]

    Online-PVLM: Advancing personalized VLMs with online concept learning.arXiv preprint arXiv:2511.20056, 2025

    Huiyu Bai, Runze Wang, Zhuoyun Du, Yiyang Zhao, Fengji Zhang, Haoyu Chen, Xiaoyong Zhu, Bo Zheng, and Xuejiao Zhao. Online-PVLM: Advancing personalized VLMs with online concept learning.arXiv preprint arXiv:2511.20056, 2025

  7. [7]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025

  8. [8]

    EAGLE: Egocentric AGgregated language-video engine.arXiv preprint arXiv:2409.17523, 2024

    Jing Bi, Yunlong Huang, Lianggong Wang, and Jiebo Luo. EAGLE: Egocentric AGgregated language-video engine.arXiv preprint arXiv:2409.17523, 2024

  9. [9]

    Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, et al

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, 2020

  10. [10]

    Hadzic, Taran Kota, Jimming He, Cristobal Eyzaguirre, Zane Durante, Manling Li, Jiajun Wu, and Li Fei-Fei

    Keshigeyan Chandrasegaran, Agrim Gupta, Lea M. Hadzic, Taran Kota, Jimming He, Cristobal Eyzaguirre, Zane Durante, Manling Li, Jiajun Wu, and Li Fei-Fei. HourVideo: 1-hour video- language understanding. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  11. [11]

    Memory-efficient streaming VideoLLMs for real-time procedural video understanding

    Dibyadip Chatterjee, Edoardo Remelli, Yale Song, Bugra Tekin, Abhay Mittal, Bharat Bhatnagar, Necati Cihan Camgöz, Shreyas Hampali, Eric Sauser, Shugao Ma, Angela Yao, and Fadime Sener. Memory-efficient streaming VideoLLMs for real-time procedural video understanding. arXiv preprint arXiv:2504.13915, 2025

  12. [12]

    this is my unicorn, Fluffy

    Niv Cohen, Rinon Gal, Eli A. Meirom, Gal Chechik, and Yuval Atzmon. “this is my unicorn, Fluffy”: Personalizing frozen vision-language representations. InEuropean Conference on Computer Vision (ECCV), 2022

  13. [13]

    Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100. International Journal of Computer Vision (IJCV), 2022

  14. [14]

    Grounded question-answering in long egocentric videos

    Shangzhe Di and Weidi Xie. Grounded question-answering in long egocentric videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  15. [15]

    ReWind: Understanding long videos with instructed learnable memory

    Anxhelo Diko, Tinghuai Wang, Wassim Swaileh, Shiyan Sun, and Ioannis Patras. ReWind: Understanding long videos with instructed learnable memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  16. [16]

    A Survey on In-context Learning

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, et al. A survey on in-context learning.arXiv preprint arXiv:2301.00234, 2023. 10

  17. [17]

    Demo-ICL: In-context learning for procedural video knowledge acquisition.arXiv preprint arXiv:2602.08439, 2026

    Yuhao Dong, Shulin Tian, Shuai Liu, Shuangrui Ding, Yuhang Zang, and Xiao Dong. Demo-ICL: In-context learning for procedural video knowledge acquisition.arXiv preprint arXiv:2602.08439, 2026

  18. [18]

    Cl-bench: A benchmark for context learning

    Shihan Dou, Ming Zhang, Zhangyue Yin, et al. CL-bench: A benchmark for context learning. arXiv preprint arXiv:2602.03587, 2026

  19. [19]

    Jehanzeb Mirza, Wei Lin, Amit Alfassy, Assaf Arbelle, Shimon Ullman, and Leonid Karlinsky

    Sivan Doveh, Shaked Perek, M. Jehanzeb Mirza, Wei Lin, Amit Alfassy, Assaf Arbelle, Shimon Ullman, and Leonid Karlinsky. Towards multimodal in-context learning for vision & language models. InEuropean Conference on Computer Vision (ECCV), 2024

  20. [20]

    Project Aria: A New Tool for Egocentric Multi-Modal AI Research

    Jakob Engel, Kiran Somasundaram, Michael Goesele, Albert Sun, Alexander Gamino, Andrew Turner, Arjang Talattof, Andreas Yuan, Bilal Souti, Brighid Meredith, et al. Project Aria: A new tool for egocentric multi-modal AI research.arXiv preprint arXiv:2308.13561, 2023

  21. [21]

    EgoVQA: An egocentric video question answering benchmark dataset

    Chenyou Fan. EgoVQA: An egocentric video question answering benchmark dataset. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2019

  22. [22]

    Gemma 4: Byte for byte, the most capable open models

    Clement Farabet and Olivier Lacombe. Gemma 4: Byte for byte, the most capable open models. Technical report, Google, April 2026

  23. [23]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Rongrong Ji, and Xing Sun. Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis. arXiv prepri...

  24. [24]

    VIOLA: Towards video in-context learning with minimal annotations.arXiv preprint arXiv:2601.15549, 2026

    Ryo Fujii, Hideo Saito, and Ryo Hachiuma. VIOLA: Towards video in-context learning with minimal annotations.arXiv preprint arXiv:2601.15549, 2026

  25. [25]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

  26. [26]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Google. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  27. [27]

    Gemini 3 Flash model card

    Gemini Team, Google DeepMind. Gemini 3 Flash model card. Technical report, Google DeepMind, December 2025

  28. [28]

    Gemini 3.1 Pro model card

    Gemini Team, Google DeepMind. Gemini 3.1 Pro model card. Technical report, Google DeepMind, February 2026

  29. [29]

    Ego4D: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4D: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  30. [30]

    Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyl- los Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, et al. Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  31. [31]

    RAP: Retrieval- augmented personalization for multimodal large language models

    Haoran Hao, Jiaming Han, Changsheng Li, Yu-Feng Li, and Xiangyu Yue. RAP: Retrieval- augmented personalization for multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  32. [32]

    TAMEing long contexts in personalization: Towards training-free and state-aware MLLM personalized assistant

    Rongpei Hong, Jian Lang, Ting Zhong, Yong Wang, and Fan Zhou. TAMEing long contexts in personalization: Towards training-free and state-aware MLLM personalized assistant. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2026. arXiv:2512.21616

  33. [33]

    An egocentric look at video photographer identity

    Yedid Hoshen and Shmuel Peleg. An egocentric look at video photographer identity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4284–4292, 2016. 11

  34. [34]

    EgoExoLearn: A dataset for bridging asynchronous ego- and exo-centric view of procedural activities in real world

    Yifei Huang, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Dong Lu, Yali Wang, Limin Wang, and Yu Qiao. EgoExoLearn: A dataset for bridging asynchronous ego- and exo-centric view of procedural activities in real world. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  35. [35]

    An egocentric vision-language model based portable real-time smart assistant.arXiv preprint arXiv:2503.04250, 2025

    Yifei Huang, Jilan Xu, Baoqi Pei, Yuping He, Guo Chen, Mingfang Zhang, Lijin Yang, Zheng Nie, Jinyao Liu, Guoshun Fan, Dechen Lin, Fang Fang, Kunpeng Li, Chang Yuan, Xinyuan Chen, Yaohui Wang, Yali Wang, Yu Qiao, and Limin Wang. An egocentric vision-language model based portable real-time smart assistant.arXiv preprint arXiv:2503.04250, 2025

  36. [36]

    BIMBA: Selective-scan compression for long-range video question answering

    Md Mohaiminul Islam, Tushar Nagarajan, Huiyu Wang, Gedas Bertasius, and Lorenzo Torresani. BIMBA: Selective-scan compression for long-range video question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  37. [37]

    EgoTaskQA: Understanding human tasks in egocentric videos

    Baoxiong Jia, Ting Lei, Song-Chun Zhu, and Siyuan Huang. EgoTaskQA: Understanding human tasks in egocentric videos. InAdvances in Neural Information Processing Systems, 2022

  38. [38]

    Many-shot in-context learning in multimodal foundation models.arXiv preprint arXiv:2405.09798, 2024

    Yixing Jiang, Jeremy Irvin, Ji Hun Wang, Muhammad Ahmed Chaudhry, and Jonathan Chen. Many-shot in-context learning in multimodal foundation models.arXiv preprint arXiv:2405.09798, 2024

  39. [39]

    MMPB: It’s time for multi-modal personalization.arXiv preprint arXiv:2509.22820, 2025

    Jaeik Kim, Woojin Kim, Woohyeon Park, and Jaeyoung Do. MMPB: It’s time for multi-modal personalization.arXiv preprint arXiv:2509.22820, 2025

  40. [40]

    MA-EgoQA: Question answer- ing over egocentric videos from multiple embodied agents.arXiv preprint arXiv:2603.09827, 2026

    Kangsan Kim, Geon Park, Youngwan Lee, and Sung Ju Hwang. MA-EgoQA: Question answer- ing over egocentric videos from multiple embodied agents.arXiv preprint arXiv:2603.09827, 2026

  41. [41]

    VideoICL: Confidence-based iterative in-context learning for out-of-distribution video understanding.arXiv preprint arXiv:2412.02186, 2024

    Kangsan Kim, Geon Park, Youngwan Lee, Woongyeong Yeo, and Sung Ju Hwang. VideoICL: Confidence-based iterative in-context learning for out-of-distribution video understanding.arXiv preprint arXiv:2412.02186, 2024

  42. [42]

    arXiv preprint arXiv:2407.11016 , year=

    Ishita Kumar, Snigdha Viswanathan, Sushrita Yerra, Alireza Salemi, Ryan A. Rossi, Franck Dernoncourt, Hanieh Deilamsalehy, Xiang Chen, Ruiyi Zhang, Shubham Agarwal, Nedim Lipka, Chien Van Nguyen, Thien Huu Nguyen, and Hamed Zamani. LongLaMP: A benchmark for personalized long-form text generation.arXiv preprint arXiv:2407.11016, 2024

  43. [43]

    Berg, and Mohit Bansal

    Jie Lei, Tamara L. Berg, and Mohit Bansal. Detecting moments and highlights in videos via natural language queries. InAdvances in Neural Information Processing Systems, 2021

  44. [44]

    Retrieval-augmented generation for knowledge-intensive NLP tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems, 2020

  45. [45]

    LensWalk: Agentic video understanding by planning how you see in videos.arXiv preprint arXiv:2603.24558, 2026

    Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, and Shiguang Shan. LensWalk: Agentic video understanding by planning how you see in videos.arXiv preprint arXiv:2603.24558, 2026

  46. [46]

    MVBench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. MVBench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  47. [47]

    EgoPrivacy: What your first-person camera says about you? InInternational Conference on Machine Learning (ICML), 2025

    Yijiang Li et al. EgoPrivacy: What your first-person camera says about you? InInternational Conference on Machine Learning (ICML), 2025. arXiv:2506.12258

  48. [48]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.arXiv preprint arXiv:2304.08485, 2023

  49. [49]

    Aria Everyday Activities Dataset,

    Zhaoyang Lv, Nicholas Charron, Pierre Moulon, Alexander Gamino, Cheng Peng, Chris Sweeney, Edward Miller, Huixuan Tang, Jeff Meissner, Jing Dong, et al. Aria Everyday Activities dataset.arXiv preprint arXiv:2402.13349, 2024. 12

  50. [50]

    EgoSchema: A diagnostic benchmark for very long-form video language understanding

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. EgoSchema: A diagnostic benchmark for very long-form video language understanding. InAdvances in Neural Information Processing Systems, 2023

  51. [51]

    Online episodic memory visual query localization with egocentric streaming object memory.arXiv preprint arXiv:2411.16934, 2024

    Zaira Manigrasso, Matteo Milani, and Rita Cucchiara. Online episodic memory visual query localization with egocentric streaming object memory.arXiv preprint arXiv:2411.16934, 2024

  52. [52]

    Rethinking the role of demonstrations: What makes in-context learning work? InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022

    Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022

  53. [53]

    PEARL: Personalizing large language model writing assistants with generation-calibrated retrievers.arXiv preprint arXiv:2311.09180, 2024

    Sheshera Mysore, Zhuoran Lu, Mengting Wan, Longqi Yang, Steve Menezes, Tina Baghaee, Emmanuel Barajas Gonzalez, Jake Hofman, and Jennifer Neville. PEARL: Personalizing large language model writing assistants with generation-calibrated retrievers.arXiv preprint arXiv:2311.09180, 2024

  54. [54]

    Yo’LLaV A: Your personalized language and vision assistant

    Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, and Yong Jae Lee. Yo’LLaV A: Your personalized language and vision assistant. InAdvances in Neural Information Processing Systems, 2024

  55. [55]

    Repic: Reinforced post-training for personalizing multi-modal lan- guage models.arXiv:2506.18369, 2025

    Yeongtak Oh, Dohyun Chung, Juhyeon Shin, Sangha Park, Johan Barthelemy, Jisoo Mok, and Sungroh Yoon. RePIC: Reinforced post-training for personalizing multi-modal language models.arXiv preprint arXiv:2506.18369, 2025

  56. [56]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  57. [57]

    GPT-5.4 Thinking system card

    OpenAI. GPT-5.4 Thinking system card. Technical report, OpenAI, March 2026

  58. [58]

    CaptainCook4D: A dataset for understanding errors in procedural activities.arXiv preprint arXiv:2312.14556, 2023

    Rohith Peddi, Shivvrat Arya Tirumala, Mohammad Khan, Yufei Ji, Tushar Sridhar, Vib- hav Gogate Sridhar, and Nicholas Ruozzi. CaptainCook4D: A dataset for understanding errors in procedural activities.arXiv preprint arXiv:2312.14556, 2023

  59. [59]

    Personalized large vision-language models.arXiv preprint arXiv:2412.17610, 2024

    Chau Pham, Hoang Phan, David Doermann, and Yunjie Tian. Personalized large vision-language models.arXiv preprint arXiv:2412.17610, 2024

  60. [60]

    Personalized visual instruction tuning

    Renjie Pi, Jianshu Zhang, Tianyang Han, Jipeng Zhang, Rui Pan, and Tong Zhang. Personalized visual instruction tuning.arXiv preprint arXiv:2410.07113, 2024

  61. [61]

    Agentic very long video understanding.arXiv preprint arXiv:2601.18157, 2026

    Aniket Rege, Arka Sadhu, Yuliang Li, Kejie Li, Ramya Korlakai Vinayak, and Chai. Agentic very long video understanding.arXiv preprint arXiv:2601.18157, 2026

  62. [62]

    Optimization methods for personalizing large language models through retrieval augmentation

    Alireza Salemi, Surya Kallumadi, and Hamed Zamani. Optimization methods for personalizing large language models through retrieval augmentation. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2024

  63. [63]

    LaMP: When large language models meet personalization.arXiv preprint arXiv:2304.11406, 2023

    Alireza Salemi, Sheshera Mysore, Michael Bendersky, and Hamed Zamani. LaMP: When large language models meet personalization.arXiv preprint arXiv:2304.11406, 2023

  64. [64]

    LaMP-QA: A benchmark for personalized long-form question answering.arXiv preprint arXiv:2506.00137, 2025

    Alireza Salemi and Hamed Zamani. LaMP-QA: A benchmark for personalized long-form question answering.arXiv preprint arXiv:2506.00137, 2025

  65. [65]

    Personalization Toolkit: Training Free Personalization of Large Vision Language Models

    Soroush Seifi, Vaggelis Dorovatas, Matteo Cassinelli, Fabien Despinoy, Daniel Olmeda Reino, and Rahaf Aljundi. Personalization toolkit: Training free personalization of large vision language models.arXiv preprint arXiv:2502.02452, 2025

  66. [66]

    Encode-store-retrieve: Augmenting human memory through language-encoded egocentric perception.arXiv preprint arXiv:2308.05822, 2023

    Junxiao Shen, John Dudley, and Per Ola Kristensson. Encode-store-retrieve: Augmenting human memory through language-encoded egocentric perception.arXiv preprint arXiv:2308.05822, 2023

  67. [67]

    PVChat: Personalized video chat with one-shot learning.arXiv preprint arXiv:2503.17069, 2025

    Yufei Shi, Weilong Yan, Gang Xu, Yumeng Li, Yucheng Chen, Zhenxi Li, Fei Richard Yu, Ming Li, and Si Yong Yeo. PVChat: Personalized video chat with one-shot learning.arXiv preprint arXiv:2503.17069, 2025. 13

  68. [68]

    Personalized pieces: Efficient personalized large language models through collaborative efforts

    Zhaoxuan Tan, Zheyuan Liu, and Meng Jiang. Personalized pieces: Efficient personalized large language models through collaborative efforts. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

  69. [69]

    Democratizing large lan- guage models via personalized parameter-efficient fine-tuning

    Zhaoxuan Tan, Qingkai Zeng, Yijun Tian, Zheyuan Liu, Bing Yin, and Meng Jiang. Democra- tizing large language models via personalized parameter-efficient fine-tuning.arXiv preprint arXiv:2402.04401, 2024

  70. [70]

    Instant personalized large language model adaptation via hypernetwork.arXiv preprint arXiv:2510.16282, 2025

    Zhaoxuan Tan, Zixuan Zhang, Haoyang Wen, Zheng Li, Rongzhi Zhang, Pei Chen, Fengran Mo, Zheyuan Liu, Qingkai Zeng, Qingyu Yin, and Meng Jiang. Instant personalized large language model adaptation via hypernetwork.arXiv preprint arXiv:2510.16282, 2025

  71. [71]

    Is sharing of egocentric video giving away your biometric signature? InEuropean Conference on Computer Vision (ECCV)

    Daksh Thapar, Aditya Nigam, and Chetan Arora. Is sharing of egocentric video giving away your biometric signature? InEuropean Conference on Computer Vision (ECCV). Springer, 2020

  72. [72]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  73. [73]

    VideoAgent: Long-form video understanding with large language model as agent

    Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung-Levy. VideoAgent: Long-form video understanding with large language model as agent. InEuropean Conference on Computer Vision (ECCV), 2024

  74. [74]

    LifelongMem- ory: Leveraging LLMs for answering queries in long-form egocentric videos.arXiv preprint arXiv:2312.05269, 2023

    Ying Wang, Yanlai He, Cuiling Wang, Huaiyu Bian, and Jianlong Chen. LifelongMem- ory: Leveraging LLMs for answering queries in long-form egocentric videos.arXiv preprint arXiv:2312.05269, 2023

  75. [75]

    Ryoo, and Juan Carlos Niebles

    Ziyang Wang, Honglu Zhou, Shijie Wang, Junnan Li, Caiming Xiong, Silvio Savarese, Mohit Bansal, Michael S. Ryoo, and Juan Carlos Niebles. Active video perception: Iterative evidence seeking for agentic long video understanding.arXiv preprint arXiv:2512.05774, 2025

  76. [76]

    LongVideoBench: A benchmark for long-context interleaved video-language understanding

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. LongVideoBench: A benchmark for long-context interleaved video-language understanding. InAdvances in Neural Information Processing Systems, 2024

  77. [77]

    Retrieval- augmented egocentric video captioning.arXiv preprint arXiv:2401.00789, 2024

    Jilan Xu, Yifei Huang, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, and Weidi Xie. Retrieval- augmented egocentric video captioning.arXiv preprint arXiv:2401.00789, 2024

  78. [78]

    EgoLife: Towards egocentric life assistant.arXiv preprint arXiv:2503.03803, 2025

    Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, et al. EgoLife: Towards egocentric life assistant.arXiv preprint arXiv:2503.03803, 2025

  79. [79]

    Eliciting in-context learning in vision-language models for videos through curated data distributional properties.arXiv preprint arXiv:2311.17041, 2023

    Keunwoo Peter Yu, Zheyuan Zhang, Fengyuan Hu, Shane Storks, and Joyce Chai. Eliciting in-context learning in vision-language models for videos through curated data distributional properties.arXiv preprint arXiv:2311.17041, 2023

  80. [80]

    A simple LLM framework for long-range video question-answering

    Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, and Gedas Bertasius. A simple LLM framework for long-range video question-answering. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Showing first 80 references.