Recognition: 2 theorem links
· Lean TheoremPersonal Visual Context Learning in Large Multimodal Models
Pith reviewed 2026-05-12 03:37 UTC · model grok-4.3
The pith
Large multimodal models improve on personalized visual queries when user context is stored in a self-refining memory bank instead of raw prompts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We formalize Personal Visual Context Learning as the capability of large multimodal models to resolve personalized queries using user-specific visual context at prompt time. Analysis of frontier LMMs on the new Personal-VCL-Bench reveals a profound gap in context utilization, including mechanisms for leveraging visual evidence and aggregating multiple observations. The Agentic Context Bank addresses this by structuring visual context into a self-refining memory bank and employing query-adaptive evidence selection, consistently outperforming standard context prompting regimes.
What carries the argument
The Agentic Context Bank, a memory structure that self-refines and selects query-adaptive visual evidence from personal context to improve inference in large multimodal models.
Load-bearing premise
The Personal-VCL-Bench dataset and its evaluation protocol represent real-world personal visual contexts without significant collection biases or incomplete coverage of user behaviors.
What would settle it
Observing no improvement or a decrease in performance when applying the Agentic Context Bank to queries from actual users wearing smart glasses over extended periods.
Figures
read the original abstract
As wearable devices like smart glasses integrate Large Multimodal Models (LMMs) into the continuous first-person visual streams of individual users, the evolution of these models into true personal assistants hinges on visual personalization: the ability to reason over visual information unique to the wearer. We formalize this capability as Personal Visual Context Learning (Personal VCL), the prompt-time capability of using user-specific visual context to resolve personalized queries. To systematically evaluate this, we present Personal-VCL-Bench, a comprehensive benchmark capturing the personal visual world across persons, objects, and behaviors. Our analysis of frontier LMMs identifies a profound context utilization gap, revealing that the mechanisms for leveraging visual evidence, as well as aggregating multiple visual observations, remain critically understudied. Motivated by these findings, we propose the Agentic Context Bank, a strong inference-time baseline that structures a user's visual context into a self-refining memory bank and employs query-adaptive evidence selection. Our baseline approach consistently improves over standard context prompting regimes across tasks and evaluated backbones, demonstrating a practical path towards future personalized LMMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formalizes Personal Visual Context Learning (Personal VCL) as the prompt-time ability of large multimodal models to leverage user-specific visual context from first-person streams to answer personalized queries. It introduces Personal-VCL-Bench to evaluate this across persons, objects, and behaviors, documents a context utilization gap in frontier LMMs, and proposes the Agentic Context Bank—an inference-time baseline that organizes visual context into a self-refining memory bank and performs query-adaptive evidence selection—claiming consistent gains over standard prompting regimes.
Significance. If the reported gains prove robust and the benchmark is shown to be free of collection artifacts, the work would be significant for wearable LMM applications: it supplies the first dedicated benchmark for personal visual context and a practical, training-free baseline that structures memory and retrieval, directly addressing the identified gap in evidence aggregation.
major comments (2)
- [§4] §4 (Personal-VCL-Bench construction): no quantitative validation is reported for the benchmark (inter-annotator agreement on personal-relevance labels, coverage statistics across user demographics, or comparison to held-out real wearable logs). This is load-bearing for the central claim, because any over-representation of short, high-salience events or query distributions that favor explicit memory-bank retrieval would inflate the measured delta between Agentic Context Bank and naive prompting.
- [§5] §5 (Experiments): the abstract states that the baseline “consistently improves” across tasks and backbones, yet no quantitative tables, error bars, ablation results on the self-refining mechanism or query-adaptive selection, or statistical significance tests are referenced. Without these details the empirical support for the practical-path claim cannot be assessed.
minor comments (1)
- [Abstract] The abstract introduces the term “Agentic Context Bank” without a concise one-sentence definition; adding this in the abstract or §2 would improve immediate readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work formalizing Personal Visual Context Learning and introducing the Agentic Context Bank baseline. We address the major comments point-by-point below.
read point-by-point responses
-
Referee: [§4] §4 (Personal-VCL-Bench construction): no quantitative validation is reported for the benchmark (inter-annotator agreement on personal-relevance labels, coverage statistics across user demographics, or comparison to held-out real wearable logs). This is load-bearing for the central claim, because any over-representation of short, high-salience events or query distributions that favor explicit memory-bank retrieval would inflate the measured delta between Agentic Context Bank and naive prompting.
Authors: We acknowledge the importance of quantitative validation for the benchmark. While the construction process involved multiple annotators and careful selection to cover diverse persons, objects, and behaviors, we did not report inter-annotator agreement or demographic coverage in the initial submission. In the revised version, we will add these statistics, including inter-annotator agreement scores and coverage across demographics. Regarding comparison to held-out real wearable logs, we will include an analysis comparing the benchmark's event distributions to a small set of anonymized real-world logs to address potential artifacts. revision: yes
-
Referee: [§5] §5 (Experiments): the abstract states that the baseline “consistently improves” across tasks and backbones, yet no quantitative tables, error bars, ablation results on the self-refining mechanism or query-adaptive selection, or statistical significance tests are referenced. Without these details the empirical support for the practical-path claim cannot be assessed.
Authors: We apologize if the experimental details were not sufficiently highlighted. The manuscript in §5 includes quantitative tables showing performance improvements across multiple tasks and LMM backbones, with results averaged over multiple queries. We will add error bars, detailed ablations on the self-refining memory bank and query-adaptive selection components, and statistical significance tests (such as p-values from t-tests) in the revised manuscript to better support the claims. These elements were partially present but will be expanded for clarity. revision: partial
Circularity Check
No circularity: purely empirical benchmark and baseline evaluation
full rationale
The paper introduces Personal-VCL-Bench and evaluates the Agentic Context Bank baseline through direct experiments on frontier LMMs. No equations, derivations, fitted parameters, or predictions appear in the provided text. Claims rest on measured performance deltas rather than any reduction to self-defined quantities or self-citation chains. The evaluation protocol is presented as new data collection, not as a tautological fit.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Frontier LMMs possess basic mechanisms for visual evidence leveraging and multi-observation aggregation that can be improved via structured prompting.
invented entities (1)
-
Agentic Context Bank
no independent evidence
Reference graph
Works this paper leans on
-
[1]
MyVLM: Personalizing VLMs for user-specific queries.arXiv preprint arXiv:2403.14599, 2024
Yuval Alaluf, Elad Richardson, Sergey Tulyakov, Kfir Aberman, and Daniel Cohen-Or. MyVLM: Personalizing VLMs for user-specific queries.arXiv preprint arXiv:2403.14599, 2024
-
[2]
MC- LLaV A: Multi-concept personalized vision-language model.arXiv preprint arXiv:2411.11706, 2024
Ruichuan An, Sihan Yang, Ming Lu, Renrui Zhang, Kai Zeng, Yulin Luo, and Jiwen Cao. MC- LLaV A: Multi-concept personalized vision-language model.arXiv preprint arXiv:2411.11706, 2024
-
[3]
Ruichuan An, Sihan Yang, Renrui Zhang, Zijun Shen, Ming Lu, and Gaole Dai. UniCTokens: Boosting personalized understanding and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025
-
[4]
Ruichuan An, Kai Zeng, Ming Lu, Sihan Yang, Renrui Zhang, Huitong Ji, Hao Liang, and Wentao Zhang. Concept-as-tree: A controllable synthetic data framework makes stronger personalized VLMs.arXiv preprint arXiv:2503.12999, 2025
-
[5]
Temporal chain of thought: Long-video understanding by thinking in frames, 2025
Anurag Arnab, Ahmet Iscen, Mathilde Caron, Alireza Fathi, and Cordelia Schmid. Tem- poral chain of thought: Long-video understanding by thinking in frames.arXiv preprint arXiv:2507.02001, 2025
-
[6]
Huiyu Bai, Runze Wang, Zhuoyun Du, Yiyang Zhao, Fengji Zhang, Haoyu Chen, Xiaoyong Zhu, Bo Zheng, and Xuejiao Zhao. Online-PVLM: Advancing personalized VLMs with online concept learning.arXiv preprint arXiv:2511.20056, 2025
-
[7]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
EAGLE: Egocentric AGgregated language-video engine.arXiv preprint arXiv:2409.17523, 2024
Jing Bi, Yunlong Huang, Lianggong Wang, and Jiebo Luo. EAGLE: Egocentric AGgregated language-video engine.arXiv preprint arXiv:2409.17523, 2024
-
[9]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, 2020
work page 2020
-
[10]
Keshigeyan Chandrasegaran, Agrim Gupta, Lea M. Hadzic, Taran Kota, Jimming He, Cristobal Eyzaguirre, Zane Durante, Manling Li, Jiajun Wu, and Li Fei-Fei. HourVideo: 1-hour video- language understanding. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[11]
Memory-efficient streaming VideoLLMs for real-time procedural video understanding
Dibyadip Chatterjee, Edoardo Remelli, Yale Song, Bugra Tekin, Abhay Mittal, Bharat Bhatnagar, Necati Cihan Camgöz, Shreyas Hampali, Eric Sauser, Shugao Ma, Angela Yao, and Fadime Sener. Memory-efficient streaming VideoLLMs for real-time procedural video understanding. arXiv preprint arXiv:2504.13915, 2025
-
[12]
Niv Cohen, Rinon Gal, Eli A. Meirom, Gal Chechik, and Yuval Atzmon. “this is my unicorn, Fluffy”: Personalizing frozen vision-language representations. InEuropean Conference on Computer Vision (ECCV), 2022
work page 2022
-
[13]
Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100. International Journal of Computer Vision (IJCV), 2022
work page 2022
-
[14]
Grounded question-answering in long egocentric videos
Shangzhe Di and Weidi Xie. Grounded question-answering in long egocentric videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[15]
ReWind: Understanding long videos with instructed learnable memory
Anxhelo Diko, Tinghuai Wang, Wassim Swaileh, Shiyan Sun, and Ioannis Patras. ReWind: Understanding long videos with instructed learnable memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
work page 2025
-
[16]
A Survey on In-context Learning
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, et al. A survey on in-context learning.arXiv preprint arXiv:2301.00234, 2023. 10
work page internal anchor Pith review arXiv 2023
-
[17]
Yuhao Dong, Shulin Tian, Shuai Liu, Shuangrui Ding, Yuhang Zang, and Xiao Dong. Demo-ICL: In-context learning for procedural video knowledge acquisition.arXiv preprint arXiv:2602.08439, 2026
-
[18]
Cl-bench: A benchmark for context learning
Shihan Dou, Ming Zhang, Zhangyue Yin, et al. CL-bench: A benchmark for context learning. arXiv preprint arXiv:2602.03587, 2026
-
[19]
Jehanzeb Mirza, Wei Lin, Amit Alfassy, Assaf Arbelle, Shimon Ullman, and Leonid Karlinsky
Sivan Doveh, Shaked Perek, M. Jehanzeb Mirza, Wei Lin, Amit Alfassy, Assaf Arbelle, Shimon Ullman, and Leonid Karlinsky. Towards multimodal in-context learning for vision & language models. InEuropean Conference on Computer Vision (ECCV), 2024
work page 2024
-
[20]
Project Aria: A New Tool for Egocentric Multi-Modal AI Research
Jakob Engel, Kiran Somasundaram, Michael Goesele, Albert Sun, Alexander Gamino, Andrew Turner, Arjang Talattof, Andreas Yuan, Bilal Souti, Brighid Meredith, et al. Project Aria: A new tool for egocentric multi-modal AI research.arXiv preprint arXiv:2308.13561, 2023
work page internal anchor Pith review arXiv 2023
-
[21]
EgoVQA: An egocentric video question answering benchmark dataset
Chenyou Fan. EgoVQA: An egocentric video question answering benchmark dataset. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2019
work page 2019
-
[22]
Gemma 4: Byte for byte, the most capable open models
Clement Farabet and Olivier Lacombe. Gemma 4: Byte for byte, the most capable open models. Technical report, Google, April 2026
work page 2026
-
[23]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Rongrong Ji, and Xing Sun. Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis. arXiv prepri...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Ryo Fujii, Hideo Saito, and Ryo Hachiuma. VIOLA: Towards video in-context learning with minimal annotations.arXiv preprint arXiv:2601.15549, 2026
-
[25]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Google. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Gemini Team, Google DeepMind. Gemini 3 Flash model card. Technical report, Google DeepMind, December 2025
work page 2025
-
[28]
Gemini Team, Google DeepMind. Gemini 3.1 Pro model card. Technical report, Google DeepMind, February 2026
work page 2026
-
[29]
Ego4D: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4D: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
work page 2022
-
[30]
Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyl- los Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, et al. Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[31]
RAP: Retrieval- augmented personalization for multimodal large language models
Haoran Hao, Jiaming Han, Changsheng Li, Yu-Feng Li, and Xiangyu Yue. RAP: Retrieval- augmented personalization for multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
work page 2025
-
[32]
Rongpei Hong, Jian Lang, Ting Zhong, Yong Wang, and Fan Zhou. TAMEing long contexts in personalization: Towards training-free and state-aware MLLM personalized assistant. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2026. arXiv:2512.21616
-
[33]
An egocentric look at video photographer identity
Yedid Hoshen and Shmuel Peleg. An egocentric look at video photographer identity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4284–4292, 2016. 11
work page 2016
-
[34]
Yifei Huang, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Dong Lu, Yali Wang, Limin Wang, and Yu Qiao. EgoExoLearn: A dataset for bridging asynchronous ego- and exo-centric view of procedural activities in real world. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[35]
Yifei Huang, Jilan Xu, Baoqi Pei, Yuping He, Guo Chen, Mingfang Zhang, Lijin Yang, Zheng Nie, Jinyao Liu, Guoshun Fan, Dechen Lin, Fang Fang, Kunpeng Li, Chang Yuan, Xinyuan Chen, Yaohui Wang, Yali Wang, Yu Qiao, and Limin Wang. An egocentric vision-language model based portable real-time smart assistant.arXiv preprint arXiv:2503.04250, 2025
-
[36]
BIMBA: Selective-scan compression for long-range video question answering
Md Mohaiminul Islam, Tushar Nagarajan, Huiyu Wang, Gedas Bertasius, and Lorenzo Torresani. BIMBA: Selective-scan compression for long-range video question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
work page 2025
-
[37]
EgoTaskQA: Understanding human tasks in egocentric videos
Baoxiong Jia, Ting Lei, Song-Chun Zhu, and Siyuan Huang. EgoTaskQA: Understanding human tasks in egocentric videos. InAdvances in Neural Information Processing Systems, 2022
work page 2022
-
[38]
Many-shot in-context learning in multimodal foundation models.arXiv preprint arXiv:2405.09798, 2024
Yixing Jiang, Jeremy Irvin, Ji Hun Wang, Muhammad Ahmed Chaudhry, and Jonathan Chen. Many-shot in-context learning in multimodal foundation models.arXiv preprint arXiv:2405.09798, 2024
-
[39]
MMPB: It’s time for multi-modal personalization.arXiv preprint arXiv:2509.22820, 2025
Jaeik Kim, Woojin Kim, Woohyeon Park, and Jaeyoung Do. MMPB: It’s time for multi-modal personalization.arXiv preprint arXiv:2509.22820, 2025
-
[40]
Kangsan Kim, Geon Park, Youngwan Lee, and Sung Ju Hwang. MA-EgoQA: Question answer- ing over egocentric videos from multiple embodied agents.arXiv preprint arXiv:2603.09827, 2026
-
[41]
Kangsan Kim, Geon Park, Youngwan Lee, Woongyeong Yeo, and Sung Ju Hwang. VideoICL: Confidence-based iterative in-context learning for out-of-distribution video understanding.arXiv preprint arXiv:2412.02186, 2024
-
[42]
arXiv preprint arXiv:2407.11016 , year=
Ishita Kumar, Snigdha Viswanathan, Sushrita Yerra, Alireza Salemi, Ryan A. Rossi, Franck Dernoncourt, Hanieh Deilamsalehy, Xiang Chen, Ruiyi Zhang, Shubham Agarwal, Nedim Lipka, Chien Van Nguyen, Thien Huu Nguyen, and Hamed Zamani. LongLaMP: A benchmark for personalized long-form text generation.arXiv preprint arXiv:2407.11016, 2024
-
[43]
Jie Lei, Tamara L. Berg, and Mohit Bansal. Detecting moments and highlights in videos via natural language queries. InAdvances in Neural Information Processing Systems, 2021
work page 2021
-
[44]
Retrieval-augmented generation for knowledge-intensive NLP tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems, 2020
work page 2020
-
[45]
Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, and Shiguang Shan. LensWalk: Agentic video understanding by planning how you see in videos.arXiv preprint arXiv:2603.24558, 2026
-
[46]
MVBench: A comprehensive multi-modal video understanding benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. MVBench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[47]
Yijiang Li et al. EgoPrivacy: What your first-person camera says about you? InInternational Conference on Machine Learning (ICML), 2025. arXiv:2506.12258
-
[48]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.arXiv preprint arXiv:2304.08485, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Aria Everyday Activities Dataset,
Zhaoyang Lv, Nicholas Charron, Pierre Moulon, Alexander Gamino, Cheng Peng, Chris Sweeney, Edward Miller, Huixuan Tang, Jeff Meissner, Jing Dong, et al. Aria Everyday Activities dataset.arXiv preprint arXiv:2402.13349, 2024. 12
-
[50]
EgoSchema: A diagnostic benchmark for very long-form video language understanding
Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. EgoSchema: A diagnostic benchmark for very long-form video language understanding. InAdvances in Neural Information Processing Systems, 2023
work page 2023
-
[51]
Zaira Manigrasso, Matteo Milani, and Rita Cucchiara. Online episodic memory visual query localization with egocentric streaming object memory.arXiv preprint arXiv:2411.16934, 2024
-
[52]
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
work page 2022
-
[53]
Sheshera Mysore, Zhuoran Lu, Mengting Wan, Longqi Yang, Steve Menezes, Tina Baghaee, Emmanuel Barajas Gonzalez, Jake Hofman, and Jennifer Neville. PEARL: Personalizing large language model writing assistants with generation-calibrated retrievers.arXiv preprint arXiv:2311.09180, 2024
-
[54]
Yo’LLaV A: Your personalized language and vision assistant
Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, and Yong Jae Lee. Yo’LLaV A: Your personalized language and vision assistant. InAdvances in Neural Information Processing Systems, 2024
work page 2024
-
[55]
Yeongtak Oh, Dohyun Chung, Juhyeon Shin, Sangha Park, Johan Barthelemy, Jisoo Mok, and Sungroh Yoon. RePIC: Reinforced post-training for personalizing multi-modal language models.arXiv preprint arXiv:2506.18369, 2025
-
[56]
OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
OpenAI. GPT-5.4 Thinking system card. Technical report, OpenAI, March 2026
work page 2026
-
[58]
Rohith Peddi, Shivvrat Arya Tirumala, Mohammad Khan, Yufei Ji, Tushar Sridhar, Vib- hav Gogate Sridhar, and Nicholas Ruozzi. CaptainCook4D: A dataset for understanding errors in procedural activities.arXiv preprint arXiv:2312.14556, 2023
-
[59]
Personalized large vision-language models.arXiv preprint arXiv:2412.17610, 2024
Chau Pham, Hoang Phan, David Doermann, and Yunjie Tian. Personalized large vision-language models.arXiv preprint arXiv:2412.17610, 2024
-
[60]
Personalized visual instruction tuning
Renjie Pi, Jianshu Zhang, Tianyang Han, Jipeng Zhang, Rui Pan, and Tong Zhang. Personalized visual instruction tuning.arXiv preprint arXiv:2410.07113, 2024
-
[61]
Agentic very long video understanding.arXiv preprint arXiv:2601.18157, 2026
Aniket Rege, Arka Sadhu, Yuliang Li, Kejie Li, Ramya Korlakai Vinayak, and Chai. Agentic very long video understanding.arXiv preprint arXiv:2601.18157, 2026
-
[62]
Optimization methods for personalizing large language models through retrieval augmentation
Alireza Salemi, Surya Kallumadi, and Hamed Zamani. Optimization methods for personalizing large language models through retrieval augmentation. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2024
work page 2024
-
[63]
LaMP: When large language models meet personalization.arXiv preprint arXiv:2304.11406, 2023
Alireza Salemi, Sheshera Mysore, Michael Bendersky, and Hamed Zamani. LaMP: When large language models meet personalization.arXiv preprint arXiv:2304.11406, 2023
-
[64]
Alireza Salemi and Hamed Zamani. LaMP-QA: A benchmark for personalized long-form question answering.arXiv preprint arXiv:2506.00137, 2025
-
[65]
Personalization Toolkit: Training Free Personalization of Large Vision Language Models
Soroush Seifi, Vaggelis Dorovatas, Matteo Cassinelli, Fabien Despinoy, Daniel Olmeda Reino, and Rahaf Aljundi. Personalization toolkit: Training free personalization of large vision language models.arXiv preprint arXiv:2502.02452, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[66]
Junxiao Shen, John Dudley, and Per Ola Kristensson. Encode-store-retrieve: Augmenting human memory through language-encoded egocentric perception.arXiv preprint arXiv:2308.05822, 2023
-
[67]
PVChat: Personalized video chat with one-shot learning.arXiv preprint arXiv:2503.17069, 2025
Yufei Shi, Weilong Yan, Gang Xu, Yumeng Li, Yucheng Chen, Zhenxi Li, Fei Richard Yu, Ming Li, and Si Yong Yeo. PVChat: Personalized video chat with one-shot learning.arXiv preprint arXiv:2503.17069, 2025. 13
-
[68]
Personalized pieces: Efficient personalized large language models through collaborative efforts
Zhaoxuan Tan, Zheyuan Liu, and Meng Jiang. Personalized pieces: Efficient personalized large language models through collaborative efforts. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
work page 2024
-
[69]
Democratizing large lan- guage models via personalized parameter-efficient fine-tuning
Zhaoxuan Tan, Qingkai Zeng, Yijun Tian, Zheyuan Liu, Bing Yin, and Meng Jiang. Democra- tizing large language models via personalized parameter-efficient fine-tuning.arXiv preprint arXiv:2402.04401, 2024
-
[70]
Zhaoxuan Tan, Zixuan Zhang, Haoyang Wen, Zheng Li, Rongzhi Zhang, Pei Chen, Fengran Mo, Zheyuan Liu, Qingkai Zeng, Qingyu Yin, and Meng Jiang. Instant personalized large language model adaptation via hypernetwork.arXiv preprint arXiv:2510.16282, 2025
-
[71]
Daksh Thapar, Aditya Nigam, and Chetan Arora. Is sharing of egocentric video giving away your biometric signature? InEuropean Conference on Computer Vision (ECCV). Springer, 2020
work page 2020
-
[72]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[73]
VideoAgent: Long-form video understanding with large language model as agent
Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung-Levy. VideoAgent: Long-form video understanding with large language model as agent. InEuropean Conference on Computer Vision (ECCV), 2024
work page 2024
-
[74]
Ying Wang, Yanlai He, Cuiling Wang, Huaiyu Bian, and Jianlong Chen. LifelongMem- ory: Leveraging LLMs for answering queries in long-form egocentric videos.arXiv preprint arXiv:2312.05269, 2023
-
[75]
Ziyang Wang, Honglu Zhou, Shijie Wang, Junnan Li, Caiming Xiong, Silvio Savarese, Mohit Bansal, Michael S. Ryoo, and Juan Carlos Niebles. Active video perception: Iterative evidence seeking for agentic long video understanding.arXiv preprint arXiv:2512.05774, 2025
-
[76]
LongVideoBench: A benchmark for long-context interleaved video-language understanding
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. LongVideoBench: A benchmark for long-context interleaved video-language understanding. InAdvances in Neural Information Processing Systems, 2024
work page 2024
-
[77]
Retrieval- augmented egocentric video captioning.arXiv preprint arXiv:2401.00789, 2024
Jilan Xu, Yifei Huang, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, and Weidi Xie. Retrieval- augmented egocentric video captioning.arXiv preprint arXiv:2401.00789, 2024
-
[78]
EgoLife: Towards egocentric life assistant.arXiv preprint arXiv:2503.03803, 2025
Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, et al. EgoLife: Towards egocentric life assistant.arXiv preprint arXiv:2503.03803, 2025
-
[79]
Keunwoo Peter Yu, Zheyuan Zhang, Fengyuan Hu, Shane Storks, and Joyce Chai. Eliciting in-context learning in vision-language models for videos through curated data distributional properties.arXiv preprint arXiv:2311.17041, 2023
-
[80]
A simple LLM framework for long-range video question-answering
Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, and Gedas Bertasius. A simple LLM framework for long-range video question-answering. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.