arxiv: 2605.10936 · v1 · submitted 2026-05-11 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Personal Visual Context Learning in Large Multimodal Models

Zihui Xue , Ami Baid , Sangho Kim , Mi Luo , Kristen Grauman

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:37 UTC · model grok-4.3

classification 💻 cs.CV

keywords Personal Visual Context LearningLarge Multimodal ModelsPersonalizationVisual ContextMemory BankWearable DevicesInference-time AdaptationContext Utilization

0 comments

The pith

Large multimodal models improve on personalized visual queries when user context is stored in a self-refining memory bank instead of raw prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Wearable devices will soon stream continuous first-person video to large multimodal models, requiring them to answer questions based on a specific user's visual experiences. The paper defines Personal Visual Context Learning as the prompt-time ability to draw on that unique visual history. Current models fall short in using this information effectively, especially when multiple observations must be combined. To close the gap, the authors introduce a benchmark spanning persons, objects, and behaviors, and test a baseline method called the Agentic Context Bank. This method builds a memory structure from the visual context and picks relevant pieces for each query, yielding better results than standard prompting across several models and tasks.

Core claim

We formalize Personal Visual Context Learning as the capability of large multimodal models to resolve personalized queries using user-specific visual context at prompt time. Analysis of frontier LMMs on the new Personal-VCL-Bench reveals a profound gap in context utilization, including mechanisms for leveraging visual evidence and aggregating multiple observations. The Agentic Context Bank addresses this by structuring visual context into a self-refining memory bank and employing query-adaptive evidence selection, consistently outperforming standard context prompting regimes.

What carries the argument

The Agentic Context Bank, a memory structure that self-refines and selects query-adaptive visual evidence from personal context to improve inference in large multimodal models.

Load-bearing premise

The Personal-VCL-Bench dataset and its evaluation protocol represent real-world personal visual contexts without significant collection biases or incomplete coverage of user behaviors.

What would settle it

Observing no improvement or a decrease in performance when applying the Agentic Context Bank to queries from actual users wearing smart glasses over extended periods.

Figures

Figures reproduced from arXiv: 2605.10936 by Ami Baid, Kristen Grauman, Mi Luo, Sangho Kim, Zihui Xue.

**Figure 2.** Figure 2: We propose Personal-VCL-Bench to evaluate how LMMs use personal visual context [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative example of the full Agentic Context Bank pipeline. The left side shows [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Examples from the Persons, Objects, and EgoWearer identification tasks in Personal-VCL [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: Examples from the Behavior tasks in Personal-VCL-Bench, built from CaptainCook4D [ [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Failure case of our Agentic Context Bank. The model requests visual evidence for entry [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗

read the original abstract

As wearable devices like smart glasses integrate Large Multimodal Models (LMMs) into the continuous first-person visual streams of individual users, the evolution of these models into true personal assistants hinges on visual personalization: the ability to reason over visual information unique to the wearer. We formalize this capability as Personal Visual Context Learning (Personal VCL), the prompt-time capability of using user-specific visual context to resolve personalized queries. To systematically evaluate this, we present Personal-VCL-Bench, a comprehensive benchmark capturing the personal visual world across persons, objects, and behaviors. Our analysis of frontier LMMs identifies a profound context utilization gap, revealing that the mechanisms for leveraging visual evidence, as well as aggregating multiple visual observations, remain critically understudied. Motivated by these findings, we propose the Agentic Context Bank, a strong inference-time baseline that structures a user's visual context into a self-refining memory bank and employs query-adaptive evidence selection. Our baseline approach consistently improves over standard context prompting regimes across tasks and evaluated backbones, demonstrating a practical path towards future personalized LMMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper names Personal VCL as a distinct prompt-time capability, releases a benchmark, and shows a memory-bank baseline beats plain prompting, but the numbers and benchmark checks are still thin.

read the letter

This paper formalizes Personal Visual Context Learning as the prompt-time capability of using user-specific visual context to resolve personalized queries. It introduces Personal-VCL-Bench to test this across persons, objects, and behaviors, and proposes the Agentic Context Bank, which turns the visual stream into a self-refining memory structure and picks evidence adaptively per query. The analysis of frontier LMMs shows they struggle to aggregate multiple personal observations, which is a useful observation. Their baseline delivers consistent gains over standard context prompting across tasks and backbones, giving a practical inference-time starting point that does not require retraining. That is the main contribution worth noting. The work is clear on the problem setup and separates this from generic prompting literature. On the soft spots, the abstract supplies no numbers, ablations, or error breakdowns, so the size of the improvement and the conditions where it holds are hard to judge from the summary alone. The stress-test point about possible benchmark artifacts lands: first-person streams can easily over-represent salient short events or queries that reward explicit retrieval, and without inter-annotator agreement, demographic coverage stats, or comparison to held-out real logs, it is difficult to know how diagnostic the deltas are. If the full paper has those checks, they would address the main concern. This is for researchers working on egocentric vision and practical multimodal personalization for wearables. A reader who wants a benchmark and a reproducible baseline to build on will get value from it. The thinking is straightforward and the contributions are distinct enough that it deserves a serious referee to pressure-test the evaluation protocol and see the full results.

Referee Report

2 major / 1 minor

Summary. The paper formalizes Personal Visual Context Learning (Personal VCL) as the prompt-time ability of large multimodal models to leverage user-specific visual context from first-person streams to answer personalized queries. It introduces Personal-VCL-Bench to evaluate this across persons, objects, and behaviors, documents a context utilization gap in frontier LMMs, and proposes the Agentic Context Bank—an inference-time baseline that organizes visual context into a self-refining memory bank and performs query-adaptive evidence selection—claiming consistent gains over standard prompting regimes.

Significance. If the reported gains prove robust and the benchmark is shown to be free of collection artifacts, the work would be significant for wearable LMM applications: it supplies the first dedicated benchmark for personal visual context and a practical, training-free baseline that structures memory and retrieval, directly addressing the identified gap in evidence aggregation.

major comments (2)

[§4] §4 (Personal-VCL-Bench construction): no quantitative validation is reported for the benchmark (inter-annotator agreement on personal-relevance labels, coverage statistics across user demographics, or comparison to held-out real wearable logs). This is load-bearing for the central claim, because any over-representation of short, high-salience events or query distributions that favor explicit memory-bank retrieval would inflate the measured delta between Agentic Context Bank and naive prompting.
[§5] §5 (Experiments): the abstract states that the baseline “consistently improves” across tasks and backbones, yet no quantitative tables, error bars, ablation results on the self-refining mechanism or query-adaptive selection, or statistical significance tests are referenced. Without these details the empirical support for the practical-path claim cannot be assessed.

minor comments (1)

[Abstract] The abstract introduces the term “Agentic Context Bank” without a concise one-sentence definition; adding this in the abstract or §2 would improve immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work formalizing Personal Visual Context Learning and introducing the Agentic Context Bank baseline. We address the major comments point-by-point below.

read point-by-point responses

Referee: [§4] §4 (Personal-VCL-Bench construction): no quantitative validation is reported for the benchmark (inter-annotator agreement on personal-relevance labels, coverage statistics across user demographics, or comparison to held-out real wearable logs). This is load-bearing for the central claim, because any over-representation of short, high-salience events or query distributions that favor explicit memory-bank retrieval would inflate the measured delta between Agentic Context Bank and naive prompting.

Authors: We acknowledge the importance of quantitative validation for the benchmark. While the construction process involved multiple annotators and careful selection to cover diverse persons, objects, and behaviors, we did not report inter-annotator agreement or demographic coverage in the initial submission. In the revised version, we will add these statistics, including inter-annotator agreement scores and coverage across demographics. Regarding comparison to held-out real wearable logs, we will include an analysis comparing the benchmark's event distributions to a small set of anonymized real-world logs to address potential artifacts. revision: yes
Referee: [§5] §5 (Experiments): the abstract states that the baseline “consistently improves” across tasks and backbones, yet no quantitative tables, error bars, ablation results on the self-refining mechanism or query-adaptive selection, or statistical significance tests are referenced. Without these details the empirical support for the practical-path claim cannot be assessed.

Authors: We apologize if the experimental details were not sufficiently highlighted. The manuscript in §5 includes quantitative tables showing performance improvements across multiple tasks and LMM backbones, with results averaged over multiple queries. We will add error bars, detailed ablations on the self-refining memory bank and query-adaptive selection components, and statistical significance tests (such as p-values from t-tests) in the revised manuscript to better support the claims. These elements were partially present but will be expanded for clarity. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark and baseline evaluation

full rationale

The paper introduces Personal-VCL-Bench and evaluates the Agentic Context Bank baseline through direct experiments on frontier LMMs. No equations, derivations, fitted parameters, or predictions appear in the provided text. Claims rest on measured performance deltas rather than any reduction to self-defined quantities or self-citation chains. The evaluation protocol is presented as new data collection, not as a tautological fit.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Work rests on standard assumptions about LMM prompt-based visual reasoning and the representativeness of the new benchmark; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption Frontier LMMs possess basic mechanisms for visual evidence leveraging and multi-observation aggregation that can be improved via structured prompting.
Invoked when identifying the context utilization gap and motivating the baseline.

invented entities (1)

Agentic Context Bank no independent evidence
purpose: Structures user visual context into a self-refining memory bank with query-adaptive evidence selection.
New inference-time construct proposed to address identified gaps.

pith-pipeline@v0.9.0 · 5494 in / 1180 out tokens · 35050 ms · 2026-05-12T03:37:56.001425+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 10 internal anchors

[1]

MyVLM: Personalizing VLMs for user-specific queries.arXiv preprint arXiv:2403.14599, 2024

Yuval Alaluf, Elad Richardson, Sergey Tulyakov, Kfir Aberman, and Daniel Cohen-Or. MyVLM: Personalizing VLMs for user-specific queries.arXiv preprint arXiv:2403.14599, 2024

work page arXiv 2024
[2]

MC- LLaV A: Multi-concept personalized vision-language model.arXiv preprint arXiv:2411.11706, 2024

Ruichuan An, Sihan Yang, Ming Lu, Renrui Zhang, Kai Zeng, Yulin Luo, and Jiwen Cao. MC- LLaV A: Multi-concept personalized vision-language model.arXiv preprint arXiv:2411.11706, 2024

work page arXiv 2024
[3]

UniCTokens: Boosting personalized understanding and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025

Ruichuan An, Sihan Yang, Renrui Zhang, Zijun Shen, Ming Lu, and Gaole Dai. UniCTokens: Boosting personalized understanding and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025

work page arXiv 2025
[4]

Concept-as-tree: A controllable synthetic data framework makes stronger personalized VLMs.arXiv preprint arXiv:2503.12999, 2025

Ruichuan An, Kai Zeng, Ming Lu, Sihan Yang, Renrui Zhang, Huitong Ji, Hao Liang, and Wentao Zhang. Concept-as-tree: A controllable synthetic data framework makes stronger personalized VLMs.arXiv preprint arXiv:2503.12999, 2025

work page arXiv 2025
[5]

Temporal chain of thought: Long-video understanding by thinking in frames, 2025

Anurag Arnab, Ahmet Iscen, Mathilde Caron, Alireza Fathi, and Cordelia Schmid. Tem- poral chain of thought: Long-video understanding by thinking in frames.arXiv preprint arXiv:2507.02001, 2025

work page arXiv 2025
[6]

Online-PVLM: Advancing personalized VLMs with online concept learning.arXiv preprint arXiv:2511.20056, 2025

Huiyu Bai, Runze Wang, Zhuoyun Du, Yiyang Zhao, Fengji Zhang, Haoyu Chen, Xiaoyong Zhu, Bo Zheng, and Xuejiao Zhao. Online-PVLM: Advancing personalized VLMs with online concept learning.arXiv preprint arXiv:2511.20056, 2025

work page arXiv 2025
[7]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

EAGLE: Egocentric AGgregated language-video engine.arXiv preprint arXiv:2409.17523, 2024

Jing Bi, Yunlong Huang, Lianggong Wang, and Jiebo Luo. EAGLE: Egocentric AGgregated language-video engine.arXiv preprint arXiv:2409.17523, 2024

work page arXiv 2024
[9]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, 2020

work page 2020
[10]

Hadzic, Taran Kota, Jimming He, Cristobal Eyzaguirre, Zane Durante, Manling Li, Jiajun Wu, and Li Fei-Fei

Keshigeyan Chandrasegaran, Agrim Gupta, Lea M. Hadzic, Taran Kota, Jimming He, Cristobal Eyzaguirre, Zane Durante, Manling Li, Jiajun Wu, and Li Fei-Fei. HourVideo: 1-hour video- language understanding. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[11]

Memory-efficient streaming VideoLLMs for real-time procedural video understanding

Dibyadip Chatterjee, Edoardo Remelli, Yale Song, Bugra Tekin, Abhay Mittal, Bharat Bhatnagar, Necati Cihan Camgöz, Shreyas Hampali, Eric Sauser, Shugao Ma, Angela Yao, and Fadime Sener. Memory-efficient streaming VideoLLMs for real-time procedural video understanding. arXiv preprint arXiv:2504.13915, 2025

work page arXiv 2025
[12]

this is my unicorn, Fluffy

Niv Cohen, Rinon Gal, Eli A. Meirom, Gal Chechik, and Yuval Atzmon. “this is my unicorn, Fluffy”: Personalizing frozen vision-language representations. InEuropean Conference on Computer Vision (ECCV), 2022

work page 2022
[13]

Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100. International Journal of Computer Vision (IJCV), 2022

work page 2022
[14]

Grounded question-answering in long egocentric videos

Shangzhe Di and Weidi Xie. Grounded question-answering in long egocentric videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[15]

ReWind: Understanding long videos with instructed learnable memory

Anxhelo Diko, Tinghuai Wang, Wassim Swaileh, Shiyan Sun, and Ioannis Patras. ReWind: Understanding long videos with instructed learnable memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[16]

A Survey on In-context Learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, et al. A survey on in-context learning.arXiv preprint arXiv:2301.00234, 2023. 10

work page internal anchor Pith review arXiv 2023
[17]

Demo-ICL: In-context learning for procedural video knowledge acquisition.arXiv preprint arXiv:2602.08439, 2026

Yuhao Dong, Shulin Tian, Shuai Liu, Shuangrui Ding, Yuhang Zang, and Xiao Dong. Demo-ICL: In-context learning for procedural video knowledge acquisition.arXiv preprint arXiv:2602.08439, 2026

work page arXiv 2026
[18]

Cl-bench: A benchmark for context learning

Shihan Dou, Ming Zhang, Zhangyue Yin, et al. CL-bench: A benchmark for context learning. arXiv preprint arXiv:2602.03587, 2026

work page arXiv 2026
[19]

Jehanzeb Mirza, Wei Lin, Amit Alfassy, Assaf Arbelle, Shimon Ullman, and Leonid Karlinsky

Sivan Doveh, Shaked Perek, M. Jehanzeb Mirza, Wei Lin, Amit Alfassy, Assaf Arbelle, Shimon Ullman, and Leonid Karlinsky. Towards multimodal in-context learning for vision & language models. InEuropean Conference on Computer Vision (ECCV), 2024

work page 2024
[20]

Project Aria: A New Tool for Egocentric Multi-Modal AI Research

Jakob Engel, Kiran Somasundaram, Michael Goesele, Albert Sun, Alexander Gamino, Andrew Turner, Arjang Talattof, Andreas Yuan, Bilal Souti, Brighid Meredith, et al. Project Aria: A new tool for egocentric multi-modal AI research.arXiv preprint arXiv:2308.13561, 2023

work page internal anchor Pith review arXiv 2023
[21]

EgoVQA: An egocentric video question answering benchmark dataset

Chenyou Fan. EgoVQA: An egocentric video question answering benchmark dataset. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2019

work page 2019
[22]

Gemma 4: Byte for byte, the most capable open models

Clement Farabet and Olivier Lacombe. Gemma 4: Byte for byte, the most capable open models. Technical report, Google, April 2026

work page 2026
[23]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Rongrong Ji, and Xing Sun. Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis. arXiv prepri...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

VIOLA: Towards video in-context learning with minimal annotations.arXiv preprint arXiv:2601.15549, 2026

Ryo Fujii, Hideo Saito, and Ryo Hachiuma. VIOLA: Towards video in-context learning with minimal annotations.arXiv preprint arXiv:2601.15549, 2026

work page arXiv 2026
[25]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Google. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Gemini 3 Flash model card

Gemini Team, Google DeepMind. Gemini 3 Flash model card. Technical report, Google DeepMind, December 2025

work page 2025
[28]

Gemini 3.1 Pro model card

Gemini Team, Google DeepMind. Gemini 3.1 Pro model card. Technical report, Google DeepMind, February 2026

work page 2026
[29]

Ego4D: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4D: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022
[30]

Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyl- los Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, et al. Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[31]

RAP: Retrieval- augmented personalization for multimodal large language models

Haoran Hao, Jiaming Han, Changsheng Li, Yu-Feng Li, and Xiangyu Yue. RAP: Retrieval- augmented personalization for multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[32]

TAMEing long contexts in personalization: Towards training-free and state-aware MLLM personalized assistant

Rongpei Hong, Jian Lang, Ting Zhong, Yong Wang, and Fan Zhou. TAMEing long contexts in personalization: Towards training-free and state-aware MLLM personalized assistant. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2026. arXiv:2512.21616

work page arXiv 2026
[33]

An egocentric look at video photographer identity

Yedid Hoshen and Shmuel Peleg. An egocentric look at video photographer identity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4284–4292, 2016. 11

work page 2016
[34]

EgoExoLearn: A dataset for bridging asynchronous ego- and exo-centric view of procedural activities in real world

Yifei Huang, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Dong Lu, Yali Wang, Limin Wang, and Yu Qiao. EgoExoLearn: A dataset for bridging asynchronous ego- and exo-centric view of procedural activities in real world. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[35]

An egocentric vision-language model based portable real-time smart assistant.arXiv preprint arXiv:2503.04250, 2025

Yifei Huang, Jilan Xu, Baoqi Pei, Yuping He, Guo Chen, Mingfang Zhang, Lijin Yang, Zheng Nie, Jinyao Liu, Guoshun Fan, Dechen Lin, Fang Fang, Kunpeng Li, Chang Yuan, Xinyuan Chen, Yaohui Wang, Yali Wang, Yu Qiao, and Limin Wang. An egocentric vision-language model based portable real-time smart assistant.arXiv preprint arXiv:2503.04250, 2025

work page arXiv 2025
[36]

BIMBA: Selective-scan compression for long-range video question answering

Md Mohaiminul Islam, Tushar Nagarajan, Huiyu Wang, Gedas Bertasius, and Lorenzo Torresani. BIMBA: Selective-scan compression for long-range video question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[37]

EgoTaskQA: Understanding human tasks in egocentric videos

Baoxiong Jia, Ting Lei, Song-Chun Zhu, and Siyuan Huang. EgoTaskQA: Understanding human tasks in egocentric videos. InAdvances in Neural Information Processing Systems, 2022

work page 2022
[38]

Many-shot in-context learning in multimodal foundation models.arXiv preprint arXiv:2405.09798, 2024

Yixing Jiang, Jeremy Irvin, Ji Hun Wang, Muhammad Ahmed Chaudhry, and Jonathan Chen. Many-shot in-context learning in multimodal foundation models.arXiv preprint arXiv:2405.09798, 2024

work page arXiv 2024
[39]

MMPB: It’s time for multi-modal personalization.arXiv preprint arXiv:2509.22820, 2025

Jaeik Kim, Woojin Kim, Woohyeon Park, and Jaeyoung Do. MMPB: It’s time for multi-modal personalization.arXiv preprint arXiv:2509.22820, 2025

work page arXiv 2025
[40]

MA-EgoQA: Question answer- ing over egocentric videos from multiple embodied agents.arXiv preprint arXiv:2603.09827, 2026

Kangsan Kim, Geon Park, Youngwan Lee, and Sung Ju Hwang. MA-EgoQA: Question answer- ing over egocentric videos from multiple embodied agents.arXiv preprint arXiv:2603.09827, 2026

work page arXiv 2026
[41]

VideoICL: Confidence-based iterative in-context learning for out-of-distribution video understanding.arXiv preprint arXiv:2412.02186, 2024

Kangsan Kim, Geon Park, Youngwan Lee, Woongyeong Yeo, and Sung Ju Hwang. VideoICL: Confidence-based iterative in-context learning for out-of-distribution video understanding.arXiv preprint arXiv:2412.02186, 2024

work page arXiv 2024
[42]

arXiv preprint arXiv:2407.11016 , year=

Ishita Kumar, Snigdha Viswanathan, Sushrita Yerra, Alireza Salemi, Ryan A. Rossi, Franck Dernoncourt, Hanieh Deilamsalehy, Xiang Chen, Ruiyi Zhang, Shubham Agarwal, Nedim Lipka, Chien Van Nguyen, Thien Huu Nguyen, and Hamed Zamani. LongLaMP: A benchmark for personalized long-form text generation.arXiv preprint arXiv:2407.11016, 2024

work page arXiv 2024
[43]

Berg, and Mohit Bansal

Jie Lei, Tamara L. Berg, and Mohit Bansal. Detecting moments and highlights in videos via natural language queries. InAdvances in Neural Information Processing Systems, 2021

work page 2021
[44]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems, 2020

work page 2020
[45]

LensWalk: Agentic video understanding by planning how you see in videos.arXiv preprint arXiv:2603.24558, 2026

Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, and Shiguang Shan. LensWalk: Agentic video understanding by planning how you see in videos.arXiv preprint arXiv:2603.24558, 2026

work page arXiv 2026
[46]

MVBench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. MVBench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[47]

EgoPrivacy: What your first-person camera says about you? InInternational Conference on Machine Learning (ICML), 2025

Yijiang Li et al. EgoPrivacy: What your first-person camera says about you? InInternational Conference on Machine Learning (ICML), 2025. arXiv:2506.12258

work page arXiv 2025
[48]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.arXiv preprint arXiv:2304.08485, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Aria Everyday Activities Dataset,

Zhaoyang Lv, Nicholas Charron, Pierre Moulon, Alexander Gamino, Cheng Peng, Chris Sweeney, Edward Miller, Huixuan Tang, Jeff Meissner, Jing Dong, et al. Aria Everyday Activities dataset.arXiv preprint arXiv:2402.13349, 2024. 12

work page arXiv 2024
[50]

EgoSchema: A diagnostic benchmark for very long-form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. EgoSchema: A diagnostic benchmark for very long-form video language understanding. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[51]

Online episodic memory visual query localization with egocentric streaming object memory.arXiv preprint arXiv:2411.16934, 2024

Zaira Manigrasso, Matteo Milani, and Rita Cucchiara. Online episodic memory visual query localization with egocentric streaming object memory.arXiv preprint arXiv:2411.16934, 2024

work page arXiv 2024
[52]

Rethinking the role of demonstrations: What makes in-context learning work? InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022

work page 2022
[53]

PEARL: Personalizing large language model writing assistants with generation-calibrated retrievers.arXiv preprint arXiv:2311.09180, 2024

Sheshera Mysore, Zhuoran Lu, Mengting Wan, Longqi Yang, Steve Menezes, Tina Baghaee, Emmanuel Barajas Gonzalez, Jake Hofman, and Jennifer Neville. PEARL: Personalizing large language model writing assistants with generation-calibrated retrievers.arXiv preprint arXiv:2311.09180, 2024

work page arXiv 2024
[54]

Yo’LLaV A: Your personalized language and vision assistant

Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, and Yong Jae Lee. Yo’LLaV A: Your personalized language and vision assistant. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[55]

Repic: Reinforced post-training for personalizing multi-modal lan- guage models.arXiv:2506.18369, 2025

Yeongtak Oh, Dohyun Chung, Juhyeon Shin, Sangha Park, Johan Barthelemy, Jisoo Mok, and Sungroh Yoon. RePIC: Reinforced post-training for personalizing multi-modal language models.arXiv preprint arXiv:2506.18369, 2025

work page arXiv 2025
[56]

GPT-4 Technical Report

OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

GPT-5.4 Thinking system card

OpenAI. GPT-5.4 Thinking system card. Technical report, OpenAI, March 2026

work page 2026
[58]

CaptainCook4D: A dataset for understanding errors in procedural activities.arXiv preprint arXiv:2312.14556, 2023

Rohith Peddi, Shivvrat Arya Tirumala, Mohammad Khan, Yufei Ji, Tushar Sridhar, Vib- hav Gogate Sridhar, and Nicholas Ruozzi. CaptainCook4D: A dataset for understanding errors in procedural activities.arXiv preprint arXiv:2312.14556, 2023

work page arXiv 2023
[59]

Personalized large vision-language models.arXiv preprint arXiv:2412.17610, 2024

Chau Pham, Hoang Phan, David Doermann, and Yunjie Tian. Personalized large vision-language models.arXiv preprint arXiv:2412.17610, 2024

work page arXiv 2024
[60]

Personalized visual instruction tuning

Renjie Pi, Jianshu Zhang, Tianyang Han, Jipeng Zhang, Rui Pan, and Tong Zhang. Personalized visual instruction tuning.arXiv preprint arXiv:2410.07113, 2024

work page arXiv 2024
[61]

Agentic very long video understanding.arXiv preprint arXiv:2601.18157, 2026

Aniket Rege, Arka Sadhu, Yuliang Li, Kejie Li, Ramya Korlakai Vinayak, and Chai. Agentic very long video understanding.arXiv preprint arXiv:2601.18157, 2026

work page arXiv 2026
[62]

Optimization methods for personalizing large language models through retrieval augmentation

Alireza Salemi, Surya Kallumadi, and Hamed Zamani. Optimization methods for personalizing large language models through retrieval augmentation. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2024

work page 2024
[63]

LaMP: When large language models meet personalization.arXiv preprint arXiv:2304.11406, 2023

Alireza Salemi, Sheshera Mysore, Michael Bendersky, and Hamed Zamani. LaMP: When large language models meet personalization.arXiv preprint arXiv:2304.11406, 2023

work page arXiv 2023
[64]

LaMP-QA: A benchmark for personalized long-form question answering.arXiv preprint arXiv:2506.00137, 2025

Alireza Salemi and Hamed Zamani. LaMP-QA: A benchmark for personalized long-form question answering.arXiv preprint arXiv:2506.00137, 2025

work page arXiv 2025
[65]

Personalization Toolkit: Training Free Personalization of Large Vision Language Models

Soroush Seifi, Vaggelis Dorovatas, Matteo Cassinelli, Fabien Despinoy, Daniel Olmeda Reino, and Rahaf Aljundi. Personalization toolkit: Training free personalization of large vision language models.arXiv preprint arXiv:2502.02452, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Encode-store-retrieve: Augmenting human memory through language-encoded egocentric perception.arXiv preprint arXiv:2308.05822, 2023

Junxiao Shen, John Dudley, and Per Ola Kristensson. Encode-store-retrieve: Augmenting human memory through language-encoded egocentric perception.arXiv preprint arXiv:2308.05822, 2023

work page arXiv 2023
[67]

PVChat: Personalized video chat with one-shot learning.arXiv preprint arXiv:2503.17069, 2025

Yufei Shi, Weilong Yan, Gang Xu, Yumeng Li, Yucheng Chen, Zhenxi Li, Fei Richard Yu, Ming Li, and Si Yong Yeo. PVChat: Personalized video chat with one-shot learning.arXiv preprint arXiv:2503.17069, 2025. 13

work page arXiv 2025
[68]

Personalized pieces: Efficient personalized large language models through collaborative efforts

Zhaoxuan Tan, Zheyuan Liu, and Meng Jiang. Personalized pieces: Efficient personalized large language models through collaborative efforts. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

work page 2024
[69]

Democratizing large lan- guage models via personalized parameter-efficient fine-tuning

Zhaoxuan Tan, Qingkai Zeng, Yijun Tian, Zheyuan Liu, Bing Yin, and Meng Jiang. Democra- tizing large language models via personalized parameter-efficient fine-tuning.arXiv preprint arXiv:2402.04401, 2024

work page arXiv 2024
[70]

Instant personalized large language model adaptation via hypernetwork.arXiv preprint arXiv:2510.16282, 2025

Zhaoxuan Tan, Zixuan Zhang, Haoyang Wen, Zheng Li, Rongzhi Zhang, Pei Chen, Fengran Mo, Zheyuan Liu, Qingkai Zeng, Qingyu Yin, and Meng Jiang. Instant personalized large language model adaptation via hypernetwork.arXiv preprint arXiv:2510.16282, 2025

work page arXiv 2025
[71]

Is sharing of egocentric video giving away your biometric signature? InEuropean Conference on Computer Vision (ECCV)

Daksh Thapar, Aditya Nigam, and Chetan Arora. Is sharing of egocentric video giving away your biometric signature? InEuropean Conference on Computer Vision (ECCV). Springer, 2020

work page 2020
[72]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

VideoAgent: Long-form video understanding with large language model as agent

Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung-Levy. VideoAgent: Long-form video understanding with large language model as agent. InEuropean Conference on Computer Vision (ECCV), 2024

work page 2024
[74]

LifelongMem- ory: Leveraging LLMs for answering queries in long-form egocentric videos.arXiv preprint arXiv:2312.05269, 2023

Ying Wang, Yanlai He, Cuiling Wang, Huaiyu Bian, and Jianlong Chen. LifelongMem- ory: Leveraging LLMs for answering queries in long-form egocentric videos.arXiv preprint arXiv:2312.05269, 2023

work page arXiv 2023
[75]

Ryoo, and Juan Carlos Niebles

Ziyang Wang, Honglu Zhou, Shijie Wang, Junnan Li, Caiming Xiong, Silvio Savarese, Mohit Bansal, Michael S. Ryoo, and Juan Carlos Niebles. Active video perception: Iterative evidence seeking for agentic long video understanding.arXiv preprint arXiv:2512.05774, 2025

work page arXiv 2025
[76]

LongVideoBench: A benchmark for long-context interleaved video-language understanding

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. LongVideoBench: A benchmark for long-context interleaved video-language understanding. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[77]

Retrieval- augmented egocentric video captioning.arXiv preprint arXiv:2401.00789, 2024

Jilan Xu, Yifei Huang, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, and Weidi Xie. Retrieval- augmented egocentric video captioning.arXiv preprint arXiv:2401.00789, 2024

work page arXiv 2024
[78]

EgoLife: Towards egocentric life assistant.arXiv preprint arXiv:2503.03803, 2025

Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, et al. EgoLife: Towards egocentric life assistant.arXiv preprint arXiv:2503.03803, 2025

work page arXiv 2025
[79]

Eliciting in-context learning in vision-language models for videos through curated data distributional properties.arXiv preprint arXiv:2311.17041, 2023

Keunwoo Peter Yu, Zheyuan Zhang, Fengyuan Hu, Shane Storks, and Joyce Chai. Eliciting in-context learning in vision-language models for videos through curated data distributional properties.arXiv preprint arXiv:2311.17041, 2023

work page arXiv 2023
[80]

A simple LLM framework for long-range video question-answering

Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, and Gedas Bertasius. A simple LLM framework for long-range video question-answering. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

work page 2024

Showing first 80 references.