OpenLifelogQA: An Open-Ended Multi-Modal Lifelog Question-Answering Dataset
Pith reviewed 2026-05-19 01:21 UTC · model grok-4.3
The pith
OpenLifelogQA supplies 14,187 question-answer pairs drawn from 18 months of personal multimodal lifelog data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OpenLifelogQA is a large-scale open-ended lifelog QA dataset constructed from 18 months of multimodal lifelog data that contains 14,187 Q&A pairs spanning multiple question types and difficulty levels and offers greater diversity and practicality for real-world applications than prior resources.
What carries the argument
The OpenLifelogQA dataset, built by passive collection of personal daily activities via wearable devices followed by construction of Q&A pairs.
If this is right
- Enables systematic testing of multimodal models on realistic personal experience queries across varied difficulty levels.
- Supports creation of applications that provide memory augmentation and lifestyle coaching from wearable data streams.
- Supplies baseline scores for models such as LLaVA-NeXT-Interleave on open-ended lifelog questions.
- Facilitates future development of personal lifelog assistants for healthcare support and daily activity analysis.
Where Pith is reading between the lines
- Pairing the dataset with additional users beyond the original collectors could reveal whether annotation patterns hold across different lifestyles.
- Linking lifelog QA outputs directly to biometric trends may strengthen applications in long-term health monitoring.
- The release creates an opportunity to test whether performance on this single-source collection transfers to data gathered under different device or privacy constraints.
Load-bearing premise
The 18 months of lifelog data collected from the authors' own devices and the subsequent QA pair construction produce annotations that are accurate, unbiased, and representative enough to enable meaningful evaluation of lifelog QA systems.
What would settle it
An independent collection of lifelog data from multiple new users followed by fresh QA pair creation that yields substantially lower model performance scores than the 89.7 percent BERTScore and 3.97 LLM Score reported on OpenLifelogQA.
Figures
read the original abstract
We introduce OpenLifelogQA, a large-scale open-ended lifelog QA dataset constructed from 18 months of multimodal lifelog data. Lifelogging is the passive collection and analysis of personal daily activities using wearable devices, producing rich multimodal data such as images, locations, and biometrics. Question answering (QA) over lifelog data enables users to interactively query their own experiences, supporting applications in memory support, lifestyle analysis, and personal assistance. OpenLifelogQA contains 14,187 Q&A pairs spanning multiple question types and difficulty levels, designed to support robust evaluation in realistic settings. Compared with prior resources, OpenLifelogQA offers greater diversity and practicality for real-world applications. To establish baselines, we evaluate the LLaVA-NeXT-Interleave 7B model, achieving 89.7% BERTScore, 25.87% ROUGE-L, and an average LLM Score of 3.97. By releasing OpenLifelogQA, we aim to promote future research on lifelog technologies, paving the way for personal lifelog assistants capable of memory augmentation, healthcare support, and lifestyle coaching.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OpenLifelogQA, a large-scale open-ended multi-modal lifelog question-answering dataset constructed from 18 months of multimodal lifelog data collected from wearable devices. It contains 14,187 Q&A pairs across multiple question types and difficulty levels. The authors evaluate the LLaVA-NeXT-Interleave 7B model as a baseline, reporting 89.7% BERTScore, 25.87% ROUGE-L, and an average LLM Score of 3.97. The paper claims that OpenLifelogQA offers greater diversity and practicality for real-world applications compared to prior resources.
Significance. If the dataset construction ensures high quality and diversity, this work could provide a valuable benchmark for developing lifelog QA systems, advancing applications in memory support, healthcare, and lifestyle coaching. The provision of baselines facilitates immediate use by the community.
major comments (2)
- The manuscript does not detail the QA pair generation process, including inter-annotator agreement or validation steps. This is load-bearing for the claim that the dataset supports robust evaluation in realistic settings.
- The assertion of greater diversity and practicality is not adequately supported given that the data comes from the authors' own devices over 18 months, which inherently limits variation in user demographics, routines, and locations. Explicit comparison to multi-user lifelog corpora or diversity metrics would strengthen this claim.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major point below and outline the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: The manuscript does not detail the QA pair generation process, including inter-annotator agreement or validation steps. This is load-bearing for the claim that the dataset supports robust evaluation in realistic settings.
Authors: We agree that the original submission omitted key details on QA pair generation. In the revised manuscript we will add a dedicated subsection describing the full annotation pipeline: a team of three annotators created questions from the lifelog streams following explicit guidelines for question types and difficulty; answers were written by the same annotators with reference to the raw multimodal data; we will report inter-annotator agreement via Fleiss’ kappa on both question quality and answer correctness, plus validation steps including a pilot round with 500 pairs and final expert review for factual accuracy. These additions will directly support the claim of robust evaluation. revision: yes
-
Referee: The assertion of greater diversity and practicality is not adequately supported given that the data comes from the authors' own devices over 18 months, which inherently limits variation in user demographics, routines, and locations. Explicit comparison to multi-user lifelog corpora or diversity metrics would strengthen this claim.
Authors: We acknowledge that collection from the authors’ own devices limits demographic breadth. The 18-month span nevertheless yields substantial intra-user variation in routines, locations, and activities. In the revision we will insert an explicit comparison table against multi-user corpora (NTCIR Lifelog, Ego4D, and Lifelogging 2.0) and report quantitative diversity metrics including entropy of activity categories, number of unique GPS clusters, and temporal coverage across seasons. These additions will better ground the practicality claim while transparently noting the single-user limitation. revision: yes
Circularity Check
No circularity: dataset introduction with independent empirical contribution
full rationale
The paper presents a new multimodal lifelog QA dataset constructed from 18 months of collected data, generates 14,187 Q&A pairs across question types, and reports baselines using an off-the-shelf LLaVA-NeXT-Interleave model. No mathematical derivations, predictions, or first-principles results are claimed. There are no equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations that reduce the central claims to inputs by construction. The contribution is the release of the dataset itself for future evaluation, which is self-contained and does not rely on any circular reduction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We construct 14,187 pairs of Q&A with diverse types and difficulty levels... baseline experiment... LLaVA-NeXT-Interleave 7B model, achieving 89.7% BERTScore, 25.87% ROUGE-L...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Opal: Private Memory for Personal AI
Opal enables private long-term memory for personal AI by decoupling reasoning to a trusted enclave with a lightweight knowledge graph and piggybacking reindexing on ORAM accesses.
Reference graph
Works this paper leans on
-
[1]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Leonard Bärmann and Alex Waibel. 2022. Where did i leave my keys?-episodic- memory-based question answering on egocentric videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 1560–1568
work page 2022
-
[3]
Michael Desmond, Zahra Ashktorab, Qian Pan, Casey Dugan, and James M. Johnson. 2024. EvaluLLM: LLM assisted evaluation of generative outputs. In Companion Proceedings of the 29th International Conference on Intelligent User Interfaces (Greenville, SC, USA) (IUI ’24 Companion). Association for Computing Machinery, New York, NY, USA, 30–32. doi:10.1145/3640...
-
[4]
Jacob Devlin. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[5]
Abhimanyu Dubey et al. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Chenyou Fan. 2019. Egovqa-an egocentric video question answering benchmark dataset. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. 0–0
work page 2019
-
[7]
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. 2022. Ego4d: Around the world in 3,000 hours of egocentric video. In OpenLifelogQA: An Open-Ended Multi-Modal Lifelog Question-Answering Dataset Proceedings of the IEEE/CVF conference on computer visio...
work page 2022
-
[8]
Cathal Gurrin, Tu-Khiem Le, Van-Tu Ninh, Duc-Tien Dang-Nguyen, Björn Þór Jónsson, Jakub Lokoč, Wolfgang Hürst, Minh-Triet Tran, and Klaus Schoeffmann
-
[9]
In Proceedings of the 2020 International Conference on Multimedia Retrieval
Introduction to the third annual lifelog search challenge (LSC’20). In Proceedings of the 2020 International Conference on Multimedia Retrieval . 584– 585
work page 2020
-
[10]
Cathal Gurrin, Alan F. Smeaton, and Aiden R. Doherty. 2014. . doi:10.1561/ 1500000033
work page 2014
-
[11]
Cathal Gurrin, Liting Zhou, Graham Healy, Werner Bailer, Duc-Tien Dang Nguyen, Steve Hodges, Björn Þór Jónsson, Jakub Lokoč, Luca Rossetto, Minh-Triet Tran, and Klaus Schöffmann. 2024. Introduction to the Seventh An- nual Lifelog Search Challenge, LSC’24. InProceedings of the 2024 International Conference on Multimedia Retrieval (Phuket, Thailand) (ICMR ’...
-
[12]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Baoxiong Jia, Ting Lei, Song-Chun Zhu, and Siyuan Huang. 2022. Egotaskqa: Understanding human tasks in egocentric videos.Advances in Neural Information Processing Systems 35 (2022), 3343–3360
work page 2022
-
[14]
Basel Kikhia, Josef Hallberg, Johan E Bengtsson, Stefan Savenstedt, and Kare Synnes. 2010. Building digital life stories for memory support. International journal of Computers in Healthcare 1, 2 (2010), 161–176
work page 2010
-
[15]
Seongjung Kim, Seongkyu Yeom, Oh-Jin Kwon, Dongil Shin, and Dongkyoo Shin
-
[16]
IEEE Access 6 (2018), 8909–8915
Ubiquitous Healthcare System for Analysis of Chronic Patients’ Biological and Lifelog Data. IEEE Access 6 (2018), 8909–8915. doi:10.1109/ACCESS.2018. 2805304
-
[17]
Gunjan Kumar, Houssem Jerbi, Cathal Gurrin, and Michael P O’Mahony. 2014. Towards activity recommendation from lifelogs. In Proceedings of the 16th in- ternational conference on information integration and web-based applications & services. 87–96
work page 2014
-
[18]
Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. 2024. LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models. arXiv:2407.07895 [cs.CV] https://arxiv.org/ abs/2407.07895
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013
work page 2004
-
[20]
Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenaman- dra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, et al. 2024. Openeqa: Embodied question answering in the era of foun- dation models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16488–16498
work page 2024
-
[21]
Wang-Chiew Tan, Jane Dwivedi-Yu, Yuliang Li, Lambert Mathias, Marzieh Saeidi, Jing Nathan Yan, and Alon Halevy. 2023. TimelineQA: A Benchmark for Question Answering over Timelines. In Findings of the Association for Computational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, T...
work page 2023
-
[22]
Ly-Duyen Tran, Thanh Cong Ho, Lan Anh Pham, Binh Nguyen, Cathal Gurrin, and Liting Zhou. 2022. LLQA-lifelog question answering dataset. In International Conference on Multimedia Modeling . Springer, 217–228
work page 2022
-
[23]
Ly-Duyen Tran, Liting Zhou, Binh Nguyen, and Cathal Gurrin. 2024. Interactive Question Answering for Multimodal Lifelog Retrieval. InInternational Conference on Multimedia Modeling. Springer, 68–81
work page 2024
-
[24]
Quang-Linh Tran, Binh Nguyen, Gareth JF Jones, and Cathal Gurrin. 2024. Mem- oriQA: A Question-Answering Lifelog Dataset. In Proceedings of the 1st ACM Workshop on AI-Powered Q&A Systems for Multimedia . 7–12
work page 2024
-
[25]
Quang-Linh Tran, Binh Nguyen, Gareth J. F. Jones, and Cathal Gurrin. 2024. MemoriEase 2.0: A Conversational Lifelog Retrieve System for LSC’24. InPro- ceedings of the 7th Annual ACM Workshop on the Lifelog Search Challenge (Phuket, Thailand) (LSC ’24). Association for Computing Machinery, New York, NY, USA, 12–17. doi:10.1145/3643489.3661114
-
[26]
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sigmoid Loss for Language Image Pre-Training. arXiv:2303.15343 [cs.CV] https: //arxiv.org/abs/2303.15343
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating Text Generation with BERT. arXiv:1904.09675 [cs.CL] https://arxiv.org/abs/1904.09675
work page internal anchor Pith review Pith/arXiv arXiv 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.