pith. sign in

arxiv: 2508.03583 · v2 · submitted 2025-08-05 · 💻 cs.MM · cs.IR

OpenLifelogQA: An Open-Ended Multi-Modal Lifelog Question-Answering Dataset

Pith reviewed 2026-05-19 01:21 UTC · model grok-4.3

classification 💻 cs.MM cs.IR
keywords lifelogquestion answeringmultimodal datasetopen-ended QApersonal datawearable devicesmemory support
0
0 comments X p. Extension

The pith

OpenLifelogQA supplies 14,187 question-answer pairs drawn from 18 months of personal multimodal lifelog data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs OpenLifelogQA as a large open-ended question answering dataset from 18 months of lifelog data that includes images, locations, and biometrics collected by wearable devices. It seeks to give researchers a more diverse and practical collection than earlier resources for testing systems that answer questions about daily personal experiences. The dataset covers multiple question types and difficulty levels to support evaluation in settings close to actual use. This work targets applications such as memory support, lifestyle analysis, and personal assistance through interactive queries over one's own recorded life data.

Core claim

OpenLifelogQA is a large-scale open-ended lifelog QA dataset constructed from 18 months of multimodal lifelog data that contains 14,187 Q&A pairs spanning multiple question types and difficulty levels and offers greater diversity and practicality for real-world applications than prior resources.

What carries the argument

The OpenLifelogQA dataset, built by passive collection of personal daily activities via wearable devices followed by construction of Q&A pairs.

If this is right

  • Enables systematic testing of multimodal models on realistic personal experience queries across varied difficulty levels.
  • Supports creation of applications that provide memory augmentation and lifestyle coaching from wearable data streams.
  • Supplies baseline scores for models such as LLaVA-NeXT-Interleave on open-ended lifelog questions.
  • Facilitates future development of personal lifelog assistants for healthcare support and daily activity analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Pairing the dataset with additional users beyond the original collectors could reveal whether annotation patterns hold across different lifestyles.
  • Linking lifelog QA outputs directly to biometric trends may strengthen applications in long-term health monitoring.
  • The release creates an opportunity to test whether performance on this single-source collection transfers to data gathered under different device or privacy constraints.

Load-bearing premise

The 18 months of lifelog data collected from the authors' own devices and the subsequent QA pair construction produce annotations that are accurate, unbiased, and representative enough to enable meaningful evaluation of lifelog QA systems.

What would settle it

An independent collection of lifelog data from multiple new users followed by fresh QA pair creation that yields substantially lower model performance scores than the 89.7 percent BERTScore and 3.97 LLM Score reported on OpenLifelogQA.

Figures

Figures reproduced from arXiv: 2508.03583 by Binh Nguyen, Cathal Gurrin, Gareth J. F. Jones, Hoang-Bao Le, Quang-Linh Tran, Tuong-Nghiem Diep.

Figure 1
Figure 1. Figure 1: Annotation Process generate 10 QA pairs for each day of lifelog data, while we pro￾vided all event descriptions and asked GPT-4o to generate 20 QA pairs. The number of required Q&As for the volunteers and GPT-4o was chosen to balance quality, scalability, and human effort in the generation process. Each QA pair also links from one to several events in stage 2, as these events provide information to answer … view at source ↗
Figure 2
Figure 2. Figure 2: Analysis on questions data protection regulations such as GDPR are essential to maintain public trust and ensure responsible research practices. 5 Baseline Experiment In this section, we run a baseline experiment on the OpenLifelogQA dataset with a SOTA multi-modal LLM, LLaVA-NeXT-Interleave. The experimental setting and results, as well as some analysis, are discussed in the following sections. 5.1 Settin… view at source ↗
read the original abstract

We introduce OpenLifelogQA, a large-scale open-ended lifelog QA dataset constructed from 18 months of multimodal lifelog data. Lifelogging is the passive collection and analysis of personal daily activities using wearable devices, producing rich multimodal data such as images, locations, and biometrics. Question answering (QA) over lifelog data enables users to interactively query their own experiences, supporting applications in memory support, lifestyle analysis, and personal assistance. OpenLifelogQA contains 14,187 Q&A pairs spanning multiple question types and difficulty levels, designed to support robust evaluation in realistic settings. Compared with prior resources, OpenLifelogQA offers greater diversity and practicality for real-world applications. To establish baselines, we evaluate the LLaVA-NeXT-Interleave 7B model, achieving 89.7% BERTScore, 25.87% ROUGE-L, and an average LLM Score of 3.97. By releasing OpenLifelogQA, we aim to promote future research on lifelog technologies, paving the way for personal lifelog assistants capable of memory augmentation, healthcare support, and lifestyle coaching.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces OpenLifelogQA, a large-scale open-ended multi-modal lifelog question-answering dataset constructed from 18 months of multimodal lifelog data collected from wearable devices. It contains 14,187 Q&A pairs across multiple question types and difficulty levels. The authors evaluate the LLaVA-NeXT-Interleave 7B model as a baseline, reporting 89.7% BERTScore, 25.87% ROUGE-L, and an average LLM Score of 3.97. The paper claims that OpenLifelogQA offers greater diversity and practicality for real-world applications compared to prior resources.

Significance. If the dataset construction ensures high quality and diversity, this work could provide a valuable benchmark for developing lifelog QA systems, advancing applications in memory support, healthcare, and lifestyle coaching. The provision of baselines facilitates immediate use by the community.

major comments (2)
  1. The manuscript does not detail the QA pair generation process, including inter-annotator agreement or validation steps. This is load-bearing for the claim that the dataset supports robust evaluation in realistic settings.
  2. The assertion of greater diversity and practicality is not adequately supported given that the data comes from the authors' own devices over 18 months, which inherently limits variation in user demographics, routines, and locations. Explicit comparison to multi-user lifelog corpora or diversity metrics would strengthen this claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and outline the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: The manuscript does not detail the QA pair generation process, including inter-annotator agreement or validation steps. This is load-bearing for the claim that the dataset supports robust evaluation in realistic settings.

    Authors: We agree that the original submission omitted key details on QA pair generation. In the revised manuscript we will add a dedicated subsection describing the full annotation pipeline: a team of three annotators created questions from the lifelog streams following explicit guidelines for question types and difficulty; answers were written by the same annotators with reference to the raw multimodal data; we will report inter-annotator agreement via Fleiss’ kappa on both question quality and answer correctness, plus validation steps including a pilot round with 500 pairs and final expert review for factual accuracy. These additions will directly support the claim of robust evaluation. revision: yes

  2. Referee: The assertion of greater diversity and practicality is not adequately supported given that the data comes from the authors' own devices over 18 months, which inherently limits variation in user demographics, routines, and locations. Explicit comparison to multi-user lifelog corpora or diversity metrics would strengthen this claim.

    Authors: We acknowledge that collection from the authors’ own devices limits demographic breadth. The 18-month span nevertheless yields substantial intra-user variation in routines, locations, and activities. In the revision we will insert an explicit comparison table against multi-user corpora (NTCIR Lifelog, Ego4D, and Lifelogging 2.0) and report quantitative diversity metrics including entropy of activity categories, number of unique GPS clusters, and temporal coverage across seasons. These additions will better ground the practicality claim while transparently noting the single-user limitation. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset introduction with independent empirical contribution

full rationale

The paper presents a new multimodal lifelog QA dataset constructed from 18 months of collected data, generates 14,187 Q&A pairs across question types, and reports baselines using an off-the-shelf LLaVA-NeXT-Interleave model. No mathematical derivations, predictions, or first-principles results are claimed. There are no equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations that reduce the central claims to inputs by construction. The contribution is the release of the dataset itself for future evaluation, which is self-contained and does not rely on any circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central contribution rests on standard lifelog data collection practices and existing multimodal QA evaluation metrics; no new free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5756 in / 1033 out tokens · 38747 ms · 2026-05-19T01:21:11.646189+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Opal: Private Memory for Personal AI

    cs.CR 2026-04 unverdicted novelty 6.0

    Opal enables private long-term memory for personal AI by decoupling reasoning to a trusted enclave with a lightweight knowledge graph and piggybacking reindexing on ORAM accesses.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

  2. [2]

    Leonard Bärmann and Alex Waibel. 2022. Where did i leave my keys?-episodic- memory-based question answering on egocentric videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 1560–1568

  3. [3]

    Michael Desmond, Zahra Ashktorab, Qian Pan, Casey Dugan, and James M. Johnson. 2024. EvaluLLM: LLM assisted evaluation of generative outputs. In Companion Proceedings of the 29th International Conference on Intelligent User Interfaces (Greenville, SC, USA) (IUI ’24 Companion). Association for Computing Machinery, New York, NY, USA, 30–32. doi:10.1145/3640...

  4. [4]

    Jacob Devlin. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  5. [5]

    Abhimanyu Dubey et al. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https://arxiv.org/abs/2407.21783

  6. [6]

    Chenyou Fan. 2019. Egovqa-an egocentric video question answering benchmark dataset. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. 0–0

  7. [7]

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. 2022. Ego4d: Around the world in 3,000 hours of egocentric video. In OpenLifelogQA: An Open-Ended Multi-Modal Lifelog Question-Answering Dataset Proceedings of the IEEE/CVF conference on computer visio...

  8. [8]

    Cathal Gurrin, Tu-Khiem Le, Van-Tu Ninh, Duc-Tien Dang-Nguyen, Björn Þór Jónsson, Jakub Lokoč, Wolfgang Hürst, Minh-Triet Tran, and Klaus Schoeffmann

  9. [9]

    In Proceedings of the 2020 International Conference on Multimedia Retrieval

    Introduction to the third annual lifelog search challenge (LSC’20). In Proceedings of the 2020 International Conference on Multimedia Retrieval . 584– 585

  10. [10]

    Smeaton, and Aiden R

    Cathal Gurrin, Alan F. Smeaton, and Aiden R. Doherty. 2014. . doi:10.1561/ 1500000033

  11. [11]

    Cathal Gurrin, Liting Zhou, Graham Healy, Werner Bailer, Duc-Tien Dang Nguyen, Steve Hodges, Björn Þór Jónsson, Jakub Lokoč, Luca Rossetto, Minh-Triet Tran, and Klaus Schöffmann. 2024. Introduction to the Seventh An- nual Lifelog Search Challenge, LSC’24. InProceedings of the 2024 International Conference on Multimedia Retrieval (Phuket, Thailand) (ICMR ’...

  12. [12]

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

  13. [13]

    Baoxiong Jia, Ting Lei, Song-Chun Zhu, and Siyuan Huang. 2022. Egotaskqa: Understanding human tasks in egocentric videos.Advances in Neural Information Processing Systems 35 (2022), 3343–3360

  14. [14]

    Basel Kikhia, Josef Hallberg, Johan E Bengtsson, Stefan Savenstedt, and Kare Synnes. 2010. Building digital life stories for memory support. International journal of Computers in Healthcare 1, 2 (2010), 161–176

  15. [15]

    Seongjung Kim, Seongkyu Yeom, Oh-Jin Kwon, Dongil Shin, and Dongkyoo Shin

  16. [16]

    IEEE Access 6 (2018), 8909–8915

    Ubiquitous Healthcare System for Analysis of Chronic Patients’ Biological and Lifelog Data. IEEE Access 6 (2018), 8909–8915. doi:10.1109/ACCESS.2018. 2805304

  17. [17]

    Gunjan Kumar, Houssem Jerbi, Cathal Gurrin, and Michael P O’Mahony. 2014. Towards activity recommendation from lifelogs. In Proceedings of the 16th in- ternational conference on information integration and web-based applications & services. 87–96

  18. [18]

    Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. 2024. LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models. arXiv:2407.07895 [cs.CV] https://arxiv.org/ abs/2407.07895

  19. [19]

    Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013

  20. [20]

    Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenaman- dra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, et al. 2024. Openeqa: Embodied question answering in the era of foun- dation models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16488–16498

  21. [21]

    Wang-Chiew Tan, Jane Dwivedi-Yu, Yuliang Li, Lambert Mathias, Marzieh Saeidi, Jing Nathan Yan, and Alon Halevy. 2023. TimelineQA: A Benchmark for Question Answering over Timelines. In Findings of the Association for Computational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, T...

  22. [22]

    Ly-Duyen Tran, Thanh Cong Ho, Lan Anh Pham, Binh Nguyen, Cathal Gurrin, and Liting Zhou. 2022. LLQA-lifelog question answering dataset. In International Conference on Multimedia Modeling . Springer, 217–228

  23. [23]

    Ly-Duyen Tran, Liting Zhou, Binh Nguyen, and Cathal Gurrin. 2024. Interactive Question Answering for Multimodal Lifelog Retrieval. InInternational Conference on Multimedia Modeling. Springer, 68–81

  24. [24]

    Quang-Linh Tran, Binh Nguyen, Gareth JF Jones, and Cathal Gurrin. 2024. Mem- oriQA: A Question-Answering Lifelog Dataset. In Proceedings of the 1st ACM Workshop on AI-Powered Q&A Systems for Multimedia . 7–12

  25. [25]

    Quang-Linh Tran, Binh Nguyen, Gareth J. F. Jones, and Cathal Gurrin. 2024. MemoriEase 2.0: A Conversational Lifelog Retrieve System for LSC’24. InPro- ceedings of the 7th Annual ACM Workshop on the Lifelog Search Challenge (Phuket, Thailand) (LSC ’24). Association for Computing Machinery, New York, NY, USA, 12–17. doi:10.1145/3643489.3661114

  26. [26]

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sigmoid Loss for Language Image Pre-Training. arXiv:2303.15343 [cs.CV] https: //arxiv.org/abs/2303.15343

  27. [27]

    BERTScore: Evaluating Text Generation with BERT

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating Text Generation with BERT. arXiv:1904.09675 [cs.CL] https://arxiv.org/abs/1904.09675