Lifelongmemory: Leveraging llms for answering queries in long-form egocentric videos

Lifelongmemory: Leveraging llms for answering queries in long-form egocentric videos , author= · 2023 · arXiv 2312.05269

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning

cs.CV · 2026-05-17 · unverdicted · novelty 8.0

EgoIntrospect provides the first egocentric dataset with self-annotations for internal state tasks and shows multimodal LLMs struggle to infer subjective states from combined signals.

Benchmarking Visual State Tracking in Multimodal Video Understanding

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

VSTAT benchmark shows state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on visual state tracking, failing at visual perception despite correct textual reasoning.

Pro$^2$Assist: Continuous Step-Aware Proactive Assistance with Multimodal Egocentric Perception for Long-Horizon Procedural Tasks

cs.AI · 2026-05-05 · unverdicted · novelty 7.0

Pro²Assist uses multimodal egocentric perception from AR glasses to track fine-grained progress in long-horizon procedural tasks and deliver timely proactive assistance, outperforming baselines by over 21% in action understanding and up to 2.29x in timing accuracy.

VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning

cs.CV · 2026-01-22 · unverdicted · novelty 7.0

VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.

HPP: Hierarchical Programmatic Probing for Long Video Understanding by Decoupling Perception and Reasoning

cs.CV · 2026-06-19 · unverdicted · novelty 6.0

HPP decouples perception from reasoning in long-video VLMs by having an LLM run iterative programmatic probes on hierarchically segmented video, reporting gains on LongVideoBench, EgoSchema, VideoMME, and MLVU.

Native Active Perception as Reasoning for Omni-Modal Understanding

cs.CV · 2026-06-17 · unverdicted · novelty 6.0

OmniAgent formulates omni-modal video understanding as a POMDP with on-demand actions that distill cues into persistent text memory, showing positive test-time scaling and SOTA results on benchmarks like LVBench where a 7B model beats a 72B baseline.

MemoryCard: Topic-Aware Multi-Modal Clue Compression for Long-Video Question Answering

cs.CV · 2026-06-04 · unverdicted · novelty 6.0

MemoryCard organizes long videos into self-contained topic-aware Memory Cards that improve long-video QA accuracy by up to 21.8% relative under fixed visual-token budgets.

EgoProx: Evaluating MLLMs on Egocentric 3D Proximity Reasoning Across a Cognitive Hierarchy

cs.CV · 2026-05-23 · unverdicted · novelty 6.0

EgoProx benchmark shows MLLMs have some spatial knowledge but struggle to leverage it for egocentric 3D proximity reasoning VQA.

Personal Visual Context Learning in Large Multimodal Models

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

Introduces Personal VCL formalization and benchmark revealing LMM context gaps, plus an Agentic Context Bank baseline that boosts personalized visual reasoning.

HiCrew: Hierarchical Reasoning for Long-Form Video Understanding via Question-Aware Multi-Agent Collaboration

cs.AI · 2026-04-23 · unverdicted · novelty 6.0

HiCrew improves long-form video question answering on EgoSchema and NExT-QA via a hybrid tree for temporal topology, question-aware captioning, and adaptive multi-agent planning, with gains in temporal and causal reasoning.

Perceive, Verify and Understand Long Video: Multi-Granular Perception and Active Verification via Interactive Agents

cs.CV · 2025-09-29 · unverdicted · novelty 6.0

CogniGPT uses an interactive loop between a Multi-Granular Perception Agent and an Active Verification Agent to identify reliable clues in long videos with high accuracy and low frame usage.

CoVStream: Edge-Cloud Collaboration for Understanding of Long Video Streams

cs.CV · 2026-06-22 · unverdicted · novelty 4.0

CoVStream is an edge-cloud system that distills long videos into features and captions to cut bandwidth 87.6% while retaining 99.2% of full-cloud accuracy on LVBench.

Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge

cs.CV · 2026-02-25 · unverdicted · novelty 4.0

An edge-deployed multimodal LLM pipeline for online episodic memory QA reaches 51.76% accuracy on an 8 GB GPU and 54.40% on a local server, within 4-5 points of a 56% cloud baseline.

citing papers explorer

Showing 13 of 13 citing papers after filters.

EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning cs.CV · 2026-05-17 · unverdicted · none · ref 8
EgoIntrospect provides the first egocentric dataset with self-annotations for internal state tasks and shows multimodal LLMs struggle to infer subjective states from combined signals.
Benchmarking Visual State Tracking in Multimodal Video Understanding cs.CV · 2026-06-02 · unverdicted · none · ref 54
VSTAT benchmark shows state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on visual state tracking, failing at visual perception despite correct textual reasoning.
Pro$^2$Assist: Continuous Step-Aware Proactive Assistance with Multimodal Egocentric Perception for Long-Horizon Procedural Tasks cs.AI · 2026-05-05 · unverdicted · none · ref 69
Pro²Assist uses multimodal egocentric perception from AR glasses to track fine-grained progress in long-horizon procedural tasks and deliver timely proactive assistance, outperforming baselines by over 21% in action understanding and up to 2.29x in timing accuracy.
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning cs.CV · 2026-01-22 · unverdicted · none · ref 32
VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
HPP: Hierarchical Programmatic Probing for Long Video Understanding by Decoupling Perception and Reasoning cs.CV · 2026-06-19 · unverdicted · none · ref 92
HPP decouples perception from reasoning in long-video VLMs by having an LLM run iterative programmatic probes on hierarchically segmented video, reporting gains on LongVideoBench, EgoSchema, VideoMME, and MLVU.
Native Active Perception as Reasoning for Omni-Modal Understanding cs.CV · 2026-06-17 · unverdicted · none · ref 5
OmniAgent formulates omni-modal video understanding as a POMDP with on-demand actions that distill cues into persistent text memory, showing positive test-time scaling and SOTA results on benchmarks like LVBench where a 7B model beats a 72B baseline.
MemoryCard: Topic-Aware Multi-Modal Clue Compression for Long-Video Question Answering cs.CV · 2026-06-04 · unverdicted · none · ref 32
MemoryCard organizes long videos into self-contained topic-aware Memory Cards that improve long-video QA accuracy by up to 21.8% relative under fixed visual-token budgets.
EgoProx: Evaluating MLLMs on Egocentric 3D Proximity Reasoning Across a Cognitive Hierarchy cs.CV · 2026-05-23 · unverdicted · none · ref 63
EgoProx benchmark shows MLLMs have some spatial knowledge but struggle to leverage it for egocentric 3D proximity reasoning VQA.
Personal Visual Context Learning in Large Multimodal Models cs.CV · 2026-05-11 · unverdicted · none · ref 74
Introduces Personal VCL formalization and benchmark revealing LMM context gaps, plus an Agentic Context Bank baseline that boosts personalized visual reasoning.
HiCrew: Hierarchical Reasoning for Long-Form Video Understanding via Question-Aware Multi-Agent Collaboration cs.AI · 2026-04-23 · unverdicted · none · ref 18
HiCrew improves long-form video question answering on EgoSchema and NExT-QA via a hybrid tree for temporal topology, question-aware captioning, and adaptive multi-agent planning, with gains in temporal and causal reasoning.
Perceive, Verify and Understand Long Video: Multi-Granular Perception and Active Verification via Interactive Agents cs.CV · 2025-09-29 · unverdicted · none · ref 33
CogniGPT uses an interactive loop between a Multi-Granular Perception Agent and an Active Verification Agent to identify reliable clues in long videos with high accuracy and low frame usage.
CoVStream: Edge-Cloud Collaboration for Understanding of Long Video Streams cs.CV · 2026-06-22 · unverdicted · none · ref 11
CoVStream is an edge-cloud system that distills long videos into features and captions to cut bandwidth 87.6% while retaining 99.2% of full-cloud accuracy on LVBench.
Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge cs.CV · 2026-02-25 · unverdicted · none · ref 11
An edge-deployed multimodal LLM pipeline for online episodic memory QA reaches 51.76% accuracy on an 8 GB GPU and 54.40% on a local server, within 4-5 points of a 56% cloud baseline.

Lifelongmemory: Leveraging llms for answering queries in long-form egocentric videos

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer