EgoIntrospect provides the first egocentric dataset with self-annotations for internal state tasks and shows multimodal LLMs struggle to infer subjective states from combined signals.
Lifelongmemory: Leveraging llms for answering queries in long-form egocentric videos
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 10roles
background 1polarities
background 1representative citing papers
VSTAT benchmark shows state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on visual state tracking, failing at visual perception despite correct textual reasoning.
Pro²Assist uses multimodal egocentric perception from AR glasses to track fine-grained progress in long-horizon procedural tasks and deliver timely proactive assistance, outperforming baselines by over 21% in action understanding and up to 2.29x in timing accuracy.
VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
MemoryCard organizes long videos into self-contained topic-aware Memory Cards that improve long-video QA accuracy by up to 21.8% relative under fixed visual-token budgets.
EgoProx benchmark shows MLLMs have some spatial knowledge but struggle to leverage it for egocentric 3D proximity reasoning VQA.
Introduces Personal VCL formalization and benchmark revealing LMM context gaps, plus an Agentic Context Bank baseline that boosts personalized visual reasoning.
HiCrew improves long-form video question answering on EgoSchema and NExT-QA via a hybrid tree for temporal topology, question-aware captioning, and adaptive multi-agent planning, with gains in temporal and causal reasoning.
CogniGPT uses an interactive loop between a Multi-Granular Perception Agent and an Active Verification Agent to identify reliable clues in long videos with high accuracy and low frame usage.
An edge-deployed multimodal LLM pipeline for online episodic memory QA reaches 51.76% accuracy on an 8 GB GPU and 54.40% on a local server, within 4-5 points of a 56% cloud baseline.
citing papers explorer
-
EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning
EgoIntrospect provides the first egocentric dataset with self-annotations for internal state tasks and shows multimodal LLMs struggle to infer subjective states from combined signals.
-
Benchmarking Visual State Tracking in Multimodal Video Understanding
VSTAT benchmark shows state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on visual state tracking, failing at visual perception despite correct textual reasoning.
-
Pro$^2$Assist: Continuous Step-Aware Proactive Assistance with Multimodal Egocentric Perception for Long-Horizon Procedural Tasks
Pro²Assist uses multimodal egocentric perception from AR glasses to track fine-grained progress in long-horizon procedural tasks and deliver timely proactive assistance, outperforming baselines by over 21% in action understanding and up to 2.29x in timing accuracy.
-
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
-
MemoryCard: Topic-Aware Multi-Modal Clue Compression for Long-Video Question Answering
MemoryCard organizes long videos into self-contained topic-aware Memory Cards that improve long-video QA accuracy by up to 21.8% relative under fixed visual-token budgets.
-
EgoProx: Evaluating MLLMs on Egocentric 3D Proximity Reasoning Across a Cognitive Hierarchy
EgoProx benchmark shows MLLMs have some spatial knowledge but struggle to leverage it for egocentric 3D proximity reasoning VQA.
-
Personal Visual Context Learning in Large Multimodal Models
Introduces Personal VCL formalization and benchmark revealing LMM context gaps, plus an Agentic Context Bank baseline that boosts personalized visual reasoning.
-
HiCrew: Hierarchical Reasoning for Long-Form Video Understanding via Question-Aware Multi-Agent Collaboration
HiCrew improves long-form video question answering on EgoSchema and NExT-QA via a hybrid tree for temporal topology, question-aware captioning, and adaptive multi-agent planning, with gains in temporal and causal reasoning.
-
Perceive, Verify and Understand Long Video: Multi-Granular Perception and Active Verification via Interactive Agents
CogniGPT uses an interactive loop between a Multi-Granular Perception Agent and an Active Verification Agent to identify reliable clues in long videos with high accuracy and low frame usage.
-
Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge
An edge-deployed multimodal LLM pipeline for online episodic memory QA reaches 51.76% accuracy on an 8 GB GPU and 54.40% on a local server, within 4-5 points of a 56% cloud baseline.