pith. sign in

arxiv: 2411.16934 · v3 · pith:KKKRJRV5new · submitted 2024-11-25 · 💻 cs.CV

Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory

classification 💻 cs.CV
keywords memoryobjectonlinevideoepisodicesommoduleovq2d
0
0 comments X
read the original abstract

Episodic memory retrieval enables wearable cameras to recall objects or events previously observed in video. However, existing formulations assume an "offline" setting with full video access at query time, limiting their applicability in real-world scenarios with power and storage-constrained wearable devices. Towards more application-ready episodic memory systems, we introduce Online Visual Query 2D (OVQ2D), a task where models process video streams online, observing each frame only once, and retrieve object localizations using a compact memory instead of full video history. We address OVQ2D with ESOM (Egocentric Streaming Object Memory), a novel framework integrating an object discovery module, an object tracking module, and a memory module that find, track, and store spatio-temporal object information for efficient querying. Experiments on Ego4D demonstrate ESOM's superiority over other online approaches, though OVQ2D remains challenging, with top performance at only ~4% success. ESOM's accuracy increases markedly with perfect object tracking (31.91%), discovery (40.55%), or both (81.92%), underscoring the need of applied research on these components.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Personal Visual Context Learning in Large Multimodal Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Introduces Personal VCL formalization and benchmark revealing LMM context gaps, plus an Agentic Context Bank baseline that boosts personalized visual reasoning.