Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory

Antonino Furnari; Antonio Finocchiaro; Christian Micheloni; Davide Marana; Giovanni Maria Farinella; Matteo Dunnhofer; Moritz Nottebaum; Rosario Forte; Zaira Manigrasso

arxiv: 2411.16934 · v3 · pith:KKKRJRV5new · submitted 2024-11-25 · 💻 cs.CV

Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory

Zaira Manigrasso , Matteo Dunnhofer , Antonino Furnari , Moritz Nottebaum , Antonio Finocchiaro , Davide Marana , Rosario Forte , Giovanni Maria Farinella

show 1 more author

Christian Micheloni

This is my paper

classification 💻 cs.CV

keywords memoryobjectonlinevideoepisodicesommoduleovq2d

0 comments

read the original abstract

Episodic memory retrieval enables wearable cameras to recall objects or events previously observed in video. However, existing formulations assume an "offline" setting with full video access at query time, limiting their applicability in real-world scenarios with power and storage-constrained wearable devices. Towards more application-ready episodic memory systems, we introduce Online Visual Query 2D (OVQ2D), a task where models process video streams online, observing each frame only once, and retrieve object localizations using a compact memory instead of full video history. We address OVQ2D with ESOM (Egocentric Streaming Object Memory), a novel framework integrating an object discovery module, an object tracking module, and a memory module that find, track, and store spatio-temporal object information for efficient querying. Experiments on Ego4D demonstrate ESOM's superiority over other online approaches, though OVQ2D remains challenging, with top performance at only ~4% success. ESOM's accuracy increases markedly with perfect object tracking (31.91%), discovery (40.55%), or both (81.92%), underscoring the need of applied research on these components.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Personal Visual Context Learning in Large Multimodal Models
cs.CV 2026-05 unverdicted novelty 6.0

Introduces Personal VCL formalization and benchmark revealing LMM context gaps, plus an Agentic Context Bank baseline that boosts personalized visual reasoning.