Embodied Visual Recognition

Jianwei Yang , Zhile Ren , Mingze Xu , Xinlei Chen , David Crandall , Devi Parikh , Dhruv Batra

Authors on Pith no claims yet

classification 💻 cs.CV cs.AIcs.LGcs.RO

keywords visualobjectrecognitionagentsembodiedenvironmentamodalmove

read the original abstract

Passive visual systems typically fail to recognize objects in the amodal setting where they are heavily occluded. In contrast, humans and other embodied agents have the ability to move in the environment, and actively control the viewing angle to better understand object shapes and semantics. In this work, we introduce the task of Embodied Visual Recognition (EVR): An agent is instantiated in a 3D environment close to an occluded target object, and is free to move in the environment to perform object classification, amodal object localization, and amodal object segmentation. To address this, we develop a new model called Embodied Mask R-CNN, for agents to learn to move strategically to improve their visual recognition abilities. We conduct experiments using the House3D environment. Experimental results show that: 1) agents with embodiment (movement) achieve better visual recognition performance than passive ones; 2) in order to improve visual recognition abilities, agents can learn strategical moving paths that are different from shortest paths.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Mirror, Mirror on the Wall: Can VLM Agents Tell Who They Are at All?
cs.AI 2026-05 unverdicted novelty 6.0

Stronger VLM agents use mirror reflections for self-identification in controlled 3D tests, while weaker ones inspect but fail to extract or correctly attribute self-relevant information.