pith. sign in

arxiv: 2206.13396 · v2 · pith:HD2IORZUnew · submitted 2022-06-21 · 💻 cs.CV · cs.AI· cs.LG· cs.RO

A Simple Approach for Visual Rearrangement: 3D Mapping and Semantic Search

classification 💻 cs.CV cs.AIcs.LGcs.RO
keywords rearrangementobjectssemanticvisualsearchapproachmethodneed
0
0 comments X
read the original abstract

Physically rearranging objects is an important capability for embodied agents. Visual room rearrangement evaluates an agent's ability to rearrange objects in a room to a desired goal based solely on visual input. We propose a simple yet effective method for this problem: (1) search for and map which objects need to be rearranged, and (2) rearrange each object until the task is complete. Our approach consists of an off-the-shelf semantic segmentation model, voxel-based semantic map, and semantic search policy to efficiently find objects that need to be rearranged. On the AI2-THOR Rearrangement Challenge, our method improves on current state-of-the-art end-to-end reinforcement learning-based methods that learn visual rearrangement policies from 0.53% correct rearrangement to 16.56%, using only 2.7% as many samples from the environment.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments

    cs.CV 2026-04 unverdicted novelty 7.0

    SpaMEM benchmark shows multimodal LLMs succeed at spatial tasks with text histories but sharply fail at long-horizon belief maintenance from raw visual streams alone.

  2. SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments

    cs.CV 2026-04 conditional novelty 6.0

    SpaMEM is a diagnostic benchmark showing that current vision-language models exhibit a sharp collapse in spatial reasoning when transitioning from text-aided state tracking to purely visual memory in dynamic environments.