A Survey of Robotic Navigation and Manipulation with Physics Simulators in the Era of Embodied AI

Jianwei Zhang; Kaixin Bai; Lik Hang Kenny Wong; Xueyang Kang

arxiv: 2505.01458 · v2 · pith:KAWTC63Gnew · submitted 2025-05-01 · 💻 cs.RO · cs.AI

A Survey of Robotic Navigation and Manipulation with Physics Simulators in the Era of Embodied AI

Lik Hang Kenny Wong , Xueyang Kang , Kaixin Bai , Jianwei Zhang This is my paper

classification 💻 cs.RO cs.AI

keywords manipulationnavigationembodiedhardwarephysicssim-to-realsimulatorssurvey

0 comments

read the original abstract

Navigation and manipulation are core capabilities in Embodied AI, but training agents to perform them directly in the real world is costly, time-consuming, and unsafe. Therefore, sim-to-real transfer has emerged as a key approach, yet the sim-to-real gap persists. This survey examines how physics simulators address this gap by analyzing properties that have received limited attention in prior surveys. We also analyze their features for navigation and manipulation tasks, as well as their hardware requirements. Additionally, we offer a resource with benchmark datasets, metrics, simulation platforms, and methods to help researchers select suitable tools while accounting for hardware constraints.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Plan in Sandbox, Navigate in Open Worlds: Learning Physics-Grounded Abstracted Experience for Embodied Navigation
cs.RO 2026-05 unverdicted novelty 6.0

SAGE trains agents in physics-grounded semantic abstractions via RL with asymmetric clipping, achieving 53.21% LLM-Match Success on A-EQA (+9.7% over baseline) and encouraging physical robot transfer.
ClickSeg3D: Few-Click Interactive Segmentation via Semantic Embeddings
cs.CV 2026-05 unverdicted novelty 6.0

A point-Transformer interactive 3D instance segmentation model handles multiple clicks jointly in one pass and reports over 20% mIoU gains versus baselines plus 8-10% cross-dataset improvement for one-click-per-instan...
ClickSeg3D: Few-Click Interactive Segmentation via Semantic Embeddings
cs.CV 2026-05 unverdicted novelty 6.0

ClickSeg3D uses a point Transformer encoder and hierarchical mask decoder with semantic embeddings to enable single-pass multi-object 3D interactive segmentation from sparse points, reporting over 20% mIoU gains versu...
PhyMix: Towards Physically Consistent Single-Image 3D Indoor Scene Generation with Implicit--Explicit Optimization
cs.CV 2026-04 unverdicted novelty 6.0

PhyMix unifies a new multi-aspect physics evaluator with implicit policy optimization and explicit test-time correction to produce single-image 3D indoor scenes that are both visually faithful and physically plausible.
MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?
cs.LG 2026-02 conditional novelty 6.0

MapTab is a new multimodal benchmark with 328 images and nearly 200k queries that shows current MLLMs have substantial difficulty with multi-criteria route planning when visual and tabular information must be combined.
MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?
cs.LG 2026-02 unverdicted novelty 6.0

MapTab benchmark shows current MLLMs struggle with multi-criteria multimodal route planning and that combining vision and language frequently underperforms single-modality approaches.
Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap
cs.RO 2026-04 unverdicted novelty 4.0

A survey of UAV vision-and-language navigation that establishes a methodological taxonomy, reviews resources and challenges, and proposes a forward-looking research roadmap.