Target-Driven Structured Transformer Planner for Vision-Language Navigation

Chen Gao; Haibing Ren; Huaxia Xia; Jinyu Chen; Lirong Yang; Si Liu; Wenguan Wang; Yusheng Zhao

arxiv: 2207.11201 · v1 · pith:IF5O5EOCnew · submitted 2022-07-19 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Target-Driven Structured Transformer Planner for Vision-Language Navigation

Yusheng Zhao , Jinyu Chen , Chen Gao , Wenguan Wang , Lirong Yang , Haibing Ren , Huaxia Xia , Si Liu This is my paper

classification 💻 cs.CV cs.AIcs.CLcs.LG

keywords navigationstructuredplannertd-stptransformeragentlong-termplanning

0 comments

read the original abstract

Vision-language navigation is the task of directing an embodied agent to navigate in 3D scenes with natural language instructions. For the agent, inferring the long-term navigation target from visual-linguistic clues is crucial for reliable path planning, which, however, has rarely been studied before in literature. In this article, we propose a Target-Driven Structured Transformer Planner (TD-STP) for long-horizon goal-guided and room layout-aware navigation. Specifically, we devise an Imaginary Scene Tokenization mechanism for explicit estimation of the long-term target (even located in unexplored environments). In addition, we design a Structured Transformer Planner which elegantly incorporates the explored room layout into a neural attention architecture for structured and global planning. Experimental results demonstrate that our TD-STP substantially improves previous best methods' success rate by 2% and 5% on the test set of R2R and REVERIE benchmarks, respectively. Our code is available at https://github.com/YushengZhao/TD-STP .

This paper has not been read by Pith yet.

Target-Driven Structured Transformer Planner for Vision-Language Navigation

discussion (0)