AD-H: Language-guided Autonomous Driving with Hierarchical Agents

Huchuan Lu; Lijun Wang; Shiyu Tang; Talas Fu; Yifan Wang; Yuanhang Zhang; Zaibin Zhang

arxiv: 2406.03474 · v2 · pith:G4AJWDTUnew · submitted 2024-06-05 · 💻 cs.CV

AD-H: Language-guided Autonomous Driving with Hierarchical Agents

Zaibin Zhang , Talas Fu , Shiyu Tang , Yuanhang Zhang , Yifan Wang , Lijun Wang , Huchuan Lu This is my paper

classification 💻 cs.CV

keywords drivingad-hhierarchicalinstructionsmid-levelactionsautonomouscommands

0 comments

read the original abstract

Language-guided autonomous driving requires bridging a large abstraction gap between high-level natural-language instructions and low-level vehicle control. End-to-end approaches that use a single multimodal large language model (MLLM) to map language directly to actions struggle with this mismatch, often failing to exploit the reasoning capabilities of the model and exhibiting limited generalization beyond the distributions of driving datasets used for fine-tuning. To address this issue, we propose AD-H, a hierarchical multi-agent framework that explicitly separates high-level decision-making from low-level vehicle execution. At the upper level, an MLLM-based planner interprets natural-language commands and environmental context to generate coherent mid-level driving instructions. At the lower level, a lightweight controller converts these mid-level instructions into precise, continuous control actions. This decomposition aligns with the functional strengths of each component: the planner focuses on semantic reasoning and task decomposition, while the controller ensures stable and accurate actuation. To support large-scale training under this hierarchy, we design a rule-based pipeline that reconstructs mid-level commands from driving signals, producing 1.15 million hierarchical annotation pairs. Extensive experiments show that AD-H outperforms state-of-the-art models despite using fewer parameters, namely 3B plus 350M compared with 7B, and achieves superior long-horizon generalization and instruction-following performance. We make our data and code publicly accessible at https://github.com/zhangzaibin/AD-H

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving
cs.CV 2026-04 unverdicted novelty 6.0

LMGenDrive unifies LLM-based multimodal understanding with generative world models to output both future driving videos and control signals for end-to-end closed-loop autonomous driving.