arxiv: 2604.08883 · v1 · submitted 2026-04-10 · 💻 cs.RO · cs.AI

Recognition: unknown

HTNav: A Hybrid Navigation Framework with Tiered Structure for Urban Aerial Vision-and-Language Navigation

Chengjie Fan , Cong Pan , Zijian Liu , Ningzhong Liu , Jie Qin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:15 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords aerial vision-and-language navigationhybrid imitation reinforcement learningtiered decision makingurban navigationmap representation learningCityNav benchmarkstaged training

0 comments

The pith

HTNav uses staged hybrid IL-RL training and tiered planning to reach state-of-the-art results on urban aerial navigation benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes HTNav as a framework for aerial vision-and-language navigation in cities that faces problems with generalization to new areas, long-distance routes, and keeping track of spatial layout. It builds a hybrid system that first trains with imitation learning for stable basics then adds reinforcement learning for better exploration through staged phases. A tiered structure separates high-level route decisions from low-level controls while a dedicated map module learns to represent open-space connections. These pieces together produce higher success on the CityNav test cases at every difficulty level and scene type. The result points to more reliable drone navigation for tasks such as delivery or inspection in real city environments.

Core claim

To address challenges in complex urban environments for aerial VLN, including insufficient generalization to unseen scenes, suboptimal long-range path planning, and inadequate spatial continuity understanding, we propose HTNav, a new collaborative navigation framework that integrates Imitation Learning (IL) and Reinforcement Learning (RL) within a hybrid IL-RL framework. This framework adopts a staged training mechanism to ensure the stability of the basic navigation strategy while enhancing its environmental exploration capability. By integrating a tiered decision-making mechanism, it achieves collaborative interaction between macro-level path planning and fine-grained action control. A map

What carries the argument

Tiered decision-making mechanism that links macro-level path planning to fine-grained action control inside a staged hybrid IL-RL training loop, augmented by a map representation learning module for spatial continuity.

If this is right

Improved generalization to unseen urban scenes.
Better performance on long-range path planning tasks.
Stronger grasp of spatial continuity in open urban domains.
Increased navigation precision and robustness overall.
State-of-the-art scores across all scene levels and task difficulties on CityNav.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The tiered structure may transfer to ground-based or multi-agent navigation where scale separation helps.
Map learning could incorporate real-time updates from additional sensors to handle dynamic city traffic.
Staged training might shorten the data needed for new cities if the imitation phase is made more efficient.
The approach could support safety constraints by letting the macro layer enforce regulatory flight zones.

Load-bearing premise

That adding staged IL-RL training, tiered decision layers, and map representation will reliably raise generalization and long-range performance in unseen urban scenes.

What would settle it

Run HTNav on the CityNav unseen test scenes and check whether success rate and path efficiency remain higher than prior methods when paths are long and environments are novel.

Figures

Figures reproduced from arXiv: 2604.08883 by Chengjie Fan, Cong Pan, Jie Qin, Ningzhong Liu, Zijian Liu.

**Figure 2.** Figure 2: Architecture of HTNav. The model processes visual observations and generates a multi-layered navigation map from state and target information. Features extracted from RGB, depth, and map inputs by their respective encoders are fed into a three-head prediction module: a Value Prediction Head for expected cumulative reward, a Progress Prediction Head for navigation status, and a Goal Prediction Head for the … view at source ↗

**Figure 3.** Figure 3: A schematic of the Tiered Decision Mechanism. The [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Visualization of navigation trajectories. The left panel illustrates the long-path navigation capability, the middle panel demon [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Inspired by the general Vision-and-Language Navigation (VLN) task, aerial VLN has attracted widespread attention, owing to its significant practical value in applications such as logistics delivery and urban inspection. However, existing methods face several challenges in complex urban environments, including insufficient generalization to unseen scenes, suboptimal performance in long-range path planning, and inadequate understanding of spatial continuity. To address these challenges, we propose HTNav, a new collaborative navigation framework that integrates Imitation Learning (IL) and Reinforcement Learning (RL) within a hybrid IL-RL framework. This framework adopts a staged training mechanism to ensure the stability of the basic navigation strategy while enhancing its environmental exploration capability. By integrating a tiered decision-making mechanism, it achieves collaborative interaction between macro-level path planning and fine-grained action control. Furthermore, a map representation learning module is introduced to deepen its understanding of spatial continuity in open domains. On the CityNav benchmark, our method achieves state-of-the-art performance across all scene levels and task difficulties. Experimental results demonstrate that this framework significantly improves navigation precision and robustness in complex urban environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HTNav puts together staged IL-RL training, tiered macro/fine planning, and a map module for aerial VLN, but the SOTA claim on CityNav has no numbers or ablations attached to it.

read the letter

HTNav puts together staged IL-RL training, tiered macro/fine planning, and a map module for aerial VLN, but the SOTA claim on CityNav has no numbers or ablations attached to it. The framework targets generalization to new scenes, long-range paths, and spatial continuity in urban drone navigation with language instructions. The staged training is meant to lock in stable basic behavior before RL adds exploration, the tiered structure splits high-level routes from low-level controls, and the map module is added to handle open-space continuity. These pieces line up directly with the problems listed in the abstract. The practical motivation for logistics and inspection tasks is stated plainly. The main limitation is the missing experimental support. The abstract asserts state-of-the-art results across scene levels and difficulties yet supplies no metrics, baselines, ablations, or error breakdowns. Without those, it is impossible to tell whether the tiered mechanism or map learning actually moves the needle on unseen environments. The assumption that the hybrid schedule plus map representation will deliver reliable gains is reasonable in principle but untested in the text provided. Free parameters in the training schedule could also shift outcomes, though that is typical for such methods. Researchers working on vision-language navigation or aerial robotics would find the structure worth looking at. A reader already familiar with IL-RL hybrids in navigation could extract the integration idea even before seeing full results. The paper deserves a serious referee to check the implementation details and the actual benchmark numbers rather than a desk rejection.

Referee Report

1 major / 0 minor

Summary. The paper proposes HTNav, a hybrid imitation learning (IL) and reinforcement learning (RL) framework for urban aerial vision-and-language navigation (VLN). It incorporates a staged training mechanism to stabilize basic navigation policies while improving exploration, a tiered decision-making structure for collaborative macro-level path planning and fine-grained action control, and a map representation learning module to enhance spatial continuity understanding in open domains. The central claim is that HTNav achieves state-of-the-art performance on the CityNav benchmark across all scene levels and task difficulties, addressing challenges of generalization to unseen scenes, long-range planning, and spatial continuity.

Significance. If the empirical claims hold with proper validation, this would represent a meaningful advance in aerial VLN by demonstrating how hybrid IL-RL training combined with tiered control and explicit map learning can improve robustness in complex urban settings. Such a framework could inform practical systems for applications like logistics delivery and urban inspection, where long-range navigation and spatial awareness are critical.

major comments (1)

Abstract: The central SOTA claim on CityNav is asserted without any reported metrics (e.g., success rate, SPL, or navigation error), baseline comparisons, ablation studies on the staged training/tiered mechanism/map module, or error analysis. This makes the performance improvements from the proposed components unverifiable and undermines the load-bearing empirical contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We agree that the abstract would be strengthened by including specific quantitative metrics to support the SOTA claim. We will revise the abstract in the next version to incorporate key results from our experiments while preserving the overall structure and claims.

read point-by-point responses

Referee: Abstract: The central SOTA claim on CityNav is asserted without any reported metrics (e.g., success rate, SPL, or navigation error), baseline comparisons, ablation studies on the staged training/tiered mechanism/map module, or error analysis. This makes the performance improvements from the proposed components unverifiable and undermines the load-bearing empirical contribution.

Authors: We acknowledge the validity of this observation. The full manuscript (Sections 4 and 5) provides detailed quantitative results, including success rate, SPL, navigation error, comparisons against multiple baselines, ablation studies on the staged training, tiered decision-making, and map representation learning modules, as well as error analysis across scene levels and task difficulties. However, the abstract summarizes these findings at a high level without specific numbers. We will revise the abstract to include representative metrics (e.g., overall success rate improvements and SPL values) and brief mentions of the ablations to make the empirical contribution immediately verifiable. This change will not alter any experimental findings or conclusions. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an empirical hybrid IL-RL navigation framework evaluated on the external CityNav benchmark. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. Claims of SOTA performance rest on benchmark results rather than any self-referential construction. The staged training, tiered mechanism, and map module are presented as design choices whose effectiveness is tested externally, satisfying the criteria for a self-contained non-circular empirical contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 2 invented entities

The central claim depends on the effectiveness of newly introduced components whose benefits are asserted rather than derived from first principles or external benchmarks.

free parameters (1)

staged training schedule parameters
Controls the transition point and weighting between imitation learning and reinforcement learning phases.

axioms (2)

domain assumption Imitation learning followed by reinforcement learning produces stable yet exploratory navigation policies
Invoked to justify the hybrid staged training mechanism.
domain assumption Tiered decision-making separates macro path planning from fine action control without loss of performance
Underlies the collaborative interaction claim.

invented entities (2)

tiered decision-making mechanism no independent evidence
purpose: Enable collaborative interaction between macro-level path planning and fine-grained action control
New architectural component introduced to address long-range planning and control issues.
map representation learning module no independent evidence
purpose: Deepen understanding of spatial continuity in open domains
New module added to handle spatial continuity challenges.

pith-pipeline@v0.9.0 · 5500 in / 1335 out tokens · 72737 ms · 2026-05-10T18:15:29.101174+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 12 canonical work pages · 3 internal anchors

[1]

On Evaluation of Embodied Navigation Agents

Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On evaluation of embodied navigation agents.arXiv preprint arXiv:1807.06757, 2018. 6

work page internal anchor Pith review arXiv 2018
[2]

Vision-and-language navigation: In- terpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S ¨underhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: In- terpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3674– 3683, 2018. 2, 6

2018
[3]

Gaia: A transfer learning system of object detection that fits your needs

Xingyuan Bu, Junran Peng, Junjie Yan, Tieniu Tan, and Zhaoxiang Zhang. Gaia: A transfer learning system of object detection that fits your needs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 274–283, 2021. 4

2021
[4]

Cai et al

Hengxing Cai, Jinhan Dong, Yijie Rao, Jingcheng Deng, Jingjun Tan, Qien Chen, Haidong Wang, Zhen Wang, Shiyu Huang, Agachai Sumalee, et al. Sa-gcs: Semantic-aware gaussian curriculum scheduling for uav vision-language nav- igation.arXiv preprint arXiv:2508.00390, 2025. 3, 4

work page arXiv 2025
[5]

FlightGPT: Towards generalizable and interpretable UA V vision-and-language navigation with vision-language models

Hengxing Cai, Jinhan Dong, Jingjun Tan, Jingcheng Deng, Sihang Li, Zhifeng Gao, Haidong Wang, Zicheng Su, Agachai Sumalee, and Renxin Zhong. FlightGPT: Towards generalizable and interpretable UA V vision-and-language navigation with vision-language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing, pages 66...

2025
[6]

Touchdown: Natural language navigation and spatial reasoning in visual street environments

Howard Chen, Alane Suhr, Dipendra Misra, Noah Snavely, and Yoav Artzi. Touchdown: Natural language navigation and spatial reasoning in visual street environments. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12538–12547, 2019. 2

2019
[7]

Embodied question answer- ing

Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied question answer- ing. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–10, 2018. 2

2018
[8]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on Computer Vision and Pattern Recognition, pages 248–255. Ieee, 2009. 6

2009
[9]

Aerial vision-and-dialog navigation

Yue Fan, Winson Chen, Tongzhou Jiang, Chun Zhou, Yi Zhang, and Xin Wang. Aerial vision-and-dialog navigation. InFindings of the Association for Computational Linguis- tics: ACL 2023, pages 3043–3061, 2023. 1, 3

2023
[10]

Aerial vision-and-language nav- igation via semantic-topo-metric representation guided llm reasoning.arXiv preprint arXiv:2410.08500, 2024

Yunpeng Gao, Zhigang Wang, Linglin Jing, Dong Wang, Xuelong Li, and Bin Zhao. Aerial vision-and-language nav- igation via semantic-topo-metric representation guided llm reasoning.arXiv preprint arXiv:2410.08500, 2024. 2, 5

work page arXiv 2024
[11]

Openfly: A comprehensive platform for aerial vision-language navigation,

Yunpeng Gao, Chenhui Li, Zhongrui You, Junli Liu, Zhen Li, Pengan Chen, Qizhi Chen, Zhonghan Tang, Liansheng Wang, Penghui Yang, et al. Openfly: A comprehensive plat- form for aerial vision-language navigation.arXiv preprint arXiv:2502.18041, 2025. 1, 2, 6

work page arXiv 2025
[12]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016. 2, 5

2016
[13]

Landmark-RxR: Solving vision- and-language navigation with fine-grained alignment super- vision.Advances in Neural Information Processing Systems, 34:652–663, 2021

Keji He, Yan Huang, Qi Wu, Jianhua Yang, Dong An, Shuan- glin Sima, and Liang Wang. Landmark-RxR: Solving vision- and-language navigation with fine-grained alignment super- vision.Advances in Neural Information Processing Systems, 34:652–663, 2021. 2

2021
[14]

Frequency-enhanced data augmentation for vision-and-language navigation.Advances in Neural In- formation Processing Systems, 36:4351–4364, 2023

Keji He, Chenyang Si, Zhihe Lu, Yan Huang, Liang Wang, and Xinchao Wang. Frequency-enhanced data augmentation for vision-and-language navigation.Advances in Neural In- formation Processing Systems, 36:4351–4364, 2023. 2

2023
[15]

Sensaturban: Learning semantics from urban-scale photogrammetric point clouds

Qingyong Hu, Bo Yang, Sheikh Khalid, Wen Xiao, Niki Trigoni, and Andrew Markham. Sensaturban: Learning semantics from urban-scale photogrammetric point clouds. International Journal of Computer Vision, 130(2):316–343,
[16]

Stay on the path: Instruction fidelity in vision-and-language navigation

Vihan Jain, Gabriel Magalhaes, Alexander Ku, Ashish Vaswani, Eugene Ie, and Jason Baldridge. Stay on the path: Instruction fidelity in vision-and-language navigation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1862–1872, 2019. 2

2019
[17]

Beyond the nav-graph: Vision-and-language navigation in continuous environments

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision-and-language navigation in continuous environments. InEuropean Confer- ence on Computer Vision, pages 104–120. Springer, 2020. 2, 6

2020
[18]

Room-across-room: Multilingual vision- and-language navigation with dense spatiotemporal ground- ing

Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision- and-language navigation with dense spatiotemporal ground- ing. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 4392–4412,

2020
[19]

Citynav: Language-goal aerial navigation dataset with geographic information.arXiv preprint arXiv:2406.14240, 2024

Jungdae Lee, Taiki Miyanishi, Shuhei Kurita, Koya Sakamoto, Daichi Azuma, Yutaka Matsuo, and Naka- masa Inoue. CityNav: Language-goal aerial naviga- tion dataset with geographic information.arXiv preprint arXiv:2406.14240, 2024. 3, 5, 6, 8

work page arXiv 2024
[20]

CityNav: A large-scale dataset for real-world aerial navigation

Jungdae Lee, Taiki Miyanishi, Shuhei Kurita, Koya Sakamoto, Daichi Azuma, Yutaka Matsuo, and Nakamasa Inoue. CityNav: A large-scale dataset for real-world aerial navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5912–5922, 2025. 1, 2, 3, 6

2025
[21]

SCConv: Spa- tial and channel reconstruction convolution for feature re- dundancy

Jiafeng Li, Ying Wen, and Lianghua He. SCConv: Spa- tial and channel reconstruction convolution for feature re- dundancy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6153– 6162, 2023. 2, 6

2023
[22]

AerialVLN: Vision-and-language navigation for uavs

Shubo Liu, Hongsheng Zhang, Yuankai Qi, Peng Wang, Yan- ning Zhang, and Qi Wu. AerialVLN: Vision-and-language navigation for uavs. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 15384– 15394, 2023. 1, 2, 6

2023
[23]

Grounding DINO: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding DINO: Marrying dino with grounded pre-training for open-set object detection. In European Conference on Computer Vision, pages 38–55. Springer, 2024. 8

2024
[24]

Navagent: Multi-scale urban street view fusion for uav embodied vision-and-language naviga- tion.arXiv preprint arXiv:2411.08579, 2024

Youzhi Liu, Fanglong Yao, Yuanchang Yue, Guangluan Xu, Xian Sun, and Kun Fu. Navagent: Multi-scale urban street view fusion for uav embodied vision-and-language naviga- tion.arXiv preprint arXiv:2411.08579, 2024. 3

work page arXiv 2024
[25]

Decoupled Weight Decay Regularization

Ilya Loshchilov, Frank Hutter, et al. Fixing weight decay regularization in adam.arXiv preprint arXiv:1711.05101, 5 (5):5, 2017. 6

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

Cityrefer: geography-aware 3d visual grounding dataset on city-scale point cloud data.arXiv preprint arXiv:2310.18773, 2023

Taiki Miyanishi, Fumiya Kitamori, Shuhei Kurita, Jungdae Lee, Motoaki Kawanabe, and Nakamasa Inoue. Cityrefer: geography-aware 3d visual grounding dataset on city-scale point cloud data.arXiv preprint arXiv:2310.18773, 2023. 6

work page arXiv 2023
[27]

BAEFormer: Bi-directional and early interaction transformers for bird’s eye view semantic seg- mentation

Cong Pan, Yonghao He, Junran Peng, Qian Zhang, Wei Sui, and Zhaoxiang Zhang. BAEFormer: Bi-directional and early interaction transformers for bird’s eye view semantic seg- mentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9590– 9599, 2023. 2

2023
[28]

Depth- guided vision transformer with normalizing flows for monocular 3d object detection.IEEE/CAA Journal of Au- tomatica Sinica, 11(3):673–689, 2024

Cong Pan, Junran Peng, and Zhaoxiang Zhang. Depth- guided vision transformer with normalizing flows for monocular 3d object detection.IEEE/CAA Journal of Au- tomatica Sinica, 11(3):673–689, 2024. 1

2024
[29]

Large-scale object detection in the wild from imbalanced multi-labels

Junran Peng, Xingyuan Bu, Ming Sun, Zhaoxiang Zhang, Tieniu Tan, and Junjie Yan. Large-scale object detection in the wild from imbalanced multi-labels. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9709–9718, 2020. 3

2020
[30]

Gaia-universe: Everything is super-netify.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10):11856–11868, 2023

Junran Peng, Qing Chang, Haoran Yin, Xingyuan Bu, Jia- jun Sun, Lingxi Xie, Xiaopeng Zhang, Qi Tian, and Zhaoxi- ang Zhang. Gaia-universe: Everything is super-netify.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10):11856–11868, 2023. 1

2023
[31]

Reverie: Remote embodied visual referring ex- pression in real indoor environments

Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. Reverie: Remote embodied visual referring ex- pression in real indoor environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9982–9991, 2020. 2

2020
[32]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 4

work page internal anchor Pith review Pith/arXiv arXiv 2017
[33]

Airsim: High-fidelity visual and physical simula- tion for autonomous vehicles

Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. Airsim: High-fidelity visual and physical simula- tion for autonomous vehicles. InField and service robotics: Results of the 11th international conference, pages 621–635. Springer, 2017. 1

2017
[34]

Alfred: A benchmark for interpreting grounded instructions for everyday tasks

Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10740–10749, 2020. 2

2020
[35]

Learning fine- grained alignment for aerial vision-dialog navigation.In Pro- ceedings of the AAAI Conference on Artificial Intelligence, 39(7):7060–7068, 2025

Yifei Su, Dong An, Kehan Chen, Weichen Yu, Baiyang Ning, Yonggen Ling, Yan Huang, and Liang Wang. Learning fine- grained alignment for aerial vision-dialog navigation.In Pro- ceedings of the AAAI Conference on Artificial Intelligence, 39(7):7060–7068, 2025. 1, 3

2025
[36]

Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017. 2

2017
[37]

Towards realistic UA V vision-language naviga- tion: Platform, benchmark, and methodology

Xiangyu Wang, Donglin Yang, Ziqin Wang, Hohin Kwan, Jinyu Chen, Wenjun Wu, Hongsheng Li, Yue Liao, and Si Liu. Towards realistic UA V vision-language naviga- tion: Platform, benchmark, and methodology. InThe Thir- teenth International Conference on Learning Representa- tions, 2025. 1, 2

2025
[38]

Geonav: Empowering mllms with explicit geospatial reasoning abilities for language-goal aerial navigation.arXiv preprint arXiv:2504.09587, 2025

Haotian Xu, Yue Hu, Chen Gao, Zhengqiu Zhu, Yong Zhao, Yong Li, and Quanjun Yin. Geonav: Empowering mllms with explicit geospatial reasoning abilities for language-goal aerial navigation.arXiv preprint arXiv:2504.09587, 2025. 2, 3

work page arXiv 2025
[39]

Hybrid supervised and reinforcement learning for the design and optimization of nanophotonic structures.Optics Express, 32(6):9920– 9930, 2024

Christopher Yeung, Benjamin Pham, Zihan Zhang, Kather- ine T Fountaine, and Aaswath P Raman. Hybrid supervised and reinforcement learning for the design and optimization of nanophotonic structures.Optics Express, 32(6):9920– 9930, 2024. 3

2024
[40]

Faster segment anything: Towards lightweight sam for mobile applications,

Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster segment anything: Towards lightweight sam for mo- bile applications.arXiv preprint arXiv:2306.14289, 2023. 8

work page arXiv 2023
[41]

Cityx: Controllable procedural content generation for unbounded 3d cities.arXiv preprint arXiv:2407.17572,

Shougao Zhang, Mengqi Zhou, Yuxi Wang, Chuanchen Luo, Rongyu Wang, Yiwei Li, Zhaoxiang Zhang, and Junran Peng. Cityx: Controllable procedural content generation for unbounded 3d cities.arXiv preprint arXiv:2407.17572,

work page arXiv
[42]

Zhaoxiang Zhang, Cong Pan, and Junran Peng. Delving into the effectiveness of receptive fields: Learning scale- transferrable architectures for practical object detection.In- ternational Journal of Computer Vision, 130(4):970–989,