arxiv: 2604.18223 · v1 · submitted 2026-04-20 · 💻 cs.CV

Instruction-as-State: Environment-Guided and State-Conditioned Semantic Understanding for Embodied Navigation

Zhen Liu , Yuhan Liu , Jinjun Wang , Jianyi Liu , Wei Song , Jingwen Fu This is my paper

Pith reviewed 2026-05-10 04:42 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language navigationembodied navigationinstruction understandingstate-conditioned semanticsenvironment-guided refinementtoken-level groundingdynamic language encoding

0 comments

The pith

Modeling instructions as an evolving state updated by each new observation improves how agents follow directions in changing environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that static encodings of instructions limit navigation because word meanings shift with the agent's changing views and position. Instead, it treats the instruction as a live state that refines itself token by token using the current perceptual context at every step. A coarse stage picks the relevant instruction segment based on what is seen, while a fine stage sharpens token meanings under that observation. This keeps language aligned with the scene without discarding earlier or later parts of the command. The result is better path efficiency and success on standard vision-language navigation tasks.

Core claim

We therefore model instruction understanding as an Instruction-as-State variable: a decision-relevant, token-level instruction state that evolves step by step conditioned on the agent's perceptual state, where the perceptual state denotes the observation-grounded navigation context at each step. To realize this principle, we introduce State-Entangled Environment-Guided Instruction Understanding (S-EGIU), a coarse-to-fine framework for state-conditioned segment activation and token-level semantic refinement. At the coarse level, S-EGIU activates the instruction segment whose semantics align with the current observation. At the fine level, it refines the activated segment through observation -

What carries the argument

The Instruction-as-State variable, a token-level representation that is activated in segments and then refined at the token level according to the agent's current observation.

Load-bearing premise

That activating the matching instruction segment and then refining its tokens using the current view will keep the full instruction's meaning coherent and correctly aligned without losing earlier context or adding errors as the agent continues moving.

What would settle it

Running the method on trajectories where one observation matches two different instruction segments equally well and checking whether later actions still follow the remaining parts of the original instruction correctly.

Figures

Figures reproduced from arXiv: 2604.18223 by Jianyi Liu, Jingwen Fu, Jinjun Wang, Wei Song, Yuhan Liu, Zhen Liu.

**Figure 2.** Figure 2: Overview of the S-EGIU framework. CGIP estimates a clause-relevance distri [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of S-EGIU’s state-conditioned instruction understanding. [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗

**Figure 4.** Figure 4: A representative failure case on the R2R dataset. The agent (blue trajectory) [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

read the original abstract

Vision-and-Language Navigation requires agents to follow natural-language instructions in visually changing environments. A central challenge is the dynamic entanglement between language and observations: the meaning of instruction shifts as the agent's field of view and spatial context evolve. However, many existing models encode the instruction as a static global representation, limiting their ability to adapt instruction meaning to the current visual context. We therefore model instruction understanding as an Instruction-as-State variable: a decision-relevant, token-level instruction state that evolves step by step conditioned on the agent's perceptual state, where the perceptual state denotes the observation-grounded navigation context at each step. To realize this principle, we introduce State-Entangled Environment-Guided Instruction Understanding (S-EGIU), a coarse-to-fine framework for state-conditioned segment activation and token-level semantic refinement. At the coarse level, S-EGIU activates the instruction segment whose semantics align with the current observation. At the fine level, it refines the activated segment through observation-guided token grounding and contextual modeling, sharpening its internal semantics under the current observation. Together, these stages maintain an instruction state that is continuously updated according to the agent's perceptual state during navigation. S-EGIU delivers strong performance on several key metrics, including a +2.68% SPL gain on REVERIE Test Unseen, and demonstrates consistent efficiency gains across multiple VLN benchmarks, underscoring the value of dynamic instruction--perception entanglement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames instruction understanding as an evolving per-step state in VLN via coarse segment activation plus observation-guided refinement, but the gains are modest and the mechanism for preserving cross-segment context looks underspecified.

read the letter

The main thing here is the shift from static instruction encodings to an Instruction-as-State that updates token by token with the agent's current observation. S-EGIU does this in two stages: first activate the matching instruction segment, then refine its tokens under the visual input. That setup directly targets the dynamic entanglement problem the abstract describes, and the reported +2.68% SPL on REVERIE Test Unseen plus efficiency numbers on other VLN sets give it some empirical grounding. The framing itself is clean and the coarse-to-fine split is a practical way to avoid full-instruction recomputation at every step. Those are the parts that feel like actual forward movement rather than re-labeling of attention tricks. The soft spot is exactly the one the stress test flags. Once a segment is activated and the others are set aside, there is no obvious persistent memory or merging step to recover global dependencies or conditional clauses if the activation turns out wrong later. The abstract does not spell out how the state carries non-local information across steps, so the claim that the evolving state stays semantically coherent rests on an assumption that may not hold under visual ambiguity. Without ablations on segment switching errors or full-instruction baselines, it is hard to know how much of the SPL lift comes from the new state modeling versus other engineering choices. This work is for researchers already running VLN experiments who want a concrete alternative to fixed language encoders. A reader who needs to improve robustness in changing environments will get usable ideas from the architecture even if the numbers stay incremental. It is coherent enough on its own terms to warrant referee time rather than a desk reject, though any review should press on the context-preservation details and demand clearer error breakdowns. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper claims that static global encodings of instructions limit adaptation in VLN tasks due to dynamic language-perception entanglement. It introduces the Instruction-as-State variable—a token-level, evolving representation conditioned on the agent's perceptual state—and realizes it via the S-EGIU coarse-to-fine framework: observation-aligned segment activation at the coarse level followed by observation-guided token grounding and contextual modeling at the fine level. This produces a continuously updated instruction state during navigation. The authors report empirical gains, including +2.68% SPL on REVERIE Test Unseen, plus efficiency improvements across VLN benchmarks.

Significance. If the reported gains hold under rigorous controls and the framework demonstrably preserves cross-segment dependencies, the work would offer a concrete mechanism for state-conditioned instruction semantics that addresses a recognized limitation in embodied navigation models. The coarse-to-fine design is a clear modeling choice that could be adopted more broadly if shown to be robust.

major comments (2)

[§3.2] §3.2 (Coarse-to-Fine Framework): The central claim that local segment activation plus intra-segment token refinement maintains semantic coherence rests on the unstated assumption that de-activated segments' context is dispensable. No persistent cross-segment memory, full-instruction attention, or state-merging operator is described; if visual ambiguity triggers an early incorrect activation, the paper provides no documented recovery path for conditional or sequential clauses spanning multiple segments. This directly bears on whether the +2.68% SPL gain can be attributed to the proposed dynamic entanglement rather than to other factors.
[§4.3] §4.3 (Ablation and Baseline Comparisons): The ablation isolating the contribution of observation-guided token refinement reports gains, yet the static-instruction baselines are not shown to have been re-trained with equivalent capacity or the same observation encoder. Without this control, the performance delta cannot be unambiguously credited to the Instruction-as-State construction.

minor comments (2)

[§2] Notation: The term 'perceptual state' is used interchangeably with 'observation-grounded navigation context' without an explicit definition or diagram showing its relation to the visual encoder output.
[Figure 2] Figure 2: The diagram of segment activation lacks an arrow or label indicating whether the Instruction-as-State is carried forward from the previous timestep or reset per segment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below, providing clarifications on our design and experimental setup while outlining the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Coarse-to-Fine Framework): The central claim that local segment activation plus intra-segment token refinement maintains semantic coherence rests on the unstated assumption that de-activated segments' context is dispensable. No persistent cross-segment memory, full-instruction attention, or state-merging operator is described; if visual ambiguity triggers an early incorrect activation, the paper provides no documented recovery path for conditional or sequential clauses spanning multiple segments. This directly bears on whether the +2.68% SPL gain can be attributed to the proposed dynamic entanglement rather than to other factors.

Authors: We appreciate the referee highlighting the importance of cross-segment semantic coherence. In S-EGIU, the Instruction-as-State is updated at every step by conditioning on the current perceptual state: the coarse stage activates the observation-aligned segment while the fine stage performs observation-guided token grounding and contextual modeling within it. Although we do not maintain an explicit persistent memory buffer for de-activated segments, the continuous evolution of the state via perceptual conditioning enables re-alignment when subsequent observations provide clarifying visual evidence, offering an implicit recovery path. The +2.68% SPL gain is supported by component ablations that isolate the contribution of state-conditioned activation and refinement. To address the concern more explicitly, the revised manuscript will include additional discussion of cross-segment dependency handling together with a targeted ablation on multi-segment sequential instructions. revision: partial
Referee: [§4.3] §4.3 (Ablation and Baseline Comparisons): The ablation isolating the contribution of observation-guided token refinement reports gains, yet the static-instruction baselines are not shown to have been re-trained with equivalent capacity or the same observation encoder. Without this control, the performance delta cannot be unambiguously credited to the Instruction-as-State construction.

Authors: We agree that rigorous attribution requires matched capacity and encoders. The static-instruction baselines in our experiments already share the identical observation encoder architecture with the proposed model. Nevertheless, we acknowledge that explicit re-training under strictly equivalent parameter budgets would further strengthen the ablation. In the revised manuscript we will re-train the static baselines with matched capacity, report the updated numbers, and clarify the controls, allowing the performance improvements to be more unambiguously attributed to the Instruction-as-State construction. revision: yes

Circularity Check

0 steps flagged

No circularity: modeling framework is an independent architectural choice with empirical validation

full rationale

The paper introduces Instruction-as-State and the S-EGIU coarse-to-fine framework as a modeling decision to address dynamic language-observation entanglement. The description of segment activation followed by token refinement is presented as a design choice, not derived from prior equations or self-citations. Reported gains (+2.68% SPL on REVERIE) are empirical outcomes on standard benchmarks rather than predictions forced by fitted inputs or self-referential definitions. No equations, uniqueness theorems, or load-bearing self-citations appear in the provided text that would reduce the central claim to its own inputs by construction. The derivation chain remains self-contained as a proposed architecture evaluated externally.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The work rests on standard VLN domain assumptions and introduces two new conceptual entities; no numerical free parameters are mentioned.

axioms (1)

domain assumption Instruction semantics in VLN are dynamically entangled with the agent's changing visual and spatial context
Stated as the central challenge motivating the Instruction-as-State model.

invented entities (2)

Instruction-as-State variable no independent evidence
purpose: Token-level instruction representation that evolves conditioned on perceptual state
Core modeling innovation introduced to replace static global encoding.
S-EGIU framework no independent evidence
purpose: Coarse-to-fine mechanism for state-conditioned segment activation and token refinement
Concrete realization of the Instruction-as-State principle.

pith-pipeline@v0.9.0 · 5570 in / 1348 out tokens · 44500 ms · 2026-05-10T04:42:04.083087+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 13 canonical work pages

[1]

Vision-language navi- gation: A survey and taxonomy,

W. Wu, T. Chang, X. Li, Q. Yin, and Y. Hu, “Vision-language navi- gation: A survey and taxonomy,”Neural Computing and Applications, vol. 36, no. 7, pp. 3291–3316, 2024, doi: 10.1007/s00521-023-09217-1

work page doi:10.1007/s00521-023-09217-1 2024
[2]

, author Qing, J

J. Li and M. Bansal, “Improving Vision-and-Language Navigation by Generating Future-View Image Semantics,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 10803–10812, doi: 10.1109/CVPR52729.2023.01040

work page doi:10.1109/cvpr52729.2023.01040 2023
[3]

Vision-and-Language Navi- gation: Interpreting Visually-Grounded Navigation Instructions in Real Environments,

P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I.D. Reid, S. Gould, A. van den Hengel, “Vision-and-Language Navi- gation: Interpreting Visually-Grounded Navigation Instructions in Real Environments,” inProceedings of the 2018 IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2018, pp. 3674–3683

2018
[4]

Episodic Transformer for Vision- and-Language Navigation,

A. Pashevich, C. Schmid, and C. Sun, “Episodic Transformer for Vision- and-Language Navigation,” inProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), 2021, pp. 15942–15952

2021
[5]

Everyday Object Meets Vision-and-Language Navigation Agent via Backdoor,

K. He, K. Chen, J. Bai, Y. Huang, Q. Wu, S.-T. Xia, and L. Wang, “Everyday Object Meets Vision-and-Language Navigation Agent via Backdoor,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 37, 2024

2024
[6]

VELMA: Verbalization Embodiment of LLM Agents for Vision and 25 Language Navigation in Street View,

R. Schumann, W. Zhu, W. Feng, T.-J. Fu, S. Riezler, and W. Y. Wang, “VELMA: Verbalization Embodiment of LLM Agents for Vision and 25 Language Navigation in Street View,” inProceedings of the AAAI Con- ference on Artificial Intelligence, vol. 38, no. 17, pp. 19039–19047, 2024, doi: 10.1609/AAAI.V38I17.29858

work page doi:10.1609/aaai.v38i17.29858 2024
[7]

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation,

C.-Y. Ma, J. Lu, Z. Wu, G. AlRegib, Z. Kira, R. Socher, and C. Xiong, “Self-Monitoring Navigation Agent via Auxiliary Progress Estimation,” inProceedings of the International Conference on Learning Representa- tions, 2019

2019
[8]

Object-and- action aware model for visual language navigation,

Y. Qi, Z. Pan, S. Zhang, A. van den Hengel, and Q. Wu, “Object-and- action aware model for visual language navigation,” inComputer Vision – ECCV 2020, 2020

2020
[9]

Speaker- follower models for vision-and-language navigation,

D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, and T. Darrell, “Speaker- follower models for vision-and-language navigation,”Advances in Neural Information Processing Systems (NeurIPS), 2018

2018
[10]

Sub-instruction aware vision-and-language navigation,

Y. Hong, C. Rodriguez, Q. Wu, and S. Gould, “Sub-instruction aware vision-and-language navigation,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

2020
[11]

Sub-instruction and local map relationship enhanced model for vision and language navigation,

Y. Zhang, Y. Li, J. Bai, Y. Feng, and M. Tao, “Sub-instruction and local map relationship enhanced model for vision and language navigation,” inProceedings of the International Conference on Neural Information Processing (ICONIP), 2023, pp. 518–529

2023
[12]

Structured Scene Memory for Vision-Language Navigation,

H. Wang, W. Wang, W. Liang, C. Xiong, and J. Shen, “Structured Scene Memory for Vision-Language Navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

2021
[13]

Explicit Object Relation Alignment for Vision and Language Navigation,

Y. Zhang and P. Kordjamshidi, “Explicit Object Relation Alignment for Vision and Language Navigation,” inProceedings of the 60th An- nual Meeting of the Association for Computational Linguistics: Student Research Workshop, 2022, pp. 322–331

2022
[14]

Think Deeply, Act Locally: Memory-Driven Transformers for Vision-and-Language Navigation,

X. Chen, Z. Liu, W. Bai, and S. K. Y. Lee, “Think Deeply, Act Locally: Memory-Driven Transformers for Vision-and-Language Navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2022. 26

2022
[15]

Causal learning with uncertainty-aware transformer for vision-and- language navigation,

K. Zhang, W. Xu, Z. Miao, Y. Tian, Y. Cen, Y. Liu, and W. He, “Causal learning with uncertainty-aware transformer for vision-and- language navigation,”Neurocomputing, p. 132196, 2025

2025
[16]

Instruction- guided path planning with 3D semantic maps for vision-language navi- gation,

Z. Wang, M. Li, M. Wu, M.-F. Moens, and T. Tuytelaars, “Instruction- guided path planning with 3D semantic maps for vision-language navi- gation,”Neurocomputing, vol. 625, p. 129457, 2025

2025
[17]

NavGPT: Explicit Reasoning in Vision- and-Language Navigation with Large Language Models,

G. Zhou, Y. Hong, and Q. Wu, “NavGPT: Explicit Reasoning in Vision- and-Language Navigation with Large Language Models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 7, pp. 7641–7649, Mar. 2024, doi: 10.1609/AAAI.V38I7.28597

work page doi:10.1609/aaai.v38i7.28597 2024
[18]

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation,

J. Zhang, K. Wang, R. Xu, G. Zhou, Y. Hong, X. Fang, Q. Wu, Z. Zhang, and H. Wang, “NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation,” inProceedings of Robotics: Sci- ence and Systems (RSS), 2024

2024
[19]

NavGPT-2: Un- leashing Navigational Reasoning Capability for Large Vision-Language Models,

G. Zhou, Y. Hong, Z. Wang, X. E. Wang, and Q. Wu, “NavGPT-2: Un- leashing Navigational Reasoning Capability for Large Vision-Language Models,” inProceedings of the European Conference on Computer Vi- sion (ECCV), Lecture Notes in Computer Science, pp. 260–278, 2024, doi: 10.1007/978-3-031-72667-5_15

work page doi:10.1007/978-3-031-72667-5_15 2024
[20]

Vision-and-language navigation today and tomorrow: A survey in the era of foundation models

Y. Zhang, Z. Ma, J. Li, Y. Qiao, Z. Wang, J. Chai, Q. Wu, M. Bansal, and P. Kordjamshidi, “Vision-and-Language Navigation Today and To- morrow: A Survey in the Era of Foundation Models,”arXiv preprint arXiv:2407.07035, 2024, doi: 10.48550/arXiv.2407.07035

work page doi:10.48550/arxiv.2407.07035 2024
[21]

Language and visual entity relationship graph for agent navigation,

Y. Hong, C. Rodriguez, Q. Wu, and S. Gould, “Language and visual entity relationship graph for agent navigation,” inAdvances in Neu- ral Information Processing Systems (NeurIPS), vol. 33, pp. 7685–7696, 2020

2020
[22]

Evolving graphical plan- ner: Contextual global planning for vision-and-language navigation,

Z. Deng, K. Narasimhan, and O. Russakovsky, “Evolving graphical plan- ner: Contextual global planning for vision-and-language navigation,” in Advances in Neural Information Processing Systems (NeurIPS), 2020

2020
[23]

Topo- logical planning with transformers for vision-and-language navigation,

C. Chen, J. K. Chen, J. Chuang, M. Vázquez, and S. Savarese, “Topo- logical planning with transformers for vision-and-language navigation,” 27 inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11276-11286, 2021

2021
[24]

VLN-PETL: Parameter-efficient trans- fer learning for vision-and-language navigation,

Y. Qiao, Z. Yu, and Q. Wu, “VLN-PETL: Parameter-efficient trans- fer learning for vision-and-language navigation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15443–15452

2023
[25]

Be- yond the nav-graph: Vision-and-language navigation in continuous en- vironments,

J. Krantz, E. Wijmans, A. Majumdar, D. Batra, and S. Lee, “Be- yond the nav-graph: Vision-and-language navigation in continuous en- vironments,” inProceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Springer International Publishing, 2020, pp. 104–120

2020
[26]

Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training,

W. Hao, C. Li, X. Li, L. Carin, and J. Gao, “Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

2020
[27]

Air- bert: In-Domain Pretraining for Vision-and-Language Navigation,

P.-L. Guhur, M. Tapaswi, S. Chen, I. Laptev, and C. Schmid, “Air- bert: In-Domain Pretraining for Vision-and-Language Navigation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

2021
[28]

Think Global, Act Local: Dual-Scale Graph Transformer for Vision-and-Language Navigation,

X. Chen, J. Zhang, Q. Xu, X. Zhang, and S. K. Y. Lee, “Think Global, Act Local: Dual-Scale Graph Transformer for Vision-and-Language Navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

2021
[29]

MapNav: A Novel Memory Representation via Annotated Semantic Maps for VLM-Based Vision-and-Language Navigation,

L. Zhang, H. Liao, X. Xu, Q. Zhang, X. Zhang, P. Wang, and R. Xu, “MapNav: A Novel Memory Representation via Annotated Semantic Maps for VLM-Based Vision-and-Language Navigation,” inProceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), 2025, pp. 13032–13056

2025
[30]

TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation,

N. Rajabi and J. Kosecka, “TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation,” inarXiv preprint arXiv:2502.07306, 2025. 28

work page arXiv 2025
[31]

Recursive bidirectional cross-modal reasoning network for vision-and-language navigation,

J. Wu, C. Wu, X. Shen, F. Wu, and L. Wang, “Recursive bidirectional cross-modal reasoning network for vision-and-language navigation,”Ex- pert Systems with Applications, vol. 270, p. 126442, 2025

2025
[32]

World-Consistent Data Generation for Vision-and-Language Nav- igation,

Y. Zhong, R. Zhang, Z. Zhang, Z. Wang, C. Fang, X. Zhang, and Q. Guo, “World-Consistent Data Generation for Vision-and-Language Nav- igation,”arXiv preprint arXiv:2412.06413, 2024

work page arXiv 2024
[33]

REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments,

Y. Qi, Q. Wu, P. Anderson, X. Wang, W.Y. Wang, C. Shen, A. van den Hengel, “REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments,” inProceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),2020, pp. 9979–9988

2020
[34]

Natural language processing,

K. R. Chowdhary and K. R. Chowdhary, “Natural language processing,” Fundamentals of Artificial Intelligence, 2020, pp. 603-649

2020
[35]

SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation,

P. Moudgil, T. Jain, A. Salim, and P. A. S. D. V. K., “SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation,” in Advances in Neural Information Processing Systems (NeurIPS), 2021

2021
[36]

VLnBERT: A recurrent vision-and-language BERT for navigation,

Y.Hong, Q.Wu, Y.Qi, C.Rodriguez-Opazo, andS.Gould, “VLnBERT: A recurrent vision-and-language BERT for navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), pp. 1643–1653, 2021

2021
[37]

History aware multi- modal transformer for vision-and-language navigation,

S. Chen, P.-L. Guhur, C. Schmid, and I. Laptev, “History aware multi- modal transformer for vision-and-language navigation,” inAdvances in Neural Information Processing Systems (NeurIPS), 2021

2021
[38]

SOON: Sce- nario oriented object navigation with graph-based exploration,

F. Zhu, X. Liang, Y. Zhu, Q. Yu, X. Chang, and X. Liang, “SOON: Sce- nario oriented object navigation with graph-based exploration,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 12689–12699

2021
[39]

E2BA: Environment Exploration and Backtracking Agent for Visual Language Object Navigation,

Y. Shi, J. Liu, L. Sun, and X. Zheng, “E2BA: Environment Exploration and Backtracking Agent for Visual Language Object Navigation,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 7, pp. 6231–6244, 2025. 29

2025
[40]

Visual Language Maps for Robot Navigation,

C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual Language Maps for Robot Navigation,” inProceedings of the IEEE International Con- ference on Robotics and Automation (ICRA), 2023, pp. 10608–10615

2023
[41]

Instruction- guided Path Planning with 3D Semantic Maps for Vision-Language Nav- igation,

Z. Wang, M. Li, M. Wu, M.-F. Moens, and T. Tuytelaars, “Instruction- guided Path Planning with 3D Semantic Maps for Vision-Language Nav- igation,”Neurocomputing, vol. 625, p. 129457, 2025

2025
[42]

Zero-shot Visual Grounding via Coarse-to-Fine Representation Learning,

J. Mi, S. Jin, Z. Chen, D. Liu, X. Wei, and J. Zhang, “Zero-shot Visual Grounding via Coarse-to-Fine Representation Learning,”Neurocomput- ing, vol. 610, p. 128621, 2024

2024
[43]

Vision- and-Language Navigation via Causal Learning,

L. Wang, Z. He, R. Dang, M. Shen, C. Liu, and Q. Chen, “Vision- and-Language Navigation via Causal Learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13139–13150

2024
[44]

CR-former: Single-image cloud removal with focused Taylor attention,

Y. Wu, Y. Deng, S. Zhou, Y. Liu, W. Huang, and J. Wang, “CR-former: Single-image cloud removal with focused Taylor attention,”IEEE Trans- actions on Geoscience and Remote Sensing, vol. 62, pp. 1–14, 2024

2024
[45]

Event-Equalized Dense Video Captioning,

K. Wu, P. Li, J. Fu, Y. Li, Y. Wu, Y. Liu, J. Wang, and S. Zhou, “Event-Equalized Dense Video Captioning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 8417–8427

2025
[46]

Semantic-aware representation learning for homography estimation,

Y. Liu, Q. Huang, S. Hui, J. Fu, S. Zhou, K. Wu, P. Li, and J. Wang, “Semantic-aware representation learning for homography estimation,” in Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 2506–2514

2024
[47]

Mind the gap: Aligning vision foundation models to image feature matching,

Y. Liu, J. Fu, Y. Wu, K. Wu, P. Li, J. Wu, S. Zhou, and J. Xin, “Mind the gap: Aligning vision foundation models to image feature matching,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 20313–20323

2025
[48]

PatchCue: Enhancing Vision-Language Model Reasoning with Patch- Based Visual Cues,

Y. Qi, P. Fu, H. Li, Y. Liu, C. Jiang, B. Qin, Z. Luo, and J. Luan, “PatchCue: Enhancing Vision-Language Model Reasoning with Patch- Based Visual Cues,”arXiv preprint arXiv:2603.05869, 2026. 30

work page arXiv 2026
[49]

Closing the gap between the upper bound and lower bound of Adam’s iteration com- plexity,

B. Wang, J. Fu, H. Zhang, N. Zheng, and W. Chen, “Closing the gap between the upper bound and lower bound of Adam’s iteration com- plexity,”Advances in Neural Information Processing Systems, vol. 36, pp. 39006–39032, 2023

2023
[50]

Recognition of surface defects on steel sheet using transfer learning,

J. Fu, X. Zhu, and Y. Li, “Recognition of surface defects on steel sheet using transfer learning,”arXiv preprint arXiv:1909.03258, 2019

work page arXiv 1909
[51]

When and why momentum accelerates SGD : An empirical study, 2023

J.Fu, B.Wang, H.Zhang, Z.Zhang, W.Chen, andN.Zheng, “Whenand why momentum accelerates SGD: An empirical study,”arXiv preprint arXiv:2306.09000, 2023

work page arXiv 2023
[52]

Understanding mobile GUI: From pixel-words to screen-sentences,

J. Fu, X. Zhang, Y. Wang, W. Zeng, and N. Zheng, “Understanding mobile GUI: From pixel-words to screen-sentences,”Neurocomputing, vol. 601, p. 128200, 2024

2024
[53]

Breaking through the learning plateaus of in-context learning in transformer,

J. Fu, T. Yang, Y. Wang, Y. Lu, and N. Zheng, “Breaking through the learning plateaus of in-context learning in transformer,”arXiv preprint arXiv:2309.06054, 2023

work page arXiv 2023
[54]

Regnav: Room expert guided image- goal navigation,

P. Li, K. Wu, J. Fu, and S. Zhou, “Regnav: Room expert guided image- goal navigation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 5, 2025, pp. 4860–4868

2025
[55]

Camera-aware la- bel refinement for unsupervised person re-identification,

P. Li, K. Wu, W. Huang, S. Zhou, and J. Wang, “Camera-aware la- bel refinement for unsupervised person re-identification,”arXiv preprint arXiv:2403.16450, 2024. 31 Supplementary Materials S1. Purpose and Scope This supplementary material provides additional details for the two auxil- iary analyses summarized in the main paper: the controlled plug-in compa...

work page arXiv 2024