pith. machine review for the scientific record. sign in

arxiv: 2604.18223 · v1 · submitted 2026-04-20 · 💻 cs.CV

Instruction-as-State: Environment-Guided and State-Conditioned Semantic Understanding for Embodied Navigation

Pith reviewed 2026-05-10 04:42 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language navigationembodied navigationinstruction understandingstate-conditioned semanticsenvironment-guided refinementtoken-level groundingdynamic language encoding
0
0 comments X

The pith

Modeling instructions as an evolving state updated by each new observation improves how agents follow directions in changing environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that static encodings of instructions limit navigation because word meanings shift with the agent's changing views and position. Instead, it treats the instruction as a live state that refines itself token by token using the current perceptual context at every step. A coarse stage picks the relevant instruction segment based on what is seen, while a fine stage sharpens token meanings under that observation. This keeps language aligned with the scene without discarding earlier or later parts of the command. The result is better path efficiency and success on standard vision-language navigation tasks.

Core claim

We therefore model instruction understanding as an Instruction-as-State variable: a decision-relevant, token-level instruction state that evolves step by step conditioned on the agent's perceptual state, where the perceptual state denotes the observation-grounded navigation context at each step. To realize this principle, we introduce State-Entangled Environment-Guided Instruction Understanding (S-EGIU), a coarse-to-fine framework for state-conditioned segment activation and token-level semantic refinement. At the coarse level, S-EGIU activates the instruction segment whose semantics align with the current observation. At the fine level, it refines the activated segment through observation -

What carries the argument

The Instruction-as-State variable, a token-level representation that is activated in segments and then refined at the token level according to the agent's current observation.

Load-bearing premise

That activating the matching instruction segment and then refining its tokens using the current view will keep the full instruction's meaning coherent and correctly aligned without losing earlier context or adding errors as the agent continues moving.

What would settle it

Running the method on trajectories where one observation matches two different instruction segments equally well and checking whether later actions still follow the remaining parts of the original instruction correctly.

Figures

Figures reproduced from arXiv: 2604.18223 by Jianyi Liu, Jingwen Fu, Jinjun Wang, Wei Song, Yuhan Liu, Zhen Liu.

Figure 1
Figure 1. Figure 1: Example of dynamic instruction semantics in VLN. At each viewpoint, the [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the S-EGIU framework. CGIP estimates a clause-relevance distri [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of S-EGIU’s state-conditioned instruction understanding. [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A representative failure case on the R2R dataset. The agent (blue trajectory) [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗
read the original abstract

Vision-and-Language Navigation requires agents to follow natural-language instructions in visually changing environments. A central challenge is the dynamic entanglement between language and observations: the meaning of instruction shifts as the agent's field of view and spatial context evolve. However, many existing models encode the instruction as a static global representation, limiting their ability to adapt instruction meaning to the current visual context. We therefore model instruction understanding as an Instruction-as-State variable: a decision-relevant, token-level instruction state that evolves step by step conditioned on the agent's perceptual state, where the perceptual state denotes the observation-grounded navigation context at each step. To realize this principle, we introduce State-Entangled Environment-Guided Instruction Understanding (S-EGIU), a coarse-to-fine framework for state-conditioned segment activation and token-level semantic refinement. At the coarse level, S-EGIU activates the instruction segment whose semantics align with the current observation. At the fine level, it refines the activated segment through observation-guided token grounding and contextual modeling, sharpening its internal semantics under the current observation. Together, these stages maintain an instruction state that is continuously updated according to the agent's perceptual state during navigation. S-EGIU delivers strong performance on several key metrics, including a +2.68% SPL gain on REVERIE Test Unseen, and demonstrates consistent efficiency gains across multiple VLN benchmarks, underscoring the value of dynamic instruction--perception entanglement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that static global encodings of instructions limit adaptation in VLN tasks due to dynamic language-perception entanglement. It introduces the Instruction-as-State variable—a token-level, evolving representation conditioned on the agent's perceptual state—and realizes it via the S-EGIU coarse-to-fine framework: observation-aligned segment activation at the coarse level followed by observation-guided token grounding and contextual modeling at the fine level. This produces a continuously updated instruction state during navigation. The authors report empirical gains, including +2.68% SPL on REVERIE Test Unseen, plus efficiency improvements across VLN benchmarks.

Significance. If the reported gains hold under rigorous controls and the framework demonstrably preserves cross-segment dependencies, the work would offer a concrete mechanism for state-conditioned instruction semantics that addresses a recognized limitation in embodied navigation models. The coarse-to-fine design is a clear modeling choice that could be adopted more broadly if shown to be robust.

major comments (2)
  1. [§3.2] §3.2 (Coarse-to-Fine Framework): The central claim that local segment activation plus intra-segment token refinement maintains semantic coherence rests on the unstated assumption that de-activated segments' context is dispensable. No persistent cross-segment memory, full-instruction attention, or state-merging operator is described; if visual ambiguity triggers an early incorrect activation, the paper provides no documented recovery path for conditional or sequential clauses spanning multiple segments. This directly bears on whether the +2.68% SPL gain can be attributed to the proposed dynamic entanglement rather than to other factors.
  2. [§4.3] §4.3 (Ablation and Baseline Comparisons): The ablation isolating the contribution of observation-guided token refinement reports gains, yet the static-instruction baselines are not shown to have been re-trained with equivalent capacity or the same observation encoder. Without this control, the performance delta cannot be unambiguously credited to the Instruction-as-State construction.
minor comments (2)
  1. [§2] Notation: The term 'perceptual state' is used interchangeably with 'observation-grounded navigation context' without an explicit definition or diagram showing its relation to the visual encoder output.
  2. [Figure 2] Figure 2: The diagram of segment activation lacks an arrow or label indicating whether the Instruction-as-State is carried forward from the previous timestep or reset per segment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below, providing clarifications on our design and experimental setup while outlining the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Coarse-to-Fine Framework): The central claim that local segment activation plus intra-segment token refinement maintains semantic coherence rests on the unstated assumption that de-activated segments' context is dispensable. No persistent cross-segment memory, full-instruction attention, or state-merging operator is described; if visual ambiguity triggers an early incorrect activation, the paper provides no documented recovery path for conditional or sequential clauses spanning multiple segments. This directly bears on whether the +2.68% SPL gain can be attributed to the proposed dynamic entanglement rather than to other factors.

    Authors: We appreciate the referee highlighting the importance of cross-segment semantic coherence. In S-EGIU, the Instruction-as-State is updated at every step by conditioning on the current perceptual state: the coarse stage activates the observation-aligned segment while the fine stage performs observation-guided token grounding and contextual modeling within it. Although we do not maintain an explicit persistent memory buffer for de-activated segments, the continuous evolution of the state via perceptual conditioning enables re-alignment when subsequent observations provide clarifying visual evidence, offering an implicit recovery path. The +2.68% SPL gain is supported by component ablations that isolate the contribution of state-conditioned activation and refinement. To address the concern more explicitly, the revised manuscript will include additional discussion of cross-segment dependency handling together with a targeted ablation on multi-segment sequential instructions. revision: partial

  2. Referee: [§4.3] §4.3 (Ablation and Baseline Comparisons): The ablation isolating the contribution of observation-guided token refinement reports gains, yet the static-instruction baselines are not shown to have been re-trained with equivalent capacity or the same observation encoder. Without this control, the performance delta cannot be unambiguously credited to the Instruction-as-State construction.

    Authors: We agree that rigorous attribution requires matched capacity and encoders. The static-instruction baselines in our experiments already share the identical observation encoder architecture with the proposed model. Nevertheless, we acknowledge that explicit re-training under strictly equivalent parameter budgets would further strengthen the ablation. In the revised manuscript we will re-train the static baselines with matched capacity, report the updated numbers, and clarify the controls, allowing the performance improvements to be more unambiguously attributed to the Instruction-as-State construction. revision: yes

Circularity Check

0 steps flagged

No circularity: modeling framework is an independent architectural choice with empirical validation

full rationale

The paper introduces Instruction-as-State and the S-EGIU coarse-to-fine framework as a modeling decision to address dynamic language-observation entanglement. The description of segment activation followed by token refinement is presented as a design choice, not derived from prior equations or self-citations. Reported gains (+2.68% SPL on REVERIE) are empirical outcomes on standard benchmarks rather than predictions forced by fitted inputs or self-referential definitions. No equations, uniqueness theorems, or load-bearing self-citations appear in the provided text that would reduce the central claim to its own inputs by construction. The derivation chain remains self-contained as a proposed architecture evaluated externally.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The work rests on standard VLN domain assumptions and introduces two new conceptual entities; no numerical free parameters are mentioned.

axioms (1)
  • domain assumption Instruction semantics in VLN are dynamically entangled with the agent's changing visual and spatial context
    Stated as the central challenge motivating the Instruction-as-State model.
invented entities (2)
  • Instruction-as-State variable no independent evidence
    purpose: Token-level instruction representation that evolves conditioned on perceptual state
    Core modeling innovation introduced to replace static global encoding.
  • S-EGIU framework no independent evidence
    purpose: Coarse-to-fine mechanism for state-conditioned segment activation and token refinement
    Concrete realization of the Instruction-as-State principle.

pith-pipeline@v0.9.0 · 5570 in / 1348 out tokens · 44500 ms · 2026-05-10T04:42:04.083087+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 13 canonical work pages

  1. [1]

    Vision-language navi- gation: A survey and taxonomy,

    W. Wu, T. Chang, X. Li, Q. Yin, and Y. Hu, “Vision-language navi- gation: A survey and taxonomy,”Neural Computing and Applications, vol. 36, no. 7, pp. 3291–3316, 2024, doi: 10.1007/s00521-023-09217-1

  2. [2]

    , author Qing, J

    J. Li and M. Bansal, “Improving Vision-and-Language Navigation by Generating Future-View Image Semantics,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 10803–10812, doi: 10.1109/CVPR52729.2023.01040

  3. [3]

    Vision-and-Language Navi- gation: Interpreting Visually-Grounded Navigation Instructions in Real Environments,

    P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I.D. Reid, S. Gould, A. van den Hengel, “Vision-and-Language Navi- gation: Interpreting Visually-Grounded Navigation Instructions in Real Environments,” inProceedings of the 2018 IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2018, pp. 3674–3683

  4. [4]

    Episodic Transformer for Vision- and-Language Navigation,

    A. Pashevich, C. Schmid, and C. Sun, “Episodic Transformer for Vision- and-Language Navigation,” inProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), 2021, pp. 15942–15952

  5. [5]

    Everyday Object Meets Vision-and-Language Navigation Agent via Backdoor,

    K. He, K. Chen, J. Bai, Y. Huang, Q. Wu, S.-T. Xia, and L. Wang, “Everyday Object Meets Vision-and-Language Navigation Agent via Backdoor,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 37, 2024

  6. [6]

    VELMA: Verbalization Embodiment of LLM Agents for Vision and 25 Language Navigation in Street View,

    R. Schumann, W. Zhu, W. Feng, T.-J. Fu, S. Riezler, and W. Y. Wang, “VELMA: Verbalization Embodiment of LLM Agents for Vision and 25 Language Navigation in Street View,” inProceedings of the AAAI Con- ference on Artificial Intelligence, vol. 38, no. 17, pp. 19039–19047, 2024, doi: 10.1609/AAAI.V38I17.29858

  7. [7]

    Self-Monitoring Navigation Agent via Auxiliary Progress Estimation,

    C.-Y. Ma, J. Lu, Z. Wu, G. AlRegib, Z. Kira, R. Socher, and C. Xiong, “Self-Monitoring Navigation Agent via Auxiliary Progress Estimation,” inProceedings of the International Conference on Learning Representa- tions, 2019

  8. [8]

    Object-and- action aware model for visual language navigation,

    Y. Qi, Z. Pan, S. Zhang, A. van den Hengel, and Q. Wu, “Object-and- action aware model for visual language navigation,” inComputer Vision – ECCV 2020, 2020

  9. [9]

    Speaker- follower models for vision-and-language navigation,

    D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, and T. Darrell, “Speaker- follower models for vision-and-language navigation,”Advances in Neural Information Processing Systems (NeurIPS), 2018

  10. [10]

    Sub-instruction aware vision-and-language navigation,

    Y. Hong, C. Rodriguez, Q. Wu, and S. Gould, “Sub-instruction aware vision-and-language navigation,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

  11. [11]

    Sub-instruction and local map relationship enhanced model for vision and language navigation,

    Y. Zhang, Y. Li, J. Bai, Y. Feng, and M. Tao, “Sub-instruction and local map relationship enhanced model for vision and language navigation,” inProceedings of the International Conference on Neural Information Processing (ICONIP), 2023, pp. 518–529

  12. [12]

    Structured Scene Memory for Vision-Language Navigation,

    H. Wang, W. Wang, W. Liang, C. Xiong, and J. Shen, “Structured Scene Memory for Vision-Language Navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

  13. [13]

    Explicit Object Relation Alignment for Vision and Language Navigation,

    Y. Zhang and P. Kordjamshidi, “Explicit Object Relation Alignment for Vision and Language Navigation,” inProceedings of the 60th An- nual Meeting of the Association for Computational Linguistics: Student Research Workshop, 2022, pp. 322–331

  14. [14]

    Think Deeply, Act Locally: Memory-Driven Transformers for Vision-and-Language Navigation,

    X. Chen, Z. Liu, W. Bai, and S. K. Y. Lee, “Think Deeply, Act Locally: Memory-Driven Transformers for Vision-and-Language Navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2022. 26

  15. [15]

    Causal learning with uncertainty-aware transformer for vision-and- language navigation,

    K. Zhang, W. Xu, Z. Miao, Y. Tian, Y. Cen, Y. Liu, and W. He, “Causal learning with uncertainty-aware transformer for vision-and- language navigation,”Neurocomputing, p. 132196, 2025

  16. [16]

    Instruction- guided path planning with 3D semantic maps for vision-language navi- gation,

    Z. Wang, M. Li, M. Wu, M.-F. Moens, and T. Tuytelaars, “Instruction- guided path planning with 3D semantic maps for vision-language navi- gation,”Neurocomputing, vol. 625, p. 129457, 2025

  17. [17]

    NavGPT: Explicit Reasoning in Vision- and-Language Navigation with Large Language Models,

    G. Zhou, Y. Hong, and Q. Wu, “NavGPT: Explicit Reasoning in Vision- and-Language Navigation with Large Language Models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 7, pp. 7641–7649, Mar. 2024, doi: 10.1609/AAAI.V38I7.28597

  18. [18]

    NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation,

    J. Zhang, K. Wang, R. Xu, G. Zhou, Y. Hong, X. Fang, Q. Wu, Z. Zhang, and H. Wang, “NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation,” inProceedings of Robotics: Sci- ence and Systems (RSS), 2024

  19. [19]

    NavGPT-2: Un- leashing Navigational Reasoning Capability for Large Vision-Language Models,

    G. Zhou, Y. Hong, Z. Wang, X. E. Wang, and Q. Wu, “NavGPT-2: Un- leashing Navigational Reasoning Capability for Large Vision-Language Models,” inProceedings of the European Conference on Computer Vi- sion (ECCV), Lecture Notes in Computer Science, pp. 260–278, 2024, doi: 10.1007/978-3-031-72667-5_15

  20. [20]

    Vision-and-language navigation today and tomorrow: A survey in the era of foundation models

    Y. Zhang, Z. Ma, J. Li, Y. Qiao, Z. Wang, J. Chai, Q. Wu, M. Bansal, and P. Kordjamshidi, “Vision-and-Language Navigation Today and To- morrow: A Survey in the Era of Foundation Models,”arXiv preprint arXiv:2407.07035, 2024, doi: 10.48550/arXiv.2407.07035

  21. [21]

    Language and visual entity relationship graph for agent navigation,

    Y. Hong, C. Rodriguez, Q. Wu, and S. Gould, “Language and visual entity relationship graph for agent navigation,” inAdvances in Neu- ral Information Processing Systems (NeurIPS), vol. 33, pp. 7685–7696, 2020

  22. [22]

    Evolving graphical plan- ner: Contextual global planning for vision-and-language navigation,

    Z. Deng, K. Narasimhan, and O. Russakovsky, “Evolving graphical plan- ner: Contextual global planning for vision-and-language navigation,” in Advances in Neural Information Processing Systems (NeurIPS), 2020

  23. [23]

    Topo- logical planning with transformers for vision-and-language navigation,

    C. Chen, J. K. Chen, J. Chuang, M. Vázquez, and S. Savarese, “Topo- logical planning with transformers for vision-and-language navigation,” 27 inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11276-11286, 2021

  24. [24]

    VLN-PETL: Parameter-efficient trans- fer learning for vision-and-language navigation,

    Y. Qiao, Z. Yu, and Q. Wu, “VLN-PETL: Parameter-efficient trans- fer learning for vision-and-language navigation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15443–15452

  25. [25]

    Be- yond the nav-graph: Vision-and-language navigation in continuous en- vironments,

    J. Krantz, E. Wijmans, A. Majumdar, D. Batra, and S. Lee, “Be- yond the nav-graph: Vision-and-language navigation in continuous en- vironments,” inProceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Springer International Publishing, 2020, pp. 104–120

  26. [26]

    Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training,

    W. Hao, C. Li, X. Li, L. Carin, and J. Gao, “Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

  27. [27]

    Air- bert: In-Domain Pretraining for Vision-and-Language Navigation,

    P.-L. Guhur, M. Tapaswi, S. Chen, I. Laptev, and C. Schmid, “Air- bert: In-Domain Pretraining for Vision-and-Language Navigation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

  28. [28]

    Think Global, Act Local: Dual-Scale Graph Transformer for Vision-and-Language Navigation,

    X. Chen, J. Zhang, Q. Xu, X. Zhang, and S. K. Y. Lee, “Think Global, Act Local: Dual-Scale Graph Transformer for Vision-and-Language Navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

  29. [29]

    MapNav: A Novel Memory Representation via Annotated Semantic Maps for VLM-Based Vision-and-Language Navigation,

    L. Zhang, H. Liao, X. Xu, Q. Zhang, X. Zhang, P. Wang, and R. Xu, “MapNav: A Novel Memory Representation via Annotated Semantic Maps for VLM-Based Vision-and-Language Navigation,” inProceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), 2025, pp. 13032–13056

  30. [30]

    TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation,

    N. Rajabi and J. Kosecka, “TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation,” inarXiv preprint arXiv:2502.07306, 2025. 28

  31. [31]

    Recursive bidirectional cross-modal reasoning network for vision-and-language navigation,

    J. Wu, C. Wu, X. Shen, F. Wu, and L. Wang, “Recursive bidirectional cross-modal reasoning network for vision-and-language navigation,”Ex- pert Systems with Applications, vol. 270, p. 126442, 2025

  32. [32]

    World-Consistent Data Generation for Vision-and-Language Nav- igation,

    Y. Zhong, R. Zhang, Z. Zhang, Z. Wang, C. Fang, X. Zhang, and Q. Guo, “World-Consistent Data Generation for Vision-and-Language Nav- igation,”arXiv preprint arXiv:2412.06413, 2024

  33. [33]

    REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments,

    Y. Qi, Q. Wu, P. Anderson, X. Wang, W.Y. Wang, C. Shen, A. van den Hengel, “REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments,” inProceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),2020, pp. 9979–9988

  34. [34]

    Natural language processing,

    K. R. Chowdhary and K. R. Chowdhary, “Natural language processing,” Fundamentals of Artificial Intelligence, 2020, pp. 603-649

  35. [35]

    SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation,

    P. Moudgil, T. Jain, A. Salim, and P. A. S. D. V. K., “SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation,” in Advances in Neural Information Processing Systems (NeurIPS), 2021

  36. [36]

    VLnBERT: A recurrent vision-and-language BERT for navigation,

    Y.Hong, Q.Wu, Y.Qi, C.Rodriguez-Opazo, andS.Gould, “VLnBERT: A recurrent vision-and-language BERT for navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), pp. 1643–1653, 2021

  37. [37]

    History aware multi- modal transformer for vision-and-language navigation,

    S. Chen, P.-L. Guhur, C. Schmid, and I. Laptev, “History aware multi- modal transformer for vision-and-language navigation,” inAdvances in Neural Information Processing Systems (NeurIPS), 2021

  38. [38]

    SOON: Sce- nario oriented object navigation with graph-based exploration,

    F. Zhu, X. Liang, Y. Zhu, Q. Yu, X. Chang, and X. Liang, “SOON: Sce- nario oriented object navigation with graph-based exploration,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 12689–12699

  39. [39]

    E2BA: Environment Exploration and Backtracking Agent for Visual Language Object Navigation,

    Y. Shi, J. Liu, L. Sun, and X. Zheng, “E2BA: Environment Exploration and Backtracking Agent for Visual Language Object Navigation,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 7, pp. 6231–6244, 2025. 29

  40. [40]

    Visual Language Maps for Robot Navigation,

    C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual Language Maps for Robot Navigation,” inProceedings of the IEEE International Con- ference on Robotics and Automation (ICRA), 2023, pp. 10608–10615

  41. [41]

    Instruction- guided Path Planning with 3D Semantic Maps for Vision-Language Nav- igation,

    Z. Wang, M. Li, M. Wu, M.-F. Moens, and T. Tuytelaars, “Instruction- guided Path Planning with 3D Semantic Maps for Vision-Language Nav- igation,”Neurocomputing, vol. 625, p. 129457, 2025

  42. [42]

    Zero-shot Visual Grounding via Coarse-to-Fine Representation Learning,

    J. Mi, S. Jin, Z. Chen, D. Liu, X. Wei, and J. Zhang, “Zero-shot Visual Grounding via Coarse-to-Fine Representation Learning,”Neurocomput- ing, vol. 610, p. 128621, 2024

  43. [43]

    Vision- and-Language Navigation via Causal Learning,

    L. Wang, Z. He, R. Dang, M. Shen, C. Liu, and Q. Chen, “Vision- and-Language Navigation via Causal Learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13139–13150

  44. [44]

    CR-former: Single-image cloud removal with focused Taylor attention,

    Y. Wu, Y. Deng, S. Zhou, Y. Liu, W. Huang, and J. Wang, “CR-former: Single-image cloud removal with focused Taylor attention,”IEEE Trans- actions on Geoscience and Remote Sensing, vol. 62, pp. 1–14, 2024

  45. [45]

    Event-Equalized Dense Video Captioning,

    K. Wu, P. Li, J. Fu, Y. Li, Y. Wu, Y. Liu, J. Wang, and S. Zhou, “Event-Equalized Dense Video Captioning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 8417–8427

  46. [46]

    Semantic-aware representation learning for homography estimation,

    Y. Liu, Q. Huang, S. Hui, J. Fu, S. Zhou, K. Wu, P. Li, and J. Wang, “Semantic-aware representation learning for homography estimation,” in Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 2506–2514

  47. [47]

    Mind the gap: Aligning vision foundation models to image feature matching,

    Y. Liu, J. Fu, Y. Wu, K. Wu, P. Li, J. Wu, S. Zhou, and J. Xin, “Mind the gap: Aligning vision foundation models to image feature matching,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 20313–20323

  48. [48]

    PatchCue: Enhancing Vision-Language Model Reasoning with Patch- Based Visual Cues,

    Y. Qi, P. Fu, H. Li, Y. Liu, C. Jiang, B. Qin, Z. Luo, and J. Luan, “PatchCue: Enhancing Vision-Language Model Reasoning with Patch- Based Visual Cues,”arXiv preprint arXiv:2603.05869, 2026. 30

  49. [49]

    Closing the gap between the upper bound and lower bound of Adam’s iteration com- plexity,

    B. Wang, J. Fu, H. Zhang, N. Zheng, and W. Chen, “Closing the gap between the upper bound and lower bound of Adam’s iteration com- plexity,”Advances in Neural Information Processing Systems, vol. 36, pp. 39006–39032, 2023

  50. [50]

    Recognition of surface defects on steel sheet using transfer learning,

    J. Fu, X. Zhu, and Y. Li, “Recognition of surface defects on steel sheet using transfer learning,”arXiv preprint arXiv:1909.03258, 2019

  51. [51]

    When and why momentum accelerates SGD : An empirical study, 2023

    J.Fu, B.Wang, H.Zhang, Z.Zhang, W.Chen, andN.Zheng, “Whenand why momentum accelerates SGD: An empirical study,”arXiv preprint arXiv:2306.09000, 2023

  52. [52]

    Understanding mobile GUI: From pixel-words to screen-sentences,

    J. Fu, X. Zhang, Y. Wang, W. Zeng, and N. Zheng, “Understanding mobile GUI: From pixel-words to screen-sentences,”Neurocomputing, vol. 601, p. 128200, 2024

  53. [53]

    Breaking through the learning plateaus of in-context learning in transformer,

    J. Fu, T. Yang, Y. Wang, Y. Lu, and N. Zheng, “Breaking through the learning plateaus of in-context learning in transformer,”arXiv preprint arXiv:2309.06054, 2023

  54. [54]

    Regnav: Room expert guided image- goal navigation,

    P. Li, K. Wu, J. Fu, and S. Zhou, “Regnav: Room expert guided image- goal navigation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 5, 2025, pp. 4860–4868

  55. [55]

    Camera-aware la- bel refinement for unsupervised person re-identification,

    P. Li, K. Wu, W. Huang, S. Zhou, and J. Wang, “Camera-aware la- bel refinement for unsupervised person re-identification,”arXiv preprint arXiv:2403.16450, 2024. 31 Supplementary Materials S1. Purpose and Scope This supplementary material provides additional details for the two auxil- iary analyses summarized in the main paper: the controlled plug-in compa...