Instruction-as-State: Environment-Guided and State-Conditioned Semantic Understanding for Embodied Navigation
Pith reviewed 2026-05-10 04:42 UTC · model grok-4.3
The pith
Modeling instructions as an evolving state updated by each new observation improves how agents follow directions in changing environments.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We therefore model instruction understanding as an Instruction-as-State variable: a decision-relevant, token-level instruction state that evolves step by step conditioned on the agent's perceptual state, where the perceptual state denotes the observation-grounded navigation context at each step. To realize this principle, we introduce State-Entangled Environment-Guided Instruction Understanding (S-EGIU), a coarse-to-fine framework for state-conditioned segment activation and token-level semantic refinement. At the coarse level, S-EGIU activates the instruction segment whose semantics align with the current observation. At the fine level, it refines the activated segment through observation -
What carries the argument
The Instruction-as-State variable, a token-level representation that is activated in segments and then refined at the token level according to the agent's current observation.
Load-bearing premise
That activating the matching instruction segment and then refining its tokens using the current view will keep the full instruction's meaning coherent and correctly aligned without losing earlier context or adding errors as the agent continues moving.
What would settle it
Running the method on trajectories where one observation matches two different instruction segments equally well and checking whether later actions still follow the remaining parts of the original instruction correctly.
Figures
read the original abstract
Vision-and-Language Navigation requires agents to follow natural-language instructions in visually changing environments. A central challenge is the dynamic entanglement between language and observations: the meaning of instruction shifts as the agent's field of view and spatial context evolve. However, many existing models encode the instruction as a static global representation, limiting their ability to adapt instruction meaning to the current visual context. We therefore model instruction understanding as an Instruction-as-State variable: a decision-relevant, token-level instruction state that evolves step by step conditioned on the agent's perceptual state, where the perceptual state denotes the observation-grounded navigation context at each step. To realize this principle, we introduce State-Entangled Environment-Guided Instruction Understanding (S-EGIU), a coarse-to-fine framework for state-conditioned segment activation and token-level semantic refinement. At the coarse level, S-EGIU activates the instruction segment whose semantics align with the current observation. At the fine level, it refines the activated segment through observation-guided token grounding and contextual modeling, sharpening its internal semantics under the current observation. Together, these stages maintain an instruction state that is continuously updated according to the agent's perceptual state during navigation. S-EGIU delivers strong performance on several key metrics, including a +2.68% SPL gain on REVERIE Test Unseen, and demonstrates consistent efficiency gains across multiple VLN benchmarks, underscoring the value of dynamic instruction--perception entanglement.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that static global encodings of instructions limit adaptation in VLN tasks due to dynamic language-perception entanglement. It introduces the Instruction-as-State variable—a token-level, evolving representation conditioned on the agent's perceptual state—and realizes it via the S-EGIU coarse-to-fine framework: observation-aligned segment activation at the coarse level followed by observation-guided token grounding and contextual modeling at the fine level. This produces a continuously updated instruction state during navigation. The authors report empirical gains, including +2.68% SPL on REVERIE Test Unseen, plus efficiency improvements across VLN benchmarks.
Significance. If the reported gains hold under rigorous controls and the framework demonstrably preserves cross-segment dependencies, the work would offer a concrete mechanism for state-conditioned instruction semantics that addresses a recognized limitation in embodied navigation models. The coarse-to-fine design is a clear modeling choice that could be adopted more broadly if shown to be robust.
major comments (2)
- [§3.2] §3.2 (Coarse-to-Fine Framework): The central claim that local segment activation plus intra-segment token refinement maintains semantic coherence rests on the unstated assumption that de-activated segments' context is dispensable. No persistent cross-segment memory, full-instruction attention, or state-merging operator is described; if visual ambiguity triggers an early incorrect activation, the paper provides no documented recovery path for conditional or sequential clauses spanning multiple segments. This directly bears on whether the +2.68% SPL gain can be attributed to the proposed dynamic entanglement rather than to other factors.
- [§4.3] §4.3 (Ablation and Baseline Comparisons): The ablation isolating the contribution of observation-guided token refinement reports gains, yet the static-instruction baselines are not shown to have been re-trained with equivalent capacity or the same observation encoder. Without this control, the performance delta cannot be unambiguously credited to the Instruction-as-State construction.
minor comments (2)
- [§2] Notation: The term 'perceptual state' is used interchangeably with 'observation-grounded navigation context' without an explicit definition or diagram showing its relation to the visual encoder output.
- [Figure 2] Figure 2: The diagram of segment activation lacks an arrow or label indicating whether the Instruction-as-State is carried forward from the previous timestep or reset per segment.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment point by point below, providing clarifications on our design and experimental setup while outlining the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Coarse-to-Fine Framework): The central claim that local segment activation plus intra-segment token refinement maintains semantic coherence rests on the unstated assumption that de-activated segments' context is dispensable. No persistent cross-segment memory, full-instruction attention, or state-merging operator is described; if visual ambiguity triggers an early incorrect activation, the paper provides no documented recovery path for conditional or sequential clauses spanning multiple segments. This directly bears on whether the +2.68% SPL gain can be attributed to the proposed dynamic entanglement rather than to other factors.
Authors: We appreciate the referee highlighting the importance of cross-segment semantic coherence. In S-EGIU, the Instruction-as-State is updated at every step by conditioning on the current perceptual state: the coarse stage activates the observation-aligned segment while the fine stage performs observation-guided token grounding and contextual modeling within it. Although we do not maintain an explicit persistent memory buffer for de-activated segments, the continuous evolution of the state via perceptual conditioning enables re-alignment when subsequent observations provide clarifying visual evidence, offering an implicit recovery path. The +2.68% SPL gain is supported by component ablations that isolate the contribution of state-conditioned activation and refinement. To address the concern more explicitly, the revised manuscript will include additional discussion of cross-segment dependency handling together with a targeted ablation on multi-segment sequential instructions. revision: partial
-
Referee: [§4.3] §4.3 (Ablation and Baseline Comparisons): The ablation isolating the contribution of observation-guided token refinement reports gains, yet the static-instruction baselines are not shown to have been re-trained with equivalent capacity or the same observation encoder. Without this control, the performance delta cannot be unambiguously credited to the Instruction-as-State construction.
Authors: We agree that rigorous attribution requires matched capacity and encoders. The static-instruction baselines in our experiments already share the identical observation encoder architecture with the proposed model. Nevertheless, we acknowledge that explicit re-training under strictly equivalent parameter budgets would further strengthen the ablation. In the revised manuscript we will re-train the static baselines with matched capacity, report the updated numbers, and clarify the controls, allowing the performance improvements to be more unambiguously attributed to the Instruction-as-State construction. revision: yes
Circularity Check
No circularity: modeling framework is an independent architectural choice with empirical validation
full rationale
The paper introduces Instruction-as-State and the S-EGIU coarse-to-fine framework as a modeling decision to address dynamic language-observation entanglement. The description of segment activation followed by token refinement is presented as a design choice, not derived from prior equations or self-citations. Reported gains (+2.68% SPL on REVERIE) are empirical outcomes on standard benchmarks rather than predictions forced by fitted inputs or self-referential definitions. No equations, uniqueness theorems, or load-bearing self-citations appear in the provided text that would reduce the central claim to its own inputs by construction. The derivation chain remains self-contained as a proposed architecture evaluated externally.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Instruction semantics in VLN are dynamically entangled with the agent's changing visual and spatial context
invented entities (2)
-
Instruction-as-State variable
no independent evidence
-
S-EGIU framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Vision-language navi- gation: A survey and taxonomy,
W. Wu, T. Chang, X. Li, Q. Yin, and Y. Hu, “Vision-language navi- gation: A survey and taxonomy,”Neural Computing and Applications, vol. 36, no. 7, pp. 3291–3316, 2024, doi: 10.1007/s00521-023-09217-1
-
[2]
J. Li and M. Bansal, “Improving Vision-and-Language Navigation by Generating Future-View Image Semantics,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 10803–10812, doi: 10.1109/CVPR52729.2023.01040
-
[3]
Vision-and-Language Navi- gation: Interpreting Visually-Grounded Navigation Instructions in Real Environments,
P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I.D. Reid, S. Gould, A. van den Hengel, “Vision-and-Language Navi- gation: Interpreting Visually-Grounded Navigation Instructions in Real Environments,” inProceedings of the 2018 IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2018, pp. 3674–3683
2018
-
[4]
Episodic Transformer for Vision- and-Language Navigation,
A. Pashevich, C. Schmid, and C. Sun, “Episodic Transformer for Vision- and-Language Navigation,” inProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), 2021, pp. 15942–15952
2021
-
[5]
Everyday Object Meets Vision-and-Language Navigation Agent via Backdoor,
K. He, K. Chen, J. Bai, Y. Huang, Q. Wu, S.-T. Xia, and L. Wang, “Everyday Object Meets Vision-and-Language Navigation Agent via Backdoor,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 37, 2024
2024
-
[6]
VELMA: Verbalization Embodiment of LLM Agents for Vision and 25 Language Navigation in Street View,
R. Schumann, W. Zhu, W. Feng, T.-J. Fu, S. Riezler, and W. Y. Wang, “VELMA: Verbalization Embodiment of LLM Agents for Vision and 25 Language Navigation in Street View,” inProceedings of the AAAI Con- ference on Artificial Intelligence, vol. 38, no. 17, pp. 19039–19047, 2024, doi: 10.1609/AAAI.V38I17.29858
-
[7]
Self-Monitoring Navigation Agent via Auxiliary Progress Estimation,
C.-Y. Ma, J. Lu, Z. Wu, G. AlRegib, Z. Kira, R. Socher, and C. Xiong, “Self-Monitoring Navigation Agent via Auxiliary Progress Estimation,” inProceedings of the International Conference on Learning Representa- tions, 2019
2019
-
[8]
Object-and- action aware model for visual language navigation,
Y. Qi, Z. Pan, S. Zhang, A. van den Hengel, and Q. Wu, “Object-and- action aware model for visual language navigation,” inComputer Vision – ECCV 2020, 2020
2020
-
[9]
Speaker- follower models for vision-and-language navigation,
D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, and T. Darrell, “Speaker- follower models for vision-and-language navigation,”Advances in Neural Information Processing Systems (NeurIPS), 2018
2018
-
[10]
Sub-instruction aware vision-and-language navigation,
Y. Hong, C. Rodriguez, Q. Wu, and S. Gould, “Sub-instruction aware vision-and-language navigation,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
2020
-
[11]
Sub-instruction and local map relationship enhanced model for vision and language navigation,
Y. Zhang, Y. Li, J. Bai, Y. Feng, and M. Tao, “Sub-instruction and local map relationship enhanced model for vision and language navigation,” inProceedings of the International Conference on Neural Information Processing (ICONIP), 2023, pp. 518–529
2023
-
[12]
Structured Scene Memory for Vision-Language Navigation,
H. Wang, W. Wang, W. Liang, C. Xiong, and J. Shen, “Structured Scene Memory for Vision-Language Navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021
2021
-
[13]
Explicit Object Relation Alignment for Vision and Language Navigation,
Y. Zhang and P. Kordjamshidi, “Explicit Object Relation Alignment for Vision and Language Navigation,” inProceedings of the 60th An- nual Meeting of the Association for Computational Linguistics: Student Research Workshop, 2022, pp. 322–331
2022
-
[14]
Think Deeply, Act Locally: Memory-Driven Transformers for Vision-and-Language Navigation,
X. Chen, Z. Liu, W. Bai, and S. K. Y. Lee, “Think Deeply, Act Locally: Memory-Driven Transformers for Vision-and-Language Navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2022. 26
2022
-
[15]
Causal learning with uncertainty-aware transformer for vision-and- language navigation,
K. Zhang, W. Xu, Z. Miao, Y. Tian, Y. Cen, Y. Liu, and W. He, “Causal learning with uncertainty-aware transformer for vision-and- language navigation,”Neurocomputing, p. 132196, 2025
2025
-
[16]
Instruction- guided path planning with 3D semantic maps for vision-language navi- gation,
Z. Wang, M. Li, M. Wu, M.-F. Moens, and T. Tuytelaars, “Instruction- guided path planning with 3D semantic maps for vision-language navi- gation,”Neurocomputing, vol. 625, p. 129457, 2025
2025
-
[17]
NavGPT: Explicit Reasoning in Vision- and-Language Navigation with Large Language Models,
G. Zhou, Y. Hong, and Q. Wu, “NavGPT: Explicit Reasoning in Vision- and-Language Navigation with Large Language Models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 7, pp. 7641–7649, Mar. 2024, doi: 10.1609/AAAI.V38I7.28597
-
[18]
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation,
J. Zhang, K. Wang, R. Xu, G. Zhou, Y. Hong, X. Fang, Q. Wu, Z. Zhang, and H. Wang, “NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation,” inProceedings of Robotics: Sci- ence and Systems (RSS), 2024
2024
-
[19]
NavGPT-2: Un- leashing Navigational Reasoning Capability for Large Vision-Language Models,
G. Zhou, Y. Hong, Z. Wang, X. E. Wang, and Q. Wu, “NavGPT-2: Un- leashing Navigational Reasoning Capability for Large Vision-Language Models,” inProceedings of the European Conference on Computer Vi- sion (ECCV), Lecture Notes in Computer Science, pp. 260–278, 2024, doi: 10.1007/978-3-031-72667-5_15
-
[20]
Vision-and-language navigation today and tomorrow: A survey in the era of foundation models
Y. Zhang, Z. Ma, J. Li, Y. Qiao, Z. Wang, J. Chai, Q. Wu, M. Bansal, and P. Kordjamshidi, “Vision-and-Language Navigation Today and To- morrow: A Survey in the Era of Foundation Models,”arXiv preprint arXiv:2407.07035, 2024, doi: 10.48550/arXiv.2407.07035
-
[21]
Language and visual entity relationship graph for agent navigation,
Y. Hong, C. Rodriguez, Q. Wu, and S. Gould, “Language and visual entity relationship graph for agent navigation,” inAdvances in Neu- ral Information Processing Systems (NeurIPS), vol. 33, pp. 7685–7696, 2020
2020
-
[22]
Evolving graphical plan- ner: Contextual global planning for vision-and-language navigation,
Z. Deng, K. Narasimhan, and O. Russakovsky, “Evolving graphical plan- ner: Contextual global planning for vision-and-language navigation,” in Advances in Neural Information Processing Systems (NeurIPS), 2020
2020
-
[23]
Topo- logical planning with transformers for vision-and-language navigation,
C. Chen, J. K. Chen, J. Chuang, M. Vázquez, and S. Savarese, “Topo- logical planning with transformers for vision-and-language navigation,” 27 inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11276-11286, 2021
2021
-
[24]
VLN-PETL: Parameter-efficient trans- fer learning for vision-and-language navigation,
Y. Qiao, Z. Yu, and Q. Wu, “VLN-PETL: Parameter-efficient trans- fer learning for vision-and-language navigation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15443–15452
2023
-
[25]
Be- yond the nav-graph: Vision-and-language navigation in continuous en- vironments,
J. Krantz, E. Wijmans, A. Majumdar, D. Batra, and S. Lee, “Be- yond the nav-graph: Vision-and-language navigation in continuous en- vironments,” inProceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Springer International Publishing, 2020, pp. 104–120
2020
-
[26]
Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training,
W. Hao, C. Li, X. Li, L. Carin, and J. Gao, “Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020
2020
-
[27]
Air- bert: In-Domain Pretraining for Vision-and-Language Navigation,
P.-L. Guhur, M. Tapaswi, S. Chen, I. Laptev, and C. Schmid, “Air- bert: In-Domain Pretraining for Vision-and-Language Navigation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021
2021
-
[28]
Think Global, Act Local: Dual-Scale Graph Transformer for Vision-and-Language Navigation,
X. Chen, J. Zhang, Q. Xu, X. Zhang, and S. K. Y. Lee, “Think Global, Act Local: Dual-Scale Graph Transformer for Vision-and-Language Navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021
2021
-
[29]
MapNav: A Novel Memory Representation via Annotated Semantic Maps for VLM-Based Vision-and-Language Navigation,
L. Zhang, H. Liao, X. Xu, Q. Zhang, X. Zhang, P. Wang, and R. Xu, “MapNav: A Novel Memory Representation via Annotated Semantic Maps for VLM-Based Vision-and-Language Navigation,” inProceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), 2025, pp. 13032–13056
2025
-
[30]
TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation,
N. Rajabi and J. Kosecka, “TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation,” inarXiv preprint arXiv:2502.07306, 2025. 28
-
[31]
Recursive bidirectional cross-modal reasoning network for vision-and-language navigation,
J. Wu, C. Wu, X. Shen, F. Wu, and L. Wang, “Recursive bidirectional cross-modal reasoning network for vision-and-language navigation,”Ex- pert Systems with Applications, vol. 270, p. 126442, 2025
2025
-
[32]
World-Consistent Data Generation for Vision-and-Language Nav- igation,
Y. Zhong, R. Zhang, Z. Zhang, Z. Wang, C. Fang, X. Zhang, and Q. Guo, “World-Consistent Data Generation for Vision-and-Language Nav- igation,”arXiv preprint arXiv:2412.06413, 2024
-
[33]
REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments,
Y. Qi, Q. Wu, P. Anderson, X. Wang, W.Y. Wang, C. Shen, A. van den Hengel, “REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments,” inProceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),2020, pp. 9979–9988
2020
-
[34]
Natural language processing,
K. R. Chowdhary and K. R. Chowdhary, “Natural language processing,” Fundamentals of Artificial Intelligence, 2020, pp. 603-649
2020
-
[35]
SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation,
P. Moudgil, T. Jain, A. Salim, and P. A. S. D. V. K., “SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation,” in Advances in Neural Information Processing Systems (NeurIPS), 2021
2021
-
[36]
VLnBERT: A recurrent vision-and-language BERT for navigation,
Y.Hong, Q.Wu, Y.Qi, C.Rodriguez-Opazo, andS.Gould, “VLnBERT: A recurrent vision-and-language BERT for navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), pp. 1643–1653, 2021
2021
-
[37]
History aware multi- modal transformer for vision-and-language navigation,
S. Chen, P.-L. Guhur, C. Schmid, and I. Laptev, “History aware multi- modal transformer for vision-and-language navigation,” inAdvances in Neural Information Processing Systems (NeurIPS), 2021
2021
-
[38]
SOON: Sce- nario oriented object navigation with graph-based exploration,
F. Zhu, X. Liang, Y. Zhu, Q. Yu, X. Chang, and X. Liang, “SOON: Sce- nario oriented object navigation with graph-based exploration,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 12689–12699
2021
-
[39]
E2BA: Environment Exploration and Backtracking Agent for Visual Language Object Navigation,
Y. Shi, J. Liu, L. Sun, and X. Zheng, “E2BA: Environment Exploration and Backtracking Agent for Visual Language Object Navigation,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 7, pp. 6231–6244, 2025. 29
2025
-
[40]
Visual Language Maps for Robot Navigation,
C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual Language Maps for Robot Navigation,” inProceedings of the IEEE International Con- ference on Robotics and Automation (ICRA), 2023, pp. 10608–10615
2023
-
[41]
Instruction- guided Path Planning with 3D Semantic Maps for Vision-Language Nav- igation,
Z. Wang, M. Li, M. Wu, M.-F. Moens, and T. Tuytelaars, “Instruction- guided Path Planning with 3D Semantic Maps for Vision-Language Nav- igation,”Neurocomputing, vol. 625, p. 129457, 2025
2025
-
[42]
Zero-shot Visual Grounding via Coarse-to-Fine Representation Learning,
J. Mi, S. Jin, Z. Chen, D. Liu, X. Wei, and J. Zhang, “Zero-shot Visual Grounding via Coarse-to-Fine Representation Learning,”Neurocomput- ing, vol. 610, p. 128621, 2024
2024
-
[43]
Vision- and-Language Navigation via Causal Learning,
L. Wang, Z. He, R. Dang, M. Shen, C. Liu, and Q. Chen, “Vision- and-Language Navigation via Causal Learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13139–13150
2024
-
[44]
CR-former: Single-image cloud removal with focused Taylor attention,
Y. Wu, Y. Deng, S. Zhou, Y. Liu, W. Huang, and J. Wang, “CR-former: Single-image cloud removal with focused Taylor attention,”IEEE Trans- actions on Geoscience and Remote Sensing, vol. 62, pp. 1–14, 2024
2024
-
[45]
Event-Equalized Dense Video Captioning,
K. Wu, P. Li, J. Fu, Y. Li, Y. Wu, Y. Liu, J. Wang, and S. Zhou, “Event-Equalized Dense Video Captioning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 8417–8427
2025
-
[46]
Semantic-aware representation learning for homography estimation,
Y. Liu, Q. Huang, S. Hui, J. Fu, S. Zhou, K. Wu, P. Li, and J. Wang, “Semantic-aware representation learning for homography estimation,” in Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 2506–2514
2024
-
[47]
Mind the gap: Aligning vision foundation models to image feature matching,
Y. Liu, J. Fu, Y. Wu, K. Wu, P. Li, J. Wu, S. Zhou, and J. Xin, “Mind the gap: Aligning vision foundation models to image feature matching,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 20313–20323
2025
-
[48]
PatchCue: Enhancing Vision-Language Model Reasoning with Patch- Based Visual Cues,
Y. Qi, P. Fu, H. Li, Y. Liu, C. Jiang, B. Qin, Z. Luo, and J. Luan, “PatchCue: Enhancing Vision-Language Model Reasoning with Patch- Based Visual Cues,”arXiv preprint arXiv:2603.05869, 2026. 30
-
[49]
Closing the gap between the upper bound and lower bound of Adam’s iteration com- plexity,
B. Wang, J. Fu, H. Zhang, N. Zheng, and W. Chen, “Closing the gap between the upper bound and lower bound of Adam’s iteration com- plexity,”Advances in Neural Information Processing Systems, vol. 36, pp. 39006–39032, 2023
2023
-
[50]
Recognition of surface defects on steel sheet using transfer learning,
J. Fu, X. Zhu, and Y. Li, “Recognition of surface defects on steel sheet using transfer learning,”arXiv preprint arXiv:1909.03258, 2019
-
[51]
When and why momentum accelerates SGD : An empirical study, 2023
J.Fu, B.Wang, H.Zhang, Z.Zhang, W.Chen, andN.Zheng, “Whenand why momentum accelerates SGD: An empirical study,”arXiv preprint arXiv:2306.09000, 2023
-
[52]
Understanding mobile GUI: From pixel-words to screen-sentences,
J. Fu, X. Zhang, Y. Wang, W. Zeng, and N. Zheng, “Understanding mobile GUI: From pixel-words to screen-sentences,”Neurocomputing, vol. 601, p. 128200, 2024
2024
-
[53]
Breaking through the learning plateaus of in-context learning in transformer,
J. Fu, T. Yang, Y. Wang, Y. Lu, and N. Zheng, “Breaking through the learning plateaus of in-context learning in transformer,”arXiv preprint arXiv:2309.06054, 2023
-
[54]
Regnav: Room expert guided image- goal navigation,
P. Li, K. Wu, J. Fu, and S. Zhou, “Regnav: Room expert guided image- goal navigation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 5, 2025, pp. 4860–4868
2025
-
[55]
Camera-aware la- bel refinement for unsupervised person re-identification,
P. Li, K. Wu, W. Huang, S. Zhou, and J. Wang, “Camera-aware la- bel refinement for unsupervised person re-identification,”arXiv preprint arXiv:2403.16450, 2024. 31 Supplementary Materials S1. Purpose and Scope This supplementary material provides additional details for the two auxil- iary analyses summarized in the main paper: the controlled plug-in compa...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.