SEDualVLN: A Spatially-Enhanced Dual-System for Vision-Language Navigation
Pith reviewed 2026-05-20 13:27 UTC · model grok-4.3
The pith
A dual-system VLN framework pairs a fast spatially-aware vision-language model for actions with a slow MLLM planner using real-time 3D maps to reach state-of-the-art results on unseen environments.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a spatially-enhanced dual-system VLN framework succeeds by letting System 1, a vision-language model augmented with global and local spatial awareness, generate actions rapidly while System 2 integrates a multimodal large language model with a mapping module that plans waypoints from top-down 3D map views and streams of rendered path images, with the two systems cooperating through a fast-slow coordinated approach to complete navigation tasks and achieve state-of-the-art performance on VLN-CE benchmarks.
What carries the argument
The spatially-enhanced dual-system in which System 1 supplies quick actions from a vision-language model with added spatial awareness and System 2 supplies waypoint plans from an MLLM operating on top-down 3D maps and path images.
If this is right
- The approach extends reliable navigation to longer trajectories where end-to-end models typically lose coherence.
- Spatial enhancements in both systems improve grounding for planning compared with pure zero-shot MLLM pipelines.
- Coordination between the systems reduces overall reasoning time while preserving generalization to new scenes.
- Ablation results indicate that removing either the global-local awareness or the 3D map module lowers final performance.
Where Pith is reading between the lines
- Real-time map updates in System 2 could let the agent recover from temporary obstacles by replanning without restarting the entire task.
- The same fast-slow split might transfer to other language-guided embodied tasks such as object search or rearrangement in homes.
- Rendered path images could be varied at inference time to let the planner preview alternate routes before committing to a waypoint.
Load-bearing premise
The fast action system and the slow planning system can coordinate without producing conflicts or deadlocks when the agent faces environments it has not seen during training.
What would settle it
Deploy the agent in a long-horizon unseen test environment and measure whether the success rate falls below current single-system baselines or whether the agent frequently stalls while the two systems resolve differing suggestions.
Figures
read the original abstract
Vision-Language Navigation (VLN) approaches have currently followed two primary paradigms: the end-to-end Vision-Language Model (VLM) policy fine-tuned on navigation trajectories to directly predict actions, and the zero-shot modular pipeline integrating pre-trained Multimodal Large Language Model (MLLM) for training-free generalization to unseen environments. However, end-to-end methods struggle with long-horizon navigation and lack dynamic reasoning, whereas zero-shot methods are constrained by limited spatial grounding for reliable planning and also require substantial reasoning time. To bridge this gap, we introduce SEDualVLN, a spatially-enhanced dual-system VLN framework. System 1 is a VLM model enhanced with both global and local spatial awareness, used for action generation. System 2 integrates a general MLLM with a mapping module, wherein the MLLM plans waypoints by leveraging top-down views of the real-time 3D map alongside streams of rendered path images. Both systems leverage different forms of spatial enhancement to cultivate the agent's sense of direction in VLN tasks. Ultimately, they cooperate to complete the navigation task through a fast-slow coordinated approach. SEDualVLN achieves state-of-the-art performance on VLN-CE benchmarks, and further ablation studies demonstrate the effectiveness of each system and module.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SEDualVLN, a dual-system framework for Vision-Language Navigation that pairs a fast VLM-based System 1 (enhanced with global and local spatial awareness for direct action generation) with a slower System 2 (MLLM plus real-time 3D mapping module that plans waypoints from top-down views and rendered path images). The two systems cooperate via a fast-slow coordinated approach to address limitations of pure end-to-end and zero-shot methods, claiming state-of-the-art results on VLN-CE benchmarks together with ablation studies validating each component.
Significance. If the empirical claims hold, the work offers a practical bridge between reactive end-to-end policies and modular planning, potentially improving long-horizon reliability in unseen environments through explicit spatial enhancements. The dual-system design and emphasis on cultivating directional awareness constitute a clear incremental contribution to VLN-CE.
major comments (1)
- [Abstract and dual-system cooperation description] The central claim of reliable navigation rests on the fast-slow coordination between System 1 action generation and System 2 waypoint planning, yet the manuscript provides only a high-level description of their cooperation. No priority rules, override conditions, deadlock detection mechanism, or fusion procedure for reconciling incompatible proposals (e.g., when rendered path images and 3D map updates disagree) are specified. This omission directly affects the weakest assumption identified for unseen long-horizon episodes.
minor comments (1)
- [Abstract] Quantitative results, error bars, and exact VLN-CE dataset splits should be stated explicitly in the abstract or early results section to allow immediate verification of the SOTA claim.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback on our manuscript. We have carefully addressed the major comment concerning the dual-system cooperation mechanism and revised the paper to provide greater clarity and detail on this aspect.
read point-by-point responses
-
Referee: [Abstract and dual-system cooperation description] The central claim of reliable navigation rests on the fast-slow coordination between System 1 action generation and System 2 waypoint planning, yet the manuscript provides only a high-level description of their cooperation. No priority rules, override conditions, deadlock detection mechanism, or fusion procedure for reconciling incompatible proposals (e.g., when rendered path images and 3D map updates disagree) are specified. This omission directly affects the weakest assumption identified for unseen long-horizon episodes.
Authors: We agree that the original manuscript described the fast-slow coordination at a high level, which limited the transparency of how the systems interact in practice. In the revised version, we have added a new subsection (Section 3.4) that explicitly details the coordination protocol. System 1 serves as the default reactive controller for low-latency action generation. System 2 intervenes at fixed intervals or upon detecting map inconsistencies (e.g., via rendered path image mismatches with the 3D map). Priority rules assign precedence to System 2 for waypoint overrides when long-horizon discrepancies exceed a confidence threshold from the MLLM. A simple deadlock detector monitors consecutive failed actions from System 1 and triggers a System 2 replan. The fusion procedure reconciles proposals by selecting the System 2 waypoint if the rendered path deviates beyond a spatial threshold, otherwise blending compatible actions. These additions directly strengthen the description for long-horizon unseen episodes. revision: yes
Circularity Check
No circularity: architectural proposal without equations or self-referential derivations
full rationale
The paper introduces SEDualVLN as a dual-system architecture (System 1 VLM with spatial awareness for fast actions; System 2 MLLM with 3D mapping for slower waypoints) that cooperates via an unspecified fast-slow approach to achieve SOTA on VLN-CE. No equations, fitted parameters, uniqueness theorems, or self-citations appear in the provided text as load-bearing elements of any derivation. The central claims rest on empirical benchmark results and ablation studies rather than reducing to quantities defined by the authors' own prior constructs or by construction. This is a standard non-circular architectural contribution.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
System 1 is a VLM model enhanced with both global and local spatial awareness... System 2 integrates a general MLLM with a mapping module... cooperate through a fast-slow coordinated approach.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LSE = L_action + α · (1/N) Σ [1 - cos(Vt, St + pt)] ... Channel Connectivity Extraction Module ... A* path rendering and cosine-similarity pruning
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Vision-and-language navigation: A survey of tasks, methods, and future directions,
J. Gu, E. Stefani, Q. Wu, J. Thomason, and X. Wang, “Vision-and-language navigation: A survey of tasks, methods, and future directions,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), S. Muresan, P. Nakov, and A. Villavicencio, Eds., 2022, pp. 7606–7623
work page 2022
-
[2]
Vision-and-language navigation today and tomorrow: A survey in the era of foundation models,
Y . Zhang, Z. Ma, J. Li, Y . Qiao, Z. Wang, J. Chai, Q. Wu, M. Bansal, and P. Kordjamshidi, “Vision-and-language navigation today and tomorrow: A survey in the era of foundation models,” 2024
work page 2024
-
[3]
Video-LLaV A: Learning united visual representation by alignment before projection,
B. Lin, Y . Ye, B. Zhu, J. Cui, M. Ning, P. Jin, and L. Yuan, “Video-LLaV A: Learning united visual representation by alignment before projection,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, pp. 5971–5984
work page 2024
-
[4]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Y . Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li, “Video instruction tuning with synthetic data, 2024,”arXiv preprint arXiv:2410.02713, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
A survey on multimodal large language models,
S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on multimodal large language models,”National Science Review, 2024
work page 2024
-
[6]
Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks
J. Zhang, K. Wang, S. Wang, M. Li, H. Liu, S. Wei, Z. Wang, Z. Zhang, and H. Wang, “Uni- navid: A video-based vision-language-action model for unifying embodied navigation tasks,” arXiv preprint arXiv:2412.06224, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
NaVILA: Legged Robot Vision-Language-Action Model for Naviga- tion
A.-C. Cheng, Y . Ji, Z. Yang, Z. Gongye, X. Zou, J. Kautz, E. Bıyık, H. Yin, S. Liu, and X. Wang, “Navila: Legged robot vision-language-action model for navigation,”arXiv preprint arXiv:2412.04453, 2024
-
[8]
Towards learning a generalist model for embodied navigation,
D. Zheng, S. Huang, L. Zhao, Y . Zhong, and L. Wang, “Towards learning a generalist model for embodied navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13 624–13 634
work page 2024
-
[9]
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
J. Zhang, K. Wang, R. Xu, G. Zhou, Y . Hong, X. Fang, Q. Wu, Z. Zhang, and H. Wang, “Navid: Video-based vlm plans the next step for vision-and-language navigation,”arXiv preprint arXiv:2402.15852, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
D. Goetting, H. G. Singh, and A. Loquercio, “End-to-end navigation with vision language mod- els: Transforming spatial reasoning into question-answering,”arXiv preprint arXiv:2411.05755, 2024
-
[11]
Towards long-horizon vision-language navigation: Platform, benchmark and method,
X. Song, W. Chen, Y . Liu, W. Chen, G. Li, and L. Lin, “Towards long-horizon vision-language navigation: Platform, benchmark and method,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 12 078–12 088
work page 2025
-
[12]
M. Wei, C. Wan, X. Yu, T. Wang, Y . Yang, X. Mao, C. Zhu, W. Cai, H. Wang, Y . Chen, X. Liu, and J. Pang, “Streamvln: Streaming vision-and-language navigation via slowfast context modeling,”arXiv preprint arXiv:2507.05240, 2025
-
[13]
S. Zeng, D. Qi, X. Chang, F. Xiong, S. Xie, X. Wu, S. Liang, M. Xu, X. Wei, and N. Guo, “Janusvln: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation,”arXiv preprint arXiv:2509.22548, 2025. 10
-
[14]
Dygeovln: Infusing dynamic geometry foundation model into vision-language navigation,
X. Liu, H. Zheng, J. Jeong, M. Yoon, L. Zhao, Z. Zhong, H. Li, and S.-E. Yoon, “Dygeovln: Infusing dynamic geometry foundation model into vision-language navigation,”arXiv preprint arXiv:2603.21269, 2026
-
[15]
P3nav: End-to-end perception, prediction and planning for vision-and-language navigation,
T. Li, W. Chen, H. Xu, X. Zheng, and H. Li, “P3nav: End-to-end perception, prediction and planning for vision-and-language navigation,”arXiv preprint arXiv:2603.17459, 2026
-
[16]
Msnav: Zero-shot vision-and- language navigation with dynamic memory and llm spatial reasoning,
C. Liu, Z. Zhou, J. Zhang, M. Zhang, S. Huang, and H. Duan, “Msnav: Zero-shot vision-and- language navigation with dynamic memory and llm spatial reasoning,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026, pp. 20 112–20 116
work page 2026
-
[17]
L. Yue, Y . Fan, S. Lian, Y . Zhao, J. Yu, L. Xie, and F. Zhang, “Spatial-vln: Zero-shot vision- and-language navigation with explicit spatial perception and exploration,”arXiv preprint arXiv:2601.12766, 2026
-
[18]
Y . Qiao, W. Lyu, H. Wang, Z. Wang, Z. Li, Y . Zhang, M. Tan, and Q. Wu, “Open-nav: Exploring zero-shot vision-and-language navigation in continuous environment with open-source llms,” in IEEE International Conference on Robotics and Automation (ICRA), 2025, pp. 6710–6717
work page 2025
-
[19]
Constraint-aware zero- shot vision-language navigation in continuous environments,
K. Chen, D. An, Y . Huang, R. Xu, Y . Su, Y . Ling, I. Reid, and L. Wang, “Constraint-aware zero- shot vision-language navigation in continuous environments,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 47, no. 11, pp. 10 441–10 456, 2025
work page 2025
-
[20]
Spatialnav: Leveraging spatial scene graphs for zero-shot vision-and-language navigation,
J. Zhang, Z. Li, S. Wang, X. Shi, Z. Wei, and Q. Wu, “Spatialnav: Leveraging spatial scene graphs for zero-shot vision-and-language navigation,”arXiv preprint arXiv:2601.06806, 2026
-
[21]
Spatialgpt: Zero-shot vision-and-language navigation via spatial cot over structured spatial memory,
Z. Jiang and X. Wang, “Spatialgpt: Zero-shot vision-and-language navigation via spatial cot over structured spatial memory,” inProceedings of the 33rd ACM International Conference on Advances in Geographic Information Systems, 2025, p. 423–435
work page 2025
-
[22]
K. Lyu, K. Wu, P. Li, X. Hu, Q. Si, C. Miao, N. Yang, Z. Wang, L. Xiao, L. Hu, J. Sun, and C. Hao, “Himemvln: Enhancing reliability of open-source zero-shot vision-and-language navigation with hierarchical memory system,”arXiv preprint arXiv:2603.14807, 2026
-
[23]
Y . Long, W. Cai, H. Wang, G. Zhan, and H. Dong, “Instructnav: Zero-shot system for generic instruction navigation in unexplored environment,”arXiv preprint arXiv:2406.04882, 2024
-
[24]
Dreamnav: A trajectory-based imaginative framework for zero-shot vision-and-language navigation,
Y . Wang, Y . Fang, T. Wang, Y . Feng, Y . Tan, S. Zhang, P. Liu, Y . Ji, and R. Xu, “Dreamnav: A trajectory-based imaginative framework for zero-shot vision-and-language navigation,” 2025
work page 2025
-
[25]
M. Wei, C. Wan, J. Peng, X. Yu, Y . Yang, D. Feng, W. Cai, C. Zhu, T. Wang, J. Pang, and X. Liu, “Ground slow, move fast: A dual-system foundation model for generalizable vision-and-language navigation,” 2025
work page 2025
-
[26]
Run, ruminate, and regulate: A dual-process thinking system for vision-and-language navigation,
Y . Zhong, Z. Zhang, R. Zhang, L. Huang, H. Gao, S. Wang, D. Li, R. Han, J. Guo, S. Peng, D. Huang, and Y . Chen, “Run, ruminate, and regulate: A dual-process thinking system for vision-and-language navigation,” inAAAI, 2026, pp. 18 845–18 854
work page 2026
-
[27]
P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. van den Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018
work page 2018
-
[28]
Reverie: Remote embodied visual referring expression in real indoor environments,
Y . Qi, Q. Wu, P. Anderson, X. Wang, W. Y . Wang, C. Shen, and A. v. d. Hengel, “Reverie: Remote embodied visual referring expression in real indoor environments,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020
work page 2020
-
[29]
Beyond the nav-graph: Vision-and- language navigation in continuous environments,
J. Krantz, E. Wijmans, A. Majumdar, D. Batra, and S. Lee, “Beyond the nav-graph: Vision-and- language navigation in continuous environments,” inEuropean Conference on Computer Vision (ECCV). Springer, 2020, pp. 104–120
work page 2020
-
[30]
Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding,
A. Ku, P. Anderson, R. Patel, E. Ie, and J. Baldridge, “Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 4392–4412. 11
work page 2020
-
[31]
Learning to navigate unseen environments: Back translation with environmental dropout,
H. Tan, L. Yu, and M. Bansal, “Learning to navigate unseen environments: Back translation with environmental dropout,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019, pp. 2610–2621
work page 2019
-
[32]
Vln bert: A recurrent vision-and- language bert for navigation,
Y . Hong, Q. Wu, Y . Qi, C. Rodriguez-Opazo, and S. Gould, “Vln bert: A recurrent vision-and- language bert for navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 1643–1653
work page 2021
-
[33]
A reduction of imitation learning and structured prediction to no-regret online learning,
S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” inProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. PMLR, 2011, pp. 627–635
work page 2011
- [34]
-
[35]
Think global, act local: Dual- scale graph transformer for vision-and-language navigation,
S. Chen, P.-L. Guhur, M. Tapaswi, C. Schmid, and I. Laptev, “Think global, act local: Dual- scale graph transformer for vision-and-language navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 16 537–16 547
work page 2022
-
[36]
J. Huo, Q. Sun, B. Jiang, H. Lin, and Y . Fu, “Geovln: Learning geometry-enhanced visual representation with slot attention for vision-and-language navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 23 212– 23 221
work page 2023
-
[37]
Vggt: Visual geometry grounded transformer,
J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 5294–5306
work page 2025
-
[38]
Deciphering cross-modal alignment in large vision-language models with modality integration rate,
Q. Huang, X. Dong, P. Zhang, Y . Zang, Y . Cao, J. Wang, D. Lin, W. Zhang, and N. Yu, “Deciphering cross-modal alignment in large vision-language models with modality integration rate,”arXiv preprint arXiv:2410.07167, 2024
-
[39]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection,
S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Suet al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” in European Conference on Computer Vision (ECCV), 2024, pp. 38–55
work page 2024
-
[40]
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 4015–4026
work page 2023
-
[41]
Vlfm: Vision-language frontier maps for zero-shot semantic navigation,
N. Yokoyama, S. Ha, D. Batra, J. Wang, and B. Bucher, “Vlfm: Vision-language frontier maps for zero-shot semantic navigation,” inIEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 42–48
work page 2024
-
[42]
Geometric Context Transformer for Streaming 3D Reconstruction
L.-Z. Chen, J. Gao, Y . Chen, K. L. Cheng, Y . Sun, L. Hu, N. Xue, X. Zhu, Y . Shen, Y . Yaoet al., “Geometric context transformer for streaming 3d reconstruction,”arXiv preprint arXiv:2604.14141, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[43]
Learning transferable visual models from natural language supervi- sion,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervi- sion,” inInternational Conference on Machine Learning (ICML), 2021, pp. 8748–8763
work page 2021
-
[44]
Qwen2.5-Coder Technical Report
B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Luet al., “Qwen2.5-coder technical report,”arXiv preprint arXiv:2409.12186, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Y . Hong, Z. Wang, Q. Wu, and S. Gould, “Bridging the gap between learning in discrete and continuous environments for vision-and-language navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 15 439–15 449
work page 2022
-
[46]
Etpnav: Evolving topolog- ical planning for vision-language navigation in continuous environments,
D. An, H. Wang, W. Wang, Z. Wang, Y . Huang, K. He, and L. Wang, “Etpnav: Evolving topolog- ical planning for vision-language navigation in continuous environments,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), pp. 5130–5145, 2025. 12
work page 2025
-
[47]
S. Raychaudhuri, S. Wani, S. Patel, U. Jain, and A. Chang, “Language-aligned waypoint (LAW) supervision for vision-and-language navigation in continuous environments,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021, pp. 4018–4028
work page 2021
-
[48]
Gemini: A family of highly capable multimodal models,
G. Team, “Gemini: A family of highly capable multimodal models,” 2023
work page 2023
-
[49]
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,” 2025. 13 A Technical appendices and supplementary material In this section, we first build ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.