LiveVLN: Breaking the Stop-and-Go Loop in Vision-Language Navigation

Feng Zheng; Jinyu Yang; Teng Wang; TianTian Geng; Weiye Zhu; Xiangchen Wang; Zekai Zhang; Zhiyuan Qi

arxiv: 2604.19536 · v1 · submitted 2026-04-21 · 💻 cs.RO

LiveVLN: Breaking the Stop-and-Go Loop in Vision-Language Navigation

Xiangchen Wang , Weiye Zhu , Teng Wang , TianTian Geng , Zekai Zhang , Zhiyuan Qi , Jinyu Yang , Feng Zheng This is my paper

Pith reviewed 2026-05-10 01:46 UTC · model grok-4.3

classification 💻 cs.RO

keywords vision-language navigationcontinuous motionmulti-step action continuationstop-and-go loopVLM navigatorreal-world deploymentwaiting time reductionaction availability

0 comments

The pith

LiveVLN augments pretrained VLM navigators with multi-step action continuation to enable continuous motion by overlapping execution with observation processing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language navigation systems achieve strong benchmark scores yet often move in a stop-and-go pattern because each new observation must be fully sensed, transmitted, and inferred before the next action can start. LiveVLN counters this by generating sequences of future actions from the existing model and supplying updated plans before the current executable prefix runs out. The approach runs at inference time without retraining and can be added to compatible pretrained navigators. A reader would care because it converts high-performing but visibly jerky deployments into smoother, lower-waiting executions while leaving benchmark numbers intact.

Core claim

LiveVLN is a training-free runtime framework that augments pretrained VLM navigators by producing multi-step action sequences. Execution of the current prefix proceeds in parallel with the arrival and processing of new observations, so refreshed future actions can be handed off before the prefix is exhausted. This design keeps actions continuously available, removes redundant idle intervals, and yields shorter wall-clock episodes in both simulated benchmarks and physical robot deployments.

What carries the argument

multi-step action continuation, the mechanism that generates and supplies sequences of future actions from a pretrained VLM while the current executable prefix runs concurrently with new observation processing.

If this is right

Benchmark success rates on R2R and RxR remain unchanged.
Average episode waiting time drops by up to 77.7 percent in physical deployments.
Wall-clock episode duration shortens by 12.6 percent on StreamVLN and 19.6 percent on NaVIDA.
Action availability improves because motion continues without blocking on each inference cycle.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same overlap technique could be tested on other sequential control tasks where model inference latency currently forces pauses.
Pretrained VLMs appear to contain enough multi-step foresight that runtime decoupling of planning and execution can improve online performance without retraining.
In environments with fast-moving obstacles the absence of mid-sequence correction may accumulate errors, suggesting lightweight safety monitors as a possible addition.

Load-bearing premise

The multi-step action sequences produced by the pretrained VLM stay safe and correct when executed continuously in dynamic, partially observable real-world settings without extra verification or recovery steps.

What would settle it

A real-world trial in which the robot encounters unexpected changes midway through a multi-step sequence and collides or fails at a higher rate than the single-step baseline.

Figures

Figures reproduced from arXiv: 2604.19536 by Feng Zheng, Jinyu Yang, Teng Wang, TianTian Geng, Weiye Zhu, Xiangchen Wang, Zekai Zhang, Zhiyuan Qi.

**Figure 1.** Figure 1: Blocking VLN serializes sense(S), inference(I), and execution(E). After each action continuation, the robot pauses until the next sense-and-inference round finishes. LiveVLN instead keeps the execution thread running on a short guard buffer while the next sense-and-inference pass refreshes the following continuation in the background. seconds of motion, further reduced by controller-side execution overhea… view at source ↗

**Figure 2.** Figure 2: Native blocking NaVIDA diagnosis on Unitree G1. The waiting [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of LiveVLN when the next update arrives before the current guard buffer is exhausted. The short-horizon action state contains executed [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Real-robot deployment in the shared office scene for the instruction [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Recent navigation systems achieve strong benchmark results, yet real-world deployment often remains visibly stop-and-go. This bottleneck arises because the sense-inference-execution loop is still blocking: after each new observation, the controller must wait for sensing, transmission, and inference before motion can continue. Reducing action-generation cost alone therefore does not remove redundant waiting. To address this issue, we present LiveVLN, a training-free framework for more continuous embodied navigation by augmenting pretrained VLM navigators with multi-step action continuation. Instead of pausing for each full sense-and-inference round, LiveVLN overlaps execution with the processing of newly arrived observations, allowing refreshed future actions to be handed off before the current executable prefix is exhausted. This design keeps actions continuously available during motion, reducing idle waiting and enabling smoother online execution. The framework operates at runtime and can be integrated with compatible pretrained VLM navigators. Across R2R and RxR, LiveVLN preserves benchmark performance while reducing waiting time and improving action availability. In real-world deployments, it cuts average episode waiting time by up to $77.7\%$ and shortens wall-clock episode time by $12.6\%$ on StreamVLN and $19.6\%$ on NaVIDA, yielding more coherent execution during deployment. Code is available at https://github.com/NIneeeeeem/LiveVLN.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LiveVLN offers a lightweight runtime overlap of inference and execution to cut VLN waiting time, but the handoff of multi-step plans lacks described safeguards for when they go stale.

read the letter

The core idea is straightforward: take a pretrained VLM navigator, have it output a short prefix of future actions, and keep executing that prefix while the next observation is sensed and processed in parallel. When the prefix runs low, hand off a refreshed one. This removes the full pause after every observation without changing the underlying model or training anything new. On R2R and RxR the success rates stay comparable to the base systems, and the real-robot runs with StreamVLN and NaVIDA show clear drops in idle time—up to 77% less waiting and 12-20% shorter wall-clock episodes. That part is useful and directly measurable. The code release also helps anyone who wants to try the integration quickly. What the work does well is focus on the deployment friction that benchmark papers often ignore: the actual seconds the robot sits still while the loop catches up. The numbers come from both simulation and physical setups, which is better than abstract-only claims. The soft spot is exactly the one the stress-test flags. The framework assumes the multi-step continuations stay safe and goal-directed while the robot moves under partial observability. Nothing in the description indicates runtime checks, drift detection, or a recovery policy if a new obstacle appears or the observation contradicts the prefix. If those cases are rare in their test environments, the gains hold; if not, the smoother motion could mask occasional unsafe or inefficient segments. The paper would be stronger with explicit failure-mode analysis or at least a simple replanning trigger. This is for people already running VLM-based navigation on robots who need less choppy motion without a full redesign. It is not a new model or training method, so it will not shift the research frontier, but it addresses a practical bottleneck that matters for real use. I would send it to peer review. The empirical results are concrete enough to justify referee time, provided the authors can clarify how (or whether) they handle invalid continuations during handoff.

Referee Report

2 major / 2 minor

Summary. The manuscript presents LiveVLN, a training-free framework that augments pretrained VLM navigators with multi-step action continuation to overlap execution with new observation processing. This keeps actions continuously available, reducing the blocking sense-inference-execution loop in vision-language navigation. The paper claims that LiveVLN preserves benchmark performance on R2R and RxR while reducing waiting time and action unavailability; in real-world tests it cuts average episode waiting time by up to 77.7% and shortens wall-clock episode time by 12.6% (StreamVLN) and 19.6% (NaVIDA).

Significance. If the results hold, the work targets a visible practical bottleneck in embodied VLM deployment by enabling smoother, less stop-and-go motion without retraining. The training-free runtime integration and public code release are strengths that support adoption and reproducibility. The reported real-world time savings could be impactful for robotics applications if the safety of handoff is adequately addressed.

major comments (2)

[Method / framework description] Framework description: the runtime handoff of multi-step action prefixes generated by the pretrained VLM is presented without any runtime verification, replanning trigger, or recovery policy for cases where the continuation becomes invalid due to partial observability, drift, or new obstacles. This assumption is load-bearing for the real-world claims of preserved performance and 77.7% waiting-time reduction, as unverified continuations may invalidate during dynamic motion.
[Experiments and Results] Experimental evaluation: concrete percentage improvements (77.7% waiting time, 12.6%/19.6% episode time) are reported, yet the manuscript provides no details on experimental controls, run-to-run variance, statistical significance, or explicit validation that handed-off continuations remained safe and goal-directed in the dynamic real-world deployments. This limits assessment of the central empirical claims.

minor comments (2)

[Abstract] The abstract states the framework 'can be integrated with compatible pretrained VLM navigators' but does not name the specific models or continuation lengths used in the reported experiments; adding this would improve clarity.
[Overall presentation] Ensure all tables and figures in the full manuscript explicitly distinguish benchmark (R2R/RxR) results from real-world StreamVLN/NaVIDA deployments and label the exact baselines for waiting-time and wall-clock metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. The comments highlight important aspects of the framework's robustness and the strength of the empirical claims. Below we respond point-by-point to the major comments. We are prepared to revise the manuscript to incorporate clarifications and additional details where appropriate, while preserving the core contribution of the training-free overlapping mechanism.

read point-by-point responses

Referee: [Method / framework description] Framework description: the runtime handoff of multi-step action prefixes generated by the pretrained VLM is presented without any runtime verification, replanning trigger, or recovery policy for cases where the continuation becomes invalid due to partial observability, drift, or new obstacles. This assumption is load-bearing for the real-world claims of preserved performance and 77.7% waiting-time reduction, as unverified continuations may invalidate during dynamic motion.

Authors: We agree that the current manuscript description could more explicitly address runtime robustness. LiveVLN is designed as a lightweight, training-free overlay that hands off multi-step prefixes while new observations are processed in parallel; the base VLM navigator can generate a fresh prefix before the current one is exhausted, providing an implicit refresh mechanism. However, we acknowledge that the paper does not detail explicit verification, replanning triggers, or recovery policies for cases of invalidation due to drift or new obstacles. In the revised manuscript we will add a dedicated subsection under the framework description that (1) states the core assumption of prefix reliability within short horizons, (2) describes how new observations can trigger prefix regeneration before exhaustion, and (3) outlines straightforward extensions for integrating external verification or safety layers (e.g., collision checks or replanning on discrepancy detection). This addition clarifies the design without changing the reported results or requiring new experiments. revision: yes
Referee: [Experiments and Results] Experimental evaluation: concrete percentage improvements (77.7% waiting time, 12.6%/19.6% episode time) are reported, yet the manuscript provides no details on experimental controls, run-to-run variance, statistical significance, or explicit validation that handed-off continuations remained safe and goal-directed in the dynamic real-world deployments. This limits assessment of the central empirical claims.

Authors: We appreciate the referee's emphasis on rigorous reporting of the real-world results. The reported time savings were obtained from repeated deployments in controlled indoor environments using the same robot platform and navigation tasks for both baseline and LiveVLN conditions. In the revised manuscript we will expand the experimental section to include: (i) explicit description of the hardware, sensor setup, and environmental controls; (ii) the number of independent runs per method and per environment; (iii) observed run-to-run variance (standard deviations) for waiting time and wall-clock episode time; and (iv) a qualitative validation summary based on logged trajectories and video review confirming that handed-off continuations remained goal-directed and collision-free in the tested scenarios. While formal statistical significance tests were not performed in the original submission, we can report the variance measures and note the consistent directional improvements across trials. These additions will strengthen the empirical support without altering the numerical claims. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical integration with no derived predictions or self-referential definitions

full rationale

The abstract and provided text describe LiveVLN as a training-free runtime framework that overlaps VLM inference with action execution via multi-step continuations. No equations, fitted parameters, predictions derived from inputs, or derivation chains appear. Claims of preserved benchmark performance and reduced waiting times (e.g., 77.7%) are framed as direct empirical outcomes from deployment on R2R, RxR, StreamVLN, and NaVIDA, not as quantities forced by internal definitions or self-citations. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results are present. The central mechanism is an engineering handoff design whose validity rests on external evaluation rather than tautological reduction to its own inputs. This matches the default expectation for non-circular papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are identifiable; the approach relies on existing pretrained VLMs and unspecified runtime handoff logic.

pith-pipeline@v0.9.0 · 5569 in / 1131 out tokens · 56753 ms · 2026-05-10T01:46:55.467337+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages

[1]

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,

P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. van den Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3674–3683

work page 2018
[2]

Room- across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding,

A. Ku, P. Anderson, R. Patel, E. Ie, and J. Baldridge, “Room- across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020, pp. 4392– 4412

work page 2020
[3]

VLN BERT: A recurrent vision-and-language BERT for navigation,

Y . Hong, Q. Wu, Y . Qi, C. Rodriguez-Opazo, and S. Gould, “VLN BERT: A recurrent vision-and-language BERT for navigation,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1643–1653

work page 2021
[4]

Think global, act local: Dual-scale graph transformer for vision-and-language navigation,

S. Chen, P.-L. Guhur, M. Tapaswi, C. Schmid, and I. Laptev, “Think global, act local: Dual-scale graph transformer for vision-and-language navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 537–16 547

work page 2022
[5]

NaVid: Video-based VLM plans the next step for vision- and-language navigation,

J. Zhang, K. Wang, R. Xu, G. Zhou, Y . Hong, X. Fang, Q. Wu, Z. Zhang, and H. Wang, “NaVid: Video-based VLM plans the next step for vision- and-language navigation,” inRobotics: Science and Systems, 2024

work page 2024
[6]

Streamvln: Streaming vision-and- language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

M. Wei, C. Wan, X. Yu, T. Wang, Y . Yang, X. Mao, C. Zhu, W. Cai, H. Wang, Y . Chen, X. Liu, and J. Pang, “StreamVLN: Streaming vision-and-language navigation via slowfast context modeling,”arXiv preprint arXiv:2507.05240, 2025. [Online]. Available: https://arxiv.org/ abs/2507.05240

work page arXiv 2025
[7]

thinking with images

W. Zhu, Z. Zhang, X. Wang, H. Pan, T. Wang, T. Geng, R. Xu, and F. Zheng, “NaVIDA: Vision-language navigation with inverse dynamics augmentation,”arXiv preprint arXiv:2601.18188, 2026. [Online]. Available: https://arxiv.org/abs/2601.18188

work page arXiv 2026
[8]

NaVILA: Legged robot vision-language- action model for navigation,

A.-C. Cheng, Y . Ji, Z. Yang, Z. Gongye, X. Zou, J. Kautz, E. Biyik, H. Yin, S. Liu, and X. Wang, “NaVILA: Legged robot vision-language- action model for navigation,” inProceedings of Robotics: Science and Systems, Los Angeles, CA, USA, June 2025

work page 2025
[9]

Uni-NaVid: A video-based vision-language-action model for unifying embodied navigation tasks,

J. Zhang, K. Wang, S. Wang, M. Li, H. Liu, S. Wei, Z. Wang, Z. Zhang, and H. Wang, “Uni-NaVid: A video-based vision-language-action model for unifying embodied navigation tasks,” inProceedings of Robotics: Science and Systems, Los Angeles, CA, USA, June 2025. [Online]. Available: https://roboticsconference.org/program/papers/13/

work page 2025
[10]

MapNav: A novel memory representation via annotated semantic maps for VLM-based vision-and-language navigation,

L. Zhang, X. Hao, Q. Xu, Q. Zhang, X. Zhang, P. Wang, J. Zhang, Z. Wang, S. Zhang, and R. Xu, “MapNav: A novel memory representation via annotated semantic maps for VLM-based vision-and-language navigation,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vienna, Austria: Association for C...

work page 2025
[11]

Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation,

X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y .-F. Wang, W. Y . Wang, and L. Zhang, “Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 6629–6638

work page 2019
[12]

Towards learning a generic agent for vision-and-language navigation via pre-training,

W. Hao, C. Li, X. Li, L. Carin, and J. Gao, “Towards learning a generic agent for vision-and-language navigation via pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13 137–13 146

work page 2020
[13]

History aware multi- modal transformer for vision-and-language navigation,

S. Chen, P.-L. Guhur, C. Schmid, and I. Laptev, “History aware multi- modal transformer for vision-and-language navigation,” inAdvances in Neural Information Processing Systems, 2021

work page 2021
[14]

Topo- logical planning with transformers for vision-and-language navigation,

K. Chen, J. K. Chen, J. Chuang, M. Vazquez, and S. Savarese, “Topo- logical planning with transformers for vision-and-language navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 276–11 286

work page 2021
[15]

Visual language maps for robot navigation,

C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language maps for robot navigation,”arXiv preprint arXiv:2210.05714, 2023. [Online]. Available: https://arxiv.org/abs/2210.05714

work page arXiv 2023
[16]

ETPNav: Evolving topological planning for vision-language navigation in continuous environments,

D. An, H. Wang, W. Wang, Z. Wang, Y . Huang, K. He, and L. Wang, “ETPNav: Evolving topological planning for vision-language navigation in continuous environments,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 7, pp. 5130–5145, 2025

work page 2025
[17]

Beyond the nav-graph: Vision-and-language navigation in continuous environments,

J. Krantz, E. Wijmans, A. Majumdar, D. Batra, and S. Lee, “Beyond the nav-graph: Vision-and-language navigation in continuous environments,” inComputer Vision – ECCV 2020, 2020, pp. 104–120

work page 2020
[18]

Bridging the gap be- tween learning in discrete and continuous environments for vision-and- language navigation,

Y . Hong, Z. Wang, Q. Wu, and S. Gould, “Bridging the gap be- tween learning in discrete and continuous environments for vision-and- language navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 439–15 449

work page 2022
[19]

Scaling data generation in vision-and-language navigation,

Z. Wang, J. Li, Y . Hong, Y . Wang, Q. Wu, M. Bansal, S. Gould, H. Tan, and Y . Qiao, “Scaling data generation in vision-and-language navigation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 12 009–12 020

work page 2023
[20]

NavGPT: Explicit reasoning in vision- and-language navigation with large language models,

G. Zhou, Y . Hong, and Q. Wu, “NavGPT: Explicit reasoning in vision- and-language navigation with large language models,” inProceedings of the AAAI Conference on Artificial Intelligence, 2024

work page 2024
[21]

Janusvln: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation.arXiv preprint arXiv:2509.22548, 2025

S. Zeng, D. Qi, X. Chang, F. Xiong, S. Xie, X. Wu, S. Liang, M. Xu, X. Wei, and N. Guo, “JanusVLN: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation,”arXiv preprint arXiv:2509.22548, 2025. [Online]. Available: https://arxiv.org/abs/2509.22548

work page arXiv 2025
[23]

Activevln: Towards active exploration via multi-turn rl in vision-and-language navigation, 2025

[Online]. Available: https://arxiv.org/abs/2509.12618

work page arXiv
[24]

Let’s reward step-by-step: Step-aware con- trastive alignment for vision-language navigation in continuous environments.arXiv preprint arXiv:2603.09740, 2026

H. Li, R. Liu, H. Fan, and Y . Yang, “Let’s reward step-by-step: Step-aware contrastive alignment for vision-language navigation in continuous environments,”arXiv preprint arXiv:2603.09740, 2026. [Online]. Available: https://arxiv.org/abs/2603.09740

work page arXiv 2026
[25]

LLaV A-Video: Video instruction tuning with synthetic data,

Y . Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li, “LLaV A-Video: Video instruction tuning with synthetic data,”Transactions on Machine Learning Research, 2025

work page 2025
[26]

Streaming long video understanding with large language models,

R. Qian, X. Dong, P. Zhang, Y . Zang, S. Ding, D. Lin, and J. Wang, “Streaming long video understanding with large language models,” in Advances in Neural Information Processing Systems, 2024

work page 2024
[27]

VideoLLM-online: Online video large language model for streaming video,

J. Chen, Z. Lv, S. Wu, K. Q. Lin, C. Song, D. Gao, J.-W. Liu, Z. Gao, D. Mao, and M. Z. Shou, “VideoLLM-online: Online video large language model for streaming video,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 18 407–18 418

work page 2024
[28]

Streamchat: Chatting with streaming video.arXiv preprint arXiv:2412.08646, 2024

J. Liu, Z. Yu, S. Lan, S. Wang, R. Fang, J. Kautz, H. Li, and J. M. Alvarez, “StreamChat: Chatting with streaming video,” arXiv preprint arXiv:2412.08646, 2025. [Online]. Available: https: //arxiv.org/abs/2412.08646

work page arXiv 2025
[29]

Streambridge: Turning your offline video large language model into a proactive streaming assistant,

H. Wang, B. Feng, Z. Lai, M. Xu, S. Li, W. Ge, A. Dehghan, M. Cao, and P. Huang, “StreamBridge: Turning your offline video large language model into a proactive streaming assistant,” arXiv preprint arXiv:2505.05467, 2025. [Online]. Available: https: //arxiv.org/abs/2505.05467

work page arXiv 2025
[30]

LiveCC: Learning video llm with streaming speech transcription at scale,

J. Chen, Z. Zeng, Y . Lin, W. Li, Z. Ma, and M. Z. Shou, “LiveCC: Learning video llm with streaming speech transcription at scale,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 29 083–29 095

work page 2025
[31]

Streamingbench: Assessing the gap for mllms to achieve streaming video un- derstanding.arXiv preprint arXiv:2411.03628, 2024

J. Lin, Z. Fang, C. Chen, Z. Wan, F. Luo, P. Li, Y . Liu, and M. Sun, “StreamingBench: Assessing the gap for MLLMs to achieve streaming video understanding,”arXiv preprint arXiv:2411.03628, 2024. [Online]. Available: https://arxiv.org/abs/2411.03628

work page arXiv 2024
[32]

Speak While Watching: Unleashing true real-time video understanding capability of multimodal large language models,

J. Lin, J. Tong, H. Wu, J. Zhang, J. Liu, X. Jin, and X. Shen, “Speak While Watching: Unleashing true real-time video understanding capability of multimodal large language models,”arXiv preprint arXiv:2601.06843, 2026. [Online]. Available: https://arxiv.org/abs/2601. 06843

work page arXiv 2026
[33]

Fast inference from transform- ers via speculative decoding,

Y . Leviathan, M. Kalman, and Y . Matias, “Fast inference from transform- ers via speculative decoding,” inProceedings of the 40th International Conference on Machine Learning, 2023, pp. 19 274–19 286

work page 2023
[34]

Con- strained model predictive control: Stability and optimality,

D. Q. Mayne, J. B. Rawlings, C. V . Rao, and P. O. M. Scokaert, “Con- strained model predictive control: Stability and optimality,”Automatica, vol. 36, no. 6, pp. 789–814, 2000

work page 2000

[1] [1]

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,

P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. van den Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3674–3683

work page 2018

[2] [2]

Room- across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding,

A. Ku, P. Anderson, R. Patel, E. Ie, and J. Baldridge, “Room- across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020, pp. 4392– 4412

work page 2020

[3] [3]

VLN BERT: A recurrent vision-and-language BERT for navigation,

Y . Hong, Q. Wu, Y . Qi, C. Rodriguez-Opazo, and S. Gould, “VLN BERT: A recurrent vision-and-language BERT for navigation,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1643–1653

work page 2021

[4] [4]

Think global, act local: Dual-scale graph transformer for vision-and-language navigation,

S. Chen, P.-L. Guhur, M. Tapaswi, C. Schmid, and I. Laptev, “Think global, act local: Dual-scale graph transformer for vision-and-language navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 537–16 547

work page 2022

[5] [5]

NaVid: Video-based VLM plans the next step for vision- and-language navigation,

J. Zhang, K. Wang, R. Xu, G. Zhou, Y . Hong, X. Fang, Q. Wu, Z. Zhang, and H. Wang, “NaVid: Video-based VLM plans the next step for vision- and-language navigation,” inRobotics: Science and Systems, 2024

work page 2024

[6] [6]

Streamvln: Streaming vision-and- language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

M. Wei, C. Wan, X. Yu, T. Wang, Y . Yang, X. Mao, C. Zhu, W. Cai, H. Wang, Y . Chen, X. Liu, and J. Pang, “StreamVLN: Streaming vision-and-language navigation via slowfast context modeling,”arXiv preprint arXiv:2507.05240, 2025. [Online]. Available: https://arxiv.org/ abs/2507.05240

work page arXiv 2025

[7] [7]

thinking with images

W. Zhu, Z. Zhang, X. Wang, H. Pan, T. Wang, T. Geng, R. Xu, and F. Zheng, “NaVIDA: Vision-language navigation with inverse dynamics augmentation,”arXiv preprint arXiv:2601.18188, 2026. [Online]. Available: https://arxiv.org/abs/2601.18188

work page arXiv 2026

[8] [8]

NaVILA: Legged robot vision-language- action model for navigation,

A.-C. Cheng, Y . Ji, Z. Yang, Z. Gongye, X. Zou, J. Kautz, E. Biyik, H. Yin, S. Liu, and X. Wang, “NaVILA: Legged robot vision-language- action model for navigation,” inProceedings of Robotics: Science and Systems, Los Angeles, CA, USA, June 2025

work page 2025

[9] [9]

Uni-NaVid: A video-based vision-language-action model for unifying embodied navigation tasks,

J. Zhang, K. Wang, S. Wang, M. Li, H. Liu, S. Wei, Z. Wang, Z. Zhang, and H. Wang, “Uni-NaVid: A video-based vision-language-action model for unifying embodied navigation tasks,” inProceedings of Robotics: Science and Systems, Los Angeles, CA, USA, June 2025. [Online]. Available: https://roboticsconference.org/program/papers/13/

work page 2025

[10] [10]

MapNav: A novel memory representation via annotated semantic maps for VLM-based vision-and-language navigation,

L. Zhang, X. Hao, Q. Xu, Q. Zhang, X. Zhang, P. Wang, J. Zhang, Z. Wang, S. Zhang, and R. Xu, “MapNav: A novel memory representation via annotated semantic maps for VLM-based vision-and-language navigation,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vienna, Austria: Association for C...

work page 2025

[11] [11]

Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation,

X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y .-F. Wang, W. Y . Wang, and L. Zhang, “Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 6629–6638

work page 2019

[12] [12]

Towards learning a generic agent for vision-and-language navigation via pre-training,

W. Hao, C. Li, X. Li, L. Carin, and J. Gao, “Towards learning a generic agent for vision-and-language navigation via pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13 137–13 146

work page 2020

[13] [13]

History aware multi- modal transformer for vision-and-language navigation,

S. Chen, P.-L. Guhur, C. Schmid, and I. Laptev, “History aware multi- modal transformer for vision-and-language navigation,” inAdvances in Neural Information Processing Systems, 2021

work page 2021

[14] [14]

Topo- logical planning with transformers for vision-and-language navigation,

K. Chen, J. K. Chen, J. Chuang, M. Vazquez, and S. Savarese, “Topo- logical planning with transformers for vision-and-language navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 276–11 286

work page 2021

[15] [15]

Visual language maps for robot navigation,

C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language maps for robot navigation,”arXiv preprint arXiv:2210.05714, 2023. [Online]. Available: https://arxiv.org/abs/2210.05714

work page arXiv 2023

[16] [16]

ETPNav: Evolving topological planning for vision-language navigation in continuous environments,

D. An, H. Wang, W. Wang, Z. Wang, Y . Huang, K. He, and L. Wang, “ETPNav: Evolving topological planning for vision-language navigation in continuous environments,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 7, pp. 5130–5145, 2025

work page 2025

[17] [17]

Beyond the nav-graph: Vision-and-language navigation in continuous environments,

J. Krantz, E. Wijmans, A. Majumdar, D. Batra, and S. Lee, “Beyond the nav-graph: Vision-and-language navigation in continuous environments,” inComputer Vision – ECCV 2020, 2020, pp. 104–120

work page 2020

[18] [18]

Bridging the gap be- tween learning in discrete and continuous environments for vision-and- language navigation,

Y . Hong, Z. Wang, Q. Wu, and S. Gould, “Bridging the gap be- tween learning in discrete and continuous environments for vision-and- language navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 439–15 449

work page 2022

[19] [19]

Scaling data generation in vision-and-language navigation,

Z. Wang, J. Li, Y . Hong, Y . Wang, Q. Wu, M. Bansal, S. Gould, H. Tan, and Y . Qiao, “Scaling data generation in vision-and-language navigation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 12 009–12 020

work page 2023

[20] [20]

NavGPT: Explicit reasoning in vision- and-language navigation with large language models,

G. Zhou, Y . Hong, and Q. Wu, “NavGPT: Explicit reasoning in vision- and-language navigation with large language models,” inProceedings of the AAAI Conference on Artificial Intelligence, 2024

work page 2024

[21] [21]

Janusvln: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation.arXiv preprint arXiv:2509.22548, 2025

S. Zeng, D. Qi, X. Chang, F. Xiong, S. Xie, X. Wu, S. Liang, M. Xu, X. Wei, and N. Guo, “JanusVLN: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation,”arXiv preprint arXiv:2509.22548, 2025. [Online]. Available: https://arxiv.org/abs/2509.22548

work page arXiv 2025

[22] [23]

Activevln: Towards active exploration via multi-turn rl in vision-and-language navigation, 2025

[Online]. Available: https://arxiv.org/abs/2509.12618

work page arXiv

[23] [24]

Let’s reward step-by-step: Step-aware con- trastive alignment for vision-language navigation in continuous environments.arXiv preprint arXiv:2603.09740, 2026

H. Li, R. Liu, H. Fan, and Y . Yang, “Let’s reward step-by-step: Step-aware contrastive alignment for vision-language navigation in continuous environments,”arXiv preprint arXiv:2603.09740, 2026. [Online]. Available: https://arxiv.org/abs/2603.09740

work page arXiv 2026

[24] [25]

LLaV A-Video: Video instruction tuning with synthetic data,

Y . Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li, “LLaV A-Video: Video instruction tuning with synthetic data,”Transactions on Machine Learning Research, 2025

work page 2025

[25] [26]

Streaming long video understanding with large language models,

R. Qian, X. Dong, P. Zhang, Y . Zang, S. Ding, D. Lin, and J. Wang, “Streaming long video understanding with large language models,” in Advances in Neural Information Processing Systems, 2024

work page 2024

[26] [27]

VideoLLM-online: Online video large language model for streaming video,

J. Chen, Z. Lv, S. Wu, K. Q. Lin, C. Song, D. Gao, J.-W. Liu, Z. Gao, D. Mao, and M. Z. Shou, “VideoLLM-online: Online video large language model for streaming video,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 18 407–18 418

work page 2024

[27] [28]

Streamchat: Chatting with streaming video.arXiv preprint arXiv:2412.08646, 2024

J. Liu, Z. Yu, S. Lan, S. Wang, R. Fang, J. Kautz, H. Li, and J. M. Alvarez, “StreamChat: Chatting with streaming video,” arXiv preprint arXiv:2412.08646, 2025. [Online]. Available: https: //arxiv.org/abs/2412.08646

work page arXiv 2025

[28] [29]

Streambridge: Turning your offline video large language model into a proactive streaming assistant,

H. Wang, B. Feng, Z. Lai, M. Xu, S. Li, W. Ge, A. Dehghan, M. Cao, and P. Huang, “StreamBridge: Turning your offline video large language model into a proactive streaming assistant,” arXiv preprint arXiv:2505.05467, 2025. [Online]. Available: https: //arxiv.org/abs/2505.05467

work page arXiv 2025

[29] [30]

LiveCC: Learning video llm with streaming speech transcription at scale,

J. Chen, Z. Zeng, Y . Lin, W. Li, Z. Ma, and M. Z. Shou, “LiveCC: Learning video llm with streaming speech transcription at scale,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 29 083–29 095

work page 2025

[30] [31]

Streamingbench: Assessing the gap for mllms to achieve streaming video un- derstanding.arXiv preprint arXiv:2411.03628, 2024

J. Lin, Z. Fang, C. Chen, Z. Wan, F. Luo, P. Li, Y . Liu, and M. Sun, “StreamingBench: Assessing the gap for MLLMs to achieve streaming video understanding,”arXiv preprint arXiv:2411.03628, 2024. [Online]. Available: https://arxiv.org/abs/2411.03628

work page arXiv 2024

[31] [32]

Speak While Watching: Unleashing true real-time video understanding capability of multimodal large language models,

J. Lin, J. Tong, H. Wu, J. Zhang, J. Liu, X. Jin, and X. Shen, “Speak While Watching: Unleashing true real-time video understanding capability of multimodal large language models,”arXiv preprint arXiv:2601.06843, 2026. [Online]. Available: https://arxiv.org/abs/2601. 06843

work page arXiv 2026

[32] [33]

Fast inference from transform- ers via speculative decoding,

Y . Leviathan, M. Kalman, and Y . Matias, “Fast inference from transform- ers via speculative decoding,” inProceedings of the 40th International Conference on Machine Learning, 2023, pp. 19 274–19 286

work page 2023

[33] [34]

Con- strained model predictive control: Stability and optimality,

D. Q. Mayne, J. B. Rawlings, C. V . Rao, and P. O. M. Scokaert, “Con- strained model predictive control: Stability and optimality,”Automatica, vol. 36, no. 6, pp. 789–814, 2000

work page 2000