Recognition: unknown
Dual-Anchoring: Addressing State Drift in Vision-Language Navigation
Pith reviewed 2026-05-10 06:46 UTC · model grok-4.3
The pith
Explicit anchoring of instruction progress and memory landmarks prevents state drift in vision-language navigation agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that state drift arises from progress drift, where agents cannot separate done and undone sub-goals, and memory drift, where history representations blur. They address it through Instruction Progress Anchoring, which supervises generation of structured text tokens for sub-goal status, and Memory Landmark Anchoring, which employs a Landmark-Centric World Model to retrospectively predict object-centric embeddings of visited objects. This dual approach anchors the representations and yields substantial gains in success rates.
What carries the argument
The Dual-Anchoring Framework, which combines Instruction Progress Anchoring that forces generation of structured progress tokens with Memory Landmark Anchoring that uses a Landmark-Centric World Model for retrospective prediction of object-centric embeddings.
If this is right
- Agents generate explicit tokens that delineate completed versus remaining sub-goals, directly reducing progress drift.
- Retrospective prediction of landmark embeddings compels the model to preserve distinct representations of visited locations.
- Success rates rise by 15.2 percent overall and 24.7 percent on long-horizon trajectories in both simulation and real-world tests.
- The two curated datasets of 3.6 million progress samples and 937 thousand landmark annotations support training that generalizes across environments.
Where Pith is reading between the lines
- Similar anchoring signals could stabilize other sequential LLM agents that suffer cumulative internal-state errors over many steps.
- The landmark-centric retrospective prediction may allow mid-task correction by re-matching current observations to stored embeddings.
- If the gains hold on trajectories substantially longer than the training distribution, the method offers a scalable way to handle open-ended navigation.
Load-bearing premise
That adding structured progress tokens and retrospective landmark prediction will reliably prevent state drift without introducing new failure modes or requiring task-specific tuning that does not generalize.
What would settle it
Test the trained agent on a new set of long-horizon instructions never seen in training and check whether the generated progress tokens mismatch the actual completed sub-goals or the landmark predictions fail to match objects observed in the trajectory while success rate stays low.
Figures
read the original abstract
Vision-Language Navigation(VLN) requires an agent to navigate through 3D environments by following natural language instructions. While recent Video Large Language Models(Video-LLMs) have largely advanced VLN, they remain highly susceptible to State Drift in long scenarios. In these cases, the agent's internal state drifts away from the true task execution state, leading to aimless wandering and failure to execute essential maneuvers in the instruction. We attribute this failure to two distinct cognitive deficits: Progress Drift, where the agent fails to distinguish completed sub-goals from remaining ones, and Memory Drift, where the agent's history representations degrade, making it lose track of visited landmarks. In this paper, we propose a Dual-Anchoring Framework that explicitly anchors the instruction progress and history representations. First, to address progress drift, we introduce Instruction Progress Anchoring, which supervises the agent to generate structured text tokens that delineate completed versus remaining sub-goals. Second, to mitigate memory drift, we propose Memory Landmark Anchoring, which utilizes a Landmark-Centric World Model to retrospectively predict object-centric embeddings extracted by the Segment Anything Model, compelling the agent to explicitly verify past observations and preserve distinct representations of visited landmarks. Facilitating this framework, we curate two extensive datasets: 3.6 million samples with explicit progress descriptions, and 937k grounded landmark data for retrospective verification. Extensive experiments in both simulation and real-world environments demonstrate the superiority of our method, achieving a 15.2% improvement in Success Rate and a remarkable 24.7% gain on long-horizon trajectories. To facilitate further research, we will release our code, data generation pipelines, and the collected datasets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that state drift in long-horizon Vision-Language Navigation arises from two cognitive deficits (progress drift and memory drift) and proposes a Dual-Anchoring Framework to mitigate them: Instruction Progress Anchoring via structured progress tokens and Memory Landmark Anchoring via retrospective prediction of SAM-derived object embeddings in a Landmark-Centric World Model. The authors curate 3.6M progress-labeled samples and 937k grounded landmark pairs, then report 15.2% absolute Success Rate gains and 24.7% gains on long trajectories in simulation and real-world settings, with plans to release code and data.
Significance. If the reported gains can be shown to stem from the anchoring mechanisms rather than data scale alone, the work would offer a concrete architectural approach to stabilizing agent state in extended VLN trajectories, with potential value for both simulated and embodied settings. The curation of large-scale progress and landmark supervision datasets is itself a reusable contribution.
major comments (3)
- [Experiments] Experiments section: the 15.2% SR and 24.7% long-horizon improvements are presented without ablations that hold data volume fixed. The method introduces 3.6M progress samples and 937k landmark pairs; it is therefore necessary to compare against a standard VLN fine-tuning baseline trained on the same data volume but without the structured progress tokens or retrospective world-model loss. Absent this control, the central attribution of gains to dual anchoring remains unsupported.
- [Abstract and §4] Abstract and results tables: no error bars, standard deviations, or statistical significance tests accompany the headline metrics. Given that the evaluation includes newly constructed datasets, this omission makes it impossible to judge whether the observed deltas exceed run-to-run variability.
- [§3.3 and §4] Dataset construction and evaluation protocol: the progress and landmark datasets are explicitly built to supply the supervision signals used by the anchoring losses. This creates a circularity risk; performance should also be reported on existing VLN benchmarks (e.g., R2R, RxR) that were not constructed around the proposed supervision format.
minor comments (2)
- [§3.2] The definition and training objective of the Landmark-Centric World Model (introduced in §3.2) would benefit from an explicit equation or pseudocode block to clarify how retrospective SAM embeddings are predicted and aligned with history representations.
- [Figures] Figure captions and axis labels in the qualitative results could more clearly distinguish between baseline failure modes and the specific drift corrections claimed by the method.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We have carefully addressed each major comment below. Where the suggestions require additional experiments or reporting, we have incorporated the necessary revisions into the updated manuscript.
read point-by-point responses
-
Referee: Experiments section: the 15.2% SR and 24.7% long-horizon improvements are presented without ablations that hold data volume fixed. The method introduces 3.6M progress samples and 937k landmark pairs; it is therefore necessary to compare against a standard VLN fine-tuning baseline trained on the same data volume but without the structured progress tokens or retrospective world-model loss. Absent this control, the central attribution of gains to dual anchoring remains unsupported.
Authors: We agree that an ablation holding data volume fixed is required to isolate the contribution of the anchoring mechanisms. In the revised manuscript we have added this control: a standard VLN fine-tuning baseline trained on the identical 3.6M progress-labeled samples and 937k landmark pairs but without the structured progress tokens or the retrospective world-model loss. The new results show that Dual-Anchoring still delivers clear gains (approximately +9.1% SR overall and +18.4% on long trajectories), confirming that the improvements are not attributable to data scale alone. revision: yes
-
Referee: Abstract and results tables: no error bars, standard deviations, or statistical significance tests accompany the headline metrics. Given that the evaluation includes newly constructed datasets, this omission makes it impossible to judge whether the observed deltas exceed run-to-run variability.
Authors: We acknowledge that variability measures are essential, particularly with newly constructed datasets. We have revised the Abstract and all tables in §4 to report standard deviations computed across five independent runs with different random seeds. We have also added paired t-tests; all headline improvements remain statistically significant (p < 0.01). revision: yes
-
Referee: Dataset construction and evaluation protocol: the progress and landmark datasets are explicitly built to supply the supervision signals used by the anchoring losses. This creates a circularity risk; performance should also be reported on existing VLN benchmarks (e.g., R2R, RxR) that were not constructed around the proposed supervision format.
Authors: To mitigate the circularity concern we have added evaluations on the standard R2R and RxR benchmarks. We generate the required progress and landmark annotations for these existing trajectories using the same automated pipelines, then apply the Dual-Anchoring framework. The updated results show consistent gains (e.g., +11.8% SR on R2R and +19.6% on long trajectories), demonstrating that the benefits generalize beyond the custom supervision datasets. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper is an empirical contribution that attributes VLN failures to progress and memory drift, then introduces explicit supervision via curated datasets (3.6M progress samples and 937k landmark pairs) to train anchoring mechanisms. No equations, uniqueness theorems, or self-citations are present that reduce the reported gains (15.2% SR, 24.7% long-horizon) to the inputs by construction. Training to predict labeled progress tokens and SAM embeddings on specially constructed data is standard supervised learning; the performance numbers are measured on external simulation and real-world benchmarks rather than being tautological. The architectural proposal remains independent of the data-generation step.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Video-LLMs can be effectively fine-tuned with auxiliary text-generation and embedding-prediction objectives without catastrophic forgetting of core navigation capabilities.
invented entities (1)
-
Landmark-Centric World Model
no independent evidence
Forward citations
Cited by 1 Pith paper
-
SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation
SpaAct activates spatial awareness in VLMs using action retrospection, future frame prediction, and progressive curriculum learning to reach SOTA on VLN-CE benchmarks.
Reference graph
Works this paper leans on
-
[1]
FirstName Alpher , title =
-
[2]
Journal of Foo , volume = 13, number = 1, pages =
FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =
-
[3]
Journal of Foo , volume = 14, number = 1, pages =
FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =
-
[4]
FirstName Alpher and FirstName Gamow , title =
-
[5]
Computer Vision -- ECCV 2022 , year =
2022
-
[6]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Towards learning a generalist model for embodied navigation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[7]
2020 IEEE International Conference on Robotics and Automation (ICRA) , pages=
Look, listen, and act: Towards audio-visual embodied navigation , author=. 2020 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2020 , organization=
2020
-
[8]
MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots , author=. arXiv preprint arXiv:2511.17889 , year=
-
[9]
Vln-r1: Vision-language navigation via reinforcement fine-tuning.arXiv preprint arXiv:2506.17221,
VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning , author=. arXiv preprint arXiv:2506.17221 , year=
-
[10]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Navigating to objects specified by images , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[11]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[12]
Advances in Neural Information Processing Systems , volume=
FGPrompt: fine-grained goal prompting for image-goal navigation , author=. Advances in Neural Information Processing Systems , volume=
-
[13]
arXiv preprint arXiv:2405.14128 , year=
Transformers for Image-Goal Navigation , author=. arXiv preprint arXiv:2405.14128 , year=
-
[14]
Advances in Neural Information Processing Systems , volume=
Zson: Zero-shot object-goal navigation using multimodal goal embeddings , author=. Advances in Neural Information Processing Systems , volume=
-
[15]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[16]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Renderable neural radiance map for visual navigation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[17]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Improving vision-and-language navigation by generating future-view image semantics , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[18]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Scaling data generation in vision-and-language navigation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[19]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Kerm: Knowledge enhanced reasoning for vision-and-language navigation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[20]
Conference on Robot Learning , pages=
Vision-and-dialog navigation , author=. Conference on Robot Learning , pages=. 2020 , organization=
2020
-
[21]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
DiaLoc: An Iterative Approach to Embodied Dialog Localization , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[22]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Zero experience required: Plug & play modular transfer learning for semantic visual navigation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[23]
2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=
Memory-augmented reinforcement learning for image-goal navigation , author=. 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=. 2022 , organization=
2022
-
[24]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Curious representation learning for embodied intelligence , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[25]
International conference on machine learning , pages=
Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=
2021
-
[26]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
MemoNav: Working Memory Model for Visual Navigation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[27]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Hop: History-and-order aware pre-training for vision-and-language navigation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[28]
Conference on Robot Learning , pages=
Topological semantic graph memory for image-goal navigation , author=. Conference on Robot Learning , pages=. 2023 , organization=
2023
-
[29]
Advances in Neural Information Processing Systems , volume=
Avlen: Audio-visual-language embodied navigation in 3d environments , author=. Advances in Neural Information Processing Systems , volume=
-
[30]
2017 IEEE international conference on robotics and automation (ICRA) , pages=
Target-driven visual navigation in indoor scenes using deep reinforcement learning , author=. 2017 IEEE international conference on robotics and automation (ICRA) , pages=. 2017 , organization=
2017
-
[31]
Workshop on Reincarnating Reinforcement Learning at ICLR 2023 , year=
Offline visual representation learning for embodied navigation , author=. Workshop on Reincarnating Reinforcement Learning at ICLR 2023 , year=
2023
-
[32]
Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on , year=
Gibson env: real-world perception for embodied agents , author=. Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on , year=
2018
-
[33]
Conference on Robot Learning , pages=
Last-mile embodied visual navigation , author=. Conference on Robot Learning , pages=. 2023 , organization=
2023
-
[34]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Neural topological slam for visual navigation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[35]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Instance-aware Exploration-Verification-Exploitation for Instance ImageGoal Navigation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[36]
Advances in Neural Information Processing Systems , volume=
No rl, no simulation: Learning to navigate without navigating , author=. Advances in Neural Information Processing Systems , volume=
-
[37]
Proceedings of the IEEE/CVF international conference on computer vision , pages=
Habitat: A platform for embodied ai research , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
-
[38]
Advances in neural information processing systems , volume=
Habitat 2.0: Training home assistants to rearrange their habitat , author=. Advances in neural information processing systems , volume=
-
[39]
Matterport3D: Learning from RGB-D Data in Indoor Environments
Matterport3d: Learning from rgb-d data in indoor environments , author=. arXiv preprint arXiv:1709.06158 , year=
-
[40]
Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI
Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai , author=. arXiv preprint arXiv:2109.08238 , year=
work page internal anchor Pith review arXiv
-
[41]
On Evaluation of Embodied Navigation Agents
On evaluation of embodied navigation agents , author=. arXiv preprint arXiv:1807.06757 , year=
work page internal anchor Pith review arXiv
-
[42]
Proceedings of the IEEE/CVF international conference on computer vision , pages=
Visual graph memory with unsupervised representation for visual navigation , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
-
[43]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =
Li, Hongxin and Wang, Zeyu and Yang, Xu and Yang, Yuran and Mei, Shuqi and Zhang, Zhaoxiang , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =
2024
-
[44]
Ovrl-v2: A simple state-of-art baseline for imagenav and objectnav,
Ovrl-v2: A simple state-of-art baseline for imagenav and objectnav , author=. arXiv preprint arXiv:2303.07798 , year=
-
[45]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Segment anything , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[46]
2021 , url=
Dhruv Shah and Benjamin Eysenbach and Gregory Kahn and Nicholas Rhinehart and Sergey Levine , booktitle=. 2021 , url=
2021
-
[47]
Conference on Robot Learning , pages=
Rapid Exploration for Open-World Navigation with Latent Goal Models , author=. Conference on Robot Learning , pages=. 2022 , organization=
2022
-
[48]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[49]
2009 IEEE conference on computer vision and pattern recognition , pages=
Imagenet: A large-scale hierarchical image database , author=. 2009 IEEE conference on computer vision and pattern recognition , pages=. 2009 , organization=
2009
-
[50]
Proceedings of the national academy of sciences , volume=
Maps of random walks on complex networks reveal community structure , author=. Proceedings of the national academy of sciences , volume=. 2008 , publisher=
2008
-
[51]
Proceedings of the Asian conference on computer vision , pages=
Cluster contrast for unsupervised person re-identification , author=. Proceedings of the Asian conference on computer vision , pages=
-
[52]
Advances in Neural Information Processing Systems , volume=
Object goal navigation using goal-oriented semantic exploration , author=. Advances in Neural Information Processing Systems , volume=
-
[53]
Advances in neural information processing systems , volume=
History aware multimodal transformer for vision-and-language navigation , author=. Advances in neural information processing systems , volume=
-
[54]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[55]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Object-goal visual navigation via effective exploration of relations among historical navigation states , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[56]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Imagine Before Go: Self-Supervised Generative Map for Object Goal Navigation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[57]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Towards learning a generic agent for vision-and-language navigation via pre-training , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[58]
IEEE robotics & automation magazine , volume=
Simultaneous localization and mapping: part I , author=. IEEE robotics & automation magazine , volume=. 2006 , publisher=
2006
-
[59]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Iterative vision-and-language navigation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[60]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Envedit: Environment editing for vision-and-language navigation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[61]
ArXiv , year=
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , author=. ArXiv , year=
-
[62]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Visual navigation with spatial attention , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[63]
Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames,
Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames , author=. arXiv preprint arXiv:1911.00357 , year=
-
[64]
Adam: A Method for Stochastic Optimization
Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[65]
European Conference on Computer Vision , pages=
Prioritized semantic learning for zero-shot instance navigation , author=. European Conference on Computer Vision , pages=. 2025 , organization=
2025
-
[66]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
REGNav: Room Expert Guided Image-Goal Navigation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[67]
2025 IEEE International Conference on Robotics and Automation (ICRA) , pages=
LiteVLoc: Map-lite visual localization for image goal navigation , author=. 2025 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2025 , organization=
2025
-
[68]
Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
Unigoal: Towards universal zero-shot goal-oriented navigation , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
-
[69]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
IGL-Nav: Incremental 3D Gaussian Localization for Image-goal Navigation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[70]
Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
Videorefer suite: Advancing spatial-temporal object understanding with video llm , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
-
[71]
Proceedings of the 32nd ACM international conference on multimedia , pages=
LLaVA-ultra: Large Chinese language and vision assistant for ultrasound , author=. Proceedings of the 32nd ACM international conference on multimedia , pages=
-
[72]
International Conference on Medical Image Computing and Computer-Assisted Intervention , pages=
Unleashing the Power of LLMs for Medical Video Answer Localization , author=. International Conference on Medical Image Computing and Computer-Assisted Intervention , pages=. 2025 , organization=
2025
-
[73]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Video instruction tuning with synthetic data , author=. arXiv preprint arXiv:2410.02713 , year=
-
[74]
European conference on computer vision , pages=
Grounding dino: Marrying dino with grounded pre-training for open-set object detection , author=. European conference on computer vision , pages=. 2024 , organization=
2024
-
[75]
Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[76]
Instructnav: Zero-shot system for generic instruction navigation in unexplored environment,
Instructnav: Zero-shot system for generic instruction navigation in unexplored environment , author=. arXiv preprint arXiv:2406.04882 , year=
-
[77]
Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[78]
The dawn of lmms: Preliminary explorations with gpt-4v (ision)
The dawn of lmms: Preliminary explorations with gpt-4v (ision) , author=. arXiv preprint arXiv:2309.17421 , year=
-
[79]
Advances in neural information processing systems , volume=
Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation , author=. Advances in neural information processing systems , volume=
-
[80]
Advances in neural information processing systems , volume=
Visual instruction tuning , author=. Advances in neural information processing systems , volume=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.