pith. machine review for the scientific record. sign in

arxiv: 2510.20685 · v4 · submitted 2025-10-23 · 💻 cs.RO

C-NAV: Towards Self-Evolving Continual Object Navigation in Open World

Pith reviewed 2026-05-18 04:29 UTC · model grok-4.3

classification 💻 cs.RO
keywords continual object navigationanti-forgetting mechanismfeature distillationfeature replayadaptive samplingembodied agentsvisual navigationopen world
0
0 comments X

The pith

A dual-path anti-forgetting system lets navigation agents learn new objects in changing environments while retaining old skills and using far less memory than full-history methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to enable embodied agents to acquire navigation skills for new object categories in dynamic open-world settings without losing proficiency on previously learned categories. It introduces a dedicated continual object navigation benchmark to measure this ability and presents the C-Nav framework as a solution. C-Nav combines feature distillation to align multi-modal inputs into a stable representation space with feature replay to preserve policy behavior in the action decoder, plus an adaptive sampling step that chooses diverse experiences to limit storage. Experiments across model architectures show the approach surpasses prior methods and even exceeds baselines that retain complete past trajectories, all while cutting memory needs. If correct, this would allow agents to operate continuously in evolving real-world spaces without periodic full retraining or unbounded data accumulation.

Core claim

C-Nav integrates a dual-path anti-forgetting mechanism—feature distillation that aligns multi-modal inputs into a consistent representation space for representation consistency and feature replay that retains temporal features within the action decoder for policy consistency—together with an adaptive sampling strategy that selects diverse and informative experiences to reduce redundancy and memory overhead, yielding superior performance on the continual object navigation benchmark.

What carries the argument

Dual-path anti-forgetting mechanism of feature distillation for representation consistency and feature replay for policy consistency, combined with adaptive sampling for experience selection.

If this is right

  • Agents maintain high success rates on old object categories while gaining competence on new ones without retraining from scratch.
  • Memory footprint stays low because only selected experiences are retained rather than full trajectories.
  • The same framework delivers gains across different underlying model architectures.
  • Navigation systems can handle open-world evolution with less risk of performance collapse over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar dual-consistency paths could transfer to other embodied continual tasks such as manipulation or exploration.
  • Real-robot deployments with streaming sensor data would provide a direct test of the sampling strategy under non-stationary conditions.
  • Scaling the number of categories to dozens or hundreds would clarify where the consistency mechanisms begin to degrade.

Load-bearing premise

The dual-path mechanism of feature distillation and feature replay plus adaptive sampling will continue to block catastrophic forgetting when object categories and environmental shifts grow much larger than the tested benchmark scenarios.

What would settle it

An experiment that doubles or triples the number of object categories or introduces more frequent scene changes and then measures a sharp drop in success rate on previously learned navigation tasks.

Figures

Figures reproduced from arXiv: 2510.20685 by Fei Zhu, Jing Liu, Ming-Ming Yu, Qunbo Wang, Wenjun Wu, Wenzhuo Liu, Yirong Yang.

Figure 1
Figure 1. Figure 1: Continual Object Navigation: The robot must continually learn from new data while [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Different navigation model architectures. According to the action decoder, they can be [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed C-Nav framework for continual object navigation. It consists of two key components: (1) adaptive experience selection, which identifies semantically meaningful keyframes via LOF in the representation space, reducing storage and redundancy; and (2) dual-path anti-forgetting, which mitigates representation and policy drift through a feature distillation path (bottom left) and a featu… view at source ↗
Figure 4
Figure 4. Figure 4: Results of SR and SPL on the MP3D and HM3D datasets. Solid lines represent SR, while [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation on experience selection and retention ratio on [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison with the Zero-Shot Method. Stage 1 Stage 2 Stage 3 Stage 4 5 10 15 20 25 30 35 40 45 Percentage (%) Order1 (SR) Order1 (SPL) Order2 (SR) Order2 (SPL) Order3 (SR) Order3 (SPL) [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: Demonstration trajectory length distribution per stage in HM3D dataset. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Demonstration trajectory length distribution per stage in MP3D Dataset. [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Ablation study: Freezing vs. unfreezing the multimodal encoder on MP3D Continual [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Results for the dual-path anti-forgetting performance at each stage. [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Average success rate for each component across four architectures. HM3D-SR-Avg [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Results for the adaptive experience selection at each stage. [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Average success rate for different sample methods across four architectures. [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗
read the original abstract

Embodied agents are expected to perform object navigation in dynamic, open-world environments. However, existing approaches typically rely on static trajectories and a fixed set of object categories during training, overlooking the real-world requirement for continual adaptation to evolving scenarios. To facilitate related studies, we introduce the continual object navigation benchmark, which requires agents to acquire navigation skills for new object categories while avoiding catastrophic forgetting of previously learned knowledge. To tackle this challenge, we propose C-Nav, a continual visual navigation framework that integrates two key innovations: (1) A dual-path anti-forgetting mechanism, which comprises feature distillation that aligns multi-modal inputs into a consistent representation space to ensure representation consistency, and feature replay that retains temporal features within the action decoder to ensure policy consistency. (2) An adaptive sampling strategy that selects diverse and informative experiences, thereby reducing redundancy and minimizing memory overhead. Extensive experiments across multiple model architectures demonstrate that C-Nav consistently outperforms existing approaches, achieving superior performance even compared to baselines with full trajectory retention, while significantly lowering memory requirements. The code will be publicly available at https://bigtree765.github.io/C-Nav-project.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a continual object navigation benchmark requiring agents to learn navigation for new object categories in dynamic open-world settings without catastrophic forgetting. It proposes the C-Nav framework featuring a dual-path anti-forgetting mechanism (feature distillation for multi-modal representation consistency and feature replay for policy consistency in the action decoder) together with an adaptive sampling strategy that selects diverse experiences to reduce memory overhead. Experiments across model architectures claim consistent outperformance over prior methods, including baselines that retain full trajectories, while using substantially less memory.

Significance. If the central claims hold after addressing experimental controls, the work would be significant for continual learning in embodied robotics: it provides a new benchmark for open-world adaptation and demonstrates a memory-efficient approach that could scale navigation agents beyond static training regimes. The dual-path design and adaptive sampling directly target representation and policy consistency, which are load-bearing for long-term deployment in evolving environments.

major comments (2)
  1. [Abstract / Experimental Evaluation] Abstract and experimental evaluation: the claim of outperforming full-trajectory-retention baselines while using less memory is load-bearing for the central contribution. However, no ablation is reported that applies the identical adaptive sampling strategy to those full-retention baselines. Without this control, any observed performance edge cannot be unambiguously attributed to the dual-path anti-forgetting mechanism rather than improved data selection.
  2. [Method] Method section on adaptive sampling and dual-path components: the paper does not specify whether the adaptive sampling criteria (diversity and informativeness) or the distillation/replay weights were tuned on the same benchmark splits used for final evaluation. This raises a moderate risk of circularity that could inflate the reported gains relative to untuned baselines.
minor comments (2)
  1. [Figures and Tables] Figure captions and table headers should explicitly state the number of runs, random seeds, and statistical tests used to support the 'consistently outperforms' claim.
  2. [Benchmark Description] The continual object navigation benchmark definition would benefit from an explicit description of the train/test split protocol and the schedule of new object category introductions to facilitate future comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and commit to revisions that strengthen the experimental controls and transparency of the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Experimental Evaluation] Abstract and experimental evaluation: the claim of outperforming full-trajectory-retention baselines while using less memory is load-bearing for the central contribution. However, no ablation is reported that applies the identical adaptive sampling strategy to those full-retention baselines. Without this control, any observed performance edge cannot be unambiguously attributed to the dual-path anti-forgetting mechanism rather than improved data selection.

    Authors: We agree this control is valuable for unambiguous attribution. The adaptive sampling strategy is a core part of C-Nav for memory reduction, but to isolate the dual-path anti-forgetting contribution, we will add a new ablation in the revised manuscript. This ablation will apply the identical adaptive sampling criteria to the full-trajectory-retention baselines and report the resulting performance and memory metrics. We expect this to confirm that the observed gains are not solely due to data selection. revision: yes

  2. Referee: [Method] Method section on adaptive sampling and dual-path components: the paper does not specify whether the adaptive sampling criteria (diversity and informativeness) or the distillation/replay weights were tuned on the same benchmark splits used for final evaluation. This raises a moderate risk of circularity that could inflate the reported gains relative to untuned baselines.

    Authors: We thank the referee for noting this potential issue. The adaptive sampling criteria and distillation/replay weights were tuned on a held-out validation split that is disjoint from the test splits used for final reporting. We will revise the Method section to explicitly describe the tuning protocol, including the validation split construction, the search ranges, and the selection criteria, thereby removing any ambiguity about circularity. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with independent experimental validation

full rationale

The paper proposes C-Nav as a practical continual navigation framework combining dual-path anti-forgetting (feature distillation and replay) with adaptive sampling, then validates it through experiments on a new benchmark. No mathematical derivation chain exists that reduces predictions or uniqueness claims to fitted parameters or self-citations by construction. The central performance claims rest on comparative results against baselines (including full-trajectory retention), which are externally falsifiable via the reported metrics and code release rather than being forced by internal definitions or prior author work. Adaptive sampling and distillation weights are implementation choices whose impact is measured, not smuggled in as predictions. This is the common case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of the proposed mechanisms rather than on new mathematical axioms or invented physical entities. No free parameters are explicitly named in the abstract beyond standard training hyperparameters.

pith-pipeline@v0.9.0 · 5750 in / 1141 out tokens · 22386 ms · 2026-05-18T04:29:18.768233+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution

    cs.AI 2026-05 conditional novelty 6.0

    LongAct benchmark reveals top VLMs reach only 59% goal completion and 16% full success on long-horizon household tasks, while HoloMind agent improves results via DAG planner, multimodal spatial memory, episodic memory...

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 1 Pith paper · 8 internal anchors

  1. [1]

    Object goal navigation using goal-oriented semantic exploration

    Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Abhinav Gupta, et al. Object goal navigation using goal-oriented semantic exploration. InAdvances in Neural Information Processing Systems, volume 33 of NeurIPS, pages 4247–4258, 2020

  2. [2]

    Objectnav revisited: On evaluation of embodied agents navigating to objects.arXiv preprint arXiv:2006.13171, 2020

    Dhruv Batra, Aaron Gokaslan, Aniruddha Kembhavi, Oleksandr Maksymets, Roozbeh Mottaghi, Manolis Savva, Alexander Toshev, and Erik Wijmans. Objectnav revisited: On evaluation of embodied agents navigating to objects.arXiv preprint arXiv:2006.13171, 2020

  3. [3]

    V oronav: V oronoi-based zeroshot object navigation with large language model

    Pengying Wu, Yao Mu, Bingxian Wu, Yi Hou, Ji Ma, Shanghang Zhang, and Chang Liu. V oronav: V oronoi-based zeroshot object navigation with large language model. InInternational Conference on Machine Learning (ICML), 2024

  4. [4]

    Vlfm: Vision- language frontier maps for zero-shot semantic navigation

    Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. Vlfm: Vision- language frontier maps for zero-shot semantic navigation. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 42–48. IEEE, 2024

  5. [5]

    Esc: Exploration with soft commonsense constraints for zero-shot object navigation

    Kaiwen Zhou, Kaizhi Zheng, Connor Pryor, Yilin Shen, Hongxia Jin, Lise Getoor, and Xin Eric Wang. Esc: Exploration with soft commonsense constraints for zero-shot object navigation. InInternational Conference on Machine Learning, pages 42829–42842. PMLR, 2023

  6. [6]

    Ovrl-v2: A simple state-of-the-art baseline for imagenav and objectnav.arXiv preprint arXiv:2303.07798, 2023

    Karmesh Yadav, Arjun Majumdar, Ram Ramrakhya, et al. Ovrl-v2: A simple state-of-the-art baseline for imagenav and objectnav.arXiv preprint arXiv:2303.07798, 2023. Version 2

  7. [7]

    Pirlnav: Pretraining with imitation and rl finetuning for objectnav

    Ram Ramrakhya, Dhruv Batra, Erik Wijmans, et al. Pirlnav: Pretraining with imitation and rl finetuning for objectnav. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17896–17906. IEEE/CVF, 2023

  8. [8]

    NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

    Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and He Wang. Navid: Video-based vlm plans the next step for vision-and-language navigation. arXiv preprint arXiv:2402.15852, 2024

  9. [9]

    Object goal navigation with recursive implicit maps

    Siyuan Chen, Thomas Chabal, Ivan Laptev, et al. Object goal navigation with recursive implicit maps. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7089–7096. IEEE, 2023

  10. [10]

    Habitat-web: Learning embodied object-search strategies from human demonstrations at scale

    Ram Ramrakhya, Erik Undersander, Dhruv Batra, et al. Habitat-web: Learning embodied object-search strategies from human demonstrations at scale. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5173–5183. IEEE/CVF, 2022

  11. [11]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, volume 139, pages 8748–8763, feb 2021

  12. [12]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, et al. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11975–11986. IEEE/CVF, 2023

  13. [13]

    Caron, H

    M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the International Conference on Computer Vision (ICCV), 2021

  14. [14]

    A comprehensive survey on continual learning in generative models

    Haiyang Guo, Fanhu Zeng, Fei Zhu, Jiayi Wang, Xukai Wang, Jingang Zhou, Hongbo Zhao, Wenzhuo Liu, Shijie Ma, Xu-Yao Zhang, et al. A comprehensive survey on continual learning in generative models. arXiv preprint arXiv:2506.13045, 2025

  15. [15]

    Open-world machine learning: A review and new outlooks.arXiv e-prints, pages arXiv–2403, 2024

    Fei Zhu, Shijie Ma, Zhen Cheng, Xu-Yao Zhang, Zhaoxiang Zhang, and Cheng-Lin Liu. Open-world machine learning: A review and new outlooks.arXiv e-prints, pages arXiv–2403, 2024

  16. [16]

    Bagdanov, and Joost Van De Weijer

    Marc Masana, Xialei Liu, Bartłomiej Twardowski, Mikel Menta, Andrew D. Bagdanov, and Joost Van De Weijer. Class-incremental learning: Survey and performance evaluation on image classification. volume 45, pages 5513–5533. IEEE, 2022. 11

  17. [17]

    Dkt: Diverse knowledge transfer transformer for class incremental learning

    Xiangyu Gao, Yang He, Shanshan Dong, Yujia Chen, Lei Wang, Wei Zhang, Hao Li, and Mingyuan Zhou. Dkt: Diverse knowledge transfer transformer for class incremental learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24236–24245. IEEE/CVF, 2023

  18. [18]

    Spoc: Imitating shortest paths in simulation enables effective navigation and manipulation in the real world.arXiv preprint arXiv:2312.02976, 2023

    Kiana Ehsani, Tanmay Gupta, Ryan Hendrix, et al. Spoc: Imitating shortest paths in simulation enables effective navigation and manipulation in the real world.arXiv preprint arXiv:2312.02976, 2023

  19. [19]

    Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. iCaRL: incre- mental classifier and representation learning. InCVPR, 2017

  20. [20]

    Ramakrishnan, Dhruv Batra, Yonatan Bisk, and Roozbeh Mottaghi

    Karmesh Yadav, Ram Ramrakhya, Santhosh K. Ramakrishnan, Dhruv Batra, Yonatan Bisk, and Roozbeh Mottaghi. Habitat-matterport 3d semantics dataset. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4927–4936. IEEE/CVF, 2023

  21. [21]

    Matterport3D: Learning from RGB-D Data in Indoor Environments

    Angel Chang, Angela Dai, Thomas Funkhouser, and et al. Matterport3d: Learning from rgb-d data in indoor environments.arXiv preprint arXiv:1709.06158, September 2017

  22. [22]

    Habitat: A platform for embodied ai research

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347, 2019

  23. [23]

    Simple but effective: Clip embeddings for embodied ai

    Apoorv Khandelwal, Luca Weihs, Roozbeh Mottaghi, and Aniruddha Kembhavi. Simple but effective: Clip embeddings for embodied ai. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14829–14838, 2022

  24. [24]

    Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al

    Edward J. Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models. InAdvances in Neural Information Processing Systems, volume 36, 2023

  25. [25]

    Magmax: Leveraging model merging for seamless continual learning

    Daniel Marczak, Bartłomiej Twardowski, Tomasz Trzci´nski, and Sebastian Cygert. Magmax: Leveraging model merging for seamless continual learning. InEuropean Conference on Computer Vision, pages 379–395. Springer, 2024

  26. [26]

    Learning without forgetting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2935–2947, 2017

    Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2935–2947, 2017

  27. [27]

    Lof: identifying density-based local outliers

    Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. Lof: identifying density-based local outliers. InProceedings of the 2000 ACM SIGMOD international conference on Management of data, pages 93–104, 2000

  28. [28]

    Openfm- nav: Towards open-set zero-shot object navigation via vision-language foundation models

    Yuxuan Kuang, Hai Lin, and Meng Jiang. Openfmnav: Towards open-set zero-shot object navigation via vision-language foundation models.arXiv preprint arXiv:2402.10670, 2024

  29. [29]

    S. Y . Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song. Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation. InProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2023-June, pages 23171–23181, mar 2023

  30. [30]

    Unigoal: Towards universal zero-shot goal-oriented navigation.arXiv preprint arXiv:2503.10630, 2025

    Hang Yin, Xiuwei Xu, Lingqing Zhao, Ziwei Wang, Jie Zhou, and Jiwen Lu. Unigoal: Towards universal zero-shot goal-oriented navigation.arXiv preprint arXiv:2503.10630, 2025

  31. [31]

    H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning.Advances in Neural Information Processing Systems, 36:1–19, 2023

  32. [32]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  33. [33]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  34. [34]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  35. [35]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  36. [36]

    L3mvn: Leveraging large language models for visual target navigation

    Bangguo Yu, Hamidreza Kasaei, and Ming Cao. L3mvn: Leveraging large language models for visual target navigation. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3554–3560. IEEE, 2023

  37. [37]

    An Embodied Generalist Agent in 3D World

    Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song- Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023. 12

  38. [38]

    Hierarchical consensus-based multi-agent reinforcement learning for multi-robot cooperation tasks

    Pu Feng, Junkang Liang, Size Wang, Xin Yu, Xin Ji, Yiting Chen, Kui Zhang, Rongye Shi, and Wenjun Wu. Hierarchical consensus-based multi-agent reinforcement learning for multi-robot cooperation tasks. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 642–649. IEEE, 2024

  39. [39]

    Fast-slow test-time adaptation for online vision-and-language navigation

    Junyu Gao, Xuan Yao, and Changsheng Xu. Fast-slow test-time adaptation for online vision-and-language navigation. 2024

  40. [40]

    Navmorph: A self-evolving world model for vision-and- language navigation in continuous environments

    Junyu Gao Xuan Yao and Changsheng Xu. Navmorph: A self-evolving world model for vision-and- language navigation in continuous environments. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

  41. [41]

    Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017

  42. [42]

    Class incremental learning with self-supervised pre-training and prototype learning.Pattern Recognition, 157:110943, 2025

    Wenzhuo Liu, Xin-Jian Wu, Fei Zhu, Ming-Ming Yu, Chuang Wang, and Cheng-Lin Liu. Class incremental learning with self-supervised pre-training and prototype learning.Pattern Recognition, 157:110943, 2025

  43. [43]

    End-to-end incremental learning

    Francisco M Castro, Manuel J Marín-Jiménez, Nicolás Guil, Cordelia Schmid, and Karteek Alahari. End-to-end incremental learning. InProceedings of the European conference on computer vision (ECCV), pages 233–248, 2018

  44. [44]

    Learning a unified classifier incrementally via rebalancing

    Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, and Dahua Lin. Learning a unified classifier incrementally via rebalancing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 831–839. IEEE/CVF, 2019

  45. [45]

    Prototype augmentation and self- supervision for incremental learning

    Fei Zhu, Xu-Yao Zhang, Chuang Wang, Fei Yin, and Cheng-Lin Liu. Prototype augmentation and self- supervision for incremental learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5871–5880, 2021

  46. [46]

    Der: Dynamically expandable representation for class incremental learning

    Shipeng Yan, Jiangwei Xie, and Xuming He. Der: Dynamically expandable representation for class incremental learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3014–3023. IEEE/CVF, 2021

  47. [47]

    Llava-c: Continual improved visual instruction tuning.arXiv preprint arXiv:2506.08666, 2025

    Wenzhuo Liu, Fei Zhu, Haiyang Guo, Longhui Wei, and Cheng-Lin Liu. Llava-c: Continual improved visual instruction tuning.arXiv preprint arXiv:2506.08666, 2025

  48. [48]

    Mllm-cl: Continual learning for multimodal large language models.arXiv preprint arXiv:2506.05453, 2025

    Hongbo Zhao, Fei Zhu, Rundong Wang, Gaofeng Meng, and Zhaoxiang Zhang. Mllm-cl: Continual learning for multimodal large language models.arXiv preprint arXiv:2506.05453, 2025

  49. [49]

    Hide-llava: Hierarchical decoupling for continual instruction tuning of multimodal large language model

    Haiyang Guo, Fanhu Zeng, Ziwei Xiang, Fei Zhu, Da-Han Wang, Xu-Yao Zhang, and Cheng-Lin Liu. Hide-llava: Hierarchical decoupling for continual instruction tuning of multimodal large language model. arXiv preprint arXiv:2503.12941, 2025

  50. [50]

    Modalprompt: Towards efficient multimodal continual instruction tuning with dual-modality guided prompt.arXiv preprint arXiv:2410.05849, 2024

    Fanhu Zeng, Fei Zhu, Haiyang Guo, Xu-Yao Zhang, and Cheng-Lin Liu. Modalprompt: Towards efficient multimodal continual instruction tuning with dual-modality guided prompt.arXiv preprint arXiv:2410.05849, 2024

  51. [51]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

  52. [52]

    Pilora: Prototype guided incremental lora for federated class-incremental learning

    Haiyang Guo, Fei Zhu, Wenzhuo Liu, Xu-Yao Zhang, and Cheng-Lin Liu. Pilora: Prototype guided incremental lora for federated class-incremental learning. InEuropean Conference on Computer Vision, pages 141–159. Springer, 2024

  53. [53]

    The Power of Scale for Parameter-Efficient Prompt Tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, April 2021

  54. [54]

    Ramrakhya, E

    R. Ramrakhya, E. Undersander, D. Batra, and A. Das. Habitat-web: Learning embodied object-search strategies from human demonstrations at scale. InProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022

  55. [55]

    Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames.arXiv preprint arXiv:1911.00357, 2019

    Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, and Dhruv Batra. Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames.arXiv preprint arXiv:1911.00357, 2019

  56. [56]

    Long short-term memory.Neural Computation, 9(8):1735–1780, 1997

    Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Computation, 9(8):1735–1780, 1997

  57. [57]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 13 A Dataset Details A.1 Category Splits for Continual ObjectNav HM3D Dataset.The object categories are split into four stages as follows: •Stage...