AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents
Pith reviewed 2026-05-20 11:21 UTC · model grok-4.3
The pith
AtlasVA lets VLM agents build and reuse visually grounded memory through self-evolving atlases without any teacher models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AtlasVA is a teacher-free visual skill memory framework that organizes memory into three complementary layers: spatial heatmaps, visual exemplars, and symbolic text skills. It evolves danger and affinity atlases directly from trajectory statistics and lightweight grid heuristics, then reuses these atlases as potential-based shaping rewards for reinforcement learning. This design unifies perception, memory, and optimization without external LLM supervision.
What carries the argument
The three-layer memory structure of spatial heatmaps, visual exemplars, and symbolic text skills, together with self-evolving danger and affinity atlases built from trajectory statistics and grid heuristics.
Load-bearing premise
Reusable experience for VLM agents should remain visually grounded, and trajectory statistics plus lightweight grid heuristics can produce effective danger and affinity atlases without external LLM supervision or loss of critical spatial information.
What would settle it
An experiment that replaces the self-evolved danger and affinity atlases with text-only memory while keeping all other components fixed and shows no performance drop on spatial benchmarks would falsify the central claim.
read the original abstract
Vision-language model (VLM) agents increasingly rely on memory-augmented reinforcement learning to reuse experience across long-horizon tasks, yet most existing frameworks store memory as text and depend on proprietary teacher models to summarize or refine it. This design is poorly matched to spatial decision making: geometric priors are compressed into lossy language, and sparse interaction is often supervised through delayed textual feedback rather than dense visually grounded signals. We argue that reusable experience for VLM agents should remain visually grounded. Based on this insight, we propose \textbf{AtlasVA}, a teacher-free visual skill memory framework that organizes memory into three complementary layers: spatial heatmaps, visual exemplars, and symbolic text skills. AtlasVA further evolves danger and affinity atlases directly from trajectory statistics and lightweight grid heuristics, and reuses these self-evolving atlases as potential-based shaping rewards for reinforcement learning. This unifies perception, memory, and optimization without external LLM supervision. Experiments on \textsc{Sokoban}, \textsc{FrozenLake}, 3D embodied navigation, and 3D robotic manipulation benchmarks show that AtlasVA consistently outperforms text-centric memory baselines and competitive VLM agents, with especially strong gains on spatially intensive tasks. Homepage: https://wangpan-ustc.github.io/AtlasvaWeb
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces AtlasVA, a teacher-free framework for visual skill memory in VLM agents. Memory is organized into three layers—spatial heatmaps, visual exemplars, and symbolic text skills—with danger and affinity atlases evolved directly from trajectory statistics via lightweight grid heuristics. These atlases supply potential-based shaping rewards for RL, unifying perception, memory, and optimization without external LLM supervision. Experiments on Sokoban, FrozenLake, 3D embodied navigation, and 3D robotic manipulation benchmarks report consistent outperformance over text-centric memory baselines and competitive VLM agents, with larger gains on spatially intensive tasks.
Significance. If the empirical claims hold, the work offers a concrete alternative to text-centric memory in VLM agents by preserving visual grounding and deriving shaping rewards from self-generated trajectory data. The three-layer design and self-evolution mechanism could improve sample efficiency on long-horizon spatial problems; the absence of teacher models is a practical advantage. Reproducible code or machine-checked components are not mentioned.
major comments (2)
- [Method (atlas evolution)] Method section (atlas evolution): the claim that lightweight grid heuristics applied to trajectories yield dense, spatially faithful danger/affinity signals in continuous 3D navigation and manipulation lacks supporting derivation or ablation. No analysis shows that quantization does not omit key geometric features that text baselines allegedly discard; this assumption is load-bearing for the central claim that the three-layer memory plus atlases produce useful shaping rewards.
- [Experiments] Experiments section: the abstract and results summary state outperformance on four benchmarks but supply no quantitative numbers, error bars, ablation tables, or statistical tests. Without these, the magnitude of gains (especially the “especially strong” spatial-task improvements) cannot be evaluated or compared to baselines.
minor comments (2)
- [Method] Notation for the three memory layers and the potential function derived from atlases should be defined explicitly with equations rather than prose descriptions.
- [Figures] Figure captions for the 3D navigation and manipulation environments should clarify the grid resolution used by the heuristics and how continuous observations are mapped.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review of our manuscript. The comments highlight important areas for strengthening the presentation of the atlas evolution mechanism and the experimental results. We address each major comment below and describe the revisions we will incorporate.
read point-by-point responses
-
Referee: [Method (atlas evolution)] Method section (atlas evolution): the claim that lightweight grid heuristics applied to trajectories yield dense, spatially faithful danger/affinity signals in continuous 3D navigation and manipulation lacks supporting derivation or ablation. No analysis shows that quantization does not omit key geometric features that text baselines allegedly discard; this assumption is load-bearing for the central claim that the three-layer memory plus atlases produce useful shaping rewards.
Authors: We acknowledge that the current description of the atlas evolution process would benefit from additional formal support, particularly for continuous 3D settings. The manuscript outlines the lightweight grid heuristics and their derivation from trajectory statistics, but a dedicated derivation and ablation analysis for quantization effects in 3D navigation and manipulation is not fully elaborated. In the revised version, we will add a new subsection in the Method section that provides the mathematical formulation of the danger and affinity atlas updates, including the quantization step and a proof sketch showing preservation of key geometric features (e.g., via bounds on discretization error for spatial gradients). We will also include a targeted ablation comparing quantized atlases against continuous or finer-grid variants, measuring impact on shaping reward quality and downstream agent performance. These additions will directly substantiate that the three-layer memory yields useful, spatially faithful signals beyond what text-centric baselines provide. revision: yes
-
Referee: [Experiments] Experiments section: the abstract and results summary state outperformance on four benchmarks but supply no quantitative numbers, error bars, ablation tables, or statistical tests. Without these, the magnitude of gains (especially the “especially strong” spatial-task improvements) cannot be evaluated or compared to baselines.
Authors: We agree that the abstract and the high-level results summary do not contain specific numerical values, error bars, or statistical tests, which makes it difficult to fully evaluate the reported gains. The full Experiments section does contain comparative tables and figures with performance metrics across Sokoban, FrozenLake, 3D navigation, and robotic manipulation. To address the concern, we will expand the results summary paragraph to include key quantitative results (e.g., success rates and improvement margins) along with standard deviations as error bars. We will add ablation tables and report statistical significance (e.g., via paired t-tests or Wilcoxon tests) for the spatial-task improvements. Due to abstract length limits, we will focus these enhancements on the main text summary and Experiments section rather than modifying the abstract itself. revision: partial
Circularity Check
No significant circularity; empirical framework with independent experimental validation
full rationale
The paper presents an architectural proposal for a three-layer visual memory system whose atlases are constructed from trajectory statistics and grid heuristics, then deployed as shaping rewards. No equations, first-principles derivations, or fitted-parameter predictions appear in the provided text. Performance claims rest on benchmark comparisons rather than any quantity defined by construction from the inputs. The design choices are motivated by stated assumptions about spatial information loss in text-centric baselines, but these assumptions are not smuggled in via self-citation or self-definition; they are tested externally through experiments. This is the normal case of a self-contained empirical contribution.
Axiom & Free-Parameter Ledger
free parameters (1)
- grid heuristic parameters
axioms (1)
- domain assumption Geometric priors are better preserved in visual form than in language for spatial decision making
invented entities (1)
-
danger and affinity atlases
no independent evidence
Reference graph
Works this paper leans on
-
[1]
IEEE Transactions on Pattern Analysis and Machine Intelligence47(7), 5130–5145 (2025)
An, D., Wang, H., Wang, W., Wang, Z., Huang, Y ., He, K., Wang, L.: Etpnav: Evolving topological planning for vision-language navigation in continuous environments. IEEE Transactions on Pattern Analysis and Machine Intelligence47(7), 5130–5145 (2025). https://doi.org/10.1109/TPAMI.2024.3386695
-
[2]
Anthropic: The claude 3 model family: Opus, sonnet, haiku. https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf(2024), model card
work page 2024
-
[3]
Anthropic: Introducing claude sonnet 4.5.https://www.anthropic.com/news/claude-sonnet-4-5(2025)
work page 2025
-
[4]
In: The Twelfth International Conference on Learning Representations (2023)
Asai, A., Wu, Z., Wang, Y ., Sil, A., Hajishirzi, H.: Self-rag: Learning to retrieve, generate, and critique through self-reflection. In: The Twelfth International Conference on Learning Representations (2023)
work page 2023
-
[5]
Advances in Neural Information Processing Systems37, 12461–12495 (2024)
Bai, H., Zhou, Y ., Cemri, M., Pan, J., Suhr, A., Levine, S., Kumar, A.: Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning. Advances in Neural Information Processing Systems37, 12461–12495 (2024)
work page 2024
-
[6]
Text Reading, and Beyond2(1), 1 (2023)
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization. Text Reading, and Beyond2(1), 1 (2023)
work page 2023
-
[7]
Advances in neural information processing systems29(2016)
Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., Munos, R.: Unifying count-based exploration and intrinsic motivation. Advances in neural information processing systems29(2016)
work page 2016
-
[8]
Advances in Neural Information Processing Systems37, 135062–135093 (2024)
Cheng, A.C., Yin, H., Fu, Y ., Guo, Q., Yang, R., Kautz, J., Wang, X., Liu, S.: Spatialrgpt: Grounded spatial reasoning in vision-language models. Advances in Neural Information Processing Systems37, 135062–135093 (2024)
work page 2024
-
[9]
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Chhikara, P., Khant, D., Aryan, S., Singh, T., Yadav, D.: Mem0: Building production-ready ai agents with scalable long-term memory. arXiv preprint arXiv:2504.19413 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
In: European Conference on Computer Vision
Cui, X., Liu, Q., Liu, Z., Wang, H.: Frontier-enhanced topological memory with improved exploration awareness for embodied visual navigation. In: European Conference on Computer Vision. pp. 296–313. Springer (2024)
work page 2024
-
[11]
Advances in Neural Information Processing Systems36, 28091–28114 (2023)
Deng, X., Gu, Y ., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., Su, Y .: Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems36, 28091–28114 (2023)
work page 2023
-
[12]
arXiv preprint arXiv:2510.02240 (2025)
Feng, S., Tuo, K., Wang, S., Kong, L., Zhu, J., Wang, H.: Rewardmap: Tackling sparse rewards in fine-grained visual reasoning via multi-stage reinforcement learning. arXiv preprint arXiv:2510.02240 (2025)
-
[13]
https://blog.google/technology/google-deepmind/ gemini-model-thinking-updates-march-2025/(2025)
Google: Gemini 2.5: Our most intelligent ai model. https://blog.google/technology/google-deepmind/ gemini-model-thinking-updates-march-2025/(2025)
work page 2025
-
[14]
Gou, B., Wang, R., Zheng, B., Xie, Y ., Chang, C., Shu, Y ., Sun, H., Su, Y .: Navigating the digital world as humans do: Universal visual grounding for GUI agents. In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=kxnoqaisCT
work page 2025
-
[15]
Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Xskill: Continual learning from experience and skills in multimodal agents,
Jiang, G., Su, Z., Qu, X., et al.: Xskill: Continual learning from experience and skills in multimodal agents. arXiv preprint arXiv:2603.12056 (2026)
-
[17]
In: European Conference on Computer Vision
Jiang, H., Lu, Z.: Visual grounding for object-level generalization in reinforcement learning. In: European Conference on Computer Vision. pp. 55–72. Springer (2024)
work page 2024
-
[18]
arXiv preprint arXiv:2501.15418 (2025)
Jiang, Y ., Liu, Q., Yang, Y ., Ma, X., Zhong, D., Hu, H., Yang, J., Liang, B., Xu, B., Zhang, C., et al.: Episodic novelty through temporal distance. arXiv preprint arXiv:2501.15418 (2025)
-
[19]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Kang, H., Sachdeva, E., Gupta, P., Bae, S., Lee, K.: Gflowvlm: Enhancing multi-step reasoning in vision-language models with generative flow networks. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 3815–3825 (2025)
work page 2025
-
[20]
Koh, J.Y ., Lo, R., Jang, L., Duvvur, V ., Lim, M., Huang, P.Y ., Neubig, G., Zhou, S., Salakhutdinov, R., Fried, D.: Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). pp. 881–905 (2024)
work page 2024
-
[21]
AI2-THOR: An Interactive 3D Environment for Visual AI
Kolve, E., Mottaghi, R., Han, W., VanderBilt, E., Weihs, L., Herrasti, A., Deitke, M., Ehsani, K., Gordon, D., Zhu, Y ., et al.: Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474 (2017) 11
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[22]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Li, D., Zhang, Y ., Cao, M., Liu, D., Xie, W., Hui, T., Lin, L., Xie, Z., Li, Y .: Towards long-horizon vision-language-action system: Reasoning, acting and memory. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6839–6848 (2025)
work page 2025
-
[23]
IEEE Transactions on Pattern Analysis and Machine Intelligence 47(7), 5945–5957 (2025)
Lin, B., Nie, Y ., Wei, Z., Chen, J., Ma, S., Han, J., Xu, H., Chang, X., Liang, X.: Navcot: Boosting llm-based vision-and- language navigation via learning disentangled reasoning. IEEE Transactions on Pattern Analysis and Machine Intelligence 47(7), 5945–5957 (2025). https://doi.org/10.1109/TPAMI.2025.3554559
-
[24]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Lin, K.Q., Li, L., Gao, D., Yang, Z., Wu, S., Bai, Z., Lei, S.W., Wang, L., Shou, M.Z.: Showui: One vision-language-action model for gui visual agent. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 19498–19508 (2025)
work page 2025
-
[25]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Liu, R., Wang, W., Yang, Y .: V olumetric environment representation for vision-language navigation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16317–16328 (2024)
work page 2024
-
[26]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Luo, T., Logeswaran, L., Johnson, J., Lee, H.: Visual test-time scaling for gui agent grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19989–19998 (2025)
work page 2025
- [27]
-
[28]
arXiv preprint arXiv:2402.07945 (2024)
Niu, R., Li, J., Wang, S., Fu, Y ., Hu, X., Leng, X., Kong, H., Chang, Y ., Wang, Q.: Screenagent: A vision language model-driven computer control agent. arXiv preprint arXiv:2402.07945 (2024)
-
[29]
OpenAI: Introducing gpt-5.https://openai.com/index/introducing-gpt-5/(2025)
work page 2025
-
[30]
OpenAI: Introducing openai o3 and o4-mini.https://openai.com/index/introducing-o3-and-o4-mini/(2025)
work page 2025
-
[31]
arXiv preprint arXiv:2411.13543 (2024)
Paglieri, D., Cupiał, B., Coward, S., Piterbarg, U., Wolczyk, M., Khan, A., Pignatelli, E., Kuci´nski, Ł., Pinto, L., Fergus, R., et al.: Balrog: Benchmarking agentic llm and vlm reasoning on games. arXiv preprint arXiv:2411.13543 (2024)
-
[32]
arXiv preprint arXiv:2310.12921 (2023)
Rocamonde, J., Montesinos, V ., Nava, E., Perez, E., Lindner, D.: Vision-language models are zero-shot reward models for reinforcement learning. arXiv preprint arXiv:2310.12921 (2023)
-
[33]
Grounded Reinforcement Learning for Visual Reasoning
Sarch, G., Saha, S., Khandelwal, N., Jain, A., Tarr, M.J., Kumar, A., Fragkiadaki, K.: Grounded reinforcement learning for visual reasoning. arXiv preprint arXiv:2505.23678 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
In: The Twelfth International Conference on Learning Representations (2024)
Sarthi, P., Abdullah, S., Tuli, A., Khanna, S., Goldie, A., Manning, C.D.: Raptor: Recursive abstractive processing for tree-organized retrieval. In: The Twelfth International Conference on Learning Representations (2024)
work page 2024
-
[35]
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Shen, H., Liu, P., Li, J., Fang, C., Ma, Y ., Liao, J., Shen, Q., Zhang, Z., Zhao, K., Zhang, Q., et al.: Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv preprint arXiv:2504.07615 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Advances in neural information processing systems36, 8634–8652 (2023)
Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., Yao, S.: Reflexion: Language agents with verbal reinforcement learning. Advances in neural information processing systems36, 8634–8652 (2023)
work page 2023
-
[37]
Sridhar, K., Dutta, S., Jayaraman, D., Lee, I.: REGENT: A retrieval-augmented generalist agent that can act in-context in new environments. In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/ forum?id=NxyfSW6mLK
work page 2025
-
[38]
Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai,
Tao, S., Xiang, F., Shukla, A., Qin, Y ., Hinrichsen, X., Yuan, X., Bao, C., Lin, X., Liu, Y ., Chan, T.k., et al.: Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai. arXiv preprint arXiv:2410.00425 (2024)
-
[39]
Gemini: A Family of Highly Capable Multimodal Models
Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
Gymnasium: A Standard Interface for Reinforcement Learning Environments
Towers, M., Kwiatkowski, A., Terry, J., Balis, J.U., De Cola, G., Deleu, T., Goulão, M., Kallinteris, A., Krimmel, M., KG, A., et al.: Gymnasium: A standard interface for reinforcement learning environments. arXiv preprint arXiv:2407.17032 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
arXiv preprint arXiv:2510.16907 (2025)
Wang, K., Zhang, P., Wang, Z., Gao, Y ., Li, L., Wang, Q., Chen, H., Wan, C., Lu, Y ., Yang, Z., et al.: Vagen: Reinforcing world model reasoning for multi-turn vlm agents. arXiv preprint arXiv:2510.16907 (2025)
-
[42]
arXiv preprint arXiv:2508.02694 (2025)
Wang, N., Hu, X., Liu, P., Zhu, H., Hou, Y ., Huang, H., Zhang, S., Yang, J., Liu, J., Zhang, G., et al.: Efficient agents: Building effective agents while reducing cost. arXiv preprint arXiv:2508.02694 (2025)
-
[43]
arXiv preprint arXiv:2402.03681 (2024) 12
Wang, Y ., Sun, Z., Zhang, J., Xian, Z., Biyik, E., Held, D., Erickson, Z.: Rl-vlm-f: Reinforcement learning from vision language foundation model feedback. arXiv preprint arXiv:2402.03681 (2024) 12
-
[44]
Advances in Neural Information Processing Systems37, 73278–73308 (2024)
Wang, Z., Cai, S., Mu, Z., Lin, H., Zhang, C., Liu, X., Li, Q., Liu, A., Ma, X., Liang, Y .: Omnijarvis: Unified vision-language- action tokenization enables open-world instruction following agents. Advances in Neural Information Processing Systems37, 73278–73308 (2024)
work page 2024
-
[45]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Wang, Z., Chen, W., Yang, L., Zhou, S., Zhao, S., Zhan, H., Jin, J., Li, L., Shao, Z., Bu, J.: Mp-gui: Modality perception with mllms for gui understanding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 29711–29721 (2025)
work page 2025
-
[46]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Wei, T., Yang, Y ., Xing, J., Shi, Y ., Lu, Z., Ye, D.: Gtr: Guided thought reinforcement prevents thought collapse in rl-based vlm agent training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 18855–18865 (2025)
work page 2025
-
[47]
arXiv preprint arXiv:2506.03143 (2025)
Wu, Q., Cheng, K., Yang, R., Zhang, C., Yang, J., Jiang, H., Mu, J., Peng, B., Qiao, B., Tan, R., et al.: Gui-actor: Coordinate-free visual grounding for gui agents. arXiv preprint arXiv:2506.03143 (2025)
-
[48]
SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning
Xia, P., Chen, J., Wang, H., Liu, J., Zeng, K., Wang, Y ., Han, S., Zhou, Y ., Zhao, X., Chen, H., et al.: Skillrl: Evolving agents via recursive skill-augmented reinforcement learning. arXiv preprint arXiv:2602.08234 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[49]
A-MEM: Agentic Memory for LLM Agents
Xu, W., Liang, Z., Mei, K., Gao, H., Tan, J., Zhang, Y .: A-mem: Agentic memory for llm agents. arXiv preprint arXiv:2502.12110 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Yang, J., Yang, S., Gupta, A.W., Han, R., Fei-Fei, L., Xie, S.: Thinking in space: How multimodal large language models see, remember, and recall spaces. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 10632–10643 (2025)
work page 2025
-
[51]
Yang, R., Chen, H., Zhang, J., Zhao, M., Qian, C., Wang, K., Wang, Q., Koripella, T.V ., Movahedi, M., Li, M., et al.: Embodiedbench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents. arXiv preprint arXiv:2502.09560 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
Yang, Y .: Text2reward: Reward shaping with language models for reinforcement learning. In: International Conference on Learning Representations (ICLR), 2024 (07/05/2024-11/05/2024, Vienna, Austria) (2024)
work page 2024
-
[53]
In: The eleventh international conference on learning representations (2022)
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y .: React: Synergizing reasoning and acting in language models. In: The eleventh international conference on learning representations (2022)
work page 2022
-
[54]
Advances in Neural Information Processing Systems36, 78227–78239 (2023)
Zhang, D., Chen, L., Zhang, S., Xu, H., Zhao, Z., Yu, K.: Large language models are semi-parametric reinforcement learning agents. Advances in Neural Information Processing Systems36, 78227–78239 (2023)
work page 2023
-
[55]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Zhao, A., Huang, D., Xu, Q., Lin, M., Liu, Y .J., Huang, G.: Expel: Llm agents are experiential learners. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 19632–19642 (2024)
work page 2024
-
[56]
Advances in Neural Information Processing Systems37, 43730–43758 (2024)
Zhong, V ., Misra, D., Yuan, X., Côté, M.A.: Policy improvement using language feedback models. Advances in Neural Information Processing Systems37, 43730–43758 (2024)
work page 2024
-
[57]
Zhou, Y ., Dai, S., Wang, S., Zhou, K., Jia, Q., Xu, J.: Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents. arXiv preprint arXiv:2505.15810 (2025) 13 Appendix A Algorithmic and Mathematical Framework While Section 3 introduced the conceptual framework of AtlasV A, including the Visual Skill Memory and dense reward shaping, thi...
-
[58]
</answer> <think> The green cube is now on the left target at (75, 122, 20). The red cube is still in the buffer zone at (24, 4, 20). To complete the task, I need to pick up the red cube from the buffer zone and place it on the right target. First action: Pick up the red cube from the buffer zone. Second action: Place the red c u b e o n t h e r i g h t t...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.