pith. sign in

arxiv: 2605.17933 · v1 · pith:AIONU3KHnew · submitted 2026-05-18 · 💻 cs.CV

AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents

Pith reviewed 2026-05-20 11:21 UTC · model grok-4.3

classification 💻 cs.CV
keywords VLM agentsvisual skill memoryself-evolving atlasesreinforcement learningspatial decision makingteacher-freeembodied navigationrobotic manipulation
0
0 comments X

The pith

AtlasVA lets VLM agents build and reuse visually grounded memory through self-evolving atlases without any teacher models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current approaches to memory in vision-language model agents store experience as text and rely on external teacher models to summarize or refine it. This compresses geometric and spatial details into lossy language and delivers only delayed feedback, which limits performance on tasks that depend on precise spatial reasoning. The paper claims that reusable experience should remain visually grounded instead. AtlasVA realizes this by organizing memory into three layers—spatial heatmaps, visual exemplars, and symbolic text skills—while automatically deriving danger and affinity atlases from trajectory statistics and simple grid rules. These atlases then serve as potential-based shaping rewards that guide reinforcement learning, unifying perception, memory, and optimization with no external supervision.

Core claim

AtlasVA is a teacher-free visual skill memory framework that organizes memory into three complementary layers: spatial heatmaps, visual exemplars, and symbolic text skills. It evolves danger and affinity atlases directly from trajectory statistics and lightweight grid heuristics, then reuses these atlases as potential-based shaping rewards for reinforcement learning. This design unifies perception, memory, and optimization without external LLM supervision.

What carries the argument

The three-layer memory structure of spatial heatmaps, visual exemplars, and symbolic text skills, together with self-evolving danger and affinity atlases built from trajectory statistics and grid heuristics.

Load-bearing premise

Reusable experience for VLM agents should remain visually grounded, and trajectory statistics plus lightweight grid heuristics can produce effective danger and affinity atlases without external LLM supervision or loss of critical spatial information.

What would settle it

An experiment that replaces the self-evolved danger and affinity atlases with text-only memory while keeping all other components fixed and shows no performance drop on spatial benchmarks would falsify the central claim.

read the original abstract

Vision-language model (VLM) agents increasingly rely on memory-augmented reinforcement learning to reuse experience across long-horizon tasks, yet most existing frameworks store memory as text and depend on proprietary teacher models to summarize or refine it. This design is poorly matched to spatial decision making: geometric priors are compressed into lossy language, and sparse interaction is often supervised through delayed textual feedback rather than dense visually grounded signals. We argue that reusable experience for VLM agents should remain visually grounded. Based on this insight, we propose \textbf{AtlasVA}, a teacher-free visual skill memory framework that organizes memory into three complementary layers: spatial heatmaps, visual exemplars, and symbolic text skills. AtlasVA further evolves danger and affinity atlases directly from trajectory statistics and lightweight grid heuristics, and reuses these self-evolving atlases as potential-based shaping rewards for reinforcement learning. This unifies perception, memory, and optimization without external LLM supervision. Experiments on \textsc{Sokoban}, \textsc{FrozenLake}, 3D embodied navigation, and 3D robotic manipulation benchmarks show that AtlasVA consistently outperforms text-centric memory baselines and competitive VLM agents, with especially strong gains on spatially intensive tasks. Homepage: https://wangpan-ustc.github.io/AtlasvaWeb

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces AtlasVA, a teacher-free framework for visual skill memory in VLM agents. Memory is organized into three layers—spatial heatmaps, visual exemplars, and symbolic text skills—with danger and affinity atlases evolved directly from trajectory statistics via lightweight grid heuristics. These atlases supply potential-based shaping rewards for RL, unifying perception, memory, and optimization without external LLM supervision. Experiments on Sokoban, FrozenLake, 3D embodied navigation, and 3D robotic manipulation benchmarks report consistent outperformance over text-centric memory baselines and competitive VLM agents, with larger gains on spatially intensive tasks.

Significance. If the empirical claims hold, the work offers a concrete alternative to text-centric memory in VLM agents by preserving visual grounding and deriving shaping rewards from self-generated trajectory data. The three-layer design and self-evolution mechanism could improve sample efficiency on long-horizon spatial problems; the absence of teacher models is a practical advantage. Reproducible code or machine-checked components are not mentioned.

major comments (2)
  1. [Method (atlas evolution)] Method section (atlas evolution): the claim that lightweight grid heuristics applied to trajectories yield dense, spatially faithful danger/affinity signals in continuous 3D navigation and manipulation lacks supporting derivation or ablation. No analysis shows that quantization does not omit key geometric features that text baselines allegedly discard; this assumption is load-bearing for the central claim that the three-layer memory plus atlases produce useful shaping rewards.
  2. [Experiments] Experiments section: the abstract and results summary state outperformance on four benchmarks but supply no quantitative numbers, error bars, ablation tables, or statistical tests. Without these, the magnitude of gains (especially the “especially strong” spatial-task improvements) cannot be evaluated or compared to baselines.
minor comments (2)
  1. [Method] Notation for the three memory layers and the potential function derived from atlases should be defined explicitly with equations rather than prose descriptions.
  2. [Figures] Figure captions for the 3D navigation and manipulation environments should clarify the grid resolution used by the heuristics and how continuous observations are mapped.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review of our manuscript. The comments highlight important areas for strengthening the presentation of the atlas evolution mechanism and the experimental results. We address each major comment below and describe the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Method (atlas evolution)] Method section (atlas evolution): the claim that lightweight grid heuristics applied to trajectories yield dense, spatially faithful danger/affinity signals in continuous 3D navigation and manipulation lacks supporting derivation or ablation. No analysis shows that quantization does not omit key geometric features that text baselines allegedly discard; this assumption is load-bearing for the central claim that the three-layer memory plus atlases produce useful shaping rewards.

    Authors: We acknowledge that the current description of the atlas evolution process would benefit from additional formal support, particularly for continuous 3D settings. The manuscript outlines the lightweight grid heuristics and their derivation from trajectory statistics, but a dedicated derivation and ablation analysis for quantization effects in 3D navigation and manipulation is not fully elaborated. In the revised version, we will add a new subsection in the Method section that provides the mathematical formulation of the danger and affinity atlas updates, including the quantization step and a proof sketch showing preservation of key geometric features (e.g., via bounds on discretization error for spatial gradients). We will also include a targeted ablation comparing quantized atlases against continuous or finer-grid variants, measuring impact on shaping reward quality and downstream agent performance. These additions will directly substantiate that the three-layer memory yields useful, spatially faithful signals beyond what text-centric baselines provide. revision: yes

  2. Referee: [Experiments] Experiments section: the abstract and results summary state outperformance on four benchmarks but supply no quantitative numbers, error bars, ablation tables, or statistical tests. Without these, the magnitude of gains (especially the “especially strong” spatial-task improvements) cannot be evaluated or compared to baselines.

    Authors: We agree that the abstract and the high-level results summary do not contain specific numerical values, error bars, or statistical tests, which makes it difficult to fully evaluate the reported gains. The full Experiments section does contain comparative tables and figures with performance metrics across Sokoban, FrozenLake, 3D navigation, and robotic manipulation. To address the concern, we will expand the results summary paragraph to include key quantitative results (e.g., success rates and improvement margins) along with standard deviations as error bars. We will add ablation tables and report statistical significance (e.g., via paired t-tests or Wilcoxon tests) for the spatial-task improvements. Due to abstract length limits, we will focus these enhancements on the main text summary and Experiments section rather than modifying the abstract itself. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical framework with independent experimental validation

full rationale

The paper presents an architectural proposal for a three-layer visual memory system whose atlases are constructed from trajectory statistics and grid heuristics, then deployed as shaping rewards. No equations, first-principles derivations, or fitted-parameter predictions appear in the provided text. Performance claims rest on benchmark comparisons rather than any quantity defined by construction from the inputs. The design choices are motivated by stated assumptions about spatial information loss in text-centric baselines, but these assumptions are not smuggled in via self-citation or self-definition; they are tested externally through experiments. This is the normal case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; full paper likely contains additional parameters and assumptions not visible here.

free parameters (1)
  • grid heuristic parameters
    Lightweight grid heuristics are invoked to evolve atlases but no specific values or fitting procedure are stated in the abstract.
axioms (1)
  • domain assumption Geometric priors are better preserved in visual form than in language for spatial decision making
    This is presented as the motivating insight for keeping memory visually grounded.
invented entities (1)
  • danger and affinity atlases no independent evidence
    purpose: Self-evolving potential-based shaping rewards derived from trajectory statistics
    New maps introduced to provide dense visual guidance without external supervision; no independent evidence outside the framework is described.

pith-pipeline@v0.9.0 · 5771 in / 1490 out tokens · 48107 ms · 2026-05-20T11:21:47.753768+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 10 internal anchors

  1. [1]

    IEEE Transactions on Pattern Analysis and Machine Intelligence47(7), 5130–5145 (2025)

    An, D., Wang, H., Wang, W., Wang, Z., Huang, Y ., He, K., Wang, L.: Etpnav: Evolving topological planning for vision-language navigation in continuous environments. IEEE Transactions on Pattern Analysis and Machine Intelligence47(7), 5130–5145 (2025). https://doi.org/10.1109/TPAMI.2024.3386695

  2. [2]

    https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf(2024), model card

    Anthropic: The claude 3 model family: Opus, sonnet, haiku. https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf(2024), model card

  3. [3]

    Anthropic: Introducing claude sonnet 4.5.https://www.anthropic.com/news/claude-sonnet-4-5(2025)

  4. [4]

    In: The Twelfth International Conference on Learning Representations (2023)

    Asai, A., Wu, Z., Wang, Y ., Sil, A., Hajishirzi, H.: Self-rag: Learning to retrieve, generate, and critique through self-reflection. In: The Twelfth International Conference on Learning Representations (2023)

  5. [5]

    Advances in Neural Information Processing Systems37, 12461–12495 (2024)

    Bai, H., Zhou, Y ., Cemri, M., Pan, J., Suhr, A., Levine, S., Kumar, A.: Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning. Advances in Neural Information Processing Systems37, 12461–12495 (2024)

  6. [6]

    Text Reading, and Beyond2(1), 1 (2023)

    Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization. Text Reading, and Beyond2(1), 1 (2023)

  7. [7]

    Advances in neural information processing systems29(2016)

    Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., Munos, R.: Unifying count-based exploration and intrinsic motivation. Advances in neural information processing systems29(2016)

  8. [8]

    Advances in Neural Information Processing Systems37, 135062–135093 (2024)

    Cheng, A.C., Yin, H., Fu, Y ., Guo, Q., Yang, R., Kautz, J., Wang, X., Liu, S.: Spatialrgpt: Grounded spatial reasoning in vision-language models. Advances in Neural Information Processing Systems37, 135062–135093 (2024)

  9. [9]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

    Chhikara, P., Khant, D., Aryan, S., Singh, T., Yadav, D.: Mem0: Building production-ready ai agents with scalable long-term memory. arXiv preprint arXiv:2504.19413 (2025)

  10. [10]

    In: European Conference on Computer Vision

    Cui, X., Liu, Q., Liu, Z., Wang, H.: Frontier-enhanced topological memory with improved exploration awareness for embodied visual navigation. In: European Conference on Computer Vision. pp. 296–313. Springer (2024)

  11. [11]

    Advances in Neural Information Processing Systems36, 28091–28114 (2023)

    Deng, X., Gu, Y ., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., Su, Y .: Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems36, 28091–28114 (2023)

  12. [12]

    arXiv preprint arXiv:2510.02240 (2025)

    Feng, S., Tuo, K., Wang, S., Kong, L., Zhu, J., Wang, H.: Rewardmap: Tackling sparse rewards in fine-grained visual reasoning via multi-stage reinforcement learning. arXiv preprint arXiv:2510.02240 (2025)

  13. [13]

    https://blog.google/technology/google-deepmind/ gemini-model-thinking-updates-march-2025/(2025)

    Google: Gemini 2.5: Our most intelligent ai model. https://blog.google/technology/google-deepmind/ gemini-model-thinking-updates-march-2025/(2025)

  14. [14]

    In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=kxnoqaisCT

    Gou, B., Wang, R., Zheng, B., Xie, Y ., Chang, C., Shu, Y ., Sun, H., Su, Y .: Navigating the digital world as humans do: Universal visual grounding for GUI agents. In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=kxnoqaisCT

  15. [15]

    GPT-4o System Card

    Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

  16. [16]

    Xskill: Continual learning from experience and skills in multimodal agents,

    Jiang, G., Su, Z., Qu, X., et al.: Xskill: Continual learning from experience and skills in multimodal agents. arXiv preprint arXiv:2603.12056 (2026)

  17. [17]

    In: European Conference on Computer Vision

    Jiang, H., Lu, Z.: Visual grounding for object-level generalization in reinforcement learning. In: European Conference on Computer Vision. pp. 55–72. Springer (2024)

  18. [18]

    arXiv preprint arXiv:2501.15418 (2025)

    Jiang, Y ., Liu, Q., Yang, Y ., Ma, X., Zhong, D., Hu, H., Yang, J., Liang, B., Xu, B., Zhang, C., et al.: Episodic novelty through temporal distance. arXiv preprint arXiv:2501.15418 (2025)

  19. [19]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Kang, H., Sachdeva, E., Gupta, P., Bae, S., Lee, K.: Gflowvlm: Enhancing multi-step reasoning in vision-language models with generative flow networks. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 3815–3825 (2025)

  20. [20]

    In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers)

    Koh, J.Y ., Lo, R., Jang, L., Duvvur, V ., Lim, M., Huang, P.Y ., Neubig, G., Zhou, S., Salakhutdinov, R., Fried, D.: Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). pp. 881–905 (2024)

  21. [21]

    AI2-THOR: An Interactive 3D Environment for Visual AI

    Kolve, E., Mottaghi, R., Han, W., VanderBilt, E., Weihs, L., Herrasti, A., Deitke, M., Ehsani, K., Gordon, D., Zhu, Y ., et al.: Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474 (2017) 11

  22. [22]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Li, D., Zhang, Y ., Cao, M., Liu, D., Xie, W., Hui, T., Lin, L., Xie, Z., Li, Y .: Towards long-horizon vision-language-action system: Reasoning, acting and memory. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6839–6848 (2025)

  23. [23]

    IEEE Transactions on Pattern Analysis and Machine Intelligence 47(7), 5945–5957 (2025)

    Lin, B., Nie, Y ., Wei, Z., Chen, J., Ma, S., Han, J., Xu, H., Chang, X., Liang, X.: Navcot: Boosting llm-based vision-and- language navigation via learning disentangled reasoning. IEEE Transactions on Pattern Analysis and Machine Intelligence 47(7), 5945–5957 (2025). https://doi.org/10.1109/TPAMI.2025.3554559

  24. [24]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Lin, K.Q., Li, L., Gao, D., Yang, Z., Wu, S., Bai, Z., Lei, S.W., Wang, L., Shou, M.Z.: Showui: One vision-language-action model for gui visual agent. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 19498–19508 (2025)

  25. [25]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Liu, R., Wang, W., Yang, Y .: V olumetric environment representation for vision-language navigation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16317–16328 (2024)

  26. [26]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Luo, T., Logeswaran, L., Johnson, J., Lee, H.: Visual test-time scaling for gui agent grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19989–19998 (2025)

  27. [27]

    In: Icml

    Ng, A.Y ., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml. vol. 99, pp. 278–287. Citeseer (1999)

  28. [28]

    arXiv preprint arXiv:2402.07945 (2024)

    Niu, R., Li, J., Wang, S., Fu, Y ., Hu, X., Leng, X., Kong, H., Chang, Y ., Wang, Q.: Screenagent: A vision language model-driven computer control agent. arXiv preprint arXiv:2402.07945 (2024)

  29. [29]

    OpenAI: Introducing gpt-5.https://openai.com/index/introducing-gpt-5/(2025)

  30. [30]

    OpenAI: Introducing openai o3 and o4-mini.https://openai.com/index/introducing-o3-and-o4-mini/(2025)

  31. [31]

    arXiv preprint arXiv:2411.13543 (2024)

    Paglieri, D., Cupiał, B., Coward, S., Piterbarg, U., Wolczyk, M., Khan, A., Pignatelli, E., Kuci´nski, Ł., Pinto, L., Fergus, R., et al.: Balrog: Benchmarking agentic llm and vlm reasoning on games. arXiv preprint arXiv:2411.13543 (2024)

  32. [32]

    arXiv preprint arXiv:2310.12921 (2023)

    Rocamonde, J., Montesinos, V ., Nava, E., Perez, E., Lindner, D.: Vision-language models are zero-shot reward models for reinforcement learning. arXiv preprint arXiv:2310.12921 (2023)

  33. [33]

    Grounded Reinforcement Learning for Visual Reasoning

    Sarch, G., Saha, S., Khandelwal, N., Jain, A., Tarr, M.J., Kumar, A., Fragkiadaki, K.: Grounded reinforcement learning for visual reasoning. arXiv preprint arXiv:2505.23678 (2025)

  34. [34]

    In: The Twelfth International Conference on Learning Representations (2024)

    Sarthi, P., Abdullah, S., Tuli, A., Khanna, S., Goldie, A., Manning, C.D.: Raptor: Recursive abstractive processing for tree-organized retrieval. In: The Twelfth International Conference on Learning Representations (2024)

  35. [35]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Shen, H., Liu, P., Li, J., Fang, C., Ma, Y ., Liao, J., Shen, Q., Zhang, Z., Zhao, K., Zhang, Q., et al.: Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv preprint arXiv:2504.07615 (2025)

  36. [36]

    Advances in neural information processing systems36, 8634–8652 (2023)

    Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., Yao, S.: Reflexion: Language agents with verbal reinforcement learning. Advances in neural information processing systems36, 8634–8652 (2023)

  37. [37]

    In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/ forum?id=NxyfSW6mLK

    Sridhar, K., Dutta, S., Jayaraman, D., Lee, I.: REGENT: A retrieval-augmented generalist agent that can act in-context in new environments. In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/ forum?id=NxyfSW6mLK

  38. [38]

    Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai,

    Tao, S., Xiang, F., Shukla, A., Qin, Y ., Hinrichsen, X., Yuan, X., Bao, C., Lin, X., Liu, Y ., Chan, T.k., et al.: Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai. arXiv preprint arXiv:2410.00425 (2024)

  39. [39]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

  40. [40]

    Gymnasium: A Standard Interface for Reinforcement Learning Environments

    Towers, M., Kwiatkowski, A., Terry, J., Balis, J.U., De Cola, G., Deleu, T., Goulão, M., Kallinteris, A., Krimmel, M., KG, A., et al.: Gymnasium: A standard interface for reinforcement learning environments. arXiv preprint arXiv:2407.17032 (2024)

  41. [41]

    arXiv preprint arXiv:2510.16907 (2025)

    Wang, K., Zhang, P., Wang, Z., Gao, Y ., Li, L., Wang, Q., Chen, H., Wan, C., Lu, Y ., Yang, Z., et al.: Vagen: Reinforcing world model reasoning for multi-turn vlm agents. arXiv preprint arXiv:2510.16907 (2025)

  42. [42]

    arXiv preprint arXiv:2508.02694 (2025)

    Wang, N., Hu, X., Liu, P., Zhu, H., Hou, Y ., Huang, H., Zhang, S., Yang, J., Liu, J., Zhang, G., et al.: Efficient agents: Building effective agents while reducing cost. arXiv preprint arXiv:2508.02694 (2025)

  43. [43]

    arXiv preprint arXiv:2402.03681 (2024) 12

    Wang, Y ., Sun, Z., Zhang, J., Xian, Z., Biyik, E., Held, D., Erickson, Z.: Rl-vlm-f: Reinforcement learning from vision language foundation model feedback. arXiv preprint arXiv:2402.03681 (2024) 12

  44. [44]

    Advances in Neural Information Processing Systems37, 73278–73308 (2024)

    Wang, Z., Cai, S., Mu, Z., Lin, H., Zhang, C., Liu, X., Li, Q., Liu, A., Ma, X., Liang, Y .: Omnijarvis: Unified vision-language- action tokenization enables open-world instruction following agents. Advances in Neural Information Processing Systems37, 73278–73308 (2024)

  45. [45]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Wang, Z., Chen, W., Yang, L., Zhou, S., Zhao, S., Zhan, H., Jin, J., Li, L., Shao, Z., Bu, J.: Mp-gui: Modality perception with mllms for gui understanding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 29711–29721 (2025)

  46. [46]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Wei, T., Yang, Y ., Xing, J., Shi, Y ., Lu, Z., Ye, D.: Gtr: Guided thought reinforcement prevents thought collapse in rl-based vlm agent training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 18855–18865 (2025)

  47. [47]

    arXiv preprint arXiv:2506.03143 (2025)

    Wu, Q., Cheng, K., Yang, R., Zhang, C., Yang, J., Jiang, H., Mu, J., Peng, B., Qiao, B., Tan, R., et al.: Gui-actor: Coordinate-free visual grounding for gui agents. arXiv preprint arXiv:2506.03143 (2025)

  48. [48]

    SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

    Xia, P., Chen, J., Wang, H., Liu, J., Zeng, K., Wang, Y ., Han, S., Zhou, Y ., Zhao, X., Chen, H., et al.: Skillrl: Evolving agents via recursive skill-augmented reinforcement learning. arXiv preprint arXiv:2602.08234 (2026)

  49. [49]

    A-MEM: Agentic Memory for LLM Agents

    Xu, W., Liang, Z., Mei, K., Gao, H., Tan, J., Zhang, Y .: A-mem: Agentic memory for llm agents. arXiv preprint arXiv:2502.12110 (2025)

  50. [50]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Yang, J., Yang, S., Gupta, A.W., Han, R., Fei-Fei, L., Xie, S.: Thinking in space: How multimodal large language models see, remember, and recall spaces. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 10632–10643 (2025)

  51. [51]

    EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

    Yang, R., Chen, H., Zhang, J., Zhao, M., Qian, C., Wang, K., Wang, Q., Koripella, T.V ., Movahedi, M., Li, M., et al.: Embodiedbench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents. arXiv preprint arXiv:2502.09560 (2025)

  52. [52]

    In: International Conference on Learning Representations (ICLR), 2024 (07/05/2024-11/05/2024, Vienna, Austria) (2024)

    Yang, Y .: Text2reward: Reward shaping with language models for reinforcement learning. In: International Conference on Learning Representations (ICLR), 2024 (07/05/2024-11/05/2024, Vienna, Austria) (2024)

  53. [53]

    In: The eleventh international conference on learning representations (2022)

    Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y .: React: Synergizing reasoning and acting in language models. In: The eleventh international conference on learning representations (2022)

  54. [54]

    Advances in Neural Information Processing Systems36, 78227–78239 (2023)

    Zhang, D., Chen, L., Zhang, S., Xu, H., Zhao, Z., Yu, K.: Large language models are semi-parametric reinforcement learning agents. Advances in Neural Information Processing Systems36, 78227–78239 (2023)

  55. [55]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Zhao, A., Huang, D., Xu, Q., Lin, M., Liu, Y .J., Huang, G.: Expel: Llm agents are experiential learners. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 19632–19642 (2024)

  56. [56]

    Advances in Neural Information Processing Systems37, 43730–43758 (2024)

    Zhong, V ., Misra, D., Yuan, X., Côté, M.A.: Policy improvement using language feedback models. Advances in Neural Information Processing Systems37, 43730–43758 (2024)

  57. [57]

    Gui-g1: Understanding r1-zero- like training for visual grounding in gui agents.arXiv preprint arXiv:2505.15810, 2025

    Zhou, Y ., Dai, S., Wang, S., Zhou, K., Jia, Q., Xu, J.: Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents. arXiv preprint arXiv:2505.15810 (2025) 13 Appendix A Algorithmic and Mathematical Framework While Section 3 introduced the conceptual framework of AtlasV A, including the Visual Skill Memory and dense reward shaping, thi...

  58. [58]

    spatial blindness

    </answer> <think> The green cube is now on the left target at (75, 122, 20). The red cube is still in the buffer zone at (24, 4, 20). To complete the task, I need to pick up the red cube from the buffer zone and place it on the right target. First action: Pick up the red cube from the buffer zone. Second action: Place the red c u b e o n t h e r i g h t t...