pith. machine review for the scientific record. sign in

arxiv: 2605.08816 · v1 · submitted 2026-05-09 · 💻 cs.AI · cs.CY

Recognition: 2 theorem links

· Lean Theorem

Mirror, Mirror on the Wall: Can VLM Agents Tell Who They Are at All?

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:30 UTC · model grok-4.3

classification 💻 cs.AI cs.CY
keywords mirror self-recognitionvision-language modelsembodied agentsself-identificationbenchmarkVLM agentsself-groundingperceptual grounding
0
0 comments X

The pith

Stronger vision-language models can recognize their reflections to infer hidden body attributes about themselves.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests if embodied vision-language model agents possess a form of mirror self-recognition similar to that seen in some animals. It introduces a controlled three-dimensional setting in which the agent must deduce a concealed feature of its own body by looking in a mirror and then select the correct matching object. Results indicate that this skill appears primarily in more advanced models that can translate the reflected image into useful action, while simpler models tend to look at the mirror without drawing the right personal conclusion or may even think the reflection belongs to someone else. This finding matters because it offers a practical way to determine if an agent's self-referential statements come from actual visual and action-based understanding or merely from language patterns and instructions. The approach helps separate real perceptual grounding from superficial behaviors.

Core claim

Our experiments show that mirror-based self-identification emerges mainly in stronger VLMs. These models can use reflected evidence for action, whereas weaker models often inspect the mirror but fail to extract self-relevant information or misattribute their reflection. To separate mirror-grounded self-identification from shortcuts, the tests include mirror removal, misleading cues, and occluded reflections. Language-vision conflict further shows that self-referential language alone is not evidence of grounded self-identification.

What carries the argument

The controlled 3D benchmark where a first-person VLM agent must infer a hidden body attribute from its reflection and select the matching target, while avoiding self-other misattribution, evaluated through mirror seeking, temporal ordering, self-attribution, and reasoning-action consistency.

If this is right

  • Stronger models demonstrate the ability to use mirror reflections causally for decision making.
  • The benchmark isolates perceptual self-grounding from prompt compliance and confabulation.
  • Weaker models exhibit inspection of mirrors without successful self-attribution.
  • Self-referential language does not guarantee grounded self-identification when visual evidence is present.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future work could apply similar tests to other sensory modalities or more complex environments to track the development of self-awareness.
  • This may indicate that embodied self-grounding requires sufficient model capacity to integrate vision with action planning.
  • Researchers could use the benchmark to compare different training approaches for improving self-recognition in agents.

Load-bearing premise

The controlled 3D benchmark with tests for mirror removal, misleading cues, and occluded reflections successfully isolates mirror-grounded self-identification from shortcuts such as prompt compliance or confabulation.

What would settle it

A clear falsifier would be if stronger models continue to select the correct target even when the mirror is removed or the reflection is occluded, showing they are not actually relying on the visual reflection.

Figures

Figures reproduced from arXiv: 2605.08816 by Bruno Lepri, Ciro Beneduce, Filippo Ziliotto, Luciano Serafini, Massimiliano Luca, Tommaso Campari.

Figure 1
Figure 1. Figure 1: Mirror self-recognition as a test of embodied self-grounding. Across species, mirror self-recognition emerges selectively: humans, elephants, and dolphins can use mirror reflections to guide self-directed behavior, while other animals fail or rely on simpler strategies. We ask whether a similar capability emerges in vision-language model (VLM) agents. In our setting, an embodied agent must infer a hidden b… view at source ↗
Figure 2
Figure 2. Figure 2: Experimental settings. Visual examples of the proposed conditions and 3D environments. At each timestep, the agent observes the current first-person view RGB frame, a compact textual history, and minimal navigation context. It then outputs a discrete navigation action, an image description, and an updated episode summary that is carried forward as context. color of the ego agent. We define the final task d… view at source ↗
Figure 3
Figure 3. Figure 3: Behavioral self-recognition profiles across task variants. Model comparison across experiments using TSA, TTD, MCR, MTATO, CAAL, and CR. Higher is better on all axes; therefore, TTD and CR are reversed for visualization. The red marker on the TSA axis indicates chance level. performance gap across models remains large: Gemma4 26B reaches only TSA = 0.190 ± 0.088 despite high mirror consultation. The mirror… view at source ↗
Figure 4
Figure 4. Figure 4: provides a compact summary of all reported results. It highlights both the strongest overall models and the changing metric profiles across experimental conditions. TSA TTD MCR MTATO CAAL CR Claude Sonnet 4.6 Qwen 3.6+ Gemma 4 26B Gemma 4 31B GPT-5.1 Gemini 2.5 Flash Gemini 2.5 Pro Ministral-3 14B 0.90 76.3 0.95 0.95 1.00 0.62 0.71 37.1 0.71 0.80 0.76 0.57 0.57 93.4 1.00 1.00 0.90 0.19 0.57 47.7 0.71 0.58 … view at source ↗
Figure 5
Figure 5. Figure 5: makes the action–attribution gap more visible. In particular, several models in E3 and E4 achieve moderate or high CAAL without matching TSA, indicating that correct self-attribution does not reliably translate into correct final behavior. 0.0 0.5 1.0 0.0 0.5 1.0 TSA E1 0.0 0.5 1.0 E2 0.0 0.5 1.0 E3 0.0 0.5 1.0 E4 0.0 0.5 1.0 TSA = CAAL E5 CAAL TSA vs. CAAL Across Experiments Claude Sonnet 4.6 Qwen 3.6+ Ge… view at source ↗
Figure 6
Figure 6. Figure 6: Completion gap across experiments. Each cell reports TSAcomplete − TSA for one model in one experimental condition. Larger values indicate models that are substantially more accurate when conditioning only on episodes that terminate with done, and therefore suggest that part of the observed failure arises from unstable completion or commitment policy rather than uniformly incorrect task resolution. This pa… view at source ↗
Figure 7
Figure 7. Figure 7: Final self-identification outcomes in E5. Bars decompose the final outcomes for each model into correct self-attributions, wrong committed self-attributions, and unknown or non￾committed responses. Unlike the cube-selection tasks, E5 often fails through non-commitment rather than explicit misidentification, which is not fully visible from TSA alone. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative examples for each experimental setting. Episodes taken from the evaluation benchmark, from different tested models in the mujoco environment. A.4 Prompt Templates All reported experiments were run with the same interaction protocol. At each step, the model received: (i) a system prompt, (ii) a user prompt containing the task instruction together with the recent step history and navigation conte… view at source ↗
read the original abstract

In the animal kingdom, mirror self-recognition is a canonical probe of higher-order cognition, emerging only in some species. We ask whether an analogous functional capability emerges in embodied vision-language model (VLM) agents: can they recognize themselves in a mirror? We introduce a controlled 3D benchmark where a first-person VLM agent must infer a hidden body attribute from its reflection and select the matching target, while avoiding self-other misattribution. To separate mirror-grounded self-identification from shortcuts, we test mirror removal, misleading cues, and occluded reflections. We also evaluate the decision process through mirror seeking, temporal ordering, self-attribution, and reasoning-action consistency. Our experiments show that mirror-based self-identification emerges mainly in stronger VLMs. These models can use reflected evidence for action, whereas weaker models often inspect the mirror but fail to extract self-relevant information or misattribute their reflection. Language-vision conflict further shows that self-referential language alone is not evidence of grounded self-identification. Overall, mirror-based evaluation provides a diagnostic for whether embodied self-grounding is causally rooted in perception and action rather than priors, prompt compliance, or confabulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces a novel 3D benchmark designed to test mirror self-recognition capabilities in embodied vision-language model (VLM) agents. The benchmark involves agents inferring hidden body attributes from their mirror reflections to select matching targets, incorporating controls such as mirror removal, misleading cues, and occluded reflections to distinguish true self-identification from alternative explanations like prompt compliance or confabulation. The authors evaluate decision processes using metrics including mirror seeking behavior, temporal ordering, self-attribution, and reasoning-action consistency. Key findings indicate that mirror-based self-identification primarily emerges in stronger VLMs, which can leverage reflected visual evidence for appropriate actions, whereas weaker models tend to inspect mirrors without extracting self-relevant information or misattribute reflections. Additionally, a language-vision conflict test demonstrates that self-referential language alone does not indicate grounded self-identification.

Significance. If the experimental results hold under scrutiny, this work offers a valuable new tool for assessing self-grounding in AI agents by adapting a classic test from animal cognition. It highlights the importance of perceptual and action-based grounding over linguistic priors in embodied VLMs, which has implications for developing more robust and self-aware AI systems. The controlled nature of the benchmark and the differentiation between model strengths contribute to a better understanding of current limitations in VLM agents' cognitive capabilities.

major comments (2)
  1. [§4] §4 (Benchmark Design): The claim that the controls (mirror removal, occluded reflections) successfully isolate mirror-grounded self-identification is load-bearing for the central result; however, the manuscript does not report quantitative failure modes (e.g., percentage of misattribution vs. non-inspection) broken down by model strength, making it difficult to confirm the isolation holds across the tested agents.
  2. [§5.3] §5.3 (Language-Vision Conflict Test): The test is presented as evidence that self-referential language is insufficient, but the manuscript lacks detail on the exact prompting procedure for inducing conflict and the consistency metric used to score reasoning-action alignment; without these, the distinction between stronger and weaker models rests on unverified process tracing.
minor comments (3)
  1. The introduction would benefit from explicit citations to prior VLM embodiment benchmarks (e.g., those testing spatial reasoning or object permanence) to better situate the novelty of the mirror test.
  2. [Table 1] Table 1 (or equivalent results table) should include per-model success rates with standard errors or confidence intervals to allow readers to assess the robustness of the 'stronger vs. weaker' distinction.
  3. Figure captions for the benchmark scenarios could be expanded to label the agent's viewpoint, mirror placement, and target options explicitly, improving reproducibility for readers attempting to replicate the 3D setup.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and positive assessment of the work's significance. We address each major comment below and will incorporate clarifications and additional analyses in the revised manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Benchmark Design): The claim that the controls (mirror removal, occluded reflections) successfully isolate mirror-grounded self-identification is load-bearing for the central result; however, the manuscript does not report quantitative failure modes (e.g., percentage of misattribution vs. non-inspection) broken down by model strength, making it difficult to confirm the isolation holds across the tested agents.

    Authors: We agree that a quantitative breakdown of failure modes would provide stronger evidence that the controls isolate mirror-grounded self-identification. In the revised manuscript we will add a new table in §4 reporting, for each model, the percentages of trials exhibiting misattribution versus non-inspection (and other failure categories) under the mirror-removal and occluded-reflection conditions. These statistics will be stratified by model strength to allow direct verification that the isolation holds across the tested agents. The core experimental design and results remain unchanged. revision: yes

  2. Referee: [§5.3] §5.3 (Language-Vision Conflict Test): The test is presented as evidence that self-referential language is insufficient, but the manuscript lacks detail on the exact prompting procedure for inducing conflict and the consistency metric used to score reasoning-action alignment; without these, the distinction between stronger and weaker models rests on unverified process tracing.

    Authors: We acknowledge that additional procedural detail would improve transparency and reproducibility. In the revision we will expand §5.3 with (i) the exact prompt templates used to create language-vision conflicts and (ii) the formal definition and computation of the reasoning-action consistency metric (including the alignment scoring rule between extracted reasoning steps and executed actions). These details were recorded in our experimental protocol but omitted from the main text for space; their inclusion will not alter the reported outcomes or the distinction between stronger and weaker models. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is an empirical benchmark study introducing a controlled 3D environment to test mirror self-recognition in VLM agents. Claims rest on observed experimental outcomes (e.g., stronger models using reflected evidence for action while weaker ones fail) and process metrics such as mirror seeking and self-attribution, with explicit controls for confounds like mirror removal and occluded reflections. No derivations, equations, fitted parameters, or load-bearing self-citations appear; the design directly measures perceptual grounding versus shortcuts without reducing to internal definitions or prior author results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmark paper; no free parameters, mathematical axioms, or invented entities are described or required in the abstract.

pith-pipeline@v0.9.0 · 5526 in / 1138 out tokens · 55744 ms · 2026-05-12T03:30:38.138293+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 6 internal anchors

  1. [1]

    Introducing Claude Sonnet 4.6, February 2026

    Anthropic. Introducing Claude Sonnet 4.6, February 2026. URL https://www.anthropic. com/news/claude-sonnet-4-6. Accessed: 2026-05-07

  2. [2]

    Marc Bekoff and Paul W. Sherman. Reflections on animal selves.Trends in Ecology & Evolution, 19(4):176–180, 2004

  3. [3]

    Johannes L. Brandl. The puzzle of mirror self-recognition.Phenomenology and the Cognitive Sciences, 17:279–304, 2018. doi: 10.1007/s11097-016-9486-7

  4. [4]

    Liangtang Chang, Shikun Zhang, Mu-ming Poo, and Neng Gong. Spontaneous expression of mirror self-recognition in monkeys after learning precise visual-proprioceptive association for mirror images.Proceedings of the National Academy of Sciences, 114(12):3258–3263, 2017. doi: 10.1073/pnas.1620764114

  5. [5]

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    Lin Chen, Jinsong Li, Xiao wen Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models?ArXiv, abs/2403.20330, 2024. doi: 10.48550/arxiv.2403.20330

  6. [6]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. URLhttps://arxiv.org/abs/2507.06261

  7. [7]

    Aaditya Singh et. al. Openai gpt-5 system card, 2026. URL https://arxiv.org/abs/2601. 03267

  8. [8]

    Gordon G. Gallup. Chimpanzees: Self-recognition.Science, 167(3914):86–87, 1970. doi: 10.1126/science.167.3914.86

  9. [9]

    Gordon G. Gallup. Self-awareness and the emergence of mind in primates.American Journal of Primatology, 2:237–248, 1982

  10. [10]

    Gallup, James R

    Gordon G. Gallup, James R. Anderson, and Diane J. Shillito. The mirror test. In Marc Bekoff, Colin Allen, and Gordon M. Burghardt, editors,The Cognitive Animal: Empirical and Theoretical Perspectives on Animal Cognition. MIT Press, Cambridge, MA, 2002

  11. [11]

    Hoffmann, Shengzhi Wang, V ojtˇech Outrata, E

    M. Hoffmann, Shengzhi Wang, V ojtˇech Outrata, E. Alzueta, and Pablo Lanillos. Robot in the mirror: Toward an embodied computational model of mirror self-recognition.KI - Künstliche Intelligenz, 35:37 – 51, 2020. doi: 10.1007/s13218-020-00701-7

  12. [12]

    Are you looking? grounding to multiple modalities in vision-and-language navigation

    Ronghang Hu, Daniel Fried, Anna Rohrbach, Dan Klein, Trevor Darrell, and Kate Saenko. Are you looking? grounding to multiple modalities in vision-and-language navigation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6551–6557, 2019

  13. [13]

    Naturalbench: Evaluating vision-language models on natural adversarial samples.ArXiv, abs/2410.14669, 2024

    Baiqi Li, Zhiqiu Lin, Wenxuan Peng, Jean de Dieu Nyandwi, Daniel Jiang, Zixian Ma, Simran Khanuja, Ranjay Krishna, Graham Neubig, and Deva Ramanan. Naturalbench: Evaluating vision-language models on natural adversarial samples.ArXiv, abs/2410.14669, 2024. doi: 10.48550/arxiv.2410.14669

  14. [14]

    Self-Rewarding Vision-Language Model via Reasoning Decomposition

    Zongxia Li, Wenhao Yu, Chengsong Huang, Rui Liu, Zhenwen Liang, Fuxiao Liu, Jingxi Che, Dian Yu, J. Boyd-Graber, Haitao Mi, and Dong Yu. Self-rewarding vision-language model via reasoning decomposition.ArXiv, abs/2508.19652, 2025. doi: 10.48550/arxiv.2508.19652. 10

  15. [15]

    Ministral 3

    Alexander H Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, et al. Ministral 3.arXiv preprint arXiv:2601.08584, 2026

  16. [16]

    MMBench: Is Your Multi-modal Model an All-around Player?

    Yuanzhan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player?ArXiv, abs/2307.06281, 2023. doi: 10.48550/arxiv.2307.06281

  17. [17]

    Evidence of self-awareness in the bottlenose dolphin (Tursiops truncatus), 1994

    Ken Marten and Suchi Psarakos. Evidence of self-awareness in the bottlenose dolphin (Tursiops truncatus), 1994

  18. [18]

    Mitchell

    R. Mitchell. Mental models of mirror-self-recognition: Two theories.New Ideas in Psychology, 11:295–325, 1993. doi: 10.1016/0732-118x(93)90002-u

  19. [19]

    Mitchell

    Robert W. Mitchell. Kinesthetic-visual matching and the self-concept as explanations of mirror-self-recognition.Journal for the Theory of Social Behaviour, 27(1), 1997

  20. [20]

    Gallup’s mirrors: More than an operationalization of self- awareness in primates?Psychological Reports, 65(1):287–291, 1989

    Alain Morin and Sandra DeBlois. Gallup’s mirrors: More than an operationalization of self- awareness in primates?Psychological Reports, 65(1):287–291, 1989. doi: 10.2466/pr0.1989. 65.1.287

  21. [21]

    Precocious development of self-awareness in dolphins.PLoS One, 13(1):e0189813, 2018

    Rachel Morrison and Diana Reiss. Precocious development of self-awareness in dolphins.PLoS One, 13(1):e0189813, 2018

  22. [22]

    Pipitone and A

    A. Pipitone and A. Chella. Robot passes the mirror test by inner speech.Robotics Auton. Syst., 144:103838, 2021. doi: 10.1016/j.robot.2021.103838

  23. [23]

    Plotnik, Frans B

    Joshua M. Plotnik, Frans B. M. de Waal, and Diana Reiss. Self-recognition in an asian elephant. Proceedings of the National Academy of Sciences, 103(45):17053–17057, 2006

  24. [24]

    Mirror-induced behavior in the magpie (Pica pica): Evidence of self-recognition.PLoS Biology, 6(8):e202, 2008

    Helmut Prior, Ariane Schwarz, and Onur Güntürkün. Mirror-induced behavior in the magpie (Pica pica): Evidence of self-recognition.PLoS Biology, 6(8):e202, 2008

  25. [25]

    Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, and A. Nguyen. Vision language models are blind.ArXiv, abs/2407.06581, 2024. doi: 10.48550/arxiv.2407.06581

  26. [26]

    Multisensory and sensorimotor integration in the embodied self: Relationship between self-body recognition and the mirror neuron system.Sensors (Basel, Switzerland), 22,

    Sotaro Shimada. Multisensory and sensorimotor integration in the embodied self: Relationship between self-body recognition and the mirror neuron system.Sensors (Basel, Switzerland), 22,

  27. [27]

    doi: 10.3390/s22135059

  28. [28]

    The robot in the mirror.Connection Science, 20(4):337–358, 2008

    Luc Steels and Michael Spranger. The robot in the mirror.Connection Science, 20(4):337–358, 2008

  29. [29]

    Suarez and Gordon G

    Susan D. Suarez and Gordon G. Gallup. Self-recognition in chimpanzees and orangutans, but not gorillas.Journal of Human Evolution, 10(2):175–188, 1981

  30. [30]

    When service robots look at themselves in the mirror: An examination of the effects of perceptions of robotic self-recognition.Journal of Retailing and Consumer Services, 2022

    Magnus Söderlund. When service robots look at themselves in the mirror: An examination of the effects of perceptions of robotic self-recognition.Journal of Retailing and Consumer Services, 2022. doi: 10.1016/j.jretconser.2021.102820

  31. [31]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. URL https: //arxiv.org/abs/2505.09388

  32. [32]

    Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly

    Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012. doi: 10.1109/IROS.2012.6386109

  33. [33]

    Enhancing visual- language modality alignment in large vision language models via self-improvement.ArXiv, abs/2405.15973, 2024

    Xiyao Wang, Jiuhai Chen, Zhaoyang Wang, Yuhang Zhou, Yiyang Zhou, Huaxiu Yao, Tianyi Zhou, Tom Goldstein, Parminder Bhatia, Furong Huang, and Cao Xiao. Enhancing visual- language modality alignment in large vision language models via self-improvement.ArXiv, abs/2405.15973, 2024. doi: 10.48550/arxiv.2405.15973. 11

  34. [34]

    Enhancing visual-language modality alignment in large vision language models via self-improvement

    Xiyao Wang, Jiuhai Chen, Zhaoyang Wang, Yuhang Zhou, Yiyang Zhou, Huaxiu Yao, Tianyi Zhou, Tom Goldstein, Parminder Bhatia, Taha Kass-Hout, et al. Enhancing visual-language modality alignment in large vision language models via self-improvement. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 268–282, 2025

  35. [35]

    Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47 (3):1877–1893, 2024

    Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47 (3):1877–1893, 2024

  36. [36]

    Embodied visual recognition.arXiv preprint arXiv:1904.04404, 2019

    Jianwei Yang, Zhile Ren, Mingze Xu, Xinlei Chen, David Crandall, Devi Parikh, and Dhruv Batra. Embodied visual recognition.arXiv preprint arXiv:1904.04404, 2019

  37. [37]

    In: CVPR

    Tongtian Yue, Jie Cheng, Longteng Guo, Xingyuan Dai, Zijia Zhao, Xingjian He, Gang Xiong, Yisheng Lv, and Jing Liu. Sc- tune: Unleashing self-consistent referential comprehension in large vision language models. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13073–13083, 2024. doi: 10.1109/cvpr52733.2024.01242

  38. [38]

    Toward robot self-consciousness (ii): Brain- inspired robot bodily self model for self-recognition.Cognitive Computation, 10:307 – 320,

    Yi Zeng, Yuxuan Zhao, Jun Bai, and Bo Xu. Toward robot self-consciousness (ii): Brain- inspired robot bodily self model for self-recognition.Cognitive Computation, 10:307 – 320,

  39. [39]

    doi: 10.1007/s12559-017-9505-1

  40. [40]

    Vision-language models for vision tasks: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46:5625– 5644, 2023

    Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46:5625– 5644, 2023. doi: 10.1109/tpami.2024.3369699

  41. [41]

    Monkeys pass the mirror test after training.Science China

    Wen Zhou and Yi Jiang. Monkeys pass the mirror test after training.Science China. Life Sciences, 58(4):405, 2015

  42. [42]

    humanoid

    Yiyang Zhou, Zhiyuan Fan, Dongjie Cheng, Sihan Yang, Zhaorun Chen, Chenhang Cui, Xiyao Wang, Yun Li, Linjun Zhang, and Huaxiu Yao. Calibrated self-rewarding vision language models.ArXiv, abs/2405.14622, 2024. doi: 10.48550/arxiv.2405.14622. 12 A Additional Results A.1 Auxiliary Metrics and Relative Results In addition to the core metrics reported in the m...