pith. sign in

arxiv: 2505.19237 · v2 · submitted 2025-05-25 · 💻 cs.AI · cs.RO

Sensorimotor Self-Recognition in Multimodal Large Language Model-Driven Robots

Pith reviewed 2026-05-19 13:25 UTC · model grok-4.3

classification 💻 cs.AI cs.RO
keywords self-recognitionmultimodal LLMembodied AIsensorimotor experienceminimal selfrobot autonomysensory integrationartificial selfhood
0
0 comments X

The pith

Multimodal large language models integrated into robots develop self-recognition from sensorimotor experience, opening a route to artificial selfhood.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a multimodal LLM placed inside a mobile robot can build an internal sense of its own body and place in the world when given ongoing sensory streams. It reports that the combined system shows environmental awareness, identifies itself as a robot, and anticipates its own movements, with statistical models tracing how different senses feed into separate aspects of this minimal self and how memory links past and present states. A sympathetic reader would care because self-recognition is described as the starting point for autonomous behavior, so success here would mean LLMs can move beyond text patterns toward grounded, embodied cognition when properly situated in physical agents.

Core claim

Integrating a multimodal LLM into an autonomous mobile robot produces robust environmental awareness, self-identification, and predictive awareness that lets the system infer its own robotic nature and motion characteristics. Structural equation modeling shows how sensory integration shapes distinct dimensions of the minimal self and coordinates them with structured and episodic memory, while ablation of sensory channels reveals compensatory interactions among inputs and confirms memory's essential role.

What carries the argument

Sensorimotor integration through the multimodal LLM, which fuses visual, proprioceptive and other streams to construct and maintain an internal representation of the robot's body within its surroundings.

If this is right

  • The robot distinguishes its own body and actions from surrounding objects using fused sensory data.
  • Removal of one sensory channel is offset by strengthened use of remaining channels to preserve self-identification.
  • Structured and episodic memory are required to link current sensations with past states for consistent self-recognition.
  • The resulting internal associations form a hierarchical structure that drives explicit self-identification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same integration pattern could be tried on different robot bodies or with newer multimodal models to check whether self-recognition generalizes beyond the specific platform used here.
  • If the approach scales, it supplies a concrete way to test whether minimal self representations can support more complex behaviors such as long-term planning or social interaction.
  • The work suggests that embodied selfhood may not require new architectures but can emerge when existing language models receive sustained, multimodal feedback from a physical body.

Load-bearing premise

The robot's spoken descriptions and action forecasts reflect a genuine internal model of itself rather than surface pattern matching from its training data or the prompt.

What would settle it

A test in which the robot is placed in a novel environment or given contradictory sensory feedback yet still claims the same self-identity and motion predictions as in the original trials would falsify the claim that the behavior arises from integrated sensorimotor self-representation.

read the original abstract

Self-recognition -- the ability to maintain an internal representation of one's own body within the environment -- underpins intelligent, autonomous behavior. As a foundational component of the minimal self, self-recognition provides the initial substrate from which higher forms of self-awareness may eventually emerge. Recent advances in large language models achieve human-like performance in tasks integrating multimodal information, raising growing interest in the embodiment capabilities of AI agents deployed on nonhuman platforms such as robots. We investigate whether multimodal LLMs can develop self-recognition through sensorimotor experience by integrating an LLM into an autonomous mobile robot. The system exhibits robust environmental awareness, self-identification, and predictive awareness, enabling it to infer its robotic nature and motion characteristics. Structural equation modeling reveals how sensory integration influences distinct dimensions of the minimal self and their coordination with past-present memory, as well as the hierarchical internal associations that drive self-identification. Ablation tests of sensory inputs demonstrate compensatory interactions among sensors and confirm the essential role of structured and episodic memory. Given appropriate sensory information about the world and itself, multimodal LLMs open the door to artificial selfhood in embodied cognitive systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that integrating a multimodal LLM into an autonomous mobile robot enables the development of self-recognition through sensorimotor experience. The system is reported to exhibit environmental awareness, self-identification, and predictive awareness of its robotic nature and motion. Structural equation modeling is used to show how sensory integration influences dimensions of the minimal self and their coordination with memory, while ablation tests demonstrate compensatory sensor interactions and the essential role of structured and episodic memory. The work concludes that appropriate sensory information allows multimodal LLMs to open the door to artificial selfhood in embodied systems.

Significance. If the central claims hold under rigorous validation, the work would be significant for embodied AI and cognitive robotics by providing empirical support for sensorimotor routes to minimal self in LLM-driven agents. It extends multimodal model capabilities to physical platforms and introduces SEM-based analysis of self-dimensions, which could inform future architectures for autonomous systems with internal self-models.

major comments (3)
  1. [Abstract] Abstract: The description of ablation tests and structural equation modeling provides no quantitative results, error bars, baseline comparisons, statistical significance values, or details on prompt engineering and data exclusion criteria. This absence makes it impossible to assess whether the reported sensory integration effects and memory ablations actually support the central claim of emergent self-recognition.
  2. [Results] Results/Interpretation sections: Evidence for self-identification and predictive awareness rests primarily on the LLM's own generated verbal reports and action predictions. Without independent behavioral metrics, external validation, or controls isolating embodiment (such as identical prompts and memory structures supplied with synthetic rather than real sensor streams), these outputs remain compatible with prompt-driven pattern completion from training data rather than an internally updated self-model.
  3. [Methods] Methods: The manuscript does not report controls that decouple the contribution of real sensorimotor loops from structured context about 'self' and 'robot body'. This leaves open the possibility that observed self-identification arises from surface-level completion rather than sensorimotor updating, directly bearing on the claim that the system infers its robotic nature through experience.
minor comments (2)
  1. [Abstract] The abstract and main text could more explicitly define the latent dimensions of the minimal self used in the SEM analysis and how they are operationalized from LLM outputs.
  2. [Figures/Tables] Figure captions and table presentations of ablation results should include exact sample sizes, variance measures, and comparison conditions to improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. These highlight key opportunities to improve the transparency and rigor of our evidence for sensorimotor self-recognition in multimodal LLM-driven robots. We respond point by point below and commit to revisions that directly address the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The description of ablation tests and structural equation modeling provides no quantitative results, error bars, baseline comparisons, statistical significance values, or details on prompt engineering and data exclusion criteria. This absence makes it impossible to assess whether the reported sensory integration effects and memory ablations actually support the central claim of emergent self-recognition.

    Authors: We agree that the abstract lacks the quantitative detail needed for full evaluation. In the revised version we will expand the abstract to report key SEM results (standardized coefficients, significance levels, and model fit statistics) along with ablation outcomes including performance deltas, error bars, and baseline comparisons. The Methods section will be updated with explicit descriptions of prompt engineering protocols and data exclusion criteria to allow readers to assess support for the sensory integration and memory claims. revision: yes

  2. Referee: [Results] Results/Interpretation sections: Evidence for self-identification and predictive awareness rests primarily on the LLM's own generated verbal reports and action predictions. Without independent behavioral metrics, external validation, or controls isolating embodiment (such as identical prompts and memory structures supplied with synthetic rather than real sensor streams), these outputs remain compatible with prompt-driven pattern completion from training data rather than an internally updated self-model.

    Authors: The concern is valid: verbal reports alone leave room for alternative explanations. We will add independent behavioral metrics extracted from logged robot trajectories and interaction success rates. We will also introduce control conditions that supply identical prompts and memory structures but replace real sensor streams with synthetic equivalents. These additions will allow direct comparison and help establish that self-identification reflects ongoing sensorimotor updating rather than static pattern completion. revision: yes

  3. Referee: [Methods] Methods: The manuscript does not report controls that decouple the contribution of real sensorimotor loops from structured context about 'self' and 'robot body'. This leaves open the possibility that observed self-identification arises from surface-level completion rather than sensorimotor updating, directly bearing on the claim that the system infers its robotic nature through experience.

    Authors: We recognize that the existing sensory-ablation results do not fully isolate real-time sensorimotor updating from pre-provided contextual knowledge. In revision we will add and report explicit control experiments in which the model receives structured self- and body-related context but operates without live sensorimotor input, contrasting these against the full embodied condition. This will more directly test whether self-recognition depends on sensorimotor experience. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical setup integrating a multimodal LLM into a mobile robot and reports observed behaviors (environmental awareness, self-identification via verbal reports and actions), supported by structural equation modeling on sensory integration data and ablation tests on memory and inputs. These steps rely on external experimental measurements and statistical analysis of outputs rather than defining the target phenomenon in terms of itself or renaming fitted parameters as predictions. No equations or self-citation chains reduce the central claim to its inputs by construction; the derivation remains self-contained against the described behavioral and modeling benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLM-generated language about self and motion constitutes evidence of internal representation; no free parameters are explicitly named in the abstract, but the structural equation model necessarily introduces latent variables whose values are fitted to the observed behaviors.

free parameters (1)
  • latent self-dimensions in SEM
    Structural equation modeling fits latent variables that represent dimensions of the minimal self; these are not measured directly and are inferred from the data.
axioms (1)
  • domain assumption LLM outputs can be treated as veridical reports of internal states
    The paper interprets the model's self-descriptions as evidence of self-recognition without independent verification that the descriptions reflect genuine internal modeling.

pith-pipeline@v0.9.0 · 5757 in / 1390 out tokens · 41757 ms · 2026-05-19T13:25:50.600745+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 9 internal anchors

  1. [1]

    Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, March 2023

    G. Gallup, Chimpanzees: Self-Recognition. Science 167, 86–87 (1970), doi:10.1126/science. 167.3914.86

  2. [2]

    Rizzolatti, L

    G. Rizzolatti, L. Craighero, The Mirror-Neuron System. Annual Review of Neuroscience 27, 169–192 (2004), doi:10.1146/annurev.neuro.27.070203.1 44230

  3. [3]

    A. D. Craig, How do you feel–now? The anterior insula and hu man awareness. Nature Re- views Neuroscience 10 (1), 59–70 (2009), doi:10.1038/nrn2555, https://doi.org/10. 1038/nrn2555

  4. [4]

    A. M. Turing, Computing Machinery and Intelligence. Mind 59 (236), 433–460 (1950), http: //www.jstor.org/stable/2251299

  5. [5]

    Watchus, Towards Self-Aware AI: Embodiment, Feedback Loops, and the Role of the Insula in Consciousness

    B. Watchus, Towards Self-Aware AI: Embodiment, Feedback Loops, and the Role of the Insula in Consciousness. Preprints 2024110661 (2024), doi:10.20944/preprints202411.0661. v1, https://doi.org/10.20944/preprints202411.0661.v1

  6. [6]

    L. Li, C. Li, Enabling self-identification in intelligent agent: insights from computational psychoanalysis (2024), https://arxiv.org/abs/2403.07664

  7. [7]

    Y . K. Georgie, G. Schillaci, V . V . Hafner, An interdisciplinary overview of developmental in- dices and behavioral measures of the minimal self. 2019 Joint IEEE 9th International Confer- ence on Development and Learning and Epigenetic Robotics, ICDL-EpiRob 2019 pp. 129–136 (2019), doi:10.1109/DEVLRN.2019.8850703, https://arxiv.org/pdf/1907.00709

  8. [8]

    V . V . Hafner, P . Loviken, A. P . Villalpando, G. Schillaci, Prerequisites for an Artificial Self. Frontiers in Neurorobotics 14, 423754 (2020), doi:10.3389/FNBOT.2020.00005/BIBTEX, www.frontiersin.org

  9. [9]

    Pfeifer, C

    R. Pfeifer, C. Scheier, Understanding Intelligence (MIT Press, Cambridge, MA) (1999)

  10. [10]

    Dehaene, H

    S. Dehaene, H. Lau, S. Kouider, What is consciousness, an d could machines have it? Science 358 (6362), 486–492 (2017), doi:10.1126/science.aan8871. 16

  11. [11]

    David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al

    D. Silver, et al., Mastering the game of Go with deep neural networks and tree s earch. Nature 529 (7587), 484–489 (2016), doi:10.1038/nature16961

  12. [12]

    B. M. Lake, T. D. Ullman, J. B. Tenenbaum, S. J. Gershman, B uilding machines that learn and think like people. Behavioral and Brain Sciences 40, e253 (2017), doi:10.1017/ S0140525X16001837

  13. [13]

    Rahwan, et al., Machine behaviour

    I. Rahwan, et al., Machine behaviour. Nature 568 (7753), 477–486 (2019)

  14. [14]

    Large language models for mathematical reasoning: Progresses and challenges.arXiv preprint arXiv:2402.00157, 2024

    J. Ahn, et al., Large Language Models for Mathematical Reasoning: Progresses and Challenges (2024), https://arxiv.org/abs/2402.00157

  15. [15]

    Chang, et al., A Survey on Evaluation of Large Language Models (2023),https://arxiv

    Y . Chang, et al., A Survey on Evaluation of Large Language Models (2023),https://arxiv. org/abs/2307.03109

  16. [16]

    K. M. Collins, et al., Evaluating Language Models for Mathematics Through Interactions. Pro- ceedings of the National Academy of Sciences of the United St ates of America 121 (24), e2318124121 (2024), doi:10.1073/pnas.2318124121, https://doi.org/10.1073/pnas. 2318124121

  17. [17]

    J. W. A. Strachan, et al., Testing theory of mind in large language models and humans. Nature Human Behaviour 8 (7), 1285–1295 (2024), doi:10.1038/s41562-024-01882-z

  18. [18]

    OpenAI, et al., GPT-4 Technical Report (2024), https://arxiv.org/abs/2303.08774

  19. [19]

    PaLM-E: An Embodied Multimodal Language Model

    D. Driess, et al. , PaLM-E: An Embodied Multimodal Language Model (2023), https:// arxiv.org/abs/2303.03378

  20. [20]

    S. Wu, H. Fei, L. Qu, W. Ji, T.-S. Chua, NExT-GPT: Any-to-A ny Multimodal LLM (2024), https://arxiv.org/abs/2309.05519

  21. [21]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    G. Team, et al., Gemini 1.5: Unlocking multimodal understanding across mi llions of tokens of context (2024), https://arxiv.org/abs/2403.05530

  22. [22]

    G. R. Team, et al. , Gemini Robotics: Bringing AI into the Physical World (2025 ), https: //arxiv.org/abs/2503.20020. 17

  23. [23]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    M. Ahn, et al., Do As I Can, Not As I Say: Grounding Language in Robotic Affordances (2022), https://arxiv.org/abs/2204.01691

  24. [24]

    Zheng, R

    L. Zheng, R. Mei, B. Zou, et al., GMM-searcher: efficient ob ject search in large-scale scenes using large language models. Scientific Reports 15, 16709 (2025), doi:10.1038/ s41598-025-00788-8

  25. [25]

    Mon-Williams, G

    R. Mon-Williams, G. Li, R. Long, et al. , Embodied large language models enable robots to complete complex tasks in unpredictable environments. Nature Machine Intelligence 7, 592–601 (2025), doi:10.1038/s42256-025-01005-x

  26. [26]

    Zhang, J

    C. Zhang, J. Chen, J. Li, Y . Peng, Z. Mao, Large language models for human–robot interaction: A review. Biomimetic Intelligence and Robotics 3 (4), 100131 (2023), doi:https://doi.org/10. 1016/j.birob.2023.100131, https://www.sciencedirect.com/science/article/pii/ S2667379723000451

  27. [27]

    Menon, 20 years of the default mode network: A review an d synthesis

    V . Menon, 20 years of the default mode network: A review an d synthesis. Neuron 111 (16), 2469–2484 (2023), doi:10.1016/j.neuron.2023.04.023

  28. [28]

    M. E. Raichle, et al., A default mode of brain function. Proceedings of the National Academy of Sciences 98 (2), 676–682 (2001), doi:10.1073/pnas.98.2.676

  29. [29]

    Northoff, et al., Self-referential processing in our brain–a meta-analysi s of imaging studies on the self

    G. Northoff, et al., Self-referential processing in our brain–a meta-analysi s of imaging studies on the self. NeuroImage 31 (1), 440–457 (2006), doi:10.1016/j.neuroimage.2005.12. 002

  30. [30]

    Rochat, Five Levels of Self-Awareness as They Unfold E arly in Life

    P . Rochat, Five Levels of Self-Awareness as They Unfold E arly in Life. Consciousness and Cognition 12 (4), 717–731 (2003), doi:10.1016/S1053-8100(03)00081-3

  31. [31]

    Gemini: A Family of Highly Capable Multimodal Models

    G. Team, et al. , Gemini: A Family of Highly Capable Multimodal Models (2024 ), https: //arxiv.org/abs/2312.11805

  32. [32]

    Hercz, W

    T. Hercz, W. Liu, Mecabot User Manual (2024), http://www.roboworks.net, version 20240501, Roboworks

  33. [33]

    S. M. Mousavi, et al., Gemini and Physical World: Large Language Models Can Estim ate the Intensity of Earthquake Shaking from Multimodal Social Med ia Posts. Geophysical Journal 18 International 240 (2), 1281–1294 (2025), doi:10.1093/gji/ggae436, https://doi.org/10. 1093/gji/ggae436

  34. [34]

    Prasad, M

    D. Prasad, M. Pimpude, A. Alankar, Towards Development o f Automated Knowledge Maps and Databases for Materials Engineering using Large Langua ge Models (2024), https:// arxiv.org/abs/2402.11323

  35. [35]

    Gemma: Open Models Based on Gemini Research and Technology

    G. Team, et al. , Gemma: Open Models Based on Gemini Research and Technology (2024), https://arxiv.org/abs/2403.08295

  36. [36]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Y . Gao, et al., Retrieval-Augmented Generation for Large Language Models: A Survey (2024), https://arxiv.org/abs/2312.10997

  37. [37]

    Shi, et al., Optimization-based Prompt Injection Attack to LLM-as-a- Judge (2025), https: //arxiv.org/abs/2403.17710

    J. Shi, et al., Optimization-based Prompt Injection Attack to LLM-as-a- Judge (2025), https: //arxiv.org/abs/2403.17710

  38. [38]

    Li, et al., Generative Judge for Evaluating Alignment (2023), https://arxiv.org/abs/ 2310.05470

    J. Li, et al., Generative Judge for Evaluating Alignment (2023), https://arxiv.org/abs/ 2310.05470

  39. [39]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    L. Zheng, et al., Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (2023), https: //arxiv.org/abs/2306.05685

  40. [40]

    L. J. Cronbach, P . E. Meehl, Construct validity in psycho logical tests. Psychological Bulletin 52 (4), 281–302 (1955), doi:10.1037/h0040957

  41. [41]

    M. R. Longo, F. Sch¨ u¨ ur, M. P . Kammers, M. Tsakiris, P . Ha ggard, What is em- bodiment? A psychometric approach. Cognition 107 (3), 978–998 (2008), doi:https: //doi.org/10.1016/j.cognition.2007.12.004, https://www.sciencedirect.com/science/ article/pii/S0010027708000061

  42. [42]

    M. Gao, X. Hu, J. Ruan, X. Pu, X. Wan, LLM-based NLG Evaluat ion: Current Status and Challenges (2025), https://arxiv.org/abs/2402.01383

  43. [43]

    Kim, et al., Prometheus: Inducing Fine-grained Evaluation Capabilit y in Language Models (2024), https://arxiv.org/abs/2310.08491

    S. Kim, et al., Prometheus: Inducing Fine-grained Evaluation Capabilit y in Language Models (2024), https://arxiv.org/abs/2310.08491. 19

  44. [44]

    E. Goh, R. Gallo, J. Hom, et al. , Large Language Model Influence on Diagnostic Rea- soning: A Randomized Clinical Trial. JAMA Network Open 7 (10), e2440969 (2024), doi:10.1001/jamanetworkopen.2024.40969, https://jamanetwork.com/article.aspx? doi=10.1001/jamanetworkopen.2024.40969

  45. [45]

    Giannakopoulos, A

    K. Giannakopoulos, A. Kavadella, A. A. Salim, V . Stamatopoulos, E. Kaklamanos, Evaluation of the Performance of Generative AI Large Language Models Ch atGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistr y: Comparative Mixed Methods Study. Journal of Medical Internet Research 25, e51580 (2023), doi:10.2196/51580, https: //www...

  46. [46]

    M. W.-L. Cheung, Meta-Analysis: A Structural Equation Modeling Approach(Wiley, Hoboken, NJ) (2015), doi:10.1002/9781118957813

  47. [47]

    Raykov, G

    T. Raykov, G. A. Marcoulides, A First Course in Structural Equation Modeling (Routledge), 2nd ed. (2006), doi:10.4324/9780203930687

  48. [48]

    L. B. Merabet, et al., Rapid and reversible recruitment of early visual cortex fo r touch. PLoS One 3 (8), e3046 (2008), doi:10.1371/journal.pone.0003046

  49. [49]

    A. J. King, Crossmodal plasticity and hearing capabilit ies following blindness. Cell Tissue Res. 361 (1), 295–300 (2015), doi:10.1007/s00441-015-2175-y

  50. [50]

    sensors” become “sources of information

    S. G. Lomber, M. A. Meredith, A. Kral, Cross-modal plasti city in specific auditory cortices underlies visual compensations in the deaf. Nature Neuroscience 13, 1421–1427 (2010), doi: 10.1038/nn.2653. Acknowledgments The authors would like to thank Rafael Sendra-Arranz and ´Alvaro Guti´errez for their discussions and technical input during the development ...