Sensorimotor Self-Recognition in Multimodal Large Language Model-Driven Robots
Pith reviewed 2026-05-19 13:25 UTC · model grok-4.3
The pith
Multimodal large language models integrated into robots develop self-recognition from sensorimotor experience, opening a route to artificial selfhood.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Integrating a multimodal LLM into an autonomous mobile robot produces robust environmental awareness, self-identification, and predictive awareness that lets the system infer its own robotic nature and motion characteristics. Structural equation modeling shows how sensory integration shapes distinct dimensions of the minimal self and coordinates them with structured and episodic memory, while ablation of sensory channels reveals compensatory interactions among inputs and confirms memory's essential role.
What carries the argument
Sensorimotor integration through the multimodal LLM, which fuses visual, proprioceptive and other streams to construct and maintain an internal representation of the robot's body within its surroundings.
If this is right
- The robot distinguishes its own body and actions from surrounding objects using fused sensory data.
- Removal of one sensory channel is offset by strengthened use of remaining channels to preserve self-identification.
- Structured and episodic memory are required to link current sensations with past states for consistent self-recognition.
- The resulting internal associations form a hierarchical structure that drives explicit self-identification.
Where Pith is reading between the lines
- The same integration pattern could be tried on different robot bodies or with newer multimodal models to check whether self-recognition generalizes beyond the specific platform used here.
- If the approach scales, it supplies a concrete way to test whether minimal self representations can support more complex behaviors such as long-term planning or social interaction.
- The work suggests that embodied selfhood may not require new architectures but can emerge when existing language models receive sustained, multimodal feedback from a physical body.
Load-bearing premise
The robot's spoken descriptions and action forecasts reflect a genuine internal model of itself rather than surface pattern matching from its training data or the prompt.
What would settle it
A test in which the robot is placed in a novel environment or given contradictory sensory feedback yet still claims the same self-identity and motion predictions as in the original trials would falsify the claim that the behavior arises from integrated sensorimotor self-representation.
read the original abstract
Self-recognition -- the ability to maintain an internal representation of one's own body within the environment -- underpins intelligent, autonomous behavior. As a foundational component of the minimal self, self-recognition provides the initial substrate from which higher forms of self-awareness may eventually emerge. Recent advances in large language models achieve human-like performance in tasks integrating multimodal information, raising growing interest in the embodiment capabilities of AI agents deployed on nonhuman platforms such as robots. We investigate whether multimodal LLMs can develop self-recognition through sensorimotor experience by integrating an LLM into an autonomous mobile robot. The system exhibits robust environmental awareness, self-identification, and predictive awareness, enabling it to infer its robotic nature and motion characteristics. Structural equation modeling reveals how sensory integration influences distinct dimensions of the minimal self and their coordination with past-present memory, as well as the hierarchical internal associations that drive self-identification. Ablation tests of sensory inputs demonstrate compensatory interactions among sensors and confirm the essential role of structured and episodic memory. Given appropriate sensory information about the world and itself, multimodal LLMs open the door to artificial selfhood in embodied cognitive systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that integrating a multimodal LLM into an autonomous mobile robot enables the development of self-recognition through sensorimotor experience. The system is reported to exhibit environmental awareness, self-identification, and predictive awareness of its robotic nature and motion. Structural equation modeling is used to show how sensory integration influences dimensions of the minimal self and their coordination with memory, while ablation tests demonstrate compensatory sensor interactions and the essential role of structured and episodic memory. The work concludes that appropriate sensory information allows multimodal LLMs to open the door to artificial selfhood in embodied systems.
Significance. If the central claims hold under rigorous validation, the work would be significant for embodied AI and cognitive robotics by providing empirical support for sensorimotor routes to minimal self in LLM-driven agents. It extends multimodal model capabilities to physical platforms and introduces SEM-based analysis of self-dimensions, which could inform future architectures for autonomous systems with internal self-models.
major comments (3)
- [Abstract] Abstract: The description of ablation tests and structural equation modeling provides no quantitative results, error bars, baseline comparisons, statistical significance values, or details on prompt engineering and data exclusion criteria. This absence makes it impossible to assess whether the reported sensory integration effects and memory ablations actually support the central claim of emergent self-recognition.
- [Results] Results/Interpretation sections: Evidence for self-identification and predictive awareness rests primarily on the LLM's own generated verbal reports and action predictions. Without independent behavioral metrics, external validation, or controls isolating embodiment (such as identical prompts and memory structures supplied with synthetic rather than real sensor streams), these outputs remain compatible with prompt-driven pattern completion from training data rather than an internally updated self-model.
- [Methods] Methods: The manuscript does not report controls that decouple the contribution of real sensorimotor loops from structured context about 'self' and 'robot body'. This leaves open the possibility that observed self-identification arises from surface-level completion rather than sensorimotor updating, directly bearing on the claim that the system infers its robotic nature through experience.
minor comments (2)
- [Abstract] The abstract and main text could more explicitly define the latent dimensions of the minimal self used in the SEM analysis and how they are operationalized from LLM outputs.
- [Figures/Tables] Figure captions and table presentations of ablation results should include exact sample sizes, variance measures, and comparison conditions to improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments on our manuscript. These highlight key opportunities to improve the transparency and rigor of our evidence for sensorimotor self-recognition in multimodal LLM-driven robots. We respond point by point below and commit to revisions that directly address the concerns raised.
read point-by-point responses
-
Referee: [Abstract] Abstract: The description of ablation tests and structural equation modeling provides no quantitative results, error bars, baseline comparisons, statistical significance values, or details on prompt engineering and data exclusion criteria. This absence makes it impossible to assess whether the reported sensory integration effects and memory ablations actually support the central claim of emergent self-recognition.
Authors: We agree that the abstract lacks the quantitative detail needed for full evaluation. In the revised version we will expand the abstract to report key SEM results (standardized coefficients, significance levels, and model fit statistics) along with ablation outcomes including performance deltas, error bars, and baseline comparisons. The Methods section will be updated with explicit descriptions of prompt engineering protocols and data exclusion criteria to allow readers to assess support for the sensory integration and memory claims. revision: yes
-
Referee: [Results] Results/Interpretation sections: Evidence for self-identification and predictive awareness rests primarily on the LLM's own generated verbal reports and action predictions. Without independent behavioral metrics, external validation, or controls isolating embodiment (such as identical prompts and memory structures supplied with synthetic rather than real sensor streams), these outputs remain compatible with prompt-driven pattern completion from training data rather than an internally updated self-model.
Authors: The concern is valid: verbal reports alone leave room for alternative explanations. We will add independent behavioral metrics extracted from logged robot trajectories and interaction success rates. We will also introduce control conditions that supply identical prompts and memory structures but replace real sensor streams with synthetic equivalents. These additions will allow direct comparison and help establish that self-identification reflects ongoing sensorimotor updating rather than static pattern completion. revision: yes
-
Referee: [Methods] Methods: The manuscript does not report controls that decouple the contribution of real sensorimotor loops from structured context about 'self' and 'robot body'. This leaves open the possibility that observed self-identification arises from surface-level completion rather than sensorimotor updating, directly bearing on the claim that the system infers its robotic nature through experience.
Authors: We recognize that the existing sensory-ablation results do not fully isolate real-time sensorimotor updating from pre-provided contextual knowledge. In revision we will add and report explicit control experiments in which the model receives structured self- and body-related context but operates without live sensorimotor input, contrasting these against the full embodied condition. This will more directly test whether self-recognition depends on sensorimotor experience. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents an empirical setup integrating a multimodal LLM into a mobile robot and reports observed behaviors (environmental awareness, self-identification via verbal reports and actions), supported by structural equation modeling on sensory integration data and ablation tests on memory and inputs. These steps rely on external experimental measurements and statistical analysis of outputs rather than defining the target phenomenon in terms of itself or renaming fitted parameters as predictions. No equations or self-citation chains reduce the central claim to its inputs by construction; the derivation remains self-contained against the described behavioral and modeling benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- latent self-dimensions in SEM
axioms (1)
- domain assumption LLM outputs can be treated as veridical reports of internal states
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanbare_distinguishability_of_absolute_floor echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
differentiation, discriminating self-generated from external events through sensorimotor contingencies
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
G. Gallup, Chimpanzees: Self-Recognition. Science 167, 86–87 (1970), doi:10.1126/science. 167.3914.86
-
[2]
G. Rizzolatti, L. Craighero, The Mirror-Neuron System. Annual Review of Neuroscience 27, 169–192 (2004), doi:10.1146/annurev.neuro.27.070203.1 44230
-
[3]
A. D. Craig, How do you feel–now? The anterior insula and hu man awareness. Nature Re- views Neuroscience 10 (1), 59–70 (2009), doi:10.1038/nrn2555, https://doi.org/10. 1038/nrn2555
- [4]
-
[5]
B. Watchus, Towards Self-Aware AI: Embodiment, Feedback Loops, and the Role of the Insula in Consciousness. Preprints 2024110661 (2024), doi:10.20944/preprints202411.0661. v1, https://doi.org/10.20944/preprints202411.0661.v1
- [6]
-
[7]
Y . K. Georgie, G. Schillaci, V . V . Hafner, An interdisciplinary overview of developmental in- dices and behavioral measures of the minimal self. 2019 Joint IEEE 9th International Confer- ence on Development and Learning and Epigenetic Robotics, ICDL-EpiRob 2019 pp. 129–136 (2019), doi:10.1109/DEVLRN.2019.8850703, https://arxiv.org/pdf/1907.00709
-
[8]
V . V . Hafner, P . Loviken, A. P . Villalpando, G. Schillaci, Prerequisites for an Artificial Self. Frontiers in Neurorobotics 14, 423754 (2020), doi:10.3389/FNBOT.2020.00005/BIBTEX, www.frontiersin.org
-
[9]
R. Pfeifer, C. Scheier, Understanding Intelligence (MIT Press, Cambridge, MA) (1999)
work page 1999
-
[10]
S. Dehaene, H. Lau, S. Kouider, What is consciousness, an d could machines have it? Science 358 (6362), 486–492 (2017), doi:10.1126/science.aan8871. 16
-
[11]
D. Silver, et al., Mastering the game of Go with deep neural networks and tree s earch. Nature 529 (7587), 484–489 (2016), doi:10.1038/nature16961
-
[12]
B. M. Lake, T. D. Ullman, J. B. Tenenbaum, S. J. Gershman, B uilding machines that learn and think like people. Behavioral and Brain Sciences 40, e253 (2017), doi:10.1017/ S0140525X16001837
work page 2017
-
[13]
Rahwan, et al., Machine behaviour
I. Rahwan, et al., Machine behaviour. Nature 568 (7753), 477–486 (2019)
work page 2019
-
[14]
J. Ahn, et al., Large Language Models for Mathematical Reasoning: Progresses and Challenges (2024), https://arxiv.org/abs/2402.00157
-
[15]
Chang, et al., A Survey on Evaluation of Large Language Models (2023),https://arxiv
Y . Chang, et al., A Survey on Evaluation of Large Language Models (2023),https://arxiv. org/abs/2307.03109
-
[16]
K. M. Collins, et al., Evaluating Language Models for Mathematics Through Interactions. Pro- ceedings of the National Academy of Sciences of the United St ates of America 121 (24), e2318124121 (2024), doi:10.1073/pnas.2318124121, https://doi.org/10.1073/pnas. 2318124121
-
[17]
J. W. A. Strachan, et al., Testing theory of mind in large language models and humans. Nature Human Behaviour 8 (7), 1285–1295 (2024), doi:10.1038/s41562-024-01882-z
-
[18]
OpenAI, et al., GPT-4 Technical Report (2024), https://arxiv.org/abs/2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
PaLM-E: An Embodied Multimodal Language Model
D. Driess, et al. , PaLM-E: An Embodied Multimodal Language Model (2023), https:// arxiv.org/abs/2303.03378
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [20]
-
[21]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
G. Team, et al., Gemini 1.5: Unlocking multimodal understanding across mi llions of tokens of context (2024), https://arxiv.org/abs/2403.05530
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
G. R. Team, et al. , Gemini Robotics: Bringing AI into the Physical World (2025 ), https: //arxiv.org/abs/2503.20020. 17
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
M. Ahn, et al., Do As I Can, Not As I Say: Grounding Language in Robotic Affordances (2022), https://arxiv.org/abs/2204.01691
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [24]
-
[25]
R. Mon-Williams, G. Li, R. Long, et al. , Embodied large language models enable robots to complete complex tasks in unpredictable environments. Nature Machine Intelligence 7, 592–601 (2025), doi:10.1038/s42256-025-01005-x
-
[26]
C. Zhang, J. Chen, J. Li, Y . Peng, Z. Mao, Large language models for human–robot interaction: A review. Biomimetic Intelligence and Robotics 3 (4), 100131 (2023), doi:https://doi.org/10. 1016/j.birob.2023.100131, https://www.sciencedirect.com/science/article/pii/ S2667379723000451
-
[27]
Menon, 20 years of the default mode network: A review an d synthesis
V . Menon, 20 years of the default mode network: A review an d synthesis. Neuron 111 (16), 2469–2484 (2023), doi:10.1016/j.neuron.2023.04.023
-
[28]
M. E. Raichle, et al., A default mode of brain function. Proceedings of the National Academy of Sciences 98 (2), 676–682 (2001), doi:10.1073/pnas.98.2.676
-
[29]
G. Northoff, et al., Self-referential processing in our brain–a meta-analysi s of imaging studies on the self. NeuroImage 31 (1), 440–457 (2006), doi:10.1016/j.neuroimage.2005.12. 002
-
[30]
Rochat, Five Levels of Self-Awareness as They Unfold E arly in Life
P . Rochat, Five Levels of Self-Awareness as They Unfold E arly in Life. Consciousness and Cognition 12 (4), 717–731 (2003), doi:10.1016/S1053-8100(03)00081-3
-
[31]
Gemini: A Family of Highly Capable Multimodal Models
G. Team, et al. , Gemini: A Family of Highly Capable Multimodal Models (2024 ), https: //arxiv.org/abs/2312.11805
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [32]
-
[33]
S. M. Mousavi, et al., Gemini and Physical World: Large Language Models Can Estim ate the Intensity of Earthquake Shaking from Multimodal Social Med ia Posts. Geophysical Journal 18 International 240 (2), 1281–1294 (2025), doi:10.1093/gji/ggae436, https://doi.org/10. 1093/gji/ggae436
- [34]
-
[35]
Gemma: Open Models Based on Gemini Research and Technology
G. Team, et al. , Gemma: Open Models Based on Gemini Research and Technology (2024), https://arxiv.org/abs/2403.08295
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Retrieval-Augmented Generation for Large Language Models: A Survey
Y . Gao, et al., Retrieval-Augmented Generation for Large Language Models: A Survey (2024), https://arxiv.org/abs/2312.10997
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
J. Shi, et al., Optimization-based Prompt Injection Attack to LLM-as-a- Judge (2025), https: //arxiv.org/abs/2403.17710
-
[38]
Li, et al., Generative Judge for Evaluating Alignment (2023), https://arxiv.org/abs/ 2310.05470
J. Li, et al., Generative Judge for Evaluating Alignment (2023), https://arxiv.org/abs/ 2310.05470
-
[39]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
L. Zheng, et al., Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (2023), https: //arxiv.org/abs/2306.05685
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
L. J. Cronbach, P . E. Meehl, Construct validity in psycho logical tests. Psychological Bulletin 52 (4), 281–302 (1955), doi:10.1037/h0040957
-
[41]
M. R. Longo, F. Sch¨ u¨ ur, M. P . Kammers, M. Tsakiris, P . Ha ggard, What is em- bodiment? A psychometric approach. Cognition 107 (3), 978–998 (2008), doi:https: //doi.org/10.1016/j.cognition.2007.12.004, https://www.sciencedirect.com/science/ article/pii/S0010027708000061
- [42]
-
[43]
S. Kim, et al., Prometheus: Inducing Fine-grained Evaluation Capabilit y in Language Models (2024), https://arxiv.org/abs/2310.08491. 19
-
[44]
E. Goh, R. Gallo, J. Hom, et al. , Large Language Model Influence on Diagnostic Rea- soning: A Randomized Clinical Trial. JAMA Network Open 7 (10), e2440969 (2024), doi:10.1001/jamanetworkopen.2024.40969, https://jamanetwork.com/article.aspx? doi=10.1001/jamanetworkopen.2024.40969
-
[45]
K. Giannakopoulos, A. Kavadella, A. A. Salim, V . Stamatopoulos, E. Kaklamanos, Evaluation of the Performance of Generative AI Large Language Models Ch atGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistr y: Comparative Mixed Methods Study. Journal of Medical Internet Research 25, e51580 (2023), doi:10.2196/51580, https: //www...
-
[46]
M. W.-L. Cheung, Meta-Analysis: A Structural Equation Modeling Approach(Wiley, Hoboken, NJ) (2015), doi:10.1002/9781118957813
-
[47]
T. Raykov, G. A. Marcoulides, A First Course in Structural Equation Modeling (Routledge), 2nd ed. (2006), doi:10.4324/9780203930687
-
[48]
L. B. Merabet, et al., Rapid and reversible recruitment of early visual cortex fo r touch. PLoS One 3 (8), e3046 (2008), doi:10.1371/journal.pone.0003046
-
[49]
A. J. King, Crossmodal plasticity and hearing capabilit ies following blindness. Cell Tissue Res. 361 (1), 295–300 (2015), doi:10.1007/s00441-015-2175-y
-
[50]
sensors” become “sources of information
S. G. Lomber, M. A. Meredith, A. Kral, Cross-modal plasti city in specific auditory cortices underlies visual compensations in the deaf. Nature Neuroscience 13, 1421–1427 (2010), doi: 10.1038/nn.2653. Acknowledgments The authors would like to thank Rafael Sendra-Arranz and ´Alvaro Guti´errez for their discussions and technical input during the development ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.