pith. sign in

arxiv: 2604.27583 · v2 · pith:LQRIG6QZnew · submitted 2026-04-30 · 🧬 q-bio.NC · cs.RO

Simulating Infant First-Person Sensorimotor Experience via Motion Retargeting from Babies to Humanoids

Pith reviewed 2026-07-01 08:07 UTC · model grok-4.3

classification 🧬 q-bio.NC cs.RO
keywords motion retargetinginfant developmenthumanoid robotssensorimotor simulationmultimodal datapose estimationdevelopmental science
0
0 comments X

The pith

Motion retargeting from infant videos to humanoid robots generates simulated multisensory streams with sub-centimeter accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework that takes a single video of an infant, reconstructs the 3D body pose and skeletal structure frame by frame, and maps the motion onto physical and virtual humanoid platforms such as the iCub robot. Replaying the mapped motion on these platforms produces streams of proprioceptive, tactile, and visual data meant to approximate what an infant experiences. The authors show that the best-matched embodiment yields positional errors below one centimeter, which in turn supports detailed study of developmental patterns and automated labeling of actions. This approach is positioned as a bridge between video observation and internal sensorimotor states that are otherwise inaccessible.

Core claim

From a single video the method extracts the infant's skeletal structure and estimates full 3D pose per frame, then maps the reconstructed motion onto the iCub, pyCub, EMFANT and MIMo embodiments; replaying the retargeted motions on these platforms yields multisensory streams of joint and muscle proprioception, touch and vision, reaching sub-centimeter accuracy for the best-matching embodiment and thereby enabling multimodal analysis of infant development plus automated behavior annotation.

What carries the argument

The motion retargeting pipeline that reconstructs skeletal structure and 3D pose from video then maps the pose to humanoid joint angles and sensor models to produce proprioceptive, tactile and visual streams.

If this is right

  • Multimodal analysis of infant development becomes possible from ordinary video recordings alone.
  • Automated annotation of infant behaviors gains an additional layer of proprioceptive and tactile context.
  • Robotics, developmental science and early neurodevelopmental screening each receive a new source of synthetic first-person data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Different humanoid platforms could be ranked by how faithfully their generated sensor streams reproduce patterns seen in real infant data.
  • The same retargeting pipeline might be applied to videos of older children or adults once embodiment parameters are adjusted accordingly.
  • Synthetic datasets produced this way could serve as training material for models that learn to predict infant actions from partial observations.

Load-bearing premise

The retargeted motion on a robot body will produce sensor streams that meaningfully approximate an infant's own experience only when the robot's proportions, joint limits and sensor placement are close enough to a baby's.

What would settle it

Direct comparison of the generated sensor streams against simultaneous physiological or behavioral recordings from real infants performing the same movements would show whether the simulated data match actual infant experience.

Figures

Figures reproduced from arXiv: 2604.27583 by Dongmin Kim, Francisco M. L\'opez, Hoshinori Kanazawa, Jochen Triesch, Lukas Rustler, Matej Hoffmann, Miles Lenz, Ondrej Fiala, Valentin Marcel, Yakov Balashov, Yasuo Kuniyoshi.

Figure 1
Figure 1. Figure 1 view at source ↗
Figure 2
Figure 2. Figure 2: Motion retargeting pipeline. From a single view of a moving infant, the three dimensional body pose can be estimated and reconstructed in a humanoid. This method allows the simulation of multimodal sensory streams from a first-person perspective. The examples show EMFANT’s muscle-tendon structure, MIMo’s virtual skin of touch sensors activated due to a hand-to-body contact, and MIMo’s binocular vision when… view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy of the motion retargeting. MIMo and EMFANT achieve the lowest relative distances, since their morphology can be modified to fit that of the recorded infant. The relative orientation and velocities, measured using vectors pointing from the MidHip keypoint to each end effector, provide a more qualitative comparison of the retargeting accuracy. MIMo performs best overall, with an average MAE below 2 … view at source ↗
Figure 4
Figure 4. Figure 4: Cross-embodiment invariance and shared sensorimotor man￾ifold. (A) Similarity triangle diagram summarizing pairwise latent correla￾tions among the three embodiments (EMFANT, MIMo, and iCub). Edge labels indicate Spearman ρ correlations. (B) Kernel density plot from dimensionality reduction of GPA-aligned fused latent distribution (tactile + proprioception + vision) at K = 20 view at source ↗
Figure 5
Figure 5. Figure 5: B. It consists of a sparse raster plot where only some view at source ↗
Figure 5
Figure 5. Figure 5: Simulation of sensorimotor infant experiences for different humanoids. (A) The robot iCub performing hand regard. His eyes have cameras which allow us to simulate the visual experience of an infant looking at their hand. Due to the proximity of the hand to the face, there is a binocular disparity in the left and right images. This is shown as the red and cyan colors in the stereoscopic anaglyph. (B) Touch … view at source ↗
Figure 6
Figure 6. Figure 6: Distributions of self-touches. The manual coding was performed by expert annotators, whereas the humanoid touches were detected as collisions between the hands and the bodies. been shown that these distributions of touches change during development and are indicative of a maturing body schema [36]. Following a similar approach, we extract the hand-to￾body contacts detected as collisions in the three simula… view at source ↗
read the original abstract

Motion retargeting from humans to human-like artificial agents is becoming increasingly important as humanoid robots grow more capable. However, most existing approaches focus only on reproducing kinematics and ignore the rich sensorimotor experience associated with human movement. In this work, we present a framework for simulating the multimodal sensorimotor experiences of infants using physical and virtual humanoids. From a single video, our method reconstructs the infant's body configuration by extracting its skeletal structure and estimating the full 3D pose from each frame. Then we map the reconstructed motion onto several developmental platforms: the physical iCub robot and the virtual simulators pyCub, EMFANT and MIMo. Replaying the retargeted motions on these embodiments produces simulated multisensory streams including proprioception (joints and muscles), touch, and vision. For the best-matching embodiment, the retargeting achieves sub-centimeter accuracy and enables a rich multimodal analysis of infant development as well as enhanced automated annotation of behaviors. This framework provides a unique window into the infant's sensorimotor experience, offering new tools for robotics, developmental science, and early detection of neurodevelopmental disorders. The code is available at https://github.com/ctu-vras/motion-retargeting/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents a pipeline that extracts 3D skeletal poses from infant video, retargets the kinematics onto physical iCub and virtual platforms (pyCub, EMFANT, MIMo), and replays the motions to generate simulated proprioceptive, tactile, and visual streams; it reports sub-centimeter end-effector accuracy on the best-matching embodiment and claims this supplies a window into infant first-person sensorimotor experience for developmental analysis and behavior annotation.

Significance. If the retargeted streams were shown to approximate infant multisensory experience, the framework would supply otherwise inaccessible longitudinal data for developmental science and robotics; the public code release is a clear strength that supports reproducibility.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (method): the central claim that retargeted motions 'simulate the multimodal sensorimotor experiences of infants' and provide 'a unique window into the infant's sensorimotor experience' rests on the untested assumption that kinematic mapping to iCub-scale embodiments produces proprioceptive/tactile/visual streams that meaningfully match an infant's; large differences in head-to-torso ratio, limb lengths, joint limits, and sensor density are not quantified or compensated, so sub-centimeter kinematic error on the target robot does not establish correspondence of the generated sensor streams.
  2. [§4] §4 (results): the reported sub-centimeter accuracy is stated only for the best-matching embodiment, yet no error distributions, per-joint breakdowns, or comparison against infant ground-truth sensor data (or even against a same-scale infant model) are provided; without these, it is impossible to judge whether post-processing choices preserve the claimed multimodal fidelity.
minor comments (2)
  1. The abstract states that code is available at the cited GitHub link; this should be repeated with a precise commit hash or release tag in the main text.
  2. Notation for the retargeting mapping (e.g., how joint angles and muscle lengths are scaled) is introduced without an explicit equation or pseudocode block; adding one would improve clarity.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below, clarifying the scope of our claims and indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (method): the central claim that retargeted motions 'simulate the multimodal sensorimotor experiences of infants' and provide 'a unique window into the infant's sensorimotor experience' rests on the untested assumption that kinematic mapping to iCub-scale embodiments produces proprioceptive/tactile/visual streams that meaningfully match an infant's; large differences in head-to-torso ratio, limb lengths, joint limits, and sensor density are not quantified or compensated, so sub-centimeter kinematic error on the target robot does not establish correspondence of the generated sensor streams.

    Authors: The manuscript presents a kinematic retargeting pipeline that generates simulated sensor streams on available embodiments; it does not assert that these streams are identical to an infant's due to inherent morphological mismatches. Sub-centimeter accuracy on the best-matching platform demonstrates faithful reproduction of the input motion, which in turn drives the simulated proprioception, touch, and vision. We agree the language in the abstract and §3 overstates the degree of correspondence and will revise it to describe the output as an approximation suitable for developmental analysis. We will also add explicit discussion of unquantified differences (e.g., limb proportions, sensor density) and their implications for sensor-stream fidelity. revision: partial

  2. Referee: [§4] §4 (results): the reported sub-centimeter accuracy is stated only for the best-matching embodiment, yet no error distributions, per-joint breakdowns, or comparison against infant ground-truth sensor data (or even against a same-scale infant model) are provided; without these, it is impossible to judge whether post-processing choices preserve the claimed multimodal fidelity.

    Authors: We will expand §4 to report full error distributions and per-joint breakdowns for all tested embodiments. However, no infant ground-truth multimodal sensor recordings exist, which is the central motivation for the simulation framework; a same-scale infant model comparison is likewise outside the present scope. We will add text stating these limitations explicitly and note that the reported kinematic accuracy is the best available proxy for assessing post-processing effects on the generated streams. revision: partial

standing simulated objections not resolved
  • Empirical comparison against real infant multimodal sensor data, which does not exist.

Circularity Check

0 steps flagged

No circularity: pipeline of reconstruction and retargeting with no fitted parameters or self-referential definitions

full rationale

The paper describes a forward pipeline: video-based skeletal extraction, 3D pose estimation, kinematic mapping to target embodiments (iCub, pyCub, etc.), and replay to generate simulated sensor streams. No equations, fitted parameters, or predictions are defined in terms of themselves. Sub-centimeter accuracy is reported as an empirical outcome of the mapping on the best-matching body, not used to define or justify the method. No self-citations are invoked as load-bearing uniqueness theorems. The central claim (that retargeted streams enable multimodal analysis) rests on the described engineering steps rather than reducing to its own inputs by construction. This is a standard non-circular methodological contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the method implicitly assumes standard computer-vision pose estimation and retargeting techniques function adequately on infant data.

pith-pipeline@v0.9.1-grok · 5797 in / 1174 out tokens · 23288 ms · 2026-07-01T08:07:22.222474+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Embodiment Shapes Rolling Behavior in a Multimodal Infant Model

    cs.RO 2026-06 unverdicted novelty 5.0

    A reinforcement learning model of a multimodal virtual infant produces rolling behaviors that reproduce age-related improvements and coordination patterns observed in human infants, shaped by changing body morphology.

Reference graph

Works this paper leans on

36 extracted references · 1 canonical work pages · cited by 1 Pith paper

  1. [1]

    Piaget, M

    J. Piaget, M. Cooket al.,The origins of intelligence in children. International universities press New York, 1952, vol. 8, no. 5

  2. [2]

    Helpless infants are learning a foundation model,

    R. Cusack, M. Ranzato, and C. J. Charvet, “Helpless infants are learning a foundation model,”Trends in Cognitive Sciences, vol. 28, no. 8, pp. 726–738, 2024

  3. [3]

    Lessons from infant learning for unsupervised machine learning,

    L. Zaadnoordijk, T. R. Besold, and R. Cusack, “Lessons from infant learning for unsupervised machine learning,”Nature Machine Intelli- gence, vol. 4, no. 6, pp. 510–520, 2022

  4. [4]

    Bayley scales of infant development: Manual,

    N. Bayley, “Bayley scales of infant development: Manual,”New York, 1993

  5. [5]

    Structuring of early reaching movements: a longitu- dinal study,

    C. von Hofsten, “Structuring of early reaching movements: a longitu- dinal study,”Journal of motor behavior, vol. 23, no. 4, pp. 280–292, 1991

  6. [6]

    Detection of intermodal proprioceptive–visual contingency as a potential basis of self- perception in infancy

    L. E. Bahrick and J. S. Watson, “Detection of intermodal proprioceptive–visual contingency as a potential basis of self- perception in infancy.”Developmental psychology, vol. 21, no. 6, p. 963, 1985

  7. [7]

    Infants tailor their attention to maximize learning,

    F. Poli, G. Serino, R. Mars, and S. Hunnius, “Infants tailor their attention to maximize learning,”Science advances, vol. 6, no. 39, p. eabb5053, 2020

  8. [8]

    A decade of infant neuroimaging research: what have we learned and where are we going?

    A. Azhari, A. Truzzi, M. J.-Y . Neoh, J. P. M. Balagtas, H. H. Tan, P. P. Goh, X. A. Ang, P. Setoh, P. Rigo, M. H. Bornsteinet al., “A decade of infant neuroimaging research: what have we learned and where are we going?”Infant Behavior and Development, vol. 58, p. 101389, 2020

  9. [9]

    Sampling development,

    K. E. Adolph and S. R. Robinson, “Sampling development,”Journal of Cognition and Development, vol. 12, no. 4, pp. 411–423, 2011

  10. [10]

    Video can make behavioural science more reproducible,

    R. O. Gilmore and K. E. Adolph, “Video can make behavioural science more reproducible,”Nature human behaviour, vol. 1, no. 7, p. 0128, 2017

  11. [11]

    A Naturalis- tic Observation of Spontaneous Touches to the Body and Environment in the First 2 Months of Life,

    A. DiMercurio, J. P. Connell, M. Clark, and D. Corbetta, “A Naturalis- tic Observation of Spontaneous Touches to the Body and Environment in the First 2 Months of Life,”Frontiers in Psychology, vol. 9, 2018

  12. [12]

    Automatic infant 2d pose estimation from videos: Comparing seven deep neural network methods,

    F. Gama, M. M ´ısaˇr, L. Navara, S. T. Popescu, and M. Hoffmann, “Automatic infant 2d pose estimation from videos: Comparing seven deep neural network methods,”Behavior Research Methods, vol. 57, no. 10, p. 280, 2025

  13. [13]

    Learning and tracking the 3d body shape of freely moving infants from rgb-d sequences,

    N. Hesse, S. Pujades, M. J. Black, M. Arens, U. G. Hofmann, and A. S. Schroeder, “Learning and tracking the 3d body shape of freely moving infants from rgb-d sequences,”IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 10, pp. 2540–2551, 2019

  14. [14]

    Grounded language acquisition through the eyes and ears of a single child,

    W. K. V ong, W. Wang, A. E. Orhan, and B. M. Lake, “Grounded language acquisition through the eyes and ears of a single child,” Science, vol. 383, no. 6682, pp. 504–511, 2024

  15. [15]

    Simulated cortical magnifi- cation supports self-supervised object learning,

    Z. Yu, A. Aubret, C. Yu, and J. Triesch, “Simulated cortical magnifi- cation supports self-supervised object learning,” in2025 IEEE Inter- national Conference on Development and Learning (ICDL). IEEE, 2025, pp. 1–6

  16. [16]

    Infants’ use of eye movements to explore their natural environment,

    T. R. Candy, S. Biehn, S. Freeman, A. Dalessandro, V . Tellez, B. Marella, K. Singh, Z. Petroff, K. Bonnen, and L. Smith, “Infants’ use of eye movements to explore their natural environment,”Journal of Vision, vol. 24, no. 10, pp. 974–974, 2024

  17. [17]

    The icub humanoid robot: An open-systems platform for research in cognitive development,

    G. Metta, L. Natale, F. Nori, G. Sandini, D. Vernon, L. Fadiga, C. V on Hofsten, K. Rosander, M. Lopes, J. Santos-Victoret al., “The icub humanoid robot: An open-systems platform for research in cognitive development,”Neural networks, vol. 23, no. 8-9, pp. 1125– 1134, 2010

  18. [18]

    Mimo: A multimodal infant model for studying cognitive development,

    D. Mattern, P. Schumacher, F. M. L ´opez, M. C. Raabe, M. R. Ernst, A. Aubret, and J. Triesch, “Mimo: A multimodal infant model for studying cognitive development,”IEEE Transactions on Cognitive and Developmental Systems, vol. 16, no. 4, pp. 1291–1301, 2024

  19. [19]

    Simulating a human fetus in soft uterus,

    D. Kim, H. Kanazawa, and Y . Kuniyoshi, “Simulating a human fetus in soft uterus,” in2022 IEEE International Conference on Development and Learning (ICDL). IEEE, 2022, pp. 135–141

  20. [20]

    Deep learning-based human pose estimation: A survey,

    C. Zheng, W. Wu, C. Chen, T. Yang, S. Zhu, J. Shen, N. Kehtarnavaz, and M. Shah, “Deep learning-based human pose estimation: A survey,” ACM Computing Surveys, vol. 56, no. 1, pp. 1–37, 2023

  21. [21]

    ViTPose: Simple vision transformer baselines for human pose estimation,

    Y . Xu, J. Zhang, Q. Zhang, and D. Tao, “ViTPose: Simple vision transformer baselines for human pose estimation,” inAdvances in Neural Information Processing Systems, 2022

  22. [22]

    Expressive body capture: 3D hands, face, and body from a single image,

    G. Pavlakos, V . Choutas, N. Ghorbani, T. Bolkart, A. A. A. Osman, D. Tzionas, and M. J. Black, “Expressive body capture: 3D hands, face, and body from a single image,” inProceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019

  23. [23]

    Methods and technologies for the implementation of large- scale robot tactile sensors,

    A. Schmitz, P. Maiolino, M. Maggiali, L. Natale, G. Cannata, and G. Metta, “Methods and technologies for the implementation of large- scale robot tactile sensors,”IEEE Transactions on Robotics, vol. 27, no. 3, pp. 389–400, 2011

  24. [24]

    Vernon, C

    D. Vernon, C. V on Hofsten, and L. Fadiga,A roadmap for cognitive development in humanoid robots. Springer Science & Business Media, 2011, vol. 11

  25. [25]

    The iCub platform: a tool for studying intrinsically motivated learning,

    L. Natale, F. Nori, G. Metta, M. Fumagalli, S. Ivaldi, U. Pattacini, M. Randazzo, A. Schmitz, and G. Sandini, “The iCub platform: a tool for studying intrinsically motivated learning,” inIntrinsically motivated learning in natural and artificial systems. Springer, 2012, pp. 433–458

  26. [26]

    Robotic homunculus: Learning of artificial skin representation in a humanoid robot motivated by primary somatosensory cortex,

    M. Hoffmann, Z. Straka, I. Farkas, M. Vavrecka, and G. Metta, “Robotic homunculus: Learning of artificial skin representation in a humanoid robot motivated by primary somatosensory cortex,”IEEE Transactions on Cognitive and Developmental Systems, vol. 10, no. 2, pp. 163–176, June 2018

  27. [27]

    Learning with pycub: A new simulation and exercise framework for humanoid robotics,

    L. Rustler and M. Hoffmann, “Learning with pycub: A new simulation and exercise framework for humanoid robotics,” 2025. [Online]. Available: https://arxiv.org/abs/2506.01756

  28. [28]

    Retargeting infant movements to baby humanoid robots,

    O. Fiala, “Retargeting infant movements to baby humanoid robots,” Bachelor’s thesis, Czech Technical University in Prague, 2023

  29. [29]

    An embodied brain model of the human foetus,

    Y . Yamada, H. Kanazawa, S. Iwasaki, Y . Tsukahara, O. Iwata, S. Ya- mada, and Y . Kuniyoshi, “An embodied brain model of the human foetus,”Scientific Reports, vol. 6, 2016

  30. [30]

    Opensim: Simulating musculoskeletal dynamics and neuromuscular control to study human and animal movement,

    A. Seth, J. L. Hicks, T. K. Uchida, A. Habib, C. L. Dembia, J. J. Dunne, C. F. Ong, M. S. DeMers, A. Rajagopal, M. Millardet al., “Opensim: Simulating musculoskeletal dynamics and neuromuscular control to study human and animal movement,”PLoS computational biology, vol. 14, no. 7, p. e1006223, 2018

  31. [31]

    Mimo grows! simulating body and sensory development in a mul- timodal infant model,

    F. M. L ´opez, M. Lenz, M. G. Fedozzi, A. Aubret, and J. Triesch, “Mimo grows! simulating body and sensory development in a mul- timodal infant model,” in2025 IEEE International Conference on Development and Learning (ICDL). IEEE, 2025

  32. [32]

    AnthroKids - Anthropometric data of children,

    S. Ressler, “AnthroKids - Anthropometric data of children,”Nat. Inst. Standards and Technol., 1977

  33. [33]

    Keeping the arm in the limelight: Advanced visual control of arm movements in neonates,

    A. L. van der Meer, “Keeping the arm in the limelight: Advanced visual control of arm movements in neonates,”European Journal of Paediatric Neurology, vol. 1, no. 4, pp. 103–108, 1997

  34. [34]

    Open-ended movements structure sensorimotor information in early human development,

    H. Kanazawa, Y . Yamada, K. Tanaka, M. Kawai, F. Niwa, K. Iwanaga, and Y . Kuniyoshi, “Open-ended movements structure sensorimotor information in early human development,”Proceedings of the National Academy of Sciences, vol. 120, no. 1, p. e2209953120, 2023

  35. [35]

    Independent devel- opment of the reach and the grasp in spontaneous self-touching by human infants in the first 6 months,

    B. L. Thomas, J. M. Karl, and I. Q. Whishaw, “Independent devel- opment of the reach and the grasp in spontaneous self-touching by human infants in the first 6 months,”Frontiers in psychology, vol. 5, p. 1526, 2015

  36. [36]

    Self-touch and other spontaneous behavior patterns in early infancy,

    J. Khoury, S. T. Popescu, F. Gama, V . Marcel, and M. Hoffmann, “Self-touch and other spontaneous behavior patterns in early infancy,” in2022 IEEE International Conference on Development and Learning (ICDL). IEEE, 2022, pp. 148–155