VibeAct: Vibration to Actions for Contact-Rich Reactive Robot Dexterity
Pith reviewed 2026-06-26 04:34 UTC · model grok-4.3
The pith
VibeAct trains dexterous robot policies in simulation using contact and slip labels from real vibro-acoustic recordings to outperform baselines and transfer to hardware.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VibeAct decouples real vibrotactile sensing from simulation-based reinforcement learning through a shared representation of contact and slip. Real microphone data are collected via teleoperation and replayed in simulation to label contacts; an estimator predicts the same quantities from live waveforms while policies train directly on simulated contacts. This produces reactive policies that outperform vision-plus-proprioception baselines in five contact-rich tasks and transfer successfully to a physical dexterous hand-arm system.
What carries the argument
Shared physical representation of per-finger contact and slip that lets policies exploit rapid tactile feedback without simulating raw audio waveforms.
If this is right
- Policies outperform a proprioception-and-point-cloud baseline across five contact-rich tasks in simulation.
- Largest gains appear on tasks that require sustained reactive control.
- The continuous slip-magnitude channel is the most informative observation.
- Learned policies transfer to a physical dexterous hand-arm platform and raise deployed success rates.
Where Pith is reading between the lines
- The replay-and-label step could be applied to other high-bandwidth sensors that are hard to simulate directly.
- Explicit contact modeling may reduce the need for full waveform simulation in other tactile policy-learning settings.
- Reactive-control improvements may appear in any manipulation domain where contacts are fast and visually occluded.
Load-bearing premise
Replaying real vibro-acoustic recordings in a calibrated digital clone produces accurate per-finger contact and slip labels that match physical dynamics well enough for policy training and transfer.
What would settle it
Policies trained on VibeAct labels showing no success-rate gain over the baseline on physical insertion or reorientation tasks would falsify the transfer claim.
Figures
read the original abstract
Dexterous manipulation depends on contact events that are fast, local, and often visually occluded. Piezoelectric microphones offer a compact and high-bandwidth way to sense these interactions, but the resulting vibro-acoustic signals are difficult to simulate faithfully enough for end-to-end sim-to-real policy learning on dexterous robot hands. We propose VibeAct, a framework that bridges real vibrotactile sensing and simulation-based reinforcement learning through a shared physical representation of contact and slip. In the real world, we embed piezoelectric microphones into a dexterous robot hand and collect vibro-acoustic data through teleoperation, then replay the recordings in a calibrated digital clone to automatically label per-finger contact and slip. A tactile estimator learns to predict contact and slip from real microphone waveforms, while manipulation policies are trained in simulation on the same representation computed directly from simulated contacts. This decoupling lets policies exploit rapid tactile feedback without simulating raw audio. Across five contact-rich tasks spanning regrasping, in-hand reorientation, and insertion, VibeAct consistently outperforms a proprioception-and-point-cloud baseline in simulation, with the largest gains on tasks requiring sustained reactive control, where the continuous slip-magnitude channel proves the most informative observation. The learned policies transfer to a physical dexterous hand-arm platform, improving success rates on deployed tasks. Project videos and additional details are at https://vibeact.github.io/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VibeAct, which collects real vibro-acoustic data from piezoelectric microphones embedded in a dexterous hand during teleoperation, replays the recordings in a calibrated digital clone to auto-label per-finger contact and slip, trains a tactile estimator to map real waveforms to this representation, and trains RL policies in simulation using the same contact/slip features computed from simulated contacts. Policies are evaluated on five contact-rich tasks (regrasping, in-hand reorientation, insertion) where they outperform a proprioception-and-point-cloud baseline, with largest gains on sustained-reactive tasks; the policies transfer to a physical hand-arm platform with improved success rates.
Significance. If the replay-derived labels faithfully capture physical contact dynamics, the decoupling of raw-audio simulation from policy training offers a practical route to high-bandwidth reactive tactile control in dexterous manipulation. The emphasis on the continuous slip-magnitude channel as the most informative observation, together with demonstrated sim-to-real transfer, would strengthen the case for vibrotactile sensing in contact-rich settings where vision is occluded.
major comments (2)
- [§3 and §4] §3 (label generation) and §4 (sim-to-real transfer): the central assumption that replaying real microphone recordings in the calibrated digital clone produces per-finger contact and slip labels whose timing, magnitude, and distribution match physical dynamics is not supported by any quantitative agreement metric against independent ground truth (force-torque sensors, high-speed video, or known slip events). This validation is load-bearing for both the policy-training claim and the transfer results.
- [Table 2 / Figure 5] Table 2 / Figure 5 (task-wise results): the reported outperformance on sustained-reactive tasks is presented without per-task trial counts, standard deviations, or statistical tests; the claim that the slip-magnitude channel is 'the most informative' therefore rests on qualitative comparison rather than an ablation that isolates its contribution while holding other channels fixed.
minor comments (2)
- [Abstract] The abstract states that policies 'transfer to a physical dexterous hand-arm platform, improving success rates' but supplies no numerical deltas or task-specific success rates; these numbers should appear in the abstract or a summary table.
- [§2] Notation for the shared contact/slip representation (binary contact flag, continuous slip magnitude, per-finger aggregation) is introduced only after the method description; an early, compact definition would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important aspects of validation and statistical reporting. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3 and §4] §3 (label generation) and §4 (sim-to-real transfer): the central assumption that replaying real microphone recordings in the calibrated digital clone produces per-finger contact and slip labels whose timing, magnitude, and distribution match physical dynamics is not supported by any quantitative agreement metric against independent ground truth (force-torque sensors, high-speed video, or known slip events). This validation is load-bearing for both the policy-training claim and the transfer results.
Authors: We acknowledge that the manuscript does not include direct quantitative agreement metrics (e.g., timing or magnitude correlations) between the replay-derived labels and independent ground truth from force-torque sensors or high-speed video. The digital clone was calibrated to reproduce observed contact events from the teleoperation recordings, and downstream policy transfer success provides supporting evidence for label utility. However, we agree that explicit validation metrics would increase confidence in the approach. In revision, we will expand §3 with additional details on the calibration procedure, include any available qualitative comparisons (e.g., against synchronized video of contact events), and explicitly discuss the assumption and its limitations as a direction for future work. revision: partial
-
Referee: [Table 2 / Figure 5] Table 2 / Figure 5 (task-wise results): the reported outperformance on sustained-reactive tasks is presented without per-task trial counts, standard deviations, or statistical tests; the claim that the slip-magnitude channel is 'the most informative' therefore rests on qualitative comparison rather than an ablation that isolates its contribution while holding other channels fixed.
Authors: We agree that reporting per-task trial counts, standard deviations, and statistical tests would improve rigor and allow readers to better assess the results. The underlying experiments used multiple trials per task, but these statistics were not included in the original tables and figures. We will revise Table 2 and Figure 5 to report means with standard deviations, trial counts, and appropriate statistical comparisons. Additionally, we will add an ablation experiment that isolates the contribution of the continuous slip-magnitude channel by training and evaluating policies with and without this observation while holding all other channels fixed. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper's method collects real vibro-acoustic recordings via teleoperation, replays them in a calibrated digital clone to generate per-finger contact/slip labels, trains an estimator to map waveforms to those labels, and trains policies in simulation using the same label representation computed from simulated contacts. This chain relies on external data collection and independent simulation rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation. No equations or uniqueness theorems are invoked that reduce the central claims to their inputs by construction. The empirical outperformance and sim-to-real transfer are presented as measured outcomes, not tautological restatements.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Y . C. Nakamura, D. M. Troniak, A. Rodriguez, M. T. Mason, and N. S. Pollard. The complexities of grasping in the wild. In2017 IEEE-RAS 17th International Conference on Humanoid Robotics (Humanoids), pages 233–240. IEEE, 2017
2017
-
[2]
R. S. Dahiya, G. Metta, M. Valle, and G. Sandini. Tactile sensing—from humans to humanoids. IEEE transactions on robotics, 26(1):1–20, 2009
2009
-
[3]
W. Yuan, S. Dong, and E. H. Adelson. Gelsight: High-resolution robot tactile sensors for estimating geometry and force.Sensors, 17(12):2762, 2017
2017
-
[4]
Lambeta, P.-W
M. Lambeta, P.-W. Chou, S. Tian, B. Yang, B. Maloon, V . R. Most, D. Stroud, R. Santos, A. Byagowi, G. Kammerer, et al. Digit: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation.IEEE Robotics and Automation Letters, 5(3):3838–3845, 2020
2020
-
[5]
Y . Mao, B. P. Duisterhof, M. Lee, and J. Ichnowski. Hearing the slide: Acoustic-guided constraint learning for fast non-prehensile transport. In2025 IEEE 21st International Conference on Automation Science and Engineering (CASE), pages 1127–1133. IEEE, 2025
2025
-
[6]
U. Yoo, Z. Lopez, J. Ichnowski, and J. Oh. Poe: Acoustic soft robotic proprioception for omni- directional end-effectors. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 14980–14987. IEEE, 2024
2024
-
[7]
M. Lee, U. Yoo, J. Oh, J. Ichnowski, G. Kantor, and O. Kroemer. Sonicboom: Contact localization using array of microphones.IEEE Robotics and Automation Letters, 2025
2025
-
[8]
U. Yoo, Y . Mao, J. Oh, and J. Ichnowski. A-slip: Acoustic sensing for continuous in-hand slip estimation, 2026. URLhttps://arxiv.org/abs/2604.08528
Pith/arXiv arXiv 2026
-
[9]
Clarke, N
S. Clarke, N. Heravi, M. Rau, R. Gao, J. Wu, D. James, and J. Bohg. Diffimpact: Differentiable rendering and identification of impact sounds. InConference on Robot Learning, pages 662–673. PMLR, 2022
2022
-
[10]
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
Pith/arXiv arXiv 2017
-
[11]
Y . Niu, Z. Fang, B. Chen, S. Zhou, R. Senthilkumaran, H. Zhang, B. Chen, C. Qiu, H. E. Tseng, J. Francis, et al. Learning versatile humanoid manipulation with touch dreaming.arXiv preprint arXiv:2604.13015, 2026
Pith/arXiv arXiv 2026
-
[12]
F. Liu, C. Li, Y . Qin, J. Xu, P. Abbeel, and R. Chen. Vitamin: Learning contact-rich tasks through robot-free visuo-tactile manipulation interface.arXiv preprint arXiv:2504.06156, 2025
arXiv 2025
-
[13]
W. Yuan, R. Li, M. A. Srinivasan, and E. H. Adelson. Measurement of shear and slip with a gelsight tactile sensor. In2015 IEEE international conference on robotics and automation (ICRA), pages 304–311. IEEE, 2015
2015
-
[14]
S. Dong, W. Yuan, and E. H. Adelson. Improved gelsight tactile sensor for measuring geometry and slip. In2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 137–144. IEEE, 2017. 9
2017
-
[15]
Li and E
R. Li and E. H. Adelson. Sensing and recognizing surface textures using a gelsight sensor. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013
2013
-
[16]
S. Luo, W. Yuan, E. Adelson, A. G. Cohn, and R. Fuentes. Vitac: Feature sharing between vision and tactile sensing for cloth texture recognition. In2018 IEEE International Conference on Robotics and Automation (ICRA), pages 2722–2727. IEEE, 2018
2018
-
[17]
Y . She, S. Wang, S. Dong, N. Sunil, A. Rodriguez, and E. Adelson. Cable manipulation with a tactile-reactive gripper.The International Journal of Robotics Research, 40(12-14):1385–1401, 2021
2021
-
[18]
Alspach, K
A. Alspach, K. Hashimoto, N. Kuppuswamy, and R. Tedrake. Soft-bubble: A highly com- pliant dense geometry tactile sensor for robot manipulation. In2019 2nd IEEE International Conference on Soft Robotics (RoboSoft), pages 597–604. IEEE, 2019
2019
-
[19]
Kim and A
S. Kim and A. Rodriguez. Active extrinsic contact sensing: Application to general peg-in- hole insertion. In2022 International Conference on Robotics and Automation (ICRA), pages 10241–10247. IEEE, 2022
2022
-
[20]
Oller, M
M. Oller, M. P. i Lisbona, D. Berenson, and N. Fazeli. Manipulation via membranes: High- resolution and highly deformable tactile sensing and control. InConference on Robot Learning, pages 1850–1859. PMLR, 2023
2023
-
[21]
C. Lin, B. Huo, M. Yu, E. Ruppel, B. Chen, J. Francis, and D. Zhao. Lighttact: A visual-tactile fingertip sensor for deformation-independent contact sensing.arXiv preprint arXiv:2512.20591, 2025
Pith/arXiv arXiv 2025
-
[22]
Bhirangi, T
R. Bhirangi, T. Hellebrekers, C. Majidi, and A. Gupta. Reskin:versatile, replaceable, lasting tactile skins. InCoRL, 2021
2021
-
[23]
R. Bhirangi, V . Pattabiraman, E. Erciyes, Y . Cao, T. Hellebrekers, and L. Pinto. Anyskin: Plug- and-play skin sensing for robotic touch, 2024. URL https://arxiv.org/abs/2409.08276
arXiv 2024
-
[24]
T. Hellebrekers, N. Chang, K. Chin, M. J. Ford, O. Kroemer, and C. Majidi. Soft magnetic tactile skin for continuous force and location estimation using neural networks.IEEE Robotics and Automation Letters, 5(3):3892–3898, 2020. doi:10.1109/LRA.2020.2983707
-
[25]
T. P. Tomo, A. Schmitz, W. K. Wong, H. Kristanto, S. Somlor, J. Hwang, L. Jamone, and S. Sugano. Covering a robot fingertip with uskin: A soft electronic skin with distributed 3-axis force sensitive elements for robot hands.IEEE Robotics and Automation Letters, 3(1):124–131,
-
[26]
doi:10.1109/LRA.2017.2734965
-
[27]
Huang, Y
B. Huang, Y . Wang, X. Yang, Y . Luo, and Y . Li. 3d-vitac: Learning fine-grained manipulation with visuo-tactile sensing. In8th Annual Conference on Robot Learning, 2024
2024
-
[28]
X. Liu, W. Yang, F. Meng, and T. Sun. Material recognition using robotic hand with capac- itive tactile sensor array and machine learning.IEEE Transactions on Instrumentation and Measurement, 73:1–9, 2024. doi:10.1109/TIM.2024.3383886
-
[29]
S. Wistreich, B. Shi, S. Tian, S. Clarke, M. Nath, C. Xu, Z. Bao, and J. Wu. Dexskin: High- coverage conformable robotic skin for learning contact-rich manipulation.arXiv preprint arXiv:2509.18830, 2025
arXiv 2025
-
[30]
Lu and H
S. Lu and H. Culbertson. Active acoustic sensing for robot manipulation. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3161–3168. IEEE, 2023. 10
2023
-
[31]
The reflectance field map: Mapping glass and specular surfaces in dynamic environment s,
S. Rupavatharam, C. Escobedo, D. Lee, C. Prepscius, L. Jackel, R. Howard, and V . Isler. Sonicfinger: Pre-touch and contact detection tactile sensor for reactive pregrasping. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 12556–12562, 2023. doi:10.1109/ICRA48891.2023.10161074
-
[32]
X. Yi, Y . Xing, Z. Manchester, and N. Fazeli. Sound of touch: Active acoustic tactile sensing via string vibrations.arXiv preprint arXiv:2602.16846, 2026
arXiv 2026
-
[33]
Zhang, D.-G
K. Zhang, D.-G. Kim, E. T. Chang, H.-H. Liang, Z. He, K. Lampo, P. Wu, I. Kymissis, and M. Ciocarlie. Vibecheck: Using active acoustic tactile sensing for contact-rich manipulation. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 12278–12285. IEEE, 2025
2025
- [34]
-
[35]
Liu and B
J. Liu and B. Chen. Sonicsense: Object perception from in-hand acoustic vibration. In Conference on Robot Learning, pages 4332–4353. PMLR, 2025
2025
-
[36]
Clarke, T
S. Clarke, T. Rhodes, C. G. Atkeson, and O. Kroemer. Learning audio feedback for estimating amount and flow of granular material. In A. Billard, A. Dragan, J. Peters, and J. Morimoto, editors,Proceedings of The 2nd Conference on Robot Learning, volume 87 ofProceedings of Machine Learning Research, pages 529–550. PMLR, 29–31 Oct 2018. URL https:// proceedi...
2018
-
[37]
Zhang, M
K. Zhang, M. Sharma, M. Veloso, and O. Kroemer. Leveraging multimodal haptic sensory data for robust cutting. In2019 IEEE-RAS 19th International Conference on Humanoid Robots (Humanoids), pages 409–416. IEEE, 2019
2019
-
[38]
Mejia, V
J. Mejia, V . Dean, T. Hellebrekers, and A. Gupta. Hearing touch: Audio-visual pretraining for contact-rich manipulation. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6912–6919. IEEE, 2024
2024
-
[39]
Thankaraj and L
A. Thankaraj and L. Pinto. That sounds right: Auditory self-supervision for dynamic robot manipulation. InConference on Robot Learning, pages 1036–1049. PMLR, 2023
2023
-
[40]
M. Du, O. Y . Lee, S. Nair, and C. Finn. Play it by ear: Learning skills amidst occlusion through audio-visual imitation learning.arXiv preprint arXiv:2205.14850, 2022
arXiv 2022
-
[41]
Z. Liu, C. Chi, E. Cousineau, N. Kuppuswamy, B. Burchfiel, and S. Song. Maniwav: Learning robot manipulation from in-the-wild audio-visual data.arXiv preprint arXiv:2406.19464, 2024
arXiv 2024
-
[42]
H. Li, Y . Zhang, J. Zhu, S. Wang, M. A. Lee, H. Xu, E. Adelson, L. Fei-Fei, R. Gao, and J. Wu. See, hear, and feel: Smart sensory fusion for robotic manipulation. InConference on Robot Learning, pages 1368–1378. PMLR, 2023
2023
-
[43]
H. Qi, B. Yi, M. Lambeta, Y . Ma, R. Calandra, and J. Malik. From simple to complex skills: The case of in-hand object reorientation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 14291–14298. IEEE, 2025
2025
-
[44]
E. Xing, V . Luk, and J. Oh. Stabilizing reinforcement learning in differentiable multiphysics simulation. InInternational Conference on Learning Representations, volume 2025, pages 91165–91198, 2025
2025
-
[45]
K. Shaw, A. Agarwal, and D. Pathak. Leap hand: Low-cost, efficient, and anthropomorphic hand for robot learning.Robotics: Science and Systems (RSS), 2023
2023
-
[46]
Todorov, T
E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012. 11
2012
-
[47]
C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017
2017
-
[48]
Calli, A
B. Calli, A. Singh, A. Walsman, S. Srinivasa, P. Abbeel, and A. M. Dollar. The ycb object and model set: Towards common benchmarks for manipulation research. In2015 international conference on advanced robotics (ICAR), pages 510–517. IEEE, 2015
2015
-
[49]
I. Loshchilov and F. Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
Pith/arXiv arXiv 2017
-
[50]
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014
Pith/arXiv arXiv 2014
-
[51]
J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438, 2015. 12 A Tactile Estimator Training A.1 Network Architecture Table S.1 lists the layer dimensions of each per-finger subnetwork of VIBEACTtactile estimator. The network takes per-microph...
Pith/arXiv arXiv 2015
-
[52]
Hard cap: values exceeding 0.05 m/s (a physically-impossible rigid-replay artifact in our digital-clone pipeline) are replaced by the recent-valid median
-
[53]
3.Causal sliding-window median: 5-step window over the filled stream
Causal local-median fill: capped samples are replaced by the median of the last 5 valid samples. 3.Causal sliding-window median: 5-step window over the filled stream. 4.Causal one-pole IIR: low-pass withα= 0.15. This matches the post-processing applied to real microphone-derived slip estimates in our perception pipeline, so that the simulated and real tac...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.