A four-microphone acoustic system with a CNN achieves 14.1-degree mean directional error for continuous in-hand slip estimation and outperforms single-channel baselines.
Maniwav: Learning robot manipulation from in-the-wild audio-visual data
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 4roles
background 1polarities
background 1representative citing papers
Introduces the Kaiwu multimodal dataset and framework with 11,664 synchronized assembling demonstrations including hand motions, pressures, sounds, multi-view videos, motion capture, eye gaze, and EMG signals with timestamp-based and semantic annotations.
Training-time instrumentation with audio and privileged button-state signals produces contact policies that match success rates but apply lower forces using only vision and audio at inference.
LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.
citing papers explorer
-
A-SLIP: Acoustic Sensing for Continuous In-hand Slip Estimation
A four-microphone acoustic system with a CNN achieves 14.1-degree mean directional error for continuous in-hand slip estimation and outperforms single-channel baselines.
-
Kaiwu: A Multimodal Manipulation Dataset and Framework for Robot Learning and Human-Robot Interaction
Introduces the Kaiwu multimodal dataset and framework with 11,664 synchronized assembling demonstrations including hand motions, pressures, sounds, multi-view videos, motion capture, eye gaze, and EMG signals with timestamp-based and semantic annotations.
-
You're Pushing My Buttons: Instrumented Learning of Gentle Button Presses
Training-time instrumentation with audio and privileged button-state signals produces contact policies that match success rates but apply lower forces using only vision and audio at inference.
-
Causal World Modeling for Robot Control
LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.