Multimodal Diffusion Forcing trains a diffusion model on partially masked multimodal robot trajectories to learn temporal and cross-modal dependencies for forceful manipulation.
Learn- ing visuotactile skills with two multifingered hands
12 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.RO 12roles
background 4representative citing papers
TactX learns a shared latent representation across three tactile sensor modalities via joint training on paired contacts, enabling zero-shot policy transfer and higher success on pick-and-place, insertion, wiping, and reorientation tasks.
Grasp pretraining on 355k trajectories improves full-task success on six articulated tool-use tasks by 33.3 pp over DP3 in real-world experiments.
CoStream composes semantic, predictive, and reactive behaviors on an SE(3) interface to enable precise, generalizable performance on eight real-world contact-rich manipulation tasks.
MonoDuo generates synthetic bimanual demonstrations from single-arm teleoperation plus human collaboration to train policies achieving up to 70% zero-shot success on five manipulation tasks, with 65-70% gains from 25-shot finetuning.
FingerViP equips each finger with a miniature camera and trains a multi-view diffusion policy that achieves 80.8% success on real-world dexterous tasks previously limited by wrist-camera occlusion.
TeleGate achieves high-precision real-time whole-body teleoperation of humanoid robots by dynamically gating between expert policies and using a VAE motion prior to infer future intent from history, outperforming distillation baselines on dynamic motions with only 2.5 hours of mocap data.
DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.
FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diffusion VLA performance with up to 5x faster training.
A hybrid event-driven switching system pairs VLA models with lightweight dexterous policies on a compliant anthropomorphic hand to perform language-conditioned multi-finger tasks with cross-embodiment modularity.
CoDex combines VLMs, constrained optimization, and RL to autonomously discover grasp-move-actuate policies for functional manipulation of unseen objects with internal mechanisms.
FlexiTac is a scalable piezoresistive tactile sensing system with flexible FPC-Velostat-FPC pads and a 100 Hz multi-channel readout board that mounts on rigid or soft grippers and supports visuo-tactile learning.
citing papers explorer
-
Multimodal Diffusion Forcing for Forceful Manipulation
Multimodal Diffusion Forcing trains a diffusion model on partially masked multimodal robot trajectories to learn temporal and cross-modal dependencies for forceful manipulation.
-
TactX: Learning Shared Tactile Representations Across Diverse Sensors
TactX learns a shared latent representation across three tactile sensor modalities via joint training on paired contacts, enabling zero-shot policy transfer and higher success on pick-and-place, insertion, wiping, and reorientation tasks.
-
From Grasps to Dexterity: Large-Scale Grasp Pretraining for Dexterous Manipulation
Grasp pretraining on 355k trajectories improves full-task success on six articulated tool-use tasks by 33.3 pp over DP3 in real-world experiments.
-
CoStream: Composing Simple Behaviors for Generalizable Complex Manipulation
CoStream composes semantic, predictive, and reactive behaviors on an SE(3) interface to enable precise, generalizable performance on eight real-world contact-rich manipulation tasks.
-
MonoDuo: Using One Robot Arm to Learn Bimanual Policies
MonoDuo generates synthetic bimanual demonstrations from single-arm teleoperation plus human collaboration to train policies achieving up to 70% zero-shot success on five manipulation tasks, with 65-70% gains from 25-shot finetuning.
-
FingerViP: Learning Real-World Dexterous Manipulation with Fingertip Visual Perception
FingerViP equips each finger with a miniature camera and trains a multi-view diffusion policy that achieves 80.8% success on real-world dexterous tasks previously limited by wrist-camera occlusion.
-
TeleGate: Whole-Body Humanoid Teleoperation via Gated Expert Selection with Motion Prior
TeleGate achieves high-precision real-time whole-body teleoperation of humanoid robots by dynamically gating between expert policies and using a VAE motion prior to infer future intent from history, outperforming distillation baselines on dynamic motions with only 2.5 hours of mocap data.
-
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.
-
FAST: Efficient Action Tokenization for Vision-Language-Action Models
FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diffusion VLA performance with up to 5x faster training.
-
Language Conditioned Multi-Finger Dexterous Manipulation Enabled by Physical Compliance and Switching of Controllers
A hybrid event-driven switching system pairs VLA models with lightweight dexterous policies on a compliant anthropomorphic hand to perform language-conditioned multi-finger tasks with cross-embodiment modularity.
-
CoDex: Learning Compositional Dexterous Functional Manipulation without Demonstrations
CoDex combines VLMs, constrained optimization, and RL to autonomously discover grasp-move-actuate policies for functional manipulation of unseen objects with internal mechanisms.
-
FlexiTac: A Low-Cost, Open-Source, Scalable Tactile Sensing Solution for Robotic Systems
FlexiTac is a scalable piezoresistive tactile sensing system with flexible FPC-Velostat-FPC pads and a 100 Hz multi-channel readout board that mounts on rigid or soft grippers and supports visuo-tactile learning.