arxiv: 2410.07864 · v2 · submitted 2024-10-10 · 💻 cs.RO · cs.AI· cs.CV· cs.LG

Recognition: 2 theorem links

· Lean Theorem

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu , Lingxuan Wu , Bangguo Li , Hengkai Tan , Huayu Chen , Zhengyi Wang , Ke Xu , Hang Su

show 1 more author

Jun Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-11 07:40 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CVcs.LG

keywords bimanual manipulationdiffusion modelsfoundation modelsroboticstransformer architectureunified action spacezero-shot generalizationfew-shot learning

0 comments

The pith

A 1.2-billion-parameter diffusion model unifies multi-robot data to perform bimanual tasks with zero-shot generalization and few-shot learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a diffusion-based foundation model called RDT can address the core difficulties of bimanual manipulation: coordinating two arms produces complex multi-modal action distributions, and robot data remains scarce. It does so by scaling a Transformer architecture to capture high-frequency robotic signals and by introducing a unified action space that keeps physical meanings intact across different robots. If successful, this would let one pre-trained model handle language instructions, adapt to new objects and scenes without retraining, and acquire dexterous skills from only a handful of demonstrations. Sympathetic readers would care because bimanual coordination is a basic requirement for useful household and industrial robots, yet current methods still demand large task-specific datasets.

Core claim

RDT is a diffusion foundation model that represents multi-modal action distributions through a scalable Transformer designed for heterogeneous inputs and high-frequency robotic signals. Pre-training on the largest multi-robot dataset collection to date, followed by fine-tuning on over 6,000 bimanual episodes, produces a 1.2B-parameter model that outperforms prior methods on real robots, exhibits zero-shot generalization to unseen objects and scenes, follows language instructions, acquires new skills from 1-5 demonstrations, and executes complex dexterous tasks.

What carries the argument

The Robotics Diffusion Transformer (RDT) with its Physically Interpretable Unified Action Space, which unifies action representations across robots while retaining original physical meanings to enable transfer of physical knowledge.

If this is right

A single pre-trained model can handle coordination of two arms across varied hardware without per-robot redesign.
Language-conditioned bimanual behaviors become feasible without collecting thousands of task-specific trajectories.
Dexterous skills such as precise grasping or assembly can be added rapidly via minimal human demonstrations.
Multi-robot pre-training data can be reused rather than discarded when switching between different arm configurations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same unification strategy could apply to multi-arm or humanoid systems where action spaces differ even more sharply.
If the physical-meaning preservation holds, combining RDT with existing vision or language foundation models would require only light alignment rather than full retraining.
Future scaling laws for robot models might be tested by measuring how performance changes when the unified action space is replaced with raw joint commands.

Load-bearing premise

The unified action space preserves enough physical meaning to let knowledge learned on one robot transfer to others without losing control fidelity.

What would settle it

Real-robot trials in which RDT shows no zero-shot success on a novel object or requires more than five demonstrations to reach baseline performance on a new bimanual skill would falsify the performance claims.

read the original abstract

Bimanual manipulation is essential in robotics, yet developing foundation models is extremely challenging due to the inherent complexity of coordinating two robot arms (leading to multi-modal action distributions) and the scarcity of training data. In this paper, we present the Robotics Diffusion Transformer (RDT), a pioneering diffusion foundation model for bimanual manipulation. RDT builds on diffusion models to effectively represent multi-modality, with innovative designs of a scalable Transformer to deal with the heterogeneity of multi-modal inputs and to capture the nonlinearity and high frequency of robotic data. To address data scarcity, we further introduce a Physically Interpretable Unified Action Space, which can unify the action representations of various robots while preserving the physical meanings of original actions, facilitating learning transferrable physical knowledge. With these designs, we managed to pre-train RDT on the largest collection of multi-robot datasets to date and scaled it up to 1.2B parameters, which is the largest diffusion-based foundation model for robotic manipulation. We finally fine-tuned RDT on a self-created multi-task bimanual dataset with over 6K+ episodes to refine its manipulation capabilities. Experiments on real robots demonstrate that RDT significantly outperforms existing methods. It exhibits zero-shot generalization to unseen objects and scenes, understands and follows language instructions, learns new skills with just 1~5 demonstrations, and effectively handles complex, dexterous tasks. We refer to https://rdt-robotics.github.io/rdt-robotics/ for the code and videos.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces RDT-1B, a 1.2B-parameter diffusion-based Transformer foundation model for bimanual robotic manipulation. It pre-trains on the largest multi-robot dataset collection to date using a Physically Interpretable Unified Action Space to handle heterogeneous robot actions while preserving physical semantics, then fine-tunes on a self-collected 6K+ episode bimanual dataset. The model is claimed to outperform prior methods on real robots, with zero-shot generalization to unseen objects/scenes, language instruction following, 1-5 demonstration adaptation, and handling of complex dexterous tasks.

Significance. If the empirical claims hold under rigorous verification, this would represent a notable advance in scaling diffusion models for robotics, providing evidence for cross-robot transfer via unified action representations and few-shot adaptation in bimanual settings. The work ships code and videos, which supports reproducibility.

major comments (2)

[§3] §3 (Action Space Design): The central claim of transferable physical knowledge rests on the Physically Interpretable Unified Action Space preserving original action semantics (torques, velocities, poses) across heterogeneous robots. The manuscript asserts this property but provides no formal definition, invariance proof, or ablation showing that the unification (normalization/embedding/kinematic projection) does not distort high-frequency or multi-modal components; without this, the pre-training on mixed datasets risks learning robot-specific artifacts rather than dynamics, directly undermining the zero-shot and few-shot claims.
[§5] §5 (Experiments): The abstract and results assert significant outperformance, zero-shot generalization, and 1-5 demo learning, yet the provided text lacks visible quantitative tables with metrics, baselines, error bars, or full ablation studies on the unified action space. This makes it impossible to assess the degree to which data supports the performance claims or to verify that fine-tuning on the 6K bimanual episodes recovers any distortions from pre-training.

minor comments (2)

[Abstract and §4] The abstract mentions scaling to 1.2B parameters as the largest diffusion foundation model for manipulation, but the methods section should explicitly compare parameter counts and training data scale against prior diffusion robotics works for context.
[§4] Figure captions and video references are helpful, but the manuscript should include a clear table summarizing dataset sizes, robot types, and action space mappings used in pre-training.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the presentation without overstating our contributions.

read point-by-point responses

Referee: [§3] §3 (Action Space Design): The central claim of transferable physical knowledge rests on the Physically Interpretable Unified Action Space preserving original action semantics (torques, velocities, poses) across heterogeneous robots. The manuscript asserts this property but provides no formal definition, invariance proof, or ablation showing that the unification (normalization/embedding/kinematic projection) does not distort high-frequency or multi-modal components; without this, the pre-training on mixed datasets risks learning robot-specific artifacts rather than dynamics, directly undermining the zero-shot and few-shot claims.

Authors: We appreciate the referee's emphasis on rigorously justifying the unified action space, as this is central to our cross-robot transfer claims. The Physically Interpretable Unified Action Space maps robot-specific actions (joint torques, velocities, end-effector poses) to a shared physical representation using consistent units (e.g., m/s for velocities, Nm for torques) and minimal kinematic projections only when necessary to align coordinate frames, with the goal of preserving original semantics. We acknowledge that the current manuscript provides only a high-level description without a formal mathematical definition or invariance proof, relying instead on empirical validation through pre-training scale and downstream performance. In revision, we will add a precise formal definition of the mapping functions, normalization steps, and embedding process in Section 3, along with a new ablation study isolating the effect of unification on high-frequency action components and multi-modal distributions. This will directly test whether robot-specific artifacts are avoided. revision: partial
Referee: [§5] §5 (Experiments): The abstract and results assert significant outperformance, zero-shot generalization, and 1-5 demo learning, yet the provided text lacks visible quantitative tables with metrics, baselines, error bars, or full ablation studies on the unified action space. This makes it impossible to assess the degree to which data supports the performance claims or to verify that fine-tuning on the 6K bimanual episodes recovers any distortions from pre-training.

Authors: We thank the referee for highlighting the need for clearer and more complete experimental reporting. The full manuscript includes quantitative results in Section 5 with success-rate tables comparing RDT-1B to baselines (e.g., Diffusion Policy, RT-1), metrics for zero-shot generalization, language following, and 1-5 shot adaptation, plus error bars from multiple real-robot trials and ablations on the action space. However, we accept that these elements may not have been sufficiently prominent or detailed in the reviewed version, making independent assessment difficult. In the revised manuscript, we will prominently display all tables with full metrics, add expanded ablations specifically quantifying any pre-training distortions and their recovery after fine-tuning on the 6K+ bimanual episodes, and include additional statistical details to support the performance claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical model design and evaluation

full rationale

The paper describes an empirical pipeline: pre-training a 1.2B-parameter diffusion transformer on heterogeneous multi-robot datasets, introducing a Physically Interpretable Unified Action Space as an engineering design to unify action representations, fine-tuning on a 6K-episode bimanual dataset, and validating via real-robot experiments for generalization and few-shot adaptation. No derivation chain, first-principles prediction, or analytical result is claimed that reduces by construction to fitted inputs, self-definitions, or self-citations. The unified action space is presented as an introduced mechanism rather than a derived theorem, and all performance claims rest on external benchmarks and physical robot tests rather than internal reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of diffusion for multi-modal robotic actions and the utility of the new unified action space; these are introduced without external independent validation beyond the reported experiments.

axioms (1)

domain assumption Diffusion models can effectively represent multi-modal action distributions in robotics.
Invoked to justify the base modeling choice for handling complexity of bimanual coordination.

invented entities (1)

Physically Interpretable Unified Action Space no independent evidence
purpose: To unify action representations of various robots while preserving physical meanings to enable transfer of physical knowledge.
New design introduced in the paper to address data scarcity and robot heterogeneity.

pith-pipeline@v0.9.0 · 5602 in / 1444 out tokens · 81198 ms · 2026-05-11T07:40:09.796844+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

we further introduce a Physically Interpretable Unified Action Space, which can unify the action representations of various robots while preserving the physical meanings of original actions, facilitating learning transferrable physical knowledge
Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RDT builds on diffusion models to effectively represent multi-modality... scaled it up to 1.2B parameters

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models
cs.CV 2026-03 unverdicted novelty 8.0

FlowHijack is the first dynamics-aware backdoor attack on flow-matching VLAs that achieves high success rates with stealthy triggers while preserving benign performance and making malicious actions kinematically indis...
RotVLA: Rotational Latent Action for Vision-Language-Action Model
cs.RO 2026-05 unverdicted novelty 7.0

RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
Test-time Sparsity for Extreme Fast Action Diffusion
cs.CV 2026-05 unverdicted novelty 7.0

Test-time sparsity with a parallel pipeline and omnidirectional feature reuse accelerates action diffusion by 5x to 47.5 Hz while cutting FLOPs 92% with no performance loss.
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
cs.RO 2026-05 unverdicted novelty 7.0

MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
cs.RO 2026-05 unverdicted novelty 7.0

Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.
CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models
cs.CV 2026-05 unverdicted novelty 7.0

Capability vectors extracted from parameter differences between standard and auxiliary-finetuned VLA models can be merged into pretrained weights to match auxiliary-training performance while reducing computational ov...
VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 conditional novelty 7.0

Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models
cs.RO 2026-05 unverdicted novelty 7.0

NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
MolmoAct2: Action Reasoning Models for Real-world Deployment
cs.RO 2026-05 unverdicted novelty 7.0

MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.
Being-H0.7: A Latent World-Action Model from Egocentric Videos
cs.RO 2026-04 unverdicted novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
cs.RO 2026-04 unverdicted novelty 7.0

ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
BiCoord: A Bimanual Manipulation Benchmark towards Long-Horizon Spatial-Temporal Coordination
cs.RO 2026-04 conditional novelty 7.0

BiCoord is a new benchmark for long-horizon tightly coordinated bimanual manipulation that includes quantitative metrics and shows existing policies like DP, RDT, Pi0 and OpenVLA-OFT struggle on such tasks.
Diffusion Policy with Bayesian Expert Selection for Active Multi-Target Tracking
cs.RO 2026-04 unverdicted novelty 7.0

A Bayesian expert selection framework with variational Bayesian last layers and lower confidence bounds improves diffusion policies for active multi-target tracking.
DFM-VLA: Iterative Action Refinement for Robot Manipulation via Discrete Flow Matching
cs.RO 2026-03 conditional novelty 7.0

DFM-VLA uses discrete flow matching to iteratively refine action tokens in VLA models, outperforming autoregressive and diffusion baselines with 4.44 average success length on CALVIN and 95.7% success on LIBERO.
VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models
cs.RO 2026-03 unverdicted novelty 7.0

VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.
Towards Generalizable Robotic Manipulation in Dynamic Environments
cs.CV 2026-03 unverdicted novelty 7.0

DOMINO dataset and PUMA architecture enable better dynamic robotic manipulation by incorporating motion history, delivering 6.3% higher success rates than prior VLA models.
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
cs.CV 2025-06 unverdicted novelty 7.0

ReCogDrive unifies VLM scene understanding with a diffusion planner reinforced by DiffGRPO to reach state-of-the-art results on NAVSIM and Bench2Drive benchmarks.
CUBic: Coordinated Unified Bimanual Perception and Control Framework
cs.RO 2026-05 unverdicted novelty 6.0

CUBic learns a shared tokenized representation for bimanual robot perception and control via unidirectional aggregation, bidirectional codebook coordination, and a unified diffusion policy, yielding higher coordinatio...
Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
cs.RO 2026-05 unverdicted novelty 6.0

MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
From Reaction to Anticipation: Proactive Failure Recovery through Agentic Task Graph for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

AgentChord models manipulation tasks as directed graphs enriched with anticipatory recovery branches, using specialized agents to enable immediate, low-latency failure responses and improve success on long-horizon bim...
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
cs.RO 2026-05 unverdicted novelty 6.0

Pace-and-Path Correction is a closed-form inference-time operator that decomposes a quadratic cost minimization into orthogonal pace compression and path offset channels to correct dynamics-blindness in chunked-action...
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models
cs.RO 2026-05 unverdicted novelty 6.0

HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
Unified Noise Steering for Efficient Human-Guided VLA Adaptation
cs.RO 2026-05 unverdicted novelty 6.0

UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.
From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discr...
Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing
cs.RO 2026-05 unverdicted novelty 6.0

A dual-contrastive disentanglement method factorizes videos into independent task and embodiment latents, then uses a parameter-efficient adapter on a frozen video diffusion model to synthesize robot executions from s...
MolmoAct2: Action Reasoning Models for Real-world Deployment
cs.RO 2026-05 unverdicted novelty 6.0

MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture chang...
Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation
cs.CV 2026-05 unverdicted novelty 6.0

A video transfer pipeline augments simulated VLA data into realistic videos while preserving actions, yielding consistent performance gains on robot benchmarks such as 8% on Robotwin 2.0.
Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

Decompose and Recompose decomposes seen robotic demonstrations into skill-action alignments and recomposes them via visual-semantic retrieval and planning to enable zero-shot cross-task generalization.
Sentinel-VLA: A Metacognitive VLA Model with Active Status Monitoring for Dynamic Reasoning and Error Recovery
cs.RO 2026-05 unverdicted novelty 6.0

Sentinel-VLA introduces a metacognitive VLA model with a sentinel module for real-time status monitoring, dynamic reasoning, and error recovery, plus a self-evolving continual learning method, raising real-world task ...
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
cs.RO 2026-04 unverdicted novelty 6.0

LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
cs.RO 2026-04 unverdicted novelty 6.0

LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills
cs.RO 2026-04 unverdicted novelty 6.0

M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.
CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors
cs.RO 2026-04 unverdicted novelty 6.0

CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

OFlow unifies temporal foresight and object-aware reasoning inside a shared latent space via flow matching to improve VLA robustness in robotic manipulation under distribution shifts.
AffordGen: Generating Diverse Demonstrations for Generalizable Object Manipulation with Afford Correspondence
cs.RO 2026-04 unverdicted novelty 6.0

AffordGen generates affordance-aware manipulation demonstrations from 3D mesh correspondences to train policies with zero-shot generalization to novel objects.
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
cs.RO 2026-04 unverdicted novelty 6.0

VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model
cs.RO 2026-04 unverdicted novelty 6.0

A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.
E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes
cs.CV 2026-04 conditional novelty 6.0

E-VLA integrates event streams directly into VLA models via lightweight fusion, raising Pick-Place success from 0% to 60-90% at 20 lux and from 0% to 20-25% under severe motion blur.
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
cs.CV 2026-03 unverdicted novelty 6.0

Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
cs.RO 2025-09 conditional novelty 6.0

SimpleVLA-RL applies tailored reinforcement learning to VLA models, reaching SoTA on LIBERO, outperforming π₀ on RoboTwin, and surpassing SFT in real-world tasks while reducing data needs and identifying a 'pushcut' p...
RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation
cs.RO 2025-06 unverdicted novelty 6.0

RoboTwin 2.0 automates diverse synthetic data creation for dual-arm robots via MLLMs and five-axis domain randomization, leading to 228-367% gains in manipulation success.
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
cs.RO 2025-02 accept novelty 6.0

OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
cs.CV 2024-12 unverdicted novelty 6.0

Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
cs.RO 2024-11 unverdicted novelty 6.0

CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new ro...
Nautilus: From One Prompt to Plug-and-Play Robot Learning
cs.RO 2026-05 unverdicted novelty 5.0

NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
BioProVLA-Agent: An Affordable, Protocol-Driven, Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation
cs.RO 2026-05 unverdicted novelty 5.0

BioProVLA-Agent integrates protocol parsing, visual state verification, and VLA-based execution in a closed-loop multi-agent framework with AugSmolVLA augmentation to improve robustness for biological lab tasks like t...
Gated Memory Policy
cs.RO 2026-04 unverdicted novelty 5.0

GMP selectively activates and represents memory via a gate and lightweight cross-attention, yielding 30.1% higher success on non-Markovian robotic tasks while staying competitive on Markovian ones.
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement
cs.RO 2026-04 unverdicted novelty 5.0

StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...
ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning
cs.RO 2026-04 unverdicted novelty 5.0

ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.
R3D: Revisiting 3D Policy Learning
cs.CV 2026-04 unverdicted novelty 5.0

A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.
Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection
cs.RO 2026-04 unverdicted novelty 5.0

A dual VLM-VLA framework for long-horizon robot manipulation achieves 32.4% success on RMBench tasks versus 9.8% for the strongest baseline via structured memory and closed-loop adaptive replanning.
Jump-Start Reinforcement Learning with Vision-Language-Action Regularization
cs.LG 2026-04 unverdicted novelty 5.0

VLAJS augments PPO with sparse annealed VLA guidance through directional regularization to cut required interactions by over 50% on manipulation tasks and enable zero-shot sim-to-real transfer.
DA-PTQ: Drift-Aware Post-Training Quantization for Efficient Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 5.0

DA-PTQ quantizes VLAs by compensating cross-space distortions and allocating mixed precision to minimize motion errors and kinematic drift in trajectories.
ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation
cs.RO 2026-04 unverdicted novelty 5.0

Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 59 Pith papers · 5 internal anchors

[1]

Dimensionality reduction for dynamic movement primitives and application to bimanual manipulation of clothes

1, 2, 3, 4, 5, 21, 23 Adria Colom´e and Carme Torras. Dimensionality reduction for dynamic movement primitives and application to bimanual manipulation of clothes. IEEE Transactions on Robotics, 34(3):602–615,

work page
[2]

PaLM-E: An Embodied Multimodal Language Model

3 Adri`a Colom´e and Carme Torras. Reinforcement learning of bimanual robot skills. Springer, 2020. 3 Shivin Dass, Jullian Yapeter, Jesse Zhang, Jiahui Zhang, Karl Pertsch, Stefanos Nikolaidis, and Joseph J. Lim. Clvr jaco play dataset, 2023. URL https://github.com/clvrai/clvr_ jaco_play_dataset. 22 Carlos Canudas de Wit, Bruno Siciliano, and Georges Bast...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[3]

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

1, 3 13 Published as a conference paper at ICLR 2025 Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv preprint arXiv:2401.02117, 2024. 3, 21, 22, 24, 25 Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Lu...

work page internal anchor Pith review arXiv 2025
[4]

LoRA: Low-Rank Adaptation of Large Language Models

22 Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. 2, 3 Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[5]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

22 Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

OpenVLA: An Open-Source Vision-Language-Action Model

20, 22 Minchan Kim, Junhyek Han, Jaehyung Kim, and Beomjoon Kim. Pre-and post-contact policy decomposition for non-prehensile manipulation with zero-shot sim-to-real transfer. 2023. 22 14 Published as a conference paper at ICLR 2025 Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Gr...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

V oxact-b: V oxel-based acting and stabi- lizing policy for bimanual manipulation.arXiv preprint arXiv:2407.04152, 2024

1 Hanwen Liang, Niamul Quader, Zhixiang Chi, Lizhe Chen, Peng Dai, Juwei Lu, and Yang Wang. Self- supervised spatiotemporal representation learning by exploiting video continuity. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 1564–1573, 2022. 2, 5 Rudolf Lioutikov, Oliver Kroemer, Guilherme Maeda, and Jan Peters. Learnin...

work page arXiv 2022
[8]

Multi-stage cable routing through hierarchical imitation learning

8, 25 Jianlan Luo, Charles Xu, Xinyang Geng, Gilbert Feng, Kuan Fang, Liam Tan, Stefan Schaal, and Sergey Levine. Multi-stage cable routing through hierarchical imitation learning. arXiv pre-print,

work page
[9]

Multi-stage cable routing through hierarchical imitation learning, 2024

URL https://arxiv.org/abs/2307.08927. 22 Jianlan Luo, Charles Xu, Fangchen Liu, Liam Tan, Zipeng Lin, Jeffrey Wu, Pieter Abbeel, and Sergey Levine. Fmb: a functional manipulation benchmark for generalizable robotic learning, 2024. URL https://arxiv.org/abs/2401.08553. 22 Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch...

work page arXiv 2024
[10]

Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks

22 Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022. 22 Seyed Sina Mirrazavi Salehian, Nadia Barbara Figueroa Fernandez, and Aude Billard. Dynamical system-based motion p...

work page 2022
[11]

Learning and retrieval from prior data for skill-based imitation learning

1, 4 Soroush Nasiriany, Tian Gao, Ajay Mandlekar, and Yuke Zhu. Learning and retrieval from prior data for skill-based imitation learning. In Conference on Robot Learning (CoRL), 2022. 22 Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning, pp. 8162–8171. PMLR, 20...

work page 2022
[12]

Imitating human behaviour with diffusion models

23 Tim Pearce, Tabish Rashid, Anssi Kanervisto, Dave Bignell, Mingfei Sun, Raluca Georgescu, Sergio Valcarcel Macua, Shan Zheng Tan, Ida Momennejad, Katja Hofmann, and Sam Devlin. Imitating human behaviour with diffusion models. In The Eleventh International Conference on Learning Representations, 2023. 3, 5 William Peebles and Saining Xie. Scalable diffu...

work page 2023
[13]

Multi-resolution sensing for real-time control with vision-language models

19 Saumya Saxena, Mohit Sharma, and Oliver Kroemer. Multi-resolution sensing for real-time control with vision-language models. In 7th Annual Conference on Robot Learning, 2023. URL https: //openreview.net/forum?id=WuBv9-IGDUA. 22 Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, and Lerrel Pinto. On br...

work page doi:10.1109/cvpr.2019.00589 2023
[14]

0” actually has a physical meaning. For example, a speed of “0

into each attention layer and replace each LayerNorm with RMSNorm (Zhang & Sennrich, 2019). In each DiT block’s cross-attention layer, we alternately inject language and image tokens rather than simultaneously inject both, avoiding the issue of token imbalance between the two modalities. After L DiT blocks, we normalize the output and project it back to t...

work page 2019