Recognition: no theorem link
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
Pith reviewed 2026-05-12 07:27 UTC · model grok-4.3
The pith
A componentized vision-language-action model with a diffusion action module achieves higher success rates in robotic manipulation tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The key finding is that a componentized VLA architecture featuring a diffusion action transformer conditioned on VLM outputs for action sequence modeling leads to markedly improved task performance, generalization, and adaptability in robotic manipulation compared to previous VLA designs.
What carries the argument
Componentized VLA architecture with diffusion action transformers conditioned on VLM outputs.
If this is right
- Exceeds OpenVLA success rates by over 35% in simulation and 55% in real experiments at similar model size.
- Outperforms RT-2-X (55B) by 18% absolute success rates in simulation.
- Exhibits strong adaptation to new robots and generalization to unseen objects and backgrounds.
- Shows favorable scaling behaviors with the diffusion action module.
Where Pith is reading between the lines
- The separation of cognition and action could allow for modular upgrades where better VLMs are swapped in without retraining the action part.
- Future work might explore applying this componentization to other robot tasks beyond manipulation.
- If the architecture generalizes, it could reduce the need for massive end-to-end training in robot learning.
Load-bearing premise
Performance improvements result from the componentized architecture and diffusion action module rather than from training data volume, fine-tuning recipes, or evaluation differences.
What would settle it
A controlled study training OpenVLA and CogACT on identical data and protocols to check if the architecture still yields the reported gains.
read the original abstract
The advancement of large Vision-Language-Action (VLA) models has significantly improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios. While existing VLAs adapted from pretrained large Vision-Language-Models (VLM) have demonstrated promising generalizability, their task performance is still unsatisfactory as indicated by the low tasks success rates in different environments. In this paper, we present a new advanced VLA architecture derived from VLM. Unlike previous works that directly repurpose VLM for action prediction by simple action quantization, we propose a omponentized VLA architecture that has a specialized action module conditioned on VLM output. We systematically study the design of the action module and demonstrates the strong performance enhancement with diffusion action transformers for action sequence modeling, as well as their favorable scaling behaviors. We also conduct comprehensive experiments and ablation studies to evaluate the efficacy of our models with varied designs. The evaluation on 5 robot embodiments in simulation and real work shows that our model not only significantly surpasses existing VLAs in task performance and but also exhibits remarkable adaptation to new robots and generalization to unseen objects and backgrounds. It exceeds the average success rates of OpenVLA which has similar model size (7B) with ours by over 35% in simulated evaluation and 55% in real robot experiments. It also outperforms the large RT-2-X model (55B) by 18% absolute success rates in simulation. Code and models can be found on our project page (https://cogact.github.io/).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CogACT, a componentized Vision-Language-Action (VLA) model derived from a pretrained VLM backbone paired with a specialized diffusion-based action transformer module for sequence modeling. It reports large empirical gains on robotic manipulation benchmarks across five embodiments in simulation and real-world settings, claiming >35% higher average success rates than OpenVLA (7B) in simulation and >55% in real-robot experiments, plus an 18% absolute improvement over the much larger RT-2-X (55B) model in simulation. The work includes ablations on action-module variants, scaling studies, and generalization tests to new robots, objects, and backgrounds, with code and models released.
Significance. If the performance attribution holds after proper controls, the results would provide concrete evidence that separating high-level VLM-based cognition from a dedicated diffusion action module yields both higher task success and favorable scaling, offering a practical design pattern for future VLA systems. The multi-embodiment evaluation and generalization claims, together with the public release of models, would strengthen the paper's utility to the robotics community.
major comments (2)
- [§4] §4 (and appendix): The headline performance claims (35% sim / 55% real over OpenVLA at similar 7B scale; 18% over RT-2-X) rest on direct comparisons whose training corpora, episode counts, fine-tuning mixtures, and exact success criteria are not matched or quantified against the baselines. Without a controlled re-training of OpenVLA under the same data recipe or explicit disclosure of pre-training corpus overlap, the gains cannot be isolated to the componentized architecture and diffusion action module rather than data volume or evaluation-protocol differences.
- [§4] §4: No statistical significance, run-to-run variance, or confidence intervals are reported for the success-rate tables, and the abstract's aggregate percentages lack per-task breakdowns or controls for confounding factors such as episode length and success definition. These omissions make it impossible to assess whether the reported margins are robust.
minor comments (1)
- [Abstract] Abstract: Typo in 'a omponentized' (should read 'a componentized').
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments on our manuscript. We address each major point below, providing clarifications based on the experiments and evaluations reported in the paper. Where appropriate, we will revise the manuscript to improve transparency and robustness of the presented results.
read point-by-point responses
-
Referee: [§4] §4 (and appendix): The headline performance claims (35% sim / 55% real over OpenVLA at similar 7B scale; 18% over RT-2-X) rest on direct comparisons whose training corpora, episode counts, fine-tuning mixtures, and exact success criteria are not matched or quantified against the baselines. Without a controlled re-training of OpenVLA under the same data recipe or explicit disclosure of pre-training corpus overlap, the gains cannot be isolated to the componentized architecture and diffusion action module rather than data volume or evaluation-protocol differences.
Authors: We appreciate the referee's emphasis on isolating architectural contributions. Our evaluations follow the exact task definitions, success criteria, and simulation environments reported in the OpenVLA and RT-2-X papers, using the same benchmark suites (e.g., the standard manipulation tasks across the five embodiments). Training data for CogACT is drawn from the publicly released Open X-Embodiment corpus, which forms the foundation for the baseline models as well. Section 4 and the appendix already describe our fine-tuning mixture and episode counts at a high level. To address the concern directly, we will expand the appendix with a side-by-side table quantifying training episode numbers, data source overlap, and precise success definitions used in our runs versus those in the baseline publications. While a full controlled re-training of OpenVLA under our exact recipe would be ideal for stronger isolation, it is computationally prohibitive at this scale; instead, we rely on the within-framework ablations (action module variants and scaling studies) to attribute gains to the diffusion transformer design. We believe these additions will allow readers to better assess the comparisons. revision: partial
-
Referee: [§4] §4: No statistical significance, run-to-run variance, or confidence intervals are reported for the success-rate tables, and the abstract's aggregate percentages lack per-task breakdowns or controls for confounding factors such as episode length and success definition. These omissions make it impossible to assess whether the reported margins are robust.
Authors: We agree that including variance estimates and per-task details would strengthen the presentation of results. Although the main tables report average success rates across tasks and embodiments, the full manuscript already contains per-task breakdowns in the appendix tables. In the revised version, we will augment the primary result tables (Section 4) with standard deviations computed over multiple evaluation runs (typically 3 seeds per task where feasible) and add 95% confidence intervals. We will also explicitly state the success definitions and average episode lengths used, confirming they match the protocols from prior VLA works to control for confounding factors. These changes will be reflected in both the main text and appendix, making the robustness of the >35% and >55% margins clearer. revision: yes
Circularity Check
No circularity: empirical architecture proposal and benchmark comparisons
full rationale
The paper introduces a componentized VLA model with a VLM-conditioned diffusion action module, describes its training, and reports direct empirical success rates on five robot embodiments in simulation and real-world settings. Claims rest on measured task performance (e.g., +35% over OpenVLA at 7B scale, +18% over RT-2-X at 55B) rather than any first-principles derivation, fitted-parameter prediction, or self-referential definition. No equations, uniqueness theorems, or ansatzes are invoked that reduce to inputs by construction; ablations and comparisons are external benchmarks. The work is therefore self-contained against its own experimental data.
Axiom & Free-Parameter Ledger
free parameters (1)
- Diffusion action module hyperparameters
axioms (1)
- domain assumption Standard deep-learning assumptions hold: i.i.d. train/test splits, representative robot embodiments, and comparable evaluation protocols across models.
Forward citations
Cited by 45 Pith papers
-
BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning
BlockVLA accelerates autoregressive VLA models by 3.3x using block diffusion finetuning, with faster training convergence and better early performance on long-horizon robotic tasks.
-
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
-
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
-
Latent Bridge: Feature Delta Prediction for Efficient Dual-System Vision-Language-Action Model Inference
Latent Bridge predicts VLM feature deltas to reduce VLM calls by 50-75% in dual-system VLA models while retaining 95-100% performance and achieving 1.65-1.73x speedup across LIBERO, RoboCasa, and ALOHA benchmarks.
-
Being-H0.7: A Latent World-Action Model from Egocentric Videos
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
-
Mask World Model: Predicting What Matters for Robust Robot Policy Learning
Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...
-
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
-
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
-
BiCoord: A Bimanual Manipulation Benchmark towards Long-Horizon Spatial-Temporal Coordination
BiCoord is a new benchmark for long-horizon tightly coordinated bimanual manipulation that includes quantitative metrics and shows existing policies like DP, RDT, Pi0 and OpenVLA-OFT struggle on such tasks.
-
VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models
VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.
-
FrameSkip: Learning from Fewer but More Informative Frames in VLA Training
FrameSkip improves VLA policy training success from 66.50% to 76.15% by selecting high-importance frames and retaining only 20% of unique frames across three benchmarks.
-
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models
GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.
-
See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model
GridS reduces visual tokens in VLA models to under 10% of the original count via task-aware differentiable resampling, delivering 76% lower FLOPs with no drop in task success rate on benchmarks and real robots.
-
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models
HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
-
PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models
PriorVLA preserves pretrained priors in VLA models through a frozen Prior Expert and trained Adaptation Expert, delivering better robot manipulation performance than full fine-tuning with only 25% of the parameter updates.
-
RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark
RoboMemArena is a new large-scale robotic memory benchmark with real-world tasks, and PrediMem is a dual VLA system that outperforms baselines by managing memory buffers with predictive coding.
-
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
-
Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs
Retrieve-then-steer stores successful observation-action segments in memory, retrieves relevant chunks, filters them, and uses an elite prior with confidence-adaptive guidance to steer a flow-matching action sampler f...
-
Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs
A retrieve-then-steer method stores successful robot actions in memory and uses them to steer a frozen VLA's flow-matching sampler for better test-time reliability without parameter updates.
-
Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval
RetrieveVGGT enables constant-memory long-context streaming 3D reconstruction by retrieving relevant frames via query-key similarities in VGGT's first attention layer, outperforming StreamVGGT and others.
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.
-
ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations
ForgeVLA enables federated VLA model training from unlabeled vision-action pairs by recovering language via embodied classifiers and using contrastive planning plus adaptive aggregation to avoid feature collapse.
-
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation
Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
-
AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models
AT-VLA introduces adaptive tactile injection and a dual-stream tactile reaction mechanism to integrate real-time tactile feedback into pretrained VLA models for contact-rich robotic manipulation.
-
TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation
TriRelVLA introduces triadic object-hand-task relational representations and a task-grounded graph transformer with a relational bottleneck to improve generalization in robotic manipulation across scenes, objects, and tasks.
-
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
-
VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model
VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.
-
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
-
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.
-
PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations
PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
-
ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation
ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.
-
DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA
DIAL decouples intent from action in end-to-end VLAs using a latent visual foresight bottleneck and two-stage training, reaching SOTA on RoboCasa with 10x fewer demonstrations and zero-shot real-world transfer.
-
RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation
RoboTwin 2.0 automates diverse synthetic data creation for dual-arm robots via MLLMs and five-axis domain randomization, leading to 228-367% gains in manipulation success.
-
AttenA+: Rectifying Action Inequality in Robotic Foundation Models
AttenA+ applies velocity-driven action attention to reweight training objectives toward kinematically critical low-velocity segments, yielding small benchmark gains on Libero and RoboTwin without added parameters.
-
Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation
The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LI...
-
Nautilus: From One Prompt to Plug-and-Play Robot Learning
NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
-
ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning
ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.
-
DA-PTQ: Drift-Aware Post-Training Quantization for Efficient Vision-Language-Action Models
DA-PTQ quantizes VLAs by compensating cross-space distortions and allocating mixed precision to minimize motion errors and kinematic drift in trajectories.
-
ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation
Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.
-
RoboECC: Multi-Factor-Aware Edge-Cloud Collaborative Deployment for VLA Models
RoboECC delivers up to 3.28x speedup for VLA model inference via co-aware segmentation and network-aware adjustment with 2.55-2.62% overhead.
-
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
SpatialVLA adds 3D-aware position encoding and adaptive discretized action grids to visual-language-action models, enabling strong zero-shot performance and fine-tuning on new robot setups after pre-training on 1.1 mi...
-
RLDX-1 Technical Report
RLDX-1 achieves 86.8% success on complex ALLEX humanoid manipulation tasks where prior VLAs reach only around 40%.
-
RLDX-1 Technical Report
RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.
Reference graph
Works this paper leans on
-
[1]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadal- lah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open- source framework for training large autoregressive vision- language models. arXiv preprint arXiv:2308.01390 , 2023. 2
work page internal anchor Pith review arXiv 2023
-
[4]
Hydra: Hybrid robot actions for imitation learning
Suneel Belkhale, Yuchen Cui, and Dorsa Sadigh. Hydra: Hybrid robot actions for imitation learning. arxiv, 2023. 13
work page 2023
-
[5]
arXiv preprint arXiv:2409.16283 (2024)
Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Hu- man video generation in novel scenarios enables generaliz- able robot manipulation. arXiv preprint arXiv:2409.16283,
-
[7]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. 1, 2, 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[8]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Language Models are Few-Shot Learners
Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020. 2
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[10]
The ycb object and model set: Towards common benchmarks for manipula- tion research
Berk Calli, Arjun Singh, Aaron Walsman, Siddhartha Srini- vasa, Pieter Abbeel, and Aaron M Dollar. The ycb object and model set: Towards common benchmarks for manipula- tion research. In 2015 international conference on advanced robotics (ICAR), pages 510–517. IEEE, 2015. 14
work page 2015
-
[11]
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Berkeley UR5 demonstration dataset
Lawrence Yunliang Chen, Simeon Adebola, and Ken Goldberg. Berkeley UR5 demonstration dataset. https://sites.google.com/view/berkeley-ur5/home. 13
-
[13]
Pali-x: On scaling up a multilingual vision and language model
Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Se- bastian Goodman, Xiao Wang, Yi Tay, et al. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023. 2, 3
-
[14]
Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 2
work page 2024
-
[15]
Diffusion policy: Visuomotor policy learning via action dif- fusion
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action dif- fusion. The International Journal of Robotics Research, page 02783649241273668, 2023. 2, 3, 4
work page 2023
-
[16]
Analysis and observations from the first amazon picking challenge
Nikolaus Correll, Kostas E Bekris, Dmitry Berenson, Oliver Brock, Albert Causo, Kris Hauser, Kei Okada, Alberto Ro- driguez, Joseph M Romano, and Peter R Wurman. Analysis and observations from the first amazon picking challenge. IEEE Transactions on Automation Science and Engineering, 15(1):172–188, 2016. 14
work page 2016
-
[17]
From play to policy: Conditional behavior genera- tion from uncurated robot data
Zichen Jeff Cui, Yibin Wang, Nur Muhammad Mahi Shafi- ullah, and Lerrel Pinto. From play to policy: Conditional be- havior generation from uncurated robot data. arXiv preprint arXiv:2210.10047, 2022. 13
-
[18]
Shivin Dass, Jullian Yapeter, Jesse Zhang, Jiahui Zhang, Karl Pertsch, Stefanos Nikolaidis, and Joseph J. Lim. Clvr jaco play dataset, 2023. 13
work page 2023
-
[19]
High Fidelity Neural Audio Compression
Alexandre D ´efossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022. 2
work page internal anchor Pith review arXiv 2022
-
[20]
Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets
Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting general- ization of robotic skills with cross-domain datasets. arXiv preprint arXiv:2109.13396, 2021. 13
work page internal anchor Pith review arXiv 2021
-
[21]
The brain basis of language process- ing: from structure to function
Angela D Friederici. The brain basis of language process- ing: from structure to function. Physiological reviews, 91 (4):1357–1392, 2011. 2
work page 2011
-
[22]
Minho Heo, Youngwoon Lee, Doohyun Lee, and Joseph J. Lim. Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation. In Robotics: Sci- ence and Systems, 2023. 13
work page 2023
-
[23]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022. 14, 16
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[24]
Zhi Hou, Tianyi Zhang, Yuwen Xiong, Hengjun Pu, Chengyang Zhao, Ronglei Tong, Yu Qiao, Jifeng Dai, and Yuntao Chen. Diffusion transformer policy. arXiv preprint arXiv:2410.15959, 2024. 3
-
[25]
Trevor Huff, Navid Mahabadi, and Prasanna Tadi. Neu- roanatomy, visual cortex. 2018. 2
work page 2018
-
[26]
Bc-z: Zero-shot task generalization with robotic imitation learning
Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Fred- erik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning , pages 991–
-
[27]
Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation, 2018
Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, 9 Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Qt- opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293 ,
-
[28]
Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Prismatic vlms: Investigating the design space of visually-conditioned language models. arXiv preprint arXiv:2402.07865 , 2024. 2, 3, 13
-
[29]
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Bal- akrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, Y...
work page 2024
-
[30]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024. 1, 2, 3, 5, 6, 7, 13, 15
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
The darpa robotics challenge finals: Results and perspectives
Eric Krotkov, Douglas Hackett, Larry Jackel, Michael Per- schbacher, James Pippine, Jesse Strauss, Gill Pratt, and Christopher Orlowski. The darpa robotics challenge finals: Results and perspectives. The DARPA robotics challenge fi- nals: Humanoid robots to the rescue, pages 1–26, 2018. 14
work page 2018
-
[32]
Vision-language foundation models as effective robot imitators.arXiv preprint arXiv:2311.01378, 2023
Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators. arXiv preprint arXiv:2311.01378,
-
[33]
Evaluating Real-World Robot Manipulation Policies in Simulation
Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating real-world robot manipulation policies in simulation. arXiv preprint arXiv:2405.05941, 2024. 1, 2, 5, 6, 14, 15
work page internal anchor Pith review arXiv 2024
-
[34]
Vision-language foundation models as effective robot imitators
Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators. In The Twelfth International Con- ference on Learning Representations, 2024. 2
work page 2024
-
[35]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2023. 2
work page 2023
-
[36]
Robot learning on the job: Human-in-the- loop autonomy and learning during deployment
Huihan Liu, Soroush Nasiriany, Lance Zhang, Zhiyao Bao, and Yuke Zhu. Robot learning on the job: Human-in-the- loop autonomy and learning during deployment. InRobotics: Science and Systems (RSS), 2023. 13
work page 2023
-
[37]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. 2
work page 2024
-
[38]
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipu- lation. arXiv preprint arXiv:2410.07864, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Multi- stage cable routing through hierarchical imitation learning
Jianlan Luo, Charles Xu, Xinyang Geng, Gilbert Feng, Kuan Fang, Liam Tan, Stefan Schaal, and Sergey Levine. Multi- stage cable routing through hierarchical imitation learning. arXiv pre-print, 2023. 13
work page 2023
-
[40]
Fmb: a functional manipulation benchmark for generalizable robotic learning
Jianlan Luo, Charles Xu, Fangchen Liu, Liam Tan, Zipeng Lin, Jeffrey Wu, Pieter Abbeel, and Sergey Levine. Fmb: a functional manipulation benchmark for generalizable robotic learning. arXiv preprint arXiv:2401.08553, 2024. 13
-
[41]
Interactive language: Talking to robots in real time
Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023. 13
work page 2023
-
[42]
Ajay Mandlekar, Jonathan Booher, Max Spero, Albert Tung, Anchit Gupta, Yuke Zhu, Animesh Garg, Silvio Savarese, and Li Fei-Fei. Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity. In 2019 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS), pages 1048–10...
work page 2019
-
[43]
Grounding language with visual affordances over unstruc- tured data
Oier Mees, Jessica Borja-Diaz, and Wolfram Burgard. Grounding language with visual affordances over unstruc- tured data. In Proceedings of the IEEE International Con- ference on Robotics and Automation (ICRA) , London, UK,
-
[44]
Struc- tured world models from human videos
Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Struc- tured world models from human videos. CoRL, 2023. 13
work page 2023
-
[45]
R3m: A universal visual repre- sentation for robot manipulation
Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual repre- sentation for robot manipulation. In Conference on Robot Learning, pages 892–909, 2023. 2
work page 2023
-
[46]
Learning and retrieval from prior data for skill- based imitation learning
Soroush Nasiriany, Tian Gao, Ajay Mandlekar, and Yuke Zhu. Learning and retrieval from prior data for skill- based imitation learning. In Conference on Robot Learning (CoRL), 2022. 13
work page 2022
-
[47]
Improved denoising diffusion probabilistic models
Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR,
-
[48]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864 , 2023. 1, 2, 3, 5, 6, 13, 15
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Je- gou, Julien Mairal, Patr...
work page 2024
-
[50]
Imitating human behaviour with diffusion models
Tim Pearce, Tabish Rashid, Anssi Kanervisto, Dave Bignell, Mingfei Sun, Raluca Georgescu, Sergio Valcarcel Macua, Shan Zheng Tan, Ida Momennejad, Katja Hofmann, et al. Imitating human behaviour with diffusion models. arXiv preprint arXiv:2301.10677, 2023. 3
-
[51]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4195–4205,
-
[52]
Shared Control Templates for Assistive Robotics
Gabriel Quere, Annette Hagengruber, Maged Iskandar, Samuel Bustamante, Daniel Leidner, Freek Stulp, and Jo- ern V ogel. Shared Control Templates for Assistive Robotics. In 2020 IEEE International Conference on Robotics and Au- tomation (ICRA), page 7, Paris, France, 2020. 13
work page 2020
-
[53]
Goal-conditioned imitation learning us- ing score-based diffusion policies
Moritz Reuss, Maximilian Li, Xiaogang Jia, and Rudolf Li- outikov. Goal-conditioned imitation learning using score- based diffusion policies. arXiv preprint arXiv:2304.02532,
-
[54]
Latent plans for task ag- nostic offline reinforcement learning
Erick Rosete-Beas, Oier Mees, Gabriel Kalweit, Joschka Boedecker, and Wolfram Burgard. Latent plans for task ag- nostic offline reinforcement learning. 2022. 13
work page 2022
-
[55]
Multi- resolution sensing for real-time control with vision-language models
Saumya Saxena, Mohit Sharma, and Oliver Kroemer. Multi- resolution sensing for real-time control with vision-language models. In 7th Annual Conference on Robot Learning, 2023. 13
work page 2023
-
[56]
Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, and Lerrel Pinto. On bringing robots home, 2023. 13
work page 2023
-
[57]
MU- TEX: Learning unified policies from multimodal task spec- ifications
Rutav Shah, Roberto Mart ´ın-Mart´ın, and Yuke Zhu. MU- TEX: Learning unified policies from multimodal task spec- ifications. In 7th Annual Conference on Robot Learning ,
-
[58]
Perceiver- actor: A multi-task transformer for robotic manipulation
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver- actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning , pages 785–799. PMLR,
-
[59]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020. 14
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[60]
Open-world object manipulation using pre-trained vision-language models
Austin Stone, Ted Xiao, Yao Lu, Keerthana Gopalakrishnan, Kuang-Huei Lee, Quan Vuong, Paul Wohlhart, Sean Kir- mani, Brianna Zitkovich, Fei Xia, et al. Open-world object manipulation using pre-trained vision-language models. In Conference on Robot Learning, pages 3397–3417, 2023. 2
work page 2023
-
[61]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[62]
Octo: An Open-Source Generalist Robot Policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213, 2024. 1, 2, 3, 5, 6, 7, 13, 15
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[63]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[64]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 2, 3, 13
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[65]
Neural discrete representation learning
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information pro- cessing systems, 30, 2017. 2
work page 2017
-
[66]
Bridgedata v2: A dataset for robot learning at scale
Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. In Con- ference on Robot Learning, pages 1723–1736. PMLR, 2023. 13
work page 2023
-
[67]
Tinyvla: To- wards fast, data-efficient vision-language-action models for robotic manipulation, 2024
Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Kun Wu, Zhiyuan Xu, Ran Cheng, Chaomin Shen, Yaxin Peng, Feifei Feng, et al. Tinyvla: Towards fast, data-efficient vision- language-action models for robotic manipulation. arXiv preprint arXiv:2409.12514, 2024. 2
-
[68]
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation
Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre- training for visual robot manipulation. arXiv preprint arXiv:2312.13139, 2023. 3
work page internal anchor Pith review arXiv 2023
-
[69]
Unleashing large-scale video generative pre-training for visual robot manipulation
Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. In The Twelfth International Conference on Learning Representations, 2024. 2
work page 2024
-
[70]
Ge Yan, Kris Wu, and Xiaolong Wang. ucsd kitchens Dataset. 2023. 13
work page 2023
-
[71]
Physi- ology, motor cortical, 2024
Derek W Yip, Ayoola O Awosika, and Forshing Lui. Physi- ology, motor cortical, 2024. 2
work page 2024
-
[72]
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Lijun Yu, Jos ´e Lezama, Nitesh B Gundavarapu, Luca Ver- sari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023. 2
work page internal anchor Pith review arXiv 2023
-
[73]
Soundstream: An end- 11 to-end neural audio codec
Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. Soundstream: An end- 11 to-end neural audio codec. IEEE/ACM Transactions on Au- dio, Speech, and Language Processing , 30:495–507, 2021. 2
work page 2021
-
[74]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023. 3, 13
work page 2023
-
[75]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023. 4, 8
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[76]
Train of- fline, test online: A real robot learning benchmark
Gaoyue Zhou, Victoria Dean, Mohan Kumar Srirama, Ar- avind Rajeswaran, Jyothish Pari, Kyle Hatch, Aryan Jain, Tianhe Yu, Pieter Abbeel, Lerrel Pinto, et al. Train of- fline, test online: A real robot learning benchmark. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9197–9203. IEEE, 2023. 13, 14
work page 2023
-
[77]
Fanuc manipulation: A dataset for learning-based manipulation with fanuc mate 200id robot
Xinghao Zhu, Ran Tian, Chenfeng Xu, Mingyu Ding, Wei Zhan, and Masayoshi Tomizuka. Fanuc manipulation: A dataset for learning-based manipulation with fanuc mate 200id robot. 2023. 13
work page 2023
-
[78]
Vi- ola: Imitation learning for vision-based manipulation with object proposal priors
Yifeng Zhu, Abhishek Joshi, Peter Stone, and Yuke Zhu. Vi- ola: Imitation learning for vision-based manipulation with object proposal priors. 6th Annual Conference on Robot Learning (CoRL), 2022. 13
work page 2022
-
[79]
Yifeng Zhu, Peter Stone, and Yuke Zhu. Bottom-up skill dis- covery from unsegmented demonstrations for long-horizon robot manipulation. IEEE Robotics and Automation Letters, 7(2):4126–4133, 2022. 13 12 CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation Supplementary Material A. Training Data De...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.