pith. machine review for the scientific record. sign in

arxiv: 2411.19650 · v1 · submitted 2024-11-29 · 💻 cs.RO · cs.AI· cs.CL· cs.CV· cs.LG

Recognition: no theorem link

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Authors on Pith no claims yet

Pith reviewed 2026-05-12 07:27 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CLcs.CVcs.LG
keywords vision language actionrobotic manipulationdiffusion action transformercomponentized architecturegeneralizationrobot learningmultimodal models
0
0 comments X

The pith

A componentized vision-language-action model with a diffusion action module achieves higher success rates in robotic manipulation tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CogACT, which builds a VLA model by pairing a vision-language model with a specialized action module rather than repurposing the VLM directly. The action module uses diffusion transformers to predict action sequences based on the VLM's conditioning. Experiments across five robot embodiments in simulation and reality show substantial gains in task success, adaptation to new robots, and generalization to new objects and backgrounds. This approach is claimed to enable better performance than both similarly sized and larger existing VLA models. The design is argued to support favorable scaling for language-conditioned robot control.

Core claim

The key finding is that a componentized VLA architecture featuring a diffusion action transformer conditioned on VLM outputs for action sequence modeling leads to markedly improved task performance, generalization, and adaptability in robotic manipulation compared to previous VLA designs.

What carries the argument

Componentized VLA architecture with diffusion action transformers conditioned on VLM outputs.

If this is right

  • Exceeds OpenVLA success rates by over 35% in simulation and 55% in real experiments at similar model size.
  • Outperforms RT-2-X (55B) by 18% absolute success rates in simulation.
  • Exhibits strong adaptation to new robots and generalization to unseen objects and backgrounds.
  • Shows favorable scaling behaviors with the diffusion action module.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of cognition and action could allow for modular upgrades where better VLMs are swapped in without retraining the action part.
  • Future work might explore applying this componentization to other robot tasks beyond manipulation.
  • If the architecture generalizes, it could reduce the need for massive end-to-end training in robot learning.

Load-bearing premise

Performance improvements result from the componentized architecture and diffusion action module rather than from training data volume, fine-tuning recipes, or evaluation differences.

What would settle it

A controlled study training OpenVLA and CogACT on identical data and protocols to check if the architecture still yields the reported gains.

read the original abstract

The advancement of large Vision-Language-Action (VLA) models has significantly improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios. While existing VLAs adapted from pretrained large Vision-Language-Models (VLM) have demonstrated promising generalizability, their task performance is still unsatisfactory as indicated by the low tasks success rates in different environments. In this paper, we present a new advanced VLA architecture derived from VLM. Unlike previous works that directly repurpose VLM for action prediction by simple action quantization, we propose a omponentized VLA architecture that has a specialized action module conditioned on VLM output. We systematically study the design of the action module and demonstrates the strong performance enhancement with diffusion action transformers for action sequence modeling, as well as their favorable scaling behaviors. We also conduct comprehensive experiments and ablation studies to evaluate the efficacy of our models with varied designs. The evaluation on 5 robot embodiments in simulation and real work shows that our model not only significantly surpasses existing VLAs in task performance and but also exhibits remarkable adaptation to new robots and generalization to unseen objects and backgrounds. It exceeds the average success rates of OpenVLA which has similar model size (7B) with ours by over 35% in simulated evaluation and 55% in real robot experiments. It also outperforms the large RT-2-X model (55B) by 18% absolute success rates in simulation. Code and models can be found on our project page (https://cogact.github.io/).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces CogACT, a componentized Vision-Language-Action (VLA) model derived from a pretrained VLM backbone paired with a specialized diffusion-based action transformer module for sequence modeling. It reports large empirical gains on robotic manipulation benchmarks across five embodiments in simulation and real-world settings, claiming >35% higher average success rates than OpenVLA (7B) in simulation and >55% in real-robot experiments, plus an 18% absolute improvement over the much larger RT-2-X (55B) model in simulation. The work includes ablations on action-module variants, scaling studies, and generalization tests to new robots, objects, and backgrounds, with code and models released.

Significance. If the performance attribution holds after proper controls, the results would provide concrete evidence that separating high-level VLM-based cognition from a dedicated diffusion action module yields both higher task success and favorable scaling, offering a practical design pattern for future VLA systems. The multi-embodiment evaluation and generalization claims, together with the public release of models, would strengthen the paper's utility to the robotics community.

major comments (2)
  1. [§4] §4 (and appendix): The headline performance claims (35% sim / 55% real over OpenVLA at similar 7B scale; 18% over RT-2-X) rest on direct comparisons whose training corpora, episode counts, fine-tuning mixtures, and exact success criteria are not matched or quantified against the baselines. Without a controlled re-training of OpenVLA under the same data recipe or explicit disclosure of pre-training corpus overlap, the gains cannot be isolated to the componentized architecture and diffusion action module rather than data volume or evaluation-protocol differences.
  2. [§4] §4: No statistical significance, run-to-run variance, or confidence intervals are reported for the success-rate tables, and the abstract's aggregate percentages lack per-task breakdowns or controls for confounding factors such as episode length and success definition. These omissions make it impossible to assess whether the reported margins are robust.
minor comments (1)
  1. [Abstract] Abstract: Typo in 'a omponentized' (should read 'a componentized').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each major point below, providing clarifications based on the experiments and evaluations reported in the paper. Where appropriate, we will revise the manuscript to improve transparency and robustness of the presented results.

read point-by-point responses
  1. Referee: [§4] §4 (and appendix): The headline performance claims (35% sim / 55% real over OpenVLA at similar 7B scale; 18% over RT-2-X) rest on direct comparisons whose training corpora, episode counts, fine-tuning mixtures, and exact success criteria are not matched or quantified against the baselines. Without a controlled re-training of OpenVLA under the same data recipe or explicit disclosure of pre-training corpus overlap, the gains cannot be isolated to the componentized architecture and diffusion action module rather than data volume or evaluation-protocol differences.

    Authors: We appreciate the referee's emphasis on isolating architectural contributions. Our evaluations follow the exact task definitions, success criteria, and simulation environments reported in the OpenVLA and RT-2-X papers, using the same benchmark suites (e.g., the standard manipulation tasks across the five embodiments). Training data for CogACT is drawn from the publicly released Open X-Embodiment corpus, which forms the foundation for the baseline models as well. Section 4 and the appendix already describe our fine-tuning mixture and episode counts at a high level. To address the concern directly, we will expand the appendix with a side-by-side table quantifying training episode numbers, data source overlap, and precise success definitions used in our runs versus those in the baseline publications. While a full controlled re-training of OpenVLA under our exact recipe would be ideal for stronger isolation, it is computationally prohibitive at this scale; instead, we rely on the within-framework ablations (action module variants and scaling studies) to attribute gains to the diffusion transformer design. We believe these additions will allow readers to better assess the comparisons. revision: partial

  2. Referee: [§4] §4: No statistical significance, run-to-run variance, or confidence intervals are reported for the success-rate tables, and the abstract's aggregate percentages lack per-task breakdowns or controls for confounding factors such as episode length and success definition. These omissions make it impossible to assess whether the reported margins are robust.

    Authors: We agree that including variance estimates and per-task details would strengthen the presentation of results. Although the main tables report average success rates across tasks and embodiments, the full manuscript already contains per-task breakdowns in the appendix tables. In the revised version, we will augment the primary result tables (Section 4) with standard deviations computed over multiple evaluation runs (typically 3 seeds per task where feasible) and add 95% confidence intervals. We will also explicitly state the success definitions and average episode lengths used, confirming they match the protocols from prior VLA works to control for confounding factors. These changes will be reflected in both the main text and appendix, making the robustness of the >35% and >55% margins clearer. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture proposal and benchmark comparisons

full rationale

The paper introduces a componentized VLA model with a VLM-conditioned diffusion action module, describes its training, and reports direct empirical success rates on five robot embodiments in simulation and real-world settings. Claims rest on measured task performance (e.g., +35% over OpenVLA at 7B scale, +18% over RT-2-X at 55B) rather than any first-principles derivation, fitted-parameter prediction, or self-referential definition. No equations, uniqueness theorems, or ansatzes are invoked that reduce to inputs by construction; ablations and comparisons are external benchmarks. The work is therefore self-contained against its own experimental data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The performance claims rest on standard supervised fine-tuning of large models plus the specific choice of diffusion for action sequence modeling; no new physical axioms or invented entities are introduced.

free parameters (1)
  • Diffusion action module hyperparameters
    Number of diffusion steps, conditioning strength, and transformer width are design choices tuned to achieve the reported performance.
axioms (1)
  • domain assumption Standard deep-learning assumptions hold: i.i.d. train/test splits, representative robot embodiments, and comparable evaluation protocols across models.
    Implicit in all empirical robotics papers comparing success rates.

pith-pipeline@v0.9.0 · 5651 in / 1302 out tokens · 60657 ms · 2026-05-12T07:27:30.041349+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 45 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning

    cs.RO 2026-05 unverdicted novelty 7.0

    BlockVLA accelerates autoregressive VLA models by 3.3x using block diffusion finetuning, with faster training convergence and better early performance on long-horizon robotic tasks.

  2. ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

  3. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 conditional novelty 7.0

    Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.

  4. OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

  5. Latent Bridge: Feature Delta Prediction for Efficient Dual-System Vision-Language-Action Model Inference

    cs.RO 2026-05 unverdicted novelty 7.0

    Latent Bridge predicts VLM feature deltas to reduce VLM calls by 50-75% in dual-system VLA models while retaining 95-100% performance and achieving 1.65-1.73x speedup across LIBERO, RoboCasa, and ALOHA benchmarks.

  6. Being-H0.7: A Latent World-Action Model from Egocentric Videos

    cs.RO 2026-04 unverdicted novelty 7.0

    Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

  7. Mask World Model: Predicting What Matters for Robust Robot Policy Learning

    cs.RO 2026-04 unverdicted novelty 7.0

    Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...

  8. ${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

    cs.LG 2026-04 unverdicted novelty 7.0

    π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.

  9. ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

    cs.RO 2026-04 unverdicted novelty 7.0

    ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.

  10. BiCoord: A Bimanual Manipulation Benchmark towards Long-Horizon Spatial-Temporal Coordination

    cs.RO 2026-04 conditional novelty 7.0

    BiCoord is a new benchmark for long-horizon tightly coordinated bimanual manipulation that includes quantitative metrics and shows existing policies like DP, RDT, Pi0 and OpenVLA-OFT struggle on such tasks.

  11. VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models

    cs.RO 2026-03 unverdicted novelty 7.0

    VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.

  12. FrameSkip: Learning from Fewer but More Informative Frames in VLA Training

    cs.RO 2026-05 unverdicted novelty 6.0

    FrameSkip improves VLA policy training success from 66.50% to 76.15% by selecting high-importance frames and retaining only 20% of unique frames across three benchmarks.

  13. Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models

    cs.RO 2026-05 conditional novelty 6.0

    GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.

  14. See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model

    cs.RO 2026-05 unverdicted novelty 6.0

    GridS reduces visual tokens in VLA models to under 10% of the original count via task-aware differentiable resampling, delivering 76% lower FLOPs with no drop in task success rate on benchmarks and real robots.

  15. HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.

  16. PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    PriorVLA preserves pretrained priors in VLA models through a frozen Prior Expert and trained Adaptation Expert, delivering better robot manipulation performance than full fine-tuning with only 25% of the parameter updates.

  17. RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark

    cs.RO 2026-05 unverdicted novelty 6.0

    RoboMemArena is a new large-scale robotic memory benchmark with real-world tasks, and PrediMem is a dual VLA system that outperforms baselines by managing memory buffers with predictive coding.

  18. ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

  19. Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs

    cs.RO 2026-05 unverdicted novelty 6.0

    Retrieve-then-steer stores successful observation-action segments in memory, retrieves relevant chunks, filters them, and uses an elite prior with confidence-adaptive guidance to steer a flow-matching action sampler f...

  20. Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs

    cs.RO 2026-05 unverdicted novelty 6.0

    A retrieve-then-steer method stores successful robot actions in memory and uses them to steer a frozen VLA's flow-matching sampler for better test-time reliability without parameter updates.

  21. Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval

    cs.CV 2026-05 unverdicted novelty 6.0

    RetrieveVGGT enables constant-memory long-context streaming 3D reconstruction by retrieving relevant frames via query-key similarities in VGGT's first attention layer, outperforming StreamVGGT and others.

  22. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 unverdicted novelty 6.0

    Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.

  23. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 unverdicted novelty 6.0

    Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.

  24. ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations

    cs.CV 2026-05 unverdicted novelty 6.0

    ForgeVLA enables federated VLA model training from unlabeled vision-action pairs by recovering language via embodied classifiers and using contrastive planning plus adaptive aggregation to avoid feature collapse.

  25. Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation

    cs.RO 2026-05 unverdicted novelty 6.0

    Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.

  26. AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    AT-VLA introduces adaptive tactile injection and a dual-stream tactile reaction mechanism to integrate real-time tactile feedback into pretrained VLA models for contact-rich robotic manipulation.

  27. TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation

    cs.CV 2026-05 unverdicted novelty 6.0

    TriRelVLA introduces triadic object-hand-task relational representations and a task-grounded graph transformer with a relational bottleneck to improve generalization in robotic manipulation across scenes, objects, and tasks.

  28. ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.

  29. VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model

    cs.RO 2026-05 unverdicted novelty 6.0

    VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.

  30. LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

    cs.RO 2026-04 unverdicted novelty 6.0

    LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.

  31. LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

    cs.RO 2026-04 unverdicted novelty 6.0

    LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.

  32. PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations

    cs.AI 2026-04 unverdicted novelty 6.0

    PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.

  33. ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.

  34. DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA

    cs.RO 2026-03 unverdicted novelty 6.0

    DIAL decouples intent from action in end-to-end VLAs using a latent visual foresight bottleneck and two-stage training, reaching SOTA on RoboCasa with 10x fewer demonstrations and zero-shot real-world transfer.

  35. RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    cs.RO 2025-06 unverdicted novelty 6.0

    RoboTwin 2.0 automates diverse synthetic data creation for dual-arm robots via MLLMs and five-axis domain randomization, leading to 228-367% gains in manipulation success.

  36. AttenA+: Rectifying Action Inequality in Robotic Foundation Models

    cs.RO 2026-05 unverdicted novelty 5.0

    AttenA+ applies velocity-driven action attention to reweight training objectives toward kinematically critical low-velocity segments, yielding small benchmark gains on Libero and RoboTwin without added parameters.

  37. Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 5.0

    The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LI...

  38. Nautilus: From One Prompt to Plug-and-Play Robot Learning

    cs.RO 2026-05 unverdicted novelty 5.0

    NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.

  39. ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning

    cs.RO 2026-04 unverdicted novelty 5.0

    ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.

  40. DA-PTQ: Drift-Aware Post-Training Quantization for Efficient Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 5.0

    DA-PTQ quantizes VLAs by compensating cross-space distortions and allocating mixed precision to minimize motion errors and kinematic drift in trajectories.

  41. ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation

    cs.RO 2026-04 unverdicted novelty 5.0

    Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.

  42. RoboECC: Multi-Factor-Aware Edge-Cloud Collaborative Deployment for VLA Models

    cs.DC 2026-03 unverdicted novelty 5.0

    RoboECC delivers up to 3.28x speedup for VLA model inference via co-aware segmentation and network-aware adjustment with 2.55-2.62% overhead.

  43. SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    cs.RO 2025-01 unverdicted novelty 5.0

    SpatialVLA adds 3D-aware position encoding and adaptive discretized action grids to visual-language-action models, enabling strong zero-shot performance and fine-tuning on new robot setups after pre-training on 1.1 mi...

  44. RLDX-1 Technical Report

    cs.RO 2026-05 unverdicted novelty 4.0

    RLDX-1 achieves 86.8% success on complex ALLEX humanoid manipulation tasks where prior VLAs reach only around 40%.

  45. RLDX-1 Technical Report

    cs.RO 2026-05 unverdicted novelty 4.0

    RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · cited by 39 Pith papers · 22 internal anchors

  1. [1]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadal- lah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024. 2

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

  3. [3]

    OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open- source framework for training large autoregressive vision- language models. arXiv preprint arXiv:2308.01390 , 2023. 2

  4. [4]

    Hydra: Hybrid robot actions for imitation learning

    Suneel Belkhale, Yuchen Cui, and Dorsa Sadigh. Hydra: Hybrid robot actions for imitation learning. arxiv, 2023. 13

  5. [5]

    arXiv preprint arXiv:2409.16283 (2024)

    Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Hu- man video generation in novel scenarios enables generaliz- able robot manipulation. arXiv preprint arXiv:2409.16283,

  6. [7]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. 1, 2, 5, 6

  7. [8]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023. 2

  8. [9]

    Language Models are Few-Shot Learners

    Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020. 2

  9. [10]

    The ycb object and model set: Towards common benchmarks for manipula- tion research

    Berk Calli, Arjun Singh, Aaron Walsman, Siddhartha Srini- vasa, Pieter Abbeel, and Aaron M Dollar. The ycb object and model set: Towards common benchmarks for manipula- tion research. In 2015 international conference on advanced robotics (ICAR), pages 510–517. IEEE, 2015. 14

  10. [11]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2024. 3

  11. [12]

    Berkeley UR5 demonstration dataset

    Lawrence Yunliang Chen, Simeon Adebola, and Ken Goldberg. Berkeley UR5 demonstration dataset. https://sites.google.com/view/berkeley-ur5/home. 13

  12. [13]

    Pali-x: On scaling up a multilingual vision and language model

    Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Se- bastian Goodman, Xiao Wang, Yi Tay, et al. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023. 2, 3

  13. [14]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 2

  14. [15]

    Diffusion policy: Visuomotor policy learning via action dif- fusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action dif- fusion. The International Journal of Robotics Research, page 02783649241273668, 2023. 2, 3, 4

  15. [16]

    Analysis and observations from the first amazon picking challenge

    Nikolaus Correll, Kostas E Bekris, Dmitry Berenson, Oliver Brock, Albert Causo, Kris Hauser, Kei Okada, Alberto Ro- driguez, Joseph M Romano, and Peter R Wurman. Analysis and observations from the first amazon picking challenge. IEEE Transactions on Automation Science and Engineering, 15(1):172–188, 2016. 14

  16. [17]

    From play to policy: Conditional behavior genera- tion from uncurated robot data

    Zichen Jeff Cui, Yibin Wang, Nur Muhammad Mahi Shafi- ullah, and Lerrel Pinto. From play to policy: Conditional be- havior generation from uncurated robot data. arXiv preprint arXiv:2210.10047, 2022. 13

  17. [18]

    Shivin Dass, Jullian Yapeter, Jesse Zhang, Jiahui Zhang, Karl Pertsch, Stefanos Nikolaidis, and Joseph J. Lim. Clvr jaco play dataset, 2023. 13

  18. [19]

    High Fidelity Neural Audio Compression

    Alexandre D ´efossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022. 2

  19. [20]

    Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

    Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting general- ization of robotic skills with cross-domain datasets. arXiv preprint arXiv:2109.13396, 2021. 13

  20. [21]

    The brain basis of language process- ing: from structure to function

    Angela D Friederici. The brain basis of language process- ing: from structure to function. Physiological reviews, 91 (4):1357–1392, 2011. 2

  21. [22]

    Minho Heo, Youngwoon Lee, Doohyun Lee, and Joseph J. Lim. Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation. In Robotics: Sci- ence and Systems, 2023. 13

  22. [23]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022. 14, 16

  23. [24]

    Diffusion transformer policy

    Zhi Hou, Tianyi Zhang, Yuwen Xiong, Hengjun Pu, Chengyang Zhao, Ronglei Tong, Yu Qiao, Jifeng Dai, and Yuntao Chen. Diffusion transformer policy. arXiv preprint arXiv:2410.15959, 2024. 3

  24. [25]

    Neu- roanatomy, visual cortex

    Trevor Huff, Navid Mahabadi, and Prasanna Tadi. Neu- roanatomy, visual cortex. 2018. 2

  25. [26]

    Bc-z: Zero-shot task generalization with robotic imitation learning

    Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Fred- erik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning , pages 991–

  26. [27]

    Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation, 2018

    Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, 9 Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Qt- opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293 ,

  27. [28]

    Karamcheti, S

    Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Prismatic vlms: Investigating the design space of visually-conditioned language models. arXiv preprint arXiv:2402.07865 , 2024. 2, 3, 13

  28. [29]

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Bal- akrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, Y...

  29. [30]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024. 1, 2, 3, 5, 6, 7, 13, 15

  30. [31]

    The darpa robotics challenge finals: Results and perspectives

    Eric Krotkov, Douglas Hackett, Larry Jackel, Michael Per- schbacher, James Pippine, Jesse Strauss, Gill Pratt, and Christopher Orlowski. The darpa robotics challenge finals: Results and perspectives. The DARPA robotics challenge fi- nals: Humanoid robots to the rescue, pages 1–26, 2018. 14

  31. [32]

    Vision-language foundation models as effective robot imitators.arXiv preprint arXiv:2311.01378, 2023

    Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators. arXiv preprint arXiv:2311.01378,

  32. [33]

    Evaluating Real-World Robot Manipulation Policies in Simulation

    Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating real-world robot manipulation policies in simulation. arXiv preprint arXiv:2405.05941, 2024. 1, 2, 5, 6, 14, 15

  33. [34]

    Vision-language foundation models as effective robot imitators

    Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators. In The Twelfth International Con- ference on Learning Representations, 2024. 2

  34. [35]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2023. 2

  35. [36]

    Robot learning on the job: Human-in-the- loop autonomy and learning during deployment

    Huihan Liu, Soroush Nasiriany, Lance Zhang, Zhiyao Bao, and Yuke Zhu. Robot learning on the job: Human-in-the- loop autonomy and learning during deployment. InRobotics: Science and Systems (RSS), 2023. 13

  36. [37]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. 2

  37. [38]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipu- lation. arXiv preprint arXiv:2410.07864, 2024. 3

  38. [39]

    Multi- stage cable routing through hierarchical imitation learning

    Jianlan Luo, Charles Xu, Xinyang Geng, Gilbert Feng, Kuan Fang, Liam Tan, Stefan Schaal, and Sergey Levine. Multi- stage cable routing through hierarchical imitation learning. arXiv pre-print, 2023. 13

  39. [40]

    Fmb: a functional manipulation benchmark for generalizable robotic learning

    Jianlan Luo, Charles Xu, Fangchen Liu, Liam Tan, Zipeng Lin, Jeffrey Wu, Pieter Abbeel, and Sergey Levine. Fmb: a functional manipulation benchmark for generalizable robotic learning. arXiv preprint arXiv:2401.08553, 2024. 13

  40. [41]

    Interactive language: Talking to robots in real time

    Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023. 13

  41. [42]

    Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity

    Ajay Mandlekar, Jonathan Booher, Max Spero, Albert Tung, Anchit Gupta, Yuke Zhu, Animesh Garg, Silvio Savarese, and Li Fei-Fei. Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity. In 2019 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS), pages 1048–10...

  42. [43]

    Grounding language with visual affordances over unstruc- tured data

    Oier Mees, Jessica Borja-Diaz, and Wolfram Burgard. Grounding language with visual affordances over unstruc- tured data. In Proceedings of the IEEE International Con- ference on Robotics and Automation (ICRA) , London, UK,

  43. [44]

    Struc- tured world models from human videos

    Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Struc- tured world models from human videos. CoRL, 2023. 13

  44. [45]

    R3m: A universal visual repre- sentation for robot manipulation

    Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual repre- sentation for robot manipulation. In Conference on Robot Learning, pages 892–909, 2023. 2

  45. [46]

    Learning and retrieval from prior data for skill- based imitation learning

    Soroush Nasiriany, Tian Gao, Ajay Mandlekar, and Yuke Zhu. Learning and retrieval from prior data for skill- based imitation learning. In Conference on Robot Learning (CoRL), 2022. 13

  46. [47]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR,

  47. [48]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864 , 2023. 1, 2, 3, 5, 6, 13, 15

  48. [49]

    Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Je- gou, Julien Mairal, Patr...

  49. [50]

    Imitating human behaviour with diffusion models

    Tim Pearce, Tabish Rashid, Anssi Kanervisto, Dave Bignell, Mingfei Sun, Raluca Georgescu, Sergio Valcarcel Macua, Shan Zheng Tan, Ida Momennejad, Katja Hofmann, et al. Imitating human behaviour with diffusion models. arXiv preprint arXiv:2301.10677, 2023. 3

  50. [51]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4195–4205,

  51. [52]

    Shared Control Templates for Assistive Robotics

    Gabriel Quere, Annette Hagengruber, Maged Iskandar, Samuel Bustamante, Daniel Leidner, Freek Stulp, and Jo- ern V ogel. Shared Control Templates for Assistive Robotics. In 2020 IEEE International Conference on Robotics and Au- tomation (ICRA), page 7, Paris, France, 2020. 13

  52. [53]

    Goal-conditioned imitation learning us- ing score-based diffusion policies

    Moritz Reuss, Maximilian Li, Xiaogang Jia, and Rudolf Li- outikov. Goal-conditioned imitation learning using score- based diffusion policies. arXiv preprint arXiv:2304.02532,

  53. [54]

    Latent plans for task ag- nostic offline reinforcement learning

    Erick Rosete-Beas, Oier Mees, Gabriel Kalweit, Joschka Boedecker, and Wolfram Burgard. Latent plans for task ag- nostic offline reinforcement learning. 2022. 13

  54. [55]

    Multi- resolution sensing for real-time control with vision-language models

    Saumya Saxena, Mohit Sharma, and Oliver Kroemer. Multi- resolution sensing for real-time control with vision-language models. In 7th Annual Conference on Robot Learning, 2023. 13

  55. [56]

    On bringing robots home, 2023

    Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, and Lerrel Pinto. On bringing robots home, 2023. 13

  56. [57]

    MU- TEX: Learning unified policies from multimodal task spec- ifications

    Rutav Shah, Roberto Mart ´ın-Mart´ın, and Yuke Zhu. MU- TEX: Learning unified policies from multimodal task spec- ifications. In 7th Annual Conference on Robot Learning ,

  57. [58]

    Perceiver- actor: A multi-task transformer for robotic manipulation

    Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver- actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning , pages 785–799. PMLR,

  58. [59]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020. 14

  59. [60]

    Open-world object manipulation using pre-trained vision-language models

    Austin Stone, Ted Xiao, Yao Lu, Keerthana Gopalakrishnan, Kuang-Huei Lee, Quan Vuong, Paul Wohlhart, Sean Kir- mani, Brianna Zitkovich, Fei Xia, et al. Open-world object manipulation using pre-trained vision-language models. In Conference on Robot Learning, pages 3397–3417, 2023. 2

  60. [61]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 2

  61. [62]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213, 2024. 1, 2, 3, 5, 6, 7, 13, 15

  62. [63]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 2

  63. [64]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 2, 3, 13

  64. [65]

    Neural discrete representation learning

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information pro- cessing systems, 30, 2017. 2

  65. [66]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. In Con- ference on Robot Learning, pages 1723–1736. PMLR, 2023. 13

  66. [67]

    Tinyvla: To- wards fast, data-efficient vision-language-action models for robotic manipulation, 2024

    Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Kun Wu, Zhiyuan Xu, Ran Cheng, Chaomin Shen, Yaxin Peng, Feifei Feng, et al. Tinyvla: Towards fast, data-efficient vision- language-action models for robotic manipulation. arXiv preprint arXiv:2409.12514, 2024. 2

  67. [68]

    Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

    Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre- training for visual robot manipulation. arXiv preprint arXiv:2312.13139, 2023. 3

  68. [69]

    Unleashing large-scale video generative pre-training for visual robot manipulation

    Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. In The Twelfth International Conference on Learning Representations, 2024. 2

  69. [70]

    ucsd kitchens Dataset

    Ge Yan, Kris Wu, and Xiaolong Wang. ucsd kitchens Dataset. 2023. 13

  70. [71]

    Physi- ology, motor cortical, 2024

    Derek W Yip, Ayoola O Awosika, and Forshing Lui. Physi- ology, motor cortical, 2024. 2

  71. [72]

    Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    Lijun Yu, Jos ´e Lezama, Nitesh B Gundavarapu, Luca Ver- sari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023. 2

  72. [73]

    Soundstream: An end- 11 to-end neural audio codec

    Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. Soundstream: An end- 11 to-end neural audio codec. IEEE/ACM Transactions on Au- dio, Speech, and Language Processing , 30:495–507, 2021. 2

  73. [74]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023. 3, 13

  74. [75]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023. 4, 8

  75. [76]

    Train of- fline, test online: A real robot learning benchmark

    Gaoyue Zhou, Victoria Dean, Mohan Kumar Srirama, Ar- avind Rajeswaran, Jyothish Pari, Kyle Hatch, Aryan Jain, Tianhe Yu, Pieter Abbeel, Lerrel Pinto, et al. Train of- fline, test online: A real robot learning benchmark. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9197–9203. IEEE, 2023. 13, 14

  76. [77]

    Fanuc manipulation: A dataset for learning-based manipulation with fanuc mate 200id robot

    Xinghao Zhu, Ran Tian, Chenfeng Xu, Mingyu Ding, Wei Zhan, and Masayoshi Tomizuka. Fanuc manipulation: A dataset for learning-based manipulation with fanuc mate 200id robot. 2023. 13

  77. [78]

    Vi- ola: Imitation learning for vision-based manipulation with object proposal priors

    Yifeng Zhu, Abhishek Joshi, Peter Stone, and Yuke Zhu. Vi- ola: Imitation learning for vision-based manipulation with object proposal priors. 6th Annual Conference on Robot Learning (CoRL), 2022. 13

  78. [79]

    pick Coke can

    Yifeng Zhu, Peter Stone, and Yuke Zhu. Bottom-up skill dis- covery from unsegmented demonstrations for long-horizon robot manipulation. IEEE Robotics and Automation Letters, 7(2):4126–4133, 2022. 13 12 CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation Supplementary Material A. Training Data De...