arxiv: 2508.07917 · v4 · submitted 2025-08-11 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

MolmoAct: Action Reasoning Models that can Reason in Space

Jason Lee , Jiafei Duan , Haoquan Fang , Yuquan Deng , Shuo Liu , Boyang Li , Bohan Fang , Jieyu Zhang

show 11 more authors

Yi Ru Wang Sangho Lee Winson Han Wilbert Pumacay Angelica Wu Rose Hendrix Karen Farley Eli VanderBilt Ali Farhadi Dieter Fox Ranjay Krishna

Authors on Pith no claims yet

Pith reviewed 2026-05-14 23:30 UTC · model grok-4.3

classification 💻 cs.RO

keywords action reasoning modelsrobotic foundation modelstrajectory planningdepth-aware perceptionzero-shot generalizationembodied AILIBERO benchmarkeditable plans

0 comments

The pith

MolmoAct encodes robot observations into depth-aware tokens, editable trajectory traces, and low-level actions through a three-stage pipeline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Most robotic foundation models map perception and instructions straight to control outputs, which restricts how well they adapt to new situations or explain their choices. MolmoAct instead follows a structured sequence: it first converts visual observations and language instructions into depth-aware perception tokens, then produces mid-level spatial plans in the form of editable trajectory traces, and finally converts those plans into precise low-level robot actions. This separation keeps the reasoning visible so humans can inspect or edit the trajectories, which the paper shows improves zero-shot performance on benchmarks like SimplerEnv and LIBERO while also raising success rates after real-world fine-tuning. The authors further release a dataset of over 10,000 robot trajectories to support training of similar models. A reader would care because the approach promises robots that are both more capable and more controllable than direct-mapping systems.

Core claim

MolmoAct encodes observations and instructions into depth-aware perception tokens, generates mid-level spatial plans as editable trajectory traces, and predicts precise low-level actions, enabling explainable and steerable behavior. The 7B model reaches 70.5 percent zero-shot accuracy on SimplerEnv Visual Matching tasks, 86.6 percent average success on LIBERO with extra gains on long-horizon cases, and real-world fine-tuning improvements of 10 percent for single-arm and 22.7 percent for bimanual setups over Pi-0-FAST, plus 23.3 percent better out-of-distribution generalization.

What carries the argument

The Action Reasoning Model three-stage pipeline that converts depth-aware perception tokens into editable trajectory traces and then into low-level actions.

If this is right

The model achieves 70.5% zero-shot accuracy on SimplerEnv Visual Matching, exceeding closed-source baselines.
It records 86.6% average success on LIBERO, including a 6.3% gain over ThinkAct on long-horizon tasks.
Real-world fine-tuning yields an extra 10% single-arm and 22.7% bimanual task progression over Pi-0-FAST.
Out-of-distribution generalization improves by an additional 23.3% relative to baselines.
Human preference scores rank highest for open-ended instruction following and trajectory steering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Making the intermediate spatial plans editable opens the possibility for real-time human correction during execution without retraining the entire model.
The released mid-training dataset of over 10,000 trajectories could serve as a shared resource for testing other modular planning architectures.
If the separation of perception, planning, and control generalizes, similar pipelines might appear in non-robot domains such as autonomous navigation or manipulation in simulation.
The emphasis on depth-aware tokens suggests future work could test whether explicit 3D structure remains necessary when scaling to larger models.

Load-bearing premise

The structured three-stage pipeline of depth-aware perception, editable trajectory planning, and low-level control produces meaningfully better adaptability, generalization, and semantic grounding than direct perception-to-action models.

What would settle it

An experiment that trains a model of identical size and data but removes the intermediate trajectory-trace stage and shows equal or higher scores on the same SimplerEnv, LIBERO, and real-world tasks would falsify the claimed advantage of the pipeline.

read the original abstract

Reasoning is central to purposeful action, yet most robotic foundation models map perception and instructions directly to control, which limits adaptability, generalization, and semantic grounding. We introduce Action Reasoning Models (ARMs), a class of robotic foundation models that integrate perception, planning, and control through a structured three-stage pipeline. Our model, MolmoAct, encodes observations and instructions into depth-aware perception tokens, generates mid-level spatial plans as editable trajectory traces, and predicts precise low-level actions, enabling explainable and steerable behavior. MolmoAct-7B-D achieves strong performance across simulation and real-world settings: 70.5% zero-shot accuracy on SimplerEnv Visual Matching tasks, surpassing closed-source Pi-0 and GR00T N1.5; 86.6% average success on LIBERO, including an additional 6.3% gain over ThinkAct on long-horizon tasks; and in real-world fine-tuning, an additional 10% (single-arm) and an additional 22.7% (bimanual) task progression over Pi-0-FAST. It also outperforms baselines by an additional 23.3% on out-of-distribution generalization and achieves top human-preference scores for open-ended instruction following and trajectory steering. Furthermore, we release, for the first time, the MolmoAct Dataset -- a mid-training robot dataset comprising over 10,000 high quality robot trajectories across diverse scenarios and tasks. Training with this dataset yields an average 5.5% improvement in general performance over the base model. We release all model weights, training code, our collected dataset, and our action reasoning dataset, establishing MolmoAct as both a state-of-the-art robotics foundation model and an open blueprint for building ARMs that transform perception into purposeful action through structured reasoning. Blogpost: https://allenai.org/blog/molmoact

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MolmoAct defines a three-stage ARM pipeline with editable trajectories and releases a 10k-trajectory dataset that drives clear benchmark gains, but the architecture's isolated contribution over data scaling remains unproven.

read the letter

The main point is that MolmoAct introduces Action Reasoning Models as a class that inserts depth-aware perception tokens, then editable mid-level trajectory traces, then low-level actions. This setup aims for better steerability and grounding than direct perception-to-action models. The 7B version hits 70.5% zero-shot on SimplerEnv visual matching, 86.6% on LIBERO with extra lift on long-horizon tasks, and real-world fine-tuning gains of 10% single-arm and 22.7% bimanual over Pi-0-FAST, plus 23.3% better out-of-distribution results and strong human preference scores for open instructions and steering. They also release the full model weights, code, and the new MolmoAct Dataset of over 10k trajectories, which alone adds 5.5% average performance when used for training. That open release is the most immediately useful part for the field. The soft spot is exactly the one flagged in the stress-test: no ablation keeps data, backbone, and tokenization fixed while dropping the editable-trajectory stage, so it is still possible the reported margins are mostly data-driven rather than pipeline-driven. The abstract also gives percentages without error bars, protocol details, or baseline implementation notes, which makes it harder to judge how solid the comparisons are. This paper is for robotics groups working on foundation models who want reproducible starting points and ideas for spatial planning. A reader who needs open data or concrete numbers on steerable agents will get value from it. It deserves a serious referee because the releases and benchmark numbers are concrete enough to support discussion and follow-up work, even if the central claim about the three-stage structure needs tighter evidence.

Referee Report

2 major / 2 minor

Summary. The paper introduces Action Reasoning Models (ARMs) as a class of robotic foundation models and presents MolmoAct-7B-D, which encodes observations and instructions into depth-aware perception tokens, generates mid-level spatial plans as editable trajectory traces, and predicts low-level actions. It reports 70.5% zero-shot accuracy on SimplerEnv Visual Matching, 86.6% average success on LIBERO (with +6.3% on long-horizon tasks over ThinkAct), real-world gains of +10% (single-arm) and +22.7% (bimanual) over Pi-0-FAST after fine-tuning, +23.3% on out-of-distribution generalization, and top human-preference scores. The work also releases the MolmoAct Dataset (>10k trajectories) whose use yields a 5.5% average performance lift, along with model weights, training code, and an action reasoning dataset.

Significance. If the three-stage pipeline demonstrably outperforms direct perception-to-action baselines when data and backbone are controlled, the work would provide a concrete, open blueprint for explainable and steerable robotic foundation models, addressing limitations in adaptability and semantic grounding. The public release of the 10k-trajectory dataset, code, and weights is a clear community benefit that could accelerate follow-on research on mid-level spatial planning.

major comments (2)

[Experiments] Experiments section (and abstract performance claims): The central claim that the depth-aware perception → editable trajectory traces → low-level action pipeline produces meaningfully better adaptability, generalization, and semantic grounding than direct mapping models is not isolated by an ablation that holds the MolmoAct Dataset, model backbone, and tokenization fixed while removing only the intermediate editable-trajectory stage. The reported 5.5% average gain from the new dataset and large margins over Pi-0/GR00T/ThinkAct therefore leave open the possibility that results are largely data-driven rather than architecture-driven.
[Results] Results tables and real-world evaluation: Specific percentages (70.5% SimplerEnv, 86.6% LIBERO, +10%/+22.7% real-world) are presented without error bars, number of trials, statistical tests, or detailed baseline re-implementation protocols, which are required to substantiate claims of surpassing closed-source models and to support the generalization and human-preference assertions.

minor comments (2)

[Abstract] Abstract: The phrase 'top human-preference scores for open-ended instruction following and trajectory steering' lacks any description of the evaluation protocol, number of raters, or comparison setup.
[Introduction] Dataset release statement: While the release of the MolmoAct Dataset is welcome, the paper does not specify licensing, exact collection procedure, or quality-control criteria used to ensure the 'high quality' trajectories.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment point by point below, outlining the revisions we plan to make.

read point-by-point responses

Referee: [Experiments] Experiments section (and abstract performance claims): The central claim that the depth-aware perception → editable trajectory traces → low-level action pipeline produces meaningfully better adaptability, generalization, and semantic grounding than direct mapping models is not isolated by an ablation that holds the MolmoAct Dataset, model backbone, and tokenization fixed while removing only the intermediate editable-trajectory stage. The reported 5.5% average gain from the new dataset and large margins over Pi-0/GR00T/ThinkAct therefore leave open the possibility that results are largely data-driven rather than architecture-driven.

Authors: We thank the referee for highlighting this important point. While our work emphasizes the full three-stage pipeline as the core of Action Reasoning Models, and the dataset is released to enable further research, we acknowledge that a controlled ablation isolating the editable trajectory stage—while holding the dataset, backbone, and tokenization fixed—would more definitively separate architectural contributions from data effects. In the revised manuscript, we will include such an ablation study by training a variant without the intermediate planning stage on the same data and backbone. revision: yes
Referee: [Results] Results tables and real-world evaluation: Specific percentages (70.5% SimplerEnv, 86.6% LIBERO, +10%/+22.7% real-world) are presented without error bars, number of trials, statistical tests, or detailed baseline re-implementation protocols, which are required to substantiate claims of surpassing closed-source models and to support the generalization and human-preference assertions.

Authors: We agree that including error bars, trial counts, statistical tests, and detailed baseline protocols is essential for robust claims. In the revised manuscript, we will update the results tables and real-world evaluation sections to include these details, such as standard deviations across multiple runs, the number of evaluation trials, p-values where applicable, and expanded descriptions of how each baseline was implemented or evaluated under consistent conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims with no derivations or self-referential equations

full rationale

The paper describes an empirical robotics model (MolmoAct) using a three-stage pipeline of depth-aware tokens, editable trajectories, and low-level actions. All reported results (70.5% SimplerEnv, 86.6% LIBERO, real-world gains, 5.5% from new dataset) are framed as experimental comparisons against baselines, not as quantities derived from fitted parameters or first-principles equations. No mathematical derivations, uniqueness theorems, or ansatzes appear in the provided text. The central claim rests on ablation-free empirical deltas rather than any reduction of outputs to inputs by construction. Self-citations are absent from load-bearing positions; the new dataset is released openly and its contribution is stated as an additive empirical factor, not a definitional tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract provides no explicit free parameters, mathematical axioms, or additional invented entities beyond the high-level model description; the ARM class itself is the primary new framing.

invented entities (1)

Action Reasoning Models (ARMs) no independent evidence
purpose: Class of robotic foundation models that integrate perception, planning, and control through a structured three-stage pipeline
Newly introduced in the abstract to frame the MolmoAct approach.

pith-pipeline@v0.9.0 · 5710 in / 1272 out tokens · 52528 ms · 2026-05-14T23:30:25.974841+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DimensionForcing alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MolmoAct encodes observations and instructions into depth-aware perception tokens, generates mid-level spatial plans as editable trajectory traces, and predicts precise low-level actions
IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Training with this dataset yields an average 5.5% improvement in general performance over the base model

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
cs.RO 2026-05 unverdicted novelty 7.0

VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...
MolmoAct2: Action Reasoning Models for Real-world Deployment
cs.RO 2026-05 unverdicted novelty 7.0

MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.
CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

CoRAL lets LLMs act as adaptive cost designers for motion planners while using VLM priors and online identification to handle unknown physics, achieving over 50% higher success rates than baselines in unseen contact-r...
Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation
cs.AI 2026-05 unverdicted novelty 7.0

A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.
Being-H0.7: A Latent World-Action Model from Egocentric Videos
cs.RO 2026-04 unverdicted novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies
cs.CV 2026-04 unverdicted novelty 7.0

CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.
Action Images: End-to-End Policy Learning via Multiview Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.
Token Warping Helps MLLMs Look from Nearby Viewpoints
cs.CV 2026-04 unverdicted novelty 7.0

Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.
DFM-VLA: Iterative Action Refinement for Robot Manipulation via Discrete Flow Matching
cs.RO 2026-03 conditional novelty 7.0

DFM-VLA uses discrete flow matching to iteratively refine action tokens in VLA models, outperforming autoregressive and diffusion baselines with 4.44 average success length on CALVIN and 95.7% success on LIBERO.
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models
cs.RO 2026-05 conditional novelty 6.0

GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation
cs.RO 2026-05 unverdicted novelty 6.0

Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
MolmoAct2: Action Reasoning Models for Real-world Deployment
cs.RO 2026-05 unverdicted novelty 6.0

MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture chang...
$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills
cs.RO 2026-04 unverdicted novelty 6.0

M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models
cs.RO 2026-04 unverdicted novelty 6.0

Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model
cs.RO 2026-04 unverdicted novelty 6.0

A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.
ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors
cs.RO 2026-03 conditional novelty 6.0

ExpertGen generates high-success expert policies in simulation from imperfect priors by freezing a diffusion behavior model and optimizing its initial noise via RL, then distills them for real-robot deployment.
World Action Models are Zero-shot Policies
cs.RO 2026-02 unverdicted novelty 6.0

DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
cs.RO 2025-10 unverdicted novelty 6.0

InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies
cs.LG 2026-05 unverdicted novelty 5.0

Unsupervised behavioral mode discovery combined with mutual information rewards enables RL fine-tuning of multimodal generative policies that achieves higher success rates without losing action diversity.
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
cs.RO 2026-05 unverdicted novelty 5.0

VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...
CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 5.0

CoRAL lets LLMs design objective functions for robot motion planners and uses vision-language models plus real-time identification to adapt to unknown physical properties, raising success rates by over 50 percent on n...
PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance
cs.RO 2026-04 unverdicted novelty 5.0

PokeVLA is a lightweight VLA model pre-trained on 2.4M samples for spatial grounding and reasoning, then adapted via multi-view semantics and geometry alignment to achieve state-of-the-art robot manipulation performance.
Cortex 2.0: Grounding World Models in Real-World Industrial Deployment
cs.RO 2026-04 unverdicted novelty 5.0

Cortex 2.0 introduces world-model-based planning that generates and scores future trajectories to outperform reactive vision-language-action baselines on industrial robotic tasks including pick-and-place, sorting, and...
VLA Foundry: A Unified Framework for Training Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 5.0

VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 21 Pith papers

[1]

2.ViT Image Encoder:encodes each crop independently into per-patch features

Pre-processor:converts each input image into one low-resolution crop and several high-resolution crops. 2.ViT Image Encoder:encodes each crop independently into per-patch features. 3.Vision–language Connector:pools and projects patch features into the LLM embedding space. 4.LLM:autoregressively processes vision and text tokens. From this templateMolmoActi...

work page 2024
[2]

Layer selection and concatenation:features from the third-to-last (OpenAI CLIP) or fourth-to-last (SigLIP2) and the tenth-from-last ViT layers are concatenated for each patch; this slightly outperforms using a single layer as shown by Molmo (Deitke et al., 2024)

work page 2024
[3]

Prefix i

Attention pooling in2 × 2windows:within each2 × 2patch window, a multi-headed attention layer pools the four patches to a single vector, using the mean of the patches as the query. This pooling reduces sequence length while preserving local spatial structure and outperforms naive concatenation as shown by Molmo (Deitke et al., 2024). Pooled features are t...

work page 2024
[4]

Language Description:Put the bowl into the sink

Task Name:put_bowl_in_sink Task Description:The robot picks up the orange bowl next to the sink and place it all the way into the sink. Language Description:Put the bowl into the sink. Task Progression Score Metrics:grasp bowl (0.25), move into the sink (0.4), open gripper (0.7), drop bowl at target location (1)

work page
[5]

Language Description:Wipe the table

Task Name:wipe_table Task Description:The robot grasp onto the table cloth, and move across the surface in one direction. Language Description:Wipe the table. Task Progression Score Metrics:Grasp the towel (0.25), Move in the right direction (0.5), Complete the wipe (1)

work page
[6]

Task Name:table_bussing Task Description:The robot grasp onto the green tea can and place it into the purple bin. Language Description:Clean the trash into the bin Task Progression Score Metrics:Grasp onto the can (0.25), Lift up the can (0.5), Move to above the bin (0.75), Drop the can into the bin (1)

work page
[7]

Language Description:Set the table Task Progression Score Metrics:Put banana on plate (0.25), Grasp onto the teapot (0.75), Pour the tea (1)

Task Name:set_table Task Description:The right arm grasp onto the banana and place it onto the plate, and the left arm grasp onto the teapot to pour. Language Description:Set the table Task Progression Score Metrics:Put banana on plate (0.25), Grasp onto the teapot (0.75), Pour the tea (1)

work page
[8]

Language Description:Lift up the box Task Progression Score Metrics:Left arm grasp onto the tray (0.3), Right arm grasp onto the tray (0.6), Both arms lift up the tray (1)

Task Name:lift_tray Task Description:The left and right arm approaches the box and grasp onto it, and lift up the box together. Language Description:Lift up the box Task Progression Score Metrics:Left arm grasp onto the tray (0.3), Right arm grasp onto the tray (0.6), Both arms lift up the tray (1)

work page
[9]

Task Name:fold_towel Task Description:The right arm press down on the centre of the towel, while the left arm grasp onto the towel to fold. Language Description:Fold the towel Task Progression Score Metrics:Grasp onto the towel (0.25), Put the towel over the right location for folding (0.75), Drop the towel so that it is folded (1). We report details ofMo...

work page
[10]

Language Description:Close the lid Task Progression Score Metrics:Move the lid towards the closing direction (0.5)

Task Name:close_lid Task Description:The robot goes to the back of the lid, closes its gripper and push the lid to close. Language Description:Close the lid Task Progression Score Metrics:Move the lid towards the closing direction (0.5). Close the lid (1)

work page
[11]

Language Description:Rotate the pot Task Progression Score Metrics:Go target position of pot handle (0.3)

Task Name:rotate_pot Task Description:The robot goes to a target position to the handle, and rotate it by 90 degree. Language Description:Rotate the pot Task Progression Score Metrics:Go target position of pot handle (0.3). Rotate the pot by 45 degree (0.6). Close the 90 degree rotation (1)

work page
[12]

Language Description:Pour tea into cup Task Progression Score Metrics:Grasp onto the teapot (0.5)

Task Name:pour_tea Task Description:The robot grasp onto the teapot handle, and lift it up to above the cup to pour. Language Description:Pour tea into cup Task Progression Score Metrics:Grasp onto the teapot (0.5). Move the teapot on top of cup (0.8). Pour tea into cup (1). We report details ofMolmoAct’s post-training hyperparameters for this evaluation ...

work page 2024
[13]

<verb> the <adj.> <noun.>

and use a fixed set of 100 tokens to represent depth. However, fine-grained manipulation tasks require higher-resolution depth estimation. Increasing the number of depth perception tokens could enhance spatial reasoning and improve performance on such tasks. 35 Figure 10 Examples of Single-arm and Bimanual Tasks.We list the observation breakdown to show h...

work page