arxiv: 2403.09631 · v1 · submitted 2024-03-14 · 💻 cs.CV · cs.AI· cs.CL· cs.RO

Recognition: 3 theorem links

· Lean Theorem

3D-VLA: A 3D Vision-Language-Action Generative World Model

Haoyu Zhen , Xiaowen Qiu , Peihao Chen , Jincheng Yang , Xin Yan , Yilun Du , Yining Hong , Chuang Gan

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:13 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.RO

keywords 3D-VLAvision-language-action modelgenerative world modelembodied diffusion3D point cloudsrobotics instruction datasetembodied planning

0 comments

The pith

3D-VLA connects 3D perception to robot actions by embedding a generative world model inside a language model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current vision-language-action models operate on 2D images and map perception straight to actions without modeling world dynamics. The paper introduces 3D-VLA to address this gap by building a generative world model on a 3D large language model. Interaction tokens let the model engage with the environment while aligned diffusion networks generate future goal images and point clouds. A large training set is assembled by pulling 3D information from existing robotics datasets. The result is an embodied model that reasons about possible futures before selecting actions.

Core claim

3D-VLA is built on a 3D-based large language model with interaction tokens to engage the environment, and embodied diffusion models aligned to it for predicting goal images and point clouds. This creates a generative world model that links 3D perception, reasoning, and action, trained on a curated 3D embodied instruction dataset from existing robotics data. Experiments show significant improvements in reasoning, multimodal generation, and planning capabilities in embodied environments.

What carries the argument

A 3D large language model augmented with interaction tokens and aligned embodied diffusion models that generate future goal images and point clouds.

Load-bearing premise

3D information extracted from existing robotics datasets is diverse enough to train a model that generalizes to new environments.

What would settle it

Testing the trained model on a held-out robotics task or physical robot never seen during dataset curation and measuring whether planning success rates exceed those of standard 2D vision-language-action baselines.

read the original abstract

Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. Furthermore, they perform action prediction by learning a direct mapping from perception to action, neglecting the vast dynamics of the world and the relations between actions and dynamics. In contrast, human beings are endowed with world models that depict imagination about future scenarios to plan actions accordingly. To this end, we propose 3D-VLA by introducing a new family of embodied foundation models that seamlessly link 3D perception, reasoning, and action through a generative world model. Specifically, 3D-VLA is built on top of a 3D-based large language model (LLM), and a set of interaction tokens is introduced to engage with the embodied environment. Furthermore, to inject generation abilities into the model, we train a series of embodied diffusion models and align them into the LLM for predicting the goal images and point clouds. To train our 3D-VLA, we curate a large-scale 3D embodied instruction dataset by extracting vast 3D-related information from existing robotics datasets. Our experiments on held-in datasets demonstrate that 3D-VLA significantly improves the reasoning, multimodal generation, and planning capabilities in embodied environments, showcasing its potential in real-world applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces 3D-VLA, a generative world model for embodied AI that integrates 3D perception, reasoning, and action via a 3D-based LLM augmented with interaction tokens and aligned diffusion models for goal image and point-cloud prediction. A large-scale 3D embodied instruction dataset is curated by extracting 3D information from existing robotics corpora, and experiments on held-in splits are reported to show gains in reasoning, multimodal generation, and planning.

Significance. If the empirical claims are substantiated with quantitative metrics and generalization tests, the work could meaningfully advance embodied foundation models by shifting from direct perception-to-action mappings toward explicit generative world models that support planning via imagined 3D futures. The dataset curation effort is a constructive contribution to the community.

major comments (2)

[§4] §4 (Experiments): The central claim of 'significant improvements' in reasoning, generation, and planning is supported only by held-in dataset results; no quantitative metrics, baselines, ablation studies, or error analysis are supplied, leaving the magnitude and sources of any gains impossible to assess.
[§4.3] §4.3 (Evaluation): No out-of-distribution, held-out, or cross-robotology tests are described. Because the dataset is extracted from the same robotics sources used for training, observed gains may reflect interpolation within the training support rather than the claimed advantages of the 3D world model for real-world planning under distributional shift.

minor comments (2)

[Abstract] Abstract: The phrase 'significantly improves' is used without any numerical results or baseline comparisons.
[§3.2] §3.2: The mechanism by which interaction tokens interface with the embodied environment would benefit from a concrete example or pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We agree that the experimental evaluation requires more rigorous quantitative support and generalization analysis to substantiate the claims. We have revised the manuscript to address these points and provide point-by-point responses below.

read point-by-point responses

Referee: [§4] §4 (Experiments): The central claim of 'significant improvements' in reasoning, generation, and planning is supported only by held-in dataset results; no quantitative metrics, baselines, ablation studies, or error analysis are supplied, leaving the magnitude and sources of any gains impossible to assess.

Authors: We acknowledge that the original submission relied primarily on held-in results and qualitative examples. In the revised manuscript, §4 has been expanded with quantitative metrics (task success rates for planning, accuracy for reasoning, and perceptual quality scores for generation), direct comparisons to baselines including 2D VLA models and non-generative variants, ablation studies on the 3D LLM backbone, interaction tokens, and diffusion alignment modules, and an error analysis subsection that categorizes failure modes and links them to specific model components. revision: yes
Referee: [§4.3] §4.3 (Evaluation): No out-of-distribution, held-out, or cross-robotology tests are described. Because the dataset is extracted from the same robotics sources used for training, observed gains may reflect interpolation within the training support rather than the claimed advantages of the 3D world model for real-world planning under distributional shift.

Authors: We agree that held-in results alone cannot fully rule out interpolation effects. The revised evaluation now includes a held-out split consisting of novel instruction-object combinations excluded from training but drawn from the same source corpora; 3D-VLA shows consistent gains over baselines on this split, supporting the value of the generative 3D world model. Full cross-robotology testing (different hardware platforms) is not feasible within the current revision due to the absence of aligned multi-robot 3D data and would require new collection efforts; we explicitly discuss this limitation and outline it as future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical training on external data with no self-referential derivation

full rationale

The paper describes an empirical pipeline: curating a 3D embodied instruction dataset by extracting information from existing robotics corpora, training a 3D-based LLM augmented with interaction tokens, training and aligning embodied diffusion models for goal image/point-cloud prediction, and reporting performance on held-in dataset splits. No equations, uniqueness theorems, or ansatzes are presented that reduce a claimed prediction or result to a quantity defined inside the paper itself. The central claims rest on observed improvements in reasoning/generation/planning metrics rather than any fitted parameter being renamed as a prediction or any self-citation chain substituting for independent justification. This is a standard empirical ML construction whose validity is assessed by external benchmarks, not by internal definitional closure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that 3D LLMs can be extended with generative capabilities via diffusion alignment and that curated robotics data suffices for training. No explicit free parameters or invented physical entities are named in the abstract.

axioms (1)

domain assumption A 3D-based LLM can be extended with generation abilities by aligning embodied diffusion models for goal image and point cloud prediction.
This is the core construction step stated in the abstract.

pith-pipeline@v0.9.0 · 5566 in / 1263 out tokens · 30704 ms · 2026-05-13T18:13:35.728197+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose 3D-VLA by introducing a new family of embodied foundation models that seamlessly link 3D perception, reasoning, and action through a generative world model. Specifically, 3D-VLA is built on top of a 3D-based large language model (LLM), and a set of interaction tokens is introduced to engage with the embodied environment.
IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

To train our 3D-VLA, we curate a large-scale 3D embodied instruction dataset by extracting vast 3D-related information from existing robotics datasets. Our experiments on held-in datasets demonstrate that 3D-VLA significantly improves the reasoning, multimodal generation, and planning capabilities
IndisputableMonolith.Foundation.PhiForcing phi_equation unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we train a series of embodied diffusion models and align them into the LLM for predicting the goal images and point clouds

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 30 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 conditional novelty 7.0

Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
cs.LG 2026-04 unverdicted novelty 7.0

π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
Action Images: End-to-End Policy Learning via Multiview Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
Affordance Agent Harness: Verification-Gated Skill Orchestration
cs.RO 2026-05 unverdicted novelty 6.0

Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tra...
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
cs.RO 2026-04 unverdicted novelty 6.0

LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
cs.RO 2026-04 unverdicted novelty 6.0

LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model
cs.RO 2026-04 unverdicted novelty 6.0

A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.
ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.
ThermoAct:Thermal-Aware Vision-Language-Action Models for Robotic Perception and Decision-Making
cs.RO 2026-03 unverdicted novelty 6.0

ThermoAct integrates thermal imaging into VLA models via a VLM planner to enable robots to perceive physical properties like heat and improve safety over vision-only systems.
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
cs.RO 2025-03 unverdicted novelty 6.0

GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
cs.RO 2025-02 accept novelty 6.0

OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
cs.RO 2025-02 unverdicted novelty 6.0

DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.
FAST: Efficient Action Tokenization for Vision-Language-Action Models
cs.RO 2025-01 unverdicted novelty 6.0

FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diff...
OpenVLA: An Open-Source Vision-Language-Action Model
cs.RO 2024-06 unverdicted novelty 6.0

OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
cs.RO 2024-03 accept novelty 6.0

DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.
Nautilus: From One Prompt to Plug-and-Play Robot Learning
cs.RO 2026-05 unverdicted novelty 5.0

NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning
cs.RO 2026-04 unverdicted novelty 5.0

ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.
R3D: Revisiting 3D Policy Learning
cs.CV 2026-04 unverdicted novelty 5.0

A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.
CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment
cs.RO 2026-04 unverdicted novelty 5.0

CoEnv introduces a compositional environment that integrates real and simulated spaces for multi-agent robotic collaboration, using real-to-sim reconstruction, VLM action synthesis, and validated sim-to-real transfer ...
MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details
cs.CV 2025-07 unverdicted novelty 5.0

MoGe-2 recovers metric-scale 3D point maps with fine details from single images via data refinement and extension of affine-invariant predictions.
WorldVLA: Towards Autoregressive Action World Model
cs.RO 2025-06 unverdicted novelty 5.0

WorldVLA unifies VLA and world models in one autoregressive system, shows they boost each other, and adds an attention mask to stop error buildup when generating action chunks.
Towards Robotic Dexterous Hand Intelligence: A Survey
cs.RO 2026-05 unverdicted novelty 4.0

A structured survey of dexterous robotic hand research that reviews hardware, control methods, data resources, and benchmarks while identifying major limitations and future directions.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
Affordance Agent Harness: Verification-Gated Skill Orchestration
cs.RO 2026-05 unverdicted novelty 4.0

Affordance Agent Harness is a verification-gated orchestration framework that adaptively combines heterogeneous skills, retrieves episodic memories, and uses self-consistency checks to improve affordance grounding acc...
Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines
cs.RO 2026-04 unverdicted novelty 3.0

A survey of VLA robotics research identifies data infrastructure as the primary bottleneck and distills four open challenges in representation alignment, multimodal supervision, reasoning assessment, and scalable data...
Redefining End-of-Life: Intelligent Automation for Electronics Remanufacturing Systems
eess.SY 2026-04 unverdicted novelty 2.0

A literature review of intelligent automation approaches using robotics, AI, and control for disassembly, inspection, sorting, and reprocessing of end-of-life electronics.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 26 Pith papers · 12 internal anchors

[1]

Flamingo: a visual language model for few-shot learning

Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35: 0 23716--23736, 2022

work page 2022
[2]

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

Bhat, S. F., Birkl, R., Wofk, D., Wonka, P., and M \"u ller, M. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023

work page internal anchor Pith review arXiv 2023
[3]

Zero-shot robotic manipulation with pretrained image-editing diffusion models, 2023

Black, K., Nakamoto, M., Atreya, P., Walke, H., Finn, C., Kumar, A., and Levine, S. Zero-shot robotic manipulation with pretrained image-editing diffusion models, 2023

work page 2023
[4]

RT-1: Robotics Transformer for Real-World Control at Scale

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Brooks, T., Holynski, A., and Efros, A. A. Instructpix2pix: Learning to follow image editing instructions, 2023

work page 2023
[7]

Playfusion: Skill acquisition via diffusion from language-annotated play

Chen, L., Bahl, S., and Pathak, D. Playfusion: Skill acquisition via diffusion from language-annotated play. In Conference on Robot Learning, pp.\ 2012--2029. PMLR, 2023 a

work page 2012
[8]

Ll3da: Visual interactive instruction tuning for omni-3d understanding, reasoning, and planning, 2023 b

Chen, S., Chen, X., Zhang, C., Li, M., Yu, G., Fei, H., Zhu, H., Fan, J., and Chen, T. Ll3da: Visual interactive instruction tuning for omni-3d understanding, reasoning, and planning, 2023 b

work page 2023
[9]

X., Savva, M., Halber, M., Funkhouser, T., and Nießner, M

Dai, A., Chang, A. X., Savva, M., Halber, M., Funkhouser, T., and Nießner, M. Scannet: Richly-annotated 3d reconstructions of indoor scenes, 2017

work page 2017
[10]

M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al

Damen, D., Doughty, H., Farinella, G. M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV), pp.\ 720--736, 2018

work page 2018
[11]

Dass, S., Yapeter, J., Zhang, J., Zhang, J., Pertsch, K., Nikolaidis, S., and Lim, J. J. Clvr jaco play dataset, 2023. URL https://github.com/clvrai/clvr_jaco_play_dataset

work page 2023
[12]

Objaverse: A universe of annotated 3d objects, 2022

Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., and Farhadi, A. Objaverse: A universe of annotated 3d objects, 2022

work page 2022
[13]

Dreamllm: Synergistic multimodal com- prehension and creation

Dong, R., Han, C., Peng, Y., Qi, Z., Ge, Z., Yang, J., Zhao, L., Sun, J., Zhou, H., Wei, H., Kong, X., Zhang, X., Ma, K., and Yi, L. Dreamllm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499, 2023

work page arXiv 2023
[14]

PaLM-E: An Embodied Multimodal Language Model

Driess, D., Xia, F., Sajjadi, M. S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023 a

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Driess, D., Xia, F., Sajjadi, M. S. M., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., Huang, W., Chebotar, Y., Sermanet, P., Duckworth, D., Levine, S., Vanhoucke, V., Hausman, K., Toussaint, M., Greff, K., Zeng, A., Mordatch, I., and Florence, P. Palm-e: An embodied multimodal language model, 2023 b

work page 2023
[16]

Structure and content-guided video synthesis with diffusion models, 2023

Esser, P., Chiu, J., Atighehchian, P., Granskog, J., and Germanidis, A. Structure and content-guided video synthesis with diffusion models, 2023

work page 2023
[17]

Rh20t: A robotic dataset for learning diverse skills in one-shot

Fang, H.-S., Fang, H., Tang, Z., Liu, J., Wang, J., Zhu, H., and Lu, C. Rh20t: A robotic dataset for learning diverse skills in one-shot. arXiv preprint arXiv:2307.00595, 2023

work page arXiv 2023
[18]

Finetuning offline world models in the real world,

Feng, Y., Hansen, N., Xiong, Z., Rajagopalan, C., and Wang, X. Finetuning offline world models in the real world. arXiv preprint arXiv:2310.16029, 2023

work page arXiv 2023
[19]

Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following, 2023

Guo, Z., Zhang, R., Zhu, X., Tang, Y., Ma, X., Han, J., Chen, K., Gao, P., Li, X., Li, H., and Heng, P.-A. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following, 2023

work page 2023
[20]

3d- llm: Injecting the 3d world into large language models,

Hong, Y., Zhen, H., Chen, P., Zheng, S., Du, Y., Chen, Z., and Gan, C. 3d-llm: Injecting the 3d world into large language models. arXiv preprint arXiv:2307.12981, 2023

work page arXiv 2023
[21]

Multiply: A multisensory object-centric embodied large language model in 3d world

Hong, Y., Zheng, Z., Chen, P., Wang, Y., Li, J., and Gan, C. Multiply: A multisensory object-centric embodied large language model in 3d world. arXiv preprint arXiv:2401.08577, 2024

work page arXiv 2024
[22]

and Montani, I

Honnibal, M. and Montani, I. spaCy 2 : Natural language understanding with B loom embeddings, convolutional neural networks and incremental parsing. To appear, 2017

work page 2017
[23]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[24]

Chat-3d v2: Bridging 3d scene and large language models with object identifiers, 2023 a

Huang, H., Wang, Z., Huang, R., Liu, L., Cheng, X., Zhao, Y., Jin, T., and Zhao, Z. Chat-3d v2: Bridging 3d scene and large language models with object identifiers, 2023 a

work page 2023
[25]

An embodied generalist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023

Huang, J., Yong, S., Ma, X., Linghu, X., Li, P., Wang, Y., Li, Q., Zhu, S.-C., Jia, B., and Huang, S. An embodied generalist agent in 3d world. arXiv preprint arXiv:2311.12871, 2023 b

work page arXiv 2023
[26]

Language is not all you need: Aligning perception with language models

Huang, S., Dong, L., Wang, W., Hao, Y., Singhal, S., Ma, S., Lv, T., Cui, L., Mohammed, O. K., Liu, Q., et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023 c

work page arXiv 2023
[27]

R., and Davison, A

James, S., Ma, Z., Arrojo, D. R., and Davison, A. J. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 5 0 (2): 0 3019--3026, 2020

work page 2020
[28]

Bc-z: Zero-shot task generalization with robotic imitation learning

Jang, E., Irpan, A., Khansari, M., Kappler, D., Ebert, F., Lynch, C., Levine, S., and Finn, C. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pp.\ 991--1002. PMLR, 2022

work page 2022
[29]

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[30]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Li, J., Li, D., Xiong, C., and Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pp.\ 12888--12900. PMLR, 2022

work page 2022
[31]

Covlm: Composing visual entities and relationships in large language models via communicative decoding

Li, J., Chen, D., Hong, Y., Chen, Z., Chen, P., Shen, Y., and Gan, C. Covlm: Composing visual entities and relationships in large language models via communicative decoding. arXiv preprint arXiv:2311.03354, 2023 a

work page arXiv 2023
[32]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023 b

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

3dmit: 3d multi-modal instruction tuning for scene understanding, 2024

Li, Z., Zhang, C., Wang, X., Ren, R., Xu, Y., Ma, R., and Liu, X. 3dmit: 3d multi-modal instruction tuning for scene understanding, 2024

work page 2024
[34]

Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Hoi4d: A 4d egocentric dataset for category-level human-object interaction

Liu, Y., Liu, Y., Jiang, C., Lyu, K., Wan, W., Shen, H., Liang, B., Fu, Z., Wang, H., and Yi, L. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 21013--21022, 2022

work page 2022
[36]

UNIFIED - IO : A unified model for vision, language, and multi-modal tasks

Lu, J., Clark, C., Zellers, R., Mottaghi, R., and Kembhavi, A. UNIFIED - IO : A unified model for vision, language, and multi-modal tasks. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=E01k9048soZ

work page 2023
[37]

Language conditioned imitation learning over unstructured data,

Lynch, C. and Sermanet, P. Language conditioned imitation learning over unstructured data. arXiv preprint arXiv:2005.07648, 2020

work page arXiv 2005
[38]

Interactive language: Talking to robots in real time

Lynch, C., Wahid, A., Tompson, J., Ding, T., Betker, J., Baruch, R., Armstrong, T., and Florence, P. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023

work page 2023
[39]

Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity

Mandlekar, A., Booher, J., Spero, M., Tung, A., Gupta, A., Zhu, Y., Garg, A., Savarese, S., and Fei-Fei, L. Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.\ 1048--1055. IEEE, 2019

work page 2019
[40]

(1982) Vision: A computational investigation into the human representation and processing of visual information

Marr, D. Vision: A Computational Investigation into the Human Representation and Processing of Visual Information . The MIT Press, 07 2010. ISBN 9780262514620. doi:10.7551/mitpress/9780262514620.001.0001. URL https://doi.org/10.7551/mitpress/9780262514620.001.0001

work page doi:10.7551/mitpress/9780262514620.001.0001 2010
[41]

Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks

Mees, O., Hermann, L., Rosete-Beas, E., and Burgard, W. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters (RA-L), 7 0 (3): 0 7327--7334, 2022

work page 2022
[42]

Grounding language with visual affordances over unstructured data

Mees, O., Borja-Diaz, J., and Burgard, W. Grounding language with visual affordances over unstructured data. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 2023

work page 2023
[43]

Point-E: A System for Generating 3D Point Clouds from Complex Prompts

Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., and Chen, M. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022

work page internal anchor Pith review arXiv 2022
[44]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Padalkar, A., Pooley, A., Jain, A., Bewley, A., Herzog, A., Irpan, A., Khazatsky, A., Rai, A., Singh, A., Brohan, A., et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

The effects of contextual scenes on the identification of objects

Palmer, S. The effects of contextual scenes on the identification of objects. Memory & Cognition, 3: 0 519--526, 01 1975

work page 1975
[46]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., and Wei, F. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Seeing and Visualizing: It's Not What You Think

Pylyshyn, Z. Seeing and Visualizing: It's Not What You Think. 01 2003. ISBN 9780262316316. doi:10.7551/mitpress/6137.001.0001

work page doi:10.7551/mitpress/6137.001.0001 2003
[48]

Gpt4point: A unified framework for point-language understanding and generation, 2023

Qi, Z., Fang, Y., Sun, Z., Wu, X., Wu, T., Wang, J., Lin, D., and Zhao, H. Gpt4point: A unified framework for point-language understanding and generation, 2023

work page 2023
[49]

K., Gokaslan, A., Wijmans, E., Maksymets, O., Clegg, A., Turner, J., Undersander, E., Galuba, W., Westbury, A., Chang, A

Ramakrishnan, S. K., Gokaslan, A., Wijmans, E., Maksymets, O., Clegg, A., Turner, J., Undersander, E., Galuba, W., Westbury, A., Chang, A. X., Savva, M., Zhao, Y., and Batra, D. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai, 2021

work page 2021
[50]

Grounded sam: Assembling open-world models for diverse visual tasks, 2024

Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., Zeng, Z., Zhang, H., Li, F., Yang, J., Li, H., Jiang, Q., and Zhang, L. Grounded sam: Assembling open-world models for diverse visual tasks, 2024

work page 2024
[51]

High-resolution image synthesis with latent diffusion models

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 10684--10695, 2022

work page 2022
[52]

Playing with food: Learning food item representations through interactive exploration

Sawhney, A., Lee, S., Zhang, K., Veloso, M., and Kroemer, O. Playing with food: Learning food item representations through interactive exploration. In Experimental Robotics: The 17th International Symposium, pp.\ 309--322. Springer, 2021

work page 2021
[53]

J., Florence, P., Han, W., Baruch, R., Lu, Y., Mirchandani, S., Xu, P., Sanketi, P., Hausman, K., Shafran, I., Ichter, B., and Cao, Y

Sermanet, P., Ding, T., Zhao, J., Xia, F., Dwibedi, D., Gopalakrishnan, K., Chan, C., Dulac-Arnold, G., Maddineni, S., Joshi, N. J., Florence, P., Han, W., Baruch, R., Lu, Y., Mirchandani, S., Xu, P., Sanketi, P., Hausman, K., Shafran, I., Ichter, B., and Cao, Y. Robovqa: Multimodal long-horizon reasoning for robotics. In arXiv preprint arXiv:2311.00899, 2023

work page arXiv 2023
[54]

Shafiullah, N. M. M., Rai, A., Etukuru, H., Liu, Y., Misra, I., Chintala, S., and Pinto, L. On bringing robots home. arXiv preprint arXiv:2311.16098, 2023

work page arXiv 2023
[55]

MUTEX : Learning unified policies from multimodal task specifications

Shah, R., Mart \' n-Mart \' n, R., and Zhu, Y. MUTEX : Learning unified policies from multimodal task specifications. In 7th Annual Conference on Robot Learning, 2023. URL https://openreview.net/forum?id=PwqiqaaEzJ

work page 2023
[56]

Lancon-learn: Learning with language to enable generalization in multi-task manipulation

Silva, A., Moorman, N., Silva, W., Zaidi, Z., Gopalan, N., and Gombolay, M. Lancon-learn: Learning with language to enable generalization in multi-task manipulation. IEEE Robotics and Automation Letters, 7 0 (2): 0 1635--1642, 2021

work page 2021
[57]

and Deng, J

Teed, Z. and Deng, J. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part II 16, pp.\ 402--419. Springer, 2020

work page 2020
[58]

R., Black, K., Zhao, T

Walke, H. R., Black, K., Zhao, T. Z., Vuong, Q., Zheng, C., Hansen-Estruch, P., He, A. W., Myers, V., Kim, M. J., Du, M., et al. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning, pp.\ 1723--1736. PMLR, 2023

work page 2023
[59]

Janus: Decoupling visual encoding for unified multimodal understanding and generation

Wu, S., Fei, H., Qu, L., Ji, W., and Chua, T.-S. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023

work page arXiv 2023
[60]

Pointllm: Empowering large language models to understand point clouds, 2023

Xu, R., Wang, X., Wang, T., Chen, Y., Pang, J., and Lin, D. Pointllm: Empowering large language models to understand point clouds, 2023

work page 2023
[61]

Uni3d: Exploring unified 3d representation at scale, 2023

Zhou, J., Wang, J., Ma, B., Liu, Y.-S., Huang, T., and Wang, X. Uni3d: Exploring unified 3d representation at scale, 2023

work page 2023
[62]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023