hub Mixed citations

RVT-2: learning precise manipulation from few demonstrations

Ankit Goyal, Valts Blukis, Jie Xu, Yijie Guo, Yu-Wei Chao, Dieter Fox · 2024 · arXiv 2406.08545

Mixed citation behavior. Most common role is background (60%).

13 Pith papers citing it

Background 60% of classified citations

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 method 1

citation-polarity summary

background 3 unclear 1 use method 1

representative citing papers

EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control

cs.RO · 2026-05-21 · conditional · novelty 7.0

EvoScene-VLA maintains an action-updated scene prior across control chunks in VLA policies, raising success rates on RoboTwin tasks from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized while outperforming baselines on a real robot.

ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

cs.RO · 2024-09-03 · conditional · novelty 7.0

ReKep encodes robotic tasks as optimizable Python functions over 3D keypoints that are generated automatically from language and RGB-D input, enabling real-time hierarchical planning on single- and dual-arm platforms without task-specific data.

Expanding Spatial and Temporal Context for Robotic Imitation Learning With Scene Graphs

cs.RO · 2026-05-31 · unverdicted · novelty 6.0

Dynamic scene graphs serve as explicit memory to improve imitation learning policies for spatial-temporal reasoning under partial observability in mobile and tabletop manipulation.

Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation

cs.RO · 2026-05-02 · unverdicted · novelty 6.0

Decompose and Recompose decomposes seen robotic demonstrations into skill-action alignments and recomposes them via visual-semantic retrieval and planning to enable zero-shot cross-task generalization.

A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model

cs.RO · 2026-04-07 · unverdicted · novelty 6.0

A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.

SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation

cs.RO · 2026-03-05 · conditional · novelty 6.0

SeedPolicy introduces self-evolving gated attention to extend the temporal horizon of diffusion policies, yielding 36.8% and 169% relative gains over standard DP on clean and randomized RoboTwin 2.0 tasks.

GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data

cs.RO · 2025-05-06 · unverdicted · novelty 6.0

GraspVLA shows that pretraining a grasping model on a billion synthetic action frames enables zero-shot open-vocabulary performance and sim-to-real transfer.

HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

cs.CV · 2025-03-13 · unverdicted · novelty 6.0

HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.

A Survey on Vision-Language-Action Models for Embodied AI

cs.RO · 2024-05-23 · unverdicted · novelty 6.0

This is the first survey on vision-language-action models, providing a taxonomy across three lines, plus summaries of datasets, simulators, benchmarks, challenges, and future directions in embodied AI.

OASIS: Observation-Action Space Alignment via SE(3) Trajectory Prediction for Robotic Manipulation

cs.RO · 2026-05-25 · unverdicted · novelty 5.0

OASIS improves robotic manipulation success and generalization by predicting camera-frame SE(3) end-effector trajectories to condition the action decoder on pose-supervised states.

Learning Structural Latent Points for Efficient Visual Representations in Robotic Manipulation

cs.RO · 2026-05-20 · unverdicted · novelty 5.0

A hybrid structural latent points representation is learned by inserting a point-wise latent VAE into a point-cloud autoencoder and regularizing toward a Gaussian prior, paired with a lightweight 3DGS rendering pipeline, yielding gains on RLBench and ManiSkill2 benchmarks.

Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images

cs.CV · 2026-04-12 · unverdicted · novelty 5.0

UniSplat learns consistent 3D geometry, appearance, and semantics from unposed images using dual masking, progressive Gaussian splatting, and recalibration to align predictions across tasks.

Understanding the Impact of Geometric Foundation Models on Vision-Language-Action Models

cs.CV · 2026-05-23 · unverdicted · novelty 3.0

The paper quantifies the geometric gap in current VLAs via linear probing and compares three architectures for injecting geometry from GFMs while analyzing impacts of data, cameras, and reconstruction quality.

citing papers explorer

Showing 9 of 9 citing papers after filters.

EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control cs.RO · 2026-05-21 · conditional · none · ref 9
EvoScene-VLA maintains an action-updated scene prior across control chunks in VLA policies, raising success rates on RoboTwin tasks from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized while outperforming baselines on a real robot.
Expanding Spatial and Temporal Context for Robotic Imitation Learning With Scene Graphs cs.RO · 2026-05-31 · unverdicted · none · ref 19
Dynamic scene graphs serve as explicit memory to improve imitation learning policies for spatial-temporal reasoning under partial observability in mobile and tabletop manipulation.
Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation cs.RO · 2026-05-02 · unverdicted · none · ref 9
Decompose and Recompose decomposes seen robotic demonstrations into skill-action alignments and recomposes them via visual-semantic retrieval and planning to enable zero-shot cross-task generalization.
A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model cs.RO · 2026-04-07 · unverdicted · none · ref 12
A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.
SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation cs.RO · 2026-03-05 · conditional · none · ref 12
SeedPolicy introduces self-evolving gated attention to extend the temporal horizon of diffusion policies, yielding 36.8% and 169% relative gains over standard DP on clean and randomized RoboTwin 2.0 tasks.
OASIS: Observation-Action Space Alignment via SE(3) Trajectory Prediction for Robotic Manipulation cs.RO · 2026-05-25 · unverdicted · none · ref 15
OASIS improves robotic manipulation success and generalization by predicting camera-frame SE(3) end-effector trajectories to condition the action decoder on pose-supervised states.
Learning Structural Latent Points for Efficient Visual Representations in Robotic Manipulation cs.RO · 2026-05-20 · unverdicted · none · ref 17
A hybrid structural latent points representation is learned by inserting a point-wise latent VAE into a point-cloud autoencoder and regularizing toward a Gaussian prior, paired with a lightweight 3DGS rendering pipeline, yielding gains on RLBench and ManiSkill2 benchmarks.
Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images cs.CV · 2026-04-12 · unverdicted · none · ref 19
UniSplat learns consistent 3D geometry, appearance, and semantics from unposed images using dual masking, progressive Gaussian splatting, and recalibration to align predictions across tasks.
Understanding the Impact of Geometric Foundation Models on Vision-Language-Action Models cs.CV · 2026-05-23 · unverdicted · none · ref 17
The paper quantifies the geometric gap in current VLAs via linear probing and compares three architectures for injecting geometry from GFMs while analyzing impacts of data, cameras, and reconstruction quality.

RVT-2: learning precise manipulation from few demonstrations

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer