hub Canonical reference

Videovla: Video generators can be generalizable robot manipulators

Shen, Y · 2025 · arXiv 2512.06963

Canonical reference. 86% of citing Pith papers cite this work as background.

11 Pith papers citing it

Background 86% of classified citations

read on arXiv browse 11 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 other 1

citation-polarity summary

background 6 unclear 1

representative citing papers

NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

cs.RO · 2026-05-08 · unverdicted · novelty 7.0

NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.

EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.

Being-H0.7: A Latent World-Action Model from Egocentric Videos

cs.RO · 2026-04-30 · unverdicted · novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models

cs.RO · 2026-03-23 · unverdicted · novelty 7.0

VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.

Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control

cs.RO · 2026-03-18 · conditional · novelty 7.0

GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.

FrameSkip: Learning from Fewer but More Informative Frames in VLA Training

cs.RO · 2026-05-13 · unverdicted · novelty 6.0

FrameSkip improves VLA policy training success from 66.50% to 76.15% by selecting high-importance frames and retaining only 20% of unique frames across three benchmarks.

Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

cs.RO · 2026-04-29 · unverdicted · novelty 6.0 · 2 refs

X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.

Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?

cs.RO · 2026-04-06 · unverdicted · novelty 6.0

Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.

Embody4D: A Generalist 4D World Model for Embodied AI

cs.CV · 2026-05-03 · unverdicted · novelty 5.0

Embody4D generates high-fidelity, view-consistent novel views from monocular videos for embodied scenarios via 3D-aware data synthesis, adaptive noise injection, and interaction-aware attention.

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

cs.CV · 2026-04-30 · unverdicted · novelty 5.0

Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemphasizing perceptual quality.

Causal World Modeling for Robot Control

cs.CV · 2026-01-29 · unverdicted · novelty 5.0

LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.

citing papers explorer

Showing 11 of 11 citing papers.

NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models cs.RO · 2026-05-08 · unverdicted · none · ref 4
NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields cs.CV · 2026-05-07 · unverdicted · none · ref 19
EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.
Being-H0.7: A Latent World-Action Model from Egocentric Videos cs.RO · 2026-04-30 · unverdicted · none · ref 60
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models cs.RO · 2026-03-23 · unverdicted · none · ref 32
VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.
Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control cs.RO · 2026-03-18 · conditional · none · ref 22
GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.
FrameSkip: Learning from Fewer but More Informative Frames in VLA Training cs.RO · 2026-05-13 · unverdicted · none · ref 16
FrameSkip improves VLA policy training success from 66.50% to 76.15% by selecting high-importance frames and retaining only 20% of unique frames across three benchmarks.
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising cs.RO · 2026-04-29 · unverdicted · none · ref 21 · 2 links
X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.
Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation? cs.RO · 2026-04-06 · unverdicted · none · ref 37
Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.
Embody4D: A Generalist 4D World Model for Embodied AI cs.CV · 2026-05-03 · unverdicted · none · ref 42
Embody4D generates high-fidelity, view-consistent novel views from monocular videos for embodied scenarios via 3D-aware data synthesis, adaptive noise injection, and interaction-aware attention.
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling cs.CV · 2026-04-30 · unverdicted · none · ref 73
Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemphasizing perceptual quality.
Causal World Modeling for Robot Control cs.CV · 2026-01-29 · unverdicted · none · ref 64
LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.

Videovla: Video generators can be generalizable robot manipulators

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer