pith. sign in

hub Mixed citations

Learning to act from actionless videos through dense correspondences

Mixed citation behavior. Most common role is background (67%).

19 Pith papers citing it
Background 67% of classified citations

hub tools

citation-role summary

background 6 baseline 1 method 1 other 1

citation-polarity summary

clear filters

representative citing papers

Any-point Trajectory Modeling for Policy Learning

cs.RO · 2023-12-28 · conditional · novelty 7.0

ATM pre-trains models to predict trajectories of any points in videos, then uses those predictions to learn strong visuomotor policies from minimal action labels, beating baselines by 80% on 130+ tasks.

Structured 4D Latent Predictive Model for Robot Planning

cs.RO · 2026-07-01 · unverdicted · novelty 6.0

A 4D latent predictive model encodes scenes holistically to generate 3D-consistent futures that an inverse dynamics module converts into robot actions, outperforming video-based planners on manipulation tasks.

Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing

cs.RO · 2026-05-05 · unverdicted · novelty 6.0

A dual-contrastive disentanglement method factorizes videos into independent task and embodiment latents, then uses a parameter-efficient adapter on a frozen video diffusion model to synthesize robot executions from single human demonstrations without paired data.

Unified Video Action Model

cs.RO · 2025-02-28 · unverdicted · novelty 6.0

UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without performance loss versus task-specific methods.

World Action Models: The Next Frontier in Embodied AI

cs.RO · 2026-05-12 · unverdicted · novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Agent AI: Surveying the Horizons of Multimodal Interaction

cs.AI · 2024-01-07 · unverdicted · novelty 4.0

The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.

citing papers explorer

Showing 4 of 4 citing papers after filters.

  • Large Video Planner Enables Generalizable Robot Control cs.RO · 2025-12-17 · conditional · none · ref 44

    A video foundation model trained on human demonstrations generates zero-shot plans that convert to executable robot actions on novel scenes and tasks.

  • IGen: Scalable Data Generation for Robot Learning from Open-World Images cs.RO · 2025-12-01 · unverdicted · none · ref 35

    IGen generates realistic visuomotor training data including actions and temporally coherent visuals from unstructured open-world images via 3D reconstruction and VLM reasoning.

  • Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations cs.RO · 2025-07-01 · unverdicted · none · ref 60

    RIGVid shows that filtered AI-generated videos can serve as effective supervision for complex robotic manipulation tasks without any real demonstrations.

  • Unified Video Action Model cs.RO · 2025-02-28 · unverdicted · none · ref 27

    UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without performance loss versus task-specific methods.