pith. sign in
Pith Number

pith:UCFL54T5

pith:2025:UCFL54T5TYCJIGF4EG4OUKZIID
not attested not anchored not stored refs resolved

Vidar: Embodied Video Diffusion Model for Generalist Manipulation

Chendong Xiang, Guodong Liu, Hang Su, Hengkai Tan, Jun Zhu, Shuhe Huang, Xinyi Mao, Yao Feng

A video diffusion model pre-trained on internet-scale data and 750K robot trajectories adapts to new robot embodiments with only 20 minutes of demonstrations.

arxiv:2507.12898 v4 · 2025-07-17 · cs.LG · cs.AI · cs.CV · cs.RO

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{UCFL54T5TYCJIGF4EG4OUKZIID}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

With only 20 minutes of human demonstrations on an unseen robot (1% of typical data), Vidar outperforms state-of-the-art baselines and generalizes to unseen tasks, backgrounds, and camera layouts.

C2weakest assumption

That continuous pre-training of an internet-scale video diffusion model on 750K trajectories from only three robot platforms produces a sufficiently general visual-dynamics prior that can be grounded to arbitrary new embodiments via a lightweight masked inverse dynamics adapter.

C3one line summary

Vidar shows that a video diffusion prior continuously pre-trained on 750K multi-view robot trajectories plus a label-free masked inverse dynamics adapter can generalize manipulation to new robot embodiments with 1% of typical demonstration data.

References

46 extracted · 46 resolved · 20 Pith anchors

[1] Open X-Embodiment: Robotic Learning Datasets and RT-X Models : Open X-Embodiment Collaboration 2024
[2] OpenVLA: An Open-Source Vision-Language-Action Model 2024
[3] RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation 2024 · arXiv:2410.07864
[4] Crossformer: Transformer Utilizing Cross-Dimension Depen- dency for Multivariate Time Series Forecasting 2023
[5] $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization 2025 · arXiv:2504.16054

Formal links

1 machine-checked theorem link

Cited by

19 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:48.275613Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

a08abef27d9e049418bc21b8ea2b2840fd028bb99319803a1c9b926dadbd5add

Aliases

arxiv: 2507.12898 · arxiv_version: 2507.12898v4 · doi: 10.48550/arxiv.2507.12898 · pith_short_12: UCFL54T5TYCJ · pith_short_16: UCFL54T5TYCJIGF4 · pith_short_8: UCFL54T5
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/UCFL54T5TYCJIGF4EG4OUKZIID \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: a08abef27d9e049418bc21b8ea2b2840fd028bb99319803a1c9b926dadbd5add
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "b93945cdd00928251a5e5d498e11d02981f54dfc1915fba1b499718dfc9f733a",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.CV",
      "cs.RO"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2025-07-17T08:31:55Z",
    "title_canon_sha256": "fa3c128f8b8583963347041a1d34688c91ca0ea5ec73addaaa8e286d2aaa09b0"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2507.12898",
    "kind": "arxiv",
    "version": 4
  }
}