pith. sign in
Pith Number

pith:MVQLCYDA

pith:2025:MVQLCYDAV5XYQOSFXCJZJTZ4P5
not attested not anchored not stored refs resolved

Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

Adrian Li-Bell, Anna Walling, Brian Ichter, Chelsea Finn, Danny Driess, Haohuan Wang, James Tanner, Karl Pertsch, Lachy Groom, Liyiming Ke, Lucy Xiaoyang Shi, Michael Equi, Niccolo Fusai, Quan Vuong, Sergey Levine

A hierarchical vision-language model lets robots interpret complex instructions and real-time feedback to choose and perform next steps.

arxiv:2502.19417 v2 · 2025-02-26 · cs.RO · cs.AI · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{MVQLCYDAV5XYQOSFXCJZJTZ4P5}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

our system can reason through complex prompts and incorporate situated feedback during task execution ('that's not trash')

C2weakest assumption

That the high-level VLM can reliably map open-ended natural language and visual feedback into correct next-step decisions without hallucinating or misinterpreting physical context.

C3one line summary

A hierarchical VLA architecture lets robots follow complex instructions and situated feedback by separating high-level reasoning from low-level control.

References

51 extracted · 51 resolved · 15 Pith anchors

[1] RT-H: Action Hierarchies Using Language 2024 · arXiv:2403.01823
[2] PaliGemma: A versatile 3B VLM for transfer 2024 · arXiv:2407.07726
[3] $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control 2024 · arXiv:2410.24164
[4] RT-1: Robotics Transformer for Real-World Control at Scale 2022 · arXiv:2212.06817
[5] RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control 2023 · arXiv:2307.15818

Formal links

1 machine-checked theorem link

Cited by

32 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:49.872422Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

6560b16060af6f883a45b89394cf3c7f69d4dc0f491bc183b2b52a5082e3020b

Aliases

arxiv: 2502.19417 · arxiv_version: 2502.19417v2 · doi: 10.48550/arxiv.2502.19417 · pith_short_12: MVQLCYDAV5XY · pith_short_16: MVQLCYDAV5XYQOSF · pith_short_8: MVQLCYDA
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/MVQLCYDAV5XYQOSFXCJZJTZ4P5 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 6560b16060af6f883a45b89394cf3c7f69d4dc0f491bc183b2b52a5082e3020b
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "2b64350ff6a13afc04f9ab60c2db11011409494aafe124453dddad25a14cfe73",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.LG"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.RO",
    "submitted_at": "2025-02-26T18:58:41Z",
    "title_canon_sha256": "3b1fdea721df6a4839273c19af265454d6171db77f88a82f2f1d4d419a24a8a0"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2502.19417",
    "kind": "arxiv",
    "version": 2
  }
}