Pith Number

pith:E3T33RED

pith:2024:E3T33REDAUK72EVJZT4BVCFQGB

not attested not anchored not stored refs resolved

What Matters in Building Vision-Language-Action Models for Generalist Robots

Bingyi Kang, Di Guo, Dong Wang, Hanbo Zhang, Huaping Liu, Jirong Liu, Long Qian, Minghuan Liu, Peiyan Li, Tao Kong, Xiao Ma, Xinghang Li, Xinlong Wang

Specific choices in backbones, architectures, and data timing let simple Vision-Language-Action models set new robot manipulation records.

arxiv:2412.14058 v4 · 2024-12-18 · cs.RO · cs.CV

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{E3T33REDAUK72EVJZT4BVCFQGB}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Through extensive experiments which include over 8 VLM backbones, 4 policy architectures, and over 600 distinct designed experiments, we provide a detailed guidebook for the future design of VLAs... RoboVLMs... achieve a new state-of-the-art performance in three simulation tasks and real-world experiments.

C2weakest assumption

That the chosen simulation tasks and real-world experiments are representative enough of generalist robot manipulation that the observed ranking of design choices will transfer to new tasks, embodiments, and environments not tested in the 600+ runs.

C3one line summary

Systematic tests of VLM backbones, policy architectures, and cross-embodiment data yield RoboVLMs that set new SOTA on robot manipulation benchmarks while requiring few manual designs.

References

54 extracted · 54 resolved · 25 Pith anchors

[1] Flamingo: a visual language model for few-shot learning 2022

[2] Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond 2023 · arXiv:2308.12966

[3] PaliGemma: A versatile 3B VLM for transfer 2024 · arXiv:2407.07726

[4] $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control 2024 · arXiv:2410.24164

[5] RoboCat : A self-improving foundation agent for robotic manipulation 2023

Formal links

2 machine-checked theorem links

Cited by

20 papers in Pith

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

PhysBrain 1.0 Technical Report

AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models

GR-3 Technical Report

HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models

Receipt and verification

First computed	2026-05-17T23:38:13.006948Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

26e7bdc4830515fd12a9ccf81a88b0304845d02a7ee8022d7163fbd4f8dbd609

Aliases

arxiv: 2412.14058 · arxiv_version: 2412.14058v4 · doi: 10.48550/arxiv.2412.14058 · pith_short_12: E3T33REDAUK7 · pith_short_16: E3T33REDAUK72EVJ · pith_short_8: E3T33RED

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/E3T33REDAUK72EVJZT4BVCFQGB \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 26e7bdc4830515fd12a9ccf81a88b0304845d02a7ee8022d7163fbd4f8dbd609

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "334f8ebbf897361337c756e401d8bad05d9afe3016185d1262c03aca70045cd6",
    "cross_cats_sorted": [
      "cs.CV"
    ],
    "license": "http://creativecommons.org/licenses/by-nc-nd/4.0/",
    "primary_cat": "cs.RO",
    "submitted_at": "2024-12-18T17:07:20Z",
    "title_canon_sha256": "68aff8c3e3373b6aae6aedffb5e6c987e6510dab548baddf2b11e15244a28776"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2412.14058",
    "kind": "arxiv",
    "version": 4
  }
}