pith. sign in
Pith Number

pith:E3T33RED

pith:2024:E3T33REDAUK72EVJZT4BVCFQGB
not attested not anchored not stored refs resolved

What Matters in Building Vision-Language-Action Models for Generalist Robots

Bingyi Kang, Di Guo, Dong Wang, Hanbo Zhang, Huaping Liu, Jirong Liu, Long Qian, Minghuan Liu, Peiyan Li, Tao Kong, Xiao Ma, Xinghang Li, Xinlong Wang

Specific choices in backbones, architectures, and data timing let simple Vision-Language-Action models set new robot manipulation records.

arxiv:2412.14058 v4 · 2024-12-18 · cs.RO · cs.CV

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{E3T33REDAUK72EVJZT4BVCFQGB}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Through extensive experiments which include over 8 VLM backbones, 4 policy architectures, and over 600 distinct designed experiments, we provide a detailed guidebook for the future design of VLAs... RoboVLMs... achieve a new state-of-the-art performance in three simulation tasks and real-world experiments.

C2weakest assumption

That the chosen simulation tasks and real-world experiments are representative enough of generalist robot manipulation that the observed ranking of design choices will transfer to new tasks, embodiments, and environments not tested in the 600+ runs.

C3one line summary

Systematic tests of VLM backbones, policy architectures, and cross-embodiment data yield RoboVLMs that set new SOTA on robot manipulation benchmarks while requiring few manual designs.

References

54 extracted · 54 resolved · 25 Pith anchors

[1] Flamingo: a visual language model for few-shot learning 2022
[2] Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond 2023 · arXiv:2308.12966
[3] PaliGemma: A versatile 3B VLM for transfer 2024 · arXiv:2407.07726
[4] $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control 2024 · arXiv:2410.24164
[5] RoboCat : A self-improving foundation agent for robotic manipulation 2023

Formal links

2 machine-checked theorem links

Cited by

20 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:13.006948Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

26e7bdc4830515fd12a9ccf81a88b0304845d02a7ee8022d7163fbd4f8dbd609

Aliases

arxiv: 2412.14058 · arxiv_version: 2412.14058v4 · doi: 10.48550/arxiv.2412.14058 · pith_short_12: E3T33REDAUK7 · pith_short_16: E3T33REDAUK72EVJ · pith_short_8: E3T33RED
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/E3T33REDAUK72EVJZT4BVCFQGB \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 26e7bdc4830515fd12a9ccf81a88b0304845d02a7ee8022d7163fbd4f8dbd609
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "334f8ebbf897361337c756e401d8bad05d9afe3016185d1262c03aca70045cd6",
    "cross_cats_sorted": [
      "cs.CV"
    ],
    "license": "http://creativecommons.org/licenses/by-nc-nd/4.0/",
    "primary_cat": "cs.RO",
    "submitted_at": "2024-12-18T17:07:20Z",
    "title_canon_sha256": "68aff8c3e3373b6aae6aedffb5e6c987e6510dab548baddf2b11e15244a28776"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2412.14058",
    "kind": "arxiv",
    "version": 4
  }
}