pith. sign in
Pith Number

pith:IWBYRN2A

pith:2023:IWBYRN2ASLXKP6H7V375PJ5X4V
not attested not anchored not stored refs resolved

Vision-Language Foundation Models as Effective Robot Imitators

Chilam Cheang, Cunjun Yu, Hanbo Zhang, Hang Li, Hongtao Wu, Huaping Liu, Jie Xu, Minghuan Liu, Tao Kong, Weinan Zhang, Xinghang Li, Ya Jing

Simple fine-tuning adapts pre-trained vision-language models into robot policies that beat prior methods.

arxiv:2311.01378 v3 · 2023-11-02 · cs.RO · cs.AI · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{IWBYRN2ASLXKP6H7V375PJ5X4V}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

By exceeding the state-of-the-art performance with a large margin on the tested benchmark, we show RoboFlamingo can be an effective and competitive alternative to adapt VLMs to robot control.

C2weakest assumption

That modest fine-tuning on existing language-conditioned manipulation datasets is sufficient to transfer the general vision-language understanding of pre-trained VLMs into reliable sequential robot policies without catastrophic forgetting or domain shift.

C3one line summary

RoboFlamingo adapts open-source vision-language models for robot manipulation tasks via single-step comprehension plus an explicit policy head, outperforming prior methods on benchmarks with only light fine-tuning.

References

25 extracted · 25 resolved · 13 Pith anchors

[1] Do As I Can, Not As I Say: Grounding Language in Robotic Affordances · arXiv:2204.01691
[2] OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models · arXiv:2308.01390
[3] S., Purohit, S., Reynolds, L., Tow, J., Wang, B., and Weinbach, S · arXiv:2204.06745
[4] RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control · arXiv:2307.15818
[5] Language models are few-shot learners 1901

Formal links

2 machine-checked theorem links

Cited by

33 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:46.479380Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

458388b74092eea7f8ffaeffd7a7b7e55266a2aeb2963d1e0f1e15ceefe3d1fc

Aliases

arxiv: 2311.01378 · arxiv_version: 2311.01378v3 · doi: 10.48550/arxiv.2311.01378 · pith_short_12: IWBYRN2ASLXK · pith_short_16: IWBYRN2ASLXKP6H7 · pith_short_8: IWBYRN2A
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/IWBYRN2ASLXKP6H7V375PJ5X4V \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 458388b74092eea7f8ffaeffd7a7b7e55266a2aeb2963d1e0f1e15ceefe3d1fc
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "acf5b854e35077856f99fbbcc551f23d4efec5c2d3f5deccf242275df7a9dba8",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.LG"
    ],
    "license": "http://creativecommons.org/licenses/by-nc-nd/4.0/",
    "primary_cat": "cs.RO",
    "submitted_at": "2023-11-02T16:34:33Z",
    "title_canon_sha256": "a46623a0809364e13041acd5187a238e16b787a3e7101eff9db09d46cb95d7f9"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2311.01378",
    "kind": "arxiv",
    "version": 3
  }
}