pith. machine review for the scientific record.
sign in
Pith Number

pith:LJHUW6AY

pith:2023:LJHUW6AYODUNYUY3UBFIV6JLS7
not attested not anchored not stored refs resolved

An Embodied Generalist Agent in 3D World

Baoxiong Jia, Jiangyong Huang, Puhao Li, Qing Li, Silong Yong, Siyuan Huang, Song-Chun Zhu, Xiaojian Ma, Xiongkun Linghu, Yan Wang

LEO trains as a 3D embodied generalist agent through two-stage alignment on large vision-language and vision-language-action datasets.

arxiv:2311.12871 v3 · 2023-11-18 · cs.CV · cs.AI · cs.CL · cs.LG

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Through extensive experiments, we demonstrate LEO's remarkable proficiency across a wide spectrum of tasks, including 3D captioning, question answering, embodied reasoning, navigation and manipulation.

C2weakest assumption

The central claim assumes that the collected large-scale 3D VL and VLA datasets plus the two-stage training procedure are sufficient to produce generalist performance that transfers beyond the specific benchmarks shown.

C3one line summary

LEO is an embodied generalist agent that performs 3D captioning, question answering, reasoning, navigation, and manipulation after 3D vision-language alignment followed by vision-language-action instruction tuning on large-scale object- and scene-level datasets.

References

27 extracted · 27 resolved · 10 Pith anchors

[1] A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity 2022 · arXiv:2302.04023
[2] RT-1: Robotics Transformer for Real-World Control at Scale 2022 · arXiv:2212.06817
[3] Scaling Instruction-Finetuned Language Models 2022 · arXiv:2210.11416
[4] LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model 2022 · arXiv:2304.15010
[5] Scaling Laws for Neural Language Models 2001 · arXiv:2001.08361

Formal links

2 machine-checked theorem links

Cited by

20 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:13.838991Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

5a4f4b781870e8dc531ba04a8af92b97ce9163b929a16e03aaf471e3739edb0f

Aliases

arxiv: 2311.12871 · arxiv_version: 2311.12871v3 · doi: 10.48550/arxiv.2311.12871 · pith_short_12: LJHUW6AYODUN · pith_short_16: LJHUW6AYODUNYUY3 · pith_short_8: LJHUW6AY
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/LJHUW6AYODUNYUY3UBFIV6JLS7 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 5a4f4b781870e8dc531ba04a8af92b97ce9163b929a16e03aaf471e3739edb0f
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "4f47ee0a5b4dedd7b27c6fe1061559319ed77d33695b15255171ac542dd21944",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.CL",
      "cs.LG"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2023-11-18T01:21:38Z",
    "title_canon_sha256": "d3f6a0a01ad1f88f36aa4d2acac57a9346d7c35c4738df676395e28791d80bc8"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2311.12871",
    "kind": "arxiv",
    "version": 3
  }
}