pith. sign in
Pith Number

pith:4OXKXQ4P

pith:2024:4OXKXQ4PBCCPHOCVZK2PKPZTA2
not attested not anchored not stored refs resolved

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

David Fan, Jiachen Zhu, Koustuv Sinha, Michael Rabbat, Saining Xie, Shengbang Tong, Xinlei Chen, Yann LeCun, Yunyang Xiong, Zhuang Liu

Visual generation ability emerges as a natural byproduct of improved visual understanding in instruction-tuned LLMs.

arxiv:2412.14164 v1 · 2024-12-18 · cs.CV

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{4OXKXQ4PBCCPHOCVZK2PKPZTA2}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

visual generation ability emerges as a natural byproduct of improved visual understanding, and can be unlocked efficiently with a small amount of generation data

C2weakest assumption

That the curated instruction-following multimodal datasets are sufficient to reveal general emergence of generation from understanding and that results will transfer beyond the specific models and data mixtures tested.

C3one line summary

VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.

References

282 extracted · 282 resolved · 32 Pith anchors

[3] Llama 3 model card 2024
[4] Flamingo: a visual language model for few-shot learning 2022
[5] ICML 2024 Tutorial: Physics of Language Models , 2024 2024
[6] Anthropic. Claude, 2024 2024
[7] Jimmy Lei Ba, Jamie Kiros, and Geoffrey E. Hinton. Layer normalization. In NeurIPS, 2016 2016

Formal links

2 machine-checked theorem links

Cited by

22 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:14.674381Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

e3aeabc38f0884f3b855cab4f53f330696ad3beecbe15677a1497d065c6c83d6

Aliases

arxiv: 2412.14164 · arxiv_version: 2412.14164v1 · doi: 10.48550/arxiv.2412.14164 · pith_short_12: 4OXKXQ4PBCCP · pith_short_16: 4OXKXQ4PBCCPHOCV · pith_short_8: 4OXKXQ4P
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/4OXKXQ4PBCCPHOCVZK2PKPZTA2 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: e3aeabc38f0884f3b855cab4f53f330696ad3beecbe15677a1497d065c6c83d6
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "76dfbffcf154571b952855eb877a4fa7e54ec04dfe3914dcbd04f4a201adbe57",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2024-12-18T18:58:50Z",
    "title_canon_sha256": "20c025291f77e7cde39c6acc6ed7ebea9dd3abbcaa26f3b99b95e26f8161a30c"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2412.14164",
    "kind": "arxiv",
    "version": 1
  }
}