pith. sign in
Pith Number

pith:S4RNNHS2

pith:2024:S4RNNHS2J66LO45JQHFDNRTL7V
not attested not anchored not stored refs resolved

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

Dacheng Li, Enze Xie, Haotian Tang, Hongxu Yin, Junyu Chen, Ligeng Zhu, Li Yi, Song Han, Yao Lu, Yecheng Wu, Yunhao Fang, Zhuoyang Zhang

VILA-U integrates visual understanding and generation using a single autoregressive next-token prediction framework.

arxiv:2409.04429 v3 · 2024-09-06 · cs.CV · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{S4RNNHS2J66LO45JQHFDNRTL7V}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

VILA-U employs a single autoregressive next-token prediction framework for both visual understanding and generation tasks, eliminating the need for additional components like diffusion models while achieving near state-of-the-art performance.

C2weakest assumption

That a unified vision tower can sufficiently align discrete visual tokens with textual inputs during pretraining and that autoregressive generation on a high-quality dataset can reach quality comparable to diffusion models without additional architectural components.

C3one line summary

VILA-U unifies visual understanding and generation inside one autoregressive next-token prediction model, removing separate diffusion components while claiming near state-of-the-art results.

References

29 extracted · 29 resolved · 13 Pith anchors

[1] ShareGPT4V: Improving Large Multi-Modal Models with Better Captions · arXiv:2311.12793
[2] Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality 2023
[3] Imagenet: A large-scale hierarchical image database 2009
[4] Planting a seed of vision in large language model
[5] Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering 2025

Formal links

3 machine-checked theorem links

Cited by

35 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:49.624534Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

9722d69e5a4fbcb773a981ca36c66bfd79ef74d71431dba298615260d16f17de

Aliases

arxiv: 2409.04429 · arxiv_version: 2409.04429v3 · doi: 10.48550/arxiv.2409.04429 · pith_short_12: S4RNNHS2J66L · pith_short_16: S4RNNHS2J66LO45J · pith_short_8: S4RNNHS2
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/S4RNNHS2J66LO45JQHFDNRTL7V \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 9722d69e5a4fbcb773a981ca36c66bfd79ef74d71431dba298615260d16f17de
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "8602611bef010053a6e1496c2659cf959a0e0989be198015128a4e05c92f1b2b",
    "cross_cats_sorted": [
      "cs.LG"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2024-09-06T17:49:56Z",
    "title_canon_sha256": "8f09cf539f61129398f14c71a553e592ce0ea211a778d0bc648039e95e389af0"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2409.04429",
    "kind": "arxiv",
    "version": 3
  }
}