Pith Number

pith:S4RNNHS2

pith:2024:S4RNNHS2J66LO45JQHFDNRTL7V

not attested not anchored not stored refs resolved

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

Dacheng Li, Enze Xie, Haotian Tang, Hongxu Yin, Junyu Chen, Ligeng Zhu, Li Yi, Song Han, Yao Lu, Yecheng Wu, Yunhao Fang, Zhuoyang Zhang

VILA-U integrates visual understanding and generation using a single autoregressive next-token prediction framework.

arxiv:2409.04429 v3 · 2024-09-06 · cs.CV · cs.LG

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{S4RNNHS2J66LO45JQHFDNRTL7V}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

VILA-U employs a single autoregressive next-token prediction framework for both visual understanding and generation tasks, eliminating the need for additional components like diffusion models while achieving near state-of-the-art performance.

C2weakest assumption

That a unified vision tower can sufficiently align discrete visual tokens with textual inputs during pretraining and that autoregressive generation on a high-quality dataset can reach quality comparable to diffusion models without additional architectural components.

C3one line summary

VILA-U unifies visual understanding and generation inside one autoregressive next-token prediction model, removing separate diffusion components while claiming near state-of-the-art results.

References

29 extracted · 29 resolved · 13 Pith anchors

[1] ShareGPT4V: Improving Large Multi-Modal Models with Better Captions · arXiv:2311.12793

[2] Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality 2023

[3] Imagenet: A large-scale hierarchical image database 2009

[4] Planting a seed of vision in large language model

[5] Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering 2025

Formal links

3 machine-checked theorem links

Cited by

35 papers in Pith

UniEmo: Unifying Emotional Understanding and Generation with Learnable Expert Queries

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM

Lance: Unified Multimodal Modeling by Multi-Task Synergy

Receipt and verification

First computed	2026-05-17T23:38:49.624534Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

9722d69e5a4fbcb773a981ca36c66bfd79ef74d71431dba298615260d16f17de

Aliases

arxiv: 2409.04429 · arxiv_version: 2409.04429v3 · doi: 10.48550/arxiv.2409.04429 · pith_short_12: S4RNNHS2J66L · pith_short_16: S4RNNHS2J66LO45J · pith_short_8: S4RNNHS2

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/S4RNNHS2J66LO45JQHFDNRTL7V \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 9722d69e5a4fbcb773a981ca36c66bfd79ef74d71431dba298615260d16f17de

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "8602611bef010053a6e1496c2659cf959a0e0989be198015128a4e05c92f1b2b",
    "cross_cats_sorted": [
      "cs.LG"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2024-09-06T17:49:56Z",
    "title_canon_sha256": "8f09cf539f61129398f14c71a553e592ce0ea211a778d0bc648039e95e389af0"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2409.04429",
    "kind": "arxiv",
    "version": 3
  }
}