Pith Number

pith:UMTFWY7F

pith:2023:UMTFWY7FF7LPB4AG4LC3PPN4NN

not attested not anchored not stored refs resolved

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Bin Zhu, Jiaxi Cui, Li Yuan, Munan Ning, Peng Jin, Yang Ye

By aligning images and videos into the language feature space before projection, a single LLM processes both modalities and lets them improve each other.

arxiv:2311.10122 v3 · 2023-11-16 · cs.CV

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{UMTFWY7FF7LPB4AG4LC3PPN4NN}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other.

C2weakest assumption

due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers.

C3one line summary

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.

References

87 extracted · 87 resolved · 28 Pith anchors

[1] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model 2022

[3] Max Bain, Arsha Nagrani, G \"u l Varol, and Andrew Zisserman. 2021. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on 2021

[5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot lear 2020

[6] David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human langu 2011

[8] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. 2023. Vicuna: An open-source chatbot impressing gpt-4 wi 2023

Formal links

1 machine-checked theorem link

Cited by

69 papers in Pith

AirVista-II: An Agentic System for Embodied UAVs Toward Dynamic Scene Semantic Understanding

Gemini: A Family of Highly Capable Multimodal Models

What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction

Open-Sora Plan: Open-Source Large Video Generation Model

LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding

Receipt and verification

First computed	2026-05-17T23:39:22.231807Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

a3265b63e52fd6f0f006e2c5b7bdbc6b59ff9c3187e6dd8a99d4b5d12ca0d596

Aliases

arxiv: 2311.10122 · arxiv_version: 2311.10122v3 · doi: 10.48550/arxiv.2311.10122 · pith_short_12: UMTFWY7FF7LP · pith_short_16: UMTFWY7FF7LPB4AG · pith_short_8: UMTFWY7F

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/UMTFWY7FF7LPB4AG4LC3PPN4NN \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: a3265b63e52fd6f0f006e2c5b7bdbc6b59ff9c3187e6dd8a99d4b5d12ca0d596

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "c80295a5762ecdeb6e65c5f49691842dea7b0fe27da82d7476d300d10c866324",
    "cross_cats_sorted": [],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2023-11-16T10:59:44Z",
    "title_canon_sha256": "169b24f471d2208db1ce36173b5691902e0fd44518285d76760c7236864b0685"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2311.10122",
    "kind": "arxiv",
    "version": 3
  }
}