pith. sign in
Pith Number

pith:UMTFWY7F

pith:2023:UMTFWY7FF7LPB4AG4LC3PPN4NN
not attested not anchored not stored refs resolved

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Bin Zhu, Jiaxi Cui, Li Yuan, Munan Ning, Peng Jin, Yang Ye

By aligning images and videos into the language feature space before projection, a single LLM processes both modalities and lets them improve each other.

arxiv:2311.10122 v3 · 2023-11-16 · cs.CV

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{UMTFWY7FF7LPB4AG4LC3PPN4NN}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other.

C2weakest assumption

due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers.

C3one line summary

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.

References

87 extracted · 87 resolved · 28 Pith anchors

[1] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model 2022
[3] Max Bain, Arsha Nagrani, G \"u l Varol, and Andrew Zisserman. 2021. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on 2021
[5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot lear 2020
[6] David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human langu 2011
[8] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. 2023. Vicuna: An open-source chatbot impressing gpt-4 wi 2023

Formal links

1 machine-checked theorem link

Cited by

69 papers in Pith

Receipt and verification
First computed 2026-05-17T23:39:22.231807Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

a3265b63e52fd6f0f006e2c5b7bdbc6b59ff9c3187e6dd8a99d4b5d12ca0d596

Aliases

arxiv: 2311.10122 · arxiv_version: 2311.10122v3 · doi: 10.48550/arxiv.2311.10122 · pith_short_12: UMTFWY7FF7LP · pith_short_16: UMTFWY7FF7LPB4AG · pith_short_8: UMTFWY7F
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/UMTFWY7FF7LPB4AG4LC3PPN4NN \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: a3265b63e52fd6f0f006e2c5b7bdbc6b59ff9c3187e6dd8a99d4b5d12ca0d596
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "c80295a5762ecdeb6e65c5f49691842dea7b0fe27da82d7476d300d10c866324",
    "cross_cats_sorted": [],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2023-11-16T10:59:44Z",
    "title_canon_sha256": "169b24f471d2208db1ce36173b5691902e0fd44518285d76760c7236864b0685"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2311.10122",
    "kind": "arxiv",
    "version": 3
  }
}