pith. sign in
Pith Number

pith:ZOHLDRDH

pith:2023:ZOHLDRDHWC72JFJ6655JF6NPE5
not attested not anchored not stored refs resolved

Language Is Not All You Need: Aligning Perception with Language Models

Barun Patra, Furu Wei, Johan Bjorck, Kriti Aggarwal, Lei Cui, Li Dong, Owais Khan Mohammed, Qiang Liu, Saksham Singhal, Shaohan Huang, Shuming Ma, Subhojit Som, Tengchao Lv, Vishrav Chaudhary, Wenhui Wang, Xia Song, Yaru Hao, Zewen Chi

Kosmos-1 learns perception and language jointly from web-scale interleaved text and images, then performs zero-shot and few-shot tasks across modalities without any finetuning.

arxiv:2302.14045 v2 · 2023-02-27 · cs.CL · cs.CV

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{ZOHLDRDHWC72JFJ6655JF6NPE5}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Kosmos-1 achieves impressive performance on language understanding, generation, OCR-free NLP, multimodal dialogue, image captioning, visual question answering, and vision tasks such as image recognition with descriptions, all without gradient updates or finetuning.

C2weakest assumption

That web-scale multimodal corpora provide sufficient aligned signal for the model to acquire general cross-modal capabilities that transfer to held-out tasks without any task-specific adaptation.

C3one line summary

Kosmos-1 shows strong zero-shot and few-shot results on language tasks, image captioning, visual QA, OCR-free document understanding, and image recognition guided by text instructions.

References

33 extracted · 33 resolved · 13 Pith anchors

[1] Cm3: A causal masked multimodal model of the internet
[2] Are Elephants Bigger than Butterflies? Reasoning about Sizes of Objects · arXiv:1602.00753
[3] Language models are few-shot learners 1901
[4] BoolQ: Exploring the surprising difficulty of natural yes/no questions 2019
[5] PaLM: Scaling Language Modeling with Pathways · arXiv:2204.02311

Formal links

3 machine-checked theorem links

Cited by

30 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:50.584005Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

cb8eb1c467b0bfa4953ef77a92f9af2754c2cd0d73810d6e90d8cdb6db6d9aa3

Aliases

arxiv: 2302.14045 · arxiv_version: 2302.14045v2 · doi: 10.48550/arxiv.2302.14045 · pith_short_12: ZOHLDRDHWC72 · pith_short_16: ZOHLDRDHWC72JFJ6 · pith_short_8: ZOHLDRDH
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/ZOHLDRDHWC72JFJ6655JF6NPE5 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: cb8eb1c467b0bfa4953ef77a92f9af2754c2cd0d73810d6e90d8cdb6db6d9aa3
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "7993a0954152dcd545e045cc24f137dd82711ecda0cf6977f227365be35946f8",
    "cross_cats_sorted": [
      "cs.CV"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2023-02-27T18:55:27Z",
    "title_canon_sha256": "24cf17c4ee445514c2840626d8355b45ecf0094c3de6a6b61d7d413ce01507b1"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2302.14045",
    "kind": "arxiv",
    "version": 2
  }
}