Pith Number

pith:ZOHLDRDH

pith:2023:ZOHLDRDHWC72JFJ6655JF6NPE5

not attested not anchored not stored refs resolved

Language Is Not All You Need: Aligning Perception with Language Models

Barun Patra, Furu Wei, Johan Bjorck, Kriti Aggarwal, Lei Cui, Li Dong, Owais Khan Mohammed, Qiang Liu, Saksham Singhal, Shaohan Huang, Shuming Ma, Subhojit Som, Tengchao Lv, Vishrav Chaudhary, Wenhui Wang, Xia Song, Yaru Hao, Zewen Chi

Kosmos-1 learns perception and language jointly from web-scale interleaved text and images, then performs zero-shot and few-shot tasks across modalities without any finetuning.

arxiv:2302.14045 v2 · 2023-02-27 · cs.CL · cs.CV

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{ZOHLDRDHWC72JFJ6655JF6NPE5}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Kosmos-1 achieves impressive performance on language understanding, generation, OCR-free NLP, multimodal dialogue, image captioning, visual question answering, and vision tasks such as image recognition with descriptions, all without gradient updates or finetuning.

C2weakest assumption

That web-scale multimodal corpora provide sufficient aligned signal for the model to acquire general cross-modal capabilities that transfer to held-out tasks without any task-specific adaptation.

C3one line summary

Kosmos-1 shows strong zero-shot and few-shot results on language tasks, image captioning, visual QA, OCR-free document understanding, and image recognition guided by text instructions.

References

33 extracted · 33 resolved · 13 Pith anchors

[1] Cm3: A causal masked multimodal model of the internet

[2] Are Elephants Bigger than Butterflies? Reasoning about Sizes of Objects · arXiv:1602.00753

[3] Language models are few-shot learners 1901

[4] BoolQ: Exploring the surprising difﬁculty of natural yes/no questions 2019

[5] PaLM: Scaling Language Modeling with Pathways · arXiv:2204.02311

Formal links

3 machine-checked theorem links

Cited by

30 papers in Pith

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models

Escaping Plato's Cave: JAM for Aligning Independently Trained Vision and Language Models

Multilingual and Multimodal LLMs in the Wild: Building for Low-Resource Languages

Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models

Receipt and verification

First computed	2026-05-17T23:38:50.584005Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

cb8eb1c467b0bfa4953ef77a92f9af2754c2cd0d73810d6e90d8cdb6db6d9aa3

Aliases

arxiv: 2302.14045 · arxiv_version: 2302.14045v2 · doi: 10.48550/arxiv.2302.14045 · pith_short_12: ZOHLDRDHWC72 · pith_short_16: ZOHLDRDHWC72JFJ6 · pith_short_8: ZOHLDRDH

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/ZOHLDRDHWC72JFJ6655JF6NPE5 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: cb8eb1c467b0bfa4953ef77a92f9af2754c2cd0d73810d6e90d8cdb6db6d9aa3

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "7993a0954152dcd545e045cc24f137dd82711ecda0cf6977f227365be35946f8",
    "cross_cats_sorted": [
      "cs.CV"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2023-02-27T18:55:27Z",
    "title_canon_sha256": "24cf17c4ee445514c2840626d8355b45ecf0094c3de6a6b61d7d413ce01507b1"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2302.14045",
    "kind": "arxiv",
    "version": 2
  }
}