Pith Number

pith:ROYX62NC

pith:2022:ROYX62NCB72JQHDVNJK6MR3MCN

not attested not anchored not stored refs resolved

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Adam Grycner, AJ Piergiovanni, Alexander Kolesnikov, Andreas Steiner, Anelia Angelova, Ashish Thapliyal, Basil Mustafa, Burcu Karagol Ayan, Carlos Riquelme, Chao Jia, Daniel Salz, Gaurav Mishra, Hassan Akbari, James Bradbury, Joan Puigcerver, Keran Rong, Linting Xue, Lucas Beyer, Mojtaba Seyedhosseini, Nan Ding, Neil Houlsby, Piotr Padlewski, Radu Soricut, Sebastian Goodman, Soravit Changpinyo, Weicheng Kuo, Xiaohua Zhai, Xiao Wang, Xi Chen

PaLI jointly scales a 4-billion-parameter vision transformer with a language model on a 10B multilingual image-text set to reach state-of-the-art on captioning, VQA and scene-text tasks.

arxiv:2209.06794 v4 · 2022-09-14 · cs.CV · cs.CL

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{ROYX62NCB72JQHDVNJK6MR3MCN}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.

C2weakest assumption

That joint scaling of the vision and language components on the new 10B multilingual dataset will produce the claimed performance gains without major issues from data quality, language imbalance, or overfitting.

C3one line summary

PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.

References

185 extracted · 185 resolved · 12 Pith anchors

[1] Tallyqa: Answering complex counting questions 2019

[2] nocaps : Novel object captioning at scale 2019

[3] Crossvqa: Scalably generating benchmarks for systematically testing vqa generalization 2021

[5] On the cross-lingual transferability of monolingual representations 2020

[6] ObjectNet : a large-scale bias-controlled dataset for pushing the limits of object recognition models 2019

Formal links

1 machine-checked theorem link

Cited by

34 papers in Pith

Gemini: A Family of Highly Capable Multimodal Models

Character-Centered Dialogue Generation from Scene-Level Prompts

Grounded Reinforcement Learning for Visual Reasoning

Controlla: Learning Controllability via Graph-Constrained Latent Geometry

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Receipt and verification

First computed	2026-05-17T23:38:48.354221Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

8bb17f69a20ff4981c756a55e6476c13441e2bbbd55d7f5c78db51a1e6549e0a

Aliases

arxiv: 2209.06794 · arxiv_version: 2209.06794v4 · doi: 10.48550/arxiv.2209.06794 · pith_short_12: ROYX62NCB72J · pith_short_16: ROYX62NCB72JQHDV · pith_short_8: ROYX62NC

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/ROYX62NCB72JQHDVNJK6MR3MCN \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 8bb17f69a20ff4981c756a55e6476c13441e2bbbd55d7f5c78db51a1e6549e0a

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "2a50651767b6289fbf279711ac7379d502692af9f7b0932b728ccd5beb6987f9",
    "cross_cats_sorted": [
      "cs.CL"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2022-09-14T17:24:07Z",
    "title_canon_sha256": "08a218ce080d71719c57e13038c8be63fb06bfe224ac17e3c8cb281586c15081"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2209.06794",
    "kind": "arxiv",
    "version": 4
  }
}