pith. sign in
Pith Number

pith:2KOQ3SMA

pith:2024:2KOQ3SMADRV4XAYY3DLSLG76FH
not attested not anchored not stored refs resolved

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Alexander Toshev, Ankur Jain, Anton Belyi, Aonan Zhang, Bowen Zhang, Brandon McKinzie, Chong Wang, Dhruti Shah, Doug Kang, Floris Weers, Futang Peng, Guoli Yin, Haotian Zhang, Hongyu H\`e, Jean-Philippe Fauconnier, Jianyu Wang, Karanjeet Singh, Mark Lee, Max Schwarzer, Nan Du, Peter Grasch, Philipp Dufter, Ruoming Pang, Sam Dodge, Sam Wiseman, Tao Lei, Tom Gunter, Xiang Kong, Xianzhi Du, Yinfei Yang, Zhe Gan, Zirui Wang

A careful mix of image-caption, interleaved image-text, and text-only data during pre-training is crucial for state-of-the-art few-shot results in multimodal large language models.

arxiv:2403.09611 v4 · 2024-03-14 · cs.CV · cs.CL · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{2KOQ3SMADRV4XAYY3DLSLG76FH}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

For large-scale multimodal pre-training, a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art few-shot results across multiple benchmarks.

C2weakest assumption

That the ablations performed are comprehensive enough to isolate the true importance of data composition and image encoder choices without confounding effects from untested interactions or hyperparameter choices.

C3one line summary

MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.

References

137 extracted · 137 resolved · 47 Pith anchors

[1] GPT-4 Technical Report 2023 · arXiv:2303.08774
[2] In: ICCV (2019) 2019
[3] Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, 2022
[4] OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models 2023 · arXiv:2308.01390
[5] Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond 2023 · arXiv:2308.12966

Formal links

1 machine-checked theorem link

Cited by

26 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:49.147551Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

d29d0dc9801c6bcb8318d8d7259bfe29e407418722613cd139c0b9faa3e3b0fc

Aliases

arxiv: 2403.09611 · arxiv_version: 2403.09611v4 · doi: 10.48550/arxiv.2403.09611 · pith_short_12: 2KOQ3SMADRV4 · pith_short_16: 2KOQ3SMADRV4XAYY · pith_short_8: 2KOQ3SMA
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/2KOQ3SMADRV4XAYY3DLSLG76FH \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: d29d0dc9801c6bcb8318d8d7259bfe29e407418722613cd139c0b9faa3e3b0fc
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "923a9976303f3c648273dba6d0d92803fad89135dc3e2e95942bff3913bb9ceb",
    "cross_cats_sorted": [
      "cs.CL",
      "cs.LG"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2024-03-14T17:51:32Z",
    "title_canon_sha256": "98612f0506b0805073aeaaeaf93f8af49f3f2ccba777087e6dd48a1edd8d0f0a"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2403.09611",
    "kind": "arxiv",
    "version": 4
  }
}