pith. sign in
Pith Number

pith:PJOWPOFF

pith:2024:PJOWPOFFWLEMB7JYLTWCWOGEE5
not attested not anchored not stored refs resolved

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Guowei Xu, Hao Li, Lichao Sun, Li Yuan, Peng Jin, Yibing Song, Ziang Wu

By training on structured four-stage annotations, LLaVA-CoT lets vision-language models reason autonomously and outperform larger models with only 100k samples.

arxiv:2411.10440 v6 · 2024-11-15 · cs.CV

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{PJOWPOFFWLEMB7JYLTWCWOGEE5}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

with only 100k training samples and test-time scaling, LLaVA-CoT not only outperforms its base model by 9.4% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.

C2weakest assumption

That the human-provided structured reasoning annotations in the LLaVA-CoT-100k dataset faithfully capture effective multistage reasoning without introducing systematic biases or annotation artifacts that the model simply memorizes.

C3one line summary

LLaVA-CoT adds autonomous multistage reasoning to vision-language models, delivering 9.4% gains over its base model and outperforming larger models like Gemini-1.5-pro on reasoning benchmarks via a 100k annotated dataset and SWIRES test-time scaling.

References

68 extracted · 68 resolved · 3 Pith anchors

[1] https : / / opencompass
[2] Available at: https://www 2024
[3] Gpt-4o system card, 2024 2024
[4] Variational best-of-n alignment, 2024 2024
[5] Neuro-symbolic visual reasoning: Disentangling 2020

Formal links

2 machine-checked theorem links

Cited by

39 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:48.018188Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

7a5d67b8a5b2c8c0fd385cec2b38c4275c7481a89b5dba265e12bb5c41fff2e1

Aliases

arxiv: 2411.10440 · arxiv_version: 2411.10440v6 · doi: 10.48550/arxiv.2411.10440 · pith_short_12: PJOWPOFFWLEM · pith_short_16: PJOWPOFFWLEMB7JY · pith_short_8: PJOWPOFF
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/PJOWPOFFWLEMB7JYLTWCWOGEE5 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 7a5d67b8a5b2c8c0fd385cec2b38c4275c7481a89b5dba265e12bb5c41fff2e1
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "24b193d28ef5af944ab35cb2be4e90913f09b547ee2d6b7a86d57d3933323322",
    "cross_cats_sorted": [],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2024-11-15T18:58:31Z",
    "title_canon_sha256": "bc7d3a69bb86e42ea12f690bae4d1046c5a3e7378c8f824482aa58f70d6e11b9"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2411.10440",
    "kind": "arxiv",
    "version": 6
  }
}