pith:PJOWPOFF
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
By training on structured four-stage annotations, LLaVA-CoT lets vision-language models reason autonomously and outperform larger models with only 100k samples.
arxiv:2411.10440 v6 · 2024-11-15 · cs.CV
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{PJOWPOFFWLEMB7JYLTWCWOGEE5}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
with only 100k training samples and test-time scaling, LLaVA-CoT not only outperforms its base model by 9.4% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.
That the human-provided structured reasoning annotations in the LLaVA-CoT-100k dataset faithfully capture effective multistage reasoning without introducing systematic biases or annotation artifacts that the model simply memorizes.
LLaVA-CoT adds autonomous multistage reasoning to vision-language models, delivering 9.4% gains over its base model and outperforming larger models like Gemini-1.5-pro on reasoning benchmarks via a 100k annotated dataset and SWIRES test-time scaling.
References
Formal links
Cited by
Receipt and verification
| First computed | 2026-05-17T23:38:48.018188Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
7a5d67b8a5b2c8c0fd385cec2b38c4275c7481a89b5dba265e12bb5c41fff2e1
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/PJOWPOFFWLEMB7JYLTWCWOGEE5 \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 7a5d67b8a5b2c8c0fd385cec2b38c4275c7481a89b5dba265e12bb5c41fff2e1
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "24b193d28ef5af944ab35cb2be4e90913f09b547ee2d6b7a86d57d3933323322",
"cross_cats_sorted": [],
"license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
"primary_cat": "cs.CV",
"submitted_at": "2024-11-15T18:58:31Z",
"title_canon_sha256": "bc7d3a69bb86e42ea12f690bae4d1046c5a3e7378c8f824482aa58f70d6e11b9"
},
"schema_version": "1.0",
"source": {
"id": "2411.10440",
"kind": "arxiv",
"version": 6
}
}