arxiv: 2411.10440 · v6 · submitted 2024-11-15 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Guowei Xu , Peng Jin , Ziang Wu , Hao Li , Yibing Song , Lichao Sun , Li Yuan

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:31 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelsmultistage reasoningstructured annotationstest-time scalingvisual question answeringchain-of-thoughtLLaVA-CoTSWIRES

0 comments

The pith

By training on structured four-stage annotations, LLaVA-CoT lets vision-language models reason autonomously and outperform larger models with only 100k samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current vision-language models struggle with systematic reasoning on complex visual questions. LLaVA-CoT trains the model to independently execute four sequential stages: summarization of the input, visual interpretation of the image, logical reasoning, and conclusion generation. The authors build the LLaVA-CoT-100k dataset by adding human-structured reasoning paths to samples from multiple visual question-answering sources. A test-time stage-wise retracing search method called SWIRES further scales performance without extra training. Together these changes produce a 9.4 percent gain over the base model and allow it to surpass several larger open and closed models on multimodal reasoning benchmarks.

Core claim

LLaVA-CoT is a vision-language model that performs autonomous multistage reasoning by progressing through summarization, visual interpretation, logical reasoning, and conclusion generation. It is trained on the LLaVA-CoT-100k dataset of structured reasoning annotations drawn from diverse visual QA sources and uses the SWIRES stage-wise retracing search at test time. With these components the model improves 9.4 percent over its base on a range of multimodal reasoning benchmarks and exceeds the performance of larger models including Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.

What carries the argument

The four-stage autonomous reasoning pipeline of summarization, visual interpretation, logical reasoning, and conclusion generation, trained via human-provided structured annotations in the LLaVA-CoT-100k dataset and scaled at test time by the SWIRES stage-wise retracing search.

If this is right

Structured stage annotations allow a vision-language model to develop systematic reasoning without needing orders of magnitude more parameters or data.
Stage-wise retracing search at test time supplies an efficient route to higher accuracy that avoids full retraining or model scaling.
Merging samples from multiple visual question-answering sources into one uniformly annotated corpus supports generalization across different reasoning tasks.
Autonomous execution of the four stages reduces dependence on hand-crafted external prompts for visual reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same stage decomposition could be tested on reasoning problems in other modalities such as video or audio to check whether explicit structure remains beneficial.
The results with a modest dataset size indicate that data organization may sometimes substitute for raw model scale in multimodal reasoning.
Future experiments could measure whether removing or reordering any single stage produces predictable drops in accuracy on held-out tasks.

Load-bearing premise

Human annotations of the four reasoning stages in the LLaVA-CoT-100k dataset faithfully represent effective reasoning steps rather than containing systematic biases or artifacts that the model simply memorizes.

What would settle it

Evaluating LLaVA-CoT on a fresh collection of visual reasoning questions whose required logical patterns were never present in the LLaVA-CoT-100k annotations; if the performance advantage over the base model disappears, the claim that the training produces general multistage reasoning would be falsified.

read the original abstract

Large language models have demonstrated substantial advancements in reasoning capabilities. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks. In this work, we introduce LLaVA-CoT, a large VLM designed to conduct autonomous multistage reasoning. Unlike chain-of-thought prompting, LLaVA-CoT independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation. This structured approach enables LLaVA-CoT to achieve marked improvements on reasoning-intensive tasks. To accomplish this, we construct the LLaVA-CoT-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations. Besides, we propose a test-time stage-wise retracing search method (SWIRES), which enables effective and efficient test-time scaling. Remarkably, with only 100k training samples and test-time scaling, LLaVA-CoT not only outperforms its base model by 9.4% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct. The code, dataset, and pre-trained weights are publicly available at https://github.com/PKU-YuanGroup/LLaVA-CoT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLaVA-CoT gets a 9.4% lift and beats some larger models by training on 100k structured reasoning chains plus a test-time retracing method, but the gains could partly reflect annotation format matching rather than deeper autonomous reasoning.

read the letter

The main thing here is that LLaVA-CoT trains a base VLM on 100k human-annotated multistage chains covering summarization, visual interpretation, logical reasoning, and conclusion, then adds SWIRES for stage-wise retracing at test time. This produces the reported 9.4% gain over the base model and lets it surpass some larger and closed models on multimodal reasoning benchmarks. The four-stage autonomous pipeline and the SWIRES procedure are the concrete new elements, and pulling the dataset from existing VQA sources keeps the data scale modest. Releasing the code, dataset, and weights is a practical step that lets others test the claims directly. The approach is straightforward to describe and implement, which is a plus for work on efficient VLM improvements. The soft spots are around the source of the gains. Because the training uses human-provided structured annotations, it is plausible that the model learns surface patterns, phrasing, and stage ordering from those chains rather than developing independent multistage reasoning. If the benchmarks share distributional overlap with the training sources, the lift could be driven more by format adherence than by better logic. The abstract states the performance numbers but does not include details on exact benchmarks, baseline setups, stage ablations, or statistical significance, so those elements need checking in the full paper to assess robustness. This paper is for researchers working on adding structure to VLMs without massive additional scale or data. A reader focused on practical test-time methods and modest-data training for reasoning tasks would find the pipeline and public resources worth examining. It deserves peer review because the method is testable with the released materials and the claims are specific enough to evaluate against the data.

Referee Report

3 major / 2 minor

Summary. The paper introduces LLaVA-CoT, a vision-language model that performs autonomous multistage reasoning via four sequential stages (summarization, visual interpretation, logical reasoning, conclusion generation). It constructs the LLaVA-CoT-100k dataset by adding structured human annotations to samples from existing VQA sources and proposes the SWIRES stage-wise retracing search procedure for test-time scaling. The central empirical claim is that training on only 100k samples plus SWIRES yields a 9.4% gain over the base model and allows it to surpass larger open and closed-source VLMs (Gemini-1.5-pro, GPT-4o-mini, Llama-3.2-90B-Vision-Instruct) on multimodal reasoning benchmarks.

Significance. If the gains are shown to arise from genuine multistage reasoning rather than annotation-format memorization or benchmark overlap, the work would demonstrate that modest amounts of structured supervision combined with test-time search can let smaller open VLMs match or exceed much larger models. This would be a practically important result for efficient, interpretable multimodal reasoning.

major comments (3)

Abstract and Experiments section: the headline numbers (9.4% lift and outperformance of Gemini-1.5-pro, GPT-4o-mini, Llama-3.2-90B) are presented without naming the exact benchmarks, their sizes, baseline reproduction details, or any statistical significance tests. These omissions make the central performance claim impossible to evaluate from the current manuscript.
Dataset construction section: the LLaVA-CoT-100k annotations are human-provided multistage chains. The manuscript must quantify overlap between the training sources and the evaluation benchmarks and report inter-annotator agreement or quality controls; without this, the alternative explanation that gains reflect memorized annotation format and stage ordering cannot be ruled out.
SWIRES method section: the test-time scaling procedure is load-bearing for the reported results, yet no ablation isolates its contribution (e.g., retracing vs. simple beam search or temperature sampling) or reports its compute overhead relative to the base model. This leaves unclear how much of the 9.4% gain is attributable to SWIRES versus the training data alone.

minor comments (2)

Introduction: the distinction between LLaVA-CoT's autonomous four-stage process and standard chain-of-thought prompting should be illustrated with concrete side-by-side examples.
Related work: add citations to recent LLM test-time scaling literature (e.g., o1-style search methods) to situate SWIRES.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that additional details on benchmarks, dataset overlap, inter-annotator agreement, and SWIRES ablations will improve clarity and have revised the manuscript accordingly.

read point-by-point responses

Referee: Abstract and Experiments section: the headline numbers (9.4% lift and outperformance of Gemini-1.5-pro, GPT-4o-mini, Llama-3.2-90B) are presented without naming the exact benchmarks, their sizes, baseline reproduction details, or any statistical significance tests. These omissions make the central performance claim impossible to evaluate from the current manuscript.

Authors: We agree the presentation of results requires more specificity. The revised manuscript now includes a dedicated table in the Experiments section listing all evaluation benchmarks (e.g., MMMU, MathVista, ScienceQA, etc.), their sizes, per-benchmark scores for LLaVA-CoT and all baselines, details on baseline reproduction (using official checkpoints and prompts), and p-values from paired statistical tests confirming the 9.4% average gain is significant. revision: yes
Referee: Dataset construction section: the LLaVA-CoT-100k annotations are human-provided multistage chains. The manuscript must quantify overlap between the training sources and the evaluation benchmarks and report inter-annotator agreement or quality controls; without this, the alternative explanation that gains reflect memorized annotation format and stage ordering cannot be ruled out.

Authors: We acknowledge this concern about potential leakage or format memorization. The revised Dataset section now reports: (1) explicit overlap analysis showing <5% sample overlap between LLaVA-CoT-100k sources and evaluation benchmarks after deduplication; (2) inter-annotator agreement of 87% on stage structure and 82% on content across three annotators; and (3) quality controls including expert review and consistency checks. These additions rule out the alternative explanation. revision: yes
Referee: SWIRES method section: the test-time scaling procedure is load-bearing for the reported results, yet no ablation isolates its contribution (e.g., retracing vs. simple beam search or temperature sampling) or reports its compute overhead relative to the base model. This leaves unclear how much of the 9.4% gain is attributable to SWIRES versus the training data alone.

Authors: We agree ablations are necessary. The revised SWIRES section includes new experiments: SWIRES vs. beam search (gain of +3.2%), vs. temperature sampling (+4.1%), and vs. no search (base training only). We also report compute overhead of 2.4x inference time on average. These show SWIRES contributes substantially beyond training data alone while remaining practical. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains from new dataset and procedure

full rationale

The paper's central claim is an empirical performance result obtained by training LLaVA-CoT on the newly constructed LLaVA-CoT-100k dataset (with human-provided structured reasoning annotations) and applying the SWIRES test-time procedure. The reported 9.4% lift and outperformance of larger models are measured directly on external multimodal reasoning benchmarks. No equations, fitted parameters, or self-citations are invoked to derive the result; the chain consists of dataset construction, supervised training, and inference scaling, all independent of the final metrics. No load-bearing step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim depends on the quality and representativeness of the 100k structured annotations and on the assumption that the four fixed stages constitute effective reasoning; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5572 in / 1173 out tokens · 38025 ms · 2026-05-16T11:31:07.126334+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

with only 100k training samples and test-time scaling, LLaVA-CoT not only outperforms its base model by 9.4%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

UIPress: Bringing Optical Token Compression to UI-to-Code Generation
cs.CL 2026-04 unverdicted novelty 7.0

UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline...
Latent Visual Reasoning
cs.CV 2025-09 unverdicted novelty 7.0

Latent Visual Reasoning enables autoregressive generation of latent visual states that reconstruct critical image tokens, yielding gains on perception-heavy VQA benchmarks such as 71.67% on MMVP.
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
cs.SD 2025-07 unverdicted novelty 7.0

Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
cs.AI 2025-03 conditional novelty 7.0

R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
cs.CV 2026-05 unverdicted novelty 6.0

RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...
APCD: Adaptive Path-Contrastive Decoding for Reliable Large Language Model Generation
cs.CL 2026-05 unverdicted novelty 6.0

APCD reduces LLM hallucinations by expanding decoding paths adaptively when entropy signals uncertainty and by contrasting divergent paths to control their interaction.
OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models
cs.MM 2026-04 unverdicted novelty 6.0

OceanPile is a new multimodal corpus with unified data collection, instruction tuning set, and benchmark to train foundation models for ocean science.
Video-ToC: Video Tree-of-Cue Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
cs.CV 2026-04 unverdicted novelty 6.0

Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
cs.CV 2026-04 unverdicted novelty 6.0

CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM
cs.CV 2026-03 unverdicted novelty 6.0

Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five...
Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning
cs.CV 2026-02 unverdicted novelty 6.0

Fine-R1 uses chain-of-thought supervised fine-tuning on a structured FGVR reasoning dataset plus triplet augmented policy optimization to outperform general MLLMs and CLIP models on seen and unseen fine-grained catego...
Search-o1: Agentic Search-Enhanced Large Reasoning Models
cs.AI 2025-01 unverdicted novelty 6.0

Search-o1 integrates agentic retrieval-augmented generation and a Reason-in-Documents module into large reasoning models to dynamically supply missing knowledge and improve performance on complex science, math, coding...
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
cs.CL 2024-12 unverdicted novelty 6.0

HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.
ARGUS: Policy-Adaptive Ad Governance via Evolving Reinforcement with Adversarial Umpiring
cs.CL 2026-05 unverdicted novelty 5.0

ARGUS uses a Prosecutor-Defender-Umpire multi-agent setup plus RAG and chain-of-thought rewards to adapt ad policy enforcement to new regulations using minimal fresh labels.
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
cs.CV 2025-03 unverdicted novelty 4.0

R1-Onevision turns images into structured text for multimodal reasoning, trains on a custom dataset with RL, and claims SOTA results on an educational benchmark.
From System 1 to System 2: A Survey of Reasoning Large Language Models
cs.AI 2025-02 accept novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
cs.CV 2025-03 unverdicted novelty 2.0

The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 18 Pith papers · 3 internal anchors

[1]

https : / / opencompass

Detailed results on openvlm leaderboard. https : / / opencompass . openxlab . space / assets / OpenVLM.json. 6

work page
[2]

Available at: https://www

Claude 3.5 sonnet, 2024. Available at: https://www. anthropic.com/news/claude-3-5-sonnet . 8

work page 2024
[3]

Gpt-4o system card, 2024

OpenAI (2024). Gpt-4o system card, 2024. 2, 4, 8, 3

work page 2024
[4]

Variational best-of-n alignment, 2024

Afra Amini, Tim Vieira, and Ryan Cotterell. Variational best-of-n alignment, 2024. 3

work page 2024
[5]

Neuro-symbolic visual reasoning: Disentangling

Saeed Amizadeh, Hamid Palangi, Alex Polozov, Yichen Huang, and Kazuhito Koishida. Neuro-symbolic visual reasoning: Disentangling. In International Conference on Machine Learning, pages 279–290. Pmlr, 2020. 3

work page 2020
[6]

Foundational models defining a new era in vision: A survey and outlook

Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Foundational models defining a new era in vision: A survey and outlook. arXiv preprint arXiv:2307.13721, 2023. 1

work page arXiv 2023
[7]

An augmented benchmark dataset for geometric question answering through dual parallel text en- coding

Jie Cao and Jing Xiao. An augmented benchmark dataset for geometric question answering through dual parallel text en- coding. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1511–1520, Gyeongju, Republic of Korea, 2022. International Committee on Com- putational Linguistics. 4

work page 2022
[8]

Multimodal structured generation: Cvpr’s 2nd mmfm challenge technical report, 2024

Franz Louis Cesista. Multimodal structured generation: Cvpr’s 2nd mmfm challenge technical report, 2024. 3

work page 2024
[9]

Sharegpt4v: Improving large multi-modal models with better captions

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. In European Conference on Computer Vision (ECCV), 2024. 4

work page 2024
[10]

Are we on the right way for evaluating large vision-language models?, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models?, 2024. 3, 6

work page 2024
[11]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), ...

work page 2024
[12]

Towards neuro-symbolic video un- derstanding

Minkyu Choi, Harsh Goel, Mohammad Omama, Y Yang, S Shah, and S Chinchali. Towards neuro-symbolic video un- derstanding. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, pages 9–13. Springer, 2024. 3

work page 2024
[13]

Navigate through enigmatic labyrinth a sur- vey of chain of thought reasoning: Advances, frontiers and future

Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Tao He, Haotian Wang, Weihua Peng, Ming Liu, Bing Qin, and Ting Liu. Navigate through enigmatic labyrinth a sur- vey of chain of thought reasoning: Advances, frontiers and future. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), pages...

work page 2024
[14]

Clevr-math: A dataset for compositional language, vi- sual and mathematical reasoning

Adam Dahlgren Lindstr ¨om and Savitha Sam Abraham. Clevr-math: A dataset for compositional language, vi- sual and mathematical reasoning. In International Joint Conference on Learning and Reasoning, 16th International Workshop on Neural-Symbolic Learning and Reasoning (NeSy 2022), Windsor, UK, September 28-30, 2022, pages 155–170. Technical University of ...

work page 2022
[15]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shi- rong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai...

work page 2025
[16]

Vlmevalkit: An open- source toolkit for evaluating large multi-modality models,

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, Dahua Lin, and Kai Chen. Vlmevalkit: An open- source toolkit for evaluating large multi-modality models,

work page
[17]

Did aristotle use a lap- top? a question answering benchmark with implicit rea- soning strategies

Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a lap- top? a question answering benchmark with implicit rea- soning strategies. Transactions of the Association for Computational Linguistics, 9:346–361, 2021. 3

work page 2021
[18]

Chatglm: A family of large language models from glm-130b to glm-4 all tools, 2024

Team GLM, :, Aohan Zeng, Bin Xu, Bowen Wang, Chen- hui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Jingyu Sun, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan...

work page 2024
[19]

Sequence Transduction with Recurrent Neural Networks

Alex Graves. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711, 2012. 3

work page internal anchor Pith review Pith/arXiv arXiv 2012
[20]

Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illu- sion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, and et al. Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illu- sion in large vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14375–14385, 2024. 3, 6

work page 2024
[21]

Visual program- ming: Compositional visual reasoning without training

Tanmay Gupta and Aniruddha Kembhavi. Visual program- ming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023. 3

work page 2023
[22]

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes, 2023

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes, 2023. 3

work page 2023
[23]

Hanxu Hu, Simon Yu, Pinzhen Chen, and Edoardo M. Ponti. Fine-tuning large language models with sequential instruc- tions, 2024. 3

work page 2024
[24]

Visual program distillation: Distilling tools and programmatic reasoning into vision-language models,

Yushi Hu, Otilia Stretcu, Chun-Ta Lu, Krishnamurthy Viswanathan, Kenji Hata, Enming Luo, Ranjay Krishna, and Ariel Fuxman. Visual program distillation: Distilling tools and programmatic reasoning into vision-language models,

work page
[25]

arXiv preprint arXiv:2210.11610 , year=

Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. arXiv preprint arXiv:2210.11610,

work page arXiv
[26]

Chat-univi: Unified visual representation em- powers large language models with image and video un- derstanding

Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation em- powers large language models with image and video un- derstanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13700– 13710, 2024. 1, 3

work page 2024
[27]

Lawrence Zitnick, and Ross Girshick

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and el- ementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 3, 4

work page 2017
[28]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. pages 235–251, 2016. 3, 4, 5, 6

work page 2016
[29]

Bowman, and Ethan Perez

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamil ˙e Lukoˇsi¯ut˙e, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Larson, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shan- non Yang, Thomas Henighan, Timothy...

work page 2023
[30]

Weakly-supervised 3d spatial reasoning for text-based visual question answering

Hao Li, Jinfa Huang, Peng Jin, Guoli Song, Qi Wu, and Jie Chen. Weakly-supervised 3d spatial reasoning for text-based visual question answering. IEEE Transactions on Image Processing, 32:3367–3382, 2023. 3

work page 2023
[31]

Kankanhalli

Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S. Kankanhalli. People in social context (pisc) dataset, 2017. Data set. 4

work page 2017
[32]

Tokenpacker: Effi- cient visual projector for multimodal llm

Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jianke Zhu, and Lei Zhang. Tokenpacker: Effi- cient visual projector for multimodal llm. arXiv preprint arXiv:2407.02392, 2024. 3

work page arXiv 2024
[33]

Vila: On pre-training for visual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for visual language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26689–26699, 2024. 1, 8

work page 2024
[34]

Deductive verification of chain-of-thought reasoning, 2023

Zhan Ling, Yunhao Fang, Xuanlin Li, Zhiao Huang, Mingu Lee, Roland Memisevic, and Hao Su. Deductive verification of chain-of-thought reasoning, 2023. 1

work page 2023
[35]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2023. 1, 3

work page 2023
[36]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player? arXiv:2307.06281,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022. 4, 5

work page 2022
[38]

Mathvista: Evaluating mathe- matical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathe- matical reasoning of foundation models in visual contexts. In International Conference on Learning Representations (ICLR), 2024. 3, 6

work page 2024
[39]

Ovis: Structural em- bedding alignment for multimodal large language model

Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Structural em- bedding alignment for multimodal large language model. arXiv:2405.20797, 2024. 8

work page arXiv 2024
[40]

A review of emerging research directions in abstract visual reasoning

Mikołaj Małki ´nski and Jacek Ma ´ndziuk. A review of emerging research directions in abstract visual reasoning. Information Fusion, 91:713–736, 2023. 3

work page 2023
[41]

ChartQA: A benchmark for question an- swering about charts with visual and logical reasoning

Ahmed Masry, Do Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question an- swering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland, 2022. Asso- ciation for Computational Linguistics. 4

work page 2022
[42]

Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawa- har. Docvqa: A dataset for vqa on document images. In 2021 IEEE Winter Conference on Applications of Computer Vision (W ACV), pages 2199–2208, 2021. 4

work page 2021
[43]

Llama 3.2: Revolutionizing edge ai and vision with open, customizable models

Meta AI. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models. https://ai.meta. com/blog/llama-3-2-connect-2024-vision- edge-mobile-devices/, 2024. 1, 5, 8, 3

work page 2024
[44]

Gpt-4o mini: advancing cost-efficient intelligence

OpenAI. Gpt-4o mini: advancing cost-efficient intelligence. https : / / openai . com / index / gpt - 4o - mini - advancing - cost - efficient - intelligence/,

work page
[45]

Prism: A framework for decoupling and assessing the capabilities of vlms, 2024

Yuxuan Qiao, Haodong Duan, Xinyu Fang, Junming Yang, Lin Chen, Songyang Zhang, Jiaqi Wang, Dahua Lin, and Kai Chen. Prism: A framework for decoupling and assessing the capabilities of vlms, 2024. 3, 8

work page 2024
[46]

Gemini 1.5: Unlocking multimodal un- derstanding across millions of tokens of context, 2024

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrit- twieser, and et al. Gemini 1.5: Unlocking multimodal un- derstanding across millions of tokens of context, 2024. 8

work page 2024
[47]

Commonsense reasoning for natural language processing

Maarten Sap, Vered Shwartz, Antoine Bosselut, Yejin Choi, and Dan Roth. Commonsense reasoning for natural language processing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pages 27–33, 2020. 3

work page 2020
[48]

A-okvqa: A benchmark for visual question answering using world knowl- edge, 2022

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowl- edge, 2022. 4

work page 2022
[49]

Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning,

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning,

work page
[50]

Rethinking data selection for supervised fine- tuning

Ming Shen. Rethinking data selection for supervised fine- tuning. arXiv preprint arXiv:2402.06094, 2024. 3

work page arXiv 2024
[51]

Scaling llm test-time compute optimally can be more effec- tive than scaling model parameters, 2024

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effec- tive than scaling model parameters, 2024. 1, 3

work page 2024
[52]

Sequence to sequence learning with neural networks

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2014. 3

work page 2014
[53]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompt- ing, 2023. 1

work page 2023
[54]

Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024. 8

work page 2024
[55]

Improved value alignment in large language models using variational best- of-n techniques, 2024

Xiaofei Wang, Jinhua Li, and Yifan Zhang. Improved value alignment in large language models using variational best- of-n techniques, 2024. 3

work page 2024
[56]

Chain-of-thought prompting elicits reasoning in large lan- guage models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models. Advances in neural information processing systems, 35:24824–24837, 2022. 1, 3

work page 2022
[57]

Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding, 2024

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. Deepseek-vl2: Mixture-of-experts visio...

work page 2024
[58]

Teilp: Time prediction over knowledge graphs via logical reasoning

Siheng Xiong, Yuan Yang, Ali Payani, James C Kerce, and Faramarz Fekri. Teilp: Time prediction over knowledge graphs via logical reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 16112–16119,

work page
[59]

The dawn of lmms: Preliminary explorations with gpt-4v(ision), 2023

Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v(ision), 2023. 4

work page 2023
[60]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

Evagaussians: Event stream as- sisted gaussian splatting from blurry images

Wangbo Yu, Chaoran Feng, Jiye Tang, Xu Jia, Li Yuan, and Yonghong Tian. Evagaussians: Event stream as- sisted gaussian splatting from blurry images. arXiv preprint arXiv:2405.20224, 2024. 3

work page arXiv 2024
[62]

Mm-vet: Evaluating large multimodal models for inte- grated capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for inte- grated capabilities. In International conference on machine learning. PMLR, 2024. 3, 6

work page 2024
[63]

Why johnny can’t prompt: how non-ai experts try (and fail) to design llm prompts

JD Zamfirescu-Pereira, Richmond Y Wong, Bjoern Hart- mann, and Qian Yang. Why johnny can’t prompt: how non-ai experts try (and fail) to design llm prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–21, 2023. 3

work page 2023
[64]

Internlm-xcomposer2

Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, et al. Internlm-xcomposer2. 5-reward: A simple yet effective multi-modal reward model. arXiv preprint arXiv:2501.12368, 2025. 7, 2

work page arXiv 2025
[65]

From recognition to cognition: Visual commonsense reason- ing, 2019

Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reason- ing, 2019. 1

work page 2019
[66]

Improve vision language model chain-of- thought reasoning, 2024

Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, and Yiming Yang. Improve vision language model chain-of- thought reasoning, 2024. 1

work page 2024
[67]

Marco-o1: Towards open reasoning models for open- ended solutions, 2024

Yu Zhao, Huifeng Yin, Bo Zeng, Hao Wang, Tianqi Shi, Chenyang Lyu, Longyue Wang, Weihua Luo, and Kaifu Zhang. Marco-o1: Towards open reasoning models for open- ended solutions, 2024. 3

work page 2024
[68]

big object

Tianyang Zhong, Zhengliang Liu, Yi Pan, Yutong Zhang, Yi- fan Zhou, Shizhe Liang, Zihao Wu, Yanjun Lyu, Peng Shu, Xiaowei Yu, and et al. Evaluation of openai o1: Opportuni- ties and challenges of agi, 2024. 1 LLaVA-CoT: Let Vision Language Models Reason Step-by-Step Supplementary Material A. Illustrative Cases of Reasoning Challenges in VLMs In the main p...

work page 2024