hub Canonical reference

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun · 2024 · cs.CV · arXiv 2411.10440

Canonical reference. 90% of citing Pith papers cite this work as background.

40 Pith papers citing it

Background 90% of classified citations

open full Pith review browse 40 citing papers arXiv PDF

abstract

Large language models have demonstrated substantial advancements in reasoning capabilities. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks. In this work, we introduce LLaVA-CoT, a large VLM designed to conduct autonomous multistage reasoning. Unlike chain-of-thought prompting, LLaVA-CoT independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation. This structured approach enables LLaVA-CoT to achieve marked improvements on reasoning-intensive tasks. To accomplish this, we construct the LLaVA-CoT-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations. Besides, we propose a test-time stage-wise retracing search method (SWIRES), which enables effective and efficient test-time scaling. Remarkably, with only 100k training samples and test-time scaling, LLaVA-CoT not only outperforms its base model by 9.4% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct. The code, dataset, and pre-trained weights are publicly available at https://github.com/PKU-YuanGroup/LLaVA-CoT.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9 baseline 1

citation-polarity summary

background 9 baseline 1

representative citing papers

Test-Time Hinting for Black-Box Vision-Language Models

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

Test-Time Hinting trains a hint generator to prepend contextual guidance to VLM prompts, improving accuracy on natural-image VQA benchmarks with generalization to unseen tasks and models.

UIPress: Bringing Optical Token Compression to UI-to-Code Generation

cs.CL · 2026-04-10 · unverdicted · novelty 7.0

UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline by 4.6% while delivering 9.1x TTFT speedup.

Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models

cs.AI · 2026-02-02 · unverdicted · novelty 7.0

GPS trains a small model on optimization history to predict prompt difficulty and select intermediate-difficulty diverse batches, yielding better training efficiency, final performance, and test-time allocation than baselines on reasoning benchmarks.

Latent Visual Reasoning

cs.CV · 2025-09-29 · unverdicted · novelty 7.0

Latent Visual Reasoning enables autoregressive generation of latent visual states that reconstruct critical image tokens, yielding gains on perception-heavy VQA benchmarks such as 71.67% on MMVP.

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

cs.SD · 2025-07-10 · unverdicted · novelty 7.0

Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.

High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

cs.CV · 2025-07-08 · conditional · novelty 7.0

MGPO elicits grounding in LMMs via multi-turn RL with binary rewards, yielding 5.4% and 5.2% gains on MME-Realworld and V* Bench and surpassing GPT-4o on the latter after training on 21K samples.

VGR: Visual Grounded Reasoning

cs.CV · 2025-06-13 · unverdicted · novelty 7.0

VGR introduces a visual-grounded reasoning MLLM that detects and replays image regions during inference, achieving gains on visual benchmarks with 30% fewer image tokens than the LLaVA-NeXT-7B baseline.

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

cs.CV · 2025-05-27 · conditional · novelty 7.0

Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.

Toward Generalizable Forgery Detection and Reasoning

cs.CV · 2025-03-27 · unverdicted · novelty 7.0

FakeReasoning is an MLLM-based framework for unified forgery detection and reasoning on AI-generated images, supported by the new MMFR-Dataset of 120K images and 378K annotations across 10 generators.

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

cs.AI · 2025-03-17 · conditional · novelty 7.0

R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.

R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model

cs.AI · 2025-03-07 · conditional · novelty 7.0

RL on Qwen2-VL-2B with SAT dataset produces R1-like reasoning and 59.47% CVBench accuracy, outperforming base model by ~30% and SFT by ~2%.

CAVE: A Structured Credit Assignment Approach for Fragmented Visual Evidence Reasoning

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

CAVE is a GRPO-based process-reward method that improves VLMs on fragmented visual reasoning by crediting intermediate actions via belief update, evidence acquisition, and adaptive focus, shown on TRACER-Bench and public benchmarks.

RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards for cancer screening.

OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models

cs.MM · 2026-04-25 · unverdicted · novelty 6.0

OceanPile is a new multimodal corpus with unified data collection, instruction tuning set, and benchmark to train foundation models for ocean science.

Video-ToC: Video Tree-of-Cue Reasoning

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.

Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward

cs.CV · 2026-04-06 · unverdicted · novelty 6.0

Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.

CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

cs.CV · 2026-04-03 · unverdicted · novelty 6.0

CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.

Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM

cs.CV · 2026-03-29 · unverdicted · novelty 6.0

Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five benchmarks using pre-trained encoders.

Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

cs.CV · 2026-02-07 · unverdicted · novelty 6.0

Fine-R1 uses chain-of-thought supervised fine-tuning on a structured FGVR reasoning dataset plus triplet augmented policy optimization to outperform general MLLMs and CLIP models on seen and unseen fine-grained categories with 4-shot training.

Mitigating Visual Context Degradation in Large Multimodal Models: A Training-Free Decoupled Agentic Framework

cs.CV · 2025-09-27 · unverdicted · novelty 6.0

DRP decouples reasoning from perception in LMMs by using an LLM reasoner to query an LMM observer for visual details as needed, reducing visual grounding loss.

LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA

cs.CV · 2025-09-12 · unverdicted · novelty 6.0

LaV-CoT introduces a multi-stage visual CoT pipeline and GRPO training with language-consistency rewards, delivering up to 9.5% accuracy gains on multilingual VQA benchmarks over similar-sized open models.

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

cs.AI · 2025-09-02 · accept · novelty 6.0

Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.

Listener-Rewarded Thinking in VLMs for Image Preferences

cs.CV · 2025-06-28 · unverdicted · novelty 6.0

Listener-augmented GRPO uses an independent frozen VLM to provide dense confidence scores on reasoning traces, yielding 67.4% accuracy on ImageReward, up to +6% OOD gains on 1.2M-vote human data, and fewer reasoning contradictions.

Grounded Reinforcement Learning for Visual Reasoning

cs.CV · 2025-05-29 · unverdicted · novelty 6.0

ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.

citing papers explorer

Showing 5 of 5 citing papers after filters.

High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning cs.CV · 2025-07-08 · conditional · none · ref 42 · internal anchor
MGPO elicits grounding in LMMs via multi-turn RL with binary rewards, yielding 5.4% and 5.2% gains on MME-Realworld and V* Bench and surpassing GPT-4o on the latter after training on 21K samples.
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning? cs.CV · 2025-05-27 · conditional · none · ref 35 · internal anchor
Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization cs.AI · 2025-03-17 · conditional · none · ref 44 · internal anchor
R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model cs.AI · 2025-03-07 · conditional · none · ref 18 · internal anchor
RL on Qwen2-VL-2B with SAT dataset produces R1-like reasoning and 59.47% CVBench accuracy, outperforming base model by ~30% and SFT by ~2%.
OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles cs.CV · 2025-03-21 · conditional · none · ref 78 · internal anchor
Iterative SFT-RL cycles enable a 7B LVLM to develop sophisticated visual chain-of-thought reasoning and improve performance on math and general reasoning benchmarks.

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer