hub Canonical reference

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed · 2023 · cs.CV · arXiv 2303.11381

Canonical reference. 88% of citing Pith papers cite this work as background.

56 Pith papers citing it

Background 88% of classified citations

open full Pith review browse 56 citing papers arXiv PDF

abstract

We propose MM-REACT, a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action. In this paper, we define and explore a comprehensive list of advanced vision tasks that are intriguing to solve, but may exceed the capabilities of existing vision and vision-language models. To achieve such advanced visual intelligence, MM-REACT introduces a textual prompt design that can represent text descriptions, textualized spatial coordinates, and aligned file names for dense visual signals such as images and videos. MM-REACT's prompt design allows language models to accept, associate, and process multimodal information, thereby facilitating the synergetic combination of ChatGPT and various vision experts. Zero-shot experiments demonstrate MM-REACT's effectiveness in addressing the specified capabilities of interests and its wide application in different scenarios that require advanced visual understanding. Furthermore, we discuss and compare MM-REACT's system paradigm with an alternative approach that extends language models for multimodal scenarios through joint finetuning. Code, demo, video, and visualization are available at https://multimodal-react.github.io/

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 14 baseline 1 method 1

citation-polarity summary

background 14 baseline 1 use method 1

representative citing papers

TerraBench: Can Agents Reason Over Heterogeneous Earth-System Data?

cs.AI · 2026-06-11 · unverdicted · novelty 7.0 · 2 refs

TerraBench is a new benchmark with 403 tasks across Earth-science domains that evaluates LLM agents on coordinating heterogeneous data using executable ReAct-style workflows and process-level metrics.

DeepLatent: Think with Images via Parallel Latent Visual Reasoning

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.

STORM: Internalized Modeling for Spatial-Temporal Reasoning in Video-Language Models

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

STORM teaches LVLMs to internalize spatial-temporal reasoning via bounded latent trajectories trained with generated thought videos in two stages, improving accuracy on VideoMME, MVBench and similar benchmarks while lowering inference overhead.

MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning

cs.AI · 2026-05-25 · unverdicted · novelty 7.0

MuCRASP prunes VLMs in a CoT-aware manner, outperforming baselines by preserving reasoning quality at 30-50% compression rates on models like Qwen2.5-VL-7B.

ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.

AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation

cs.CV · 2026-05-11 · conditional · novelty 7.0

AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domain datasets.

V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.

The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.

AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.

Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Tarot-SAM3 delivers a training-free pipeline for segmenting images from arbitrary referring expressions via expression reasoning prompts and DINOv3-based mask self-refinement.

FaSTA$^*$: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing

cs.CV · 2025-06-26 · unverdicted · novelty 7.0

FaSTA* combines LLM fast planning with A* search and inductive subroutine mining to create an efficient agent for multi-turn image editing tasks.

Deep Multimodal Learning with Missing Modality: A Survey

cs.CV · 2024-09-12 · unverdicted · novelty 7.0

This survey provides the first comprehensive overview of deep multimodal learning methods designed to remain robust when some input modalities are absent.

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

cs.LG · 2024-01-19 · conditional · novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

cs.CV · 2023-10-23 · unverdicted · novelty 7.0

HallusionBench shows GPT-4V reaches only 31.42% accuracy on paired questions testing language hallucination and visual illusion in LVLMs, with other models below 16%.

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

cs.CV · 2023-10-17 · accept · novelty 7.0

Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.

VideoChat: Chat-Centric Video Understanding

cs.CV · 2023-05-10 · conditional · novelty 7.0

VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.

Visual Instruction Tuning

cs.CV · 2023-04-17 · unverdicted · novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.

S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

cs.CV · 2026-06-18 · unverdicted · novelty 6.0 · 2 refs

S-Agent augments VLMs with spatial tools, scene and agent memory for evidence accumulation on multi-view and video tasks, and produces an 8B model via SFT on its own trajectories that beats same-scale baselines.

VTOS: Learning to Orchestrate Vision Tools by Co-Searching Solutions and Observers

cs.CV · 2026-06-17 · unverdicted · novelty 6.0

VTOS jointly searches solution and observer programs to adaptively orchestrate vision tools, outperforming static pipelines on dense object counting and zero-shot plant disease segmentation.

MedCTA: A Benchmark for Clinical Tool Agents

cs.CV · 2026-06-10 · unverdicted · novelty 6.0

MedCTA is a new benchmark with 107 real-world tasks and process-aware metrics that shows frontier multimodal models remain brittle at autonomous tool selection, execution, and trajectory completion in clinical settings.

MUSE: A Unified Agentic Harness for MLLMs

cs.CV · 2026-06-02 · unverdicted · novelty 6.0

MUSE is a unified agentic harness that improves off-the-shelf MLLMs on visual spatial planning, perception, multimodal reasoning, and fine-grained discrimination benchmarks through structured execution modules and verifier-guided repair without model retraining.

Towards Localized and Disentangled Knowledge Editing for Multimodal Large Language Models

cs.CL · 2026-05-28 · unverdicted · novelty 6.0

LDKE framework localizes fact-specific layers and disentangles inputs to improve generalization and locality in multimodal knowledge editing for MLLMs.

ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning

cs.CV · 2026-05-27 · unverdicted · novelty 6.0

ROVER introduces a learnable routing plugin for object-centric visual evidence in MLLMs via token triplets and differential attention, reporting gains on MM-GCoT and VideoEspresso when integrated into Qwen2.5-VL-7B.

REVERSE: Reinforcing Evidence Verification and Search for Agentic Image geo-localization

cs.CV · 2026-05-26 · unverdicted · novelty 6.0

REVERSE uses tool-grounded trajectories and process rewards on visual grounding, query utility, and evidence discrimination to train a 4B model that outperforms retrieval-augmented baselines on Im2GPS3k and YFCC4k.

citing papers explorer

Showing 6 of 56 citing papers.

Multilingual OCR-Aware Fine-Tuning and Prompt-Guided Chain-of-Thought Reasoning for Multimodal Large Language Models cs.CV · 2026-05-13 · unverdicted · none · ref 12 · internal anchor
An OCR-aware multilingual framework combining synthetic data generation, LoRA SFT, and visual CoT prompting improves text extraction and translation robustness in multimodal LLMs on degraded images.
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) cs.CV · 2023-09-29 · conditional · none · ref 142 · internal anchor
GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.
A Survey on Multimodal Large Language Models cs.CV · 2023-06-23 · accept · none · ref 23 · internal anchor
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
Materials Informatics Across the Length Scales cond-mat.mtrl-sci · 2026-04-20 · unverdicted · none · ref 152 · internal anchor
A survey of data-driven methods for materials modeling at nanoscale, mesoscale, and micro-to-continuum scales that identifies established capabilities, data quality issues, and obstacles to cross-scale integration.
A Comprehensive Overview of Large Language Models cs.CL · 2023-07-12 · unverdicted · none · ref 290 · internal anchor
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.
OmniSch: A Multimodal PCB Schematic Benchmark For Structured Diagram Visual Reasoning cs.CV · 2026-03-31 · unreviewed · ref 37 · internal anchor

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer