hub Canonical reference

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed · 2023 · cs.CV · arXiv 2303.11381

Canonical reference. 88% of citing Pith papers cite this work as background.

56 Pith papers citing it

Background 88% of classified citations

open full Pith review browse 56 citing papers arXiv PDF

abstract

We propose MM-REACT, a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action. In this paper, we define and explore a comprehensive list of advanced vision tasks that are intriguing to solve, but may exceed the capabilities of existing vision and vision-language models. To achieve such advanced visual intelligence, MM-REACT introduces a textual prompt design that can represent text descriptions, textualized spatial coordinates, and aligned file names for dense visual signals such as images and videos. MM-REACT's prompt design allows language models to accept, associate, and process multimodal information, thereby facilitating the synergetic combination of ChatGPT and various vision experts. Zero-shot experiments demonstrate MM-REACT's effectiveness in addressing the specified capabilities of interests and its wide application in different scenarios that require advanced visual understanding. Furthermore, we discuss and compare MM-REACT's system paradigm with an alternative approach that extends language models for multimodal scenarios through joint finetuning. Code, demo, video, and visualization are available at https://multimodal-react.github.io/

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 14 baseline 1 method 1

citation-polarity summary

background 14 baseline 1 use method 1

representative citing papers

TerraBench: Can Agents Reason Over Heterogeneous Earth-System Data?

cs.AI · 2026-06-11 · unverdicted · novelty 7.0 · 2 refs

TerraBench is a new benchmark with 403 tasks across Earth-science domains that evaluates LLM agents on coordinating heterogeneous data using executable ReAct-style workflows and process-level metrics.

DeepLatent: Think with Images via Parallel Latent Visual Reasoning

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.

STORM: Internalized Modeling for Spatial-Temporal Reasoning in Video-Language Models

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

STORM teaches LVLMs to internalize spatial-temporal reasoning via bounded latent trajectories trained with generated thought videos in two stages, improving accuracy on VideoMME, MVBench and similar benchmarks while lowering inference overhead.

MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning

cs.AI · 2026-05-25 · unverdicted · novelty 7.0

MuCRASP prunes VLMs in a CoT-aware manner, outperforming baselines by preserving reasoning quality at 30-50% compression rates on models like Qwen2.5-VL-7B.

ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.

AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation

cs.CV · 2026-05-11 · conditional · novelty 7.0

AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domain datasets.

V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.

The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.

AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.

Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Tarot-SAM3 delivers a training-free pipeline for segmenting images from arbitrary referring expressions via expression reasoning prompts and DINOv3-based mask self-refinement.

FaSTA$^*$: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing

cs.CV · 2025-06-26 · unverdicted · novelty 7.0

FaSTA* combines LLM fast planning with A* search and inductive subroutine mining to create an efficient agent for multi-turn image editing tasks.

Deep Multimodal Learning with Missing Modality: A Survey

cs.CV · 2024-09-12 · unverdicted · novelty 7.0

This survey provides the first comprehensive overview of deep multimodal learning methods designed to remain robust when some input modalities are absent.

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

cs.LG · 2024-01-19 · conditional · novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

cs.CV · 2023-10-23 · unverdicted · novelty 7.0

HallusionBench shows GPT-4V reaches only 31.42% accuracy on paired questions testing language hallucination and visual illusion in LVLMs, with other models below 16%.

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

cs.CV · 2023-10-17 · accept · novelty 7.0

Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.

VideoChat: Chat-Centric Video Understanding

cs.CV · 2023-05-10 · conditional · novelty 7.0

VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.

Visual Instruction Tuning

cs.CV · 2023-04-17 · unverdicted · novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.

S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

cs.CV · 2026-06-18 · unverdicted · novelty 6.0 · 2 refs

S-Agent augments VLMs with spatial tools, scene and agent memory for evidence accumulation on multi-view and video tasks, and produces an 8B model via SFT on its own trajectories that beats same-scale baselines.

VTOS: Learning to Orchestrate Vision Tools by Co-Searching Solutions and Observers

cs.CV · 2026-06-17 · unverdicted · novelty 6.0

VTOS jointly searches solution and observer programs to adaptively orchestrate vision tools, outperforming static pipelines on dense object counting and zero-shot plant disease segmentation.

MedCTA: A Benchmark for Clinical Tool Agents

cs.CV · 2026-06-10 · unverdicted · novelty 6.0

MedCTA is a new benchmark with 107 real-world tasks and process-aware metrics that shows frontier multimodal models remain brittle at autonomous tool selection, execution, and trajectory completion in clinical settings.

MUSE: A Unified Agentic Harness for MLLMs

cs.CV · 2026-06-02 · unverdicted · novelty 6.0

MUSE is a unified agentic harness that improves off-the-shelf MLLMs on visual spatial planning, perception, multimodal reasoning, and fine-grained discrimination benchmarks through structured execution modules and verifier-guided repair without model retraining.

Towards Localized and Disentangled Knowledge Editing for Multimodal Large Language Models

cs.CL · 2026-05-28 · unverdicted · novelty 6.0

LDKE framework localizes fact-specific layers and disentangles inputs to improve generalization and locality in multimodal knowledge editing for MLLMs.

ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning

cs.CV · 2026-05-27 · unverdicted · novelty 6.0

ROVER introduces a learnable routing plugin for object-centric visual evidence in MLLMs via token triplets and differential attention, reporting gains on MM-GCoT and VideoEspresso when integrated into Qwen2.5-VL-7B.

REVERSE: Reinforcing Evidence Verification and Search for Agentic Image geo-localization

cs.CV · 2026-05-26 · unverdicted · novelty 6.0

REVERSE uses tool-grounded trajectories and process rewards on visual grounding, query utility, and evidence discrimination to train a 4B model that outperforms retrieval-augmented baselines on Im2GPS3k and YFCC4k.

citing papers explorer

Showing 50 of 56 citing papers.

TerraBench: Can Agents Reason Over Heterogeneous Earth-System Data? cs.AI · 2026-06-11 · unverdicted · none · ref 38 · 2 links · internal anchor
TerraBench is a new benchmark with 403 tasks across Earth-science domains that evaluates LLM agents on coordinating heterogeneous data using executable ReAct-style workflows and process-level metrics.
DeepLatent: Think with Images via Parallel Latent Visual Reasoning cs.CV · 2026-05-30 · unverdicted · none · ref 82 · internal anchor
DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.
STORM: Internalized Modeling for Spatial-Temporal Reasoning in Video-Language Models cs.CV · 2026-05-25 · unverdicted · none · ref 46 · internal anchor
STORM teaches LVLMs to internalize spatial-temporal reasoning via bounded latent trajectories trained with generated thought videos in two stages, improving accuracy on VideoMME, MVBench and similar benchmarks while lowering inference overhead.
MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning cs.AI · 2026-05-25 · unverdicted · none · ref 49 · internal anchor
MuCRASP prunes VLMs in a CoT-aware manner, outperforming baselines by preserving reasoning quality at 30-50% compression rates on models like Qwen2.5-VL-7B.
ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding cs.CV · 2026-05-13 · unverdicted · none · ref 41 · internal anchor
ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.
AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation cs.CV · 2026-05-11 · conditional · none · ref 12 · internal anchor
AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domain datasets.
V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning cs.CV · 2026-05-11 · unverdicted · none · ref 23 · internal anchor
V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment cs.CL · 2026-05-08 · unverdicted · none · ref 33 · internal anchor
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation cs.CV · 2026-04-20 · unverdicted · none · ref 139 · internal anchor
AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation cs.CV · 2026-04-09 · unverdicted · none · ref 55 · internal anchor
Tarot-SAM3 delivers a training-free pipeline for segmenting images from arbitrary referring expressions via expression reasoning prompts and DINOv3-based mask self-refinement.
FaSTA$^*$: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing cs.CV · 2025-06-26 · unverdicted · none · ref 44 · internal anchor
FaSTA* combines LLM fast planning with A* search and inductive subroutine mining to create an efficient agent for multi-turn image editing tasks.
Deep Multimodal Learning with Missing Modality: A Survey cs.CV · 2024-09-12 · unverdicted · none · ref 73 · internal anchor
This survey provides the first comprehensive overview of deep multimodal learning methods designed to remain robust when some input modalities are absent.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads cs.LG · 2024-01-19 · conditional · none · ref 104 · internal anchor
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models cs.CV · 2023-10-23 · unverdicted · none · ref 49 · internal anchor
HallusionBench shows GPT-4V reaches only 31.42% accuracy on paired questions testing language hallucination and visual illusion in LVLMs, with other models below 16%.
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V cs.CV · 2023-10-17 · accept · none · ref 50 · internal anchor
Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.
VideoChat: Chat-Centric Video Understanding cs.CV · 2023-05-10 · conditional · none · ref 50 · internal anchor
VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
Visual Instruction Tuning cs.CV · 2023-04-17 · unverdicted · none · ref 55 · internal anchor
LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence cs.CV · 2026-06-18 · unverdicted · none · ref 34 · 2 links · internal anchor
S-Agent augments VLMs with spatial tools, scene and agent memory for evidence accumulation on multi-view and video tasks, and produces an 8B model via SFT on its own trajectories that beats same-scale baselines.
VTOS: Learning to Orchestrate Vision Tools by Co-Searching Solutions and Observers cs.CV · 2026-06-17 · unverdicted · none · ref 17 · internal anchor
VTOS jointly searches solution and observer programs to adaptively orchestrate vision tools, outperforming static pipelines on dense object counting and zero-shot plant disease segmentation.
MedCTA: A Benchmark for Clinical Tool Agents cs.CV · 2026-06-10 · unverdicted · none · ref 74 · internal anchor
MedCTA is a new benchmark with 107 real-world tasks and process-aware metrics that shows frontier multimodal models remain brittle at autonomous tool selection, execution, and trajectory completion in clinical settings.
MUSE: A Unified Agentic Harness for MLLMs cs.CV · 2026-06-02 · unverdicted · none · ref 54 · internal anchor
MUSE is a unified agentic harness that improves off-the-shelf MLLMs on visual spatial planning, perception, multimodal reasoning, and fine-grained discrimination benchmarks through structured execution modules and verifier-guided repair without model retraining.
Towards Localized and Disentangled Knowledge Editing for Multimodal Large Language Models cs.CL · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
LDKE framework localizes fact-specific layers and disentangles inputs to improve generalization and locality in multimodal knowledge editing for MLLMs.
ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning cs.CV · 2026-05-27 · unverdicted · none · ref 66 · internal anchor
ROVER introduces a learnable routing plugin for object-centric visual evidence in MLLMs via token triplets and differential attention, reporting gains on MM-GCoT and VideoEspresso when integrated into Qwen2.5-VL-7B.
REVERSE: Reinforcing Evidence Verification and Search for Agentic Image geo-localization cs.CV · 2026-05-26 · unverdicted · none · ref 25 · internal anchor
REVERSE uses tool-grounded trajectories and process rewards on visual grounding, query utility, and evidence discrimination to train a 4B model that outperforms retrieval-augmented baselines on Im2GPS3k and YFCC4k.
InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward cs.CV · 2026-05-26 · unverdicted · none · ref 22 · internal anchor
InterSketch improves long-horizon visual-textual chain-of-thought in VLMs by dynamically generating and interleaving self-correcting visual sketches with text, using a synthesized dataset plus reflection in cold-start followed by stepwise-reward RL, and reports outperforming Gemini-3-Pro on benchmar
TRACE: Evidence Grounding-Guided Multi-Video Event Understanding and Claim Generation cs.CV · 2026-05-16 · unverdicted · none · ref 13 · 2 links · internal anchor
TRACE improves multi-video event understanding by grounding evidence in structured timelines before visual reasoning, raising MiRAGE F1 from 0.705 to 0.811 on MAGMaR 2026.
PresentAgent-2: Towards Generalist Multimodal Presentation Agents cs.CV · 2026-05-12 · unverdicted · none · ref 18 · internal anchor
PresentAgent-2 generates query-driven multimodal presentation videos with research grounding, supporting single-speaker, multi-speaker discussion, and interactive question-answering modes.
Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning cs.CL · 2026-05-08 · unverdicted · none · ref 5 · internal anchor
RIS improves MLLM latent visual reasoning by retrieving spatial-semantic evidence, integrating it via attention bottlenecks, and synthesizing it with language transition tokens, yielding gains on V*, HRBench, MMVP, and BLINK benchmarks.
Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning cs.CV · 2026-05-05 · unverdicted · none · ref 58 · internal anchor
HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.
AlbumFill: Album-Guided Reasoning and Retrieval for Personalized Image Completion cs.CV · 2026-05-04 · unverdicted · none · ref 57 · internal anchor
AlbumFill retrieves identity-consistent references from personal albums via VLM-inferred semantic cues to support personalized image completion.
DenTab: A Dataset for Table Recognition and Visual QA on Real-World Dental Estimates cs.CV · 2026-04-17 · unverdicted · none · ref 38 · internal anchor
DenTab provides 2,000 annotated dental table images and 2,208 questions to benchmark 16 systems on table structure recognition and VQA, revealing that strong layout recovery does not ensure reliable multi-step arithmetic, and proposes a Table Router Pipeline combining VLMs with rule-based execution.
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization cs.CV · 2026-04-08 · unverdicted · none · ref 40 · internal anchor
MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
How RL Unlocks the Aha Moment in Geometric Interleaved Reasoning cs.CL · 2026-03-01 · unverdicted · none · ref 33 · internal anchor
Reinforcement learning with three causal constraints enables multimodal models to internalize diagram-reasoning links in geometry, unlike SFT which only mimics surface format and harms performance.
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing cs.CV · 2025-06-11 · unverdicted · none · ref 72 · internal anchor
VILASR integrates visual drawing operations with reasoning in LVLMs via cold-start synthetic training, reflective rejection sampling, and reinforcement learning, yielding an 18.4% average gain on spatial reasoning benchmarks.
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought cs.CL · 2025-01-13 · unverdicted · none · ref 24 · internal anchor
MVoT lets multimodal models create coherent images during chain-of-thought reasoning via a token discrepancy loss, yielding competitive or better results than text-only CoT on dynamic spatial tasks.
TempCompass: Do Video LLMs Really Understand Videos? cs.CV · 2024-03-01 · unverdicted · none · ref 128 · internal anchor
TempCompass benchmark reveals that state-of-the-art Video LLMs have poor ability to perceive temporal aspects such as speed, direction, and ordering in videos.
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception cs.CL · 2024-01-29 · conditional · none · ref 5 · internal anchor
Mobile-Agent is a vision-centric autonomous agent that uses MLLMs to perceive UI elements, plan complex multi-step tasks, and operate mobile apps without relying on XML or system metadata, showing strong results on the introduced Mobile-Eval benchmark.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection cs.CV · 2023-11-16 · unverdicted · none · ref 75 · internal anchor
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
A Survey on Large Language Model based Autonomous Agents cs.AI · 2023-08-22 · accept · none · ref 77 · internal anchor
A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future directions.
ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models cs.CL · 2023-05-23 · conditional · none · ref 3 · internal anchor
ReWOO decouples reasoning from tool observations in augmented language models, delivering 5x token efficiency and 4% higher accuracy on multi-step reasoning benchmarks like HotpotQA.
ChemCrow: Augmenting large-language models with chemistry tools physics.chem-ph · 2023-04-11 · conditional · none · ref 51 · internal anchor
ChemCrow augments LLMs with 18 expert chemistry tools to autonomously plan and execute syntheses and guide molecular discoveries in organic synthesis, drug discovery, and materials design.
ToolGate: Token-Efficient Pre-Call Control for Tool-Augmented Vision-Language Agents cs.AI · 2026-06-02 · unverdicted · none · ref 62 · internal anchor
ToolGate is a pre-call controller that reduces token usage in tool-augmented VLM agents to 64-69% of baseline while preserving or slightly improving accuracy on benchmarks.
Grounded 3D-Aware Spatial Vision-Language Modeling cs.CV · 2026-05-28 · unverdicted · none · ref 95 · internal anchor
GR3D is a VLM that combines explicit 2D, implicit 2D, and monocular 3D grounding mechanisms to improve performance on spatial understanding benchmarks.
When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness? cs.CV · 2026-05-27 · unverdicted · none · ref 11 · internal anchor
Explicit image-tool interaction in VLMs cuts multimodal jailbreak ASR by ~30% on average; the effect is attributed to a safety-relevant shift in hidden representations rather than image semantics or text traces.
Scaling Video Understanding via Compact Latent Multi-Agent Collaboration cs.CV · 2026-05-01 · unverdicted · none · ref 26 · internal anchor
MACF decouples agent perception budgets from overall video length using latent token collaboration to scale video understanding in MLLMs beyond current limits.
LLM-Guided Agentic Floor Plan Parsing for Accessible Indoor Navigation of Blind and Low-Vision People cs.AI · 2026-04-27 · unverdicted · none · ref 30 · internal anchor
A self-correcting multi-agent LLM pipeline parses floor plans into graphs and generates accessible routes, outperforming single LLM calls with success rates up to 92% on short paths in a real university building.
FireScope: Wildfire Risk Raster Prediction with a Chain-of-Thought Oracle cs.CV · 2025-11-21 · unverdicted · none · ref 32 · 2 links · internal anchor
FireScope trains a VLM on US data to output wildfire risk rasters with reasoning traces and shows improved cross-continental performance on European events compared with prior approaches.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks cs.CV · 2023-12-21 · unverdicted · none · ref 167 · internal anchor
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models cs.CV · 2023-11-13 · unverdicted · none · ref 39 · internal anchor
SPHINX improves multi-modal LLMs through joint mixing of weights, tasks, and visual embeddings from varied sources to achieve stronger alignment and multi-purpose capabilities.
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model cs.CV · 2023-04-28 · conditional · none · ref 69 · internal anchor
LLaMA-Adapter V2 achieves open-ended visual instruction following in LLMs by unlocking more parameters, early fusion of visual tokens, and joint training on disjoint parameter groups with only 14M added parameters.

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer