VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

Bang Yang; Chris Kelly; Cindy Yang; Deshun Yang; Jiayin Hu; Luhui Hu; Yuexian Zou; Yu Tian; Zaoshan Huang; Zihao Li

arxiv: 2403.09027 · v1 · pith:ISWRIHHVnew · submitted 2024-03-14 · 💻 cs.CV

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

Chris Kelly , Luhui Hu , Bang Yang , Yu Tian , Deshun Yang , Cindy Yang , Zaoshan Huang , Zihao Li

show 2 more authors

Jiayin Hu Yuexian Zou

This is my paper

classification 💻 cs.CV

keywords modelsfoundationvisiongptunderstandingvision-languagevisualframeworkgeneralized

0 comments

read the original abstract

With the emergence of large language models (LLMs) and vision foundation models, how to combine the intelligence and capacity of these open-sourced or API-available models to achieve open-world visual perception remains an open question. In this paper, we introduce VisionGPT to consolidate and automate the integration of state-of-the-art foundation models, thereby facilitating vision-language understanding and the development of vision-oriented AI. VisionGPT builds upon a generalized multimodal framework that distinguishes itself through three key features: (1) utilizing LLMs (e.g., LLaMA-2) as the pivot to break down users' requests into detailed action proposals to call suitable foundation models; (2) integrating multi-source outputs from foundation models automatically and generating comprehensive responses for users; (3) adaptable to a wide range of applications such as text-conditioned image understanding/generation/editing and visual question answering. This paper outlines the architecture and capabilities of VisionGPT, demonstrating its potential to revolutionize the field of computer vision through enhanced efficiency, versatility, and generalization, and performance. Our code and models will be made publicly available. Keywords: VisionGPT, Open-world visual perception, Vision-language understanding, Large language model, and Foundation model

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RailVQA: A Benchmark and Framework for Efficient Interpretable Visual Cognition in Automatic Train Operation
cs.CV 2026-03 unverdicted novelty 7.0

RailVQA-bench supplies 21,168 QA pairs for ATO visual cognition while RailVQA-CoM combines large-model reasoning with small-model efficiency via transparent modules and temporal sampling.
Chain-of-Procedure: Hierarchical Visual-Language Reasoning for Procedural QA
cs.CL 2026-05 unverdicted novelty 6.0

Introduces ProcedureVQA benchmark and Chain-of-Procedure framework that improves VLM next-step prediction in procedures by up to 13% over baselines.
GLANCE: A Global-Local Coordination Multi-Agent Framework for Music-Grounded Non-Linear Video Editing
cs.MA 2026-04 unverdicted novelty 6.0

GLANCE introduces a bi-loop multi-agent framework with global-local coordination mechanisms that outperforms baselines by up to 33% on music-grounded nonlinear video editing tasks using a new MVEBench benchmark.
DREAM: Dynamic Retinal Enhancement with Adaptive Multi-modal Fusion for Expert Precision Medical Report Generation
cs.CV 2026-04 unverdicted novelty 5.0

DREAM introduces a two-stage adaptive multi-modal fusion framework that reaches BLEU-4 of 0.241 on DeepEyeNet for retinal image report generation and generalizes to ROCO.