VisionGPT: Vision-language under- standing agent using generalized multimodal frame- work

· 2024 · arXiv 2403.09027

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

RailVQA: A Benchmark and Framework for Efficient Interpretable Visual Cognition in Automatic Train Operation

cs.CV · 2026-03-28 · unverdicted · novelty 7.0

RailVQA-bench supplies 21,168 QA pairs for ATO visual cognition while RailVQA-CoM combines large-model reasoning with small-model efficiency via transparent modules and temporal sampling.

GLANCE: A Global-Local Coordination Multi-Agent Framework for Music-Grounded Non-Linear Video Editing

cs.MA · 2026-04-06 · unverdicted · novelty 6.0

GLANCE introduces a bi-loop multi-agent framework with global-local coordination mechanisms that outperforms baselines by up to 33% on music-grounded nonlinear video editing tasks using a new MVEBench benchmark.

DREAM: Dynamic Retinal Enhancement with Adaptive Multi-modal Fusion for Expert Precision Medical Report Generation

cs.CV · 2026-04-19 · unverdicted · novelty 5.0

DREAM introduces a two-stage adaptive multi-modal fusion framework that reaches BLEU-4 of 0.241 on DeepEyeNet for retinal image report generation and generalizes to ROCO.

citing papers explorer

Showing 3 of 3 citing papers.

RailVQA: A Benchmark and Framework for Efficient Interpretable Visual Cognition in Automatic Train Operation cs.CV · 2026-03-28 · unverdicted · none · ref 40
RailVQA-bench supplies 21,168 QA pairs for ATO visual cognition while RailVQA-CoM combines large-model reasoning with small-model efficiency via transparent modules and temporal sampling.
GLANCE: A Global-Local Coordination Multi-Agent Framework for Music-Grounded Non-Linear Video Editing cs.MA · 2026-04-06 · unverdicted · none · ref 17
GLANCE introduces a bi-loop multi-agent framework with global-local coordination mechanisms that outperforms baselines by up to 33% on music-grounded nonlinear video editing tasks using a new MVEBench benchmark.
DREAM: Dynamic Retinal Enhancement with Adaptive Multi-modal Fusion for Expert Precision Medical Report Generation cs.CV · 2026-04-19 · unverdicted · none · ref 6
DREAM introduces a two-stage adaptive multi-modal fusion framework that reaches BLEU-4 of 0.241 on DeepEyeNet for retinal image report generation and generalizes to ROCO.

VisionGPT: Vision-language under- standing agent using generalized multimodal frame- work

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer