super hub Canonical reference

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Mohamed Elhoseiny, Xiang Li, Xiaoqian Shen · 2023 · cs.CV · arXiv 2304.10592

Canonical reference. 85% of citing Pith papers cite this work as background.

241 Pith papers citing it

Background 85% of classified citations

open full Pith review browse 241 citing papers more from Deyao Zhu arXiv PDF

abstract

The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. However, the technical details behind GPT-4 continue to remain undisclosed. We believe that the enhanced multi-modal generation capabilities of GPT-4 stem from the utilization of sophisticated large language models (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen advanced LLM, Vicuna, using one projection layer. Our work, for the first time, uncovers that properly aligning the visual features with an advanced large language model can possess numerous advanced multi-modal abilities demonstrated by GPT-4, such as detailed image description generation and website creation from hand-drawn drafts. Furthermore, we also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images, teaching users how to cook based on food photos, and so on. In our experiment, we found that the model trained on short image caption pairs could produce unnatural language outputs (e.g., repetition and fragmentation). To address this problem, we curate a detailed image description dataset in the second stage to finetune the model, which consequently improves the model's generation reliability and overall usability. Our code, pre-trained model, and collected dataset are available at https://minigpt-4.github.io/.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 46 baseline 3 method 3 dataset 1

citation-polarity summary

background 45 baseline 3 use method 3 support 1 use dataset 1

claims ledger

abstract The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. However, the technical details behind GPT-4 continue to remain undisclosed. We believe that the enhanced multi-modal generation capabilities of GPT-4 stem from the utilization of sophisticated large language models (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen advanced LLM, Vicuna, using

authors

Deyao Zhu Jun Chen Mohamed Elhoseiny Xiang Li Xiaoqian Shen

co-cited works

representative citing papers

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

cs.CL · 2024-09-04 · accept · novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

cs.CV · 2024-08-23 · conditional · novelty 8.0

MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

cs.AI · 2024-04-11 · accept · novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

cs.CL · 2023-11-27 · unverdicted · novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

Position Rebinding Cache Reuse: Replay-Free Visual Revisiting for Interleaved Multimodal Reasoning

cs.CV · 2026-06-25 · unverdicted · novelty 7.0

PRCR enables replay-free visual revisiting in interleaved multimodal reasoning by storing raw visual KV caches with spatial coordinates and rebinding keys to position-compatible coordinates, matching replay performance while cutting computation by orders of magnitude.

Seeing Without Exposing: Adaptive Privacy Control for Open-World, Context-Hungry MLLMs

cs.CV · 2026-06-05 · unverdicted · novelty 7.0

Anchored Privacy Drifting (APD) replaces privacy-sensitive visual elements with semantically equivalent alternatives while anchoring context, evaluated on the new AdaptShield benchmark with reported gains of 10.4% and 8.5% across four MLLM families.

P$^2$-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

P²-DPO generates on-policy preference pairs targeting focus-and-enhance perception and visual robustness, combined with a calibration loss, to reduce hallucinations in LVLMs more effectively than human-feedback baselines.

AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

AVI-Bench is a cognitively inspired benchmark that evaluates Omni-MLLMs on joint audio-visual tasks and reveals substantial limitations in current models.

EvoCut: Multi-Layer Evolution-Aware Visual Token Compression for Efficient Large Vision-Language Models

cs.CV · 2026-06-01 · conditional · novelty 7.0

EvoCut is a training-free visual token compression technique that identifies important tokens via multi-layer evolution deviation, retaining 11.1% tokens with 94.4% average performance preserved on LLaVA-1.5-7B.

What Makes LVLMs Hallucinate Less? Unveiling the Architectural Factors Behind Hallucination Robustness

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

The study links three LVLM architectural dimensions to three hallucination types via a new benchmark, finding that language foundation quality reduces co-occurrence errors, visual encoder strength reduces similarity errors, alignment reduces uncertainty errors, and joint visual-alignment improvement

Dual-Pathway Geometry-Aware MLLM for Spatial Intelligence

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

GAMSI is a dual-pathway Geometry-Aware MLLM using Metric-Structure Decoupled Queries and Expert-Guided Visual Grounding on RGB inputs alone, trained on a new 152k-sample MTS dataset to reach SOTA on seven spatial benchmarks.

Beyond Binary Edits Robust Multimodal Knowledge Editing with Adversarial Subspace Alignment

cs.AI · 2026-05-22 · unverdicted · novelty 7.0

Introduces Latent Adversarial Robustification and Rank-Constrained Subspace Learning to enable robust generalization in multimodal knowledge editing through adversarial subspace alignment.

From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.

DistractMIA: Black-Box Membership Inference on Vision-Language Models via Semantic Distraction

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

DistractMIA performs output-only black-box membership inference on vision-language models by inserting semantic distractors and measuring shifts in generated text responses.

Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.

OZ-TAL: Online Zero-Shot Temporal Action Localization

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

Defines OZ-TAL task and presents a training-free VLM-based method that outperforms prior approaches for online and offline zero-shot temporal action localization on THUMOS14 and ActivityNet-1.3.

UniShield: Unified Face Attack Detection via KG-Informed Multimodal Reasoning

cs.CV · 2026-05-09 · unverdicted · novelty 7.0

UniShield introduces a knowledge-graph-informed multimodal framework that improves unified detection of physical and digital face attacks through instruction tuning and consistency-optimized reasoning.

PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models

cs.CV · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

PolarVLM is the first VLM framework to integrate polarimetric physical parameters via dual-stream architecture and progressive training, delivering 25.4% gains over RGB baselines on reflection and transparency tasks with a new 75K-pair PolarVQA benchmark.

Masks Can Talk: Extracting Structured Text Information from Single-Modal Images for Remote Sensing Change Detection

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

S2M extracts structured text quadruples from change masks to provide noise-free multimodal supervision, achieving 17.80% Sek and 66.14% F_scd on the new Gaza-Change-v2 dataset and outperforming LLM-based multimodal methods.

ICU-Bench:Benchmarking Continual Unlearning in Multimodal Large Language Models

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

ICU-Bench is a new continual unlearning benchmark for MLLMs using 1000 privacy profiles, 9500 images, and 100 forget tasks, showing existing methods fail to balance forgetting, utility, and scalability.

VIDA: A dataset for Visually Dependent Ambiguity in Multimodal Machine Translation

cs.CL · 2026-05-03 · unverdicted · novelty 7.0

VIDA provides 2,500 visually-dependent ambiguous translation examples and span-level disambiguation metrics; CoT-SFT on LVLMs improves out-of-distribution performance over standard SFT.

VoxAfford: Multi-Scale Voxel-Token Fusion for Open-Vocabulary 3D Affordance Detection

cs.CV · 2026-05-02 · unverdicted · novelty 7.0

VoxAfford fuses multi-scale voxel features into MLLM output tokens using cross-attention with a learned compatibility gate to achieve SOTA open-vocabulary 3D affordance detection with ~8% mIoU gain and zero-shot robot transfer.

Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation

cs.CV · 2026-05-02 · unverdicted · novelty 7.0 · 2 refs

Chain of Evidence introduces a retriever-agnostic visual attribution method for iRAG that reasons over document screenshots with VLMs to output precise bounding boxes, outperforming text baselines on Wiki-CoE and SlideVQA.

LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-layer attention in LLMs.

citing papers explorer

Showing 50 of 195 citing papers after filters.

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI cs.CL · 2023-11-27 · unverdicted · none · ref 97 · internal anchor
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
Position Rebinding Cache Reuse: Replay-Free Visual Revisiting for Interleaved Multimodal Reasoning cs.CV · 2026-06-25 · unverdicted · none · ref 42 · internal anchor
PRCR enables replay-free visual revisiting in interleaved multimodal reasoning by storing raw visual KV caches with spatial coordinates and rebinding keys to position-compatible coordinates, matching replay performance while cutting computation by orders of magnitude.
Seeing Without Exposing: Adaptive Privacy Control for Open-World, Context-Hungry MLLMs cs.CV · 2026-06-05 · unverdicted · none · ref 53 · internal anchor
Anchored Privacy Drifting (APD) replaces privacy-sensitive visual elements with semantically equivalent alternatives while anchoring context, evaluated on the new AdaptShield benchmark with reported gains of 10.4% and 8.5% across four MLLM families.
P$^2$-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization cs.CV · 2026-06-02 · unverdicted · none · ref 123 · internal anchor
P²-DPO generates on-policy preference pairs targeting focus-and-enhance perception and visual robustness, combined with a calibration loss, to reduce hallucinations in LVLMs more effectively than human-feedback baselines.
AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs cs.CV · 2026-06-01 · unverdicted · none · ref 10 · internal anchor
AVI-Bench is a cognitively inspired benchmark that evaluates Omni-MLLMs on joint audio-visual tasks and reveals substantial limitations in current models.
What Makes LVLMs Hallucinate Less? Unveiling the Architectural Factors Behind Hallucination Robustness cs.CV · 2026-05-29 · unverdicted · none · ref 10 · internal anchor
The study links three LVLM architectural dimensions to three hallucination types via a new benchmark, finding that language foundation quality reduces co-occurrence errors, visual encoder strength reduces similarity errors, alignment reduces uncertainty errors, and joint visual-alignment improvement
Dual-Pathway Geometry-Aware MLLM for Spatial Intelligence cs.CV · 2026-05-25 · unverdicted · none · ref 11 · internal anchor
GAMSI is a dual-pathway Geometry-Aware MLLM using Metric-Structure Decoupled Queries and Expert-Guided Visual Grounding on RGB inputs alone, trained on a new 152k-sample MTS dataset to reach SOTA on seven spatial benchmarks.
Beyond Binary Edits Robust Multimodal Knowledge Editing with Adversarial Subspace Alignment cs.AI · 2026-05-22 · unverdicted · none · ref 52 · internal anchor
Introduces Latent Adversarial Robustification and Rank-Constrained Subspace Learning to enable robust generalization in multimodal knowledge editing through adversarial subspace alignment.
From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing cs.CV · 2026-05-14 · unverdicted · none · ref 63 · internal anchor
A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
DistractMIA: Black-Box Membership Inference on Vision-Language Models via Semantic Distraction cs.CV · 2026-05-12 · unverdicted · none · ref 15 · internal anchor
DistractMIA performs output-only black-box membership inference on vision-language models by inserting semantic distractors and measuring shifts in generated text responses.
Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters cs.CV · 2026-05-12 · unverdicted · none · ref 61 · internal anchor
Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.
OZ-TAL: Online Zero-Shot Temporal Action Localization cs.CV · 2026-05-11 · unverdicted · none · ref 44 · internal anchor
Defines OZ-TAL task and presents a training-free VLM-based method that outperforms prior approaches for online and offline zero-shot temporal action localization on THUMOS14 and ActivityNet-1.3.
UniShield: Unified Face Attack Detection via KG-Informed Multimodal Reasoning cs.CV · 2026-05-09 · unverdicted · none · ref 33 · internal anchor
UniShield introduces a knowledge-graph-informed multimodal framework that improves unified detection of physical and digital face attacks through instruction tuning and consistency-optimized reasoning.
PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models cs.CV · 2026-05-08 · unverdicted · none · ref 34 · 2 links · internal anchor
PolarVLM is the first VLM framework to integrate polarimetric physical parameters via dual-stream architecture and progressive training, delivering 25.4% gains over RGB baselines on reflection and transparency tasks with a new 75K-pair PolarVQA benchmark.
Masks Can Talk: Extracting Structured Text Information from Single-Modal Images for Remote Sensing Change Detection cs.CV · 2026-05-08 · unverdicted · none · ref 39 · internal anchor
S2M extracts structured text quadruples from change masks to provide noise-free multimodal supervision, achieving 17.80% Sek and 66.14% F_scd on the new Gaza-Change-v2 dataset and outperforming LLM-based multimodal methods.
ICU-Bench:Benchmarking Continual Unlearning in Multimodal Large Language Models cs.AI · 2026-05-07 · unverdicted · none · ref 3 · internal anchor
ICU-Bench is a new continual unlearning benchmark for MLLMs using 1000 privacy profiles, 9500 images, and 100 forget tasks, showing existing methods fail to balance forgetting, utility, and scalability.
VIDA: A dataset for Visually Dependent Ambiguity in Multimodal Machine Translation cs.CL · 2026-05-03 · unverdicted · none · ref 59 · internal anchor
VIDA provides 2,500 visually-dependent ambiguous translation examples and span-level disambiguation metrics; CoT-SFT on LVLMs improves out-of-distribution performance over standard SFT.
VoxAfford: Multi-Scale Voxel-Token Fusion for Open-Vocabulary 3D Affordance Detection cs.CV · 2026-05-02 · unverdicted · none · ref 47 · internal anchor
VoxAfford fuses multi-scale voxel features into MLLM output tokens using cross-attention with a learned compatibility gate to achieve SOTA open-vocabulary 3D affordance detection with ~8% mIoU gain and zero-shot robot transfer.
Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation cs.CV · 2026-05-02 · unverdicted · none · ref 70 · 2 links · internal anchor
Chain of Evidence introduces a retriever-agnostic visual attribution method for iRAG that reasons over document screenshots with VLMs to output precise bounding boxes, outperforming text baselines on Wiki-CoE and SlideVQA.
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models cs.CV · 2026-04-27 · unverdicted · none · ref 20 · internal anchor
LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-layer attention in LLMs.
ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety cs.CR · 2026-04-21 · unverdicted · none · ref 7 · internal anchor
ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisoned samples.
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation cs.CV · 2026-04-20 · unverdicted · none · ref 13 · internal anchor
AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
Culture-Aware Humorous Captioning: Multimodal Humor Generation across Cultural Contexts cs.CL · 2026-04-20 · unverdicted · none · ref 73 · internal anchor
Introduces culture-aware humorous captioning task and staged alignment framework that improves contextual fit and balances image relevance with humor in multimodal LLMs.
Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding cs.CV · 2026-04-13 · unverdicted · none · ref 61 · internal anchor
DualComp uses a lightweight router to split visual token compression into a semantic stream with size-adaptive clustering and a geometric stream with path-tracing recovery, enabling low-cost high-fidelity UHR remote sensing interpretation.
Skill-Conditioned Visual Geolocation for Vision-Language Models cs.CV · 2026-04-10 · unverdicted · none · ref 48 · 2 links · internal anchor
GeoSkill lets vision-language models improve geolocation accuracy and reasoning by maintaining an evolving Skill-Graph that grows through autonomous analysis of successful and failed rollouts on web-scale image data.
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration cs.CV · 2026-04-06 · unverdicted · none · ref 63 · internal anchor
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models cs.LG · 2026-04-03 · unverdicted · none · ref 44 · internal anchor
RL post-training on hallucination-forced multimodal data improves reasoning performance and can outperform standard training.
Topo-R1: Detecting Topological Anomalies via Vision-Language Models cs.CV · 2026-03-13 · unverdicted · none · ref 104 · internal anchor
Topo-R1 fine-tunes a vision-language model using a topology-aware reward and GRPO to detect anomalies such as broken or spurious connections in tubular segmentation masks, outperforming standard VLMs.
SpatialMosaic: A Multiview VLM Dataset for Partial Visibility cs.CV · 2025-12-29 · unverdicted · none · ref 49 · internal anchor
SpatialMosaic introduces a 2M-pair multi-view QA dataset and 1M-pair benchmark for MLLMs on spatial reasoning under partial visibility, plus a hybrid baseline that integrates 3D reconstruction models as geometry encoders.
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models cs.CV · 2025-12-01 · unverdicted · none · ref 70 · internal anchor
AV-SpeakerBench is a new speaker-centered benchmark showing that top multimodal models still struggle with fine-grained audiovisual speech understanding, with Gemini 2.5 Pro leading but open models lagging on fusion.
Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting cs.CV · 2025-08-06 · unverdicted · none · ref 45 · internal anchor
The paper offers a comprehensive survey and proposes a new taxonomy for continual learning strategies in VLMs and MLLMs to combat catastrophic forgetting beyond traditional methods.
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models cs.CV · 2025-06-10 · unverdicted · none · ref 118 · internal anchor
AVA-Bench evaluates vision foundation models by disentangling 14 atomic visual abilities with aligned training-test distributions to reveal precise ability fingerprints.
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs cs.CV · 2025-02-06 · unverdicted · none · ref 89 · internal anchor
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks cs.CV · 2024-12-23 · unverdicted · none · ref 63 · internal anchor
HumanVBench provides a 16-task benchmark for human-centric video understanding in MLLMs, created through automated annotation and distractor synthesis pipelines, and shows top models lag human performance on emotion perception and cross-modal alignment.
Visual Adversarial Attack on Vision-Language Models for Autonomous Driving cs.CV · 2024-11-27 · unverdicted · none · ref 68 · internal anchor
ADvLM is the first visual adversarial attack framework for VLMs in autonomous driving, using semantic-invariant induction via LLM-generated prompt libraries and scenario-associated attention-based enhancement to achieve SOTA attack effectiveness across benchmarks and real-world tests.
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation cs.CV · 2024-10-17 · unverdicted · none · ref 95 · internal anchor
Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark cs.AI · 2024-10-06 · unverdicted · none · ref 52 · internal anchor
PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs cs.CV · 2024-06-24 · unverdicted · none · ref 157 · internal anchor
Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance on visual grounding tasks.
Detecting and Evaluating Medical Hallucinations in Large Vision Language Models cs.CV · 2024-06-14 · unverdicted · none · ref 29 · internal anchor
Presents Med-HallMark benchmark, MediHall Score metric, and MediHallDetector model for hallucination detection and evaluation in medical LVLMs.
MirrorCheck: Efficient Adversarial Defense for Vision-Language Models cs.CV · 2024-06-13 · unverdicted · none · ref 106 · internal anchor
MirrorCheck detects adversarial attacks on VLMs via T2I regeneration for semantic consistency checks, using stochastic model selection and one-time perturbations for robustness against adaptive attacks.
3D-VLA: A 3D Vision-Language-Action Generative World Model cs.CV · 2024-03-14 · unverdicted · none · ref 62 · internal anchor
3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models cs.CV · 2023-10-23 · unverdicted · none · ref 63 · internal anchor
HallusionBench shows GPT-4V reaches only 31.42% accuracy on paired questions testing language hallucination and visual illusion in LVLMs, with other models below 16%.
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension cs.CL · 2023-07-30 · unverdicted · none · ref 7 · internal anchor
SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
StochasT: Learning with Stochastic Turn Depth for Visual Instruction Tuning cs.CV · 2026-07-01 · unverdicted · none · ref 71 · internal anchor
StochasT uses stochastic clustering of language tasks into varying turn depths for the same image to improve LVLMs on both single-turn and multi-turn scenarios without discarding data.
HyFL-CLIP: Hyperbolic Fine-Tuning of CLIP for Robust Long-Context Understanding cs.CV · 2026-07-01 · unverdicted · none · ref 84 · internal anchor
HyFL-CLIP distills Euclidean CLIP alignment into hyperbolic space using cross-manifold similarity and Einstein midpoint aggregation to capture hierarchical part-whole relations, achieving up to 19.5% gains in long-text retrieval under perturbations.
Wake up for Touch! Mask-isolated Tactile Alignment Learning in MLLMs cs.CV · 2026-07-01 · unverdicted · none · ref 53 · internal anchor
Splash partitions MLLM parameters into dormant and critical subspaces via significance quantification, updating only the dormant subspace for tactile alignment while preserving general capabilities and achieving SOTA on visuo-tactile benchmarks.
VisReflect: Latent Visual Reflection for Fine-Grained Perception in Long Visual Context cs.CV · 2026-06-29 · unverdicted · none · ref 47 · internal anchor
VisReflect generates continuous latent visual reflections to emphasize relevant visual features and guide attention in LVLMs, yielding 4.1% gains on image benchmarks and 1.8% on video benchmarks with 44% less inference time than zooming methods.
TOPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM Inference cs.AI · 2026-06-25 · unverdicted · none · ref 5 · internal anchor
TOPS formulates visual token pruning as constructing Token Optimal Preservation Sets using three information-theoretic principles and demonstrates superior performance on MLLM benchmarks.
HPP: Hierarchical Programmatic Probing for Long Video Understanding by Decoupling Perception and Reasoning cs.CV · 2026-06-19 · unverdicted · none · ref 294 · internal anchor
HPP decouples perception from reasoning in long-video VLMs by having an LLM run iterative programmatic probes on hierarchically segmented video, reporting gains on LongVideoBench, EgoSchema, VideoMME, and MLVU.
ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval cs.IR · 2026-06-18 · unverdicted · none · ref 77 · internal anchor
ELVA applies ranking-driven RLVR to multimodal retrieval to reduce grain blindness in contrastive learning, reporting SOTA results and a 13.1% gain on the new MRBench benchmark.

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer