VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.
hub Mixed citations
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
Mixed citation behavior. Most common role is background (60%).
abstract
In this work, we propose Visual-Predictive Instruction Tuning (VPiT) - a simple and effective extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an unified autoregressive model capable of generating both text and visual tokens. VPiT teaches an LLM to predict discrete text tokens and continuous visual tokens from any input sequence of image and text data curated in an instruction-following format. Our empirical investigation reveals several intriguing properties of VPiT: (1) visual generation ability emerges as a natural byproduct of improved visual understanding, and can be unlocked efficiently with a small amount of generation data; (2) while we find understanding and generation to be mutually beneficial, understanding data contributes to both capabilities more effectively than generation data. Building upon these findings, we train our MetaMorph model and achieve competitive performance on both visual understanding and generation. In visual generation, MetaMorph can leverage the world knowledge and reasoning abilities gained from LLM pretraining, and overcome common failure modes exhibited by other generation models. Our results suggest that LLMs may have strong "prior" vision capabilities that can be efficiently adapted to both visual understanding and generation with a relatively simple instruction tuning process.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
The Token Replacement Test shows VLMs keep most accuracy gains even after corrupting or replacing continuous thought token content, indicating the tokens are not used as information bottlenecks.
Process-driven image generation decomposes text-to-image synthesis into interleaved cycles of textual planning, visual drafting, textual reflection, and visual refinement with dense consistency supervision.
Derives exact guidance transition rates for discrete flow matching models that require only one model evaluation per sampling step and unify prior approximation-based methods.
Text-to-image models show significant limitations in integrating world knowledge, as measured by the new WISE benchmark and WiScore metric across 20 models.
Semantic Generative Tuning uses image segmentation as a generative proxy to align misaligned representation spaces in unified multimodal models and improve both perception and generative layout fidelity.
Vision Inference Former adds a direct visual-to-output bridge that continuously injects visual semantics during MLLM decoding to sustain consistency and reduce modality imbalance.
STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.
Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.
PhotoFramer is a multi-modal model that jointly produces textual composition instructions and illustrative corrected images from poorly framed inputs.
Muddit is a unified discrete diffusion transformer that integrates strong visual priors from a pretrained text-to-image model with a lightweight text decoder to enable fast parallel generation across text and image modalities.
LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interleaved outputs including zero-shot editing.
Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.
WorldVLA unifies VLA and world models in one autoregressive system, shows they boost each other, and adds an attention mask to stop error buildup when generating action chunks.
UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.
DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
citing papers explorer
-
When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models
VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.
-
Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?
The Token Replacement Test shows VLMs keep most accuracy gains even after corrupting or replacing continuous thought token content, indicating the tokens are not used as information bottlenecks.
-
Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning
Process-driven image generation decomposes text-to-image synthesis into interleaved cycles of textual planning, visual drafting, textual reflection, and visual refinement with dense consistency supervision.
-
Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching
Derives exact guidance transition rates for discrete flow matching models that require only one model evaluation per sampling step and unify prior approximation-based methods.
-
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
Text-to-image models show significant limitations in integrating world knowledge, as measured by the new WISE benchmark and WiScore metric across 20 models.
-
Semantic Generative Tuning for Unified Multimodal Models
Semantic Generative Tuning uses image segmentation as a generative proxy to align misaligned representation spaces in unified multimodal models and improve both perception and generative layout fidelity.
-
Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models
Vision Inference Former adds a direct visual-to-output bridge that continuously injects visual semantics during MLLM decoding to sustain consistency and reduce modality imbalance.
-
STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation
STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.
-
Meta-CoT: Enhancing Granularity and Generalization in Image Editing
Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.
-
PhotoFramer: Multi-modal Image Composition Instruction
PhotoFramer is a multi-modal model that jointly produces textual composition instructions and illustrative corrected images from poorly framed inputs.
-
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
Muddit is a unified discrete diffusion transformer that integrates strong visual priors from a pretrained text-to-image model with a lightweight text decoder to enable fast parallel generation across text and image modalities.
-
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
-
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interleaved outputs including zero-shot editing.
-
Steering Visual Generation in Unified Multimodal Models with Understanding Supervision
Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.
-
WorldVLA: Towards Autoregressive Action World Model
WorldVLA unifies VLA and world models in one autoregressive system, shows they boost each other, and adds an attention mask to stop error buildup when generating action chunks.
-
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
-
Emerging Properties in Unified Multimodal Pretraining
BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
-
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.
-
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.
-
Show-o2: Improved Native Unified Multimodal Models
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
-
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.
-
Evolution of Video Generative Foundations
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.