pith. sign in

arxiv: 2601.21406 · v3 · pith:4YYJD5IFnew · submitted 2026-01-29 · 💻 cs.CV · cs.LG

Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation

classification 💻 cs.CV cs.LG
keywords generationunderstandingummsenhancesunifiedvisualcapabilitiesdiverse
0
0 comments X
read the original abstract

Unified Multimodal Models (UMMs) integrate both visual understanding and generation within a single framework. Their ultimate aspiration is to create a cycle where understanding and generation mutually reinforce each other. While recent post-training methods have successfully leveraged understanding to enhance generation, the reverse direction of utilizing generation to improve understanding remains largely unexplored. In this work, we propose UniMRG (Unified Multi-Representation Generation), a simple yet effective architecture-agnostic post-training method. UniMRG enhances the understanding capabilities of UMMs by incorporating auxiliary generation tasks. Specifically, we train UMMs to generate multiple intrinsic representations of input images, namely pixel (reconstruction), depth (geometry), and segmentation (structure), alongside standard visual understanding objectives. By synthesizing these diverse representations, UMMs capture complementary information regarding appearance, spatial relations, and structural layout. Consequently, UMMs develop a deeper and more comprehensive understanding of visual inputs. Extensive experiments across diverse UMM architectures demonstrate that our method notably enhances fine-grained perception, reduces hallucinations, and improves spatial understanding, while simultaneously boosting generation capabilities.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Semantic Generative Tuning for Unified Multimodal Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Semantic Generative Tuning uses image segmentation as a generative proxy to align misaligned representation spaces in unified multimodal models and improve both perception and generative layout fidelity.

  2. LatentUMM: Dual Latent Alignment for Unified Multimodal Models

    cs.CV 2026-05 unverdicted novelty 6.0

    LatentUMM proposes dual latent alignment at modality and capacity levels plus latent dynamics stabilization to reduce semantic drift and improve consistency in unified multimodal models.

  3. Latent Action Control for Reasoning-Guided Unified Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Latent Action Control learns unobserved action trajectories via variational alignment and GRPO to inject reasoning into flow-based image generation, yielding gains on compositional benchmarks.

  4. SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness

    cs.CV 2026-04 unverdicted novelty 6.0

    SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.