hub Canonical reference

Unified multimodal discrete diffusion

· 2025 · arXiv 2503.20853

Canonical reference. 80% of citing Pith papers cite this work as background.

13 Pith papers citing it

Background 80% of classified citations

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 baseline 1

citation-polarity summary

background 4 baseline 1

representative citing papers

AsyncPatch Diffusion: spatially-flexible image generation

cs.CV · 2026-06-05 · unverdicted · novelty 7.0

AsyncPatch Diffusion introduces asynchronous per-region noise levels in diffusion models, proves a valid ELBO, and uses a controlled sampler to support spatially adaptive generation and native inpainting.

Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching

cs.LG · 2025-09-26 · conditional · novelty 7.0

Derives exact guidance transition rates for discrete flow matching models that require only one model evaluation per sampling step and unify prior approximation-based methods.

Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space

cs.CL · 2026-05-14 · unverdicted · novelty 6.0 · 3 refs

Manta-LM approximates the HJB equation via flow matching in latent control space to realize closed-loop optimal control for language generation.

dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.

DVD: Discrete Voxel Diffusion for 3D Generation and Editing

cs.CV · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

DVD applies discrete diffusion directly to voxel occupancy for 3D generation, uncertainty estimation via entropy, and single-round editing via block perturbation fine-tuning.

AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models

cs.RO · 2025-11-18 · unverdicted · novelty 6.0

AsyncVLA adds asynchronous flow matching and a confidence rater to VLA models so they can generate actions on flexible schedules and selectively refine low-confidence tokens before execution.

Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

cs.LG · 2025-05-29 · unverdicted · novelty 6.0

Muddit is a unified discrete diffusion transformer that integrates strong visual priors from a pretrained text-to-image model with a lightweight text decoder to enable fast parallel generation across text and image modalities.

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

cs.LG · 2025-05-22 · conditional · novelty 6.0

LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.

Bridging Video Understanding and Generation in a Unified Framework

cs.CV · 2026-06-30 · unverdicted · novelty 5.0

Vega unifies video understanding and generation via shared vocabulary and hybrid autoregressive-diffusion architecture, reporting strong results on VBench and VideoMME.

TBD-VLA: Temporal Block Diffusion Vision Language Action Model

cs.CV · 2026-06-05 · unverdicted · novelty 5.0

TBD-VLA partitions action sequences into temporal blocks, performs masked discrete diffusion within blocks, and autoregressive generation across blocks to unify parallel decoding with temporal coherence in discrete VLA models.

MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset

cs.CV · 2026-05-20 · unverdicted · novelty 5.0

MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.

Show-o2: Improved Native Unified Multimodal Models

cs.CV · 2025-06-18 · unverdicted · novelty 4.0

Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

Generative AI Meets 6G and Beyond: Diffusion Models for Semantic Communications

eess.SP · 2025-11-11 · unverdicted · novelty 3.0

The tutorial synthesizes diffusion model techniques for generative semantic communications to achieve high compression while preserving meaning in wireless transmission.

citing papers explorer

Showing 6 of 6 citing papers after filters.

AsyncPatch Diffusion: spatially-flexible image generation cs.CV · 2026-06-05 · unverdicted · none · ref 46
AsyncPatch Diffusion introduces asynchronous per-region noise levels in diffusion models, proves a valid ELBO, and uses a controlled sampler to support spatially adaptive generation and native inpainting.
DVD: Discrete Voxel Diffusion for 3D Generation and Editing cs.CV · 2026-05-08 · unverdicted · none · ref 27 · 2 links
DVD applies discrete diffusion directly to voxel occupancy for 3D generation, uncertainty estimation via entropy, and single-round editing via block perturbation fine-tuning.
Bridging Video Understanding and Generation in a Unified Framework cs.CV · 2026-06-30 · unverdicted · none · ref 50
Vega unifies video understanding and generation via shared vocabulary and hybrid autoregressive-diffusion architecture, reporting strong results on VBench and VideoMME.
TBD-VLA: Temporal Block Diffusion Vision Language Action Model cs.CV · 2026-06-05 · unverdicted · none · ref 15
TBD-VLA partitions action sequences into temporal blocks, performs masked discrete diffusion within blocks, and autoregressive generation across blocks to unify parallel decoding with temporal coherence in discrete VLA models.
MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset cs.CV · 2026-05-20 · unverdicted · none · ref 88
MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.
Show-o2: Improved Native Unified Multimodal Models cs.CV · 2025-06-18 · unverdicted · none · ref 98
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

Unified multimodal discrete diffusion

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer