{"total":12,"items":[{"citing_arxiv_id":"2606.31326","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Bridging Video Understanding and Generation in a Unified Framework","primary_cat":"cs.CV","submitted_at":"2026-06-30T08:29:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Vega unifies video understanding and generation via shared vocabulary and hybrid autoregressive-diffusion architecture, reporting strong results on VBench and VideoMME.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07079","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AsyncPatch Diffusion: spatially-flexible image generation","primary_cat":"cs.CV","submitted_at":"2026-06-05T09:16:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AsyncPatch Diffusion introduces asynchronous per-region noise levels in diffusion models, proves a valid ELBO, and uses a controlled sampler to support spatially adaptive generation and native inpainting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21272","ref_index":88,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset","primary_cat":"cs.CV","submitted_at":"2026-05-20T15:04:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[86] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InInternational Conference on Learning Representations, 2020. [87] Alexander Swerdlow, Mihir Prabhudesai, Siddharth Gandhi, Deepak Pathak, and Katerina Fragkiadaki. Unified multimodal discrete diffusion.arXiv preprint arXiv:2503.20853, 2025. [88] Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. [89] Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403."},{"citing_arxiv_id":"2605.14531","ref_index":42,"ref_count":3,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space","primary_cat":"cs.CL","submitted_at":"2026-05-14T08:13:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Manta-LM approximates the HJB equation via flow matching in latent control space to realize closed-loop optimal control for language generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09291","ref_index":74,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models","primary_cat":"cs.LG","submitted_at":"2026-05-10T03:36:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07971","ref_index":27,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DVD: Discrete Voxel Diffusion for 3D Generation and Editing","primary_cat":"cs.CV","submitted_at":"2026-05-08T16:32:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DVD applies discrete diffusion directly to voxel occupancy for 3D generation, uncertainty estimation via entropy, and single-round editing via block perturbation fine-tuning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"state diffusion models (USDMs) and mask diffusion models (MDMs). USDMs have a uniform prior distribution across all possible states and all tokens, while MDMs mask every position with a special MASK token. Discrete diffusion models have been developed and studied in various tasks, such as text generation [25], image generation [26], multimodal modeling [27], protein generation [24], and pose estimation [ 28], etc. For applications of discrete diffusion in 3D tasks, Song et al. [29] investigated mesh generation using discrete diffusion models, and TD3D [ 30] leveraged discrete diffusion models for shape generation in a quantized latent space, scaffold diffusion [31] used DDMs to generate labels for given voxels."},{"citing_arxiv_id":"2511.14148","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2025-11-18T05:21:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AsyncVLA adds asynchronous flow matching and a confidence rater to VLA models so they can generate actions on flexible schedules and selectively refine low-confidence tokens before execution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.08416","ref_index":116,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Generative AI Meets 6G and Beyond: Diffusion Models for Semantic Communications","primary_cat":"eess.SP","submitted_at":"2025-11-11T16:27:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The tutorial synthesizes diffusion model techniques for generative semantic communications to achieve high compression while preserving meaning in wireless transmission.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"These three categories extend diffusion models to accommodate diverse modalities, domains, and tasks. Category # Related Work Venue Illustration Links MODALITY EXPANSION 1 MonoFormer [112] arXiv'24 Language VisionDiffusion Model Tokenization Alignment Expansion Tokenization 2 Diffusion Forcing [113] NeurIPS'24 3 Show-o [114] ICLR'25 4 Transfusion [115] ICLR'25 5 UniDisc [116] arXiv'25 DOMAIN ADAPTATION 1 DSB [117] NeurIPS'21 Human Tiger Food Cat Adaptation Source Domain Target Domain 2 Composable Diffusion [118] ECCV'22 3 DreamBooth [119] CVPR'23 4 I2SB [120] ICML'23 5 P2P-Bridge [121] ECCV'24 TASK GENERALIZATION 1 Diffuser [122] ICML'22 Multi-Task Learning Policy Diffusion Process New-Task Generalization 2 Diffusion Policy [123] RSS'23"},{"citing_arxiv_id":"2509.21912","ref_index":73,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching","primary_cat":"cs.LG","submitted_at":"2025-09-26T05:51:31+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Derives exact guidance transition rates for discrete flow matching models that require only one model evaluation per sampling step and unify prior approximation-based methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.15564","ref_index":98,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Show-o2: Improved Native Unified Multimodal Models","primary_cat":"cs.CV","submitted_at":"2025-06-18T15:39:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Assembling Tailored Models Paradigm Chameleon [102]✓ ✓ARShow-o [128]✓ ✓AR + Diff.Transfusion [147]✓ ✓AR + Diff.VILA-U [123]✓ ✓ ✓AREmu3 [114]✓ ✓ ✓ARMonoFormer [146]✓ ✓AR + Diff.Dual-Diffusion [63]✓ ✓Diff.SynerGen-VL [58]✓ ✓ARMMAR [134]✓ ✓AR + MARMUSE-VL [129]✓ ✓AROrthus [53]✓ ✓AR + Diff.Liquid [118]✓ ✓ARLlamaFusion [95]✓ ✓AR + Diff.UGen [99]✓ ✓ARUniDisc [98]✓ ✓Diff.UniToken [50]✓ ✓ARHarmon [122]✓ ✓AR+MARDualToken [96]✓ ✓ARUniTok [77]✓ ✓ARSelftok [110]✓ ✓ARMuddit [94]✓ ✓Diff.MMaDA [135]✓ ✓Diff.HaploOmni [124]✓ ✓ ✓AR + Diff.TokLIP [68]✓ ✓ARShow-o2 (Ours) ✓ ✓ ✓ AR + Diff. Janus-Series [26, 27, 79]✓ ✓AR (+Diff.)V ARGPT [148]✓ ✓ARUnidFluid [38]✓ ✓AR + MAROmniMamba [149]✓ ✓ARMogao [65]✓ ✓AR + Diff.BAGEL [32]✓ ✓ ✓AR + Diff."},{"citing_arxiv_id":"2505.23606","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model","primary_cat":"cs.LG","submitted_at":"2025-05-29T16:15:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Muddit is a unified discrete diffusion transformer that integrates strong visual priors from a pretrained text-to-image model with a lightweight text decoder to enable fast parallel generation across text and image modalities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.16933","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning","primary_cat":"cs.LG","submitted_at":"2025-05-22T17:23:26+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Recent attempts to incorporate diffusion models [22-26] into MLLMs have predominantly adopted one of two strategies: either leveraging autoregressive models to provide strong language model- ing capabilities [27-32], or employing discrete diffusion-based approaches with limited language modeling capacity, which consequently leads to suboptimal performance [33, 34]. Encouragingly, recent advances in discrete diffusion models [25, 26, 35-43] have shown promising potential to overcome these limitations. In particular, LLaDA [42] has demonstrated performance competitive with LLaMA3-8B-Instruct [18] through large-scale pre-training and SFT, while retaining favorable scaling properties. Nevertheless, while LLaDA has shown remarkable progress in language"}],"limit":50,"offset":0}