{"total":26,"items":[{"citing_arxiv_id":"2605.23163","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving","primary_cat":"cs.CL","submitted_at":"2026-05-22T02:31:32+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16941","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Roll Out and Roll Back: Diffusion LLMs are Their Own Efficiency Teachers","primary_cat":"cs.CL","submitted_at":"2026-05-16T11:27:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Diffusion LLMs can act as their own efficiency teachers by using revokable parallel decoding to identify reliable token orders and then distilling those orders into the model parameters for faster inference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14531","ref_index":45,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space","primary_cat":"cs.CL","submitted_at":"2026-05-14T08:13:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The paper introduces Manta-LM, which approximates the Hamilton-Jacobi-Bellman optimal policy via Flow Matching in a rectified latent control space to enable high-fidelity parallel language generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11726","ref_index":68,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-12T08:09:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"to fewer tokens, leading to a finer generation structure close to auto-regressive paradigm. Therefore, the block sizecdetermines the structural granularity of dLLM generation. 2.2 Reinforcement Learning for dLLMs Reinforcement Learning functions as a powerful post-training mechanism to enhance dLLM reasoning capabilities by optimising trajectory-level rewards [68, 63]. Recent studies apply policy optimisation algorithms, particularly Group Relative Policy Optimisation (GRPO) [49, 74, 53, 45, 20, 75, 58, 39], to dLLM domains. Generally, the diffusion-based GRPO objective [74] can be formulated as: JGRPO(θ, c) =E \" 1 GL GX g=1 LX i=1 min \u0010 ri g(θ) ˆAg,clip(r i g(θ),1−ϵ,1 +ϵ) ˆAg \u0011 −βD KL(π(c) θ ∥π(c) ref ) #"},{"citing_arxiv_id":"2605.11400","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning","primary_cat":"cs.MM","submitted_at":"2026-05-12T01:43:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixed strategies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10218","ref_index":97,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Relative Score Policy Optimization for Diffusion Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-11T08:58:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09302","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Discrete Langevin-Inspired Posterior Sampling","primary_cat":"cs.LG","submitted_at":"2026-05-10T03:59:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ΔLPS is a gradient-guided discrete posterior sampler for inverse problems that works with masked or uniform discrete diffusion priors and outperforms prior discrete methods on image restoration tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"samplers can provide a promising path toward practical and general posterior samplers over a wide class of inverse problems over discrete representations. 2 Preliminaries and Related Work Discrete diffusion modelshave recently emerged as a powerful class of generative models for discretedata, including text [ 23, 51, 22, 6], code [14, 10], vector-quantized images [44, 19], audio 2 [43], protein, and molecular generation [ 42, 38, 20, 46]. These models use a forward Markov corruption process over categorical states z0 ∈ {1, . . . , K}L and learn a reverse denoising model. In D3PM [ 1], the forward process is specified by transition matrices Qt ∈R K×K , where Qij t denotes the probability of transitioning from token i to token j at time t."},{"citing_arxiv_id":"2605.02263","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Break the Block: Dynamic-size Reasoning Blocks for Diffusion Large Language Models via Monotonic Entropy Descent with Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-04T06:17:49+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08144","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"NoiseRater: Meta-Learned Noise Valuation for Diffusion Model Training","primary_cat":"cs.LG","submitted_at":"2026-05-02T19:43:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"NoiseRater meta-learns instance-level importance scores for noise in diffusion training via bilevel optimization, then uses a two-stage pipeline to improve efficiency and generation quality on FFHQ and ImageNet.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4524-4533, 2020. [51] J. L. Watson, D. Juergens, N. R. Bennett, B. L. Trippe, J. Yim, H. E. Eisenach, W. Ahern, A. J. Borst, R. J. Ragotte, L. F. Milles, et al. De novo design of protein structure and function with rfdiffusion.Nature, 620(7976):1089-1100, 2023. 13 [52] L. Yang, Y . Tian, B. Li, X. Zhang, K. Shen, Y . Tong, and M. Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025. [53] J. Yoon, H. Cho, D. Baek, Y . Bengio, and S. Ahn. Monte carlo tree diffusion for system 2 planning.arXiv preprint arXiv:2502.07202, 2025. [54] Z. Zhou, S. Shao, L. Bai, S. Zhang, Z. Xu, B."},{"citing_arxiv_id":"2604.27720","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures, Format Collapse, and Domain Adaptation","primary_cat":"cs.AI","submitted_at":"2026-04-30T11:11:47+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Auditing five frontier VLMs reveals severe grounding failures (max 0.23 IoU, 19.1% Acc@0.5) and format collapse (up to 99% parse failure) in medical VQA; fine-tuning yields 85.5% SLAKE recall but perception remains the primary trustworthiness issue.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24763","ref_index":48,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation","primary_cat":"cs.CV","submitted_at":"2026-04-27T17:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Tuna-2 shows that direct pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive generation and stronger understanding at scale.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22152","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model","primary_cat":"cs.RO","submitted_at":"2026-04-24T01:50:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"insets in the top corners of each frame display the synchronized wrist views. 3.2.1 Action Control via Unified Token Sequence To ensure action controllability, we integrate actions directly into a unified discrete token space, rather than treating them as auxiliary conditions. Specifically, we employ specialized tokenizers to map heteroge- neous modalities into discrete codes: MAGVIT-v2 [45] for RGB observations, LLaDA [31] for language, and FAST [32] for continuous action chunksat. By serializing these codes into a single flattened sequence, the transformer can model the joint distribution of actions and observations. Through self-attention, each visual token directly attends to action tokens, enabling fine-grained control at the token level."},{"citing_arxiv_id":"2604.21904","ref_index":68,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection","primary_cat":"cs.CV","submitted_at":"2026-04-23T17:49:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"UniGenDet unifies generative and discriminative models through symbiotic self-attention and detector-guided alignment to co-evolve image generation and authenticity detection.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. InICLR, 2025. 2 [67] Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Weidi Xie. A sanity check for ai- generated image detection.ICLR, 2025. 3, 5, 6 [68] Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Mul- timodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025. 2 [69] Junyan Ye, Baichuan Zhou, Zilong Huang, Junan Zhang, Tianyi Bai, Hengrui Kang, Jun He, Honglin Lin, Zihao Wang, Tong Wu, et al. Loki: A comprehensive synthetic data detection benchmark using large multimodal models."},{"citing_arxiv_id":"2604.17068","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Stability-Weighted Decoding for Diffusion Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-18T17:04:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Stability-Weighted Decoding improves diffusion LLM accuracy by modulating token scores with temporal stability from KL divergence between prediction steps.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"p(v,x t |x t+1) log p(v|x t) p(v|x t+1) (12) Using Bayes' theorem, we substitutep(v|x t) = p(v,xt|xt+1) p(xt|xt+1) into the logarithm term: log p(v|x t) p(v|x t+1) = log p(v,x t |x t+1) p(xt |x t+1)p(v|x t+1) (13) Substituting this back into the summation: E[D(i) temp] = X xt X v p(v,x t |x t+1) log p(v,x t |x t+1) p(v|x t+1)p(xt |x t+1) (14) ≡I(x i 0;x t |x t+1).(15) 12 Stability-Weighted Decoding for Diffusion Language Models A.3. The Sensitivity-Dependency Bound We now relate this information gain to the token's total dependency on the unknown contexts. Theorem A.2.The expected temporal instability of token xi 0 is a strict lower bound on its mutual information with the total masked contextU t+1. That is, high instability implies high dependency on the remaining unknowns."},{"citing_arxiv_id":"2604.16514","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation","primary_cat":"cs.CV","submitted_at":"2026-04-15T09:17:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"InternVL3.5 [19] 4B 57.4 38.2 2236 66.7 65.6 80.6 86.2 InternVL3.5 [19] 8B 57.2 41.0 2359 63.1 66.3 82.1 87.0 Diffusion Vision-Language Models LLaDA-V [26] 8B 48.8 35.4 1998 63.4 60.4 77.8 78.2 Dream-VL [25] 7B 51.6 25.0 2179 67.7 59.9 80.4 86.2 LaviDa [11] 8B 44.2 28.6 1711 40.3 47.0 70.1 64.6 SDAR-VL [5] 8B 44.0 28.2 2142 66.1 53.3 79.6 82.4 MMaDA [23] 8B 30.2 21.5 1287 28.2 25.7 54.9 43.2 Dimple-VL [27] 7B 46.4 24.1 1924 51.9 47.7 74.2 58.4 BARD-VL Converted from Qwen3-VL BARD-VL(𝐵=32)2B 42.0 27.9 2045 64.6 53.1 72.6 76.8 BARD-VL(𝐵=32)4B 53.0 34.2 2305 71.9 63.6 82.8 80.2 BARD-VL(𝐵=4)8B 54.6 37.6 2393 70.7 65.0 83.2 84.6 MMStar), and document understanding (AI2D and ChartQA). All benchmark evaluations are conducted with VLMEvalKit, the open-"},{"citing_arxiv_id":"2604.10784","ref_index":29,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training","primary_cat":"cs.AI","submitted_at":"2026-04-12T19:19:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TorchUMM is the first unified codebase and benchmark suite for multimodal understanding, generation, and editing across varied UMM models and datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08302","ref_index":90,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DMax: Aggressive Parallel Decoding for dLLMs","primary_cat":"cs.LG","submitted_at":"2026-04-09T14:35:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[88] Chen Xu and Dawei Yang. Dllmquant: Quantizing diffusion-based large language models.arXiv preprint arXiv:2508.14090, 2025. [89] Chenkai Xu, Yijie Jin, Jiajun Li, Yi Tu, Guoping Long, Dandan Tu, Mingcong Song, Hongjie Si, Tianqi Hou, Junchi Yan, et al. Lopa: Scaling dllm inference via lookahead parallel decoding.arXiv preprint arXiv:2512.16229, 2025. [90] Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025. [91] Jiacheng Ye, Shansan Gong, Jiahui Gao, Junming Fan, Shuang Wu, Wei Bi, Haoli Bai, Lifeng Shang, and Lingpeng Kong. Dream-vl & dream-vla: Open vision-language and vision-language-action models with"},{"citing_arxiv_id":"2604.05497","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models","primary_cat":"cs.AI","submitted_at":"2026-04-07T06:41:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"MMBench includes diverse splits to evaluate a model's general visual perception and visual reasoning capabilities. V* Bench is designed to focus on evaluating the model's performance in high-resolution visual question answering tasks. Models and Baselines.Our experiments are conducted based on two dMLLMs with strong reasoning capabilities: LaViDa-llada-reason [16] and MMaDA-8B-MixCoT [40]. Both models are additionally fine-tuned to generate ratio- nales and are structured to generate progressive responses through an unmasking-based generation process. For a rig- orous baseline setup, we include three remasking strategies: Low-confidence (Low-conf), Entropy, and Margin. We also incorporate two CoT-based methods, CCoT [24] and DD- CoT [49], which have demonstrated strong performance"},{"citing_arxiv_id":"2603.12554","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages","primary_cat":"cs.LG","submitted_at":"2026-03-13T01:38:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.18176","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Improving Sampling for Masked Diffusion Models via Information Gain","primary_cat":"cs.CL","submitted_at":"2026-02-20T12:26:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Info-Gain Sampler improves MDM decoding by using bidirectional information gain to reduce cumulative uncertainty, outperforming greedy samplers on reasoning accuracy and creative writing tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.19433","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models","primary_cat":"cs.CV","submitted_at":"2025-12-22T14:31:58+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"dMLLM-TTS delivers up to 6x more efficient test-time scaling for diffusion MLLMs via O(N+T) hierarchical search and self-verified feedback, improving generation quality on GenEval across three models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.14067","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed","primary_cat":"cs.CL","submitted_at":"2025-12-16T04:12:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Efficient-DLM converts AR models to dLMs via block-wise causal attention and position-dependent masking, yielding higher accuracy and 2.7-4.5x throughput than Dream 7B and Qwen3 4B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.13030","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Motus: A Unified Latent Action World Model","primary_cat":"cs.CV","submitted_at":"2025-12-15T06:58:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ternet videos for scalable robot learning.arXiv preprint arXiv:2505.17006, 2025. 3 [51] Jiange Yang, Haoyi Zhu, Yating Wang, Gangshan Wu, Tong He, and Limin Wang. Tra-moe: Learning trajectory prediction model from multiple domains for adaptive policy condition- ing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6960-6970, 2025. 3 [52] Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025. 3 [53] Sherry Yang, Yilun Du, Seyed Kamyar Seyed Ghasemipour, Jonathan Tompson, Leslie Pack Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators."},{"citing_arxiv_id":"2509.21912","ref_index":83,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching","primary_cat":"cs.LG","submitted_at":"2025-09-26T05:51:31+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Derives exact guidance transition rates for discrete flow matching models that require only one model evaluation per sampling step and unify prior approximation-based methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.20863","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GIFT: Guided Importance-Aware Fine-Tuning for Diffusion Language Models","primary_cat":"cs.CL","submitted_at":"2025-09-25T07:55:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GIFT weights tokens by entropy during fine-tuning of diffusion language models and reports better performance than standard SFT on reasoning benchmarks across multiple settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.15564","ref_index":135,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Show-o2: Improved Native Unified Multimodal Models","primary_cat":"cs.CV","submitted_at":"2025-06-18T15:39:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"VILA-U [123]✓ ✓ ✓AREmu3 [114]✓ ✓ ✓ARMonoFormer [146]✓ ✓AR + Diff.Dual-Diffusion [63]✓ ✓Diff.SynerGen-VL [58]✓ ✓ARMMAR [134]✓ ✓AR + MARMUSE-VL [129]✓ ✓AROrthus [53]✓ ✓AR + Diff.Liquid [118]✓ ✓ARLlamaFusion [95]✓ ✓AR + Diff.UGen [99]✓ ✓ARUniDisc [98]✓ ✓Diff.UniToken [50]✓ ✓ARHarmon [122]✓ ✓AR+MARDualToken [96]✓ ✓ARUniTok [77]✓ ✓ARSelftok [110]✓ ✓ARMuddit [94]✓ ✓Diff.MMaDA [135]✓ ✓Diff.HaploOmni [124]✓ ✓ ✓AR + Diff.TokLIP [68]✓ ✓ARShow-o2 (Ours) ✓ ✓ ✓ AR + Diff. Janus-Series [26, 27, 79]✓ ✓AR (+Diff.)V ARGPT [148]✓ ✓ARUnidFluid [38]✓ ✓AR + MAROmniMamba [149]✓ ✓ARMogao [65]✓ ✓AR + Diff.BAGEL [32]✓ ✓ ✓AR + Diff.Fudoki [112]✓ ✓Diff.UniGen [104]✓ ✓AR + Diff. NExT-GPT [120]✓ ✓ ✓AR + Diff.CoDI [101]✓ ✓ ✓AR + Diff.DreamLLM [36]✓ ✓AR + Diff."}],"limit":50,"offset":0}