{"total":23,"items":[{"citing_arxiv_id":"2605.22507","ref_index":21,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Generative Modeling by Value-Driven Transport","primary_cat":"cs.LG","submitted_at":"2026-05-21T13:57:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A control-theoretic linear program yields value-driven transport policies for generative modeling with straight paths and simulation-free training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18052","ref_index":95,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Efficient 3D Content Reconstruction and Generation","primary_cat":"cs.CV","submitted_at":"2026-05-18T08:41:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Presents Instant3D for rapid text/image-to-3D generation via multi-view diffusion plus feed-forward reconstruction, and FastMap for 10x faster structure-from-motion with comparable accuracy.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ing regions via generative 3D inpainting, 3D-SceneDreamer[312] targets text-driven 3D- consistent scene generation, and RealmDreamer[220] combines inpainting with depth diffu- sion for text-driven scenes. Recent Gaussian-splatting-based methods scale scene generation and interactivity: LucidDreamer[42] and Text2Immersion[170] generate 3D Gaussian scenes from text, BloomScene[95] proposes a lightweight structured Gaussian splatting formulation for cross-modal scene generation, and WonderWorld[300] generates interactive scenes from a single image. Efficiency-focused systems such as Bolt3D[235] and WonderTurbo[164] em- phasize fast scene generation, while SynCity[60] explores training-free 3D world generation. 2.1.4 Feed-forward reconstruction"},{"citing_arxiv_id":"2605.13054","ref_index":33,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Bridging Domain Gaps with Target-Aligned Generation for Offline Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-13T06:23:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TCE bridges domain gaps in offline RL by selectively using source data or generating target-aligned transitions via a dual score-based model, outperforming baselines in experiments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02657","ref_index":23,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"CARD: Coarse-to-fine Autoregressive Modeling with Radix-based Decomposition for Transferable Free Energy Estimation","primary_cat":"cs.LG","submitted_at":"2026-05-04T14:38:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CARD uses radix decomposition to enable autoregressive modeling of molecular coordinates as a zero-free-energy reference distribution, delivering classical accuracy for absolute free energy on unseen systems at ~40x speedup.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02152","ref_index":27,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SpecEdit: Training-Free Acceleration for Diffusion based Image Editing via Semantic Locking","primary_cat":"cs.CV","submitted_at":"2026-05-04T02:30:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SpecEdit accelerates diffusion-based image editing up to 10x by using a low-resolution draft to identify edit-relevant tokens via semantic discrepancies for selective high-resolution denoising.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15311","ref_index":14,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories","primary_cat":"cs.CV","submitted_at":"2026-04-16T17:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12932","ref_index":31,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Turbulent pair dispersion with Stochastic Generative Diffusion Models","primary_cat":"physics.flu-dyn","submitted_at":"2026-04-14T16:20:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Diffusion models generate joint pairs of Lagrangian trajectories that reproduce turbulent pair separation statistics, including deviations from Richardson scaling, while preserving single-particle properties.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.13942","ref_index":16,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Frozen Forecasting: A Unified Evaluation","primary_cat":"cs.CV","submitted_at":"2025-07-18T14:14:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A new evaluation framework using latent diffusion on frozen vision backbones shows video-pretrained models consistently outperform image-based ones in forecasting entire trajectories across abstraction levels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.18601","ref_index":21,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"BulletGen: Improving 4D Reconstruction with Bullet-Time Generation","primary_cat":"cs.GR","submitted_at":"2025-06-23T13:03:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BulletGen enhances 4D dynamic scene reconstruction from monocular videos by supervising Gaussian optimization with diffusion-generated frames aligned at a bullet-time step, achieving SOTA on novel-view synthesis and tracking.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.15442","ref_index":1,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material","primary_cat":"cs.CV","submitted_at":"2025-06-18T13:14:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Hunyuan3D 2.1 is a two-part system with DiT for shape generation and Paint for texture synthesis that produces high-fidelity 3D assets with PBR materials.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.08809","ref_index":7,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Training-Free Inference for High-Resolution Sinogram Completion","primary_cat":"cs.CV","submitted_at":"2025-06-10T13:59:25+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.05398","ref_index":17,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"2ndMatch: Finetuning Pruned Diffusion Models via Second-Order Jacobian Matching","primary_cat":"cs.GR","submitted_at":"2025-06-03T20:04:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"2ndMatch finetunes pruned diffusion models via second-order Jacobian matching inspired by Finite-Time Lyapunov Exponents to reduce the quality gap with dense models on image generation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.00721","ref_index":30,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Common Inpainted Objects In-N-Out of Context","primary_cat":"cs.CV","submitted_at":"2025-05-31T21:42:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"COinCO is a new dataset of inpainted COCO images with in- and out-of-context objects, enabling context reasoning, object prediction from scenes, and improved fake image detection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.00433","ref_index":22,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Latent Wavelet Diffusion For Ultra-High-Resolution Image Synthesis","primary_cat":"cs.CV","submitted_at":"2025-05-31T07:28:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Latent Wavelet Diffusion uses wavelet energy map masking and a scale-consistent VAE to improve detail fidelity in 2K-4K image generation without extra inference overhead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.18780","ref_index":49,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"DreamPolicy: A Unified World-model Policy for Scalable Humanoid Locomotion","primary_cat":"cs.RO","submitted_at":"2025-05-24T16:33:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DreamPolicy integrates an autoregressive diffusion world model with policy learning to produce a single scalable policy that generalizes to unseen composite terrains for humanoid locomotion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.07818","ref_index":1,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"DanceGRPO: Unleashing GRPO on Visual Generation","primary_cat":"cs.CV","submitted_at":"2025-05-12T17:59:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DanceGRPO applies GRPO to visual generation tasks to achieve stable policy optimization across diffusion models, rectified flows, multiple tasks, and diverse reward models, outperforming prior RL methods.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"1 [25], CLIP score [26], VideoAlign [14], and GenEval [27], in visual generation tasks. Notably, DanceGRPO also enables models to learn the denoising trajectory in Best-of-N inference scaling. We also make some initial attempts to enable models to capture the distribution of binary (0/1) reward models, showing its ability to capture sparse, thresholding feedback. 2 Approach 2.1 Preliminary Diffusion Model [ 1]. A diffusion process gradually destroys an observed datapointx over timestep t, by mixing data with noise, and the forward process of the diffusion model can be defined as : zt = αtx + σtϵ, where ϵ ∼ N (0, I), (1) and αt and σt denote the noise schedule. The noise schedule is designed in a way such thatz0 is close to clean data andz1 is close to Gaussian noise."},{"citing_arxiv_id":"2505.05472","ref_index":28,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation","primary_cat":"cs.CV","submitted_at":"2025-05-08T17:58:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interleaved outputs including zero-shot editing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.17761","ref_index":19,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Step1X-Edit: A Practical Framework for General Image Editing","primary_cat":"cs.CV","submitted_at":"2025-04-24T17:25:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Step1X-Edit integrates a multimodal LLM with a diffusion decoder, trained on a custom high-quality dataset, to deliver image editing performance that surpasses open-source baselines and approaches proprietary models on the new GEdit-Bench.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"However, due to reliance on discrete tokens and sequence length constraints, AR models often struggle to produce high-resolution and photorealistic results, especially in complex scenes. Diffusion models have become the dominant approach for high-fidelity image synthesis, offering strong capabilities in photorealism, structural consistency, and diversity. Beginning with DDPM [19] and DDIM [51], and further advanced by Latent Diffusion [ 47, 42], diffusion models operate in latent spaces for improved scalability. With the introduction of DiT architectures [ 41], diffusion models have made significant strides in generalization, image quality, and knowledge capacity, becoming the predominant architecture in modern image generation [1, 5, 7]."},{"citing_arxiv_id":"2504.13074","ref_index":33,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SkyReels-V2: Infinite-length Film Generative Model","primary_cat":"cs.CV","submitted_at":"2025-04-17T16:37:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SkyReels-V2 produces infinite-length film videos via MLLM-based captioning, progressive pretraining, motion RL, and diffusion forcing with non-decreasing noise schedules.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"1 [5] demonstrate progressively diminishing quality disparities with their proprietary counterparts. These improvements stem from multi-faceted innovations: architectural transitions from U-Net [ 22] to DiT [23] or MMDiT [24] structures, enhanced V AE implementations [25, 26, 27, 28, 29, 18, 5], upgraded text encoders [30, 31, 18, 32], and paradigm shifts from DDPM [33, 34] to flow matching [35, 24] optimization. Concurrently, refined data processing pipelines and advancements in video captioning capabilities (GPT-4o [36], Qwen2.5-VL [8], Gemini 2.5 [37], Tarsier2 [38], etc.) have significantly contributed to quality enhancements. The frontier of research now extends to novel integrations of reinforcement learning, hybrid autoregressive-diffusion approaches, and long-form video generation"},{"citing_arxiv_id":"2503.10631","ref_index":17,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model","primary_cat":"cs.CV","submitted_at":"2025-03-13T17:59:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"While such methods enable generalized manipulation skills [10], they quantize continuous actions into discrete bins by adding new embeddings into the vocabulary in large language models (LLMs), which disrupts the continuity of action pose and hinders precise control [16]. On the other hand, building on the success of diffusion models in content generation [ 17, 18, 19, 20], diffusion policies have been introduced in robotic imitation learning [21, 22, 23, 24, 25, 26]. Recent diffusion-based VLA methods [13, 14, 16, 12] incorporate a diffusion head after the VLM, leveraging probabilistic noise-denoising for action prediction. While these methods enable precise manipulation, the diffusion head operates independently of the VLM and lacks internet-scale pretraining."},{"citing_arxiv_id":"2501.12202","ref_index":33,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation","primary_cat":"cs.CV","submitted_at":"2025-01-21T15:16:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Hunyuan3D 2.0 scales flow-based diffusion transformers and texture synthesis models to generate high-resolution textured 3D assets that outperform prior state-of-the-art in geometry, alignment, and texture quality.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"demanding high expertise and proficiency in digital content creation software. As a result, the automated generation of high-resolution digital 3D assets has emerged as one of the most exciting and sought-after topics in recent years. Despite the importance of automated 3D generation and rapid development in image and video generation fueled by the rise of diffusion models [ 33, 74, 24, 50, 43], the field of 3D generation appears to be relatively stagnant in the era of large models and big data, with only a handful of works making gradual progress [111, 118, 49]. Building on the 3DShape2Vectset [111], Michelangelo [118] and CLAY [113] gradually enhance shape generation performance, where CLAY is the first work to demonstrate the unprecedented potential of diffusion models in 3D asset generation."},{"citing_arxiv_id":"2412.03134","ref_index":11,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"A Probabilistic Formulation of Offset Noise in Diffusion Models","primary_cat":"stat.ML","submitted_at":"2024-12-04T08:57:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A diffusion model variant that adds structured non-zero-mean noise via modified forward/reverse processes, yielding an ELBO loss analogous to offset noise but with time-dependent coefficients, and showing gains on synthetic high-dimensional data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.06158","ref_index":22,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation","primary_cat":"cs.RO","submitted_at":"2024-10-08T16:00:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"to evaluate the performance under the condition of data scarcity, we train GR-2 using approximately 1/8 of the full dataset, which corresponds to around 50 trajectories per task. To enable better generalization to unseen scenarios, we perform data augmentation during fine-tuning by adding new objects into the scene and/or changing the background. To insert new objects into the scene, a diffusion model [22] is trained with a combination of a self-collected object dataset and the Open Images dataset [23]. This model enables us to insert a specific object in a designated region. For changing the background, we utilize Segment Anything Model (SAM) [24] to extract regions corresponding to the background. Finally, we employ a video generation model [25] that conditions on the original video and the inpainted frame to produce"}],"limit":50,"offset":0}