{"total":11,"items":[{"citing_arxiv_id":"2605.21661","ref_index":52,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Hierarchical Variational Policies for Reward-Guided Diffusion","primary_cat":"cs.LG","submitted_at":"2026-05-20T19:13:28+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A hierarchical variational formulation amortizes test-time guidance in diffusion models to achieve strong quality-speed tradeoffs with significantly reduced inference compute.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20910","ref_index":32,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching","primary_cat":"cs.CV","submitted_at":"2026-05-20T08:55:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FlowLong generates videos several times longer than native model windows by blending adjacent predictions with Tweedie matching to enforce manifold and temporal consistency while using stochastic noise injection early and deterministic sampling later.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16003","ref_index":33,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-15T14:33:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video generation under bounded cache.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15592","ref_index":38,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Efficient Image Synthesis with Sphere Latent Encoder","primary_cat":"cs.CV","submitted_at":"2026-05-15T04:03:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Decouples Sphere Encoder into fixed pretrained encoder and spherical latent denoiser, yielding higher quality and faster inference than the joint original on Animal-Faces, Oxford-Flowers and ImageNet-1K.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11347","ref_index":35,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Gradient-Free Noise Optimization for Reward Alignment in Generative Models","primary_cat":"cs.LG","submitted_at":"2026-05-12T00:05:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ZeNO frames noise optimization as a path-integral control problem solvable from zeroth-order reward evaluations, connecting to implicit Langevin dynamics for reward-tilted distributions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07288","ref_index":24,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training","primary_cat":"cs.CV","submitted_at":"2026-05-08T05:54:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Sword improves world model simulators for VLA policies by disentangling visual style from dynamics and bootstrapping latents for better consistency, outperforming baselines on LIBERO in generalization and RL post-training success.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07253","ref_index":46,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LENS: Low-Frequency Eigen Noise Shaping for Efficient Diffusion Sampling","primary_cat":"cs.CV","submitted_at":"2026-05-08T05:22:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LENS shapes low-frequency eigen noise with a lightweight network to enable efficient, high-quality sampling in distilled diffusion models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903-15935, 2023. [45] Katherine Xu, Lingzhi Zhang, and Jianbo Shi. Good seed makes a good crop: Discovering secret seeds in text-to-image diffusion models. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3024-3034. IEEE, 2025. [46] Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613-6623, 2024. [47] Chenshuang Zhang, Chaoning Zhang, Mengchun Zhang, In So Kweon, and Junmo Kim."},{"citing_arxiv_id":"2605.06376","ref_index":59,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Continuous-Time Distribution Matching for Few-Step Diffusion Distillation","primary_cat":"cs.CV","submitted_at":"2026-05-07T14:56:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"CDM migrates distribution matching distillation to continuous time via dynamic random-length schedules and active off-trajectory latent alignment, yielding competitive few-step image fidelity on SD3 and Longcat-Image.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05975","ref_index":32,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Physical Fidelity Reconstruction via Improved Consistency-Distilled Flow Matching for Dynamical Systems","primary_cat":"cs.LG","submitted_at":"2026-05-07T10:20:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Distilled one-step consistency model from optimal-transport flow-matching teacher reconstructs high-fidelity dynamical system flows from low-fidelity data with 12x speedup, half the parameters, and 23.1% better SSIM than scratch-trained baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25427","ref_index":35,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"A Systematic Post-Train Framework for Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-28T09:34:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2405.14430","ref_index":32,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference","primary_cat":"cs.CV","submitted_at":"2024-05-23T11:00:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PipeFusion applies patch partitioning and pipeline parallelism with one-step stale feature reuse to reduce communication overhead in DiT inference, reporting SOTA results on 8x L40 GPUs for Pixart, SD3, and Flux.1.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}