Presents the first large-scale benchmark for multi-frame geometric distortion removal in videos under severe refractive warping, using real and synthetic data across four distortion levels and evaluating classical and learning-based methods including a proposed diffusion-based V-cache.
hub
High-resolution image synthesis with latent diffusion models
11 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 11representative citing papers
IAFS is a training-free iterative inference-time scaling framework that uses adaptive frequency-aware particle fusion to resolve the perception-fidelity conflict in diffusion super-resolution models, outperforming prior scaling strategies.
A time-reversed reconstruction method couples visual language models with constrained diffusion to generate past scene frames from current thermal traces in controlled scenarios.
A recursive sparse MoE framework integrated into diffusion models iteratively refines visual tokens via gated module selection to improve structured reasoning and image generation performance.
Synthetic data complements real data in diffusion-based controllable human video generation, with effective sample selection improving motion realism, temporal consistency, and identity preservation.
The paper introduces Hyper Diffusion Planner (HDP), a diffusion-based E2E AD framework that identifies insights on loss space, trajectory representation and data scaling, adds RL post-training, and reports 10x performance gains over 200 km of real-world testing across 6 scenarios.
A video-trained large vision model achieves competitive zero-shot performance on organ segmentation, denoising, super-resolution, and 4D CT motion prediction in medical imaging, outperforming some specialized baselines on patient data from 122 cases.
π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.
A latent diffusion model jointly synthesizes MRI volumes and mixed-type tabular clinical data in a shared space via cross-attention and separate decoders after VAE fusion.
InSpatio-WorldFM is a frame-independent generative model that uses explicit 3D anchors and spatial memory to deliver real-time multi-view consistent spatial intelligence via a three-stage training pipeline from pretrained diffusion models.
Selective aggregation of cross-attention maps from the most relevant heads in diffusion-based T2I models yields higher mean IoU for visual interpretation than standard aggregation methods like DAAM.
citing papers explorer
-
A unified Benchmark for Multi-Frame Image Restoration under Severe Refractive Warping
Presents the first large-scale benchmark for multi-frame geometric distortion removal in videos under severe refractive warping, using real and synthetic data across four distortion levels and evaluating classical and learning-based methods including a proposed diffusion-based V-cache.
-
Iterative Inference-time Scaling with Adaptive Frequency Steering for Image Super-Resolution
IAFS is a training-free iterative inference-time scaling framework that uses adaptive frequency-aware particle fusion to resolve the perception-fidelity conflict in diffusion super-resolution models, outperforming prior scaling strategies.
-
See the past: Time-Reversed Scene Reconstruction from Thermal Traces Using Visual Language Models
A time-reversed reconstruction method couples visual language models with constrained diffusion to generate past scene frames from current thermal traces in controlled scenarios.
-
The Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents
A recursive sparse MoE framework integrated into diffusion models iteratively refines visual tokens via gated module selection to improve structured reasoning and image generation performance.
-
Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation
Synthetic data complements real data in diffusion-based controllable human video generation, with effective sample selection improving motion realism, temporal consistency, and identity preservation.
-
Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving
The paper introduces Hyper Diffusion Planner (HDP), a diffusion-based E2E AD framework that identifies insights on loss space, trajectory representation and data scaling, adds RL post-training, and reports 10x performance gains over 200 km of real-world testing across 6 scenarios.
-
Are Video Models Emerging as Zero-Shot Learners and Reasoners in Medical Imaging?
A video-trained large vision model achieves competitive zero-shot performance on organ segmentation, denoising, super-resolution, and 4D CT motion prediction in medical imaging, outperforming some specialized baselines on patient data from 122 cases.
-
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.
-
Multimodal synthesis of MRI and tabular data with diffusion in a joint latent space via cross-attention
A latent diffusion model jointly synthesizes MRI volumes and mixed-type tabular clinical data in a shared space via cross-attention and separate decoders after VAE fusion.
-
InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model
InSpatio-WorldFM is a frame-independent generative model that uses explicit 3D anchors and spatial memory to deliver real-time multi-view consistent spatial intelligence via a three-stage training pipeline from pretrained diffusion models.
-
Selective Aggregation of Attention Maps Improves Diffusion-Based Visual Interpretation
Selective aggregation of cross-attention maps from the most relevant heads in diffusion-based T2I models yields higher mean IoU for visual interpretation than standard aggregation methods like DAAM.