An MLLM-guided architecture with a mixture of frequency experts and relational alignment loss achieves state-of-the-art all-in-one image restoration, outperforming prior methods by up to 1.35 dB on the CDD11 dataset.
super hub Canonical reference
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Canonical reference. 82% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- background globally consistent 3D scenes that remain stable under large viewpoint changes. fundamentally an ill-posed problem: Simple texts or image inputs fail to provide a comprehensive representation of the entire 3D space. Consequently, inferring massive amounts of missing information for unseen areas while maintaining ge- ometric consistency remains a significant challenge. Deep generative models, particularly diffusion models [13,17,34,35,37], ad- dress this by leveraging strong 2D visual priors. How
- background language model to guide the evolution process. Importantly, it works on black-box generation models by requiring only image outputs. Finally, we evaluate PromptEvolver across multiple prompt inversion benchmarks and show that it consistently outperforms competing methods. Keywords:Prompt inversion·Text to image generation 1 Introduction Text-to-image (T2I) diffusion models [21,35,48] have transformed visual con- tent creation, enabling users to generate photorealistic images from natural- langua
- background Thisphysics-basedreference ˆImv v guaranteesglobalilluminationconsistencyacross views but lacks photorealistic high-frequency details (e.g. specularities, sky tex- tures), so we use it as a structural guidance signal for the generative stage. Generative Refinement via IC-Light.We refineˆImv v with IC-Light [48], a re- lighting diffusion model adapted from Stable Diffusion [25]. While IC-Light pro- duces photorealistic lighting effects, applying it independently per view breaks multi-view consist
- background Subsequent works further enhance controllability and semantic alignment, including Prompt-to-Prompt [11], DiffEdit [7], Imagic [18], Plug-and-Play Diffusion Features [43], and ControlNet [59]. More recent approaches explore richer instruction interfaces and multimodal reasoning, such as MGIE [9] and GenArtist [46], while subject-driven and compositional editing are studied in DreamBooth [35], Blended Diffusion [1], SDEdit [25], and image translation methods such as Detail Fusion GAN [ 20]. Comme
- background To validate the effectiveness of our proposed Neural Simulation in recovering real-world data distributions from simulation, we consider the following set of diverse comparative approaches: 1) Classical Simulation(Sim), denoting the canonical raw simulation pipeline without neural-driven refinement; 2) Baseline, a video-to-video generation model built on Stable Diffusion 1.5 [39] with temporal continuity post-processing [54]; 3) Zero-Shot, referring to the backbone model deployed without any sim
- background Several methods explicitly incorporate inpainting modules to hallucinate missing details in saturated re- gions [23,60,111]. However, when using limited-capacity generative models, the synthesized content often lacks realism or fine details. 2.3 Generative HDR Advancesingenerativemodeling,includingGANs[4,9,10,22,40,48-50,79,83,106] and diffusion models [3,16,31,34,39,67,74,88-90,96,102,105,107,108,112,113], have shown strong priors for image and video generation. Some approaches learn themapping
authors
co-cited works
representative citing papers
DiSI disentangles stochastic interpolants into separate generation and regression paths, allowing controllable transitions between regression and generative image restoration with a unified few-step sampler.
SeamCam quantifies camouflage by computing one minus the highest IoU recoverable from category-conditioned detection proposals against a ground-truth mask, achieving 78.82% agreement with human judgments.
A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
Fake3DGS benchmark shows state-of-the-art 2D fake detectors fail on 3D-manipulated Gaussian Splatting images while a new multi-view coherence method improves detection.
Render-in-the-Loop reformulates SVG generation as a step-wise visual-context-aware process using self-feedback from rendered intermediate states, VSF training, and RaV inference to outperform baselines on MMSVGBench for Text-to-SVG and Image-to-SVG.
GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
A new 100k triplet dataset and in-context diffusion framework ICTone enable state-of-the-art tone style transfer by jointly conditioning on content and reference images with scorer-based reward learning.
A conditional diffusion model using proprioception and multi-contact touch produces metric-scale, physically consistent 3D object reconstructions under hand occlusion.
Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.
Flow Divergence Sampler refines flow matching by computing velocity field divergence to correct ambiguous intermediate states during inference, improving fidelity in text-to-image and inverse problem tasks.
PromptEvolver recovers high-fidelity natural language prompts for given images by evolving them via genetic algorithm guided by a vision-language model, outperforming prior methods on benchmarks.
CAMEO uses coordinated agents for planning, prompting, generation, and quality feedback to achieve higher structural reliability in conditional image editing than single-step models.
HairOrbit leverages video generation priors and a neural orientation extractor to achieve state-of-the-art strand-level 3D hair reconstruction from single-view portraits in visible and invisible regions.
MoCA3D formulates monocular 3D box prediction as dense pixel-space tasks using corner heatmaps and depth maps, with a new PAG metric for image-plane evaluation.
DSH-Bench is a benchmark for subject-driven T2I generation that uses hierarchical taxonomy sampling, difficulty/scenario classification, and a new SICS metric showing 9.4% higher human correlation than prior measures.
StreamGVE enables high-quality training-free video editing by converting the task to noise-to-data streaming generation with dual-branch fast sampling, self-attention bridges, cross-attention grounding, source-oriented guidance, and visual prompting.
DS-DiT decouples low-resolution and reference interactions in a siamese diffusion transformer and adds a patch-level weights module plus autoguidance to improve reference-based super-resolution for remote sensing images.
Pretrained autoencoders in medical latent diffusion encode discriminative features well for reconstruction but structure their latent spaces in ways that hinder classifier learning, a gap that persists across architectures and is not closed by domain fine-tuning.
Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
DiGSeg repurposes diffusion U-Nets as generalist segmentation learners by conditioning on image-mask latents and multi-scale CLIP text features, achieving strong cross-domain performance.
Combining diffusion priors as a product-of-experts and optimizing exponents via Bayesian evidence maximization enables prior tuning from one observation in inverse imaging problems.
FluSplat trains a model with geometric alignment constraints on multi-view edits to produce consistent 3D scene edits from sparse views in a single forward pass without test-time optimization.
Allo{SR}^2 rectifies one-step super-resolution trajectories with allomorphic generative flows via SNR initialization, velocity supervision, and self-adversarial matching to deliver state-of-the-art fidelity and realism.
citing papers explorer
-
Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts
An MLLM-guided architecture with a mixture of frequency experts and relational alignment loss achieves state-of-the-art all-in-one image restoration, outperforming prior methods by up to 1.35 dB on the CDD11 dataset.
-
Disentangling Generation and Regression in Stochastic Interpolants for Controllable Image Restoration
DiSI disentangles stochastic interpolants into separate generation and regression paths, allowing controllable transitions between regression and generative image restoration with a unified few-step sampler.
-
SeamCam: Quantifying Seamless Camouflage via Multi-Cue Visual Detectability
SeamCam quantifies camouflage by computing one minus the highest IoU recoverable from category-conditioned detection proposals against a ground-truth mask, achieving 78.82% agreement with human judgments.
-
From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing
A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
-
Fake3DGS: A Benchmark for 3D Manipulation Detection in Neural Rendering
Fake3DGS benchmark shows state-of-the-art 2D fake detectors fail on 3D-manipulated Gaussian Splatting images while a new multi-view coherence method improves detection.
-
Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback
Render-in-the-Loop reformulates SVG generation as a step-wise visual-context-aware process using self-feedback from rendered intermediate states, VSF training, and RaV inference to outperform baselines on MMSVGBench for Text-to-SVG and Image-to-SVG.
-
Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning
GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
-
Towards In-Context Tone Style Transfer with A Large-Scale Triplet Dataset
A new 100k triplet dataset and in-context diffusion framework ICTone enable state-of-the-art tone style transfer by jointly conditioning on content and reference images with scorer-based reward learning.
-
Physically Grounded 3D Generative Reconstruction under Hand Occlusion using Proprioception and Multi-Contact Touch
A conditional diffusion model using proprioception and multi-contact touch produces metric-scale, physically consistent 3D object reconstructions under hand occlusion.
-
Novel View Synthesis as Video Completion
Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.
-
Training-Free Refinement of Flow Matching with Divergence-based Sampling
Flow Divergence Sampler refines flow matching by computing velocity field divergence to correct ambiguous intermediate states during inference, improving fidelity in text-to-image and inverse problem tasks.
-
PromptEvolver: Prompt Inversion through Evolutionary Optimization in Natural-Language Space
PromptEvolver recovers high-fidelity natural language prompts for given images by evolving them via genetic algorithm guided by a vision-language model, outperforming prior methods on benchmarks.
-
CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator
CAMEO uses coordinated agents for planning, prompting, generation, and quality feedback to achieve higher structural reliability in conditional image editing than single-step models.
-
HairOrbit: Multi-view Aware 3D Hair Modeling from Single Portraits
HairOrbit leverages video generation priors and a neural orientation extractor to achieve state-of-the-art strand-level 3D hair reconstruction from single-view portraits in visible and invisible regions.
-
MoCA3D: Monocular 3D Bounding Box Prediction in the Image Plane
MoCA3D formulates monocular 3D box prediction as dense pixel-space tasks using corner heatmaps and depth maps, with a new PAG metric for image-plane evaluation.
-
DSH-Bench: A Difficulty- and Scenario-Aware Benchmark with Hierarchical Subject Taxonomy for Subject-Driven Text-to-Image Generation
DSH-Bench is a benchmark for subject-driven T2I generation that uses hierarchical taxonomy sampling, difficulty/scenario classification, and a new SICS metric showing 9.4% higher human correlation than prior measures.
-
StreamGVE: Training-Free Video Editing via Few-Step Streaming Video Generation
StreamGVE enables high-quality training-free video editing by converting the task to noise-to-data streaming generation with dual-branch fast sampling, self-attention bridges, cross-attention grounding, source-oriented guidance, and visual prompting.
-
Learning to Balance: Decoupled Siamese Diffusion Transformer for Reference-Based Remote Sensing Image Super-Resolution
DS-DiT decouples low-resolution and reference interactions in a siamese diffusion transformer and adds a patch-level weights module plus autoguidance to improve reference-based super-resolution for remote sensing images.
-
The Learnability Gap in Medical Latent Diffusion
Pretrained autoencoders in medical latent diffusion encode discriminative features well for reconstruction but structure their latent spaces in ways that hinder classifier learning, a gap that persists across architectures and is not closed by domain fine-tuning.
-
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
-
Diffusion Model as a Generalist Segmentation Learner
DiGSeg repurposes diffusion U-Nets as generalist segmentation learners by conditioning on image-mask latents and multi-scale CLIP text features, achieving strong cross-domain performance.
-
Optimizing Diffusion Priors in Image Reconstruction from a Single Observation
Combining diffusion priors as a product-of-experts and optimizing exponents via Bayesian evidence maximization enables prior tuning from one observation in inverse imaging problems.
-
FluSplat: Sparse-View 3D Editing without Test-Time Optimization
FluSplat trains a model with geometric alignment constraints on multi-view edits to produce consistent 3D scene edits from sparse views in a single forward pass without test-time optimization.
-
Allo{SR}$^2$: Rectifying One-Step Super-Resolution to Stay Real via Allomorphic Generative Flows
Allo{SR}^2 rectifies one-step super-resolution trajectories with allomorphic generative flows via SNR initialization, velocity supervision, and self-adversarial matching to deliver state-of-the-art fidelity and realism.
-
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation
A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
-
From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation
Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.
-
DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer
RTR-DiT distills a bidirectional DiT teacher into an autoregressive few-step model using Self Forcing and Distribution Matching Distillation, plus a reference-preserving KV cache, to enable stable real-time text- and reference-guided video stylization.
-
ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment
ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.
-
Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models
Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.
-
GIF: A Conditional Multimodal Generative Framework for IR Drop Imaging in Chip Layouts
GIF fuses geometrical image features and logical graph topology in a conditional diffusion model to generate high-quality IR drop images for chip layouts, outperforming prior ML methods on CircuitNet-N28 with SSIM 0.78, Pearson 0.95, PSNR 21.77, and NMAE 0.026.
-
What Matters in Virtual Try-Off? Dual-UNet Diffusion Model For Garment Reconstruction
A Dual-UNet diffusion model for virtual garment reconstruction from clothed images sets new benchmarks on VITON-HD and DressCode by optimizing Stable Diffusion variants, mask conditioning, and auxiliary losses.
-
GroundingAnomaly: Spatially-Grounded Diffusion for Few-Shot Anomaly Synthesis
GroundingAnomaly uses a Spatial Conditioning Module and Gated Self-Attention in a frozen diffusion U-Net to synthesize spatially accurate few-shot anomalies, reaching SOTA on MVTec AD and VisA for detection, segmentation, and instance detection.
-
Distilling Photon-Counting CT into Routine Chest CT through Clinically Validated Degradation Modeling
SUMI distills photon-counting CT quality into routine chest CT by learning to reverse clinically validated acquisition degradations, yielding 15-20% gains in image metrics, better radiologist utility, and up to 15% higher lesion detection sensitivity.
-
Generative Phomosaic with Structure-Aligned and Personalized Diffusion
The paper presents the first generative photomosaic framework that synthesizes tiles via structure-aligned diffusion models and few-shot personalization instead of color-based matching from large tile collections.
-
DiffHDR: Re-Exposing LDR Videos with Video Diffusion Models
DiffHDR converts LDR videos to HDR by formulating the task as generative radiance inpainting in a video diffusion model's latent space, using Log-Gamma encoding and synthesized training data to achieve better fidelity and stability than prior methods.
-
Beyond Semantics: Uncovering the Physics of Fakes via Universal Physical Descriptors for Cross-Modal Synthetic Detection
Five universal physical descriptors including Laplacian variance, Sobel statistics, and residual noise variance, when integrated as text encodings with CLIP, achieve up to 99.8% accuracy detecting synthetic images across GAN and diffusion model datasets.
-
SpectralSplat: Appearance-Disentangled Feed-Forward Gaussian Splatting for Driving Scenes
SpectralSplat disentangles appearance from geometry in feed-forward 3D Gaussian Splatting by factoring color into base and adapted streams conditioned on DINOv2 embeddings, trained on paired data from a hybrid relighting pipeline.
-
Clinically Aware Synthetic Image Generation for Concept Coverage in Chest X-ray Models
CARPA generates anatomically faithful synthetic chest X-rays with controlled clinical concept insertions and deletions to expand training coverage and improve model precision, calibration, and reliability on real benchmarks.
-
Dual-End Consistency Model
DE-CM reaches state-of-the-art one-step FID of 1.70 on ImageNet 256x256 by decomposing PF-ODE trajectories into three critical sub-trajectories and using flow matching plus N2N mapping for stability.
-
InstantID: Zero-shot Identity-Preserving Generation in Seconds
InstantID enables zero-shot identity-preserving image generation from one facial image via a novel IdentityNet that combines strong semantic and weak spatial conditioning with text prompts in diffusion models.
-
DepthPilot: From Controllability to Interpretability in Colonoscopy Video Generation
DepthPilot generates physically consistent and clinically interpretable colonoscopy videos by injecting depth priors into diffusion models through parameter-efficient fine-tuning and replacing linear denoising weights with adaptive splines.
-
ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation
Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.
-
Structured State-Space Regularization for Generation-Friendly Image Tokenization
Structured state-space regularization induces spectral structure in image tokenizer latent spaces via an SSM-derived objective, improving generative performance with minimal reconstruction loss.
-
SHIFT: Steering Hidden Intermediates in Flow Transformers
SHIFT learns and applies steering vectors to selected layers and timesteps in DiT models to suppress concepts, shift styles, or bias objects while keeping image quality and prompt adherence intact.
-
EEG2Vision: A Multimodal EEG-Based Framework for 2D Visual Reconstruction in Cognitive Neuroscience
EEG2Vision reconstructs images from EEG using diffusion models plus LLM-guided boosting, with reconstruction quality holding up reasonably as electrode count drops from 128 to 24 channels.
-
Generalized SAM: Efficient Fine-Tuning of SAM for Variable Input Image Sizes
GSAM applies random cropping to enable variable input sizes for efficient SAM fine-tuning, claiming lower compute with comparable or higher accuracy on varied datasets.