super hub Canonical reference

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Andreas Blattmann, Bjorn Ommer, Dominik Lorenz, Patrick Esser, Robin Rombach · 2022 · 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) · DOI 10.1109/cvpr52688.2022.01042

Canonical reference. 82% of citing Pith papers cite this work as background.

46 Pith papers citing it

13.8k external citations · external index

Background 82% of classified citations

open at publisher browse 46 citing papers more from Andreas Blattmann

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 9 method 2

citation-polarity summary

background 9 use method 2

claims ledger

background globally consistent 3D scenes that remain stable under large viewpoint changes. fundamentally an ill-posed problem: Simple texts or image inputs fail to provide a comprehensive representation of the entire 3D space. Consequently, inferring massive amounts of missing information for unseen areas while maintaining ge- ometric consistency remains a significant challenge. Deep generative models, particularly diffusion models [13,17,34,35,37], ad- dress this by leveraging strong 2D visual priors. How
background language model to guide the evolution process. Importantly, it works on black-box generation models by requiring only image outputs. Finally, we evaluate PromptEvolver across multiple prompt inversion benchmarks and show that it consistently outperforms competing methods. Keywords:Prompt inversion·Text to image generation 1 Introduction Text-to-image (T2I) diffusion models [21,35,48] have transformed visual con- tent creation, enabling users to generate photorealistic images from natural- langua
background Thisphysics-basedreference ˆImv v guaranteesglobalilluminationconsistencyacross views but lacks photorealistic high-frequency details (e.g. specularities, sky tex- tures), so we use it as a structural guidance signal for the generative stage. Generative Refinement via IC-Light.We refineˆImv v with IC-Light [48], a re- lighting diffusion model adapted from Stable Diffusion [25]. While IC-Light pro- duces photorealistic lighting effects, applying it independently per view breaks multi-view consist
background Subsequent works further enhance controllability and semantic alignment, including Prompt-to-Prompt [11], DiffEdit [7], Imagic [18], Plug-and-Play Diffusion Features [43], and ControlNet [59]. More recent approaches explore richer instruction interfaces and multimodal reasoning, such as MGIE [9] and GenArtist [46], while subject-driven and compositional editing are studied in DreamBooth [35], Blended Diffusion [1], SDEdit [25], and image translation methods such as Detail Fusion GAN [ 20]. Comme
background To validate the effectiveness of our proposed Neural Simulation in recovering real-world data distributions from simulation, we consider the following set of diverse comparative approaches: 1) Classical Simulation(Sim), denoting the canonical raw simulation pipeline without neural-driven refinement; 2) Baseline, a video-to-video generation model built on Stable Diffusion 1.5 [39] with temporal continuity post-processing [54]; 3) Zero-Shot, referring to the backbone model deployed without any sim
background Several methods explicitly incorporate inpainting modules to hallucinate missing details in saturated re- gions [23,60,111]. However, when using limited-capacity generative models, the synthesized content often lacks realism or fine details. 2.3 Generative HDR Advancesingenerativemodeling,includingGANs[4,9,10,22,40,48-50,79,83,106] and diffusion models [3,16,31,34,39,67,74,88-90,96,102,105,107,108,112,113], have shown strong priors for image and video generation. Some approaches learn themapping

authors

Andreas Blattmann Bjorn Ommer Dominik Lorenz Patrick Esser Robin Rombach

co-cited works

representative citing papers

Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts

cs.CV · 2026-05-12 · unverdicted · novelty 8.0 · 2 refs

An MLLM-guided architecture with a mixture of frequency experts and relational alignment loss achieves state-of-the-art all-in-one image restoration, outperforming prior methods by up to 1.35 dB on the CDD11 dataset.

Disentangling Generation and Regression in Stochastic Interpolants for Controllable Image Restoration

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

DiSI disentangles stochastic interpolants into separate generation and regression paths, allowing controllable transitions between regression and generative image restoration with a unified few-step sampler.

SeamCam: Quantifying Seamless Camouflage via Multi-Cue Visual Detectability

cs.CV · 2026-05-15 · conditional · novelty 7.0

SeamCam quantifies camouflage by computing one minus the highest IoU recoverable from category-conditioned detection proposals against a ground-truth mask, achieving 78.82% agreement with human judgments.

From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.

Fake3DGS: A Benchmark for 3D Manipulation Detection in Neural Rendering

cs.CV · 2026-04-30 · unverdicted · novelty 7.0

Fake3DGS benchmark shows state-of-the-art 2D fake detectors fail on 3D-manipulated Gaussian Splatting images while a new multi-view coherence method improves detection.

Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback

cs.CV · 2026-04-22 · unverdicted · novelty 7.0

Render-in-the-Loop reformulates SVG generation as a step-wise visual-context-aware process using self-feedback from rendered intermediate states, VSF training, and RaV inference to outperform baselines on MMSVGBench for Text-to-SVG and Image-to-SVG.

Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.

Towards In-Context Tone Style Transfer with A Large-Scale Triplet Dataset

cs.CV · 2026-04-17 · unverdicted · novelty 7.0

A new 100k triplet dataset and in-context diffusion framework ICTone enable state-of-the-art tone style transfer by jointly conditioning on content and reference images with scorer-based reward learning.

Physically Grounded 3D Generative Reconstruction under Hand Occlusion using Proprioception and Multi-Contact Touch

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

A conditional diffusion model using proprioception and multi-contact touch produces metric-scale, physically consistent 3D object reconstructions under hand occlusion.

Novel View Synthesis as Video Completion

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.

Training-Free Refinement of Flow Matching with Divergence-based Sampling

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

Flow Divergence Sampler refines flow matching by computing velocity field divergence to correct ambiguous intermediate states during inference, improving fidelity in text-to-image and inverse problem tasks.

PromptEvolver: Prompt Inversion through Evolutionary Optimization in Natural-Language Space

cs.LG · 2026-04-03 · unverdicted · novelty 7.0

PromptEvolver recovers high-fidelity natural language prompts for given images by evolving them via genetic algorithm guided by a vision-language model, outperforming prior methods on benchmarks.

CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

CAMEO uses coordinated agents for planning, prompting, generation, and quality feedback to achieve higher structural reliability in conditional image editing than single-step models.

HairOrbit: Multi-view Aware 3D Hair Modeling from Single Portraits

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

HairOrbit leverages video generation priors and a neural orientation extractor to achieve state-of-the-art strand-level 3D hair reconstruction from single-view portraits in visible and invisible regions.

MoCA3D: Monocular 3D Bounding Box Prediction in the Image Plane

cs.CV · 2026-03-20 · unverdicted · novelty 7.0

MoCA3D formulates monocular 3D box prediction as dense pixel-space tasks using corner heatmaps and depth maps, with a new PAG metric for image-plane evaluation.

DSH-Bench: A Difficulty- and Scenario-Aware Benchmark with Hierarchical Subject Taxonomy for Subject-Driven Text-to-Image Generation

cs.CV · 2026-03-09 · unverdicted · novelty 7.0

DSH-Bench is a benchmark for subject-driven T2I generation that uses hierarchical taxonomy sampling, difficulty/scenario classification, and a new SICS metric showing 9.4% higher human correlation than prior measures.

StreamGVE: Training-Free Video Editing via Few-Step Streaming Video Generation

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

StreamGVE enables high-quality training-free video editing by converting the task to noise-to-data streaming generation with dual-branch fast sampling, self-attention bridges, cross-attention grounding, source-oriented guidance, and visual prompting.

Learning to Balance: Decoupled Siamese Diffusion Transformer for Reference-Based Remote Sensing Image Super-Resolution

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

DS-DiT decouples low-resolution and reference interactions in a siamese diffusion transformer and adds a patch-level weights module plus autoguidance to improve reference-based super-resolution for remote sensing images.

The Learnability Gap in Medical Latent Diffusion

cs.CV · 2026-05-16 · unverdicted · novelty 6.0

Pretrained autoencoders in medical latent diffusion encode discriminative features well for reconstruction but structure their latent spaces in ways that hinder classifier learning, a gap that persists across architectures and is not closed by domain fine-tuning.

Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.

Diffusion Model as a Generalist Segmentation Learner

cs.CV · 2026-04-27 · unverdicted · novelty 6.0

DiGSeg repurposes diffusion U-Nets as generalist segmentation learners by conditioning on image-mask latents and multi-scale CLIP text features, achieving strong cross-domain performance.

Optimizing Diffusion Priors in Image Reconstruction from a Single Observation

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

Combining diffusion priors as a product-of-experts and optimizing exponents via Bayesian evidence maximization enables prior tuning from one observation in inverse imaging problems.

FluSplat: Sparse-View 3D Editing without Test-Time Optimization

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

FluSplat trains a model with geometric alignment constraints on multi-view edits to produce consistent 3D scene edits from sparse views in a single forward pass without test-time optimization.

Allo{SR}$^2$: Rectifying One-Step Super-Resolution to Stay Real via Allomorphic Generative Flows

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

Allo{SR}^2 rectifies one-step super-resolution trajectories with allomorphic generative flows via SNR initialization, velocity supervision, and self-adversarial matching to deliver state-of-the-art fidelity and realism.

citing papers explorer

Showing 46 of 46 citing papers.

Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts cs.CV · 2026-05-12 · unverdicted · none · ref 48 · 2 links
An MLLM-guided architecture with a mixture of frequency experts and relational alignment loss achieves state-of-the-art all-in-one image restoration, outperforming prior methods by up to 1.35 dB on the CDD11 dataset.
Disentangling Generation and Regression in Stochastic Interpolants for Controllable Image Restoration cs.CV · 2026-05-20 · unverdicted · none · ref 62
DiSI disentangles stochastic interpolants into separate generation and regression paths, allowing controllable transitions between regression and generative image restoration with a unified few-step sampler.
SeamCam: Quantifying Seamless Camouflage via Multi-Cue Visual Detectability cs.CV · 2026-05-15 · conditional · none · ref 43
SeamCam quantifies camouflage by computing one minus the highest IoU recoverable from category-conditioned detection proposals against a ground-truth mask, achieving 78.82% agreement with human judgments.
From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing cs.CV · 2026-05-14 · unverdicted · none · ref 37
A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
Fake3DGS: A Benchmark for 3D Manipulation Detection in Neural Rendering cs.CV · 2026-04-30 · unverdicted · none · ref 44
Fake3DGS benchmark shows state-of-the-art 2D fake detectors fail on 3D-manipulated Gaussian Splatting images while a new multi-view coherence method improves detection.
Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback cs.CV · 2026-04-22 · unverdicted · none · ref 31
Render-in-the-Loop reformulates SVG generation as a step-wise visual-context-aware process using self-feedback from rendered intermediate states, VSF training, and RaV inference to outperform baselines on MMSVGBench for Text-to-SVG and Image-to-SVG.
Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning cs.LG · 2026-04-21 · unverdicted · none · ref 37
GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
Towards In-Context Tone Style Transfer with A Large-Scale Triplet Dataset cs.CV · 2026-04-17 · unverdicted · none · ref 35
A new 100k triplet dataset and in-context diffusion framework ICTone enable state-of-the-art tone style transfer by jointly conditioning on content and reference images with scorer-based reward learning.
Physically Grounded 3D Generative Reconstruction under Hand Occlusion using Proprioception and Multi-Contact Touch cs.CV · 2026-04-10 · unverdicted · none · ref 44
A conditional diffusion model using proprioception and multi-contact touch produces metric-scale, physically consistent 3D object reconstructions under hand occlusion.
Novel View Synthesis as Video Completion cs.CV · 2026-04-09 · unverdicted · none · ref 32
Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.
Training-Free Refinement of Flow Matching with Divergence-based Sampling cs.CV · 2026-04-06 · unverdicted · none · ref 30
Flow Divergence Sampler refines flow matching by computing velocity field divergence to correct ambiguous intermediate states during inference, improving fidelity in text-to-image and inverse problem tasks.
PromptEvolver: Prompt Inversion through Evolutionary Optimization in Natural-Language Space cs.LG · 2026-04-03 · unverdicted · none · ref 35
PromptEvolver recovers high-fidelity natural language prompts for given images by evolving them via genetic algorithm guided by a vision-language model, outperforming prior methods on benchmarks.
CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator cs.CV · 2026-04-03 · unverdicted · none · ref 35
CAMEO uses coordinated agents for planning, prompting, generation, and quality feedback to achieve higher structural reliability in conditional image editing than single-step models.
HairOrbit: Multi-view Aware 3D Hair Modeling from Single Portraits cs.CV · 2026-04-03 · unverdicted · none · ref 23
HairOrbit leverages video generation priors and a neural orientation extractor to achieve state-of-the-art strand-level 3D hair reconstruction from single-view portraits in visible and invisible regions.
MoCA3D: Monocular 3D Bounding Box Prediction in the Image Plane cs.CV · 2026-03-20 · unverdicted · none · ref 37
MoCA3D formulates monocular 3D box prediction as dense pixel-space tasks using corner heatmaps and depth maps, with a new PAG metric for image-plane evaluation.
DSH-Bench: A Difficulty- and Scenario-Aware Benchmark with Hierarchical Subject Taxonomy for Subject-Driven Text-to-Image Generation cs.CV · 2026-03-09 · unverdicted · none · ref 51
DSH-Bench is a benchmark for subject-driven T2I generation that uses hierarchical taxonomy sampling, difficulty/scenario classification, and a new SICS metric showing 9.4% higher human correlation than prior measures.
StreamGVE: Training-Free Video Editing via Few-Step Streaming Video Generation cs.CV · 2026-05-20 · unverdicted · none · ref 60
StreamGVE enables high-quality training-free video editing by converting the task to noise-to-data streaming generation with dual-branch fast sampling, self-attention bridges, cross-attention grounding, source-oriented guidance, and visual prompting.
Learning to Balance: Decoupled Siamese Diffusion Transformer for Reference-Based Remote Sensing Image Super-Resolution cs.CV · 2026-05-18 · unverdicted · none · ref 26
DS-DiT decouples low-resolution and reference interactions in a siamese diffusion transformer and adds a patch-level weights module plus autoguidance to improve reference-based super-resolution for remote sensing images.
The Learnability Gap in Medical Latent Diffusion cs.CV · 2026-05-16 · unverdicted · none · ref 33
Pretrained autoencoders in medical latent diffusion encode discriminative features well for reconstruction but structure their latent spaces in ways that hinder classifier learning, a gap that persists across architectures and is not closed by domain fine-tuning.
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity cs.CV · 2026-05-14 · unverdicted · none · ref 42
Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
Diffusion Model as a Generalist Segmentation Learner cs.CV · 2026-04-27 · unverdicted · none · ref 68
DiGSeg repurposes diffusion U-Nets as generalist segmentation learners by conditioning on image-mask latents and multi-scale CLIP text features, achieving strong cross-domain performance.
Optimizing Diffusion Priors in Image Reconstruction from a Single Observation cs.CV · 2026-04-22 · unverdicted · none · ref 25
Combining diffusion priors as a product-of-experts and optimizing exponents via Bayesian evidence maximization enables prior tuning from one observation in inverse imaging problems.
FluSplat: Sparse-View 3D Editing without Test-Time Optimization cs.CV · 2026-04-21 · unverdicted · none · ref 37
FluSplat trains a model with geometric alignment constraints on multi-view edits to produce consistent 3D scene edits from sparse views in a single forward pass without test-time optimization.
Allo{SR}$^2$: Rectifying One-Step Super-Resolution to Stay Real via Allomorphic Generative Flows cs.CV · 2026-04-21 · unverdicted · none · ref 32
Allo{SR}^2 rectifies one-step super-resolution trajectories with allomorphic generative flows via SNR initialization, velocity supervision, and self-adversarial matching to deliver state-of-the-art fidelity and realism.
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation cs.CV · 2026-04-20 · unverdicted · none · ref 37
A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation cs.CV · 2026-04-15 · unverdicted · none · ref 40
Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.
DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer cs.CV · 2026-04-15 · unverdicted · none · ref 24
RTR-DiT distills a bidirectional DiT teacher into an autoregressive few-step model using Self Forcing and Distribution Matching Distillation, plus a reference-preserving KV cache, to enable stable real-time text- and reference-guided video stylization.
ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment cs.CV · 2026-04-12 · unverdicted · none · ref 43
ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.
Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models cs.CV · 2026-04-12 · unverdicted · none · ref 35
Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.
GIF: A Conditional Multimodal Generative Framework for IR Drop Imaging in Chip Layouts cs.CV · 2026-04-11 · unverdicted · none · ref 19
GIF fuses geometrical image features and logical graph topology in a conditional diffusion model to generate high-quality IR drop images for chip layouts, outperforming prior ML methods on CircuitNet-N28 with SSIM 0.78, Pearson 0.95, PSNR 21.77, and NMAE 0.026.
What Matters in Virtual Try-Off? Dual-UNet Diffusion Model For Garment Reconstruction cs.CV · 2026-04-09 · accept · none · ref 26
A Dual-UNet diffusion model for virtual garment reconstruction from clothed images sets new benchmarks on VITON-HD and DressCode by optimizing Stable Diffusion variants, mask conditioning, and auxiliary losses.
GroundingAnomaly: Spatially-Grounded Diffusion for Few-Shot Anomaly Synthesis cs.CV · 2026-04-09 · unverdicted · none · ref 28
GroundingAnomaly uses a Spatial Conditioning Module and Gated Self-Attention in a frozen diffusion U-Net to synthesize spatially accurate few-shot anomalies, reaching SOTA on MVTec AD and VisA for detection, segmentation, and instance detection.
Distilling Photon-Counting CT into Routine Chest CT through Clinically Validated Degradation Modeling cs.CV · 2026-04-08 · unverdicted · none · ref 27
SUMI distills photon-counting CT quality into routine chest CT by learning to reverse clinically validated acquisition degradations, yielding 15-20% gains in image metrics, better radiologist utility, and up to 15% higher lesion detection sensitivity.
Generative Phomosaic with Structure-Aligned and Personalized Diffusion cs.CV · 2026-04-08 · unverdicted · none · ref 25
The paper presents the first generative photomosaic framework that synthesizes tiles via structure-aligned diffusion models and few-shot personalization instead of color-based matching from large tile collections.
DiffHDR: Re-Exposing LDR Videos with Video Diffusion Models cs.CV · 2026-04-07 · unverdicted · none · ref 74
DiffHDR converts LDR videos to HDR by formulating the task as generative radiance inpainting in a video diffusion model's latent space, using Log-Gamma encoding and synthesized training data to achieve better fidelity and stability than prior methods.
Beyond Semantics: Uncovering the Physics of Fakes via Universal Physical Descriptors for Cross-Modal Synthetic Detection cs.CV · 2026-04-06 · unverdicted · none · ref 22
Five universal physical descriptors including Laplacian variance, Sobel statistics, and residual noise variance, when integrated as text encodings with CLIP, achieve up to 99.8% accuracy detecting synthetic images across GAN and diffusion model datasets.
SpectralSplat: Appearance-Disentangled Feed-Forward Gaussian Splatting for Driving Scenes cs.CV · 2026-04-03 · unverdicted · none · ref 25
SpectralSplat disentangles appearance from geometry in feed-forward 3D Gaussian Splatting by factoring color into base and adapted streams conditioned on DINOv2 embeddings, trained on paired data from a hybrid relighting pipeline.
Clinically Aware Synthetic Image Generation for Concept Coverage in Chest X-ray Models cs.CV · 2026-03-16 · unverdicted · none · ref 26
CARPA generates anatomically faithful synthetic chest X-rays with controlled clinical concept insertions and deletions to expand training coverage and improve model precision, calibration, and reliability on real benchmarks.
Dual-End Consistency Model cs.CV · 2026-02-11 · unverdicted · none · ref 38
DE-CM reaches state-of-the-art one-step FID of 1.70 on ImageNet 256x256 by decomposing PF-ODE trajectories into three critical sub-trajectories and using flow matching plus N2N mapping for stability.
InstantID: Zero-shot Identity-Preserving Generation in Seconds cs.CV · 2024-01-15 · unverdicted · none · ref 16
InstantID enables zero-shot identity-preserving image generation from one facial image via a novel IdentityNet that combines strong semantic and weak spatial conditioning with text prompts in diffusion models.
DepthPilot: From Controllability to Interpretability in Colonoscopy Video Generation cs.CV · 2026-04-29 · unverdicted · none · ref 23
DepthPilot generates physically consistent and clinically interpretable colonoscopy videos by injecting depth priors into diffusion models through parameter-efficient fine-tuning and replacing linear denoising weights with adaptive splines.
ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation cs.RO · 2026-04-13 · unverdicted · none · ref 39
Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.
Structured State-Space Regularization for Generation-Friendly Image Tokenization cs.CV · 2026-04-13 · unverdicted · none · ref 51 · 2 links
Structured state-space regularization induces spectral structure in image tokenizer latent spaces via an SSM-derived objective, improving generative performance with minimal reconstruction loss.
SHIFT: Steering Hidden Intermediates in Flow Transformers cs.CV · 2026-04-10 · unverdicted · none · ref 28
SHIFT learns and applies steering vectors to selected layers and timesteps in DiT models to suppress concepts, shift styles, or bias objects while keeping image quality and prompt adherence intact.
EEG2Vision: A Multimodal EEG-Based Framework for 2D Visual Reconstruction in Cognitive Neuroscience cs.CV · 2026-04-09 · unverdicted · none · ref 37
EEG2Vision reconstructs images from EEG using diffusion models plus LLM-guided boosting, with reconstruction quality holding up reasonably as electrode count drops from 128 to 24 channels.
Generalized SAM: Efficient Fine-Tuning of SAM for Variable Input Image Sizes cs.CV · 2024-08-22 · unverdicted · none · ref 24
GSAM applies random cropping to enable variable input sizes for efficient SAM fine-tuning, claiming lower compute with comparable or higher accuracy on varied datasets.

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer