BrainCause recovers known visual localizations and finds new candidate representations by validating causal specificity via counterfactual stimuli and encoding models, showing activation alone produces many false positives.
hub Mixed citations
Instructpix2pix: Learning to follow image editing instructions
Mixed citation behavior. Most common role is background (60%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.
Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.
ConFixGS repairs feedforward 3D Gaussian Splatting with confidence-aware diffusion priors, delivering up to 3.68 dB PSNR gains and halved FID scores on Waymo, nuScenes, and KITTI novel view synthesis tasks.
TextSculptor supplies an automated data synthesis pipeline yielding 3.2M samples plus a four-task benchmark that raises open-source scene text editing performance.
SafeMark integrates a thresholded watermark-decoding loss into diffusion editors to enable text-guided edits that preserve embedded watermarks with high bit accuracy.
Pretrained instruction-based image editing models exhibit early foreground-background separability that enables a training-free framework for zero-shot referring image segmentation using a single denoising step.
OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.
Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text-to-image and editing benchmarks.
Slot-MLLM introduces a slot-attention-based object-centric visual tokenizer with Q-Former encoder, diffusion decoder, and residual vector quantization for improved local visual comprehension and generation in multimodal LLMs.
Fixed-Point Distillation constructs one-step correction targets for discrete diffusion generators via partial corruption and single teacher refinement, lifted into continuous features with a multi-bandwidth drift loss and straight-through estimation.
SWEET is a one-shot sparse visual planning framework that progressively generates manipulation keyframes via image editing conditioned on language and spatial guidance, then converts them to actions with a diffusion predictor, showing better fidelity and lower cost than video models on DROID and Rob
Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.
VAE-LFA suppresses semantic drift in multi-turn DiT image editing by low-pass filtering latent discrepancies and aligning low-frequency components to an EMA of previous rounds in VAE space.
citing papers explorer
-
From Activation to Causality: Discovery of Causal Visual Representations in the Human Brain
BrainCause recovers known visual localizations and finds new candidate representations by validating causal specificity via counterfactual stimuli and encoding models, showing activation alone produces many false positives.
-
StreamingEffect: Real-Time Human-Centric Video Effect Generation
StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.
-
Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling
Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.
-
ConFixGS: Learning to Fix Feedforward 3D Gaussian Splatting with Confidence-Aware Diffusion Priors in Driving Scenes
ConFixGS repairs feedforward 3D Gaussian Splatting with confidence-aware diffusion priors, delivering up to 3.68 dB PSNR gains and halved FID scores on Waymo, nuScenes, and KITTI novel view synthesis tasks.
-
TextSculptor: Training and Benchmarking Scene Text Editing
TextSculptor supplies an automated data synthesis pipeline yielding 3.2M samples plus a four-task benchmark that raises open-source scene text editing performance.
-
Are Watermarked Images Editable? SafeMark for Watermark-Preserving Text-Guided Image Editing
SafeMark integrates a thresholded watermark-decoding loss into diffusion editors to enable text-guided edits that preserve embedded watermarks with high bit accuracy.
-
Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation
Pretrained instruction-based image editing models exhibit early foreground-background separability that enables a training-free framework for zero-shot referring image segmentation using a single denoising step.
-
OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation
OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.
-
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria
Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text-to-image and editing benchmarks.
-
Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM
Slot-MLLM introduces a slot-attention-based object-centric visual tokenizer with Q-Former encoder, diffusion decoder, and residual vector quantization for improved local visual comprehension and generation in multimodal LLMs.
-
One-Step Distillation of Discrete Diffusion Image Generators via Fixed-Point Iteration
Fixed-Point Distillation constructs one-step correction targets for discrete diffusion generators via partial corruption and single teacher refinement, lifted into continuous features with a multi-bandwidth drift loss and straight-through estimation.
-
SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution
SWEET is a one-shot sparse visual planning framework that progressively generates manipulation keyframes via image editing conditioned on language and spatial guidance, then converts them to actions with a diffusion predictor, showing better fidelity and lower cost than video models on DROID and Rob
-
Steering Visual Generation in Unified Multimodal Models with Understanding Supervision
Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.
-
Why Do DiT Editors Drift? Plug-and-Play Low Frequency Alignment in VAE Latent Space
VAE-LFA suppresses semantic drift in multi-turn DiT image editing by low-pass filtering latent discrepancies and aligning low-frequency components to an EMA of previous rounds in VAE space.
- Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm