neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.
hub Canonical reference
Diffedit: Diffusion-based semantic image editing with mask guidance
Canonical reference. 100% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
roles
background 6polarities
background 6representative citing papers
LD-Pruning applies latent discrepancy to prune tokens and adaptively skip unconditional branches in VAR models for up to 2.35x faster inference with preserved quality.
Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.
GeoEdit constructs local tangent frames from small perturbations to initial noise, enabling Jacobian-free on-manifold edits in diffusion models via alternating tangent steps and diffusion projections.
UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
DLEBench is the first benchmark for small-scale object editing in instruction-based image editing models, using 1889 samples, seven instruction types, and a dual-mode evaluation protocol to reveal performance gaps in 10 tested models.
RDSplat is the first 3D Gaussian Splatting watermarking method that maintains 0.701 bit accuracy against both 2D and 3D diffusion editing by embedding only in low-frequency primitives selected via FAPS.
Factored Classifier-Free Guidance enables per-attribute control in classifier-free guidance for diffusion models to produce more sound counterfactuals.
UniEdit-Flow presents tuning-free Uni-Inv and Uni-Edit methods for inversion and editing in flow models that achieve accurate reconstruction and robust region-preserving edits across generative models.
GeoEdit introduces a Lift-Manipulate-Render-Denoise pipeline with dual-branch denoising and variance-homogeneous injection for 3D-consistent object editing in single photos.
MirrorPPR extracts retouching operations from exemplar pairs via a dedicated extractor and transfers them to query images through a LoRA-adapted Diffusion Transformer, enabled by a new 47-million-pair dataset and self-augmentation for alignment.
SpatialFlow-GRPO adds region-level reward feedback and spatial alignment to Flow-GRPO-style RL for image editing, reporting gains on GEdit-Bench, ImgEdit-Bench, and a new MultiEditBench.
Introduces 9 synthetic annotation tasks and benchmarks for behavioral cloning, finding hierarchical skill learning, scaling benefits, effective multi-task pretraining, and shared internal representations of task phases and mistakes.
Empirical studies reveal instruction-level generalization as the main bottleneck in scribble-guided editing; three strategies (curriculum, multi-task mosaicking, edit-focused loss) achieve SOTA on VIBE benchmark.
ACE-LoRA introduces adaptive orthogonal decoupling and rank-invariant compression for continual image editing in diffusion models, plus the CIE-Bench benchmark.
PhysEdit introduces adaptive reasoning depth and spatial masking to make image editing faster and more instruction-aligned without retraining the base model.
Task-aware localization via attention cues and feature centroids from source/target streams in IIE models improves non-edit consistency while preserving instruction following.
Fine-tuning text-to-video models on sparse low-quality synthetic data for physical camera controls outperforms fine-tuning on photorealistic data.
FlashEdit delivers real-time localized text-guided image editing under 0.2 seconds via cycle-consistent one-step inversion, background shield, and sparsified spatial cross-attention, achieving over 150x speedup on PIE-Bench.
UniCodec uses LLM-driven semantic disentanglement at the encoder and diffusion-based compositional generation at the decoder to enable one codec for both human perception and machine vision tasks without task-specific retraining.
An ensemble of stage-specialized text-to-image diffusion models improves prompt alignment over single shared-parameter models while preserving visual quality and inference speed.
IdentiFace is a multi-modal iterative diffusion framework that generates identifiable suspect faces with improved identity retrieval for law enforcement applications.
DiffMagicFace uses concurrent fine-tuned text and image diffusion models plus a rendered multi-view dataset to achieve identity-consistent text-conditioned editing of real facial videos.
A score-based method is introduced to guide optimization in geometric view diffusion models toward correct viewpoints, improving convergence and sample efficiency over naive multistart strategies.
citing papers explorer
-
neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing
neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.
-
Where to Refine, When to Stop: Rethinking Redundancy via Latent Discrepancy for Efficient Visual Autoregressive Generation
LD-Pruning applies latent discrepancy to prune tokens and adaptively skip unconditional branches in VAR models for up to 2.35x faster inference with preserved quality.
-
Functionalization via Structure Completion and Motion Rectification
Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.
-
GeoEdit: Local Frames for Fast, Training-Free On-Manifold Editing in Diffusion Models
GeoEdit constructs local tangent frames from small perturbations to initial noise, enabling Jacobian-free on-manifold edits in diffusion models via alternating tangent steps and diffusion projections.
-
UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs
UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
-
DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model
DLEBench is the first benchmark for small-scale object editing in instruction-based image editing models, using 1889 samples, seven instruction types, and a dual-mode evaluation protocol to reveal performance gaps in 10 tested models.
-
RDSplat: Robust Watermarking for 3D Gaussian Splatting Against 2D and 3D Diffusion Editing
RDSplat is the first 3D Gaussian Splatting watermarking method that maintains 0.701 bit accuracy against both 2D and 3D diffusion editing by embedding only in low-frequency primitives selected via FAPS.
-
Factored Classifier-Free Guidance
Factored Classifier-Free Guidance enables per-attribute control in classifier-free guidance for diffusion models to produce more sound counterfactuals.
-
UniEdit-Flow: Unleashing Inversion and Editing in the Era of Flow Models
UniEdit-Flow presents tuning-free Uni-Inv and Uni-Edit methods for inversion and editing in flow models that achieve accurate reconstruction and robust region-preserving edits across generative models.
-
GeoEdit: Geometry-Aware Object Editing via Dual-Branch Denoising
GeoEdit introduces a Lift-Manipulate-Render-Denoise pipeline with dual-branch denoising and variance-homogeneous injection for 3D-consistent object editing in single photos.
-
MirrorPPR: Exemplar-Based Portrait Photo Retouching
MirrorPPR extracts retouching operations from exemplar pairs via a dedicated extractor and transfers them to query images through a LoRA-adapted Diffusion Transformer, enabled by a new 47-million-pair dataset and self-augmentation for alignment.
-
SpatialFlow-GRPO: Where Spatial Credit Drives Image Editing
SpatialFlow-GRPO adds region-level reward feedback and spatial alignment to Flow-GRPO-style RL for image editing, reporting gains on GEdit-Bench, ImgEdit-Bench, and a new MultiEditBench.
-
A Systematic Study of Behavioral Cloning for Scientific Data Annotation
Introduces 9 synthetic annotation tasks and benchmarks for behavioral cloning, finding hierarchical skill learning, scaling benefits, effective multi-task pretraining, and shared internal representations of task phases and mistakes.
-
Rethinking Scribble-Guided Image Editing: Generalization, Instruction Adherence, and Multi-Tasking
Empirical studies reveal instruction-level generalization as the main bottleneck in scribble-guided editing; three strategies (curriculum, multi-task mosaicking, edit-focused loss) achieve SOTA on VIBE benchmark.
-
ACE-LoRA: Adaptive Orthogonal Decoupling for Continual Image Editing
ACE-LoRA introduces adaptive orthogonal decoupling and rank-invariant compression for continual image editing in diffusion models, plus the CIE-Bench benchmark.
-
PhysEdit: Physically-Consistent Region-Aware Image Editing via Adaptive Spatio-Temporal Reasoning
PhysEdit introduces adaptive reasoning depth and spatial masking to make image editing faster and more instruction-aligned without retraining the base model.
-
Rethinking Where to Edit: Task-Aware Localization for Instruction-Based Image Editing
Task-aware localization via attention cues and feature centroids from source/target streams in IIE models improves non-edit consistency while preserving instruction following.
-
Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation
Fine-tuning text-to-video models on sparse low-quality synthetic data for physical camera controls outperforms fine-tuning on photorealistic data.
-
FlashEdit: Decoupling Speed, Structure, and Semantics for Precise Image Editing
FlashEdit delivers real-time localized text-guided image editing under 0.2 seconds via cycle-consistent one-step inversion, background shield, and sparsified spatial cross-attention, achieving over 150x speedup on PIE-Bench.
-
Semantics Disentanglement and Composition for Universal Image Coding with Efficiently LLM Reasoning and Generative Diffusion
UniCodec uses LLM-driven semantic disentanglement at the encoder and diffusion-based compositional generation at the decoder to enable one codec for both human perception and machine vision tasks without task-specific retraining.
-
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
An ensemble of stage-specialized text-to-image diffusion models improves prompt alignment over single shared-parameter models while preserving visual quality and inference speed.
-
IdentiFace: Multi-Modal Iterative Diffusion Framework for Identifiable Suspect Face Generation in Crime Investigations
IdentiFace is a multi-modal iterative diffusion framework that generates identifiable suspect faces with improved identity retrieval for law enforcement applications.
-
DiffMagicFace: Identity Consistent Facial Editing of Real Videos
DiffMagicFace uses concurrent fine-tuned text and image diffusion models plus a rendered multi-view dataset to achieve identity-consistent text-conditioned editing of real facial videos.
-
Landscape-Awareness for Geometric View Diffusion Model
A score-based method is introduced to guide optimization in geometric view diffusion models toward correct viewpoints, improving convergence and sample efficiency over naive multistart strategies.
-
Towards Robust Sequential Decomposition for Complex Image Editing
Develops a synthetic data pipeline for training sequential decomposition in generative image editing, showing robust gains with complexity and sim-to-real transfer via co-training.
- UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
- Making Image Editing Easier via Adaptive Task Reformulation with Agentic Executions
- CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator