hub Canonical reference

InstructPix2Pix: Learning to Follow Image Editing Instructions

Tim Brooks, Aleksander Holynski, Alexei A Efros · 2022 · arXiv 2211.09800

Canonical reference. 71% of citing Pith papers cite this work as background.

23 Pith papers citing it

Background 71% of classified citations

read on arXiv browse 23 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 baseline 1 method 1

citation-polarity summary

background 5 baseline 1 use method 1

representative citing papers

RS-Gen: A Multi-Stage Agentic Framework for Reasoning and Search-Augmented Image Generation

cs.CV · 2026-06-22 · unverdicted · novelty 7.0

RS-Gen proposes a plug-and-play agentic framework with a closed-loop reasoning mechanism that augments base image models to achieve SOTA results on WISE Verified and RISEBench.

V2V-Bench: A Comprehensive Benchmark for Video-to-Video Generation Evaluation

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

V2V-Bench is a new 11-dimension benchmark for video-to-video generation that achieves 0.905 Spearman correlation with human judgments on six V2V-specific dimensions.

Towards Characterizing Scientific Image Utility and Upgradability

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

The SIU²A framework evaluates scientific images for error detection, repair feasibility, and correction quality, showing current multimodal systems have major limitations in preserving scientific validity.

Functionalization via Structure Completion and Motion Rectification

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.

Setting-Matched and Semantics-Scaled Benchmarking of One-Step Generative Models Against Multistep Diffusion and Flow Models

cs.CV · 2026-03-15 · unverdicted · novelty 7.0

Matched benchmarking reveals FID misleads in few-step regimes under CFG, prompting CLIP-scaled and PickScore-scaled FID and IS variants for better semantic evaluation of one-step image generators.

Delta Rectified Flow Sampling for Text-to-Image Editing

cs.CV · 2025-09-01 · unverdicted · novelty 7.0

DRFS is a new inversion-free editing technique for rectified flow models that models source-target velocity discrepancies and applies a time-dependent shift to improve fidelity and unify prior methods like DDS and FlowEdit.

IdeaBlocks: Expressing and Reusing Divergent Intents for Graphic Design Exploration using Generative AI

cs.HC · 2025-07-29 · unverdicted · novelty 7.0

IdeaBlocks modularizes divergent intents into Exploration Blocks with multi-level reuse options, enabling 2.13 times more images explored and 12.5% greater visual diversity than baseline in a comparative user study.

Visual Instruction Tuning

cs.CV · 2023-04-17 · unverdicted · novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

cs.CV · 2023-03-08 · accept · novelty 7.0

Visual ChatGPT integrates visual foundation models with ChatGPT via prompts to enable multi-step image understanding, generation, and editing in conversational interactions.

Adding Conditional Control to Text-to-Image Diffusion Models

cs.CV · 2023-02-10 · conditional · novelty 7.0

ControlNet adds spatial conditioning controls to pretrained text-to-image diffusion models via zero convolutions for stable fine-tuning on small or large datasets.

Organizational Control Layer: Governance Infrastructure at the Execution Boundary of LLM Agent Systems

cs.MA · 2026-06-03 · unverdicted · novelty 6.0

OCL is a governance layer for LLM agents that cuts unsafe executions from 88% to near-zero and raises valid success from 12% to 96% in adversarial buyer-seller negotiations across frontier LLMs.

A Systematic Study of Behavioral Cloning for Scientific Data Annotation

cs.HC · 2026-05-26 · unverdicted · novelty 6.0

Introduces 9 synthetic annotation tasks and benchmarks for behavioral cloning, finding hierarchical skill learning, scaling benefits, effective multi-task pretraining, and shared internal representations of task phases and mistakes.

SimInsert: Seamless Video Object Insertion via Regional Sparse Attention Fusion

cs.CV · 2026-05-22 · unverdicted · novelty 6.0

SimInsert is a training-free video object insertion technique that decouples the task into single-frame editing and semantic motion description, using image-to-video diffusion models with non-invasive guidance to achieve spatio-temporal coherence.

UniVL: Unified Vision-Language Embedding for Spatially Grounded Contextual Image Generation

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

UniVL unifies vision and language into one mask-rendered input processed by an OCR backbone to condition diffusion models for spatially grounded image generation without a standalone text encoder.

Stylistic Attribute Control in Latent Diffusion Models

cs.CV · 2026-05-04 · unverdicted · novelty 6.0

A technique for parametric stylistic control in latent diffusion models learns disentangled directions from synthetic datasets and applies them via guidance composition while preserving semantics.

PhysEdit: Physically-Consistent Region-Aware Image Editing via Adaptive Spatio-Temporal Reasoning

cs.CV · 2026-05-01 · unverdicted · novelty 6.0

PhysEdit introduces adaptive reasoning depth and spatial masking to make image editing faster and more instruction-aligned without retraining the base model.

PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

PostureObjectStitch generates assembly-aware anomaly images by decoupling multi-view features into high-frequency, texture and RGB components, modulating them temporally in a diffusion model, and applying conditional loss plus geometric priors to preserve correct component relationships.

Scaling Robot Learning with Semantically Imagined Experience

cs.RO · 2023-02-22 · unverdicted · novelty 6.0

Augmenting robot datasets via diffusion-based semantic inpainting enables manipulation policies to solve unseen tasks with new objects and improves robustness to novel distractors.

ROGLE: Robust Global-Local Alignment with Automated Region Supervision for Text-Based Person Search

cs.CV · 2026-06-01 · unverdicted · novelty 5.0 · 2 refs

ROGLE introduces automated pseudo region-sentence pairs via RSM and multi-granular learning to boost fine-grained alignment in text-based person search, plus the P-VLG benchmark with over 100k annotated regions.

Edit-GRPO: A Locality-Preserving Policy Optimization Framework for Image Editing

cs.CV · 2026-05-16 · unverdicted · novelty 5.0

Edit-GRPO decouples editing and preservation objectives via region-specific signals in a policy optimization framework to improve locality in image editing tasks.

Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

cs.CV · 2025-01-21 · unverdicted · novelty 4.0

Hunyuan3D 2.0 scales flow-based diffusion transformers and texture synthesis models to generate high-resolution textured 3D assets that outperform prior state-of-the-art in geometry, alignment, and texture quality.

Toward Native Multimodal Modeling: A Roadmap

cs.CV · 2026-05-25 · unverdicted · novelty 3.0

A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.

From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data

cs.RO · 2026-04-04

citing papers explorer

Showing 20 of 20 citing papers after filters.

RS-Gen: A Multi-Stage Agentic Framework for Reasoning and Search-Augmented Image Generation cs.CV · 2026-06-22 · unverdicted · none · ref 21
RS-Gen proposes a plug-and-play agentic framework with a closed-loop reasoning mechanism that augments base image models to achieve SOTA results on WISE Verified and RISEBench.
V2V-Bench: A Comprehensive Benchmark for Video-to-Video Generation Evaluation cs.CV · 2026-06-04 · unverdicted · none · ref 1
V2V-Bench is a new 11-dimension benchmark for video-to-video generation that achieves 0.905 Spearman correlation with human judgments on six V2V-specific dimensions.
Towards Characterizing Scientific Image Utility and Upgradability cs.CV · 2026-06-02 · unverdicted · none · ref 3
The SIU²A framework evaluates scientific images for error detection, repair feasibility, and correction quality, showing current multimodal systems have major limitations in preserving scientific validity.
Functionalization via Structure Completion and Motion Rectification cs.CV · 2026-05-18 · unverdicted · none · ref 103
Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.
Setting-Matched and Semantics-Scaled Benchmarking of One-Step Generative Models Against Multistep Diffusion and Flow Models cs.CV · 2026-03-15 · unverdicted · none · ref 2
Matched benchmarking reveals FID misleads in few-step regimes under CFG, prompting CLIP-scaled and PickScore-scaled FID and IS variants for better semantic evaluation of one-step image generators.
Delta Rectified Flow Sampling for Text-to-Image Editing cs.CV · 2025-09-01 · unverdicted · none · ref 4
DRFS is a new inversion-free editing technique for rectified flow models that models source-target velocity discrepancies and applies a time-dependent shift to improve fidelity and unify prior methods like DDS and FlowEdit.
IdeaBlocks: Expressing and Reusing Divergent Intents for Graphic Design Exploration using Generative AI cs.HC · 2025-07-29 · unverdicted · none · ref 8
IdeaBlocks modularizes divergent intents into Exploration Blocks with multi-level reuse options, enabling 2.13 times more images explored and 12.5% greater visual diversity than baseline in a comparative user study.
Visual Instruction Tuning cs.CV · 2023-04-17 · unverdicted · none · ref 6
LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
Organizational Control Layer: Governance Infrastructure at the Execution Boundary of LLM Agent Systems cs.MA · 2026-06-03 · unverdicted · none · ref 114
OCL is a governance layer for LLM agents that cuts unsafe executions from 88% to near-zero and raises valid success from 12% to 96% in adversarial buyer-seller negotiations across frontier LLMs.
A Systematic Study of Behavioral Cloning for Scientific Data Annotation cs.HC · 2026-05-26 · unverdicted · none · ref 252
Introduces 9 synthetic annotation tasks and benchmarks for behavioral cloning, finding hierarchical skill learning, scaling benefits, effective multi-task pretraining, and shared internal representations of task phases and mistakes.
SimInsert: Seamless Video Object Insertion via Regional Sparse Attention Fusion cs.CV · 2026-05-22 · unverdicted · none · ref 5
SimInsert is a training-free video object insertion technique that decouples the task into single-frame editing and semantic motion description, using image-to-video diffusion models with non-invasive guidance to achieve spatio-temporal coherence.
UniVL: Unified Vision-Language Embedding for Spatially Grounded Contextual Image Generation cs.CV · 2026-05-20 · unverdicted · none · ref 3
UniVL unifies vision and language into one mask-rendered input processed by an OCR backbone to condition diffusion models for spatially grounded image generation without a standalone text encoder.
Stylistic Attribute Control in Latent Diffusion Models cs.CV · 2026-05-04 · unverdicted · none · ref 1
A technique for parametric stylistic control in latent diffusion models learns disentangled directions from synthetic datasets and applies them via guidance composition while preserving semantics.
PhysEdit: Physically-Consistent Region-Aware Image Editing via Adaptive Spatio-Temporal Reasoning cs.CV · 2026-05-01 · unverdicted · none · ref 4
PhysEdit introduces adaptive reasoning depth and spatial masking to make image editing faster and more instruction-aligned without retraining the base model.
PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios cs.CV · 2026-04-15 · unverdicted · none · ref 3
PostureObjectStitch generates assembly-aware anomaly images by decoupling multi-view features into high-frequency, texture and RGB components, modulating them temporally in a diffusion model, and applying conditional loss plus geometric priors to preserve correct component relationships.
Scaling Robot Learning with Semantically Imagined Experience cs.RO · 2023-02-22 · unverdicted · none · ref 64
Augmenting robot datasets via diffusion-based semantic inpainting enables manipulation policies to solve unseen tasks with new objects and improves robustness to novel distractors.
ROGLE: Robust Global-Local Alignment with Automated Region Supervision for Text-Based Person Search cs.CV · 2026-06-01 · unverdicted · none · ref 135 · 2 links
ROGLE introduces automated pseudo region-sentence pairs via RSM and multi-granular learning to boost fine-grained alignment in text-based person search, plus the P-VLG benchmark with over 100k annotated regions.
Edit-GRPO: A Locality-Preserving Policy Optimization Framework for Image Editing cs.CV · 2026-05-16 · unverdicted · none · ref 4
Edit-GRPO decouples editing and preservation objectives via region-specific signals in a policy optimization framework to improve locality in image editing tasks.
Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation cs.CV · 2025-01-21 · unverdicted · none · ref 6
Hunyuan3D 2.0 scales flow-based diffusion transformers and texture synthesis models to generate high-resolution textured 3D assets that outperform prior state-of-the-art in geometry, alignment, and texture quality.
Toward Native Multimodal Modeling: A Roadmap cs.CV · 2026-05-25 · unverdicted · none · ref 102
A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.

InstructPix2Pix: Learning to Follow Image Editing Instructions

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer