Direct 3D-Aware Object Insertion via Decomposed Visual Proxies

Chen Change Loy; Jingbo Gong; Ming-Ming Cheng; Qibin Hou; Rui Zhao; Yikai Wang; Yuhao Wan; Yushi Lan; Ziheng Ouyang

arxiv: 2606.06601 · v1 · pith:7CSO5PM6new · submitted 2026-06-04 · 💻 cs.CV · cs.AI· cs.LG

Direct 3D-Aware Object Insertion via Decomposed Visual Proxies

Jingbo Gong , Yikai Wang , Yushi Lan , Yuhao Wan , Ziheng Ouyang , Rui Zhao , Ming-Ming Cheng , Qibin Hou

show 1 more author

Chen Change Loy

This is my paper

Pith reviewed 2026-06-28 02:12 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords object insertion3D-aware generationdiffusion modelspose controlimage compositingdecomposed guidancevisual proxies

0 comments

The pith

Decomposing insertion conditions into separate appearance, geometry, and context pathways enables controllable 3D object insertion without feature entanglement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DIRECT, a diffusion-based framework for inserting a reference object into a background image while allowing explicit user control over the object's 3D pose. It splits the conditioning signals into three parts—appearance details taken from the reference, geometry taken from a user-adjusted 3D proxy, and scene context taken from the target background—and feeds each through its own dedicated injection route. The separation is intended to stop the signals from mixing so that the inserted object keeps its original look, obeys the chosen pose, and still matches the surrounding image. An automated pipeline is also described for building more varied training examples. If the separation works, object insertion gains practical 3D controllability that earlier 2D inpainting methods lack.

Core claim

DIRECT decomposes the insertion conditions into appearance guidance capturing visual details from the reference object, geometry guidance derived from the user-adjusted 3D proxy, and context guidance from the target background; by injecting them through separate pathways, the method avoids feature entanglement and simultaneously preserves reference appearance, follows the user-specified pose, and adapts the object to the target scene.

What carries the argument

Decomposed injection of appearance guidance, geometry guidance from the 3D proxy, and context guidance through separate pathways inside the diffusion model.

If this is right

Interactive pose manipulation becomes possible alongside high-fidelity 2D synthesis.
The inserted object preserves reference appearance while following the specified pose and adapting to the scene.
Geometric controllability and visual quality both improve over prior 2D inpainting approaches.
An automated data construction pipeline increases training diversity and quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation idea might apply to other conditional image tasks that need independent control of identity, layout, and environment.
If the 3D proxy is replaced by a text-described pose, the method could extend to language-driven insertion.
Failure modes on real photographs with complex lighting would indicate where the proxy-to-image transfer still needs refinement.

Load-bearing premise

Sending the three different signals down separate pathways is enough to stop them from mixing and to let each factor be controlled on its own.

What would settle it

A test case in which the output either changes the reference object's visual details, deviates from the supplied 3D pose, or fails to match background lighting and shadows would show the separate pathways do not deliver the claimed independent control.

Figures

Figures reproduced from arXiv: 2606.06601 by Chen Change Loy, Jingbo Gong, Ming-Ming Cheng, Qibin Hou, Rui Zhao, Yikai Wang, Yuhao Wan, Yushi Lan, Ziheng Ouyang.

**Figure 1.** Figure 1: Pose-controllable object insertion. (a) Existing pipelines have difficulty placing the reference object in a reasonable and user-specified pose within the background image, even when using a strong 2D generative model such as Nano Banana Pro (Google, 2025) or a 3D-aware editing model such as Object3DIT (Michel et al., 2023). In contrast, our framework inserts the object with precise pose control and bett… view at source ↗

**Figure 2.** Figure 2: Illustration of our framework. The generation process is controlled by three types of conditions: appearance guidance from the original reference object, geometry guidance from the rendered image with the user-specified pose, and context guidance from global features of the background image. These conditions are injected through decomposed LoRA pathways to reduce interference. The standard masked backgroun… view at source ↗

**Figure 3.** Figure 3: Geometric semantic ambiguity. Standard spatial signals, such as depth and normal maps, fail to distinguish the orientation of symmetric objects, whereas our RGB geometric condition explicitly preserves semantic pose. Input Image LGM TRELLIS [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Appearance fidelity gap. Current image-to-3D models suffer from severe texture degradation. Relying solely on the rendered proxy can lead to blurry outputs, motivating the re-injection of the original reference. 3D Visual Proxy Lifting. The reference object image is 2D, while user interaction is more intuitive when the object can be directly translated and rotated in 3D space. In contrast, standard 2D dif… view at source ↗

**Figure 5.** Figure 5: Overview of geometric alignment pipeline. Given a target image Igt, we estimate the rendering pose of the 3D proxy P such that its projection matches the target object. The pose-aligned rendering is then used as the geometric condition Igeo for training. training, the precise mask is replaced with a random realobject mask sampled from an external dataset (Wang et al., 2025b). This prevents the model from … view at source ↗

**Figure 6.** Figure 6: Qualitative Comparison. We compare our method against Object3DIT (Michel et al., 2023) and TRELLIS (Xiang et al., 2025). Our method achieves superior identity preservation and background consistency, avoiding the appearance artifacts observed in TRELLIS and the geometric distortions in Object3DIT. IA denotes InsertAnything (Song et al., 2026). pose while maintaining realistic scene integration. 4.2. Qualit… view at source ↗

**Figure 7.** Figure 7: Large pose-change examples. Representative cases show substantial pose variations between the reference object and target pose. These examples require synthesis of largely unseen object views from limited reference appearance, including large rotations, top-view to side-view transformation, and near 180◦ viewpoint changes. Our method preserves object identity while following the specified pose. view datase… view at source ↗

**Figure 8.** Figure 8: Comparison of geometry guidance signals. Top row: Reference object, RGB/normal guidance at 0 ◦ , and RGB/normal guidance at 180◦ . Bottom row: Background image and the four corresponding generation results. For the symmetric road sign, the normal maps are invariant to the 180◦ rotation, leading to semantic ambiguity and orientation errors in the normal-based results. In contrast, our RGB proxy provides sem… view at source ↗

**Figure 10.** Figure 10: Robustness to degraded 3D proxies. In an extreme object insertion case with rich textual details on the object surface, the 3D proxy suffers from significant quality degradation. In contrast, our model inserts precise, legible details. Reference Rendered 3D proxy Background Result [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 11.** Figure 11: Failure case. The upstream model incorrectly reconstructs the rectangular reference as a square proxy. Our model strictly follows this distorted geometric condition, resulting in an incorrect aspect ratio in the final output. Appendices E–G provide additional analyses on latency, proxy-scene misalignment, and complex environments. 5. Conclusions In this work, we present DIRECT, a framework for posecontr… view at source ↗

**Figure 12.** Figure 12: Overview of the Interactive Inference Pipeline. First, the reference image is lifted into a 3D proxy. Users then manipulate the proxy over the background canvas via a visual gizmo to determine the target 6-DoF pose. Finally, the system automatically renders the necessary conditions to guide our generative framework, yielding a high-fidelity composite image that respects the user-specified pose. C. Interac… view at source ↗

**Figure 13.** Figure 13: Qualitative comparison with intrinsic-guided compositing. The intrinsic-guided compositing baseline provides strong geometric adherence, but struggles to preserve fine-grained reference appearance and overall image realism. In contrast, our method simultaneously achieves pose control, identity preservation, and realistic scene integration. E. Inference Latency and Memory Overhead Since our framework intro… view at source ↗

**Figure 14.** Figure 14: Sensitivity to 3D proxy-scene misalignment. We show representative cases where the user-specified 3D proxy is mildly misaligned with the target scene. In the first example, the proxy is placed slightly above the ground. In the second example, the proxy is not perfectly aligned with the supporting surface. Despite these mild proxy-scene placement errors, our method produces natural insertion results, sugge… view at source ↗

**Figure 15.** Figure 15: Performance in complex environments. We show representative examples involving occlusion, lighting, and reflection. For occlusion, a pen is inserted into a pen holder, where the generated result exhibits a plausible depth relationship between the pen and the holder structure. For lighting, a car is inserted into a scene with strong directional illumination, and the model generates a plausible shadow consi… view at source ↗

**Figure 16.** Figure 16: Visual Demonstrations. We showcase our model’s capability to insert various objects into complex real-world backgrounds with high visual fidelity. The results show that our method supports explicit pose control (e.g., varying angles and orientations) while strictly preserving the identity and texture details of the reference objects. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗

read the original abstract

Object insertion aims to seamlessly composite a reference object into a specified region of a background image. Recent diffusion-based methods achieve high visual quality but formulate insertion as a simple 2D inpainting task, providing no explicit control over the object's 3D pose and limiting their practical applicability. We propose DIRECT (Decomposed Injection for Reference Composition and Target-integration), a novel framework that integrates interactive pose manipulation with high-fidelity 2D image synthesis to enable pose-controllable object insertion. Our method decomposes the insertion conditions into three complementary components: appearance guidance capturing visual details from the reference object, geometry guidance derived from the user-adjusted 3D proxy, and context guidance from the target background. By injecting them through separate pathways, DIRECT avoids feature entanglement and simultaneously preserves reference appearance, follows the user-specified pose, and adapts the object to the target scene. We also introduce an automated data construction pipeline to improve the diversity and quality of training data. Experiments show that DIRECT outperforms previous methods in both geometric controllability and visual quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DIRECT adds explicit 3D pose control to diffusion object insertion by splitting conditions into separate appearance, geometry-proxy, and context pathways, but the abstract gives no metrics or ablations to show the split actually works.

read the letter

The paper's main move is to take the usual diffusion inpainting setup for pasting a reference object into a scene and split the conditioning into three streams: one carrying the reference appearance, one carrying geometry from a user-adjusted 3D proxy, and one carrying background context. These are injected through separate pathways so the model can keep the object's look, follow the new pose, and blend with the scene. That decomposition is the concrete difference from the 2D-only methods cited in the abstract. They also describe an automated pipeline for building more varied training pairs, which is a practical detail that could help data quality.

The architecture description is clear enough on paper. The claim is that keeping the signals apart prevents the usual feature mixing that happens when everything is jammed into one conditioning channel.

The soft spot is the missing evidence. The abstract states that the method outperforms prior work on controllability and quality, yet supplies no numbers, no ablation results, and no details on how they measured pose accuracy or visual fidelity. Without those, it is impossible to tell whether the separate pathways deliver independent control or whether any gains come from the data pipeline or training tricks. The stress-test worry about mixing still happening inside the shared UNet layers is worth checking against the actual implementation; if there is no extra loss or normalization to enforce separation, the observed behavior could be data-driven rather than architecture-driven.

This is the sort of paper that would interest people building controllable editing tools inside diffusion models. A reader who wants to add pose knobs to an existing inpainting pipeline could pull the three-component breakdown and test it directly.

I would send it for peer review. The problem is well-posed and the proposed fix is straightforward to evaluate once the experiments are on the table.

Referee Report

2 major / 1 minor

Summary. The paper proposes DIRECT, a diffusion-based framework for object insertion that decomposes conditions into appearance guidance from the reference, geometry guidance from a user-adjusted 3D proxy, and context guidance from the target scene. These are injected via separate pathways into the model to avoid feature entanglement, enabling simultaneous preservation of appearance, adherence to specified 3D pose, and adaptation to the background. An automated data construction pipeline is introduced to enhance training data, and experiments claim superior geometric controllability and visual quality over prior methods.

Significance. If the disentanglement via separate pathways holds and is validated quantitatively, the work would advance controllable object insertion beyond 2D inpainting, offering practical 3D pose manipulation useful for scene editing and AR applications. The data pipeline could also support future reproducibility in diffusion-based editing tasks.

major comments (2)

[Abstract] Abstract: The claim of outperformance in geometric controllability and visual quality is stated without any quantitative metrics, ablation studies, or implementation details, preventing assessment of the central claims about independent control and quality gains.
[§3] §3 (method description): The assertion that routing appearance, geometry-from-3D-proxy, and context through separate pathways avoids feature entanglement lacks any explicit mechanism (e.g., orthogonal losses, pathway-specific normalization, or attention isolation) to prevent mixing in the shared UNet backbone and cross-attention layers; this makes the independent control claim vulnerable to the possibility that results stem from data biases rather than the decomposition.

minor comments (1)

[Abstract] The abstract mentions an automated data construction pipeline but provides no details on its steps or how it improves diversity/quality; this should be expanded in the main text for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to strengthen the presentation of our claims.

read point-by-point responses

Referee: [Abstract] Abstract: The claim of outperformance in geometric controllability and visual quality is stated without any quantitative metrics, ablation studies, or implementation details, preventing assessment of the central claims about independent control and quality gains.

Authors: The abstract is intentionally concise, while the full manuscript provides quantitative metrics, ablations, and implementation details in the experiments section. To better support the claims within the abstract itself, we will revise it to include brief references to key quantitative results on geometric controllability and visual quality. revision: yes
Referee: [§3] §3 (method description): The assertion that routing appearance, geometry-from-3D-proxy, and context through separate pathways avoids feature entanglement lacks any explicit mechanism (e.g., orthogonal losses, pathway-specific normalization, or attention isolation) to prevent mixing in the shared UNet backbone and cross-attention layers; this makes the independent control claim vulnerable to the possibility that results stem from data biases rather than the decomposition.

Authors: The decomposition is implemented via distinct conditioning pathways for each guidance type into the shared UNet. We acknowledge that the current description does not include additional explicit mechanisms such as orthogonal losses to further enforce separation. We will revise §3 to provide a more detailed account of the injection process and add an ablation comparing separate versus joint conditioning to empirically support that the observed control arises from the decomposition. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural decomposition presented without self-referential reductions or fitted predictions

full rationale

The provided abstract and description contain no equations, fitted parameters, or self-citations that bear the central claim. The method is described as a decomposition into three guidance components injected via separate pathways; this is an architectural proposal whose validity is asserted to be shown by experiments rather than derived by construction from its own inputs. No load-bearing step reduces to a self-definition, renamed known result, or author-prior ansatz. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method appears to rest on standard diffusion-model assumptions not detailed here.

pith-pipeline@v0.9.1-grok · 5739 in / 1077 out tokens · 29896 ms · 2026-06-28T02:12:23.001962+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Paint by example: Exemplar-based image editing with diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[2]

International Conference on Machine Learning , pages=

Learning transferable visual models from natural language supervision , author=. International Conference on Machine Learning , pages=
[3]

Structured

Xiang, Jianfeng and Lv, Zelong and Xu, Sicheng and Deng, Yu and Wang, Ruicheng and Zhang, Bowen and Chen, Dong and Tong, Xin and Yang, Jiaolong , booktitle=. Structured
[4]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

The unreasonable effectiveness of deep features as a perceptual metric , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[5]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Emerging properties in self-supervised vision transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[6]

International Conference on Learning Representations , year=

Flow Matching for Generative Modeling , author=. International Conference on Learning Representations , year=
[7]

International Conference on Machine Learning , year=

Scaling rectified flow transformers for high-resolution image synthesis , author=. International Conference on Machine Learning , year=
[8]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Objectstitch: Object compositing with diffusion model , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[9]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Imprint: Generative object compositing by learning identity-preserving representation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[10]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Anydoor: Zero-shot object-level image customization , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[11]

Insert Anything: Image insertion via in-context editing in

Song, Wensong and Jiang, Hong and Yang, Zongxing and Cheng, Zheqiao and Quan, Ruijie and Yang, Yi , booktitle=. Insert Anything: Image insertion via in-context editing in
[12]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Leftrefill: Filling right canvas based on left reference through generalized text-to-image diffusion model , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[13]

Object 3dit: Language-guided

Michel, Oscar and Bhattad, Anand and VanderBilt, Eli and Krishna, Ranjay and Kembhavi, Aniruddha and Gupta, Tanmay , journal=. Object 3dit: Language-guided
[14]

Neural assets:

Wu, Ziyi and Rubanova, Yulia and Kabra, Rishabh and Hudson, Drew A and Gilitschenski, Igor and Aytar, Yusuf and Van Steenkiste, Sjoerd and Allen, Kelsey R and Kipf, Thomas , journal=. Neural assets:
[15]

Diffusion handles enabling

Pandey, Karran and Guerrero, Paul and Gadelha, Matheus and Hold-Geoffroy, Yannick and Singh, Karan and Mitra, Niloy J , booktitle=. Diffusion handles enabling
[16]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

Geodiffuser: Geometry-based image editing with diffusion models , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=
[17]

Image sculpting: Precise object editing with

Yenphraphai, Jiraphon and Pan, Xichen and Liu, Sainan and Panozzo, Daniele and Xie, Saining , booktitle=. Image sculpting: Precise object editing with
[18]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

Zerocomp: Zero-shot object compositing from image intrinsics via diffusion , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=
[19]

Ge, Yunhao and Yu, Hong-Xing and Zhao, Cheng and Guo, Yuliang and Huang, Xinyu and Ren, Liu and Itti, Laurent and Wu, Jiajun , journal=
[20]

2024 , howpublished=

2024
[21]

CAAI Artificial Intelligence Research , year=

Bilateral Reference for High-Resolution Dichotomous Image Segmentation , author=. CAAI Artificial Intelligence Research , year=
[22]

Viser: Imperative, web-based

Yi, Brent and Kim, Chung Min and Kerr, Justin and Wu, Gina and Feng, Rebecca and Zhang, Anthony and Kulhanek, Jonas and Choi, Hongsuk and Ma, Yi and Tancik, Matthew and Kanazawa, Angjoo , journal=. Viser: Imperative, web-based
[23]

Neurocomputing , volume=

Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

2024
[24]

Tschannen, Michael and Gritsenko, Alexey and Wang, Xiao and Naeem, Muhammad Ferjad and Alabdulmohsin, Ibrahim and Parthasarathy, Nikhil and Evans, Talfan and Beyer, Lucas and Xia, Ye and Mustafa, Basil and others , journal=
[25]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Mvimgnet: A large-scale dataset of multi-view images , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[26]

Shuai Bai and Yuxuan Cai and Ruizhe Chen and Keqin Chen and Xionghui Chen and Zesen Cheng and Lianghao Deng and Wei Ding and Chang Gao and Chunjiang Ge and Wenbin Ge and Zhifang Guo and Qidong Huang and Jie Huang and Fei Huang and Binyuan Hui and Shutong Jiang and Zhaohai Li and Mingsheng Li and Mei Li and Kaixin Li and Zicheng Lin and Junyang Lin and Xue...
[27]

Carion, Nicolas and Gustafson, Laura and Hu, Yuan-Ting and Debnath, Shoubhik and Hu, Ronghang and Suris, Didac and Ryali, Chaitanya and Alwala, Kalyan Vasudev and Khedr, Haitham and Huang, Andrew and others , journal=
[28]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Towards enhanced image inpainting: Mitigating unwanted object insertion and preserving color consistency , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[29]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[30]

Poole, Ben and Jain, Ajay and Barron, Jonathan T and Mildenhall, Ben , booktitle=
[31]

Lin, Chen-Hsuan and Gao, Jun and Tang, Luming and Takikawa, Towaki and Zeng, Xiaohui and Huang, Xun and Kreis, Karsten and Fidler, Sanja and Liu, Ming-Yu and Lin, Tsung-Yi , booktitle=
[32]

2021 , publisher=

Mildenhall, Ben and Srinivasan, Pratul P and Tancik, Matthew and Barron, Jonathan T and Ramamoorthi, Ravi and Ng, Ren , journal=. 2021 , publisher=

2021
[33]

Hong, Yicong and Zhang, Kai and Gu, Jiuxiang and Bi, Sai and Zhou, Yang and Liu, Difan and Liu, Feng and Sunkavalli, Kalyan and Bui, Trung and Tan, Hao , booktitle=
[34]

Lgm: Large multi-view gaussian model for high-resolution

Tang, Jiaxiang and Chen, Zhaoxi and Chen, Xiaokang and Wang, Tengfei and Zeng, Gang and Liu, Ziwei , booktitle=. Lgm: Large multi-view gaussian model for high-resolution
[35]

Gaussiananything: Interactive point cloud flow matching for

Lan, Yushi and Zhou, Shangchen and Lyu, Zhaoyang and Hong, Fangzhou and Yang, Shuai and Dai, Bo and Pan, Xingang and Loy, Chen Change , booktitle=. Gaussiananything: Interactive point cloud flow matching for
[36]

Lai, Zeqiang and Zhao, Yunfei and Liu, Haolin and Zhao, Zibo and Lin, Qingxiang and Shi, Huiwen and Yang, Xianghui and Yang, Mingxin and Yang, Shuhui and Feng, Yifei and others , journal=
[37]

2024 , journal=

Repositioning the Subject within Image , author=. 2024 , journal=

2024
[38]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Segment anything , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[39]

Wang, Jianyuan and Chen, Minghao and Karaev, Nikita and Vedaldi, Andrea and Rupprecht, Christian and Novotny, David , booktitle=
[40]

Wu, Chenfei and Li, Jiahao and Zhou, Jingren and Lin, Junyang and Gao, Kaiyuan and Yan, Kun and Yin, Sheng-ming and Bai, Shuai and Xu, Xiao and Chen, Yilei and others , journal=
[41]

Common Objects in

Reizenstein, Jeremy and Shapovalov, Roman and Henzler, Philipp and Sbordone, Luca and Labatut, Patrick and Novotny, David , booktitle=. Common Objects in
[42]

Cheng, Yen-Chi and Singh, Krishna Kumar and Yoon, Jae Shin and Schwing, Alexander and Gui, Liang-Yan and Gadelha, Matheus and Guerrero, Paul and Zhao, Nanxuan , booktitle=
[43]

Shape-for-motion: Precise and consistent video editing with

Liu, Yuhao and Wang, Tengfei and Liu, Fang and Wang, Zhenwei and Lau, Rynson WH , booktitle=. Shape-for-motion: Precise and consistent video editing with
[44]

Edward J Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo
[45]

Neural Information Processing Systems Workshop on Deep Generative Models and Downstream Applications , year=

Classifier-Free Diffusion Guidance , author=. Neural Information Processing Systems Workshop on Deep Generative Models and Downstream Applications , year=
[46]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Ominicontrol: Minimal and universal control for diffusion transformer , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[47]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Easycontrol: Adding efficient and flexible control for diffusion transformer , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[48]

arXiv preprint arXiv:2511.20614 , year=

The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment , author=. arXiv preprint arXiv:2511.20614 , year=

work page arXiv
[49]

Nano Banana Pro , howpublished =
[50]

Grounding image matching in

Leroy, Vincent and Cabon, Yohann and Revaud, J. Grounding image matching in. European Conference on Computer Vision , pages=
[51]

Exploring

Wang, Jianyi and Chan, Kelvin CK and Loy, Chen Change , booktitle=. Exploring
[52]

Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels , author=. arXiv preprint arXiv:2312.17090 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

1971 , publisher=

Accommodation in computer vision , author=. 1971 , publisher=

1971

[1] [1]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Paint by example: Exemplar-based image editing with diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[2] [2]

International Conference on Machine Learning , pages=

Learning transferable visual models from natural language supervision , author=. International Conference on Machine Learning , pages=

[3] [3]

Structured

Xiang, Jianfeng and Lv, Zelong and Xu, Sicheng and Deng, Yu and Wang, Ruicheng and Zhang, Bowen and Chen, Dong and Tong, Xin and Yang, Jiaolong , booktitle=. Structured

[4] [4]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

The unreasonable effectiveness of deep features as a perceptual metric , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[5] [5]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Emerging properties in self-supervised vision transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[6] [6]

International Conference on Learning Representations , year=

Flow Matching for Generative Modeling , author=. International Conference on Learning Representations , year=

[7] [7]

International Conference on Machine Learning , year=

Scaling rectified flow transformers for high-resolution image synthesis , author=. International Conference on Machine Learning , year=

[8] [8]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Objectstitch: Object compositing with diffusion model , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[9] [9]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Imprint: Generative object compositing by learning identity-preserving representation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[10] [10]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Anydoor: Zero-shot object-level image customization , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[11] [11]

Insert Anything: Image insertion via in-context editing in

Song, Wensong and Jiang, Hong and Yang, Zongxing and Cheng, Zheqiao and Quan, Ruijie and Yang, Yi , booktitle=. Insert Anything: Image insertion via in-context editing in

[12] [12]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Leftrefill: Filling right canvas based on left reference through generalized text-to-image diffusion model , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[13] [13]

Object 3dit: Language-guided

Michel, Oscar and Bhattad, Anand and VanderBilt, Eli and Krishna, Ranjay and Kembhavi, Aniruddha and Gupta, Tanmay , journal=. Object 3dit: Language-guided

[14] [14]

Neural assets:

Wu, Ziyi and Rubanova, Yulia and Kabra, Rishabh and Hudson, Drew A and Gilitschenski, Igor and Aytar, Yusuf and Van Steenkiste, Sjoerd and Allen, Kelsey R and Kipf, Thomas , journal=. Neural assets:

[15] [15]

Diffusion handles enabling

Pandey, Karran and Guerrero, Paul and Gadelha, Matheus and Hold-Geoffroy, Yannick and Singh, Karan and Mitra, Niloy J , booktitle=. Diffusion handles enabling

[16] [16]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

Geodiffuser: Geometry-based image editing with diffusion models , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

[17] [17]

Image sculpting: Precise object editing with

Yenphraphai, Jiraphon and Pan, Xichen and Liu, Sainan and Panozzo, Daniele and Xie, Saining , booktitle=. Image sculpting: Precise object editing with

[18] [18]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

Zerocomp: Zero-shot object compositing from image intrinsics via diffusion , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

[19] [19]

Ge, Yunhao and Yu, Hong-Xing and Zhao, Cheng and Guo, Yuliang and Huang, Xinyu and Ren, Liu and Itti, Laurent and Wu, Jiajun , journal=

[20] [20]

2024 , howpublished=

2024

[21] [21]

CAAI Artificial Intelligence Research , year=

Bilateral Reference for High-Resolution Dichotomous Image Segmentation , author=. CAAI Artificial Intelligence Research , year=

[22] [22]

Viser: Imperative, web-based

Yi, Brent and Kim, Chung Min and Kerr, Justin and Wu, Gina and Feng, Rebecca and Zhang, Anthony and Kulhanek, Jonas and Choi, Hongsuk and Ma, Yi and Tancik, Matthew and Kanazawa, Angjoo , journal=. Viser: Imperative, web-based

[23] [23]

Neurocomputing , volume=

Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

2024

[24] [24]

Tschannen, Michael and Gritsenko, Alexey and Wang, Xiao and Naeem, Muhammad Ferjad and Alabdulmohsin, Ibrahim and Parthasarathy, Nikhil and Evans, Talfan and Beyer, Lucas and Xia, Ye and Mustafa, Basil and others , journal=

[25] [25]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Mvimgnet: A large-scale dataset of multi-view images , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[26] [26]

Shuai Bai and Yuxuan Cai and Ruizhe Chen and Keqin Chen and Xionghui Chen and Zesen Cheng and Lianghao Deng and Wei Ding and Chang Gao and Chunjiang Ge and Wenbin Ge and Zhifang Guo and Qidong Huang and Jie Huang and Fei Huang and Binyuan Hui and Shutong Jiang and Zhaohai Li and Mingsheng Li and Mei Li and Kaixin Li and Zicheng Lin and Junyang Lin and Xue...

[27] [27]

Carion, Nicolas and Gustafson, Laura and Hu, Yuan-Ting and Debnath, Shoubhik and Hu, Ronghang and Suris, Didac and Ryali, Chaitanya and Alwala, Kalyan Vasudev and Khedr, Haitham and Huang, Andrew and others , journal=

[28] [28]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Towards enhanced image inpainting: Mitigating unwanted object insertion and preserving color consistency , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[29] [29]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[30] [30]

Poole, Ben and Jain, Ajay and Barron, Jonathan T and Mildenhall, Ben , booktitle=

[31] [31]

Lin, Chen-Hsuan and Gao, Jun and Tang, Luming and Takikawa, Towaki and Zeng, Xiaohui and Huang, Xun and Kreis, Karsten and Fidler, Sanja and Liu, Ming-Yu and Lin, Tsung-Yi , booktitle=

[32] [32]

2021 , publisher=

Mildenhall, Ben and Srinivasan, Pratul P and Tancik, Matthew and Barron, Jonathan T and Ramamoorthi, Ravi and Ng, Ren , journal=. 2021 , publisher=

2021

[33] [33]

Hong, Yicong and Zhang, Kai and Gu, Jiuxiang and Bi, Sai and Zhou, Yang and Liu, Difan and Liu, Feng and Sunkavalli, Kalyan and Bui, Trung and Tan, Hao , booktitle=

[34] [34]

Lgm: Large multi-view gaussian model for high-resolution

Tang, Jiaxiang and Chen, Zhaoxi and Chen, Xiaokang and Wang, Tengfei and Zeng, Gang and Liu, Ziwei , booktitle=. Lgm: Large multi-view gaussian model for high-resolution

[35] [35]

Gaussiananything: Interactive point cloud flow matching for

Lan, Yushi and Zhou, Shangchen and Lyu, Zhaoyang and Hong, Fangzhou and Yang, Shuai and Dai, Bo and Pan, Xingang and Loy, Chen Change , booktitle=. Gaussiananything: Interactive point cloud flow matching for

[36] [36]

Lai, Zeqiang and Zhao, Yunfei and Liu, Haolin and Zhao, Zibo and Lin, Qingxiang and Shi, Huiwen and Yang, Xianghui and Yang, Mingxin and Yang, Shuhui and Feng, Yifei and others , journal=

[37] [37]

2024 , journal=

Repositioning the Subject within Image , author=. 2024 , journal=

2024

[38] [38]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Segment anything , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[39] [39]

Wang, Jianyuan and Chen, Minghao and Karaev, Nikita and Vedaldi, Andrea and Rupprecht, Christian and Novotny, David , booktitle=

[40] [40]

Wu, Chenfei and Li, Jiahao and Zhou, Jingren and Lin, Junyang and Gao, Kaiyuan and Yan, Kun and Yin, Sheng-ming and Bai, Shuai and Xu, Xiao and Chen, Yilei and others , journal=

[41] [41]

Common Objects in

Reizenstein, Jeremy and Shapovalov, Roman and Henzler, Philipp and Sbordone, Luca and Labatut, Patrick and Novotny, David , booktitle=. Common Objects in

[42] [42]

Cheng, Yen-Chi and Singh, Krishna Kumar and Yoon, Jae Shin and Schwing, Alexander and Gui, Liang-Yan and Gadelha, Matheus and Guerrero, Paul and Zhao, Nanxuan , booktitle=

[43] [43]

Shape-for-motion: Precise and consistent video editing with

Liu, Yuhao and Wang, Tengfei and Liu, Fang and Wang, Zhenwei and Lau, Rynson WH , booktitle=. Shape-for-motion: Precise and consistent video editing with

[44] [44]

Edward J Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo

[45] [45]

Neural Information Processing Systems Workshop on Deep Generative Models and Downstream Applications , year=

Classifier-Free Diffusion Guidance , author=. Neural Information Processing Systems Workshop on Deep Generative Models and Downstream Applications , year=

[46] [46]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Ominicontrol: Minimal and universal control for diffusion transformer , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[47] [47]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Easycontrol: Adding efficient and flexible control for diffusion transformer , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[48] [48]

arXiv preprint arXiv:2511.20614 , year=

The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment , author=. arXiv preprint arXiv:2511.20614 , year=

work page arXiv

[49] [49]

Nano Banana Pro , howpublished =

[50] [50]

Grounding image matching in

Leroy, Vincent and Cabon, Yohann and Revaud, J. Grounding image matching in. European Conference on Computer Vision , pages=

[51] [51]

Exploring

Wang, Jianyi and Chan, Kelvin CK and Loy, Chen Change , booktitle=. Exploring

[52] [52]

Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels , author=. arXiv preprint arXiv:2312.17090 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[53] [53]

1971 , publisher=

Accommodation in computer vision , author=. 1971 , publisher=

1971