InsertAnywhere: Geometrically Grounded and Optics-Aware Video Object Insertion

Dongjin Kim; Hoiyeong Jin; Huijin Choi; Hyeonji Kim; Hyojin Jang; Jaegul Choo; Jeongho Kim; Junha Hyung; Kinam Kim

arxiv: 2512.17504 · v2 · pith:V2E7VD6Lnew · submitted 2025-12-19 · 💻 cs.CV · cs.AI

InsertAnywhere: Geometrically Grounded and Optics-Aware Video Object Insertion

Hoiyeong Jin , Hyojin Jang , Junha Hyung , Jeongho Kim , Kinam Kim , Dongjin Kim , Huijin Choi , Hyeonji Kim

show 1 more author

Jaegul Choo

This is my paper

classification 💻 cs.CV cs.AI

keywords objectvideogeometricallyinsertanywhereopticaloptics-awareeffectsframework

0 comments

read the original abstract

Recent advances in diffusion models have enabled impressive video editing capabilities, yet production-grade Video Object Insertion (VOI) remains challenging due to inadequate 4D scene understanding and a lack of proper optical interactions, such as shadows and reflections. To address these limitations, we present InsertAnywhere, a comprehensive VOI framework that achieves geometrically grounded object placement and optics-aware video synthesis. Our approach first leverages a 4D-aware mask generation module that allows users to anchor an object's 3D pose in a single frame. The framework automatically propagates this placement across the video, accurately handling local scene dynamics and occlusions. To synthesize realistic physical lighting interactions, we introduce Optics-Aware Representation Alignment, a novel strategy that utilizes an extended mask to guide feature extraction, enabling optical effects to seamlessly extend beyond the inserted object's boundary. Finally, to overcome the lack of training data for such phenomena, we construct and open-source ROSE++, a specialized quadruplet dataset tailored for the supervised learning of optical effects. Extensive experiments demonstrate that InsertAnywhere produces geometrically plausible and photometrically realistic insertions in complex real-world scenarios, significantly outperforming existing research and commercial generative tools.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency
cs.CV 2026-05 unverdicted novelty 7.0

Eulerian adjacent-frame motion fields with bidirectional cycle consistency checks enable faster parallel training and fewer artifacts in diffusion model image animation compared to initial-frame Lagrangian guidance.
Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency
cs.CV 2026-05 unverdicted novelty 7.0

Eulerian adjacent-frame motion guidance plus bidirectional geometric consistency improves training speed, temporal coherence, and artifact reduction in diffusion-based image animation.
AlbedoEdit: Unified Instance-Level Video Editing with Albedo Guidance
cs.GR 2026-05 unverdicted novelty 6.0

AlbedoEdit fine-tunes video foundation models to translate RGB videos into edited versions conditioned on user-edited first-frame albedo maps, trained on a new synthetic paired dataset for insertion, removal, and text...
Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency
cs.CV 2026-05 unverdicted novelty 6.0

Eulerian adjacent-frame motion guidance plus bidirectional geometric consistency yields faster training and more coherent diffusion-based image animation than first-frame reference methods.
Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency
cs.CV 2026-05 unverdicted novelty 6.0

Introduces Eulerian motion guidance with bidirectional geometric consistency to improve training speed and temporal quality in diffusion-based image animation.
Smart-Insertion-V: Photorealistic Video Insertion via a Closed-Loop Feedback Dual-Stream Framework
cs.CV 2026-05 unverdicted novelty 5.0

Smart-Insertion-V is a dual-stream closed-loop framework with Dual-World-View RoPE and a Decoupled Guidance Module that inserts reference objects into videos while achieving stylistic harmony despite domain gaps.
Controllable Video Object Insertion via Multiview Priors
cs.CV 2026-04 unverdicted novelty 5.0

A multi-view prior-based framework for video object insertion that uses dual-path conditioning and an integration-aware consistency module to improve appearance stability and occlusion handling.