pith. sign in

arxiv: 2603.03143 · v2 · pith:CFPRAWPFnew · submitted 2026-03-03 · 💻 cs.CV · cs.AI

Edit in 2D, Verify in 3D: Reinforcement Learning for Multi-view Consistent Scene Editing

classification 💻 cs.CV cs.AI
keywords editingmulti-viewconsistencypriorschallengingconsistentd-consistentdata
0
0 comments X
read the original abstract

Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, multi-view consistency remains challenging in edited results, and the extreme scarcity of paired 3D-consistent editing data makes supervised fine-tuning (SFT) impractical, despite its effectiveness for editing tasks. In this paper, we observe that, while generating multi-view consistent 3D content is highly challenging, verifying 3D consistency is tractable, naturally positioning reinforcement learning (RL) as a feasible solution. Motivated by this, we propose RL3DEdit, a single-pass framework driven by RL optimization with novel rewards derived from the 3D foundation model, VGGT. Specifically, we leverage VGGT's robust priors learned from massive real-world data, feed the edited images into it, and utilize the output confidence maps and pose estimation errors as reward signals, effectively anchoring the 2D editing priors onto a 3D-consistent manifold via RL. Extensive experiments demonstrate that RL3DEdit achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency. To promote the development of 3D editing, we will release the code and model.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. GeM-NR: Geometry-Aware Multi-View Editing for Nonrigid Scene Changes

    cs.CV 2026-06 unverdicted novelty 7.0

    GeM-NR performs multi-view consistent nonrigid editing by aligning depth-derived point clouds between edited and unedited scenes then refining projections conditioned on the original query view.

  2. CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

    cs.CV 2026-05 unverdicted novelty 7.0

    CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GR...

  3. CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

    cs.CV 2026-05 unverdicted novelty 7.0

    CaC presents a new spatiotemporal concentrating reward model for video anomalies, built on a novel large-scale dataset and three-stage training with RL and IoU rewards, claiming 25.7% accuracy gains and 11.7% anomaly ...

  4. IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools

    cs.CV 2026-05 unverdicted novelty 5.0

    IndusAgent achieves state-of-the-art zero-shot performance on industrial anomaly benchmarks by using a custom Indus-CoT dataset, dynamic tool orchestration, and gated RL to optimize anomaly classification, localizatio...