pith. machine review for the scientific record. sign in

arxiv: 2604.17801 · v2 · submitted 2026-04-20 · 💻 cs.CV

Recognition: unknown

View-Consistent 3D Scene Editing via Dual-Path Structural Correspondense and Semantic Continuity

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:44 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D scene editingview consistencytext-driven editingcross-view dependenciesdual-path mechanismstructural correspondencesemantic continuitymulti-view dataset
0
0 comments X

The pith

Recasting 3D scene editing as joint distribution modeling across viewpoints with a dual-path consistency mechanism produces precise multi-view consistent text-driven edits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome cross-view inconsistency, the main bottleneck in text-driven 3D scene editing pipelines that render images, edit them in 2D, and optimize the 3D representation. It establishes that the task requires explicit modeling of a joint distribution across viewpoints and that separating structural correspondence from semantic continuity into dedicated paths, plus training on a new paired multi-view dataset, yields more reliable results than methods relying on inference-time synchronization. A sympathetic reader would care because inconsistent perspectives limit the usefulness of 3D editing in design, visualization, and immersive applications where mismatched views break coherence.

Core claim

We recast multi-view consistent 3D editing from a distributional perspective: 3D scene editing essentially requires a joint distribution modeling across viewpoints. Based on this insight, we propose a view-consistent 3D editing framework that explicitly introduces cross-view dependencies into the editing process. Furthermore, motivated by the observation that structural correspondence and semantic continuity rely on different cross-view cues, we introduce a dual-path consistency mechanism consisting of projection-guided structural guidance and patch-level semantic propagation for effective cross-view editing. Further, we construct a paired multi-view editing dataset that provides reliable su

What carries the argument

dual-path consistency mechanism of projection-guided structural guidance and patch-level semantic propagation that handles different cross-view cues separately

If this is right

  • Superior editing performance with precise and consistent views for complex scenes compared to prior render-edit-optimize methods.
  • Reduced reliance on inference-time synchronization improves robustness and generalization.
  • Dedicated paths allow structural correspondence and semantic continuity to be maintained using their respective cross-view cues.
  • The paired multi-view editing dataset supplies reliable supervision for learning cross-view consistency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The distributional framing could be tested on other multi-view tasks such as consistent novel-view synthesis from edits.
  • The separation of structural and semantic paths may suggest similar decompositions for consistency problems in video or 4D generation.
  • If the paired dataset generalizes, it could support supervised training of additional cross-view editing models beyond the current framework.

Load-bearing premise

The dual-path mechanism of projection-guided structural guidance and patch-level semantic propagation, combined with supervision from the new paired dataset, will produce reliable cross-view consistency without limiting generalization or introducing new artifacts in real-world scenes.

What would settle it

Edited multi-view images of a complex real-world scene exhibiting visible structural mismatches or semantic drifts between viewpoints would show the dual-path approach has not delivered the claimed consistency.

Figures

Figures reproduced from arXiv: 2604.17801 by Bi'an Du, Junyi Yao, Pufan Li, Shenghe Zheng, Wei Hu.

Figure 1
Figure 1. Figure 1: (a) Cross-view discrepancies in per-view edits, high [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed framework. Left: two-stage training under a unified architecture, where Stage 1 trains the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Consistency-aware multi-view editing dataset construction. Left: view pairs with limited viewpoint difference are [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison with state-of-the-art methods under various editing prompts. The leftmost column shows source [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visual comparison between Direct Editing and our [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visual comparison between w/o structural guidance and [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visual comparison between w/o semantic propagation [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
read the original abstract

Text-driven 3D scene editing has recently attracted increasing attention. Most existing methods follow a render-edit-optimize pipeline, where multi-view images are rendered from a 3D scene, edited with 2D image editors, and then used to optimize the underlying 3D representation. However, cross-view inconsistency remains a major bottleneck. Although recent methods introduce geometric cues, cross-view interactions, or video priors to mitigate this issue, they still largely rely on inference-time synchronization and thus remain limited in robustness and generalization.In this work, we recast multi-view consistent 3D editing from a distributional perspective: 3D scene editing essentially requires a joint distribution modeling across viewpoints.Based on this insight, we propose a view-consistent 3D editing framework that explicitly introduces cross-view dependencies into the editing process. Furthermore, motivated by the observation that structural correspondence and semantic continuity rely on different cross-view cues, we introduce a dual-path consistency mechanism consisting of projection-guided structural guidance and patch-level semantic propagation for effective cross-view editing. Further, we construct a paired multi-view editing dataset that provides reliable supervision for learning cross-view consistency in edited scenes. Extensive experiments demonstrate that our method achieves superior editing performance with precise and consistent views for complex scenes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript recasts text-driven 3D scene editing as joint distribution modeling across viewpoints to overcome cross-view inconsistency in the standard render-edit-optimize pipeline. It introduces a dual-path consistency mechanism (projection-guided structural guidance combined with patch-level semantic propagation) and constructs a paired multi-view editing dataset to provide supervision for learning cross-view dependencies. The central claim is that this framework delivers superior editing performance with precise and consistent multi-view results on complex scenes.

Significance. If the experimental claims hold, the distributional perspective and dual-path mechanism could provide a more robust, training-based alternative to inference-time synchronization techniques for enforcing view consistency. The new paired dataset would also supply a concrete resource for future work on supervised cross-view editing, with potential impact on applications requiring coherent 3D content generation.

major comments (2)
  1. [Method and Experiments] The central performance claim depends on the dual-path mechanism producing reliable cross-view consistency that generalizes beyond the training distribution. The manuscript provides no explicit cross-dataset validation or testing on real-world captures outside the custom paired set (see Method description and Experiments section), which directly bears on whether the distributional insight mitigates inconsistency in complex scenes or merely fits the training data.
  2. [Abstract and Experiments] The abstract asserts that 'extensive experiments demonstrate superior editing performance with precise and consistent views,' yet the provided text contains no quantitative metrics, baseline comparisons, ablation results, or error analysis to support this. Without these details, the superiority and consistency claims cannot be verified as load-bearing evidence.
minor comments (2)
  1. [Title] The title contains a spelling error ('Correspondense' should be 'Correspondence').
  2. [Method] Notation for the dual-path components (projection-guided structural guidance and patch-level semantic propagation) is introduced at a high level; explicit equations or pseudocode would clarify how the two paths interact during training and inference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions where appropriate to strengthen the presentation of our distributional perspective and dual-path framework.

read point-by-point responses
  1. Referee: [Method and Experiments] The central performance claim depends on the dual-path mechanism producing reliable cross-view consistency that generalizes beyond the training distribution. The manuscript provides no explicit cross-dataset validation or testing on real-world captures outside the custom paired set (see Method description and Experiments section), which directly bears on whether the distributional insight mitigates inconsistency in complex scenes or merely fits the training data.

    Authors: We appreciate the referee's emphasis on the need to demonstrate generalization. Our paired multi-view editing dataset was constructed from diverse 3D scenes and editing instructions specifically to encourage learning of cross-view dependencies that are not tied to a narrow distribution. The dual-path mechanism relies on projection-based structural cues and patch-level semantic propagation, which draw from general geometric and semantic principles. That said, we agree that explicit testing on external real-world captures would provide stronger evidence that the approach mitigates inconsistency rather than overfitting the training set. In the revised manuscript, we will add experiments on additional real-world multi-view datasets to evaluate generalization. revision: yes

  2. Referee: [Abstract and Experiments] The abstract asserts that 'extensive experiments demonstrate superior editing performance with precise and consistent views,' yet the provided text contains no quantitative metrics, baseline comparisons, ablation results, or error analysis to support this. Without these details, the superiority and consistency claims cannot be verified as load-bearing evidence.

    Authors: We apologize for any lack of prominence in the experimental details. The full manuscript contains a dedicated Experiments section that reports quantitative metrics on editing quality and multi-view consistency, direct comparisons against relevant baselines, ablation studies isolating the contributions of the structural and semantic paths, and error analysis across scene types. The abstract is written as a concise summary of these results. To make the claims more immediately verifiable, we will revise the abstract to include a brief reference to key quantitative outcomes and ensure the Experiments section is clearly signposted from the abstract. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in the derivation chain

full rationale

The paper introduces a new view-consistent 3D editing framework by recasting the problem as joint distribution modeling across viewpoints, proposing a dual-path mechanism (projection-guided structural guidance plus patch-level semantic propagation), and constructing a new paired multi-view editing dataset for supervision. No load-bearing steps reduce by construction to self-defined quantities, fitted parameters renamed as predictions, or self-citation chains. The abstract and method framing present an independent proposal whose claims rest on experimental validation rather than tautological equivalence to inputs. This matches the reader's assessment of no evident circular reasoning.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; the framework appears to rely on standard neural network components for 3D representations and editing without introducing new postulated objects.

pith-pipeline@v0.9.0 · 5531 in / 1138 out tokens · 60261 ms · 2026-05-10T05:44:54.651388+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 12 canonical work pages · 7 internal anchors

  1. [1]

    Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions,

    A. Haque, M. Tancik, A. A. Efros, A. Holynski, and A. Kanazawa, “Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions,” inPro- ceedings of the IEEE/CVF International Conference on Computer Vi- sion, pp. 19740–19750, 2023

  2. [2]

    Editing Conditional Radiance Fields,

    S. Liu, X. Zhang, Z. Zhang, R. Zhang, J.-Y . Zhu, and B. Russell, “Editing Conditional Radiance Fields,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5773–5783, 2021

  3. [3]

    GaussCtrl: Multi-View Consistent Text-Driven 3D Gaussian Splatting Editing,

    J. Wu, J.-W. Bian, X. Li, G. Wang, I. Reid, P. Torr, and V . A. Prisacariu, “GaussCtrl: Multi-View Consistent Text-Driven 3D Gaussian Splatting Editing,” inEuropean Conference on Computer Vision, pp. 55–71, Springer, 2024

  4. [4]

    TrAME: Trajectory-Anchored Multi-View Editing for Text-Guided 3D Gaussian Manipulation,

    C. Luo, D. Di, X. Yang, Y . Ma, Z. Xue, W. Chen, X. Gou, and Y . Liu, “TrAME: Trajectory-Anchored Multi-View Editing for Text-Guided 3D Gaussian Manipulation,”IEEE Transactions on Multimedia, vol. 27, pp. 2886–2898, 2025

  5. [5]

    EditSplat: Multi-View Fusion and Attention- Guided Optimization for View-Consistent 3D Scene Editing with 3D Gaussian Splatting,

    D. I. Lee, H. Park, J. Seo, E. Park, H. Park, H. D. Baek, S. Shin, S. Kim, and S. Kim, “EditSplat: Multi-View Fusion and Attention- Guided Optimization for View-Consistent 3D Scene Editing with 3D Gaussian Splatting,” inProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pp. 11135–11145, 2025

  6. [6]

    DGE: Direct Gaussian 3D Editing by Consistent Multi-View Editing,

    M. Chen, I. Laina, and A. Vedaldi, “DGE: Direct Gaussian 3D Editing by Consistent Multi-View Editing,” inEuropean Conference on Computer Vision, pp. 74–92, Springer, 2024

  7. [7]

    Fast Multi-View Consistent 3D Editing with Video Priors,

    L. Chen, R. Li, G. Zhang, P. Wang, and L. Zhang, “Fast Multi-View Consistent 3D Editing with Video Priors,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, pp. 2948–2956, 2026. JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 10

  8. [8]

    GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting,

    Y . Chen, Z. Chen, C. Zhang, F. Wang, X. Yang, Y . Wang, Z. Cai, L. Yang, H. Liu, and G. Lin, “GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting,” inProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pp. 21476– 21485, 2024

  9. [9]

    Perturb-and-Revise: Flexible 3D Editing with Generative Trajectories,

    S. Hong, J. Karras, R. Martin-Brualla, and I. Kemelmacher-Shlizerman, “Perturb-and-Revise: Flexible 3D Editing with Generative Trajectories,” inProceedings of the Computer Vision and Pattern Recognition Confer- ence, pp. 16293–16303, 2025

  10. [10]

    Denoising Diffusion Probabilistic Models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising Diffusion Probabilistic Models,”Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, 2020

  11. [11]

    Denoising Diffusion Implicit Models

    J. Song, C. Meng, and S. Ermon, “Denoising Diffusion Implicit Models,” arXiv preprint arXiv:2010.02502, 2020

  12. [12]

    High-Resolution Image Synthesis with Latent Diffusion Models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-Resolution Image Synthesis with Latent Diffusion Models,” in Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pp. 10684–10695, 2022

  13. [13]

    Tuning- Free High-Resolution Video Diffusion with Spatial-Temporal Latent Grouping,

    Z. Chen, F. Long, Z. Qiu, T. Yao, W. Zhou, J. Luo, and T. Mei, “Tuning- Free High-Resolution Video Diffusion with Spatial-Temporal Latent Grouping,”IEEE Transactions on Multimedia, vol. 28, pp. 42–56, 2026

  14. [14]

    DiffW: Multi-Encoder Based on Conditional Diffusion Model for Robust Image Watermarking,

    T. Luo, R. Hu, Z. He, G. Jiang, H. Xu, Y . Song, and C.-C. Chang, “DiffW: Multi-Encoder Based on Conditional Diffusion Model for Robust Image Watermarking,”IEEE Transactions on Multimedia, vol. 28, pp. 837–852, 2026

  15. [15]

    V-bridge: Bridging video generative priors to versatile few-shot image restoration.arXiv preprint arXiv:2603.13089, 2026

    S. Zheng, J. Jiang, and W. Li, “V-Bridge: Bridging Video Generative Priors to Versatile Few-Shot Image Restoration,”arXiv preprint arXiv:2603.13089, 2026

  16. [16]

    Deep Point Set Resampling via Gradient Fields,

    H. Chen, B. Du, S. Luo, and W. Hu, “Deep Point Set Resampling via Gradient Fields,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 2913–2930, 2022

  17. [17]

    Multi-View User Preference Modeling for Personalized Text-to-Image Generation,

    H. Zhang, T. Wu, and Y . Wei, “Multi-View User Preference Modeling for Personalized Text-to-Image Generation,”IEEE Transactions on Multimedia, vol. 27, pp. 3082–3091, 2025

  18. [18]

    CLIP-GAN: Stacking CLIPs and GAN for Efficient and Controllable Text-to-Image Synthesis,

    Y . Hou, W. Zhang, Z. Zhu, and H. Yu, “CLIP-GAN: Stacking CLIPs and GAN for Efficient and Controllable Text-to-Image Synthesis,”IEEE Transactions on Multimedia, vol. 27, pp. 3702–3715, 2025

  19. [19]

    Adding Conditional Control to Text-to-Image Diffusion Models,

    L. Zhang, A. Rao, and M. Agrawala, “Adding Conditional Control to Text-to-Image Diffusion Models,” inProceedings of the IEEE/CVF international conference on computer vision, pp. 3836–3847, 2023

  20. [20]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    H. Ye, J. Zhang, S. Liu, X. Han, and W. Yang, “Ip-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models,”arXiv preprint arXiv:2308.06721, 2023

  21. [21]

    Scalable Diffusion Models with Transformers,

    W. Peebles and S. Xie, “Scalable Diffusion Models with Transformers,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195–4205, 2023

  22. [22]

    Attention is All You Need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is All You Need,”Advances in neural information processing systems, vol. 30, 2017

  23. [23]

    Free-Merging: Fourier Transform for Efficient Model Merging,

    S. Zheng and H. Wang, “Free-Merging: Fourier Transform for Efficient Model Merging,” inProceedings of the IEEE/CVF International Con- ference on Computer Vision, pp. 3863–3873, 2025

  24. [24]

    DCLP: Neural Architecture Predictor with Curriculum Contrastive Learning,

    S. Zheng, H. Wang, and T. Mu, “DCLP: Neural Architecture Predictor with Curriculum Contrastive Learning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 17051–17059, 2024

  25. [25]

    Building Normalizing Flows with Stochastic Interpolants,

    M. Albergo and E. Vanden-Eijnden, “Building Normalizing Flows with Stochastic Interpolants,” inICLR 2023 Conference, 2023

  26. [26]

    Flow Matching for Generative Modeling,

    Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow Matching for Generative Modeling,” inThe Eleventh International Conference on Learning Representations, 2023

  27. [27]

    Geometry and Perception Guided Gaussians for Multiview-Consistent 3D Generation from a Single Image,

    P. Li, B. Du, and W. Hu, “Geometry and Perception Guided Gaussians for Multiview-Consistent 3D Generation from a Single Image,”arXiv preprint arXiv:2506.21152, 2025

  28. [28]

    Diffusion Probabilistic Models for 3D Point Cloud Generation,

    S. Luo and W. Hu, “Diffusion Probabilistic Models for 3D Point Cloud Generation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2837–2845, 2021

  29. [29]

    LION: Latent Point Diffusion Models for 3D Shape Generation,

    A. Vahdat, F. Williams, Z. Gojcic, O. Litany, S. Fidler,et al., “LION: Latent Point Diffusion Models for 3D Shape Generation,”Advances in neural information processing systems, vol. 35, pp. 10021–10039, 2022

  30. [30]

    Point-E: A System for Generating 3D Point Clouds from Complex Prompts

    A. Nichol, H. Jun, P. Dhariwal, P. Mishkin, and M. Chen, “Point-E: A System for Generating 3D Point Clouds from Complex Prompts,”arXiv preprint arXiv:2212.08751, 2022

  31. [31]

    From Part to Whole: 3D Generative World Model with an Adaptive Structural Hierarchy,

    B. Du, D. Liu, P. Li, and W. Hu, “From Part to Whole: 3D Generative World Model with an Adaptive Structural Hierarchy,”arXiv preprint arXiv:2603.21557, 2026

  32. [32]

    In: arXiv preprint arXiv:2412.19413 (2024) 2

    B. Du, W. Hu, and R. Liao, “Multi-Scale Latent Point Consistency Models for 3D Shape Generation,”arXiv preprint arXiv:2412.19413, 2024

  33. [33]

    Generative 3D Part Assembly via Part-Whole-Hierarchy Message Passing,

    B. Du, X. Gao, W. Hu, and R. Liao, “Generative 3D Part Assembly via Part-Whole-Hierarchy Message Passing,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20850–20859, 2024

  34. [34]

    Null- Text Inversion for Editing Real Images Using Guided Diffusion Models,

    R. Mokady, A. Hertz, K. Aberman, Y . Pritch, and D. Cohen-Or, “Null- Text Inversion for Editing Real Images Using Guided Diffusion Models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6038–6047, 2023

  35. [35]

    Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation,

    N. Tumanyan, M. Geyer, S. Bagon, and T. Dekel, “Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1921–1930, 2023

  36. [36]

    TextFace: Text-to-Style Mapping based Face Generation and Manipulation,

    X. Hou, X. Zhang, Y . Li, and L. Shen, “TextFace: Text-to-Style Mapping based Face Generation and Manipulation,”IEEE Transactions on Multimedia, vol. 25, pp. 3409–3419, 2022

  37. [37]

    Learning by Imagination: A Joint Framework for Text-Based Image Manipulation and Change Captioning,

    K. E. Ak, Y . Sun, and J. H. Lim, “Learning by Imagination: A Joint Framework for Text-Based Image Manipulation and Change Captioning,”IEEE Transactions on Multimedia, vol. 25, pp. 3006–3016, 2022

  38. [38]

    From External to Internal: Structuring Image for Text-to-Image Attributes Manipulation,

    L. Gao, Q. Zhao, J. Zhu, S. Su, L. Cheng, and L. Zhao, “From External to Internal: Structuring Image for Text-to-Image Attributes Manipulation,”IEEE Transactions on Multimedia, vol. 25, pp. 7248– 7261, 2022

  39. [39]

    Instructpix2pix: Learning to Follow Image Editing Instructions,

    T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: Learning to Follow Image Editing Instructions,” inProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pp. 18392– 18402, 2023

  40. [40]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    B. F. Labs, S. Batifol, A. Blattmann, F. Boesel, S. Consul, C. Diagne, T. Dockhorn, J. English, Z. English, P. Esser,et al., “FLUX. 1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space,”arXiv preprint arXiv:2506.15742, 2025

  41. [41]

    Qwen-Image Technical Report

    C. Wu, J. Li, J. Zhou, J. Lin, K. Gao, K. Yan, S.-m. Yin, S. Bai, X. Xu, Y . Chen,et al., “Qwen-Image Technical Report,”arXiv preprint arXiv:2508.02324, 2025

  42. [42]

    MorphNeRF: Text-Guided 3D-Aware Editing via Morphing Generative Neural Radiance Fields,

    Y . Yu, R. Wu, Y . Men, S. Lu, M. Cui, X. Xie, and C. Miao, “MorphNeRF: Text-Guided 3D-Aware Editing via Morphing Generative Neural Radiance Fields,”IEEE Transactions on Multimedia, vol. 26, pp. 8516–8528, 2024

  43. [43]

    AugGS: Self-Augmented Gaussians with Structural Masks for Sparse-View 3D Reconstruction,

    B. Du, L. Meng, and W. Hu, “AugGS: Self-Augmented Gaussians with Structural Masks for Sparse-View 3D Reconstruction,”arXiv preprint arXiv:2408.04831, 2024

  44. [44]

    3D Gaussian Splatting for Real-Time Radiance Field Rendering.,

    B. Kerbl, G. Kopanas, T. Leimk ¨uhler, G. Drettakis,et al., “3D Gaussian Splatting for Real-Time Radiance Field Rendering.,”ACM Trans. Graph., vol. 42, no. 4, pp. 139–1, 2023

  45. [45]

    Depth Anything 3: Recovering the Visual Space from Any Views

    H. Lin, S. Chen, J. Liew, D. Y . Chen, Z. Li, G. Shi, J. Feng, and B. Kang, “Depth Anything 3: Recovering the Visual Space from Any Views,” arXiv preprint arXiv:2511.10647, 2025

  46. [46]

    DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning- Based 3D Vision,

    L. Ling, Y . Sheng, Z. Tu, W. Zhao, C. Xin, K. Wan, L. Yu, Q. Guo, Z. Yu, et al., “DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning- Based 3D Vision,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22160–22169, 2024

  47. [47]

    RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos,

    H. Xia, Y . Fu, S. Liu, and X. Wang, “RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22378–22389, 2024

  48. [48]

    DINOv3

    O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa,et al., “DINOV3,” arXiv preprint arXiv:2508.10104, 2025

  49. [49]

    Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields,

    J. T. Barron, B. Mildenhall, D. Verbin, P. P. Srinivasan, and P. Hedman, “Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields,” inProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pp. 5470–5479, 2022

  50. [50]

    BlendedMVS: A Large-Scale Dataset for Generalized Multi- View Stereo Networks,

    Y . Yao, Z. Luo, S. Li, J. Zhang, Y . Ren, L. Zhou, T. Fang, and L. Quan, “BlendedMVS: A Large-Scale Dataset for Generalized Multi- View Stereo Networks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1790–1799, 2020

  51. [51]

    Local Light Field Fusion: Practical View Synthesis with Prescriptive Sampling Guidelines,

    B. Mildenhall, P. P. Srinivasan, R. Ortiz-Cayon, N. K. Kalantari, R. Ramamoorthi, R. Ng, and A. Kar, “Local Light Field Fusion: Practical View Synthesis with Prescriptive Sampling Guidelines,”ACM Transactions on Graphics (ToG), vol. 38, no. 4, pp. 1–14, 2019

  52. [52]

    StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators,

    R. Gal, O. Patashnik, H. Maron, A. H. Bermano, G. Chechik, and D. Cohen-Or, “StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators,”ACM Transactions on Graphics (TOG), vol. 41, no. 4, pp. 1–13, 2022