pith. sign in

arxiv: 2605.16859 · v1 · pith:CCEIAIEUnew · submitted 2026-05-16 · 💻 cs.CV · cs.AI

VGGT-CD: Training-Free Robust Registration for 3D Change Detection

Pith reviewed 2026-05-19 20:36 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords 3D change detectionpoint cloud registrationtraining-free pipelinemulti-temporal reconstructionstatic background isolationSim(3) alignmentvisual geometry foundation model
0
0 comments X

The pith

VGGT-CD registers multi-temporal point clouds by first aligning sparse keyframes into one metric space then purifying dense reconstructions to static background only.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a training-free pipeline that turns independent per-epoch reconstructions from a visual-geometry model into aligned 3D change maps. It first performs joint sparse-keyframe inference to remove scale ambiguity and produce an initial Sim(3) prior. The second stage then removes points belonging to physical changes so that a closed-form centroid alignment on the remaining static correspondences can refine translation while locking scale and rotation. A residual self-check is used to ensure the refinement step never worsens the initial prior. On an 11-scene benchmark the method cuts absolute trajectory error by 44 percent outdoors and 59 percent indoors while running more than six times faster than prior approaches.

Core claim

By decoupling registration from dynamic interference through a coarse sparse-keyframe stage that establishes a unified metric space followed by a fine stage that isolates static-background correspondences and performs closed-form centroid alignment with a residual self-check, the pipeline produces non-degrading refinements and high-purity 3D change maps without any task-specific training.

What carries the argument

Two-stage registration: coarse sparse keyframe joint inference for an initial Sim(3) prior, followed by dense-reconstruction purification that isolates static-background correspondences for closed-form centroid alignment with residual self-check.

If this is right

  • Multi-view images captured at different times can be turned directly into metric 3D change maps without retraining any model.
  • Registration speed increases by a factor of six or more because only static correspondences are used in the final alignment.
  • The residual self-check provides a mathematical guarantee that the fine stage never degrades the coarse-stage prior.
  • High-purity change maps become available for urban monitoring and autonomous driving without requiring paired training data for each new scene.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same coarse-to-fine separation could be applied to other dense reconstruction models that output per-epoch point clouds.
  • If the static-background isolation step were made probabilistic, the method might extend to scenes with moving objects that occupy the same location across epochs.
  • The closed-form centroid step suggests that once scale and rotation are fixed, translation refinement reduces to a simple average of residuals on trusted points.

Load-bearing premise

The fine stage can always separate static background points from points belonging to actual scene changes so that alignment on the remaining points improves rather than harms the initial estimate.

What would settle it

A test sequence in which the fine-stage isolation step leaves more than a small fraction of changed points in the static set, causing the refined translation to increase rather than decrease absolute trajectory error compared with the coarse prior.

Figures

Figures reproduced from arXiv: 2605.16859 by China), Qiang Li (1), Qi Wang (1) ((1) Northwestern Polytechnical University, Songhua Li (1), Wei Zhang (1), Xi'an, Yihang Wu (1).

Figure 1
Figure 1. Figure 1: Given bi-temporal multi-view images (doors closed vs. trunk open), independent reconstruction produces two point clouds in separate coordinate frames. Naive overlay without alignment exposes severe scale ambiguity and edge-flying noise, rendering the two epochs entirely incomparable (left). RANSAC + Scale-ICP fails to resolve the scale discrepancy, resulting in registration failure and false-positive-domin… view at source ↗
Figure 2
Figure 2. Figure 2: Overall architecture of the proposed VGGT-CD pipeline. Given unposed image sets from two tem￾poral states (T1 and T2), our training-free system operates in a decoupled, coarse-to-fine manner. (Top) Coarse Stage: A sparse subset of keyframes undergoes joint inference to establish a unified metric space. By aligning the implicitly reconstructed camera frustums, we extract a reliable global prior, rigidly loc… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of predicted camera trajectories versus ground truth on representative scenes. Each subplot [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on representative scenes. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of keyframe budget K on the coarse stage. (a) ATE saturates around K=5; (b) GPU memory grows super-linearly with K; (c) computation time scales linearly. Orange markers highlight our default K=5. Effect of keyframe budget (K). We study the trade-off between the number of keyframes K used in the coarse stage and the resulting accuracy, memory cost, and computation time [PITH_FULL_IMAGE:figures/full_… view at source ↗
read the original abstract

3D change detection from multi-view images is essential for urban monitoring, disaster assessment, and autonomous driving. However, existing methods predominantly operate in the 2D domain, where viewpoint variations are mistaken for physical changes and depth is unavailable. While visual geometry foundation models like VGGT rapidly produce dense point clouds from unposed images, independent per-epoch reconstruction encounters fundamental obstacles: unpredictable inter-epoch scale ambiguity, registration-change paradox where scene changes corrupt alignment, and pervasive edge-flying noise. To address these challenges, we present VGGT-CD, a training-free pipeline decoupling cross-temporal registration from dynamic-change interference. In the Coarse Stage, sparse keyframe joint inference establishes a unified metric space and yields an initial Sim(3) prior. In the Fine Stage, dense reconstructions are purified by isolating static-background correspondences. A closed-form centroid alignment refines the translation while locking scale and rotation, using a residual self-check to mathematically guarantee non-degradation. Evaluated on an 11-scene benchmark from the World Across Time dataset, VGGT-CD reduces Absolute Trajectory Error by 44% outdoors and 59% indoors. It completes registration over 6 times faster, producing high-purity 3D change maps without task-specific training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces VGGT-CD, a training-free pipeline for robust 3D registration in change detection tasks. It leverages the VGGT visual geometry foundation model to generate dense point clouds from unposed multi-view images. The method consists of a Coarse Stage that uses sparse keyframe joint inference to establish a unified metric space and an initial Sim(3) prior, and a Fine Stage that purifies dense reconstructions by isolating static-background correspondences, followed by a closed-form centroid alignment to refine translation while locking scale and rotation, with a residual self-check to ensure non-degradation. On an 11-scene benchmark from the World Across Time dataset, it claims to reduce Absolute Trajectory Error by 44% outdoors and 59% indoors, complete registration over 6 times faster, and produce high-purity 3D change maps without task-specific training.

Significance. If the results hold, this work is significant for providing an efficient, training-free solution to 3D change detection that avoids the pitfalls of independent per-epoch reconstructions. The use of closed-form centroid alignment and a residual self-check for mathematical non-degradation is a notable strength, enhancing reproducibility and computational efficiency. This could have practical impact in applications like urban monitoring and autonomous driving by enabling high-purity change maps from multi-view images.

major comments (2)
  1. Fine Stage: The isolation of static-background correspondences from dynamic-change interference is invoked to resolve the registration-change paradox (abstract), but no explicit algorithm, threshold, invariance property, or robust selection mechanism is specified. This is load-bearing for the claim that the subsequent closed-form centroid alignment (with scale/rotation locked and residual self-check) yields a non-degrading refinement, as the self-check operates after selection and could fail if the subset is small or biased under high change ratios.
  2. Evaluation section: The reported ATE reductions (44% outdoors, 59% indoors) on the 11-scene benchmark lack error bars, exact data-exclusion rules, and full derivation steps for the metrics. This prevents verification of the quantitative claims and the cross-scene consistency asserted in the abstract.
minor comments (1)
  1. Abstract: The claim of 'high-purity 3D change maps' is not accompanied by a definition or quantification of the purity metric.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's significance. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: Fine Stage: The isolation of static-background correspondences from dynamic-change interference is invoked to resolve the registration-change paradox (abstract), but no explicit algorithm, threshold, invariance property, or robust selection mechanism is specified. This is load-bearing for the claim that the subsequent closed-form centroid alignment (with scale/rotation locked and residual self-check) yields a non-degrading refinement, as the self-check operates after selection and could fail if the subset is small or biased under high change ratios.

    Authors: We agree that the Fine Stage description would benefit from greater explicitness. Section 3.2 describes purification via residual errors after the initial Sim(3) prior to isolate static correspondences before the closed-form centroid alignment and residual self-check. To address the concern directly, we will add an algorithm box with the precise selection procedure, the threshold criterion, and a short invariance argument (static points remain consistent under the locked scale/rotation). We will also include a brief analysis of performance under high change ratios to show the self-check remains effective even when the static subset is reduced. revision: yes

  2. Referee: Evaluation section: The reported ATE reductions (44% outdoors, 59% indoors) on the 11-scene benchmark lack error bars, exact data-exclusion rules, and full derivation steps for the metrics. This prevents verification of the quantitative claims and the cross-scene consistency asserted in the abstract.

    Authors: We accept this point. The current evaluation reports aggregate ATE reductions on the 11 scenes but does not include per-scene variance or explicit exclusion criteria. In the revision we will add error bars (standard deviation across scenes), state the exact data-exclusion rules applied, and append the metric derivation steps (including how ATE is computed from the aligned trajectories) to the evaluation section and supplementary material. These additions will make the 44 % / 59 % figures and cross-scene consistency fully verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent geometric operations

full rationale

The VGGT-CD pipeline is self-contained: the coarse stage uses sparse keyframe joint inference to produce a unified metric space and initial Sim(3) prior, while the fine stage applies closed-form centroid alignment on isolated static correspondences with a residual self-check. These operations are defined via standard rigid-body geometry and do not reduce the reported ATE improvements or non-degradation guarantee to any fitted parameter or self-citation within the same derivation. The isolation step is an assumption but does not create definitional equivalence between inputs and outputs. External VGGT foundation model and benchmark evaluation provide independent grounding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that VGGT produces usable dense point clouds and that static-background points can be separated from dynamic ones without introducing new fitted parameters beyond those already present in the foundation model.

axioms (2)
  • domain assumption VGGT rapidly produces dense point clouds from unposed images
    Invoked in the opening paragraph as the starting point for both coarse and fine stages.
  • domain assumption Static-background correspondences can be isolated from dynamic-change interference
    Stated when the fine stage is described as purifying dense reconstructions.

pith-pipeline@v0.9.0 · 5779 in / 1380 out tokens · 28521 ms · 2026-05-19T20:36:06.767229+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 1 internal anchor

  1. [1]

    Change detection of urban objects using 3D point clouds: A review

    Uwe Stilla, Y usheng Xu. Change detection of urban objects using 3D point clouds: A review. ISPRS Journal of Photogrammetry and Remote Sensing , 197, pp. 228–255. 2023

  2. [2]

    Change detection in urban point clouds: An experimental comparison with simulated 3d datasets

    Iris de Gélis, Sébastien Lefèvre, Thomas Corpetti. Change detection in urban point clouds: An experimental comparison with simulated 3d datasets. Remote Sensing, 13, (13), pp. 2629. 2021

  3. [3]

    Point cloud registration and change detection in urban environ- ment using an onboard Lidar sensor and MLS reference data

    Örkény Zováthi, Balázs Nagy, Csaba Benedek. Point cloud registration and change detection in urban environ- ment using an onboard Lidar sensor and MLS reference data. International Journal of Applied Earth Observation and Geoinformation, 110, pp. 102767. 2022

  4. [4]

    Building damage assessment for rapid disaster response with a deep object-based semantic change detection framework: From natural disasters to man- made disasters

    Zhuo Zheng, Y anfei Zhong, Junjue Wang, Ailong Ma, Liangpei Zhang. Building damage assessment for rapid disaster response with a deep object-based semantic change detection framework: From natural disasters to man- made disasters. Remote Sensing of Environment, 265, pp. 112636. 2021

  5. [5]

    Integrating machine learning and remote sensing in disaster management: A decadal review of post-disaster building damage assessment

    Sultan Al Shafian, Da Hu. Integrating machine learning and remote sensing in disaster management: A decadal review of post-disaster building damage assessment. Buildings, 14, (8), pp. 2344. 2024

  6. [6]

    Rapid automatic detection of collapsed buildings with single period LiDAR data after an earthquake

    Ömer Canözü, Hayrettin Acar. Rapid automatic detection of collapsed buildings with single period LiDAR data after an earthquake. Earth Science Informatics, 18, (1), pp. 151. 2025

  7. [7]

    SceneEdited: A City-Scale Benchmark for 3D HD Map Updating via Image-Guided Change Detection

    Chun-Jung Lin, Tat-Jun Chin, Sourav Garg, Feras Dayoub. SceneEdited: A City-Scale Benchmark for 3D HD Map Updating via Image-Guided Change Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 6330–6339. 2026

  8. [8]

    Urban 3D Change Detection Using LiDAR Sensor for HD Map Maintenance and Smart Mobility

    Hezam Albagami, Haitian Wang, Xinyu Wang, Muhammad Ibrahim, Zainy M Malakan, Abdullah M Alqamdi, et al.. Urban 3D Change Detection Using LiDAR Sensor for HD Map Maintenance and Smart Mobility. arXiv preprint arXiv:2510.21112. 2025

  9. [9]

    A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images

    Chenxiao Zhang, Peng Y ue, Deodato Tapete, Liangcun Jiang, Boyi Shangguan, Li Huang, et al.. A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images. ISPRS Journal of Photogrammetry and Remote Sensing , 166, pp. 183–200. 2020

  10. [10]

    A spatial-temporal attention-based method and a new dataset for remote sensing image change detection

    Hao Chen, Zhenwei Shi. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote sensing, 12, (10), pp. 1662. 2020

  11. [11]

    Remote sensing image change detection with transformers

    Hao Chen, Zipeng Qi, Zhenwei Shi. Remote sensing image change detection with transformers. IEEE Transac- tions on Geoscience and Remote Sensing , 60, pp. 1–14. 2021

  12. [12]

    Vggt: Vi- sual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea V edaldi, Christian Rupprecht, David Novotny. Vggt: Vi- sual geometry grounded transformer. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 5294–5306. 2025

  13. [13]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Y ohann Cabon, Boris Chidlovskii, Jerome Revaud. Dust3r: Geometric 3d vision made easy. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pp. 20697– 20709. 2024. 11 VGGT-CD A P REPRINT

  14. [14]

    Grounding image matching in 3d with mast3r

    Vincent Leroy, Y ohann Cabon, Jérôme Revaud. Grounding image matching in 3d with mast3r. In European conference on computer vision , pp. 71–91. 2024

  15. [15]

    π3: Scalable Permutation- Equivariant Visual Geometry Learning

    Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Y ang Zhou, Zizun Li, et al.. π3: Scalable Permutation- Equivariant Visual Geometry Learning. arXiv e-prints, pp. arXiv–2507. 2025

  16. [16]

    Streaming 4D Visual Geometry Transformer

    Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Y uqi Wu, Jie Zhou, Jiwen Lu. Streaming 4d visual geometry trans- former. arXiv preprint arXiv:2507.11539. 2025

  17. [17]

    InfiniteVGGT: Visual geometry grounded transformer for endless streams

    Shuai Y uan, Y antai Y ang, Xiaotian Y ang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, et al.. InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams. arXiv preprint arXiv:2601.02281. 2026

  18. [18]

    Method for registration of 3-D shapes

    Paul J Besl, Neil D McKay. Method for registration of 3-D shapes. In Sensor fusion IV: control paradigms and data structures, 1611, pp. 586–606. 1992

  19. [19]

    Least-squares estimation of transformation parameters between two point patterns

    Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns. IEEE Trans- actions on pattern analysis and machine intelligence , 13, (4), pp. 376–380. 2002

  20. [20]

    A critical synthesis of remotely sensed optical image change detection techniques

    Andrew P Tewkesbury, Alexis J Comber, Nicholas J Tate, Alistair Lamb, Peter F Fisher. A critical synthesis of remotely sensed optical image change detection techniques. Remote Sensing of Environment , 160, pp. 1–14. 2015

  21. [21]

    Review article digital change detection techniques using remotely-sensed data

    Ashbindu Singh. Review article digital change detection techniques using remotely-sensed data. International journal of remote sensing , 10, (6), pp. 989–1003. 1989

  22. [22]

    Airborne laser scanningan introduction and overview

    Aloysius Wehr, Uwe Lohr. Airborne laser scanningan introduction and overview. ISPRS Journal of photogram- metry and remote sensing , 54, (2-3), pp. 68–82. 1999

  23. [23]

    Photo tourism: exploring photo collections in 3D

    Noah Snavely, Steven M Seitz, Richard Szeliski. Photo tourism: exploring photo collections in 3D. In ACM siggraph 2006 papers, pp. 835–846. 2006

  24. [24]

    PGN3DCD: Prior-Knowledge-Guided Network for Urban 3-D Point Cloud Change Detection

    Wenxiao Zhan, Ruozhen Cheng, Jing Chen. PGN3DCD: Prior-Knowledge-Guided Network for Urban 3-D Point Cloud Change Detection. IEEE Transactions on Geoscience and Remote Sensing , 62, pp. 1–15. 2024

  25. [25]

    Living scenes: Multi-object relocalization and reconstruction in changing 3d environments

    Liyuan Zhu, Shengyu Huang, Konrad Schindler, Iro Armeni. Living scenes: Multi-object relocalization and reconstruction in changing 3d environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 28014–28024. 2024

  26. [26]

    Mvsnet: Depth inference for unstructured multi-view stereo

    Y ao Y ao, Zixin Luo, Shiwei Li, Tian Fang, Long Quan. Mvsnet: Depth inference for unstructured multi-view stereo. In Proceedings of the European conference on computer vision (ECCV) , pp. 767–783. 2018

  27. [27]

    Cascade cost volume for high- resolution multi-view stereo and stereo matching

    Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, Ping Tan. Cascade cost volume for high- resolution multi-view stereo and stereo matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pp. 2495–2504. 2020

  28. [28]

    Visual Consistency Enhancement for Multi-view Stereo Reconstruc- tion in Remote Sensing

    Wei Zhang, Qiang Li, Y uan Y uan, Qi Wang. Visual Consistency Enhancement for Multi-view Stereo Reconstruc- tion in Remote Sensing. IEEE Transactions on Geoscience and Remote Sensing . 2024

  29. [29]

    Semantic-Guided Multiview Stereo Reconstruction for Aerial Image

    Wei Zhang, Zhigang Y ang, Qiang Li, Qi Wang. Semantic-Guided Multiview Stereo Reconstruction for Aerial Image. IEEE Transactions on Geoscience and Remote Sensing , 63, pp. 1-11. 2025

  30. [30]

    Refined Cascade Cost V olume for Multiview Remote Sensing Image Recon- struction

    Wei Zhang, Qiang Li, Qi Wang. Refined Cascade Cost V olume for Multiview Remote Sensing Image Recon- struction. IEEE Transactions on Geoscience and Remote Sensing , 63, pp. 1-11. 2025

  31. [31]

    SDL-MVS: View Space and Depth Deformable Learning Paradigm for Multi-View Stereo Reconstruction in Remote Sensing

    Y ong-Qiang Mao, Hanbo Bi, Liangyu Xu, Kaiqiang Chen, Zhirui Wang, Xian Sun, et al.. SDL-MVS: View Space and Depth Deformable Learning Paradigm for Multi-View Stereo Reconstruction in Remote Sensing. IEEE Transactions on Geoscience and Remote Sensing . 2024

  32. [32]

    Edge aware depth infer- ence for large-scale aerial building multi-view stereo

    Song Zhang, ZhiWei Wei, WenJia Xu, LiLi Zhang, Y ang Wang, JinMing Zhang, et al.. Edge aware depth infer- ence for large-scale aerial building multi-view stereo. ISPRS Journal of Photogrammetry and Remote Sensing , 207, pp. 27–42. 2024

  33. [33]

    A hierarchical deformable deep neural network and an aerial image benchmark dataset for surface multiview stereo reconstruction

    Jiayi Li, Xin Huang, Y ujin Feng, Zhen Ji, Shulei Zhang, Dawei Wen. A hierarchical deformable deep neural network and an aerial image benchmark dataset for surface multiview stereo reconstruction. IEEE Transactions on Geoscience and Remote Sensing , 61, pp. 1–12. 2023

  34. [34]

    A novel recurrent encoder-decoder structure for large-scale multi-view stereo reconstruction from an open aerial dataset

    Jin Liu, Shunping Ji. A novel recurrent encoder-decoder structure for large-scale multi-view stereo reconstruction from an open aerial dataset. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 6050–6059. 2020

  35. [35]

    Rethinking depth estimation for multi- view stereo: A unified representation

    Rui Peng, Rongjie Wang, Zhenyu Wang, Y awen Lai, Ronggang Wang. Rethinking depth estimation for multi- view stereo: A unified representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8645–8654. 2022. 12 VGGT-CD A P REPRINT

  36. [36]

    Fast global registration

    Qian-Yi Zhou, Jaesik Park, Vladlen Koltun. Fast global registration. In European conference on computer vision, pp. 766–782. 2016

  37. [37]

    Geotransformer: Fast and robust point cloud registration with geometric transformer

    Zheng Qin, Hao Y u, Changjian Wang, Y ulan Guo, Y uxing Peng, Slobodan Ilic, et al.. Geotransformer: Fast and robust point cloud registration with geometric transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, (8), pp. 9806–9821. 2023

  38. [38]

    Dynamic cues-assisted transformer for robust point cloud regis- tration

    Hong Chen, Pei Y an, Sihe Xiang, Yihua Tan. Dynamic cues-assisted transformer for robust point cloud regis- tration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 21698– 21707. 2024

  39. [39]

    Robust multiview point cloud registration with reliable pose graph initialization and history reweighting

    Haiping Wang, Y uan Liu, Zhen Dong, Y ulan Guo, Y u-Shen Liu, Wenping Wang, et al.. Robust multiview point cloud registration with reliable pose graph initialization and history reweighting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pp. 9506–9515. 2023

  40. [40]

    Clnerf: Continual learning meets nerf

    Zhipeng Cai, Matthias Müller. Clnerf: Continual learning meets nerf. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pp. 23185–23194. 2023. 13