pith. sign in

arxiv: 2605.17472 · v2 · pith:NTWQ5SJZnew · submitted 2026-05-17 · 💻 cs.CV

Weighted Reverse Convolution for Feature Upsampling

Pith reviewed 2026-05-21 08:11 UTC · model grok-4.3

classification 💻 cs.CV
keywords feature upsamplingvision foundation modelsweighted reverse convolutionTikhonov regularizationinverse problemdense predictionFFT solution
0
0 comments X

The pith

Feature upsampling for vision foundation models reduces to a weighted Tikhonov-regularized least-squares problem solved via spatially adaptive reverse convolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Pre-trained vision foundation models yield coarse patch features that hinder fine localization and dense prediction. The paper models upsampling as a weighted Tikhonov-regularized least-squares inverse problem in which spatially varying weights control data fidelity and regularization strength at each location. This formulation yields an efficient closed-form FFT solution that adapts reconstruction to local feature statistics and reduces over-smoothing. A reader would care because the operator integrates into lightweight self-supervised frameworks and lifts performance on segmentation, depth, video object segmentation, object discovery, and keypoint tasks while preserving computational speed.

Core claim

We formulate feature upsampling as a weighted Tikhonov-regularized least-squares problem, where spatially varying weights modulate both data fidelity and prior strength at each spatial location. This allows WRC to adapt the reconstruction to spatially varying feature characteristics, thereby preserving critical structures while mitigating over-smoothing. Moreover, WRC retains an efficient, fully differentiable closed-form FFT solution, making it a practical drop-in upsampling operator.

What carries the argument

Weighted Reverse Convolution (WRC), the spatially adaptive inverse operator obtained from the weighted Tikhonov-regularized least-squares formulation that enables efficient FFT-based densification of coarse patch descriptors.

If this is right

  • Dense feature quality improves on segmentation, depth estimation, video object segmentation, object discovery, and keypoint correspondence.
  • The operator integrates as a drop-in module inside lightweight self-supervised densification frameworks.
  • Computational cost remains low because the solution is a closed-form FFT operation that stays fully differentiable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same weighted-regularization view could be tested on super-resolution or feature denoising tasks that also suffer from local structure loss.
  • Because the method is parameter-light and FFT-based, it may suit real-time pipelines that currently rely on learned upsampling layers.
  • Extending the weight-learning component to video or multi-view settings could address temporal consistency without extra architectural changes.

Load-bearing premise

Feature upsampling is accurately modeled by a weighted Tikhonov-regularized least-squares problem in which spatially varying weights can be chosen or learned to adapt to local characteristics without artifacts or overfitting.

What would settle it

A controlled benchmark comparison in which WRC-upsampled features produce equal or lower accuracy than bilinear or bicubic baselines on a standard dense prediction task such as semantic segmentation or depth estimation.

read the original abstract

Pre-trained vision foundation models (VFMs) provide strong semantic representations, yet their patch-level features are inherently coarse, limiting their effectiveness on tasks requiring fine-grained localization, dense prediction, and point-wise correspondence. In this work, we revisit feature upsampling for VFMs from the perspective of \textbf{\textit{inverse problem}} and propose Weighted Reverse Convolution (WRC), a spatially adaptive inverse operator for densifying high-level visual descriptors. Specifically, we formulate feature upsampling as a weighted Tikhonov-regularized least-squares problem, where spatially varying weights modulate both data fidelity and prior strength at each spatial location. This allows WRC to adapt the reconstruction to spatially varying feature characteristics, thereby preserving critical structures while mitigating over-smoothing. Moreover, WRC retains an efficient, fully differentiable closed-form FFT solution, making it a practical drop-in upsampling operator. Integrated into a lightweight self-supervised densification framework, WRC consistently improves dense feature quality across various downstream benchmarks, including segmentation, depth estimation, video object segmentation, object discovery, and keypoint correspondence, while maintaining high computational efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Weighted Reverse Convolution (WRC) for upsampling coarse patch-level features from pre-trained vision foundation models. It formulates the task as a weighted Tikhonov-regularized least-squares inverse problem in which spatially varying weights modulate both the data-fidelity term and the regularization strength at each location. The method is claimed to admit an efficient, fully differentiable closed-form solution via the FFT and is integrated into a lightweight self-supervised densification pipeline that yields consistent gains on segmentation, depth estimation, video object segmentation, object discovery, and keypoint correspondence while preserving computational efficiency.

Significance. If the central mathematical claim holds, WRC would supply a principled, spatially adaptive upsampling operator that mitigates over-smoothing while remaining practical for dense-prediction pipelines. The combination of an inverse-problem framing with a claimed FFT closed form and empirical improvements across multiple downstream tasks would constitute a useful contribution to feature densification for vision foundation models.

major comments (2)
  1. [Abstract and Section 3 (formulation)] The abstract and introduction assert an “efficient, fully differentiable closed-form FFT solution” for the weighted Tikhonov problem. However, the normal equations are of the form (K^T W K + λ L^T L) x = K^T W y where W is a diagonal matrix of spatially varying weights. This operator is not circulant, so it is not diagonalized by the DFT; a direct FFT inversion is therefore unavailable without additional approximations or restrictions on W that are not stated in the provided text. This directly affects the claimed efficiency and exactness of the inverse operator.
  2. [Experiments and efficiency claims] The central experimental claim—that WRC improves dense feature quality across five downstream benchmarks—rests on the correctness of the upsampling operator. If the FFT solution is only approximate or iterative, the reported speed and differentiability advantages must be re-evaluated; the current manuscript does not provide timing or convergence analysis that would resolve this.
minor comments (2)
  1. [Section 3] Notation for the weight map W and the regularization operator L should be introduced with explicit definitions and dimensions before the normal equations are written.
  2. [Section 4] The self-supervised densification framework is described only at a high level; a diagram or pseudocode would clarify how WRC is inserted and trained.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments raise important points about the mathematical formulation and experimental validation of the claimed FFT solution. We address each major comment below and will revise the manuscript accordingly to improve precision and completeness.

read point-by-point responses
  1. Referee: [Abstract and Section 3 (formulation)] The abstract and introduction assert an “efficient, fully differentiable closed-form FFT solution” for the weighted Tikhonov problem. However, the normal equations are of the form (K^T W K + λ L^T L) x = K^T W y where W is a diagonal matrix of spatially varying weights. This operator is not circulant, so it is not diagonalized by the DFT; a direct FFT inversion is therefore unavailable without additional approximations or restrictions on W that are not stated in the provided text. This directly affects the claimed efficiency and exactness of the inverse operator.

    Authors: We appreciate the referee’s careful analysis of the normal equations. We agree that the presence of a spatially varying diagonal weight matrix W renders the composite operator non-circulant, so a direct DFT diagonalization does not hold for arbitrary W. In the original derivation we treated the convolution operators K and L as circulant (hence FFT-diagonalizable) and incorporated W through a pointwise modulation that preserves an efficient closed-form expression under the assumption of locally smooth weights. To address the concern, we will revise Section 3 to state this assumption explicitly, supply the step-by-step derivation showing how the FFT is applied to the circulant terms while W is handled exactly in the spatial domain, and update the abstract to reflect the precise conditions under which the solution remains closed-form and FFT-based. revision: yes

  2. Referee: [Experiments and efficiency claims] The central experimental claim—that WRC improves dense feature quality across five downstream benchmarks—rests on the correctness of the upsampling operator. If the FFT solution is only approximate or iterative, the reported speed and differentiability advantages must be re-evaluated; the current manuscript does not provide timing or convergence analysis that would resolve this.

    Authors: We concur that the downstream gains and efficiency claims depend on the properties of the upsampling operator. We will add a dedicated efficiency subsection that reports wall-clock timings on the same hardware used for the benchmarks, compares WRC against standard upsampling baselines, and includes a brief convergence study when the weighted system is solved. Because the solution remains a single linear-system solve that is fully differentiable (via implicit differentiation or direct back-propagation through the closed-form expression), the differentiability advantage is preserved; the new analysis will make this explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper defines WRC by formulating upsampling as a weighted Tikhonov-regularized least-squares problem with spatially varying weights and then states that this retains a closed-form FFT solution. No quoted equations or steps reduce the claimed operator or its performance to a fitted parameter, self-citation chain, or input by construction. The method is presented as an independent proposal whose value is assessed on external downstream benchmarks rather than internal tautology. The derivation chain does not exhibit any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that upsampling can be cast as a weighted Tikhonov-regularized least-squares inverse problem with spatially adaptive weights; the main addition is the adaptive weighting and efficient solver rather than new physical entities or many free parameters.

free parameters (1)
  • spatially varying weights
    These weights modulate data fidelity and prior strength at each location and are central to the adaptivity; their determination is part of the framework but not specified as fixed constants.
axioms (1)
  • domain assumption Feature upsampling for vision foundation models can be formulated as a weighted Tikhonov-regularized least-squares problem.
    This is the explicit starting point stated in the abstract for deriving the WRC operator.

pith-pipeline@v0.9.0 · 5721 in / 1471 out tokens · 69099 ms · 2026-05-21T08:11:42.456529+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 5 internal anchors

  1. [1]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  2. [2]

    Mini-gemini: Mining the potential of multi-modality vision language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  3. [3]

    Scalable pre-training of large autoregressive image models.arXiv preprint arXiv:2401.08541, 2024

    Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Angel Bautista, Alexander Toshev, Vaishaal Shankar, Joshua M Susskind, and Armand Joulin. Scalable pre-training of large autoregressive image models.arXiv preprint arXiv:2401.08541, 2024

  4. [4]

    Osprey: Pixel understanding with visual instruction tuning

    Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, and Jianke Zhu. Osprey: Pixel understanding with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28202–28211, 2024

  5. [5]

    Perception encoder: The best visual embeddings are not at the output of the network.Advances in Neural Information Processing Systems, 38:60884–60937, 2026

    Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Bangalath, et al. Perception encoder: The best visual embeddings are not at the output of the network.Advances in Neural Information Processing Systems, 38:60884–60937, 2026

  6. [6]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InIEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021

  7. [7]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  8. [8]

    DINOv3

    Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

  9. [9]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PmLR, 2021

  10. [10]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

  11. [11]

    Denoising vision transformers

    Jiawei Yang, Katie Z Luo, Jiefeng Li, Congyue Deng, Leonidas Guibas, Dilip Krishnan, Kilian Q Weinberger, Yonglong Tian, and Yue Wang. Denoising vision transformers. InEuropean Conference on Computer Vision, pages 453–469. Springer, 2024

  12. [12]

    Vitar: Vision transformer with any resolution.arXiv preprint arXiv:2403.18361, 2024

    Qihang Fan, Quanzeng You, Xiaotian Han, Yongfei Liu, Yunzhe Tao, Huaibo Huang, Ran He, and Hongxia Yang. Vitar: Vision transformer with any resolution.arXiv preprint arXiv:2403.18361, 2024

  13. [13]

    Brandt, Axel Feldmann, Zhoutong Zhang, and William T

    Stephanie Fu, Mark Hamilton, Laura E. Brandt, Axel Feldmann, Zhoutong Zhang, and William T. Freeman. Featup: Amodel-agnosticframeworkforfeaturesatanyresolution. InInternationalConferenceonLearningRepresentations, 2024. Visual Computing Lab·The Hong Kong Polytechnic University 10 / 18

  14. [14]

    Loftup: Learning a coordinate- based feature upsampler for vision foundation models

    Haiwen Huang, Anpei Chen, Volodymyr Havrylov, Andreas Geiger, and Dan Zhang. Loftup: Learning a coordinate- based feature upsampler for vision foundation models. InIEEE/CVF International Conference on Computer Vision, pages 9913–9923, 2025

  15. [15]

    Anyup: Universal feature upsampling

    Thomas Wimmer, Prune Truong, Marie-Julie Rakotosaona, Michael Oechsle, Federico Tombari, Bernt Schiele, and Jan Eric Lenssen. Anyup: Universal feature upsampling. InInternational Conference on Learning Representations, 2026

  16. [16]

    Jafar: Jack up any feature at any resolution

    Paul Couairon, Loick Chambon, Louis Serrano, Jean-Emmanuel Haugeard, Matthieu Cord, and Nicolas Thome. Jafar: Jack up any feature at any resolution. InAnnual Conference on Neural Information Processing Systems, 2025

  17. [17]

    Lift: A surprisingly simple lightweight feature transform for dense vit descriptors

    Saksham Suri, Matthew Walmer, Kamal Gupta, and Abhinav Shrivastava. Lift: A surprisingly simple lightweight feature transform for dense vit descriptors. InEuropean Conference on Computer Vision, pages 110–128, 2024

  18. [18]

    Reverse convolution and its applications to image restoration

    Xuhong Huang, Shiqi Liu, Kai Zhang, Ying Tai, Jian Yang, Hui Zeng, and Lei Zhang. Reverse convolution and its applications to image restoration. InIEEE/CVF International Conference on Computer Vision, pages 10507–10516, 2025

  19. [19]

    Vision transformers are circulant attention learners

    Dongchen Han, Tianyu Li, Ziyi Wang, and Gao Huang. Vision transformers are circulant attention learners. In AAAI Conference on Artificial Intelligence, volume 40, pages 21549–21557, 2026

  20. [20]

    Tokenpacker: Efficient visual projector for multimodal llm.International Journal of Computer Vision, 133(10):6794–6812, 2025

    Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, and Lei Zhang. Tokenpacker: Efficient visual projector for multimodal llm.International Journal of Computer Vision, 133(10):6794–6812, 2025

  21. [21]

    Carafe: Content-aware reassembly of features.2019 IEEE/CVF International Conference on Computer Vision, pages 3007–3016, 2019

    Jiaqi Wang, Kai Chen, Rui Xu, Ziwei Liu, Chen Change Loy, and Dahua Lin. Carafe: Content-aware reassembly of features.2019 IEEE/CVF International Conference on Computer Vision, pages 3007–3016, 2019

  22. [22]

    Sapa: Similarity-aware point affiliation for feature upsampling.ArXiv, abs/2209.12866, 2022

    Hao Lu, Wenze Liu, Zixuan Ye, Hongtao Fu, Yuliang Liu, and Zhiguo Cao. Sapa: Similarity-aware point affiliation for feature upsampling.ArXiv, abs/2209.12866, 2022

  23. [23]

    Learning to upsample by learning to sample.2023 IEEE/CVF International Conference on Computer Vision, pages 6004–6014, 2023

    Wenze Liu, Hao Lu, Hongtao Fu, and Zhiguo Cao. Learning to upsample by learning to sample.2023 IEEE/CVF International Conference on Computer Vision, pages 6004–6014, 2023

  24. [24]

    Cubic convolution interpolation for digital image processing.IEEE Transactions on Acoustics, Speech, and Signal Processing, 29(6):1153–1160, 2003

    Robert Keys. Cubic convolution interpolation for digital image processing.IEEE Transactions on Acoustics, Speech, and Signal Processing, 29(6):1153–1160, 2003

  25. [25]

    Learning deconvolution network for semantic segmentation

    Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. InIEEE International Conference on Computer Vision, pages 1520–1528, 2015

  26. [26]

    Joint bilateral upsampling.ACM Transactions on Graphics, 26(3):96–es, 2007

    Johannes Kopf, Michael F Cohen, Dani Lischinski, and Matt Uyttendaele. Joint bilateral upsampling.ACM Transactions on Graphics, 26(3):96–es, 2007

  27. [27]

    The MIT press, 1949

    NorbertWiener.Extrapolation,interpolation,andsmoothingofstationarytimeseries: withengineeringapplications. The MIT press, 1949

  28. [28]

    Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising.IEEE Transactions on Image Processing, 26(7):3142–3155, 2017

    Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising.IEEE Transactions on Image Processing, 26(7):3142–3155, 2017

  29. [29]

    Deep generalized unfolding networks for image restoration

    Chong Mou, Qian Wang, and Jian Zhang. Deep generalized unfolding networks for image restoration. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17399–17410, 2022

  30. [30]

    Dinov2: Learning robust visual features without supervision, 2023

    MaximeOquab,TimothéeDarcet, TheoMoutakanni, HuyV.Vo, MarcSzafraniec, VasilKhalidov,PierreFernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Labatut, Armand ...

  31. [31]

    Numerical methods for the approximate solution of ill-posed problems on compact sets

    Andre-i Nikolaevich Tikhonov, Alexander V Goncharsky, Vâčeslav Vasil’evič Stepanov, and Anatoli˘ı Grigor’evich Yagola. Numerical methods for the approximate solution of ill-posed problems on compact sets. InNumerical Methods for the Solution of Ill-posed Problems, pages 65–79. Springer, 1995

  32. [32]

    Fast single image super-resolution using a new analytical solution forℓ2-ℓ2 problems.IEEE Transactions on Image Processing, 25(8):3683–3697, 2016

    Ningning Zhao, Qi Wei, Adrian Basarab, Nicolas Dobigeon, Denis Kouamé, and Jean-Yves Tourneret. Fast single image super-resolution using a new analytical solution forℓ2-ℓ2 problems.IEEE Transactions on Image Processing, 25(8):3683–3697, 2016

  33. [33]

    Dynamic filter networks.Advances in Neural Information Processing Systems, 29, 2016

    Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc V Gool. Dynamic filter networks.Advances in Neural Information Processing Systems, 29, 2016. Visual Computing Lab·The Hong Kong Polytechnic University 11 / 18

  34. [34]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  35. [35]

    Spair-71k: A large-scale benchmark for semantic correspondence.arXiv preprint arXiv:1908.10543, 2019

    Juhong Min, Jongmin Lee, Jean Ponce, and Minsu Cho. Spair-71k: A large-scale benchmark for semantic correspondence.arXiv preprint arXiv:1908.10543, 2019

  36. [36]

    The 2017 DAVIS Challenge on Video Object Segmentation

    Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alexander Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv:1704.00675, 2017

  37. [37]

    Yangtao Wang, Xi Shen, Yuan Yuan, Yuming Du, Maomao Li, Shell Xu Hu, James L Crowley, and Dominique Vaufreydaz. Tokencut: Segmenting objects in images and videos with self-supervised transformer and normalized cut.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15790–15801, 2023

  38. [38]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean Conference on Computer Vision, pages 740–755. Springer, 2014

  39. [39]

    Extract free dense labels from clip

    Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. InEuropean Conference on Computer Vision, pages 696–712. Springer, 2022

  40. [40]

    Mask r-cnn

    Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. InIEEE International Conference on Computer Vision, pages 2961–2969, 2017. Visual Computing Lab·The Hong Kong Polytechnic University 12 / 18 Appendix In this appendix, we provide the following materials: •A.BCCB Patterns in DINOv3 •B.Proof of Closed-form Solution for WRC •C.Additio...