Weighted Reverse Convolution for Feature Upsampling
Pith reviewed 2026-05-21 08:11 UTC · model grok-4.3
The pith
Feature upsampling for vision foundation models reduces to a weighted Tikhonov-regularized least-squares problem solved via spatially adaptive reverse convolution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We formulate feature upsampling as a weighted Tikhonov-regularized least-squares problem, where spatially varying weights modulate both data fidelity and prior strength at each spatial location. This allows WRC to adapt the reconstruction to spatially varying feature characteristics, thereby preserving critical structures while mitigating over-smoothing. Moreover, WRC retains an efficient, fully differentiable closed-form FFT solution, making it a practical drop-in upsampling operator.
What carries the argument
Weighted Reverse Convolution (WRC), the spatially adaptive inverse operator obtained from the weighted Tikhonov-regularized least-squares formulation that enables efficient FFT-based densification of coarse patch descriptors.
If this is right
- Dense feature quality improves on segmentation, depth estimation, video object segmentation, object discovery, and keypoint correspondence.
- The operator integrates as a drop-in module inside lightweight self-supervised densification frameworks.
- Computational cost remains low because the solution is a closed-form FFT operation that stays fully differentiable.
Where Pith is reading between the lines
- The same weighted-regularization view could be tested on super-resolution or feature denoising tasks that also suffer from local structure loss.
- Because the method is parameter-light and FFT-based, it may suit real-time pipelines that currently rely on learned upsampling layers.
- Extending the weight-learning component to video or multi-view settings could address temporal consistency without extra architectural changes.
Load-bearing premise
Feature upsampling is accurately modeled by a weighted Tikhonov-regularized least-squares problem in which spatially varying weights can be chosen or learned to adapt to local characteristics without artifacts or overfitting.
What would settle it
A controlled benchmark comparison in which WRC-upsampled features produce equal or lower accuracy than bilinear or bicubic baselines on a standard dense prediction task such as semantic segmentation or depth estimation.
read the original abstract
Pre-trained vision foundation models (VFMs) provide strong semantic representations, yet their patch-level features are inherently coarse, limiting their effectiveness on tasks requiring fine-grained localization, dense prediction, and point-wise correspondence. In this work, we revisit feature upsampling for VFMs from the perspective of \textbf{\textit{inverse problem}} and propose Weighted Reverse Convolution (WRC), a spatially adaptive inverse operator for densifying high-level visual descriptors. Specifically, we formulate feature upsampling as a weighted Tikhonov-regularized least-squares problem, where spatially varying weights modulate both data fidelity and prior strength at each spatial location. This allows WRC to adapt the reconstruction to spatially varying feature characteristics, thereby preserving critical structures while mitigating over-smoothing. Moreover, WRC retains an efficient, fully differentiable closed-form FFT solution, making it a practical drop-in upsampling operator. Integrated into a lightweight self-supervised densification framework, WRC consistently improves dense feature quality across various downstream benchmarks, including segmentation, depth estimation, video object segmentation, object discovery, and keypoint correspondence, while maintaining high computational efficiency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Weighted Reverse Convolution (WRC) for upsampling coarse patch-level features from pre-trained vision foundation models. It formulates the task as a weighted Tikhonov-regularized least-squares inverse problem in which spatially varying weights modulate both the data-fidelity term and the regularization strength at each location. The method is claimed to admit an efficient, fully differentiable closed-form solution via the FFT and is integrated into a lightweight self-supervised densification pipeline that yields consistent gains on segmentation, depth estimation, video object segmentation, object discovery, and keypoint correspondence while preserving computational efficiency.
Significance. If the central mathematical claim holds, WRC would supply a principled, spatially adaptive upsampling operator that mitigates over-smoothing while remaining practical for dense-prediction pipelines. The combination of an inverse-problem framing with a claimed FFT closed form and empirical improvements across multiple downstream tasks would constitute a useful contribution to feature densification for vision foundation models.
major comments (2)
- [Abstract and Section 3 (formulation)] The abstract and introduction assert an “efficient, fully differentiable closed-form FFT solution” for the weighted Tikhonov problem. However, the normal equations are of the form (K^T W K + λ L^T L) x = K^T W y where W is a diagonal matrix of spatially varying weights. This operator is not circulant, so it is not diagonalized by the DFT; a direct FFT inversion is therefore unavailable without additional approximations or restrictions on W that are not stated in the provided text. This directly affects the claimed efficiency and exactness of the inverse operator.
- [Experiments and efficiency claims] The central experimental claim—that WRC improves dense feature quality across five downstream benchmarks—rests on the correctness of the upsampling operator. If the FFT solution is only approximate or iterative, the reported speed and differentiability advantages must be re-evaluated; the current manuscript does not provide timing or convergence analysis that would resolve this.
minor comments (2)
- [Section 3] Notation for the weight map W and the regularization operator L should be introduced with explicit definitions and dimensions before the normal equations are written.
- [Section 4] The self-supervised densification framework is described only at a high level; a diagram or pseudocode would clarify how WRC is inserted and trained.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The comments raise important points about the mathematical formulation and experimental validation of the claimed FFT solution. We address each major comment below and will revise the manuscript accordingly to improve precision and completeness.
read point-by-point responses
-
Referee: [Abstract and Section 3 (formulation)] The abstract and introduction assert an “efficient, fully differentiable closed-form FFT solution” for the weighted Tikhonov problem. However, the normal equations are of the form (K^T W K + λ L^T L) x = K^T W y where W is a diagonal matrix of spatially varying weights. This operator is not circulant, so it is not diagonalized by the DFT; a direct FFT inversion is therefore unavailable without additional approximations or restrictions on W that are not stated in the provided text. This directly affects the claimed efficiency and exactness of the inverse operator.
Authors: We appreciate the referee’s careful analysis of the normal equations. We agree that the presence of a spatially varying diagonal weight matrix W renders the composite operator non-circulant, so a direct DFT diagonalization does not hold for arbitrary W. In the original derivation we treated the convolution operators K and L as circulant (hence FFT-diagonalizable) and incorporated W through a pointwise modulation that preserves an efficient closed-form expression under the assumption of locally smooth weights. To address the concern, we will revise Section 3 to state this assumption explicitly, supply the step-by-step derivation showing how the FFT is applied to the circulant terms while W is handled exactly in the spatial domain, and update the abstract to reflect the precise conditions under which the solution remains closed-form and FFT-based. revision: yes
-
Referee: [Experiments and efficiency claims] The central experimental claim—that WRC improves dense feature quality across five downstream benchmarks—rests on the correctness of the upsampling operator. If the FFT solution is only approximate or iterative, the reported speed and differentiability advantages must be re-evaluated; the current manuscript does not provide timing or convergence analysis that would resolve this.
Authors: We concur that the downstream gains and efficiency claims depend on the properties of the upsampling operator. We will add a dedicated efficiency subsection that reports wall-clock timings on the same hardware used for the benchmarks, compares WRC against standard upsampling baselines, and includes a brief convergence study when the weighted system is solved. Because the solution remains a single linear-system solve that is fully differentiable (via implicit differentiation or direct back-propagation through the closed-form expression), the differentiability advantage is preserved; the new analysis will make this explicit. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper defines WRC by formulating upsampling as a weighted Tikhonov-regularized least-squares problem with spatially varying weights and then states that this retains a closed-form FFT solution. No quoted equations or steps reduce the claimed operator or its performance to a fitted parameter, self-citation chain, or input by construction. The method is presented as an independent proposal whose value is assessed on external downstream benchmarks rather than internal tautology. The derivation chain does not exhibit any of the enumerated circular patterns.
Axiom & Free-Parameter Ledger
free parameters (1)
- spatially varying weights
axioms (1)
- domain assumption Feature upsampling for vision foundation models can be formulated as a weighted Tikhonov-regularized least-squares problem.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
formulate feature upsampling as a weighted Tikhonov-regularized least-squares problem, where spatially varying weights modulate both data fidelity and prior strength at each spatial location... retains an efficient, fully differentiable closed-form FFT solution
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leancostAlphaLog_high_calibrated_iff unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
X∗=arg minX ∥W(Y−(X⊗K)↓s)∥²F + ∥Wλ(X−X0)∥²F
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
work page 2023
-
[2]
Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
work page 2025
-
[3]
Scalable pre-training of large autoregressive image models.arXiv preprint arXiv:2401.08541, 2024
Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Angel Bautista, Alexander Toshev, Vaishaal Shankar, Joshua M Susskind, and Armand Joulin. Scalable pre-training of large autoregressive image models.arXiv preprint arXiv:2401.08541, 2024
-
[4]
Osprey: Pixel understanding with visual instruction tuning
Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, and Jianke Zhu. Osprey: Pixel understanding with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28202–28211, 2024
work page 2024
-
[5]
Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Bangalath, et al. Perception encoder: The best visual embeddings are not at the output of the network.Advances in Neural Information Processing Systems, 38:60884–60937, 2026
work page 2026
-
[6]
Emerging properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InIEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021
work page 2021
-
[7]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PmLR, 2021
work page 2021
-
[10]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Jiawei Yang, Katie Z Luo, Jiefeng Li, Congyue Deng, Leonidas Guibas, Dilip Krishnan, Kilian Q Weinberger, Yonglong Tian, and Yue Wang. Denoising vision transformers. InEuropean Conference on Computer Vision, pages 453–469. Springer, 2024
work page 2024
-
[12]
Vitar: Vision transformer with any resolution.arXiv preprint arXiv:2403.18361, 2024
Qihang Fan, Quanzeng You, Xiaotian Han, Yongfei Liu, Yunzhe Tao, Huaibo Huang, Ran He, and Hongxia Yang. Vitar: Vision transformer with any resolution.arXiv preprint arXiv:2403.18361, 2024
-
[13]
Brandt, Axel Feldmann, Zhoutong Zhang, and William T
Stephanie Fu, Mark Hamilton, Laura E. Brandt, Axel Feldmann, Zhoutong Zhang, and William T. Freeman. Featup: Amodel-agnosticframeworkforfeaturesatanyresolution. InInternationalConferenceonLearningRepresentations, 2024. Visual Computing Lab·The Hong Kong Polytechnic University 10 / 18
work page 2024
-
[14]
Loftup: Learning a coordinate- based feature upsampler for vision foundation models
Haiwen Huang, Anpei Chen, Volodymyr Havrylov, Andreas Geiger, and Dan Zhang. Loftup: Learning a coordinate- based feature upsampler for vision foundation models. InIEEE/CVF International Conference on Computer Vision, pages 9913–9923, 2025
work page 2025
-
[15]
Anyup: Universal feature upsampling
Thomas Wimmer, Prune Truong, Marie-Julie Rakotosaona, Michael Oechsle, Federico Tombari, Bernt Schiele, and Jan Eric Lenssen. Anyup: Universal feature upsampling. InInternational Conference on Learning Representations, 2026
work page 2026
-
[16]
Jafar: Jack up any feature at any resolution
Paul Couairon, Loick Chambon, Louis Serrano, Jean-Emmanuel Haugeard, Matthieu Cord, and Nicolas Thome. Jafar: Jack up any feature at any resolution. InAnnual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[17]
Lift: A surprisingly simple lightweight feature transform for dense vit descriptors
Saksham Suri, Matthew Walmer, Kamal Gupta, and Abhinav Shrivastava. Lift: A surprisingly simple lightweight feature transform for dense vit descriptors. InEuropean Conference on Computer Vision, pages 110–128, 2024
work page 2024
-
[18]
Reverse convolution and its applications to image restoration
Xuhong Huang, Shiqi Liu, Kai Zhang, Ying Tai, Jian Yang, Hui Zeng, and Lei Zhang. Reverse convolution and its applications to image restoration. InIEEE/CVF International Conference on Computer Vision, pages 10507–10516, 2025
work page 2025
-
[19]
Vision transformers are circulant attention learners
Dongchen Han, Tianyu Li, Ziyi Wang, and Gao Huang. Vision transformers are circulant attention learners. In AAAI Conference on Artificial Intelligence, volume 40, pages 21549–21557, 2026
work page 2026
-
[20]
Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, and Lei Zhang. Tokenpacker: Efficient visual projector for multimodal llm.International Journal of Computer Vision, 133(10):6794–6812, 2025
work page 2025
-
[21]
Jiaqi Wang, Kai Chen, Rui Xu, Ziwei Liu, Chen Change Loy, and Dahua Lin. Carafe: Content-aware reassembly of features.2019 IEEE/CVF International Conference on Computer Vision, pages 3007–3016, 2019
work page 2019
-
[22]
Sapa: Similarity-aware point affiliation for feature upsampling.ArXiv, abs/2209.12866, 2022
Hao Lu, Wenze Liu, Zixuan Ye, Hongtao Fu, Yuliang Liu, and Zhiguo Cao. Sapa: Similarity-aware point affiliation for feature upsampling.ArXiv, abs/2209.12866, 2022
-
[23]
Wenze Liu, Hao Lu, Hongtao Fu, and Zhiguo Cao. Learning to upsample by learning to sample.2023 IEEE/CVF International Conference on Computer Vision, pages 6004–6014, 2023
work page 2023
-
[24]
Robert Keys. Cubic convolution interpolation for digital image processing.IEEE Transactions on Acoustics, Speech, and Signal Processing, 29(6):1153–1160, 2003
work page 2003
-
[25]
Learning deconvolution network for semantic segmentation
Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. InIEEE International Conference on Computer Vision, pages 1520–1528, 2015
work page 2015
-
[26]
Joint bilateral upsampling.ACM Transactions on Graphics, 26(3):96–es, 2007
Johannes Kopf, Michael F Cohen, Dani Lischinski, and Matt Uyttendaele. Joint bilateral upsampling.ACM Transactions on Graphics, 26(3):96–es, 2007
work page 2007
-
[27]
NorbertWiener.Extrapolation,interpolation,andsmoothingofstationarytimeseries: withengineeringapplications. The MIT press, 1949
work page 1949
-
[28]
Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising.IEEE Transactions on Image Processing, 26(7):3142–3155, 2017
work page 2017
-
[29]
Deep generalized unfolding networks for image restoration
Chong Mou, Qian Wang, and Jian Zhang. Deep generalized unfolding networks for image restoration. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17399–17410, 2022
work page 2022
-
[30]
Dinov2: Learning robust visual features without supervision, 2023
MaximeOquab,TimothéeDarcet, TheoMoutakanni, HuyV.Vo, MarcSzafraniec, VasilKhalidov,PierreFernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Labatut, Armand ...
work page 2023
-
[31]
Numerical methods for the approximate solution of ill-posed problems on compact sets
Andre-i Nikolaevich Tikhonov, Alexander V Goncharsky, Vâčeslav Vasil’evič Stepanov, and Anatoli˘ı Grigor’evich Yagola. Numerical methods for the approximate solution of ill-posed problems on compact sets. InNumerical Methods for the Solution of Ill-posed Problems, pages 65–79. Springer, 1995
work page 1995
-
[32]
Ningning Zhao, Qi Wei, Adrian Basarab, Nicolas Dobigeon, Denis Kouamé, and Jean-Yves Tourneret. Fast single image super-resolution using a new analytical solution forℓ2-ℓ2 problems.IEEE Transactions on Image Processing, 25(8):3683–3697, 2016
work page 2016
-
[33]
Dynamic filter networks.Advances in Neural Information Processing Systems, 29, 2016
Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc V Gool. Dynamic filter networks.Advances in Neural Information Processing Systems, 29, 2016. Visual Computing Lab·The Hong Kong Polytechnic University 11 / 18
work page 2016
-
[34]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[35]
Spair-71k: A large-scale benchmark for semantic correspondence.arXiv preprint arXiv:1908.10543, 2019
Juhong Min, Jongmin Lee, Jean Ponce, and Minsu Cho. Spair-71k: A large-scale benchmark for semantic correspondence.arXiv preprint arXiv:1908.10543, 2019
-
[36]
The 2017 DAVIS Challenge on Video Object Segmentation
Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alexander Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv:1704.00675, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[37]
Yangtao Wang, Xi Shen, Yuan Yuan, Yuming Du, Maomao Li, Shell Xu Hu, James L Crowley, and Dominique Vaufreydaz. Tokencut: Segmenting objects in images and videos with self-supervised transformer and normalized cut.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15790–15801, 2023
work page 2023
-
[38]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean Conference on Computer Vision, pages 740–755. Springer, 2014
work page 2014
-
[39]
Extract free dense labels from clip
Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. InEuropean Conference on Computer Vision, pages 696–712. Springer, 2022
work page 2022
-
[40]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. InIEEE International Conference on Computer Vision, pages 2961–2969, 2017. Visual Computing Lab·The Hong Kong Polytechnic University 12 / 18 Appendix In this appendix, we provide the following materials: •A.BCCB Patterns in DINOv3 •B.Proof of Closed-form Solution for WRC •C.Additio...
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.