Weighted Reverse Convolution for Feature Upsampling

Kai Zhang; Lei Zhang; Wentong Li; Zhiyuan Qi; Zichen Zhao

arxiv: 2605.17472 · v2 · pith:NTWQ5SJZnew · submitted 2026-05-17 · 💻 cs.CV

Weighted Reverse Convolution for Feature Upsampling

Wentong Li , Zhiyuan Qi , Zichen Zhao , Kai Zhang , Lei Zhang This is my paper

Pith reviewed 2026-05-21 08:11 UTC · model grok-4.3

classification 💻 cs.CV

keywords feature upsamplingvision foundation modelsweighted reverse convolutionTikhonov regularizationinverse problemdense predictionFFT solution

0 comments

The pith

Feature upsampling for vision foundation models reduces to a weighted Tikhonov-regularized least-squares problem solved via spatially adaptive reverse convolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Pre-trained vision foundation models yield coarse patch features that hinder fine localization and dense prediction. The paper models upsampling as a weighted Tikhonov-regularized least-squares inverse problem in which spatially varying weights control data fidelity and regularization strength at each location. This formulation yields an efficient closed-form FFT solution that adapts reconstruction to local feature statistics and reduces over-smoothing. A reader would care because the operator integrates into lightweight self-supervised frameworks and lifts performance on segmentation, depth, video object segmentation, object discovery, and keypoint tasks while preserving computational speed.

Core claim

We formulate feature upsampling as a weighted Tikhonov-regularized least-squares problem, where spatially varying weights modulate both data fidelity and prior strength at each spatial location. This allows WRC to adapt the reconstruction to spatially varying feature characteristics, thereby preserving critical structures while mitigating over-smoothing. Moreover, WRC retains an efficient, fully differentiable closed-form FFT solution, making it a practical drop-in upsampling operator.

What carries the argument

Weighted Reverse Convolution (WRC), the spatially adaptive inverse operator obtained from the weighted Tikhonov-regularized least-squares formulation that enables efficient FFT-based densification of coarse patch descriptors.

If this is right

Dense feature quality improves on segmentation, depth estimation, video object segmentation, object discovery, and keypoint correspondence.
The operator integrates as a drop-in module inside lightweight self-supervised densification frameworks.
Computational cost remains low because the solution is a closed-form FFT operation that stays fully differentiable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same weighted-regularization view could be tested on super-resolution or feature denoising tasks that also suffer from local structure loss.
Because the method is parameter-light and FFT-based, it may suit real-time pipelines that currently rely on learned upsampling layers.
Extending the weight-learning component to video or multi-view settings could address temporal consistency without extra architectural changes.

Load-bearing premise

Feature upsampling is accurately modeled by a weighted Tikhonov-regularized least-squares problem in which spatially varying weights can be chosen or learned to adapt to local characteristics without artifacts or overfitting.

What would settle it

A controlled benchmark comparison in which WRC-upsampled features produce equal or lower accuracy than bilinear or bicubic baselines on a standard dense prediction task such as semantic segmentation or depth estimation.

read the original abstract

Pre-trained vision foundation models (VFMs) provide strong semantic representations, yet their patch-level features are inherently coarse, limiting their effectiveness on tasks requiring fine-grained localization, dense prediction, and point-wise correspondence. In this work, we revisit feature upsampling for VFMs from the perspective of \textbf{\textit{inverse problem}} and propose Weighted Reverse Convolution (WRC), a spatially adaptive inverse operator for densifying high-level visual descriptors. Specifically, we formulate feature upsampling as a weighted Tikhonov-regularized least-squares problem, where spatially varying weights modulate both data fidelity and prior strength at each spatial location. This allows WRC to adapt the reconstruction to spatially varying feature characteristics, thereby preserving critical structures while mitigating over-smoothing. Moreover, WRC retains an efficient, fully differentiable closed-form FFT solution, making it a practical drop-in upsampling operator. Integrated into a lightweight self-supervised densification framework, WRC consistently improves dense feature quality across various downstream benchmarks, including segmentation, depth estimation, video object segmentation, object discovery, and keypoint correspondence, while maintaining high computational efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WRC gives an adaptive upsampler for VFM features via weighted Tikhonov but the claimed FFT closed form with varying weights is the part that needs verification.

read the letter

The main takeaway is that this work proposes Weighted Reverse Convolution as a spatially adaptive upsampler for coarse VFM features, cast as a weighted Tikhonov-regularized least-squares inverse problem with a claimed FFT closed-form solution. The paper does a solid job on the application side. They integrate WRC into a lightweight self-supervised densification framework and report consistent gains on segmentation, depth estimation, video object segmentation, object discovery, and keypoint correspondence. Keeping computational efficiency high is a plus for real use with foundation models. The soft spot sits in the central technical claim. Spatially varying weights on the fidelity and prior terms make the normal equations non-stationary. That breaks the circulant structure that lets you solve the system with a single FFT multiply. The abstract still asserts an efficient fully differentiable closed-form FFT solution. If the full paper has an exact derivation that preserves this, fine. If it relies on an approximation or a special case for the weights, that needs to be stated clearly because it affects both the efficiency and the exactness of the operator. The rest of the setup looks standard. No major issues with how they position it against interpolation or learned upsamplers. This is for readers who work on dense prediction pipelines and want a more structure-preserving upsampler than the usual choices. It could be worth referee time to verify the math and the experimental controls. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper proposes Weighted Reverse Convolution (WRC) for upsampling coarse patch-level features from pre-trained vision foundation models. It formulates the task as a weighted Tikhonov-regularized least-squares inverse problem in which spatially varying weights modulate both the data-fidelity term and the regularization strength at each location. The method is claimed to admit an efficient, fully differentiable closed-form solution via the FFT and is integrated into a lightweight self-supervised densification pipeline that yields consistent gains on segmentation, depth estimation, video object segmentation, object discovery, and keypoint correspondence while preserving computational efficiency.

Significance. If the central mathematical claim holds, WRC would supply a principled, spatially adaptive upsampling operator that mitigates over-smoothing while remaining practical for dense-prediction pipelines. The combination of an inverse-problem framing with a claimed FFT closed form and empirical improvements across multiple downstream tasks would constitute a useful contribution to feature densification for vision foundation models.

major comments (2)

[Abstract and Section 3 (formulation)] The abstract and introduction assert an “efficient, fully differentiable closed-form FFT solution” for the weighted Tikhonov problem. However, the normal equations are of the form (K^T W K + λ L^T L) x = K^T W y where W is a diagonal matrix of spatially varying weights. This operator is not circulant, so it is not diagonalized by the DFT; a direct FFT inversion is therefore unavailable without additional approximations or restrictions on W that are not stated in the provided text. This directly affects the claimed efficiency and exactness of the inverse operator.
[Experiments and efficiency claims] The central experimental claim—that WRC improves dense feature quality across five downstream benchmarks—rests on the correctness of the upsampling operator. If the FFT solution is only approximate or iterative, the reported speed and differentiability advantages must be re-evaluated; the current manuscript does not provide timing or convergence analysis that would resolve this.

minor comments (2)

[Section 3] Notation for the weight map W and the regularization operator L should be introduced with explicit definitions and dimensions before the normal equations are written.
[Section 4] The self-supervised densification framework is described only at a high level; a diagram or pseudocode would clarify how WRC is inserted and trained.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments raise important points about the mathematical formulation and experimental validation of the claimed FFT solution. We address each major comment below and will revise the manuscript accordingly to improve precision and completeness.

read point-by-point responses

Referee: [Abstract and Section 3 (formulation)] The abstract and introduction assert an “efficient, fully differentiable closed-form FFT solution” for the weighted Tikhonov problem. However, the normal equations are of the form (K^T W K + λ L^T L) x = K^T W y where W is a diagonal matrix of spatially varying weights. This operator is not circulant, so it is not diagonalized by the DFT; a direct FFT inversion is therefore unavailable without additional approximations or restrictions on W that are not stated in the provided text. This directly affects the claimed efficiency and exactness of the inverse operator.

Authors: We appreciate the referee’s careful analysis of the normal equations. We agree that the presence of a spatially varying diagonal weight matrix W renders the composite operator non-circulant, so a direct DFT diagonalization does not hold for arbitrary W. In the original derivation we treated the convolution operators K and L as circulant (hence FFT-diagonalizable) and incorporated W through a pointwise modulation that preserves an efficient closed-form expression under the assumption of locally smooth weights. To address the concern, we will revise Section 3 to state this assumption explicitly, supply the step-by-step derivation showing how the FFT is applied to the circulant terms while W is handled exactly in the spatial domain, and update the abstract to reflect the precise conditions under which the solution remains closed-form and FFT-based. revision: yes
Referee: [Experiments and efficiency claims] The central experimental claim—that WRC improves dense feature quality across five downstream benchmarks—rests on the correctness of the upsampling operator. If the FFT solution is only approximate or iterative, the reported speed and differentiability advantages must be re-evaluated; the current manuscript does not provide timing or convergence analysis that would resolve this.

Authors: We concur that the downstream gains and efficiency claims depend on the properties of the upsampling operator. We will add a dedicated efficiency subsection that reports wall-clock timings on the same hardware used for the benchmarks, compares WRC against standard upsampling baselines, and includes a brief convergence study when the weighted system is solved. Because the solution remains a single linear-system solve that is fully differentiable (via implicit differentiation or direct back-propagation through the closed-form expression), the differentiability advantage is preserved; the new analysis will make this explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper defines WRC by formulating upsampling as a weighted Tikhonov-regularized least-squares problem with spatially varying weights and then states that this retains a closed-form FFT solution. No quoted equations or steps reduce the claimed operator or its performance to a fitted parameter, self-citation chain, or input by construction. The method is presented as an independent proposal whose value is assessed on external downstream benchmarks rather than internal tautology. The derivation chain does not exhibit any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that upsampling can be cast as a weighted Tikhonov-regularized least-squares inverse problem with spatially adaptive weights; the main addition is the adaptive weighting and efficient solver rather than new physical entities or many free parameters.

free parameters (1)

spatially varying weights
These weights modulate data fidelity and prior strength at each location and are central to the adaptivity; their determination is part of the framework but not specified as fixed constants.

axioms (1)

domain assumption Feature upsampling for vision foundation models can be formulated as a weighted Tikhonov-regularized least-squares problem.
This is the explicit starting point stated in the abstract for deriving the WRC operator.

pith-pipeline@v0.9.0 · 5721 in / 1471 out tokens · 69099 ms · 2026-05-21T08:11:42.456529+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

formulate feature upsampling as a weighted Tikhonov-regularized least-squares problem, where spatially varying weights modulate both data fidelity and prior strength at each spatial location... retains an efficient, fully differentiable closed-form FFT solution
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_high_calibrated_iff unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

X∗=arg minX ∥W(Y−(X⊗K)↓s)∥²F + ∥Wλ(X−X0)∥²F

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 5 internal anchors

[1]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

work page 2023
[2]

Mini-gemini: Mining the potential of multi-modality vision language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[3]

Scalable pre-training of large autoregressive image models.arXiv preprint arXiv:2401.08541, 2024

Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Angel Bautista, Alexander Toshev, Vaishaal Shankar, Joshua M Susskind, and Armand Joulin. Scalable pre-training of large autoregressive image models.arXiv preprint arXiv:2401.08541, 2024

work page arXiv 2024
[4]

Osprey: Pixel understanding with visual instruction tuning

Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, and Jianke Zhu. Osprey: Pixel understanding with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28202–28211, 2024

work page 2024
[5]

Perception encoder: The best visual embeddings are not at the output of the network.Advances in Neural Information Processing Systems, 38:60884–60937, 2026

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Bangalath, et al. Perception encoder: The best visual embeddings are not at the output of the network.Advances in Neural Information Processing Systems, 38:60884–60937, 2026

work page 2026
[6]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InIEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021

work page 2021
[7]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

DINOv3

Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PmLR, 2021

work page 2021
[10]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Denoising vision transformers

Jiawei Yang, Katie Z Luo, Jiefeng Li, Congyue Deng, Leonidas Guibas, Dilip Krishnan, Kilian Q Weinberger, Yonglong Tian, and Yue Wang. Denoising vision transformers. InEuropean Conference on Computer Vision, pages 453–469. Springer, 2024

work page 2024
[12]

Vitar: Vision transformer with any resolution.arXiv preprint arXiv:2403.18361, 2024

Qihang Fan, Quanzeng You, Xiaotian Han, Yongfei Liu, Yunzhe Tao, Huaibo Huang, Ran He, and Hongxia Yang. Vitar: Vision transformer with any resolution.arXiv preprint arXiv:2403.18361, 2024

work page arXiv 2024
[13]

Brandt, Axel Feldmann, Zhoutong Zhang, and William T

Stephanie Fu, Mark Hamilton, Laura E. Brandt, Axel Feldmann, Zhoutong Zhang, and William T. Freeman. Featup: Amodel-agnosticframeworkforfeaturesatanyresolution. InInternationalConferenceonLearningRepresentations, 2024. Visual Computing Lab·The Hong Kong Polytechnic University 10 / 18

work page 2024
[14]

Loftup: Learning a coordinate- based feature upsampler for vision foundation models

Haiwen Huang, Anpei Chen, Volodymyr Havrylov, Andreas Geiger, and Dan Zhang. Loftup: Learning a coordinate- based feature upsampler for vision foundation models. InIEEE/CVF International Conference on Computer Vision, pages 9913–9923, 2025

work page 2025
[15]

Anyup: Universal feature upsampling

Thomas Wimmer, Prune Truong, Marie-Julie Rakotosaona, Michael Oechsle, Federico Tombari, Bernt Schiele, and Jan Eric Lenssen. Anyup: Universal feature upsampling. InInternational Conference on Learning Representations, 2026

work page 2026
[16]

Jafar: Jack up any feature at any resolution

Paul Couairon, Loick Chambon, Louis Serrano, Jean-Emmanuel Haugeard, Matthieu Cord, and Nicolas Thome. Jafar: Jack up any feature at any resolution. InAnnual Conference on Neural Information Processing Systems, 2025

work page 2025
[17]

Lift: A surprisingly simple lightweight feature transform for dense vit descriptors

Saksham Suri, Matthew Walmer, Kamal Gupta, and Abhinav Shrivastava. Lift: A surprisingly simple lightweight feature transform for dense vit descriptors. InEuropean Conference on Computer Vision, pages 110–128, 2024

work page 2024
[18]

Reverse convolution and its applications to image restoration

Xuhong Huang, Shiqi Liu, Kai Zhang, Ying Tai, Jian Yang, Hui Zeng, and Lei Zhang. Reverse convolution and its applications to image restoration. InIEEE/CVF International Conference on Computer Vision, pages 10507–10516, 2025

work page 2025
[19]

Vision transformers are circulant attention learners

Dongchen Han, Tianyu Li, Ziyi Wang, and Gao Huang. Vision transformers are circulant attention learners. In AAAI Conference on Artificial Intelligence, volume 40, pages 21549–21557, 2026

work page 2026
[20]

Tokenpacker: Efficient visual projector for multimodal llm.International Journal of Computer Vision, 133(10):6794–6812, 2025

Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, and Lei Zhang. Tokenpacker: Efficient visual projector for multimodal llm.International Journal of Computer Vision, 133(10):6794–6812, 2025

work page 2025
[21]

Carafe: Content-aware reassembly of features.2019 IEEE/CVF International Conference on Computer Vision, pages 3007–3016, 2019

Jiaqi Wang, Kai Chen, Rui Xu, Ziwei Liu, Chen Change Loy, and Dahua Lin. Carafe: Content-aware reassembly of features.2019 IEEE/CVF International Conference on Computer Vision, pages 3007–3016, 2019

work page 2019
[22]

Sapa: Similarity-aware point affiliation for feature upsampling.ArXiv, abs/2209.12866, 2022

Hao Lu, Wenze Liu, Zixuan Ye, Hongtao Fu, Yuliang Liu, and Zhiguo Cao. Sapa: Similarity-aware point affiliation for feature upsampling.ArXiv, abs/2209.12866, 2022

work page arXiv 2022
[23]

Learning to upsample by learning to sample.2023 IEEE/CVF International Conference on Computer Vision, pages 6004–6014, 2023

Wenze Liu, Hao Lu, Hongtao Fu, and Zhiguo Cao. Learning to upsample by learning to sample.2023 IEEE/CVF International Conference on Computer Vision, pages 6004–6014, 2023

work page 2023
[24]

Cubic convolution interpolation for digital image processing.IEEE Transactions on Acoustics, Speech, and Signal Processing, 29(6):1153–1160, 2003

Robert Keys. Cubic convolution interpolation for digital image processing.IEEE Transactions on Acoustics, Speech, and Signal Processing, 29(6):1153–1160, 2003

work page 2003
[25]

Learning deconvolution network for semantic segmentation

Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. InIEEE International Conference on Computer Vision, pages 1520–1528, 2015

work page 2015
[26]

Joint bilateral upsampling.ACM Transactions on Graphics, 26(3):96–es, 2007

Johannes Kopf, Michael F Cohen, Dani Lischinski, and Matt Uyttendaele. Joint bilateral upsampling.ACM Transactions on Graphics, 26(3):96–es, 2007

work page 2007
[27]

The MIT press, 1949

NorbertWiener.Extrapolation,interpolation,andsmoothingofstationarytimeseries: withengineeringapplications. The MIT press, 1949

work page 1949
[28]

Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising.IEEE Transactions on Image Processing, 26(7):3142–3155, 2017

Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising.IEEE Transactions on Image Processing, 26(7):3142–3155, 2017

work page 2017
[29]

Deep generalized unfolding networks for image restoration

Chong Mou, Qian Wang, and Jian Zhang. Deep generalized unfolding networks for image restoration. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17399–17410, 2022

work page 2022
[30]

Dinov2: Learning robust visual features without supervision, 2023

MaximeOquab,TimothéeDarcet, TheoMoutakanni, HuyV.Vo, MarcSzafraniec, VasilKhalidov,PierreFernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Labatut, Armand ...

work page 2023
[31]

Numerical methods for the approximate solution of ill-posed problems on compact sets

Andre-i Nikolaevich Tikhonov, Alexander V Goncharsky, Vâčeslav Vasil’evič Stepanov, and Anatoli˘ı Grigor’evich Yagola. Numerical methods for the approximate solution of ill-posed problems on compact sets. InNumerical Methods for the Solution of Ill-posed Problems, pages 65–79. Springer, 1995

work page 1995
[32]

Fast single image super-resolution using a new analytical solution forℓ2-ℓ2 problems.IEEE Transactions on Image Processing, 25(8):3683–3697, 2016

Ningning Zhao, Qi Wei, Adrian Basarab, Nicolas Dobigeon, Denis Kouamé, and Jean-Yves Tourneret. Fast single image super-resolution using a new analytical solution forℓ2-ℓ2 problems.IEEE Transactions on Image Processing, 25(8):3683–3697, 2016

work page 2016
[33]

Dynamic filter networks.Advances in Neural Information Processing Systems, 29, 2016

Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc V Gool. Dynamic filter networks.Advances in Neural Information Processing Systems, 29, 2016. Visual Computing Lab·The Hong Kong Polytechnic University 11 / 18

work page 2016
[34]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[35]

Spair-71k: A large-scale benchmark for semantic correspondence.arXiv preprint arXiv:1908.10543, 2019

Juhong Min, Jongmin Lee, Jean Ponce, and Minsu Cho. Spair-71k: A large-scale benchmark for semantic correspondence.arXiv preprint arXiv:1908.10543, 2019

work page arXiv 1908
[36]

The 2017 DAVIS Challenge on Video Object Segmentation

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alexander Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv:1704.00675, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[37]

Yangtao Wang, Xi Shen, Yuan Yuan, Yuming Du, Maomao Li, Shell Xu Hu, James L Crowley, and Dominique Vaufreydaz. Tokencut: Segmenting objects in images and videos with self-supervised transformer and normalized cut.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15790–15801, 2023

work page 2023
[38]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean Conference on Computer Vision, pages 740–755. Springer, 2014

work page 2014
[39]

Extract free dense labels from clip

Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. InEuropean Conference on Computer Vision, pages 696–712. Springer, 2022

work page 2022
[40]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. InIEEE International Conference on Computer Vision, pages 2961–2969, 2017. Visual Computing Lab·The Hong Kong Polytechnic University 12 / 18 Appendix In this appendix, we provide the following materials: •A.BCCB Patterns in DINOv3 •B.Proof of Closed-form Solution for WRC •C.Additio...

work page 2017

[1] [1]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

work page 2023

[2] [2]

Mini-gemini: Mining the potential of multi-modality vision language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025

[3] [3]

Scalable pre-training of large autoregressive image models.arXiv preprint arXiv:2401.08541, 2024

Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Angel Bautista, Alexander Toshev, Vaishaal Shankar, Joshua M Susskind, and Armand Joulin. Scalable pre-training of large autoregressive image models.arXiv preprint arXiv:2401.08541, 2024

work page arXiv 2024

[4] [4]

Osprey: Pixel understanding with visual instruction tuning

Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, and Jianke Zhu. Osprey: Pixel understanding with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28202–28211, 2024

work page 2024

[5] [5]

Perception encoder: The best visual embeddings are not at the output of the network.Advances in Neural Information Processing Systems, 38:60884–60937, 2026

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Bangalath, et al. Perception encoder: The best visual embeddings are not at the output of the network.Advances in Neural Information Processing Systems, 38:60884–60937, 2026

work page 2026

[6] [6]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InIEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021

work page 2021

[7] [7]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

DINOv3

Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PmLR, 2021

work page 2021

[10] [10]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Denoising vision transformers

Jiawei Yang, Katie Z Luo, Jiefeng Li, Congyue Deng, Leonidas Guibas, Dilip Krishnan, Kilian Q Weinberger, Yonglong Tian, and Yue Wang. Denoising vision transformers. InEuropean Conference on Computer Vision, pages 453–469. Springer, 2024

work page 2024

[12] [12]

Vitar: Vision transformer with any resolution.arXiv preprint arXiv:2403.18361, 2024

Qihang Fan, Quanzeng You, Xiaotian Han, Yongfei Liu, Yunzhe Tao, Huaibo Huang, Ran He, and Hongxia Yang. Vitar: Vision transformer with any resolution.arXiv preprint arXiv:2403.18361, 2024

work page arXiv 2024

[13] [13]

Brandt, Axel Feldmann, Zhoutong Zhang, and William T

Stephanie Fu, Mark Hamilton, Laura E. Brandt, Axel Feldmann, Zhoutong Zhang, and William T. Freeman. Featup: Amodel-agnosticframeworkforfeaturesatanyresolution. InInternationalConferenceonLearningRepresentations, 2024. Visual Computing Lab·The Hong Kong Polytechnic University 10 / 18

work page 2024

[14] [14]

Loftup: Learning a coordinate- based feature upsampler for vision foundation models

Haiwen Huang, Anpei Chen, Volodymyr Havrylov, Andreas Geiger, and Dan Zhang. Loftup: Learning a coordinate- based feature upsampler for vision foundation models. InIEEE/CVF International Conference on Computer Vision, pages 9913–9923, 2025

work page 2025

[15] [15]

Anyup: Universal feature upsampling

Thomas Wimmer, Prune Truong, Marie-Julie Rakotosaona, Michael Oechsle, Federico Tombari, Bernt Schiele, and Jan Eric Lenssen. Anyup: Universal feature upsampling. InInternational Conference on Learning Representations, 2026

work page 2026

[16] [16]

Jafar: Jack up any feature at any resolution

Paul Couairon, Loick Chambon, Louis Serrano, Jean-Emmanuel Haugeard, Matthieu Cord, and Nicolas Thome. Jafar: Jack up any feature at any resolution. InAnnual Conference on Neural Information Processing Systems, 2025

work page 2025

[17] [17]

Lift: A surprisingly simple lightweight feature transform for dense vit descriptors

Saksham Suri, Matthew Walmer, Kamal Gupta, and Abhinav Shrivastava. Lift: A surprisingly simple lightweight feature transform for dense vit descriptors. InEuropean Conference on Computer Vision, pages 110–128, 2024

work page 2024

[18] [18]

Reverse convolution and its applications to image restoration

Xuhong Huang, Shiqi Liu, Kai Zhang, Ying Tai, Jian Yang, Hui Zeng, and Lei Zhang. Reverse convolution and its applications to image restoration. InIEEE/CVF International Conference on Computer Vision, pages 10507–10516, 2025

work page 2025

[19] [19]

Vision transformers are circulant attention learners

Dongchen Han, Tianyu Li, Ziyi Wang, and Gao Huang. Vision transformers are circulant attention learners. In AAAI Conference on Artificial Intelligence, volume 40, pages 21549–21557, 2026

work page 2026

[20] [20]

Tokenpacker: Efficient visual projector for multimodal llm.International Journal of Computer Vision, 133(10):6794–6812, 2025

Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, and Lei Zhang. Tokenpacker: Efficient visual projector for multimodal llm.International Journal of Computer Vision, 133(10):6794–6812, 2025

work page 2025

[21] [21]

Carafe: Content-aware reassembly of features.2019 IEEE/CVF International Conference on Computer Vision, pages 3007–3016, 2019

Jiaqi Wang, Kai Chen, Rui Xu, Ziwei Liu, Chen Change Loy, and Dahua Lin. Carafe: Content-aware reassembly of features.2019 IEEE/CVF International Conference on Computer Vision, pages 3007–3016, 2019

work page 2019

[22] [22]

Sapa: Similarity-aware point affiliation for feature upsampling.ArXiv, abs/2209.12866, 2022

Hao Lu, Wenze Liu, Zixuan Ye, Hongtao Fu, Yuliang Liu, and Zhiguo Cao. Sapa: Similarity-aware point affiliation for feature upsampling.ArXiv, abs/2209.12866, 2022

work page arXiv 2022

[23] [23]

Learning to upsample by learning to sample.2023 IEEE/CVF International Conference on Computer Vision, pages 6004–6014, 2023

Wenze Liu, Hao Lu, Hongtao Fu, and Zhiguo Cao. Learning to upsample by learning to sample.2023 IEEE/CVF International Conference on Computer Vision, pages 6004–6014, 2023

work page 2023

[24] [24]

Cubic convolution interpolation for digital image processing.IEEE Transactions on Acoustics, Speech, and Signal Processing, 29(6):1153–1160, 2003

Robert Keys. Cubic convolution interpolation for digital image processing.IEEE Transactions on Acoustics, Speech, and Signal Processing, 29(6):1153–1160, 2003

work page 2003

[25] [25]

Learning deconvolution network for semantic segmentation

Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. InIEEE International Conference on Computer Vision, pages 1520–1528, 2015

work page 2015

[26] [26]

Joint bilateral upsampling.ACM Transactions on Graphics, 26(3):96–es, 2007

Johannes Kopf, Michael F Cohen, Dani Lischinski, and Matt Uyttendaele. Joint bilateral upsampling.ACM Transactions on Graphics, 26(3):96–es, 2007

work page 2007

[27] [27]

The MIT press, 1949

NorbertWiener.Extrapolation,interpolation,andsmoothingofstationarytimeseries: withengineeringapplications. The MIT press, 1949

work page 1949

[28] [28]

Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising.IEEE Transactions on Image Processing, 26(7):3142–3155, 2017

Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising.IEEE Transactions on Image Processing, 26(7):3142–3155, 2017

work page 2017

[29] [29]

Deep generalized unfolding networks for image restoration

Chong Mou, Qian Wang, and Jian Zhang. Deep generalized unfolding networks for image restoration. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17399–17410, 2022

work page 2022

[30] [30]

Dinov2: Learning robust visual features without supervision, 2023

MaximeOquab,TimothéeDarcet, TheoMoutakanni, HuyV.Vo, MarcSzafraniec, VasilKhalidov,PierreFernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Labatut, Armand ...

work page 2023

[31] [31]

Numerical methods for the approximate solution of ill-posed problems on compact sets

Andre-i Nikolaevich Tikhonov, Alexander V Goncharsky, Vâčeslav Vasil’evič Stepanov, and Anatoli˘ı Grigor’evich Yagola. Numerical methods for the approximate solution of ill-posed problems on compact sets. InNumerical Methods for the Solution of Ill-posed Problems, pages 65–79. Springer, 1995

work page 1995

[32] [32]

Fast single image super-resolution using a new analytical solution forℓ2-ℓ2 problems.IEEE Transactions on Image Processing, 25(8):3683–3697, 2016

Ningning Zhao, Qi Wei, Adrian Basarab, Nicolas Dobigeon, Denis Kouamé, and Jean-Yves Tourneret. Fast single image super-resolution using a new analytical solution forℓ2-ℓ2 problems.IEEE Transactions on Image Processing, 25(8):3683–3697, 2016

work page 2016

[33] [33]

Dynamic filter networks.Advances in Neural Information Processing Systems, 29, 2016

Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc V Gool. Dynamic filter networks.Advances in Neural Information Processing Systems, 29, 2016. Visual Computing Lab·The Hong Kong Polytechnic University 11 / 18

work page 2016

[34] [34]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[35] [35]

Spair-71k: A large-scale benchmark for semantic correspondence.arXiv preprint arXiv:1908.10543, 2019

Juhong Min, Jongmin Lee, Jean Ponce, and Minsu Cho. Spair-71k: A large-scale benchmark for semantic correspondence.arXiv preprint arXiv:1908.10543, 2019

work page arXiv 1908

[36] [36]

The 2017 DAVIS Challenge on Video Object Segmentation

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alexander Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv:1704.00675, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[37] [37]

Yangtao Wang, Xi Shen, Yuan Yuan, Yuming Du, Maomao Li, Shell Xu Hu, James L Crowley, and Dominique Vaufreydaz. Tokencut: Segmenting objects in images and videos with self-supervised transformer and normalized cut.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15790–15801, 2023

work page 2023

[38] [38]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean Conference on Computer Vision, pages 740–755. Springer, 2014

work page 2014

[39] [39]

Extract free dense labels from clip

Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. InEuropean Conference on Computer Vision, pages 696–712. Springer, 2022

work page 2022

[40] [40]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. InIEEE International Conference on Computer Vision, pages 2961–2969, 2017. Visual Computing Lab·The Hong Kong Polytechnic University 12 / 18 Appendix In this appendix, we provide the following materials: •A.BCCB Patterns in DINOv3 •B.Proof of Closed-form Solution for WRC •C.Additio...

work page 2017