pith. machine review for the scientific record. sign in

arxiv: 2605.08000 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Rethinking Dense Optical Flow without Test-Time Scaling

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:57 UTC · model grok-4.3

classification 💻 cs.CV
keywords dense optical flowsingle forward passfoundation modelsDINO-v2global matchingno test-time refinementmonocular depth priorscross-dataset generalization
0
0 comments X

The pith

Powerful priors from frozen foundation models enable accurate dense optical flow in one forward pass without iterative refinement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that test-time scaling through recurrent refinement is not required for strong performance in dense optical flow estimation. It shows that semantic features from a frozen DINO-v2 model fused with geometric cues from a monocular depth foundation model can be combined via global matching to produce reliable correspondences directly. This yields 2.81 EPE on Sintel Final while outperforming several recent methods that rely on refinement under comparable training. A reader would care because it reframes efficiency as a matter of leveraging existing pretrained representations rather than adding inference steps. The result suggests that foundation models can substitute for the computational overhead of multi-step pipelines on challenging benchmarks.

Core claim

We present a framework that estimates dense optical flow in a single forward pass by extracting visual semantic features from a frozen DINO-v2 backbone, combining them with geometric cues from a monocular depth foundation model, fusing the priors into a unified representation, and applying global matching to recover correspondences without recurrent updates or test-time optimization.

What carries the argument

Fusion of frozen DINO-v2 semantic features with monocular depth geometric cues into a unified representation followed by global matching for dense correspondence estimation.

If this is right

  • Dense flow estimation becomes feasible with fixed, low inference cost independent of scene complexity.
  • Training effort shifts from learning refinement dynamics to learning how to fuse complementary pretrained priors.
  • Cross-dataset generalization improves because the priors are not tuned to any single benchmark's error patterns.
  • Real-time applications on edge devices become viable without sacrificing benchmark accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fusion-plus-global-matching pattern could be tested on related dense prediction tasks such as stereo disparity or scene flow.
  • If foundation priors continue to strengthen, the performance gap between single-pass and multi-step methods may widen further on future benchmarks.
  • The approach opens a route to parameter-efficient adaptation where only a lightweight fusion head is trained on top of frozen backbones.

Load-bearing premise

The semantic and geometric information already captured inside current foundation models is rich enough to produce accurate dense correspondences without iterative correction at inference time.

What would settle it

A controlled comparison on Sintel Final or KITTI where the single-pass method records higher endpoint error than a refinement-based baseline trained under identical conditions and data.

Figures

Figures reproduced from arXiv: 2605.08000 by Praroop Chanda, Suryansh Kumar.

Figure 1
Figure 1. Figure 1: Overview. The conventional CNN-based feature encoder is replaced with DINOv2 [27] to provide semantically rich, large￾scale self-supervised visual features, while original transformer-based feature interaction, global matching, and flow propagation modules remain unchanged. In addition, monocular depth estimates from Depth Anything V2 [43] are introduced as a geometric prior to improve feature conditioning… view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative comparison on Sintel (Final). This benchmark contains severe motion blur, illumination variation, and large displacements. Despite operating without iterative refinement, our method preserves sharp motion boundaries and produces accurate flow estimates, performing comparably to—and in some cases better than—refinement-based approaches. Method Extra Data #refine KITTI EPE KITTI F1-all RAFT [37] … view at source ↗
read the original abstract

Recent progress in dense optical flow has been driven by increasingly complex architectures and multi-step refinement for test-time scaling. While these approaches achieve strong benchmark performance, they also require substantial computation during inference. This raises a fundamental question: Is scaling test-time computation the only way to improve dense optical flow accuracy? We argue that it is not. Instead, powerful visual semantic and geometric priors encoded in modern foundation models can reduce, if not overcome, the need for computationally expensive iterative refinement at test-time. In this paper, we present a framework that estimates dense optical flow in a single forward pass, leveraging pretrained foundation representations, while avoiding iterative refinement and additional inference-time computation, thus offering an alternative to test-time scaling. Our method extracts visual semantic features from a frozen DINO-v2 backbone and combines them with geometric cues from a monocular depth foundation model. We fuse these complementary priors into a unified representation and apply a global matching formulation to estimate dense correspondences without recurrent updates or test-time optimization. Despite avoiding iterative refinement, our approach achieves strong cross-dataset generalization across challenging benchmarks. On Sintel Final, we obtain 2.81 EPE without refinement, significantly improving over state-of-the-art (SOTA) SEA-RAFT under comparable training conditions and outperforming RAFT, GMFlow (without refinement), and recent FlowSeek in the same setting. These results suggest that strong foundation priors can substitute for test-time scaling, offering a computationally efficient alternative to refinement-heavy pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a single-forward-pass dense optical flow method that extracts semantic features from a frozen DINO-v2 backbone, combines them with geometric cues from a monocular depth foundation model, fuses the priors into a unified representation, and applies a global matching formulation to estimate correspondences without recurrent refinement or test-time optimization. It reports 2.81 EPE on Sintel Final and claims to outperform SEA-RAFT (under comparable training), RAFT, GMFlow (no refinement), and FlowSeek.

Significance. If the reported numbers hold under fair and fully documented conditions, the result would show that strong priors from foundation models can substitute for iterative test-time scaling in optical flow, offering a computationally lighter alternative to refinement-heavy pipelines and potentially shifting emphasis toward pretrained representations for dense correspondence tasks.

major comments (2)
  1. [Abstract] Abstract: the abstract states benchmark numbers (2.81 EPE on Sintel Final) and claims improvement over SEA-RAFT, RAFT, GMFlow, and FlowSeek, but supplies no training details, loss functions, fusion mechanism, or full experimental protocol. Without these, it is impossible to verify whether the central claim is supported by the proposed architecture or by differences in training setup.
  2. [Abstract] Abstract: the statement that results are obtained 'under comparable training conditions' to SEA-RAFT is undefined; no specification is given of the datasets, splits, epochs, optimizer, or loss used for the baselines versus the proposed method, which is load-bearing for the cross-method comparison.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'global matching formulation' is used without even a one-sentence description or pointer to the matching cost or correspondence solver.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for greater clarity in the abstract regarding experimental details and training conditions. We address each point below and will revise the manuscript to improve verifiability while preserving the core contribution of a single-pass framework that leverages foundation-model priors.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the abstract states benchmark numbers (2.81 EPE on Sintel Final) and claims improvement over SEA-RAFT, RAFT, GMFlow, and FlowSeek, but supplies no training details, loss functions, fusion mechanism, or full experimental protocol. Without these, it is impossible to verify whether the central claim is supported by the proposed architecture or by differences in training setup.

    Authors: We agree that the abstract is too concise and omits key details. The full manuscript describes the fusion of frozen DINO-v2 semantic features with monocular depth geometric cues, the global matching formulation, the training loss, and the complete experimental protocol. We will revise the abstract to briefly note the single-forward-pass design, the use of frozen foundation backbones, and the training regime, while explicitly directing readers to the Methods and Experiments sections for the full protocol. This change will make it clearer that the reported gains stem from the architecture rather than undisclosed training differences. revision: yes

  2. Referee: [Abstract] Abstract: the statement that results are obtained 'under comparable training conditions' to SEA-RAFT is undefined; no specification is given of the datasets, splits, epochs, optimizer, or loss used for the baselines versus the proposed method, which is load-bearing for the cross-method comparison.

    Authors: We acknowledge that the phrase 'under comparable training conditions' requires explicit definition to support the comparison. Our experiments used the same primary training datasets (FlyingChairs, FlyingThings3D, Sintel, KITTI) and followed a similar multi-stage training schedule and optimizer as SEA-RAFT. We will revise the abstract to state this explicitly and add a concise comparison table or paragraph in the Experiments section listing datasets, epochs, and loss terms for our method versus the baselines. This will allow direct assessment of fairness. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript describes an empirical architecture that fuses frozen DINO-v2 semantic features with monocular depth priors inside a global matching head to produce single-pass flow. No equations, fitted parameters, or self-citations are shown to reduce the reported EPE numbers or generalization claims to the inputs by construction. The central performance numbers are benchmark results obtained under stated training conditions; they do not arise from renaming or re-deriving the same quantities that were used to build the model. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that frozen foundation models already encode sufficiently rich semantic and geometric information for accurate correspondence without further adaptation or test-time optimization; no free parameters or invented entities are described in the abstract.

axioms (2)
  • domain assumption Pretrained foundation models such as DINO-v2 encode useful semantic priors for visual correspondence tasks.
    The method freezes DINO-v2 and relies on its features to avoid refinement.
  • domain assumption Monocular depth foundation models supply reliable geometric cues that complement semantic features for flow estimation.
    The paper combines these cues to enable single-pass global matching.

pith-pipeline@v0.9.0 · 5558 in / 1405 out tokens · 37404 ms · 2026-05-11T02:57:18.226361+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 2 internal anchors

  1. [1]

    D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. InEuropean Conf. on Computer Vision (ECCV), pages 611–

  2. [2]

    Springer-Verlag, 2012

  3. [3]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021

  4. [4]

    Uncertainty-driven dense two-view structure from motion

    Weirong Chen, Suryansh Kumar, and Fisher Yu. Uncertainty-driven dense two-view structure from motion. IEEE Robotics and Automation Letters, 8(3):1763–1770, 2023

  5. [5]

    Flownet: Learn- ing optical flow with convolutional networks

    Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip H¨ausser, Caner Hazırbas ¸, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learn- ing optical flow with convolutional networks. InProceedings of the IEEE International Conference on Computer Vision (ICCV), 2015

  6. [6]

    Flow-edge guided video completion

    Chen Gao, Ayush Saraf, Jia-Bin Huang, and Johannes Kopf. Flow-edge guided video completion. InEuropean Confer- ence on Computer Vision, pages 713–729. Springer, 2020

  7. [7]

    Vision meets robotics: The kitti dataset.The Inter- national Journal of Robotics Research, 32(11):1231–1237, 2013

    Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset.The Inter- national Journal of Robotics Research, 32(11):1231–1237, 2013

  8. [8]

    Zero-shot monocular motion segmentation in the wild by combining deep learning with geometric motion model fusion

    Yuxiang Huang, Yuhao Chen, and John Zelek. Zero-shot monocular motion segmentation in the wild by combining deep learning with geometric motion model fusion. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2733–2743, 2024

  9. [9]

    Flowformer: A transformer architecture for optical flow

    Zhaoyang Huang, Xiaoyu Shi, Chao Zhang, Qiang Wang, Ka Chun Cheung, Hongwei Qin, Jifeng Dai, and Hongsheng Li. Flowformer: A transformer architecture for optical flow. InEuropean Conference on Computer Vision, pages 668–

  10. [10]

    Flownet 2.0: Evolu- tion of optical flow estimation with deep networks

    Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolu- tion of optical flow estimation with deep networks. InPro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 2462–2470, 2017

  11. [11]

    Ms-raft+: high resolution multi-scale raft.International Journal of Computer Vision, 132(5): 1835–1856, 2024

    Azin Jahedi, Maximilian Luz, Marc Rivinius, Lukas Mehl, and Andr ´es Bruhn. Ms-raft+: high resolution multi-scale raft.International Journal of Computer Vision, 132(5): 1835–1856, 2024

  12. [12]

    Enhanced stable view synthesis

    Nishant Jain, Suryansh Kumar, and Luc Van Gool. Enhanced stable view synthesis. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13208–13217, 2023

  13. [13]

    Learning robust multi-scale representation for neural radiance fields from unposed images.International Journal of Computer Vision, 132(4):1310–1335, 2024

    Nishant Jain, Suryansh Kumar, and Luc Van Gool. Learning robust multi-scale representation for neural radiance fields from unposed images.International Journal of Computer Vision, 132(4):1310–1335, 2024

  14. [14]

    The hci benchmark suite: Stereo and flow ground truth with uncertainties for urban autonomous driv- ing

    Daniel Kondermann, Rahul Nair, Katrin Honauer, Karsten Krispin, Jonas Andrulis, Alexander Brock, Burkhard Gusse- feld, Mohsen Rahimimoghaddam, Sabine Hofmann, Claus Brenner, et al. The hci benchmark suite: Stereo and flow ground truth with uncertainties for urban autonomous driv- ing. InProceedings of the IEEE Conference on Computer Vision and Pattern Rec...

  15. [15]

    Jumping manifolds: Geometry aware dense non-rigid structure from motion

    Suryansh Kumar. Jumping manifolds: Geometry aware dense non-rigid structure from motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5346–5355, 2019

  16. [16]

    Non-rigid structure from motion: Prior- free factorization method revisited

    Suryansh Kumar. Non-rigid structure from motion: Prior- free factorization method revisited. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 51–60, 2020

  17. [17]

    Organic priors in non- rigid structure from motion

    Suryansh Kumar and Luc Van Gool. Organic priors in non- rigid structure from motion. InEuropean Conference on Computer Vision, pages 71–88. Springer, 2022

  18. [18]

    Monocular dense 3d reconstruction of a complex dynamic scene from two perspective frames

    Suryansh Kumar, Yuchao Dai, and Hongdong Li. Monocular dense 3d reconstruction of a complex dynamic scene from two perspective frames. InProceedings of the IEEE inter- national conference on computer vision, pages 4649–4657, 2017

  19. [19]

    Scalable dense non-rigid structure-from-motion: A grassmannian perspective

    Suryansh Kumar, Anoop Cherian, Yuchao Dai, and Hong- dong Li. Scalable dense non-rigid structure-from-motion: A grassmannian perspective. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 254–263, 2018

  20. [20]

    Super- pixel soup: Monocular dense 3d reconstruction of a complex dynamic scene.IEEE transactions on pattern analysis and machine intelligence, 43(5):1705–1717, 2019

    Suryansh Kumar, Yuchao Dai, and Hongdong Li. Super- pixel soup: Monocular dense 3d reconstruction of a complex dynamic scene.IEEE transactions on pattern analysis and machine intelligence, 43(5):1705–1717, 2019

  21. [21]

    Mo- tion guided attention for video salient object detection

    Haofeng Li, Guanqi Chen, Guanbin Li, and Yizhou Yu. Mo- tion guided attention for video salient object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7274–7283, 2019

  22. [22]

    Va-depthnet: A variational approach to sin- gle image depth prediction

    Ce Liu, Suryansh Kumar, Shuhang Gu, Radu Timofte, and Luc Van Gool. Va-depthnet: A variational approach to sin- gle image depth prediction. InThe Eleventh International Conference on Learning Representations

  23. [23]

    Single image depth prediction made better: A multivariate gaussian take

    Ce Liu, Suryansh Kumar, Shuhang Gu, Radu Timofte, and Luc Van Gool. Single image depth prediction made better: A multivariate gaussian take. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17346–17356, 2023

  24. [24]

    Segment any point cloud sequences by distilling vision foundation models.Advances in Neural Information Processing Sys- tems, 36:37193–37229, 2023

    Youquan Liu, Lingdong Kong, Jun Cen, Runnan Chen, Wen- wei Zhang, Liang Pan, Kai Chen, and Ziwei Liu. Segment any point cloud sequences by distilling vision foundation models.Advances in Neural Information Processing Sys- tems, 36:37193–37229, 2023

  25. [25]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  26. [26]

    A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation

    Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 4040–4048, 2016. 9

  27. [27]

    Object scene flow for autonomous vehicles

    Moritz Menze and Andreas Geiger. Object scene flow for autonomous vehicles. InConference on Computer Vision and Pattern Recognition (CVPR), 2015

  28. [28]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

  29. [29]

    Pytorch: An im- perative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im- perative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019

  30. [30]

    Representation flow for action recognition

    AJ Piergiovanni and Michael S Ryoo. Representation flow for action recognition. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 9945–9953, 2019

  31. [31]

    Flowseek: Optical flow made easier with depth foundation models and motion bases

    Matteo Poggi and Fabio Tosi. Flowseek: Optical flow made easier with depth foundation models and motion bases. In Proceedings of the International Conference on Computer Vision (ICCV), 2025

  32. [32]

    Bridging view- point gaps: Geometric reasoning boosts semantic correspon- dence

    Qiyang Qian, Hansheng Chen, Masayoshi Tomizuka, Kurt Keutzer, Qianqian Wang, and Chenfeng Xu. Bridging view- point gaps: Geometric reasoning boosts semantic correspon- dence. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11579–11589, 2025

  33. [33]

    Flow4r: Unifying 4d reconstruction and tracking with scene flow.arXiv preprint arXiv:2602.14021, 2026

    Shenhan Qian, Ganlin Zhang, Shangzhe Wu, and Daniel Cremers. Flow4r: Unifying 4d reconstruction and tracking with scene flow.arXiv preprint arXiv:2602.14021, 2026

  34. [34]

    Secrets of optical flow estimation and their principles

    Deqing Sun, Stefan Roth, and Michael J Black. Secrets of optical flow estimation and their principles. In2010 IEEE computer society conference on computer vision and pattern recognition, pages 2432–2439. IEEE, 2010

  35. [35]

    Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume

    Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8934–8943, 2018

  36. [36]

    Models matter, so does training: An empirical study of cnns for optical flow estimation.IEEE transactions on pattern analysis and machine intelligence, 42(6):1408–1423, 2019

    Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. Models matter, so does training: An empirical study of cnns for optical flow estimation.IEEE transactions on pattern analysis and machine intelligence, 42(6):1408–1423, 2019

  37. [37]

    Optical flow guided feature: A fast and robust motion representation for video action recognition

    Shuyang Sun, Zhanghui Kuang, Lu Sheng, Wanli Ouyang, and Wei Zhang. Optical flow guided feature: A fast and robust motion representation for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1390–1399, 2018

  38. [38]

    Raft: Recurrent all-pairs field transforms for optical flow

    Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean conference on com- puter vision, pages 402–419. Springer, 2020

  39. [39]

    Piece- wise rigid scene flow

    Christoph V ogel, Konrad Schindler, and Stefan Roth. Piece- wise rigid scene flow. InProceedings of the IEEE Inter- national Conference on Computer Vision, pages 1377–1384, 2013

  40. [40]

    Sea-raft: Simple, efficient, accurate raft for optical flow

    Yihan Wang, Lahav Lipson, and Jia Deng. Sea-raft: Simple, efficient, accurate raft for optical flow. InEuropean Confer- ence on Computer Vision, pages 36–54. Springer, 2024

  41. [41]

    Boosting generative adversarial transferability with self-supervised vision transformer fea- tures.arXiv preprint arXiv:2506.21046, 2025

    Shangbo Wu, Yu-an Tan, Ruinan Ma, Wencong Ma, Dehua Zhu, and Yuanzhang Li. Boosting generative adversarial transferability with self-supervised vision transformer fea- tures.arXiv preprint arXiv:2506.21046, 2025

  42. [42]

    Gmflow: Learning optical flow via global matching

    Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, and Dacheng Tao. Gmflow: Learning optical flow via global matching. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8121–8130, 2022

  43. [43]

    Depth anything: Unleashing the power of large-scale unlabeled data

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10371–10381, 2024

  44. [44]

    Depth any- thing v2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024

  45. [45]

    Improving the generalization of segmentation foundation model under distribution shift via weakly supervised adaptation

    Haojie Zhang, Yongyi Su, Xun Xu, and Kui Jia. Improving the generalization of segmentation foundation model under distribution shift via weakly supervised adaptation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23385–23395, 2024

  46. [46]

    Ar- gus: A compact and versatile foundation model for vision

    Weiming Zhuang, Chen Chen, Zhizhong Li, Sina Sajad- manesh, Jingtao Li, Jiabo Huang, Vikash Sehwag, Vivek Sharma, Hirotaka Shinozaki, Felan Carlo Garcia, et al. Ar- gus: A compact and versatile foundation model for vision. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 4418–4429, 2025. 10