arxiv: 2605.08000 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Rethinking Dense Optical Flow without Test-Time Scaling

Praroop Chanda , Suryansh Kumar

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:57 UTC · model grok-4.3

classification 💻 cs.CV

keywords dense optical flowsingle forward passfoundation modelsDINO-v2global matchingno test-time refinementmonocular depth priorscross-dataset generalization

0 comments

The pith

Powerful priors from frozen foundation models enable accurate dense optical flow in one forward pass without iterative refinement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that test-time scaling through recurrent refinement is not required for strong performance in dense optical flow estimation. It shows that semantic features from a frozen DINO-v2 model fused with geometric cues from a monocular depth foundation model can be combined via global matching to produce reliable correspondences directly. This yields 2.81 EPE on Sintel Final while outperforming several recent methods that rely on refinement under comparable training. A reader would care because it reframes efficiency as a matter of leveraging existing pretrained representations rather than adding inference steps. The result suggests that foundation models can substitute for the computational overhead of multi-step pipelines on challenging benchmarks.

Core claim

We present a framework that estimates dense optical flow in a single forward pass by extracting visual semantic features from a frozen DINO-v2 backbone, combining them with geometric cues from a monocular depth foundation model, fusing the priors into a unified representation, and applying global matching to recover correspondences without recurrent updates or test-time optimization.

What carries the argument

Fusion of frozen DINO-v2 semantic features with monocular depth geometric cues into a unified representation followed by global matching for dense correspondence estimation.

If this is right

Dense flow estimation becomes feasible with fixed, low inference cost independent of scene complexity.
Training effort shifts from learning refinement dynamics to learning how to fuse complementary pretrained priors.
Cross-dataset generalization improves because the priors are not tuned to any single benchmark's error patterns.
Real-time applications on edge devices become viable without sacrificing benchmark accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fusion-plus-global-matching pattern could be tested on related dense prediction tasks such as stereo disparity or scene flow.
If foundation priors continue to strengthen, the performance gap between single-pass and multi-step methods may widen further on future benchmarks.
The approach opens a route to parameter-efficient adaptation where only a lightweight fusion head is trained on top of frozen backbones.

Load-bearing premise

The semantic and geometric information already captured inside current foundation models is rich enough to produce accurate dense correspondences without iterative correction at inference time.

What would settle it

A controlled comparison on Sintel Final or KITTI where the single-pass method records higher endpoint error than a refinement-based baseline trained under identical conditions and data.

Figures

Figures reproduced from arXiv: 2605.08000 by Praroop Chanda, Suryansh Kumar.

**Figure 1.** Figure 1: Overview. The conventional CNN-based feature encoder is replaced with DINOv2 [27] to provide semantically rich, largescale self-supervised visual features, while original transformer-based feature interaction, global matching, and flow propagation modules remain unchanged. In addition, monocular depth estimates from Depth Anything V2 [43] are introduced as a geometric prior to improve feature conditioning… view at source ↗

**Figure 2.** Figure 2: Qualitative comparison on Sintel (Final). This benchmark contains severe motion blur, illumination variation, and large displacements. Despite operating without iterative refinement, our method preserves sharp motion boundaries and produces accurate flow estimates, performing comparably to—and in some cases better than—refinement-based approaches. Method Extra Data #refine KITTI EPE KITTI F1-all RAFT [37] … view at source ↗

read the original abstract

Recent progress in dense optical flow has been driven by increasingly complex architectures and multi-step refinement for test-time scaling. While these approaches achieve strong benchmark performance, they also require substantial computation during inference. This raises a fundamental question: Is scaling test-time computation the only way to improve dense optical flow accuracy? We argue that it is not. Instead, powerful visual semantic and geometric priors encoded in modern foundation models can reduce, if not overcome, the need for computationally expensive iterative refinement at test-time. In this paper, we present a framework that estimates dense optical flow in a single forward pass, leveraging pretrained foundation representations, while avoiding iterative refinement and additional inference-time computation, thus offering an alternative to test-time scaling. Our method extracts visual semantic features from a frozen DINO-v2 backbone and combines them with geometric cues from a monocular depth foundation model. We fuse these complementary priors into a unified representation and apply a global matching formulation to estimate dense correspondences without recurrent updates or test-time optimization. Despite avoiding iterative refinement, our approach achieves strong cross-dataset generalization across challenging benchmarks. On Sintel Final, we obtain 2.81 EPE without refinement, significantly improving over state-of-the-art (SOTA) SEA-RAFT under comparable training conditions and outperforming RAFT, GMFlow (without refinement), and recent FlowSeek in the same setting. These results suggest that strong foundation priors can substitute for test-time scaling, offering a computationally efficient alternative to refinement-heavy pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows frozen DINO-v2 plus monocular depth priors can deliver 2.81 EPE on Sintel Final in one pass without refinement, but the method details are too thin to judge how much the priors actually drive the result.

read the letter

The main thing to know is that this work gets competitive single-pass optical flow by feeding frozen DINO-v2 semantic features and monocular depth cues into a global matching head, skipping the iterative refinement that most recent methods rely on. They report 2.81 EPE on Sintel Final, which they say beats SEA-RAFT under comparable training and also beats RAFT and GMFlow without refinement. The efficiency angle is straightforward: no test-time scaling or recurrent updates means lower inference cost for video or robotics use cases.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a single-forward-pass dense optical flow method that extracts semantic features from a frozen DINO-v2 backbone, combines them with geometric cues from a monocular depth foundation model, fuses the priors into a unified representation, and applies a global matching formulation to estimate correspondences without recurrent refinement or test-time optimization. It reports 2.81 EPE on Sintel Final and claims to outperform SEA-RAFT (under comparable training), RAFT, GMFlow (no refinement), and FlowSeek.

Significance. If the reported numbers hold under fair and fully documented conditions, the result would show that strong priors from foundation models can substitute for iterative test-time scaling in optical flow, offering a computationally lighter alternative to refinement-heavy pipelines and potentially shifting emphasis toward pretrained representations for dense correspondence tasks.

major comments (2)

[Abstract] Abstract: the abstract states benchmark numbers (2.81 EPE on Sintel Final) and claims improvement over SEA-RAFT, RAFT, GMFlow, and FlowSeek, but supplies no training details, loss functions, fusion mechanism, or full experimental protocol. Without these, it is impossible to verify whether the central claim is supported by the proposed architecture or by differences in training setup.
[Abstract] Abstract: the statement that results are obtained 'under comparable training conditions' to SEA-RAFT is undefined; no specification is given of the datasets, splits, epochs, optimizer, or loss used for the baselines versus the proposed method, which is load-bearing for the cross-method comparison.

minor comments (1)

[Abstract] Abstract: the phrase 'global matching formulation' is used without even a one-sentence description or pointer to the matching cost or correspondence solver.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for greater clarity in the abstract regarding experimental details and training conditions. We address each point below and will revise the manuscript to improve verifiability while preserving the core contribution of a single-pass framework that leverages foundation-model priors.

read point-by-point responses

Referee: [Abstract] Abstract: the abstract states benchmark numbers (2.81 EPE on Sintel Final) and claims improvement over SEA-RAFT, RAFT, GMFlow, and FlowSeek, but supplies no training details, loss functions, fusion mechanism, or full experimental protocol. Without these, it is impossible to verify whether the central claim is supported by the proposed architecture or by differences in training setup.

Authors: We agree that the abstract is too concise and omits key details. The full manuscript describes the fusion of frozen DINO-v2 semantic features with monocular depth geometric cues, the global matching formulation, the training loss, and the complete experimental protocol. We will revise the abstract to briefly note the single-forward-pass design, the use of frozen foundation backbones, and the training regime, while explicitly directing readers to the Methods and Experiments sections for the full protocol. This change will make it clearer that the reported gains stem from the architecture rather than undisclosed training differences. revision: yes
Referee: [Abstract] Abstract: the statement that results are obtained 'under comparable training conditions' to SEA-RAFT is undefined; no specification is given of the datasets, splits, epochs, optimizer, or loss used for the baselines versus the proposed method, which is load-bearing for the cross-method comparison.

Authors: We acknowledge that the phrase 'under comparable training conditions' requires explicit definition to support the comparison. Our experiments used the same primary training datasets (FlyingChairs, FlyingThings3D, Sintel, KITTI) and followed a similar multi-stage training schedule and optimizer as SEA-RAFT. We will revise the abstract to state this explicitly and add a concise comparison table or paragraph in the Experiments section listing datasets, epochs, and loss terms for our method versus the baselines. This will allow direct assessment of fairness. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript describes an empirical architecture that fuses frozen DINO-v2 semantic features with monocular depth priors inside a global matching head to produce single-pass flow. No equations, fitted parameters, or self-citations are shown to reduce the reported EPE numbers or generalization claims to the inputs by construction. The central performance numbers are benchmark results obtained under stated training conditions; they do not arise from renaming or re-deriving the same quantities that were used to build the model. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that frozen foundation models already encode sufficiently rich semantic and geometric information for accurate correspondence without further adaptation or test-time optimization; no free parameters or invented entities are described in the abstract.

axioms (2)

domain assumption Pretrained foundation models such as DINO-v2 encode useful semantic priors for visual correspondence tasks.
The method freezes DINO-v2 and relies on its features to avoid refinement.
domain assumption Monocular depth foundation models supply reliable geometric cues that complement semantic features for flow estimation.
The paper combines these cues to enable single-pass global matching.

pith-pipeline@v0.9.0 · 5558 in / 1405 out tokens · 37404 ms · 2026-05-11T02:57:18.226361+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We fuse these complementary priors into a unified representation and apply a global matching formulation to estimate dense correspondences without recurrent updates or test-time optimization.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

On Sintel Final, we obtain 2.81 EPE without refinement

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 2 internal anchors

[1]

D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. InEuropean Conf. on Computer Vision (ECCV), pages 611–

work page
[2]

Springer-Verlag, 2012

work page 2012
[3]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021

work page 2021
[4]

Uncertainty-driven dense two-view structure from motion

Weirong Chen, Suryansh Kumar, and Fisher Yu. Uncertainty-driven dense two-view structure from motion. IEEE Robotics and Automation Letters, 8(3):1763–1770, 2023

work page 2023
[5]

Flownet: Learn- ing optical flow with convolutional networks

Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip H¨ausser, Caner Hazırbas ¸, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learn- ing optical flow with convolutional networks. InProceedings of the IEEE International Conference on Computer Vision (ICCV), 2015

work page 2015
[6]

Flow-edge guided video completion

Chen Gao, Ayush Saraf, Jia-Bin Huang, and Johannes Kopf. Flow-edge guided video completion. InEuropean Confer- ence on Computer Vision, pages 713–729. Springer, 2020

work page 2020
[7]

Vision meets robotics: The kitti dataset.The Inter- national Journal of Robotics Research, 32(11):1231–1237, 2013

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset.The Inter- national Journal of Robotics Research, 32(11):1231–1237, 2013

work page 2013
[8]

Zero-shot monocular motion segmentation in the wild by combining deep learning with geometric motion model fusion

Yuxiang Huang, Yuhao Chen, and John Zelek. Zero-shot monocular motion segmentation in the wild by combining deep learning with geometric motion model fusion. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2733–2743, 2024

work page 2024
[9]

Flowformer: A transformer architecture for optical flow

Zhaoyang Huang, Xiaoyu Shi, Chao Zhang, Qiang Wang, Ka Chun Cheung, Hongwei Qin, Jifeng Dai, and Hongsheng Li. Flowformer: A transformer architecture for optical flow. InEuropean Conference on Computer Vision, pages 668–

work page
[10]

Flownet 2.0: Evolu- tion of optical flow estimation with deep networks

Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolu- tion of optical flow estimation with deep networks. InPro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 2462–2470, 2017

work page 2017
[11]

Ms-raft+: high resolution multi-scale raft.International Journal of Computer Vision, 132(5): 1835–1856, 2024

Azin Jahedi, Maximilian Luz, Marc Rivinius, Lukas Mehl, and Andr ´es Bruhn. Ms-raft+: high resolution multi-scale raft.International Journal of Computer Vision, 132(5): 1835–1856, 2024

work page 2024
[12]

Enhanced stable view synthesis

Nishant Jain, Suryansh Kumar, and Luc Van Gool. Enhanced stable view synthesis. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13208–13217, 2023

work page 2023
[13]

Learning robust multi-scale representation for neural radiance fields from unposed images.International Journal of Computer Vision, 132(4):1310–1335, 2024

Nishant Jain, Suryansh Kumar, and Luc Van Gool. Learning robust multi-scale representation for neural radiance fields from unposed images.International Journal of Computer Vision, 132(4):1310–1335, 2024

work page 2024
[14]

The hci benchmark suite: Stereo and flow ground truth with uncertainties for urban autonomous driv- ing

Daniel Kondermann, Rahul Nair, Katrin Honauer, Karsten Krispin, Jonas Andrulis, Alexander Brock, Burkhard Gusse- feld, Mohsen Rahimimoghaddam, Sabine Hofmann, Claus Brenner, et al. The hci benchmark suite: Stereo and flow ground truth with uncertainties for urban autonomous driv- ing. InProceedings of the IEEE Conference on Computer Vision and Pattern Rec...

work page 2016
[15]

Jumping manifolds: Geometry aware dense non-rigid structure from motion

Suryansh Kumar. Jumping manifolds: Geometry aware dense non-rigid structure from motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5346–5355, 2019

work page 2019
[16]

Non-rigid structure from motion: Prior- free factorization method revisited

Suryansh Kumar. Non-rigid structure from motion: Prior- free factorization method revisited. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 51–60, 2020

work page 2020
[17]

Organic priors in non- rigid structure from motion

Suryansh Kumar and Luc Van Gool. Organic priors in non- rigid structure from motion. InEuropean Conference on Computer Vision, pages 71–88. Springer, 2022

work page 2022
[18]

Monocular dense 3d reconstruction of a complex dynamic scene from two perspective frames

Suryansh Kumar, Yuchao Dai, and Hongdong Li. Monocular dense 3d reconstruction of a complex dynamic scene from two perspective frames. InProceedings of the IEEE inter- national conference on computer vision, pages 4649–4657, 2017

work page 2017
[19]

Scalable dense non-rigid structure-from-motion: A grassmannian perspective

Suryansh Kumar, Anoop Cherian, Yuchao Dai, and Hong- dong Li. Scalable dense non-rigid structure-from-motion: A grassmannian perspective. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 254–263, 2018

work page 2018
[20]

Super- pixel soup: Monocular dense 3d reconstruction of a complex dynamic scene.IEEE transactions on pattern analysis and machine intelligence, 43(5):1705–1717, 2019

Suryansh Kumar, Yuchao Dai, and Hongdong Li. Super- pixel soup: Monocular dense 3d reconstruction of a complex dynamic scene.IEEE transactions on pattern analysis and machine intelligence, 43(5):1705–1717, 2019

work page 2019
[21]

Mo- tion guided attention for video salient object detection

Haofeng Li, Guanqi Chen, Guanbin Li, and Yizhou Yu. Mo- tion guided attention for video salient object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7274–7283, 2019

work page 2019
[22]

Va-depthnet: A variational approach to sin- gle image depth prediction

Ce Liu, Suryansh Kumar, Shuhang Gu, Radu Timofte, and Luc Van Gool. Va-depthnet: A variational approach to sin- gle image depth prediction. InThe Eleventh International Conference on Learning Representations

work page
[23]

Single image depth prediction made better: A multivariate gaussian take

Ce Liu, Suryansh Kumar, Shuhang Gu, Radu Timofte, and Luc Van Gool. Single image depth prediction made better: A multivariate gaussian take. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17346–17356, 2023

work page 2023
[24]

Segment any point cloud sequences by distilling vision foundation models.Advances in Neural Information Processing Sys- tems, 36:37193–37229, 2023

Youquan Liu, Lingdong Kong, Jun Cen, Runnan Chen, Wen- wei Zhang, Liang Pan, Kai Chen, and Ziwei Liu. Segment any point cloud sequences by distilling vision foundation models.Advances in Neural Information Processing Sys- tems, 36:37193–37229, 2023

work page 2023
[25]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation

Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 4040–4048, 2016. 9

work page 2016
[27]

Object scene flow for autonomous vehicles

Moritz Menze and Andreas Geiger. Object scene flow for autonomous vehicles. InConference on Computer Vision and Pattern Recognition (CVPR), 2015

work page 2015
[28]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Pytorch: An im- perative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im- perative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019

work page 2019
[30]

Representation flow for action recognition

AJ Piergiovanni and Michael S Ryoo. Representation flow for action recognition. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 9945–9953, 2019

work page 2019
[31]

Flowseek: Optical flow made easier with depth foundation models and motion bases

Matteo Poggi and Fabio Tosi. Flowseek: Optical flow made easier with depth foundation models and motion bases. In Proceedings of the International Conference on Computer Vision (ICCV), 2025

work page 2025
[32]

Bridging view- point gaps: Geometric reasoning boosts semantic correspon- dence

Qiyang Qian, Hansheng Chen, Masayoshi Tomizuka, Kurt Keutzer, Qianqian Wang, and Chenfeng Xu. Bridging view- point gaps: Geometric reasoning boosts semantic correspon- dence. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11579–11589, 2025

work page 2025
[33]

Flow4r: Unifying 4d reconstruction and tracking with scene flow.arXiv preprint arXiv:2602.14021, 2026

Shenhan Qian, Ganlin Zhang, Shangzhe Wu, and Daniel Cremers. Flow4r: Unifying 4d reconstruction and tracking with scene flow.arXiv preprint arXiv:2602.14021, 2026

work page arXiv 2026
[34]

Secrets of optical flow estimation and their principles

Deqing Sun, Stefan Roth, and Michael J Black. Secrets of optical flow estimation and their principles. In2010 IEEE computer society conference on computer vision and pattern recognition, pages 2432–2439. IEEE, 2010

work page 2010
[35]

Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume

Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8934–8943, 2018

work page 2018
[36]

Models matter, so does training: An empirical study of cnns for optical flow estimation.IEEE transactions on pattern analysis and machine intelligence, 42(6):1408–1423, 2019

Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. Models matter, so does training: An empirical study of cnns for optical flow estimation.IEEE transactions on pattern analysis and machine intelligence, 42(6):1408–1423, 2019

work page 2019
[37]

Optical flow guided feature: A fast and robust motion representation for video action recognition

Shuyang Sun, Zhanghui Kuang, Lu Sheng, Wanli Ouyang, and Wei Zhang. Optical flow guided feature: A fast and robust motion representation for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1390–1399, 2018

work page 2018
[38]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean conference on com- puter vision, pages 402–419. Springer, 2020

work page 2020
[39]

Piece- wise rigid scene flow

Christoph V ogel, Konrad Schindler, and Stefan Roth. Piece- wise rigid scene flow. InProceedings of the IEEE Inter- national Conference on Computer Vision, pages 1377–1384, 2013

work page 2013
[40]

Sea-raft: Simple, efficient, accurate raft for optical flow

Yihan Wang, Lahav Lipson, and Jia Deng. Sea-raft: Simple, efficient, accurate raft for optical flow. InEuropean Confer- ence on Computer Vision, pages 36–54. Springer, 2024

work page 2024
[41]

Boosting generative adversarial transferability with self-supervised vision transformer fea- tures.arXiv preprint arXiv:2506.21046, 2025

Shangbo Wu, Yu-an Tan, Ruinan Ma, Wencong Ma, Dehua Zhu, and Yuanzhang Li. Boosting generative adversarial transferability with self-supervised vision transformer fea- tures.arXiv preprint arXiv:2506.21046, 2025

work page arXiv 2025
[42]

Gmflow: Learning optical flow via global matching

Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, and Dacheng Tao. Gmflow: Learning optical flow via global matching. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8121–8130, 2022

work page 2022
[43]

Depth anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10371–10381, 2024

work page 2024
[44]

Depth any- thing v2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024

work page 2024
[45]

Improving the generalization of segmentation foundation model under distribution shift via weakly supervised adaptation

Haojie Zhang, Yongyi Su, Xun Xu, and Kui Jia. Improving the generalization of segmentation foundation model under distribution shift via weakly supervised adaptation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23385–23395, 2024

work page 2024
[46]

Ar- gus: A compact and versatile foundation model for vision

Weiming Zhuang, Chen Chen, Zhizhong Li, Sina Sajad- manesh, Jingtao Li, Jiabo Huang, Vikash Sehwag, Vivek Sharma, Hirotaka Shinozaki, Felan Carlo Garcia, et al. Ar- gus: A compact and versatile foundation model for vision. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 4418–4429, 2025. 10

work page 2025