RAFT-MSF++: Temporal Geometry-Motion Feature Fusion for Self-Supervised Monocular Scene Flow

Gang Chen; Wei-Shi Zheng; Xunpei Sun; Yi Chang; Zuoxun Hou

arxiv: 2604.19349 · v1 · submitted 2026-04-21 · 💻 cs.CV

RAFT-MSF++: Temporal Geometry-Motion Feature Fusion for Self-Supervised Monocular Scene Flow

Xunpei Sun , Zuoxun Hou , Yi Chang , Gang Chen , Wei-Shi Zheng This is my paper

Pith reviewed 2026-05-10 02:39 UTC · model grok-4.3

classification 💻 cs.CV

keywords monocular scene flowself-supervised learningtemporal feature fusiongeometry-motion featureocclusion handlingdepth estimationKITTI benchmarkrecurrent updates

0 comments

The pith

Recurrent fusion of a Geometry-Motion Feature enables accurate self-supervised monocular scene flow estimation over multiple frames.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a multi-frame approach to recover dense 3D motion from single-camera image sequences, moving beyond the common two-frame limitation. At its core is the Geometry-Motion Feature, which encodes joint motion and geometry information and receives recurrent updates to reason across time. Relative positional attention and an occlusion regularization module are added to maintain reliable propagation of motion cues into hidden areas. This design allows joint depth and flow prediction in a self-supervised manner. A reader would care because real-world videos contain frequent occlusions and extended motion that two-frame methods struggle to resolve.

Core claim

The authors claim that a self-supervised framework recurrently fuses temporal features through an iteratively updated Geometry-Motion Feature, augmented by relative positional attention for spatial priors and occlusion regularization to transfer motion from visible to ambiguous regions, thereby producing more accurate and robust monocular scene flow estimates than prior two-frame techniques.

What carries the argument

The Geometry-Motion Feature (GMF), a compact representation that encodes coupled motion and geometry cues and is iteratively updated across frames to support temporal reasoning.

If this is right

Joint depth and scene flow can be estimated from longer image sequences rather than isolated pairs.
Occlusion handling improves by propagating motion cues from visible to hidden areas via attention and regularization.
Self-supervised training becomes viable for multi-frame temporal modeling without requiring ground-truth labels.
Performance gains appear on standard benchmarks such as KITTI Scene Flow, with particular benefit in occluded regions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The recurrent update pattern could be tested on other video tasks that require consistent 3D motion over time, such as object tracking in 3D.
Extending the framework to longer sequences or different camera motions might expose limits in how far temporal information can be propagated.
Combining the GMF with additional cues like semantic segmentation could further stabilize estimates in complex scenes.
The occlusion regularization approach may generalize to related problems in optical flow or stereo matching where visibility varies.

Load-bearing premise

That recurrent iterative updates to the Geometry-Motion Feature, together with relative positional attention and occlusion regularization, will propagate accurate motion information into occluded regions during self-supervised training without introducing systematic errors.

What would settle it

A head-to-head evaluation on the KITTI Scene Flow benchmark that shows no reduction in the SF-all metric or no gain in occluded-region accuracy relative to a two-frame baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.19349 by Gang Chen, Wei-Shi Zheng, Xunpei Sun, Yi Chang, Zuoxun Hou.

**Figure 2.** Figure 2: Proposed multi-frame architecture. The forward and backward scene flow branches run in parallel with shared weights. Disparity is derived from their [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Recurrent fusion module. This figure illustrates the iterative refinement process of forward estimation. Forward and backward Geometry-Motion [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of GMF fusion. We compare the feature similarity maps of: (a) the backward GMF, (b) the forward GMF, (c) the initial fused GMF [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration of the occlusion regularization process. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative results on the KITTI Scene Flow benchmark. We compare with Self-Mono-SF [ [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of Attention Maps and Feature Similarity. Top row: Input frames; Second and third rows: Results for Frame #17; Fourth and fifth [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Zero-shot generalization results on Cityscapes [ [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: A representative failure case on the KITTI Scene Flow Test [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Zero-shot generalization results on the Spring dataset [ [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative results from the ablation study illustrating the impact of our main contributions. Columns (a)–(e) depict results from different model [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

read the original abstract

Monocular scene flow estimation aims to recover dense 3D motion from image sequences, yet most existing methods are limited to two-frame inputs, restricting temporal modeling and robustness to occlusions. We propose RAFT-MSF++, a self-supervised multi-frame framework that recurrently fuses temporal features to jointly estimate depth and scene flow. Central to our approach is the Geometry-Motion Feature (GMF), which compactly encodes coupled motion and geometry cues and is iteratively updated for effective temporal reasoning. To ensure the robustness of this temporal fusion against occlusions, we incorporate relative positional attention to inject spatial priors and an occlusion regularization module to propagate reliable motion from visible regions. These components enable the GMF to effectively propagate information even in ambiguous areas. Extensive experiments show that RAFT-MSF++ achieves 24.14% SF-all on the KITTI Scene Flow benchmark, with a 30.99% improvement over the baseline and better robustness in occluded regions. The code is available at https://github.com/sunzunyi/RAFT-MSF-PlusPlus.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RAFT-MSF++ extends RAFT with a recurrent Geometry-Motion Feature, relative positional attention, and occlusion regularization for multi-frame self-supervised scene flow, claiming solid KITTI gains, but the abstract leaves the loss details and occlusion robustness unproven.

read the letter

The core advance is taking the RAFT iterative update and making it multi-frame by recurrently fusing a compact Geometry-Motion Feature across time, plus adding relative positional attention and an occlusion regularization term to push reliable motion into hidden areas. They report 24.14% SF-all on KITTI Scene Flow, a 31% relative lift over their baseline, and improved occluded-region behavior, with code released on GitHub. That directly tackles the two-frame limit that most prior monocular scene flow work still has, and the GMF design looks like a clean way to keep geometry and motion coupled without separate heads exploding in complexity. The temporal fusion idea is sensible for sequences where single pairs lose context. The occlusion module is the part that tries to make the recurrent updates robust where photometric losses go dark. The main soft spot is that the abstract gives no loss equations, no training schedule, no ablation numbers, and no error breakdown by occlusion mask. Without those, it is hard to tell whether the regularization actually infers correct 3D motion or just damps errors under the self-supervised photometric term. The stress-test worry about bias accumulation across iterations in independently moving objects is reasonable; photometric consistency supplies no signal inside occlusions, so any propagation step rests on continuity assumptions that may not hold. If the full paper has stratified results or an ablation removing the regularization term, that would settle it. This is aimed at people building self-supervised 3D perception stacks for driving or robotics who already know RAFT-style architectures. A reader who wants concrete multi-frame improvements on standard benchmarks will get something usable from the GMF and attention design. It deserves peer review because the problem is practical, the benchmark numbers are reported on a common dataset, and the code is public, even if the method section needs more equations and controls to be fully convincing.

Referee Report

3 major / 1 minor

Summary. The paper proposes RAFT-MSF++, a self-supervised multi-frame monocular scene flow method that recurrently fuses temporal Geometry-Motion Features (GMF) via relative positional attention and an occlusion regularization module to jointly estimate depth and scene flow. It reports achieving 24.14% SF-all on the KITTI Scene Flow benchmark (30.99% relative improvement over baseline) with claimed better robustness in occluded regions, and releases code.

Significance. If the performance claims and occlusion-handling mechanism hold under rigorous validation, the work would meaningfully advance self-supervised scene flow by extending temporal modeling beyond two-frame baselines, addressing a key limitation for applications like autonomous driving. The public code release is a clear strength supporting reproducibility.

major comments (3)

[§3] §3 (Method, GMF update and occlusion regularization): The claim that relative positional attention plus occlusion regularization enables reliable motion propagation from visible pixels rests on an unverified assumption; in self-supervised photometric training, occluded regions supply no direct gradient, so the recurrent GMF updates implicitly rely on motion continuity that may not hold for independently moving objects. No derivation, pseudocode, or error-propagation analysis is supplied to show the regularization damps biases rather than amplifying them across iterations.
[§4] §4 (Experiments, main results and ablations): The headline 24.14% SF-all and 30.99% improvement are presented without an ablation table isolating the occlusion regularization term or relative positional attention, nor error metrics stratified by occlusion masks. This omission is load-bearing because the central claim of improved occluded-region robustness cannot be attributed to the proposed components without these controls.
[Abstract and §4] Abstract and §4: Quantitative claims are given without the explicit loss definitions, GMF update equations, or training hyperparameters that would allow independent verification of the self-supervised procedure; the abstract supplies component names but no equations, preventing assessment of whether the reported gains are consistent with the training objective.

minor comments (1)

[Figures and Notation] Figure captions and notation: The distinction between the Geometry-Motion Feature and standard RAFT correlation features should be formalized with a compact equation to improve clarity for readers familiar with the baseline architecture.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which has helped us identify areas to strengthen the manuscript. We address each major comment below and have revised the paper accordingly to provide additional analysis, ablations, and clarifications while maintaining the integrity of our contributions.

read point-by-point responses

Referee: [§3] The claim that relative positional attention plus occlusion regularization enables reliable motion propagation from visible pixels rests on an unverified assumption; in self-supervised photometric training, occluded regions supply no direct gradient, so the recurrent GMF updates implicitly rely on motion continuity that may not hold for independently moving objects. No derivation, pseudocode, or error-propagation analysis is supplied to show the regularization damps biases rather than amplifying them across iterations.

Authors: We appreciate this observation on the underlying assumptions. The occlusion regularization is designed to penalize photometric inconsistencies in occluded areas by leveraging motion estimates propagated via relative positional attention from visible regions. In the revised manuscript, we have added pseudocode for the GMF update and occlusion regularization procedure in Section 3, along with a mathematical formulation of the regularization term. We also include an empirical analysis with qualitative visualizations showing reduced error propagation in occluded regions. We acknowledge that a complete theoretical derivation of error damping under self-supervision is not provided, as it would require assumptions about object motion that do not always hold; however, the design choices are supported by the observed performance gains. revision: partial
Referee: [§4] The headline 24.14% SF-all and 30.99% improvement are presented without an ablation table isolating the occlusion regularization term or relative positional attention, nor error metrics stratified by occlusion masks. This omission is load-bearing because the central claim of improved occluded-region robustness cannot be attributed to the proposed components without these controls.

Authors: We agree that isolating component contributions and providing occlusion-stratified metrics is essential for validating the claims. The revised manuscript now includes a dedicated ablation table evaluating the individual and combined effects of relative positional attention and the occlusion regularization module on SF-all and EPE. We have also added error metrics stratified by occluded versus non-occluded regions, computed using the ground-truth occlusion masks available in the KITTI Scene Flow benchmark. These additions directly attribute the reported robustness improvements to the proposed modules. revision: yes
Referee: [Abstract and §4] Quantitative claims are given without the explicit loss definitions, GMF update equations, or training hyperparameters that would allow independent verification of the self-supervised procedure; the abstract supplies component names but no equations, preventing assessment of whether the reported gains are consistent with the training objective.

Authors: The loss definitions, GMF update equations, and training hyperparameters are fully detailed in Sections 3.2, 3.3, and 4.1. To improve clarity and verifiability, we have updated the abstract to reference the key equations and self-supervised objective. We have also inserted a summary table of all loss weights and hyperparameters in the experiments section, enabling straightforward reproduction and assessment of consistency between the reported gains and the training procedure. revision: yes

Circularity Check

0 steps flagged

No circularity: architecture and empirical results are independent of inputs

full rationale

The paper describes a proposed recurrent architecture (GMF with relative positional attention and occlusion regularization) trained under standard self-supervised photometric losses on KITTI. No equations, fitted parameters, or self-citations are shown that would make any reported metric (e.g., SF-all) equivalent to the training objective by construction. The performance numbers are presented as experimental outcomes on a held-out benchmark, not as algebraic identities or reparameterizations of the loss. The derivation chain therefore remains self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or additional invented entities are stated beyond the core GMF concept.

invented entities (1)

Geometry-Motion Feature (GMF) no independent evidence
purpose: Compact encoding of coupled motion and geometry cues that is iteratively updated for temporal reasoning
Described as central to the approach in the abstract.

pith-pipeline@v0.9.0 · 5497 in / 1265 out tokens · 44143 ms · 2026-05-10T02:39:53.355520+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages

[1]

Three-dimensional scene flow,

S. Vedula, P. Rander, R. Collins, and T. Kanade, “Three-dimensional scene flow,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 3, pp. 475–480, 2005

work page 2005
[2]

Difflow3d: Toward robust uncertainty-aware scene flow estimation with iterative diffusion-based refinement,

J. Liu, G. Wang, W. Ye, C. Jiang, J. Han, Z. Liu, G. Zhang, D. Du, and H. Wang, “Difflow3d: Toward robust uncertainty-aware scene flow estimation with iterative diffusion-based refinement,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 15 109–15 119

work page 2024
[3]

Deep rigid instance scene flow,

W.-C. Ma, S. Wang, R. Hu, Y . Xiong, and R. Urtasun, “Deep rigid instance scene flow,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

work page 2019
[4]

Raft-3d: Scene flow using rigid-motion embed- dings,

Z. Teed and J. Deng, “Raft-3d: Scene flow using rigid-motion embed- dings,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 8375–8384

work page 2021
[5]

Flownet3d: Learning scene flow in 3d point clouds,

X. Liu, C. R. Qi, and L. J. Guibas, “Flownet3d: Learning scene flow in 3d point clouds,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 529–537

work page 2019
[6]

Self-supervised monocular scene flow estimation,

J. Hur and S. Roth, “Self-supervised monocular scene flow estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 7396–7405

work page 2020
[7]

Raft-msf: Self-supervised monocular scene flow using recurrent optimizer,

B. Bayramli, J. Hur, and H. Lu, “Raft-msf: Self-supervised monocular scene flow using recurrent optimizer,”International Journal of Computer Vision, vol. 131, no. 11, pp. 2757–2769, 2023

work page 2023
[8]

Emr-msf: Self-supervised recurrent monoc- ular scene flow exploiting ego-motion rigidity,

Z. Jiang and M. Okutomi, “Emr-msf: Self-supervised recurrent monoc- ular scene flow exploiting ego-motion rigidity,” inProceedings of the 10 IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 69–78

work page 2023
[9]

Self-supervised multi-frame monocular scene flow,

J. Hur and S. Roth, “Self-supervised multi-frame monocular scene flow,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 2684–2694

work page 2021
[10]

M-fuse: Multi-frame fusion for scene flow estimation,

L. Mehl, A. Jahedi, J. Schmalfuss, and A. Bruhn, “M-fuse: Multi-frame fusion for scene flow estimation,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 2020–2029

work page 2023
[11]

3d scene flow estimation with a piecewise rigid scene model,

C. V ogel, K. Schindler, and S. Roth, “3d scene flow estimation with a piecewise rigid scene model,”International Journal of Computer Vision, vol. 115, pp. 1–28, 2015

work page 2015
[12]

Rms-flownet++: Efficient and robust multi-scale scene flow estimation for large-scale point clouds,

R. Battrawy, R. Schuster, and D. Stricker, “Rms-flownet++: Efficient and robust multi-scale scene flow estimation for large-scale point clouds,” International Journal of Computer Vision, vol. 132, no. 10, pp. 4724– 4745, 2024

work page 2024
[13]

Df-net: Unsupervised joint learning of depth and flow using cross-task consistency,

Y . Zou, Z. Luo, and J.-B. Huang, “Df-net: Unsupervised joint learning of depth and flow using cross-task consistency,” inProceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 36–53

work page 2018
[14]

Geonet: Unsupervised learning of dense depth, optical flow and camera pose,

Z. Yin and J. Shi, “Geonet: Unsupervised learning of dense depth, optical flow and camera pose,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 1983– 1992

work page 2018
[15]

Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume,

D. Sun, X. Yang, M.-Y . Liu, and J. Kautz, “Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 8934–8943

work page 2018
[16]

Glofp- msf: Monocular scene flow estimation with global feature perception,

X. Xiang, Y . Cui, X. Wang, M. Zhai, and A. El Saddik, “Glofp- msf: Monocular scene flow estimation with global feature perception,” Multimedia Systems, vol. 30, no. 4, p. 227, 2024

work page 2024
[17]

Integrating semantic segmentation model for self-supervised scene flow estimation via cross task distil- lation,

B. Bayramli, Y . Ding, and H. Lu, “Integrating semantic segmentation model for self-supervised scene flow estimation via cross task distil- lation,” inProceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), 2024, pp. 1–8

work page 2024
[18]

Mamba-sf: Monocular scene flow learning with state space models,

Y . Chen, X. Xiang, X. Ben, I. Hassan, M. Zhai, L. Zhang, and X. Zhen, “Mamba-sf: Monocular scene flow learning with state space models,” in Proceedings of the IEEE International Conference on Image Processing (ICIP), 2025, pp. 253–258

work page 2025
[19]

A deep temporal fusion frame- work for scene flow using a learnable motion model and occlusions,

R. Schuster, C. Unger, and D. Stricker, “A deep temporal fusion frame- work for scene flow using a learnable motion model and occlusions,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021, pp. 247–255

work page 2021
[20]

A fusion approach for multi-frame optical flow estimation,

Z. Ren, O. Gallo, D. Sun, M.-H. Yang, E. B. Sudderth, and J. Kautz, “A fusion approach for multi-frame optical flow estimation,” in2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Jan. 2019, pp. 2077–2086

work page 2019
[21]

Unsupervised learning of multi-frame optical flow with occlusions,

J. Janai, F. Guney, A. Ranjan, M. Black, and A. Geiger, “Unsupervised learning of multi-frame optical flow with occlusions,” inProceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 690–706

work page 2018
[22]

Selflow: Self-supervised learning of optical flow,

P. Liu, M. Lyu, I. King, and J. Xu, “Selflow: Self-supervised learning of optical flow,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4571–4580

work page 2019
[23]

Proflow: Learning to predict optical flow,

D. Maurer and A. Bruhn, “Proflow: Learning to predict optical flow,” arXiv preprint arXiv:1806.00800, 2018

work page arXiv 2018
[24]

Smurf: Self-teaching multi-frame unsupervised raft with full-image warping,

A. Stone, D. Maurer, A. Ayvaci, A. Angelova, and R. Jonschkowski, “Smurf: Self-teaching multi-frame unsupervised raft with full-image warping,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 3887–3896

work page 2021
[25]

Raft: Recurrent all-pairs field transforms for op- tical flow,

Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for op- tical flow,” inProceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI), Aug. 2021, pp. 4839–4843

work page 2021
[26]

Transflow: Transformer as flow learner,

Y . Lu, Q. Wang, S. Ma, T. Geng, Y . V . Chen, H. Chen, and D. Liu, “Transflow: Transformer as flow learner,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 18 063–18 073

work page 2023
[27]

Videoflow: Exploiting temporal cues for multi-frame optical flow estimation,

X. Shi, Z. Huang, W. Bian, D. Li, M. Zhang, K. C. Cheung, S. See, H. Qin, J. Dai, and H. Li, “Videoflow: Exploiting temporal cues for multi-frame optical flow estimation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2023, pp. 12 435–12 446

work page 2023
[29]

M2flow: A motion information fusion framework for enhanced unsupervised optical flow estimation in au- tonomous driving,

X. Sun, G. Chen, and Z. Hou, “M2flow: A motion information fusion framework for enhanced unsupervised optical flow estimation in au- tonomous driving,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 39, no. 7, 2025, pp. 7140–7148

work page 2025
[30]

Self-attention- based multiscale feature learning optical flow with occlusion feature map prediction,

C. Zhang, Z. Zhou, Z. Chen, W. Hu, M. Li, and S. Jiang, “Self-attention- based multiscale feature learning optical flow with occlusion feature map prediction,”IEEE Transactions on Multimedia, vol. 24, pp. 3340–3354, 2021

work page 2021
[31]

Apcaflow: All-pairs cost volume aggregation for optical flow estimation,

M. Feng, H. Jia, Z. Yan, and X. Yang, “Apcaflow: All-pairs cost volume aggregation for optical flow estimation,”IEEE Transactions on Multimedia, vol. 26, pp. 9060–9069, 2024

work page 2024
[32]

Unflow: Unsupervised learning of optical flow with a bidirectional census loss,

S. Meister, J. Hur, and S. Roth, “Unflow: Unsupervised learning of optical flow with a bidirectional census loss,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 32, no. 1, 2018

work page 2018
[33]

Occlusion aware unsupervised learning of optical flow,

Y . Wang, Y . Yang, Z. Yang, L. Zhao, P. Wang, and W. Xu, “Occlusion aware unsupervised learning of optical flow,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4884–4893

work page 2018
[34]

Zero-shot monocular scene flow estimation in the wild,

Y . Liang, A. Badki, H. Su, J. Tompkin, and O. Gallo, “Zero-shot monocular scene flow estimation in the wild,” inIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[35]

Learning to esti- mate hidden motions with global motion aggregation,

S. Jiang, D. Campbell, Y . Lu, H. Li, and R. Hartley, “Learning to esti- mate hidden motions with global motion aggregation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 9772–9781

work page 2021
[36]

Digging into self-supervised monocular depth estimation,

C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into self-supervised monocular depth estimation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 3828–3838

work page 2019
[37]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 4015–4026

work page 2023
[38]

Vision meets robotics: The kitti dataset,

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,”The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013

work page 2013
[39]

Unsupervised monocular depth estimation with left-right consistency,

C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 270–279

work page 2017
[40]

Object scene flow for autonomous vehicles,

M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015

work page 2015
[41]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” International Conference on Learning Representations (ICLR), 2018

work page 2018
[42]

Every pixel counts++: Joint learning of geometry and motion with 3d holistic understanding,

C. Luo, Z. Yang, P. Wang, Y . Wang, W. Xu, R. Nevatia, and A. Yuille, “Every pixel counts++: Joint learning of geometry and motion with 3d holistic understanding,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 10, pp. 2624–2641, 2019

work page 2019
[43]

Mono-sf: Multi-view ge- ometry meets single-view depth for monocular scene flow estimation of dynamic traffic scenes,

F. Brickwedde, S. Abraham, and R. Mester, “Mono-sf: Multi-view ge- ometry meets single-view depth for monocular scene flow estimation of dynamic traffic scenes,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 2780–2790

work page 2019
[44]

Self-distilled feature aggregation for self- supervised monocular depth estimation,

Z. Zhou and Q. Dong, “Self-distilled feature aggregation for self- supervised monocular depth estimation,” inProceedings of the European Conference on Computer Vision (ECCV), 2022, pp. 709–726

work page 2022
[45]

Videoflow: Exploiting temporal cues for multi-frame optical flow estimation,

X. Shi, Z. Huang, W. Bian, D. Li, M. Zhang, K. C. Cheung, S. See, H. Qin, J. Dai, and H. Li, “Videoflow: Exploiting temporal cues for multi-frame optical flow estimation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 12 469– 12 480

work page 2023
[46]

Memflow: Optical flow estimation and prediction with memory,

Q. Dong and Y . Fu, “Memflow: Optical flow estimation and prediction with memory,” inProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2024, pp. 19 068–19 078

work page 2024
[47]

The cityscapes dataset for semantic urban scene understanding,

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benen- son, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 3213–3223

work page 2016
[48]

Nuscenes: A multimodal dataset for autonomous driving,

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “Nuscenes: A multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 11 621–11 631

work page 2020
[49]

Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo,

L. Mehl, J. Schmalfuss, A. Jahedi, Y . Nalivayko, and A. Bruhn, “Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 4981– 4991. 11 Reference Frame #31 Target Frame #31 (a) Attention Heatmap (b) Feature Si...

work page 2023
[50]

Final only

maintains consistent feature semantics across the object, accurately reflecting its rigid-body nature. In contrast, the two-frame baseline (Row 2, Col 2) loses the vehicle’s semantic integrity. Furthermore, the two- frame features often display spurious high correlations with incorrect, spatially distant regions (as evidenced in Row 4, Col 2), while faili...

work page arXiv 2023

[1] [1]

Three-dimensional scene flow,

S. Vedula, P. Rander, R. Collins, and T. Kanade, “Three-dimensional scene flow,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 3, pp. 475–480, 2005

work page 2005

[2] [2]

Difflow3d: Toward robust uncertainty-aware scene flow estimation with iterative diffusion-based refinement,

J. Liu, G. Wang, W. Ye, C. Jiang, J. Han, Z. Liu, G. Zhang, D. Du, and H. Wang, “Difflow3d: Toward robust uncertainty-aware scene flow estimation with iterative diffusion-based refinement,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 15 109–15 119

work page 2024

[3] [3]

Deep rigid instance scene flow,

W.-C. Ma, S. Wang, R. Hu, Y . Xiong, and R. Urtasun, “Deep rigid instance scene flow,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

work page 2019

[4] [4]

Raft-3d: Scene flow using rigid-motion embed- dings,

Z. Teed and J. Deng, “Raft-3d: Scene flow using rigid-motion embed- dings,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 8375–8384

work page 2021

[5] [5]

Flownet3d: Learning scene flow in 3d point clouds,

X. Liu, C. R. Qi, and L. J. Guibas, “Flownet3d: Learning scene flow in 3d point clouds,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 529–537

work page 2019

[6] [6]

Self-supervised monocular scene flow estimation,

J. Hur and S. Roth, “Self-supervised monocular scene flow estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 7396–7405

work page 2020

[7] [7]

Raft-msf: Self-supervised monocular scene flow using recurrent optimizer,

B. Bayramli, J. Hur, and H. Lu, “Raft-msf: Self-supervised monocular scene flow using recurrent optimizer,”International Journal of Computer Vision, vol. 131, no. 11, pp. 2757–2769, 2023

work page 2023

[8] [8]

Emr-msf: Self-supervised recurrent monoc- ular scene flow exploiting ego-motion rigidity,

Z. Jiang and M. Okutomi, “Emr-msf: Self-supervised recurrent monoc- ular scene flow exploiting ego-motion rigidity,” inProceedings of the 10 IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 69–78

work page 2023

[9] [9]

Self-supervised multi-frame monocular scene flow,

J. Hur and S. Roth, “Self-supervised multi-frame monocular scene flow,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 2684–2694

work page 2021

[10] [10]

M-fuse: Multi-frame fusion for scene flow estimation,

L. Mehl, A. Jahedi, J. Schmalfuss, and A. Bruhn, “M-fuse: Multi-frame fusion for scene flow estimation,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 2020–2029

work page 2023

[11] [11]

3d scene flow estimation with a piecewise rigid scene model,

C. V ogel, K. Schindler, and S. Roth, “3d scene flow estimation with a piecewise rigid scene model,”International Journal of Computer Vision, vol. 115, pp. 1–28, 2015

work page 2015

[12] [12]

Rms-flownet++: Efficient and robust multi-scale scene flow estimation for large-scale point clouds,

R. Battrawy, R. Schuster, and D. Stricker, “Rms-flownet++: Efficient and robust multi-scale scene flow estimation for large-scale point clouds,” International Journal of Computer Vision, vol. 132, no. 10, pp. 4724– 4745, 2024

work page 2024

[13] [13]

Df-net: Unsupervised joint learning of depth and flow using cross-task consistency,

Y . Zou, Z. Luo, and J.-B. Huang, “Df-net: Unsupervised joint learning of depth and flow using cross-task consistency,” inProceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 36–53

work page 2018

[14] [14]

Geonet: Unsupervised learning of dense depth, optical flow and camera pose,

Z. Yin and J. Shi, “Geonet: Unsupervised learning of dense depth, optical flow and camera pose,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 1983– 1992

work page 2018

[15] [15]

Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume,

D. Sun, X. Yang, M.-Y . Liu, and J. Kautz, “Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 8934–8943

work page 2018

[16] [16]

Glofp- msf: Monocular scene flow estimation with global feature perception,

X. Xiang, Y . Cui, X. Wang, M. Zhai, and A. El Saddik, “Glofp- msf: Monocular scene flow estimation with global feature perception,” Multimedia Systems, vol. 30, no. 4, p. 227, 2024

work page 2024

[17] [17]

Integrating semantic segmentation model for self-supervised scene flow estimation via cross task distil- lation,

B. Bayramli, Y . Ding, and H. Lu, “Integrating semantic segmentation model for self-supervised scene flow estimation via cross task distil- lation,” inProceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), 2024, pp. 1–8

work page 2024

[18] [18]

Mamba-sf: Monocular scene flow learning with state space models,

Y . Chen, X. Xiang, X. Ben, I. Hassan, M. Zhai, L. Zhang, and X. Zhen, “Mamba-sf: Monocular scene flow learning with state space models,” in Proceedings of the IEEE International Conference on Image Processing (ICIP), 2025, pp. 253–258

work page 2025

[19] [19]

A deep temporal fusion frame- work for scene flow using a learnable motion model and occlusions,

R. Schuster, C. Unger, and D. Stricker, “A deep temporal fusion frame- work for scene flow using a learnable motion model and occlusions,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021, pp. 247–255

work page 2021

[20] [20]

A fusion approach for multi-frame optical flow estimation,

Z. Ren, O. Gallo, D. Sun, M.-H. Yang, E. B. Sudderth, and J. Kautz, “A fusion approach for multi-frame optical flow estimation,” in2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Jan. 2019, pp. 2077–2086

work page 2019

[21] [21]

Unsupervised learning of multi-frame optical flow with occlusions,

J. Janai, F. Guney, A. Ranjan, M. Black, and A. Geiger, “Unsupervised learning of multi-frame optical flow with occlusions,” inProceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 690–706

work page 2018

[22] [22]

Selflow: Self-supervised learning of optical flow,

P. Liu, M. Lyu, I. King, and J. Xu, “Selflow: Self-supervised learning of optical flow,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4571–4580

work page 2019

[23] [23]

Proflow: Learning to predict optical flow,

D. Maurer and A. Bruhn, “Proflow: Learning to predict optical flow,” arXiv preprint arXiv:1806.00800, 2018

work page arXiv 2018

[24] [24]

Smurf: Self-teaching multi-frame unsupervised raft with full-image warping,

A. Stone, D. Maurer, A. Ayvaci, A. Angelova, and R. Jonschkowski, “Smurf: Self-teaching multi-frame unsupervised raft with full-image warping,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 3887–3896

work page 2021

[25] [25]

Raft: Recurrent all-pairs field transforms for op- tical flow,

Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for op- tical flow,” inProceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI), Aug. 2021, pp. 4839–4843

work page 2021

[26] [26]

Transflow: Transformer as flow learner,

Y . Lu, Q. Wang, S. Ma, T. Geng, Y . V . Chen, H. Chen, and D. Liu, “Transflow: Transformer as flow learner,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 18 063–18 073

work page 2023

[27] [27]

Videoflow: Exploiting temporal cues for multi-frame optical flow estimation,

X. Shi, Z. Huang, W. Bian, D. Li, M. Zhang, K. C. Cheung, S. See, H. Qin, J. Dai, and H. Li, “Videoflow: Exploiting temporal cues for multi-frame optical flow estimation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2023, pp. 12 435–12 446

work page 2023

[28] [29]

M2flow: A motion information fusion framework for enhanced unsupervised optical flow estimation in au- tonomous driving,

X. Sun, G. Chen, and Z. Hou, “M2flow: A motion information fusion framework for enhanced unsupervised optical flow estimation in au- tonomous driving,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 39, no. 7, 2025, pp. 7140–7148

work page 2025

[29] [30]

Self-attention- based multiscale feature learning optical flow with occlusion feature map prediction,

C. Zhang, Z. Zhou, Z. Chen, W. Hu, M. Li, and S. Jiang, “Self-attention- based multiscale feature learning optical flow with occlusion feature map prediction,”IEEE Transactions on Multimedia, vol. 24, pp. 3340–3354, 2021

work page 2021

[30] [31]

Apcaflow: All-pairs cost volume aggregation for optical flow estimation,

M. Feng, H. Jia, Z. Yan, and X. Yang, “Apcaflow: All-pairs cost volume aggregation for optical flow estimation,”IEEE Transactions on Multimedia, vol. 26, pp. 9060–9069, 2024

work page 2024

[31] [32]

Unflow: Unsupervised learning of optical flow with a bidirectional census loss,

S. Meister, J. Hur, and S. Roth, “Unflow: Unsupervised learning of optical flow with a bidirectional census loss,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 32, no. 1, 2018

work page 2018

[32] [33]

Occlusion aware unsupervised learning of optical flow,

Y . Wang, Y . Yang, Z. Yang, L. Zhao, P. Wang, and W. Xu, “Occlusion aware unsupervised learning of optical flow,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4884–4893

work page 2018

[33] [34]

Zero-shot monocular scene flow estimation in the wild,

Y . Liang, A. Badki, H. Su, J. Tompkin, and O. Gallo, “Zero-shot monocular scene flow estimation in the wild,” inIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025

[34] [35]

Learning to esti- mate hidden motions with global motion aggregation,

S. Jiang, D. Campbell, Y . Lu, H. Li, and R. Hartley, “Learning to esti- mate hidden motions with global motion aggregation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 9772–9781

work page 2021

[35] [36]

Digging into self-supervised monocular depth estimation,

C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into self-supervised monocular depth estimation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 3828–3838

work page 2019

[36] [37]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 4015–4026

work page 2023

[37] [38]

Vision meets robotics: The kitti dataset,

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,”The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013

work page 2013

[38] [39]

Unsupervised monocular depth estimation with left-right consistency,

C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 270–279

work page 2017

[39] [40]

Object scene flow for autonomous vehicles,

M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015

work page 2015

[40] [41]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” International Conference on Learning Representations (ICLR), 2018

work page 2018

[41] [42]

Every pixel counts++: Joint learning of geometry and motion with 3d holistic understanding,

C. Luo, Z. Yang, P. Wang, Y . Wang, W. Xu, R. Nevatia, and A. Yuille, “Every pixel counts++: Joint learning of geometry and motion with 3d holistic understanding,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 10, pp. 2624–2641, 2019

work page 2019

[42] [43]

Mono-sf: Multi-view ge- ometry meets single-view depth for monocular scene flow estimation of dynamic traffic scenes,

F. Brickwedde, S. Abraham, and R. Mester, “Mono-sf: Multi-view ge- ometry meets single-view depth for monocular scene flow estimation of dynamic traffic scenes,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 2780–2790

work page 2019

[43] [44]

Self-distilled feature aggregation for self- supervised monocular depth estimation,

Z. Zhou and Q. Dong, “Self-distilled feature aggregation for self- supervised monocular depth estimation,” inProceedings of the European Conference on Computer Vision (ECCV), 2022, pp. 709–726

work page 2022

[44] [45]

Videoflow: Exploiting temporal cues for multi-frame optical flow estimation,

X. Shi, Z. Huang, W. Bian, D. Li, M. Zhang, K. C. Cheung, S. See, H. Qin, J. Dai, and H. Li, “Videoflow: Exploiting temporal cues for multi-frame optical flow estimation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 12 469– 12 480

work page 2023

[45] [46]

Memflow: Optical flow estimation and prediction with memory,

Q. Dong and Y . Fu, “Memflow: Optical flow estimation and prediction with memory,” inProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2024, pp. 19 068–19 078

work page 2024

[46] [47]

The cityscapes dataset for semantic urban scene understanding,

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benen- son, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 3213–3223

work page 2016

[47] [48]

Nuscenes: A multimodal dataset for autonomous driving,

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “Nuscenes: A multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 11 621–11 631

work page 2020

[48] [49]

Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo,

L. Mehl, J. Schmalfuss, A. Jahedi, Y . Nalivayko, and A. Bruhn, “Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 4981– 4991. 11 Reference Frame #31 Target Frame #31 (a) Attention Heatmap (b) Feature Si...

work page 2023

[49] [50]

Final only

maintains consistent feature semantics across the object, accurately reflecting its rigid-body nature. In contrast, the two-frame baseline (Row 2, Col 2) loses the vehicle’s semantic integrity. Furthermore, the two- frame features often display spurious high correlations with incorrect, spatially distant regions (as evidenced in Row 4, Col 2), while faili...

work page arXiv 2023