Weakly Supervised Cross-Modal Learning for 4D Radar Scene Flow Estimation

Jingyun Fu; Na Zhao; Zhiyu Xiang

arxiv: 2605.18507 · v3 · pith:S64BYLUGnew · submitted 2026-05-18 · 💻 cs.CV

Weakly Supervised Cross-Modal Learning for 4D Radar Scene Flow Estimation

Jingyun Fu , Zhiyu Xiang , Na Zhao This is my paper

Pith reviewed 2026-05-22 09:34 UTC · model grok-4.3

classification 💻 cs.CV

keywords 4D radarscene flowweakly supervisedcross-modal learningself-supervised lossautonomous drivinginstance masksodometry

0 comments

The pith

A new framework estimates 4D radar scene flow with weak supervision from images and odometry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a weakly supervised approach to 4D radar scene flow estimation that relies only on images and vehicle odometry instead of ground truth or dense LiDAR data. It develops an iterative framework that uses 2D tracking and segmentation to generate instance masks, back-projects them to 3D radar space for semantic guidance in self-supervised losses, and adds a rigid motion loss for static areas based on odometry. This addresses the poor performance of pure self-supervision on sparse radar data and the expense of LiDAR-dependent cross-modal methods. A sympathetic reader would care because it promises accurate scene flow with cheaper, weather-robust radar sensors for applications like autonomous driving.

Core claim

The task-specific iterative framework for weakly supervised radar scene flow learning uses off-the-shelf 2D tracking and segmentation to obtain tracked instance masks that are back-projected into 3D space to provide instance-level semantic guidance for self-supervised losses, and integrates vehicle odometry with radar's intrinsic motion cues for a rigid static loss in static regions, leading to superior performance over both cross-modal supervised and fully supervised methods on the View-of-Delft dataset.

What carries the argument

The iterative framework with two novel instance-aware self-supervised losses from back-projected 2D masks and a rigid static loss from odometry.

Load-bearing premise

Off-the-shelf 2D tracking and segmentation algorithms must produce instance masks that back-project accurately into 3D radar space to give reliable guidance.

What would settle it

Measuring scene flow errors on the VoD dataset and finding that the method does not achieve lower errors than current LiDAR-based or fully supervised baselines would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2605.18507 by Jingyun Fu, Na Zhao, Zhiyu Xiang.

**Figure 1.** Figure 1: Comparison between existing self-supervised (SSF) and cross-modal supervised (CMS) radar scene flow estimation settings and our weakly supervised cross-modal learning setting. SF , FDS , and EM denote the predicted scene flow, foreground dynamic segmentation, and ego-motion, respectively. Lself is the self-supervised losses in (Ding et al., 2022); Lopt, Lmot, Lseg and Lego are cross-modal losses proposed i… view at source ↗

**Figure 2.** Figure 2: Overall architecture of our proposed method. The process of the kth scene flow iteration is depicted on the left and the detailed loss formulation process in the training stage is given on the right. ⊕ represents concatenation. Flow leverages ball query grouping to correlate cross-frame features within a constrained spatial range and employs a GRU-based (Cho et al., 2014) recurrent update scheme for iterat… view at source ↗

**Figure 3.** Figure 3: An example of mismached chamfer pairs. The top and bottom rows show two consecutive frames at time t and time t + 1, respectively. The smaller points are LiDAR points for auxiliary visualization, while the larger balls are radar points. Pt is first warped by estimated scene flow and then used to calculate chamfer loss with Pt+1. The orange circles indicate the current selected points for normal chamfer los… view at source ↗

**Figure 4.** Figure 4: An example of wrong KNN-based spatial flow smoothing. The figure shows the bird’s-eye view of a cyclist (green bounding box). Foreground radar points of the cyclist are painted in green. The orange circles highlight the current center points for KNN search, while the red dashed lines represent the confusion between foreground dynamic and background static points caused by the KNN-based flow smoothing in p… view at source ↗

**Figure 5.** Figure 5: Qualitative Results on VoD validation dataset. The first row and the third row display two separate traffic scenes; while the second row and fourth row show the zoomed-in results of the regions highlighted with yellow rectangles. As shown in the legend, the direction and magnitude of scene flow vectors are employed as hue and saturation, respectively. good performance, our IterFlow performs better in some … view at source ↗

**Figure 6.** Figure 6: A schematic diagram of obtaining 3D pointwise instance labels from 2D tracking and semantic segmentation results. The smaller points are LiDAR points for auxiliary visualization, while the larger balls are radar points. Each valid radar point associated with a certain instance is painted with the same color as its corresponding 2D tracking box and instance mask. Given the intrinsic matrix ΓI of the camera… view at source ↗

**Figure 7.** Figure 7: Ablation on iteration steps K and ball query hyperparameters L and R. When L varies, R = 1m; when R varies, L = 8 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Visualization of failure cases on VoD validation set. Each row displays a driving scenario and regions with large scene flow estimation errors are highlighted with yellow circles. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

Due to the difficulty of obtaining ground-truth data for 4D radar scene flow estimation, previous methods typically rely on either self-supervised losses or cross-modal supervision using 3D LiDAR data, 2D images, and odometry. However, self-supervised approaches often yield suboptimal results due to radar's inherently low-fidelity measurements, while existing cross-modal supervised methods introduce complex multi-task architecture and require costly LiDAR sensors to generate pseudo radar scene flow labels from pretrained 3D tracking models. To overcome these limitations, we propose a task-specific iterative framework for weakly supervised radar scene flow learning, using only images and odometry for auxiliary supervision during training. Specially, we establish two novel instance-aware self-supervised losses by exploiting off-the-shelf 2D tracking and segmentation algorithms to obtain tracked instance masks, which are back-projected into 3D space to provide instance-level semantic guidance; for static regions, we integrate vehicle odometry with radar's intrinsic motion cues to construct a rigid static loss. Extensive experiments on the real-world View-of-Delft (VoD) dataset demonstrate that our method not only surpasses state-of-the-art cross-modal supervised approaches that rely on 3D multi-object tracking on dense LiDAR point clouds but also outperforms existing fully supervised scene flow estimation methods. The code is open-sourced at \href{https://github.com/FuJingyun/IterFlow}{https://github.com/FuJingyun/IterFlow}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a weakly supervised radar scene flow method using 2D image masks and odometry that claims to beat LiDAR-based and fully supervised baselines, but the back-projection step is a clear vulnerability.

read the letter

The key takeaway is that this work introduces a weakly supervised framework for 4D radar scene flow estimation that uses 2D images and odometry to generate supervision signals, and it reports outperforming both LiDAR-dependent cross-modal methods and fully supervised approaches on the VoD dataset. What is new here is the combination of an iterative task-specific setup with instance-aware self-supervised losses. They leverage off-the-shelf 2D tracking and segmentation to create tracked instance masks, back-project those into 3D radar space for semantic guidance on individual points, and pair that with a rigid static loss derived from vehicle odometry and radar motion cues. This differs from prior self-supervised radar methods or those relying on dense LiDAR for pseudo-labels. The paper does some things well. It addresses a real limitation in radar scene flow where ground truth is hard to get, and it aims to lower the barrier by avoiding LiDAR altogether during training. The open-sourced code at the GitHub link is a plus for reproducibility. The focus on instance-level guidance makes sense for handling moving objects separately from static background. There are soft spots though. The whole approach depends on the back-projection of 2D masks providing accurate instance labels for the sparse, noisy radar points. Calibration drift, occlusions, or quantization errors could easily lead to misassignments, which would undermine the self-supervised losses and make the performance gains less reliable. The abstract highlights superior results but lacks mention of detailed ablations on the mask quality or sensitivity to these issues, so the evidence for the central claim feels incomplete at this stage. This paper would appeal to people working on multi-sensor fusion and perception for autonomous vehicles, particularly those exploring radar as a primary or supplementary sensor. Readers looking for practical ways to do scene flow with reduced supervision would find the loss formulations and experimental setup useful. It deserves a serious referee because the task is relevant to real-world applications and the method brings a fresh angle on cross-modal weak supervision. I would recommend putting it through peer review. The results are interesting enough to merit closer examination of the experiments and potential failure cases around the projection step.

Referee Report

2 major / 2 minor

Summary. The paper proposes a weakly supervised iterative framework for 4D radar scene flow estimation that relies solely on 2D images and vehicle odometry for auxiliary supervision during training. It introduces two novel instance-aware self-supervised losses obtained by back-projecting tracked instance masks from off-the-shelf 2D tracking and segmentation algorithms into 3D radar space, plus a rigid static loss that combines odometry with radar's intrinsic motion cues. Extensive experiments on the View-of-Delft (VoD) dataset are reported to show that the method surpasses both state-of-the-art cross-modal supervised approaches (which use 3D multi-object tracking on dense LiDAR) and existing fully supervised scene flow methods; the code is open-sourced.

Significance. If the central claims hold, the work would be significant for radar-based perception in autonomous driving, as it reduces dependence on costly LiDAR sensors while still achieving competitive or superior scene flow accuracy through weak cross-modal cues. The open-sourced code is a clear strength that aids reproducibility and allows the community to verify the instance-aware losses. However, the overall significance is limited by the lack of detailed validation for the core supervision mechanism.

major comments (2)

[§3.2] §3.2 (instance-aware self-supervised losses): The performance advantage over dense-LiDAR pseudo-labeling methods rests on the claim that back-projected 2D instance masks provide reliable semantic guidance for sparse, noisy radar points. No quantitative analysis of projection accuracy, sensitivity to calibration error, or occlusion handling is presented; if mis-assignments are systematic, the two losses would inject label noise that undermines the headline comparison.
[§4] §4 (experiments on VoD): The claim that the method outperforms both cross-modal LiDAR-based and fully supervised baselines is load-bearing, yet the reported results lack error bars, statistical significance tests, ablation studies isolating the contribution of each loss term, and explicit data-exclusion criteria. Without these, it is impossible to determine whether the gains are robust or sensitive to post-hoc loss-weight choices.

minor comments (2)

[Abstract] The abstract refers to a 'task-specific iterative framework' without indicating how many iterations are used or what convergence criterion is applied; a short clarifying sentence would improve readability.
[§3] Notation for radar points, instance labels, and static/dynamic masks should be introduced once and used consistently; occasional redefinition of symbols appears in the method description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions planned for the next manuscript version.

read point-by-point responses

Referee: [§3.2] §3.2 (instance-aware self-supervised losses): The performance advantage over dense-LiDAR pseudo-labeling methods rests on the claim that back-projected 2D instance masks provide reliable semantic guidance for sparse, noisy radar points. No quantitative analysis of projection accuracy, sensitivity to calibration error, or occlusion handling is presented; if mis-assignments are systematic, the two losses would inject label noise that undermines the headline comparison.

Authors: We agree that a direct quantitative assessment of the back-projection step would strengthen the justification for the instance-aware losses. The current manuscript relies on end-to-end performance gains over LiDAR-based baselines as indirect validation. In the revision we will add a dedicated analysis: overlap metrics between back-projected 2D masks and available LiDAR instance annotations on the VoD validation set, a sensitivity study under small synthetic calibration perturbations, and a brief discussion of occlusion handling (2D masks are generated only for visible objects and radar points falling outside projected masks are excluded from the loss). revision: yes
Referee: [§4] §4 (experiments on VoD): The claim that the method outperforms both cross-modal LiDAR-based and fully supervised baselines is load-bearing, yet the reported results lack error bars, statistical significance tests, ablation studies isolating the contribution of each loss term, and explicit data-exclusion criteria. Without these, it is impossible to determine whether the gains are robust or sensitive to post-hoc loss-weight choices.

Authors: We acknowledge that the experimental presentation would be more convincing with additional statistical controls and transparency. The revised manuscript will include: error bars obtained from three independent training runs with different random seeds; ablation tables that isolate each loss component (the two instance-aware losses and the rigid static loss); statistical significance tests (paired t-test and Wilcoxon signed-rank test) against the main baselines; and an explicit statement of the data splits and exclusion rules, which follow the official VoD train/val/test partitions without further post-hoc filtering. revision: yes

Circularity Check

0 steps flagged

No significant circularity: supervision derives from independent external sources

full rationale

The paper constructs its instance-aware self-supervised losses and rigid static loss using off-the-shelf 2D tracking/segmentation algorithms plus vehicle odometry as auxiliary inputs. These are external, pre-existing components not derived from or fitted to the radar scene flow outputs themselves. No equations reduce a claimed prediction to a fitted parameter by construction, no load-bearing self-citation chains appear in the provided description, and the central performance claims rest on empirical comparison against external benchmarks rather than definitional equivalence. The derivation chain remains self-contained against independent data sources.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on assumptions about the reliability of external 2D algorithms and odometry rather than many fitted parameters or new invented entities; one domain assumption is key.

free parameters (1)

loss balancing weights
Hyperparameters likely used to combine instance-aware and rigid static losses, though not detailed in abstract.

axioms (1)

domain assumption Off-the-shelf 2D tracking and segmentation algorithms produce tracked instance masks accurate enough for reliable 3D back-projection and semantic guidance.
Directly invoked when establishing the two novel instance-aware self-supervised losses.

pith-pipeline@v0.9.0 · 5798 in / 1459 out tokens · 38291 ms · 2026-05-22T09:34:00.262646+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 4 internal anchors

[1]

On the Properties of Neural Machine Translation: Encoder-Decoder Approaches

Cho, K., Van Merri¨enboer, B., Bahdanau, D., and Bengio, Y . On the properties of neural machine translation: Encoder- decoder approaches.arXiv preprint arXiv:1409.1259,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

YOLOv11: An Overview of the Key Architectural Enhancements

Khanam, R. and Hussain, M. Yolov11: An overview of the key architectural enhancements.arXiv preprint arXiv:2410.17725,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

P., Asefaw, A., Man- souri, S

Khoche, A., Zhang, Q., Sanchez, L. P., Asefaw, A., Man- souri, S. S., and Jensfelt, P. Ssf: Sparse long-range scene flow for autonomous driving.arXiv preprint arXiv:2501.17821,

work page arXiv
[4]

Mambaflow: A novel and flow-guided state space model for scene flow estimation.arXiv preprint arXiv:2502.16907,

10 Weakly Supervised Cross-Modal Learning for 4D Radar Scene Flow Estimation Luo, J., Cheng, J., Tang, X., Zhang, Q., Xue, B., and Fan, R. Mambaflow: A novel and flow-guided state space model for scene flow estimation.arXiv preprint arXiv:2502.16907,

work page arXiv
[5]

SAM 2: Segment Anything in Images and Videos

Qi, C. R., Su, H., Mo, K., and Guibas, L. J. Pointnet: Deep learning on point sets for 3d classification and segmenta- tion. InCVPR, pp. 652–660, 2017a. Qi, C. R., Yi, L., Su, H., and Guibas, L. J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.NIPS, 30, 2017b. Ravi, N., Gabeur, V ., Hu, Y .-T., Hu, R., Ryali, C., Ma, T., K...

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Neural eulerian scene flow fields.arXiv preprint arXiv:2410.02031,

Vedder, K., Peri, N., Khatri, I., Li, S., Eaton, E., Koca- maz, M., Wang, Y ., Yu, Z., Ramanan, D., and Pehserl, J. Neural eulerian scene flow fields.arXiv preprint arXiv:2410.02031,

work page arXiv
[7]

TARS: Traffic-Aware Radar Scene Flow Estimation

Wu, J., Braun, M., Spata, D., and Rottmann, M. Tars: Traffic-aware radar scene flow estimation.arXiv preprint arXiv:2503.10210,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Y ., Li, Z., Liu, W., and Fuxin, L

Wu, W., Wang, Z. Y ., Li, Z., Liu, W., and Fuxin, L. Pointpwc-net: Cost volume on point clouds for (self-) supervised scene flow estimation. InComputer Vision– ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pp. 88–107. Springer,

work page 2020
[9]

is col- lected with a Toyota Prius 2013 platform, which is equipped with a 3D LiDAR, a stereo camera, and a 4D ZF FRGen21 radar in complex urban traffic environments under normal weather conditions. With an annotation frequency of 10 Hz, the V oD dataset consists of 123,106 annotated 3D bounding boxes for both moving and static objects, including 26,587 p...

work page 2013
[10]

Note that the pro- duced rigid scene flow vector is an approximation of the actual scene flow ground truth, which may not be accurate for non-rigid objects

For a point P in the tracking box at t0, its rigid scene flow Fr is calculated as follows: Fr = 1 ∆t(T⊗P−P),(11) 12 Weakly Supervised Cross-Modal Learning for 4D Radar Scene Flow Estimation where ⊗ denotes matrix multiplication. Note that the pro- duced rigid scene flow vector is an approximation of the actual scene flow ground truth, which may not be acc...

work page 2024
[11]

to generate tracked 2D bounding boxes. Inspired by recent advances in Vision Foundation Models (VFMs) (Zou et al., 2023; Kirillov et al., 2023; Ravi et al., 2024), the 2D tracking results can be further refined into more detailed instance- level masks that are aligned with object boundaries. In this way, final instance segmentation masks with track ID can...

work page 2023
[12]

Additional Experiments and Analysis C.1

C. Additional Experiments and Analysis C.1. Three-way Endpoint Error Evaluation and Runtime Comparison To comprehensively evaluate the performance of state-of- the-art scene flow estimation methods and our IterFlow, we adopt the three-way Endpoint Error (3-way EPE) for ad- ditional evaluation, which is often used in LiDAR-based cases (Jund et al., 2021; Z...

work page 2021
[13]

WhenLvaries,R = 1m; whenRvaries,L =

It can be seen that the runtime required for our IterFlow† is similar as CMFlow (Ding et al., 2023), and both 13 Weakly Supervised Cross-Modal Learning for 4D Radar Scene Flow Estimation Figure 7.Ablation on iteration stepsKand ball query hyperparametersLandR. WhenLvaries,R = 1m; whenRvaries,L =

work page 2023
[14]

Fully ✗ 0.2015 0.2140 0.3056 0.2067 0.0922 77.2 DeFlow (Zhang et al., 2024a) Fully ✗ 0.1802 0.2015 0.2422 0.1966 0.1020 44.8 PointPWC (Wu et al.,

work page 2015
[15]

Self ✓ 0.3560 0.4314 0.4203 0.4333 0.2145 52.1 FlowStep3D (Kittenplon et al., 2021)Self ✓ 0.2296 0.2607 0.3116 0.2562 0.1210 58.9 radar- based RaFlow (Ding et al.,

work page 2021
[16]

Speed Normalized EPE (↓) Cat

Cross ✗ 0.1455 0.1600 0.2073 0.1566 0.0727 29.1 IterFlow† (ours) Cross ✓ 0.1223 0.1139 0.2129 0.1049 0.0491 33.0 IterFlow (ours) Cross ✓ 0.1156 0.1045 0.2058 0.0952 0.0458 60.7 Table 6.Per-category Performance Disparities for FD Objects on V oD validation set. Speed Normalized EPE (↓) Cat. Method Sup. Mean Car O. V . Pd. W. V LiDAR- based Flow4D (Kim et al.,

work page 2073
[17]

Fully 1.1973 1.1842 1.1279 1.2416 1.2356 DeFlow (Zhang et al., 2024a) Fully 1.1508 0.9907 1.3374 1.2063 1.0690 PointPWC (Wu et al.,

work page 1973
[18]

The dynamic normalized EPE is a ratio as the EPE has been normalized by speed

to ensure that objects with different speeds are fairly evaluated. The dynamic normalized EPE is a ratio as the EPE has been normalized by speed. As shown in Table 6, fine-grained analysis is conducted on individual classes, including Car, Other Vehicles (O. V .), Pedestrian 14 Weakly Supervised Cross-Modal Learning for 4D Radar Scene Flow Estimation Tabl...

work page 2025

[1] [1]

On the Properties of Neural Machine Translation: Encoder-Decoder Approaches

Cho, K., Van Merri¨enboer, B., Bahdanau, D., and Bengio, Y . On the properties of neural machine translation: Encoder- decoder approaches.arXiv preprint arXiv:1409.1259,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

YOLOv11: An Overview of the Key Architectural Enhancements

Khanam, R. and Hussain, M. Yolov11: An overview of the key architectural enhancements.arXiv preprint arXiv:2410.17725,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

P., Asefaw, A., Man- souri, S

Khoche, A., Zhang, Q., Sanchez, L. P., Asefaw, A., Man- souri, S. S., and Jensfelt, P. Ssf: Sparse long-range scene flow for autonomous driving.arXiv preprint arXiv:2501.17821,

work page arXiv

[4] [4]

Mambaflow: A novel and flow-guided state space model for scene flow estimation.arXiv preprint arXiv:2502.16907,

10 Weakly Supervised Cross-Modal Learning for 4D Radar Scene Flow Estimation Luo, J., Cheng, J., Tang, X., Zhang, Q., Xue, B., and Fan, R. Mambaflow: A novel and flow-guided state space model for scene flow estimation.arXiv preprint arXiv:2502.16907,

work page arXiv

[5] [5]

SAM 2: Segment Anything in Images and Videos

Qi, C. R., Su, H., Mo, K., and Guibas, L. J. Pointnet: Deep learning on point sets for 3d classification and segmenta- tion. InCVPR, pp. 652–660, 2017a. Qi, C. R., Yi, L., Su, H., and Guibas, L. J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.NIPS, 30, 2017b. Ravi, N., Gabeur, V ., Hu, Y .-T., Hu, R., Ryali, C., Ma, T., K...

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Neural eulerian scene flow fields.arXiv preprint arXiv:2410.02031,

Vedder, K., Peri, N., Khatri, I., Li, S., Eaton, E., Koca- maz, M., Wang, Y ., Yu, Z., Ramanan, D., and Pehserl, J. Neural eulerian scene flow fields.arXiv preprint arXiv:2410.02031,

work page arXiv

[7] [7]

TARS: Traffic-Aware Radar Scene Flow Estimation

Wu, J., Braun, M., Spata, D., and Rottmann, M. Tars: Traffic-aware radar scene flow estimation.arXiv preprint arXiv:2503.10210,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Y ., Li, Z., Liu, W., and Fuxin, L

Wu, W., Wang, Z. Y ., Li, Z., Liu, W., and Fuxin, L. Pointpwc-net: Cost volume on point clouds for (self-) supervised scene flow estimation. InComputer Vision– ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pp. 88–107. Springer,

work page 2020

[9] [9]

is col- lected with a Toyota Prius 2013 platform, which is equipped with a 3D LiDAR, a stereo camera, and a 4D ZF FRGen21 radar in complex urban traffic environments under normal weather conditions. With an annotation frequency of 10 Hz, the V oD dataset consists of 123,106 annotated 3D bounding boxes for both moving and static objects, including 26,587 p...

work page 2013

[10] [10]

Note that the pro- duced rigid scene flow vector is an approximation of the actual scene flow ground truth, which may not be accurate for non-rigid objects

For a point P in the tracking box at t0, its rigid scene flow Fr is calculated as follows: Fr = 1 ∆t(T⊗P−P),(11) 12 Weakly Supervised Cross-Modal Learning for 4D Radar Scene Flow Estimation where ⊗ denotes matrix multiplication. Note that the pro- duced rigid scene flow vector is an approximation of the actual scene flow ground truth, which may not be acc...

work page 2024

[11] [11]

to generate tracked 2D bounding boxes. Inspired by recent advances in Vision Foundation Models (VFMs) (Zou et al., 2023; Kirillov et al., 2023; Ravi et al., 2024), the 2D tracking results can be further refined into more detailed instance- level masks that are aligned with object boundaries. In this way, final instance segmentation masks with track ID can...

work page 2023

[12] [12]

Additional Experiments and Analysis C.1

C. Additional Experiments and Analysis C.1. Three-way Endpoint Error Evaluation and Runtime Comparison To comprehensively evaluate the performance of state-of- the-art scene flow estimation methods and our IterFlow, we adopt the three-way Endpoint Error (3-way EPE) for ad- ditional evaluation, which is often used in LiDAR-based cases (Jund et al., 2021; Z...

work page 2021

[13] [13]

WhenLvaries,R = 1m; whenRvaries,L =

It can be seen that the runtime required for our IterFlow† is similar as CMFlow (Ding et al., 2023), and both 13 Weakly Supervised Cross-Modal Learning for 4D Radar Scene Flow Estimation Figure 7.Ablation on iteration stepsKand ball query hyperparametersLandR. WhenLvaries,R = 1m; whenRvaries,L =

work page 2023

[14] [14]

Fully ✗ 0.2015 0.2140 0.3056 0.2067 0.0922 77.2 DeFlow (Zhang et al., 2024a) Fully ✗ 0.1802 0.2015 0.2422 0.1966 0.1020 44.8 PointPWC (Wu et al.,

work page 2015

[15] [15]

Self ✓ 0.3560 0.4314 0.4203 0.4333 0.2145 52.1 FlowStep3D (Kittenplon et al., 2021)Self ✓ 0.2296 0.2607 0.3116 0.2562 0.1210 58.9 radar- based RaFlow (Ding et al.,

work page 2021

[16] [16]

Speed Normalized EPE (↓) Cat

Cross ✗ 0.1455 0.1600 0.2073 0.1566 0.0727 29.1 IterFlow† (ours) Cross ✓ 0.1223 0.1139 0.2129 0.1049 0.0491 33.0 IterFlow (ours) Cross ✓ 0.1156 0.1045 0.2058 0.0952 0.0458 60.7 Table 6.Per-category Performance Disparities for FD Objects on V oD validation set. Speed Normalized EPE (↓) Cat. Method Sup. Mean Car O. V . Pd. W. V LiDAR- based Flow4D (Kim et al.,

work page 2073

[17] [17]

Fully 1.1973 1.1842 1.1279 1.2416 1.2356 DeFlow (Zhang et al., 2024a) Fully 1.1508 0.9907 1.3374 1.2063 1.0690 PointPWC (Wu et al.,

work page 1973

[18] [18]

The dynamic normalized EPE is a ratio as the EPE has been normalized by speed

to ensure that objects with different speeds are fairly evaluated. The dynamic normalized EPE is a ratio as the EPE has been normalized by speed. As shown in Table 6, fine-grained analysis is conducted on individual classes, including Car, Other Vehicles (O. V .), Pedestrian 14 Weakly Supervised Cross-Modal Learning for 4D Radar Scene Flow Estimation Tabl...

work page 2025