pith. sign in

arxiv: 2605.16922 · v1 · pith:Q4DVLKYKnew · submitted 2026-05-16 · 💻 cs.CV

Motion Cues from Image-based Point Tracking for LiDAR Scene Flow Estimation

Pith reviewed 2026-05-19 21:05 UTC · model grok-4.3

classification 💻 cs.CV
keywords LiDAR scene flowpoint trackingstatic-dynamic classificationself-supervised learningmotion compensationego-motionautonomous driving
0
0 comments X

The pith

Dense image trajectories from point tracking refine static-dynamic labels for LiDAR scene flow.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that sparse geometric observations produce unreliable static-dynamic labels in self-supervised LiDAR scene flow, and that dense motion cues extracted via image-based point tracking can correct this. TrackCue generates anchored image trajectories, compensates for ego-motion in the image plane, and lifts the resulting motion signals back to LiDAR points to produce cleaner labels. A sympathetic reader would care because accurate per-point 3D motion is required for autonomous driving to distinguish moving objects from background despite occlusions and point sparsity.

Core claim

TrackCue repurposes point tracking to obtain dense image-space trajectories anchored to LiDAR points, providing motion cues beyond sparse geometric observations. It presents a visually consistent motion compensation strategy that compares the tracked trajectories with ego-induced rigid trajectories in the image plane, effectively isolating true object motion from ego-induced apparent motion. Visual motion cue lifting then associates ego-compensated image trajectories with LiDAR points for static-dynamic label refinement.

What carries the argument

Visual motion cue lifting that associates ego-compensated image trajectories with LiDAR points to refine static-dynamic labels.

Load-bearing premise

Dense image-space trajectories from point tracking can be accurately associated with and lifted to corresponding LiDAR points without introducing new errors from calibration, viewpoint differences, or tracking failures in occluded regions.

What would settle it

An ablation on standard benchmarks such as KITTI where TrackCue produces no measurable increase in dynamic-label precision or scene-flow end-point error compared with the geometric baseline.

Figures

Figures reproduced from arXiv: 2605.16922 by Gyeongrok Oh, Hyung-gun Chi, Hyunju Ryu, Jonghyun Choi, Jong Wook Kim, Sangpil Kim, SeungHyeon Kim, Seungryong Kim, Youngdong Jang.

Figure 1
Figure 1. Figure 1: Analysis of dynamic auto-label quality for self-supervised LiDAR scene flow estimation. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the TrackCue framework. decomposed into ego-motion flow F ego t and residual flow ∆Ft as follows: Ft = F ego t + ∆Ft. (1) While F ego can be derived from vehicle odometry, the network is trained to estimate ∆Ft. Static-Dynamic Self-Supervision. To derive supervision without labeled data, self-supervised approaches first divide points into static and dynamic sets, denoted as P s t and P d t , re… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of our motion compensation. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: We visualize the flow estimation results in terms of two complementary perspectives. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of dynamic awareness map between DUFOMap [ [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Step-by-step visualization of TrackCue. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Step-by-step visualization of TrackCue. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison with SeFlow [8] for scene flow estimation. [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison with SeFlow++ [7] for scene flow estimation. [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison with TeFlow [19] for scene flow estimation. [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
read the original abstract

LiDAR scene flow estimation is essential for autonomous driving, as it provides 3D motion for each point. Self-supervised approaches use static-dynamic classification to mitigate the imbalance between static and dynamic points, deriving targeted supervision. However, existing methods rely on sparse geometric observations for this classification, making them vulnerable to data sparsity and occlusions. The resulting noisy labels provide incorrect motion guidance and degrade scene flow learning. To address this, we introduce TrackCue, a tracking-guided framework for improving dynamic object representation in LiDAR scene flow estimation. In particular, TrackCue repurposes point tracking to obtain dense image-space trajectories anchored to LiDAR points, providing motion cues beyond sparse geometric observations. Furthermore, we present a visually consistent motion compensation strategy that compares the tracked trajectories with ego-induced rigid trajectories in the image plane, effectively isolating true object motion from ego-induced apparent motion. To transfer these isolated motion cues back to the LiDAR domain, we perform visual motion cue lifting, which associates ego-compensated image trajectories with LiDAR points for static-dynamic label refinement. As a result, TrackCue produces more accurate static-dynamic classification and provides more reliable supervision for scene flow learning. Experimental results show that TrackCue significantly improves the precision and F1 score of dynamic labels, leading to performance gains in self-supervised scene flow estimation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces TrackCue, a framework for self-supervised LiDAR scene flow estimation that leverages dense image-space trajectories from point tracking. It projects LiDAR points into the image domain, applies ego-compensated tracking to isolate true object motion from ego-induced apparent motion, and lifts the resulting refined static-dynamic labels back to the LiDAR points. The central claim is that this produces more accurate dynamic labels (higher precision and F1) than sparse geometric methods, yielding improved scene flow performance.

Significance. If the image-to-LiDAR lifting step can be validated as introducing negligible additional noise, the approach offers a practical way to obtain denser and more reliable supervision signals for scene flow, addressing sparsity and occlusion issues common in LiDAR data for autonomous driving. The cross-modal use of point tracking is a clear strength and could generalize to other 3D perception tasks.

major comments (2)
  1. [§3.3] §3.3 (Visual motion cue lifting): the description of associating ego-compensated image trajectories with LiDAR points provides no quantitative bound or ablation on projection/association errors arising from calibration drift, parallax between sensors, or tracking failures on occluded dynamic objects. This is load-bearing for the claim of strictly more reliable labels than geometric baselines, as any systematic mismatch would directly degrade the static-dynamic classification precision reported in the experiments.
  2. [§4] §4 (Experiments): the reported gains in dynamic label precision and F1, and downstream scene flow metrics, are presented without baselines explicitly listed, ablation isolating the lifting component, error bars across runs, or analysis of how post-hoc threshold choices affect results. This prevents verification that the improvements are robust rather than sensitive to implementation details.
minor comments (2)
  1. [§3.1] Clarify the exact point-tracking backbone (e.g., which off-the-shelf method is used) and any fine-tuning performed, as this affects reproducibility.
  2. [Figure 3] Figure 3 or equivalent qualitative results: include side-by-side comparisons of labels before and after lifting on sequences with heavy occlusion to illustrate the claimed robustness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below and outline the revisions we will make to improve the manuscript.

read point-by-point responses
  1. Referee: [§3.3] §3.3 (Visual motion cue lifting): the description of associating ego-compensated image trajectories with LiDAR points provides no quantitative bound or ablation on projection/association errors arising from calibration drift, parallax between sensors, or tracking failures on occluded dynamic objects. This is load-bearing for the claim of strictly more reliable labels than geometric baselines, as any systematic mismatch would directly degrade the static-dynamic classification precision reported in the experiments.

    Authors: We agree that the current description in §3.3 would be strengthened by quantitative analysis of potential error sources. The manuscript explains the association process but does not report explicit bounds or ablations on calibration drift, parallax, or tracking failures. In the revised version we will add a dedicated ablation that perturbs the camera-LiDAR extrinsics within realistic ranges, measures the resulting drop in dynamic-label precision and F1, and compares against the geometric baseline under the same perturbations. We will also report the fraction of tracked points that fall on occluded dynamic objects and show how the ego-compensated comparison reduces false positives relative to raw geometric methods. These additions will provide the requested bounds and directly support the reliability claim. revision: yes

  2. Referee: [§4] §4 (Experiments): the reported gains in dynamic label precision and F1, and downstream scene flow metrics, are presented without baselines explicitly listed, ablation isolating the lifting component, error bars across runs, or analysis of how post-hoc threshold choices affect results. This prevents verification that the improvements are robust rather than sensitive to implementation details.

    Authors: We concur that clearer experimental reporting is needed. While the manuscript states that TrackCue improves precision, F1, and scene-flow metrics, §4 does not enumerate all baselines, isolate the lifting step, or include variability measures. In the revision we will (1) list every baseline with its exact configuration, (2) add an ablation that disables only the visual-motion-cue-lifting module while keeping all other components fixed, (3) report mean and standard deviation over five independent runs with different random seeds, and (4) include a sensitivity plot showing precision/F1 as the static-dynamic threshold varies. These changes will demonstrate that the observed gains are robust and not artifacts of particular implementation choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method introduces independent visual motion cue lifting

full rationale

The derivation chain relies on projecting LiDAR points to image space, applying point tracking to obtain dense trajectories, comparing against ego-induced rigid motion in the image plane to isolate object motion, and lifting refined static-dynamic labels back to LiDAR. This sequence is presented as a geometric procedure using external point trackers and calibration, without reducing to fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations whose validity depends on the current paper. The abstract and described steps treat the motion compensation and lifting as independent operations that can be evaluated against ground-truth labels, keeping the central claim self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approach relies on standard assumptions in multi-modal tracking and scene flow literature.

pith-pipeline@v0.9.0 · 5795 in / 1100 out tokens · 45174 ms · 2026-05-19T21:05:44.476754+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages

  1. [1]

    Neural scene flow prior.Advances in Neural Information Processing Systems, 34:7838–7851, 2021

    Xueqian Li, Jhony Kaesemodel Pontes, and Simon Lucey. Neural scene flow prior.Advances in Neural Information Processing Systems, 34:7838–7851, 2021

  2. [2]

    Fast neural scene flow

    Xueqian Li, Jianqiao Zheng, Francesco Ferroni, Jhony Kaesemodel Pontes, and Simon Lucey. Fast neural scene flow. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9878–9890, 2023

  3. [3]

    Uniflow: Towards zero-shot lidar scene flow for autonomous vehicles via cross-domain generalization, 2025

    Siyi Li, Qingwen Zhang, Ishan Khatri, Kyle Vedder, Deva Ramanan, and Neehar Peri. Uniflow: Towards zero-shot lidar scene flow for autonomous vehicles via cross-domain generalization, 2025

  4. [4]

    Deltaflow: An efficient multi-frame scene flow estimation method

    Qingwen Zhang, Xiaomeng Zhu, Yushan Zhang, Yixi Cai, Olov Andersson, and Patric Jensfelt. Deltaflow: An efficient multi-frame scene flow estimation method. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  5. [5]

    Neural eulerian scene flow fields

    Kyle Vedder, Neehar Peri, Ishan Khatri, Siyi Li, Eric Eaton, Mehmet Kemal Kocamaz, Yue Wang, Zhiding Yu, Deva Ramanan, and Joachim Pehserl. Neural eulerian scene flow fields. 2025

  6. [6]

    Floxels: Fast unsupervised voxel based scene flow estimation

    DavidTHoffmann,SyedHaseebRaza,HanqiuJiang,DenisTananaev,SteffenKlingenhoefer,and Martin Meinke. Floxels: Fast unsupervised voxel based scene flow estimation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22328–22337, 2025

  7. [7]

    HiMo: High-speed objects motion compensation in point cloud.IEEE Transactions on Robotics, 41:5896–5911, 2025

    Qingwen Zhang, Ajinkya Khoche, Yi Yang, Li Ling, Sina Sharif Mansouri, Olov Andersson, and Patric Jensfelt. HiMo: High-speed objects motion compensation in point cloud.IEEE Transactions on Robotics, 41:5896–5911, 2025

  8. [8]

    SeFlow: A self- supervised scene flow method in autonomous driving

    Qingwen Zhang, Yi Yang, Peizheng Li, Olov Andersson, and Patric Jensfelt. SeFlow: A self- supervised scene flow method in autonomous driving. InEuropean Conference on Computer Vision (ECCV), page 353–369. Springer, 2024

  9. [9]

    Dufomap: Efficient dynamic awareness mapping.IEEE Robotics and Automation Letters, 9(6):5038–5045, 2024

    Daniel Duberg, Qingwen Zhang, Mingkai Jia, and Patric Jensfelt. Dufomap: Efficient dynamic awareness mapping.IEEE Robotics and Automation Letters, 9(6):5038–5045, 2024

  10. [10]

    Voteflow: Enforcing local rigidity in self-supervised scene flow

    Yancong Lin, Shiming Wang, Liangliang Nan, Julian Kooij, and Holger Caesar. Voteflow: Enforcing local rigidity in self-supervised scene flow. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17155–17164, 2025

  11. [11]

    ZeroFlow: Fast Zero Label Scene Flow via Distillation

    KyleVedder,NeeharPeri,NathanielChodosh,IshanKhatri,EricEaton,DineshJayaraman,Yang Liu Deva Ramanan, and James Hays. ZeroFlow: Fast Zero Label Scene Flow via Distillation. International Conference on Learning Representations (ICLR), 2024

  12. [12]

    Cotracker3: Simpler and better point tracking by pseudo-labelling real videos

    Nikita Karaev, Yuri Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6013–6022, 2025. 10

  13. [13]

    Alltracker: Efficient dense point tracking at high resolution

    AdamWHarley,YangYou,XinglongSun,YangZheng,NikhilRaghuraman,YunqiGu,Sheldon Liang, Wen-Hsuan Chu, Achal Dave, Suya You, et al. Alltracker: Efficient dense point tracking at high resolution. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5253–5262, 2025

  14. [14]

    Tapnext: Trackingany point (tap) as next token prediction

    Artem Zholus, Carl Doersch, Yi Yang, Skanda Koppula, Viorica Patraucean, Xu Owen He, IgnacioRocco,MehdiS.M.Sajjadi,SarathChandar,andRossGoroshin. Tapnext: Trackingany point (tap) as next token prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9693–9703, October 2025

  15. [15]

    Scalablescene flowfrompointcloudsintherealworld.IEEERoboticsandAutomationLetters,7(2):1589–1596, 2021

    PhilippJund,ChrisSweeney,NicholaAbdo,ZhifengChen,andJonathonShlens. Scalablescene flowfrompointcloudsintherealworld.IEEERoboticsandAutomationLetters,7(2):1589–1596, 2021

  16. [16]

    I can’t believe it’s not scene flow! InEuropean Conference on Computer Vision, pages 242–257

    Ishan Khatri, Kyle Vedder, Neehar Peri, Deva Ramanan, and James Hays. I can’t believe it’s not scene flow! InEuropean Conference on Computer Vision, pages 242–257. Springer, 2024

  17. [17]

    Ssf: Sparse long-range scene flow for autonomous driving

    Ajinkya Khoche, Qingwen Zhang, Laura Pereira Sánchez, Aron Asefaw, Sina Sharif Mansouri, and Patric Jensfelt. Ssf: Sparse long-range scene flow for autonomous driving. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 6394–6400. IEEE, 2025

  18. [18]

    Flow4d: Leveraging 4d voxel network for lidar scene flow estimation.IEEE Robotics and Automation Letters, 2025

    Jaeyeul Kim, Jungwan Woo, Ukcheol Shin, Jean Oh, and Sunghoon Im. Flow4d: Leveraging 4d voxel network for lidar scene flow estimation.IEEE Robotics and Automation Letters, 2025

  19. [19]

    TeFlow: Enabling multi-frame supervision for self-supervised feed-forward scene flow estimation

    Qingwen Zhang, Chenhan Jiang, Xiaomeng Zhu, Yunqi Miao, Yushan Zhang, Olov Andersson, and Patric Jensfelt. TeFlow: Enabling multi-frame supervision for self-supervised feed-forward scene flow estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2026

  20. [20]

    ICP-Flow: Lidarsceneflowestimationwithicp

    YancongLinandHolgerCaesar. ICP-Flow: Lidarsceneflowestimationwithicp. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15501–15511, June 2024

  21. [21]

    Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation

    Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela L Rus, and Song Han. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In2023 IEEE international conference on robotics and automation (ICRA), pages 2774–2781. IEEE, 2023

  22. [22]

    Bevdilation: Lidar-centric multi- modal fusion for 3d object detection

    Guowen Zhang, Chenhang He, Liyi Chen, and Lei Zhang. Bevdilation: Lidar-centric multi- modal fusion for 3d object detection. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 12448–12456, 2026

  23. [23]

    Unitr: A unified and efficient multi-modal transformer for bird’s-eye-view representation

    Haiyang Wang, Hao Tang, Shaoshuai Shi, Aoxue Li, Zhenguo Li, Bernt Schiele, and Liwei Wang. Unitr: A unified and efficient multi-modal transformer for bird’s-eye-view representation. InProceedingsoftheIEEE/CVFinternationalconferenceoncomputervision,pages6792–6802, 2023

  24. [24]

    Detectingaslabeling: Rethinking lidar-camera fusion in 3d object detection

    JunjieHuang,YunYe,ZhujinLiang,YiShan,andDalongDu. Detectingaslabeling: Rethinking lidar-camera fusion in 3d object detection. InEuropean Conference on Computer Vision, pages 439–455. Springer, 2024

  25. [25]

    Cmda: Cross-modal and domain adversarial adaptation for lidar-based3dobjectdetection

    Gyusam Chang, Wonseok Roh, Sujin Jang, Dongwook Lee, Daehyun Ji, Gyeongrok Oh, Jinsun Park, Jinkyu Kim, and Sangpil Kim. Cmda: Cross-modal and domain adversarial adaptation for lidar-based3dobjectdetection. InProceedingsoftheAAAIConferenceonArtificialIntelligence, volume 38, pages 972–980, 2024

  26. [26]

    Learning modality-agnostic representation for semantic segmentation from any modalities

    Xu Zheng, Yuanhuiyi Lyu, and Lin Wang. Learning modality-agnostic representation for semantic segmentation from any modalities. InEuropean Conference on Computer Vision, pages 146–165. Springer, 2024

  27. [27]

    Delivering arbitrary-modal semantic segmentation

    Jiaming Zhang, Ruiping Liu, Hao Shi, Kailun Yang, Simon Reiß, Kunyu Peng, Haodong Fu, Kaiwei Wang, and Rainer Stiefelhagen. Delivering arbitrary-modal semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1136–1147, 2023. 11

  28. [28]

    Jingyi Pan, Zipeng Wang, and Lin Wang. Co-occ: Coupling explicit feature fusion with volume rendering regularization for multi-modal 3d semantic occupancy prediction.IEEE Robotics and Automation Letters, 9(6):5687–5694, 2024

  29. [29]

    Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception

    Xiaofeng Wang, Zheng Zhu, Wenbo Xu, Yunpeng Zhang, Yi Wei, Xu Chi, Yun Ye, Dalong Du, Jiwen Lu, and Xingang Wang. Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17850–17859, 2023

  30. [30]

    Image-to-lidar self-supervised distillation for autonomous driving data

    Corentin Sautier, Gilles Puy, Spyros Gidaris, Alexandre Boulch, Andrei Bursuc, and Renaud Marlet. Image-to-lidar self-supervised distillation for autonomous driving data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9891–9901, 2022

  31. [31]

    Beyond one shot, beyondoneperspective: Cross-viewandlong-horizondistillationforbetterlidarrepresentations

    Xiang Xu, Lingdong Kong, Song Wang, Chuanwei Zhou, and Qingshan Liu. Beyond one shot, beyondoneperspective: Cross-viewandlong-horizondistillationforbetterlidarrepresentations. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 25506–25518, October 2025

  32. [32]

    Segment any point cloud sequences by distilling vision foundation models

    Youquan Liu, Lingdong Kong, Jun Cen, Runnan Chen, Wenwei Zhang, Liang Pan, Kai Chen, and Ziwei Liu. Segment any point cloud sequences by distilling vision foundation models. Advances in Neural Information Processing Systems, 36:37193–37229, 2023

  33. [33]

    Camliflow: bidirectional camera-lidar fusion for joint optical flow and scene flow estimation

    Haisong Liu, Tao Lu, Yihui Xu, Jia Liu, Wenjie Li, and Lijun Chen. Camliflow: bidirectional camera-lidar fusion for joint optical flow and scene flow estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5791–5801, 2022

  34. [34]

    Particle video revisited: Tracking through occlusions using point trajectories

    Adam W Harley, Zhaoyuan Fang, and Katerina Fragkiadaki. Particle video revisited: Tracking through occlusions using point trajectories. InEuropean Conference on Computer Vision, pages 59–75. Springer, 2022

  35. [35]

    Tap-vid: A benchmark for tracking any point in a video.Advances in Neural Information Processing Systems, 35:13610–13626, 2022

    Carl Doersch, Ankush Gupta, Larisa Markeeva, Adria Recasens, Lucas Smaira, Yusuf Aytar, Joao Carreira, Andrew Zisserman, and Yi Yang. Tap-vid: A benchmark for tracking any point in a video.Advances in Neural Information Processing Systems, 35:13610–13626, 2022

  36. [36]

    TAPIR: Tracking any point with per-frame initialization and temporal refinement

    Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, and Andrew Zisserman. TAPIR: Tracking any point with per-frame initialization and temporal refinement. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10061–10072, 2023

  37. [37]

    RoboTAP: Tracking arbitrary points for few-shot visual imitation.International Conference on Robotics and Automation, pages 5397–5403, 2024

    Mel Vecerik, Carl Doersch, Yi Yang, Todor Davchev, Yusuf Aytar, Guangyao Zhou, Raia Hadsell, Lourdes Agapito, and Jon Scholz. RoboTAP: Tracking arbitrary points for few-shot visual imitation.International Conference on Robotics and Automation, pages 5397–5403, 2024

  38. [38]

    BootsTAP: Bootstrapped training for tracking-any-point.Asian Conference on Computer Vision, 2024

    Carl Doersch, Pauline Luc, Yi Yang, Dilara Gokay, Skanda Koppula, Ankush Gupta, Joseph Heyward, Ignacio Rocco, Ross Goroshin, João Carreira, and Andrew Zisserman. BootsTAP: Bootstrapped training for tracking-any-point.Asian Conference on Computer Vision, 2024

  39. [39]

    TAPVid-3D: A benchmark for tracking any point in 3D

    Skanda Koppula, Ignacio Rocco, Yi Yang, Joe Heyward, João Carreira, Andrew Zisserman, Gabriel Brostow, and Carl Doersch. TAPVid-3D: A benchmark for tracking any point in 3D. Advances in Neural Information Processing Systems, 2024

  40. [40]

    Cotracker: It is better to track together

    Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker: It is better to track together. InEuropean conference on computer vision, pages 18–35. Springer, 2024

  41. [41]

    Local all-pair correspondence for point tracking

    Seokju Cho, Jiahui Huang, Jisu Nam, Honggyu An, Seungryong Kim, and Joon-Young Lee. Local all-pair correspondence for point tracking. InEuropean conference on computer vision, pages 306–325. Springer, 2024

  42. [42]

    Vggsfm: Visual geometry grounded deep structure from motion

    Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21686–21697, 2024. 12

  43. [43]

    Denseopticaltracking: Connectingthe dots

    GuillaumeLeMoing,JeanPonce,andCordeliaSchmid. Denseopticaltracking: Connectingthe dots. InProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecognition, pages 19187–19197, 2024

  44. [44]

    Segmentanymotioninvideos

    NanHuang,WenzhaoZheng,ChenfengXu,KurtKeutzer,ShanghangZhang,AngjooKanazawa, andQianqianWang. Segmentanymotioninvideos. InProceedingsoftheIEEE/CVFConference on Computer Vision and Pattern Recognition, pages 3406–3416, 2025

  45. [45]

    Segment anything meets point tracking

    Frano Rajič, Lei Ke, Yu-Wing Tai, Chi-Keung Tang, Martin Danelljan, and Fisher Yu. Segment anything meets point tracking. InProceedings of the Winter Conference on Applications of Computer Vision, pages 9284–9293, 2025

  46. [46]

    Argoverse 2: Next generation datasets for self-driving perception and forecasting

    Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, Deva Ramanan, Peter Carr, and James Hays. Argoverse 2: Next generation datasets for self-driving perception and forecasting. InProceedings of the Neural Information Processing Systems Track on D...

  47. [47]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020

  48. [48]

    nuscenes: A multimodal datasetforautonomousdriving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal datasetforautonomousdriving. InProceedingsoftheIEEE/CVFconferenceoncomputervision and pattern recognition, pages 11621–11631, 2020

  49. [49]

    Man truckscenes: A multimodal dataset for autonomous trucking in diverse conditions.Advances in Neural Information Processing Systems, 37:62062–62082, 2024

    FelixFent,FabianKuttenreich,FlorianRuch,FarijaRizwin,StefanJuergens,LorenzLechermann, Christian Nissler, Andrea Perl, Ulrich Voll, Min Yan, et al. Man truckscenes: A multimodal dataset for autonomous trucking in diverse conditions.Advances in Neural Information Processing Systems, 37:62062–62082, 2024

  50. [50]

    Aevascenes: A dataset and benchmark for fmcw lidar perception, 2025

    Gautham Narayan Narasimhan, Heethesh Vhavle, Kumar Bhargav Vishvanatha, and James Reuther. Aevascenes: A dataset and benchmark for fmcw lidar perception, 2025

  51. [51]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

  52. [52]

    Deflow: Decoder of scene flow network in autonomous driving

    Qingwen Zhang, Yi Yang, Heng Fang, Ruoyu Geng, and Patric Jensfelt. Deflow: Decoder of scene flow network in autonomous driving. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 2105–2111. IEEE, 2024

  53. [53]

    Density-based clustering based on hierarchical density estimates

    Ricardo JGB Campello, Davoud Moulavi, and Jörg Sander. Density-based clustering based on hierarchical density estimates. InPacific-Asia conference on knowledge discovery and data mining, pages 160–172. Springer, 2013. 13 Appendix A Experiment settings A.1 Dataset Details Inthissection,weprovideadditionaldetailsontheArgoverse2leaderboard[ 3]testset,whichco...