CylindTrack: Depth-Aware Cylindrical Motion Modeling for Panoramic Multi-Object Tracking

Buyin Deng; Fei Cheng; Hang Zheng; Kailun Yang; Kai Luo; Liming Yin; Lingxin Huang; Xinqi Liu

arxiv: 2606.30097 · v1 · pith:FGWO67XJnew · submitted 2026-06-29 · 💻 cs.CV · cs.RO· eess.IV

CylindTrack: Depth-Aware Cylindrical Motion Modeling for Panoramic Multi-Object Tracking

Buyin Deng , Kai Luo , Lingxin Huang , Xinqi Liu , Fei Cheng , Hang Zheng , Liming Yin , Kailun Yang This is my paper

Pith reviewed 2026-06-30 06:52 UTC · model grok-4.3

classification 💻 cs.CV cs.ROeess.IV

keywords multi-object trackingpanoramic videocylindrical motion modeldepth-aware trackingspherical attentiontrajectory consistency360-degree scenes

0 comments

The pith

CylindTrack models panoramic tracking by converting horizontal motion to continuous angular states on a cylinder and filtering depth at the trajectory level.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that equirectangular panoramic videos break standard planar motion assumptions because the horizontal domain wraps periodically and depth cues vary sharply with scale. It shows that lifting motion into a cylindrical angular space, while promoting isolated depth readings into temporally consistent trajectory states, restores reliable association across the 0/360 seam. The central claim is that joint depth-temporal and panoramic-topology modeling produces fewer identity switches and longer continuous tracks than methods that treat each frame independently. A reader would care because panoramic cameras are already common in robotics and surveillance, where losing track of nearby objects is costly. If the claim holds, trackers can operate directly on 360-degree input without first cropping to perspective views.

Core claim

CylindTrack introduces Depth-Temporal Trajectory Modeling to raise frame-wise depth to a filtered trajectory state, Spherical Spatio-Temporal Consistency Learning that uses a Temporal Mixer and Spherical Geometry-aware Attention to enforce coherence, and a Topology-Aware Cylindrical Motion Model that performs seam-consistent prediction in the periodic angular domain. These components together improve identity preservation and trajectory continuity when objects cross the seam or change apparent size.

What carries the argument

The Topology-Aware Cylindrical Motion Model, which replaces planar Euclidean motion with continuous angular prediction on a cylinder so that association remains consistent across the periodic horizontal boundary.

If this is right

Association near the seam becomes reliable without special seam-handling rules.
Depth observations become usable even when they fluctuate strongly between frames.
Trajectory continuity improves for objects that remain visible for many frames in large-FoV scenes.
The same depth-temporal pipeline can be applied to any tracker that already produces per-frame depth estimates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The cylindrical representation could be extended to other sensors whose output is periodic, such as certain lidar scan patterns.
A natural next test would be whether the same motion model reduces drift when panoramic video is used for visual odometry.
If depth comes from a noisy monocular estimator rather than stereo, the temporal mixer may need retuning to avoid propagating errors.

Load-bearing premise

Frame-wise depth measurements in wide panoramic scenes can be turned into stable trajectory states by temporal mixing and spherical attention without creating new errors.

What would settle it

On a panoramic test sequence containing multiple objects crossing the 0/360 seam, a version of the tracker that removes the cylindrical motion model would produce the same or fewer identity switches than the full model.

Figures

Figures reproduced from arXiv: 2606.30097 by Buyin Deng, Fei Cheng, Hang Zheng, Kailun Yang, Kai Luo, Liming Yin, Lingxin Huang, Xinqi Liu.

**Figure 1.** Figure 1: Overview of our proposed CylindTrack framework for panoramic multi-object tracking. (a)–(b) CylindTrack extends [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Visualization of boundary-crossing challenges in [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 4.** Figure 4: Performance for depth-only trajectory association [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Overview of the proposed Spherical Spatio-Temporal Consistency Learning (SSTC) method for the depth-enhanced detector. SSTC improves depth-aware instance representations by combining query-based Temporal Mixer with Spherical Geometry-aware Attention (SGA). The query-based Temporal Mixer performs local temporal scale alignment of depth-query representations within each video batch, reducing frame-to-frame d… view at source ↗

**Figure 6.** Figure 6: Visualization of the proposed cylindrical lifting. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative visualization of CylindTrack on the QuadTrack [ [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative visualization of CylindTrack under heavy occlusion on the QuadTrack (a) and JRDB (b/c) test sets [ [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Sensitivity analysis of CylindTrackTBD association hyperparameters and BCIC boundary-window ablation on QuadTrack test set. Panels (a)–(e) show the sensitivity to λz, τm, [τl ,τh], λθ, and the Kalman-filter noise scale, respectively. Solid curves report HOTA, IDF1, and AssA, shaded bars indicate absolute gains over DepTR-MOT, and dashed lines mark the selected/default settings. Panel (f) reports the BCIC … view at source ↗

read the original abstract

Multi-Object Tracking (MOT) is a core capability for embodied perception, and panoramic cameras are attractive for embodied systems because their 360{\deg} field of view reduces blind spots and keeps surrounding targets observable for longer durations. However, panoramic MOT is not a straightforward extension of perspective MOT. In equirectangular panoramic videos, the horizontal image domain is periodic rather than Euclidean, which breaks planar motion assumptions and makes IoU-based association unreliable near the 0{\deg}/360{\deg} seam. Meanwhile, large-FoV scenes often contain more objects, stronger scale variation, and more frequent interactions, making online association particularly sensitive to unstable frame-wise depth cues. To address these issues, we propose CylindTrack, a depth-aware cylindrical tracking-by-detection framework for panoramic MOT. CylindTrack first introduces Depth-Temporal Trajectory Modeling (DTM), which promotes instance depth from an isolated frame-wise cue to a temporally filtered trajectory-level state. To improve the reliability of depth observations, we further develop Spherical Spatio-Temporal Consistency Learning (SSTC), which combines a Temporal Mixer and Spherical Geometry-aware Attention to enhance temporal coherence and panoramic geometric alignment in depth-aware representations. Finally, we design a Topology-Aware Cylindrical Motion Model (TCMM) that lifts horizontal motion into a continuous angular state space and performs seam-consistent motion prediction and association in the periodic panoramic domain. By jointly modeling trajectory-level depth consistency and panoramic topology, CylindTrack improves identity preservation and trajectory continuity in challenging panoramic scenes. The source code will be released at https://github.com/warriordby/CylindTrack.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CylindTrack targets panoramic MOT with depth-temporal modeling and cylindrical motion but the abstract supplies no results to back the claims.

read the letter

The core of this paper is a tracking-by-detection framework called CylindTrack built for equirectangular panoramic video. It adds three pieces: Depth-Temporal Trajectory Modeling to turn per-frame depth into a filtered trajectory state, Spherical Spatio-Temporal Consistency Learning that mixes temporal information with spherical geometry attention, and a Topology-Aware Cylindrical Motion Model that works in angular space to keep predictions consistent across the 0/360 seam.

These adaptations address real issues that standard perspective trackers hit in 360-degree scenes, such as broken IoU near the seam, large scale changes, and shaky depth estimates. The motivation section lays out why panoramic MOT is not just a drop-in extension, and the components are described at a level that shows the authors thought through the periodic domain and the need for trajectory-level depth.

The main limitation is that the abstract contains no experiments, ablations, or numbers. We get the claim that joint depth consistency and topology modeling improves identity preservation, but nothing shows whether the Temporal Mixer or the geometry attention actually delivers that or whether they create new instabilities. The weakest link remains the step that promotes unstable frame-wise depth to reliable trajectory states; without data it is just an assumption.

This work is aimed at people building tracking for robotics or embodied systems that use 360 cameras. A reader already working on panoramic vision or domain-adapted MOT would get value from the problem framing and the named modules.

It is worth sending for peer review. The problem is practical, the proposed fixes are specific rather than generic, and the full paper may contain the missing experiments that would let referees judge whether the approach holds up.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes CylindTrack, a depth-aware cylindrical tracking-by-detection framework for panoramic multi-object tracking. It identifies challenges in equirectangular panoramic videos including the periodic (non-Euclidean) horizontal domain that breaks planar motion assumptions and makes IoU association unreliable near the seam, plus unstable frame-wise depth cues in large-FoV scenes with many objects and interactions. The framework introduces Depth-Temporal Trajectory Modeling (DTM) to promote instance depth from frame-wise cues to temporally filtered trajectory-level states, Spherical Spatio-Temporal Consistency Learning (SSTC) that combines a Temporal Mixer and Spherical Geometry-aware Attention for temporal coherence and panoramic geometric alignment, and Topology-Aware Cylindrical Motion Model (TCMM) that lifts horizontal motion into a continuous angular state space for seam-consistent prediction and association. The central claim is that jointly modeling trajectory-level depth consistency and panoramic topology improves identity preservation and trajectory continuity; source code release is promised.

Significance. If validated, the work would address genuine, under-served challenges in panoramic MOT for embodied perception by replacing planar assumptions with cylindrical and spherical geometry. The explicit commitment to release source code is a positive factor for reproducibility. However, the manuscript supplies no quantitative results, ablations, baseline comparisons, or failure-case analysis, so the significance remains prospective rather than demonstrated.

major comments (2)

[Abstract] Abstract: the central claim that DTM + SSTC + TCMM jointly improve identity preservation and trajectory continuity is unsupported by any experimental results, ablation studies, or quantitative evidence. This is load-bearing because the abstract states the motivation and high-level components but provides no validation that the components deliver the claimed gains.
[Abstract] Abstract: the assumption that frame-wise depth cues can be reliably promoted to temporally filtered trajectory states via the Temporal Mixer and Spherical Geometry-aware Attention without introducing new instabilities is stated but receives no supporting analysis, stability discussion, or empirical check.

minor comments (1)

[Abstract] Abstract: the descriptions of DTM, SSTC, and TCMM remain high-level; concrete architectural diagrams, loss formulations, or pseudocode would improve clarity even before experiments are added.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and the need for empirical grounding. We agree that the central claims require supporting quantitative evidence, ablations, and stability analysis, which are absent from the current manuscript. We will perform a major revision to incorporate these elements.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that DTM + SSTC + TCMM jointly improve identity preservation and trajectory continuity is unsupported by any experimental results, ablation studies, or quantitative evidence. This is load-bearing because the abstract states the motivation and high-level components but provides no validation that the components deliver the claimed gains.

Authors: We acknowledge that the abstract currently states the intended benefits of the joint modeling without referencing supporting results. This is a valid observation given the manuscript's current content. In revision we will either qualify the abstract wording or add concise references to the quantitative gains once the experimental section is expanded with results, ablations, and baseline comparisons. revision: yes
Referee: [Abstract] Abstract: the assumption that frame-wise depth cues can be reliably promoted to temporally filtered trajectory states via the Temporal Mixer and Spherical Geometry-aware Attention without introducing new instabilities is stated but receives no supporting analysis, stability discussion, or empirical check.

Authors: We agree that the manuscript lacks any stability analysis or empirical verification for the depth promotion step. In the revised version we will add a dedicated discussion of stability properties together with targeted ablations or checks demonstrating that the Temporal Mixer and Spherical Geometry-aware Attention do not introduce new instabilities. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The provided abstract and context describe a proposed framework (DTM, SSTC, TCMM) for panoramic MOT without any equations, fitted parameters, or self-citations that reduce claimed improvements to inputs by construction. No load-bearing derivation steps are present that match the enumerated circularity patterns. The central claim of joint modeling for better identity preservation is presented as a novel architectural combination rather than a self-referential prediction or renamed known result. This is the most common honest finding for a methods-description paper lacking quantitative derivations in the supplied text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract relies on a domain assumption about panoramic image geometry and standard computer vision practices; no free parameters, new physical entities, or ad-hoc axioms are introduced.

axioms (1)

domain assumption The horizontal image domain in equirectangular panoramic videos is periodic rather than Euclidean, breaking planar motion assumptions and making IoU-based association unreliable near the 0/360 seam.
Directly stated in the abstract as the reason standard MOT methods fail.

pith-pipeline@v0.9.1-grok · 5850 in / 1283 out tokens · 53632 ms · 2026-06-30T06:52:29.142528+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 9 canonical work pages · 1 internal anchor

[1]

JRDB: A dataset and benchmark of egocentric robot visual perception of humans in built environments,

R. Martín-Martínet al., “JRDB: A dataset and benchmark of egocentric robot visual perception of humans in built environments,”IEEE Trans- actions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, pp. 6748–6765, 2023

2023
[2]

Omnidirectional multi-object tracking,

K. Luoet al., “Omnidirectional multi-object tracking,” inProc. CVPR, 2025, pp. 21 959–21 969

2025
[3]

Simple online and realtime tracking,

A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online and realtime tracking,” inProc. ICIP, 2016, pp. 3464–3468

2016
[4]

Simple online and realtime tracking with a deep association metric,

N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime tracking with a deep association metric,” inProc. ICIP, 2017, pp. 3645– 3649

2017
[5]

Hybrid-SORT: Weak cues matter for online multi-object tracking,

M. Yanget al., “Hybrid-SORT: Weak cues matter for online multi-object tracking,” inProc. AAAI, 2024, pp. 6504–6512

2024
[6]

DiffMOT: A real-time diffusion-based multiple object tracker with non-linear prediction,

W. Lv, Y . Huang, N. Zhang, R.-S. Lin, M. Han, and D. Zeng, “DiffMOT: A real-time diffusion-based multiple object tracker with non-linear prediction,” inProc. CVPR, 2024, pp. 19 321–19 330

2024
[7]

BoT-SORT: Robust as- sociations multi-pedestrian tracking,

N. Aharon, R. Orfaig, and B.-Z. Bobrovsky, “BoT-SORT: Robust associations multi-pedestrian tracking,”arXiv:2206.14651, 2022

work page arXiv 2022
[8]

JRMOT: A real-time 3D multi-object tracker and a new large-scale dataset,

A. Shenoiet al., “JRMOT: A real-time 3D multi-object tracker and a new large-scale dataset,” inProc. IROS, 2020, pp. 10 335–10 342

2020
[9]

DepTR-MOT: Unveiling the potential of depth-informed trajectory refinement for multi-object tracking,

B. Deng, L. Huang, K. Luo, F. Teng, and K. Yang, “DepTR-MOT: Unveiling the potential of depth-informed trajectory refinement for multi-object tracking,”arXiv:2509.17323, 2025

work page arXiv 2025
[10]

SparseTrack: Multi- object tracking by performing scene decomposition based on pseudo- depth,

Z. Liu, X. Wang, C. Wang, W. Liu, and X. Bai, “SparseTrack: Multi- object tracking by performing scene decomposition based on pseudo- depth,”IEEE Transactions on Circuits and Systems for Video Technol- ogy, vol. 35, no. 5, pp. 4870–4882, 2025

2025
[11]

DepthSort: Multi-object tracking optimization for unreliable detection with depth information,

Z. Cui, T. Xu, Z. Tang, X.-j. Wu, and J. Kittler, “DepthSort: Multi-object tracking optimization for unreliable detection with depth information,” IEEE Signal Processing Letters, 2026. 14

2026
[12]

Review on panoramic imaging and its applications in scene understanding,

S. Gao, K. Yang, H. Shi, K. Wang, and J. Bai, “Review on panoramic imaging and its applications in scene understanding,”IEEE Transactions on Instrumentation and Measurement, vol. 71, pp. 1–34, 2022

2022
[13]

PanoFlow: Learning 360° optical flow for surrounding temporal understanding,

H. Shiet al., “PanoFlow: Learning 360° optical flow for surrounding temporal understanding,”IEEE Transactions on Intelligent Transporta- tion Systems, vol. 24, no. 5, pp. 5570–5585, 2023

2023
[14]

Towards oriented multi-object tracking for fisheye images: Dataset and framework,

J. Yang, C. Lin, L. Nie, Y . Tang, and Y . Zhao, “Towards oriented multi-object tracking for fisheye images: Dataset and framework,”IEEE Transactions on Circuits and Systems for Video Technology, 2026

2026
[15]

Simple online and realtime tracking with spherical panoramic camera,

K.-C. Liu, Y .-T. Shen, and L.-G. Chen, “Simple online and realtime tracking with spherical panoramic camera,” inProc. ICCE, 2018, pp. 1–6

2018
[16]

Know your surroundings: Panoramic multi-object tracking by multimodality collaboration,

Y . He, W. Yu, J. Han, X. Wei, X. Hong, and Y . Gong, “Know your surroundings: Panoramic multi-object tracking by multimodality collaboration,” inProc. CVPRW, 2021, pp. 2963–2974

2021
[17]

ByteTrack: Multi-object tracking by associating every detection box,

Y . Zhanget al., “ByteTrack: Multi-object tracking by associating every detection box,” inProc. ECCV, 2022, pp. 1–21

2022
[18]

Observation- centric SORT: Rethinking SORT for robust multi-object tracking,

J. Cao, J. Pang, X. Weng, R. Khirodkar, and K. Kitani, “Observation- centric SORT: Rethinking SORT for robust multi-object tracking,” in Proc. CVPR, 2023, pp. 9686–9696

2023
[19]

Occlusion-aware seamless segmentation,

Y . Caoet al., “Occlusion-aware seamless segmentation,” inProc. ECCV, 2024, pp. 129–147

2024
[20]

S3KF: Spherical state-space kalman filtering for panoramic 3D multi-object tracking,

Z. Liuet al., “S3KF: Spherical state-space kalman filtering for panoramic 3D multi-object tracking,”arXiv:2603.27534, 2026

work page arXiv 2026
[21]

DepthMOT: Depth cues lead to a strong multi-object tracker,

J. Wu and Y . Liu, “DepthMOT: Depth cues lead to a strong multi-object tracker,”arXiv:2404.05518, 2024

work page arXiv 2024
[22]

PD-SORT: Occlusion- robust multi-object tracking using pseudo-depth cues,

Y . Wang, D. Zhang, R. Li, Z. Zheng, and M. Li, “PD-SORT: Occlusion- robust multi-object tracking using pseudo-depth cues,”IEEE Transac- tions on Consumer Electronics, vol. 71, no. 1, pp. 165–177, 2025

2025
[23]

A survey of representation learning, optimization strategies, and applications for omnidirectional vision,

H. Ai, Z. Cao, and L. Wang, “A survey of representation learning, optimization strategies, and applications for omnidirectional vision,” International Journal of Computer Vision, vol. 133, no. 8, pp. 4973– 5012, 2025

2025
[24]

One flight over the gap: A survey from perspective to panoramic vision.arXiv preprint arXiv:2509.04444, 2025

X. Linet al., “One flight over the gap: A survey from perspective to panoramic vision,”arXiv:2509.04444, 2025

work page arXiv 2025
[25]

OmniSAM: Omnidirectional segment anything model for UDA in panoramic semantic segmentation,

D. Zhonget al., “OmniSAM: Omnidirectional segment anything model for UDA in panoramic semantic segmentation,” inProc. ICCV, 2025, pp. 23 892–23 901

2025
[26]

Denoise and align: Towards source-free UDA for robust panoramic semantic segmentation,

Y . Chang, Z. Cao, X. Zheng, X. Mi, and Z. Dong, “Denoise and align: Towards source-free UDA for robust panoramic semantic segmentation,” inProc. CVPR, 2026

2026
[27]

Panoramic panoptic seg- mentation: Insights into surrounding parsing for mobile agents via unsupervised contrastive learning,

A. Jaus, K. Yang, and R. Stiefelhagen, “Panoramic panoptic seg- mentation: Insights into surrounding parsing for mobile agents via unsupervised contrastive learning,”IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 4, pp. 4438–4453, 2023

2023
[28]

Waymo open dataset: Panoramic video panoptic segmen- tation,

J. Meiet al., “Waymo open dataset: Panoramic video panoptic segmen- tation,” inProc. ECCV, 2022, pp. 53–72

2022
[29]

PanopticNeRF-360: Panoramic 3D-to-2D label transfer in urban scenes,

X. Fuet al., “PanopticNeRF-360: Panoramic 3D-to-2D label transfer in urban scenes,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 10, pp. 8804–8822, 2025

2025
[30]

PanDA: Towards panoramic depth anything with un- labeled panoramas and möbius spatial augmentation,

Z. Caoet al., “PanDA: Towards panoramic depth anything with un- labeled panoramas and möbius spatial augmentation,” inProc. CVPR, 2025, pp. 982–992

2025
[31]

arXiv preprint arXiv:2512.22819 (2025)

H. Jiang, Z. Song, Z. Lou, R. Xu, and M. Tan, “Depth anything in 360°: Towards scale invariance in the wild,”arXiv:2512.22819, 2025

work page arXiv 2025
[32]

HybridTrack: A hybrid approach for robust multi-object tracking,

L. Di Bella, Y . Lyu, B. Cornelis, and A. Munteanu, “HybridTrack: A hybrid approach for robust multi-object tracking,”IEEE Robotics and Automation Letters, vol. 10, no. 7, pp. 7238–7245, 2025

2025
[33]

Using panoramic videos for multi-person localization and tracking in a 3D panoramic coordinate,

F. Yang, F. Li, Y . Wu, S. Sakti, and S. Nakamura, “Using panoramic videos for multi-person localization and tracking in a 3D panoramic coordinate,” inProc. ICASSP, 2020, pp. 1863–1867

2020
[34]

CC-3DT: Panoramic 3D object tracking via cross-camera fusion,

T. Fischer, Y .-H. Yang, S. Kumar, M. Sun, and F. Yu, “CC-3DT: Panoramic 3D object tracking via cross-camera fusion,” inProc. CoRL, 2023, pp. 2294–2305

2023
[35]

Robust panoramic multi-object tracking with category- aware data association and adaptive noise estimation for unmanned surface vehicles,

Z. Yanget al., “Robust panoramic multi-object tracking with category- aware data association and adaptive noise estimation for unmanned surface vehicles,”Expert Systems with Applications, p. 132283, 2026

2026
[36]

The hungarian method for the assignment problem,

H. W. Kuhn, “The hungarian method for the assignment problem,”Naval Research Logistics Quarterly, vol. 2, no. 1-2, pp. 83–97, 1955

1955
[37]

Multi-object tracking model based on detection tracking paradigm in panoramic scenes,

J. Shen and H. Yang, “Multi-object tracking model based on detection tracking paradigm in panoramic scenes,”Applied Sciences, vol. 14, no. 10, p. 4146, 2024

2024
[38]

Depth-aware multi-object tracking in spherical videos,

L. Lo Presti, G. Mazzola, G. Averna, E. Ardizzone, and M. La Cascia, “Depth-aware multi-object tracking in spherical videos,” inProc. ICIAP, 2022, pp. 362–374

2022
[39]

DETrack: Depth information is predictable for tracking,

W. Zhao, Y . Jiang, Y . Gao, J. Li, and X. Gao, “DETrack: Depth information is predictable for tracking,”Neurocomputing, vol. 616, p. 128906, 2025

2025
[40]

Depth-aware scoring and hierarchical alignment for multiple object tracking,

M. Khanchi, M. Amer, and C. Poullis, “Depth-aware scoring and hierarchical alignment for multiple object tracking,” inProc. ICIP, 2025, pp. 2043–2048

2025
[41]

A depth-aware robust multi-object tracker for crowded scene by re-prioritizing association order,

C.-Y . Yanget al., “A depth-aware robust multi-object tracker for crowded scene by re-prioritizing association order,” inProc. AVSS, 2025, pp. 1–6

2025
[42]

Multi-object tracking optimization in densely occluded scenarios using depth estimation,

J. Peng, Y . Yao, P. Wang, C. Wang, and Z. Li, “Multi-object tracking optimization in densely occluded scenarios using depth estimation,” in Proc. FCN, 2025, pp. 1–6

2025
[43]

CAMOT: Camera angle-aware multi-object tracking,

F. Limanta, K. Uto, and K. Shinoda, “CAMOT: Camera angle-aware multi-object tracking,” inProc. WACV, 2024, pp. 6465–6474

2024
[44]

Depth perspective-aware multiple object tracking,

K. G. Quach, P. Nguyen, C. N. Duong, T. D. Bui, and K. Luu, “Depth perspective-aware multiple object tracking,” inEngineering Applications of AI and Swarm Intelligence, 2024, pp. 181–205

2024
[45]

View adaptive multi- object tracking method based on depth relationship cues,

H. Sun, Y . Li, G. Yang, Z. Su, and K. Luo, “View adaptive multi- object tracking method based on depth relationship cues,”Complex & Intelligent Systems, vol. 11, no. 2, p. 145, 2025

2025
[46]

GRASPTrack: Geometry-reasoned association via seg- mentation and projection for multi-object tracking,

X. Hanet al., “GRASPTrack: Geometry-reasoned association via seg- mentation and projection for multi-object tracking,”arXiv:2508.08117, 2025

work page arXiv 2025
[47]

DepthTrack: Cluster meets BEV for multi-camera multi-target 3D tracking,

T. H.-P. Tranet al., “DepthTrack: Cluster meets BEV for multi-camera multi-target 3D tracking,” inProc. ICCVW, 2025, pp. 5348–5357

2025
[48]

Video depth anything: Consistent depth estimation for super-long videos,

S. Chenet al., “Video depth anything: Consistent depth estimation for super-long videos,” inProc. CVPR, 2025, pp. 22 831–22 840

2025
[49]

UniK3D: Universal camera monocular 3D estima- tion,

L. Piccinelliet al., “UniK3D: Universal camera monocular 3D estima- tion,” inProc. CVPR, 2025, pp. 1028–1039

2025
[50]

Depth anything V2,

L. Yanget al., “Depth anything V2,” inProc. NeurIPS, 2024, pp. 21 875– 21 911

2024
[51]

Depth Anything 3: Recovering the Visual Space from Any Views

H. Linet al., “Depth anything 3: Recovering the visual space from any views,”arXiv:2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

DepthCrafter: Generating consistent long depth sequences for open-world videos,

W. Huet al., “DepthCrafter: Generating consistent long depth sequences for open-world videos,” inProc. CVPR, 2025, pp. 2005–2015

2025
[53]

arXiv preprint arXiv:2509.26618 (2025) 16 Y

H. Liet al., “DA 2: Depth anything in any direction,”arXiv:2509.26618, 2025

work page arXiv 2025
[54]

HOTA: A higher order metric for evaluating multi- object tracking,

J. Luitenet al., “HOTA: A higher order metric for evaluating multi- object tracking,”International Journal of Computer Vision, vol. 129, no. 2, pp. 548–578, 2021

2021
[55]

Evaluating multiple object tracking performance: The CLEAR MOT metrics,

K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking performance: The CLEAR MOT metrics,”EURASIP Journal on Image and Video Processing, vol. 2008, no. 1, p. 246309, 2008

2008
[56]

Performance measures and a data set for multi-target, multi-camera tracking,

E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Performance measures and a data set for multi-target, multi-camera tracking,” inProc. ECCVW, 2016, pp. 17–35

2016
[57]

A consistent metric for performance evaluation of multi-object filters,

D. Schuhmacher, B.-T. V o, and B.-N. V o, “A consistent metric for performance evaluation of multi-object filters,”IEEE Transactions on Signal Processing, vol. 56, no. 8, pp. 3447–3457, 2008

2008
[58]

SAM 2: Segment anything in images and videos,

N. Raviet al., “SAM 2: Segment anything in images and videos,” in Proc. ICLR, 2025

2025

[1] [1]

JRDB: A dataset and benchmark of egocentric robot visual perception of humans in built environments,

R. Martín-Martínet al., “JRDB: A dataset and benchmark of egocentric robot visual perception of humans in built environments,”IEEE Trans- actions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, pp. 6748–6765, 2023

2023

[2] [2]

Omnidirectional multi-object tracking,

K. Luoet al., “Omnidirectional multi-object tracking,” inProc. CVPR, 2025, pp. 21 959–21 969

2025

[3] [3]

Simple online and realtime tracking,

A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online and realtime tracking,” inProc. ICIP, 2016, pp. 3464–3468

2016

[4] [4]

Simple online and realtime tracking with a deep association metric,

N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime tracking with a deep association metric,” inProc. ICIP, 2017, pp. 3645– 3649

2017

[5] [5]

Hybrid-SORT: Weak cues matter for online multi-object tracking,

M. Yanget al., “Hybrid-SORT: Weak cues matter for online multi-object tracking,” inProc. AAAI, 2024, pp. 6504–6512

2024

[6] [6]

DiffMOT: A real-time diffusion-based multiple object tracker with non-linear prediction,

W. Lv, Y . Huang, N. Zhang, R.-S. Lin, M. Han, and D. Zeng, “DiffMOT: A real-time diffusion-based multiple object tracker with non-linear prediction,” inProc. CVPR, 2024, pp. 19 321–19 330

2024

[7] [7]

BoT-SORT: Robust as- sociations multi-pedestrian tracking,

N. Aharon, R. Orfaig, and B.-Z. Bobrovsky, “BoT-SORT: Robust associations multi-pedestrian tracking,”arXiv:2206.14651, 2022

work page arXiv 2022

[8] [8]

JRMOT: A real-time 3D multi-object tracker and a new large-scale dataset,

A. Shenoiet al., “JRMOT: A real-time 3D multi-object tracker and a new large-scale dataset,” inProc. IROS, 2020, pp. 10 335–10 342

2020

[9] [9]

DepTR-MOT: Unveiling the potential of depth-informed trajectory refinement for multi-object tracking,

B. Deng, L. Huang, K. Luo, F. Teng, and K. Yang, “DepTR-MOT: Unveiling the potential of depth-informed trajectory refinement for multi-object tracking,”arXiv:2509.17323, 2025

work page arXiv 2025

[10] [10]

SparseTrack: Multi- object tracking by performing scene decomposition based on pseudo- depth,

Z. Liu, X. Wang, C. Wang, W. Liu, and X. Bai, “SparseTrack: Multi- object tracking by performing scene decomposition based on pseudo- depth,”IEEE Transactions on Circuits and Systems for Video Technol- ogy, vol. 35, no. 5, pp. 4870–4882, 2025

2025

[11] [11]

DepthSort: Multi-object tracking optimization for unreliable detection with depth information,

Z. Cui, T. Xu, Z. Tang, X.-j. Wu, and J. Kittler, “DepthSort: Multi-object tracking optimization for unreliable detection with depth information,” IEEE Signal Processing Letters, 2026. 14

2026

[12] [12]

Review on panoramic imaging and its applications in scene understanding,

S. Gao, K. Yang, H. Shi, K. Wang, and J. Bai, “Review on panoramic imaging and its applications in scene understanding,”IEEE Transactions on Instrumentation and Measurement, vol. 71, pp. 1–34, 2022

2022

[13] [13]

PanoFlow: Learning 360° optical flow for surrounding temporal understanding,

H. Shiet al., “PanoFlow: Learning 360° optical flow for surrounding temporal understanding,”IEEE Transactions on Intelligent Transporta- tion Systems, vol. 24, no. 5, pp. 5570–5585, 2023

2023

[14] [14]

Towards oriented multi-object tracking for fisheye images: Dataset and framework,

J. Yang, C. Lin, L. Nie, Y . Tang, and Y . Zhao, “Towards oriented multi-object tracking for fisheye images: Dataset and framework,”IEEE Transactions on Circuits and Systems for Video Technology, 2026

2026

[15] [15]

Simple online and realtime tracking with spherical panoramic camera,

K.-C. Liu, Y .-T. Shen, and L.-G. Chen, “Simple online and realtime tracking with spherical panoramic camera,” inProc. ICCE, 2018, pp. 1–6

2018

[16] [16]

Know your surroundings: Panoramic multi-object tracking by multimodality collaboration,

Y . He, W. Yu, J. Han, X. Wei, X. Hong, and Y . Gong, “Know your surroundings: Panoramic multi-object tracking by multimodality collaboration,” inProc. CVPRW, 2021, pp. 2963–2974

2021

[17] [17]

ByteTrack: Multi-object tracking by associating every detection box,

Y . Zhanget al., “ByteTrack: Multi-object tracking by associating every detection box,” inProc. ECCV, 2022, pp. 1–21

2022

[18] [18]

Observation- centric SORT: Rethinking SORT for robust multi-object tracking,

J. Cao, J. Pang, X. Weng, R. Khirodkar, and K. Kitani, “Observation- centric SORT: Rethinking SORT for robust multi-object tracking,” in Proc. CVPR, 2023, pp. 9686–9696

2023

[19] [19]

Occlusion-aware seamless segmentation,

Y . Caoet al., “Occlusion-aware seamless segmentation,” inProc. ECCV, 2024, pp. 129–147

2024

[20] [20]

S3KF: Spherical state-space kalman filtering for panoramic 3D multi-object tracking,

Z. Liuet al., “S3KF: Spherical state-space kalman filtering for panoramic 3D multi-object tracking,”arXiv:2603.27534, 2026

work page arXiv 2026

[21] [21]

DepthMOT: Depth cues lead to a strong multi-object tracker,

J. Wu and Y . Liu, “DepthMOT: Depth cues lead to a strong multi-object tracker,”arXiv:2404.05518, 2024

work page arXiv 2024

[22] [22]

PD-SORT: Occlusion- robust multi-object tracking using pseudo-depth cues,

Y . Wang, D. Zhang, R. Li, Z. Zheng, and M. Li, “PD-SORT: Occlusion- robust multi-object tracking using pseudo-depth cues,”IEEE Transac- tions on Consumer Electronics, vol. 71, no. 1, pp. 165–177, 2025

2025

[23] [23]

A survey of representation learning, optimization strategies, and applications for omnidirectional vision,

H. Ai, Z. Cao, and L. Wang, “A survey of representation learning, optimization strategies, and applications for omnidirectional vision,” International Journal of Computer Vision, vol. 133, no. 8, pp. 4973– 5012, 2025

2025

[24] [24]

One flight over the gap: A survey from perspective to panoramic vision.arXiv preprint arXiv:2509.04444, 2025

X. Linet al., “One flight over the gap: A survey from perspective to panoramic vision,”arXiv:2509.04444, 2025

work page arXiv 2025

[25] [25]

OmniSAM: Omnidirectional segment anything model for UDA in panoramic semantic segmentation,

D. Zhonget al., “OmniSAM: Omnidirectional segment anything model for UDA in panoramic semantic segmentation,” inProc. ICCV, 2025, pp. 23 892–23 901

2025

[26] [26]

Denoise and align: Towards source-free UDA for robust panoramic semantic segmentation,

Y . Chang, Z. Cao, X. Zheng, X. Mi, and Z. Dong, “Denoise and align: Towards source-free UDA for robust panoramic semantic segmentation,” inProc. CVPR, 2026

2026

[27] [27]

Panoramic panoptic seg- mentation: Insights into surrounding parsing for mobile agents via unsupervised contrastive learning,

A. Jaus, K. Yang, and R. Stiefelhagen, “Panoramic panoptic seg- mentation: Insights into surrounding parsing for mobile agents via unsupervised contrastive learning,”IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 4, pp. 4438–4453, 2023

2023

[28] [28]

Waymo open dataset: Panoramic video panoptic segmen- tation,

J. Meiet al., “Waymo open dataset: Panoramic video panoptic segmen- tation,” inProc. ECCV, 2022, pp. 53–72

2022

[29] [29]

PanopticNeRF-360: Panoramic 3D-to-2D label transfer in urban scenes,

X. Fuet al., “PanopticNeRF-360: Panoramic 3D-to-2D label transfer in urban scenes,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 10, pp. 8804–8822, 2025

2025

[30] [30]

PanDA: Towards panoramic depth anything with un- labeled panoramas and möbius spatial augmentation,

Z. Caoet al., “PanDA: Towards panoramic depth anything with un- labeled panoramas and möbius spatial augmentation,” inProc. CVPR, 2025, pp. 982–992

2025

[31] [31]

arXiv preprint arXiv:2512.22819 (2025)

H. Jiang, Z. Song, Z. Lou, R. Xu, and M. Tan, “Depth anything in 360°: Towards scale invariance in the wild,”arXiv:2512.22819, 2025

work page arXiv 2025

[32] [32]

HybridTrack: A hybrid approach for robust multi-object tracking,

L. Di Bella, Y . Lyu, B. Cornelis, and A. Munteanu, “HybridTrack: A hybrid approach for robust multi-object tracking,”IEEE Robotics and Automation Letters, vol. 10, no. 7, pp. 7238–7245, 2025

2025

[33] [33]

Using panoramic videos for multi-person localization and tracking in a 3D panoramic coordinate,

F. Yang, F. Li, Y . Wu, S. Sakti, and S. Nakamura, “Using panoramic videos for multi-person localization and tracking in a 3D panoramic coordinate,” inProc. ICASSP, 2020, pp. 1863–1867

2020

[34] [34]

CC-3DT: Panoramic 3D object tracking via cross-camera fusion,

T. Fischer, Y .-H. Yang, S. Kumar, M. Sun, and F. Yu, “CC-3DT: Panoramic 3D object tracking via cross-camera fusion,” inProc. CoRL, 2023, pp. 2294–2305

2023

[35] [35]

Robust panoramic multi-object tracking with category- aware data association and adaptive noise estimation for unmanned surface vehicles,

Z. Yanget al., “Robust panoramic multi-object tracking with category- aware data association and adaptive noise estimation for unmanned surface vehicles,”Expert Systems with Applications, p. 132283, 2026

2026

[36] [36]

The hungarian method for the assignment problem,

H. W. Kuhn, “The hungarian method for the assignment problem,”Naval Research Logistics Quarterly, vol. 2, no. 1-2, pp. 83–97, 1955

1955

[37] [37]

Multi-object tracking model based on detection tracking paradigm in panoramic scenes,

J. Shen and H. Yang, “Multi-object tracking model based on detection tracking paradigm in panoramic scenes,”Applied Sciences, vol. 14, no. 10, p. 4146, 2024

2024

[38] [38]

Depth-aware multi-object tracking in spherical videos,

L. Lo Presti, G. Mazzola, G. Averna, E. Ardizzone, and M. La Cascia, “Depth-aware multi-object tracking in spherical videos,” inProc. ICIAP, 2022, pp. 362–374

2022

[39] [39]

DETrack: Depth information is predictable for tracking,

W. Zhao, Y . Jiang, Y . Gao, J. Li, and X. Gao, “DETrack: Depth information is predictable for tracking,”Neurocomputing, vol. 616, p. 128906, 2025

2025

[40] [40]

Depth-aware scoring and hierarchical alignment for multiple object tracking,

M. Khanchi, M. Amer, and C. Poullis, “Depth-aware scoring and hierarchical alignment for multiple object tracking,” inProc. ICIP, 2025, pp. 2043–2048

2025

[41] [41]

A depth-aware robust multi-object tracker for crowded scene by re-prioritizing association order,

C.-Y . Yanget al., “A depth-aware robust multi-object tracker for crowded scene by re-prioritizing association order,” inProc. AVSS, 2025, pp. 1–6

2025

[42] [42]

Multi-object tracking optimization in densely occluded scenarios using depth estimation,

J. Peng, Y . Yao, P. Wang, C. Wang, and Z. Li, “Multi-object tracking optimization in densely occluded scenarios using depth estimation,” in Proc. FCN, 2025, pp. 1–6

2025

[43] [43]

CAMOT: Camera angle-aware multi-object tracking,

F. Limanta, K. Uto, and K. Shinoda, “CAMOT: Camera angle-aware multi-object tracking,” inProc. WACV, 2024, pp. 6465–6474

2024

[44] [44]

Depth perspective-aware multiple object tracking,

K. G. Quach, P. Nguyen, C. N. Duong, T. D. Bui, and K. Luu, “Depth perspective-aware multiple object tracking,” inEngineering Applications of AI and Swarm Intelligence, 2024, pp. 181–205

2024

[45] [45]

View adaptive multi- object tracking method based on depth relationship cues,

H. Sun, Y . Li, G. Yang, Z. Su, and K. Luo, “View adaptive multi- object tracking method based on depth relationship cues,”Complex & Intelligent Systems, vol. 11, no. 2, p. 145, 2025

2025

[46] [46]

GRASPTrack: Geometry-reasoned association via seg- mentation and projection for multi-object tracking,

X. Hanet al., “GRASPTrack: Geometry-reasoned association via seg- mentation and projection for multi-object tracking,”arXiv:2508.08117, 2025

work page arXiv 2025

[47] [47]

DepthTrack: Cluster meets BEV for multi-camera multi-target 3D tracking,

T. H.-P. Tranet al., “DepthTrack: Cluster meets BEV for multi-camera multi-target 3D tracking,” inProc. ICCVW, 2025, pp. 5348–5357

2025

[48] [48]

Video depth anything: Consistent depth estimation for super-long videos,

S. Chenet al., “Video depth anything: Consistent depth estimation for super-long videos,” inProc. CVPR, 2025, pp. 22 831–22 840

2025

[49] [49]

UniK3D: Universal camera monocular 3D estima- tion,

L. Piccinelliet al., “UniK3D: Universal camera monocular 3D estima- tion,” inProc. CVPR, 2025, pp. 1028–1039

2025

[50] [50]

Depth anything V2,

L. Yanget al., “Depth anything V2,” inProc. NeurIPS, 2024, pp. 21 875– 21 911

2024

[51] [51]

Depth Anything 3: Recovering the Visual Space from Any Views

H. Linet al., “Depth anything 3: Recovering the visual space from any views,”arXiv:2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

DepthCrafter: Generating consistent long depth sequences for open-world videos,

W. Huet al., “DepthCrafter: Generating consistent long depth sequences for open-world videos,” inProc. CVPR, 2025, pp. 2005–2015

2025

[53] [53]

arXiv preprint arXiv:2509.26618 (2025) 16 Y

H. Liet al., “DA 2: Depth anything in any direction,”arXiv:2509.26618, 2025

work page arXiv 2025

[54] [54]

HOTA: A higher order metric for evaluating multi- object tracking,

J. Luitenet al., “HOTA: A higher order metric for evaluating multi- object tracking,”International Journal of Computer Vision, vol. 129, no. 2, pp. 548–578, 2021

2021

[55] [55]

Evaluating multiple object tracking performance: The CLEAR MOT metrics,

K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking performance: The CLEAR MOT metrics,”EURASIP Journal on Image and Video Processing, vol. 2008, no. 1, p. 246309, 2008

2008

[56] [56]

Performance measures and a data set for multi-target, multi-camera tracking,

E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Performance measures and a data set for multi-target, multi-camera tracking,” inProc. ECCVW, 2016, pp. 17–35

2016

[57] [57]

A consistent metric for performance evaluation of multi-object filters,

D. Schuhmacher, B.-T. V o, and B.-N. V o, “A consistent metric for performance evaluation of multi-object filters,”IEEE Transactions on Signal Processing, vol. 56, no. 8, pp. 3447–3457, 2008

2008

[58] [58]

SAM 2: Segment anything in images and videos,

N. Raviet al., “SAM 2: Segment anything in images and videos,” in Proc. ICLR, 2025

2025