pith. sign in

arxiv: 2605.21957 · v1 · pith:N53MV3SAnew · submitted 2026-05-21 · 💻 cs.CV

Bounding-Box Trajectories Matter for Video Anomaly Detection

Pith reviewed 2026-05-22 07:02 UTC · model grok-4.3

classification 💻 cs.CV
keywords video anomaly detectionbounding box trajectoriesnormalizing flowspose estimationShanghaiTechkinematic patternsMSAD dataset
0
0 comments X

The pith

Bounding-box trajectories alone can model normal video motion well enough to detect anomalies better than pose-based approaches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper argues that the paths traced by bounding boxes around moving objects contain enough information about normal movement patterns to identify unusual events in videos. By using normalizing flows to learn these patterns from trajectories, the method works without the need for detailed human pose estimation, which is common in other approaches. On the ShanghaiTech dataset, the version relying only on trajectories achieves higher average precision than previous pose-focused methods, and combining both yields even better results. This matters because it points to a simpler, readily available signal that has been overlooked in efforts to improve public safety monitoring.

Core claim

We present TrajVAD, a framework that models multi-class bounding-box trajectories using normalizing flows to learn normal kinematic patterns. Its trajectory-only variant (TrajVAD-T) eliminates pose estimation and surpasses all compared pose-based methods on ShanghaiTech in AP (87.7%), while achieving the best results on MSAD. An extended version (TrajVAD-P) incorporates pose information and further improves performance to 88.6% AUROC and 90.9% AP on ShanghaiTech.

What carries the argument

Normalizing flows for modeling multi-class bounding-box trajectories as a way to capture normal kinematic patterns in videos.

Load-bearing premise

Bounding-box trajectories contain enough information about motion to distinguish normal from anomalous events across the tested video datasets.

What would settle it

Evaluating the trajectory-only model on a new dataset featuring anomalies that change body pose but keep the same bounding box path, such as a person suddenly waving arms abnormally while staying in place, and checking if detection rates fall below those of pose-based methods.

Figures

Figures reproduced from arXiv: 2605.21957 by Inpyo Song, Jangwon Lee.

Figure 1
Figure 1. Figure 1: Pose-based VAD methods (left) score anomalies from skeleton sequences and are limited to person-class tracks. TrajVAD (right) treats multi-class bounding-box trajectories as the primary signal, applicable to any detected object. TrajVAD-P adds an optional pose branch (dashed) activated only for human tracks when pose is reliable. Early deep learning approaches addressed this problem through pixel-level rec… view at source ↗
Figure 2
Figure 2. Figure 2: TrajVAD pipeline. Multi-class tracks from detection and tracking are encoded as standardized trajectory-derived feature sequences and conditioned on class embed￾dings. TrajVAD-T (top row) maps segments through a normalizing flow and uses the negative log-likelihood as the anomaly score. TrajVAD-P (both row) adds a pose branch conditioned on the trajectory latent ztraj, gated by pose reliability g. 3 Method… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison on ShanghaiTech and MSAD. Red boxes and anomaly scores (higher means anomaly) indicate detected anomalies. Top: a car in a pedestrian zone is invisible to pose-based STG-NF but flagged by TrajVAD through bounding￾box kinematics. Bottom: partial occlusion corrupts skeleton estimation, suppressing the STG-NF anomaly score, while TrajVAD maintains detection from trajectory features. 90%… view at source ↗
Figure 4
Figure 4. Figure 4: Effect of flow depth K on AUROC for TrajVAD-T and TrajVAD-P on Shang￾haiTech and MSAD. Stars mark the best K per panel. TrajVAD-T is robust across depths, while TrajVAD-P peaks at K=18. 4.7 Ablation Study Feature group ablation [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
read the original abstract

Video anomaly detection is critical for public safety and security, yet remains highly challenging despite extensive research due to large variations in appearance, viewpoint, and scene dynamics. Among existing approaches, human pose-based methods have emerged as a major line of research, showing strong performance since many anomalies in public datasets involve humans and pose representations are robust to appearance changes while providing compact motion descriptions. However, these methods often overlook bounding-box trajectories, although such information is inherently available in pose-based pipelines. In this paper, we explicitly leverage these trajectories as a primary anomaly cue. We present TrajVAD, a framework that models multi-class bounding-box trajectories using normalizing flows to learn normal kinematic patterns. Its trajectory-only variant (TrajVAD-T) eliminates pose estimation and surpasses all compared pose-based methods on ShanghaiTech in AP (87.7%), while achieving the best results on MSAD. An extended version (TrajVAD-P) incorporates pose information and further improves performance to 88.6% AUROC and 90.9% AP on ShanghaiTech, highlighting bounding-box trajectories as an effective yet underexplored modality for video anomaly detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces TrajVAD, a framework for video anomaly detection that models multi-class bounding-box trajectories with normalizing flows to capture normal kinematic patterns. Its trajectory-only variant (TrajVAD-T) eliminates pose estimation and reports outperforming all compared pose-based methods on ShanghaiTech (87.7% AP) while achieving the best results on MSAD; the pose-augmented variant (TrajVAD-P) further improves to 88.6% AUROC and 90.9% AP on ShanghaiTech.

Significance. If the empirical results hold under detailed scrutiny, the work establishes bounding-box trajectories as a sufficient and high-performing modality for VAD, reducing dependence on pose estimation while maintaining competitive or superior accuracy on standard benchmarks. The normalizing-flow density estimation on trajectories is a technically appropriate choice for modeling normal patterns, and the provision of both T and P variants enables direct assessment of the trajectories' contribution.

major comments (2)
  1. [§4] §4 (Experiments) and associated tables: the claim that TrajVAD-T surpasses all pose-based methods with 87.7% AP on ShanghaiTech requires explicit listing of the AP scores for every cited baseline (including re-implementation details) so that the ranking can be independently verified; without these numbers the superiority statement cannot be assessed.
  2. [§3.2] §3.2 (multi-class modeling): the assumption that bounding-box trajectories contain sufficient kinematic information is load-bearing for the central claim, yet the paper provides no ablation on class granularity or on trajectory-only versus appearance-augmented inputs; this leaves open whether performance gains are truly attributable to the trajectory modality.
minor comments (1)
  1. [Abstract] Abstract: performance numbers are stated without accompanying dataset statistics (e.g., number of normal/anomalous frames or trajectory counts), which should be added for context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of clarity and experimental rigor that we will address in the revision. We respond to each major comment below.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments) and associated tables: the claim that TrajVAD-T surpasses all pose-based methods with 87.7% AP on ShanghaiTech requires explicit listing of the AP scores for every cited baseline (including re-implementation details) so that the ranking can be independently verified; without these numbers the superiority statement cannot be assessed.

    Authors: We agree that explicit numerical comparison strengthens the claim. In the revised manuscript we will add a dedicated table in Section 4 that reports the AP scores of every cited pose-based baseline on ShanghaiTech, together with a short note on any re-implementation settings used. This will allow direct verification of the reported ranking and of the 87.7% AP achieved by TrajVAD-T. revision: yes

  2. Referee: [§3.2] §3.2 (multi-class modeling): the assumption that bounding-box trajectories contain sufficient kinematic information is load-bearing for the central claim, yet the paper provides no ablation on class granularity or on trajectory-only versus appearance-augmented inputs; this leaves open whether performance gains are truly attributable to the trajectory modality.

    Authors: We recognize the value of additional ablations. The manuscript already contrasts the trajectory-only variant (TrajVAD-T) with the pose-augmented variant (TrajVAD-P), which isolates the contribution of trajectories versus an additional kinematic cue. To further address class granularity we will include a new ablation that compares single-class versus multi-class normalizing-flow modeling on ShanghaiTech. Regarding appearance-augmented inputs, we will add a brief discussion clarifying that our design deliberately avoids appearance features to isolate kinematic information; if space permits we will also report a lightweight comparison against a simple appearance baseline. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical method that applies standard normalizing flows to multi-class bounding-box trajectories for video anomaly detection, reporting direct benchmark results on ShanghaiTech and MSAD without any derivation steps, equations, or claims that reduce to self-definitions, fitted parameters renamed as predictions, or load-bearing self-citations. The central performance claims (e.g., 87.7% AP for TrajVAD-T) are presented as outcomes of the proposed pipeline rather than tautological restatements of inputs, and the approach remains self-contained against external datasets and comparisons.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the approach relies on standard normalizing flows from prior literature without detailing any ad-hoc choices or new postulates.

pith-pipeline@v0.9.0 · 5727 in / 1185 out tokens · 51646 ms · 2026-05-22T07:02:53.418842+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 2 internal anchors

  1. [1]

    In: CVPR

    Acsintoae, A., Florescu, A., Georgescu, M.I., Mare, T., Sumedrea, P., Ionescu, R.T., Khan, F.S., Shah, M.: UBnormal: New benchmark for supervised open-set video anomaly detection. In: CVPR. pp. 20143–20153 (2022)

  2. [2]

    Computer Vision and Image Understanding229, 103656 (2023)

    Barbalau, A., Ionescu, R.T., Georgescu, M.I., Dueholm, J., Ramachandra, B., Nas- rollahi, K., Khan, F.S., Moeslund, T.B., Shah, M.: SSMTL++: Revisiting self- supervised multi-task learning for video anomaly detection. Computer Vision and Image Understanding229, 103656 (2023)

  3. [3]

    In: CVPR

    Dawoud, K., Zaheer, Z., Khan, M., Nandakumar, K., Elsaddik, A., Khan, M.H.: FusedVision: A knowledge-infusing approach for practical anomaly detection in real-world surveillance videos. In: CVPR. pp. 4036–4046 (2025)

  4. [4]

    In: ICCV

    Delić, A., Grcic, M., Šegvić, S.: Sequential keypoint density estimator: an over- looked baseline of skeleton-based video anomaly detection. In: ICCV. pp. 11579– 11589 (2025)

  5. [5]

    Density estimation using Real NVP

    Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using Real NVP. arXiv preprint arXiv:1605.08803 (2016)

  6. [6]

    In: WACV

    Doshi, K., Yilmaz, Y.: Towards interpretable video anomaly detection. In: WACV. pp. 2655–2664 (2023)

  7. [7]

    IEEE TPAMI45(6), 7157–7173 (2022)

    Fang, H.S., Li, J., Tang, H., Xu, C., Zhu, H., Xiu, Y., Li, Y.L., Lu, C.: AlphaPose: Whole-body regional multi-person pose estimation and tracking in real-time. IEEE TPAMI45(6), 7157–7173 (2022)

  8. [8]

    In: ICCV

    Flaborea, A., Collorone, L., Di Melendugno, G.M.D., D’Arrigo, S., Prenkaj, B., Galasso, F.: Multimodal motion conditioned diffusion model for skeleton-based video anomaly detection. In: ICCV. pp. 10318–10329 (2023)

  9. [9]

    PR156, 110817 (2024)

    Flaborea, A., di Melendugno, G.M.D., D’Arrigo, S., Sterpa, M.A., Sampieri, A., Galasso, F.: Contracting skeletal kinematics for human-related video anomaly de- tection. PR156, 110817 (2024)

  10. [10]

    YOLOX: Exceeding YOLO Series in 2021

    Ge, Z., Liu, S., Wang, F., Li, Z., Sun, J.: YOLOX: Exceeding YOLO series in 2021. arXiv preprint arXiv:2107.08430 (2021)

  11. [11]

    In: CVPR

    Georgescu, M.I., Barbalau, A., Ionescu, R.T., Khan, F.S., Popescu, M., Shah, M.: Anomaly detection in video via self-supervised and multi-task learning. In: CVPR. pp. 12742–12752 (2021)

  12. [12]

    In: ICCV

    Gong, D., Liu, L., Le, V., Saha, B., Mansour, M.R., Venkatesh, S., Hengel, A.v.d.: Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In: ICCV. pp. 1705–1714 (2019)

  13. [13]

    In: CVPR

    Hasan, M., Choi, J., Neumann, J., Roy-Chowdhury, A.K., Davis, L.S.: Learning temporal regularity in video sequences. In: CVPR. pp. 733–742 (2016)

  14. [14]

    In: ICCV

    Hinami, R., Mei, T., Satoh, S.: Joint detection and recounting of abnormal events by learning deep generic knowledge. In: ICCV. pp. 3619–3627 (2017)

  15. [15]

    In: ICCV

    Hirschorn, O., Avidan, S.: Normalizing flows for human pose anomaly detection. In: ICCV. pp. 13545–13554 (2023)

  16. [16]

    In: ICPR

    Jain, Y., Sharma, A.K., Velmurugan, R., Banerjee, B.: Posecvae: Anomalous hu- man activity detection. In: ICPR. pp. 2927–2934 (2021)

  17. [17]

    arXiv preprint arXiv:2207.02281 (2022)

    Kanu-Asiegbu, A.M., Vasudevan, R., Du, X.: BiPOCO: Bi-directional trajectory prediction with pose constraints for pedestrian anomaly detection. arXiv preprint arXiv:2207.02281 (2022)

  18. [18]

    In: WACV

    Karami, A., Ho, T.K.K., Armanfard, N.: Graph-jigsaw conditioned diffusion model for skeleton-based video anomaly detection. In: WACV. pp. 4237–4247 (2025) 16 I. Song and J. Lee

  19. [19]

    NeurIPS31(2018)

    Kingma, D.P., Dhariwal, P.: Glow: Generative flow with invertible 1x1 convolu- tions. NeurIPS31(2018)

  20. [20]

    Neurocomputing490, 482–494 (2022)

    Li, N., Chang, F., Liu, C.: Human-related anomalous event detection via spatial- temporalgraphconvolutionalautoencoderwithembeddedlongshort-termmemory network. Neurocomputing490, 482–494 (2022)

  21. [21]

    In: CVPR

    Liu,W.,Luo,W.,Lian,D.,Gao,S.:Futureframepredictionforanomalydetection– a new baseline. In: CVPR. pp. 6536–6545 (2018)

  22. [22]

    In: ICCV

    Liu, Z., Nie, Y., Long, C., Zhang, Q., Li, G.: A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame pre- diction. In: ICCV. pp. 13588–13597 (2021)

  23. [23]

    Neurocomputing444, 332–337 (2021)

    Luo, W., Liu, W., Gao, S.: Normal graph: Spatial temporal graph convolutional networks based prediction network for skeleton based video anomaly detection. Neurocomputing444, 332–337 (2021)

  24. [24]

    In: CVPR

    Markovitz, A., Sharir, G., Friedman, I., Zelnik-Manor, L., Avidan, S.: Graph em- bedded pose clustering for anomaly detection. In: CVPR. pp. 10539–10547 (2020)

  25. [25]

    In: CVPR

    Micorek, J., Possegger, H., Narnhofer, D., Bischof, H., Kozinski, M.: MULDE: Multiscale log-density estimation via denoising score matching for video anomaly detection. In: CVPR. pp. 18868–18877 (2024)

  26. [26]

    In: CVPR

    Morais, R., Le, V., Tran, T., Saha, B., Mansour, M., Venkatesh, S.: Learning regularity in skeleton trajectories for anomaly detection in videos. In: CVPR. pp. 11996–12004 (2019)

  27. [27]

    In: WACV

    Noghre, G.A., Pazho, A.D., Tabkhi, H.: An exploratory study on human-centric video anomaly detection through variational autoencoders and trajectory predic- tion. In: WACV. pp. 995–1004 (2024)

  28. [28]

    IEEE TCSVT18(11), 1544–1554 (2008)

    Piciarelli, C., Micheloni, C., Foresti, G.L.: Trajectory-based anomalous event de- tection. IEEE TCSVT18(11), 1544–1554 (2008)

  29. [29]

    In: WACV

    Rodrigues, R., Bhargava, N., Velmurugan, R., Chaudhuri, S.: Multi-timescale tra- jectory prediction for abnormal human activity detection. In: WACV. pp. 2626– 2634 (2020)

  30. [30]

    In: CVPR

    Singh, A., Jones, M.J., Learned-Miller, E.G.: EVAL: Explainable video anomaly localization. In: CVPR. pp. 18717–18726 (2023)

  31. [31]

    In: CVPR

    Singh, A., Jones, M.J., Learned-Miller, E.G.: Tracklet-based explainable video anomaly localization. In: CVPR. pp. 3992–4001 (2024)

  32. [32]

    In: ICIP (2025)

    Song, I., Lee, J.: Real-time traffic accident anticipation with feature reuse. In: ICIP (2025)

  33. [33]

    In: WACV

    Song, I., Lee, S., Joo, M., Lee, J.: Anomaly detection for people with visual impair- ments using an egocentric 360-degree camera. In: WACV. pp. 2828–2837 (2025)

  34. [34]

    In: WACV

    Stergiou, A., De Weerdt, B., Deligiannis, N.: Holistic representation learning for multitask trajectory anomaly detection. In: WACV. pp. 6729–6739 (2024)

  35. [35]

    In: ECCV

    Wang, G., Wang, Y., Qin, J., Zhang, D., Bao, X., Huang, D.: Video anomaly detection by solving decoupled spatio-temporal jigsaw puzzles. In: ECCV. pp. 494– 511 (2022)

  36. [36]

    Cluster Computing25(4), 2715–2737 (2022)

    Wu, C., Shao, S., Tunc, C., Satam, P., Hariri, S.: An explainable and efficient deep learning framework for video anomaly detection. Cluster Computing25(4), 2715–2737 (2022)

  37. [37]

    IEEE TMM (2025)

    Wu, R., Chen, Y., Xiao, J., Li, B., Fan, J., Dufaux, F., Zhu, C., Liu, Y.: DA-flow: Dual attention normalizing flow for skeleton-based video anomaly detection. IEEE TMM (2025)

  38. [38]

    In: ICCV

    Yan, C., Zhang, S., Liu, Y., Pang, G., Wang, W.: Feature prediction diffusion model for video anomaly detection. In: ICCV. pp. 5527–5537 (2023) Bounding-Box Trajectories Matter for Video Anomaly Detection 17

  39. [39]

    In: CVPR

    Zaheer, M.Z., Mahmood, A., Khan, M.H., Segu, M., Yu, F., Lee, S.I.: Generative cooperative learning for unsupervised video anomaly detection. In: CVPR. pp. 14744–14754 (2022)

  40. [40]

    Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., Luo, P., Liu, W., Wang, X.:ByteTrack:Multi-objecttrackingbyassociatingeverydetectionbox.In:ECCV. pp. 1–21 (2022)

  41. [41]

    In: ICCV

    Zhou,K.,Yang,Y.,Cavallaro,A.,Xiang,T.:Omni-scalefeaturelearningforperson re-identification. In: ICCV. pp. 3702–3712 (2019)

  42. [42]

    NeurIPS37, 89943–89977 (2024)

    Zhu, L., Wang, L., Raj, A., Gedeon, T., Chen, C.: Advancing video anomaly de- tection: A concise review and a new dataset. NeurIPS37, 89943–89977 (2024)