arxiv: 2604.23724 · v2 · submitted 2026-04-26 · 💻 cs.CV · cs.AI

Zoom In, Reason Out: Efficient Far-field Anomaly Detection in Expressway Surveillance Videos via Focused VLM Reasoning Guided by Bayesian Inference

Xiaowei Mao , Bowen Sui , Weijie Zhang , Yawen Yang , Shengnan Guo , Shilong Zhao , Jiaqi Lin , Tingrui Wu

show 2 more authors

Youfang Lin Huaiyu Wa

This is my paper

Pith reviewed 2026-05-08 06:35 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords anomaly detectionexpressway surveillancevision-language modelsBayesian inferencefar-field detectionvideo anomaly detectionreal-time processing

0 comments

The pith

VIBES uses online Bayesian inference to trigger Vision-Language Models on localized suspicious regions in expressway videos, improving far-field anomaly detection while lowering compute costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes VIBES as an asynchronous framework that pairs an online Bayesian inference module with Vision-Language Models to handle anomaly detection in expressway surveillance videos. The Bayesian component tracks vehicle trajectories in real time to maintain and update probabilistic definitions of normal driving, then fires only when those boundaries are crossed. The VLM receives just the spatially and temporally localized image patches rather than full frames, which avoids attention dilution on distant subtle motions and keeps processing efficient. A sympathetic reader would care because expressway safety depends on catching rare far-field events like erratic vehicle behavior without flooding operators with false alarms or burning through cloud resources on continuous video.

Core claim

VIBES is an asynchronous collaborative framework utilizing VLMs guided by Bayesian inference. An online Bayesian inference module continuously evaluates vehicle trajectories to dynamically update the probabilistic boundaries of normal driving behaviors, serving as an asynchronous trigger to precisely localize anomalies in space and time. The VLM then processes only the localized visual regions indicated by the trigger instead of the continuous video stream, which prevents attention dilution and enables accurate semantic reasoning.

What carries the argument

The online Bayesian inference module that evaluates vehicle trajectories to define and update probabilistic boundaries of normal behavior as an asynchronous trigger for localized VLM processing.

If this is right

Detection accuracy rises for far-field anomalies because the VLM receives focused input instead of diluted global frames.
Computational overhead drops since the VLM processes only triggered localized regions rather than every frame.
Real-time efficiency improves to support live surveillance without constant high-resource demands.
The method supplies explainable outputs by linking Bayesian probability violations to the VLM's semantic interpretation.
Generalization holds across diverse expressway conditions once the Bayesian boundaries adapt online.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same Bayesian-trigger pattern could be tested on urban intersection cameras or highway toll plazas where far-field vehicles also appear small.
If the localized patches prove sufficient, the framework might pair with lighter-weight vision models instead of full VLMs to further cut latency.
Persistent drift in the Bayesian model under seasonal traffic changes would require an explicit forgetting or re-initialization schedule not detailed in the current design.

Load-bearing premise

The online Bayesian inference module can reliably define and update probabilistic boundaries of normal behavior in real time across varying expressway environments without excessive false positives or missed anomalies, and that the resulting localized regions provide sufficient context for accurate VLM semantic reasoning.

What would settle it

Running the system on a fresh expressway video dataset recorded under novel lighting, weather, or traffic-density conditions and observing either a sharp rise in false positives or a drop in recall for subtle far-field anomalies would falsify the generalization and reliability claims.

Figures

Figures reproduced from arXiv: 2604.23724 by Bowen Sui, Huaiyu Wa, Jiaqi Lin, Shengnan Guo, Shilong Zhao, Tingrui Wu, Weijie Zhang, Xiaowei Mao, Yawen Yang, Youfang Lin.

**Figure 1.** Figure 1: Comparison of expressway anomaly detection paradigms. (a) Global VLM Perception. Inputting full frames causes view at source ↗

**Figure 2.** Figure 2: The architecture of the proposed VIBES framework. Left: Trajectory tracking extracts vehicle kinematics and resolves view at source ↗

**Figure 4.** Figure 4: Ablation study results evaluating the impact of core view at source ↗

**Figure 5.** Figure 5: Case study comparing VIBES and Qwen3-VL-8B. Red boxes denotes specific frames selected for VLM processing based view at source ↗

read the original abstract

Expressway video anomaly detection is essential for safety management. However, identifying anomalies across diverse scenes remains challenging, particularly for far-field targets exhibiting subtle abnormal vehicle motions. While Vision-Language Models (VLMs) demonstrate strong semantic reasoning capabilities, processing global frames causes attention dilution for these far-field objects and incurs prohibitive computational costs. To address these issues, we propose VIBES, an asynchronous collaborative framework utilizing VLMs guided by Bayesian inference. Specifically, to overcome poor generalization across varying expressway environments, we introduce an online Bayesian inference module. This module continuously evaluates vehicle trajectories to dynamically update the probabilistic boundaries of normal driving behaviors, serving as an asynchronous trigger to precisely localize anomalies in space and time. Instead of processing the continuous video stream, the VLM processes only the localized visual regions indicated by the trigger. This targeted visual input prevents attention dilution and enables accurate semantic reasoning. Extensive evaluations demonstrate that VIBES improves detection accuracy for far-field anomalies and reduces computational overhead, achieving high real-time efficiency and explainability while demonstrating generalization across diverse expressway conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VIBES pairs continuous Bayesian trajectory tracking with selective VLM calls on localized regions to handle far-field anomalies in expressway video without full-frame processing.

read the letter

The core idea in this paper is to use an online Bayesian module to track normal vehicle trajectories and trigger a VLM only on specific localized areas when something looks off. This avoids running the language model on every frame and helps it focus on distant, subtle anomalies without getting lost in the whole scene. What stands out is how they frame the collaboration: the Bayesian part runs continuously to update probabilistic boundaries of normal behavior, and only when it flags a potential issue does the VLM get a zoomed-in region to reason about semantically. That asynchronous trigger is the main novelty, building on existing anomaly detection but applying it specifically to far-field expressway cases with VLMs. It does a decent job addressing the practical problems of compute cost and attention dilution in video surveillance. The claim of better generalization across diverse conditions comes from the online updating, which makes sense on paper. The soft spots are around validation. The abstract talks about extensive evaluations showing accuracy gains and efficiency, but I need to see the actual metrics, baselines, and how they handle varying conditions. The Bayesian update could be sensitive to parameter choices, and without details on false positive rates or how well it adapts, it's tough to judge if the gains are real or just from careful tuning. Also, the explainability part seems tied to the localization, but that might not add much new insight. This paper is aimed at researchers and engineers working on real-time anomaly detection in transportation videos. Readers dealing with resource-constrained video analysis or multimodal models for surveillance would get some value from the pipeline description. It deserves a serious referee because the approach is coherent and targets a clear application need, even if the experiments need scrutiny. I'd recommend sending it to peer review with requests for more ablation studies and detailed results.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes VIBES, an asynchronous collaborative framework for far-field anomaly detection in expressway surveillance videos. It uses an online Bayesian inference module to continuously update probabilistic boundaries of normal vehicle trajectories from observed data, which serves as a trigger to localize anomalous regions in space and time; the VLM then performs semantic reasoning only on these focused regions rather than full frames, with the goal of improving detection accuracy for subtle motions, reducing computational overhead, and enhancing explainability and generalization across diverse scenes.

Significance. If the empirical results and implementation details support the claims, the work could be significant for real-time video surveillance applications in transportation safety, as it offers a principled way to combine online probabilistic modeling with the semantic capabilities of VLMs while addressing attention dilution and efficiency bottlenecks that typically arise when applying VLMs to high-resolution or continuous video streams.

major comments (1)

The abstract states that 'extensive evaluations demonstrate that VIBES improves detection accuracy for far-field anomalies and reduces computational overhead' and shows 'generalization across diverse expressway conditions,' but the provided manuscript text contains no experimental section, no datasets, no quantitative metrics (e.g., AUC, F1, FPS), no baseline comparisons, and no ablation studies; without these, the central performance and generalization claims cannot be assessed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and for highlighting the need for clear empirical support. We address the major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: The abstract states that 'extensive evaluations demonstrate that VIBES improves detection accuracy for far-field anomalies and reduces computational overhead' and shows 'generalization across diverse expressway conditions,' but the provided manuscript text contains no experimental section, no datasets, no quantitative metrics (e.g., AUC, F1, FPS), no baseline comparisons, and no ablation studies; without these, the central performance and generalization claims cannot be assessed.

Authors: We agree that the version of the manuscript provided to the referee omitted the experimental section. The complete manuscript contains evaluations on multiple real-world expressway surveillance datasets, reporting quantitative metrics including AUC, F1-score, and FPS, along with comparisons against state-of-the-art baselines and ablation studies isolating the contributions of the Bayesian trigger and focused VLM reasoning. We will incorporate the full experimental section, including all datasets, metrics, tables, figures, and analyses, into the revised manuscript to substantiate the claims made in the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The provided abstract and context describe a pipeline in which an online Bayesian inference module updates probabilistic boundaries of normal trajectories to trigger localized VLM queries. No equations, self-citations, or load-bearing steps are quoted that reduce any claimed prediction or result to its own inputs by construction. The Bayesian component is presented as an adaptive trigger derived from observed data, and the VLM reasoning operates on the resulting localized inputs; these are independent modules whose outputs are not tautologically equivalent to their inputs. Empirical claims of accuracy and generalization rest on evaluations rather than definitional equivalence. This is the expected non-finding for a methods paper whose core logic does not collapse under the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Review limited to abstract; the framework relies on an online Bayesian inference module whose priors and update rules are not specified, plus assumptions about VLM behavior on cropped regions.

free parameters (1)

Bayesian prior and update parameters
Used to define and dynamically adjust probabilistic boundaries of normal driving behaviors from vehicle trajectories.

pith-pipeline@v0.9.0 · 5525 in / 1138 out tokens · 34426 ms · 2026-05-08T06:35:33.759477+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 7 canonical work pages · 1 internal anchor

[1]

Moshira Abdalla, Sajid Javed, Muaz Al Radi, Anwaar Ulhaq, and Naoufel Werghi
[2]

Video anomaly detection in 10 years: A survey and outlook.Neural Computing and Applications37, 32 (2025), 26321–26364

2025
[3]

Fatih Cagatay Akyon, Sinan Onur Altinuc, and Alptekin Temizel. 2022. Slicing aided hyper inference and fine-tuning for small object detection. In2022 IEEE international conference on image processing (ICIP). IEEE, 966–970

2022
[4]

Yoav Arad and Michael Werman. 2023. Beyond the benchmark: Detecting diverse anomalies in videos.arXiv preprint arXiv:2310.01904(2023)

work page arXiv 2023
[5]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review arXiv 2025
[6]

Yachuang Chai, Jianwu Fang, Haoquan Liang, and Wushouer Silamu. 2024. TADS: a novel dataset for road traffic accident detection from a surveillance perspective: Y. Chai et al.The Journal of Supercomputing80, 18 (2024), 26226–26249

2024
[7]

Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. 2024. Videollm-online: Online video large language model for streaming video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. 18407–18418

2024
[8]

Roy De Maesschalck, Delphine Jouan-Rimbaud, and Désiré L Massart. 2000. The mahalanobis distance.Chemometrics and intelligent laboratory systems50, 1 (2000), 1–18

2000
[9]

Xin Ding, Hao Wu, Yifan Yang, Shiqi Jiang, Qianxi Zhang, Donglin Bai, Zhibo Chen, and Ting Cao. 2025. Streammind: Unlocking full frame rate streaming video dialogue through event-gated cognition. InProceedings of the IEEE/CVF International Conference on Computer Vision. 13448–13459

2025
[10]

Yang Ding, Yizhen Zhang, Xin Lai, Ruihang Chu, and Yujiu Yang. 2025. Video- Zoomer: Reinforcement-Learned Temporal Focusing for Long Video Reasoning. arXiv preprint arXiv:2512.22315(2025)

work page arXiv 2025
[11]

Keval Doshi and Yasin Yilmaz. 2020. Fast unsupervised anomaly detection in traffic videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 624–625

2020
[12]

Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. 2024. Videoagent: A memory-augmented multimodal agent for video understanding. InEuropean Conference on Computer Vision. Springer, 75–92

2024
[13]

Hong Gao, Yiming Bao, Xuezhen Tu, Bin Zhong, Linan Yue, and Min-Ling Zhang. 2026. Apvr: Hour-level long video understanding with adaptive pivot visual information retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 4113–4121

2026
[14]

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. 2023. Imagebind: One embedding space to bind them all. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 15180–15190

2023
[15]

Kai Hu, Feng Gao, Xiaohan Nie, Peng Zhou, Son Tran, Tal Neiman, Lingyun Wang, Mubarak Shah, Raffay Hamid, Bing Yin, et al. 2025. M-llm based video frame selection for efficient video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference. 13702–13712

2025
[16]

Laurent Itti and Pierre Baldi. 2009. Bayesian surprise attracts human attention. Vision research49, 10 (2009), 1295–1306

2009
[17]

Ali Karami, Thi Kieu Khanh Ho, and Narges Armanfard. 2025. Graph-jigsaw conditioned diffusion model for skeleton-based video anomaly detection. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). IEEE, 4237–4247

2025
[18]

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. [n. d.]. LLaVA-OneVision: Easy Visual Task Transfer.Transactions on Machine Learning Research([n. d.])
[19]

Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. 2022. Exploring plain vision transformer backbones for object detection. InEuropean conference on computer vision. Springer, 280–296

2022
[20]

Yangfu Li, Hongjian Zhan, Jiawei Chen, Yuning Gong, Qi Liu, and Yue Lu. 2026. DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models.arXiv preprint arXiv:2603.03857(2026)

work page arXiv 2026
[21]

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. 2024. Video-llava: Learning united visual representation by alignment before pro- jection. InProceedings of the 2024 conference on empirical methods in natural language processing. 5971–5984

2024
[22]

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024. LLaVA-NeXT: Improved reasoning, OCR, and world knowledge. https://llava-vl.github.io/blog/2024-01-30-llava-next/

2024
[23]

Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. 2018. Future frame prediction for anomaly detection–a new baseline. InProceedings of the IEEE conference on computer vision and pattern recognition. 6536–6545

2018
[24]

Zhian Liu, Yongwei Nie, Chengjiang Long, Qing Zhang, and Guiqing Li. 2021. A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction. InProceedings of the IEEE/CVF international conference on computer vision. 13588–13597

2021
[25]

Hui Lv, Chen Chen, Zhen Cui, Chunyan Xu, Yong Li, and Jian Yang. 2021. Learn- ing normal dynamics in videos with meta prototype network. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 15425–15434

2021
[26]

Aref Miri Rekavandi, Shima Rashidi, Farid Boussaid, Stephen Hoefs, Emre Akbas, and Mohammed Bennamoun. 2025. Transformers in small object detection: A benchmark and survey of state-of-the-art.Comput. Surveys58, 3 (2025), 1–33

2025
[27]

Manfred Opper and Ole Winther. 1999. A Bayesian approach to on-line learning. (1999)

1999
[28]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al
[29]

In International conference on machine learning

Learning transferable visual models from natural language supervision. In International conference on machine learning. PmLR, 8748–8763
[30]

Kelathodi Kumaran Santhosh, Debi Prosad Dogra, and Partha Pratim Roy. 2020. Anomaly detection in road traffic using visual surveillance: A survey.Acm Computing Surveys (CSUR)53, 6 (2020), 1–26

2020
[31]

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al . 2024. Moviechat: From dense token to sparse memory for long video understand- ing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18221–18232

2024
[32]

Waqas Sultani, Chen Chen, and Mubarak Shah. 2018. Real-world anomaly detection in surveillance videos. InProceedings of the IEEE conference on computer vision and pattern recognition. 6479–6488

2018
[33]

Hui Sun, Shiyin Lu, Huanyu Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Ming Li. 2025. Mdp3: A training-free approach for list-wise frame selection in video-llms. InProceedings of the IEEE/CVF International Conference on Computer Vision. 24090–24101

2025
[34]

Peng Wu, Xuerong Zhou, Guansong Pang, Yujia Sun, Jing Liu, Peng Wang, and Yanning Zhang. 2024. Open-vocabulary video anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18297– 18307

2024
[35]

Peng Wu, Xuerong Zhou, Guansong Pang, Lingru Zhou, Qingsen Yan, Peng Wang, and Yanning Zhang. 2024. Vadclip: Adapting vision-language models for weakly supervised video anomaly detection. InProceedings of the AAAI conference on artificial intelligence, Vol. 38. 6074–6082

2024
[36]

Zhiwei Yang, Chen Gao, Jing Liu, Peng Wu, Guansong Pang, and Mike Zheng Shou. 2025. Assistpda: An online video surveillance assistant for video anomaly prediction, detection, and analysis.arXiv preprint arXiv:2503.21904(2025)

work page arXiv 2025
[37]

Yu Yao, Xizi Wang, Mingze Xu, Zelin Pu, Yuchen Wang, Ella Atkins, and David J Crandall. 2022. DoTA: Unsupervised detection of traffic anomaly in driving videos.IEEE transactions on pattern analysis and machine intelligence45, 1 (2022), 444–459

2022
[38]

Muchao Ye, Weiyang Liu, and Pan He. 2025. Vera: Explainable video anomaly detection via verbalized learning of vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference. 8679–8688

2025
[39]

Yuan Yuan, Dong Wang, and Qi Wang. 2016. Anomaly detection in traffic scenes via spatial-aware motion reconstruction.IEEE Transactions on Intelligent Transportation Systems18, 5 (2016), 1198–1209

2016
[40]

Xinyu Zhang, Yuxuan Dong, Lingling Zhang, Chengyou Jia, Zhuohang Dang, Basura Fernando, Jun Liu, and Mike Zheng Shou. 2025. CoFFT: Chain of Foresight- Focus Thought for Visual Language Models.arXiv preprint arXiv:2509.22010 (2025)

work page arXiv 2025
[41]

Jianfei Zhao, Zitong Yi, Siyang Pan, Yanyun Zhao, Zhicheng Zhao, Fei Su, and Bojin Zhuang. 2019. Unsupervised Traffic Anomaly Detection Using Trajectories.. InCVPR workshops, Vol. 3

2019
[42]

Liangyu Zhong, Fabio Rosenthal, Joachim Sicking, Fabian Hüger, Thorsten Bag- donat, Hanno Gottschalk, and Leo Schwinn. 2025. FOCUS: Internal MLLM rep- resentations for efficient fine-grained visual question answering.arXiv preprint arXiv:2506.21710(2025). Mao and Sui, et al

work page arXiv 2025
[43]

Xingcheng Zhou, Konstantinos Larintzakis, Hao Guo, Walter Zimmer, Mingyu Liu, Hu Cao, Jiajie Zhang, Venkatnarayanan Lakshminarasimhan, Leah Strand, and Alois Knoll. 2025. TUMTraf VideoQA: Dataset and Benchmark for Uni- fied Spatio-Temporal Video Understanding in Traffic Scenes. InForty-second International Conference on Machine Learning

2025