arxiv: 2510.18034 · v2 · submitted 2025-10-20 · 💻 cs.CV · cs.AI· cs.RO

Can VLMs Unlock Semantic Anomaly Detection? A Framework for Structured Reasoning

Roberto Brusnicki , David Pop , Yuan Gao , Mattia Piccinini , Johannes Betz This is my paper

Pith reviewed 2026-05-18 05:43 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO

keywords semantic anomaly detectionvision-language modelsautonomous drivingstructured reasoninganomaly detectionfine-tuningout-of-distributiondata curation

0 comments p. Extension

The pith

Structured reasoning framework lets VLMs detect semantic anomalies in driving scenes with 18.5 percent higher recall.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that reformulating anomaly detection as layered semantic consistency verification allows existing vision-language models to identify rare out-of-distribution events in driving images much more reliably than standard prompting. The authors demonstrate this through a two-phase pipeline that extracts structured scene descriptions and then evaluates them across multiple semantic domains and modalities. The resulting performance lift makes it practical to automatically label thousands of real-world images, which are then used to fine-tune a smaller open-source model for accurate single-shot detection. A sympathetic reader would care because autonomous driving systems currently struggle with long-tail semantic anomalies that threaten safety, and this method offers a route to scalable data and deployable models without full reliance on proprietary systems.

Core claim

SAVANT is a model-agnostic reasoning framework that reformulates anomaly detection as a layered semantic consistency verification. It applies a two-phase pipeline of structured scene description extraction and multi-modal evaluation to transform VLM-based detection into a principled decomposition across four semantic domains. On a balanced set of real-world driving scenarios, this improves VLM recall by about 18.5 percent over prompting baselines. Using the framework to label around 10,000 images then allows fine-tuning a 7B open-source model to 90.8 percent recall and 93.8 percent accuracy for single-shot detection.

What carries the argument

SAVANT's two-phase pipeline of structured scene description extraction followed by multi-modal evaluation that performs semantic consistency verification across four domains.

Load-bearing premise

The reported recall gains depend on the chosen balanced set of real-world driving scenarios being representative of the long-tail semantic anomalies that occur in actual autonomous driving.

What would settle it

Testing the fine-tuned 7B model on an independent collection of driving images containing verified rare semantic anomalies outside the original labeled set would show whether the 90.8 percent recall generalizes.

Figures

Figures reproduced from arXiv: 2510.18034 by David Pop, Johannes Betz, Mattia Piccinini, Roberto Brusnicki, Yuan Gao.

**Figure 2.** Figure 2: Examples of semantic anomalies in autonomous driving scenarios. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Layer-wise anomaly distribution comparing CODALM [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Resolution comparison analysis across different image resolu [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Failure rates (%) across semantic layer combinations for three [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Autonomous driving systems remain critically vulnerable to the long-tail of rare, out-of-distribution semantic anomalies. While VLMs have emerged as promising tools for perception, their application in anomaly detection remains largely restricted to prompting proprietary models - limiting reliability, reproducibility, and deployment feasibility. To address this gap, we introduce SAVANT (Semantic Anomaly Verification/Analysis Toolkit), a novel model-agnostic reasoning framework that reformulates anomaly detection as a layered semantic consistency verification. By applying SAVANT's two-phase pipeline - structured scene description extraction and multi-modal evaluation - existing VLMs achieve significantly higher scores in detecting anomalous driving scenarios from input images. Our approach replaces ad hoc prompting with semantic-aware reasoning, transforming VLM-based detection into a principled decomposition across four semantic domains. We show that across a balanced set of real-world driving scenarios, applying SAVANT improves VLM's absolute recall by approximately 18.5% compared to prompting baselines. Moreover, this gain enables reliable large-scale annotation: leveraging the best proprietary model within our framework, we automatically labeled around 10,000 real-world images with high confidence. We use the resulting high-quality dataset to fine-tune a 7B open-source model (Qwen2.5-VL) to perform single-shot anomaly detection, achieving 90.8% recall and 93.8% accuracy - surpassing all models evaluated while enabling local deployment at near-zero cost. By coupling structured semantic reasoning with scalable data curation, SAVANT provides a practical solution to data scarcity in semantic anomaly detection for autonomous systems. Supplementary material: https://SAV4N7.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAVANT gives a two-phase structured reasoning pipeline that lifts VLM recall on semantic driving anomalies by 18.5% over prompting and then bootstraps a fine-tuned 7B open model to 90.8% recall, but the test set construction details are thin.

read the letter

The main thing to know is that this paper introduces SAVANT as a model-agnostic framework that replaces loose prompting with a two-phase process: first pulling a structured scene description from the image, then running multi-modal consistency checks across four semantic domains. On their test images this yields an 18.5% absolute recall gain over plain prompting baselines, and they use the stronger detector to auto-label roughly 10k real-world driving images. Those labels then fine-tune Qwen2.5-VL-7B to 90.8% recall and 93.8% accuracy, which beats the other models they tried and lets the system run locally at low cost.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SAVANT, a model-agnostic reasoning framework that reformulates semantic anomaly detection in autonomous driving as a two-phase pipeline of structured scene description extraction followed by multi-modal semantic consistency evaluation across four domains. It reports that applying SAVANT to existing VLMs yields an approximately 18.5% absolute recall improvement over prompting baselines on a balanced set of real-world driving scenarios. The framework is further used to auto-label around 10,000 images with a proprietary model, enabling fine-tuning of a 7B open-source VLM (Qwen2.5-VL) to 90.8% recall and 93.8% accuracy for single-shot detection.

Significance. If the reported gains hold after addressing dataset details, this work could be significant for practical deployment in autonomous systems by replacing ad-hoc VLM prompting with principled semantic decomposition and by enabling scalable, high-quality data curation for open models at low cost. The model-agnostic design and end-to-end demonstration from reasoning framework to locally deployable fine-tuned model are strengths that address data scarcity in long-tail anomaly detection.

major comments (2)

[Evaluation Dataset] The central claim of an 18.5% recall improvement is evaluated on a 'balanced set of real-world driving scenarios,' but the manuscript provides no details on selection criteria, sampling protocol for long-tail anomalies, anomaly-type distribution, filtering rules, or inter-annotator agreement. This information is required to isolate the contribution of the SAVANT pipeline from possible dataset construction effects.
[Fine-tuning and Results] The claim that high-confidence automatic labels from the proprietary model were used to create the ~10,000-image fine-tuning set lacks explicit confidence thresholds, validation against human annotations, or error analysis on the labeled data. This detail is load-bearing for interpreting the downstream 90.8% recall result as a reliable outcome of the framework.

minor comments (2)

[Abstract] The supplementary material URL is given, but the main text should explicitly reference any accompanying code, prompts, or dataset splits to support reproducibility claims.
[Method] The four semantic domains in the multi-modal evaluation step are referenced but would benefit from a concise table or explicit definitions to clarify coverage of anomaly types.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which have helped us improve the clarity and rigor of the manuscript. We address each major comment below and have revised the paper to incorporate the requested details on dataset construction and fine-tuning procedures.

read point-by-point responses

Referee: [Evaluation Dataset] The central claim of an 18.5% recall improvement is evaluated on a 'balanced set of real-world driving scenarios,' but the manuscript provides no details on selection criteria, sampling protocol for long-tail anomalies, anomaly-type distribution, filtering rules, or inter-annotator agreement. This information is required to isolate the contribution of the SAVANT pipeline from possible dataset construction effects.

Authors: We agree that the current manuscript lacks sufficient detail on the evaluation dataset, which is necessary to substantiate the reported gains. In the revised version, we have added a dedicated subsection under Experimental Setup that describes the selection criteria for the balanced set, the sampling protocol used to ensure coverage of long-tail anomalies, the anomaly-type distribution, the filtering rules applied during curation, and the inter-annotator agreement statistics. These additions allow readers to more clearly attribute performance differences to the SAVANT framework. revision: yes
Referee: [Fine-tuning and Results] The claim that high-confidence automatic labels from the proprietary model were used to create the ~10,000-image fine-tuning set lacks explicit confidence thresholds, validation against human annotations, or error analysis on the labeled data. This detail is load-bearing for interpreting the downstream 90.8% recall result as a reliable outcome of the framework.

Authors: We concur that greater transparency regarding the automatic labeling process is required to support interpretation of the fine-tuning results. We have revised the relevant section to specify the confidence thresholds applied when selecting labels from the proprietary model, to report a validation study on a held-out subset against human annotations, and to include a brief error analysis of the automatically labeled data. These changes strengthen the evidential basis for the reported 90.8% recall and 93.8% accuracy of the fine-tuned model. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation of SAVANT framework is self-contained

full rationale

The paper presents SAVANT as a model-agnostic reasoning pipeline for VLM-based semantic anomaly detection in driving scenes. Claims rest on direct experimental measurements: recall improvement of ~18.5% over prompting baselines on a described dataset, followed by large-scale labeling and fine-tuning of Qwen2.5-VL to 90.8% recall. No equations, parameter fits, or derivations are invoked. No self-citations are used to justify uniqueness or load-bearing premises. Results are reported as outcomes of applying the two-phase pipeline (scene description + multi-modal evaluation) rather than any reduction to inputs by construction. The evaluation set is presented as an external benchmark, with no indication that its construction depends on the method itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the premise that VLMs guided by structured multi-phase prompts can reliably extract scene descriptions and detect semantic inconsistencies; this is treated as a domain assumption rather than derived. No free parameters are described. SAVANT is introduced as a new framework without independent external validation in the abstract.

axioms (1)

domain assumption Vision-language models can perform reliable structured scene description extraction and multi-modal semantic consistency evaluation when provided with a layered prompting pipeline
Invoked to justify the two-phase pipeline replacing ad hoc prompting.

invented entities (1)

SAVANT no independent evidence
purpose: Model-agnostic reasoning framework that reformulates anomaly detection as layered semantic consistency verification across four domains
Newly proposed toolkit whose effectiveness is demonstrated empirically in the abstract.

pith-pipeline@v0.9.0 · 5837 in / 1525 out tokens · 56804 ms · 2026-05-18T05:43:04.751411+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two-phase pipeline: structured scene description extraction followed by multi-modal evaluation... four semantic layers: Street, Infrastructure, Movable Objects, and Environment
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SAVANT achieves 89.6% recall and 88.0% accuracy... fine-tuned 7B... 90.8% recall and 93.8% accuracy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 2 internal anchors

[1]

Z. Yang, R. Li, X. Wen, H. Zhang, B. Zheng, R. Zheng, C. Wen, J. Xu, M. Yang, and K. Jia, ”LLM4Drive: A Survey of Large Lan- guage Models for Autonomous Driving,” in 2024 IEEE International Conference on Robotics and Automation (ICRA), 2024

work page 2024
[2]

Y . Zhou, Q. Yuan, K. Chen, Z. Tian, and H. Li, ”Vision- Language Models for Autonomous Driving: A Survey,” arXiv preprint arXiv:2402.03756, 2024

work page arXiv 2024
[3]

C. Jin, Z. Zhou, J. Li, X. Zhu, Z. Zhou, J. Lu, L. Wang, Y . Qiao, Y . Wang, and J. Yan, ”Large Language Models for Autonomous Driving (LLM4AD): Concept, Benchmark, Experiments, and Challenges,” arXiv preprint arXiv:2410.15281, 2024

work page arXiv 2024
[4]

Y . Gao, M. Piccinini, Y . Zhang, D. Wang, K. Moller, R. Brusnicki, B. Zarrouki, A. Gambi, J. F. Totz, K. Storms, S. Peters, A. Stocco, B. Alrifaee, M. Pavone, and J. Betz, ”Foundation Models in Autonomous Driving: A Survey on Scenario Generation and Scenario Analysis,” arXiv preprint arXiv:2506.11526, 2025

work page arXiv 2025
[5]

Pop, ”LENS-AD: A Foundation Model-based Safety Monitor for Semantic Anomaly Detection in Autonomous Driving,” M.S

D. Pop, ”LENS-AD: A Foundation Model-based Safety Monitor for Semantic Anomaly Detection in Autonomous Driving,” M.S. thesis, Technical University of Munich, 2025

work page 2025
[6]

Elhafsi, A

M. Elhafsi, A. Brem, and P. C. Gembarski, ”Semantic Anomaly Detec- tion for Autonomous Driving,” in 2023 IEEE International Conference on Omni-layer Intelligent Systems (COINS), 2023, pp. 1–8

work page 2023
[7]

Sinha and A

A. Sinha and A. Choudhury, ”Real-time Semantic Anomaly Detection and Novelty Identification in Autonomous Driving,” arXiv preprint arXiv:2404.05312, 2024

work page arXiv 2024
[8]

H. Shao, Y . Li, L. Li, S. Liu, H. Chen, X. Qi, K. Liu, C. Li, Y . Ge, A. Anandkumar, and others, ”LMDrive: Closed-Loop End-to-End Driving with Large Language Models,” arXiv preprint arXiv:2312.07488, 2023

work page arXiv 2023
[9]

Z. Xu, Y . Han, Z. Zhang, Z. Wang, S. Ge, H. Xu, and L. Li, ”DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model,” arXiv preprint arXiv:2310.01412, 2023

work page arXiv 2023
[10]

EMMA: End-to-End Multimodal Model for Autonomous Driving

E. Erlich, G. Sharir, A. Noy, I. Schwartz, Y . Friedman, Y . Chai, and D. He, ”EMMA: End-to-End Multimodal Model for Autonomous Driving,” arXiv preprint arXiv:2410.23262, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Y . Wu, J. Zhang, Z. Lin, Z. Zhou, J. Yan, and Y . Qiao, ”ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driv- ing,” arXiv preprint arXiv:2508.11428, 2025

work page arXiv 2025
[12]

W. Sha, Y . Chen, B. Li, Q. Cui, Z. Chen, B. Li, and D. Zhao, ”LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving,” arXiv preprint arXiv:2310.04301, 2023

work page arXiv 2023
[13]

Z. Tian, K. Chen, Y . Zhou, and H. Li, ”DriveVLM: The Finding of VLM’s Strong ZERO-SHOT Planning Capabilities in Autonomous Driving,” arXiv preprint arXiv:2403.03928, 2024

work page arXiv 2024
[14]

X. Chen, Z. Liu, Z. Zhang, L. Zhang, Z. Wu, and T. Zhang, ”LMAD: Integrated End-to-End Vision-Language Model for Explainable Au- tonomous Driving,” arXiv preprint arXiv:2508.12404, 2025

work page arXiv 2025
[15]

D. Wang, M. Kaufeld, and J. Betz, ”DualAD: Dual-Layer Planning for Reasoning in Autonomous Driving,” arXiv preprint arXiv:2409.18053, 2024

work page arXiv 2024
[16]

[Online]

Microsoft, ”Failure modes in machine learning,” Microsoft Learn, 2023. [Online]. Available: https://learn.microsoft.com/en- us/security/engineering/failure-modes-in-machine-learning

work page 2023
[17]

Y . Li, A. Liu, and J. Yang, ”A Comprehensive Survey on Physical Risk Control for Foundation Model-enabled Robotics,” arXiv preprint arXiv:2505.12583, 2025

work page arXiv 2025
[18]

C. Sima, K. Renz, K. Chitta, L. Chen, and A. Geiger, ”DriveLM: Driving with graph visual question answering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22091–22102

work page 2023
[19]

Y . Gao, M. Piccinini, R. Brusnicki, Y . Zhang, and J. Betz, ”NuRisk: A Visual Question Answering Dataset for Agent-Level Risk Assessment in Autonomous Driving,” arXiv preprint arXiv:2509.25944, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

J. Chen, C. Singh, D. Chen, A. Vijayaraghavan, and S. Manivasagam, ”Automated Driving Systems Data (CODA-LM): A Labeled Video Dataset for Training and Benchmarking LVLMs in Autonomous Driving,” arXiv preprint arXiv:2402.10375, 2024

work page arXiv 2024
[21]

Y . Shu, Z. Zhou, Z. Liu, and J. Wang, ”Application of Vision- Language Model to Pedestrians Behavior and Scene Understanding in Autonomous Driving,” arXiv preprint arXiv:2501.06680, 2025

work page arXiv 2025
[22]

J. Yang, J. Zhou, Y . Liu, J. Chen, and Y .-G. Wang, ”Generalized Out- of-Distribution Detection: A Survey,” IEEE Transactions on Knowl- edge and Data Engineering, 2024

work page 2024
[23]

Z. Hu, L. Zhang, R. Yang, Z. Li, X. Li, and L. Li, ”VLM-C4L: A VLM-based framework for continuous corner case learning in autonomous driving,” arXiv preprint arXiv:2502.04321, 2025

work page arXiv 2025
[24]

A. Hu, G. Stan, T. Pavlov, M. Cvitkovic, S. Pang, A. Rusu, F. Viola, P. Munk, O. Vinyals, T. Lillicrap, and others, ”GAIA-1: A Generative World Model for Autonomous Driving,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023
[25]

X. Wang, Z. Xie, A. Zhu, G. Yu, W. Li, W.-X. Chu, G. Chen, L. Wang, H. Li, and H. Yu, ”DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving,” arXiv preprint arXiv:2309.09777, 2023

work page arXiv 2023
[26]

D. Wang, Z. Sun, Z. Li, C. Wang, Y . Peng, H. Ye, B. Zarrouki, W. Li, M. Piccinini, L. Xie, and J. Betz, ”Enhancing Physical Consistency in Lightweight World Models,” arXiv preprint arXiv:2509.12437, 2025

work page arXiv 2025
[27]

A. Su, L. Yang, C. Li, and J. Wang, ”Generating Multimodal Driving Scenes via Next-Scene Prediction,” arXiv preprint arXiv:2503.14945, 2025

work page arXiv 2025
[28]

Khattab, A

O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang, K. Santhanam, S. Vardhamanan, S. Haq, A. Sharma, T. T. Joshi, H. Moazam, H. Miller, M. Zaharia, and C. Potts, ”DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines,” in The Twelfth International Conference on Learning Representations, 2024

work page 2024
[29]

Khattab, K

O. Khattab, K. Santhanam, X. L. Li, D. Hall, P. Liang, C. Potts, and M. Zaharia, ”Demonstrate-Search-Predict: Composing Retrieval and Language Models for Knowledge-Intensive NLP,” arXiv preprint arXiv:2212.14024, 2022

work page arXiv 2022
[30]

K. Li, K. Chen, H. Wang, L. Hong, C. Ye, J. Han, Y . Chen, W. Zhang, C. Xu, D.-Y . Yeung, X. Liang, Z. Li, and H. Xu, ”CODA: A Real- World Road Corner Case Dataset for Object Detection in Autonomous Driving,” arXiv preprint arXiv:2203.07724, 2022

work page arXiv 2022