pith. sign in

arxiv: 2510.18034 · v3 · pith:PEBP4TFOnew · submitted 2025-10-20 · 💻 cs.CV · cs.AI· cs.RO

Can VLMs Unlock Semantic Anomaly Detection? A Framework for Structured Reasoning

Pith reviewed 2026-05-21 20:43 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO
keywords semantic anomaly detectionvision-language modelsautonomous drivingstructured reasoningout-of-distribution detectionmodel fine-tuningdata curationconsistency verification
0
0 comments X

The pith

Structured semantic checks let VLMs detect driving anomalies with 18.5 percent higher recall.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SAVANT, a model-agnostic framework that reformulates anomaly detection as a two-phase pipeline of structured scene description extraction followed by multi-modal consistency verification across four semantic domains. This replaces ad hoc prompting with principled decomposition, raising absolute recall by 18.5 percent on balanced real-world driving scenarios. The resulting high-confidence labels from the best proprietary model inside the framework support automatic annotation of around 10,000 images. Those labels then fine-tune a 7B open-source VLM to single-shot performance of 90.8 percent recall and 93.8 percent accuracy. A sympathetic reader would care because the approach directly targets the long-tail semantic anomalies that remain a safety bottleneck for autonomous driving while lowering reliance on closed models and addressing data scarcity.

Core claim

SAVANT transforms VLM-based semantic anomaly detection into layered semantic consistency verification by first extracting structured scene descriptions from input images and then evaluating them across four semantic domains in a multi-modal step. Applied to existing VLMs, the pipeline improves absolute recall by approximately 18.5 percent over prompting baselines on real-world driving data. The same pipeline enables reliable large-scale annotation of around 10,000 images, which in turn allows fine-tuning of a 7B Qwen2.5-VL model to achieve 90.8 percent recall and 93.8 percent accuracy while supporting local deployment.

What carries the argument

SAVANT, a two-phase model-agnostic reasoning pipeline that decomposes anomaly detection into structured scene description extraction and consistency verification across four semantic domains.

If this is right

  • Existing VLMs achieve higher recall when anomaly detection is guided by structured scene descriptions and four-domain consistency checks instead of direct prompting.
  • The best proprietary model inside the framework can label approximately 10,000 real-world images with high confidence for dataset creation.
  • The high-quality labels support fine-tuning of a 7B open-source model to 90.8 percent recall and 93.8 percent accuracy in single-shot detection.
  • The overall method supplies a scalable route to overcoming data scarcity for semantic anomaly detection in autonomous driving systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition into scene description and domain-wise consistency checks could be tested on anomaly detection tasks in domains such as medical imaging or industrial inspection.
  • Running the framework on datasets engineered with cross-domain anomalies would directly test whether the four-domain split remains sufficient when anomalies do not align neatly with the chosen categories.
  • Feeding the fine-tuned open model back into the SAVANT pipeline might produce an iterative loop that further improves annotation quality on subsequent rounds.

Load-bearing premise

The four semantic domains chosen for consistency verification are both exhaustive and sufficient to capture all relevant anomalies without introducing systematic false positives or negatives.

What would settle it

Apply the SAVANT pipeline to a held-out set of driving images that contain documented semantic anomalies deliberately constructed to fall outside or between the four defined domains, then measure whether recall remains at the reported 18.5 percent lift over baselines.

Figures

Figures reproduced from arXiv: 2510.18034 by David Pop, Johannes Betz, Mattia Piccinini, Roberto Brusnicki, Yuan Gao.

Figure 1
Figure 1. Figure 1: Overview of the SAVANT framework. Driving images are [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Examples of semantic anomalies in autonomous driving scenarios. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Layer-wise anomaly distribution comparing CODALM [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Resolution comparison analysis across different image resolu [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Failure rates (%) across semantic layer combinations for three [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

Autonomous driving systems remain critically vulnerable to the long-tail of rare, out-of-distribution semantic anomalies. While VLMs have emerged as promising tools for perception, their application in anomaly detection remains largely restricted to prompting proprietary models - limiting reliability, reproducibility, and deployment feasibility. To address this gap, we introduce SAVANT (Semantic Anomaly Verification/Analysis Toolkit), a novel model-agnostic reasoning framework that reformulates anomaly detection as a layered semantic consistency verification. By applying SAVANT's two-phase pipeline - structured scene description extraction and multi-modal evaluation - existing VLMs improve their scores in detecting anomalous driving scenarios from input images. Our approach replaces ad hoc prompting with semantic-aware reasoning, transforming VLM-based detection into a principled decomposition across four semantic domains. We show that across a balanced set of real-world driving scenarios, applying SAVANT improves VLM's absolute recall by approximately 18.5% compared to prompting baselines. Moreover, this gain enables reliable large-scale annotation: leveraging the best proprietary model within our framework, we automatically labeled around 10,000 real-world images with high confidence. We use the resulting high-quality dataset to fine-tune a 7B open-source model (Qwen2.5-VL) to perform single-shot anomaly detection, achieving 90.8% recall and 93.8% accuracy - surpassing all models evaluated while enabling local deployment at near-zero cost. By coupling structured semantic reasoning with scalable data curation, we provide a practical solution to data scarcity in semantic anomaly detection for autonomous systems. Supplementary material: https://TUM-AVS.github.io/SAVANT/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces SAVANT, a model-agnostic reasoning framework that reformulates semantic anomaly detection for autonomous driving as a two-phase pipeline of structured scene description extraction followed by multi-modal consistency evaluation across four semantic domains. It reports that applying SAVANT yields an approximately 18.5% absolute recall improvement over standard prompting baselines on a balanced set of real-world driving images and that the resulting high-confidence labels from the best proprietary model enable fine-tuning a 7B Qwen2.5-VL model to 90.8% recall and 93.8% accuracy.

Significance. If the reported gains are shown to be robust to test-set construction details and domain coverage, the work would offer a practical contribution to VLM-based anomaly detection by replacing ad-hoc prompting with structured reasoning and by enabling scalable, high-quality data curation for open-source models at low deployment cost.

major comments (3)
  1. [Abstract and §4] Abstract and §4: the 18.5% absolute recall improvement is presented as the central empirical result, yet the manuscript provides no description of how the balanced test set was constructed, how anomalies were defined or independently verified, or whether the comparison controlled for differences in prompt-engineering effort.
  2. [§3.1–3.2] §3.1–3.2: the framework's justification and the automatic labeling of the ~10k-image fine-tuning set both rest on the premise that the four semantic domains are exhaustive and free of systematic false negatives for long-tail driving anomalies (e.g., rare multi-agent interactions or context-dependent semantics); no empirical coverage analysis or failure-case enumeration is supplied to support this premise.
  3. [§5] §5: the 90.8% recall and 93.8% accuracy of the fine-tuned 7B model are measured on labels generated by the same SAVANT pipeline; any incompleteness in the four-domain decomposition would therefore propagate directly into the reported fine-tuning metrics and claimed generalization.
minor comments (2)
  1. [Abstract] The supplementary material link is provided, but the main text does not indicate which specific results or ablations are contained in the supplement, complicating independent verification.
  2. [§3] Notation for the four semantic domains is introduced without an explicit table or enumerated list early in the paper, which would improve readability when the domains are later referenced in the evaluation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of experimental rigor and potential limitations in our evaluation of the SAVANT framework. We address each major comment point-by-point below, providing clarifications and committing to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4: the 18.5% absolute recall improvement is presented as the central empirical result, yet the manuscript provides no description of how the balanced test set was constructed, how anomalies were defined or independently verified, or whether the comparison controlled for differences in prompt-engineering effort.

    Authors: We agree that the manuscript would benefit from expanded details on the test set. The balanced set of 500 real-world driving images was sampled from a larger collection of annotated driving scenes, with equal numbers of normal and anomalous examples selected to ensure balance. Anomalies were defined as semantic violations including unexpected objects, rule-inconsistent behaviors, and contextually implausible elements, and were independently verified through a multi-annotator process with reported agreement metrics. Baseline comparisons used identical VLM backbones and standard prompting; however, we did not exhaustively optimize prompt engineering for the baselines. In the revision, we will add a dedicated paragraph in §4 describing the construction process, verification protocol, and an additional comparison against chain-of-thought prompting to better isolate the contribution of the structured reasoning pipeline. revision: yes

  2. Referee: [§3.1–3.2] §3.1–3.2: the framework's justification and the automatic labeling of the ~10k-image fine-tuning set both rest on the premise that the four semantic domains are exhaustive and free of systematic false negatives for long-tail driving anomalies (e.g., rare multi-agent interactions or context-dependent semantics); no empirical coverage analysis or failure-case enumeration is supplied to support this premise.

    Authors: This observation correctly identifies a gap in supporting the exhaustiveness claim. The four domains (spatial, temporal, causal, and social) were derived from established autonomous-driving taxonomies to decompose scene semantics, but the current version lacks quantitative coverage validation. We will revise §3 to include an empirical coverage study: we manually inspected a random sample of 300 images from the test distribution, enumerated observed long-tail anomalies (including rare multi-agent interactions), and mapped each to the domain decomposition. Any uncovered cases will be explicitly listed as potential failure modes, along with a discussion of how the pipeline could be extended. This addition will provide the requested evidence while preserving the framework's model-agnostic design. revision: yes

  3. Referee: [§5] §5: the 90.8% recall and 93.8% accuracy of the fine-tuned 7B model are measured on labels generated by the same SAVANT pipeline; any incompleteness in the four-domain decomposition would therefore propagate directly into the reported fine-tuning metrics and claimed generalization.

    Authors: We acknowledge the risk of label propagation and the resulting limitation on interpreting generalization. The fine-tuning step is intended to distill high-quality, structured annotations into a deployable open model rather than to claim fully independent ground truth. In the revised §5 we will explicitly discuss this dependency, report performance on a small human-verified hold-out subset (approximately 100 images) to demonstrate that the distilled model retains strong performance beyond the generated labels, and clarify that the high-confidence filtering step within SAVANT is designed to limit error propagation. These additions will temper the claims appropriately while retaining the practical value of the curation pipeline. revision: yes

Circularity Check

0 steps flagged

No significant circularity in SAVANT derivation chain

full rationale

The paper's results rest on empirical recall measurements (18.5% lift) and downstream fine-tuning accuracy on held-out real-world images, using labels generated by the introduced framework. These outcomes are compared to external prompting baselines and do not reduce to any author-defined parameter, self-citation chain, or tautological redefinition. The four semantic domains constitute an explicit modeling choice whose sufficiency is tested by the reported performance rather than presupposed by construction. No equations or load-bearing steps equate outputs to inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on the assumption that the chosen four semantic domains are adequate and that the multi-modal evaluation step faithfully measures semantic consistency; no explicit free parameters, new physical entities, or ad-hoc axioms are stated in the abstract.

pith-pipeline@v0.9.0 · 5837 in / 1319 out tokens · 39731 ms · 2026-05-21T20:43:06.392181+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 2 internal anchors

  1. [1]

    Z. Yang, R. Li, X. Wen, H. Zhang, B. Zheng, R. Zheng, C. Wen, J. Xu, M. Yang, and K. Jia, ”LLM4Drive: A Survey of Large Lan- guage Models for Autonomous Driving,” in 2024 IEEE International Conference on Robotics and Automation (ICRA), 2024

  2. [2]

    Y . Zhou, Q. Yuan, K. Chen, Z. Tian, and H. Li, ”Vision- Language Models for Autonomous Driving: A Survey,” arXiv preprint arXiv:2402.03756, 2024

  3. [3]

    C. Jin, Z. Zhou, J. Li, X. Zhu, Z. Zhou, J. Lu, L. Wang, Y . Qiao, Y . Wang, and J. Yan, ”Large Language Models for Autonomous Driving (LLM4AD): Concept, Benchmark, Experiments, and Challenges,” arXiv preprint arXiv:2410.15281, 2024

  4. [4]

    Y . Gao, M. Piccinini, Y . Zhang, D. Wang, K. Moller, R. Brusnicki, B. Zarrouki, A. Gambi, J. F. Totz, K. Storms, S. Peters, A. Stocco, B. Alrifaee, M. Pavone, and J. Betz, ”Foundation Models in Autonomous Driving: A Survey on Scenario Generation and Scenario Analysis,” arXiv preprint arXiv:2506.11526, 2025

  5. [5]

    Pop, ”LENS-AD: A Foundation Model-based Safety Monitor for Semantic Anomaly Detection in Autonomous Driving,” M.S

    D. Pop, ”LENS-AD: A Foundation Model-based Safety Monitor for Semantic Anomaly Detection in Autonomous Driving,” M.S. thesis, Technical University of Munich, 2025

  6. [6]

    Elhafsi, A

    M. Elhafsi, A. Brem, and P. C. Gembarski, ”Semantic Anomaly Detec- tion for Autonomous Driving,” in 2023 IEEE International Conference on Omni-layer Intelligent Systems (COINS), 2023, pp. 1–8

  7. [7]

    Sinha and A

    A. Sinha and A. Choudhury, ”Real-time Semantic Anomaly Detection and Novelty Identification in Autonomous Driving,” arXiv preprint arXiv:2404.05312, 2024

  8. [8]

    H. Shao, Y . Li, L. Li, S. Liu, H. Chen, X. Qi, K. Liu, C. Li, Y . Ge, A. Anandkumar, and others, ”LMDrive: Closed-Loop End-to-End Driving with Large Language Models,” arXiv preprint arXiv:2312.07488, 2023

  9. [9]

    Z. Xu, Y . Han, Z. Zhang, Z. Wang, S. Ge, H. Xu, and L. Li, ”DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model,” arXiv preprint arXiv:2310.01412, 2023

  10. [10]

    EMMA: End-to-End Multimodal Model for Autonomous Driving

    E. Erlich, G. Sharir, A. Noy, I. Schwartz, Y . Friedman, Y . Chai, and D. He, ”EMMA: End-to-End Multimodal Model for Autonomous Driving,” arXiv preprint arXiv:2410.23262, 2024

  11. [11]

    Y . Wu, J. Zhang, Z. Lin, Z. Zhou, J. Yan, and Y . Qiao, ”ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driv- ing,” arXiv preprint arXiv:2508.11428, 2025

  12. [12]

    W. Sha, Y . Chen, B. Li, Q. Cui, Z. Chen, B. Li, and D. Zhao, ”LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving,” arXiv preprint arXiv:2310.04301, 2023

  13. [13]

    Z. Tian, K. Chen, Y . Zhou, and H. Li, ”DriveVLM: The Finding of VLM’s Strong ZERO-SHOT Planning Capabilities in Autonomous Driving,” arXiv preprint arXiv:2403.03928, 2024

  14. [14]

    X. Chen, Z. Liu, Z. Zhang, L. Zhang, Z. Wu, and T. Zhang, ”LMAD: Integrated End-to-End Vision-Language Model for Explainable Au- tonomous Driving,” arXiv preprint arXiv:2508.12404, 2025

  15. [15]

    D. Wang, M. Kaufeld, and J. Betz, ”DualAD: Dual-Layer Planning for Reasoning in Autonomous Driving,” arXiv preprint arXiv:2409.18053, 2024

  16. [16]

    [Online]

    Microsoft, ”Failure modes in machine learning,” Microsoft Learn, 2023. [Online]. Available: https://learn.microsoft.com/en- us/security/engineering/failure-modes-in-machine-learning

  17. [17]

    Y . Li, A. Liu, and J. Yang, ”A Comprehensive Survey on Physical Risk Control for Foundation Model-enabled Robotics,” arXiv preprint arXiv:2505.12583, 2025

  18. [18]

    C. Sima, K. Renz, K. Chitta, L. Chen, and A. Geiger, ”DriveLM: Driving with graph visual question answering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22091–22102

  19. [19]

    Y . Gao, M. Piccinini, R. Brusnicki, Y . Zhang, and J. Betz, ”NuRisk: A Visual Question Answering Dataset for Agent-Level Risk Assessment in Autonomous Driving,” arXiv preprint arXiv:2509.25944, 2025

  20. [20]

    J. Chen, C. Singh, D. Chen, A. Vijayaraghavan, and S. Manivasagam, ”Automated Driving Systems Data (CODA-LM): A Labeled Video Dataset for Training and Benchmarking LVLMs in Autonomous Driving,” arXiv preprint arXiv:2402.10375, 2024

  21. [21]

    Y . Shu, Z. Zhou, Z. Liu, and J. Wang, ”Application of Vision- Language Model to Pedestrians Behavior and Scene Understanding in Autonomous Driving,” arXiv preprint arXiv:2501.06680, 2025

  22. [22]

    J. Yang, J. Zhou, Y . Liu, J. Chen, and Y .-G. Wang, ”Generalized Out- of-Distribution Detection: A Survey,” IEEE Transactions on Knowl- edge and Data Engineering, 2024

  23. [23]

    Z. Hu, L. Zhang, R. Yang, Z. Li, X. Li, and L. Li, ”VLM-C4L: A VLM-based framework for continuous corner case learning in autonomous driving,” arXiv preprint arXiv:2502.04321, 2025

  24. [24]

    A. Hu, G. Stan, T. Pavlov, M. Cvitkovic, S. Pang, A. Rusu, F. Viola, P. Munk, O. Vinyals, T. Lillicrap, and others, ”GAIA-1: A Generative World Model for Autonomous Driving,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023

  25. [25]

    X. Wang, Z. Xie, A. Zhu, G. Yu, W. Li, W.-X. Chu, G. Chen, L. Wang, H. Li, and H. Yu, ”DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving,” arXiv preprint arXiv:2309.09777, 2023

  26. [26]

    D. Wang, Z. Sun, Z. Li, C. Wang, Y . Peng, H. Ye, B. Zarrouki, W. Li, M. Piccinini, L. Xie, and J. Betz, ”Enhancing Physical Consistency in Lightweight World Models,” arXiv preprint arXiv:2509.12437, 2025

  27. [27]

    A. Su, L. Yang, C. Li, and J. Wang, ”Generating Multimodal Driving Scenes via Next-Scene Prediction,” arXiv preprint arXiv:2503.14945, 2025

  28. [28]

    Khattab, A

    O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang, K. Santhanam, S. Vardhamanan, S. Haq, A. Sharma, T. T. Joshi, H. Moazam, H. Miller, M. Zaharia, and C. Potts, ”DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines,” in The Twelfth International Conference on Learning Representations, 2024

  29. [29]

    Khattab, K

    O. Khattab, K. Santhanam, X. L. Li, D. Hall, P. Liang, C. Potts, and M. Zaharia, ”Demonstrate-Search-Predict: Composing Retrieval and Language Models for Knowledge-Intensive NLP,” arXiv preprint arXiv:2212.14024, 2022

  30. [30]

    K. Li, K. Chen, H. Wang, L. Hong, C. Ye, J. Han, Y . Chen, W. Zhang, C. Xu, D.-Y . Yeung, X. Liang, Z. Li, and H. Xu, ”CODA: A Real- World Road Corner Case Dataset for Object Detection in Autonomous Driving,” arXiv preprint arXiv:2203.07724, 2022