pith. sign in

arxiv: 2501.08083 · v3 · submitted 2025-01-14 · 💻 cs.CV

Benchmarking Vision Foundation Models for Input Monitoring in Autonomous Driving

Pith reviewed 2026-05-23 05:13 UTC · model grok-4.3

classification 💻 cs.CV
keywords out-of-distribution detectionvision foundation modelsautonomous drivingdensity estimationsemantic shiftcovariate shiftsafety monitoringunsupervised OOD
0
0 comments X

The pith

Vision foundation model embeddings with density estimation outperform existing methods at identifying out-of-distribution inputs for autonomous driving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Autonomous driving perception must cope with novel objects or changed conditions never seen in training data. The paper builds an unsupervised framework that extracts features from vision foundation models and fits a density model over the training distribution to score new inputs. This single approach targets both semantic shifts from unknown objects and covariate shifts from altered styles such as lighting. Benchmarks across four foundation models and five density techniques show the combination exceeds state-of-the-art binary out-of-distribution classifiers. The method additionally flags inputs likely to produce downstream errors, raising overall task performance.

Core claim

The paper claims that combining vision foundation models as feature extractors with density modeling yields a principled, unsupervised, model-agnostic monitor that unifies detection of semantic and covariate shifts by modeling the full training feature distribution and using point density as an in-distribution score. Systematic evaluation of four VFMs and five density techniques against established baselines demonstrates superior OOD identification, and the resulting scores also mark high-risk inputs that improve downstream performance when filtered.

What carries the argument

Vision foundation model embeddings used as input to density estimation techniques that compute an in-distribution score from the modeled training feature distribution.

If this is right

  • Detects semantic shifts from novel objects and covariate shifts from style changes such as lighting within one unsupervised, model-agnostic procedure.
  • Outperforms state-of-the-art binary out-of-distribution classification methods on autonomous driving data.
  • Identifies high-risk inputs that cause errors in downstream perception tasks, allowing selective filtering that raises overall task accuracy.
  • Requires no labeled out-of-distribution examples during training or operation.
  • Supplies the first systematic comparison of multiple vision foundation models for out-of-distribution monitoring under diverse autonomous driving conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same density-based scoring on foundation embeddings could be tested for safety monitoring in other open-world vision domains such as robotics or medical imaging.
  • Real-time deployment of the monitor might enable continuous filtering of risky frames without requiring any out-of-distribution labels.
  • Pre-trained foundation features appear general enough that density baselines transfer across different driving datasets and sensor setups.
  • Combining embeddings from several foundation models could be examined to increase robustness against particular shift types.

Load-bearing premise

The feature distributions learned by the chosen vision foundation models on the training set are representative enough to serve as a reliable density baseline for detecting both semantic and covariate shifts in real-world autonomous driving data.

What would settle it

A controlled test on a held-out autonomous driving dataset with labeled semantic and covariate shifts in which the VFM density scores achieve lower area under the ROC curve than at least one compared state-of-the-art binary OOD classifier.

Figures

Figures reproduced from arXiv: 2501.08083 by Alois Knoll, Gesina Schwalbe, Halil Ibrahim Orhan, Matthias Rottmann, Mert Keser, Niki Amini-Naieni.

Figure 1
Figure 1. Figure 1: A monitoring system for autonomous driving that uses a [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of Mask2Anomaly [63] applied to selected images exhibiting semantic and covariate shifts. The left column presents the original images from various datasets, including Lost and Found [59], ACDC Night and Rain [69], and SegmentMeIfY￾ouCan Anomaly Track [11] The right column displays the corre￾sponding OOD object-level maps generated by Mask2Anomaly. OOD detection. This issue is illustrated in … view at source ↗
Figure 3
Figure 3. Figure 3: Example images from four subsets of ACDC Dataset [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example images from Bravo datasets [50] and Lost and Found Dataset [59] Bravo Datasets: The Bravo dataset family [50] provides synthetic perturbations of Cityscapes images to eval￾uate model robustness under controlled distribution shifts. The collection comprises three distinct subsets: 4 [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example images from Cityscapes [16], Cityscapes Fog [68], Indian Driving dataset[77] and SegmentMeIfYouCan [11] SegmentMeIfYouCan Dataset: The SegmentMeIfYou￾Can [11] dataset (SMYIC) is designed for evaluating anomaly detection in semantic segmentation tasks. It includes challenging scenarios with diverse visual contexts and synthetic anomalies that closely resemble real-world driving conditions. The datas… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of AIC values across various model backbones, highlighting the trade-off between model complexity and goodness [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: AIC values for ImageNet-trained backbones and autoencoders trained on reference data. Each subplot depicts the AIC profile [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
read the original abstract

Deep neural networks (DNNs) remain challenged by distribution shifts in complex open-world domains like automated driving (AD): Robustness against yet unknown novel objects (semantic shift) or styles like lighting conditions (covariate shift) cannot be guaranteed. Hence, reliable operation-time monitors for identification of out-of-training-data-distribution (OOD) scenarios are imperative. Current approaches for OOD classification are untested for complex domains like AD, are limited in the kinds of shifts they detect, or even require supervision with OOD samples. To prepare for unanticipated shifts, we instead establish a framework around a principled, unsupervised and model-agnostic method that unifies detection of semantic and covariate shifts: Find a full model of the training data's feature distribution, to then use its density at new points as in-distribution (ID) score. To implement this, we propose to combine Vision Foundation Models (VFMs) as feature extractors with density modeling techniques. Through a comprehensive benchmark of 4 VFMs with different backbone architectures and 5 density-modeling techniques against established baselines, we provide the first systematic evaluation of OOD classification capabilities of VFMs across diverse conditions. A comparison with state-of-the-art binary OOD classification methods reveals that VFM embeddings with density estimation outperform existing approaches in identifying OOD inputs. Additionally, we show that our method detects high-risk inputs likely to cause errors in downstream tasks, thereby improving overall performance. Overall, VFMs, when coupled with robust density modeling techniques, are promising to realize model-agnostic, unsupervised, reliable safety monitors in complex vision tasks

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that Vision Foundation Models (VFMs) used as feature extractors, when paired with density modeling techniques, enable a unified, unsupervised, model-agnostic approach to detect both semantic and covariate shifts for input monitoring in autonomous driving. A benchmark of 4 VFMs and 5 density methods is presented, showing that this combination outperforms state-of-the-art binary OOD classification methods on selected datasets and can flag high-risk inputs to improve downstream task performance.

Significance. If the empirical results hold under rigorous verification, the work provides a valuable systematic comparison of VFMs for OOD detection in a safety-critical domain. The unsupervised density-based framework addresses limitations of supervised or shift-specific methods, and the focus on both semantic and covariate shifts is a constructive contribution to operational monitoring.

major comments (2)
  1. [Experiments] Experiments section: The outperformance claim over SOTA binary OOD methods depends on the assumption that VFM feature distributions fitted on the training set are representative for detecting shifts in real-world AD data. No separate validation or ablation is provided for generalization to unmodeled operational variability (e.g., rare events, sensor artifacts, or shift combinations), which is load-bearing for the reliability conclusion.
  2. [Methods] Methods/Implementation details: The manuscript lacks sufficient specification of exact training/test splits, hyperparameter choices for the 5 density techniques, and baseline implementations to enable independent reproduction of the reported superiority, directly affecting verifiability of the central empirical comparison.
minor comments (2)
  1. [Abstract] Abstract: The claim of 'first systematic evaluation' could be tempered or supported with a brief note on prior related benchmarks to avoid overstatement.
  2. [Results] Figures/Tables: Include statistical significance measures (e.g., p-values or confidence intervals) alongside performance metrics to strengthen the outperformance statements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, indicating planned revisions where appropriate to improve verifiability and strengthen the reliability claims.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The outperformance claim over SOTA binary OOD methods depends on the assumption that VFM feature distributions fitted on the training set are representative for detecting shifts in real-world AD data. No separate validation or ablation is provided for generalization to unmodeled operational variability (e.g., rare events, sensor artifacts, or shift combinations), which is load-bearing for the reliability conclusion.

    Authors: The benchmark evaluates detection across multiple semantic and covariate shifts drawn from established AD datasets (e.g., variations in objects, weather, and lighting), which are designed to capture operational variability. The density modeling is performed solely on ID training features and evaluated on held-out shifted inputs, providing direct evidence of the method's ability to flag distribution changes. We acknowledge that explicit ablations on rare events or sensor artifacts would further support generalization claims. In the revision we will add a limitations paragraph discussing these unmodeled cases and outline how the framework could be extended (e.g., via incremental density updates), while retaining the current empirical results as evidence for the tested conditions. revision: partial

  2. Referee: [Methods] Methods/Implementation details: The manuscript lacks sufficient specification of exact training/test splits, hyperparameter choices for the 5 density techniques, and baseline implementations to enable independent reproduction of the reported superiority, directly affecting verifiability of the central empirical comparison.

    Authors: We agree that additional implementation details are required for full reproducibility. The revised manuscript will include: (i) precise descriptions of the training/test splits for each dataset and VFM, (ii) the exact hyperparameter settings used for each of the five density estimators (including any grid-search or default values), and (iii) references or pseudocode for the baseline binary OOD methods. These details will be placed in the main Methods section and expanded in a new reproducibility appendix. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivation chain or self-referential reductions

full rationale

The paper is a comparative empirical study that benchmarks 4 VFMs paired with 5 density estimators against binary OOD baselines on AD datasets. No equations, uniqueness theorems, or predictive derivations are presented; performance claims rest on direct experimental measurements rather than any reduction to fitted parameters or self-citations. The method is described as unsupervised and model-agnostic, with no load-bearing steps that collapse to the inputs by construction. This is the standard case of a self-contained benchmark against external datasets and methods.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the approach relies on standard pre-trained VFMs and off-the-shelf density estimators whose assumptions are inherited from prior work.

pith-pipeline@v0.9.0 · 5839 in / 1035 out tokens · 33250 ms · 2026-05-23T05:13:30.710799+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

97 extracted references · 97 canonical work pages · 7 internal anchors

  1. [1]

    One-class support vector classifiers: A survey

    Shamshe Alam, Sanjay Kumar Sonbhadra, Sonali Agarwal, and P Nagabhushan. One-class support vector classifiers: A survey. Knowledge-Based Systems, 196:105754, 2020. 3, 7

  2. [2]

    Foundation models defining a new era in vision: a survey and outlook

    Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Foundation models defining a new era in vision: a survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 2025. 3

  3. [3]

    Monitizer: Automating design and evaluation of neural network mon- itors

    Muqsit Azeem, Marta Grobelna, Sudeep Kanav, Jan Kˇret´ınsk`y, Stefanie Mohr, and Sabine Rieder. Monitizer: Automating design and evaluation of neural network mon- itors. In International Conference on Computer Aided Veri- fication, pages 265–279. Springer, 2024. 6

  4. [4]

    Simultaneous semantic segmentation and outlier detection in presence of domain shift

    Petra Bevandi ´c, Ivan Kreˇso, Marin Orˇsi´c, and Siniˇsa ˇSegvi´c. Simultaneous semantic segmentation and outlier detection in presence of domain shift. In Pattern Recognition: 41st DAGM German Conference, DAGM GCPR 2019, Dortmund, Germany, September 10–13, 2019, Proceedings 41 , pages 33–47. Springer, 2019. 1

  5. [5]

    Dense outlier detection and open-set recognition based on training with noisy negative images

    Petra Bevandi ´c, Ivan Kreˇso, Marin Orˇsi´c, and Siniˇsa ˇSegvi´c. Dense outlier detection and open-set recognition based on training with noisy negative images. arXiv preprint arXiv:2101.09193, 2021. 2

  6. [6]

    Unsupervised domain adaptation to im- prove image segmentation quality both in the source and tar- get domain

    Jan-Aike Bolte, Markus Kamp, Antonia Breuer, Silviu Ho- moceanu, Peter Schlicht, Fabian Huger, Daniel Lipinski, and Tim Fingscheidt. Unsupervised domain adaptation to im- prove image segmentation quality both in the source and tar- get domain. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops , pages 0–0, 2019. 2

  7. [7]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Alt- man, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021. 3

  8. [8]

    One-class sup- port vector machines revisited

    Abdenour Bounsiar and Michael G Madden. One-class sup- port vector machines revisited. In2014 International Confer- ence on Information Science & Applications (ICISA) , pages 1–4. IEEE, 2014. 3, 7

  9. [9]

    Understanding ADAS: Lane Keep Assist, 2024

    CarADAS. Understanding ADAS: Lane Keep Assist, 2024. Accessed: 4 March 2025. 1

  10. [10]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 2, 3, 5

  11. [11]

    Segmentmeifyou- can: A benchmark for anomaly segmentation

    Robin Chan, Krzysztof Lis, Svenja Uhlemeyer, Hermann Blum, Sina Honari, Roland Siegwart, Pascal Fua, Math- ieu Salzmann, and Matthias Rottmann. Segmentmeifyou- can: A benchmark for anomaly segmentation. arXiv preprint arXiv:2104.14812, 2021. 2, 5, 6, 7, 1

  12. [12]

    Entropy maximization and meta classification for out-of- distribution detection in semantic segmentation

    Robin Chan, Matthias Rottmann, and Hanno Gottschalk. Entropy maximization and meta classification for out-of- distribution detection in semantic segmentation. In Proceed- ings of the ieee/cvf international conference on computer vi- sion, pages 5128–5137, 2021. 2, 1

  13. [13]

    Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolu- tion, and fully connected crfs

    Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolu- tion, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017. 1

  14. [14]

    Encoder-decoder with atrous separable convolution for semantic image segmentation

    Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018. 8

  15. [15]

    Yolo-world: Real-time open-vocabulary object detection

    Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16901–16911, 2024. 1, 3

  16. [16]

    The cityscapes dataset for semantic urban scene understanding

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016. 3, 5, 6, 7, 8, 2, 4

  17. [17]

    Council of the European Union. Proposal for a regulation of the european parliament and of the council laying down harmonised rules on artificial intelligence (artificial intel- ligence act) and amending certain union legislative acts - analysis of the final compromise text with a view to agree- ment. https://data.consilium.europa.eu/ doc / document / ST ...

  18. [18]

    Accessed: 2024-03-23. 2

  19. [19]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 5, 2

  20. [20]

    Density estimation using Real NVP

    Laurent Dinh, Jascha Sohl-Dickstein, and Samy Ben- gio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016. 4, 8

  21. [21]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, 9 Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021. 5, 2

  22. [22]

    Dropout as a bayesian approximation: Representing model uncertainty in deep learning

    Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning , pages 1050–1059. PMLR, 2016. 2

  23. [23]

    Generalize or detect? towards robust semantic seg- mentation under multiple distribution shifts

    Zhitong Gao, Bingnan Li, Mathieu Salzmann, and Xuming He. Generalize or detect? towards robust semantic seg- mentation under multiple distribution shifts. arXiv preprint arXiv:2411.03829, 2024. 2

  24. [24]

    Densehy- brid: Hybrid anomaly detection for dense open-set recogni- tion

    Matej Grci ´c, Petra Bevandi ´c, and Sini ˇsa ˇSegvi´c. Densehy- brid: Hybrid anomaly detection for dense open-set recogni- tion. In European Conference on Computer Vision , pages 500–517. Springer, 2022. 1

  25. [25]

    On advantages of mask-level recognition for outlier-aware segmentation

    Matej Grci ´c, Josip ˇSari´c, and Siniˇsa ˇSegvi´c. On advantages of mask-level recognition for outlier-aware segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 2937–2947, 2023. 1

  26. [26]

    Detecting and mitigating system-level anomalies of vision- based controllers

    Aryaman Gupta, Kaustav Chakraborty, and Somil Bansal. Detecting and mitigating system-level anomalies of vision- based controllers. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 9953–9959. IEEE,

  27. [27]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 5, 8, 2

  28. [28]

    Mask r-cnn

    Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 2

  29. [29]

    Towards corner case detection by modeling the uncertainty of instance segmentation networks

    Florian Heidecker, Abdul Hannan, Maarten Bieshaar, and Bernhard Sick. Towards corner case detection by modeling the uncertainty of instance segmentation networks. In Pat- tern Recognition. ICPR International Workshops and Chal- lenges: Virtual Event, January 10–15, 2021, Proceedings, Part IV, pages 361–374. Springer, 2021. 2

  30. [30]

    Moni- toring perception reliability in autonomous driving: Distri- butional shift detection for estimating the impact of input data on prediction accuracy

    Franz Hell, Gereon Hinz, Feng Liu, Sakshi Goyal, Ke Pei, Tetiana Lytvynenko, Alois Knoll, and Chen Yiqiang. Moni- toring perception reliability in autonomous driving: Distri- butional shift detection for estimating the impact of input data on prediction accuracy. In Proceedings of the 5th ACM Computer Science in Cars Symposium, pages 1–9, 2021. 2

  31. [31]

    A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

    Dan Hendrycks and Kevin Gimpel. A baseline for detect- ing misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016. 2

  32. [32]

    Searching for mo- bilenetv3

    Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mo- bilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019. 8

  33. [33]

    On the impor- tance of gradients for detecting distributional shifts in the wild

    Rui Huang, Andrew Geng, and Yixuan Li. On the impor- tance of gradients for detecting distributional shifts in the wild. Advances in Neural Information Processing Systems , 34:677–689, 2021. 2, 6, 7, 8

  34. [34]

    On the potential of open-vocabulary models for object detection in unusual street scenes

    Sadia Ilyas, Ido Freeman, and Matthias Rottmann. On the potential of open-vocabulary models for object detection in unusual street scenes. arXiv preprint arXiv:2408.11221 ,

  35. [35]

    ISO/PAS 8800:2024 – Road Vehicles – Safety and Artificial Intelli- gence, 2024

    International Organization for Standardization. ISO/PAS 8800:2024 – Road Vehicles – Safety and Artificial Intelli- gence, 2024. Accessed: 4 March 2025. 2

  36. [36]

    Mdetr- modulated detection for end-to-end multi-modal understand- ing

    Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr- modulated detection for end-to-end multi-modal understand- ing. In Proceedings of the IEEE/CVF international confer- ence on computer vision, pages 1780–1790, 2021. 4

  37. [37]

    Language-extended indoor slam (lexis): A versatile system for real-time visual scene understanding

    Christina Kassab, Matias Mattamala, Lintong Zhang, and Maurice Fallon. Language-extended indoor slam (lexis): A versatile system for real-time visual scene understanding. In 2024 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 15988–15994. IEEE, 2024. 2

  38. [38]

    What uncertainties do we need in bayesian deep learning for computer vision? Advances in neural information processing systems, 30, 2017

    Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? Advances in neural information processing systems, 30, 2017. 1

  39. [39]

    Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding

    Alex Kendall, Vijay Badrinarayanan, and Roberto Cipolla. Bayesian segnet: Model uncertainty in deep convolu- tional encoder-decoder architectures for scene understand- ing. arXiv preprint arXiv:1511.02680, 2015. 2

  40. [40]

    Openimages: A public dataset for large-scale multi-label and multi-class im- age classification

    Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami Abu-El-Haija, Alina Kuznetsova, Hassan Rom, Jasper Ui- jlings, Stefan Popov, Andreas Veit, et al. Openimages: A public dataset for large-scale multi-label and multi-class im- age classification. Dataset available from https://github. com/openimages, 2(3):18, 2017. 4

  41. [41]

    Visual genome: Connecting language and vision using crowdsourced dense image annotations

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017. 4

  42. [42]

    Out-of-distribution identification: Let detector tell which i am not sure

    Ruoqi Li, Chongyang Zhang, Hao Zhou, Chao Shi, and Yan Luo. Out-of-distribution identification: Let detector tell which i am not sure. In European Conference on Computer Vision, pages 638–654. Springer, 2022. 2

  43. [43]

    Enhanc- ing the reliability of out-of-distribution image detection in neural networks

    Shiyu Liang, Yixuan Li, and Rayadurgam Srikant. Enhanc- ing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690, 2017. 6, 8

  44. [44]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 4

  45. [45]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023. 3

  46. [46]

    Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023. 1, 3, 5, 7, 8, 2, 4, 6, 10

  47. [47]

    Energy-based out-of-distribution detection

    Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution detection. Advances 10 in neural information processing systems, 33:21464–21475,

  48. [48]

    Residual pattern learning for pixel-wise out-of-distribution detection in semantic segmentation

    Yuyuan Liu, Choubo Ding, Yu Tian, Guansong Pang, Vasileios Belagiannis, Ian Reid, and Gustavo Carneiro. Residual pattern learning for pixel-wise out-of-distribution detection in semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 1151–1161, 2023. 1

  49. [49]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 1, 5, 2, 6

  50. [50]

    Self-supervised domain mismatch estimation for autonomous perception

    Jonas Lohdefink, Justin Fehrling, Marvin Klingner, Fabian Huger, Peter Schlicht, Nico M Schmidt, and Tim Fin- gscheidt. Self-supervised domain mismatch estimation for autonomous perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 334–335, 2020. 2

  51. [51]

    Reliability in semantic seg- mentation: Can we use synthetic data? arXiv preprint arXiv:2312.09231, 2023

    Thibaut Loiseau, Tuan-Hung Vu, Mickael Chen, Patrick P´erez, and Matthieu Cord. Reliability in semantic seg- mentation: Can we use synthetic data? arXiv preprint arXiv:2312.09231, 2023. 5, 7, 3, 4, 10, 11, 12

  52. [52]

    Entropic out-of- distribution detection

    David Mac ˆedo, Tsang Ing Ren, Cleber Zanchettin, Adri- ano LI Oliveira, and Teresa Ludermir. Entropic out-of- distribution detection. In 2021 international joint conference on neural networks (IJCNN) , pages 1–8. IEEE, 2021. 2, 6, 7, 8

  53. [53]

    Kevin P. Murphy. Machine Learning: A Probabilistic Per- spective. MIT Press, Cambridge, MA, 2 edition, 2021. 4

  54. [54]

    Rba: Segmenting unknown regions rejected by all

    Nazir Nayal, Misra Yavuz, Joao F Henriques, and Fatma G¨uney. Rba: Segmenting unknown regions rejected by all. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 711–722, 2023. 1

  55. [55]

    Oodis: Anomaly instance segmentation benchmark

    Alexey Nekrasov, Rui Zhou, Miriam Ackermann, Alexan- der Hermans, Bastian Leibe, and Matthias Rottmann. Oodis: Anomaly instance segmentation benchmark. arXiv preprint arXiv:2406.11835, 2024. 2

  56. [56]

    Road obstacle detection method based on an autoencoder with semantic segmentation

    Toshiaki Ohgushi, Kenji Horiguchi, and Masao Yamanaka. Road obstacle detection method based on an autoencoder with semantic segmentation. In proceedings of the Asian conference on computer vision, 2020. 2

  57. [57]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2, 3, 5, 4

  58. [58]

    Normalizing flows for probabilistic modeling and inference

    George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning Research, 22(57):1–64, 2021. 4

  59. [59]

    Perceptiongpt: Effectively fusing visual perception into llm

    Renjie Pi, Lewei Yao, Jiahui Gao, Jipeng Zhang, and Tong Zhang. Perceptiongpt: Effectively fusing visual perception into llm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 27124– 27133, 2024. 1

  60. [60]

    Lost and found: detecting small road hazards for self-driving vehi- cles

    Peter Pinggera, Sebastian Ramos, Stefan Gehrig, Uwe Franke, Carsten Rother, and Rudolf Mester. Lost and found: detecting small road hazards for self-driving vehi- cles. In 2016 IEEE/RSJ International Conference on Intel- ligent Robots and Systems (IROS), pages 1099–1106. IEEE,

  61. [61]

    2, 5, 7, 1, 3, 4, 10, 11

  62. [62]

    Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models

    Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazeb- nik. Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models. In Pro- ceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015. 4

  63. [63]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 3, 5, 2, 4, 10

  64. [64]

    Run-time monitoring of machine learning for robotic percep- tion: A survey of emerging trends

    Quazi Marufur Rahman, Peter Corke, and Feras Dayoub. Run-time monitoring of machine learning for robotic percep- tion: A survey of emerging trends. IEEE Access, 9:20067– 20075, 2021. 2

  65. [65]

    Mask2anomaly: Mask transformer for uni- versal open-set segmentation

    Shyam Nandan Rai, Fabio Cermelli, Barbara Caputo, and Carlo Masone. Mask2anomaly: Mask transformer for uni- versal open-set segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 1

  66. [66]

    Grounding dino 1.5: Advance the” edge” of open-set object detection

    Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wen- long Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xiaoke Jiang, Yihao Chen, et al. Grounding dino 1.5: Advance the” edge” of open-set object detection. arXiv preprint arXiv:2405.10300, 2024. 10

  67. [67]

    Prediction error meta classification in semantic segmentation: Detection via aggregated dispersion measures of softmax probabilities

    Matthias Rottmann, Pascal Colling, Thomas Paul Hack, Robin Chan, Fabian H ¨uger, Peter Schlicht, and Hanno Gottschalk. Prediction error meta classification in semantic segmentation: Detection via aggregated dispersion measures of softmax probabilities. In 2020 International Joint Confer- ence on Neural Networks (IJCNN) , pages 1–9. IEEE, 2020. 1

  68. [68]

    Imagenet large scale visual recognition challenge

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015. 4

  69. [69]

    Taxonomy and Definitions for Terms Re- lated to Driving Automation Systems for On-Road Motor Vehicles (SAE J3016), 2021

    SAE International. Taxonomy and Definitions for Terms Re- lated to Driving Automation Systems for On-Road Motor Vehicles (SAE J3016), 2021. Accessed: 4 March 2025. 1

  70. [70]

    Seman- tic foggy scene understanding with synthetic data

    Christos Sakaridis, Dengxin Dai, and Luc Van Gool. Seman- tic foggy scene understanding with synthetic data. Interna- tional Journal of Computer Vision, 126:973–992, 2018. 5, 6, 7

  71. [71]

    Acdc: The adverse conditions dataset with correspondences for se- mantic driving scene understanding

    Christos Sakaridis, Dengxin Dai, and Luc Van Gool. Acdc: The adverse conditions dataset with correspondences for se- mantic driving scene understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 10765–10775, 2021. 3, 7, 1, 4, 12, 13

  72. [72]

    Improving 11 robustness against common corruptions by covariate shift adaptation

    Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bring- mann, Wieland Brendel, and Matthias Bethge. Improving 11 robustness against common corruptions by covariate shift adaptation. Advances in neural information processing sys- tems, 33:11539–11551, 2020. 2

  73. [73]

    Ssd: A unified framework for self-supervised outlier detection

    Vikash Sehwag, Mung Chiang, and Prateek Mittal. Ssd: A unified framework for self-supervised outlier detection. arXiv preprint arXiv:2103.12051, 2021. 2

  74. [74]

    Objects365: A large-scale, high-quality dataset for object detection

    Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 8430–8439, 2019. 4

  75. [75]

    Transnext: Robust foveal visual perception for vi- sion transformers

    Dai Shi. Transnext: Robust foveal visual perception for vi- sion transformers. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 17773–17783, 2024. 1

  76. [76]

    Misbehaviour prediction for autonomous driving systems

    Andrea Stocco, Michael Weiss, Marco Calzana, and Paolo Tonella. Misbehaviour prediction for autonomous driving systems. In Proceedings of the ACM/IEEE 42nd interna- tional conference on software engineering , pages 359–371,

  77. [77]

    Dice: Leveraging sparsification for out-of-distribution detection

    Yiyou Sun and Yixuan Li. Dice: Leveraging sparsification for out-of-distribution detection. In European Conference on Computer Vision, pages 691–708. Springer, 2022. 6, 7, 8

  78. [78]

    Out- of-distribution detection with deep nearest neighbors

    Yiyou Sun, Yifei Ming, Xiaojin Zhu, and Yixuan Li. Out- of-distribution detection with deep nearest neighbors. In In- ternational Conference on Machine Learning, pages 20827– 20840. PMLR, 2022. 2, 6, 7, 8

  79. [79]

    Idd: A dataset for exploring problems of autonomous navigation in uncon- strained environments

    Girish Varma, Anbumani Subramanian, Anoop Namboodiri, Manmohan Chandraker, and CV Jawahar. Idd: A dataset for exploring problems of autonomous navigation in uncon- strained environments. In 2019 IEEE winter conference on applications of computer vision (WACV), pages 1743–1751. IEEE, 2019. 5, 6, 7, 8

  80. [80]

    Image-consistent detection of road anomalies as unpredictable patches

    Tom ´aˇs V oj´ıˇr and Ji ˇr´ı Matas. Image-consistent detection of road anomalies as unpredictable patches. In Proceedings of the IEEE/CVF Winter Conference on Applications of Com- puter Vision, pages 5491–5500, 2023. 1

Showing first 80 references.