Benchmarking Vision Foundation Models for Input Monitoring in Autonomous Driving

Alois Knoll; Gesina Schwalbe; Halil Ibrahim Orhan; Matthias Rottmann; Mert Keser; Niki Amini-Naieni

arxiv: 2501.08083 · v3 · submitted 2025-01-14 · 💻 cs.CV

Benchmarking Vision Foundation Models for Input Monitoring in Autonomous Driving

Mert Keser , Halil Ibrahim Orhan , Niki Amini-Naieni , Gesina Schwalbe , Alois Knoll , Matthias Rottmann This is my paper

Pith reviewed 2026-05-23 05:13 UTC · model grok-4.3

classification 💻 cs.CV

keywords out-of-distribution detectionvision foundation modelsautonomous drivingdensity estimationsemantic shiftcovariate shiftsafety monitoringunsupervised OOD

0 comments

The pith

Vision foundation model embeddings with density estimation outperform existing methods at identifying out-of-distribution inputs for autonomous driving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Autonomous driving perception must cope with novel objects or changed conditions never seen in training data. The paper builds an unsupervised framework that extracts features from vision foundation models and fits a density model over the training distribution to score new inputs. This single approach targets both semantic shifts from unknown objects and covariate shifts from altered styles such as lighting. Benchmarks across four foundation models and five density techniques show the combination exceeds state-of-the-art binary out-of-distribution classifiers. The method additionally flags inputs likely to produce downstream errors, raising overall task performance.

Core claim

The paper claims that combining vision foundation models as feature extractors with density modeling yields a principled, unsupervised, model-agnostic monitor that unifies detection of semantic and covariate shifts by modeling the full training feature distribution and using point density as an in-distribution score. Systematic evaluation of four VFMs and five density techniques against established baselines demonstrates superior OOD identification, and the resulting scores also mark high-risk inputs that improve downstream performance when filtered.

What carries the argument

Vision foundation model embeddings used as input to density estimation techniques that compute an in-distribution score from the modeled training feature distribution.

If this is right

Detects semantic shifts from novel objects and covariate shifts from style changes such as lighting within one unsupervised, model-agnostic procedure.
Outperforms state-of-the-art binary out-of-distribution classification methods on autonomous driving data.
Identifies high-risk inputs that cause errors in downstream perception tasks, allowing selective filtering that raises overall task accuracy.
Requires no labeled out-of-distribution examples during training or operation.
Supplies the first systematic comparison of multiple vision foundation models for out-of-distribution monitoring under diverse autonomous driving conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same density-based scoring on foundation embeddings could be tested for safety monitoring in other open-world vision domains such as robotics or medical imaging.
Real-time deployment of the monitor might enable continuous filtering of risky frames without requiring any out-of-distribution labels.
Pre-trained foundation features appear general enough that density baselines transfer across different driving datasets and sensor setups.
Combining embeddings from several foundation models could be examined to increase robustness against particular shift types.

Load-bearing premise

The feature distributions learned by the chosen vision foundation models on the training set are representative enough to serve as a reliable density baseline for detecting both semantic and covariate shifts in real-world autonomous driving data.

What would settle it

A controlled test on a held-out autonomous driving dataset with labeled semantic and covariate shifts in which the VFM density scores achieve lower area under the ROC curve than at least one compared state-of-the-art binary OOD classifier.

Figures

Figures reproduced from arXiv: 2501.08083 by Alois Knoll, Gesina Schwalbe, Halil Ibrahim Orhan, Matthias Rottmann, Mert Keser, Niki Amini-Naieni.

**Figure 2.** Figure 2: Visualization of Mask2Anomaly [63] applied to selected images exhibiting semantic and covariate shifts. The left column presents the original images from various datasets, including Lost and Found [59], ACDC Night and Rain [69], and SegmentMeIfYouCan Anomaly Track [11] The right column displays the corresponding OOD object-level maps generated by Mask2Anomaly. OOD detection. This issue is illustrated in … view at source ↗

**Figure 3.** Figure 3: Example images from four subsets of ACDC Dataset [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 4.** Figure 4: Example images from Bravo datasets [50] and Lost and Found Dataset [59] Bravo Datasets: The Bravo dataset family [50] provides synthetic perturbations of Cityscapes images to evaluate model robustness under controlled distribution shifts. The collection comprises three distinct subsets: 4 [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Example images from Cityscapes [16], Cityscapes Fog [68], Indian Driving dataset[77] and SegmentMeIfYouCan [11] SegmentMeIfYouCan Dataset: The SegmentMeIfYouCan [11] dataset (SMYIC) is designed for evaluating anomaly detection in semantic segmentation tasks. It includes challenging scenarios with diverse visual contexts and synthetic anomalies that closely resemble real-world driving conditions. The datas… view at source ↗

**Figure 6.** Figure 6: Comparison of AIC values across various model backbones, highlighting the trade-off between model complexity and goodness [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: AIC values for ImageNet-trained backbones and autoencoders trained on reference data. Each subplot depicts the AIC profile [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

read the original abstract

Deep neural networks (DNNs) remain challenged by distribution shifts in complex open-world domains like automated driving (AD): Robustness against yet unknown novel objects (semantic shift) or styles like lighting conditions (covariate shift) cannot be guaranteed. Hence, reliable operation-time monitors for identification of out-of-training-data-distribution (OOD) scenarios are imperative. Current approaches for OOD classification are untested for complex domains like AD, are limited in the kinds of shifts they detect, or even require supervision with OOD samples. To prepare for unanticipated shifts, we instead establish a framework around a principled, unsupervised and model-agnostic method that unifies detection of semantic and covariate shifts: Find a full model of the training data's feature distribution, to then use its density at new points as in-distribution (ID) score. To implement this, we propose to combine Vision Foundation Models (VFMs) as feature extractors with density modeling techniques. Through a comprehensive benchmark of 4 VFMs with different backbone architectures and 5 density-modeling techniques against established baselines, we provide the first systematic evaluation of OOD classification capabilities of VFMs across diverse conditions. A comparison with state-of-the-art binary OOD classification methods reveals that VFM embeddings with density estimation outperform existing approaches in identifying OOD inputs. Additionally, we show that our method detects high-risk inputs likely to cause errors in downstream tasks, thereby improving overall performance. Overall, VFMs, when coupled with robust density modeling techniques, are promising to realize model-agnostic, unsupervised, reliable safety monitors in complex vision tasks

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VFM embeddings plus density estimation beat binary OOD baselines on the tested AD shifts, but the representativeness assumption for real-world variability is untested.

read the letter

This paper's core finding is that vision foundation models used as feature extractors, combined with density modeling, outperform state-of-the-art binary OOD classifiers for detecting both semantic and covariate shifts in autonomous driving data. It delivers the first systematic benchmark of four different VFMs paired with five density techniques. The setup is unsupervised and model-agnostic, which addresses a real need for monitors that do not require labeled OOD examples. The results also link the density scores to reduced errors in downstream tasks by identifying high-risk inputs. The benchmark itself is the main contribution. It applies established ideas about density estimation on embeddings to the AD setting and shows consistent gains across the tested conditions. The soft spot is the reliance on the training feature distribution being representative enough for operational shifts. The paper demonstrates good performance on the selected datasets, but does not include separate tests for how well the density model handles unmodeled combinations of shifts or rare sensor issues. If those distributions are incomplete, the low-density scores could miss important cases. The abstract claims outperformance, but without the full methods it is difficult to judge whether the baselines were implemented at full strength. This is for researchers working on safety monitors in autonomous driving perception. Anyone evaluating OOD methods for complex vision domains will get value from the comparisons. I would send it to peer review. The empirical protocol is clear enough to merit referee input on the experimental design and generalizability.

Referee Report

2 major / 2 minor

Summary. The paper claims that Vision Foundation Models (VFMs) used as feature extractors, when paired with density modeling techniques, enable a unified, unsupervised, model-agnostic approach to detect both semantic and covariate shifts for input monitoring in autonomous driving. A benchmark of 4 VFMs and 5 density methods is presented, showing that this combination outperforms state-of-the-art binary OOD classification methods on selected datasets and can flag high-risk inputs to improve downstream task performance.

Significance. If the empirical results hold under rigorous verification, the work provides a valuable systematic comparison of VFMs for OOD detection in a safety-critical domain. The unsupervised density-based framework addresses limitations of supervised or shift-specific methods, and the focus on both semantic and covariate shifts is a constructive contribution to operational monitoring.

major comments (2)

[Experiments] Experiments section: The outperformance claim over SOTA binary OOD methods depends on the assumption that VFM feature distributions fitted on the training set are representative for detecting shifts in real-world AD data. No separate validation or ablation is provided for generalization to unmodeled operational variability (e.g., rare events, sensor artifacts, or shift combinations), which is load-bearing for the reliability conclusion.
[Methods] Methods/Implementation details: The manuscript lacks sufficient specification of exact training/test splits, hyperparameter choices for the 5 density techniques, and baseline implementations to enable independent reproduction of the reported superiority, directly affecting verifiability of the central empirical comparison.

minor comments (2)

[Abstract] Abstract: The claim of 'first systematic evaluation' could be tempered or supported with a brief note on prior related benchmarks to avoid overstatement.
[Results] Figures/Tables: Include statistical significance measures (e.g., p-values or confidence intervals) alongside performance metrics to strengthen the outperformance statements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, indicating planned revisions where appropriate to improve verifiability and strengthen the reliability claims.

read point-by-point responses

Referee: [Experiments] Experiments section: The outperformance claim over SOTA binary OOD methods depends on the assumption that VFM feature distributions fitted on the training set are representative for detecting shifts in real-world AD data. No separate validation or ablation is provided for generalization to unmodeled operational variability (e.g., rare events, sensor artifacts, or shift combinations), which is load-bearing for the reliability conclusion.

Authors: The benchmark evaluates detection across multiple semantic and covariate shifts drawn from established AD datasets (e.g., variations in objects, weather, and lighting), which are designed to capture operational variability. The density modeling is performed solely on ID training features and evaluated on held-out shifted inputs, providing direct evidence of the method's ability to flag distribution changes. We acknowledge that explicit ablations on rare events or sensor artifacts would further support generalization claims. In the revision we will add a limitations paragraph discussing these unmodeled cases and outline how the framework could be extended (e.g., via incremental density updates), while retaining the current empirical results as evidence for the tested conditions. revision: partial
Referee: [Methods] Methods/Implementation details: The manuscript lacks sufficient specification of exact training/test splits, hyperparameter choices for the 5 density techniques, and baseline implementations to enable independent reproduction of the reported superiority, directly affecting verifiability of the central empirical comparison.

Authors: We agree that additional implementation details are required for full reproducibility. The revised manuscript will include: (i) precise descriptions of the training/test splits for each dataset and VFM, (ii) the exact hyperparameter settings used for each of the five density estimators (including any grid-search or default values), and (iii) references or pseudocode for the baseline binary OOD methods. These details will be placed in the main Methods section and expanded in a new reproducibility appendix. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivation chain or self-referential reductions

full rationale

The paper is a comparative empirical study that benchmarks 4 VFMs paired with 5 density estimators against binary OOD baselines on AD datasets. No equations, uniqueness theorems, or predictive derivations are presented; performance claims rest on direct experimental measurements rather than any reduction to fitted parameters or self-citations. The method is described as unsupervised and model-agnostic, with no load-bearing steps that collapse to the inputs by construction. This is the standard case of a self-contained benchmark against external datasets and methods.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the approach relies on standard pre-trained VFMs and off-the-shelf density estimators whose assumptions are inherited from prior work.

pith-pipeline@v0.9.0 · 5839 in / 1035 out tokens · 33250 ms · 2026-05-23T05:13:30.710799+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

97 extracted references · 97 canonical work pages · 7 internal anchors

[1]

One-class support vector classifiers: A survey

Shamshe Alam, Sanjay Kumar Sonbhadra, Sonali Agarwal, and P Nagabhushan. One-class support vector classifiers: A survey. Knowledge-Based Systems, 196:105754, 2020. 3, 7

work page 2020
[2]

Foundation models defining a new era in vision: a survey and outlook

Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Foundation models defining a new era in vision: a survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 2025. 3

work page 2025
[3]

Monitizer: Automating design and evaluation of neural network mon- itors

Muqsit Azeem, Marta Grobelna, Sudeep Kanav, Jan Kˇret´ınsk`y, Stefanie Mohr, and Sabine Rieder. Monitizer: Automating design and evaluation of neural network mon- itors. In International Conference on Computer Aided Veri- fication, pages 265–279. Springer, 2024. 6

work page 2024
[4]

Simultaneous semantic segmentation and outlier detection in presence of domain shift

Petra Bevandi ´c, Ivan Kreˇso, Marin Orˇsi´c, and Siniˇsa ˇSegvi´c. Simultaneous semantic segmentation and outlier detection in presence of domain shift. In Pattern Recognition: 41st DAGM German Conference, DAGM GCPR 2019, Dortmund, Germany, September 10–13, 2019, Proceedings 41 , pages 33–47. Springer, 2019. 1

work page 2019
[5]

Dense outlier detection and open-set recognition based on training with noisy negative images

Petra Bevandi ´c, Ivan Kreˇso, Marin Orˇsi´c, and Siniˇsa ˇSegvi´c. Dense outlier detection and open-set recognition based on training with noisy negative images. arXiv preprint arXiv:2101.09193, 2021. 2

work page arXiv 2021
[6]

Unsupervised domain adaptation to im- prove image segmentation quality both in the source and tar- get domain

Jan-Aike Bolte, Markus Kamp, Antonia Breuer, Silviu Ho- moceanu, Peter Schlicht, Fabian Huger, Daniel Lipinski, and Tim Fingscheidt. Unsupervised domain adaptation to im- prove image segmentation quality both in the source and tar- get domain. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops , pages 0–0, 2019. 2

work page 2019
[7]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Alt- man, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021. 3

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

One-class sup- port vector machines revisited

Abdenour Bounsiar and Michael G Madden. One-class sup- port vector machines revisited. In2014 International Confer- ence on Information Science & Applications (ICISA) , pages 1–4. IEEE, 2014. 3, 7

work page 2014
[9]

Understanding ADAS: Lane Keep Assist, 2024

CarADAS. Understanding ADAS: Lane Keep Assist, 2024. Accessed: 4 March 2025. 1

work page 2024
[10]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 2, 3, 5

work page 2021
[11]

Segmentmeifyou- can: A benchmark for anomaly segmentation

Robin Chan, Krzysztof Lis, Svenja Uhlemeyer, Hermann Blum, Sina Honari, Roland Siegwart, Pascal Fua, Math- ieu Salzmann, and Matthias Rottmann. Segmentmeifyou- can: A benchmark for anomaly segmentation. arXiv preprint arXiv:2104.14812, 2021. 2, 5, 6, 7, 1

work page arXiv 2021
[12]

Entropy maximization and meta classification for out-of- distribution detection in semantic segmentation

Robin Chan, Matthias Rottmann, and Hanno Gottschalk. Entropy maximization and meta classification for out-of- distribution detection in semantic segmentation. In Proceed- ings of the ieee/cvf international conference on computer vi- sion, pages 5128–5137, 2021. 2, 1

work page 2021
[13]

Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolu- tion, and fully connected crfs

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolu- tion, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017. 1

work page 2017
[14]

Encoder-decoder with atrous separable convolution for semantic image segmentation

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018. 8

work page 2018
[15]

Yolo-world: Real-time open-vocabulary object detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16901–16911, 2024. 1, 3

work page 2024
[16]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016. 3, 5, 6, 7, 8, 2, 4

work page 2016
[17]

Council of the European Union. Proposal for a regulation of the european parliament and of the council laying down harmonised rules on artificial intelligence (artificial intel- ligence act) and amending certain union legislative acts - analysis of the final compromise text with a view to agree- ment. https://data.consilium.europa.eu/ doc / document / ST ...

work page 2024
[18]

Accessed: 2024-03-23. 2

work page 2024
[19]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 5, 2

work page 2009
[20]

Density estimation using Real NVP

Laurent Dinh, Jascha Sohl-Dickstein, and Samy Ben- gio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016. 4, 8

work page internal anchor Pith review Pith/arXiv arXiv 2016
[21]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, 9 Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021. 5, 2

work page 2021
[22]

Dropout as a bayesian approximation: Representing model uncertainty in deep learning

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning , pages 1050–1059. PMLR, 2016. 2

work page 2016
[23]

Generalize or detect? towards robust semantic seg- mentation under multiple distribution shifts

Zhitong Gao, Bingnan Li, Mathieu Salzmann, and Xuming He. Generalize or detect? towards robust semantic seg- mentation under multiple distribution shifts. arXiv preprint arXiv:2411.03829, 2024. 2

work page arXiv 2024
[24]

Densehy- brid: Hybrid anomaly detection for dense open-set recogni- tion

Matej Grci ´c, Petra Bevandi ´c, and Sini ˇsa ˇSegvi´c. Densehy- brid: Hybrid anomaly detection for dense open-set recogni- tion. In European Conference on Computer Vision , pages 500–517. Springer, 2022. 1

work page 2022
[25]

On advantages of mask-level recognition for outlier-aware segmentation

Matej Grci ´c, Josip ˇSari´c, and Siniˇsa ˇSegvi´c. On advantages of mask-level recognition for outlier-aware segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 2937–2947, 2023. 1

work page 2023
[26]

Detecting and mitigating system-level anomalies of vision- based controllers

Aryaman Gupta, Kaustav Chakraborty, and Somil Bansal. Detecting and mitigating system-level anomalies of vision- based controllers. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 9953–9959. IEEE,

work page 2024
[27]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 5, 8, 2

work page 2016
[28]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 2

work page 2017
[29]

Towards corner case detection by modeling the uncertainty of instance segmentation networks

Florian Heidecker, Abdul Hannan, Maarten Bieshaar, and Bernhard Sick. Towards corner case detection by modeling the uncertainty of instance segmentation networks. In Pat- tern Recognition. ICPR International Workshops and Chal- lenges: Virtual Event, January 10–15, 2021, Proceedings, Part IV, pages 361–374. Springer, 2021. 2

work page 2021
[30]

Moni- toring perception reliability in autonomous driving: Distri- butional shift detection for estimating the impact of input data on prediction accuracy

Franz Hell, Gereon Hinz, Feng Liu, Sakshi Goyal, Ke Pei, Tetiana Lytvynenko, Alois Knoll, and Chen Yiqiang. Moni- toring perception reliability in autonomous driving: Distri- butional shift detection for estimating the impact of input data on prediction accuracy. In Proceedings of the 5th ACM Computer Science in Cars Symposium, pages 1–9, 2021. 2

work page 2021
[31]

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

Dan Hendrycks and Kevin Gimpel. A baseline for detect- ing misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016. 2

work page internal anchor Pith review Pith/arXiv arXiv 2016
[32]

Searching for mo- bilenetv3

Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mo- bilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019. 8

work page 2019
[33]

On the impor- tance of gradients for detecting distributional shifts in the wild

Rui Huang, Andrew Geng, and Yixuan Li. On the impor- tance of gradients for detecting distributional shifts in the wild. Advances in Neural Information Processing Systems , 34:677–689, 2021. 2, 6, 7, 8

work page 2021
[34]

On the potential of open-vocabulary models for object detection in unusual street scenes

Sadia Ilyas, Ido Freeman, and Matthias Rottmann. On the potential of open-vocabulary models for object detection in unusual street scenes. arXiv preprint arXiv:2408.11221 ,

work page arXiv
[35]

ISO/PAS 8800:2024 – Road Vehicles – Safety and Artificial Intelli- gence, 2024

International Organization for Standardization. ISO/PAS 8800:2024 – Road Vehicles – Safety and Artificial Intelli- gence, 2024. Accessed: 4 March 2025. 2

work page 2024
[36]

Mdetr- modulated detection for end-to-end multi-modal understand- ing

Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr- modulated detection for end-to-end multi-modal understand- ing. In Proceedings of the IEEE/CVF international confer- ence on computer vision, pages 1780–1790, 2021. 4

work page 2021
[37]

Language-extended indoor slam (lexis): A versatile system for real-time visual scene understanding

Christina Kassab, Matias Mattamala, Lintong Zhang, and Maurice Fallon. Language-extended indoor slam (lexis): A versatile system for real-time visual scene understanding. In 2024 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 15988–15994. IEEE, 2024. 2

work page 2024
[38]

What uncertainties do we need in bayesian deep learning for computer vision? Advances in neural information processing systems, 30, 2017

Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? Advances in neural information processing systems, 30, 2017. 1

work page 2017
[39]

Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding

Alex Kendall, Vijay Badrinarayanan, and Roberto Cipolla. Bayesian segnet: Model uncertainty in deep convolu- tional encoder-decoder architectures for scene understand- ing. arXiv preprint arXiv:1511.02680, 2015. 2

work page internal anchor Pith review Pith/arXiv arXiv 2015
[40]

Openimages: A public dataset for large-scale multi-label and multi-class im- age classification

Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami Abu-El-Haija, Alina Kuznetsova, Hassan Rom, Jasper Ui- jlings, Stefan Popov, Andreas Veit, et al. Openimages: A public dataset for large-scale multi-label and multi-class im- age classification. Dataset available from https://github. com/openimages, 2(3):18, 2017. 4

work page 2017
[41]

Visual genome: Connecting language and vision using crowdsourced dense image annotations

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017. 4

work page 2017
[42]

Out-of-distribution identification: Let detector tell which i am not sure

Ruoqi Li, Chongyang Zhang, Hao Zhou, Chao Shi, and Yan Luo. Out-of-distribution identification: Let detector tell which i am not sure. In European Conference on Computer Vision, pages 638–654. Springer, 2022. 2

work page 2022
[43]

Enhanc- ing the reliability of out-of-distribution image detection in neural networks

Shiyu Liang, Yixuan Li, and Rayadurgam Srikant. Enhanc- ing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690, 2017. 6, 8

work page arXiv 2017
[44]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 4

work page 2014
[45]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023. 3

work page 2023
[46]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023. 1, 3, 5, 7, 8, 2, 4, 6, 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Energy-based out-of-distribution detection

Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution detection. Advances 10 in neural information processing systems, 33:21464–21475,

work page
[48]

Residual pattern learning for pixel-wise out-of-distribution detection in semantic segmentation

Yuyuan Liu, Choubo Ding, Yu Tian, Guansong Pang, Vasileios Belagiannis, Ian Reid, and Gustavo Carneiro. Residual pattern learning for pixel-wise out-of-distribution detection in semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 1151–1161, 2023. 1

work page 2023
[49]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 1, 5, 2, 6

work page 2021
[50]

Self-supervised domain mismatch estimation for autonomous perception

Jonas Lohdefink, Justin Fehrling, Marvin Klingner, Fabian Huger, Peter Schlicht, Nico M Schmidt, and Tim Fin- gscheidt. Self-supervised domain mismatch estimation for autonomous perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 334–335, 2020. 2

work page 2020
[51]

Reliability in semantic seg- mentation: Can we use synthetic data? arXiv preprint arXiv:2312.09231, 2023

Thibaut Loiseau, Tuan-Hung Vu, Mickael Chen, Patrick P´erez, and Matthieu Cord. Reliability in semantic seg- mentation: Can we use synthetic data? arXiv preprint arXiv:2312.09231, 2023. 5, 7, 3, 4, 10, 11, 12

work page arXiv 2023
[52]

Entropic out-of- distribution detection

David Mac ˆedo, Tsang Ing Ren, Cleber Zanchettin, Adri- ano LI Oliveira, and Teresa Ludermir. Entropic out-of- distribution detection. In 2021 international joint conference on neural networks (IJCNN) , pages 1–8. IEEE, 2021. 2, 6, 7, 8

work page 2021
[53]

Kevin P. Murphy. Machine Learning: A Probabilistic Per- spective. MIT Press, Cambridge, MA, 2 edition, 2021. 4

work page 2021
[54]

Rba: Segmenting unknown regions rejected by all

Nazir Nayal, Misra Yavuz, Joao F Henriques, and Fatma G¨uney. Rba: Segmenting unknown regions rejected by all. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 711–722, 2023. 1

work page 2023
[55]

Oodis: Anomaly instance segmentation benchmark

Alexey Nekrasov, Rui Zhou, Miriam Ackermann, Alexan- der Hermans, Bastian Leibe, and Matthias Rottmann. Oodis: Anomaly instance segmentation benchmark. arXiv preprint arXiv:2406.11835, 2024. 2

work page arXiv 2024
[56]

Road obstacle detection method based on an autoencoder with semantic segmentation

Toshiaki Ohgushi, Kenji Horiguchi, and Masao Yamanaka. Road obstacle detection method based on an autoencoder with semantic segmentation. In proceedings of the Asian conference on computer vision, 2020. 2

work page 2020
[57]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2, 3, 5, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Normalizing flows for probabilistic modeling and inference

George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning Research, 22(57):1–64, 2021. 4

work page 2021
[59]

Perceptiongpt: Effectively fusing visual perception into llm

Renjie Pi, Lewei Yao, Jiahui Gao, Jipeng Zhang, and Tong Zhang. Perceptiongpt: Effectively fusing visual perception into llm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 27124– 27133, 2024. 1

work page 2024
[60]

Lost and found: detecting small road hazards for self-driving vehi- cles

Peter Pinggera, Sebastian Ramos, Stefan Gehrig, Uwe Franke, Carsten Rother, and Rudolf Mester. Lost and found: detecting small road hazards for self-driving vehi- cles. In 2016 IEEE/RSJ International Conference on Intel- ligent Robots and Systems (IROS), pages 1099–1106. IEEE,

work page 2016
[61]

2, 5, 7, 1, 3, 4, 10, 11

work page
[62]

Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models

Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazeb- nik. Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models. In Pro- ceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015. 4

work page 2015
[63]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 3, 5, 2, 4, 10

work page 2021
[64]

Run-time monitoring of machine learning for robotic percep- tion: A survey of emerging trends

Quazi Marufur Rahman, Peter Corke, and Feras Dayoub. Run-time monitoring of machine learning for robotic percep- tion: A survey of emerging trends. IEEE Access, 9:20067– 20075, 2021. 2

work page 2021
[65]

Mask2anomaly: Mask transformer for uni- versal open-set segmentation

Shyam Nandan Rai, Fabio Cermelli, Barbara Caputo, and Carlo Masone. Mask2anomaly: Mask transformer for uni- versal open-set segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 1

work page 2024
[66]

Grounding dino 1.5: Advance the” edge” of open-set object detection

Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wen- long Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xiaoke Jiang, Yihao Chen, et al. Grounding dino 1.5: Advance the” edge” of open-set object detection. arXiv preprint arXiv:2405.10300, 2024. 10

work page arXiv 2024
[67]

Prediction error meta classification in semantic segmentation: Detection via aggregated dispersion measures of softmax probabilities

Matthias Rottmann, Pascal Colling, Thomas Paul Hack, Robin Chan, Fabian H ¨uger, Peter Schlicht, and Hanno Gottschalk. Prediction error meta classification in semantic segmentation: Detection via aggregated dispersion measures of softmax probabilities. In 2020 International Joint Confer- ence on Neural Networks (IJCNN) , pages 1–9. IEEE, 2020. 1

work page 2020
[68]

Imagenet large scale visual recognition challenge

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015. 4

work page 2015
[69]

Taxonomy and Definitions for Terms Re- lated to Driving Automation Systems for On-Road Motor Vehicles (SAE J3016), 2021

SAE International. Taxonomy and Definitions for Terms Re- lated to Driving Automation Systems for On-Road Motor Vehicles (SAE J3016), 2021. Accessed: 4 March 2025. 1

work page 2021
[70]

Seman- tic foggy scene understanding with synthetic data

Christos Sakaridis, Dengxin Dai, and Luc Van Gool. Seman- tic foggy scene understanding with synthetic data. Interna- tional Journal of Computer Vision, 126:973–992, 2018. 5, 6, 7

work page 2018
[71]

Acdc: The adverse conditions dataset with correspondences for se- mantic driving scene understanding

Christos Sakaridis, Dengxin Dai, and Luc Van Gool. Acdc: The adverse conditions dataset with correspondences for se- mantic driving scene understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 10765–10775, 2021. 3, 7, 1, 4, 12, 13

work page 2021
[72]

Improving 11 robustness against common corruptions by covariate shift adaptation

Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bring- mann, Wieland Brendel, and Matthias Bethge. Improving 11 robustness against common corruptions by covariate shift adaptation. Advances in neural information processing sys- tems, 33:11539–11551, 2020. 2

work page 2020
[73]

Ssd: A unified framework for self-supervised outlier detection

Vikash Sehwag, Mung Chiang, and Prateek Mittal. Ssd: A unified framework for self-supervised outlier detection. arXiv preprint arXiv:2103.12051, 2021. 2

work page arXiv 2021
[74]

Objects365: A large-scale, high-quality dataset for object detection

Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 8430–8439, 2019. 4

work page 2019
[75]

Transnext: Robust foveal visual perception for vi- sion transformers

Dai Shi. Transnext: Robust foveal visual perception for vi- sion transformers. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 17773–17783, 2024. 1

work page 2024
[76]

Misbehaviour prediction for autonomous driving systems

Andrea Stocco, Michael Weiss, Marco Calzana, and Paolo Tonella. Misbehaviour prediction for autonomous driving systems. In Proceedings of the ACM/IEEE 42nd interna- tional conference on software engineering , pages 359–371,

work page
[77]

Dice: Leveraging sparsification for out-of-distribution detection

Yiyou Sun and Yixuan Li. Dice: Leveraging sparsification for out-of-distribution detection. In European Conference on Computer Vision, pages 691–708. Springer, 2022. 6, 7, 8

work page 2022
[78]

Out- of-distribution detection with deep nearest neighbors

Yiyou Sun, Yifei Ming, Xiaojin Zhu, and Yixuan Li. Out- of-distribution detection with deep nearest neighbors. In In- ternational Conference on Machine Learning, pages 20827– 20840. PMLR, 2022. 2, 6, 7, 8

work page 2022
[79]

Idd: A dataset for exploring problems of autonomous navigation in uncon- strained environments

Girish Varma, Anbumani Subramanian, Anoop Namboodiri, Manmohan Chandraker, and CV Jawahar. Idd: A dataset for exploring problems of autonomous navigation in uncon- strained environments. In 2019 IEEE winter conference on applications of computer vision (WACV), pages 1743–1751. IEEE, 2019. 5, 6, 7, 8

work page 2019
[80]

Image-consistent detection of road anomalies as unpredictable patches

Tom ´aˇs V oj´ıˇr and Ji ˇr´ı Matas. Image-consistent detection of road anomalies as unpredictable patches. In Proceedings of the IEEE/CVF Winter Conference on Applications of Com- puter Vision, pages 5491–5500, 2023. 1

work page 2023

Showing first 80 references.

[1] [1]

One-class support vector classifiers: A survey

Shamshe Alam, Sanjay Kumar Sonbhadra, Sonali Agarwal, and P Nagabhushan. One-class support vector classifiers: A survey. Knowledge-Based Systems, 196:105754, 2020. 3, 7

work page 2020

[2] [2]

Foundation models defining a new era in vision: a survey and outlook

Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Foundation models defining a new era in vision: a survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 2025. 3

work page 2025

[3] [3]

Monitizer: Automating design and evaluation of neural network mon- itors

Muqsit Azeem, Marta Grobelna, Sudeep Kanav, Jan Kˇret´ınsk`y, Stefanie Mohr, and Sabine Rieder. Monitizer: Automating design and evaluation of neural network mon- itors. In International Conference on Computer Aided Veri- fication, pages 265–279. Springer, 2024. 6

work page 2024

[4] [4]

Simultaneous semantic segmentation and outlier detection in presence of domain shift

Petra Bevandi ´c, Ivan Kreˇso, Marin Orˇsi´c, and Siniˇsa ˇSegvi´c. Simultaneous semantic segmentation and outlier detection in presence of domain shift. In Pattern Recognition: 41st DAGM German Conference, DAGM GCPR 2019, Dortmund, Germany, September 10–13, 2019, Proceedings 41 , pages 33–47. Springer, 2019. 1

work page 2019

[5] [5]

Dense outlier detection and open-set recognition based on training with noisy negative images

Petra Bevandi ´c, Ivan Kreˇso, Marin Orˇsi´c, and Siniˇsa ˇSegvi´c. Dense outlier detection and open-set recognition based on training with noisy negative images. arXiv preprint arXiv:2101.09193, 2021. 2

work page arXiv 2021

[6] [6]

Unsupervised domain adaptation to im- prove image segmentation quality both in the source and tar- get domain

Jan-Aike Bolte, Markus Kamp, Antonia Breuer, Silviu Ho- moceanu, Peter Schlicht, Fabian Huger, Daniel Lipinski, and Tim Fingscheidt. Unsupervised domain adaptation to im- prove image segmentation quality both in the source and tar- get domain. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops , pages 0–0, 2019. 2

work page 2019

[7] [7]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Alt- man, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021. 3

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

One-class sup- port vector machines revisited

Abdenour Bounsiar and Michael G Madden. One-class sup- port vector machines revisited. In2014 International Confer- ence on Information Science & Applications (ICISA) , pages 1–4. IEEE, 2014. 3, 7

work page 2014

[9] [9]

Understanding ADAS: Lane Keep Assist, 2024

CarADAS. Understanding ADAS: Lane Keep Assist, 2024. Accessed: 4 March 2025. 1

work page 2024

[10] [10]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 2, 3, 5

work page 2021

[11] [11]

Segmentmeifyou- can: A benchmark for anomaly segmentation

Robin Chan, Krzysztof Lis, Svenja Uhlemeyer, Hermann Blum, Sina Honari, Roland Siegwart, Pascal Fua, Math- ieu Salzmann, and Matthias Rottmann. Segmentmeifyou- can: A benchmark for anomaly segmentation. arXiv preprint arXiv:2104.14812, 2021. 2, 5, 6, 7, 1

work page arXiv 2021

[12] [12]

Entropy maximization and meta classification for out-of- distribution detection in semantic segmentation

Robin Chan, Matthias Rottmann, and Hanno Gottschalk. Entropy maximization and meta classification for out-of- distribution detection in semantic segmentation. In Proceed- ings of the ieee/cvf international conference on computer vi- sion, pages 5128–5137, 2021. 2, 1

work page 2021

[13] [13]

Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolu- tion, and fully connected crfs

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolu- tion, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017. 1

work page 2017

[14] [14]

Encoder-decoder with atrous separable convolution for semantic image segmentation

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018. 8

work page 2018

[15] [15]

Yolo-world: Real-time open-vocabulary object detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16901–16911, 2024. 1, 3

work page 2024

[16] [16]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016. 3, 5, 6, 7, 8, 2, 4

work page 2016

[17] [17]

Council of the European Union. Proposal for a regulation of the european parliament and of the council laying down harmonised rules on artificial intelligence (artificial intel- ligence act) and amending certain union legislative acts - analysis of the final compromise text with a view to agree- ment. https://data.consilium.europa.eu/ doc / document / ST ...

work page 2024

[18] [18]

Accessed: 2024-03-23. 2

work page 2024

[19] [19]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 5, 2

work page 2009

[20] [20]

Density estimation using Real NVP

Laurent Dinh, Jascha Sohl-Dickstein, and Samy Ben- gio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016. 4, 8

work page internal anchor Pith review Pith/arXiv arXiv 2016

[21] [21]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, 9 Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021. 5, 2

work page 2021

[22] [22]

Dropout as a bayesian approximation: Representing model uncertainty in deep learning

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning , pages 1050–1059. PMLR, 2016. 2

work page 2016

[23] [23]

Generalize or detect? towards robust semantic seg- mentation under multiple distribution shifts

Zhitong Gao, Bingnan Li, Mathieu Salzmann, and Xuming He. Generalize or detect? towards robust semantic seg- mentation under multiple distribution shifts. arXiv preprint arXiv:2411.03829, 2024. 2

work page arXiv 2024

[24] [24]

Densehy- brid: Hybrid anomaly detection for dense open-set recogni- tion

Matej Grci ´c, Petra Bevandi ´c, and Sini ˇsa ˇSegvi´c. Densehy- brid: Hybrid anomaly detection for dense open-set recogni- tion. In European Conference on Computer Vision , pages 500–517. Springer, 2022. 1

work page 2022

[25] [25]

On advantages of mask-level recognition for outlier-aware segmentation

Matej Grci ´c, Josip ˇSari´c, and Siniˇsa ˇSegvi´c. On advantages of mask-level recognition for outlier-aware segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 2937–2947, 2023. 1

work page 2023

[26] [26]

Detecting and mitigating system-level anomalies of vision- based controllers

Aryaman Gupta, Kaustav Chakraborty, and Somil Bansal. Detecting and mitigating system-level anomalies of vision- based controllers. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 9953–9959. IEEE,

work page 2024

[27] [27]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 5, 8, 2

work page 2016

[28] [28]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 2

work page 2017

[29] [29]

Towards corner case detection by modeling the uncertainty of instance segmentation networks

Florian Heidecker, Abdul Hannan, Maarten Bieshaar, and Bernhard Sick. Towards corner case detection by modeling the uncertainty of instance segmentation networks. In Pat- tern Recognition. ICPR International Workshops and Chal- lenges: Virtual Event, January 10–15, 2021, Proceedings, Part IV, pages 361–374. Springer, 2021. 2

work page 2021

[30] [30]

Moni- toring perception reliability in autonomous driving: Distri- butional shift detection for estimating the impact of input data on prediction accuracy

Franz Hell, Gereon Hinz, Feng Liu, Sakshi Goyal, Ke Pei, Tetiana Lytvynenko, Alois Knoll, and Chen Yiqiang. Moni- toring perception reliability in autonomous driving: Distri- butional shift detection for estimating the impact of input data on prediction accuracy. In Proceedings of the 5th ACM Computer Science in Cars Symposium, pages 1–9, 2021. 2

work page 2021

[31] [31]

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

Dan Hendrycks and Kevin Gimpel. A baseline for detect- ing misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016. 2

work page internal anchor Pith review Pith/arXiv arXiv 2016

[32] [32]

Searching for mo- bilenetv3

Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mo- bilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019. 8

work page 2019

[33] [33]

On the impor- tance of gradients for detecting distributional shifts in the wild

Rui Huang, Andrew Geng, and Yixuan Li. On the impor- tance of gradients for detecting distributional shifts in the wild. Advances in Neural Information Processing Systems , 34:677–689, 2021. 2, 6, 7, 8

work page 2021

[34] [34]

On the potential of open-vocabulary models for object detection in unusual street scenes

Sadia Ilyas, Ido Freeman, and Matthias Rottmann. On the potential of open-vocabulary models for object detection in unusual street scenes. arXiv preprint arXiv:2408.11221 ,

work page arXiv

[35] [35]

ISO/PAS 8800:2024 – Road Vehicles – Safety and Artificial Intelli- gence, 2024

International Organization for Standardization. ISO/PAS 8800:2024 – Road Vehicles – Safety and Artificial Intelli- gence, 2024. Accessed: 4 March 2025. 2

work page 2024

[36] [36]

Mdetr- modulated detection for end-to-end multi-modal understand- ing

Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr- modulated detection for end-to-end multi-modal understand- ing. In Proceedings of the IEEE/CVF international confer- ence on computer vision, pages 1780–1790, 2021. 4

work page 2021

[37] [37]

Language-extended indoor slam (lexis): A versatile system for real-time visual scene understanding

Christina Kassab, Matias Mattamala, Lintong Zhang, and Maurice Fallon. Language-extended indoor slam (lexis): A versatile system for real-time visual scene understanding. In 2024 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 15988–15994. IEEE, 2024. 2

work page 2024

[38] [38]

What uncertainties do we need in bayesian deep learning for computer vision? Advances in neural information processing systems, 30, 2017

Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? Advances in neural information processing systems, 30, 2017. 1

work page 2017

[39] [39]

Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding

Alex Kendall, Vijay Badrinarayanan, and Roberto Cipolla. Bayesian segnet: Model uncertainty in deep convolu- tional encoder-decoder architectures for scene understand- ing. arXiv preprint arXiv:1511.02680, 2015. 2

work page internal anchor Pith review Pith/arXiv arXiv 2015

[40] [40]

Openimages: A public dataset for large-scale multi-label and multi-class im- age classification

Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami Abu-El-Haija, Alina Kuznetsova, Hassan Rom, Jasper Ui- jlings, Stefan Popov, Andreas Veit, et al. Openimages: A public dataset for large-scale multi-label and multi-class im- age classification. Dataset available from https://github. com/openimages, 2(3):18, 2017. 4

work page 2017

[41] [41]

Visual genome: Connecting language and vision using crowdsourced dense image annotations

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017. 4

work page 2017

[42] [42]

Out-of-distribution identification: Let detector tell which i am not sure

Ruoqi Li, Chongyang Zhang, Hao Zhou, Chao Shi, and Yan Luo. Out-of-distribution identification: Let detector tell which i am not sure. In European Conference on Computer Vision, pages 638–654. Springer, 2022. 2

work page 2022

[43] [43]

Enhanc- ing the reliability of out-of-distribution image detection in neural networks

Shiyu Liang, Yixuan Li, and Rayadurgam Srikant. Enhanc- ing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690, 2017. 6, 8

work page arXiv 2017

[44] [44]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 4

work page 2014

[45] [45]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023. 3

work page 2023

[46] [46]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023. 1, 3, 5, 7, 8, 2, 4, 6, 10

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [47]

Energy-based out-of-distribution detection

Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution detection. Advances 10 in neural information processing systems, 33:21464–21475,

work page

[48] [48]

Residual pattern learning for pixel-wise out-of-distribution detection in semantic segmentation

Yuyuan Liu, Choubo Ding, Yu Tian, Guansong Pang, Vasileios Belagiannis, Ian Reid, and Gustavo Carneiro. Residual pattern learning for pixel-wise out-of-distribution detection in semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 1151–1161, 2023. 1

work page 2023

[49] [49]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 1, 5, 2, 6

work page 2021

[50] [50]

Self-supervised domain mismatch estimation for autonomous perception

Jonas Lohdefink, Justin Fehrling, Marvin Klingner, Fabian Huger, Peter Schlicht, Nico M Schmidt, and Tim Fin- gscheidt. Self-supervised domain mismatch estimation for autonomous perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 334–335, 2020. 2

work page 2020

[51] [51]

Reliability in semantic seg- mentation: Can we use synthetic data? arXiv preprint arXiv:2312.09231, 2023

Thibaut Loiseau, Tuan-Hung Vu, Mickael Chen, Patrick P´erez, and Matthieu Cord. Reliability in semantic seg- mentation: Can we use synthetic data? arXiv preprint arXiv:2312.09231, 2023. 5, 7, 3, 4, 10, 11, 12

work page arXiv 2023

[52] [52]

Entropic out-of- distribution detection

David Mac ˆedo, Tsang Ing Ren, Cleber Zanchettin, Adri- ano LI Oliveira, and Teresa Ludermir. Entropic out-of- distribution detection. In 2021 international joint conference on neural networks (IJCNN) , pages 1–8. IEEE, 2021. 2, 6, 7, 8

work page 2021

[53] [53]

Kevin P. Murphy. Machine Learning: A Probabilistic Per- spective. MIT Press, Cambridge, MA, 2 edition, 2021. 4

work page 2021

[54] [54]

Rba: Segmenting unknown regions rejected by all

Nazir Nayal, Misra Yavuz, Joao F Henriques, and Fatma G¨uney. Rba: Segmenting unknown regions rejected by all. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 711–722, 2023. 1

work page 2023

[55] [55]

Oodis: Anomaly instance segmentation benchmark

Alexey Nekrasov, Rui Zhou, Miriam Ackermann, Alexan- der Hermans, Bastian Leibe, and Matthias Rottmann. Oodis: Anomaly instance segmentation benchmark. arXiv preprint arXiv:2406.11835, 2024. 2

work page arXiv 2024

[56] [56]

Road obstacle detection method based on an autoencoder with semantic segmentation

Toshiaki Ohgushi, Kenji Horiguchi, and Masao Yamanaka. Road obstacle detection method based on an autoencoder with semantic segmentation. In proceedings of the Asian conference on computer vision, 2020. 2

work page 2020

[57] [57]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2, 3, 5, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[58] [58]

Normalizing flows for probabilistic modeling and inference

George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning Research, 22(57):1–64, 2021. 4

work page 2021

[59] [59]

Perceptiongpt: Effectively fusing visual perception into llm

Renjie Pi, Lewei Yao, Jiahui Gao, Jipeng Zhang, and Tong Zhang. Perceptiongpt: Effectively fusing visual perception into llm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 27124– 27133, 2024. 1

work page 2024

[60] [60]

Lost and found: detecting small road hazards for self-driving vehi- cles

Peter Pinggera, Sebastian Ramos, Stefan Gehrig, Uwe Franke, Carsten Rother, and Rudolf Mester. Lost and found: detecting small road hazards for self-driving vehi- cles. In 2016 IEEE/RSJ International Conference on Intel- ligent Robots and Systems (IROS), pages 1099–1106. IEEE,

work page 2016

[61] [61]

2, 5, 7, 1, 3, 4, 10, 11

work page

[62] [62]

Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models

Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazeb- nik. Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models. In Pro- ceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015. 4

work page 2015

[63] [63]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 3, 5, 2, 4, 10

work page 2021

[64] [64]

Run-time monitoring of machine learning for robotic percep- tion: A survey of emerging trends

Quazi Marufur Rahman, Peter Corke, and Feras Dayoub. Run-time monitoring of machine learning for robotic percep- tion: A survey of emerging trends. IEEE Access, 9:20067– 20075, 2021. 2

work page 2021

[65] [65]

Mask2anomaly: Mask transformer for uni- versal open-set segmentation

Shyam Nandan Rai, Fabio Cermelli, Barbara Caputo, and Carlo Masone. Mask2anomaly: Mask transformer for uni- versal open-set segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 1

work page 2024

[66] [66]

Grounding dino 1.5: Advance the” edge” of open-set object detection

Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wen- long Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xiaoke Jiang, Yihao Chen, et al. Grounding dino 1.5: Advance the” edge” of open-set object detection. arXiv preprint arXiv:2405.10300, 2024. 10

work page arXiv 2024

[67] [67]

Prediction error meta classification in semantic segmentation: Detection via aggregated dispersion measures of softmax probabilities

Matthias Rottmann, Pascal Colling, Thomas Paul Hack, Robin Chan, Fabian H ¨uger, Peter Schlicht, and Hanno Gottschalk. Prediction error meta classification in semantic segmentation: Detection via aggregated dispersion measures of softmax probabilities. In 2020 International Joint Confer- ence on Neural Networks (IJCNN) , pages 1–9. IEEE, 2020. 1

work page 2020

[68] [68]

Imagenet large scale visual recognition challenge

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015. 4

work page 2015

[69] [69]

Taxonomy and Definitions for Terms Re- lated to Driving Automation Systems for On-Road Motor Vehicles (SAE J3016), 2021

SAE International. Taxonomy and Definitions for Terms Re- lated to Driving Automation Systems for On-Road Motor Vehicles (SAE J3016), 2021. Accessed: 4 March 2025. 1

work page 2021

[70] [70]

Seman- tic foggy scene understanding with synthetic data

Christos Sakaridis, Dengxin Dai, and Luc Van Gool. Seman- tic foggy scene understanding with synthetic data. Interna- tional Journal of Computer Vision, 126:973–992, 2018. 5, 6, 7

work page 2018

[71] [71]

Acdc: The adverse conditions dataset with correspondences for se- mantic driving scene understanding

Christos Sakaridis, Dengxin Dai, and Luc Van Gool. Acdc: The adverse conditions dataset with correspondences for se- mantic driving scene understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 10765–10775, 2021. 3, 7, 1, 4, 12, 13

work page 2021

[72] [72]

Improving 11 robustness against common corruptions by covariate shift adaptation

Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bring- mann, Wieland Brendel, and Matthias Bethge. Improving 11 robustness against common corruptions by covariate shift adaptation. Advances in neural information processing sys- tems, 33:11539–11551, 2020. 2

work page 2020

[73] [73]

Ssd: A unified framework for self-supervised outlier detection

Vikash Sehwag, Mung Chiang, and Prateek Mittal. Ssd: A unified framework for self-supervised outlier detection. arXiv preprint arXiv:2103.12051, 2021. 2

work page arXiv 2021

[74] [74]

Objects365: A large-scale, high-quality dataset for object detection

Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 8430–8439, 2019. 4

work page 2019

[75] [75]

Transnext: Robust foveal visual perception for vi- sion transformers

Dai Shi. Transnext: Robust foveal visual perception for vi- sion transformers. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 17773–17783, 2024. 1

work page 2024

[76] [76]

Misbehaviour prediction for autonomous driving systems

Andrea Stocco, Michael Weiss, Marco Calzana, and Paolo Tonella. Misbehaviour prediction for autonomous driving systems. In Proceedings of the ACM/IEEE 42nd interna- tional conference on software engineering , pages 359–371,

work page

[77] [77]

Dice: Leveraging sparsification for out-of-distribution detection

Yiyou Sun and Yixuan Li. Dice: Leveraging sparsification for out-of-distribution detection. In European Conference on Computer Vision, pages 691–708. Springer, 2022. 6, 7, 8

work page 2022

[78] [78]

Out- of-distribution detection with deep nearest neighbors

Yiyou Sun, Yifei Ming, Xiaojin Zhu, and Yixuan Li. Out- of-distribution detection with deep nearest neighbors. In In- ternational Conference on Machine Learning, pages 20827– 20840. PMLR, 2022. 2, 6, 7, 8

work page 2022

[79] [79]

Idd: A dataset for exploring problems of autonomous navigation in uncon- strained environments

Girish Varma, Anbumani Subramanian, Anoop Namboodiri, Manmohan Chandraker, and CV Jawahar. Idd: A dataset for exploring problems of autonomous navigation in uncon- strained environments. In 2019 IEEE winter conference on applications of computer vision (WACV), pages 1743–1751. IEEE, 2019. 5, 6, 7, 8

work page 2019

[80] [80]

Image-consistent detection of road anomalies as unpredictable patches

Tom ´aˇs V oj´ıˇr and Ji ˇr´ı Matas. Image-consistent detection of road anomalies as unpredictable patches. In Proceedings of the IEEE/CVF Winter Conference on Applications of Com- puter Vision, pages 5491–5500, 2023. 1

work page 2023