arxiv: 2604.08008 · v1 · submitted 2026-04-09 · 💻 cs.CV · cs.AI· cs.LG

Recognition: unknown

SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving

Felix Embacher , Jonas Uhrig , Marius Cordts , Markus Enzweiler

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:48 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords rare image retrievalautonomous drivinglong-tail perceptionmulti-modal retrievalsafety-critical scenariosdataset benchmarkdata curation

0 comments

The pith

SearchAD supplies the first large-scale benchmark for retrieving rare safety-critical driving scenes from massive datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SearchAD to solve the growing problem of locating extremely uncommon yet safety-critical scenarios inside enormous autonomous driving collections. Instead of adding more raw data, the work focuses on efficient retrieval of the rarest classes, some appearing fewer than fifty times total. It supplies over 423,000 frames, more than 513,000 manually labeled boxes across ninety categories, and defined splits that support text-to-image, image-to-image, and fine-tuning experiments. Evaluations indicate that methods linking visual features to language descriptions achieve stronger zero-shot results than image-only approaches, thanks to better semantic alignment, although overall success rates stay low. The dataset therefore creates a public platform for testing retrieval-driven curation and long-tail perception methods in autonomous driving.

Core claim

SearchAD aggregates frames from eleven existing sources into a single benchmark containing 423k images and high-quality manual annotations for 90 rare categories, complete with a held-out test set on a public server. The work shows that models aligning spatial visual features with language text produce the best zero-shot retrieval performance, that fine-tuning further raises accuracy, and that text-based methods consistently surpass image-based ones because of stronger semantic grounding, while absolute retrieval quality remains unsatisfactory for practical use.

What carries the argument

The SearchAD dataset itself, which aggregates frames from multiple sources and supplies bounding-box annotations for rare classes to support semantic retrieval tasks and data-curation experiments.

If this is right

Text-to-image retrieval methods can be directly compared against image-only baselines on a standardized AD-specific test set.
Fine-tuning on the provided training split produces measurable gains over zero-shot performance for rare-class retrieval.
The dataset enables systematic study of how retrieval can improve data curation for long-tail perception tasks.
Public benchmark server allows ongoing evaluation of new multi-modal models without access to the full training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Curating training sets with SearchAD-retrieved samples could reduce the total data volume required to achieve reliable detection of rare events.
The observed advantage of language-aligned features may transfer to other domains where rare events must be located in large unlabeled collections.
Repeated use of the benchmark could expose whether current multi-modal architectures fundamentally struggle with extreme class imbalance.

Load-bearing premise

The ninety chosen rare categories and their manual annotations accurately reflect real-world safety-critical driving situations without selection bias in category choice or frame sampling.

What would settle it

A new model that achieves near-perfect retrieval of all ninety categories on the held-out test set would indicate that the needle-in-a-haystack problem has been solved, contradicting the paper's finding of unsatisfactory performance.

Figures

Figures reproduced from arXiv: 2604.08008 by Felix Embacher, Jonas Uhrig, Marius Cordts, Markus Enzweiler.

**Figure 1.** Figure 1: Qualitative examples of rare object and scene annotations from the SearchAD training and validation splits. References for all [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the 30 rarest classes in the training and validation splits of SearchAD, showing the absolute number of relevant [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution comparison showing the extreme rarity of [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Object size distribution for each SearchAD category. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 6.** Figure 6: Image retrieval framework for both VLM-based text-to [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Image retrieval framework for text-to-image retrieval [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative examples of three common failure cases [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Overview of the 90 classes of SearchAD. Cropped instances of objects and scenes are provided for each class, extracted from the [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Visualization of the conflicting choice of a suitable text prompt for GroundingDINO [ [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Retrieval confusion matrices based on the Top-50 re [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: Text-based retrieval results on the validation split for the [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: Text-based retrieval results on the validation split for the [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

**Figure 14.** Figure 14: Text-based retrieval results on the validation split for the [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: Text-based retrieval results on the validation split for the [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

read the original abstract

Retrieving rare and safety-critical driving scenarios from large-scale datasets is essential for building robust autonomous driving (AD) systems. As dataset sizes continue to grow, the key challenge shifts from collecting more data to efficiently identifying the most relevant samples. We introduce SearchAD, a large-scale rare image retrieval dataset for AD containing over 423k frames drawn from 11 established datasets. SearchAD provides high-quality manual annotations of more than 513k bounding boxes covering 90 rare categories. It specifically targets the needle-in-a-haystack problem of locating extremely rare classes, with some appearing fewer than 50 times across the entire dataset. Unlike existing benchmarks, which focused on instance-level retrieval, SearchAD emphasizes semantic image retrieval with a well-defined data split, enabling text-to-image and image-to-image retrieval, few-shot learning, and fine-tuning of multi-modal retrieval models. Comprehensive evaluations show that text-based methods outperform image-based ones due to stronger inherent semantic grounding. While models directly aligning spatial visual features with language achieve the best zero-shot results, and our fine-tuning baseline significantly improves performance, absolute retrieval capabilities remain unsatisfactory. With a held-out test set on a public benchmark server, SearchAD establishes the first large-scale dataset for retrieval-driven data curation and long-tail perception research in AD: https://iis-esslingen.github.io/searchad/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SearchAD gives AD researchers a new retrieval benchmark for rare classes but its value rests on unexamined annotation choices.

read the letter

SearchAD pulls 423k frames from 11 existing autonomous driving datasets and adds manual bounding-box labels for 90 rare categories, some appearing fewer than 50 times total. The paper sets up a fixed split, a public test server, and baselines for text-to-image and image-to-image retrieval, showing that language-aligned models beat pure visual ones and that fine-tuning helps but leaves performance low overall. That directly targets the practical problem of finding safety-critical needles in growing AD haystacks, which is useful for long-tail perception work. The scale and the multi-modal setup are the concrete additions here. The evaluations stay simple and report the gap honestly rather than claiming solved performance. The soft spot is the annotations themselves. The paper calls them high-quality but gives little on how the 90 categories were chosen, how frames were sampled from the source sets, what the annotation protocol was, or whether any consistency checks were run. If those steps introduced bias or inconsistent rarity definitions, the benchmark numbers and the claimed utility for data curation become harder to trust. This paper is for people already working on retrieval or long-tail issues inside autonomous driving. A reader who needs an off-the-shelf split and server for testing models in this niche will get immediate value from the released data. It is worth sending to peer review because the dataset is new and the problem statement is grounded, but reviewers should be asked to focus on the annotation process and any selection details that are missing from the current draft.

Referee Report

3 major / 3 minor

Summary. The paper introduces SearchAD, a large-scale rare image retrieval dataset for autonomous driving containing over 423k frames aggregated from 11 existing datasets. It provides manual annotations for more than 513k bounding boxes across 90 rare categories (some appearing fewer than 50 times), with defined splits supporting text-to-image, image-to-image retrieval, few-shot learning, and fine-tuning. Comprehensive evaluations indicate that text-based methods outperform image-based ones due to stronger semantic grounding, that models aligning spatial visual features with language perform best in zero-shot settings, and that a fine-tuning baseline improves results, though absolute retrieval performance remains unsatisfactory. The work releases the dataset and a public benchmark server with a held-out test set to support retrieval-driven data curation and long-tail perception research in AD.

Significance. If the annotations are reliable and representative, SearchAD could be a valuable contribution by establishing the first dedicated large-scale benchmark for the needle-in-a-haystack retrieval problem in safety-critical AD scenarios. The public server and emphasis on rare classes enable reproducible research on data curation for long-tail perception. The reported performance gap between text-based and image-based methods, if quantified rigorously, provides a concrete direction for multi-modal model development in this domain.

major comments (3)

[Dataset Construction] Dataset Construction section: The manuscript repeatedly describes the annotations as 'high-quality manual annotations' for the 513k bounding boxes over 90 rare categories, yet provides no details on the annotation protocol, number of annotators, inter-annotator agreement, quality control procedures, or guidelines for handling ambiguous rare events. This directly undermines the central claim that SearchAD constitutes a reliable benchmark for long-tail perception and retrieval-driven curation, as the reported outperformance of text-based methods and the fine-tuning baseline rest entirely on these labels serving as trustworthy ground truth.
[Evaluations] Evaluations section: The abstract and evaluation summary assert that 'text-based methods outperform image-based ones due to stronger inherent semantic grounding' and that 'our fine-tuning baseline significantly improves performance' while 'absolute retrieval capabilities remain unsatisfactory,' but no specific metrics (e.g., recall@K for K=1,5,10 or mAP), exact experimental setup, data split details, or comparison tables are referenced. Without these, the magnitude and statistical significance of the claimed gaps cannot be verified or reproduced.
[Introduction and Dataset] Introduction and Dataset sections: The selection of the 90 rare categories and the frame sampling strategy from the 11 source datasets are not described with sufficient rigor to exclude post-hoc selection bias or non-representative sampling of safety-critical scenarios. It remains unclear whether category rarity was determined via pre-annotation frequency analysis across the full corpus or whether frames were chosen to ensure diversity in driving conditions, both of which are load-bearing for the 'needle-in-a-haystack' framing and the dataset's claimed utility.

minor comments (3)

The project page URL in the abstract should be verified to confirm that all data splits, annotations, and the benchmark server are publicly accessible as stated.
[Related Work] Related Work section: Additional citations to recent long-tail detection and rare-event retrieval papers in AD would better situate the contribution relative to existing benchmarks.
[Figures] Figures showing example rare categories would benefit from higher-resolution insets or zoomed views to more clearly illustrate the visual challenges of the retrieval task.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below, and we will make revisions to improve the clarity and completeness of the paper.

read point-by-point responses

Referee: [Dataset Construction] The manuscript repeatedly describes the annotations as 'high-quality manual annotations' for the 513k bounding boxes over 90 rare categories, yet provides no details on the annotation protocol, number of annotators, inter-annotator agreement, quality control procedures, or guidelines for handling ambiguous rare events. This directly undermines the central claim that SearchAD constitutes a reliable benchmark for long-tail perception and retrieval-driven curation, as the reported outperformance of text-based methods and the fine-tuning baseline rest entirely on these labels serving as trustworthy ground truth.

Authors: We acknowledge that the current manuscript lacks sufficient detail on the annotation process. In the revised manuscript, we will add a subsection detailing the annotation protocol, including the use of multiple annotators, the guidelines for labeling rare categories, quality control steps such as cross-verification, and procedures for ambiguous cases. We will also include inter-annotator agreement metrics to validate the quality of the annotations. This will reinforce the trustworthiness of the ground truth for the benchmark tasks. revision: yes
Referee: [Evaluations] The abstract and evaluation summary assert that 'text-based methods outperform image-based ones due to stronger inherent semantic grounding' and that 'our fine-tuning baseline significantly improves performance' while 'absolute retrieval capabilities remain unsatisfactory,' but no specific metrics (e.g., recall@K for K=1,5,10 or mAP), exact experimental setup, data split details, or comparison tables are referenced. Without these, the magnitude and statistical significance of the claimed gaps cannot be verified or reproduced.

Authors: The Evaluations section of the manuscript contains the specific metrics, including recall@K values and mAP, along with descriptions of the experimental setups and data splits. However, we agree that the abstract and summary do not sufficiently reference these. We will revise the abstract and the Evaluations section introduction to explicitly cite the relevant tables and metrics, ensuring that the claims about outperformance and improvements are directly linked to the reported results for better verifiability and reproducibility. revision: yes
Referee: [Introduction and Dataset] The selection of the 90 rare categories and the frame sampling strategy from the 11 source datasets are not described with sufficient rigor to exclude post-hoc selection bias or non-representative sampling of safety-critical scenarios. It remains unclear whether category rarity was determined via pre-annotation frequency analysis across the full corpus or whether frames were chosen to ensure diversity in driving conditions, both of which are load-bearing for the 'needle-in-a-haystack' framing and the dataset's claimed utility.

Authors: We will enhance the Introduction and Dataset sections to provide a more rigorous description of the category selection and sampling strategy. This will include explaining that the 90 categories were selected based on frequency analysis of the aggregated data from the 11 source datasets to identify those with low occurrence rates (fewer than 50 instances for some). The frame sampling was designed to capture a diverse set of driving conditions while focusing on rare events. We will detail the process to demonstrate it was systematic and to address concerns about selection bias. revision: yes

Circularity Check

0 steps flagged

Empirical dataset paper with no circular derivation chain

full rationale

The paper introduces SearchAD as a new dataset with manual annotations and reports benchmark results from standard retrieval models (text-based vs image-based). No equations, fitted parameters, or derivations are present. Central claims rest on the released data and external model evaluations rather than reducing to self-defined quantities, self-citations, or ansatzes. This matches the default non-circular case for dataset/benchmark papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset construction and benchmarking paper with no mathematical model, free parameters, axioms, or invented entities; the contribution rests entirely on data collection, manual annotation, and empirical evaluation.

pith-pipeline@v0.9.0 · 5547 in / 1212 out tokens · 59216 ms · 2026-05-10T16:48:08.244645+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 12 canonical work pages · 2 internal anchors

[1]

RayFronts: Open- set semantic ray frontiers for online scene understanding and exploration

Omar Alama, Avigyan Bhattacharya, Haoyang He, Se- ungchan Kim, Yuheng Qiu, Wenshan Wang, Cherie Ho, Nikhil Keetha, and Sebastian Scherer. RayFronts: Open- set semantic ray frontiers for online scene understanding and exploration. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5930–5937,
[2]

7, 8, 1, 2, 3, 4, 9, 10, 11
[3]

Neural codes for image retrieval

Artem Babenko, Anton Slesarev, Alexandr Chigorin, and Victor Lempitsky. Neural codes for image retrieval. InProc. of the European Conf. on Computer Vision (ECCV), pages 584–599, 2014. 7

2014
[4]

Content-based image re- trieval and the semantic gap in the deep learning era

Bj ¨orn Barz and Joachim Denzler. Content-based image re- trieval and the semantic gap in the deep learning era. InProc. of the International Conf. on Pattern Recognition (ICPR), pages 245–260, 2021. 3

2021
[5]

Flohr, and Dariu M

Markus Braun, Sebastian Krebs, Fabian B. Flohr, and Dariu M. Gavrila. EuroCity Persons: A novel benchmark for person detection in traffic scenes.IEEE Trans. on Pat- tern Analysis and Machine Intelligence (PAMI), pages 1–1,
[6]

Smooth-AP: Smoothing the path towards large- scale image retrieval

Andrew Brown, Weidi Xie, Vicky Kalogeiton, and Andrew Zisserman. Smooth-AP: Smoothing the path towards large- scale image retrieval. InProc. of the European Conf. on Computer Vision (ECCV), pages 677–694, 2020. 3

2020
[7]

Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom

Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuScenes: A multi- modal dataset for autonomous driving. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2020. 2, 3, 19

2020
[8]

SegmentMeIfYou- Can: A benchmark for anomaly segmentation

Robin Chan, Krzysztof Lis, Svenja Uhlemeyer, Hermann Blum, Sina Honari, Roland Siegwart, Pascal Fua, Math- ieu Salzmann, and Matthias Rottmann. SegmentMeIfYou- Can: A benchmark for anomaly segmentation. InProceed- ings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021. 2

2021
[9]

Meta clip 2: A worldwide scaling recipe.arXiv preprint arXiv:2507.22062,

Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, et al. Meta CLIP 2: A worldwide scaling recipe.arXiv preprint arXiv:2507.22062,

work page arXiv
[10]

2, 6, 7, 8, 1, 9, 10, 11
[11]

The Cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes dataset for semantic urban scene understanding. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2016. 3, 19

2016
[12]

Simon Doll, Niklas Hanselmann, Lukas Schneider, Richard Schulz, Marius Cordts, Markus Enzweiler, and Hendrik P. A. Lensch. DualAD: Disentangling the dynamic and static world for end-to-end driving. InProc. IEEE Conf. on Com- puter Vision and Pattern Recognition (CVPR), pages 14728– 14737, 2024. 1

2024
[13]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InProc. of the International Conf. on Learning Rep- resentations (ICLR...

2021
[14]

The FAISS li- brary.IEEE Transactions on Big Data, pages 1–17, 2025

Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff John- son, Gergely Szilvasy, Pierre-Emmanuel Mazar ´e, Maria Lomeli, Lucas Hosseini, and Herv ´e J ´egou. The FAISS li- brary.IEEE Transactions on Big Data, pages 1–17, 2025. 6, 4, 5

2025
[15]

The Mapillary traffic sign dataset for detection and classification on a global scale

Christian Ertler, Jerneja Mislej, Tobias Ollmann, Lorenzo Porzi, Gerhard Neuhold, and Yubin Kuang. The Mapillary traffic sign dataset for detection and classification on a global scale. InProc. of the European Conf. on Computer Vision (ECCV), pages 68–84, 2020. 3, 8, 19

2020
[16]

ORION: A holistic end-to- end autonomous driving framework by vision-language in- structed action generation

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. ORION: A holistic end-to- end autonomous driving framework by vision-language in- structed action generation. InProc. of the IEEE International Conf. on Computer Vision (ICCV), 2025. 1

2025
[17]

Vision meets robotics: The KITTI dataset.Inter- national Journal of Robotics Research (IJRR), 32(11):1231– 1237, 2013

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The KITTI dataset.Inter- national Journal of Robotics Research (IJRR), 32(11):1231– 1237, 2013. 2, 3, 4, 19

2013
[18]

Dense open-set recognition with synthetic outliers generated by real NVP

Matej Grci ´c, Petra Bevandi ´c, and Sini ˇsa Segvi ´c. Dense open-set recognition with synthetic outliers generated by real NVP. InProc. of the Conf. on Computer Vision Theory and Applications (VISAPP), pages 133–143. INSTICC, 2021. 2

2021
[19]

Find your needle: Small object image retrieval via multi-object attention optimization.arXiv preprint arXiv:2503.07038, 2025

Michael Green, Matan Levy, Issar Tzachor, Dvir Samuel, Nir Darshan, and Rami Ben-Ari. Find your needle: Small object image retrieval via multi-object attention optimization.arXiv preprint arXiv:2503.07038, 2025. 2, 5

work page arXiv 2025
[20]

LVIS: A dataset for large vocabulary instance segmentation

Agrim Gupta, Piotr Dollar, and Ross Girshick. LVIS: A dataset for large vocabulary instance segmentation. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019. 2

2019
[21]

Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation

Sina Hajimiri, Ismail Ben Ayed, and Jose Dolz. Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation. InProc. of the IEEE Winter Conference on Applications of Computer Vision (WACV), pages 5061–5071,
[22]

6, 7, 8, 1, 2, 3, 4, 9, 10, 11
[23]

RADIOv2.5: Improved baselines for agglomerative vision foundation models

Greg Heinrich, Mike Ranzinger, Hongxu Yin, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catanzaro, and Pavlo Molchanov. RADIOv2.5: Improved baselines for agglomerative vision foundation models. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 22487–22497, 2025. 2, 7, 8, 1, 9, 10, 11

2025
[24]

Scaling out-of-distribution detection for 9 real-world settings

Dan Hendrycks, Steven Basart, Mantas Mazeika, Andy Zou, Joseph Kwon, Mohammadreza Mostajabi, Jacob Steinhardt, and Dawn Song. Scaling out-of-distribution detection for 9 real-world settings. InProc. of the International Conf. on Machine learning (ICML), pages 8759–8773, 2022. 2

2022
[25]

What matters for scalable and ro- bust learning in end-to-end driving planners?arXiv preprint arXiv:2603.15185, 2026

David Holtz, Niklas Hanselmann, Simon Doll, Marius Cordts, and Bernt Schiele. What matters for scalable and ro- bust learning in end-to-end driving planners?arXiv preprint arXiv:2603.15185, 2026. 1

work page arXiv 2026
[26]

Planning-oriented autonomous driv- ing

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented autonomous driv- ing. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 17853–17862, 2023. 1

2023
[27]

OpenCLIP.Zenodo, 2021

Gabriel Ilharco, Mitchell Wortsman, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, et al. OpenCLIP.Zenodo, 2021. 2, 6, 7, 8, 1, 3, 9, 10, 11

2021
[28]

BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InProc. of the International Conf. on Machine learning (ICML), pages 19730–19742, 2023. 2, 6, 7, 8, 1, 3, 4, 5, 9, 10, 11

2023
[29]

CODA: A real-world road corner case dataset for object detection in autonomous driving

Kaican Li, Kai Chen, Haoyu Wang, Lanqing Hong, Chao- qiang Ye, Jianhua Han, Yukuai Chen, Wei Zhang, Chunjing Xu, Dit-Yan Yeung, Xiaodan Liang, Zhenguo Li, and Hang Xu. CODA: A real-world road corner case dataset for object detection in autonomous driving. InProc. of the European Conf. on Computer Vision (ECCV), pages 406–423, 2022. 2, 4, 5

2022
[30]

Grounded language-image pre-training

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. InProc. IEEE Conf. on Com- puter Vision and Pattern Recognition (CVPR), pages 10965– 10975, 2022. 6

2022
[31]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In Proc. of the European Conf. on Computer Vision (ECCV), pages 740–755, 2014. 2, 5

2014
[32]

Grounding DINO: Mar- rying DINO with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding DINO: Mar- rying DINO with grounded pre-training for open-set object detection. InProc. of the European Conf. on Computer Vi- sion (ECCV), pages 38–55, 2025. 6, 7, 8, 1, 2, 3, 4, 5, 9, 10, 11

2025
[33]

Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X. Yu. Large-scale long-tailed recognition in an open world. InProc. IEEE Conf. on Com- puter Vision and Pattern Recognition (CVPR), 2019. 2

2019
[34]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proc. of the IEEE International Conf. on Computer Vision (ICCV), pages 10012–10022, 2021. 7

2021
[35]

ActiveAD: Planning-oriented active learning for end-to-end autonomous driving.arXiv preprint arXiv:2403.02877, 2024

Han Lu, Xiaosong Jia, Yichen Xie, Wenlong Liao, Xiaokang Yang, and Junchi Yan. ActiveAD: Planning-oriented active learning for end-to-end autonomous driving.arXiv preprint arXiv:2403.02877, 2024. 1

work page arXiv 2024
[36]

Two video data sets for tracking and retrieval of out of distribution objects

Kira Maag, Robin Chan, Svenja Uhlemeyer, Kamil Kowol, and Hanno Gottschalk. Two video data sets for tracking and retrieval of out of distribution objects. InProc. of the Asian Conf. on Computer Vision (ACCV), pages 3776–3794, 2022. 2

2022
[37]

Manning, Prabhakar Raghavan, and Hinrich Sch¨utze.Introduction to Information Retrieval

Christopher D. Manning, Prabhakar Raghavan, and Hinrich Sch¨utze.Introduction to Information Retrieval. Cambridge University Press, 2008. 5, 6, 3

2008
[38]

One million scenes for autonomous driving: ONCE dataset

Jiageng Mao, Niu Minzhe, ChenHan Jiang, hanxue liang, Jingheng Chen, Xiaodan Liang, Yamin Li, Chaoqiang Ye, Wei Zhang, Zhenguo Li, Jie Yu, Chunjing XU, and Hang Xu. One million scenes for autonomous driving: ONCE dataset. InProceedings of the Neural Information Processing Sys- tems Track on Datasets and Benchmarks, 2021. 2

2021
[39]

The Mapillary Vistas dataset for seman- tic understanding of street scenes

Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and Peter Kontschieder. The Mapillary Vistas dataset for seman- tic understanding of street scenes. InProc. of the IEEE In- ternational Conf. on Computer Vision (ICCV), 2017. 3, 19

2017
[40]

Patil, Abhinav Benagi, Charith Rage, Dhany- atha Narayan, Susmitha A, and Pragya Paramita Sahu

Annapurna P. Patil, Abhinav Benagi, Charith Rage, Dhany- atha Narayan, Susmitha A, and Pragya Paramita Sahu. CLIP- based image retrieval: A comparative study using ceitm eval- uation. In2024 1st International Conference on Communi- cations and Computer Science (InCCCS), pages 1–7, 2024. 6

2024
[41]

Object retrieval with large vocabularies and fast spatial matching

James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Object retrieval with large vocabularies and fast spatial matching. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2007. 3

2007
[42]

Lost in quantization: Improving par- ticular object retrieval in large scale image databases

James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Lost in quantization: Improving par- ticular object retrieval in large scale image databases. In Proc. IEEE Conf. on Computer Vision and Pattern Recog- nition (CVPR), pages 1–8, 2008. 3

2008
[43]

Lost and found: detecting small road hazards for self-driving vehicles

Peter Pinggera, Sebastian Ramos, Stefan Gehrig, Uwe Franke, Carsten Rother, and Rudolf Mester. Lost and found: detecting small road hazards for self-driving vehicles. In Proc. IEEE International Conf. on Intelligent Robots and Systems (IROS), pages 1099–1106, 2016. 2, 3, 4, 16, 19

2016
[44]

Revisiting Oxford and Paris: Large-scale image retrieval benchmarking

Filip Radenovi ´c, Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ond ˇrej Chum. Revisiting Oxford and Paris: Large-scale image retrieval benchmarking. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR),
[45]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProc. of the In- ternational Conf. on Machine learning (ICML), pages 8748– 8763, 2021. 2, 7

2021
[46]

AM-RADIO: Agglomerative vision foundation model reduce all domains into one

Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. AM-RADIO: Agglomerative vision foundation model reduce all domains into one. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 12490–12500, 2024. 7 10

2024
[47]

Faster R-CNN: Towards real-time object detection with re- gion proposal networks.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 39(6):1137–1149, 2017

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with re- gion proposal networks.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 39(6):1137–1149, 2017. 4

2017
[48]

SimLingo: Vision-only closed-loop autonomous driving with language-action alignment

Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. SimLingo: Vision-only closed-loop autonomous driving with language-action alignment. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 11993–12003, 2025. 1

2025
[49]

ACDC: The adverse conditions dataset with correspondences for se- mantic driving scene understanding

Christos Sakaridis, Dengxin Dai, and Luc Van Gool. ACDC: The adverse conditions dataset with correspondences for se- mantic driving scene understanding. InProc. of the IEEE In- ternational Conf. on Computer Vision (ICCV), pages 10765– 10775, 2021. 3, 19

2021
[50]

Waslander, Yu Liu, and Hongsheng Li

Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L. Waslander, Yu Liu, and Hongsheng Li. LMDrive: Closed-loop end-to-end driving with large language models. InProc. IEEE Conf. on Computer Vision and Pattern Recog- nition (CVPR), pages 15120–15130, 2024. 1

2024
[51]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. DINOv3.arXiv preprint arXiv:2508.10104, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

BEV-TSR: Text-scene retrieval in BEV space for autonomous driving.Proc

Tao Tang, Dafeng Wei, Zhengyu Jia, Tian Gao, Changwei Cai, Chengkai Hou, Peng Jia, Kun Zhan, Haiyang Sun, Fan JingChen, Yixing Zhao, Xiaodan Liang, Xianpeng Lang, and Yang Wang. BEV-TSR: Text-scene retrieval in BEV space for autonomous driving.Proc. of the Conf. on Artificial In- telligence (AAAI), 39(7):7275–7283, 2025. 5

2025
[53]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. SigLIP 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 2, 6, 7, 8, 1, 9, 10, 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Girish Varma, Anbumani Subramanian, Anoop Namboodiri, Manmohan Chandraker, and C.V . Jawahar. IDD: A dataset for exploring problems of autonomous navigation in uncon- strained environments. InProc. of the IEEE Winter Con- ference on Applications of Computer Vision (WACV), pages 1743–1751, 2019. 3, 19

2019
[55]

V oorhees and Donna Harman

Ellen M. V oorhees and Donna Harman. Overview of the sixth text retrieval conference (trec-6).Information Process- ing & Management, 36(1):3–35, 2000. 5, 6

2000
[56]

Distribution-balanced loss for multi-label classification in long-tailed datasets

Tong Wu, Qingqiu Huang, Ziwei Liu, Yu Wang, and Dahua Lin. Distribution-balanced loss for multi-label classification in long-tailed datasets. InProc. of the European Conf. on Computer Vision (ECCV), pages 162–178, 2020. 2

2020
[57]

CORA: Adapting CLIP for open-vocabulary detection with region prompting and anchor pre-matching

Xiaoshi Wu, Feng Zhu, Rui Zhao, and Hongsheng Li. CORA: Adapting CLIP for open-vocabulary detection with region prompting and anchor pre-matching. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 7031–7040, 2023. 6

2023
[58]

OpenAD: Open-world au- tonomous driving benchmark for 3D object detection.arXiv preprint arXiv:2411.17761, 2024

Zhongyu Xia, Jishuo Li, Zhiwei Lin, Xinhao Wang, Yongtao Wang, and Ming-Hsuan Yang. OpenAD: Open-world au- tonomous driving benchmark for 3D object detection.arXiv preprint arXiv:2411.17761, 2024. 2, 4

work page arXiv 2024
[59]

2510.26125 , archivePrefix =

Runsheng Xu, Hubert Lin, Wonseok Jeon, Hao Feng, Yu- liang Zou, Liting Sun, John Gorman, Kate Tolstaya, Sarah Tang, Brandyn White, et al. WOD-E2E: Waymo open dataset for end-to-end driving in challenging long-tail sce- narios.arXiv preprint arXiv:2510.26125, 2025. 2

work page arXiv 2025
[60]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken- maier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descrip- tions.Transactions of the Association for Computational Linguistics, 2:67–78, 2014. 2, 5

2014
[61]

BDD100K: A diverse driving dataset for heterogeneous multitask learning

Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Dar- rell. BDD100K: A diverse driving dataset for heterogeneous multitask learning. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2020. 3, 4, 5, 19

2020
[62]

Unifying panoptic segmentation for autonomous driving

Oliver Zendel, Matthias Sch ¨orghuber, Bernhard Rainer, Markus Murschitz, and Csaba Beleznai. Unifying panoptic segmentation for autonomous driving. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 21351–21360, 2022. 3, 19

2022
[63]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProc. of the IEEE International Conf. on Computer Vision (ICCV), pages 11975–11986, 2023. 2, 7

2023
[64]

Evaluation of similarity measurement for image retrieval

Dengsheng Zhang and Guojun Lu. Evaluation of similarity measurement for image retrieval. InInternational Confer- ence on Neural Networks and Signal Processing, pages 928– 931 V ol.2, 2003. 6 11 SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving Supplementary Material A. SearchAD Dataset In this section, we provide more details on th...

2003
[65]

[24] [50] [25] [8] [20] [19] [1, 19] Animal-Real-Cat 0.11 0.18 0.09 0.03 0.03 0.07 0.110.14 Animal-Real-Cow 16.86 7.97 16.55 9.22 14.21 13.40 27.9628.25 Animal-Real-Deer48.046.67 6.69 39.40 33.53 38.49 10.43 40.90 Animal-Real-Dog29.748.95 13.71 12.78 16.06 10.67 20.39 28.23 Animal-Real-Donkey 0.00 0.01 0.07 0.02 0.03 0.150.220.10 Animal-Real-Horse14.921.3...

work page arXiv
[66]

Evaluation of text-to-image retrieval methods on the SearchAD test set: Class-wise Average Precision (AP), and the Mean Average Precision (MAP) over all 90 classes

[24] [50] [25] [8] [20] [19] [1, 19] Scene-Fog 0.87 86.36 71.1987.3368.92 61.45 86.86 1.58 Scene-Open-Door 1.71 2.26 3.22 2.47 2.70 2.62 3.007.58 Scene-Open-Hood 0.07 0.096.863.00 0.59 0.35 0.06 0.25 Scene-Open-Trunk 1.85 2.93 4.29 3.20 3.57 3.30 2.707.58 Scene-Snow 6.63 60.34 54.60 66.47 56.87 70.1774.7961.49 Scene-Tunnel 2.74 33.25 24.11 32.46 20.13 38....

work page arXiv 1987
[67]

[20] [8] [50] [19] [29] [25] [1, 19] Animal-Real-Cat 0.02 0.02 0.01 0.02 0.030.070.06 0.02 Animal-Real-Cow 0.39 0.19 2.19 0.2517.6310.51 13.06 0.26 Animal-Real-Deer 33.39 33.48 33.62 33.3469.7021.11 33.57 33.35 Animal-Real-Dog 5.02 1.47 2.11 4.70 18.8520.739.16 6.91 Animal-Real-Donkey 0.01 0.03 0.02 0.01 0.010.050.00 0.01 Animal-Real-Horse 0.03 0.03 0.02 ...

work page arXiv 2019
[68]

BUS” on the road • The text pattern must be on the road, not on a traffic sign or somewhere else • The text pattern “BUS

[20] [8] [50] [19] [29] [25] [1, 19] Scene-Fog 71.57 23.94 46.04 53.21 2.7977.0770.22 58.32 Scene-Open-Door 2.50 2.04 2.12 2.73 1.832.962.47 2.53 Scene-Open-Hood 0.15 0.12 0.080.200.08 0.08 0.11 0.10 Scene-Open-Trunk 2.98 2.15 2.52 2.81 2.50 3.163.202.20 Scene-Snow 10.58 34.12 7.32 9.98 7.43 9.9541.9911.07 Scene-Tunnel 13.18 20.07 19.02 15.69 4.32 4.5526....

work page arXiv 2024