pith. sign in

arxiv: 2606.26734 · v2 · pith:5H6II7DQnew · submitted 2026-06-25 · 💻 cs.CV · cs.AI

Robust Onion: Peeling Open Vocab Object Detectors Under Noise

Pith reviewed 2026-06-30 00:39 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords open vocabulary object detectionrobustness to noisefeature collapsevision backbonessynthetic degradationsOV-OD
0
0 comments X

The pith

Open-vocabulary object detectors show comparable noise robustness when they share the same vision backbone because of feature collapse at matching layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The study applies controlled synthetic visual degradations to open-vocabulary object detectors and examines their internal layers to track where and how performance drops under noise. It establishes that backbone architecture drives most of the robustness behavior while pretraining choices and caption supervision add little. The same backbone produces similar collapse patterns across models, and this image-domain effect explains why robustness looks consistent on COCO and LVIS yet appears inflated on datasets with large isolated objects. These observations are then used to build a lightweight plug-and-play method that raises real-world performance on BDD100K, WiderFace, and VisDRONE while training far fewer parameters than full end-to-end retraining.

Core claim

Models with similar vision backbones exhibit comparable robustness to visual noise, driven by similar feature collapse at similar layers, while robustness is primarily governed by the image domain rather than annotations or other training factors.

What carries the argument

Layer-by-layer peeling of OV-ODs under synthetic visual degradations to locate feature collapse points.

If this is right

  • Models sharing a vision backbone will display matching robustness profiles and collapse layers under noise.
  • Annotation differences contribute little to robustness gaps across detectors.
  • Datasets containing large isolated objects produce an impression of higher robustness than datasets with varied object scales.
  • A lightweight NN & TK0 plug-and-play module can raise real-world robustness using 96 times fewer trainable parameters than full retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Efforts to strengthen vision backbones may improve noise robustness across many different open-vocabulary detectors at once.
  • The synthetic degradation peeling technique could be reused to diagnose robustness in other vision tasks such as segmentation or captioning.
  • The plug-and-play adaptation method points toward efficient ways to upgrade existing detectors without full retraining on new domains.

Load-bearing premise

Controlled synthetic visual degradations sufficiently represent the distribution and effects of real-world noise on detector performance and internal features.

What would settle it

Measure whether backbone similarity still predicts robustness levels when the same models are tested on uncontrolled real-world noisy images from BDD100K or VisDRONE instead of synthetic degradations.

Figures

Figures reproduced from arXiv: 2606.26734 by Aaditya Baranwal, Mukilan Karuppasamy, Priyank Pathak, Shruti Vyas, Yogesh S Rawat.

Figure 1
Figure 1. Figure 1: Effect of Noise: GLIP [28] (above) & MM-GDINO [69] (bottom) performance on COCO [30] for noises like turbulence, pixelation, and motion blur. Abstract. The impact of real-world noise on Open Vocabulary Object Detectors (OV-ODs) remains poorly understood due to their architec￾tural complexity. We present our comprehensive analysis Robust Onion , an empirical study that uses controlled synthetic visual degra… view at source ↗
Figure 2
Figure 2. Figure 2: Real-World & Synthetic: GLIP-T synthetic COCO noisy features collapse against clean image aligns with BDD-100K real-world collapse. BDD-100K doesnt have explicit clean images, hence all noisy categories are in blue, except the highlighted. 1 Introduction Vision Language Models (VLMs) have shown strong generalization in tasks like image-to-text retrieval [43,48], open-vocabulary classification [1], image ca… view at source ↗
Figure 3
Figure 3. Figure 3: Models vs. pixelation: Eval￾uation on COCO (mAP), Shade ∝ robustness. Fine-tuned (COCO, LVIS, RefCOCO) in bold. Datasets: Robustness is evaluated on 3 benchmarks: COCO [30] (val2017), LVIS [20] (miniVal) and ODinW-13 [28] (set of 13 datasets). COCO (80 categories) and LVIS (1,203 categories) have same images, but different annotations. Lan￾guage analysis uses RefCOCO/+/g [25, 33, 64], and Flickr30k [45]. M… view at source ↗
Figure 4
Figure 4. Figure 4: OV-OD Overview: Architecture (black), may include additional components and losses. Fusion of text features with multi-scale vision features via self-attention, cross-exchanges text-vision modality information. The role of each component in ro￾bustness against noise is described in listed sections. The vision feature enhancer (neck) is commonly referred as FPN / pixel decoder. Image modified from GLIP [28]… view at source ↗
Figure 5
Figure 5. Figure 5: (Left) Performance decline w/Severity: Models start dropping perfor￾mance around severity 3. Shade shows all model variants, with solid indicating mean accuracy. (Right) Accuracy and Robustness linearity: Accuracy (zero-shot) forms an ∼ linear relationship with robustness; preserving the relative ranking of models. also use Pixelation (e.g. compression, distant objects) for severity (intensity) analysis. P… view at source ↗
Figure 6
Figure 6. Figure 6: (Left) Robustness vs Size: Larger models are more robust (+ve correla￾tion with size); GLEE transformers > GLEE ResNet robustness. (Right) Similar vi￾sion backbones: Performance remains relatively consistent across models with similar backbones and depths. EVA-02 (303M) ≃ Swin-L (195M) robustness, both 24-blocks. 4 Analysis and Explainability (How, Where, Why) In our analysis visuals, y-axis will represent… view at source ↗
Figure 7
Figure 7. Figure 7: (Left) Zero-shot Pretraining: GLIP (green) consistent robustness regardless of pretraining size (no clear correlation). (Right) Fine-tuning Impact: Identical im￾pact of COCO and LVIS finetuning on COCO with synthetic noise evaluation. boost [10]. Interestingly, COCO and LVIS (same images, different annotations) have a similar impact on robustness, i.e. impact of fine-tuning is mostly governed by domain of … view at source ↗
Figure 8
Figure 8. Figure 8: Pixelation UMAP: Vision backbone, enhancer, and language fused features shown for sev 5 (dark) and sev 0 (lighter). n-th layer features represented by ‘#n’, e.g. layer #4 is 24-th block in Swin-L & EVA-02, and 12-th in Swin-T (blocks in each layer shown on left). Feature overlap between sev 5 and sev 0 (similar) implies robustness. similar feature collapse. This helps explain why similar-depth backbones ex… view at source ↗
Figure 9
Figure 9. Figure 9: (a) Object size: Larger objects are more robust. (b) Object Count: Models are very robust when they have to detect <3 objects. Jumping accuracy after >25 objects is likely due to very few samples in that range for averaging to smooth out. (c) Occlusion: Robustness pretty much is unaffected with degrees of overlap between objects. All experiments on COCO for pixelation. Other noises in Supplementary. GLIP C… view at source ↗
Figure 10
Figure 10. Figure 10: (Left) Dataset Dependency: ODinW-13 is more immune to noise than COCO & LVIS (both with comparable robustness) on sev 4 pixelation. (Right) Class￾wise robustness: Certain COCO classes (grouped by bins of log frequency) are more robust (shade of blue) for GLIP-T. Moderate correlation with object size (dot size). objects likely stems from the small sample size in that bin [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗
Figure 11
Figure 11. Figure 11: (Left) Train Captions: REC fine-tuned FIBER evaluated on COCO. De￾spite training on captions with different degrees of expressiveness (RefCOCOg is most descriptive), robustness varies slightly, indicating limited impact on robustness. (Right) Prompt Engineering: Evaluation on Flickr30k with test captions modified with tex￾tual context of pixelation (light) has minimal impact on robustness of GLIP variants… view at source ↗
Figure 13
Figure 13. Figure 13: Empirical Dataset Bias: Models detect small # objects despite heavy occlusion (a,b); fail for large # objects (c, d). FIBER-B on pixelated OdinW-13. bias for ODinW-13 (pixelated) in [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗
Figure 1
Figure 1. Figure 1: Progressive Pixelation: GLIP [31] (top) and MM-GDINO [75] (bottom); performance degrades on COCO image (888×924) from left (clean) to right (pixelated) via downsampling by 1 2 . Severity 0 (𝐻, 𝑊) Severity 1 (𝐻 2 , 𝑊 2 ) Severity 2 (𝐻 4 , 𝑊 4 ) Severity 3 (𝐻 8 , 𝑊 8 ) Severity 4 ( 𝐻 16 , 𝑊 16) Severity 5 ( 𝐻 32 , 𝑊 32) [PITH_FULL_IMAGE:figures/full_fig_p023_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Models progressive degradation with pixelation. [PITH_FULL_IMAGE:figures/full_fig_p023_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualizations for synthetic noise Samples of noise perturbations Person Person Person Person Person Person Person Person Person Person Person Person Person Person Person Person Person Person Person PersonPerson Person Person Person Person GLIP MMGDINO [PITH_FULL_IMAGE:figures/full_fig_p024_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Sample detection results on real-world images collected from the [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Real World BDD-100K all Feature Collapses. Partial Feature Collapse (overlapping ‘lumps’) JPG Compression Snow Rain Fog Minimum Feature Collapse ISO - Blur Motion Blur Focus / Gaussian Blur Turbulence [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Synthetic Noises mimicking Real World all Feature Collapses. Contains 5,000 images with 80 object categories. The validation set includes approximately 36,781 object instances. LVIS [20, 26] (MiniVal): A long-tail de￾tection dataset comprising 1,203 object categories. The MiniVal set contains 5,000 images with about 62,397 object instances. ODinW-13 [31]: A collection of 13 small out-of-distribution datase… view at source ↗
Figure 8
Figure 8. Figure 8: Accuracy, Robustness linear relationship [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Similar Backbone Robustness. Extension of [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: All noises for all backbones.. Same details as those of [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Robustness vs Dataset Size for all severity on pixelation [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Pixelation Features t-SNE: Same details as those of [PITH_FULL_IMAGE:figures/full_fig_p031_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Features UMAP: Same details as those of [PITH_FULL_IMAGE:figures/full_fig_p031_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Features t-SNE: Same details as those of [PITH_FULL_IMAGE:figures/full_fig_p032_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Robustness vs Object Size All perturbations, extension of [PITH_FULL_IMAGE:figures/full_fig_p033_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Robustness vs Object Size Pixelation at sev 3 & 5, extension of [PITH_FULL_IMAGE:figures/full_fig_p033_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Robustness vs num of objects/image for all noise perturbations [PITH_FULL_IMAGE:figures/full_fig_p034_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Robustness vs num of objects/image at sev 4 and 5 [PITH_FULL_IMAGE:figures/full_fig_p034_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Robustness vs occlusion Real world perturbation on COCO, like [PITH_FULL_IMAGE:figures/full_fig_p034_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Robustness vs occlusion Pixelation for Sev 4 & 5, like [PITH_FULL_IMAGE:figures/full_fig_p035_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Robustness of dataset under real world perturbations [PITH_FULL_IMAGE:figures/full_fig_p035_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Robustness vs categories for ODinW-13 at sev 3 [PITH_FULL_IMAGE:figures/full_fig_p036_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: (a) shows the superclass prompting performance with the motion blur per￾turbation (b) shows the superclass prompting performance with the turbulence pertur￾bation. This follows the same trend as the pixelation perturbation, where the super￾class/finegrained prompting doesn’t vary the performance. Object Size Distribution: As shown in Fig. 25b & Fig. 25a , LVIS has ap￾proximately 60% of objects below 322 p… view at source ↗
Figure 25
Figure 25. Figure 25: (a) LVIS shows a higher proportion of small objects compared to COCO, contributing to its greater vulnerability to resolution degradation. On the other hand ODinW-13 has much larger objects. (b) Number of small objects are more common in LVIS dataset and least common in ODinW-13 dataset. (c) Occlusion patterns reveals denser objects per image in LVIS, lowering the detection performance overall. (d) Shows … view at source ↗
Figure 26
Figure 26. Figure 26: TKO Trainable prompts added at every frozen layer of transformer. ever, validating our analysis in a noise-agnostic setting prevents us from signifi￾cantly improving robustness. For example, Foggy CitiScape would substantially benefit from fog-based pretraining. For the language-based analysis, retraining detectors with noise-aware captions e.g., “car on a foggy road”, would help solidify our findings of … view at source ↗
Figure 27
Figure 27. Figure 27: NN: Non-local block or 1-headed self-attention cross exchanges features across layers, making every layer aware of one another. Different layers (shallow and deeper) are shown in different colors. playing a minimal role in robustness. However, such noise-based caption datasets do not exist in research. For fine-tuned detectors analysis, we dont really know if they are able to detect an object because they… view at source ↗
read the original abstract

The impact of real-world noise on Open Vocabulary Object Detectors (OV-ODs) remains poorly understood due to their architectural complexity. We present our comprehensive analysis Robust Onion, an empirical study that uses controlled synthetic visual degradations to peel OV-ODs layer-by-layer, revealing how, why, and where robustness degrades, systematically analyzing feature collapse. Our findings reveal that models with similar vision backbones exhibit comparable robustness, driven by similar feature collapse at similar layers, while factors such as pretraining strategy, architectural nuances, and caption supervision contribute little. Robustness is primarily governed by the image domain rather than annotations, explaining the similar robustness impact on COCO and LVIS, and why datasets like ODinW-13 can give an impression of inflated robustness due to large, isolated objects. Finally, we validate our insights by improving robustness on real-world BDD100K, WiderFace, and VisDRONE via our lightweight plug-and-play NN & TK0 approach, using 96x fewer trainable parameters than end-to-end training. We also explain the prior works' robustness observations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents Robust Onion, an empirical study of open-vocabulary object detectors (OV-ODs) under controlled synthetic visual degradations. By peeling models layer-by-layer, it identifies feature collapse locations and concludes that robustness is governed primarily by the image domain and vision backbone (similar backbones exhibit comparable collapse at similar layers) rather than annotations, pretraining strategy, or caption supervision. This explains comparable degradation on COCO and LVIS and inflated robustness impressions on ODinW-13. The authors validate the insights by transferring a lightweight plug-and-play NN & TK0 fix (96x fewer trainable parameters than end-to-end training) to improve performance on real-world datasets BDD100K, WiderFace, and VisDRONE, while also explaining prior robustness observations.

Significance. If the central empirical claims hold, the work supplies a useful diagnostic framework for locating robustness failures in OV-ODs via layer-wise feature collapse analysis and demonstrates a practical, parameter-efficient intervention that transfers across domains. Credit is due for the multi-dataset validation (synthetic peeling plus real-world transfer) and for attempting to unify prior observations under a backbone/domain-centric account. The study is purely empirical with no derivations or parameter fitting, which avoids circularity but places the full burden on the quality and representativeness of the synthetic degradations.

major comments (2)
  1. [§4 (synthetic peeling) and §5 (real-world validation)] The load-bearing claim that 'robustness is primarily governed by the image domain rather than annotations' (abstract and §5) rests on the synthetic peeling experiments showing backbone-similar collapse loci. However, the manuscript does not report a direct comparison of activation statistics (e.g., layer-wise variance, cosine similarity to clean features, or collapse thresholds) between the controlled synthetic degradations and the actual noise distributions in BDD100K/WiderFace/VisDRONE. Without this, the domain-governance conclusion does not necessarily follow from the layer-wise analysis.
  2. [Table 2 and Figure 4] Table 2 / Figure 4 (layer-wise robustness curves): the reported similarity in collapse layers across backbones is central to the 'similar vision backbones exhibit comparable robustness' claim, yet no quantitative measure of collapse (e.g., the exact layer index where mean activation norm drops below a stated threshold, or statistical test across runs) is provided. This makes it impossible to evaluate whether the loci are truly 'similar' or merely qualitatively aligned.
minor comments (2)
  1. [Abstract] The abstract states conclusions without any quantitative results, error bars, dataset sizes, or ablation counts; moving at least one key quantitative finding (e.g., parameter count ratio or mAP delta on BDD100K) into the abstract would improve readability.
  2. [§3.3] Notation for the NN & TK0 components is introduced without an explicit equation or pseudocode block; a short definition of what 'TK0' modifies (e.g., which layers or tokens) would clarify the plug-and-play claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. Below we respond point-by-point to the major comments, indicating where revisions will be made.

read point-by-point responses
  1. Referee: [§4 (synthetic peeling) and §5 (real-world validation)] The load-bearing claim that 'robustness is primarily governed by the image domain rather than annotations' (abstract and §5) rests on the synthetic peeling experiments showing backbone-similar collapse loci. However, the manuscript does not report a direct comparison of activation statistics (e.g., layer-wise variance, cosine similarity to clean features, or collapse thresholds) between the controlled synthetic degradations and the actual noise distributions in BDD100K/WiderFace/VisDRONE. Without this, the domain-governance conclusion does not necessarily follow from the layer-wise analysis.

    Authors: The domain-governance conclusion follows from two observations reported in the manuscript: (1) models sharing the same vision backbone exhibit nearly identical layer-wise collapse patterns under the controlled synthetic degradations, independent of annotation source or pretraining details, and (2) the lightweight NN & TK0 intervention derived from those patterns transfers to and improves performance on the real-world datasets BDD100K, WiderFace, and VisDRONE. While we did not include an explicit side-by-side comparison of activation statistics between synthetic and real noise distributions, the successful transfer provides empirical support that the synthetic regime captures the relevant robustness factors. We will add a short clarifying paragraph in §5 of the revision explaining this inference chain; this is a partial revision. revision: partial

  2. Referee: [Table 2 and Figure 4] Table 2 / Figure 4 (layer-wise robustness curves): the reported similarity in collapse layers across backbones is central to the 'similar vision backbones exhibit comparable robustness' claim, yet no quantitative measure of collapse (e.g., the exact layer index where mean activation norm drops below a stated threshold, or statistical test across runs) is provided. This makes it impossible to evaluate whether the loci are truly 'similar' or merely qualitatively aligned.

    Authors: We agree that an explicit quantitative definition would strengthen the claim. In the revised manuscript we will define collapse as the first layer where the mean activation norm (normalized to the clean-feature norm) falls below a fixed threshold of 0.5 and will tabulate the corresponding layer indices for each backbone alongside the existing curves. Where multiple random seeds were run we will also report the standard deviation of these indices. This constitutes a full revision of the presentation of Table 2 / Figure 4. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical analysis with independent experimental validation

full rationale

This is a purely empirical study that applies controlled synthetic degradations to peel OV-OD layers and measures feature collapse across backbones. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains are load-bearing for the central claims. Robustness observations on COCO/LVIS are directly compared to real-world transfers on BDD100K/WiderFace/VisDRONE, and the backbone-dominance conclusion follows from layer-wise measurements rather than any definitional reduction or ansatz smuggled via prior work. The paper is self-contained against external benchmarks and exhibits no patterns from the enumerated circularity kinds.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, free parameters, or new entities are introduced; the work is an empirical measurement study.

pith-pipeline@v0.9.1-grok · 5739 in / 1017 out tokens · 33176 ms · 2026-06-30T00:39:33.492215+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

85 extracted references · 27 canonical work pages · 7 internal anchors

  1. [1]

    Abdelhamed, A., Afifi, M., Go, A.: What do you see? enhancing zero-shot image classification with multimodal large language models (2025),https://arxiv.org/ abs/2405.15668

  2. [2]

    Advances in neural information processing systems35, 23716– 23736 (2022)

    Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems35, 23716– 23736 (2022)

  3. [3]

    In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https://openreview.net/forum? id=yJpBVE4vfo

    Bao, W., Deng, R., He, J.: Mint: A simple test-time adaptation of vision-language models against common corruptions. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https://openreview.net/forum? id=yJpBVE4vfo

  4. [4]

    Explaining object detection through difference map

    Baranwal, A., Mueez, A., Voelker, J., Bhatia, G., Vyas, S.: Synspill: Improved industrial spill detection with synthetic data. In: 2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). p. 1425–1434. IEEE (Oct 2025).https://doi.org/10.1109/iccvw69036.2025.00152,http://dx.doi.org/ 10.1109/ICCVW69036.2025.00152

  5. [5]

    In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 10231–10241 (October 2021)

  6. [6]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Bianchi, L., Carrara, F., Messina, N., Gennaro, C., Falchi, F.: The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22520–22529 (2024)

  7. [7]

    Virtual KITTI 2

    Cabon, Y., Murray, N., Humenberger, M.: Virtual kitti 2. arXiv preprint arXiv:2001.10773 (2020)

  8. [8]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Chai, J.C.L., Ng, T.S., Low, C.Y., Park, J., Teoh, A.B.J.: Recognizability embed- ding enhancement for very low-resolution face recognition and quality estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9957–9967 (2023)

  9. [9]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Chen, J., Guo, H., Yi, K., Li, B., Elhoseiny, M.: Visualgpt: Data-efficient adap- tation of pretrained language models for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18030– 18040 (2022)

  10. [10]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Chen, W.T., Vong, Y.J., Kuo, S.Y., Ma, S., Wang, J.: Robustsam: Segment any- thing robustly on degraded images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4081–4091 (June 2024)

  11. [11]

    Cheng, K., Song, W., Fan, J., Ma, Z., Sun, Q., Xu, F., Yan, C., Chen, N., Zhang, J., Chen, J.: Caparena: Benchmarking and analyzing detailed image captioning in the llm era (2025),https://arxiv.org/abs/2503.12329

  12. [12]

    In: Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part III 14

    Cheng, Z., Zhu, X., Gong, S.: Low-resolution face recognition. In: Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part III 14. pp. 605–621. Springer (2019)

  13. [13]

    In: European Confer- ence on Computer Vision

    Chhipa, P.C., De, K., Chippa, M.S., Saini, R., Liwicki, M.: Open-vocabulary object detectors: Robustness challenges under distribution shifts. In: European Confer- ence on Computer Vision. pp. 62–79. Springer (2024) Robust Onion 17

  14. [14]

    In: Proc

    Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

  15. [15]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

    Davila, D., Du, D., Lewis, B., Funk, C., Van Pelt, J., Collins, R., Corona, K., Brown, M., McCloskey, S., Hoogs, A., et al.: Mevid: Multi-view extended videos with identities for video person re-identification. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1634–1643 (2023)

  16. [16]

    org/abs/2403.01680

    Deng, J., Zhang, H., Ding, K., Hu, J., Zhang, X., Wang, Y.: Zero-shot generalizable incremental learning for vision-language object detection (2024),https://arxiv. org/abs/2403.01680

  17. [17]

    In: NeurIPS (2022)

    Dou, Z.Y., Kamath, A., Gan, Z., Zhang, P., Wang, J., Li, L., Liu, Z., Liu, C., LeCun, Y., Peng, N., Gao, J., Wang, L.: Coarse-to-fine vision-language pre-training with fusion in the backbone. In: NeurIPS (2022)

  18. [18]

    Du, D., Qi, Y., Yu, H., Yang, Y., Duan, K., Li, G., Zhang, W., Huang, Q., Tian, Q.: Uavdt dataset (2018),https://sites.google.com/view/grli-uavdt/%E9% A6%96%E9%A1%B5

  19. [19]

    Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

    Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921 (2021)

  20. [20]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)

    Gupta, A., Dollar, P., Girshick, R.: LVIS: A dataset for large vocabulary instance segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)

  21. [22]

    Customizing 360-degree panoramas through text-to-image diffusion models

    Gupta, H., Kotlyar, O., Andreasson, H., Lilienthal, A.J.: Robust object detec- tion in challenging weather conditions. In: 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 7508–7517 (2024).https: //doi.org/10.1109/WACV57701.2024.00735

  22. [23]

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2015),https://arxiv.org/abs/1512.03385

  23. [24]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    He, W., Deng, Y., Tang, S., Chen, Q., Xie, Q., Wang, Y., Bai, L., Zhu, F., Zhao, R., Ouyang, W., et al.: Instruct-reid: A multi-purpose person re-identification task with instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17521–17531 (2024)

  24. [25]

    Huynh, N.D., Bouadjenek, M.R., Aryal, S., Razzak, I., Hacid, H.: Visual question answering: from early developments to recent advances – a survey (2025),https: //arxiv.org/abs/2501.03939

  25. [26]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: Mdetr- modulated detection for end-to-end multi-modal understanding. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1780–1790 (2021)

  26. [27]

    In: Conference on Empirical Methods in Natural Language Processing (2014),https://api.semanticscholar

    Kazemzadeh, S., Ordonez, V., andre Matten, M., Berg, T.L.: Referitgame: Re- ferring to objects in photographs of natural scenes. In: Conference on Empirical Methods in Natural Language Processing (2014),https://api.semanticscholar. org/CorpusID:6308361

  27. [28]

    Dawn: vehicle detection in adverse weather nature dataset.arXiv preprint arXiv:2008.05402, 2020

    Kenk, M.A., Hassaballah, M.: Dawn: vehicle detection in adverse weather nature dataset. arXiv preprint arXiv:2008.05402 (2020) 18 Pathak et al

  28. [29]

    Dataset available from https://github

    Krasin, I., Duerig, T., Alldrin, N., Ferrari, V., Abu-El-Haija, S., Kuznetsova, A., Rom, H., Uijlings, J., Popov, S., Veit, A., et al.: Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github. com/openimages2(3), 18 (2017)

  29. [30]

    Li, C., Chen, X., Zhao, K., Zhu, J., Chen, J.: Zero-shot quantization for object detection (2025),https://openreview.net/forum?id=XNr6sexQGj

  30. [31]

    In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.N., et al.: Grounded language-image pre-training. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10965–10975 (2022)

  31. [32]

    IEEE Transactions on Information Forensics and Security14(8), 2000–2012 (2019).https://doi.org/10.1109/TIFS

    Li, P., Prieto, L., Mery, D., Flynn, P.J.: On low-resolution face recognition in the wild: Comparisons and new techniques. IEEE Transactions on Information Forensics and Security14(8), 2000–2012 (2019).https://doi.org/10.1109/TIFS. 2018.2890812

  32. [33]

    Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C.L., Dollár, P.: Microsoft coco: Common objects in context (2015)

  33. [34]

    arXiv preprint arXiv:2302.05621 (2023)

    Ling, X., Lu, Y., Xu, W., Deng, W., Zhang, Y., Cui, X., Shi, H., Wen, D.: Dive into the resolution augmentations and metrics in low resolution face recognition: A plain yet effective new baseline. arXiv preprint arXiv:2302.05621 (2023)

  34. [35]

    Liu, J., Wang, Z., Ma, L., Fang, C., Bai, T., Zhang, X., Liu, J., Chen, Z.: Bench- marking object detection robustness against real-world corruptions. Int. J. Com- put. Vision132(10), 4398–4416 (May 2024).https://doi.org/10.1007/s11263- 024-02096-6,https://doi.org/10.1007/s11263-024-02096-6

  35. [36]

    Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows (2021),https: //arxiv.org/abs/2103.14030

  36. [37]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 11–20 (2016)

  37. [38]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Mao, X., Chen, Y., Zhu, Y., Chen, D., Su, H., Zhang, R., Xue, H.: Coco-o: A benchmark for object detectors under natural distribution shifts. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6339–6350 (2023)

  38. [39]

    Mao, X., Chen, Y., Zhu, Y., Chen, D., Su, H., Zhang, R., Xue, H.: Coco-o: A benchmark for object detectors under natural distribution shifts (2023),https: //arxiv.org/abs/2307.12730

  39. [40]

    In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV) (October 2021)

    Mao, Z., Chimitt, N., Chan, S.H.: Accelerating atmospheric turbulence simulation via learned phase-to-space transform. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV) (October 2021)

  40. [41]

    UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

    McInnes, L., Healy, J., Melville, J.: Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018)

  41. [42]

    Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Doso- vitskiy, A., Mahendran, A., Arnab, A., Dehghani, M., Shen, Z., Wang, X., Zhai, X., Kipf, T., Houlsby, N.: Simple open-vocabulary object detection with vision transformers (2022),https://arxiv.org/abs/2205.06230

  42. [43]

    In: Thirty-seventh Conference on Neural Information Processing Systems (2023),https://openreview.net/forum?id=mQPNcBWjGc Robust Onion 19

    Minderer, M., Gritsenko, A.A., Houlsby, N.: Scaling open-vocabulary object de- tection. In: Thirty-seventh Conference on Neural Information Processing Systems (2023),https://openreview.net/forum?id=mQPNcBWjGc Robust Onion 19

  43. [44]

    arXiv preprint arXiv:2405.04324 (2024)

    Mishra, M., Stallone, M., Zhang, G., Shen, Y., Prasad, A., Soria, A.M., Merler, M., Selvam, P., Surendran, S., Singh, S., et al.: Granite code models: A family of open foundation models for code intelligence. arXiv preprint arXiv:2405.04324 (2024)

  44. [45]

    In: The Thirteenth International Conference on Learning Representations (2025),https: //openreview.net/forum?id=AsFxRSLtqR

    Pathak, P., Marjit, S., Vyas, S., Rawat, Y.S.: LR0.FM: Low-Res Benchmark and Improving robustness for Zero-Shot Classification in Foundation Models. In: The Thirteenth International Conference on Learning Representations (2025),https: //openreview.net/forum?id=AsFxRSLtqR

  45. [46]

    In: 36th British Machine Vision Conference 2025, BMVC 2025, Sheffield, UK, November 24-27, 2025

    Pathak, P., Rawat, Y.S.: Coarse attribute prediction with task agnostic distillation for real world clothes changing reid. In: 36th British Machine Vision Conference 2025, BMVC 2025, Sheffield, UK, November 24-27, 2025. BMVA Press (2025)

  46. [47]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Pathak, P., Rawat, Y.S.: Colors see colors ignore: Clothes changing reid with color disentanglement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 16797–16807 (October 2025)

  47. [48]

    In: 2017 2nd IEEE Interna- tional Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT)

    Patil, J.S., Pawase, R.S., Dandawate, Y.H.: Classification of low resolution astro- nomical images using convolutional neural networks. In: 2017 2nd IEEE Interna- tional Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT). pp. 1168–1172. IEEE (2017)

  48. [49]

    In: 2015 IEEE International Conference on Computer Vision (ICCV)

    Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazeb- nik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: 2015 IEEE International Conference on Computer Vision (ICCV). pp. 2641–2649 (2015).https://doi.org/10.1109/ICCV.2015.303

  49. [50]

    In: Proceedings of the Asian Conference on Computer Vision

    Qin, Q., Chang, K., Huang, M., Li, G.: Denet: Detection-driven enhancement net- work for object detection under adverse weather conditions. In: Proceedings of the Asian Conference on Computer Vision. pp. 2813–2829 (2022)

  50. [51]

    Learning Transferable Visual Models From Natural Language Supervision

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021)

  51. [52]

    Saha, O., Horn, G.V., Maji, S.: Improved zero-shot classification by adapting vlms with text descriptions (2024),https://arxiv.org/abs/2401.02460

  52. [53]

    International Journal of Computer Vision126(9), 973–992 (Sep 2018),https://doi.org/10.1007/s11263-018-1072-8

    Sakaridis, C., Dai, D., Van Gool, L.: Semantic foggy scene understanding with synthetic data. International Journal of Computer Vision126(9), 973–992 (Sep 2018),https://doi.org/10.1007/s11263-018-1072-8

  53. [54]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Schiappa, M.C., Azad, S., Vs, S., Ge, Y., Miksik, O., Rawat, Y.S., Vineet, V.: Robustness analysis on foundational segmentation models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1786– 1796 (2024)

  54. [55]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X., Li, J., Sun, J.: Objects365: A large-scale, high-quality dataset for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 8430–8439 (2019)

  55. [56]

    In: Proceed- ings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceed- ings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2556–2565 (2018)

  56. [57]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Shen, H., Zhao, T., Zhu, M., Yin, J.: Groundvlp: Harnessing zero-shot visual grounding from vision-language pre-training and open-vocabulary object detec- tion. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 4766–4775 (2024)

  57. [58]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (June 2019) 20 Pathak et al

    Shermeyer, J., Van Etten, A.: The effects of super-resolution on object detection performance in satellite imagery. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (June 2019) 20 Pathak et al

  58. [59]

    DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    Tian,X.,Gu,J.,Li,B.,Liu,Y.,Hu,C.,Wang,Y.,Zhan,K.,Jia,P.,Lang,X.,Zhao, H.: Drivevlm: The convergence of autonomous driving and large vision-language models. ArXivabs/2402.12289(2024),https://api.semanticscholar.org/ CorpusID:267750682

  59. [60]

    Advances in Neural Infor- mation Processing Systems34, 200–212 (2021)

    Tsimpoukelli, M., Menick, J.L., Cabi, S., Eslami, S., Vinyals, O., Hill, F.: Multi- modal few-shot learning with frozen language models. Advances in Neural Infor- mation Processing Systems34, 200–212 (2021)

  60. [61]

    In: Pro- ceedings of the IEEE conference on computer vision and pattern recognition

    Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Pro- ceedings of the IEEE conference on computer vision and pattern recognition. pp. 7794–7803 (2018)

  61. [62]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Wu, J., Jiang, Y., Liu, Q., Yuan, Z., Bai, X., Bai, S.: General object foundation model for images and videos at scale. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3783–3795 (June 2024)

  62. [63]

    arXiv preprint arXiv:2412.16583 (2024)

    Xue, X., Wei, G., Chen, H., Zhang, H., Lin, F., Shen, C., Zhu, X.X.: Reo- vlm: Transforming vlm to meet regression challenges in earth observation. arXiv preprint arXiv:2412.16583 (2024)

  63. [64]

    Yamada, Y., Otani, M.: Does robustness on imagenet transfer to downstream tasks? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9215–9224 (June 2022)

  64. [65]

    In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

    Yang, S., Luo, P., Loy, C.C., Tang, X.: Wider face: A face detection benchmark. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

  65. [66]

    2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

    Yao, L., Pi, R., Han, J., Liang, X., Xu, H., Zhang, W., Li, Z., Xu, D.: Detclipv3: Towards versatile generative open-vocabulary object detection. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 5610–5619 (2024),https://api.semanticscholar.org/CorpusID:269148793

  66. [67]

    Yoo, J., Lee, D., Chung, I., Kim, D., Kwak, N.: What how and when should object detectors update in continually changing test domains? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 23354– 23363 (2024)

  67. [68]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

    Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V., Darrell, T.: Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

  68. [69]

    arXiv preprint arXiv:2503.15892 (2025)

    Yu, H., Yi, S., Niu, K., Zhuo, M., Li, B.: Umit: Unifying medical imaging tasks via vision-language models. arXiv preprint arXiv:2503.15892 (2025)

  69. [70]

    In: Computer Vision–ECCV 2016: 14th European Conference, Ams- terdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14

    Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Computer Vision–ECCV 2016: 14th European Conference, Ams- terdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14. pp. 69–85. Springer (2016)

  70. [71]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.F.: Open-vocabulary object detection using captions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14393–14402 (2021)

  71. [72]

    Zhang, J., Huang, J., Jin, S., Lu, S.: Vision-language models for vision tasks: A survey (2024),https://arxiv.org/abs/2304.00685

  72. [73]

    In: Proceedings of the Asian Conference on Computer Vision

    Zhang, Z., Gong, H., Feng, Y., Chu, Z., Liu, H.: Enhancing object detection in adverse weather conditions through entropy and guided multimodal fusion. In: Proceedings of the Asian Conference on Computer Vision. pp. 2922–2938 (2024)

  73. [74]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Zhao,S.,Schulter,S.,Zhao,L.,Zhang,Z.,Suh,Y.,Chandraker,M.,Metaxas,D.N., et al.: Taming self-training for open-vocabulary object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13938–13947 (2024) Robust Onion 21

  74. [75]

    arXiv preprint arXiv:2401.02361 (2024)

    Zhao, X., Chen, Y., Xu, S., Li, X., Wang, X., Li, Y., Huang, H.: An open and comprehensive pipeline for unified object grounding and detection. arXiv preprint arXiv:2401.02361 (2024)

  75. [76]

    In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition

    Zhong, Y., Yang, J., Zhang, P., Li, C., Codella, N., Li, L.H., Zhou, L., Dai, X., Yuan, L., Li, Y., et al.: Regionclip: Region-based language-image pretraining. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition. pp. 16793–16803 (2022)

  76. [77]

    In: International conference on machine learning

    Zhou, D., Yu, Z., Xie, E., Xiao, C., Anandkumar, A., Feng, J., Alvarez, J.M.: Understanding the robustness in vision transformers. In: International conference on machine learning. pp. 27378–27394. PMLR (2022)

  77. [78]

    Zhu, P., Wen, L., Du, D., Bian, X., Fan, H., Hu, Q., Ling, H.: Detection and track- ing meet drones challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence44(11), 7380–7399 (2021) Robust Onion: Peeling Open Vocab Object Detectors Under Noise Supplementary Priyank Pathak*, Mukilan Karuppasamy*, Aaditya Baranwal, Shruti Vyas, and Yogesh S ...

  78. [79]

    Section 1 highlights some RGB examples, and predictions on various syn- thetic and real-world noise examples

  79. [80]

    Section 2, and Section 3 has details for various datasets and models used in our analysis,

  80. [81]

    Section 4 have variants of various analysis shown in the main submission, but generalized for all severities, noises, and the LVIS dataset

Showing first 80 references.