pith. machine review for the scientific record. sign in

arxiv: 2604.10637 · v1 · submitted 2026-04-12 · 💻 cs.CV

Recognition: unknown

Language Prompt vs. Image Enhancement: Boosting Object Detection With CLIP in Hazy Environments

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:28 UTC · model grok-4.3

classification 💻 cs.CV
keywords object detectionhazy environmentsCLIPlanguage promptssemantic enhancementcross-entropy losssynthetic hazy dataset
0
0 comments X

The pith

Language prompts from CLIP strengthen weakened object semantics in haze to improve detection without any image enhancement step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that hazy conditions weaken the semantic meaning of objects so much that standard detectors struggle, and that image enhancement often adds instability rather than fixing this. Instead of cleaning the image, the work uses CLIP's language embeddings to measure how much each object's meaning has degraded and then guides the detector's training with a specially weighted loss. This backpropagation step directly boosts the detector's ability to recognize the degraded objects. The authors also add a fine-tuning step for the weights and release a large synthetic hazy dataset to support the experiments, reporting state-of-the-art detection results.

Core claim

By designing Approximation of Mutual Exclusion (AME) to generate credible per-object weights for a CLIP-guided Cross-Entropy Loss, the method lets backpropagation enhance the semantic features of objects whose meaning has been weakened by haze, producing a detector that outperforms enhancement-based approaches on both synthetic and real hazy scenes.

What carries the argument

The Approximation of Mutual Exclusion (AME) that supplies per-object weights for the CLIP-guided Cross-Entropy Loss, allowing the loss to focus training effort on the most semantically degraded objects.

If this is right

  • Detectors trained this way identify objects in haze more reliably because their features receive direct semantic reinforcement from language guidance.
  • The method sidesteps the instability that arises when an enhancement network is inserted before the detector.
  • Fine-tuned AME adapts the weights to the detector's current confidence, reducing imbalance during optimization.
  • The released HazyCOCO dataset of 61,258 images supplies a standardized benchmark for evaluating haze-robust detection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same language-weighting idea could be tested on other degradations such as rain or low light where object semantics are similarly diluted.
  • If CLIP assessments prove stable across domains, the approach might allow a single detector to handle multiple weather types without separate enhancement modules for each.
  • Combining the CLIP-CE loss with a small amount of real hazy data could further close the gap between synthetic training and real-world performance.

Load-bearing premise

CLIP's pre-trained embeddings can reliably quantify how much haze has weakened each object's semantic content, and back-propagating the resulting weighted loss will strengthen the detector's features without causing instability or negative transfer.

What would settle it

Train the detector with CLIP-CE on the HazyCOCO dataset and evaluate average precision on a held-out set of real hazy photographs; if mAP does not rise above a plain detector or an enhancement-based baseline, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2604.10637 by Baodi Liu, Bingfeng Zhang, Dapeng Tao, Jian Pang, Jin Wang, Weifeng Liu.

Figure 1
Figure 1. Figure 1: Pipeline comparison. Our method can detect overlooked objects using [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Failure cases of IA-YOLO. GT: Ground Truth. Involving image [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overall of our pipeline. In detection pipeline: The image is processed to obtain ROI features. The predicted logits of objects are combined with estimated weights to form CLIP-CE. In AME: We crop objects within the image and extract their visual embeddings. Meanwhile, we get text embeddings from mutually exclusive prompts. We calculate the negative similarity between these embeddings to determine the weigh… view at source ↗
Figure 4
Figure 4. Figure 4: Left part: Using the clear image and its depth map to generate [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualizations of the depth map with and without clamping. The depth map without clamping is directly generated by Depth-Anything. The HazyCOCO [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualizations for the HazyCOCO images with different hazy densities. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization for the AME weights and Focal weights. The AME weights [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of detection results on HazyCOCO and RTTS datasets. We recommend examining the zoomed-in areas (highlighted with red dashed boxes) [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of detection results. We recommend examining the zoomed-in areas for a clearer comparison. GT: ground truth. IA-YOLO, UHDFour and [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
read the original abstract

Object detection in hazy environments is challenging because degraded objects are nearly invisible and their semantics are weakened by environmental noise, making it difficult for detectors to identify. Common approaches involve image enhancement to boost weakened semantics, but these methods are limited by the instability of enhanced modules. This paper proposes a novel solution by employing language prompts to enhance weakened semantics without image enhancement. Specifically, we design Approximation of Mutual Exclusion (AME) to provide credible weights for Cross-Entropy Loss, resulting in CLIP-guided Cross-Entropy Loss (CLIP-CE). The provided weights assess the semantic weakening of objects. Through the backpropagation of CLIP-CE, weakened semantics are enhanced, making degraded objects easier to detect. In addition, we present Fine-tuned AME (FAME) which adaptively fine-tunes the weight of AME based on the predicted confidence. The proposed FAME compensates for the imbalanced optimization in AME. Furthermore, we present HazyCOCO, a large-scale synthetic hazy dataset comprising 61258 images. Experimental results demonstrate that our method achieves state-of-the-art performance. The code and dataset will be released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes replacing image enhancement with language-prompt guidance from a frozen CLIP model for object detection under haze. It introduces an Approximation of Mutual Exclusion (AME) to derive per-object weights for a modified cross-entropy loss (CLIP-CE), a Fine-tuned AME (FAME) variant that adapts those weights by predicted , and the synthetic HazyCOCO dataset (61k images). The central claim is that back-propagating through CLIP-CE strengthens detector features for semantically weakened objects, yielding state-of-the-art detection performance without the instability of enhancement modules.

Significance. If the AME weighting is shown to be reliable and the back-propagation effect is isolated, the approach would provide a lightweight, training-stable alternative to enhancement pipelines that is directly compatible with existing detectors. The public release of HazyCOCO and code would further increase impact by enabling controlled benchmarking of haze-robust detection methods.

major comments (3)
  1. [§3.2] §3.2 (AME definition): the manuscript does not supply a derivation or empirical verification that cosine similarity between CLIP image embeddings and language prompts decreases monotonically with increasing atmospheric scattering parameters; without this, the claim that AME supplies 'credible weights' for semantic weakening remains ungrounded.
  2. [§4] §4 (Experiments): no ablation isolates the contribution of gradient flow through CLIP-CE versus the base detector loss or FAME adaptation; tables report only final mAP, leaving open whether the detector features actually move toward cleaner representations or simply overfit the auxiliary signal.
  3. [§3.3] §3.3 (FAME): the adaptive fine-tuning rule is described only at the level of 'compensates for imbalanced optimization'; the precise functional form, the hyper-parameters that are learned versus fixed, and any stability analysis under back-propagation are missing, making it impossible to assess whether FAME introduces new instabilities.
minor comments (2)
  1. [§1] The abstract and §1 repeatedly contrast the method with 'image enhancement' but never cite the specific enhancement baselines used in the experiments; a table listing the exact enhancement modules and their training protocols would improve clarity.
  2. [§3] Notation for the CLIP-CE loss (Eq. 3 or equivalent) mixes the standard cross-entropy term with the AME scalar without explicitly showing the gradient path through the frozen CLIP encoder; a short derivation of the effective gradient on the detector backbone would help readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important areas for strengthening the manuscript. We address each major comment below and indicate the specific revisions planned.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (AME definition): the manuscript does not supply a derivation or empirical verification that cosine similarity between CLIP image embeddings and language prompts decreases monotonically with increasing atmospheric scattering parameters; without this, the claim that AME supplies 'credible weights' for semantic weakening remains ungrounded.

    Authors: We agree that an explicit derivation and empirical verification are needed to ground the use of AME weights. In the revised manuscript we will add a short theoretical derivation based on the atmospheric scattering model and the behavior of CLIP embeddings under increasing haze dominance, together with empirical plots of cosine similarity versus scattering coefficient on controlled synthetic data. These additions will directly support the claim that AME supplies credible weights for semantic weakening. revision: yes

  2. Referee: [§4] §4 (Experiments): no ablation isolates the contribution of gradient flow through CLIP-CE versus the base detector loss or FAME adaptation; tables report only final mAP, leaving open whether the detector features actually move toward cleaner representations or simply overfit the auxiliary signal.

    Authors: We acknowledge that isolating the gradient-flow contribution and verifying feature improvement is essential. We will add ablations that block gradients from CLIP-CE while retaining the base loss, report the resulting mAP drop, and include feature-level analysis (e.g., cosine similarity of detector embeddings to clean-image references and t-SNE visualizations) to demonstrate that representations move toward cleaner semantics rather than merely overfitting the auxiliary signal. Updated tables and figures will present these results. revision: yes

  3. Referee: [§3.3] §3.3 (FAME): the adaptive fine-tuning rule is described only at the level of 'compensates for imbalanced optimization'; the precise functional form, the hyper-parameters that are learned versus fixed, and any stability analysis under back-propagation are missing, making it impossible to assess whether FAME introduces new instabilities.

    Authors: We recognize that the current description of FAME is too high-level. In the revision we will supply the exact mathematical formulation of the adaptive weighting rule, explicitly state which scalars are learned versus held fixed, and add a stability analysis comprising gradient-norm statistics and loss-convergence curves under back-propagation. These additions will allow readers to evaluate whether FAME introduces instabilities. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external pre-trained CLIP embeddings

full rationale

The paper's central mechanism defines AME to compute per-object weights from cosine similarities between CLIP vision embeddings and language prompts, then applies those weights to modulate cross-entropy loss. Because CLIP is a frozen, externally pre-trained model whose embeddings are not fitted or redefined inside this work, the weighting step does not reduce to a self-definition, a fitted input renamed as prediction, or a self-citation chain. No equation in the abstract or described method equates the loss modulation back to the detector's own outputs by construction. The derivation therefore remains self-contained against an independent external benchmark.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The approach rests on CLIP providing usable semantic degradation signals for hazy inputs and on the AME approximation being sufficiently accurate to produce beneficial weights; no free parameters are explicitly named in the abstract but the loss weighting and FAME adaptation imply tunable components.

free parameters (2)
  • AME weighting parameters
    Parameters defining the approximation of mutual exclusion for loss weights; their specific values or fitting procedure are not detailed.
  • FAME adaptation hyperparameters
    Parameters controlling adaptive fine-tuning of AME weights based on predicted confidence.
axioms (1)
  • domain assumption CLIP embeddings remain informative for assessing object semantic weakening even under haze degradation
    The method uses CLIP outputs directly to generate loss weights without additional adaptation or validation for hazy inputs.
invented entities (2)
  • CLIP-CE loss no independent evidence
    purpose: To provide weighted supervision that enhances weakened object semantics via backpropagation
    New loss formulation combining CLIP guidance with cross-entropy.
  • AME approximation no independent evidence
    purpose: To generate credible per-object weights for the loss
    Novel weighting scheme introduced to approximate mutual exclusion.

pith-pipeline@v0.9.0 · 5514 in / 1450 out tokens · 59561 ms · 2026-05-10T16:28:42.454150+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 1 canonical work pages · 1 internal anchor

  1. [1]

    Detrs beat yolos on real-time object detection,

    Y . Zhao, W. Lv, S. Xu, J. Wei, G. Wang, Q. Dang, Y . Liu, and J. Chen, “Detrs beat yolos on real-time object detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 16 965–16 974

  2. [2]

    Semi-supervised open-world object detection,

    S. S. Mullappilly, A. S. Gehlot, R. M. Anwer, F. S. Khan, and H. Cholakkal, “Semi-supervised open-world object detection,” inProceed- ings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 5, 2024, pp. 4305–4314

  3. [3]

    Fine-grained prototypes distillation for few-shot object detection,

    Z. Wang, B. Yang, H. Yue, and Z. Ma, “Fine-grained prototypes distillation for few-shot object detection,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 6, 2024, pp. 5859–5866

  4. [4]

    Generalized concordant vision trans- former with masked image tokens for object detection,

    Y . Quan, D. Zhang, and J. Tang, “Generalized concordant vision trans- former with masked image tokens for object detection,”IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2025

  5. [5]

    Towards realistic hierarchical object detection: Problem, benchmark and solution,

    J. Feng, Y . Yang, M. Lyu, T. Hao, Y .-J. Huang, Y . Xie, Y . Li, J. Han, L. Xiang, and G. Ding, “Towards realistic hierarchical object detection: Problem, benchmark and solution,”IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2025

  6. [6]

    Instance-dictionary learning for open-world object detection in autonomous driving scenarios,

    Z. Ma, Z. Zheng, J. Wei, Y . Yang, and H. T. Shen, “Instance-dictionary learning for open-world object detection in autonomous driving scenarios,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 5, pp. 3395–3408, 2024. 10

  7. [7]

    Attribute-identity embedding and self- supervised learning for scalable person re-identification,

    H. Li, S. Yan, Z. Yu, and D. Tao, “Attribute-identity embedding and self- supervised learning for scalable person re-identification,”IEEE Transac- tions on Circuits and Systems for Video Technology, vol. 30, no. 10, pp. 3472–3485, 2020

  8. [8]

    Erasing, transforming, and noising defense network for occluded person re-identification,

    N. Dong, L. Zhang, S. Yan, H. Tang, and J. Tang, “Erasing, transforming, and noising defense network for occluded person re-identification,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 6, pp. 4458–4472, 2024

  9. [9]

    Uav-ground visual tracking: A unified dataset and collaborative learning approach,

    D. Sun, L. Cheng, S. Chen, C. Li, Y . Xiao, and B. Luo, “Uav-ground visual tracking: A unified dataset and collaborative learning approach,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 5, pp. 3619–3632, 2024

  10. [10]

    Image-adaptive yolo for object detection in adverse weather conditions,

    W. Liu, G. Ren, R. Yu, S. Guo, J. Zhu, and L. Zhang, “Image-adaptive yolo for object detection in adverse weather conditions,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 1792–1800

  11. [11]

    Detection- friendly dehazing: Object detection in real-world hazy scenes,

    C. Li, H. Zhou, Y . Liu, C. Yang, Y . Xie, Z. Li, and L. Zhu, “Detection- friendly dehazing: Object detection in real-world hazy scenes,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

  12. [12]

    Togethernet: Bridging image restoration and object detection together via dynamic enhancement learning,

    Y . Wang, X. Yan, K. Zhang, L. Gong, H. Xie, F. L. Wang, and M. Wei, “Togethernet: Bridging image restoration and object detection together via dynamic enhancement learning,” inComputer Graphics Forum, vol. 41, no. 7. Wiley Online Library, 2022, pp. 465–476

  13. [13]

    Advancing real-world image dehazing: Perspective, modules, and training,

    Y . Feng, L. Ma, X. Meng, F. Zhou, R. Liu, and Z. Su, “Advancing real-world image dehazing: Perspective, modules, and training,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  14. [14]

    Rethinking prior infor- mation generation with clip for few-shot segmentation,

    J. Wang, B. Zhang, J. Pang, H. Chen, and W. Liu, “Rethinking prior infor- mation generation with clip for few-shot segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3941–3951

  15. [15]

    Clip-adapter: Better vision-language models with feature adapters,

    P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y . Zhang, H. Li, and Y . Qiao, “Clip-adapter: Better vision-language models with feature adapters,”In- ternational Journal of Computer Vision, vol. 132, no. 2, pp. 581–595, 2024

  16. [16]

    Focal loss for dense object detection,

    T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Focal loss for dense object detection,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988

  17. [17]

    Dsnet: Joint semantic learning for object detection in inclement weather conditions,

    S.-C. Huang, T.-H. Le, and D.-W. Jaw, “Dsnet: Joint semantic learning for object detection in inclement weather conditions,”IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 8, pp. 2623–2633, 2020

  18. [18]

    Image enhancement guided object detection in visually degraded scenes,

    H. Liu, F. Jin, H. Zeng, H. Pu, and B. Fan, “Image enhancement guided object detection in visually degraded scenes,”IEEE transactions on neural networks and learning systems, 2023

  19. [19]

    Gdip: Gated differentiable image processing for object detection in adverse conditions,

    S. Kalwar, D. Patel, A. Aanegola, K. R. Konda, S. Garg, and K. M. Krishna, “Gdip: Gated differentiable image processing for object detection in adverse conditions,” in2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 7083–7089

  20. [20]

    YOLOv3: An Incremental Improvement

    J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,”arXiv preprint arXiv:1804.02767, 2018

  21. [21]

    Cfmw: Cross-modality fusion mamba for robust object detection under adverse weather,

    H. Li, Q. Hu, B. Zhou, Y . Yao, J. Lin, K. Yang, and P. Chen1, “Cfmw: Cross-modality fusion mamba for robust object detection under adverse weather,”IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2025

  22. [22]

    A simple framework for contrastive learning of visual representations,

    T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” inInternational confer- ence on machine learning. PMLR, 2020, pp. 1597–1607

  23. [23]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational con- ference on machine learning. PMLR, 2021, pp. 8748–8763

  24. [24]

    Virtex: Learning visual representations from textual annotations,

    K. Desai and J. Johnson, “Virtex: Learning visual representations from textual annotations,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 11 162–11 173

  25. [25]

    Understanding and mitigating overfitting in prompt tuning for vision-language models,

    C. Ma, Y . Liu, J. Deng, L. Xie, W. Dong, and C. Xu, “Understanding and mitigating overfitting in prompt tuning for vision-language models,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 9, pp. 4616–4629, 2023

  26. [26]

    Simple image-level classification improves open-vocabulary object detection,

    R. Fang, G. Pang, and X. Bai, “Simple image-level classification improves open-vocabulary object detection,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 2, 2024, pp. 1716–1725

  27. [27]

    Sed: A simple encoder- decoder for open-vocabulary semantic segmentation,

    B. Xie, J. Cao, J. Xie, F. S. Khan, and Y . Pang, “Sed: A simple encoder- decoder for open-vocabulary semantic segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3426–3436

  28. [28]

    Dst- det: Open-vocabulary object detection via dynamic self-training,

    S. Xu, X. Li, S. Wu, W. Zhang, Y . Tong, and C. Change Loy, “Dst- det: Open-vocabulary object detection via dynamic self-training,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 5, pp. 5037–5050, 2025

  29. [29]

    Visual and textual prior guided mask assemble for few-shot segmentation and beyond,

    S. Chen, F. Meng, R. Zhang, H. Qiu, H. Li, Q. Wu, and L. Xu, “Visual and textual prior guided mask assemble for few-shot segmentation and beyond,”IEEE Transactions on Multimedia, 2024

  30. [30]

    Faster r-cnn: Towards real- time object detection with region proposal networks,

    S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real- time object detection with region proposal networks,”Advances in Neural Information Processing Systems, vol. 28, 2015

  31. [31]

    Detrs beat yolos on real-time object detection,

    Y . Zhao, W. Lv, S. Xu, J. Wei, G. Wang, Q. Dang, Y . Liu, and J. Chen, “Detrs beat yolos on real-time object detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 16 965–16 974

  32. [32]

    Deformable detr: Deformable transformers for end-to-end object detection,

    X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” inInternational Conference on Learning Representations

  33. [33]

    Getting to know low-light images with the exclusively dark dataset,

    Y . P. Loh and C. S. Chan, “Getting to know low-light images with the exclusively dark dataset,”Computer Vision and Image Understanding, vol. 178, pp. 30–42, 2019

  34. [34]

    Trashcan 1.0 an instance- segmentation labeled dataset of trash observations,

    J. Hong, M. S. Fulton, and J. Sattar, “Trashcan 1.0 an instance- segmentation labeled dataset of trash observations,” 2020

  35. [35]

    Benchmarking single-image dehazing and beyond,

    B. Li, W. Ren, D. Fu, D. Tao, D. Feng, W. Zeng, and Z. Wang, “Benchmarking single-image dehazing and beyond,”IEEE Transactions on Image Processing, vol. 28, no. 1, pp. 492–505, 2018

  36. [36]

    Microsoft coco: Common objects in context,

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 740–755

  37. [37]

    The pascal visual object classes (voc) challenge,

    M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,”International Journal of Computer Vision, vol. 88, no. 2, pp. 303–338, 2010

  38. [38]

    Places: A 10 million image database for scene recognition,

    B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 million image database for scene recognition,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017

  39. [39]

    Depth any- thing: Unleashing the power of large-scale unlabeled data,

    L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth any- thing: Unleashing the power of large-scale unlabeled data,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 10 371–10 381

  40. [40]

    Optics of the atmosphere: scattering by molecules and particles,

    E. J. McCartney, “Optics of the atmosphere: scattering by molecules and particles,”New York, 1976

  41. [41]

    Single image haze removal using dark chan- nel prior,

    K. He, J. Sun, and X. Tang, “Single image haze removal using dark chan- nel prior,”IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 12, pp. 2341–2353, 2010

  42. [42]

    Rectified linear units improve restricted boltzmann machines,

    V . Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,”Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814, 2010

  43. [43]

    Dehazing & reasoning yolo: Prior knowledge-guided network for object detection in foggy weather,

    F. Zhong, W. Shen, H. Yu, G. Wang, and J. Hu, “Dehazing & reasoning yolo: Prior knowledge-guided network for object detection in foggy weather,”Pattern Recognition, p. 110756, 2024

  44. [44]

    Retinex- inspired unrolling with cooperative prior architecture search for low-light image enhancement,

    L. Risheng, M. Long, Z. Jiaao, F. Xin, and L. Zhongxuan, “Retinex- inspired unrolling with cooperative prior architecture search for low-light image enhancement,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

  45. [45]

    Zero-reference deep curve estimation for low-light image enhancement,

    C. G. Guo, C. Li, J. Guo, C. C. Loy, J. Hou, S. Kwong, and R. Cong, “Zero-reference deep curve estimation for low-light image enhancement,” inProceedings of the IEEE conference on computer vision and pattern recognition (CVPR), June 2020, pp. 1780–1789

  46. [46]

    Embeddingfourier for ultra-high-definition low-light image enhance- ment,

    C. Li, C.-L. Guo, M. Zhou, Z. Liang, S. Zhou, R. Feng, and C. C. Loy, “Embeddingfourier for ultra-high-definition low-light image enhance- ment,” inICLR, 2023

  47. [47]

    Hclr-net: Hybrid contrastive learning regularization with locally randomized perturbation for underwater image enhancement,

    J. Zhou, J. Sun, C. Li, Q. Jiang, M. Zhou, K.-M. Lam, W. Zhang, and X. Fu, “Hclr-net: Hybrid contrastive learning regularization with locally randomized perturbation for underwater image enhancement,” International Journal of Computer Vision, pp. 1–25, 2024

  48. [48]

    U-shape transformer for underwater image enhancement,

    L. Peng, C. Zhu, and L. Bian, “U-shape transformer for underwater image enhancement,”IEEE Transactions on Image Processing, vol. 32, pp. 3066– 3079, 2023

  49. [49]

    Boosting r-cnn: Reweighting r-cnn samples by rpn’s error for underwater object detection,

    P. Song, P. Li, L. Dai, T. Wang, and Z. Chen, “Boosting r-cnn: Reweighting r-cnn samples by rpn’s error for underwater object detection,”Neurocom- puting, vol. 530, pp. 150–164, 2023. 11 Jian Pangis currently a Ph.D. candidate at the College of Control Science and Engineering, China University of Petroleum (East China), Qingdao, PR China. He received the ...