pith. sign in

arxiv: 2606.31147 · v1 · pith:UJCS2VWXnew · submitted 2026-06-30 · 💻 cs.CV

WaterGen: Decoupling Scene and Medium in Underwater Image Generation

Pith reviewed 2026-07-01 06:42 UTC · model grok-4.3

classification 💻 cs.CV
keywords underwater image generationlatent diffusionscene-medium decouplingsynthetic underwater dataunderwater restorationsemantic segmentationphysical degradation modeling
0
0 comments X

The pith

Decoupling scene generation from water medium effects inside a latent diffusion model produces controllable and physically accurate underwater images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome limited training data for underwater vision by creating synthetic images in which the objects and layout can be varied independently from the scattering and absorption caused by the water itself. It achieves this by first fine-tuning a latent diffusion U-Net solely on clean underwater scenes so the model learns diverse scene content, then treating the physical degradation as a separate conditional decoding step applied to those latent embeddings. A reader would care because the separation removes the usual trade-off between scene variety and accurate water physics, yielding large datasets whose pseudo-labels and appearance both match real conditions more closely than earlier synthesis methods.

Core claim

Underwater image synthesis can be decomposed into two stages inside a latent diffusion framework: first, fine-tuning the U-Net on degradation-free underwater images to generate diverse latent embeddings of scene content; second, formulating the physically accurate medium degradation as a conditional decoding process on those embeddings. This produces images with independent control over scene structure and water appearance, from which large-scale synthetic datasets can be built that improve downstream underwater restoration and semantic segmentation performance.

What carries the argument

Two-stage latent diffusion process that first generates scene-content embeddings and then applies conditional medium degradation synthesis.

If this is right

  • Large synthetic datasets become feasible with independent control over scene diversity and water conditions.
  • Training on the generated data raises accuracy in underwater image restoration tasks.
  • Semantic segmentation models benefit from both realistic appearance and reliable pseudo-labels.
  • The approach removes the realism-controllability trade-off present in prior single-stage underwater synthesis methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of content and medium could be tested on other scattering environments such as fog or turbid air.
  • Varying water parameters at inference time without retraining the scene generator may support rapid simulation of different ocean conditions.
  • The resulting datasets could serve as a testbed for studying how much physical accuracy is actually required for downstream task gains.

Load-bearing premise

Scene generation and medium modeling can be decoupled within a latent diffusion framework while preserving both realism and physical accuracy of the water effects.

What would settle it

A direct comparison showing that images generated with independently varied medium parameters fail to reproduce the measured attenuation and backscattering statistics of real paired clean-degraded underwater photographs of identical scenes.

Figures

Figures reproduced from arXiv: 2606.31147 by Christopher Metzler, Cornelia Fermuller, Dehao Yuan, Jiayi Wu, Md Jahidul Islam, Tianfu Wang, Tianyi Xiong, Xiaomin Lin, Yiannis Aloimonos.

Figure 1
Figure 1. Figure 1: We introduce WaterGen, an underwater image generation method [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Advantage of our WaterGen decoupled generation model. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the WaterGen pipeline. Given a text prompt and water parameters Φ, WaterGen decouples scene generation from medium control to synthesize realistic underwater images. A LoRA-adapted denoising U-Net first produces clean scene latents from text. A medium-conditioned decoder then applies attenuation and scattering consistent with Φ. This design natively supports one scene, multiple waters: the same… view at source ↗
Figure 4
Figure 4. Figure 4: Training pipeline of WaterGen. We adopt a two-stage isolated training strategy to decouple scene generation from medium degradation. Stage 1 fine-tunes a latent diffusion backbone with LoRA on restored, water-free underwater images to learn scene geometry and layout. Stage 2 independently trains the decoder on physically accurate degraded terrestrial data to learn to conditionally inject water medium effec… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results on synthesis fidelity and diverse water types com [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison under strong underwater degradation. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative ablation study on different medium injection mecha [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative decoder ablation on degradation-free image generation. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Depth-map visualizations for generated scenes. [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative restoration comparison. From top to bottom: input under￾water images, Phaseformer [23] trained on real data only, and Phaseformer trained on real data augmented with our generated data. Training with our data produces clearer structure, improved contrast, and more faithful colors across diverse scenes. 12 Segmentation Visualization Results In [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Additional restoration comparisons across training settings. [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative segmentation comparison. SegFormer trained with Real+Ours produces more complete object masks under strong underwater degrada￾tion, with fewer missed regions and sharper boundaries than the compared training settings [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Visual ablation of internal training components. [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Precise medium degradation synthesis. We fix a single clean scene latent and vary the physical medium parameters to visualize how WaterGen independently controls underwater appearance. Columns change the background light B ∞ (shown as RGB triplets at the bottom), while rows vary the attenuation coefficients β. The bottom image shows the degradation-free reference. WaterGen produces smooth and physically c… view at source ↗
Figure 15
Figure 15. Figure 15: Diverse zero-shot generation of underwater environments. [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗
read the original abstract

Underwater computer vision tasks, such as detection, restoration, and segmentation, are limited by the scarcity of large-scale and diverse training data. We introduce WaterGen, a method for generating large-scale, realistic, and diverse underwater images that provides independent control of the scene and water medium conditions. Our approach treats underwater image generation as the decoupled control of two factors: realistic and diverse scene content (what is in the image), and accurate and controllable water medium effects (what the water does to the image). Existing methods generally achieve only part of this objective: they either provide controllability with limited realism or diversity, or generate realistic scenes without accurately and independently modeling water-medium effects. Our key insight, that allows us to avoid this compromise, is that scene generation and medium modeling can be decoupled within a latent diffusion framework, enabling diverse scene generation together with accurate and controllable underwater appearance. To do this, we decompose underwater image synthesis into two stages. First, we fine-tune the latent diffusion U-Net using degradation-free underwater images so that it learns to generate diverse and realistic latent embeddings of underwater scene content without medium-induced degradation. Second, we formulate the physically accurate medium degradation synthesis as a conditional decoding process applied to these latent embeddings. This decoupled design allows our model to generate diverse scenes with full control of underwater appearance. We leverage WaterGen to build large-scale synthetic underwater datasets that are diverse in scene structures and accurate in water effects and pseudo-labels. We demonstrate that our synthetic data consistently improve downstream performance in underwater restoration and semantic segmentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes WaterGen, a two-stage latent diffusion approach for underwater image synthesis that decouples realistic scene content generation (via fine-tuning a U-Net on degradation-free images) from physically accurate medium degradation (via conditional decoding of the resulting latents). This enables independent control over scene diversity and water effects (attenuation, scattering) to produce large-scale synthetic datasets, which are shown to improve downstream underwater restoration and semantic segmentation performance.

Significance. If the decoupling holds and the medium stage produces physically grounded effects rather than merely plausible ones, the method could meaningfully alleviate data scarcity in underwater CV by supplying controllable, diverse training data with reliable pseudo-labels. The explicit two-stage design within a latent diffusion framework is a clear conceptual strength that avoids the realism-controllability trade-offs noted in prior work.

major comments (2)
  1. [§3.2] §3.2 (medium degradation synthesis): The claim that the conditional decoding stage produces 'physically accurate' water effects is load-bearing for the central contribution, yet the manuscript provides no direct validation against the image formation model (e.g., recovered attenuation coefficients, scattering parameters, or Jerlov-type transmission maps). Downstream task gains can arise from any consistent appearance shift and therefore do not secure the physical-accuracy assertion.
  2. [§4] §4 (experiments): The reported improvements on restoration and segmentation lack an ablation that isolates the effect of independent medium control from the benefit of simply having more diverse scene content; without this, it is unclear whether the decoupling itself drives the gains or whether a single-stage model with comparable scene diversity would suffice.
minor comments (2)
  1. [§3] Notation for the conditional decoder inputs (latent embeddings vs. explicit medium parameters) is introduced without a consolidated table or diagram, making the two-stage pipeline harder to follow on first reading.
  2. [Figures 4-6] Figure captions for the generated examples should explicitly state the water-type or parameter settings used so readers can verify controllability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major comment below, clarifying our approach and indicating where revisions will strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (medium degradation synthesis): The claim that the conditional decoding stage produces 'physically accurate' water effects is load-bearing for the central contribution, yet the manuscript provides no direct validation against the image formation model (e.g., recovered attenuation coefficients, scattering parameters, or Jerlov-type transmission maps). Downstream task gains can arise from any consistent appearance shift and therefore do not secure the physical-accuracy assertion.

    Authors: We agree that direct empirical validation against the image formation model would provide stronger support for the physical-accuracy claim. Our conditional decoding stage is formulated to implement the standard underwater image formation model (attenuation and scattering) as a learned conditional process on the latent embeddings. While this design ensures the effects are grounded in the physical model by construction, we did not include parameter-recovery experiments. We will add such validation in the revision, for instance by synthesizing images with known coefficients and demonstrating recovery of those parameters from the generated outputs. revision: yes

  2. Referee: [§4] §4 (experiments): The reported improvements on restoration and segmentation lack an ablation that isolates the effect of independent medium control from the benefit of simply having more diverse scene content; without this, it is unclear whether the decoupling itself drives the gains or whether a single-stage model with comparable scene diversity would suffice.

    Authors: This is a fair point; an explicit ablation would better isolate the contribution of the decoupling. The two-stage design is motivated precisely by the need for independent control, which a single-stage model cannot achieve without trading off either scene diversity or medium accuracy. We will add an ablation comparing WaterGen against a single-stage latent diffusion baseline trained on equivalent scene content, quantifying the gains attributable to separate medium control in both controllability metrics and downstream task performance. revision: yes

Circularity Check

0 steps flagged

No circularity; forward two-stage synthesis pipeline with external empirical validation

full rationale

The paper describes a constructive two-stage latent diffusion method: (1) fine-tuning a U-Net on degradation-free underwater images to generate scene latents, then (2) applying medium degradation via conditional decoding. This is a proposed generative architecture, not a derivation that reduces any claimed result to its own inputs by construction. The downstream improvements in restoration and segmentation are presented as empirical outcomes on separate tasks, not as quantities fitted or renamed from the generation process itself. No self-citations, uniqueness theorems, ansatzes, or fitted-input-as-prediction patterns appear in the provided abstract or description. The derivation chain is self-contained as an engineering pipeline.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard assumptions about latent diffusion models and image formation physics; no free parameters, new entities, or ad-hoc axioms are explicitly introduced in the abstract.

axioms (2)
  • domain assumption Latent diffusion U-Nets can be fine-tuned on degradation-free images to capture scene content independently of medium effects.
    Invoked in the first stage of the two-stage synthesis process.
  • domain assumption Underwater medium degradation can be accurately modeled as a conditional decoding process applied to latent scene embeddings.
    Invoked in the second stage of the synthesis process.

pith-pipeline@v0.9.1-grok · 5837 in / 1214 out tokens · 37608 ms · 2026-07-01T06:42:48.288109+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 13 canonical work pages · 4 internal anchors

  1. [1]

    In: CVPR

    Akkaynak, D., Treibitz, T.: A Revised Underwater Image Formation Model. In: CVPR. pp. 6723–6732 (2018)

  2. [2]

    Electronic Imaging2016(18), 1–8 (2016).https://doi.org/10.2352/ISSN.2470- 1173.2016.18.DPMI-252

    Blasinski, H., Farrell, J.: A three parameter underwater image formation model. Electronic Imaging2016(18), 1–8 (2016).https://doi.org/10.2352/ISSN.2470- 1173.2016.18.DPMI-252

  3. [3]

    Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

    Bochkovskii, A., Delaunoy, A., Germain, H., Santos, M., Zhou, Y., Richter, S.R., Koltun, V.: Depth pro: Sharp monocular metric depth in less than a second. arXiv preprint arXiv:2410.02073 (2024)

  4. [4]

    In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition

    Cai, H., He, J., Qiao, Y., Dong, C.: Toward interactive modulation for photo- realistic image restoration. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 294–303 (2021)

  5. [5]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Cai,H.,Huang,T.W.,Gehlot,S.,Feng,B.Y.,Shah,S.,Su,G.M.,Metzler,C.:Para- metric shadow control for portrait generation in text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 18207–18217 (2025)

  6. [6]

    ACM Transactions on Multimedia Computing, Communications and Applications21(1), 1–22 (2024)

    Desai, C., Benur, S., Patil, U., Mudenagudi, U.: Rsuigm: Realistic synthetic un- derwater image generation with image formation model. ACM Transactions on Multimedia Computing, Communications and Applications21(1), 1–22 (2024)

  7. [7]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Desai, C., Tabib, R.A., Reddy, S.S., Patil, U., Mudenagudi, U.: Ruig: Realistic un- derwater image generation towards restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2181–2189 (2021)

  8. [8]

    Advances in neural information processing systems34, 8780–8794 (2021)

    Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems34, 8780–8794 (2021)

  9. [9]

    arXiv preprint arXiv:2509.12747 (2025)

    He, B., Shahidzadeh, A.H., Chen, Y., Wu, J., Guan, T., Chen, G., Choset, H., Manocha, D., Chou, G., Fermuller, C., et al.: Navmoe: Hybrid model-and learning- based traversability estimation for local navigation via mixture of experts. arXiv preprint arXiv:2509.12747 (2025)

  10. [10]

    Advances in neural information processing systems33, 6840–6851 (2020)

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

  11. [11]

    IEEE transactions on image processing34, 1602–1615 (2023)

    Hong, L., Wang, X., Zhang, G., Zhao, M.: Usod10k: a new benchmark dataset for underwater salient object detection. IEEE transactions on image processing34, 1602–1615 (2023)

  12. [12]

    Iclr1(2), 3 (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022)

  13. [13]

    In: 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS)

    Islam, M.J., Edge, C., Xiao, Y., Luo, P., Mehtaz, M., Morse, C., Enan, S.S., Sattar, J.: Semantic segmentation of underwater imagery: Dataset and benchmark. In: 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS). pp. 1769–1776. IEEE (2020)

  14. [14]

    In: Computer Vision, pp

    Islam, M.J., Li, A.Q., Girdhar, Y.A., Rekleitis, I.: Computer vision applications in underwater robotics and oceanography. In: Computer Vision, pp. 173–204. Chap- man and Hall/CRC (2024)

  15. [15]

    IEEE robotics and automation letters5(2), 3227–3234 (2020)

    Islam, M.J., Xia, Y., Sattar, J.: Fast underwater image enhancement for improved visual perception. IEEE robotics and automation letters5(2), 3227–3234 (2020)

  16. [16]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with condi- tional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1125–1134 (2017)

  17. [17]

    IEEE journal of oceanic engineering15(2), 101–111 (1990) WaterGen: Decoupling Scene and Medium in Underwater Image Generation 17

    Jaffe, J.S.: Computer modeling and the design of optimal underwater imaging systems. IEEE journal of oceanic engineering15(2), 101–111 (1990) WaterGen: Decoupling Scene and Medium in Underwater Image Generation 17

  18. [18]

    IEEE Trans- actions on Image Processing (2026)

    Jia, Y., Lin, Q., Li, H., Li, Y., Kwong, S., Cong, R.: Vit-uwa: Vision transformer underwater-adapter for dense predictions beneath the water surface. IEEE Trans- actions on Image Processing (2026)

  19. [19]

    In: European Conference on Computer Vision

    Jia, Y., Hoyer, L., Huang, S., Wang, T., Van Gool, L., Schindler, K., Obukhov, A.: Dginstyle: Domain-generalizable semantic segmentation with image diffusion mod- els and stylized semantic control. In: European Conference on Computer Vision. pp. 91–109. Springer (2024)

  20. [20]

    PHISWID: Physics-inspired underwater image dataset synthesized from RGB-D images,

    Kaneko, R., Ueda, T., Higashi, H., Tanaka, Y.: Phiswid: Physics-inspired underwa- ter image dataset synthesized from rgb-d images. arXiv preprint arXiv:2404.03998 (2024)

  21. [21]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

    Ke, B., Qu, K., Wang, T., Metzger, N., Huang, S., Li, B., Obukhov, A., Schindler, K.: Marigold: Affordable adaptation of diffusion-based image generators for image analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

  22. [22]

    In: ICCV

    Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: Multi-scale image quality transformer. In: ICCV. pp. 5148–5157 (2021)

  23. [23]

    arXiv preprint arXiv:2412.01456 (2024)

    Khan, M., Negi, A., Kulkarni, A., Phutke, S.S., Vipparthi, S.K., Murala, S.: Phase- former: Phase-based attention mechanism for underwater image restoration and beyond. arXiv preprint arXiv:2412.01456 (2024)

  24. [24]

    Auto-Encoding Variational Bayes

    Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)

  25. [25]

    IEEE transactions on image processing29, 4376–4389 (2019)

    Li, C., Guo, C., Ren, W., Cong, R., Hou, J., Kwong, S., Tao, D.: An underwater image enhancement benchmark dataset and beyond. IEEE transactions on image processing29, 4376–4389 (2019)

  26. [26]

    arXiv preprint arXiv:2505.15581 (2025)

    Li, H., Lian, S., Li, Z., Cong, R., Li, C., Yang, L.T., Zhang, W., Kwong, S.: Advanc- ing marine research: Uwsam framework and uiis10k dataset for precise underwater instance segmentation. arXiv preprint arXiv:2505.15581 (2025)

  27. [27]

    In: Proceedings of the 33rd ACM International Conference on Multimedia

    Li, H., Lin, G., Li, Z., Kwong, S., Cong, R.: Fscdiff: Frequency-spatial entangled conditional diffusion model for underwater salient object detection. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 8379–8388 (2025)

  28. [28]

    IEEE Robotics and Automation letters3(1), 387–394 (2017)

    Li, J., Skinner, K.A., Eustice, R.M., Johnson-Roberson, M.: Watergan: Unsuper- vised generative network to enable real-time color correction of monocular under- water images. IEEE Robotics and Automation letters3(1), 387–394 (2017)

  29. [29]

    In: International conference on machine learning

    Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023)

  30. [30]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Lian, S., Li, H., Cong, R., Li, S., Zhang, W., Kwong, S.: Watermask: Instance seg- mentation for underwater imagery. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1305–1315 (2023)

  31. [31]

    In: Proceedings of the 41st International Conference on Machine Learning

    Lian, S., Zhang, Z., Li, H., Li, W., Yang, L.T., Kwong, S., Cong, R.: Diving into underwater:Segmentanythingmodelguidedunderwatersalientinstancesegmenta- tion and a large-scale dataset. In: Proceedings of the 41st International Conference on Machine Learning. pp. 29545–29559 (2024)

  32. [32]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Lin, H., Liang, D., Qi, Z., Bai, X.: A unified image-dense annotation generation model for underwater scenes. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 961–970 (2025)

  33. [33]

    IEEE Transactions on Circuits and Systems for Video Technology (2025)

    Lv, Q., Dong, J., Li, Y., Chen, S., Yu, H., Zhang, S., Wang, W.: Uwstereo: A large synthetic dataset for underwater stereo matching. IEEE Transactions on Circuits and Systems for Video Technology (2025)

  34. [34]

    In: Proceedings of the AAAI conference on artificial intelligence (2024) 18 J

    Mou, C., Wang, X., Xie, L., Wu, Y., Zhang, J., Qi, Z., Shan, Y.: T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In: Proceedings of the AAAI conference on artificial intelligence (2024) 18 J. Wu, T. Wang et al

  35. [35]

    IEEE Journal of Oceanic Engineering41(3), 541–551 (2015)

    Panetta, K., Gao, C., Agaian, S.: Human-visual-system-inspired underwater image quality measures. IEEE Journal of Oceanic Engineering41(3), 541–551 (2015)

  36. [36]

    IEEE Journal of Oceanic Engineering (2024)

    Peng, Y.T., Chen, Y.R., Chen, G.R., Liao, C.J.: Histoformer: Histogram-based transformer for efficient underwater image enhancement. IEEE Journal of Oceanic Engineering (2024)

  37. [37]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)

  38. [38]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  39. [39]

    IEEE Robotics and Automation Letters (2026)

    Rajyaguru, N., Wang, T., Tajne, A., He, B., Wu, J., Fermuller, C., Metzler, C., Aloimonos, Y.: Polardepth: Polarization-guided monocular depth for visual odom- etry. IEEE Robotics and Automation Letters (2026)

  40. [40]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

  41. [41]

    Advances in neural information processing systems35, 36479–36494 (2022)

    Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text- to-image diffusion models with deep language understanding. Advances in neural information processing systems35, 36479–36494 (2022)

  42. [42]

    ACM TOMM (2023)

    Sharma, P., Bisht, I., Sur, A.: Wavelength-based attributed deep neural network for underwater image restoration. ACM TOMM (2023)

  43. [43]

    Siddique, M.A.B., Ramesh, V., Liu, J., Singh, P., Islam, M.J.: Ustyle: Waterbody styletransferofunderwaterscenesbydepth-guidedfeaturesynthesis.IEEEJournal of Oceanic Engineering (JOE) (2025)

  44. [44]

    IEEE Robotics and Au- tomation Letters (2025)

    Siddique, M.A.B., Wu, J., Rekleitis, I., Islam, M.J.: Aquafuse: Waterbody fusion for physics-guided view synthesis of underwater scenes. IEEE Robotics and Au- tomation Letters (2025)

  45. [45]

    In: Pacific rim conference on multimedia

    Song, W., Wang, Y., Huang, D., Tjondronegoro, D.: A rapid scene depth esti- mation model based on underwater light attenuation prior for underwater image restoration. In: Pacific rim conference on multimedia. pp. 678–688. Springer (2018)

  46. [46]

    In: ICLR (2021)

    Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. In: ICLR (2021)

  47. [47]

    In: 2019 IEEE Interna- tional Conference on Image Processing (ICIP)

    Ueda, T., Yamada, K., Tanaka, Y.: Underwater image synthesis from rgb-d images and its application to deep underwater image restoration. In: 2019 IEEE Interna- tional Conference on Image Processing (ICIP). pp. 2115–2119. IEEE (2019)

  48. [48]

    IEEE Trans- actions on Image Processing (2026)

    Wang, C., Li, H., Li, C., Liu, H., Tang, X., Kwong, S.: Expose camouflage in the water: Underwater camouflaged instance segmentation and dataset. IEEE Trans- actions on Image Processing (2026)

  49. [49]

    arXiv preprint arXiv:1912.10269 (2019)

    Wang, N., Zhou, Y., Han, F., Zhu, H., Yao, J.: Uwgan: Underwater gan for real- world underwater color restoration and dehazing. arXiv preprint arXiv:1912.10269 (2019)

  50. [50]

    arXiv preprint arXiv:2309.08523 (2023)

    Wang, T., Kanakis, M., Schindler, K., Van Gool, L., Obukhov, A.: Breathing new life into 3d assets with generative repainting. arXiv preprint arXiv:2309.08523 (2023)

  51. [51]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Wang, T., Xie, M., Cai, H., Shah, S., Metzler, C.A.: Flash-split: 2d reflection removal with flash cues and latent diffusion separation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5688–5698 (2025) WaterGen: Decoupling Scene and Medium in Underwater Image Generation 19

  52. [52]

    Wu, J.: Low-cost depth estimation and 3d reconstruction in scattering medium. Ph.D. thesis, University of Florida (2023)

  53. [53]

    Real2SAM2Real: Generative 3D Caches as Complementary Context for Video Diffusion

    Wu, J., Cai, H., Fermuller, C., Metzler, C., Aloimonos, Y.: Real2sam2real: Gen- erative 3d caches as complementary context for video diffusion. arXiv preprint arXiv:2606.00299 (2026)

  54. [54]

    In: 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

    Wu, J., Lin, X., He, B., Fermüller, C., Aloimonos, Y.: Viewactive: Active viewpoint optimization from a single image. In: 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 11812–11818. IEEE (2025)

  55. [55]

    In: 2024 IEEE/RSJ In- ternational Conference on Intelligent Robots and Systems (IROS)

    Wu, J., Lin, X., Negahdaripour, S., Fermüller, C., Aloimonos, Y.: Marvis: Motion & geometry aware real and virtual image segmentation. In: 2024 IEEE/RSJ In- ternational Conference on Intelligent Robots and Systems (IROS). pp. 2778–2785. IEEE (2024)

  56. [56]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

    Wu, J., Wang, T., Siddique, M.A.B., Islam, M.J., Fermuller, C., Aloimonos, Y., Metzler, C.A.: Single-step latent diffusion for underwater image restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

  57. [57]

    In: 2023 IEEE Conference on Artificial Intelligence (CAI)

    Wu, J., Yu, B., Islam, M.J.: 3d reconstruction of underwater scenes using nonlinear domain projection. In: 2023 IEEE Conference on Artificial Intelligence (CAI). pp. 359–361. IEEE (2023)

  58. [58]

    Advances in neural information processing systems34, 12077–12090 (2021)

    Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in neural information processing systems34, 12077–12090 (2021)

  59. [59]

    arXiv preprint arXiv:2406.02972 (2024)

    Xiong, T., Wu, J., He, B., Fermuller, C., Aloimonos, Y., Huang, H., Metzler, C.A.: Event3dgs: Event-based 3d gaussian splatting for high-speed robot egomotion. arXiv preprint arXiv:2406.02972 (2024)

  60. [60]

    Journal of Marine Science and Engineering11(9), 1657 (2023)

    Xu, D., Zhou, J., Liu, Y., Min, X.: Underwater image enhancement based on hybrid enhanced generative adversarial network. Journal of Marine Science and Engineering11(9), 1657 (2023)

  61. [61]

    Journal of Marine Science and Engineering11(10), 1929 (2023)

    Yang, D., Zhang, T., Li, B., Li, M., Chen, W., Li, X., Wang, X.: Underwater image translation via multi-scale generative adversarial network. Journal of Marine Science and Engineering11(10), 1929 (2023)

  62. [62]

    arXiv preprint arXiv:2209.12358 (2022)

    Yu, B., Wu, J., Islam, M.J.: Udepth: Fast monocular depth estimation for visually- guided underwater robots. arXiv preprint arXiv:2209.12358 (2022)

  63. [63]

    In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision

    Yuan, D., Burner, L., Wu, J., Liu, M., Chen, J., Aloimonos, Y., Fermüller, C.: Learning normal flow directly from events. In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision. pp. 7969–7979 (2025)

  64. [64]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Zhang, F., You, S., Li, Y., Fu, Y.: Atlantis: Enabling underwater depth estimation with stable diffusion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11852–11861 (2024)

  65. [65]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3836–3847 (2023)

  66. [66]

    Sensors21(9), 3268 (2021)

    Zhao, Q., Xin, Z., Yu, Z., Zheng, B.: Unpaired underwater image synthesis with a disentangled representation for underwater depth map prediction. Sensors21(9), 3268 (2021)

  67. [67]

    Frontiers in Marine Science8, 690962 (2021)

    Zhao, Q., Zheng, Z., Zeng, H., Yu, Z., Zheng, H., Zheng, B.: The synthesis of unpaired underwater images for monocular underwater depth prediction. Frontiers in Marine Science8, 690962 (2021)

  68. [68]

    WaterGen: Decoupling Scene and Medium in Underwater Image Generation

    Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE interna- tional conference on computer vision. pp. 2223–2232 (2017) 20 J. Wu, T. Wang et al. Supplementary Material for: “WaterGen: Decoupling Scene and Medium in Underwater Image Generation” 6 More Visu...