Semantic-guided Gaussian Splatting for High-Fidelity Underwater Scene Reconstruction

Brett Seymour; Guoxi Huang; Haoran Wang; Nantheera Anantrasirichai; Zhuodong Jiang

arxiv: 2509.00800 · v3 · submitted 2025-08-31 · 💻 cs.CV

Semantic-guided Gaussian Splatting for High-Fidelity Underwater Scene Reconstruction

Zhuodong Jiang , Haoran Wang , Guoxi Huang , Brett Seymour , Nantheera Anantrasirichai This is my paper

Pith reviewed 2026-05-18 19:55 UTC · model grok-4.3

classification 💻 cs.CV

keywords underwater scene reconstruction3D Gaussian Splattingsemantic guidanceCLIP embeddingsneural renderingadaptive primitive allocationphotogrammetryvisibility degradation

0 comments

The pith

Augmenting 3D Gaussians with CLIP semantic features and adaptive reallocation improves underwater reconstruction where photometric signals alone fall short.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SWAGSplatting to address imbalanced reconstruction in underwater scenes, where scattering and attenuation create regions of uneven information quality. It augments each Gaussian primitive with a learnable semantic feature supervised by region-level CLIP embeddings and introduces a semantic consistency loss to align the geometry with high-level semantics. An adaptive reallocation strategy redistributes primitives according to importance and reconstruction error to better cover low-visibility areas. Experiments across real underwater datasets show consistent gains in PSNR, SSIM, and LPIPS over prior methods. This approach aims to produce more coherent 3D models without raising computational cost.

Core claim

Each Gaussian primitive is augmented with a learnable semantic feature supervised by CLIP-based region embeddings. A semantic consistency loss aligns the geometric reconstruction with these high-level semantics, while an adaptive reallocation strategy redistributes representation capacity based on primitive importance and error to mitigate imbalance from conventional densification. The result is improved structural coherence, preserved object boundaries, and more effective modeling of low-visibility regions in underwater environments.

What carries the argument

Learnable semantic features attached to each Gaussian primitive and aligned through a semantic consistency loss to CLIP-derived region embeddings, paired with an adaptive primitive reallocation mechanism driven by reconstruction error.

If this is right

Structural coherence improves and salient object boundaries are preserved under challenging underwater visibility conditions.
Representation capacity shifts toward low-visibility regions without increasing overall computational cost.
Overfitting in well-observed areas decreases while detail in sparsely observed or hazy areas increases.
Average PSNR, SSIM, and LPIPS improve over state-of-the-art baselines on real-world underwater datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same semantic-augmentation pattern could be tested in other domains with non-uniform image quality, such as foggy terrestrial or low-light indoor scenes.
Adaptive reallocation driven by error might reduce the total number of primitives needed for acceptable fidelity in field photogrammetry.
Combining the semantic priors with additional sensor modalities, such as depth from sonar, could further stabilize reconstruction where optical data is weakest.

Load-bearing premise

CLIP embeddings trained on natural images supply reliable semantic supervision even for underwater scenes whose appearance is altered by scattering, attenuation, and color shifts.

What would settle it

An ablation study on the SeaThru-NeRF or S-UW datasets that removes the semantic consistency loss and measures whether the reported gains in PSNR, SSIM, and LPIPS disappear would test whether the semantic component is necessary.

Figures

Figures reproduced from arXiv: 2509.00800 by Brett Seymour, Guoxi Huang, Haoran Wang, Nantheera Anantrasirichai, Zhuodong Jiang.

**Figure 1.** Figure 1: The semantic prompt generated from the ground truth image and the illustration of the rendering results. From left to right is the ground truth, the [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Pipeline of SWAGSplatting. Yellow highlights indicate the proposed contributions: (1) semantic-guided loss [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Novel view rendering comparison on the Submerged3D and SeaThru-NeRF datasets. The first row shows results from the IUI-Redsea scene from the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation study results on the performance of modules in terms of PSNR, SSIM, and LPIPS. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Accurate 3D reconstruction in degraded imaging conditions remains a key challenge in photogrammetry and neural rendering. In underwater environments, spatially varying visibility caused by scattering, attenuation, and sparse observations leads to highly non-uniform information quality. Existing 3D Gaussian Splatting (3DGS) methods typically optimize primitives based on photometric signals alone, resulting in imbalanced representation, with overfitting in well-observed regions and insufficient reconstruction in degraded areas. In this paper, we propose SWAGSplatting (Semantic-guided Water-scene Augmented Gaussian Splatting), a multimodal framework that integrates semantic priors into 3DGS for robust, high-fidelity underwater reconstruction. Each Gaussian primitive is augmented with a learnable semantic feature, supervised by CLIP-based embeddings derived from region-level cues. A semantic consistency loss is introduced to align geometric reconstruction with high-level semantics, improving structural coherence and preserving salient object boundaries under challenging conditions. Furthermore, we propose an adaptive Gaussian primitive reallocation strategy that redistributes representation capacity based on both primitive importance and reconstruction error, mitigating the imbalance introduced by conventional densification. This enables more effective modeling of low-visibility regions without increasing computational cost. Extensive experiments on real-world datasets, including SeaThru-NeRF, Submerged3D, and S-UW, demonstrate that the proposed method consistently outperforms state-of-the-art approaches in terms of average PSNR, SSIM, and LPIPS. The results validate the effectiveness of integrating semantic priors for high-fidelity underwater scene reconstruction. Code is available at https://github.com/theflash987/SWAGSplatting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This adds per-Gaussian semantic features supervised by CLIP plus error-driven reallocation to 3DGS for underwater scenes, with reported gains on real datasets but a clear risk that the CLIP signal is mismatched to the domain.

read the letter

The core move is straightforward: each Gaussian gets an extra learnable semantic vector, pulled toward CLIP region embeddings via a consistency loss, while an adaptive rule shifts primitives toward high-error areas instead of uniform densification. That combination is not in the 3DGS papers they cite, so the specific recipe is new even if the pieces are familiar from segmentation and adaptive sampling work. They test on SeaThru-NeRF, Submerged3D, and S-UW and say the numbers beat prior methods on PSNR, SSIM, and LPIPS, which is the main empirical claim. Releasing code helps anyone who wants to check the implementation directly. The reallocation step looks like the most immediately reusable part for other low-visibility settings. The obvious soft spot is the CLIP supervision. CLIP was trained on terrestrial images; underwater patches carry strong color casts, backscatter, and wavelength-dependent loss that shift the visual statistics. Nothing in the abstract or described method indicates domain adaptation or an underwater-tuned embedding, so the semantic loss could be injecting noisy or misaligned targets rather than reliable high-level structure. If the gains survive when that loss is ablated or replaced with a weaker signal, the story changes. The abstract also does not break out whether the semantic term improves geometry or mainly cleans up appearance, which matters for the underwater use case. This is aimed at people already working on neural rendering or photogrammetry for marine robotics and ocean mapping. A reader who needs better handling of non-uniform visibility in 3DGS will find the reallocation idea worth trying even if the CLIP piece needs work. The paper is coherent enough on its own terms and ships code, so it deserves a serious referee rather than a desk reject. I would send it out but ask the reviewers to press on the domain gap and the ablation controls for the semantic loss.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SWAGSplatting, an extension of 3D Gaussian Splatting for underwater scenes. Each Gaussian primitive is augmented with a learnable semantic feature supervised by CLIP embeddings from region-level cues; a semantic consistency loss aligns geometric reconstruction with high-level semantics to improve coherence in low-visibility areas; an adaptive reallocation strategy redistributes primitives according to importance and error. Experiments on SeaThru-NeRF, Submerged3D, and S-UW report consistent gains in average PSNR, SSIM, and LPIPS over prior methods, with code released.

Significance. If the reported gains hold under proper controls, the work offers a practical route to mitigate non-uniform reconstruction quality in scattering media by injecting semantic priors, which could benefit downstream tasks such as underwater mapping and inspection. The explicit code release and multi-dataset evaluation are positive for reproducibility and generalizability assessment.

major comments (2)

[§3.2] §3.2 (Semantic consistency loss): The loss is defined directly on CLIP embeddings extracted from underwater region patches with no domain adaptation, fine-tuning, or underwater-specific variant. Because CLIP was trained on terrestrial natural-image distributions, the embeddings are subject to strong distribution shift from attenuation, backscatter, and color cast; the manuscript must demonstrate that these embeddings remain semantically meaningful rather than noisy or misaligned, for example via qualitative embedding visualization or an ablation that replaces CLIP with random features.
[§4.3] §4.3 (Adaptive reallocation): The strategy is presented as redistributing representation capacity based on primitive importance and reconstruction error, yet the precise definition of the importance score, the reallocation rule, and its interaction with the semantic loss are not fully specified. Without these details it is impossible to determine whether the reported improvements in degraded regions are driven by the semantic term, the reallocation, or their combination.

minor comments (2)

[Table 2] Table 2: the per-scene metric tables would benefit from reporting the number of Gaussians or total compute at convergence to confirm that gains are not simply the result of increased primitive count.
[§5.1] §5.1: the claim of 'parameter-free' reallocation appears to depend on two tunable weights (semantic loss weight and feature dimension); clarify or remove the phrasing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to improve clarity and provide additional validation where needed.

read point-by-point responses

Referee: [§3.2] §3.2 (Semantic consistency loss): The loss is defined directly on CLIP embeddings extracted from underwater region patches with no domain adaptation, fine-tuning, or underwater-specific variant. Because CLIP was trained on terrestrial natural-image distributions, the embeddings are subject to strong distribution shift from attenuation, backscatter, and color cast; the manuscript must demonstrate that these embeddings remain semantically meaningful rather than noisy or misaligned, for example via qualitative embedding visualization or an ablation that replaces CLIP with random features.

Authors: We acknowledge the referee's concern regarding the potential domain shift affecting CLIP embeddings in underwater conditions. While the consistent performance gains across multiple datasets indicate that the semantic features provide useful guidance beyond pure photometry, we agree that explicit validation would strengthen the presentation. In the revised manuscript we will add qualitative visualizations of the CLIP-derived embeddings on underwater patches together with an ablation that substitutes random features for CLIP embeddings, thereby quantifying their contribution and confirming semantic relevance despite the distribution shift. revision: yes
Referee: [§4.3] §4.3 (Adaptive reallocation): The strategy is presented as redistributing representation capacity based on primitive importance and reconstruction error, yet the precise definition of the importance score, the reallocation rule, and its interaction with the semantic loss are not fully specified. Without these details it is impossible to determine whether the reported improvements in degraded regions are driven by the semantic term, the reallocation, or their combination.

Authors: We appreciate the referee highlighting the need for greater precision in describing the adaptive reallocation. The importance score is computed as a weighted sum of each primitive's contribution to the semantic consistency loss and its photometric reconstruction error. Reallocation proceeds by pruning low-importance primitives and densifying high-error regions, with the semantic term biasing allocation toward semantically salient structures in low-visibility areas. We will revise §4.3 and add an appendix containing the exact formulas, pseudocode for the reallocation procedure, and an explicit discussion of its interplay with the semantic loss to clarify the drivers of the observed improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method validated on external benchmarks

full rationale

The paper introduces SWAGSplatting as a multimodal extension to 3DGS, adding learnable semantic features supervised by CLIP embeddings and a semantic consistency loss plus adaptive reallocation. These are presented as design choices with associated hyperparameters, not as derived predictions. Performance is reported via direct comparison of PSNR/SSIM/LPIPS on held-out real-world datasets (SeaThru-NeRF, Submerged3D, S-UW) against prior methods. No equation or claim reduces by construction to a fitted parameter renamed as output, no self-citation chain supplies a uniqueness theorem, and no ansatz is smuggled via prior work. The derivation chain is therefore self-contained: new components are motivated, implemented, and evaluated independently of the reported gains.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on the effectiveness of CLIP embeddings as semantic priors for underwater data and on the assumption that redistributing Gaussian primitives according to reconstruction error improves coverage without introducing new artifacts. No explicit free parameters are named in the abstract, but typical loss-balancing weights and feature dimensionality are implicit.

free parameters (2)

semantic consistency loss weight
Controls the strength of alignment between geometric primitives and CLIP embeddings; must be chosen to balance photometric and semantic objectives.
semantic feature dimension
Dimensionality of the learnable vector attached to each Gaussian; chosen to match CLIP embedding size or a reduced projection.

axioms (2)

domain assumption CLIP embeddings derived from region-level cues remain semantically meaningful when applied to underwater images whose color and contrast statistics differ from CLIP's training distribution.
Invoked when the semantic feature is supervised by CLIP-based embeddings.
domain assumption Standard 3D Gaussian Splatting densification and pruning rules can be replaced by an importance-plus-error reallocation without breaking the underlying rendering pipeline.
Invoked by the adaptive Gaussian primitive reallocation strategy.

invented entities (1)

learnable semantic feature per Gaussian primitive no independent evidence
purpose: To carry high-level semantic information that guides reconstruction in low-visibility regions.
New per-primitive attribute introduced to enable semantic supervision.

pith-pipeline@v0.9.0 · 5829 in / 1690 out tokens · 35634 ms · 2026-05-18T19:55:01.559484+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Each Gaussian primitive is augmented with a learnable semantic feature, supervised by CLIP-based embeddings... A semantic consistency loss is introduced to align geometric reconstruction with high-level semantics (Eq. 6)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages · 2 internal anchors

[1]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568 (2025). Zhuodong Jiang, Haoran Wang, Guoxi Huang, Brett Seymour, and Nantheera Anantrasirichai

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

arXiv preprint arXiv:2505.15737 (2025)

RUSplatting: Robust 3D Gaussian Splatting for Sparse-View Underwater Scene Reconstruction. arXiv preprint arXiv:2505.15737 (2025). Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis

work page arXiv 2025
[3]

ACM Trans

3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42, 4 (2023), 139–1. Deborah Levy, Amit Peleg, Naama Pearl, Dan Rosenbaum, Derya Akkaynak, Simon Korman, and Tali Treibitz

work page 2023
[4]

3DV (2025)

WaterSplatting: Fast Underwater 3D Scene Reconstruction using Gaussian Splatting. 3DV (2025). Shaohua Liu, Junzhe Lu, Zuoya Gu, Jiajun Li, and Yue Deng

work page 2025
[5]

Available: https://arxiv.org/abs/2411.00239

Aquatic-GS: A Hybrid 3D Representation for Underwater Scenes. arXiv preprint arXiv:2411.00239 (2024). Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ra- mamoorthi, and Ren Ng

work page arXiv 2024
[6]

Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 65, 1 (2021), 99–106. Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller

work page 2021
[7]

ACM transactions on graphics (TOG) 41, 4 (2022), 1–15

Instant neural graphics primitives with a multiresolution hash encoding. ACM transactions on graphics (TOG) 41, 4 (2022), 1–15. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

work page 2022
[8]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159 (2024). Yunkai Tang, Chengxuan Zhu, Renjie Wan, Chao Xu, and Boxin Shi

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568 (2025). Zhuodong Jiang, Haoran Wang, Guoxi Huang, Brett Seymour, and Nantheera Anantrasirichai

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

arXiv preprint arXiv:2505.15737 (2025)

RUSplatting: Robust 3D Gaussian Splatting for Sparse-View Underwater Scene Reconstruction. arXiv preprint arXiv:2505.15737 (2025). Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis

work page arXiv 2025

[3] [3]

ACM Trans

3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42, 4 (2023), 139–1. Deborah Levy, Amit Peleg, Naama Pearl, Dan Rosenbaum, Derya Akkaynak, Simon Korman, and Tali Treibitz

work page 2023

[4] [4]

3DV (2025)

WaterSplatting: Fast Underwater 3D Scene Reconstruction using Gaussian Splatting. 3DV (2025). Shaohua Liu, Junzhe Lu, Zuoya Gu, Jiajun Li, and Yue Deng

work page 2025

[5] [5]

Available: https://arxiv.org/abs/2411.00239

Aquatic-GS: A Hybrid 3D Representation for Underwater Scenes. arXiv preprint arXiv:2411.00239 (2024). Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ra- mamoorthi, and Ren Ng

work page arXiv 2024

[6] [6]

Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 65, 1 (2021), 99–106. Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller

work page 2021

[7] [7]

ACM transactions on graphics (TOG) 41, 4 (2022), 1–15

Instant neural graphics primitives with a multiresolution hash encoding. ACM transactions on graphics (TOG) 41, 4 (2022), 1–15. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

work page 2022

[8] [8]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159 (2024). Yunkai Tang, Chengxuan Zhu, Renjie Wan, Chao Xu, and Boxin Shi

work page internal anchor Pith review Pith/arXiv arXiv 2024