Semantic-guided Gaussian Splatting for High-Fidelity Underwater Scene Reconstruction
Pith reviewed 2026-05-18 19:55 UTC · model grok-4.3
The pith
Augmenting 3D Gaussians with CLIP semantic features and adaptive reallocation improves underwater reconstruction where photometric signals alone fall short.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Each Gaussian primitive is augmented with a learnable semantic feature supervised by CLIP-based region embeddings. A semantic consistency loss aligns the geometric reconstruction with these high-level semantics, while an adaptive reallocation strategy redistributes representation capacity based on primitive importance and error to mitigate imbalance from conventional densification. The result is improved structural coherence, preserved object boundaries, and more effective modeling of low-visibility regions in underwater environments.
What carries the argument
Learnable semantic features attached to each Gaussian primitive and aligned through a semantic consistency loss to CLIP-derived region embeddings, paired with an adaptive primitive reallocation mechanism driven by reconstruction error.
If this is right
- Structural coherence improves and salient object boundaries are preserved under challenging underwater visibility conditions.
- Representation capacity shifts toward low-visibility regions without increasing overall computational cost.
- Overfitting in well-observed areas decreases while detail in sparsely observed or hazy areas increases.
- Average PSNR, SSIM, and LPIPS improve over state-of-the-art baselines on real-world underwater datasets.
Where Pith is reading between the lines
- The same semantic-augmentation pattern could be tested in other domains with non-uniform image quality, such as foggy terrestrial or low-light indoor scenes.
- Adaptive reallocation driven by error might reduce the total number of primitives needed for acceptable fidelity in field photogrammetry.
- Combining the semantic priors with additional sensor modalities, such as depth from sonar, could further stabilize reconstruction where optical data is weakest.
Load-bearing premise
CLIP embeddings trained on natural images supply reliable semantic supervision even for underwater scenes whose appearance is altered by scattering, attenuation, and color shifts.
What would settle it
An ablation study on the SeaThru-NeRF or S-UW datasets that removes the semantic consistency loss and measures whether the reported gains in PSNR, SSIM, and LPIPS disappear would test whether the semantic component is necessary.
Figures
read the original abstract
Accurate 3D reconstruction in degraded imaging conditions remains a key challenge in photogrammetry and neural rendering. In underwater environments, spatially varying visibility caused by scattering, attenuation, and sparse observations leads to highly non-uniform information quality. Existing 3D Gaussian Splatting (3DGS) methods typically optimize primitives based on photometric signals alone, resulting in imbalanced representation, with overfitting in well-observed regions and insufficient reconstruction in degraded areas. In this paper, we propose SWAGSplatting (Semantic-guided Water-scene Augmented Gaussian Splatting), a multimodal framework that integrates semantic priors into 3DGS for robust, high-fidelity underwater reconstruction. Each Gaussian primitive is augmented with a learnable semantic feature, supervised by CLIP-based embeddings derived from region-level cues. A semantic consistency loss is introduced to align geometric reconstruction with high-level semantics, improving structural coherence and preserving salient object boundaries under challenging conditions. Furthermore, we propose an adaptive Gaussian primitive reallocation strategy that redistributes representation capacity based on both primitive importance and reconstruction error, mitigating the imbalance introduced by conventional densification. This enables more effective modeling of low-visibility regions without increasing computational cost. Extensive experiments on real-world datasets, including SeaThru-NeRF, Submerged3D, and S-UW, demonstrate that the proposed method consistently outperforms state-of-the-art approaches in terms of average PSNR, SSIM, and LPIPS. The results validate the effectiveness of integrating semantic priors for high-fidelity underwater scene reconstruction. Code is available at https://github.com/theflash987/SWAGSplatting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes SWAGSplatting, an extension of 3D Gaussian Splatting for underwater scenes. Each Gaussian primitive is augmented with a learnable semantic feature supervised by CLIP embeddings from region-level cues; a semantic consistency loss aligns geometric reconstruction with high-level semantics to improve coherence in low-visibility areas; an adaptive reallocation strategy redistributes primitives according to importance and error. Experiments on SeaThru-NeRF, Submerged3D, and S-UW report consistent gains in average PSNR, SSIM, and LPIPS over prior methods, with code released.
Significance. If the reported gains hold under proper controls, the work offers a practical route to mitigate non-uniform reconstruction quality in scattering media by injecting semantic priors, which could benefit downstream tasks such as underwater mapping and inspection. The explicit code release and multi-dataset evaluation are positive for reproducibility and generalizability assessment.
major comments (2)
- [§3.2] §3.2 (Semantic consistency loss): The loss is defined directly on CLIP embeddings extracted from underwater region patches with no domain adaptation, fine-tuning, or underwater-specific variant. Because CLIP was trained on terrestrial natural-image distributions, the embeddings are subject to strong distribution shift from attenuation, backscatter, and color cast; the manuscript must demonstrate that these embeddings remain semantically meaningful rather than noisy or misaligned, for example via qualitative embedding visualization or an ablation that replaces CLIP with random features.
- [§4.3] §4.3 (Adaptive reallocation): The strategy is presented as redistributing representation capacity based on primitive importance and reconstruction error, yet the precise definition of the importance score, the reallocation rule, and its interaction with the semantic loss are not fully specified. Without these details it is impossible to determine whether the reported improvements in degraded regions are driven by the semantic term, the reallocation, or their combination.
minor comments (2)
- [Table 2] Table 2: the per-scene metric tables would benefit from reporting the number of Gaussians or total compute at convergence to confirm that gains are not simply the result of increased primitive count.
- [§5.1] §5.1: the claim of 'parameter-free' reallocation appears to depend on two tunable weights (semantic loss weight and feature dimension); clarify or remove the phrasing.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to improve clarity and provide additional validation where needed.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Semantic consistency loss): The loss is defined directly on CLIP embeddings extracted from underwater region patches with no domain adaptation, fine-tuning, or underwater-specific variant. Because CLIP was trained on terrestrial natural-image distributions, the embeddings are subject to strong distribution shift from attenuation, backscatter, and color cast; the manuscript must demonstrate that these embeddings remain semantically meaningful rather than noisy or misaligned, for example via qualitative embedding visualization or an ablation that replaces CLIP with random features.
Authors: We acknowledge the referee's concern regarding the potential domain shift affecting CLIP embeddings in underwater conditions. While the consistent performance gains across multiple datasets indicate that the semantic features provide useful guidance beyond pure photometry, we agree that explicit validation would strengthen the presentation. In the revised manuscript we will add qualitative visualizations of the CLIP-derived embeddings on underwater patches together with an ablation that substitutes random features for CLIP embeddings, thereby quantifying their contribution and confirming semantic relevance despite the distribution shift. revision: yes
-
Referee: [§4.3] §4.3 (Adaptive reallocation): The strategy is presented as redistributing representation capacity based on primitive importance and reconstruction error, yet the precise definition of the importance score, the reallocation rule, and its interaction with the semantic loss are not fully specified. Without these details it is impossible to determine whether the reported improvements in degraded regions are driven by the semantic term, the reallocation, or their combination.
Authors: We appreciate the referee highlighting the need for greater precision in describing the adaptive reallocation. The importance score is computed as a weighted sum of each primitive's contribution to the semantic consistency loss and its photometric reconstruction error. Reallocation proceeds by pruning low-importance primitives and densifying high-error regions, with the semantic term biasing allocation toward semantically salient structures in low-visibility areas. We will revise §4.3 and add an appendix containing the exact formulas, pseudocode for the reallocation procedure, and an explicit discussion of its interplay with the semantic loss to clarify the drivers of the observed improvements. revision: yes
Circularity Check
No circularity: empirical method validated on external benchmarks
full rationale
The paper introduces SWAGSplatting as a multimodal extension to 3DGS, adding learnable semantic features supervised by CLIP embeddings and a semantic consistency loss plus adaptive reallocation. These are presented as design choices with associated hyperparameters, not as derived predictions. Performance is reported via direct comparison of PSNR/SSIM/LPIPS on held-out real-world datasets (SeaThru-NeRF, Submerged3D, S-UW) against prior methods. No equation or claim reduces by construction to a fitted parameter renamed as output, no self-citation chain supplies a uniqueness theorem, and no ansatz is smuggled via prior work. The derivation chain is therefore self-contained: new components are motivated, implemented, and evaluated independently of the reported gains.
Axiom & Free-Parameter Ledger
free parameters (2)
- semantic consistency loss weight
- semantic feature dimension
axioms (2)
- domain assumption CLIP embeddings derived from region-level cues remain semantically meaningful when applied to underwater images whose color and contrast statistics differ from CLIP's training distribution.
- domain assumption Standard 3D Gaussian Splatting densification and pruning rules can be replaced by an importance-plus-error reallocation without breaking the underlying rendering pipeline.
invented entities (1)
-
learnable semantic feature per Gaussian primitive
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Each Gaussian primitive is augmented with a learnable semantic feature, supervised by CLIP-based embeddings... A semantic consistency loss is introduced to align geometric reconstruction with high-level semantics (Eq. 6)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568 (2025). Zhuodong Jiang, Haoran Wang, Guoxi Huang, Brett Seymour, and Nantheera Anantrasirichai
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
arXiv preprint arXiv:2505.15737 (2025)
RUSplatting: Robust 3D Gaussian Splatting for Sparse-View Underwater Scene Reconstruction. arXiv preprint arXiv:2505.15737 (2025). Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis
- [3]
-
[4]
WaterSplatting: Fast Underwater 3D Scene Reconstruction using Gaussian Splatting. 3DV (2025). Shaohua Liu, Junzhe Lu, Zuoya Gu, Jiajun Li, and Yue Deng
work page 2025
-
[5]
Available: https://arxiv.org/abs/2411.00239
Aquatic-GS: A Hybrid 3D Representation for Underwater Scenes. arXiv preprint arXiv:2411.00239 (2024). Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ra- mamoorthi, and Ren Ng
-
[6]
Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 65, 1 (2021), 99–106. Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller
work page 2021
-
[7]
ACM transactions on graphics (TOG) 41, 4 (2022), 1–15
Instant neural graphics primitives with a multiresolution hash encoding. ACM transactions on graphics (TOG) 41, 4 (2022), 1–15. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al
work page 2022
-
[8]
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159 (2024). Yunkai Tang, Chengxuan Zhu, Renjie Wan, Chao Xu, and Boxin Shi
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.