arxiv: 2604.17047 · v1 · submitted 2026-04-18 · 📡 eess.SP · cs.LG· eess.IV

Recognition: unknown

E2E-WAVE: End-to-End Learned Waveform Generation for Underwater Video Multicasting

Khizar Anjum , Tingcong Jiang , Dario Pompili

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:33 UTC · model grok-4.3

classification 📡 eess.SP cs.LGeess.IV

keywords underwater acoustic communicationend-to-end learningsemantic communicationwaveform generationvideo multicastingOFDMerror resiliencelow-bitrate video

0 comments

The pith

E2E-WAVE embeds semantic similarity into physical-layer waveforms to deliver real-time underwater video where bit errors would otherwise destroy conventional transmissions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the first end-to-end learned waveform generator for multicasting video over underwater acoustic channels that suffer 20-46 percent bit error rates. In these environments forward error correction often increases errors rather than fixing them, and standard codecs like HEVC cannot operate at the required sub-5 kbps rates. E2E-WAVE tokenizes video frames via VideoGPT, maps the tokens to a trainable waveform bank, and transmits them through a fully differentiable OFDM chain so that unavoidable bit flips still land on semantically similar tokens. The resulting system reports 5 dB higher PSNR and 0.10 higher SSIM than the strongest FEC baseline while sustaining 16 FPS at 128 by 128 resolution over 2.3 kbps links. Models trained on one channel continue to work on unseen underwater channels without retraining.

Core claim

By jointly optimizing tokenization, waveform mapping, and OFDM transmission in a single differentiable pipeline, E2E-WAVE forces the physical-layer signal to carry semantic proximity: when decoding errors occur the receiver naturally selects the token whose waveform is closest in the learned space, converting random bit corruption into graceful semantic degradation.

What carries the argument

A trainable waveform bank that maps semantic video tokens to differentiable OFDM signals, optimized end-to-end so that Euclidean distance in waveform space aligns with semantic similarity.

If this is right

Real-time 16 FPS video at 128x128 resolution becomes feasible over 2.3 kbps channels that defeat conventional digital modulation.
Performance advantage over FEC-protected baselines widens as channel conditions worsen.
A single trained model works across different underwater environments without channel-specific retraining.
Video codecs that assume low bit-error rates or AWGN statistics become unnecessary in this regime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same semantic-waveform principle could be tested on other high-error-rate links such as satellite or deep-space channels.
Physical-layer semantic coding might reduce the total amount of error-correction overhead required at higher protocol layers.
Extending the approach to multi-user multicasting would require verifying whether the learned waveform bank remains stable when multiple receivers decode simultaneously.

Load-bearing premise

That training waveforms on a single underwater channel will produce signals whose semantic error behavior generalizes to other unseen underwater environments without retraining.

What would settle it

Train the model on the NOF1 channel and then measure PSNR and SSIM on the BCH1 or NCS1 channels; if the reported gains over the FEC baseline disappear or reverse, the generalization claim is false.

Figures

Figures reproduced from arXiv: 2604.17047 by Dario Pompili, Khizar Anjum, Tingcong Jiang.

**Figure 1.** Figure 1: TX: Conv3D encoder compresses video frames to token indices (1024x) via shared codebook. Indices select learned waveforms from a trainable bank, undergo OFDM modulation (preamble, pilots), and transmit through collected at-sea channels with fading, multipath, and Doppler. RX: Synchronization, pilot-based estimation, and zero-forcing equalization recover waveforms. L2-based argmin decodes token indices, ret… view at source ↗

**Figure 2.** Figure 2: Dataset examples. Top row: Training data from DRUVA and UOT100 exhibiting characteristic underwater artifacts (color distortion, reduced visibility, marine snow). Bottom row: UVE-38K test set partitioned into clear (left four, vivid colors and fine detail) and turbid (right four, low visibility and diffuse lighting) categories. TABLE I: OFDM Transmission Parameters Parameter Value Subcarriers 64 Cyclic pre… view at source ↗

**Figure 3.** Figure 3: Bit error rate (BER) vs. SNR across Watermark channel environments for BPSK and QPSK modulation. Even the most favorable channel (NOF1) [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative reconstruction comparison on NOF1 channel at 10 dB and 30 dB SNR. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: PSNR vs SNR comparison across baselines for BPSK modulation on UVE-38K test set. Each subplot shows performance on a specific Watermark [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: SSIM vs SNR comparison across baselines for BPSK modulation on UVE-38K test set. Each subplot shows performance on a specific Watermark [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: PSNR vs SNR comparison across baselines for QPSK modulation on UVE-38K test set. Each subplot shows performance on a specific Watermark [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: SSIM vs SNR comparison across baselines for QPSK modulation on UVE-38K test set. Each subplot shows performance on a specific Watermark [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Average L2 distance between transmitted and received token embeddings vs. SNR. Traditional digital modulation (BPSK/QPSK with FEC) treats all bit errors equally, corrupting tokens indiscriminately regardless of semantic content—high L2 distance directly translates to poor reconstruction. E2E-WAVE’s learned wavebank exploits inter-token semantic relevance: even when channel degradation increases L2 distance… view at source ↗

read the original abstract

We present E2E-WAVE, the first end-to-end learned waveform generation system for underwater video multicasting. Acoustic channels exhibit 20--46% bit error rates where forward error correction becomes counterproductive -- LDPC increases rather than decreases errors beyond its decoding threshold. E2E-WAVE addresses this by embedding semantic similarity directly into physical layer waveforms: when decoding errors are unavoidable, the system preferentially selects semantically similar tokens rather than arbitrary corruption. Combining VideoGPT tokenization (1024x compression) with a trainable waveform bank and fully differentiable OFDM transmission, E2E-WAVE achieves +5 dB (19.26%) PSNR and +0.10 (14.28%) SSIM over the strongest FEC-protected baseline in less challenging underwater channel (NOF1) while delivering real-time 16 FPS video at 128x128 resolution over 2.3 kbps channels -- impossible for conventional digital modulation. The performance gap only increases in harsher channels (BCH1, NCS1). Trained on a single channel, E2E-WAVE generalizes to unseen underwater environments without retraining, while HEVC fails at sub-5 kbps rates and SoftCast's AWGN assumptions collapse on frequency-selective channels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

E2E-WAVE puts forward a new end-to-end waveform design that tries to steer channel errors toward semantically similar video tokens, but the results do not yet isolate whether that mechanism is what drives the reported gains.

read the letter

The main thing here is that the authors build a system that tokenizes video with VideoGPT, then learns a bank of waveforms and sends them through differentiable OFDM so that bit flips in high-error underwater channels land on nearby tokens instead of arbitrary ones. They report +5 dB PSNR and +0.10 SSIM over the strongest FEC baseline in the NOF1 channel, with the advantage growing in tougher simulated conditions, while running 16 FPS at 128x128 over 2.3 kbps links that break conventional modulation.

Referee Report

1 major / 2 minor

Summary. The paper introduces E2E-WAVE, the first end-to-end learned waveform generation system for underwater video multicasting. It tokenizes video via VideoGPT (1024x compression), uses a trainable waveform bank, and employs fully differentiable OFDM to embed semantic similarity into physical-layer signals. This allows the decoder to map unavoidable channel errors (20-46% BER) to semantically similar tokens rather than arbitrary corruption. The system reports +5 dB PSNR (19.26%) and +0.10 SSIM (14.28%) gains over the strongest FEC-protected baseline in the NOF1 channel, supports real-time 16 FPS at 128x128 resolution over 2.3 kbps, generalizes to unseen channels (BCH1, NCS1) without retraining, and outperforms HEVC and SoftCast under frequency-selective underwater conditions.

Significance. If the central mechanism holds, the work would be significant for acoustic communications where conventional FEC fails and semantic-aware physical-layer design could enable reliable low-rate video. The end-to-end differentiability, real-time low-bitrate operation, and reported cross-channel generalization are notable strengths. However, the absence of direct evidence for the semantic-preference claim limits the ability to attribute gains specifically to the proposed mechanism rather than other implementation factors.

major comments (1)

The core claim that the trainable waveform bank causes the decoder to preferentially map errors to semantically similar VideoGPT tokens (rather than arbitrary corruption) is load-bearing for the contribution, yet the evaluation provides only aggregate PSNR/SSIM and BER. No token-level error pattern analysis, embedding-space distance measurements between erroneous and correct tokens, or ablation removing the semantic loss term is reported. This leaves open whether the +5 dB gain in NOF1 (and larger gaps in BCH1/NCS1) arises from the semantic embedding or from other factors such as improved synchronization or effective BER reduction under the simulated channels.

minor comments (2)

The abstract and results lack explicit details on data splits, number of training/test videos, error bars, or statistical significance testing for the reported PSNR/SSIM deltas.
Notation for the waveform bank parameters and the semantic loss term should be introduced earlier and used consistently when describing the end-to-end training pipeline.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of E2E-WAVE's significance for underwater acoustic communications and for the constructive feedback on strengthening the evidence for the core mechanism. We address the major comment below and will incorporate the requested analyses in the revision.

read point-by-point responses

Referee: The core claim that the trainable waveform bank causes the decoder to preferentially map errors to semantically similar VideoGPT tokens (rather than arbitrary corruption) is load-bearing for the contribution, yet the evaluation provides only aggregate PSNR/SSIM and BER. No token-level error pattern analysis, embedding-space distance measurements between erroneous and correct tokens, or ablation removing the semantic loss term is reported. This leaves open whether the +5 dB gain in NOF1 (and larger gaps in BCH1/NCS1) arises from the semantic embedding or from other factors such as improved synchronization or effective BER reduction under the simulated channels.

Authors: We agree that the current evaluation relies on aggregate metrics and does not include the direct token-level or ablation analyses needed to isolate the semantic embedding contribution. In the revised manuscript we will add: (1) token-level error pattern analysis showing the distribution of decoded VideoGPT tokens under channel errors and their semantic similarity to ground truth; (2) embedding-space distance measurements (cosine similarity and Euclidean distance in the VideoGPT latent space) between correct and erroneous tokens, compared against random corruption; and (3) an ablation removing the semantic loss term to quantify its effect on PSNR/SSIM gains. These will appear in a dedicated subsection of the experiments. We will also clarify that synchronization is handled identically to the baselines and that post-decoding BER values are comparable, so the gains are unlikely to stem from those factors. The observed generalization to unseen channels (BCH1, NCS1) without retraining provides supporting evidence that the learned waveforms capture semantic robustness rather than channel-specific adaptations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical end-to-end training results are self-contained

full rationale

The paper describes an end-to-end differentiable system combining VideoGPT tokenization, a trainable waveform bank, and OFDM modulation, with performance evaluated via direct comparisons to external baselines (FEC-protected systems, HEVC, SoftCast) on simulated channels. No equations or steps reduce by construction to their own inputs, no fitted parameters are relabeled as predictions, and no load-bearing self-citations or uniqueness theorems from the same authors are invoked to force the architecture. Generalization claims are presented as empirical outcomes of training on one channel model, without tautological redefinition of the target metrics.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Based solely on the abstract; the system relies on a trainable waveform bank whose parameters are fitted during training and on the assumption that OFDM can be made fully differentiable for joint optimization. No independent evidence for these elements is supplied.

free parameters (1)

trainable waveform bank parameters
The waveform bank is explicitly described as trainable, implying parameters learned from data to embed semantic similarity.

axioms (1)

domain assumption Fully differentiable OFDM transmission enables end-to-end gradient-based training
Invoked to allow the waveform bank to be optimized jointly with the video tokenization and channel model.

invented entities (1)

trainable waveform bank no independent evidence
purpose: To generate physical-layer waveforms that encode semantic similarity so errors produce similar tokens
New component introduced in the system; no external validation or independent evidence provided in the abstract.

pith-pipeline@v0.9.0 · 5530 in / 1308 out tokens · 51780 ms · 2026-05-10T06:33:57.277210+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 2 canonical work pages · 2 internal anchors

[1]

Advancements in the field of autonomous underwater vehicle,

A. Sahoo, S. K. Dwivedy, and P. Robi, “Advancements in the field of autonomous underwater vehicle,”Ocean Engineering, vol. 181, pp. 145– 160, 2019

2019
[2]

Underwater acoustic sensor networks: research challenges,

I. F. Akyildiz, D. Pompili, and T. Melodia, “Underwater acoustic sensor networks: research challenges,”Ad hoc networks, vol. 3, no. 3, pp. 257– 279, 2005

2005
[3]

Underwater sensor net- works: applications, advances and challenges,

J. Heidemann, M. Stojanovic, and M. Zorzi, “Underwater sensor net- works: applications, advances and challenges,”Philosophical Transac- tions of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 370, no. 1958, pp. 158–175, 2012

1958
[4]

ISO/IEC 14496-10:2022,

ISO Central Secretary, “ISO/IEC 14496-10:2022,” standard, Interna- tional Organization for Standardization, 2022

2022
[5]

Performance Overview of the Latest Video Coding Proposals: HEVC, JEM and VVC,

M. O. Martínez-Rach, H. Migallón, O. López-Granado, V . Galiano, and M. P. Malumbres, “Performance Overview of the Latest Video Coding Proposals: HEVC, JEM and VVC,”Journal of Imaging, vol. 7, p. 39, Feb. 2021. Number: 2 Publisher: Multidisciplinary Digital Publishing Institute

2021
[6]

Softcast: Clean-slate scalable wireless video,

S. Jakubczak and D. Katabi, “Softcast: Clean-slate scalable wireless video,” inProceedings of the 2010 ACM workshop on Wireless of the students, by the students, for the students, pp. 9–12, 2010

2010
[7]

Ecast: An enhanced video transmission design for wireless multicast systems over fading channels,

Z. Zhang, D. Liu, X. Ma, and X. Wang, “Ecast: An enhanced video transmission design for wireless multicast systems over fading channels,” IEEE Systems Journal, vol. 11, no. 4, pp. 2566–2577, 2015

2015
[8]

Phydnns: Bringing deep neural networks to the physical layer,

M. Abdi, K. F. Haque, F. Meneghello, J. Ashdown, and F. Restuccia, “Phydnns: Bringing deep neural networks to the physical layer,” in IEEE INFOCOM 2025-IEEE Conference on Computer Communications, pp. 1–10, IEEE, 2025

2025
[9]

Cosmos world foundation model platform for physical ai,

NVIDIA, N. Agarwal, A. Ali, M. Bala, Y . Balaji,et al., “Cosmos world foundation model platform for physical ai,” 2025

2025
[10]

Omnito- kenizer: A joint image-video tokenizer for visual generation,

J. Wang, Y . Jiang, Z. Yuan, B. Peng, Z. Wu, and Y .-G. Jiang, “Omnito- kenizer: A joint image-video tokenizer for visual generation,” 2024

2024
[11]

VideoGPT: Video Generation using VQ-VAE and Transformers

W. Yan, Y . Zhang, P. Abbeel, and A. Srinivas, “Videogpt: Video gener- ation using vq-vae and transformers,”arXiv preprint arXiv:2104.10157, 2021

work page internal anchor Pith review arXiv 2021
[12]

Deep learning enabled semantic communication systems,

H. Xie, Z. Qin, G. Y . Li, and B.-H. Juang, “Deep learning enabled semantic communication systems,”IEEE Transactions on Signal Pro- cessing, vol. 69, pp. 2663–2675, 2021

2021
[13]

An introduction to deep learning for the physical layer,

T. J. O’Shea and J. Hoydis, “An introduction to deep learning for the physical layer,”IEEE Transactions on Cognitive Communications and Networking, vol. 3, no. 4, pp. 563–575, 2017

2017
[14]

Underwater communications: Recent advances,

M. Furqan Ali, N. K. Jayakody, T. D Ponnimbaduge Perera, K. Srini- vasan, A. Sharma, I. Krikidis,et al., “Underwater communications: Recent advances,” ETIC conference, 2019

2019
[15]

Deep joint source-channel coding for underwater image transmission,

K. Anjum, Z. Qi, and D. Pompili, “Deep joint source-channel coding for underwater image transmission,” inProceedings of the 16th Interna- tional Conference on Underwater Networks & Systems, WUWNet ’22, (New York, NY , USA), Association for Computing Machinery, 2022

2022
[16]

Ultra low bitrate learned image compression by selective detail decoding,

H. Akutsu, A. Suzuki, Z. Zhong, and K. Aizawa, “Ultra low bitrate learned image compression by selective detail decoding,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 524–528, 2020

2020
[17]

Experimental evaluation of a real-time fpga platform for multichannel coherent acous- tic communication,

J. Rudander, T. Husøy, P. A. van Walree, and P. Orten, “Experimental evaluation of a real-time fpga platform for multichannel coherent acous- tic communication,”IEEE Journal of Oceanic Engineering, pp. 1–10, 2023

2023
[18]

The watermark benchmark for underwater acoustic modulation schemes,

P. A. van Walree, F.-X. Socheleau, R. Otnes, and T. Jenserud, “The watermark benchmark for underwater acoustic modulation schemes,” IEEE journal of oceanic engineering, vol. 42, no. 4, pp. 1007–1018, 2017

2017
[19]

DRUV A: Deep-sea robotic underwater video archive dataset for seafloor exploration,

N. Varghese, A. Kumar, and A. N. Rajagopalan, “DRUV A: Deep-sea robotic underwater video archive dataset for seafloor exploration,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12248–12258, October 2022. Available at: https: //github.com/nishavarghese15/DRUV A

2022
[20]

UOT100: Compre- hensive underwater object tracking benchmark dataset,

L. Kezebou, K. Panetta, V . Oludare, and S. Agaian, “UOT100: Compre- hensive underwater object tracking benchmark dataset,”IEEE Journal of Oceanic Engineering, vol. 47, no. 1, pp. 59–75, 2023. Available at: https://www.kaggle.com/datasets/landrykezebou/uot100-underwater- object-tracking-dataset

2023
[21]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human action classes from videos in the wild,”arXiv preprint arXiv:1212.0402, 2012

work page internal anchor Pith review arXiv 2012
[22]

Underwater image co-enhancement with correlation feature matching and joint learning,

Q. Qi, Y . Zhang, F. Tian, Q. J. Wu, K. Li, X. Luan, and D. Song, “Underwater image co-enhancement with correlation feature matching and joint learning,”IEEE Transactions on Circuits and Systems for Video Technology, 2021

2021
[23]

Watermark: BCH1 dataset description,

F.-X. Socheleau, A. Pottier, and C. Laot, “Watermark: BCH1 dataset description,” technical report, Institut Mines-Telecom; TELECOM Bre- tagne, UMR CNRS 6285 Lab-STICC, 2016. HAL Id: hal-01404491. Available at: https://hal.science/hal-01404491v1/file/W ATERMARK_ BCH1.pdf

2016
[24]

The capacity of low-density parity- check codes under message-passing decoding,

T. J. Richardson and R. L. Urbanke, “The capacity of low-density parity- check codes under message-passing decoding,”IEEE Transactions on Information Theory, vol. 47, no. 2, pp. 599–618, 2001

2001