Audio-Visual Speech Enhancement: Architectural Design and Deployment Strategies

Amir Hussain; Anis Hamadouche; Haifeng Luo; Mathini Sellathurai; Tharm Ratnarajah

arxiv: 2508.08468 · v5 · submitted 2025-08-11 · 💻 cs.SD · eess.SP

Audio-Visual Speech Enhancement: Architectural Design and Deployment Strategies

Anis Hamadouche , Haifeng Luo , Mathini Sellathurai , Amir Hussain , Tharm Ratnarajah This is my paper

Pith reviewed 2026-05-18 23:01 UTC · model grok-4.3

classification 💻 cs.SD eess.SP

keywords audio-visual speech enhancement5G edge computingcloud-edge architectureLSTM fusion networkreal-time multimedianetwork latency bottleneckscompression trade-offsVodafone AWS Wavelength

0 comments

The pith

Public 5G edge networks can sustain real-time audio-visual speech enhancement when compute and uplink resources are orchestrated.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper designs and evaluates a complete cloud-edge AVSE system that processes audio with CNNs, extracts facial features via OpenCV, and fuses them through an LSTM network to maintain temporal coherence. It deploys this on a Vodafone-compatible AWS Wavelength edge cloud and stress-tests performance across network loads and adaptive compression profiles. Results establish that edge placement is essential to meet delay bounds for interactive use, while uplink capacity emerges as the primary constraint. Aggressive compression cuts payloads by up to 80 percent with little perceptual loss, and a clear trade-off appears between model complexity, latency, and quality in low-SNR conditions.

Core claim

The central claim is that public 5G edge environments can sustain real-time, interactive AVSE workloads when network and compute resources are carefully orchestrated, although performance margins remain tighter than in dedicated infrastructures. Compute placement at the network edge proves critical for meeting real-time coherence constraints, uplink capacity is the dominant bottleneck, only 5G and wired Ethernet consistently satisfy the communication delay bound for uncompressed chunks, and compression enables robust operation under constrained conditions.

What carries the argument

The LSTM fusion network that integrates CNN acoustic enhancement and OpenCV facial features to preserve temporal coherence across the cloud-edge pipeline.

If this is right

Edge compute placement is required to keep processing and communication delays low enough for temporal coherence.
Uplink capacity must be treated as the primary limit when designing interactive AVSE services.
Reduced model complexity lowers delay at the direct expense of reconstruction quality in noisy conditions.
Compression profiles can be tuned to maintain perceptual quality while enabling operation on constrained links.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same orchestration approach could apply to other latency-sensitive multimedia tasks such as live translation or AR overlays.
Dynamic feedback loops that adjust compression in real time based on measured uplink load would be a natural next extension.
Similar edge-cloud splits may prove useful for perceptual enhancement services beyond speech, such as audio-visual source separation.

Load-bearing premise

The stress-testing scenarios and chosen compression levels accurately reflect real interactive use cases without unmodeled degradation in temporal coherence under observed network delays.

What would settle it

An end-to-end delay measurement under typical variable-load interactive conversation patterns that exceeds the coherence threshold even with edge placement and 80 percent compression.

Figures

Figures reproduced from arXiv: 2508.08468 by Amir Hussain, Anis Hamadouche, Haifeng Luo, Mathini Sellathurai, Tharm Ratnarajah.

**Figure 2.** Figure 2: A schematic describes the experienced latency of this service. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Network latency measured in real-world environments (round-trip latency for transferring [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Audio-video data size versus compression factors. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Algorithm processing latency versus input chunk size. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Time plot and spectrogram of the input noisy audio and output enhanced audio processed with different input chunk [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Time plot and spectrogram of the input noisy audio and output enhanced audio processed with different network [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Real-time audio-visual speech enhancement (AVSE) is a key enabler for immersive and interactive multimedia services, yet its performance is tightly constrained by network latency, uplink capacity, and computational delay. This paper presents the design, deployment, and evaluation of a complete cloud-edge-assisted AVSE system operating over a public 5G edge network. The system integrates CNN-based acoustic enhancement and OpenCV-based facial feature extraction with an LSTM fusion network to preserve temporal coherence, and is deployed on a Vodafone-compatible AWS Wavelength edge cloud. Through extensive stress testing, we analyze end-to-end performance under varying network load and adaptive multimedia profiles. Results show that compute placement at the network edge is critical for meeting real-time coherence constraints, and that uplink capacity is often the dominant bottleneck for interactive AVSE services. Only 5G and wired Ethernet consistently satisfied the required communication delay bound for uncompressed audio-video chunks, while aggressive compression reduced payload sizes by up to 80% with negligible perceptual degradation, enabling robust operation under constrained conditions. We further demonstrate a fundamental trade-off between processing latency and enhancement quality, where reduced model complexity lowers delay but degrades reconstruction performance in low-SNR scenarios. Our findings indicate that public 5G edge environments can sustain real-time, interactive AVSE workloads when network and compute resources are carefully orchestrated, although performance margins remain tighter than in dedicated infrastructures. The architectural insights derived from this study provide practical guidelines for the design of delay-sensitive multimedia and perceptual enhancement services on emerging 5G edge-cloud platforms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a practical deployment study of AVSE on one 5G edge testbed that flags real bottlenecks but offers thin evidence for broader claims.

read the letter

The paper walks through building and testing an audio-visual speech enhancement pipeline on a public 5G edge setup with AWS Wavelength. They use a CNN for audio cleanup, OpenCV for face tracking, and an LSTM to hold timing together, then run stress tests under load with different compression levels. The useful part is the concrete observation that edge placement helps latency while uplink capacity is usually the real limiter, and that heavy compression can cut data size by 80 percent with little quality hit. Only 5G and wired links met the delay target in their runs, and they note the trade-off when you simplify the model to save time at the cost of weaker enhancement in noisy conditions. That kind of end-to-end measurement on actual hardware is the main contribution here, and it gives engineers some workable rules of thumb for similar services. The soft spots sit in the data presentation and scope. Specific numbers, error bars, and dataset details are missing, which makes it difficult to judge how stable the margins really are. All the results come from a single Vodafone-compatible site, so the idea that careful orchestration will keep things working across variable public 5G conditions rests on limited ground. Public networks differ by carrier, location, and load, and the paper does not show replication that would back the wider claim. This work is aimed at people who design or deploy delay-sensitive multimedia on edge platforms rather than at researchers looking for new algorithms. Readers who need deployment examples could pick up some practical pointers. I would send it for peer review so referees can press on the experimental details and ask for more cross-site data if the authors want the generalization to stick.

Referee Report

2 major / 2 minor

Summary. The paper presents the design, deployment, and evaluation of a cloud-edge-assisted audio-visual speech enhancement (AVSE) system over a public 5G edge network using a Vodafone-compatible AWS Wavelength platform. It integrates CNN-based acoustic enhancement, OpenCV facial feature extraction, and an LSTM fusion network to maintain temporal coherence, with stress testing under varying network loads and adaptive compression profiles. Key findings include the criticality of edge compute placement, uplink capacity as the dominant bottleneck, satisfaction of delay bounds only by 5G and wired Ethernet for uncompressed streams, up to 80% payload reduction via compression with negligible perceptual loss, and a trade-off between model complexity and enhancement quality in low-SNR conditions. The central claim is that public 5G edge environments can sustain real-time interactive AVSE workloads with careful orchestration, though with tighter margins than dedicated infrastructures.

Significance. If the empirical results hold, the work offers practical architectural guidelines and deployment insights for delay-sensitive perceptual enhancement services on emerging 5G edge-cloud platforms, particularly emphasizing resource orchestration and compression strategies. The stress-testing on a real 5G edge platform is a strength, but the single-provider testbed and absence of detailed quantitative metrics limit the strength of the broader claims.

major comments (2)

[Abstract] Abstract (final paragraph): The claim that 'public 5G edge environments can sustain real-time, interactive AVSE workloads when network and compute resources are carefully orchestrated' is load-bearing for the central contribution, yet rests exclusively on end-to-end measurements from a single Vodafone-compatible AWS Wavelength deployment. No multi-carrier, multi-site, or time-of-day replication is reported to address known variability in public 5G uplink capacity, jitter, and edge availability, weakening the generalization.
[Results] Results paragraph on end-to-end performance: The abstract states that 'only 5G and wired Ethernet consistently satisfied the required communication delay bound' and reports compression benefits, but provides no specific quantitative metrics (e.g., mean delays with standard deviations, exact payload sizes before/after compression, or error bars), full dataset details, or statistical tests. This absence makes it difficult to assess whether the observed margins are robust or specific to the low-congestion test conditions.

minor comments (2)

[System Design] The description of the LSTM fusion network could include more detail on its architecture (e.g., number of layers, hidden units, or training procedure) to allow reproducibility, even if the focus is deployment rather than model innovation.
[Evaluation] Figure captions or tables summarizing the stress-test scenarios (network loads, compression levels, and resulting delays) would improve clarity and help readers quickly compare conditions.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on generalization and quantitative reporting. We address the major comments point by point below and have made targeted revisions to improve the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract (final paragraph): The claim that 'public 5G edge environments can sustain real-time, interactive AVSE workloads when network and compute resources are carefully orchestrated' is load-bearing for the central contribution, yet rests exclusively on end-to-end measurements from a single Vodafone-compatible AWS Wavelength deployment. No multi-carrier, multi-site, or time-of-day replication is reported to address known variability in public 5G uplink capacity, jitter, and edge availability, weakening the generalization.

Authors: We agree that the central claim would benefit from explicit qualification regarding the single-deployment scope. The study was performed on a representative public 5G edge platform (Vodafone-compatible AWS Wavelength) to demonstrate feasibility under real-world conditions. We have revised the abstract to state that the findings pertain to this specific deployment and added a dedicated limitations paragraph in the discussion section that acknowledges variability in uplink capacity, jitter, and edge availability across carriers and conditions. This revision maintains the practical insights while avoiding over-generalization. revision: yes
Referee: [Results] Results paragraph on end-to-end performance: The abstract states that 'only 5G and wired Ethernet consistently satisfied the required communication delay bound' and reports compression benefits, but provides no specific quantitative metrics (e.g., mean delays with standard deviations, exact payload sizes before/after compression, or error bars), full dataset details, or statistical tests. This absence makes it difficult to assess whether the observed margins are robust or specific to the low-congestion test conditions.

Authors: We thank the referee for highlighting the need for greater transparency in the quantitative results. The manuscript already states the 80% payload reduction and delay-bound satisfaction for 5G and Ethernet, but we accept that mean values, standard deviations, exact pre/post-compression sizes, error bars, and expanded dataset descriptions would strengthen the presentation. We have added these details to the results section, including a new table summarizing delay statistics and payload sizes across profiles, and clarified the experimental conditions (controlled stress-test loads on the specific testbed). Statistical hypothesis tests were not performed because the measurements derive from deterministic, repeatable network-load scenarios rather than stochastic sampling; we have instead emphasized the controlled nature of the tests so readers can evaluate robustness directly. revision: yes

standing simulated objections not resolved

Replication of the end-to-end measurements on additional carriers, sites, or under varied time-of-day conditions to quantify public 5G variability, as this would require access to multiple independent 5G edge platforms beyond the resources of the present study.

Circularity Check

0 steps flagged

Empirical deployment study with no derivation chain

full rationale

The paper describes the design, deployment, and experimental evaluation of an AVSE system on a public 5G edge testbed, reporting measured end-to-end delays, compression effects, and performance under load. No mathematical derivation, first-principles prediction, or fitted-parameter model is presented whose output is claimed to follow from equations or self-citations; all central claims rest on direct observations from stress testing. Consequently no step reduces by construction to its own inputs, and the work is self-contained as an empirical study.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard neural-network architectures and network-performance assumptions without introducing new physical entities or many ad-hoc fitted constants beyond deployment choices such as compression aggressiveness and model size.

free parameters (2)

model complexity reduction
Choice to lower model complexity to reduce latency, with specific values and exact impact on low-SNR performance not detailed in the abstract.
compression aggressiveness
Level of compression applied to audio-video chunks, reported to achieve up to 80% size reduction.

axioms (2)

domain assumption LSTM fusion network preserves temporal coherence under network-induced delays
Invoked to justify real-time operation of the AVSE pipeline (abstract, system description).
domain assumption Perceptual degradation from aggressive compression remains negligible for the target use cases
Used to support the claim that compression enables robust operation (abstract, results on compression).

pith-pipeline@v0.9.0 · 5816 in / 1670 out tokens · 63221 ms · 2026-05-18T23:01:32.373767+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The system integrates CNN-based acoustic enhancement and OpenCV-based facial feature extraction with an LSTM fusion network to preserve temporal coherence... only 5G and wired Ethernet consistently satisfied the required communication delay bound

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 4 internal anchors

[1]

A stochastic approximation method

Herbert Robbins and Sutton Monro. “A stochastic approximation method”. In: The annals of mathematical statistics (1951), pp. 400–407

work page 1951
[2]

Stochastic estimation of the maximum of a regression function

Jack Kiefer and Jacob Wolfowitz. “Stochastic estimation of the maximum of a regression function”. In: The Annals of Mathematical Statistics (1952), pp. 462–466

work page 1952
[3]

Hearing lips and seeing voices

Harry McGurk and John MacDonald. “Hearing lips and seeing voices”. In: Nature 264.5588 (1976), pp. 746–748

work page 1976
[4]

An iterative image registration technique with an application to stereo vision

Bruce D Lucas and Takeo Kanade. “An iterative image registration technique with an application to stereo vision”. In: IJCAI’81: 7th international joint conference on Artificial intelligence . V ol. 2. 1981, pp. 674–679

work page 1981
[5]

Learning representations by back-propagating errors

David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. “Learning representations by back-propagating errors”. In: nature 323.6088 (1986), pp. 533–536

work page 1986
[6]

Generalization and network design strategies

Yann LeCun et al. “Generalization and network design strategies”. In: Connectionism in perspective 19.143-155 (1989), p. 18

work page 1989
[7]

Detection and tracking of point

Carlo Tomasi and Takeo Kanade. “Detection and tracking of point”. In: Int J Comput Vis 9.137-154 (1991), p. 3

work page 1991
[8]

Long short-term memory

Sepp Hochreiter and J ¨urgen Schmidhuber. “Long short-term memory”. In: Neural computation 9.8 (1997), pp. 1735– 1780

work page 1997
[9]

Communication goes multimodal

Sarah Partan and Peter Marler. “Communication goes multimodal”. In: Science 283.5406 (1999), pp. 1272–1273

work page 1999
[10]

Robust real-time face detection

Paul Viola and Michael J Jones. “Robust real-time face detection”. In: International journal of computer vision 57 (2004), pp. 137–154

work page 2004
[11]

Dlib-ml: A machine learning toolkit

Davis E King. “Dlib-ml: A machine learning toolkit”. In: The Journal of Machine Learning Research 10 (2009), pp. 1755–1758

work page 2009
[12]

Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude

Tijmen Tieleman, Geoffrey Hinton, et al. “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude”. In: COURSERA: Neural networks for machine learning 4.2 (2012), pp. 26–31

work page 2012
[13]

Visual input enhances selective speech envelope tracking in auditory cortex at a “cocktail party

Elana Zion Golumbic et al. “Visual input enhances selective speech envelope tracking in auditory cortex at a “cocktail party””. In: Journal of Neuroscience 33.4 (2013), pp. 1417–1426

work page 2013
[14]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Kyunghyun Cho et al. “Learning phrase representations using RNN encoder-decoder for statistical machine translation”. In: arXiv preprint arXiv:1406.1078 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[15]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv preprint arXiv:1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[16]

Deep learning

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016

work page 2016
[17]

Ssd: Single shot multibox detector

Wei Liu et al. “Ssd: Single shot multibox detector”. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14 . Springer. 2016, pp. 21–37

work page 2016
[18]

Multi-Modal Hybrid Deep Neural Network for Speech Enhancement

Zhenzhou Wu et al. “Multi-modal hybrid deep neural network for speech enhancement”. In: arXiv preprint arXiv:1606.04750 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[19]

Visual Speech Enhancement

Aviv Gabbay, Asaph Shamir, and Shmuel Peleg. “Visual speech enhancement”. In: arXiv preprint arXiv:1711.08789 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

Deep multimodal learning: A survey on recent advances and trends

Dhanesh Ramachandram and Graham W Taylor. “Deep multimodal learning: A survey on recent advances and trends”. In: IEEE signal processing magazine 34.6 (2017), pp. 96–108

work page 2017
[21]

Seeing through noise: Visually driven speaker separation and enhancement

Aviv Gabbay et al. “Seeing through noise: Visually driven speaker separation and enhancement”. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) . IEEE. 2018, pp. 3051–3055

work page 2018
[22]

Using visual speech information in masking methods for audio speaker separation

Faheem Ullah Khan, Ben P Milner, and Thomas Le Cornu. “Using visual speech information in masking methods for audio speaker separation”. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 26.10 (2018), pp. 1742–1754

work page 2018
[23]

Face landmark-based speaker-independent audio-visual speech enhancement in multi-talker environments

Giovanni Morrone et al. “Face landmark-based speaker-independent audio-visual speech enhancement in multi-talker environments”. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2019, pp. 6900–6904

work page 2019
[24]

Time domain audio visual speech separation

Jian Wu et al. “Time domain audio visual speech separation”. In: 2019 IEEE automatic speech recognition and understanding workshop (ASRU) . IEEE. 2019, pp. 667–673

work page 2019
[25]

On the role of visual cues in audiovisual speech enhancement

Zakaria Aldeneh et al. “On the role of visual cues in audiovisual speech enhancement”. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE. 2021, pp. 8423–8427

work page 2021
[26]

V oice disorder classification using speech enhancement and deep learning models

Mounira Chaiani et al. “V oice disorder classification using speech enhancement and deep learning models”. In: Biocy- bernetics and Biomedical Engineering 42.2 (2022), pp. 463–480

work page 2022
[27]

Speech Enhancement: A Survey of Approaches and Applications

Siddharth Chhetri et al. “Speech Enhancement: A Survey of Approaches and Applications”. In: 2023 2nd International Conference on Edge Computing and Applications (ICECAA) . IEEE. 2023, pp. 848–856

work page 2023

[1] [1]

A stochastic approximation method

Herbert Robbins and Sutton Monro. “A stochastic approximation method”. In: The annals of mathematical statistics (1951), pp. 400–407

work page 1951

[2] [2]

Stochastic estimation of the maximum of a regression function

Jack Kiefer and Jacob Wolfowitz. “Stochastic estimation of the maximum of a regression function”. In: The Annals of Mathematical Statistics (1952), pp. 462–466

work page 1952

[3] [3]

Hearing lips and seeing voices

Harry McGurk and John MacDonald. “Hearing lips and seeing voices”. In: Nature 264.5588 (1976), pp. 746–748

work page 1976

[4] [4]

An iterative image registration technique with an application to stereo vision

Bruce D Lucas and Takeo Kanade. “An iterative image registration technique with an application to stereo vision”. In: IJCAI’81: 7th international joint conference on Artificial intelligence . V ol. 2. 1981, pp. 674–679

work page 1981

[5] [5]

Learning representations by back-propagating errors

David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. “Learning representations by back-propagating errors”. In: nature 323.6088 (1986), pp. 533–536

work page 1986

[6] [6]

Generalization and network design strategies

Yann LeCun et al. “Generalization and network design strategies”. In: Connectionism in perspective 19.143-155 (1989), p. 18

work page 1989

[7] [7]

Detection and tracking of point

Carlo Tomasi and Takeo Kanade. “Detection and tracking of point”. In: Int J Comput Vis 9.137-154 (1991), p. 3

work page 1991

[8] [8]

Long short-term memory

Sepp Hochreiter and J ¨urgen Schmidhuber. “Long short-term memory”. In: Neural computation 9.8 (1997), pp. 1735– 1780

work page 1997

[9] [9]

Communication goes multimodal

Sarah Partan and Peter Marler. “Communication goes multimodal”. In: Science 283.5406 (1999), pp. 1272–1273

work page 1999

[10] [10]

Robust real-time face detection

Paul Viola and Michael J Jones. “Robust real-time face detection”. In: International journal of computer vision 57 (2004), pp. 137–154

work page 2004

[11] [11]

Dlib-ml: A machine learning toolkit

Davis E King. “Dlib-ml: A machine learning toolkit”. In: The Journal of Machine Learning Research 10 (2009), pp. 1755–1758

work page 2009

[12] [12]

Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude

Tijmen Tieleman, Geoffrey Hinton, et al. “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude”. In: COURSERA: Neural networks for machine learning 4.2 (2012), pp. 26–31

work page 2012

[13] [13]

Visual input enhances selective speech envelope tracking in auditory cortex at a “cocktail party

Elana Zion Golumbic et al. “Visual input enhances selective speech envelope tracking in auditory cortex at a “cocktail party””. In: Journal of Neuroscience 33.4 (2013), pp. 1417–1426

work page 2013

[14] [14]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Kyunghyun Cho et al. “Learning phrase representations using RNN encoder-decoder for statistical machine translation”. In: arXiv preprint arXiv:1406.1078 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[15] [15]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv preprint arXiv:1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[16] [16]

Deep learning

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016

work page 2016

[17] [17]

Ssd: Single shot multibox detector

Wei Liu et al. “Ssd: Single shot multibox detector”. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14 . Springer. 2016, pp. 21–37

work page 2016

[18] [18]

Multi-Modal Hybrid Deep Neural Network for Speech Enhancement

Zhenzhou Wu et al. “Multi-modal hybrid deep neural network for speech enhancement”. In: arXiv preprint arXiv:1606.04750 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[19] [19]

Visual Speech Enhancement

Aviv Gabbay, Asaph Shamir, and Shmuel Peleg. “Visual speech enhancement”. In: arXiv preprint arXiv:1711.08789 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[20] [20]

Deep multimodal learning: A survey on recent advances and trends

Dhanesh Ramachandram and Graham W Taylor. “Deep multimodal learning: A survey on recent advances and trends”. In: IEEE signal processing magazine 34.6 (2017), pp. 96–108

work page 2017

[21] [21]

Seeing through noise: Visually driven speaker separation and enhancement

Aviv Gabbay et al. “Seeing through noise: Visually driven speaker separation and enhancement”. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) . IEEE. 2018, pp. 3051–3055

work page 2018

[22] [22]

Using visual speech information in masking methods for audio speaker separation

Faheem Ullah Khan, Ben P Milner, and Thomas Le Cornu. “Using visual speech information in masking methods for audio speaker separation”. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 26.10 (2018), pp. 1742–1754

work page 2018

[23] [23]

Face landmark-based speaker-independent audio-visual speech enhancement in multi-talker environments

Giovanni Morrone et al. “Face landmark-based speaker-independent audio-visual speech enhancement in multi-talker environments”. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2019, pp. 6900–6904

work page 2019

[24] [24]

Time domain audio visual speech separation

Jian Wu et al. “Time domain audio visual speech separation”. In: 2019 IEEE automatic speech recognition and understanding workshop (ASRU) . IEEE. 2019, pp. 667–673

work page 2019

[25] [25]

On the role of visual cues in audiovisual speech enhancement

Zakaria Aldeneh et al. “On the role of visual cues in audiovisual speech enhancement”. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE. 2021, pp. 8423–8427

work page 2021

[26] [26]

V oice disorder classification using speech enhancement and deep learning models

Mounira Chaiani et al. “V oice disorder classification using speech enhancement and deep learning models”. In: Biocy- bernetics and Biomedical Engineering 42.2 (2022), pp. 463–480

work page 2022

[27] [27]

Speech Enhancement: A Survey of Approaches and Applications

Siddharth Chhetri et al. “Speech Enhancement: A Survey of Approaches and Applications”. In: 2023 2nd International Conference on Edge Computing and Applications (ICECAA) . IEEE. 2023, pp. 848–856

work page 2023