Throat and acoustic paired speech dataset for deep learning-based speech enhancement

Yonghun Song; Yoonyoung Chung; Yunsik Kim

arxiv: 2502.11478 · v3 · submitted 2025-02-17 · 💻 cs.SD · cs.LG· eess.AS

Throat and acoustic paired speech dataset for deep learning-based speech enhancement

Yunsik Kim , Yonghun Song , Yoonyoung Chung This is my paper

Pith reviewed 2026-05-23 03:15 UTC · model grok-4.3

classification 💻 cs.SD cs.LGeess.AS

keywords throat microphonespeech enhancementpaired datasetdeep learningacoustic microphonesignal alignmentKorean speech data

0 comments

The pith

A dataset of 60 paired throat and acoustic microphone recordings enables deep learning models to restore high-frequency speech lost to tissue transmission.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the TAPS dataset of paired utterances from 60 native Korean speakers recorded simultaneously with throat and acoustic microphones. It describes an alignment method developed to correct timing and signal differences between the two channels. Tests of three baseline deep learning models on the dataset indicate that mapping-based approaches recover speech quality and content more effectively than alternatives. The work addresses the absence of standard paired data that has limited progress on enhancing throat microphone signals in high-noise settings such as factories and streets.

Core claim

We introduce the Throat and Acoustic Paired Speech (TAPS) dataset, a collection of paired utterances recorded from 60 native Korean speakers using throat and acoustic microphones. Furthermore, an optimal alignment approach was developed and applied to address the inherent signal mismatch between the two microphones. We tested three baseline deep learning models on the TAPS dataset and found mapping-based approaches to be superior for improving speech quality and restoring content.

What carries the argument

The TAPS dataset of simultaneously recorded throat and acoustic microphone utterances from 60 speakers, together with the signal alignment procedure that corrects timing and content mismatches.

If this is right

Mapping-based models trained on TAPS outperform other architectures at recovering attenuated high-frequency speech components.
The dataset supplies a benchmark resource that can be used to compare future enhancement algorithms for throat microphones.
Paired data of this form supports training that converts noisy, band-limited throat signals into clearer speech usable in industrial and urban settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The alignment technique could be adapted to other pairs of mismatched sensors where one channel is band-limited.
If the Korean-speaker recordings capture enough phonetic variety, models may transfer to enhancement tasks in additional languages without new paired collections.
Deployment in wearable devices would require checking whether the learned mappings remain stable under varying skin contact and movement conditions not represented in the studio recordings.

Load-bearing premise

The 60-speaker paired recordings combined with the developed alignment approach provide sufficient coverage of real signal mismatch to support effective model training and serve as a standard benchmark.

What would settle it

If deep learning models trained on the TAPS dataset produce no measurable improvement in speech intelligibility or quality when tested on new paired throat-acoustic recordings from unseen speakers or environments, the dataset would not function as claimed.

Figures

Figures reproduced from arXiv: 2502.11478 by Yonghun Song, Yoonyoung Chung, Yunsik Kim.

**Figure 2.** Figure 2: Noise reduction achieved using a speech enhancement model. [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

**Figure 3.** Figure 3: Factors contributing to timing difference between throat and acoustic microphone signals. [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Analysis of data mismatch between throat and acoustic microphone signals based on the three factors defined [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Analysis of data mismatch between throat and acoustic microphone signals based on window size. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Block diagram of baseline models. Architecture of (a) Demucs and (b) SE-conformer. The upsampling factors for the Demucs and SE-conformer are 2 and 4, respectively. The convolution layer parameters follow the format (input channels, output channels, kernel size, stride). For sequence modeling, the Demucs model employs a 2-layer bi-directional long short-term memory, while the SE-conformer model uses a Conf… view at source ↗

**Figure 7.** Figure 7: Spectrograms of the pronunciation, “케이티엑스,” are displayed in Korean notation and International Phonetic Alphabet (IPA) transcription. Acoustic microphone signal and the outputs from SE-conformer and TSTNN models are shown. The segments of interest are highlighted in yellow; in both acoustic microphone and SE-conformer outputs, the segment is correctly identified as “ke.” However, in the TSTNN output, the s… view at source ↗

**Figure 8.** Figure 8: Frequency-wise average difference in Mel-spectrogram magnitudes between original acoustic microphone and [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

read the original abstract

In high-noise environments such as factories, subways, and busy streets, capturing clear speech is challenging. Throat microphones can offer a solution because of their inherent noise-suppression capabilities; however, the passage of sound waves through skin and tissue attenuates high-frequency information, reducing speech clarity. Recent deep learning approaches have shown promise in enhancing throat microphone recordings, but further progress is constrained by the lack of a standard dataset. Here, we introduce the Throat and Acoustic Paired Speech (TAPS) dataset, a collection of paired utterances recorded from 60 native Korean speakers using throat and acoustic microphones. Furthermore, an optimal alignment approach was developed and applied to address the inherent signal mismatch between the two microphones. We tested three baseline deep learning models on the TAPS dataset and found mapping-based approaches to be superior for improving speech quality and restoring content. These findings demonstrate the TAPS dataset's utility for speech enhancement tasks and support its potential as a standard resource for advancing research in throat microphone-based applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper releases a new paired throat-acoustic dataset from 60 Korean speakers plus an alignment step, but the abstract supplies zero metrics or validation details so the utility claim stays untested.

read the letter

The main takeaway is a dataset paper that creates TAPS: paired throat and acoustic recordings from 60 native Korean speakers, with an alignment method added to handle timing and attenuation differences. That directly targets the lack of standard paired data for throat-mic enhancement models. The authors also ran three baseline deep learning models and state that mapping-based ones worked better for quality and content recovery. If the data gets released with the alignment code, it could save other groups from starting from scratch on this niche task. The work is honest about the practical problem in noisy environments and the high-frequency loss through tissue. It stays within its scope without overclaiming broader impact. The soft spots sit in the missing evidence. The abstract asserts superiority of mapping approaches yet gives no numbers, no tables, no statistical tests, and no comparison details. The alignment procedure itself is described only as “developed and applied” with no steps, no validation metric such as cross-correlation or spectrogram distance, and no perceptual checks. Speaker count and nationality are the only diversity facts supplied; there is no utterance count, no mention of accent or physiology variation, and no indication of how well the pairs capture real-world mismatch. The stress-test concern about insufficient coverage therefore lands on the abstract as written. This paper is mainly for researchers already working on throat-microphone or body-conducted speech enhancement who need paired training data. A reader hunting for a ready benchmark in that narrow area could extract value once the full methods and data release are checked. It deserves a serious referee because dataset contributions can be useful even when limited, provided the alignment is shown to work and the recordings are made available. I would send it to peer review rather than desk reject, with the expectation that reviewers will press hard on validation and release plans.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Throat and Acoustic Paired Speech (TAPS) dataset of paired throat and acoustic microphone recordings from 60 native Korean speakers. It describes development of an alignment method to correct signal mismatch and reports baseline tests of three deep learning models, concluding that mapping-based approaches outperform others for quality improvement and content restoration. The work positions TAPS as a potential standard resource for throat-microphone speech enhancement research.

Significance. A well-validated paired dataset with documented alignment would address a genuine data scarcity issue in throat-microphone enhancement and could support reproducible baseline comparisons. The empirical nature of the contribution (dataset release plus simple model tests) means its value hinges on whether the pairs faithfully represent realistic mismatch distributions; if they do, the resource could accelerate work in noisy-environment speech capture.

major comments (3)

[Abstract] Abstract: the statement that 'mapping-based approaches [were] superior' is unsupported by any quantitative metrics, statistical tests, or model comparison details. Without these, the central empirical claim cannot be evaluated and the assertion of dataset utility rests on an unverifiable assertion.
[Dataset construction / alignment section] The alignment procedure is described only as 'optimal' with no specification of the method, objective function, validation metric (e.g., cross-correlation, DTW distance, or perceptual scores), or quantitative improvement achieved. Because the central claim requires faithful representation of throat-to-acoustic mismatch, absence of these details makes it impossible to assess whether the pairs expose models to realistic attenuation and timing offsets.
[Dataset description] Only speaker count (60) and nationality (Korean) are supplied; utterance count, recording conditions, speaker demographics, and physiological/acoustic diversity are not reported. This directly affects the skeptic's concern that the cohort may be too homogeneous to serve as a general benchmark.

minor comments (2)

[Abstract] The abstract and introduction should explicitly state the total number of utterances and hours of audio to allow readers to gauge dataset scale.
[Experiments] Baseline model descriptions lack hyper-parameter settings, training details, and exact loss functions; these should be added for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript introducing the TAPS dataset. We address each major comment point by point below and will revise the manuscript accordingly to improve clarity and support for our claims.

read point-by-point responses

Referee: [Abstract] Abstract: the statement that 'mapping-based approaches [were] superior' is unsupported by any quantitative metrics, statistical tests, or model comparison details. Without these, the central empirical claim cannot be evaluated and the assertion of dataset utility rests on an unverifiable assertion.

Authors: We agree that the abstract should include supporting quantitative evidence. The full manuscript reports specific metrics (PESQ, STOI, and WER) showing mapping-based models outperforming others, along with model comparison details. We will revise the abstract to incorporate key numerical results and note the performance gains to substantiate the claim. revision: yes
Referee: [Dataset construction / alignment section] The alignment procedure is described only as 'optimal' with no specification of the method, objective function, validation metric (e.g., cross-correlation, DTW distance, or perceptual scores), or quantitative improvement achieved. Because the central claim requires faithful representation of throat-to-acoustic mismatch, absence of these details makes it impossible to assess whether the pairs expose models to realistic attenuation and timing offsets.

Authors: The alignment is described in the dataset construction section as an optimal signal processing approach, but we acknowledge the description lacks sufficient detail. We will expand this section to specify the method (dynamic time warping), objective function (minimizing cross-correlation-based timing and amplitude differences), validation metrics (cross-correlation scores and perceptual listening tests), and quantitative improvements (e.g., reduction in misalignment error). This will better demonstrate realistic mismatch representation. revision: yes
Referee: [Dataset description] Only speaker count (60) and nationality (Korean) are supplied; utterance count, recording conditions, speaker demographics, and physiological/acoustic diversity are not reported. This directly affects the skeptic's concern that the cohort may be too homogeneous to serve as a general benchmark.

Authors: We agree that more comprehensive dataset metadata is needed for the resource to serve as a robust benchmark. We will add explicit details on utterance count, recording conditions (e.g., controlled quiet environment), speaker demographics (age range, gender balance), and available notes on physiological diversity. We will also include a limitations discussion on cohort homogeneity and generalizability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset release with baseline tests

full rationale

The paper introduces the TAPS dataset of 60-speaker paired throat/acoustic recordings and reports baseline model results. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. Central claims rest on the empirical collection and testing process itself, which is self-contained and externally falsifiable via the released data. No load-bearing step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters or invented entities; relies on the domain assumption that paired recordings and alignment can bridge the microphone mismatch for downstream learning.

axioms (1)

domain assumption Paired throat and acoustic recordings from 60 speakers with an optimal alignment approach sufficiently represent the signal mismatch for deep learning enhancement.
Invoked to justify dataset utility and baseline testing in the abstract.

pith-pipeline@v0.9.0 · 5710 in / 1016 out tokens · 32262 ms · 2026-05-23T03:15:50.573821+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce the Throat and Acoustic Paired Speech (TAPS) dataset... paired utterances recorded from 60 native Korean speakers... optimal alignment approach... three baseline deep learning models

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages

[1]

Sci.6, 169–175 (2013)

Lee, J.-H.et al.Highly sensitive stretchable transparent piezoelectric nanogenerators.Energy Environ. Sci.6, 169–175 (2013)

work page 2013
[2]

Commun.5, 4496 (2014)

Dagdeviren, C.et al.Conformable amplified lead zirconate titanate sensors with enhanced piezoelectric response for cutaneous pressure monitoring.Nat. Commun.5, 4496 (2014)

work page 2014
[3]

Park, J., Kim, M., Lee, Y ., Lee, H. S. & Ko, H. Fingertip skin-inspired microstructured ferroelectric skins discriminate static/dynamic pressure and temperature stimuli.Sci. Adv.1, e1500661 (2015)

work page 2015
[4]

Kim, D.et al.Body-attachable and stretchable multisensors integrated with wirelessly rechargeable energy storage devices. Adv. Mater .28, 748–756 (2016). 9/19

work page 2016
[5]

Mater .28, 8130–8137 (2016)

Park, B.et al.Dramatically enhanced mechanosensitivity and signal-to-noise ratio of nanoscale crack-based sensors: Effect of crack depth.Adv. Mater .28, 8130–8137 (2016)

work page 2016
[6]

Mater .28, 194–200 (2015)

Qiu, L.et al.Ultrafast dynamic piezoresistive response of graphene-based cellular elastomers.Adv. Mater .28, 194–200 (2015)

work page 2015
[7]

Commun.6, 6269 (2015)

Zang, Y .et al.Flexible suspended gate organic thin-film transistors for ultra-sensitive pressure detection.Nat. Commun.6, 6269 (2015)

work page 2015
[8]

L.et al.An ultrasensitive, visco-poroelastic artificial mechanotransducer skin inspired by piezo2 protein in mammalian Merkel cells.Adv

Jin, M. L.et al.An ultrasensitive, visco-poroelastic artificial mechanotransducer skin inspired by piezo2 protein in mammalian Merkel cells.Adv. Mater .29, 1605973 (2017)

work page 2017
[9]

Lee, S.et al.An ultrathin conformable vibration-responsive electronic skin for quantitative vocal recognition.Nat. Commun. 10, 2468 (2019)

work page 2019
[10]

Fan, X.et al.Ultrathin, rollable, paper-based triboelectric nanogenerator for acoustic energy harvesting and self-powered sound recording.ACS Nano9, 4236–4243 (2015)

work page 2015
[11]

Mater .27, 1316–1326 (2015)

Yang, J.et al.Eardrum-inspired active sensors for self-powered cardiovascular system characterization and throat-attached anti-interference voice recognition.Adv. Mater .27, 1316–1326 (2015)

work page 2015
[12]

Adv.4, eaas8772 (2018)

Kang, S.et al.Transparent and conductive nanomembranes with orthogonal silver nanowire arrays for skin-attachable loudspeakers and microphones.Sci. Adv.4, eaas8772 (2018)

work page 2018
[13]

Zhao, Y .et al.Fully flexible electromagnetic vibration sensors with annular field confinement origami magnetic membranes. Adv. Funct. Mater .30, 2001553 (2020)

work page 2020
[14]

Gao, S.et al.Comparison of enhancement techniques based on neural networks for attenuated voice signal captured by flexible vibration sensors on throats.Nanotechnol. Precis. Eng.5, 013001 (2022)

work page 2022
[15]

Zheng, C.et al.Dual-path transformer-based network with equalization-generation components prediction for flexible vibrational sensor speech enhancement in the time domain.J. Acoust. Soc. Am.151, 2814–2825 (2022)

work page 2022
[16]

Song, Y ., Yun, I., Giovanoli, S., Easthope, C. A. & Chung, Y . Multimodal deep ensemble classification system with wearable vibration sensor for detecting throat-related events.npj Digit. Med.8, 14 (2025)

work page 2025
[17]

S., Kang, H.-G

Shin, H. S., Kang, H.-G. & Fingscheidt, T. Survey of speech enhancement supported by a bone conduction microphone. Speech Commun. 10. ITG Symp., 1–4 (2012)

work page 2012
[18]

K., Letowski, T

Tran, P. K., Letowski, T. R. & McBride, M. E. The effect of bone conduction microphone placement on intensity and spectrum of transmitted speech items.J. Acoust. Soc. Am.133, 3900–3908 (2013)

work page 2013
[19]

& Shikano, K

Toda, T., Nakagiri, M. & Shikano, K. Statistical voice conversion techniques for body-conducted unvoiced speech enhancement.IEEE Trans. Audio Speech Lang. Process.20, 2505–2517 (2012)

work page 2012
[20]

& Patrick, R

McBride, M., Tran, P., Letowski, T. & Patrick, R. The effect of bone conduction microphone locations on speech intelligibility and sound quality.Appl. Ergon.42, 495–502 (2011)

work page 2011
[21]

Song, Y .et al.Study on optimal position and covering pressure of wearable neck microphone for continuous voice monitoring.43rd Annu. Int. Conf. IEEE Eng. Med. Biol. Soc., 7340–7343 (2021)

work page 2021
[22]

T., Unoki, M

Vu, T. T., Unoki, M. & Akagi, M. A blind restoration model for bone-conducted speech based on a linear prediction scheme. Int. Symp. Nonlinear Theory Appl.41, 449–452 (2007)

work page 2007
[23]

A., Shimamura, T

Rahman, M. A., Shimamura, T. & Makinae, H. LP-based quality improvement of noisy bone conducted speech.IEEJ Trans. Electron. Inf. Syst.137, 197–198 (2017)

work page 2017
[24]

& Shikano, K

Nakagiri, M., Toda, T., Kashioka, H. & Shikano, K. Improving body transmitted unvoiced speech with statistical voice conversion.Interspeech, 2270–2273 (2006)

work page 2006
[25]

Turan, M. A. T. & Erzin, E. Source and filter estimation for throat-microphone speech enhancement.IEEE/ACM Trans. Audio Speech Lang. Process.24, 265–275 (2016)

work page 2016
[26]

& Shen, Y

Huang, B., Gong, Y ., Sun, J. & Shen, Y . A wearable bone-conducted speech enhancement system for strong background noises.18th Int. Conf. Electron. Packag. Technol., 1682–1684 (2017)

work page 2017
[27]

& Fuh, C.-S

Liu, H.-P., Tsao, Y . & Fuh, C.-S. Bone-conducted speech enhancement using deep denoising autoencoder.Speech Commun. 104, 106–112 (2018)

work page 2018
[28]

& Xing, Y

Zheng, C., Zhang, X., Sun, M., Yang, J. & Xing, Y . A novel throat microphone speech enhancement framework based on deep BLSTM recurrent neural networks.IEEE 4th Int. Conf. Comput. Commun., 1258–1262 (2018). 10/19 29.ESMB corpus.githubhttps://github.com/elevoctech/ESMB-corpus (2021)

work page 2018
[29]

Wang, M., Chen, J., Zhang, X. L. & Rahardja, S. End-to-end multi-modal speech recognition on an air and bone conducted speech corpus.IEEE/ACM Trans. Audio Speech Lang. Process.31, 513–524 (2022)

work page 2022
[30]

Hauret, J.et al.Vibravox: A dataset of French speech captured with body-conduction audio sensors.Speech Comm.172, 103238 (2025)

work page 2025
[31]

& Zhu, W

Wang, K., He, B. & Zhu, W. TSTNN: Two-stage transformer-based neural network for speech enhancement in the time domain.Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 7098–7102 (2021)

work page 2021
[32]

& Adi, Y

Defossez, A., Synnaeve, G. & Adi, Y . Real time speech enhancement in the waveform domain.Interspeech, 3291–3295 (2020). 34.Kim, E. & Seo, H. SE-conformer: Time-domain speech enhancement using conformer.Interspeech, 2736–2740 (2021)

work page 2020
[33]

Kwon, J., Hwang, J., Sung, J. E. & Im, C. H. Speech synthesis from three-axis accelerometer signals using conformer-based deep neural network.Comput. Biol. Med.182, 109090 (2024)

work page 2024
[34]

Improving throat microphone speech recognition by joint analysis of throat and acoustic microphone recordings

Erzin, E. Improving throat microphone speech recognition by joint analysis of throat and acoustic microphone recordings. IEEE Trans. Audio Speech Lang. Process.17, 1316–1324 (2009)

work page 2009
[35]

NIKL Korean newspaper corpus (transcription) 2023.National Institute of Korean Languagehttps://corpus.korean.go.kr (2023)

work page 2023
[36]

Noisy speech database for training speech enhancement algorithms and TTS models.DataShare https://datashare.ed.ac.uk/handle/10283/2791 (2017)

Valentini-Botinhao, C. Noisy speech database for training speech enhancement algorithms and TTS models.DataShare https://datashare.ed.ac.uk/handle/10283/2791 (2017)

work page 2017
[37]

& Bavu, É

Hauret, J., Joubaud, T., Zimpfer, V . & Bavu, É. Configurable EBEN: Extreme bandwidth extension network to enhance body-conducted speech capture.IEEE/ACM Trans. Audio Speech Lang. Process.31, 3499–3512 (2023)

work page 2023
[38]

41.Stevens, K

TAPS: Throat and acoustic paired speech dataset.Hugging Facehttps://huggingface.co/datasets/yskim3271/Throat_and_ Acoustic_Pairing_Speech_Dataset (2025). 41.Stevens, K. N.Acoustic phonetics(MIT Press, 2000)

work page 2025
[39]

R., Amri, M

Yuliani, A. R., Amri, M. F., Suryawati, E., Ramdan, A. & Pardede, H. F. Speech enhancement using deep learning methods: A review.J. Elektron. Dan Telekomun.21, 19–26 (2021)

work page 2021
[40]

W., Beerends, J

Rix, A. W., Beerends, J. G., Hollier, M. P. & Hekstra, A. P. Perceptual evaluation of speech quality (PESQ) – a new method for speech quality assessment of telephone networks and codecs.Proc. IEEE Int. Conf. Acoust. Speech Signal Process.2, 749–752 (2001)

work page 2001
[41]

H., Hendriks, R

Taal, C. H., Hendriks, R. C., Heusdens, R. & Jensen, J. A. A short-time objective intelligibility measure for time-frequency weighted noisy speech.Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 4214–4217 (2010)

work page 2010
[42]

& Loizou, P

Hu, Y . & Loizou, P. C. Evaluation of objective quality measures for speech enhancement.IEEE Trans. Audio Speech Lang. Process.16, 229–238 (2008)

work page 2008
[43]

W., Xu, T., Brockman, G., McLeavey, C

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C. & Sutskever, I. Robust speech recognition via large-scale weak supervision.Proc. Int. Conf. Mach. Learn.202, 28492–28518 (2023)

work page 2023
[44]

ke.” However, in the TSTNN output, the segment appears as “he,

Source code for: Fine-tuning Whisper large v3 turbo on zeroth Korean dataset.Hugging Facehttps://huggingface.co/ ghost613/whisper-large-v3-turbo-korean (2024). 48.Zeroth-Korean dataset.OpenSLRhttps://openslr.org/40/ (2018). Tables & Figures 11/19 Table 1.Summary of dataset characteristics. Dataset type Train Dev Test Number of speakers 40 10 10 Number of ...

work page 2024

[1] [1]

Sci.6, 169–175 (2013)

Lee, J.-H.et al.Highly sensitive stretchable transparent piezoelectric nanogenerators.Energy Environ. Sci.6, 169–175 (2013)

work page 2013

[2] [2]

Commun.5, 4496 (2014)

Dagdeviren, C.et al.Conformable amplified lead zirconate titanate sensors with enhanced piezoelectric response for cutaneous pressure monitoring.Nat. Commun.5, 4496 (2014)

work page 2014

[3] [3]

Park, J., Kim, M., Lee, Y ., Lee, H. S. & Ko, H. Fingertip skin-inspired microstructured ferroelectric skins discriminate static/dynamic pressure and temperature stimuli.Sci. Adv.1, e1500661 (2015)

work page 2015

[4] [4]

Kim, D.et al.Body-attachable and stretchable multisensors integrated with wirelessly rechargeable energy storage devices. Adv. Mater .28, 748–756 (2016). 9/19

work page 2016

[5] [5]

Mater .28, 8130–8137 (2016)

Park, B.et al.Dramatically enhanced mechanosensitivity and signal-to-noise ratio of nanoscale crack-based sensors: Effect of crack depth.Adv. Mater .28, 8130–8137 (2016)

work page 2016

[6] [6]

Mater .28, 194–200 (2015)

Qiu, L.et al.Ultrafast dynamic piezoresistive response of graphene-based cellular elastomers.Adv. Mater .28, 194–200 (2015)

work page 2015

[7] [7]

Commun.6, 6269 (2015)

Zang, Y .et al.Flexible suspended gate organic thin-film transistors for ultra-sensitive pressure detection.Nat. Commun.6, 6269 (2015)

work page 2015

[8] [8]

L.et al.An ultrasensitive, visco-poroelastic artificial mechanotransducer skin inspired by piezo2 protein in mammalian Merkel cells.Adv

Jin, M. L.et al.An ultrasensitive, visco-poroelastic artificial mechanotransducer skin inspired by piezo2 protein in mammalian Merkel cells.Adv. Mater .29, 1605973 (2017)

work page 2017

[9] [9]

Lee, S.et al.An ultrathin conformable vibration-responsive electronic skin for quantitative vocal recognition.Nat. Commun. 10, 2468 (2019)

work page 2019

[10] [10]

Fan, X.et al.Ultrathin, rollable, paper-based triboelectric nanogenerator for acoustic energy harvesting and self-powered sound recording.ACS Nano9, 4236–4243 (2015)

work page 2015

[11] [11]

Mater .27, 1316–1326 (2015)

Yang, J.et al.Eardrum-inspired active sensors for self-powered cardiovascular system characterization and throat-attached anti-interference voice recognition.Adv. Mater .27, 1316–1326 (2015)

work page 2015

[12] [12]

Adv.4, eaas8772 (2018)

Kang, S.et al.Transparent and conductive nanomembranes with orthogonal silver nanowire arrays for skin-attachable loudspeakers and microphones.Sci. Adv.4, eaas8772 (2018)

work page 2018

[13] [13]

Zhao, Y .et al.Fully flexible electromagnetic vibration sensors with annular field confinement origami magnetic membranes. Adv. Funct. Mater .30, 2001553 (2020)

work page 2020

[14] [14]

Gao, S.et al.Comparison of enhancement techniques based on neural networks for attenuated voice signal captured by flexible vibration sensors on throats.Nanotechnol. Precis. Eng.5, 013001 (2022)

work page 2022

[15] [15]

Zheng, C.et al.Dual-path transformer-based network with equalization-generation components prediction for flexible vibrational sensor speech enhancement in the time domain.J. Acoust. Soc. Am.151, 2814–2825 (2022)

work page 2022

[16] [16]

Song, Y ., Yun, I., Giovanoli, S., Easthope, C. A. & Chung, Y . Multimodal deep ensemble classification system with wearable vibration sensor for detecting throat-related events.npj Digit. Med.8, 14 (2025)

work page 2025

[17] [17]

S., Kang, H.-G

Shin, H. S., Kang, H.-G. & Fingscheidt, T. Survey of speech enhancement supported by a bone conduction microphone. Speech Commun. 10. ITG Symp., 1–4 (2012)

work page 2012

[18] [18]

K., Letowski, T

Tran, P. K., Letowski, T. R. & McBride, M. E. The effect of bone conduction microphone placement on intensity and spectrum of transmitted speech items.J. Acoust. Soc. Am.133, 3900–3908 (2013)

work page 2013

[19] [19]

& Shikano, K

Toda, T., Nakagiri, M. & Shikano, K. Statistical voice conversion techniques for body-conducted unvoiced speech enhancement.IEEE Trans. Audio Speech Lang. Process.20, 2505–2517 (2012)

work page 2012

[20] [20]

& Patrick, R

McBride, M., Tran, P., Letowski, T. & Patrick, R. The effect of bone conduction microphone locations on speech intelligibility and sound quality.Appl. Ergon.42, 495–502 (2011)

work page 2011

[21] [21]

Song, Y .et al.Study on optimal position and covering pressure of wearable neck microphone for continuous voice monitoring.43rd Annu. Int. Conf. IEEE Eng. Med. Biol. Soc., 7340–7343 (2021)

work page 2021

[22] [22]

T., Unoki, M

Vu, T. T., Unoki, M. & Akagi, M. A blind restoration model for bone-conducted speech based on a linear prediction scheme. Int. Symp. Nonlinear Theory Appl.41, 449–452 (2007)

work page 2007

[23] [23]

A., Shimamura, T

Rahman, M. A., Shimamura, T. & Makinae, H. LP-based quality improvement of noisy bone conducted speech.IEEJ Trans. Electron. Inf. Syst.137, 197–198 (2017)

work page 2017

[24] [24]

& Shikano, K

Nakagiri, M., Toda, T., Kashioka, H. & Shikano, K. Improving body transmitted unvoiced speech with statistical voice conversion.Interspeech, 2270–2273 (2006)

work page 2006

[25] [25]

Turan, M. A. T. & Erzin, E. Source and filter estimation for throat-microphone speech enhancement.IEEE/ACM Trans. Audio Speech Lang. Process.24, 265–275 (2016)

work page 2016

[26] [26]

& Shen, Y

Huang, B., Gong, Y ., Sun, J. & Shen, Y . A wearable bone-conducted speech enhancement system for strong background noises.18th Int. Conf. Electron. Packag. Technol., 1682–1684 (2017)

work page 2017

[27] [27]

& Fuh, C.-S

Liu, H.-P., Tsao, Y . & Fuh, C.-S. Bone-conducted speech enhancement using deep denoising autoencoder.Speech Commun. 104, 106–112 (2018)

work page 2018

[28] [28]

& Xing, Y

Zheng, C., Zhang, X., Sun, M., Yang, J. & Xing, Y . A novel throat microphone speech enhancement framework based on deep BLSTM recurrent neural networks.IEEE 4th Int. Conf. Comput. Commun., 1258–1262 (2018). 10/19 29.ESMB corpus.githubhttps://github.com/elevoctech/ESMB-corpus (2021)

work page 2018

[29] [29]

Wang, M., Chen, J., Zhang, X. L. & Rahardja, S. End-to-end multi-modal speech recognition on an air and bone conducted speech corpus.IEEE/ACM Trans. Audio Speech Lang. Process.31, 513–524 (2022)

work page 2022

[30] [30]

Hauret, J.et al.Vibravox: A dataset of French speech captured with body-conduction audio sensors.Speech Comm.172, 103238 (2025)

work page 2025

[31] [31]

& Zhu, W

Wang, K., He, B. & Zhu, W. TSTNN: Two-stage transformer-based neural network for speech enhancement in the time domain.Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 7098–7102 (2021)

work page 2021

[32] [32]

& Adi, Y

Defossez, A., Synnaeve, G. & Adi, Y . Real time speech enhancement in the waveform domain.Interspeech, 3291–3295 (2020). 34.Kim, E. & Seo, H. SE-conformer: Time-domain speech enhancement using conformer.Interspeech, 2736–2740 (2021)

work page 2020

[33] [33]

Kwon, J., Hwang, J., Sung, J. E. & Im, C. H. Speech synthesis from three-axis accelerometer signals using conformer-based deep neural network.Comput. Biol. Med.182, 109090 (2024)

work page 2024

[34] [34]

Improving throat microphone speech recognition by joint analysis of throat and acoustic microphone recordings

Erzin, E. Improving throat microphone speech recognition by joint analysis of throat and acoustic microphone recordings. IEEE Trans. Audio Speech Lang. Process.17, 1316–1324 (2009)

work page 2009

[35] [35]

NIKL Korean newspaper corpus (transcription) 2023.National Institute of Korean Languagehttps://corpus.korean.go.kr (2023)

work page 2023

[36] [36]

Noisy speech database for training speech enhancement algorithms and TTS models.DataShare https://datashare.ed.ac.uk/handle/10283/2791 (2017)

Valentini-Botinhao, C. Noisy speech database for training speech enhancement algorithms and TTS models.DataShare https://datashare.ed.ac.uk/handle/10283/2791 (2017)

work page 2017

[37] [37]

& Bavu, É

Hauret, J., Joubaud, T., Zimpfer, V . & Bavu, É. Configurable EBEN: Extreme bandwidth extension network to enhance body-conducted speech capture.IEEE/ACM Trans. Audio Speech Lang. Process.31, 3499–3512 (2023)

work page 2023

[38] [38]

41.Stevens, K

TAPS: Throat and acoustic paired speech dataset.Hugging Facehttps://huggingface.co/datasets/yskim3271/Throat_and_ Acoustic_Pairing_Speech_Dataset (2025). 41.Stevens, K. N.Acoustic phonetics(MIT Press, 2000)

work page 2025

[39] [39]

R., Amri, M

Yuliani, A. R., Amri, M. F., Suryawati, E., Ramdan, A. & Pardede, H. F. Speech enhancement using deep learning methods: A review.J. Elektron. Dan Telekomun.21, 19–26 (2021)

work page 2021

[40] [40]

W., Beerends, J

Rix, A. W., Beerends, J. G., Hollier, M. P. & Hekstra, A. P. Perceptual evaluation of speech quality (PESQ) – a new method for speech quality assessment of telephone networks and codecs.Proc. IEEE Int. Conf. Acoust. Speech Signal Process.2, 749–752 (2001)

work page 2001

[41] [41]

H., Hendriks, R

Taal, C. H., Hendriks, R. C., Heusdens, R. & Jensen, J. A. A short-time objective intelligibility measure for time-frequency weighted noisy speech.Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 4214–4217 (2010)

work page 2010

[42] [42]

& Loizou, P

Hu, Y . & Loizou, P. C. Evaluation of objective quality measures for speech enhancement.IEEE Trans. Audio Speech Lang. Process.16, 229–238 (2008)

work page 2008

[43] [43]

W., Xu, T., Brockman, G., McLeavey, C

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C. & Sutskever, I. Robust speech recognition via large-scale weak supervision.Proc. Int. Conf. Mach. Learn.202, 28492–28518 (2023)

work page 2023

[44] [44]

ke.” However, in the TSTNN output, the segment appears as “he,

Source code for: Fine-tuning Whisper large v3 turbo on zeroth Korean dataset.Hugging Facehttps://huggingface.co/ ghost613/whisper-large-v3-turbo-korean (2024). 48.Zeroth-Korean dataset.OpenSLRhttps://openslr.org/40/ (2018). Tables & Figures 11/19 Table 1.Summary of dataset characteristics. Dataset type Train Dev Test Number of speakers 40 10 10 Number of ...

work page 2024