arxiv: 2604.19477 · v1 · submitted 2026-04-21 · 💻 cs.SD · cs.CL

Recognition: unknown

Deep Supervised Contrastive Learning of Pitch Contours for Robust Pitch Accent Classification in Seoul Korean

Hyunjung Joo , GyeongTaek Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:23 UTC · model grok-4.3

classification 💻 cs.SD cs.CL

keywords pitch accent classificationSeoul Koreancontrastive learningF0 contoursintonational phonologyaccentual phrasesspeech processingAutosegmental-Metrical model

0 comments

The pith

Supervised contrastive learning classifies Seoul Korean pitch accents by learning consistent F0 contour shapes despite surface variation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that a contrastive framework can reliably assign continuous pitch contours to discrete tonal categories for pitch accents in Seoul Korean, even when real speech shows large F0 variability. It does so by introducing Dual-Glob, which forces the model to produce matching representations for clean and augmented versions of the same contour in a shared latent space while using label supervision. A sympathetic reader would care because this supplies a data-driven route to validate the Autosegmental-Metrical model of intonation and because accurate accent detection improves downstream speech technologies for Korean. The authors also release a manually annotated collection of 10,093 accentual phrases as a public benchmark.

Core claim

Dual-Glob captures the holistic shape of F0 contours for fine-grained pitch-accent classification by enforcing structural consistency between clean and augmented contour views in a shared latent space, achieving higher accuracy than local predictive baselines on a new dataset of 10,093 manually labeled accentual phrases and thereby providing empirical support for AM-based intonational phonology.

What carries the argument

Dual-Glob, a supervised contrastive framework that aligns latent representations of clean and augmented F0 contours while using accent labels to guide the embedding space toward invariant accent identities.

If this is right

Models that enforce global contour consistency outperform those relying only on local F0 predictions for accent classification.
Contrastive training on augmented views supplies a practical way to test whether discrete tonal categories remain stable across phonetic variation.
The released 10,093-phrase dataset can serve as a fixed benchmark for comparing future intonation classifiers.
Improved accent detection can directly feed into Korean speech synthesis and recognition systems that must recover intonational structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same contrastive alignment strategy could be tested on other languages whose intonation is described with autosegmental categories.
Extending the augmentations to include speaker or channel variation might reveal how much of the learned invariance is truly accent-specific.
Combining the F0-only contour representations with lexical or syntactic features could further improve classification in full sentences.

Load-bearing premise

That the manually annotated discrete tonal categories accurately reflect stable, invariant accent types even when F0 realizations vary across speakers and contexts, and that the chosen augmentations preserve accent identity without altering the underlying category.

What would settle it

A new dataset of Seoul Korean phrases labeled independently by multiple experts shows low inter-annotator agreement on the tonal categories, or the Dual-Glob model fails to outperform standard classifiers on speaker-disjoint test splits.

Figures

Figures reproduced from arXiv: 2604.19477 by GyeongTaek Lee, Hyunjung Joo.

**Figure 1.** Figure 1: Intonational structure of Seoul Korean (Jun, 1998). The AP-initial tone (T) is realized as H for aspirated and tense consonants, otherwise L. The % symbol refers to a boundary tone (e.g., L% or H%) at the end of an IP. is modeled as a sequence of discrete tonal targets such as Lows (L) and Highs (H), which are interpolated with one another. The intonational structure of Seoul Korean is hierarchically orga… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed Dual-Glob framework. The model processes entire F0 contours via parallel clean (xc) and augmented (xa) views using a shared encoder. A composite supervised contrastive objective (LT otal) enforces structural consistency across both views to learn robust representations. 2. Augmented-view SupCon (LAug): This term addresses the inherent instability of pitch extraction in real-world e… view at source ↗

**Figure 3.** Figure 3: t-SNE visualization of the validation set. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: Failure cases demonstrating the ambiguity in sustained tones. In both cases, the model misinterprets the lengthened final L tone as a sequence of multiple L tones (LL). mistakes a long and flat low pitch for several different patterns. For more details on these errors, please see Appendix D. 5 Discussion To address the aforementioned limitations, we incorporated syllable count constraints into the classi… view at source ↗

**Figure 6.** Figure 6: Schematic F0 contours of sixteen pitch accent patterns for an AP in Seoul Korean (Jun, 2000) Appendix B: Implementation Details Environment and Data Split. All models were implemented using PyTorch and trained on an NVIDIA GPU RTX 2070. To ensure a robust evaluation, we employed 5-fold stratified crossvalidation with a fixed random seed (42). For each fold, the dataset was split into training and validati… view at source ↗

**Figure 7.** Figure 7: Confusion matrix of the proposed Dual-Glob method with LR. The strong diagonal density [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Visualization of common misclassification patterns. Each subfigure displays two examples [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: t-SNE visualization of the feature space learned by the proposed model. Female speakers (b) [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Visual analysis of various F0 discontiunations or pitch track errors in Seoul Korean speech data, including devoicing, pitch halving, glottalization, and F0 perturbation [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

read the original abstract

The intonational structure of Seoul Korean has been defined with discrete tonal categories within the Autosegmental-Metrical model of intonational phonology. However, it is challenging to map continuous $F_0$ contours to these invariant categories due to variable $F_0$ realizations in real-world speech. Our paper proposes Dual-Glob, a deep supervised contrastive learning framework to robustly classify fine-grained pitch accent patterns in Seoul Korean. Unlike conventional local predictive models, our approach captures holistic $F_0$ contour shapes by enforcing structural consistency between clean and augmented views in a shared latent space. To this aim, we introduce the first large-scale benchmark dataset, consisting of manually annotated 10,093 Accentual Phrases in Seoul Korean. Experimental results show that our Dual-Glob significantly outperforms strong baseline models with state-of-the-art accuracy (77.75%) and F1-score (51.54%). Therefore, our work supports AM-based intonational phonology using data-driven methodology, showing that deep contrastive learning effectively captures holistic structural features of continuous $F_0$ contours.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper ships a new 10k-phrase annotated dataset for Seoul Korean pitch accents and tests a Dual-Glob contrastive model on holistic F0 shapes, with modest gains but a low F1 that flags ongoing difficulty.

read the letter

Colleague, the main things to know are that this work releases the first sizable manually annotated dataset of 10,093 Accentual Phrases and introduces Dual-Glob, a supervised contrastive architecture that pulls clean and augmented F0 contour views together in latent space to classify fine-grained pitch accent categories in Seoul Korean. The abstract frames it as supporting AM intonational phonology with data-driven tools rather than local predictors. The dataset release stands out as the clearest practical step; anyone working on Korean prosody or contour classification now has a benchmark they can use directly. The contrastive setup is a straightforward adaptation of supervised contrastive loss to enforce structural consistency across augmentations like time warping, pitch scaling, and noise, and the reported 77.75% accuracy beats the baselines they tried. The full methods section apparently spells out the architecture, splits, and loss, and the stress-test finds the numbers internally consistent with that setup. The low F1 of 51.54% is the clearest soft spot; it points to class imbalance or residual label noise that the paper does not appear to mitigate with re-weighting or deeper error analysis. Baseline implementation details and statistical significance are also thin in the abstract, though the manuscript supplies more on those fronts. The core assumption that the chosen augmentations preserve accent identity is stated but not heavily stress-tested beyond the contrastive objective. This is useful for a narrow audience: speech processing researchers focused on Korean or other languages with AM-style tonal categories, plus anyone adapting contrastive methods to prosodic time series. The dataset alone gives it value for replication or extension work. I would send it to peer review; the new resource and the empirical comparison are concrete enough to merit referee input even if the evaluation needs tightening.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces Dual-Glob, a deep supervised contrastive learning framework that learns invariant representations of F0 contours for fine-grained pitch accent classification in Seoul Korean. It enforces structural consistency between clean and augmented views (time warping, pitch scaling, noise injection) in a shared latent space, contrasting with local predictive models. The authors release a new benchmark dataset of 10,093 manually annotated accentual phrases and report outperforming strong baselines with 77.75% accuracy and 51.54% F1-score, thereby providing data-driven support for the Autosegmental-Metrical model of intonational phonology.

Significance. If the reported gains prove reproducible with full experimental details, the work would be significant for computational phonology and speech processing. It supplies the first large-scale annotated resource for Seoul Korean pitch accents and demonstrates that supervised contrastive learning can capture holistic contour structure despite real-world F0 variability. This could encourage similar data-driven validations of intonational categories in other languages and improve robustness in downstream applications such as speech synthesis or recognition for tonal systems.

major comments (2)

§4 (Experimental setup): The abstract and results claim clear outperformance, yet the manuscript provides insufficient detail on baseline implementations (exact architectures, hyperparameter search, training protocols) and the train/test split procedure (e.g., speaker-independent partitioning of the 10,093 phrases). These omissions are load-bearing for the central empirical claim, as they prevent independent verification of the 77.75% accuracy and 51.54% F1 gains.
Results section, Table reporting per-class metrics: The F1-score of 51.54% is substantially lower than accuracy, consistent with possible class imbalance or label noise, but no error analysis, confusion matrix, or per-accent performance breakdown is supplied. This weakens the assertion of 'robust' classification and requires explicit discussion to support the headline numbers.

minor comments (3)

Abstract: The phrase 'state-of-the-art accuracy (77.75%) and F1-score (51.54%)' would benefit from an explicit statement of the previous best F1 on this task or dataset to contextualize the improvement magnitude.
Methods, notation: Ensure F0 and AM are defined at first use; the contrastive loss formulation should include the exact temperature parameter and positive/negative pair construction for full reproducibility.
Dataset description: Provide a breakdown of the 10,093 phrases by accent category and speaker to allow readers to assess potential imbalance or generalization issues.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of reproducibility and analysis that we will address to strengthen the manuscript. We respond to each major comment below.

read point-by-point responses

Referee: §4 (Experimental setup): The abstract and results claim clear outperformance, yet the manuscript provides insufficient detail on baseline implementations (exact architectures, hyperparameter search, training protocols) and the train/test split procedure (e.g., speaker-independent partitioning of the 10,093 phrases). These omissions are load-bearing for the central empirical claim, as they prevent independent verification of the 77.75% accuracy and 51.54% F1 gains.

Authors: We agree that the current level of detail is insufficient for full reproducibility. In the revised manuscript, Section 4 will be expanded to specify the exact architectures of all baseline models, the hyperparameter search procedure and selected values, complete training protocols (including optimizer settings, learning rates, batch sizes, and early stopping criteria), and a precise description of the train/test split. We will confirm that the partitioning is speaker-independent, report the exact ratios used, and describe any stratification or speaker-disjoint constraints applied to the 10,093 phrases. These additions will directly support verification of the reported metrics. revision: yes
Referee: Results section, Table reporting per-class metrics: The F1-score of 51.54% is substantially lower than accuracy, consistent with possible class imbalance or label noise, but no error analysis, confusion matrix, or per-accent performance breakdown is supplied. This weakens the assertion of 'robust' classification and requires explicit discussion to support the headline numbers.

Authors: We concur that a per-class breakdown and error analysis are needed to substantiate the robustness claim. The revised results section will include a confusion matrix, per-accent F1 scores and accuracies, and an accompanying discussion of the accuracy-F1 gap. This discussion will address potential class imbalance in the dataset, possible sources of label noise in manual annotation, and implications for the data-driven validation of Autosegmental-Metrical categories. These additions will provide a more transparent evaluation of model performance across pitch accent types. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML evaluation on held-out data

full rationale

The paper describes a supervised contrastive learning pipeline (Dual-Glob) trained on a manually annotated 10,093-phrase dataset with standard augmentations and evaluated via accuracy/F1 on held-out splits. No equations, derivations, or predictions are presented that reduce by construction to fitted inputs, self-citations, or ansatzes. The headline performance numbers are direct experimental outcomes, not tautological renamings or self-referential definitions. The work is self-contained as a standard empirical benchmark study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The performance claim rests on the assumption that the manually labeled categories are reliable ground truth and that the augmentation pipeline preserves accent identity; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Manually annotated discrete tonal categories in the dataset correspond to the invariant categories of the Autosegmental-Metrical model despite surface F0 variation.
Stated in the abstract as the core challenge the model addresses.

pith-pipeline@v0.9.0 · 5498 in / 1256 out tokens · 30865 ms · 2026-05-10T01:23:00.597261+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

79 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Journal of Phonetics , volume=

On (and off) ramps in intonational phonology: Rises, falls, and the Tonal Center of Gravity , author=. Journal of Phonetics , volume=. 2021 , publisher=

2021
[2]

Laboratory Phonology , volume=

Tonal Center of Gravity: A global approach to tonal implementation in a level-based intonational phonology , author=. Laboratory Phonology , volume=. 2012 , publisher=

2012
[3]

Ohio State University , year=

The ToBI annotation conventions , author=. Ohio State University , year=
[4]

Phonology , volume=

Intonational structure in Japanese and English , author=. Phonology , volume=. 1986 , publisher=

1986
[5]

Language and Speech , pages=

The Perception of Lexical Pitch Accent in South Kyungsang Korean: The Relevance of Accent Shape , author=. Language and Speech , pages=. 2025 , publisher=

2025
[6]

Word , volume=

Intonation: levels versus configurations , author=. Word , volume=. 1951 , publisher=

1951
[7]

Machine learning , volume=

Random forests , author=. Machine learning , volume=. 2001 , publisher=

2001
[8]

Brain Sciences , volume=

Computational modelling of tone perception based on direct processing of f 0 contours , author=. Brain Sciences , volume=. 2022 , publisher=

2022
[9]

Laboratory Phonology , volume=

New methods for prosodic transcription: Capturing variability as a source of information , author=. Laboratory Phonology , volume=. 2016 , publisher=

2016
[10]

Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

The regression analysis of binary sequences , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 1958 , publisher=

1958
[11]

Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining , pages=

Minirocket: A very fast (almost) deterministic transform for time series classification , author=. Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining , pages=
[12]

Data mining and knowledge discovery , volume=

Inceptiontime: Finding alexnet for time series classification , author=. Data mining and knowledge discovery , volume=. 2020 , publisher=

2020
[13]

Journal of Phonetics , volume=

Focus-induced tonal distribution in Seoul Korean as an edge-prominence language , author=. Journal of Phonetics , volume=. 2024 , publisher=

2024
[14]

Language and communication , volume=

Fact and fiction in the description of female and male pitch , author=. Language and communication , volume=
[15]

Neural computation , volume=

Long short-term memory , author=. Neural computation , volume=. 1997 , publisher=

1997
[16]

Journal of Phonetics , volume=

The relation between the continuous and the discrete: A note on the first principles of speech dynamics , author=. Journal of Phonetics , volume=. 2017 , publisher=

2017
[17]

Proceedings of the International Congress of Phonetic Sciences , year=

American English pitch accent dynamics: A minimal dynamical model , author=. Proceedings of the International Congress of Phonetic Sciences , year=
[18]

2026 , publisher=

A quantal dynamical theory of F0 contours: Bridging the phonetics and phonology of intonation In: Developments in the Modeling of Speech Prosody , author=. 2026 , publisher=

2026
[19]

Phonology , volume=

The accentual phrase in the Korean prosodic hierarchy , author=. Phonology , volume=. 1998 , publisher=

1998
[20]

UCLA working papers in phonetics , volume=

K-ToBI (KOREAN ToBI) labelling conventions , author=. UCLA working papers in phonetics , volume=
[21]

proceedings of the XVth international congress of phonetic sciences , pages=

The effect of phrase length and speech rate on prosodic phrasing , author=. proceedings of the XVth international congress of phonetic sciences , pages=
[22]

Prosodic typology: The phonology of intonation and phrasing , volume=

Korean intonational phonology and prosodic transcription , author=. Prosodic typology: The phonology of intonation and phrasing , volume=. 2005 , publisher=

2005
[23]

Advances in neural information processing systems , volume=

Lightgbm: A highly efficient gradient boosting decision tree , author=. Advances in neural information processing systems , volume=
[24]

Advances in neural information processing systems , volume=

Supervised contrastive learning , author=. Advances in neural information processing systems , volume=
[25]

Speech Sciences , volume=

Intonational pattern frequency of Seoul Korean and its implication to word segmentation , author=. Speech Sciences , volume=
[26]

2008 , publisher=

Intonational phonology , author=. 2008 , publisher=

2008
[27]

2021 24th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA) , pages=

Korean dialect identification based on intonation modeling , author=. 2021 24th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA) , pages=. 2021 , organization=

2021
[28]

, author=

Context in multi-lingual tone and pitch accent recognition. , author=. Interspeech , pages=. 2005 , organization=

2005
[29]

Journal of machine learning research , volume=

Visualizing data using t-SNE , author=. Journal of machine learning research , volume=
[30]

2022 , howpublished=

Broadcasting Content Conversational Data , author=. 2022 , howpublished=

2022
[32]

Speech prosody 7 , pages=

Male and female speech: a study of mean f0, f0 range, phonation type and speech rate in Parisian French and American English speakers , author=. Speech prosody 7 , pages=
[33]

IEEE transactions on Signal Processing , volume=

Bidirectional recurrent neural networks , author=. IEEE transactions on Signal Processing , volume=. 1997 , publisher=

1997
[34]

2003 , publisher=

Perceptual Study of Intonation , author=. 2003 , publisher=

2003
[35]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=
[36]

2017 International joint conference on neural networks (IJCNN) , pages=

Time series classification from scratch with deep neural networks: A strong baseline , author=. 2017 International joint conference on neural networks (IJCNN) , pages=. 2017 , organization=

2017
[38]

Speech communication , volume=

Speech melody as articulatorily implemented communicative functions , author=. Speech communication , volume=. 2005 , publisher=

2005
[39]

2018 IEEE spoken language technology workshop (SLT) , pages=

Multimodal speech emotion recognition using audio and text , author=. 2018 IEEE spoken language technology workshop (SLT) , pages=. 2018 , organization=

2018
[40]

Proceedings of the AAAI conference on artificial intelligence , volume=

Are transformers effective for time series forecasting? , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
[42]

Jonathan Barnes, Alejna Brugos, Nanette Veilleux, and Stefanie Shattuck-Hufnagel. 2021. On (and off) ramps in intonational phonology: Rises, falls, and the tonal center of gravity. Journal of Phonetics, 85:101020

2021
[43]

Jonathan Barnes, Nanette Veilleux, Alejna Brugos, and Stefanie Shattuck-Hufnagel. 2012. Tonal center of gravity: A global approach to tonal implementation in a level-based intonational phonology. Laboratory Phonology, 3(2):337--383

2012
[44]

Mary E Beckman and Julia Hirschberg. 1994. The tobi annotation conventions. Ohio State University

1994
[45]

Mary E Beckman and Janet B Pierrehumbert. 1986. Intonational structure in japanese and english. Phonology, 3:255--309

1986
[46]

Dwight L Bolinger. 1951. Intonation: levels versus configurations. Word, 7(3):199--210

1951
[47]

Leo Breiman. 2001. Random forests. Machine learning, 45(1):5--32

2001
[48]

Yue Chen, Yingming Gao, and Yi Xu. 2022. Computational modelling of tone perception based on direct processing of f 0 contours. Brain Sciences, 12(3):337

2022
[49]

Jennifer Cole and Stefanie Shattuck-Hufnagel. 2016. New methods for prosodic transcription: Capturing variability as a source of information. Laboratory Phonology, 7(1)

2016
[50]

David R Cox. 1958. The regression analysis of binary sequences. Journal of the Royal Statistical Society Series B: Statistical Methodology, 20(2):215--232

1958
[51]

Angus Dempster, Daniel F Schmidt, and Geoffrey I Webb. 2021. Minirocket: A very fast (almost) deterministic transform for time series classification. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, pages 248--257

2021
[52]

Johan't Hart, JT Hart, R Collier, A Cohen, Rene Collier, and Antonie Cohen. 2003. Perceptual Study of Intonation. Cambridge

2003
[53]

Richard Hatcher, Hyunjung Joo, Sahyang Kim, and Taehong Cho. 2024. Focus-induced tonal distribution in seoul korean as an edge-prominence language. Journal of Phonetics, 107:101353

2024
[54]

Caroline G Henton. 1989. Fact and fiction in the description of female and male pitch. Language and communication, 9(4):299--311

1989
[55]

Sepp Hochreiter and J \"u rgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735--1780

1997
[56]

Khalil Iskarous. 2017. The relation between the continuous and the discrete: A note on the first principles of speech dynamics. Journal of Phonetics, 64:8--20

2017
[57]

Khalil Iskarous, J Cole, and J Steffman. 2023. American english pitch accent dynamics: A minimal dynamical model. In Proceedings of the International Congress of Phonetic Sciences. Guarant International

2023
[58]

Khalil Iskarous and Jennifer Cole. 2026. A quantal dynamical theory of f0 contours: Bridging the phonetics and phonology of intonation in: Developments in the modeling of speech prosody

2026
[59]

Hassan Ismail Fawaz, Benjamin Lucas, Germain Forestier, Charlotte Pelletier, Daniel F Schmidt, Jonathan Weber, Geoffrey I Webb, Lhassane Idoumghar, Pierre-Alain Muller, and Fran c ois Petitjean. 2020. Inceptiontime: Finding alexnet for time series classification. Data mining and knowledge discovery, 34(6):1936--1962

2020
[60]

Hyunjung Joo and Mariapaola D’Imperio. 2025. The perception of lexical pitch accent in south kyungsang korean: The relevance of accent shape. Language and Speech, page 00238309251368294

2025
[61]

Sun-Ah Jun. 1998. The accentual phrase in the korean prosodic hierarchy. Phonology, 15(2):189--226

1998
[62]

Sun-Ah Jun. 2000. K-tobi (korean tobi) labelling conventions. UCLA working papers in phonetics, 99:149--173

2000
[63]

Sun-Ah Jun. 2003. The effect of phrase length and speech rate on prosodic phrasing. In proceedings of the XVth international congress of phonetic sciences, pages 483--486

2003
[64]

Sun-Ah Jun. 2005. Korean intonational phonology and prosodic transcription. Prosodic typology: The phonology of intonation and phrasing, 1:201

2005
[65]

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30

2017
[66]

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised contrastive learning. Advances in neural information processing systems, 33:18661--18673

2020
[67]

Sahyang Kim. 2008. Intonational pattern frequency of seoul korean and its implication to word segmentation. Speech Sciences, 15(2):21--32

2008
[68]

D Robert Ladd. 2008. Intonational phonology. Cambridge University Press

2008
[69]

Jooyoung Lee, Kyungwha Kim, and Minhwa Chung. 2021. Korean dialect identification based on intonation modeling. In 2021 24th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), pages 168--173. IEEE

2021
[70]

Gina-Anne Levow. 2005. Context in multi-lingual tone and pitch accent recognition. In Interspeech, pages 1809--1812. Lisbon

2005
[71]

Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579--2605

2008
[72]

National Information Society Agency . 2022. Broadcasting content conversational data. https://aihub.or.kr/aihubdata/data/view.do?dataSetSn=71557

2022
[73]

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748

work page internal anchor Pith review Pith/arXiv arXiv 2018
[74]

Erwan P \'e piot. 2014. Male and female speech: a study of mean f0, f0 range, phonation type and speech rate in parisian french and american english speakers. In Speech prosody 7, pages 305--309

2014
[75]

Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11):2673--2681

1997
[76]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30

2017
[77]

Zhiguang Wang, Weizhong Yan, and Tim Oates. 2017. Time series classification from scratch with deep neural networks: A strong baseline. In 2017 International joint conference on neural networks (IJCNN), pages 1578--1585. IEEE

2017
[78]

Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. 2022. Timesnet: Temporal 2d-variation modeling for general time series analysis. arXiv preprint arXiv:2210.02186

work page internal anchor Pith review arXiv 2022
[79]

Yi Xu. 2005. Speech melody as articulatorily implemented communicative functions. Speech communication, 46(3-4):220--251

2005
[80]

Seunghyun Yoon, Seokhyun Byun, and Kyomin Jung. 2018. Multimodal speech emotion recognition using audio and text. In 2018 IEEE spoken language technology workshop (SLT), pages 112--118. IEEE

2018
[81]

Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. 2023. Are transformers effective for time series forecasting? In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121--11128

2023
[82]

Xiaochen Zheng, Xingyu Chen, Manuel Sch \"u rch, Amina Mollaysa, Ahmed Allam, and Michael Krauthammer. 2023. Simts: Rethinking contrastive representation learning for time series forecasting. arXiv preprint arXiv:2303.18205

work page arXiv 2023