A CNN-based tool for automatic tongue contour tracking in ultrasound images
Pith reviewed 2026-05-24 17:05 UTC · model grok-4.3
The pith
Two convolutional neural networks can automatically track tongue contours in ultrasound images with accuracy that depends more on loss function and data augmentation than on network architecture.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Fully automatic tongue contour tracking is feasible with standard CNN segmentation architectures; U-Net and DenseU-Net achieve similar accuracy, DenseU-Net generalizes more reliably across datasets, U-Net runs faster, and loss function plus data augmentation exert larger effects on results than architecture choice.
What carries the argument
U-Net and DenseU-Net architectures applied to semantic segmentation of ultrasound tongue images.
If this is right
- Automatic contour extraction removes the need for time-consuming manual tracing in ultrasound-based speech studies.
- Loss function selection and data augmentation produce larger gains in tracking performance than switching between U-Net and DenseU-Net.
- DenseU-Net offers better performance when the test data differ from the training distribution.
- U-Net enables faster processing when speed is prioritized over maximum generalization.
- The released open-source tool allows immediate use for automated annotation of new ultrasound recordings.
Where Pith is reading between the lines
- The same segmentation approach could be retrained on other articulatory imaging modalities such as MRI or EMA.
- Real-time deployment of the faster U-Net variant might support live feedback during speech therapy sessions.
- Collecting larger and more varied ultrasound datasets would likely reduce the observed differences in generalization between the two models.
Load-bearing premise
The ultrasound datasets used for training and testing represent the range of speakers, recording equipment, and speaking styles encountered in typical speech research.
What would settle it
Running the trained models on ultrasound images from a new speaker population or different scanner and obtaining substantially lower contour accuracy than reported on the paper's test sets.
Figures
read the original abstract
For speech research, ultrasound tongue imaging provides a non-invasive means for visualizing tongue position and movement during articulation. Extracting tongue contours from ultrasound images is a basic step in analyzing ultrasound data but this task often requires non-trivial manual annotation. This study presents an open source tool for fully automatic tracking of tongue contours in ultrasound frames using neural network based methods. We have implemented and systematically compared two convolutional neural networks, U-Net and DenseU-Net, under different conditions. Though both models can perform automatic contour tracking with comparable accuracy, Dense U-Net architecture seems more generalizable across test datasets while U-Net has faster extraction speed. Our comparison also shows that the choice of loss function and data augmentation have a greater effect on tracking performance in this task. This public available segmentation tool shows considerable promise for the automated tongue contour annotation of ultrasound images in speech research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an open-source CNN-based tool for fully automatic tongue contour tracking in ultrasound images for speech research. It implements and compares U-Net and DenseU-Net architectures under varying loss functions and data augmentations, claiming that both models achieve comparable accuracy on held-out images, that DenseU-Net appears more generalizable across test datasets, that U-Net offers faster extraction speed, and that loss function and augmentation choices have a greater effect on performance than architecture.
Significance. If the empirical results hold with proper quantification, the work supplies a practical, publicly available segmentation tool that could substantially reduce manual annotation labor in ultrasound-based speech research. The systematic ablation of loss functions and augmentations also provides domain-specific guidance for similar medical image segmentation tasks.
major comments (2)
- [Abstract] Abstract: performance conclusions (comparable accuracy, greater generalizability of DenseU-Net, greater effect of loss/augmentation) are stated without any quantitative metrics, error bars, dataset sizes, or statistical tests. This absence prevents evaluation of the central empirical claims.
- [Results / Dataset description] The claim that Dense U-Net 'seems more generalizable across test datasets' rests on observed performance differences, yet the manuscript provides no details on dataset composition (speaker counts, probe types, accents, speaking styles) or the criteria used for train/test splits. Without this characterization, any advantage could reflect dataset idiosyncrasies rather than architectural robustness, directly undermining the comparative conclusion.
minor comments (1)
- [Abstract] The abstract should be revised to include at least summary quantitative results (e.g., mean Dice scores or pixel errors with standard deviations) so that readers can immediately gauge the strength of the reported findings.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: performance conclusions (comparable accuracy, greater generalizability of DenseU-Net, greater effect of loss/augmentation) are stated without any quantitative metrics, error bars, dataset sizes, or statistical tests. This absence prevents evaluation of the central empirical claims.
Authors: We agree that the abstract would benefit from quantitative support. In the revision we will include specific metrics such as mean Dice scores with standard deviations for the compared models and conditions, training and test set sizes, and any statistical tests performed, while keeping the abstract concise. revision: yes
-
Referee: [Results / Dataset description] The claim that Dense U-Net 'seems more generalizable across test datasets' rests on observed performance differences, yet the manuscript provides no details on dataset composition (speaker counts, probe types, accents, speaking styles) or the criteria used for train/test splits. Without this characterization, any advantage could reflect dataset idiosyncrasies rather than architectural robustness, directly undermining the comparative conclusion.
Authors: We acknowledge that expanded dataset characterization is needed to support the generalization claim. The revised manuscript will add a dedicated subsection detailing speaker counts, probe types, accents, speaking styles, and the train/test split criteria (including whether splits were speaker-independent). This will clarify that cross-dataset testing used entirely held-out corpora collected under different conditions. revision: yes
Circularity Check
No circularity: purely empirical comparison of trained CNNs on held-out images
full rationale
The paper presents no derivation chain, equations, or first-principles predictions. All claims rest on training U-Net and DenseU-Net variants, evaluating Dice/IoU metrics on held-out ultrasound frames, and reporting empirical differences in accuracy, speed, and cross-dataset performance. No fitted parameter is renamed as a prediction, no self-citation supplies a uniqueness theorem or ansatz, and no result reduces to its inputs by construction. The generalizability observation is an empirical finding subject to dataset caveats, not a circular reduction.
Axiom & Free-Parameter Ledger
free parameters (1)
- Network weights (U-Net and DenseU-Net)
axioms (2)
- domain assumption Hand-annotated tongue contours are sufficiently accurate and consistent to serve as ground truth for supervised training.
- domain assumption Standard data-augmentation operations (flips, rotations) preserve the semantic correctness of tongue contours in ultrasound.
Reference graph
Works this paper leans on
-
[1]
Introduction Ultrasound tongue imaging provides a non-invasive means for assessing tongue position and movement during speech produc- tion. However, the presence of speckle noise and irrelevant high contrast edges often degrades the usability of ultrasound images by obscuring the tongue surface [1]. Consequently, extracting tongue contours from ultrasound...
-
[2]
Neural network based methods are promising for fully automatic segmentation
or particle filtering [4] can gear the algorithm towards more automatic segmentation. Neural network based methods are promising for fully automatic segmentation. Prior works uti- lized deep neural networks [12, 13] and Boltzmann machines [6]; recently fully convolutional neural networks such as vari- ants of the U-Net [14] have been adapted to segment ton...
-
[3]
A CNN-based tool for automatic tongue contour tracking in ultrasound images
Method In our approach, we first train a convolutional neural network to segment the brightest edge corresponding to the tongue tissue- air interface from a noisy ultrasound image, and then derive a tongue surface curve through post-processing of the segmented image. The source code, pre-trained models and some of the test data are available at https://git...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[4]
only penalizes the mismatch between the predicted white pixels (representing the tongue region) and the white edge in the mask, while excluding all background pixels and noise during the optimization process. Thus, the learning task can be formu- lated as minimizing the following loss function: LDSC =− 2 ∑N i=1siri +ϵ∑N i=1si + ∑N i=1ri +ϵ (1) wheresi is ...
-
[5]
Data Midsagittal ultrasound data was collected as MPEG video at 60 frames per second, using a Zonare Z.One Ultrasound Unit, op- erating at 4MHz and 70Hz scan rate with a P4-1C transducer. Tongue shape curves were annotated with Mark Tiede’s Get- Contours package for MATLAB [23] 1, generating a 100 point spline for each curve from human-specified anchor poi...
-
[6]
Experiments The training data were divided into multiple mini-batches, each with a size of 32 images. We used the Adam optimizer [25] with a learning rate of 0.0001, and the model was trained for 30 epochs. The training process took approximately 2 hours using an NVIDIA Tesla K40 GPU in the University of Michigan’s FLUX computing cluster. The model that a...
-
[7]
Evaluation The metric for evaluation of error from human annotation is the Mean Sum of Distance (MSD), which permits the comparison of two curves without requiring point-wise alignment [2]. The MSD between two sequences U and V can be computed as the average distance between a given point and its nearest point in another sequence: D(U,V ) = 1 2n ( n∑ i=1 ...
-
[8]
because of deprecated dependencies. Table 2: Mean and (Standard Deviation) of Mean Sum of Dis- tance (in Pixels, 1 pixel≈ 0.25mm) for the NS test set, as com- pared to three human annotators A, B and C. A B C A 0 (0) 2.33 (1.57) 2.83 (1.85) B 2.33 (1.57) 0 (0) 3.21 (2.21) C 2.83 (1.85) 3.21 (2.21) 0 (0) UNet-WC 6.65 (2.92) 6.44 (2.74) 7.25 (3.24) UNet-Dic...
-
[9]
Error analysis As the CNN is trained to identify the white edges directly cor- responding to the tongue surface, additional or missing white edges due to bad image quality or speaker physiology can lead to failures in identifying parts of the tongue surface. In the ab- sence of prior knowledge of plausible tongue shapes, the model will sometimes generate ...
-
[10]
The implemented models are tested extensively on multiple test datasets
Conclusions In this study, we present a new open source tool for fully auto- mated tongue contour extraction based on U-Net and Dense U- Net models. The implemented models are tested extensively on multiple test datasets. Though both models can perform auto- matic contour tracking with comparable accuracy, Dense U-Net architecture seems more generalizable...
-
[11]
Acknowledgements We are grateful to Patrice Speeter Beddor, Andries Coetzee, Thomas Hueber and the UltraSuite research group for making available their ultrasound data. The data from Beddor and Co- etzee were collected for a different project supported by NSF grant BCS-1348150
-
[12]
A guide to analysing tongue motion from ultrasound images,
M. Stone, “A guide to analysing tongue motion from ultrasound images,” Clinical Linguistics & Phonetics , vol. 19, no. 6-7, pp. 455–501, Jan. 2005
work page 2005
-
[13]
Automatic contour tracking in ultrasound images,
M. Li, C. Kambhamettu, and M. Stone, “Automatic contour tracking in ultrasound images,” Clinical Linguistics & Phonetics, vol. 19, no. 6-7, pp. 545–554, Jan. 2005
work page 2005
-
[14]
Robust contour tracking in ultrasound tongue image sequences,
K. Xu, Y . Yang, M. Stone, A. Jaumard-Hakoun, C. Leboullenger, G. Dreyfus, P. Roussel, and B. Denby, “Robust contour tracking in ultrasound tongue image sequences,” Clinical Linguistics & Pho- netics, vol. 30, no. 3-5, pp. 313–327, May 2016
work page 2016
-
[15]
C. Laporte and L. M ´enard, “Multi-hypothesis tracking of the tongue surface in ultrasound video recordings of normal and im- paired speech,” Medical image analysis , vol. 44, pp. 98–114, 2018
work page 2018
-
[16]
L. Tang and G. Hamarneh, “Graph-based tracking of the tongue contour in ultrasound sequences with adaptive temporal regular- ization,” in 2010 IEEE Computer Society Conference on Com- puter Vision and Pattern Recognition - Workshops, Jun. 2010, pp. 154–161
work page 2010
-
[17]
Tongue contour extraction from ultrasound images based on deep neural network
A. Jaumard-Hakoun, K. Xu, P. Roussel-Ragot, G. Dreyfus, and B. Denby, “Tongue contour extraction from ultrasound images based on deep neural network,” arXiv:1605.05912 [cs] , May 2016, arXiv: 1605.05912
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[18]
Dynamics of tongue gestures extracted au- tomatically from ultrasound,
J. Berry and I. Fasel, “Dynamics of tongue gestures extracted au- tomatically from ultrasound,” in2011 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), May 2011, pp. 557–560
work page 2011
-
[19]
D. Fabre, T. Hueber, F. Bocquelet, and P. Badin, “Tongue Track- ing in Ultrasound Images using EigenTongue Decomposition and Artificial Neural Networks,” in 16th Annual Conference of the International Speech Communication Association (Interspeech 2015), Dresden, Germany, Sep. 2015
work page 2015
-
[20]
Automatic tongue contour segmentation using deep learning,
S. Wen, “Automatic tongue contour segmentation using deep learning,” Master’s thesis, Universit´e d’Ottawa/University of Ot- tawa, 2018
work page 2018
-
[21]
Automatic tongue con- tour extraction in ultrasound images with convolutional neural networks,
J. Zhu, W. Styler, and I. C. Calloway, “Automatic tongue con- tour extraction in ultrasound images with convolutional neural networks,” The Journal of the Acoustical Society of America, vol. 143, no. 3, pp. 1966–1966, 2018
work page 1966
-
[22]
Bownet: Dilated convolution neural network for ultrasound tongue contour extraction,
M. H. Mozaffari and W.-S. Lee, “Bownet: Dilated convolution neural network for ultrasound tongue contour extraction,” arXiv preprint arXiv:1906.04232, 2019
-
[23]
Automatic classification of tongue gestures in ultrasound images,
J. Berry, D. Archangeli, and I. Fasel, “Automatic classification of tongue gestures in ultrasound images,” in Proceedings of 12th Conference on Laboratory Phonology, 2010
work page 2010
-
[24]
Automatic animation of an articulatory tongue model from ultra- sound images of the vocal tract,
D. Fabre, T. Hueber, L. Girin, X. Alameda-Pineda, and P. Badin, “Automatic animation of an articulatory tongue model from ultra- sound images of the vocal tract,”Speech Communication, vol. 93, pp. 63–75, 2017
work page 2017
-
[25]
U-net: Convolutional networks for biomedical image segmentation,
O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241
work page 2015
-
[26]
Transfer learning for ultrasound tongue contour extraction with different domains,
M. H. Mozaffari and W.-S. Lee, “Transfer learning for ultrasound tongue contour extraction with different domains,”arXiv preprint arXiv:1906.04301, 2019
-
[27]
K. Xu, T. Gabor Csapo, P. Roussel, and B. Denby, “A comparative study on the contour tracking algorithms in ultrasound tongue im- ages with automatic re-initialization,” The Journal of the Acousti- cal Society of America , vol. 139, no. 5, pp. EL154–EL160, May 2016
work page 2016
-
[28]
Error analysis of extracted tongue contours from 2d ultrasound images,
T. G. Csapo and S. M. Lulich, “Error analysis of extracted tongue contours from 2d ultrasound images,” in Sixteenth Annual Con- ference of the International Speech Communication Association , 2015
work page 2015
-
[29]
Densely connected convolutional networks
G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks.” in CVPR, vol. 1, no. 2, 2017, p. 3
work page 2017
-
[30]
H- denseunet: Hybrid densely connected unet for liver and tumor segmentation from ct volumes,
X. Li, H. Chen, X. Qi, Q. Dou, C.-W. Fu, and P.-A. Heng, “H- denseunet: Hybrid densely connected unet for liver and tumor segmentation from ct volumes,” IEEE Transactions on Medical Imaging, 2018
work page 2018
-
[31]
Fully Dense UNet for 2D Sparse Photoacoustic Tomography Artifact Removal
S. Guan, A. Khan, S. Sikdar, and P. V . Chitnis, “Fully dense unet for 2d sparse photoacoustic tomography artifact removal,” arXiv preprint arXiv:1808.10848, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[32]
V-net: Fully convo- lutional neural networks for volumetric medical image segmenta- tion,
F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convo- lutional neural networks for volumetric medical image segmenta- tion,” in 3D Vision (3DV), 2016 Fourth International Conference on. IEEE, 2016, pp. 565–571
work page 2016
-
[33]
Holistically-nested edge detection,
S. Xie and Z. Tu, “Holistically-nested edge detection,” in Pro- ceedings of the IEEE international conference on computer vi- sion, 2015, pp. 1395–1403
work page 2015
-
[34]
Getcontours: An interactive tongue surface extraction tool,
M. Tiede and D. Whalen, “Getcontours: An interactive tongue surface extraction tool,” Proceedings of Ultrafest VII, 2015
work page 2015
-
[35]
Ultrasuite: A repository of ultrasound and acoustic data from child speech therapy sessions,
A. Eshky, M. S. Ribeiro, J. Cleland, K. Richmond, Z. Roxburgh, J. M. Scobbie, and A. A. Wrench, “Ultrasuite: A repository of ultrasound and acoustic data from child speech therapy sessions,” in INTERSPEECH 2018: Proceedings of the 19th Annual Con- ference of the International Speech Communication Association (ISCA), 2-6 September 2018, Hyderabad, India . ...
work page 2018
-
[36]
Adam: A Method for Stochastic Optimization
D. P. Kingma and J. Ba, “Adam: A method for stochastic opti- mization,” arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[37]
A fast parallel algorithm for thinning digital patterns,
T. Zhang and C. Y . Suen, “A fast parallel algorithm for thinning digital patterns,” Communications of the ACM, vol. 27, no. 3, pp. 236–239, 1984
work page 1984
-
[38]
Tongue contour track- ing in dynamic ultrasound via higher-order MRFs and efficient fusion moves,
L. Tang, T. Bressmann, and G. Hamarneh, “Tongue contour track- ing in dynamic ultrasound via higher-order MRFs and efficient fusion moves,” Medical Image Analysis, vol. 16, no. 8, pp. 1503– 1520, Dec. 2012
work page 2012
-
[39]
Deep Belief Networks for Real-Time Ex- traction of Tongue Contours from Ultrasound During Speech,
I. Fasel and J. Berry, “Deep Belief Networks for Real-Time Ex- traction of Tongue Contours from Ultrasound During Speech,” in 2010 20th International Conference on Pattern Recognition, Aug. 2010, pp. 1493–1496
work page 2010
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.