pith. sign in

arxiv: 1907.10210 · v1 · pith:RJT7AXDUnew · submitted 2019-07-24 · 📡 eess.IV · cs.CL· cs.CV

A CNN-based tool for automatic tongue contour tracking in ultrasound images

Pith reviewed 2026-05-24 17:05 UTC · model grok-4.3

classification 📡 eess.IV cs.CLcs.CV
keywords tongue contour trackingultrasound imagingconvolutional neural networksU-NetDenseU-Netspeech researchimage segmentationautomatic annotation
0
0 comments X

The pith

Two convolutional neural networks can automatically track tongue contours in ultrasound images with accuracy that depends more on loss function and data augmentation than on network architecture.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper implements and compares U-Net and DenseU-Net models to extract tongue outlines from ultrasound frames used in speech research, replacing manual annotation. Both networks reach comparable accuracy on the task, yet DenseU-Net generalizes better to unseen test sets while U-Net processes frames faster. The authors further find that the choice of loss function and the use of data augmentation affect performance more than the specific network design. An open-source tool is released so researchers can apply the method directly to new ultrasound recordings.

Core claim

Fully automatic tongue contour tracking is feasible with standard CNN segmentation architectures; U-Net and DenseU-Net achieve similar accuracy, DenseU-Net generalizes more reliably across datasets, U-Net runs faster, and loss function plus data augmentation exert larger effects on results than architecture choice.

What carries the argument

U-Net and DenseU-Net architectures applied to semantic segmentation of ultrasound tongue images.

If this is right

  • Automatic contour extraction removes the need for time-consuming manual tracing in ultrasound-based speech studies.
  • Loss function selection and data augmentation produce larger gains in tracking performance than switching between U-Net and DenseU-Net.
  • DenseU-Net offers better performance when the test data differ from the training distribution.
  • U-Net enables faster processing when speed is prioritized over maximum generalization.
  • The released open-source tool allows immediate use for automated annotation of new ultrasound recordings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same segmentation approach could be retrained on other articulatory imaging modalities such as MRI or EMA.
  • Real-time deployment of the faster U-Net variant might support live feedback during speech therapy sessions.
  • Collecting larger and more varied ultrasound datasets would likely reduce the observed differences in generalization between the two models.

Load-bearing premise

The ultrasound datasets used for training and testing represent the range of speakers, recording equipment, and speaking styles encountered in typical speech research.

What would settle it

Running the trained models on ultrasound images from a new speaker population or different scanner and obtaining substantially lower contour accuracy than reported on the paper's test sets.

Figures

Figures reproduced from arXiv: 1907.10210 by Ian Calloway, Jian Zhu, Will Styler.

Figure 1
Figure 1. Figure 1: The U-Net architecture. Each rectangle represents the output feature maps and arrows represent different operations. The vertially displayed number to the left of rectangles indi￾cates the image size at that block (e.g., the vertally displayed 128 stands for an image size of 128 × 128). The number at the top presents the number of feature maps, or image channels. 2.2. DenseNet and Dense U-Net The Dense Con… view at source ↗
Figure 3
Figure 3. Figure 3: A sample ultrasound frame and its corresponding mask. functions as a regularizer to control the overconfidence given by DSC, forcing the model to generate a more gradient probabilis￾tic heatmap. LCompound = LDSC + λ ∗ LC (4) By adjusting λ, we can tune the predicted heatmap. We set λ = 5 in the current task based on pilot experiments with vali￾dation data. In order to assess the effect of these loss functi… view at source ↗
Figure 5
Figure 5. Figure 5: Sample predictions given by D U-Net-Compound. Up￾per panels are from the NS test set; lower panels are from the UltraSpeech test set. The speckle noises in ultrasound images sometimes can lead to failures in identifying parts of the tongue surface. will likely suffer from implausible curvatures as interpolation in post-processing attempts to connect these regions. There some potential solutions to these pr… view at source ↗
read the original abstract

For speech research, ultrasound tongue imaging provides a non-invasive means for visualizing tongue position and movement during articulation. Extracting tongue contours from ultrasound images is a basic step in analyzing ultrasound data but this task often requires non-trivial manual annotation. This study presents an open source tool for fully automatic tracking of tongue contours in ultrasound frames using neural network based methods. We have implemented and systematically compared two convolutional neural networks, U-Net and DenseU-Net, under different conditions. Though both models can perform automatic contour tracking with comparable accuracy, Dense U-Net architecture seems more generalizable across test datasets while U-Net has faster extraction speed. Our comparison also shows that the choice of loss function and data augmentation have a greater effect on tracking performance in this task. This public available segmentation tool shows considerable promise for the automated tongue contour annotation of ultrasound images in speech research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents an open-source CNN-based tool for fully automatic tongue contour tracking in ultrasound images for speech research. It implements and compares U-Net and DenseU-Net architectures under varying loss functions and data augmentations, claiming that both models achieve comparable accuracy on held-out images, that DenseU-Net appears more generalizable across test datasets, that U-Net offers faster extraction speed, and that loss function and augmentation choices have a greater effect on performance than architecture.

Significance. If the empirical results hold with proper quantification, the work supplies a practical, publicly available segmentation tool that could substantially reduce manual annotation labor in ultrasound-based speech research. The systematic ablation of loss functions and augmentations also provides domain-specific guidance for similar medical image segmentation tasks.

major comments (2)
  1. [Abstract] Abstract: performance conclusions (comparable accuracy, greater generalizability of DenseU-Net, greater effect of loss/augmentation) are stated without any quantitative metrics, error bars, dataset sizes, or statistical tests. This absence prevents evaluation of the central empirical claims.
  2. [Results / Dataset description] The claim that Dense U-Net 'seems more generalizable across test datasets' rests on observed performance differences, yet the manuscript provides no details on dataset composition (speaker counts, probe types, accents, speaking styles) or the criteria used for train/test splits. Without this characterization, any advantage could reflect dataset idiosyncrasies rather than architectural robustness, directly undermining the comparative conclusion.
minor comments (1)
  1. [Abstract] The abstract should be revised to include at least summary quantitative results (e.g., mean Dice scores or pixel errors with standard deviations) so that readers can immediately gauge the strength of the reported findings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: performance conclusions (comparable accuracy, greater generalizability of DenseU-Net, greater effect of loss/augmentation) are stated without any quantitative metrics, error bars, dataset sizes, or statistical tests. This absence prevents evaluation of the central empirical claims.

    Authors: We agree that the abstract would benefit from quantitative support. In the revision we will include specific metrics such as mean Dice scores with standard deviations for the compared models and conditions, training and test set sizes, and any statistical tests performed, while keeping the abstract concise. revision: yes

  2. Referee: [Results / Dataset description] The claim that Dense U-Net 'seems more generalizable across test datasets' rests on observed performance differences, yet the manuscript provides no details on dataset composition (speaker counts, probe types, accents, speaking styles) or the criteria used for train/test splits. Without this characterization, any advantage could reflect dataset idiosyncrasies rather than architectural robustness, directly undermining the comparative conclusion.

    Authors: We acknowledge that expanded dataset characterization is needed to support the generalization claim. The revised manuscript will add a dedicated subsection detailing speaker counts, probe types, accents, speaking styles, and the train/test split criteria (including whether splits were speaker-independent). This will clarify that cross-dataset testing used entirely held-out corpora collected under different conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of trained CNNs on held-out images

full rationale

The paper presents no derivation chain, equations, or first-principles predictions. All claims rest on training U-Net and DenseU-Net variants, evaluating Dice/IoU metrics on held-out ultrasound frames, and reporting empirical differences in accuracy, speed, and cross-dataset performance. No fitted parameter is renamed as a prediction, no self-citation supplies a uniqueness theorem or ansatz, and no result reduces to its inputs by construction. The generalizability observation is an empirical finding subject to dataset caveats, not a circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard supervised segmentation assumptions: that hand-drawn contours constitute reliable ground truth, that the chosen loss functions and augmentation strategies are appropriate for ultrasound speckle noise, and that the network weights learned on the training splits generalize to new speakers and scanners.

free parameters (1)
  • Network weights (U-Net and DenseU-Net)
    Millions of parameters fitted by gradient descent on manually annotated ultrasound frames.
axioms (2)
  • domain assumption Hand-annotated tongue contours are sufficiently accurate and consistent to serve as ground truth for supervised training.
    Invoked implicitly when the networks are trained to match the provided contours.
  • domain assumption Standard data-augmentation operations (flips, rotations) preserve the semantic correctness of tongue contours in ultrasound.
    Used to improve generalization without explicit validation that the augmented images remain anatomically valid.

pith-pipeline@v0.9.0 · 5673 in / 1411 out tokens · 22929 ms · 2026-05-24T17:05:16.434114+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 4 internal anchors

  1. [1]

    However, the presence of speckle noise and irrelevant high contrast edges often degrades the usability of ultrasound images by obscuring the tongue surface [1]

    Introduction Ultrasound tongue imaging provides a non-invasive means for assessing tongue position and movement during speech produc- tion. However, the presence of speckle noise and irrelevant high contrast edges often degrades the usability of ultrasound images by obscuring the tongue surface [1]. Consequently, extracting tongue contours from ultrasound...

  2. [2]

    Neural network based methods are promising for fully automatic segmentation

    or particle filtering [4] can gear the algorithm towards more automatic segmentation. Neural network based methods are promising for fully automatic segmentation. Prior works uti- lized deep neural networks [12, 13] and Boltzmann machines [6]; recently fully convolutional neural networks such as vari- ants of the U-Net [14] have been adapted to segment ton...

  3. [3]

    A CNN-based tool for automatic tongue contour tracking in ultrasound images

    Method In our approach, we first train a convolutional neural network to segment the brightest edge corresponding to the tongue tissue- air interface from a noisy ultrasound image, and then derive a tongue surface curve through post-processing of the segmented image. The source code, pre-trained models and some of the test data are available at https://git...

  4. [4]

    only penalizes the mismatch between the predicted white pixels (representing the tongue region) and the white edge in the mask, while excluding all background pixels and noise during the optimization process. Thus, the learning task can be formu- lated as minimizing the following loss function: LDSC =− 2 ∑N i=1siri +ϵ∑N i=1si + ∑N i=1ri +ϵ (1) wheresi is ...

  5. [5]

    Tongue shape curves were annotated with Mark Tiede’s Get- Contours package for MATLAB [23] 1, generating a 100 point spline for each curve from human-specified anchor points

    Data Midsagittal ultrasound data was collected as MPEG video at 60 frames per second, using a Zonare Z.One Ultrasound Unit, op- erating at 4MHz and 70Hz scan rate with a P4-1C transducer. Tongue shape curves were annotated with Mark Tiede’s Get- Contours package for MATLAB [23] 1, generating a 100 point spline for each curve from human-specified anchor poi...

  6. [6]

    We used the Adam optimizer [25] with a learning rate of 0.0001, and the model was trained for 30 epochs

    Experiments The training data were divided into multiple mini-batches, each with a size of 32 images. We used the Adam optimizer [25] with a learning rate of 0.0001, and the model was trained for 30 epochs. The training process took approximately 2 hours using an NVIDIA Tesla K40 GPU in the University of Michigan’s FLUX computing cluster. The model that a...

  7. [7]

    the tongue contour

    Evaluation The metric for evaluation of error from human annotation is the Mean Sum of Distance (MSD), which permits the comparison of two curves without requiring point-wise alignment [2]. The MSD between two sequences U and V can be computed as the average distance between a given point and its nearest point in another sequence: D(U,V ) = 1 2n ( n∑ i=1 ...

  8. [8]

    Table 2: Mean and (Standard Deviation) of Mean Sum of Dis- tance (in Pixels, 1 pixel≈ 0.25mm) for the NS test set, as com- pared to three human annotators A, B and C

    because of deprecated dependencies. Table 2: Mean and (Standard Deviation) of Mean Sum of Dis- tance (in Pixels, 1 pixel≈ 0.25mm) for the NS test set, as com- pared to three human annotators A, B and C. A B C A 0 (0) 2.33 (1.57) 2.83 (1.85) B 2.33 (1.57) 0 (0) 3.21 (2.21) C 2.83 (1.85) 3.21 (2.21) 0 (0) UNet-WC 6.65 (2.92) 6.44 (2.74) 7.25 (3.24) UNet-Dic...

  9. [9]

    In the ab- sence of prior knowledge of plausible tongue shapes, the model will sometimes generate tracking errors when the white edge becomes blurry or interrupted

    Error analysis As the CNN is trained to identify the white edges directly cor- responding to the tongue surface, additional or missing white edges due to bad image quality or speaker physiology can lead to failures in identifying parts of the tongue surface. In the ab- sence of prior knowledge of plausible tongue shapes, the model will sometimes generate ...

  10. [10]

    The implemented models are tested extensively on multiple test datasets

    Conclusions In this study, we present a new open source tool for fully auto- mated tongue contour extraction based on U-Net and Dense U- Net models. The implemented models are tested extensively on multiple test datasets. Though both models can perform auto- matic contour tracking with comparable accuracy, Dense U-Net architecture seems more generalizable...

  11. [11]

    The data from Beddor and Co- etzee were collected for a different project supported by NSF grant BCS-1348150

    Acknowledgements We are grateful to Patrice Speeter Beddor, Andries Coetzee, Thomas Hueber and the UltraSuite research group for making available their ultrasound data. The data from Beddor and Co- etzee were collected for a different project supported by NSF grant BCS-1348150

  12. [12]

    A guide to analysing tongue motion from ultrasound images,

    M. Stone, “A guide to analysing tongue motion from ultrasound images,” Clinical Linguistics & Phonetics , vol. 19, no. 6-7, pp. 455–501, Jan. 2005

  13. [13]

    Automatic contour tracking in ultrasound images,

    M. Li, C. Kambhamettu, and M. Stone, “Automatic contour tracking in ultrasound images,” Clinical Linguistics & Phonetics, vol. 19, no. 6-7, pp. 545–554, Jan. 2005

  14. [14]

    Robust contour tracking in ultrasound tongue image sequences,

    K. Xu, Y . Yang, M. Stone, A. Jaumard-Hakoun, C. Leboullenger, G. Dreyfus, P. Roussel, and B. Denby, “Robust contour tracking in ultrasound tongue image sequences,” Clinical Linguistics & Pho- netics, vol. 30, no. 3-5, pp. 313–327, May 2016

  15. [15]

    Multi-hypothesis tracking of the tongue surface in ultrasound video recordings of normal and im- paired speech,

    C. Laporte and L. M ´enard, “Multi-hypothesis tracking of the tongue surface in ultrasound video recordings of normal and im- paired speech,” Medical image analysis , vol. 44, pp. 98–114, 2018

  16. [16]

    Graph-based tracking of the tongue contour in ultrasound sequences with adaptive temporal regular- ization,

    L. Tang and G. Hamarneh, “Graph-based tracking of the tongue contour in ultrasound sequences with adaptive temporal regular- ization,” in 2010 IEEE Computer Society Conference on Com- puter Vision and Pattern Recognition - Workshops, Jun. 2010, pp. 154–161

  17. [17]

    Tongue contour extraction from ultrasound images based on deep neural network

    A. Jaumard-Hakoun, K. Xu, P. Roussel-Ragot, G. Dreyfus, and B. Denby, “Tongue contour extraction from ultrasound images based on deep neural network,” arXiv:1605.05912 [cs] , May 2016, arXiv: 1605.05912

  18. [18]

    Dynamics of tongue gestures extracted au- tomatically from ultrasound,

    J. Berry and I. Fasel, “Dynamics of tongue gestures extracted au- tomatically from ultrasound,” in2011 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), May 2011, pp. 557–560

  19. [19]

    Tongue Track- ing in Ultrasound Images using EigenTongue Decomposition and Artificial Neural Networks,

    D. Fabre, T. Hueber, F. Bocquelet, and P. Badin, “Tongue Track- ing in Ultrasound Images using EigenTongue Decomposition and Artificial Neural Networks,” in 16th Annual Conference of the International Speech Communication Association (Interspeech 2015), Dresden, Germany, Sep. 2015

  20. [20]

    Automatic tongue contour segmentation using deep learning,

    S. Wen, “Automatic tongue contour segmentation using deep learning,” Master’s thesis, Universit´e d’Ottawa/University of Ot- tawa, 2018

  21. [21]

    Automatic tongue con- tour extraction in ultrasound images with convolutional neural networks,

    J. Zhu, W. Styler, and I. C. Calloway, “Automatic tongue con- tour extraction in ultrasound images with convolutional neural networks,” The Journal of the Acoustical Society of America, vol. 143, no. 3, pp. 1966–1966, 2018

  22. [22]

    Bownet: Dilated convolution neural network for ultrasound tongue contour extraction,

    M. H. Mozaffari and W.-S. Lee, “Bownet: Dilated convolution neural network for ultrasound tongue contour extraction,” arXiv preprint arXiv:1906.04232, 2019

  23. [23]

    Automatic classification of tongue gestures in ultrasound images,

    J. Berry, D. Archangeli, and I. Fasel, “Automatic classification of tongue gestures in ultrasound images,” in Proceedings of 12th Conference on Laboratory Phonology, 2010

  24. [24]

    Automatic animation of an articulatory tongue model from ultra- sound images of the vocal tract,

    D. Fabre, T. Hueber, L. Girin, X. Alameda-Pineda, and P. Badin, “Automatic animation of an articulatory tongue model from ultra- sound images of the vocal tract,”Speech Communication, vol. 93, pp. 63–75, 2017

  25. [25]

    U-net: Convolutional networks for biomedical image segmentation,

    O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241

  26. [26]

    Transfer learning for ultrasound tongue contour extraction with different domains,

    M. H. Mozaffari and W.-S. Lee, “Transfer learning for ultrasound tongue contour extraction with different domains,”arXiv preprint arXiv:1906.04301, 2019

  27. [27]

    A comparative study on the contour tracking algorithms in ultrasound tongue im- ages with automatic re-initialization,

    K. Xu, T. Gabor Csapo, P. Roussel, and B. Denby, “A comparative study on the contour tracking algorithms in ultrasound tongue im- ages with automatic re-initialization,” The Journal of the Acousti- cal Society of America , vol. 139, no. 5, pp. EL154–EL160, May 2016

  28. [28]

    Error analysis of extracted tongue contours from 2d ultrasound images,

    T. G. Csapo and S. M. Lulich, “Error analysis of extracted tongue contours from 2d ultrasound images,” in Sixteenth Annual Con- ference of the International Speech Communication Association , 2015

  29. [29]

    Densely connected convolutional networks

    G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks.” in CVPR, vol. 1, no. 2, 2017, p. 3

  30. [30]

    H- denseunet: Hybrid densely connected unet for liver and tumor segmentation from ct volumes,

    X. Li, H. Chen, X. Qi, Q. Dou, C.-W. Fu, and P.-A. Heng, “H- denseunet: Hybrid densely connected unet for liver and tumor segmentation from ct volumes,” IEEE Transactions on Medical Imaging, 2018

  31. [31]

    Fully Dense UNet for 2D Sparse Photoacoustic Tomography Artifact Removal

    S. Guan, A. Khan, S. Sikdar, and P. V . Chitnis, “Fully dense unet for 2d sparse photoacoustic tomography artifact removal,” arXiv preprint arXiv:1808.10848, 2018

  32. [32]

    V-net: Fully convo- lutional neural networks for volumetric medical image segmenta- tion,

    F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convo- lutional neural networks for volumetric medical image segmenta- tion,” in 3D Vision (3DV), 2016 Fourth International Conference on. IEEE, 2016, pp. 565–571

  33. [33]

    Holistically-nested edge detection,

    S. Xie and Z. Tu, “Holistically-nested edge detection,” in Pro- ceedings of the IEEE international conference on computer vi- sion, 2015, pp. 1395–1403

  34. [34]

    Getcontours: An interactive tongue surface extraction tool,

    M. Tiede and D. Whalen, “Getcontours: An interactive tongue surface extraction tool,” Proceedings of Ultrafest VII, 2015

  35. [35]

    Ultrasuite: A repository of ultrasound and acoustic data from child speech therapy sessions,

    A. Eshky, M. S. Ribeiro, J. Cleland, K. Richmond, Z. Roxburgh, J. M. Scobbie, and A. A. Wrench, “Ultrasuite: A repository of ultrasound and acoustic data from child speech therapy sessions,” in INTERSPEECH 2018: Proceedings of the 19th Annual Con- ference of the International Speech Communication Association (ISCA), 2-6 September 2018, Hyderabad, India . ...

  36. [36]

    Adam: A Method for Stochastic Optimization

    D. P. Kingma and J. Ba, “Adam: A method for stochastic opti- mization,” arXiv preprint arXiv:1412.6980, 2014

  37. [37]

    A fast parallel algorithm for thinning digital patterns,

    T. Zhang and C. Y . Suen, “A fast parallel algorithm for thinning digital patterns,” Communications of the ACM, vol. 27, no. 3, pp. 236–239, 1984

  38. [38]

    Tongue contour track- ing in dynamic ultrasound via higher-order MRFs and efficient fusion moves,

    L. Tang, T. Bressmann, and G. Hamarneh, “Tongue contour track- ing in dynamic ultrasound via higher-order MRFs and efficient fusion moves,” Medical Image Analysis, vol. 16, no. 8, pp. 1503– 1520, Dec. 2012

  39. [39]

    Deep Belief Networks for Real-Time Ex- traction of Tongue Contours from Ultrasound During Speech,

    I. Fasel and J. Berry, “Deep Belief Networks for Real-Time Ex- traction of Tongue Contours from Ultrasound During Speech,” in 2010 20th International Conference on Pattern Recognition, Aug. 2010, pp. 1493–1496