Attention model for articulatory features detection

Dmytro Tkanov; Ievgen Karaulov

arxiv: 1907.01914 · v1 · pith:T4IFZV2Fnew · submitted 2019-07-02 · 📡 eess.AS · cs.CL· cs.LG· cs.SD· stat.ML

Attention model for articulatory features detection

Ievgen Karaulov , Dmytro Tkanov This is my paper

Pith reviewed 2026-05-25 10:55 UTC · model grok-4.3

classification 📡 eess.AS cs.CLcs.LGcs.SDstat.ML

keywords articulatory featuresattention modelsend-to-end trainingphone recognitionmultitask learningdistinctive featuresspeech processingdecoding technique

0 comments

The pith

A novel decoding technique enables attention models to detect articulatory features end-to-end using only phone labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies an attention-based sequence model to recognize phones from audio on limited training data. It introduces a decoding approach that repurposes the attention component to also identify manners and places of articulation. Training happens jointly with phone recognition in a multitask setup. This setup avoids any requirement for separate labeled data on sound production. The result is a way to obtain these detectors directly from standard phone transcriptions.

Core claim

The paper establishes that a novel decoding technique makes it possible to train an attention model end-to-end so that it produces detectors for the manners and places of articulation, relying solely on phone labels rather than additional articulatory annotations, and that this can be combined with phone recognition through multitask learning.

What carries the argument

The novel decoding technique that adapts the attention mechanism to generate articulatory feature detectors alongside phone recognition.

If this is right

The model performs both phone recognition and articulatory feature detection simultaneously through multitask learning.
Reliable articulatory detectors can be obtained without explicit supervision on production data.
The approach supports phone recognition on small training sets by incorporating the additional detection task.
End-to-end training becomes feasible for distinctive features in speech processing tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique could reduce data needs for systems that provide feedback on speech production details.
It might produce more interpretable outputs for applications that rely on phonetic structure.
The same decoding approach could be tested on other attention-based audio models to see if the benefit generalizes.

Load-bearing premise

The decoding technique can successfully adapt the attention model to produce reliable articulatory feature detectors without requiring additional explicit supervision or labeled production data beyond the phone labels.

What would settle it

Train the model on a standard phone-labeled audio corpus and inspect whether the resulting detectors align with established phonetic classifications of manners and places for held-out utterances; systematic mismatch on multiple test cases would falsify the claim.

Figures

Figures reproduced from arXiv: 1907.01914 by Dmytro Tkanov, Ievgen Karaulov.

**Figure 2.** Figure 2: Articulatory features posteriors for phrase SX100 from the test set: “The best way to learn is to solve extra problems”. Ground truth features are marked with white circles. To the best of our knowledge, there is no previously published results for sequence-level articulatory features detection on TIMIT. In [14] the authors report articulatory features detection results on Wall Street Journal (WSJ) corp… view at source ↗

read the original abstract

Articulatory distinctive features, as well as phonetic transcription, play important role in speech-related tasks: computer-assisted pronunciation training, text-to-speech conversion (TTS), studying speech production mechanisms, speech recognition for low-resourced languages. End-to-end approaches to speech-related tasks got a lot of traction in recent years. We apply Listen, Attend and Spell~(LAS)~\cite{Chan-LAS2016} architecture to phones recognition on a small small training set, like TIMIT~\cite{TIMIT-1992}. Also, we introduce a novel decoding technique that allows to train manners and places of articulation detectors end-to-end using attention models. We also explore joint phones recognition and articulatory features detection in multitask learning setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies LAS to TIMIT phones and claims a novel decoding trick for end-to-end articulatory feature detection, but the abstract supplies no equations, diagrams, or results, leaving the core claim untestable.

read the letter

The main takeaway is that the authors take the Listen Attend and Spell model, run it on phone recognition with a small set like TIMIT, and add what they describe as a new decoding technique to produce manners and places of articulation detectors directly from the attention setup. They also test a joint multitask version with phones and features together. That decoding step is presented as the fresh element relative to the original LAS work, and the multitask framing makes sense for the low-resource applications they mention, such as pronunciation training or TTS in limited data conditions. The paper does a clear job stating why articulatory features matter in those settings and why end-to-end attention models are worth trying there. The framing is practical rather than overly ambitious. The obvious limitation is the complete absence of any implementation details, equations for the decoding method, or experimental numbers. Without those, it is impossible to judge whether the technique actually pulls feature predictions from attention weights or decoder states in a genuine end-to-end way, or whether it simply derives the features as a post-hoc function of the phone outputs. If the latter holds, the approach collapses to standard post-processing and does not deliver the claimed end-to-end benefit without extra supervision. The stress-test concern lands directly on this point. This work would mainly interest researchers already focused on attention-based speech models for articulatory or low-resource tasks. A reader could extract the basic idea from the abstract, but there is not enough substance here to justify sending it out for serious refereeing. It would need the method written out with some concrete results before it merits that step.

Referee Report

2 major / 1 minor

Summary. The paper applies the Listen, Attend and Spell (LAS) architecture to phone recognition on small datasets such as TIMIT. It introduces a novel decoding technique claimed to enable end-to-end training of detectors for manners and places of articulation using attention models, and explores joint training of phones and articulatory features in a multitask setting.

Significance. If the claimed decoding technique successfully repurposes the LAS attention mechanism to produce reliable articulatory feature predictions directly from acoustic input (rather than as a deterministic function of phone outputs), the work could advance end-to-end modeling for pronunciation training, TTS, and low-resource ASR by reducing the need for explicit articulatory supervision.

major comments (2)

[Abstract] Abstract: The novel decoding technique is asserted to allow end-to-end articulatory feature detection, yet no equations, pseudocode, architectural diagram, or description of how attention weights or decoder states are mapped to manner/place predictions is provided. This omission leaves unresolved whether the method extracts features independently or reduces to post-processing of phone predictions, which is load-bearing for the central claim of genuine end-to-end detection without extra supervision.
[Abstract] Abstract and multitask section: The claim that the technique trains detectors 'end-to-end using attention models' from phone labels alone requires explicit verification that feature predictions are not derived deterministically from the phone recognizer; without implementation details or ablation showing independent feature learning, the multitask benefit cannot be assessed.

minor comments (1)

[Abstract] Abstract: Duplicate wording 'small small training set' should be corrected to 'small training set'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments. We address the major points below and will revise the abstract to improve clarity on the decoding technique while preserving the manuscript's core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The novel decoding technique is asserted to allow end-to-end articulatory feature detection, yet no equations, pseudocode, architectural diagram, or description of how attention weights or decoder states are mapped to manner/place predictions is provided. This omission leaves unresolved whether the method extracts features independently or reduces to post-processing of phone predictions, which is load-bearing for the central claim of genuine end-to-end detection without extra supervision.

Authors: The body of the manuscript describes the novel decoding technique, including the adaptation of the LAS attention mechanism to produce articulatory feature predictions directly from acoustic inputs in a multitask setup. We agree the abstract is too terse on this point. In revision we will expand the abstract with a high-level description of the mapping from attention weights and decoder states to manner/place outputs, plus a pointer to the methods section for equations and pseudocode. This will make explicit that the feature predictions are not a deterministic post-processing step. revision: yes
Referee: [Abstract] Abstract and multitask section: The claim that the technique trains detectors 'end-to-end using attention models' from phone labels alone requires explicit verification that feature predictions are not derived deterministically from the phone recognizer; without implementation details or ablation showing independent feature learning, the multitask benefit cannot be assessed.

Authors: The multitask experiments demonstrate that joint training improves both phone recognition and feature detection accuracy compared with single-task baselines, which would not occur if features were merely derived from phone outputs. We will revise the abstract and multitask section to state more explicitly that feature predictions are produced by a separate output head operating on the shared attention context, trained from phone labels only. An additional sentence clarifying the independence of the two prediction pathways will be added. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation relies on external LAS architecture and independent novel technique

full rationale

The paper applies the Listen Attend and Spell model from an external citation (Chan et al. 2016) to phone recognition on TIMIT and introduces a novel decoding technique for articulatory feature detectors. No self-citations, no fitted parameters renamed as predictions, no self-definitional loops, and no uniqueness theorems imported from prior author work appear in the provided text. The central claim about end-to-end training via the new technique stands as an independent architectural proposal without reduction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities used in the method.

pith-pipeline@v0.9.0 · 5658 in / 899 out tokens · 20681 ms · 2026-05-25T10:55:09.852817+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

[1]

s” in the word “sea

Introduction End-to-end approaches emerged in neural translation and later signiﬁcantly changed automatic speech recognition (ASR) and TTS. While conventional pipelines still provide decent re- sults, especially on smaller datasets, end-to-end models quickly catchup and are already state-of-the-art on some tasks [3]. End- to-end models are typically seque...

work page 2017
[2]

It requires forced alignment of phones to utterances

Previous work The conventional approach to estimation of phonological fea- tures is akin to the standard ASR pipeline. It requires forced alignment of phones to utterances. As a result, training is usually done either on ﬁne-labeled data with alignments or on data that have good acoustic models available. This limits re- search to well-studied mainstream ...

work page
[3]

Attention model for articulatory features detection

Model description 3.1. Attention-based models Typical end-to-end models in speech domain are based on se- quence to sequence neural networks. Most common archi- tectures are CTC, recurrent neural network transducer (RNN- T) [10] and encoder-decoder with attention [16]. In this work we will focus on attention-based models. One of important fea- tures of at...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[4]

The best way to learn is to solve extra problems

Experiments 4.1. Dataset and features All experiments were performed on the TIMIT corpus. For training we used the standard 462-speaker set without SA records. Test results in tables below correspond to the core test set of 192 utterances. A development set was collected from the remaining part of the test set, i.e. the non-core part. We explicitly checke...

work page
[5]

ah” – “ih

Discussion A LAS encoder is a stack of pyramidal layers. As a result, the top-most layer has a typical window step of 40 – 80 ms depend- ing on the number of encoders layers. This interval determines inaccuracy that would be even in case of the ideal projection of sequence symbols to frames through attention. One way to infer more accurate phone boundarie...

work page
[6]

The resulting model yields posteriorgrams for articulatory features, rough alignments with acoustic data and competitive phone error rates even in low-resource settings

Conclusions The paper proposes a novel approach to end-to-end articulatory features detection. The resulting model yields posteriorgrams for articulatory features, rough alignments with acoustic data and competitive phone error rates even in low-resource settings. In future, we would like to study the possibilities of ap- plying our approach to recognitio...

work page
[7]

Listen, Attend and Spell

Acknowledgments We would like to thank Tzu-Wei Sung from National Taiwan University for his implementation of “Listen, Attend and Spell” model2 that we used as a starting point for our experiments. Also we would like to thank Olga Zvyeryeva for preparing map- pings from phones to articulatory features. 1https://github.com/espeak-ng/espeak-ng 2https://gith...

work page
[8]

Listen, Attend and Spell: A neural network for large vocabulary conversational speech recognition.,

W. Chan, N. Jaitly, Q. V . Le, and O. Vinyals, “Listen, Attend and Spell: A neural network for large vocabulary conversational speech recognition.,” in2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960-4964, 2016

work page 2016
[9]

TIMIT Acoustic-phonetic Continuous Speech Corpus,

J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, N. Dahlgren, V . Zue, “TIMIT Acoustic-phonetic Continuous Speech Corpus,” in LDC93S1, Linguistic Data Consortium, 1992

work page 1992
[10]

Improved training of end-to-end attention models for speech recognition,

A. Zeyer, K. Irie, R. Schl ¨uter, and H. Ney “Improved training of end-to-end attention models for speech recognition,” inInterspeech 2018 – 19 th Annual Conference of the International Speech Com- munication Association, Hyderabad, India, Proceedings, 2018

work page 2018
[11]

The Sound Pattern of English,

N. Chomsky, M. Halle, “The Sound Pattern of English,” MIT Press, 1968

work page 1968
[12]

Joint Estimation of Articula- tory Features and Acoustic models for Low-Resource Languages,

B. Abraham, S. Umesh, N. M. Joy, “Joint Estimation of Articula- tory Features and Acoustic models for Low-Resource Languages,” in Interspeech 2017 – 18 th Annual Conference of the Interna- tional Speech Communication Association, August 20–24, Stock- holm, Sweden, Proceedings, 2017

work page 2017
[13]

Detection-Based ASR in the Automatic Speech Attribute Transcription Project,

I. Bromberg et al., “Detection-Based ASR in the Automatic Speech Attribute Transcription Project,” inInterspeech 2007, 2007

work page 2007
[14]

An attribute detection based approach to automatic speech processing,

S. M. Siniscalchi, C. Lee, “An attribute detection based approach to automatic speech processing,” in Loquens, 1(1), e005, 2014

work page 2014
[15]

An overview of spoken language technology for ed- ucation,

M. Eskenazi, “An overview of spoken language technology for ed- ucation,” Speech Communication, vol. 51, no. 10, pp. 832–844 , 2009

work page 2009
[16]

Mispronunciation Diagnosis of L2 English at Articulatory Level Using Articulatory Goodness-Of-Pronunciation Features,

H. Ryu, M. Chung, “Mispronunciation Diagnosis of L2 English at Articulatory Level Using Articulatory Goodness-Of-Pronunciation Features,” in Proc. 7th ISCA Workshop on Speech and Language Technology in Education, 2017

work page 2017
[17]

Speech recognition with Deep Recurrent Neural Networks,

A. Graves, A. Mohamed, G. Hinton, “Speech recognition with Deep Recurrent Neural Networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013

work page 2013
[18]

Detection of Phonological Features in Con- tinuous Speech using Neural Networks,

S. King, P. Taylor, “Detection of Phonological Features in Con- tinuous Speech using Neural Networks,” Computer Speech & Lan- guage, V .14, 4, p.333–353, 2000

work page 2000
[19]

Discriminating Nasals and Approximants in En- glish Language Using Zero Time Windowing,

R. Prasad et al., “Discriminating Nasals and Approximants in En- glish Language Using Zero Time Windowing,” inInterspeech 2018 – 19th Annual Conference of the International Speech Communica- tion Association, Hyderabad, India, Proceedings, 2018

work page 2018
[20]

Articulatory Feature Classiﬁcation Using Convolutional Neural Networks,

D. Merkx, O. Scharenborg, “Articulatory Feature Classiﬁcation Using Convolutional Neural Networks,” inInterspeech 2018 – 19th Annual Conference of the International Speech Communication As- sociation, Hyderabad, India, Proceedings, 2018

work page 2018
[21]

Combining Articulatory Features with End-to-end Learning in Speech Recognition,

L. Qu et al. “Combining Articulatory Features with End-to-end Learning in Speech Recognition,” inProc. 27th International Con- ference on Artiﬁcial Neural Networks (ICANN), 2018

work page 2018
[22]

Connec- tionist Temporal Classiﬁcation: Labeling Unsegmented Sequence Data with Recurrent Neural Networks,

A. Graves, S. Fern ´andez, F. Gomez, J. Schmidhuber, “Connec- tionist Temporal Classiﬁcation: Labeling Unsegmented Sequence Data with Recurrent Neural Networks,” in Proceedings of the 23rd International Conference on Machine Learning, 2006

work page 2006
[23]

Attention-Based Models for Speech Recog- nition,

J. Chorowski et al., “Attention-Based Models for Speech Recog- nition,” in NIPS 2015, 2015

work page 2015
[24]

Multitask learning,

R. Caruana, “Multitask learning,” in Learning to learn. Springer, 1998, pp. 95–133

work page 1998
[25]

Speaker-independent phone recog- nition using hid- den Markov models,

K.-F. Lee and H.-W. Hon. “Speaker-independent phone recog- nition using hid- den Markov models,” in IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(11):1641–1648 , November 1989

work page 1989
[26]

A Computational Model of Filtering, Detection, and Compression in the Cochlea,

R. Lyon, “A Computational Model of Filtering, Detection, and Compression in the Cochlea,” in Proceedings, 1982 IEEE ICASSP , Paris, 1982

work page 1982
[27]

The design for the Wall Street Journal- based CSR corpus,

D. B. Paul, J. M. Baker, “The design for the Wall Street Journal- based CSR corpus,” inProceedings of the workshop on Speech and Natural Language. pp. 357-362, 1992

work page 1992
[28]

FastDTW: Toward accurate dynamic time warping in linear time and space

S. Salvador, P. Chan. “FastDTW: Toward accurate dynamic time warping in linear time and space.” Intelligent Data Analysis 11.5 (2007): 561-580

work page 2007

[1] [1]

s” in the word “sea

Introduction End-to-end approaches emerged in neural translation and later signiﬁcantly changed automatic speech recognition (ASR) and TTS. While conventional pipelines still provide decent re- sults, especially on smaller datasets, end-to-end models quickly catchup and are already state-of-the-art on some tasks [3]. End- to-end models are typically seque...

work page 2017

[2] [2]

It requires forced alignment of phones to utterances

Previous work The conventional approach to estimation of phonological fea- tures is akin to the standard ASR pipeline. It requires forced alignment of phones to utterances. As a result, training is usually done either on ﬁne-labeled data with alignments or on data that have good acoustic models available. This limits re- search to well-studied mainstream ...

work page

[3] [3]

Attention model for articulatory features detection

Model description 3.1. Attention-based models Typical end-to-end models in speech domain are based on se- quence to sequence neural networks. Most common archi- tectures are CTC, recurrent neural network transducer (RNN- T) [10] and encoder-decoder with attention [16]. In this work we will focus on attention-based models. One of important fea- tures of at...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[4] [4]

The best way to learn is to solve extra problems

Experiments 4.1. Dataset and features All experiments were performed on the TIMIT corpus. For training we used the standard 462-speaker set without SA records. Test results in tables below correspond to the core test set of 192 utterances. A development set was collected from the remaining part of the test set, i.e. the non-core part. We explicitly checke...

work page

[5] [5]

ah” – “ih

Discussion A LAS encoder is a stack of pyramidal layers. As a result, the top-most layer has a typical window step of 40 – 80 ms depend- ing on the number of encoders layers. This interval determines inaccuracy that would be even in case of the ideal projection of sequence symbols to frames through attention. One way to infer more accurate phone boundarie...

work page

[6] [6]

The resulting model yields posteriorgrams for articulatory features, rough alignments with acoustic data and competitive phone error rates even in low-resource settings

Conclusions The paper proposes a novel approach to end-to-end articulatory features detection. The resulting model yields posteriorgrams for articulatory features, rough alignments with acoustic data and competitive phone error rates even in low-resource settings. In future, we would like to study the possibilities of ap- plying our approach to recognitio...

work page

[7] [7]

Listen, Attend and Spell

Acknowledgments We would like to thank Tzu-Wei Sung from National Taiwan University for his implementation of “Listen, Attend and Spell” model2 that we used as a starting point for our experiments. Also we would like to thank Olga Zvyeryeva for preparing map- pings from phones to articulatory features. 1https://github.com/espeak-ng/espeak-ng 2https://gith...

work page

[8] [8]

Listen, Attend and Spell: A neural network for large vocabulary conversational speech recognition.,

W. Chan, N. Jaitly, Q. V . Le, and O. Vinyals, “Listen, Attend and Spell: A neural network for large vocabulary conversational speech recognition.,” in2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960-4964, 2016

work page 2016

[9] [9]

TIMIT Acoustic-phonetic Continuous Speech Corpus,

J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, N. Dahlgren, V . Zue, “TIMIT Acoustic-phonetic Continuous Speech Corpus,” in LDC93S1, Linguistic Data Consortium, 1992

work page 1992

[10] [10]

Improved training of end-to-end attention models for speech recognition,

A. Zeyer, K. Irie, R. Schl ¨uter, and H. Ney “Improved training of end-to-end attention models for speech recognition,” inInterspeech 2018 – 19 th Annual Conference of the International Speech Com- munication Association, Hyderabad, India, Proceedings, 2018

work page 2018

[11] [11]

The Sound Pattern of English,

N. Chomsky, M. Halle, “The Sound Pattern of English,” MIT Press, 1968

work page 1968

[12] [12]

Joint Estimation of Articula- tory Features and Acoustic models for Low-Resource Languages,

B. Abraham, S. Umesh, N. M. Joy, “Joint Estimation of Articula- tory Features and Acoustic models for Low-Resource Languages,” in Interspeech 2017 – 18 th Annual Conference of the Interna- tional Speech Communication Association, August 20–24, Stock- holm, Sweden, Proceedings, 2017

work page 2017

[13] [13]

Detection-Based ASR in the Automatic Speech Attribute Transcription Project,

I. Bromberg et al., “Detection-Based ASR in the Automatic Speech Attribute Transcription Project,” inInterspeech 2007, 2007

work page 2007

[14] [14]

An attribute detection based approach to automatic speech processing,

S. M. Siniscalchi, C. Lee, “An attribute detection based approach to automatic speech processing,” in Loquens, 1(1), e005, 2014

work page 2014

[15] [15]

An overview of spoken language technology for ed- ucation,

M. Eskenazi, “An overview of spoken language technology for ed- ucation,” Speech Communication, vol. 51, no. 10, pp. 832–844 , 2009

work page 2009

[16] [16]

Mispronunciation Diagnosis of L2 English at Articulatory Level Using Articulatory Goodness-Of-Pronunciation Features,

H. Ryu, M. Chung, “Mispronunciation Diagnosis of L2 English at Articulatory Level Using Articulatory Goodness-Of-Pronunciation Features,” in Proc. 7th ISCA Workshop on Speech and Language Technology in Education, 2017

work page 2017

[17] [17]

Speech recognition with Deep Recurrent Neural Networks,

A. Graves, A. Mohamed, G. Hinton, “Speech recognition with Deep Recurrent Neural Networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013

work page 2013

[18] [18]

Detection of Phonological Features in Con- tinuous Speech using Neural Networks,

S. King, P. Taylor, “Detection of Phonological Features in Con- tinuous Speech using Neural Networks,” Computer Speech & Lan- guage, V .14, 4, p.333–353, 2000

work page 2000

[19] [19]

Discriminating Nasals and Approximants in En- glish Language Using Zero Time Windowing,

R. Prasad et al., “Discriminating Nasals and Approximants in En- glish Language Using Zero Time Windowing,” inInterspeech 2018 – 19th Annual Conference of the International Speech Communica- tion Association, Hyderabad, India, Proceedings, 2018

work page 2018

[20] [20]

Articulatory Feature Classiﬁcation Using Convolutional Neural Networks,

D. Merkx, O. Scharenborg, “Articulatory Feature Classiﬁcation Using Convolutional Neural Networks,” inInterspeech 2018 – 19th Annual Conference of the International Speech Communication As- sociation, Hyderabad, India, Proceedings, 2018

work page 2018

[21] [21]

Combining Articulatory Features with End-to-end Learning in Speech Recognition,

L. Qu et al. “Combining Articulatory Features with End-to-end Learning in Speech Recognition,” inProc. 27th International Con- ference on Artiﬁcial Neural Networks (ICANN), 2018

work page 2018

[22] [22]

Connec- tionist Temporal Classiﬁcation: Labeling Unsegmented Sequence Data with Recurrent Neural Networks,

A. Graves, S. Fern ´andez, F. Gomez, J. Schmidhuber, “Connec- tionist Temporal Classiﬁcation: Labeling Unsegmented Sequence Data with Recurrent Neural Networks,” in Proceedings of the 23rd International Conference on Machine Learning, 2006

work page 2006

[23] [23]

Attention-Based Models for Speech Recog- nition,

J. Chorowski et al., “Attention-Based Models for Speech Recog- nition,” in NIPS 2015, 2015

work page 2015

[24] [24]

Multitask learning,

R. Caruana, “Multitask learning,” in Learning to learn. Springer, 1998, pp. 95–133

work page 1998

[25] [25]

Speaker-independent phone recog- nition using hid- den Markov models,

K.-F. Lee and H.-W. Hon. “Speaker-independent phone recog- nition using hid- den Markov models,” in IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(11):1641–1648 , November 1989

work page 1989

[26] [26]

A Computational Model of Filtering, Detection, and Compression in the Cochlea,

R. Lyon, “A Computational Model of Filtering, Detection, and Compression in the Cochlea,” in Proceedings, 1982 IEEE ICASSP , Paris, 1982

work page 1982

[27] [27]

The design for the Wall Street Journal- based CSR corpus,

D. B. Paul, J. M. Baker, “The design for the Wall Street Journal- based CSR corpus,” inProceedings of the workshop on Speech and Natural Language. pp. 357-362, 1992

work page 1992

[28] [28]

FastDTW: Toward accurate dynamic time warping in linear time and space

S. Salvador, P. Chan. “FastDTW: Toward accurate dynamic time warping in linear time and space.” Intelligent Data Analysis 11.5 (2007): 561-580

work page 2007