Acoustic Model Optimization Based On Evolutionary Stochastic Gradient Descent with Anchors for Automatic Speech Recognition

Michael Picheny; Xiaodong Cui

arxiv: 1907.04882 · v1 · pith:NGJMJ4GZnew · submitted 2019-07-10 · 💻 cs.CL · cs.LG· eess.AS

Acoustic Model Optimization Based On Evolutionary Stochastic Gradient Descent with Anchors for Automatic Speech Recognition

Xiaodong Cui , Michael Picheny This is my paper

Pith reviewed 2026-05-24 23:39 UTC · model grok-4.3

classification 💻 cs.CL cs.LGeess.AS

keywords acoustic model optimizationevolutionary stochastic gradient descentautomatic speech recognitionanchor modelsESGDspeech recognitionoptimization algorithms

0 comments

The pith

Evolutionary stochastic gradient descent using anchor models improves acoustic models for speech recognition while guaranteeing no performance loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a variant of evolutionary stochastic gradient descent (ESGD) that incorporates a well-trained acoustic model as an anchor in the parent population. This anchor's good properties are propagated to offspring models during evolution. The method ensures that the population's best fitness never falls below that of the anchor model. Experiments on 50-hour Broadcast News and 300-hour Switchboard datasets demonstrate improvements in loss and ASR performance over the existing models.

Core claim

By assuming the existence of a well-trained acoustic model and using it as an anchor, the ESGD algorithm can be modified to propagate its good genes while guaranteeing that the best fitness of the population will never degrade from the anchor model, leading to further improvements in acoustic model optimization for ASR.

What carries the argument

The anchor model placed in the parent population of ESGD, which propagates good properties to offspring while enforcing that population fitness never drops below the anchor.

If this is right

Acoustic models can be further optimized beyond current well-trained states without risk of performance degradation.
Population-based optimization that mixes gradient-aware and gradient-free search can be stabilized by anchors.
Loss reductions and ASR gains are achievable on 50-hour Broadcast News and 300-hour Switchboard data.
Evolutionary search can build directly on strong initial models while preserving their fitness level.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar anchoring could stabilize evolutionary optimization in other machine-learning domains that already possess good base solutions.
Starting from strong models may lower the cost of hyperparameter search in large-scale ASR training.
The method could be tested on additional speech corpora to check whether gains hold beyond the reported Broadcast News and Switchboard sets.
Combining the anchor mechanism with different evolutionary operators might produce further performance lifts.

Load-bearing premise

A well-trained acoustic model exists that can be used as an anchor whose good properties propagate without the population ever performing worse than it.

What would settle it

Running the ESGD with anchors on the BN50 or SWB300 datasets and observing that the resulting models have higher loss or worse ASR performance than the original anchor model would falsify the guarantee and improvement claim.

Figures

Figures reproduced from arXiv: 1907.04882 by Michael Picheny, Xiaodong Cui.

**Figure 1.** Figure 1: ESGD with anchors on BN50. Left panel shows the CE loss over generations using SGD baseline as initial anchor. Right panel shows the CE loss over generations using ESGD baseline as initial anchor. The ESGD optimization with anchors can be conducted in an iterative fashion. After each round of evolution (in this case 20 generations), the best model can be used as the initial anchor model for the next round… view at source ↗

**Figure 2.** Figure 2: ESGD with iterative anchors on BN50 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Evolutionary stochastic gradient descent (ESGD) was proposed as a population-based approach that combines the merits of gradient-aware and gradient-free optimization algorithms for superior overall optimization performance. In this paper we investigate a variant of ESGD for optimization of acoustic models for automatic speech recognition (ASR). In this variant, we assume the existence of a well-trained acoustic model and use it as an anchor in the parent population whose good "gene" will propagate in the evolution to the offsprings. We propose an ESGD algorithm leveraging the anchor models such that it guarantees the best fitness of the population will never degrade from the anchor model. Experiments on 50-hour Broadcast News (BN50) and 300-hour Switchboard (SWB300) show that the ESGD with anchors can further improve the loss and ASR performance over the existing well-trained acoustic models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces a variant of evolutionary stochastic gradient descent (ESGD) for acoustic model optimization in ASR. It assumes a well-trained model exists and inserts it as an 'anchor' into the parent population so that its parameters propagate to offspring; the algorithm is constructed to guarantee that the best fitness in any subsequent population never falls below the anchor's fitness. Experiments on the 50-hour Broadcast News (BN50) and 300-hour Switchboard (SWB300) corpora are reported to show further reductions in loss and word-error rate relative to the already-trained anchor models.

Significance. If the non-degradation guarantee is rigorously enforced and the reported gains are reproducible and statistically supported, the method would offer a low-risk way to continue optimizing converged acoustic models by blending population-based search with gradient information. The approach is directly relevant to large-scale ASR training pipelines where further improvement of strong baselines is valuable.

major comments (2)

[§3.2] §3.2 (Algorithm description): The claim that the procedure 'guarantees the best fitness of the population will never degrade from the anchor model' is load-bearing for the central contribution, yet the text does not specify the exact replacement or selection rule that enforces anchor retention. Without an explicit elitist step, a fitness-comparison rule, or a parent-preservation mechanism stated in pseudocode or equations, it is impossible to verify that an offspring cannot displace the anchor and thereby violate the guarantee.
[§4] §4 (Experiments): The abstract states that ESGD-with-anchors improves both loss and ASR performance on BN50 and SWB300, but no baseline details, number of runs, error bars, or statistical significance tests are referenced in the provided description. If the only comparison is against the single anchor model without additional SGD steps or alternative population methods, the improvement cannot be attributed to the evolutionary mechanism rather than extra optimization budget.

minor comments (1)

Notation for the anchor model and its 'gene' propagation should be defined consistently with the population-update equations; currently the description mixes informal language with algorithmic claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, indicating revisions that will be incorporated to strengthen the paper.

read point-by-point responses

Referee: [§3.2] §3.2 (Algorithm description): The claim that the procedure 'guarantees the best fitness of the population will never degrade from the anchor model' is load-bearing for the central contribution, yet the text does not specify the exact replacement or selection rule that enforces anchor retention. Without an explicit elitist step, a fitness-comparison rule, or a parent-preservation mechanism stated in pseudocode or equations, it is impossible to verify that an offspring cannot displace the anchor and thereby violate the guarantee.

Authors: We agree that §3.2 would benefit from an explicit statement of the retention mechanism. The non-degradation guarantee is realized by always retaining the anchor as an elitist parent that is never replaced by offspring; the population is formed by selecting the top individuals after fitness evaluation, with the anchor guaranteed inclusion if its fitness is the best. In the revised version we will add this rule in equations and provide pseudocode that shows the anchor-preservation step, making the guarantee directly verifiable from the algorithm description. revision: yes
Referee: [§4] §4 (Experiments): The abstract states that ESGD-with-anchors improves both loss and ASR performance on BN50 and SWB300, but no baseline details, number of runs, error bars, or statistical significance tests are referenced in the provided description. If the only comparison is against the single anchor model without additional SGD steps or alternative population methods, the improvement cannot be attributed to the evolutionary mechanism rather than extra optimization budget.

Authors: We accept that the experimental reporting requires augmentation for rigor. The revised manuscript will report the number of independent runs, include error bars on loss and WER, and add statistical significance tests. To address attribution, we will also include a controlled comparison against continued SGD training from the anchor using an equivalent computational budget, demonstrating that the observed gains exceed those obtainable by additional gradient steps alone. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on experimental outcomes and algorithmic construction, not self-referential derivations.

full rationale

The paper describes an ESGD variant that incorporates a well-trained anchor model by design to ensure non-degradation of population fitness. This guarantee is an explicit property of the proposed algorithm rather than a derived result. Performance improvements are asserted via experiments on BN50 and SWB300, with no equations, fitted parameters renamed as predictions, self-citation chains, or uniqueness theorems invoked. No steps match the enumerated circularity patterns; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract; no free parameters, axioms, or invented entities are specified at this level of description.

pith-pipeline@v0.9.0 · 5678 in / 1062 out tokens · 26562 ms · 2026-05-24T23:39:48.389428+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 3 internal anchors

[1]

Introduction Evolutionary stochastic gradient descent (ESGD) was propo sed in [1] for optimization of deep neural networks (DNNs). It is a population-based [2] approach that integrates gradient- aware SGD and gradient-free evolutionary strategy (ES) [3][4] in one framework to take advantage of the merits of both families of algorithms to deal with complic...

work page
[2]

We further assume θ follows dis- tribution p(θ), data follows distribution p(ω ) and consider the expected empirical risk over p(θ) and p(ω ) J = Eθ[Eω[lω(θ)]]

Mathematical Formulation Deﬁne the loss function li(θ) ≜ ℓ(h(xi; θ), y i) (1) where h is the function to be learned with parameter θ which maps the input space X ⊆ Rdx to the output space Y ⊆ Rdy and {(xi, y i)}n i=1 ∈ X × Y. We further assume θ follows dis- tribution p(θ), data follows distribution p(ω ) and consider the expected empirical risk over p(θ)...

work page
[3]

Convent ion- ally, we would pick an optimizer and its hyper-parameters an d optimize the model until certain conditions (e.g

ESGD with Anchors Suppose we have a well-trained model in hand and want to fur- ther improve it without changing its architecture. Convent ion- ally, we would pick an optimizer and its hyper-parameters an d optimize the model until certain conditions (e.g. no improv e- ment on the validation loss) met. This process is usually re- peated multiple times and...

work page
[4]

The 50-hour data in BN50 consists of a 45-hour training set and a 5-hour validation set

Experiments Experiments are conducted on two datasets: BN50 and SWB300. The 50-hour data in BN50 consists of a 45-hour training set and a 5-hour validation set. The test set comprises 3 hour s of audio. The acoustic models are fully-connected feed-for ward network with 6 hidden layers and one softmax output layer wit h 5,000 states. There are 1,024 units ...

work page 2000
[5]

The single baseline is trained using S GD with a batch size 128 without momentum for 20 epochs

as the references. The single baseline is trained using S GD with a batch size 128 without momentum for 20 epochs. The initial learning rate is 0.001 for BN50 and 0.025 for SWB300. The learning rate is annealed by 2x every time the loss on the validation set of the current epoch is worse than the previou s epoch and meanwhile the model is backed off to th...

work page
[6]

genes” of an anchor can spread out (with probabil - ity) to the next generations until it is replaced by another a nchor with better “genes

Discussion Parallel computing is a necessity for ESGD which is a power- ful approach when there is strong computational power in han d. The reported experiments are carried out in a distributed ma n- ner where SGD and ﬁtness evaluation are conducted on multiple GPUs in parallel, the number of which is roughly the number of individuals in the parent popula...

work page
[7]

We use this model as an anchor in the population to accelerate the evolut ion and improve the quality of offsprings

Summary In this paper, we investigated a population-based ESGD algo - rithm assuming some well-trained model exists. We use this model as an anchor in the population to accelerate the evolut ion and improve the quality of offsprings. We introduced anchor switching in the population and also an iterative way of appl y- ing ESGD with anchors to monotonicall...

work page
[8]

Evolutionary stochastic gradient descent for optimization of deep neura l net- works,

X. Cui, W. Zhang, Z. Tuske, and M. Picheny, “Evolutionary stochastic gradient descent for optimization of deep neura l net- works,” Advances in Neural Information Processing Systems (NeurIPS), pp. 6048–6058, 2018

work page 2018
[9]

Population Based Training of Neural Networks

M. Jaderberg, V . Dalibard, S. Osindero, W. M. Czarnecki, J. Don- ahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonya n, C. Fernando, and K. Kavukcuoglu, “Population based trainin g of neural networks,” arXiv preprint arXiv:1711.09846, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[10]

Evolution strategies: a compre- hensive introduction,

H.-G. Beyer and H.-P . Schwefel, “Evolution strategies: a compre- hensive introduction,” Natural computing, vol. 1, no. 1, pp. 3–52, 2002

work page 2002
[11]

ES Is More Than Just a Traditional Finite-Difference Approximator

J. Lehman, J. Chen, J. Clune, and K. O. Stanley, “ES is more than just a traditional ﬁnite-difference approximator,” arXiv preprint arXiv:1712.06568, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[12]

COVNET: a cooperative coevolutionary model for evolving arti- ﬁcial neural networks,

N. Garcia-Pedrajas, C. Hervas-Martinez, and J. Munoz-P erez, “COVNET: a cooperative coevolutionary model for evolving arti- ﬁcial neural networks,” IEEE Trans. on Neural Networks, vol. 14, no. 3, pp. 575–595, 2003

work page 2003
[13]

Cooperative coevolution of artiﬁcial neural network ense mbles for pattern recognition,

N. Garcia-Pedrajas, C. Hervas-Martinez, and D. Ortiz-B oyer, “Cooperative coevolution of artiﬁcial neural network ense mbles for pattern recognition,” IEEE Trans. on Evolutionary Computa- tion, vol. 9, no. 3, pp. 271–302, 2005

work page 2005
[14]

The CMA Evolution Strategy: A Tutorial

N. Hansen, “The CMA evolution strategy: a tutorial,” arXiv preprint arXiv:1604.00772, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[15]

Perceptual linear predictive (PLP) anal ysis of speech,

H. Hermansky, “Perceptual linear predictive (PLP) anal ysis of speech,” Journal of Acoustical Society America , vol. 87, no. 4, pp. 1738–1752, 1990

work page 1990
[16]

Long short-term memo ry,

S. Hochreiter and J. Schmidhuber, “Long short-term memo ry,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997
[17]

Front-end factor analysis for speaker veriﬁcation,

N. Dehak, P . Kenny, R. Dehak, P . Dumouchel, and P . Ouelle t, “Front-end factor analysis for speaker veriﬁcation,” IEEE Trans- actions on Audio, Speech, and Language Processing, , vol. 19, no. 4, pp. 788–798, 2011

work page 2011
[18]

ADAM: a method for stochastic o p- timization,

D. P . Kingma and J. L. Ba, “ADAM: a method for stochastic o p- timization,” in International Conference on Learning Representa- tions (ICLR), 2015

work page 2015
[19]

The loss surfaces of multilayer networks,

A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y . Le- Cun, “The loss surfaces of multilayer networks,” International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pp. 192–204, 2015

work page 2015
[20]

Entropy-SGD : Biasing gradient desent into wide valleys,

P . Chaudhari, A. Choromanska, S. Soatto, Y . LeCun, C. Ba ldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina, “Entropy-SGD : Biasing gradient desent into wide valleys,” International Confer- ence on Learning Representation (ICLR) , 2017

work page 2017
[21]

Lattice-based optimization of sequenc e classiﬁ- cation criteria for neural-network acoustic modeling,

B. Kingsbury, “Lattice-based optimization of sequenc e classiﬁ- cation criteria for neural-network acoustic modeling,” in Inter- national Conference on Acoustics, Speech and Signal Proces sing (ICASSP), 2009, pp. 3761–3764

work page 2009
[22]

Minimum phone error and I-smoothing for improved discriminative training,

D. Povey and P . C. Woodland, “Minimum phone error and I-smoothing for improved discriminative training,” in Interna- tional Conference on Acoustics, Speech and Signal Processi ng (ICASSP), 2002, pp. 105–108

work page 2002

[1] [1]

Introduction Evolutionary stochastic gradient descent (ESGD) was propo sed in [1] for optimization of deep neural networks (DNNs). It is a population-based [2] approach that integrates gradient- aware SGD and gradient-free evolutionary strategy (ES) [3][4] in one framework to take advantage of the merits of both families of algorithms to deal with complic...

work page

[2] [2]

We further assume θ follows dis- tribution p(θ), data follows distribution p(ω ) and consider the expected empirical risk over p(θ) and p(ω ) J = Eθ[Eω[lω(θ)]]

Mathematical Formulation Deﬁne the loss function li(θ) ≜ ℓ(h(xi; θ), y i) (1) where h is the function to be learned with parameter θ which maps the input space X ⊆ Rdx to the output space Y ⊆ Rdy and {(xi, y i)}n i=1 ∈ X × Y. We further assume θ follows dis- tribution p(θ), data follows distribution p(ω ) and consider the expected empirical risk over p(θ)...

work page

[3] [3]

Convent ion- ally, we would pick an optimizer and its hyper-parameters an d optimize the model until certain conditions (e.g

ESGD with Anchors Suppose we have a well-trained model in hand and want to fur- ther improve it without changing its architecture. Convent ion- ally, we would pick an optimizer and its hyper-parameters an d optimize the model until certain conditions (e.g. no improv e- ment on the validation loss) met. This process is usually re- peated multiple times and...

work page

[4] [4]

The 50-hour data in BN50 consists of a 45-hour training set and a 5-hour validation set

Experiments Experiments are conducted on two datasets: BN50 and SWB300. The 50-hour data in BN50 consists of a 45-hour training set and a 5-hour validation set. The test set comprises 3 hour s of audio. The acoustic models are fully-connected feed-for ward network with 6 hidden layers and one softmax output layer wit h 5,000 states. There are 1,024 units ...

work page 2000

[5] [5]

The single baseline is trained using S GD with a batch size 128 without momentum for 20 epochs

as the references. The single baseline is trained using S GD with a batch size 128 without momentum for 20 epochs. The initial learning rate is 0.001 for BN50 and 0.025 for SWB300. The learning rate is annealed by 2x every time the loss on the validation set of the current epoch is worse than the previou s epoch and meanwhile the model is backed off to th...

work page

[6] [6]

genes” of an anchor can spread out (with probabil - ity) to the next generations until it is replaced by another a nchor with better “genes

Discussion Parallel computing is a necessity for ESGD which is a power- ful approach when there is strong computational power in han d. The reported experiments are carried out in a distributed ma n- ner where SGD and ﬁtness evaluation are conducted on multiple GPUs in parallel, the number of which is roughly the number of individuals in the parent popula...

work page

[7] [7]

We use this model as an anchor in the population to accelerate the evolut ion and improve the quality of offsprings

Summary In this paper, we investigated a population-based ESGD algo - rithm assuming some well-trained model exists. We use this model as an anchor in the population to accelerate the evolut ion and improve the quality of offsprings. We introduced anchor switching in the population and also an iterative way of appl y- ing ESGD with anchors to monotonicall...

work page

[8] [8]

Evolutionary stochastic gradient descent for optimization of deep neura l net- works,

X. Cui, W. Zhang, Z. Tuske, and M. Picheny, “Evolutionary stochastic gradient descent for optimization of deep neura l net- works,” Advances in Neural Information Processing Systems (NeurIPS), pp. 6048–6058, 2018

work page 2018

[9] [9]

Population Based Training of Neural Networks

M. Jaderberg, V . Dalibard, S. Osindero, W. M. Czarnecki, J. Don- ahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonya n, C. Fernando, and K. Kavukcuoglu, “Population based trainin g of neural networks,” arXiv preprint arXiv:1711.09846, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[10] [10]

Evolution strategies: a compre- hensive introduction,

H.-G. Beyer and H.-P . Schwefel, “Evolution strategies: a compre- hensive introduction,” Natural computing, vol. 1, no. 1, pp. 3–52, 2002

work page 2002

[11] [11]

ES Is More Than Just a Traditional Finite-Difference Approximator

J. Lehman, J. Chen, J. Clune, and K. O. Stanley, “ES is more than just a traditional ﬁnite-difference approximator,” arXiv preprint arXiv:1712.06568, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[12] [12]

COVNET: a cooperative coevolutionary model for evolving arti- ﬁcial neural networks,

N. Garcia-Pedrajas, C. Hervas-Martinez, and J. Munoz-P erez, “COVNET: a cooperative coevolutionary model for evolving arti- ﬁcial neural networks,” IEEE Trans. on Neural Networks, vol. 14, no. 3, pp. 575–595, 2003

work page 2003

[13] [13]

Cooperative coevolution of artiﬁcial neural network ense mbles for pattern recognition,

N. Garcia-Pedrajas, C. Hervas-Martinez, and D. Ortiz-B oyer, “Cooperative coevolution of artiﬁcial neural network ense mbles for pattern recognition,” IEEE Trans. on Evolutionary Computa- tion, vol. 9, no. 3, pp. 271–302, 2005

work page 2005

[14] [14]

The CMA Evolution Strategy: A Tutorial

N. Hansen, “The CMA evolution strategy: a tutorial,” arXiv preprint arXiv:1604.00772, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[15] [15]

Perceptual linear predictive (PLP) anal ysis of speech,

H. Hermansky, “Perceptual linear predictive (PLP) anal ysis of speech,” Journal of Acoustical Society America , vol. 87, no. 4, pp. 1738–1752, 1990

work page 1990

[16] [16]

Long short-term memo ry,

S. Hochreiter and J. Schmidhuber, “Long short-term memo ry,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997

[17] [17]

Front-end factor analysis for speaker veriﬁcation,

N. Dehak, P . Kenny, R. Dehak, P . Dumouchel, and P . Ouelle t, “Front-end factor analysis for speaker veriﬁcation,” IEEE Trans- actions on Audio, Speech, and Language Processing, , vol. 19, no. 4, pp. 788–798, 2011

work page 2011

[18] [18]

ADAM: a method for stochastic o p- timization,

D. P . Kingma and J. L. Ba, “ADAM: a method for stochastic o p- timization,” in International Conference on Learning Representa- tions (ICLR), 2015

work page 2015

[19] [19]

The loss surfaces of multilayer networks,

A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y . Le- Cun, “The loss surfaces of multilayer networks,” International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pp. 192–204, 2015

work page 2015

[20] [20]

Entropy-SGD : Biasing gradient desent into wide valleys,

P . Chaudhari, A. Choromanska, S. Soatto, Y . LeCun, C. Ba ldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina, “Entropy-SGD : Biasing gradient desent into wide valleys,” International Confer- ence on Learning Representation (ICLR) , 2017

work page 2017

[21] [21]

Lattice-based optimization of sequenc e classiﬁ- cation criteria for neural-network acoustic modeling,

B. Kingsbury, “Lattice-based optimization of sequenc e classiﬁ- cation criteria for neural-network acoustic modeling,” in Inter- national Conference on Acoustics, Speech and Signal Proces sing (ICASSP), 2009, pp. 3761–3764

work page 2009

[22] [22]

Minimum phone error and I-smoothing for improved discriminative training,

D. Povey and P . C. Woodland, “Minimum phone error and I-smoothing for improved discriminative training,” in Interna- tional Conference on Acoustics, Speech and Signal Processi ng (ICASSP), 2002, pp. 105–108

work page 2002