One-vs-All Models for Asynchronous Training: An Empirical Analysis

Aman Alok; Rahul Gupta; Shankar Ananthakrishnan

arxiv: 1906.08858 · v1 · pith:ZBYKCIVBnew · submitted 2019-06-20 · 💻 cs.LG · stat.ML

One-vs-All Models for Asynchronous Training: An Empirical Analysis

Rahul Gupta , Aman Alok , Shankar Ananthakrishnan This is my paper

Pith reviewed 2026-05-25 19:23 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords one-vs-allasynchronous trainingdataset divergence metricnatural language understandingspoken language understandingempirical analysismulti-class classificationmodel updates

0 comments

The pith

A metric quantifying training dataset differences in One-vs-All systems correlates strongly with overall accuracy after asynchronous updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

One-vs-All models allow each class to have its own classifier that can be retrained on its own schedule. Asynchronous retraining creates mismatches in the data each model has seen. The paper defines a metric for these mismatches and tests its relation to system accuracy on natural language understanding tasks and a spoken language understanding system. The metric tracks accuracy changes more closely than the sheer number of classes or examples. This setup matters for large-scale systems that must incorporate new data frequently without full retraining.

Core claim

The paper shows that in One-vs-All classification, differences in the training datasets of the individual models caused by asynchronous updates can be measured by a proposed divergence metric, and this metric correlates strongly with the accuracy of the combined system in both natural language understanding and spoken language understanding experiments.

What carries the argument

A metric to quantify differences in training datasets across OVA models due to asynchronous updates.

If this is right

The proposed dataset divergence metric is the dominant factor affecting OVA system accuracy compared to number of classes or data points.
Accuracy impact can be estimated from the level of asynchrony in training.
Increased asynchrony in SLU systems reduces accuracy in line with the metric.
Independent updates to OVA models are viable when dataset divergences remain small.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The metric could guide decisions on when to force a full system retrain versus allowing independent updates.
Similar divergence measures might apply to other modular training setups like ensemble methods.
Future work could test whether controlling the metric directly during data collection improves outcomes.
OVA with async updates may scale better than joint multi-class training in high-update environments.

Load-bearing premise

Changes in accuracy are driven mainly by the dataset divergences captured by the metric rather than by unmeasured factors like optimization paths or model capacities.

What would settle it

An experiment that holds dataset contents fixed across OVA models but varies only the order or optimizer and measures whether accuracy variance matches the levels seen under asynchronous dataset changes.

Figures

Figures reproduced from arXiv: 1906.08858 by Aman Alok, Rahul Gupta, Shankar Ananthakrishnan.

**Figure 2.** Figure 2: Relative SemER degradation in an OVA SLU system compared to a multi-class SLU system. Staleness is determined in terms of month time units. degradation is also high (0.84), suggesting the applicability of this metric in real world systems. In a real world setting, we suggest an operational model where a baseline value for metric α can be computed for a given acceptable state of the OVA models. Metric α ca… view at source ↗

read the original abstract

Any given classification problem can be modeled using multi-class or One-vs-All (OVA) architecture. An OVA system consists of as many OVA models as the number of classes, providing the advantage of asynchrony, where each OVA model can be re-trained independent of other models. This is particularly advantageous in settings where scalable model training is a consideration (for instance in an industrial environment where multiple and frequent updates need to be made to the classification system). In this paper, we conduct empirical analysis on realizing independent updates to OVA models and its impact on the accuracy of the overall OVA system. Given that asynchronous updates lead to differences in training datasets for OVA models, we first define a metric to quantify the differences in datasets. Thereafter, using Natural Language Understanding as a task of interest, we estimate the impact of three factors: (i) number of classes, (ii) number of data points and, (iii) divergences in training datasets across OVA models; on the OVA system accuracy. Finally, we observe the accuracy impact of increased asynchrony in a Spoken Language Understanding system. We analyze the results and establish that the proposed metric correlates strongly with the model performances in both the experimental settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a dataset divergence metric for async OVA training and reports its correlation with accuracy on NLU/SLU tasks, but the correlation may not isolate the metric from other factors.

read the letter

The punchline is that the authors define a dataset-divergence metric for OVA models under asynchronous updates and report a strong correlation with accuracy in their NLU and SLU tests. This is a narrow but practical extension of existing ideas on one-vs-all and distributed training. They do a solid job laying out the advantage of asynchrony for scalable training and then quantifying the downside through the metric. Studying the three factors—class count, data points, and divergences—on actual tasks gives the work some grounding in real data rather than pure theory. The soft spot is in the correlation analysis. The abstract says the metric correlates strongly, but it studied multiple factors at once without indicating whether the reported correlation accounts for class number or data volume. Other unmeasured things like optimizer choices or learning rates could also drive the accuracy changes when models are retrained independently. Without those controls or tests, the metric's dominance isn't fully established. This paper is aimed at engineers maintaining large classification systems in production, where models get updated at different times. Someone in that setting could use the metric as a quick check on when asynchrony starts hurting performance. It shows clear thinking on a deployment issue and engages with the literature on OVA, so it deserves a serious referee even if revisions are needed for the experimental rigor.

Referee Report

2 major / 1 minor

Summary. The paper claims that asynchronous updates to One-vs-All (OVA) classifiers induce dataset divergences that can be quantified by a proposed metric; through controlled experiments on NLU and SLU tasks it shows that this metric, together with class count and data-point count, affects overall OVA accuracy, and that the metric correlates strongly with observed performance.

Significance. If the reported correlation survives proper isolation from confounding variables, the metric would supply a practical diagnostic for deciding when independent retraining of OVA components remains safe, which is directly relevant to large-scale industrial classification pipelines that require frequent model updates.

major comments (2)

[Abstract] Abstract: the central claim that the proposed divergence metric 'correlates strongly with the model performances' is presented without any indication that the correlation analysis controls for the other two studied factors (class count, data-point count) or for unmentioned variables such as per-model optimizer choice, learning-rate schedules, or capacity differences; without such isolation the observed correlation cannot be attributed primarily to dataset divergence.
[Abstract] Abstract: no statistical tests, confidence intervals, or baseline comparisons are mentioned for the reported correlations, so it is impossible to judge whether the claimed strong relationship exceeds what would be expected from the experimental design alone.

minor comments (1)

[Abstract] The abstract states that three factors were studied but does not name the concrete datasets or the exact definition of the divergence metric; both should be stated explicitly in the opening paragraph.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. The comments correctly note that the abstract is high-level and does not explicitly describe the controlled nature of the experiments or the statistical details. We address each point below and will revise the abstract accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the proposed divergence metric 'correlates strongly with the model performances' is presented without any indication that the correlation analysis controls for the other two studied factors (class count, data-point count) or for unmentioned variables such as per-model optimizer choice, learning-rate schedules, or capacity differences; without such isolation the observed correlation cannot be attributed primarily to dataset divergence.

Authors: The full paper describes controlled experiments in which class count and data-point count are held fixed while divergence is varied (see experimental setup and results sections). All OVA models within each run use identical optimizer, learning-rate schedule, and model capacity to isolate the divergence effect. We agree the abstract does not state this explicitly and will revise it to note that the reported correlation is obtained from controlled experiments that account for the other two factors. revision: yes
Referee: [Abstract] Abstract: no statistical tests, confidence intervals, or baseline comparisons are mentioned for the reported correlations, so it is impossible to judge whether the claimed strong relationship exceeds what would be expected from the experimental design alone.

Authors: We agree the abstract omits these details. The results section reports Pearson correlations; we will add p-values, confidence intervals, and a brief baseline comparison (e.g., against random dataset divergence) to both the abstract and the results presentation in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical study

full rationale

The paper defines a dataset-divergence metric independently of performance, then reports empirical correlations with observed OVA accuracy across controlled factors (class count, data-point count). No derivation chain, fitted parameter renamed as prediction, or self-citation that reduces the central claim to its own inputs exists. The analysis is observational and falsifiable against external accuracy measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical analysis paper; no free parameters, axioms, or invented entities are invoked or required by the central claim.

pith-pipeline@v0.9.0 · 5747 in / 1038 out tokens · 63681 ms · 2026-05-25T19:23:21.452957+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 2 internal anchors

[1]

While similar model- ing architectures can be used for both the setups, certain inher- ent differences exist between them

Introduction Any classiﬁcation problem can be solved in either a multi-class setup or an One-vs-All (OV A) setup [1]. While similar model- ing architectures can be used for both the setups, certain inher- ent differences exist between them. Firstly, a multi-class model- ing withK classes yields a single model, while an OV A model- ing yieldsK different mo...

work page
[2]

We conduct two sets of experiments in this paper

exist for such tasks, a comparison between the two sys- tems with asynchrony capabilities added in the OV A setup has not been explored. We conduct two sets of experiments in this paper. First, a synthetic setup evaluates the impact of asynchronous training of OV A models on the overall performance of the OV A system. Using domain classiﬁcation as the NLU...

work page
[3]

One-vs-All Models for Asynchronous Training: An Empirical Analysis

Setup of an OV A system Consider a multi-class classiﬁcation task with K classes {y1,y 2,..,y K}, where data-points belonging to the kth classes are part of the dataset Dk. An OV A classiﬁcation setup for this task will consist of K models, where the model ( Mk) is trained to predict conﬁdence scores for the class k against all other classes. The model is...

work page internal anchor Pith review Pith/arXiv arXiv 1906
[4]

Analysis on the behavior of one-vs-all systems In this experiment, we compare the performance gap between an OV A system and a multi-class system, as more asynchrony is injected in the OV A system. We conduct an analysis on the behavior of OV A systems as a function of three factors: (i) the asynchrony between a dataset and it’s copies, (ii) the size of t...

work page
[5]

We use the SLU model setup devel- oped by Su et al

Analysis on an One-vs-all SLU system In this section, we conduct an analysis on a more complex real world OV A system: an SLU system consisting of multiple com- ponents operating together. We use the SLU model setup devel- oped by Su et al. [13], where each domain in the NLU system contains a set of four models: (i) a Domain Classiﬁer (DC), (ii) an Intent...

work page
[6]

In this work, we explore one speciﬁc property of OV A systems, where each OV A model can be updated asynchronous of other OV A models

Conclusion OV A systems have been compared to a multi-class system for accuracy in several previous works [2]. In this work, we explore one speciﬁc property of OV A systems, where each OV A model can be updated asynchronous of other OV A models. Our appli- cation of choice in this paper are SLU tasks and we propose a metric that can quantify asynchrony am...

work page
[7]

C. M. Bishop, Pattern recognition and machine learning . springer, 2006

work page 2006
[8]

In defense of one-vs-all classiﬁcation,

R. Rifkin and A. Klautau, “In defense of one-vs-all classiﬁcation,” Journal of machine learning research , vol. 5, no. Jan, pp. 101– 141, 2004

work page 2004
[9]

Weighted one- against-all,

A. Beygelzimer, J. Langford, and B. Zadrozny, “Weighted one- against-all,” in AAAI, 2005, pp. 720–725

work page 2005
[10]

An overview of ensemble methods for binary classiﬁers in multi-class problems: Experimental study on one-vs-one and one- vs-all schemes,

M. Galar, A. Fern ´andez, E. Barrenechea, H. Bustince, and F. Her- rera, “An overview of ensemble methods for binary classiﬁers in multi-class problems: Experimental study on one-vs-one and one- vs-all schemes,” Pattern Recognition, vol. 44, no. 8, pp. 1761– 1776, 2011

work page 2011
[11]

Learning from non-iid data: Fast rates for the one-vs-all multi- class plug-in classiﬁers,

V . Dinh, L. S. T. Ho, N. V . Cuong, D. Nguyen, and B. T. Nguyen, “Learning from non-iid data: Fast rates for the one-vs-all multi- class plug-in classiﬁers,” in International Conference on Theory and Applications of Models of Computation. Springer, 2015, pp. 375–387

work page 2015
[12]

Comparing the one-vs-one and one- vs-all methods in benthic macroinvertebrate image classiﬁcation,

H. Joutsijoki and M. Juhola, “Comparing the one-vs-one and one- vs-all methods in benthic macroinvertebrate image classiﬁcation,” in International Workshop on Machine Learning and Data Mining in Pattern Recognition. Springer, 2011, pp. 399–413

work page 2011
[13]

Multiclass and binary svm classi- ﬁcation: Implications for training and classiﬁcation users,

A. Mathur and G. M. Foody, “Multiclass and binary svm classi- ﬁcation: Implications for training and classiﬁcation users,”IEEE Geoscience and remote sensing letters, vol. 5, no. 2, pp. 241–245, 2008

work page 2008
[14]

A one-versus-all class binarization strategy for bearing diagnostics of concurrent defects,

S. Ng, P. Tse, and K. Tsui, “A one-versus-all class binarization strategy for bearing diagnostics of concurrent defects,” Sensors, vol. 14, no. 1, pp. 1295–1321, 2014

work page 2014
[15]

Adapted one-versus-all decision trees for data stream classiﬁca- tion,

S. Hashemi, Y . Yang, Z. Mirzamomen, and M. Kangavari, “Adapted one-versus-all decision trees for data stream classiﬁca- tion,” IEEE Transactions on Knowledge and Data Engineering , vol. 21, no. 5, pp. 624–637, 2009

work page 2009
[16]

The technology behind personal digital assistants: An overview of the system architecture and key components,

R. Sarikaya, “The technology behind personal digital assistants: An overview of the system architecture and key components,” IEEE Signal Processing Magazine , vol. 34, no. 1, pp. 67–81, 2017

work page 2017
[17]

Hypotheses ranking for robust domain classiﬁcation and track- ing in dialogue systems,

J.-P. Robichaud, P. A. Crook, P. Xu, O. Z. Khan, and R. Sarikaya, “Hypotheses ranking for robust domain classiﬁcation and track- ing in dialogue systems,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014

work page 2014
[18]

An overview of end-to-end language understanding and dialog management for personal digital assistants,

R. Sarikaya, P. A. Crook, A. Marin, M. Jeong, J.-P. Robichaud, A. Celikyilmaz, Y .-B. Kim, A. Rochette, O. Z. Khan, X. Liuet al., “An overview of end-to-end language understanding and dialog management for personal digital assistants,” in Spoken Language Technology Workshop (SLT), 2016 IEEE. IEEE, 2016, pp. 391– 397

work page 2016
[19]

A Re-ranker Scheme for Integrating Large Scale NLU models

C. Su, R. Gupta, S. Ananthakrishnan, and S. Matsoukas, “A re-ranker scheme for integrating large scale nlu models,” arXiv preprint arXiv:1809.09605, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Distributed representations of words and phrases and their com- positionality,

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their com- positionality,” in Advances in neural information processing sys- tems, 2013, pp. 3111–3119

work page 2013
[21]

Convolutional neural net- work architectures for matching natural language sentences,

B. Hu, Z. Lu, H. Li, and Q. Chen, “Convolutional neural net- work architectures for matching natural language sentences,” in Advances in neural information processing systems , 2014, pp. 2042–2050

work page 2014
[22]

Comparison of probability distributions,

J. Lindsey, “Comparison of probability distributions,” Journal of the Royal Statistical Society: Series B (Methodological) , vol. 36, no. 1, pp. 38–47, 1974

work page 1974
[23]

Comprehensive survey on distance/similarity mea- sures between probability density functions,

S.-H. Cha, “Comprehensive survey on distance/similarity mea- sures between probability density functions,” City, vol. 1, no. 2, p. 1, 2007

work page 2007

[1] [1]

While similar model- ing architectures can be used for both the setups, certain inher- ent differences exist between them

Introduction Any classiﬁcation problem can be solved in either a multi-class setup or an One-vs-All (OV A) setup [1]. While similar model- ing architectures can be used for both the setups, certain inher- ent differences exist between them. Firstly, a multi-class model- ing withK classes yields a single model, while an OV A model- ing yieldsK different mo...

work page

[2] [2]

We conduct two sets of experiments in this paper

exist for such tasks, a comparison between the two sys- tems with asynchrony capabilities added in the OV A setup has not been explored. We conduct two sets of experiments in this paper. First, a synthetic setup evaluates the impact of asynchronous training of OV A models on the overall performance of the OV A system. Using domain classiﬁcation as the NLU...

work page

[3] [3]

One-vs-All Models for Asynchronous Training: An Empirical Analysis

Setup of an OV A system Consider a multi-class classiﬁcation task with K classes {y1,y 2,..,y K}, where data-points belonging to the kth classes are part of the dataset Dk. An OV A classiﬁcation setup for this task will consist of K models, where the model ( Mk) is trained to predict conﬁdence scores for the class k against all other classes. The model is...

work page internal anchor Pith review Pith/arXiv arXiv 1906

[4] [4]

Analysis on the behavior of one-vs-all systems In this experiment, we compare the performance gap between an OV A system and a multi-class system, as more asynchrony is injected in the OV A system. We conduct an analysis on the behavior of OV A systems as a function of three factors: (i) the asynchrony between a dataset and it’s copies, (ii) the size of t...

work page

[5] [5]

We use the SLU model setup devel- oped by Su et al

Analysis on an One-vs-all SLU system In this section, we conduct an analysis on a more complex real world OV A system: an SLU system consisting of multiple com- ponents operating together. We use the SLU model setup devel- oped by Su et al. [13], where each domain in the NLU system contains a set of four models: (i) a Domain Classiﬁer (DC), (ii) an Intent...

work page

[6] [6]

In this work, we explore one speciﬁc property of OV A systems, where each OV A model can be updated asynchronous of other OV A models

Conclusion OV A systems have been compared to a multi-class system for accuracy in several previous works [2]. In this work, we explore one speciﬁc property of OV A systems, where each OV A model can be updated asynchronous of other OV A models. Our appli- cation of choice in this paper are SLU tasks and we propose a metric that can quantify asynchrony am...

work page

[7] [7]

C. M. Bishop, Pattern recognition and machine learning . springer, 2006

work page 2006

[8] [8]

In defense of one-vs-all classiﬁcation,

R. Rifkin and A. Klautau, “In defense of one-vs-all classiﬁcation,” Journal of machine learning research , vol. 5, no. Jan, pp. 101– 141, 2004

work page 2004

[9] [9]

Weighted one- against-all,

A. Beygelzimer, J. Langford, and B. Zadrozny, “Weighted one- against-all,” in AAAI, 2005, pp. 720–725

work page 2005

[10] [10]

An overview of ensemble methods for binary classiﬁers in multi-class problems: Experimental study on one-vs-one and one- vs-all schemes,

M. Galar, A. Fern ´andez, E. Barrenechea, H. Bustince, and F. Her- rera, “An overview of ensemble methods for binary classiﬁers in multi-class problems: Experimental study on one-vs-one and one- vs-all schemes,” Pattern Recognition, vol. 44, no. 8, pp. 1761– 1776, 2011

work page 2011

[11] [11]

Learning from non-iid data: Fast rates for the one-vs-all multi- class plug-in classiﬁers,

V . Dinh, L. S. T. Ho, N. V . Cuong, D. Nguyen, and B. T. Nguyen, “Learning from non-iid data: Fast rates for the one-vs-all multi- class plug-in classiﬁers,” in International Conference on Theory and Applications of Models of Computation. Springer, 2015, pp. 375–387

work page 2015

[12] [12]

Comparing the one-vs-one and one- vs-all methods in benthic macroinvertebrate image classiﬁcation,

H. Joutsijoki and M. Juhola, “Comparing the one-vs-one and one- vs-all methods in benthic macroinvertebrate image classiﬁcation,” in International Workshop on Machine Learning and Data Mining in Pattern Recognition. Springer, 2011, pp. 399–413

work page 2011

[13] [13]

Multiclass and binary svm classi- ﬁcation: Implications for training and classiﬁcation users,

A. Mathur and G. M. Foody, “Multiclass and binary svm classi- ﬁcation: Implications for training and classiﬁcation users,”IEEE Geoscience and remote sensing letters, vol. 5, no. 2, pp. 241–245, 2008

work page 2008

[14] [14]

A one-versus-all class binarization strategy for bearing diagnostics of concurrent defects,

S. Ng, P. Tse, and K. Tsui, “A one-versus-all class binarization strategy for bearing diagnostics of concurrent defects,” Sensors, vol. 14, no. 1, pp. 1295–1321, 2014

work page 2014

[15] [15]

Adapted one-versus-all decision trees for data stream classiﬁca- tion,

S. Hashemi, Y . Yang, Z. Mirzamomen, and M. Kangavari, “Adapted one-versus-all decision trees for data stream classiﬁca- tion,” IEEE Transactions on Knowledge and Data Engineering , vol. 21, no. 5, pp. 624–637, 2009

work page 2009

[16] [16]

The technology behind personal digital assistants: An overview of the system architecture and key components,

R. Sarikaya, “The technology behind personal digital assistants: An overview of the system architecture and key components,” IEEE Signal Processing Magazine , vol. 34, no. 1, pp. 67–81, 2017

work page 2017

[17] [17]

Hypotheses ranking for robust domain classiﬁcation and track- ing in dialogue systems,

J.-P. Robichaud, P. A. Crook, P. Xu, O. Z. Khan, and R. Sarikaya, “Hypotheses ranking for robust domain classiﬁcation and track- ing in dialogue systems,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014

work page 2014

[18] [18]

An overview of end-to-end language understanding and dialog management for personal digital assistants,

R. Sarikaya, P. A. Crook, A. Marin, M. Jeong, J.-P. Robichaud, A. Celikyilmaz, Y .-B. Kim, A. Rochette, O. Z. Khan, X. Liuet al., “An overview of end-to-end language understanding and dialog management for personal digital assistants,” in Spoken Language Technology Workshop (SLT), 2016 IEEE. IEEE, 2016, pp. 391– 397

work page 2016

[19] [19]

A Re-ranker Scheme for Integrating Large Scale NLU models

C. Su, R. Gupta, S. Ananthakrishnan, and S. Matsoukas, “A re-ranker scheme for integrating large scale nlu models,” arXiv preprint arXiv:1809.09605, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [20]

Distributed representations of words and phrases and their com- positionality,

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their com- positionality,” in Advances in neural information processing sys- tems, 2013, pp. 3111–3119

work page 2013

[21] [21]

Convolutional neural net- work architectures for matching natural language sentences,

B. Hu, Z. Lu, H. Li, and Q. Chen, “Convolutional neural net- work architectures for matching natural language sentences,” in Advances in neural information processing systems , 2014, pp. 2042–2050

work page 2014

[22] [22]

Comparison of probability distributions,

J. Lindsey, “Comparison of probability distributions,” Journal of the Royal Statistical Society: Series B (Methodological) , vol. 36, no. 1, pp. 38–47, 1974

work page 1974

[23] [23]

Comprehensive survey on distance/similarity mea- sures between probability density functions,

S.-H. Cha, “Comprehensive survey on distance/similarity mea- sures between probability density functions,” City, vol. 1, no. 2, p. 1, 2007

work page 2007