pith. sign in

arxiv: 1906.08858 · v1 · pith:ZBYKCIVBnew · submitted 2019-06-20 · 💻 cs.LG · stat.ML

One-vs-All Models for Asynchronous Training: An Empirical Analysis

Pith reviewed 2026-05-25 19:23 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords one-vs-allasynchronous trainingdataset divergence metricnatural language understandingspoken language understandingempirical analysismulti-class classificationmodel updates
0
0 comments X

The pith

A metric quantifying training dataset differences in One-vs-All systems correlates strongly with overall accuracy after asynchronous updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

One-vs-All models allow each class to have its own classifier that can be retrained on its own schedule. Asynchronous retraining creates mismatches in the data each model has seen. The paper defines a metric for these mismatches and tests its relation to system accuracy on natural language understanding tasks and a spoken language understanding system. The metric tracks accuracy changes more closely than the sheer number of classes or examples. This setup matters for large-scale systems that must incorporate new data frequently without full retraining.

Core claim

The paper shows that in One-vs-All classification, differences in the training datasets of the individual models caused by asynchronous updates can be measured by a proposed divergence metric, and this metric correlates strongly with the accuracy of the combined system in both natural language understanding and spoken language understanding experiments.

What carries the argument

A metric to quantify differences in training datasets across OVA models due to asynchronous updates.

If this is right

  • The proposed dataset divergence metric is the dominant factor affecting OVA system accuracy compared to number of classes or data points.
  • Accuracy impact can be estimated from the level of asynchrony in training.
  • Increased asynchrony in SLU systems reduces accuracy in line with the metric.
  • Independent updates to OVA models are viable when dataset divergences remain small.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The metric could guide decisions on when to force a full system retrain versus allowing independent updates.
  • Similar divergence measures might apply to other modular training setups like ensemble methods.
  • Future work could test whether controlling the metric directly during data collection improves outcomes.
  • OVA with async updates may scale better than joint multi-class training in high-update environments.

Load-bearing premise

Changes in accuracy are driven mainly by the dataset divergences captured by the metric rather than by unmeasured factors like optimization paths or model capacities.

What would settle it

An experiment that holds dataset contents fixed across OVA models but varies only the order or optimizer and measures whether accuracy variance matches the levels seen under asynchronous dataset changes.

Figures

Figures reproduced from arXiv: 1906.08858 by Aman Alok, Rahul Gupta, Shankar Ananthakrishnan.

Figure 1
Figure 1. Figure 1: Figures depicting absolute increase in OVA system classification error rate as compared to a multi-class system: (a) perfor￾mances gap as an OVA model Mk is trained with fewer negative samples borrowed from Dl, l 6= k, (b) performance gap with a fixed sub-sampling from other datasets Dl, l 6= k but with increasing overall dataset size and, (c) performance gap as number of classes increase with a fixed sub-… view at source ↗
Figure 2
Figure 2. Figure 2: Relative SemER degradation in an OVA SLU system compared to a multi-class SLU system. Staleness is determined in terms of month time units. degradation is also high (0.84), suggesting the applicability of this metric in real world systems. In a real world setting, we suggest an operational model where a baseline value for met￾ric α can be computed for a given acceptable state of the OVA models. Metric α ca… view at source ↗
read the original abstract

Any given classification problem can be modeled using multi-class or One-vs-All (OVA) architecture. An OVA system consists of as many OVA models as the number of classes, providing the advantage of asynchrony, where each OVA model can be re-trained independent of other models. This is particularly advantageous in settings where scalable model training is a consideration (for instance in an industrial environment where multiple and frequent updates need to be made to the classification system). In this paper, we conduct empirical analysis on realizing independent updates to OVA models and its impact on the accuracy of the overall OVA system. Given that asynchronous updates lead to differences in training datasets for OVA models, we first define a metric to quantify the differences in datasets. Thereafter, using Natural Language Understanding as a task of interest, we estimate the impact of three factors: (i) number of classes, (ii) number of data points and, (iii) divergences in training datasets across OVA models; on the OVA system accuracy. Finally, we observe the accuracy impact of increased asynchrony in a Spoken Language Understanding system. We analyze the results and establish that the proposed metric correlates strongly with the model performances in both the experimental settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that asynchronous updates to One-vs-All (OVA) classifiers induce dataset divergences that can be quantified by a proposed metric; through controlled experiments on NLU and SLU tasks it shows that this metric, together with class count and data-point count, affects overall OVA accuracy, and that the metric correlates strongly with observed performance.

Significance. If the reported correlation survives proper isolation from confounding variables, the metric would supply a practical diagnostic for deciding when independent retraining of OVA components remains safe, which is directly relevant to large-scale industrial classification pipelines that require frequent model updates.

major comments (2)
  1. [Abstract] Abstract: the central claim that the proposed divergence metric 'correlates strongly with the model performances' is presented without any indication that the correlation analysis controls for the other two studied factors (class count, data-point count) or for unmentioned variables such as per-model optimizer choice, learning-rate schedules, or capacity differences; without such isolation the observed correlation cannot be attributed primarily to dataset divergence.
  2. [Abstract] Abstract: no statistical tests, confidence intervals, or baseline comparisons are mentioned for the reported correlations, so it is impossible to judge whether the claimed strong relationship exceeds what would be expected from the experimental design alone.
minor comments (1)
  1. [Abstract] The abstract states that three factors were studied but does not name the concrete datasets or the exact definition of the divergence metric; both should be stated explicitly in the opening paragraph.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. The comments correctly note that the abstract is high-level and does not explicitly describe the controlled nature of the experiments or the statistical details. We address each point below and will revise the abstract accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the proposed divergence metric 'correlates strongly with the model performances' is presented without any indication that the correlation analysis controls for the other two studied factors (class count, data-point count) or for unmentioned variables such as per-model optimizer choice, learning-rate schedules, or capacity differences; without such isolation the observed correlation cannot be attributed primarily to dataset divergence.

    Authors: The full paper describes controlled experiments in which class count and data-point count are held fixed while divergence is varied (see experimental setup and results sections). All OVA models within each run use identical optimizer, learning-rate schedule, and model capacity to isolate the divergence effect. We agree the abstract does not state this explicitly and will revise it to note that the reported correlation is obtained from controlled experiments that account for the other two factors. revision: yes

  2. Referee: [Abstract] Abstract: no statistical tests, confidence intervals, or baseline comparisons are mentioned for the reported correlations, so it is impossible to judge whether the claimed strong relationship exceeds what would be expected from the experimental design alone.

    Authors: We agree the abstract omits these details. The results section reports Pearson correlations; we will add p-values, confidence intervals, and a brief baseline comparison (e.g., against random dataset divergence) to both the abstract and the results presentation in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical study

full rationale

The paper defines a dataset-divergence metric independently of performance, then reports empirical correlations with observed OVA accuracy across controlled factors (class count, data-point count). No derivation chain, fitted parameter renamed as prediction, or self-citation that reduces the central claim to its own inputs exists. The analysis is observational and falsifiable against external accuracy measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical analysis paper; no free parameters, axioms, or invented entities are invoked or required by the central claim.

pith-pipeline@v0.9.0 · 5747 in / 1038 out tokens · 63681 ms · 2026-05-25T19:23:21.452957+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 2 internal anchors

  1. [1]

    While similar model- ing architectures can be used for both the setups, certain inher- ent differences exist between them

    Introduction Any classification problem can be solved in either a multi-class setup or an One-vs-All (OV A) setup [1]. While similar model- ing architectures can be used for both the setups, certain inher- ent differences exist between them. Firstly, a multi-class model- ing withK classes yields a single model, while an OV A model- ing yieldsK different mo...

  2. [2]

    We conduct two sets of experiments in this paper

    exist for such tasks, a comparison between the two sys- tems with asynchrony capabilities added in the OV A setup has not been explored. We conduct two sets of experiments in this paper. First, a synthetic setup evaluates the impact of asynchronous training of OV A models on the overall performance of the OV A system. Using domain classification as the NLU...

  3. [3]

    One-vs-All Models for Asynchronous Training: An Empirical Analysis

    Setup of an OV A system Consider a multi-class classification task with K classes {y1,y 2,..,y K}, where data-points belonging to the kth classes are part of the dataset Dk. An OV A classification setup for this task will consist of K models, where the model ( Mk) is trained to predict confidence scores for the class k against all other classes. The model is...

  4. [4]

    Analysis on the behavior of one-vs-all systems In this experiment, we compare the performance gap between an OV A system and a multi-class system, as more asynchrony is injected in the OV A system. We conduct an analysis on the behavior of OV A systems as a function of three factors: (i) the asynchrony between a dataset and it’s copies, (ii) the size of t...

  5. [5]

    We use the SLU model setup devel- oped by Su et al

    Analysis on an One-vs-all SLU system In this section, we conduct an analysis on a more complex real world OV A system: an SLU system consisting of multiple com- ponents operating together. We use the SLU model setup devel- oped by Su et al. [13], where each domain in the NLU system contains a set of four models: (i) a Domain Classifier (DC), (ii) an Intent...

  6. [6]

    In this work, we explore one specific property of OV A systems, where each OV A model can be updated asynchronous of other OV A models

    Conclusion OV A systems have been compared to a multi-class system for accuracy in several previous works [2]. In this work, we explore one specific property of OV A systems, where each OV A model can be updated asynchronous of other OV A models. Our appli- cation of choice in this paper are SLU tasks and we propose a metric that can quantify asynchrony am...

  7. [7]

    C. M. Bishop, Pattern recognition and machine learning . springer, 2006

  8. [8]

    In defense of one-vs-all classification,

    R. Rifkin and A. Klautau, “In defense of one-vs-all classification,” Journal of machine learning research , vol. 5, no. Jan, pp. 101– 141, 2004

  9. [9]

    Weighted one- against-all,

    A. Beygelzimer, J. Langford, and B. Zadrozny, “Weighted one- against-all,” in AAAI, 2005, pp. 720–725

  10. [10]

    An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one- vs-all schemes,

    M. Galar, A. Fern ´andez, E. Barrenechea, H. Bustince, and F. Her- rera, “An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one- vs-all schemes,” Pattern Recognition, vol. 44, no. 8, pp. 1761– 1776, 2011

  11. [11]

    Learning from non-iid data: Fast rates for the one-vs-all multi- class plug-in classifiers,

    V . Dinh, L. S. T. Ho, N. V . Cuong, D. Nguyen, and B. T. Nguyen, “Learning from non-iid data: Fast rates for the one-vs-all multi- class plug-in classifiers,” in International Conference on Theory and Applications of Models of Computation. Springer, 2015, pp. 375–387

  12. [12]

    Comparing the one-vs-one and one- vs-all methods in benthic macroinvertebrate image classification,

    H. Joutsijoki and M. Juhola, “Comparing the one-vs-one and one- vs-all methods in benthic macroinvertebrate image classification,” in International Workshop on Machine Learning and Data Mining in Pattern Recognition. Springer, 2011, pp. 399–413

  13. [13]

    Multiclass and binary svm classi- fication: Implications for training and classification users,

    A. Mathur and G. M. Foody, “Multiclass and binary svm classi- fication: Implications for training and classification users,”IEEE Geoscience and remote sensing letters, vol. 5, no. 2, pp. 241–245, 2008

  14. [14]

    A one-versus-all class binarization strategy for bearing diagnostics of concurrent defects,

    S. Ng, P. Tse, and K. Tsui, “A one-versus-all class binarization strategy for bearing diagnostics of concurrent defects,” Sensors, vol. 14, no. 1, pp. 1295–1321, 2014

  15. [15]

    Adapted one-versus-all decision trees for data stream classifica- tion,

    S. Hashemi, Y . Yang, Z. Mirzamomen, and M. Kangavari, “Adapted one-versus-all decision trees for data stream classifica- tion,” IEEE Transactions on Knowledge and Data Engineering , vol. 21, no. 5, pp. 624–637, 2009

  16. [16]

    The technology behind personal digital assistants: An overview of the system architecture and key components,

    R. Sarikaya, “The technology behind personal digital assistants: An overview of the system architecture and key components,” IEEE Signal Processing Magazine , vol. 34, no. 1, pp. 67–81, 2017

  17. [17]

    Hypotheses ranking for robust domain classification and track- ing in dialogue systems,

    J.-P. Robichaud, P. A. Crook, P. Xu, O. Z. Khan, and R. Sarikaya, “Hypotheses ranking for robust domain classification and track- ing in dialogue systems,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014

  18. [18]

    An overview of end-to-end language understanding and dialog management for personal digital assistants,

    R. Sarikaya, P. A. Crook, A. Marin, M. Jeong, J.-P. Robichaud, A. Celikyilmaz, Y .-B. Kim, A. Rochette, O. Z. Khan, X. Liuet al., “An overview of end-to-end language understanding and dialog management for personal digital assistants,” in Spoken Language Technology Workshop (SLT), 2016 IEEE. IEEE, 2016, pp. 391– 397

  19. [19]

    A Re-ranker Scheme for Integrating Large Scale NLU models

    C. Su, R. Gupta, S. Ananthakrishnan, and S. Matsoukas, “A re-ranker scheme for integrating large scale nlu models,” arXiv preprint arXiv:1809.09605, 2018

  20. [20]

    Distributed representations of words and phrases and their com- positionality,

    T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their com- positionality,” in Advances in neural information processing sys- tems, 2013, pp. 3111–3119

  21. [21]

    Convolutional neural net- work architectures for matching natural language sentences,

    B. Hu, Z. Lu, H. Li, and Q. Chen, “Convolutional neural net- work architectures for matching natural language sentences,” in Advances in neural information processing systems , 2014, pp. 2042–2050

  22. [22]

    Comparison of probability distributions,

    J. Lindsey, “Comparison of probability distributions,” Journal of the Royal Statistical Society: Series B (Methodological) , vol. 36, no. 1, pp. 38–47, 1974

  23. [23]

    Comprehensive survey on distance/similarity mea- sures between probability density functions,

    S.-H. Cha, “Comprehensive survey on distance/similarity mea- sures between probability density functions,” City, vol. 1, no. 2, p. 1, 2007