pith. sign in

arxiv: 1906.11307 · v1 · pith:6XLSW4RGnew · submitted 2019-06-26 · 💻 cs.LG · cs.CV· cs.PF

One Size Does Not Fit All: Quantifying and Exposing the Accuracy-Latency Trade-off in Machine Learning Cloud Service APIs via Tolerance Tiers

Pith reviewed 2026-05-25 15:33 UTC · model grok-4.3

classification 💻 cs.LG cs.CVcs.PF
keywords Tolerance TiersMLaaSaccuracy-latency trade-offcloud machine learningautomatic speech recognitionimage classificationservice APIs
0
0 comments X

The pith

Machine learning cloud services can outperform a single fixed version when users select from tolerance tiers that each expose a different accuracy and latency profile.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that today's cloud ML services use one deployment for all users, but real applications differ widely in how much accuracy they need versus how fast they must respond. Experiments with a production speech recognition engine and image classification networks reveal clear accuracy-latency trade-offs that make the uniform approach wasteful. The authors introduce Tolerance Tiers as service levels that each publish its own accuracy and responsiveness numbers so users can pick the one that matches their needs. Evaluations on CPU-based ASR and on CPU and GPU vision models demonstrate that letting consumers choose tiers yields better results than forcing everyone onto the same version.

Core claim

Tolerance Tiers give each MLaaS instantiation a distinct accuracy/responsiveness characteristic that end users select programmatically, allowing the service to be tuned per consumer and thereby outperform the conventional one-size-fits-all deployment on both the production ASR engine and the evaluated neural networks for image classification.

What carries the argument

Tolerance Tiers: service levels that each expose an accuracy/responsiveness characteristic so consumers can select the tier that fits their requirements.

If this is right

  • API consumers can match service behavior to their application's accuracy and responsiveness needs without changing the underlying model.
  • Service providers can expose multiple versions simultaneously and let selection be done at request time.
  • The same tier mechanism works for both CPU-only ASR and for CPU/GPU image classification networks.
  • Quantifying the trade-off per tier makes the cost of accuracy explicit to the user.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If similar trade-offs exist in other domains such as natural language processing or recommendation systems, the tier model could be applied without new model training.
  • Providers might add automated tier recommendation based on past request patterns or declared application type.
  • Billing could be differentiated by tier, creating an economic incentive for users to accept lower accuracy when latency matters more.

Load-bearing premise

The accuracy-latency trade-offs measured for the speech recognition and image classification workloads are representative of those that would appear in other machine learning tasks and deployment settings.

What would settle it

A controlled experiment on a new ML workload or hardware platform in which no tier selection yields higher effective utility than the single best fixed version would falsify the claim that tiered selection improves on one-size-fits-all.

read the original abstract

Today's cloud service architectures follow a "one size fits all" deployment strategy where the same service version instantiation is provided to the end users. However, consumers are broad and different applications have different accuracy and responsiveness requirements, which as we demonstrate renders the "one size fits all" approach inefficient in practice. We use a production-grade speech recognition engine, which serves several thousands of users, and an open source computer vision based system, to explain our point. To overcome the limitations of the "one size fits all" approach, we recommend Tolerance Tiers where each MLaaS tier exposes an accuracy/responsiveness characteristic, and consumers can programmatically select a tier. We evaluate our proposal on the CPU-based automatic speech recognition (ASR) engine and cutting-edge neural networks for image classification deployed on both CPUs and GPUs. The results show that our proposed approach provides an MLaaS cloud service architecture that can be tuned by the end API user or consumer to outperform the conventional "one size fits all" approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that conventional MLaaS cloud services use a one-size-fits-all deployment that is inefficient given diverse user accuracy and latency needs. It proposes Tolerance Tiers, where each tier exposes a distinct accuracy/responsiveness profile that end users can select programmatically. The approach is evaluated on a production CPU-based ASR engine serving thousands of users and on open-source image classification networks deployed on both CPU and GPU; the abstract concludes that the tiered architecture can be tuned to outperform the single fixed instantiation.

Significance. If the empirical trade-offs generalize, Tolerance Tiers would offer a practical, user-tunable alternative to monolithic MLaaS deployments, potentially improving both user utility and provider efficiency. The use of a real production ASR workload and concrete CPU/GPU measurements on vision models supplies a concrete existence proof for the claimed trade-off structure in at least two domains.

major comments (2)
  1. [Abstract] Abstract: the claim that Tolerance Tiers 'can be tuned by the end API user or consumer to outperform the conventional one size fits all approach' rests on evaluations performed only on ASR and image-classification workloads. No results, discussion, or argument are supplied for other task families (e.g., sequence models, object detection, or reinforcement learning) whose accuracy-latency surfaces may be flatter or non-monotonic; this directly limits the scope of the architectural recommendation.
  2. [Abstract] Abstract / Evaluation description: the manuscript states that tiers 'expose an accuracy/responsiveness characteristic' yet supplies neither a formal definition of tier boundaries, a utility function used to select among tiers, nor quantitative evidence (e.g., net user benefit or Pareto improvement) that any tier selection rule beats the single best fixed configuration across the reported workloads.
minor comments (1)
  1. [Abstract] The abstract contains informal phrasing ('explain our point') that should be revised for a formal journal submission.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on scope and formalization. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that Tolerance Tiers 'can be tuned by the end API user or consumer to outperform the conventional one size fits all approach' rests on evaluations performed only on ASR and image-classification workloads. No results, discussion, or argument are supplied for other task families (e.g., sequence models, object detection, or reinforcement learning) whose accuracy-latency surfaces may be flatter or non-monotonic; this directly limits the scope of the architectural recommendation.

    Authors: We agree that the empirical results are confined to ASR and image classification. The manuscript provides no data or discussion for other task families. In revision we will temper the abstract claim, add an explicit limitations paragraph, and discuss why the Tolerance Tiers principle may still apply while noting that accuracy-latency surfaces could differ in other domains. revision: yes

  2. Referee: [Abstract] Abstract / Evaluation description: the manuscript states that tiers 'expose an accuracy/responsiveness characteristic' yet supplies neither a formal definition of tier boundaries, a utility function used to select among tiers, nor quantitative evidence (e.g., net user benefit or Pareto improvement) that any tier selection rule beats the single best fixed configuration across the reported workloads.

    Authors: The current manuscript relies on empirical demonstration rather than formal definitions. We will add a dedicated section that (1) formally defines tier boundaries via accuracy and latency thresholds, (2) introduces a simple utility function for tier selection, and (3) reports explicit Pareto-improvement metrics comparing tier selection against the single best fixed configuration on both workloads. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical systems proposal with direct measurements

full rationale

The paper contains no mathematical derivations, equations, fitted parameters, or uniqueness theorems. Its central claim rests on direct empirical measurements of accuracy-latency trade-offs for a production ASR engine and image-classification networks, followed by a recommendation for Tolerance Tiers. No step reduces to its own inputs by construction, no self-citation is load-bearing for any derivation, and the work is self-contained against external benchmarks. This is a standard non-circular empirical systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

No mathematical free parameters or axioms are invoked; the central claim rests on the empirical observation of accuracy-latency trade-offs and the assumption that exposing tiers is feasible in production cloud infrastructure. Tolerance Tiers is an invented architectural concept without independent falsifiable evidence outside the paper.

invented entities (1)
  • Tolerance Tiers no independent evidence
    purpose: Expose distinct accuracy/responsiveness characteristics so consumers can programmatically select a tier
    Introduced in the abstract as the recommended solution to the one-size-fits-all problem; no external validation or falsifiable prediction is provided.

pith-pipeline@v0.9.0 · 5743 in / 1164 out tokens · 21526 ms · 2026-05-25T15:33:41.547254+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 8 internal anchors

  1. [1]

    Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

    D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos et al., “Deep speech 2: End-to- end speech recognition in english and mandarin,” arXiv preprint arXiv:1512.02595, 2015

  2. [2]

    The IBM 2015 English Conversational Telephone Speech Recognition System

    G. Saon, H.-K. J. Kuo, S. Rennie, and M. Picheny, “The ibm 2015 english conversational telephone speech recognition system,” arXiv preprint arXiv:1505.05899, 2015

  3. [3]

    The microsoft 2016 conversational speech recognition system,

    W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Y u, and G. Zweig, “The microsoft 2016 conversational speech recognition system,” arXiv preprint arXiv:1609.03528, 2016

  4. [4]

    Parallelizing wfst speech decoders,

    C. Mendis, J. Droppo, S. Maleki, M. Musuvathi, T. Mytkowicz, and G. Zweig, “Parallelizing wfst speech decoders,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 5325–5329

  5. [5]

    An ultra low-power hardware accelerator for automatic speech recognition,

    R. Y azdani, A. Segura, J.-M. Arnau, and A. Gonzalez, “An ultra low-power hardware accelerator for automatic speech recognition,” 2016

  6. [6]

    Sirius: An open end-to-end voice and vision personal assistant and its implications for future warehouse scale computers,

    J. Hauswald, M. A. Laurenzano, Y . Zhang, C. Li, A. Rovinski, A. Khurana, R. G. Dreslinski, T. Mudge, V . Petrucci, L. Tang et al., “Sirius: An open end-to-end voice and vision personal assistant and its implications for future warehouse scale computers,” in ASPLOS, 2015

  7. [7]

    Con- volutional, long short-term memory, fully connected deep neural networks,

    T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, “Con- volutional, long short-term memory, fully connected deep neural networks,” in 2015 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 4580–4584

  8. [8]

    V oxforge,

    “V oxforge,” http://www.voxforge.org/. [9] Y . LeCun, L. Bottou, Y . Bengio, and P . Haffner, “Gradient-based learning applied to document recogni- tion,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278– 2324, 1998

  9. [9]

    ImageNet Large Scale Visual Recognition Challenge,

    O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015

  10. [10]

    SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size

    “Model zoo,” http://caffe.berkeleyvision.org/model zoo. html. [12] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size,” arXiv:1602.07360

  11. [11]

    Going Deeper with Convolutions

    C. Szegedy, W. Liu, Y . Jia, P . Sermanet, S. Reed, D. Anguelov, D. Erhan, V . V anhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Computer Vision and Pattern Recognition (CVPR). [Online]. Available: http://arxiv.org/abs/1409.4842

  12. [12]

    Deep Residual Learning for Image Recognition

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep resid- ual learning for image recognition,” arXiv preprint arXiv:1512.03385, 2015

  13. [13]

    Imagenet classification with deep convolutional neural networks,

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc. [Online]. Available: http://papers.nips.cc/paper/ 4824- imagenet-classification-with-deep-convolu...

  14. [14]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    K. Simonyan and A. Zisserman, “V ery deep convolu- tional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014

  15. [15]

    Google cloud platform,

    “Google cloud platform,” https://cloud.google.com/ products/. [18] “Microsoft cognitive services,” https://www.microsoft. com/cogniti ve-services/en-us/apis. [19] “Ibm watson developer cloud,” https://www.ibm.com/ smarterplanet/us/en/ibmw atson/developercloud. [20] “Ibm bluemix pricing,” https://www.ibm.com/ cloud- computing/bluemix/pricing. [21] B. Efron...

  16. [16]

    Ibm cloud services,

    “Ibm cloud services,” https://www.ibm.com/ cloud- computing/. [23] “Amazon web services,” https://aws.amazon.com/. [24] “Docker,” https://www.docker.com/. [25] “Netflix zuul,” https://github.com/Netflix/zuul. [26] “Powered by netflix oss,” urlhttps://netflix.github.io/powered-by-netflix-oss.html

  17. [17]

    Mesos: A platform for fine-grained resource sharing in the data center

    “Nginx,” https://nginx.org/en/. [28] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. H. Katz, S. Shenker, and I. Stoica, “Mesos: A platform for fine-grained resource sharing in the data center.” in NSDI, 2011

  18. [18]

    Quasar: Resource- Efficient and QoS-Aware Cluster Management,

    C. Delimitrou and C. Kozyrakis, “Quasar: Resource- Efficient and QoS-Aware Cluster Management,” in AS- PLOS, 2014

  19. [19]

    Tarcil: Reconciling Scheduling Speed and Quality in Large Shared Clusters,

    C. Delimitrou, D. Sanchez, and C. Kozyrakis, “Tarcil: Reconciling Scheduling Speed and Quality in Large Shared Clusters,” in Proceedings of the Sixth ACM Symposium on Cloud Computing (SOCC), 2015

  20. [20]

    Deconstructing amazon ec2 spot in- stance pricing,

    O. Agmon Ben-Y ehuda, M. Ben-Y ehuda, A. Schuster, and D. Tsafrir, “Deconstructing amazon ec2 spot in- stance pricing,” ACM Transactions on Economics and Computation, 2013

  21. [21]

    HCloud: Resource- Efficient Provisioning in Shared Cloud Systems,

    C. Delimitrou and C. Kozyrakis, “HCloud: Resource- Efficient Provisioning in Shared Cloud Systems,” in Proceedings of the International Conference on Ar- chitectural Support for Programming Languages and Operating Systems (ASPLOS), 2016

  22. [22]

    Paragon: QoS-Aware Scheduling for Heteroge- neous Datacenters,

    ——, “Paragon: QoS-Aware Scheduling for Heteroge- neous Datacenters,” in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2013

  23. [23]

    Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations,

    J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa, “Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations,” in Proceed- ings of the 44th Annual IEEE/ACM International Sym- posium on Microarchitecture (MICRO), 2011

  24. [24]

    Whare-map: Heterogeneity in

    J. Mars and L. Tang, “Whare-map: Heterogeneity in ”homogeneous” warehouse-scale computers,” in Pro- ceedings of the International Symposium on Computer Architecture (ISCA), 2013

  25. [25]

    Market mech- anisms for managing datacenters with heterogeneous microarchitectures,

    M. Guevara, B. Lubin, and B. C. Lee, “Market mech- anisms for managing datacenters with heterogeneous microarchitectures,” ACM Transactions on Computer Systems (TOCS), 2014

  26. [26]

    Power management of online data-intensive services,

    D. Meisner, C. M. Sadler, L. A. Barroso, W.-D. We- ber, and T. F. Wenisch, “Power management of online data-intensive services,” in International Symposium on Computer Architecture, 2011

  27. [27]

    Profiling a warehouse-scale computer,

    S. Kanev, J. P . Darago, K. Hazelwood, P . Ranganathan, T. Moseley, G.-Y . Wei, and D. Brooks, “Profiling a warehouse-scale computer,” in 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architec- ture (ISCA), 2015

  28. [28]

    The mystery machine: End-to-end perfor- mance analysis of large-scale internet services,

    M. Chow, D. Meisner, J. Flinn, D. Peek, and T. F. Wenisch, “The mystery machine: End-to-end perfor- mance analysis of large-scale internet services,” in OSDI, 2014

  29. [29]

    Big/little deep neural network for ultra low power inference,

    E. Park, D. Kim, S. Kim, Y .-D. Kim, G. Kim, S. Y oon, and S. Y oo, “Big/little deep neural network for ultra low power inference,” in Hardware/Software Codesign and System Synthesis (CODES+ ISSS), 2015 International Conference on. IEEE, 2015, pp. 124–132

  30. [30]

    Ensemble methods in machine learn- ing,

    T. G. Dietterich, “Ensemble methods in machine learn- ing,” in International workshop on multiple classifier systems. Springer, 2000

  31. [31]

    Branchynet: Fast inference via early exiting from deep neural networks,

    S. Teerapittayanon, B. McDanel, and H. Kung, “Branchynet: Fast inference via early exiting from deep neural networks,” in Pattern Recognition (ICPR), 2016 23rd International Conference on. IEEE, 2016, pp. 2464–2469

  32. [32]

    Conditional deep learning for energy-efficient and enhanced pattern recog- nition,

    P . Panda, A. Sengupta, and K. Roy, “Conditional deep learning for energy-efficient and enhanced pattern recog- nition,” in Design, Automation & Test in Europe Con- ference & Exhibition (DATE), 2016. IEEE, 2016, pp. 475–480

  33. [33]

    Learning both weights and connections for efficient neural network,

    S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in Neural Information Processing Systems, 2015, pp. 1135–1143

  34. [34]

    Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

    S. Han, H. Mao, and W. J. Dally, “Deep compres- sion: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015

  35. [35]

    The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox

    D. Crankshaw, P . Bailis, J. E. Gonzalez, H. Li, Z. Zhang, M. J. Franklin, A. Ghodsi, and M. I. Jordan, “The missing piece in complex analytics: Low latency, scalable model management and serving with velox,” CoRR, vol. abs/1409.3809, 2014. [Online]. Available: http://arxiv.org/abs/1409.3809

  36. [36]

    Mcdnn: An approximation- based execution framework for deep stream processing under resource constraints,

    S. Han, H. Shen, M. Philipose, S. Agarwal, A. Wolman, and A. Krishnamurthy, “Mcdnn: An approximation- based execution framework for deep stream processing under resource constraints,” in Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services. ACM, 2016, pp. 123–136

  37. [37]

    Clipper: A low-latency online prediction serving system,

    D. Crankshaw, X. Wang, G. Zhou, M. J. Franklin, J. E. Gonzalez, and I. Stoica, “Clipper: A low-latency online prediction serving system,” in 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). Boston, MA: USENIX Association, 2017. [Online]. Available: https://www.usenix.org/conference/ nsdi17/technical- sessions/presentation/crankshaw

  38. [38]

    Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,

    T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y . Chen, and O. Temam, “Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” in ACM Sigplan Notices

  39. [39]

    Minerva: Enabling low-power, highly- accurate deep neural network accelerators,

    B. Reagen, P . Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. Hern ´andez-Lobato, G.-Y . Wei, and D. Brooks, “Minerva: Enabling low-power, highly- accurate deep neural network accelerators,” in Inter- national Symposium on Computer Architecture (ISCA), 2016

  40. [40]

    Chisel: Reliability-and accuracy-aware opti- mization of approximate computational kernels,

    S. Misailovic, M. Carbin, S. Achour, Z. Qi, and M. C. Rinard, “Chisel: Reliability-and accuracy-aware opti- mization of approximate computational kernels,” in ACM SIGPLAN Notices , vol. 49, no. 10. ACM, 2014, pp. 309–328

  41. [41]

    Towards statistical guarantees in con- trolling quality tradeoffs for approximate acceleration,

    D. Mahajan, A. Y azdanbakhsh, J. Park, B. Thwaites, and H. Esmaeilzadeh, “Towards statistical guarantees in con- trolling quality tradeoffs for approximate acceleration,” in Proceedings of the 43rd International Symposium on Computer Architecture, 2016

  42. [42]

    Rumba: an online quality management system for ap- proximate computing,

    D. S. Khudia, B. Zamirai, M. Samadi, and S. Mahlke, “Rumba: an online quality management system for ap- proximate computing,” in ACM SIGARCH Computer Architecture News, vol. 43, no. 3. ACM, 2015

  43. [43]

    Input responsiveness: using canary inputs to dynamically steer approximation,

    M. A. Laurenzano, P . Hill, M. Samadi, S. Mahlke, J. Mars, and L. Tang, “Input responsiveness: using canary inputs to dynamically steer approximation,” ACM SIGPLAN Notices, vol. 51, no. 6, pp. 161–176, 2016