One Size Does Not Fit All: Quantifying and Exposing the Accuracy-Latency Trade-off in Machine Learning Cloud Service APIs via Tolerance Tiers

Behzad Boroujerdian; Evelyn Duesterwald; Matthew Halpern; Todd Mummert; Vijay Janapa Reddi

arxiv: 1906.11307 · v1 · pith:6XLSW4RGnew · submitted 2019-06-26 · 💻 cs.LG · cs.CV· cs.PF

One Size Does Not Fit All: Quantifying and Exposing the Accuracy-Latency Trade-off in Machine Learning Cloud Service APIs via Tolerance Tiers

Matthew Halpern , Behzad Boroujerdian , Todd Mummert , Evelyn Duesterwald , Vijay Janapa Reddi This is my paper

Pith reviewed 2026-05-25 15:33 UTC · model grok-4.3

classification 💻 cs.LG cs.CVcs.PF

keywords Tolerance TiersMLaaSaccuracy-latency trade-offcloud machine learningautomatic speech recognitionimage classificationservice APIs

0 comments

The pith

Machine learning cloud services can outperform a single fixed version when users select from tolerance tiers that each expose a different accuracy and latency profile.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that today's cloud ML services use one deployment for all users, but real applications differ widely in how much accuracy they need versus how fast they must respond. Experiments with a production speech recognition engine and image classification networks reveal clear accuracy-latency trade-offs that make the uniform approach wasteful. The authors introduce Tolerance Tiers as service levels that each publish its own accuracy and responsiveness numbers so users can pick the one that matches their needs. Evaluations on CPU-based ASR and on CPU and GPU vision models demonstrate that letting consumers choose tiers yields better results than forcing everyone onto the same version.

Core claim

Tolerance Tiers give each MLaaS instantiation a distinct accuracy/responsiveness characteristic that end users select programmatically, allowing the service to be tuned per consumer and thereby outperform the conventional one-size-fits-all deployment on both the production ASR engine and the evaluated neural networks for image classification.

What carries the argument

Tolerance Tiers: service levels that each expose an accuracy/responsiveness characteristic so consumers can select the tier that fits their requirements.

If this is right

API consumers can match service behavior to their application's accuracy and responsiveness needs without changing the underlying model.
Service providers can expose multiple versions simultaneously and let selection be done at request time.
The same tier mechanism works for both CPU-only ASR and for CPU/GPU image classification networks.
Quantifying the trade-off per tier makes the cost of accuracy explicit to the user.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If similar trade-offs exist in other domains such as natural language processing or recommendation systems, the tier model could be applied without new model training.
Providers might add automated tier recommendation based on past request patterns or declared application type.
Billing could be differentiated by tier, creating an economic incentive for users to accept lower accuracy when latency matters more.

Load-bearing premise

The accuracy-latency trade-offs measured for the speech recognition and image classification workloads are representative of those that would appear in other machine learning tasks and deployment settings.

What would settle it

A controlled experiment on a new ML workload or hardware platform in which no tier selection yields higher effective utility than the single best fixed version would falsify the claim that tiered selection improves on one-size-fits-all.

read the original abstract

Today's cloud service architectures follow a "one size fits all" deployment strategy where the same service version instantiation is provided to the end users. However, consumers are broad and different applications have different accuracy and responsiveness requirements, which as we demonstrate renders the "one size fits all" approach inefficient in practice. We use a production-grade speech recognition engine, which serves several thousands of users, and an open source computer vision based system, to explain our point. To overcome the limitations of the "one size fits all" approach, we recommend Tolerance Tiers where each MLaaS tier exposes an accuracy/responsiveness characteristic, and consumers can programmatically select a tier. We evaluate our proposal on the CPU-based automatic speech recognition (ASR) engine and cutting-edge neural networks for image classification deployed on both CPUs and GPUs. The results show that our proposed approach provides an MLaaS cloud service architecture that can be tuned by the end API user or consumer to outperform the conventional "one size fits all" approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that conventional MLaaS cloud services use a one-size-fits-all deployment that is inefficient given diverse user accuracy and latency needs. It proposes Tolerance Tiers, where each tier exposes a distinct accuracy/responsiveness profile that end users can select programmatically. The approach is evaluated on a production CPU-based ASR engine serving thousands of users and on open-source image classification networks deployed on both CPU and GPU; the abstract concludes that the tiered architecture can be tuned to outperform the single fixed instantiation.

Significance. If the empirical trade-offs generalize, Tolerance Tiers would offer a practical, user-tunable alternative to monolithic MLaaS deployments, potentially improving both user utility and provider efficiency. The use of a real production ASR workload and concrete CPU/GPU measurements on vision models supplies a concrete existence proof for the claimed trade-off structure in at least two domains.

major comments (2)

[Abstract] Abstract: the claim that Tolerance Tiers 'can be tuned by the end API user or consumer to outperform the conventional one size fits all approach' rests on evaluations performed only on ASR and image-classification workloads. No results, discussion, or argument are supplied for other task families (e.g., sequence models, object detection, or reinforcement learning) whose accuracy-latency surfaces may be flatter or non-monotonic; this directly limits the scope of the architectural recommendation.
[Abstract] Abstract / Evaluation description: the manuscript states that tiers 'expose an accuracy/responsiveness characteristic' yet supplies neither a formal definition of tier boundaries, a utility function used to select among tiers, nor quantitative evidence (e.g., net user benefit or Pareto improvement) that any tier selection rule beats the single best fixed configuration across the reported workloads.

minor comments (1)

[Abstract] The abstract contains informal phrasing ('explain our point') that should be revised for a formal journal submission.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on scope and formalization. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that Tolerance Tiers 'can be tuned by the end API user or consumer to outperform the conventional one size fits all approach' rests on evaluations performed only on ASR and image-classification workloads. No results, discussion, or argument are supplied for other task families (e.g., sequence models, object detection, or reinforcement learning) whose accuracy-latency surfaces may be flatter or non-monotonic; this directly limits the scope of the architectural recommendation.

Authors: We agree that the empirical results are confined to ASR and image classification. The manuscript provides no data or discussion for other task families. In revision we will temper the abstract claim, add an explicit limitations paragraph, and discuss why the Tolerance Tiers principle may still apply while noting that accuracy-latency surfaces could differ in other domains. revision: yes
Referee: [Abstract] Abstract / Evaluation description: the manuscript states that tiers 'expose an accuracy/responsiveness characteristic' yet supplies neither a formal definition of tier boundaries, a utility function used to select among tiers, nor quantitative evidence (e.g., net user benefit or Pareto improvement) that any tier selection rule beats the single best fixed configuration across the reported workloads.

Authors: The current manuscript relies on empirical demonstration rather than formal definitions. We will add a dedicated section that (1) formally defines tier boundaries via accuracy and latency thresholds, (2) introduces a simple utility function for tier selection, and (3) reports explicit Pareto-improvement metrics comparing tier selection against the single best fixed configuration on both workloads. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical systems proposal with direct measurements

full rationale

The paper contains no mathematical derivations, equations, fitted parameters, or uniqueness theorems. Its central claim rests on direct empirical measurements of accuracy-latency trade-offs for a production ASR engine and image-classification networks, followed by a recommendation for Tolerance Tiers. No step reduces to its own inputs by construction, no self-citation is load-bearing for any derivation, and the work is self-contained against external benchmarks. This is a standard non-circular empirical systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

No mathematical free parameters or axioms are invoked; the central claim rests on the empirical observation of accuracy-latency trade-offs and the assumption that exposing tiers is feasible in production cloud infrastructure. Tolerance Tiers is an invented architectural concept without independent falsifiable evidence outside the paper.

invented entities (1)

Tolerance Tiers no independent evidence
purpose: Expose distinct accuracy/responsiveness characteristics so consumers can programmatically select a tier
Introduced in the abstract as the recommended solution to the one-size-fits-all problem; no external validation or falsifiable prediction is provided.

pith-pipeline@v0.9.0 · 5743 in / 1164 out tokens · 21526 ms · 2026-05-25T15:33:41.547254+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Tolerance Tiers ensembles multiple versions of a machine learning-based service... routing policies that dictate how... a service version will be used
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We evaluate our proposal on the CPU-based automatic speech recognition (ASR) engine and cutting-edge neural networks for image classification

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 8 internal anchors

[1]

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos et al., “Deep speech 2: End-to- end speech recognition in english and mandarin,” arXiv preprint arXiv:1512.02595, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[2]

The IBM 2015 English Conversational Telephone Speech Recognition System

G. Saon, H.-K. J. Kuo, S. Rennie, and M. Picheny, “The ibm 2015 english conversational telephone speech recognition system,” arXiv preprint arXiv:1505.05899, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[3]

The microsoft 2016 conversational speech recognition system,

W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Y u, and G. Zweig, “The microsoft 2016 conversational speech recognition system,” arXiv preprint arXiv:1609.03528, 2016

work page arXiv 2016
[4]

Parallelizing wfst speech decoders,

C. Mendis, J. Droppo, S. Maleki, M. Musuvathi, T. Mytkowicz, and G. Zweig, “Parallelizing wfst speech decoders,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 5325–5329

work page 2016
[5]

An ultra low-power hardware accelerator for automatic speech recognition,

R. Y azdani, A. Segura, J.-M. Arnau, and A. Gonzalez, “An ultra low-power hardware accelerator for automatic speech recognition,” 2016

work page 2016
[6]

Sirius: An open end-to-end voice and vision personal assistant and its implications for future warehouse scale computers,

J. Hauswald, M. A. Laurenzano, Y . Zhang, C. Li, A. Rovinski, A. Khurana, R. G. Dreslinski, T. Mudge, V . Petrucci, L. Tang et al., “Sirius: An open end-to-end voice and vision personal assistant and its implications for future warehouse scale computers,” in ASPLOS, 2015

work page 2015
[7]

Con- volutional, long short-term memory, fully connected deep neural networks,

T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, “Con- volutional, long short-term memory, fully connected deep neural networks,” in 2015 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 4580–4584

work page 2015
[8]

V oxforge,

“V oxforge,” http://www.voxforge.org/. [9] Y . LeCun, L. Bottou, Y . Bengio, and P . Haffner, “Gradient-based learning applied to document recogni- tion,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278– 2324, 1998

work page 1998
[9]

ImageNet Large Scale Visual Recognition Challenge,

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015

work page 2015
[10]

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size

“Model zoo,” http://caffe.berkeleyvision.org/model zoo. html. [12] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size,” arXiv:1602.07360

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Going Deeper with Convolutions

C. Szegedy, W. Liu, Y . Jia, P . Sermanet, S. Reed, D. Anguelov, D. Erhan, V . V anhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Computer Vision and Pattern Recognition (CVPR). [Online]. Available: http://arxiv.org/abs/1409.4842

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Deep Residual Learning for Image Recognition

K. He, X. Zhang, S. Ren, and J. Sun, “Deep resid- ual learning for image recognition,” arXiv preprint arXiv:1512.03385, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[13]

Imagenet classiﬁcation with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcation with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc. [Online]. Available: http://papers.nips.cc/paper/ 4824- imagenet-classiﬁcation-with-deep-convolu...

work page
[14]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman, “V ery deep convolu- tional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[15]

Google cloud platform,

“Google cloud platform,” https://cloud.google.com/ products/. [18] “Microsoft cognitive services,” https://www.microsoft. com/cogniti ve-services/en-us/apis. [19] “Ibm watson developer cloud,” https://www.ibm.com/ smarterplanet/us/en/ibmw atson/developercloud. [20] “Ibm bluemix pricing,” https://www.ibm.com/ cloud- computing/bluemix/pricing. [21] B. Efron...

work page 1992
[16]

Ibm cloud services,

“Ibm cloud services,” https://www.ibm.com/ cloud- computing/. [23] “Amazon web services,” https://aws.amazon.com/. [24] “Docker,” https://www.docker.com/. [25] “Netﬂix zuul,” https://github.com/Netﬂix/zuul. [26] “Powered by netﬂix oss,” urlhttps://netﬂix.github.io/powered-by-netﬂix-oss.html

work page
[17]

Mesos: A platform for ﬁne-grained resource sharing in the data center

“Nginx,” https://nginx.org/en/. [28] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. H. Katz, S. Shenker, and I. Stoica, “Mesos: A platform for ﬁne-grained resource sharing in the data center.” in NSDI, 2011

work page 2011
[18]

Quasar: Resource- Efﬁcient and QoS-Aware Cluster Management,

C. Delimitrou and C. Kozyrakis, “Quasar: Resource- Efﬁcient and QoS-Aware Cluster Management,” in AS- PLOS, 2014

work page 2014
[19]

Tarcil: Reconciling Scheduling Speed and Quality in Large Shared Clusters,

C. Delimitrou, D. Sanchez, and C. Kozyrakis, “Tarcil: Reconciling Scheduling Speed and Quality in Large Shared Clusters,” in Proceedings of the Sixth ACM Symposium on Cloud Computing (SOCC), 2015

work page 2015
[20]

Deconstructing amazon ec2 spot in- stance pricing,

O. Agmon Ben-Y ehuda, M. Ben-Y ehuda, A. Schuster, and D. Tsafrir, “Deconstructing amazon ec2 spot in- stance pricing,” ACM Transactions on Economics and Computation, 2013

work page 2013
[21]

HCloud: Resource- Efﬁcient Provisioning in Shared Cloud Systems,

C. Delimitrou and C. Kozyrakis, “HCloud: Resource- Efﬁcient Provisioning in Shared Cloud Systems,” in Proceedings of the International Conference on Ar- chitectural Support for Programming Languages and Operating Systems (ASPLOS), 2016

work page 2016
[22]

Paragon: QoS-Aware Scheduling for Heteroge- neous Datacenters,

——, “Paragon: QoS-Aware Scheduling for Heteroge- neous Datacenters,” in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2013

work page 2013
[23]

Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations,

J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa, “Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations,” in Proceed- ings of the 44th Annual IEEE/ACM International Sym- posium on Microarchitecture (MICRO), 2011

work page 2011
[24]

Whare-map: Heterogeneity in

J. Mars and L. Tang, “Whare-map: Heterogeneity in ”homogeneous” warehouse-scale computers,” in Pro- ceedings of the International Symposium on Computer Architecture (ISCA), 2013

work page 2013
[25]

Market mech- anisms for managing datacenters with heterogeneous microarchitectures,

M. Guevara, B. Lubin, and B. C. Lee, “Market mech- anisms for managing datacenters with heterogeneous microarchitectures,” ACM Transactions on Computer Systems (TOCS), 2014

work page 2014
[26]

Power management of online data-intensive services,

D. Meisner, C. M. Sadler, L. A. Barroso, W.-D. We- ber, and T. F. Wenisch, “Power management of online data-intensive services,” in International Symposium on Computer Architecture, 2011

work page 2011
[27]

Proﬁling a warehouse-scale computer,

S. Kanev, J. P . Darago, K. Hazelwood, P . Ranganathan, T. Moseley, G.-Y . Wei, and D. Brooks, “Proﬁling a warehouse-scale computer,” in 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architec- ture (ISCA), 2015

work page 2015
[28]

The mystery machine: End-to-end perfor- mance analysis of large-scale internet services,

M. Chow, D. Meisner, J. Flinn, D. Peek, and T. F. Wenisch, “The mystery machine: End-to-end perfor- mance analysis of large-scale internet services,” in OSDI, 2014

work page 2014
[29]

Big/little deep neural network for ultra low power inference,

E. Park, D. Kim, S. Kim, Y .-D. Kim, G. Kim, S. Y oon, and S. Y oo, “Big/little deep neural network for ultra low power inference,” in Hardware/Software Codesign and System Synthesis (CODES+ ISSS), 2015 International Conference on. IEEE, 2015, pp. 124–132

work page 2015
[30]

Ensemble methods in machine learn- ing,

T. G. Dietterich, “Ensemble methods in machine learn- ing,” in International workshop on multiple classiﬁer systems. Springer, 2000

work page 2000
[31]

Branchynet: Fast inference via early exiting from deep neural networks,

S. Teerapittayanon, B. McDanel, and H. Kung, “Branchynet: Fast inference via early exiting from deep neural networks,” in Pattern Recognition (ICPR), 2016 23rd International Conference on. IEEE, 2016, pp. 2464–2469

work page 2016
[32]

Conditional deep learning for energy-efﬁcient and enhanced pattern recog- nition,

P . Panda, A. Sengupta, and K. Roy, “Conditional deep learning for energy-efﬁcient and enhanced pattern recog- nition,” in Design, Automation & Test in Europe Con- ference & Exhibition (DATE), 2016. IEEE, 2016, pp. 475–480

work page 2016
[33]

Learning both weights and connections for efﬁcient neural network,

S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efﬁcient neural network,” in Advances in Neural Information Processing Systems, 2015, pp. 1135–1143

work page 2015
[34]

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

S. Han, H. Mao, and W. J. Dally, “Deep compres- sion: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[35]

The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox

D. Crankshaw, P . Bailis, J. E. Gonzalez, H. Li, Z. Zhang, M. J. Franklin, A. Ghodsi, and M. I. Jordan, “The missing piece in complex analytics: Low latency, scalable model management and serving with velox,” CoRR, vol. abs/1409.3809, 2014. [Online]. Available: http://arxiv.org/abs/1409.3809

work page internal anchor Pith review Pith/arXiv arXiv 2014
[36]

Mcdnn: An approximation- based execution framework for deep stream processing under resource constraints,

S. Han, H. Shen, M. Philipose, S. Agarwal, A. Wolman, and A. Krishnamurthy, “Mcdnn: An approximation- based execution framework for deep stream processing under resource constraints,” in Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services. ACM, 2016, pp. 123–136

work page 2016
[37]

Clipper: A low-latency online prediction serving system,

D. Crankshaw, X. Wang, G. Zhou, M. J. Franklin, J. E. Gonzalez, and I. Stoica, “Clipper: A low-latency online prediction serving system,” in 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). Boston, MA: USENIX Association, 2017. [Online]. Available: https://www.usenix.org/conference/ nsdi17/technical- sessions/presentation/crankshaw

work page 2017
[38]

Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,

T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y . Chen, and O. Temam, “Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” in ACM Sigplan Notices

work page
[39]

Minerva: Enabling low-power, highly- accurate deep neural network accelerators,

B. Reagen, P . Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. Hern ´andez-Lobato, G.-Y . Wei, and D. Brooks, “Minerva: Enabling low-power, highly- accurate deep neural network accelerators,” in Inter- national Symposium on Computer Architecture (ISCA), 2016

work page 2016
[40]

Chisel: Reliability-and accuracy-aware opti- mization of approximate computational kernels,

S. Misailovic, M. Carbin, S. Achour, Z. Qi, and M. C. Rinard, “Chisel: Reliability-and accuracy-aware opti- mization of approximate computational kernels,” in ACM SIGPLAN Notices , vol. 49, no. 10. ACM, 2014, pp. 309–328

work page 2014
[41]

Towards statistical guarantees in con- trolling quality tradeoffs for approximate acceleration,

D. Mahajan, A. Y azdanbakhsh, J. Park, B. Thwaites, and H. Esmaeilzadeh, “Towards statistical guarantees in con- trolling quality tradeoffs for approximate acceleration,” in Proceedings of the 43rd International Symposium on Computer Architecture, 2016

work page 2016
[42]

Rumba: an online quality management system for ap- proximate computing,

D. S. Khudia, B. Zamirai, M. Samadi, and S. Mahlke, “Rumba: an online quality management system for ap- proximate computing,” in ACM SIGARCH Computer Architecture News, vol. 43, no. 3. ACM, 2015

work page 2015
[43]

Input responsiveness: using canary inputs to dynamically steer approximation,

M. A. Laurenzano, P . Hill, M. Samadi, S. Mahlke, J. Mars, and L. Tang, “Input responsiveness: using canary inputs to dynamically steer approximation,” ACM SIGPLAN Notices, vol. 51, no. 6, pp. 161–176, 2016

work page 2016

[1] [1]

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos et al., “Deep speech 2: End-to- end speech recognition in english and mandarin,” arXiv preprint arXiv:1512.02595, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[2] [2]

The IBM 2015 English Conversational Telephone Speech Recognition System

G. Saon, H.-K. J. Kuo, S. Rennie, and M. Picheny, “The ibm 2015 english conversational telephone speech recognition system,” arXiv preprint arXiv:1505.05899, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[3] [3]

The microsoft 2016 conversational speech recognition system,

W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Y u, and G. Zweig, “The microsoft 2016 conversational speech recognition system,” arXiv preprint arXiv:1609.03528, 2016

work page arXiv 2016

[4] [4]

Parallelizing wfst speech decoders,

C. Mendis, J. Droppo, S. Maleki, M. Musuvathi, T. Mytkowicz, and G. Zweig, “Parallelizing wfst speech decoders,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 5325–5329

work page 2016

[5] [5]

An ultra low-power hardware accelerator for automatic speech recognition,

R. Y azdani, A. Segura, J.-M. Arnau, and A. Gonzalez, “An ultra low-power hardware accelerator for automatic speech recognition,” 2016

work page 2016

[6] [6]

Sirius: An open end-to-end voice and vision personal assistant and its implications for future warehouse scale computers,

J. Hauswald, M. A. Laurenzano, Y . Zhang, C. Li, A. Rovinski, A. Khurana, R. G. Dreslinski, T. Mudge, V . Petrucci, L. Tang et al., “Sirius: An open end-to-end voice and vision personal assistant and its implications for future warehouse scale computers,” in ASPLOS, 2015

work page 2015

[7] [7]

Con- volutional, long short-term memory, fully connected deep neural networks,

T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, “Con- volutional, long short-term memory, fully connected deep neural networks,” in 2015 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 4580–4584

work page 2015

[8] [8]

V oxforge,

“V oxforge,” http://www.voxforge.org/. [9] Y . LeCun, L. Bottou, Y . Bengio, and P . Haffner, “Gradient-based learning applied to document recogni- tion,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278– 2324, 1998

work page 1998

[9] [9]

ImageNet Large Scale Visual Recognition Challenge,

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015

work page 2015

[10] [10]

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size

“Model zoo,” http://caffe.berkeleyvision.org/model zoo. html. [12] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size,” arXiv:1602.07360

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Going Deeper with Convolutions

C. Szegedy, W. Liu, Y . Jia, P . Sermanet, S. Reed, D. Anguelov, D. Erhan, V . V anhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Computer Vision and Pattern Recognition (CVPR). [Online]. Available: http://arxiv.org/abs/1409.4842

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Deep Residual Learning for Image Recognition

K. He, X. Zhang, S. Ren, and J. Sun, “Deep resid- ual learning for image recognition,” arXiv preprint arXiv:1512.03385, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[13] [13]

Imagenet classiﬁcation with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcation with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc. [Online]. Available: http://papers.nips.cc/paper/ 4824- imagenet-classiﬁcation-with-deep-convolu...

work page

[14] [14]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman, “V ery deep convolu- tional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[15] [15]

Google cloud platform,

“Google cloud platform,” https://cloud.google.com/ products/. [18] “Microsoft cognitive services,” https://www.microsoft. com/cogniti ve-services/en-us/apis. [19] “Ibm watson developer cloud,” https://www.ibm.com/ smarterplanet/us/en/ibmw atson/developercloud. [20] “Ibm bluemix pricing,” https://www.ibm.com/ cloud- computing/bluemix/pricing. [21] B. Efron...

work page 1992

[16] [16]

Ibm cloud services,

“Ibm cloud services,” https://www.ibm.com/ cloud- computing/. [23] “Amazon web services,” https://aws.amazon.com/. [24] “Docker,” https://www.docker.com/. [25] “Netﬂix zuul,” https://github.com/Netﬂix/zuul. [26] “Powered by netﬂix oss,” urlhttps://netﬂix.github.io/powered-by-netﬂix-oss.html

work page

[17] [17]

Mesos: A platform for ﬁne-grained resource sharing in the data center

“Nginx,” https://nginx.org/en/. [28] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. H. Katz, S. Shenker, and I. Stoica, “Mesos: A platform for ﬁne-grained resource sharing in the data center.” in NSDI, 2011

work page 2011

[18] [18]

Quasar: Resource- Efﬁcient and QoS-Aware Cluster Management,

C. Delimitrou and C. Kozyrakis, “Quasar: Resource- Efﬁcient and QoS-Aware Cluster Management,” in AS- PLOS, 2014

work page 2014

[19] [19]

Tarcil: Reconciling Scheduling Speed and Quality in Large Shared Clusters,

C. Delimitrou, D. Sanchez, and C. Kozyrakis, “Tarcil: Reconciling Scheduling Speed and Quality in Large Shared Clusters,” in Proceedings of the Sixth ACM Symposium on Cloud Computing (SOCC), 2015

work page 2015

[20] [20]

Deconstructing amazon ec2 spot in- stance pricing,

O. Agmon Ben-Y ehuda, M. Ben-Y ehuda, A. Schuster, and D. Tsafrir, “Deconstructing amazon ec2 spot in- stance pricing,” ACM Transactions on Economics and Computation, 2013

work page 2013

[21] [21]

HCloud: Resource- Efﬁcient Provisioning in Shared Cloud Systems,

C. Delimitrou and C. Kozyrakis, “HCloud: Resource- Efﬁcient Provisioning in Shared Cloud Systems,” in Proceedings of the International Conference on Ar- chitectural Support for Programming Languages and Operating Systems (ASPLOS), 2016

work page 2016

[22] [22]

Paragon: QoS-Aware Scheduling for Heteroge- neous Datacenters,

——, “Paragon: QoS-Aware Scheduling for Heteroge- neous Datacenters,” in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2013

work page 2013

[23] [23]

Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations,

J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa, “Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations,” in Proceed- ings of the 44th Annual IEEE/ACM International Sym- posium on Microarchitecture (MICRO), 2011

work page 2011

[24] [24]

Whare-map: Heterogeneity in

J. Mars and L. Tang, “Whare-map: Heterogeneity in ”homogeneous” warehouse-scale computers,” in Pro- ceedings of the International Symposium on Computer Architecture (ISCA), 2013

work page 2013

[25] [25]

Market mech- anisms for managing datacenters with heterogeneous microarchitectures,

M. Guevara, B. Lubin, and B. C. Lee, “Market mech- anisms for managing datacenters with heterogeneous microarchitectures,” ACM Transactions on Computer Systems (TOCS), 2014

work page 2014

[26] [26]

Power management of online data-intensive services,

D. Meisner, C. M. Sadler, L. A. Barroso, W.-D. We- ber, and T. F. Wenisch, “Power management of online data-intensive services,” in International Symposium on Computer Architecture, 2011

work page 2011

[27] [27]

Proﬁling a warehouse-scale computer,

S. Kanev, J. P . Darago, K. Hazelwood, P . Ranganathan, T. Moseley, G.-Y . Wei, and D. Brooks, “Proﬁling a warehouse-scale computer,” in 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architec- ture (ISCA), 2015

work page 2015

[28] [28]

The mystery machine: End-to-end perfor- mance analysis of large-scale internet services,

M. Chow, D. Meisner, J. Flinn, D. Peek, and T. F. Wenisch, “The mystery machine: End-to-end perfor- mance analysis of large-scale internet services,” in OSDI, 2014

work page 2014

[29] [29]

Big/little deep neural network for ultra low power inference,

E. Park, D. Kim, S. Kim, Y .-D. Kim, G. Kim, S. Y oon, and S. Y oo, “Big/little deep neural network for ultra low power inference,” in Hardware/Software Codesign and System Synthesis (CODES+ ISSS), 2015 International Conference on. IEEE, 2015, pp. 124–132

work page 2015

[30] [30]

Ensemble methods in machine learn- ing,

T. G. Dietterich, “Ensemble methods in machine learn- ing,” in International workshop on multiple classiﬁer systems. Springer, 2000

work page 2000

[31] [31]

Branchynet: Fast inference via early exiting from deep neural networks,

S. Teerapittayanon, B. McDanel, and H. Kung, “Branchynet: Fast inference via early exiting from deep neural networks,” in Pattern Recognition (ICPR), 2016 23rd International Conference on. IEEE, 2016, pp. 2464–2469

work page 2016

[32] [32]

Conditional deep learning for energy-efﬁcient and enhanced pattern recog- nition,

P . Panda, A. Sengupta, and K. Roy, “Conditional deep learning for energy-efﬁcient and enhanced pattern recog- nition,” in Design, Automation & Test in Europe Con- ference & Exhibition (DATE), 2016. IEEE, 2016, pp. 475–480

work page 2016

[33] [33]

Learning both weights and connections for efﬁcient neural network,

S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efﬁcient neural network,” in Advances in Neural Information Processing Systems, 2015, pp. 1135–1143

work page 2015

[34] [34]

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

S. Han, H. Mao, and W. J. Dally, “Deep compres- sion: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[35] [35]

The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox

D. Crankshaw, P . Bailis, J. E. Gonzalez, H. Li, Z. Zhang, M. J. Franklin, A. Ghodsi, and M. I. Jordan, “The missing piece in complex analytics: Low latency, scalable model management and serving with velox,” CoRR, vol. abs/1409.3809, 2014. [Online]. Available: http://arxiv.org/abs/1409.3809

work page internal anchor Pith review Pith/arXiv arXiv 2014

[36] [36]

Mcdnn: An approximation- based execution framework for deep stream processing under resource constraints,

S. Han, H. Shen, M. Philipose, S. Agarwal, A. Wolman, and A. Krishnamurthy, “Mcdnn: An approximation- based execution framework for deep stream processing under resource constraints,” in Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services. ACM, 2016, pp. 123–136

work page 2016

[37] [37]

Clipper: A low-latency online prediction serving system,

D. Crankshaw, X. Wang, G. Zhou, M. J. Franklin, J. E. Gonzalez, and I. Stoica, “Clipper: A low-latency online prediction serving system,” in 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). Boston, MA: USENIX Association, 2017. [Online]. Available: https://www.usenix.org/conference/ nsdi17/technical- sessions/presentation/crankshaw

work page 2017

[38] [38]

Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,

T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y . Chen, and O. Temam, “Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” in ACM Sigplan Notices

work page

[39] [39]

Minerva: Enabling low-power, highly- accurate deep neural network accelerators,

B. Reagen, P . Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. Hern ´andez-Lobato, G.-Y . Wei, and D. Brooks, “Minerva: Enabling low-power, highly- accurate deep neural network accelerators,” in Inter- national Symposium on Computer Architecture (ISCA), 2016

work page 2016

[40] [40]

Chisel: Reliability-and accuracy-aware opti- mization of approximate computational kernels,

S. Misailovic, M. Carbin, S. Achour, Z. Qi, and M. C. Rinard, “Chisel: Reliability-and accuracy-aware opti- mization of approximate computational kernels,” in ACM SIGPLAN Notices , vol. 49, no. 10. ACM, 2014, pp. 309–328

work page 2014

[41] [41]

Towards statistical guarantees in con- trolling quality tradeoffs for approximate acceleration,

D. Mahajan, A. Y azdanbakhsh, J. Park, B. Thwaites, and H. Esmaeilzadeh, “Towards statistical guarantees in con- trolling quality tradeoffs for approximate acceleration,” in Proceedings of the 43rd International Symposium on Computer Architecture, 2016

work page 2016

[42] [42]

Rumba: an online quality management system for ap- proximate computing,

D. S. Khudia, B. Zamirai, M. Samadi, and S. Mahlke, “Rumba: an online quality management system for ap- proximate computing,” in ACM SIGARCH Computer Architecture News, vol. 43, no. 3. ACM, 2015

work page 2015

[43] [43]

Input responsiveness: using canary inputs to dynamically steer approximation,

M. A. Laurenzano, P . Hill, M. Samadi, S. Mahlke, J. Mars, and L. Tang, “Input responsiveness: using canary inputs to dynamically steer approximation,” ACM SIGPLAN Notices, vol. 51, no. 6, pp. 161–176, 2016

work page 2016