pith. machine review for the scientific record. sign in

arxiv: 1710.03740 · v3 · submitted 2017-10-10 · 💻 cs.AI · cs.LG· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Mixed Precision Training

Authors on Pith no claims yet

Pith reviewed 2026-05-12 10:41 UTC · model grok-4.3

classification 💻 cs.AI cs.LGstat.ML
keywords mixed precision traininghalf precisiondeep neural networksloss scalingmemory reductionFP16convolutional networksrecurrent networks
0
0 comments X

The pith

Deep neural networks can be trained in half precision using a full-precision weight master copy and loss scaling to achieve nearly 2x memory savings without accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that storing weights, activations, and gradients in 16-bit floating point while keeping a 32-bit copy of the weights and scaling the loss allows training of large deep networks. This approach addresses the limited numerical range of half-precision numbers that would otherwise cause underflow or overflow in gradients. A sympathetic reader would care because it makes training bigger models feasible on current hardware and promises faster computation on specialized processors. The method is shown to work across convolutional, recurrent, and generative adversarial networks with over 100 million parameters.

Core claim

The central discovery is that mixed precision training with FP16 for most tensors and FP32 for weight accumulation, combined with dynamic loss scaling, enables training deep neural networks to the same accuracy as full precision while reducing memory footprint by almost half. The single-precision master weights prevent rounding errors from accumulating in updates, and loss scaling keeps small gradient values representable in FP16.

What carries the argument

Loss scaling combined with a single-precision master copy of the weights to handle the limited range of half-precision floating point.

Load-bearing premise

An appropriately chosen loss scaling factor together with the FP32 master weight copy will prevent numerical issues in FP16 across many different models and datasets without needing retuning that reduces accuracy.

What would settle it

Observing that for some standard model and dataset the mixed-precision version diverges or reaches lower accuracy than the FP32 baseline even after tuning the loss scale factor.

read the original abstract

Deep neural networks have enabled progress in a wide variety of applications. Growing the size of the neural network typically results in improved accuracy. As model sizes grow, the memory and compute requirements for training these models also increases. We introduce a technique to train deep neural networks using half precision floating point numbers. In our technique, weights, activations and gradients are stored in IEEE half-precision format. Half-precision floating numbers have limited numerical range compared to single-precision numbers. We propose two techniques to handle this loss of information. Firstly, we recommend maintaining a single-precision copy of the weights that accumulates the gradients after each optimizer step. This single-precision copy is rounded to half-precision format during training. Secondly, we propose scaling the loss appropriately to handle the loss of information with half-precision gradients. We demonstrate that this approach works for a wide variety of models including convolution neural networks, recurrent neural networks and generative adversarial networks. This technique works for large scale models with more than 100 million parameters trained on large datasets. Using this approach, we can reduce the memory consumption of deep learning models by nearly 2x. In future processors, we can also expect a significant computation speedup using half-precision hardware units.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces a mixed-precision training technique in which weights, activations, and gradients are stored in IEEE half-precision (FP16) format. An FP32 master copy of the weights is maintained and updated after each optimizer step, with the FP16 copy obtained by rounding. Loss scaling is applied to the loss before back-propagation to avoid underflow in FP16 gradients; the scale is adjusted dynamically on detection of Inf/NaN values. The authors claim that this combination preserves final accuracy while reducing memory consumption by nearly 2x, and they demonstrate the method on CNNs, RNNs, and GANs, including models exceeding 100 million parameters trained on large datasets.

Significance. If the reported accuracy results hold, the work is significant for enabling larger models or bigger batch sizes on existing hardware by halving memory footprint. The provision of an explicit dynamic loss-scaling algorithm and its empirical validation across three distinct architecture families constitute reproducible engineering contributions that have influenced subsequent practice in the field. The anticipation of FP16 hardware speedups is also noted as a forward-looking aspect.

minor comments (2)
  1. [Abstract] Abstract: the phrase 'scaling the loss appropriately' is used without indicating that a dynamic adjustment procedure is supplied later in the text; a brief parenthetical reference to the Inf/NaN-based update rule would improve immediate clarity.
  2. The manuscript would benefit from explicit mention of whether multiple random seeds or error bars accompany the accuracy numbers, even if the central claim of 'matching accuracy' is already supported by the reported tables.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and the recommendation to accept the manuscript. The assessment correctly identifies the core contributions of maintaining FP32 master weights, applying dynamic loss scaling, and validating the approach across CNNs, RNNs, and GANs while achieving nearly 2x memory reduction.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical engineering technique for mixed-precision training (FP16 weights/activations/gradients with FP32 master copy and dynamic loss scaling) and validates it experimentally across CNNs, RNNs, GANs and >100M-parameter models. No derivation chain, equations, fitted parameters, or self-citations are present that reduce any claim to its own inputs by construction. The central results rest on external experimental outcomes rather than internal reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the empirical observation that FP16 underflow can be mitigated by loss scaling and master weights; no new physical or mathematical entities are postulated.

free parameters (1)
  • loss scaling factor
    A multiplier applied to the loss before back-propagation whose value must be chosen to keep FP16 gradients in representable range; the abstract states it is set 'appropriately' but gives no automatic selection rule.
axioms (1)
  • domain assumption IEEE half-precision numbers have limited dynamic range that can cause gradient underflow during training
    Explicitly invoked in the abstract as the reason the two mitigation techniques are needed.

pith-pipeline@v0.9.0 · 5541 in / 1319 out tokens · 61105 ms · 2026-05-12T10:41:30.409344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We introduce a technique to train deep neural networks using half precision floating point numbers... maintain a single-precision copy of the weights that accumulates the gradients... scaling the loss appropriately to handle the loss of information with half-precision gradients... reduce the memory consumption of deep learning models by nearly 2x.

  • Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    This technique works for large scale models with more than 100 million parameters trained on large datasets.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 36 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Efficient Training on Multiple Consumer GPUs with RoundPipe

    cs.DC 2026-04 conditional novelty 8.0

    RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on...

  2. TransDot: An Area-efficient Reconfigurable Floating-Point Unit for Trans-Precision Dot-Product Accumulation for FPGA AI Engines

    cs.AR 2026-05 unverdicted novelty 7.0

    TransDot unifies SIMD FMA and trans-precision DPA in one reconfigurable FPU, achieving 2x FP16, 4x FP8, and 8x FP4 throughput with FP32 accumulation plus 1.46x to 2.92x area efficiency gains over the FPnew baseline.

  3. Mass Matrix Assembly on Tensor Cores for Implicit Particle-In-Cell Methods

    cs.CE 2026-04 unverdicted novelty 7.0

    Mass matrix assembly for implicit PIC methods can be exactly reformulated cell-by-cell as tensor-core matrix products, delivering up to 3x kernel speedup and 15% end-to-end runtime reduction in ECSIM simulations.

  4. From Characterization to Microarchitecture: Designing an Elegant and Reliable BFP-Based NPU

    cs.AR 2026-04 unverdicted novelty 7.0

    A BFP NPU microarchitecture using row/column blocking and per-path protections achieves near-DMR reliability at 3.55% geometric mean performance overhead and under 2% hardware cost.

  5. Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings

    q-bio.QM 2026-04 unverdicted novelty 7.0

    Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...

  6. Multimodal Reasoning with LLM for Encrypted Traffic Interpretation: A Benchmark

    cs.CR 2026-04 unverdicted novelty 7.0

    Creates the BGTD benchmark and mmTraffic architecture to enable explainable multimodal interpretation of encrypted network traffic using LLMs.

  7. Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements

    cs.AI 2026-04 unverdicted novelty 7.0

    PrecisionDiff is a differential testing framework that uncovers widespread precision-induced behavioral disagreements in aligned LLMs, including safety-critical jailbreak divergences across precision formats.

  8. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

    cs.LG 2022-08 conditional novelty 7.0

    LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.

  9. OPT: Open Pre-trained Transformer Language Models

    cs.CL 2022-05 unverdicted novelty 7.0

    OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

  10. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    cs.CV 2021-12 accept novelty 7.0

    A 3.5-billion-parameter diffusion model with classifier-free guidance generates images preferred over DALL-E by human raters and can be fine-tuned for text-guided inpainting.

  11. Diffusion Models Beat GANs on Image Synthesis

    cs.LG 2021-05 accept novelty 7.0

    Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.

  12. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    cs.LG 2021-01 accept novelty 7.0

    Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.

  13. Generating Long Sequences with Sparse Transformers

    cs.LG 2019-04 unverdicted novelty 7.0

    Sparse Transformers factorize attention to handle sequences tens of thousands long, achieving new SOTA density modeling on Enwik8, CIFAR-10, and ImageNet-64.

  14. LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

    cs.LG 2026-05 unverdicted novelty 6.0

    LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.

  15. LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

    cs.LG 2026-05 unverdicted novelty 6.0

    LoKA enables practical FP8 use in numerically sensitive large recommendation models via profiling, model adaptations, and runtime kernel orchestration.

  16. CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification

    cs.CL 2026-05 unverdicted novelty 6.0

    CuraView detects sentence-level faithfulness hallucinations in medical discharge summaries via GraphRAG knowledge graphs and multi-agent evidence grading, achieving 0.831 F1 on critical contradictions with a fine-tune...

  17. Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study

    cs.SE 2026-04 unverdicted novelty 6.0

    Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.

  18. Training Time Prediction for Mixed Precision-based Distributed Training

    cs.LG 2026-04 unverdicted novelty 6.0

    A precision-aware predictor for distributed training time achieves 9.8% MAPE across precision settings, compared to errors up to 147.85% when precision is ignored.

  19. The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference

    cs.LG 2026-04 unverdicted novelty 6.0

    FP16 KV caching in transformers causes deterministic token divergence versus cache-free inference due to non-associative floating-point accumulation orderings.

  20. SHARE: Social-Humanities AI for Research and Education

    cs.CL 2026-04 unverdicted novelty 6.0

    SHARE models are the first causal LMs pretrained exclusively for SSH and match general models like Phi-4 on SSH texts despite using 100 times fewer tokens, paired with a non-generative MIRROR interface to support scho...

  21. LLM-PRISM: Characterizing Silent Data Corruption from Permanent GPU Faults in LLM Training

    cs.AR 2026-04 unverdicted novelty 6.0

    LLMs resist low-frequency permanent GPU faults but certain datapaths and precision formats trigger catastrophic training divergence even at moderate fault rates.

  22. MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter an...

  23. FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

    cs.LG 2026-04 unverdicted novelty 6.0

    FlashSAC scales up Soft Actor-Critic with fewer updates, larger models, higher data throughput, and norm bounds to deliver faster, more stable training than PPO on high-dimensional robot control tasks across dozens of...

  24. Neural Harmonic Textures for High-Quality Primitive Based Neural Reconstruction

    cs.CV 2026-04 unverdicted novelty 6.0

    Neural Harmonic Textures add periodic feature interpolation and deferred neural decoding to primitive representations, achieving state-of-the-art real-time novel-view synthesis and bridging primitive and neural-field methods.

  25. M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling

    cs.LG 2026-03 unverdicted novelty 6.0

    M²RNN achieves perfect state tracking at unseen lengths and outperforms Gated DeltaNet hybrids by 0.4-0.5 perplexity on 7B models with 3x smaller recurrent states.

  26. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    cs.DC 2023-04 unverdicted novelty 6.0

    PyTorch Fully Sharded Data Parallel enables training of significantly larger models than Distributed Data Parallel with comparable speed and near-linear TFLOPS scaling.

  27. FP8 Formats for Deep Learning

    cs.LG 2022-09 unverdicted novelty 6.0

    FP8 formats E4M3 and E5M2 match 16-bit training accuracy on CNNs, RNNs, and Transformers up to 175B parameters without hyperparameter changes.

  28. ST-MoE: Designing Stable and Transferable Sparse Expert Models

    cs.CL 2022-02 unverdicted novelty 6.0

    ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...

  29. Linformer: Self-Attention with Linear Complexity

    cs.LG 2020-06 conditional novelty 6.0

    Linformer approximates self-attention with a low-rank projection to achieve O(n) time and space complexity while matching Transformer accuracy on standard NLP tasks.

  30. Probing Routing-Conditional Calibration in Attention-Residual Transformers

    cs.CV 2026-05 unverdicted novelty 5.0

    Routing summaries and auxiliary features do not provide stable evidence of conditional miscalibration in AR transformers once confidence-matched baselines, capacity controls, and permutation nulls are applied.

  31. Colinearity Decay: Training Quantization-Friendly ViTs with Outlier Decay

    cs.CV 2026-05 unverdicted novelty 5.0

    Colinearity-Decay regularizer trains ViTs that maintain or improve full-precision accuracy while delivering higher accuracy after low-bit quantization on ImageNet and COCO tasks.

  32. TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training

    cs.DC 2026-04 unverdicted novelty 5.0

    TACO compresses tensor-parallel intermediate tensors with an adaptive FP8 scheme and fused kernels, yielding up to 1.87X throughput gains on GPT and Qwen models with near-lossless accuracy.

  33. PINNACLE: An Open-Source Computational Framework for Classical and Quantum PINNs

    cs.LG 2026-04 accept novelty 5.0

    PINNACLE is an open-source framework for classical and quantum PINNs that supplies modular training methods and benchmarks showing high sensitivity to architecture choices plus parameter-efficiency gains in some hybri...

  34. BAAI Cardiac Agent: An intelligent multimodal agent for automated reasoning and diagnosis of cardiovascular diseases from cardiac magnetic resonance imaging

    eess.IV 2026-04 unverdicted novelty 5.0

    BAAI Cardiac Agent automates end-to-end cardiac MRI analysis for seven cardiovascular diseases, achieving AUC >0.93 internally and >0.81 externally with high correlation to expert measurements.

  35. Assessing Performance and Porting Strategies for Gravitational $N$-Body Simulations on the RISC-V-Based Tenstorrent Wormhole\textsuperscript{\texttrademark}

    cs.DC 2026-05 unverdicted novelty 4.0

    Three scaling strategies for an N-body code on Tenstorrent Wormhole accelerators are compared via execution time and energy measurements, identifying the configuration with the best efficiency-performance balance.

  36. CurEvo: Curriculum-Guided Self-Evolution for Video Understanding

    cs.CV 2026-04 unverdicted novelty 4.0

    CurEvo integrates curriculum guidance into self-evolution to structure autonomous improvement of video understanding models, yielding gains on VideoQA benchmarks.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · cited by 35 Pith papers · 4 internal anchors

  1. [1]

    Amodei, R

    D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceedings of The 33rd International Conference on Machine Learning, pages 173--182, 2016

  2. [2]

    K. Cho, B. Van Merri \"e nboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014

  3. [3]

    Courbariaux, Y

    M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with binary weights during propagations. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3123--3131. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/5647-binaryco...

  4. [4]

    Girshick

    R. Girshick. Faster r-cnn github repository. https://github.com/rbgirshick/py-faster-rcnn

  5. [5]

    Tensorflow tutorial: Sequence-to-sequence models

    Google. Tensorflow tutorial: Sequence-to-sequence models. URL https://www.tensorflow.org/tutorials/seq2seq

  6. [6]

    Graves, S

    A. Graves, S. Fern \'a ndez, F. Gomez, and J. Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369--376. ACM, 2006

  7. [7]

    Gupta, A

    S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1737--1746, 2015

  8. [8]

    Hannun, C

    A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, et al. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567, 2014

  9. [9]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770--778, 2016 a

  10. [10]

    K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, 2016 b

  11. [11]

    Q. He, H. Wen, S. Zhou, Y. Wu, C. Yao, X. Zhou, and Y. Zou. Effective quantization methods for recurrent neural networks. arXiv preprint arXiv:1611.10176, 2016 c

  12. [12]

    Neural Computation 9(8), 1735–1780 (1997)

    S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput., 9 0 (8): 0 1735--1780, Nov. 1997. ISSN 0899-7667. doi:10.1162/neco.1997.9.8.1735. URL http://dx.doi.org/10.1162/neco.1997.9.8.1735

  13. [13]

    Hubara, M

    I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks. In Advances in Neural Information Processing Systems, pages 4107--4115, 2016 a

  14. [14]

    Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

    I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016 b

  15. [15]

    Ioffe and C

    S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In F. R. Bach and D. M. Blei, editors, ICML, volume 37 of JMLR Workshop and Conference Proceedings, pages 448--456. JMLR.org, 2015. URL http://dblp.uni-trier.de/db/conf/icml/icml2015.html#IoffeS15

  16. [16]

    Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014

  17. [17]

    Exploring the Limits of Language Modeling

    R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu. Exploring the limits of language modeling, 2016. URL https://arxiv.org/pdf/1602.02410.pdf

  18. [18]

    Krizhevsky, I

    A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097--1105. Curran Associates, Inc., 2012. URL http://papers.nips.cc/paper/4824-imagenet-classification-with-de...

  19. [19]

    W. Liu. Ssd github repository. https://github.com/weiliu89/caffe/tree/ssd

  20. [20]

    W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. E. Reed. Ssd: Single shot multibox detector. CoRR, abs/1512.02325, 2015 a . URL http://dblp.uni-trier.de/db/journals/corr/corr1512.html#LiuAESR15

  21. [21]

    Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015 b

  22. [22]

    Mishra, E

    A. Mishra, E. Nurvitadhi, J. Cook, and D. Marr. Wrpn: Wide reduced-precision networks. arXiv preprint arXiv:1709.01134, year= 2017

  23. [23]

    Nvidia tesla v100 gpu architecture

    NVIDIA. Nvidia tesla v100 gpu architecture. https://images.nvidia.com/content/volta-architecture/pdf/Volta-Architecture-Whitepaper-v1.0.pdf, 2017

  24. [24]

    J. Ott, Z. Lin, Y. Zhang, S.-C. Liu, and Y. Bengio. Recurrent neural networks with limited numerical precision. arXiv preprint arXiv:1608.06902, 2016

  25. [25]

    Paszke, S

    A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017

  26. [26]

    Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

    A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, abs/1511.06434, 2015. URL http://dblp.uni-trier.de/db/journals/corr/corr1511.html#RadfordMC15

  27. [27]

    Rastegari, V

    M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks, pages 525--542. Springer International Publishing, Cham, 2016. ISBN 978-3-319-46493-0. doi:10.1007/978-3-319-46493-0_32. URL https://doi.org/10.1007/978-3-319-46493-0_32

  28. [28]

    S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN : Towards real-time object detection with region proposal networks. In Neural Information Processing Systems ( NIPS ) , 2015

  29. [29]

    S., Berg, A

    O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge . International Journal of Computer Vision (IJCV), 115 0 (3): 0 211--252, 2015. doi:10.1007/s11263-015-0816-y

  30. [30]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014

  31. [31]

    Going Deeper with Convolutions

    C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR), 2015. URL http://arxiv.org/abs/1409.4842

  32. [32]

    Szegedy, V

    C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016

  33. [33]

    Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016

  34. [34]

    S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. CoRR, abs/1606.06160, 2016. URL http://arxiv.org/abs/1606.06160