arxiv: 1710.03740 · v3 · submitted 2017-10-10 · 💻 cs.AI · cs.LG· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Mixed Precision Training

Paulius Micikevicius , Sharan Narang , Jonah Alben , Gregory Diamos , Erich Elsen , David Garcia , Boris Ginsburg , Michael Houston

show 3 more authors

Oleksii Kuchaiev Ganesh Venkatesh Hao Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 10:41 UTC · model grok-4.3

classification 💻 cs.AI cs.LGstat.ML

keywords mixed precision traininghalf precisiondeep neural networksloss scalingmemory reductionFP16convolutional networksrecurrent networks

0 comments

The pith

Deep neural networks can be trained in half precision using a full-precision weight master copy and loss scaling to achieve nearly 2x memory savings without accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that storing weights, activations, and gradients in 16-bit floating point while keeping a 32-bit copy of the weights and scaling the loss allows training of large deep networks. This approach addresses the limited numerical range of half-precision numbers that would otherwise cause underflow or overflow in gradients. A sympathetic reader would care because it makes training bigger models feasible on current hardware and promises faster computation on specialized processors. The method is shown to work across convolutional, recurrent, and generative adversarial networks with over 100 million parameters.

Core claim

The central discovery is that mixed precision training with FP16 for most tensors and FP32 for weight accumulation, combined with dynamic loss scaling, enables training deep neural networks to the same accuracy as full precision while reducing memory footprint by almost half. The single-precision master weights prevent rounding errors from accumulating in updates, and loss scaling keeps small gradient values representable in FP16.

What carries the argument

Loss scaling combined with a single-precision master copy of the weights to handle the limited range of half-precision floating point.

Load-bearing premise

An appropriately chosen loss scaling factor together with the FP32 master weight copy will prevent numerical issues in FP16 across many different models and datasets without needing retuning that reduces accuracy.

What would settle it

Observing that for some standard model and dataset the mixed-precision version diverges or reaches lower accuracy than the FP32 baseline even after tuning the loss scale factor.

read the original abstract

Deep neural networks have enabled progress in a wide variety of applications. Growing the size of the neural network typically results in improved accuracy. As model sizes grow, the memory and compute requirements for training these models also increases. We introduce a technique to train deep neural networks using half precision floating point numbers. In our technique, weights, activations and gradients are stored in IEEE half-precision format. Half-precision floating numbers have limited numerical range compared to single-precision numbers. We propose two techniques to handle this loss of information. Firstly, we recommend maintaining a single-precision copy of the weights that accumulates the gradients after each optimizer step. This single-precision copy is rounded to half-precision format during training. Secondly, we propose scaling the loss appropriately to handle the loss of information with half-precision gradients. We demonstrate that this approach works for a wide variety of models including convolution neural networks, recurrent neural networks and generative adversarial networks. This technique works for large scale models with more than 100 million parameters trained on large datasets. Using this approach, we can reduce the memory consumption of deep learning models by nearly 2x. In future processors, we can also expect a significant computation speedup using half-precision hardware units.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mixed precision with FP32 master weights and dynamic loss scaling cuts memory roughly in half while matching accuracy on the CNNs, RNNs, and GANs they tested.

read the letter

The main thing to know is that this paper gives a practical, reproducible way to train in FP16 by storing weights, activations, and gradients in half precision, keeping an FP32 master copy of the weights for updates, and using dynamic loss scaling that raises or lowers the factor when inf or nan appears in the gradients. They show this works across convolutional nets, recurrent nets, and GANs, including models over 100 million parameters, with nearly 2x lower memory use and no accuracy loss on the tasks reported.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces a mixed-precision training technique in which weights, activations, and gradients are stored in IEEE half-precision (FP16) format. An FP32 master copy of the weights is maintained and updated after each optimizer step, with the FP16 copy obtained by rounding. Loss scaling is applied to the loss before back-propagation to avoid underflow in FP16 gradients; the scale is adjusted dynamically on detection of Inf/NaN values. The authors claim that this combination preserves final accuracy while reducing memory consumption by nearly 2x, and they demonstrate the method on CNNs, RNNs, and GANs, including models exceeding 100 million parameters trained on large datasets.

Significance. If the reported accuracy results hold, the work is significant for enabling larger models or bigger batch sizes on existing hardware by halving memory footprint. The provision of an explicit dynamic loss-scaling algorithm and its empirical validation across three distinct architecture families constitute reproducible engineering contributions that have influenced subsequent practice in the field. The anticipation of FP16 hardware speedups is also noted as a forward-looking aspect.

minor comments (2)

[Abstract] Abstract: the phrase 'scaling the loss appropriately' is used without indicating that a dynamic adjustment procedure is supplied later in the text; a brief parenthetical reference to the Inf/NaN-based update rule would improve immediate clarity.
The manuscript would benefit from explicit mention of whether multiple random seeds or error bars accompany the accuracy numbers, even if the central claim of 'matching accuracy' is already supported by the reported tables.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and the recommendation to accept the manuscript. The assessment correctly identifies the core contributions of maintaining FP32 master weights, applying dynamic loss scaling, and validating the approach across CNNs, RNNs, and GANs while achieving nearly 2x memory reduction.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical engineering technique for mixed-precision training (FP16 weights/activations/gradients with FP32 master copy and dynamic loss scaling) and validates it experimentally across CNNs, RNNs, GANs and >100M-parameter models. No derivation chain, equations, fitted parameters, or self-citations are present that reduce any claim to its own inputs by construction. The central results rest on external experimental outcomes rather than internal reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the empirical observation that FP16 underflow can be mitigated by loss scaling and master weights; no new physical or mathematical entities are postulated.

free parameters (1)

loss scaling factor
A multiplier applied to the loss before back-propagation whose value must be chosen to keep FP16 gradients in representable range; the abstract states it is set 'appropriately' but gives no automatic selection rule.

axioms (1)

domain assumption IEEE half-precision numbers have limited dynamic range that can cause gradient underflow during training
Explicitly invoked in the abstract as the reason the two mitigation techniques are needed.

pith-pipeline@v0.9.0 · 5541 in / 1319 out tokens · 61105 ms · 2026-05-12T10:41:30.409344+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a technique to train deep neural networks using half precision floating point numbers... maintain a single-precision copy of the weights that accumulates the gradients... scaling the loss appropriately to handle the loss of information with half-precision gradients... reduce the memory consumption of deep learning models by nearly 2x.
Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This technique works for large scale models with more than 100 million parameters trained on large datasets.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 36 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Efficient Training on Multiple Consumer GPUs with RoundPipe
cs.DC 2026-04 conditional novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on...
TransDot: An Area-efficient Reconfigurable Floating-Point Unit for Trans-Precision Dot-Product Accumulation for FPGA AI Engines
cs.AR 2026-05 unverdicted novelty 7.0

TransDot unifies SIMD FMA and trans-precision DPA in one reconfigurable FPU, achieving 2x FP16, 4x FP8, and 8x FP4 throughput with FP32 accumulation plus 1.46x to 2.92x area efficiency gains over the FPnew baseline.
Mass Matrix Assembly on Tensor Cores for Implicit Particle-In-Cell Methods
cs.CE 2026-04 unverdicted novelty 7.0

Mass matrix assembly for implicit PIC methods can be exactly reformulated cell-by-cell as tensor-core matrix products, delivering up to 3x kernel speedup and 15% end-to-end runtime reduction in ECSIM simulations.
From Characterization to Microarchitecture: Designing an Elegant and Reliable BFP-Based NPU
cs.AR 2026-04 unverdicted novelty 7.0

A BFP NPU microarchitecture using row/column blocking and per-path protections achieves near-DMR reliability at 3.55% geometric mean performance overhead and under 2% hardware cost.
Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings
q-bio.QM 2026-04 unverdicted novelty 7.0

Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...
Multimodal Reasoning with LLM for Encrypted Traffic Interpretation: A Benchmark
cs.CR 2026-04 unverdicted novelty 7.0

Creates the BGTD benchmark and mmTraffic architecture to enable explainable multimodal interpretation of encrypted network traffic using LLMs.
Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements
cs.AI 2026-04 unverdicted novelty 7.0

PrecisionDiff is a differential testing framework that uncovers widespread precision-induced behavioral disagreements in aligned LLMs, including safety-critical jailbreak divergences across precision formats.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
cs.LG 2022-08 conditional novelty 7.0

LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
cs.CV 2021-12 accept novelty 7.0

A 3.5-billion-parameter diffusion model with classifier-free guidance generates images preferred over DALL-E by human raters and can be fine-tuned for text-guided inpainting.
Diffusion Models Beat GANs on Image Synthesis
cs.LG 2021-05 accept novelty 7.0

Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
cs.LG 2021-01 accept novelty 7.0

Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
Generating Long Sequences with Sparse Transformers
cs.LG 2019-04 unverdicted novelty 7.0

Sparse Transformers factorize attention to handle sequences tens of thousands long, achieving new SOTA density modeling on Enwik8, CIFAR-10, and ImageNet-64.
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
cs.LG 2026-05 unverdicted novelty 6.0

LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
cs.LG 2026-05 unverdicted novelty 6.0

LoKA enables practical FP8 use in numerically sensitive large recommendation models via profiling, model adaptations, and runtime kernel orchestration.
CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification
cs.CL 2026-05 unverdicted novelty 6.0

CuraView detects sentence-level faithfulness hallucinations in medical discharge summaries via GraphRAG knowledge graphs and multi-agent evidence grading, achieving 0.831 F1 on critical contradictions with a fine-tune...
Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study
cs.SE 2026-04 unverdicted novelty 6.0

Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.
Training Time Prediction for Mixed Precision-based Distributed Training
cs.LG 2026-04 unverdicted novelty 6.0

A precision-aware predictor for distributed training time achieves 9.8% MAPE across precision settings, compared to errors up to 147.85% when precision is ignored.
The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference
cs.LG 2026-04 unverdicted novelty 6.0

FP16 KV caching in transformers causes deterministic token divergence versus cache-free inference due to non-associative floating-point accumulation orderings.
SHARE: Social-Humanities AI for Research and Education
cs.CL 2026-04 unverdicted novelty 6.0

SHARE models are the first causal LMs pretrained exclusively for SSH and match general models like Phi-4 on SSH texts despite using 100 times fewer tokens, paired with a non-generative MIRROR interface to support scho...
LLM-PRISM: Characterizing Silent Data Corruption from Permanent GPU Faults in LLM Training
cs.AR 2026-04 unverdicted novelty 6.0

LLMs resist low-frequency permanent GPU faults but certain datapaths and precision formats trigger catastrophic training divergence even at moderate fault rates.
MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning
cs.LG 2026-04 unverdicted novelty 6.0

MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter an...
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control
cs.LG 2026-04 unverdicted novelty 6.0

FlashSAC scales up Soft Actor-Critic with fewer updates, larger models, higher data throughput, and norm bounds to deliver faster, more stable training than PPO on high-dimensional robot control tasks across dozens of...
Neural Harmonic Textures for High-Quality Primitive Based Neural Reconstruction
cs.CV 2026-04 unverdicted novelty 6.0

Neural Harmonic Textures add periodic feature interpolation and deferred neural decoding to primitive representations, achieving state-of-the-art real-time novel-view synthesis and bridging primitive and neural-field methods.
M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling
cs.LG 2026-03 unverdicted novelty 6.0

M²RNN achieves perfect state tracking at unseen lengths and outperforms Gated DeltaNet hybrids by 0.4-0.5 perplexity on 7B models with 3x smaller recurrent states.
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
cs.DC 2023-04 unverdicted novelty 6.0

PyTorch Fully Sharded Data Parallel enables training of significantly larger models than Distributed Data Parallel with comparable speed and near-linear TFLOPS scaling.
FP8 Formats for Deep Learning
cs.LG 2022-09 unverdicted novelty 6.0

FP8 formats E4M3 and E5M2 match 16-bit training accuracy on CNNs, RNNs, and Transformers up to 175B parameters without hyperparameter changes.
ST-MoE: Designing Stable and Transferable Sparse Expert Models
cs.CL 2022-02 unverdicted novelty 6.0

ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...
Linformer: Self-Attention with Linear Complexity
cs.LG 2020-06 conditional novelty 6.0

Linformer approximates self-attention with a low-rank projection to achieve O(n) time and space complexity while matching Transformer accuracy on standard NLP tasks.
Probing Routing-Conditional Calibration in Attention-Residual Transformers
cs.CV 2026-05 unverdicted novelty 5.0

Routing summaries and auxiliary features do not provide stable evidence of conditional miscalibration in AR transformers once confidence-matched baselines, capacity controls, and permutation nulls are applied.
Colinearity Decay: Training Quantization-Friendly ViTs with Outlier Decay
cs.CV 2026-05 unverdicted novelty 5.0

Colinearity-Decay regularizer trains ViTs that maintain or improve full-precision accuracy while delivering higher accuracy after low-bit quantization on ImageNet and COCO tasks.
TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training
cs.DC 2026-04 unverdicted novelty 5.0

TACO compresses tensor-parallel intermediate tensors with an adaptive FP8 scheme and fused kernels, yielding up to 1.87X throughput gains on GPT and Qwen models with near-lossless accuracy.
PINNACLE: An Open-Source Computational Framework for Classical and Quantum PINNs
cs.LG 2026-04 accept novelty 5.0

PINNACLE is an open-source framework for classical and quantum PINNs that supplies modular training methods and benchmarks showing high sensitivity to architecture choices plus parameter-efficiency gains in some hybri...
BAAI Cardiac Agent: An intelligent multimodal agent for automated reasoning and diagnosis of cardiovascular diseases from cardiac magnetic resonance imaging
eess.IV 2026-04 unverdicted novelty 5.0

BAAI Cardiac Agent automates end-to-end cardiac MRI analysis for seven cardiovascular diseases, achieving AUC >0.93 internally and >0.81 externally with high correlation to expert measurements.
Assessing Performance and Porting Strategies for Gravitational $N$-Body Simulations on the RISC-V-Based Tenstorrent Wormhole\textsuperscript{\texttrademark}
cs.DC 2026-05 unverdicted novelty 4.0

Three scaling strategies for an N-body code on Tenstorrent Wormhole accelerators are compared via execution time and energy measurements, identifying the configuration with the best efficiency-performance balance.
CurEvo: Curriculum-Guided Self-Evolution for Video Understanding
cs.CV 2026-04 unverdicted novelty 4.0

CurEvo integrates curriculum guidance into self-evolution to structure autonomous improvement of video understanding models, yielding gains on VideoQA benchmarks.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · cited by 35 Pith papers · 4 internal anchors

[1]

Amodei, R

D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceedings of The 33rd International Conference on Machine Learning, pages 173--182, 2016

work page 2016
[2]

K. Cho, B. Van Merri \"e nboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014

work page internal anchor Pith review arXiv 2014
[3]

Courbariaux, Y

M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with binary weights during propagations. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3123--3131. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/5647-binaryco...

work page 2015
[4]

Girshick

R. Girshick. Faster r-cnn github repository. https://github.com/rbgirshick/py-faster-rcnn

work page
[5]

Tensorflow tutorial: Sequence-to-sequence models

Google. Tensorflow tutorial: Sequence-to-sequence models. URL https://www.tensorflow.org/tutorials/seq2seq

work page
[6]

Graves, S

A. Graves, S. Fern \'a ndez, F. Gomez, and J. Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369--376. ACM, 2006

work page 2006
[7]

Gupta, A

S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1737--1746, 2015

work page 2015
[8]

Hannun, C

A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, et al. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567, 2014

work page arXiv 2014
[9]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770--778, 2016 a

work page 2016
[10]

K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, 2016 b

work page 2016
[11]

Q. He, H. Wen, S. Zhou, Y. Wu, C. Yao, X. Zhou, and Y. Zou. Effective quantization methods for recurrent neural networks. arXiv preprint arXiv:1611.10176, 2016 c

work page arXiv 2016
[12]

Neural Computation 9(8), 1735–1780 (1997)

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput., 9 0 (8): 0 1735--1780, Nov. 1997. ISSN 0899-7667. doi:10.1162/neco.1997.9.8.1735. URL http://dx.doi.org/10.1162/neco.1997.9.8.1735

work page doi:10.1162/neco.1997.9.8.1735 1997
[13]

Hubara, M

I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks. In Advances in Neural Information Processing Systems, pages 4107--4115, 2016 a

work page 2016
[14]

Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016 b

work page Pith review arXiv 2016
[15]

Ioffe and C

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In F. R. Bach and D. M. Blei, editors, ICML, volume 37 of JMLR Workshop and Conference Proceedings, pages 448--456. JMLR.org, 2015. URL http://dblp.uni-trier.de/db/conf/icml/icml2015.html#IoffeS15

work page 2015
[16]

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014

work page arXiv 2014
[17]

Exploring the Limits of Language Modeling

R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu. Exploring the limits of language modeling, 2016. URL https://arxiv.org/pdf/1602.02410.pdf

work page Pith review arXiv 2016
[18]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097--1105. Curran Associates, Inc., 2012. URL http://papers.nips.cc/paper/4824-imagenet-classification-with-de...

work page 2012
[19]

W. Liu. Ssd github repository. https://github.com/weiliu89/caffe/tree/ssd

work page
[20]

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. E. Reed. Ssd: Single shot multibox detector. CoRR, abs/1512.02325, 2015 a . URL http://dblp.uni-trier.de/db/journals/corr/corr1512.html#LiuAESR15

work page arXiv 2015
[21]

Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015 b

work page 2015
[22]

Mishra, E

A. Mishra, E. Nurvitadhi, J. Cook, and D. Marr. Wrpn: Wide reduced-precision networks. arXiv preprint arXiv:1709.01134, year= 2017

work page arXiv 2017
[23]

Nvidia tesla v100 gpu architecture

NVIDIA. Nvidia tesla v100 gpu architecture. https://images.nvidia.com/content/volta-architecture/pdf/Volta-Architecture-Whitepaper-v1.0.pdf, 2017

work page 2017
[24]

J. Ott, Z. Lin, Y. Zhang, S.-C. Liu, and Y. Bengio. Recurrent neural networks with limited numerical precision. arXiv preprint arXiv:1608.06902, 2016

work page arXiv 2016
[25]

Paszke, S

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017

work page 2017
[26]

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, abs/1511.06434, 2015. URL http://dblp.uni-trier.de/db/journals/corr/corr1511.html#RadfordMC15

work page internal anchor Pith review arXiv 2015
[27]

Rastegari, V

M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks, pages 525--542. Springer International Publishing, Cham, 2016. ISBN 978-3-319-46493-0. doi:10.1007/978-3-319-46493-0_32. URL https://doi.org/10.1007/978-3-319-46493-0_32

work page doi:10.1007/978-3-319-46493-0_32 2016
[28]

S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN : Towards real-time object detection with region proposal networks. In Neural Information Processing Systems ( NIPS ) , 2015

work page 2015
[29]

S., Berg, A

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge . International Journal of Computer Vision (IJCV), 115 0 (3): 0 211--252, 2015. doi:10.1007/s11263-015-0816-y

work page doi:10.1007/s11263-015-0816-y 2015
[30]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[31]

Going Deeper with Convolutions

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR), 2015. URL http://arxiv.org/abs/1409.4842

work page Pith review arXiv 2015
[32]

Szegedy, V

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016

work page 2016
[33]

Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016

work page internal anchor Pith review arXiv 2016
[34]

S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. CoRR, abs/1606.06160, 2016. URL http://arxiv.org/abs/1606.06160

work page Pith review arXiv 2016