{"work":{"id":"01efb355-7c12-4b89-bc42-91ee46ee276b","openalex_id":null,"doi":null,"arxiv_id":"1609.04836","raw_key":null,"title":"On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima","authors":null,"authors_text":"Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang","year":2016,"venue":"cs.LG","abstract":"The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$-$512$ data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We discuss several strategies to attempt to help large-batch methods eliminate this generalization gap.","external_url":"https://arxiv.org/abs/1609.04836","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T18:56:08.830155+00:00","pith_arxiv_id":"1609.04836","created_at":"2026-05-08T20:09:09.818129+00:00","updated_at":"2026-05-25T18:56:08.830155+00:00","title_quality_ok":true,"display_title":"On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima","render_title":"On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima"},"hub":{"state":{"work_id":"01efb355-7c12-4b89-bc42-91ee46ee276b","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":42,"external_cited_by_count":null,"distinct_field_count":10,"first_pith_cited_at":"2019-04-01T16:53:35+00:00","last_pith_cited_at":"2026-05-20T10:23:03+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-02T19:55:00.693912+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":8}],"polarity_counts":[{"context_polarity":"background","n":8}],"runs":{},"summary":{},"graph":{},"authors":[]}}