hub Mixed citations

Regularizing Neural Networks by Penalizing Confident Output Distributions

Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, Geoffrey Hinton · 2017 · cs.NE · arXiv 1701.06548

Mixed citation behavior. Most common role is background (60%).

15 Pith papers citing it

Background 60% of classified citations

open full Pith review browse 15 citing papers arXiv PDF

abstract

We systematically explore regularizing neural networks by penalizing low entropy output distributions. We show that penalizing low entropy output distributions, which has been shown to improve exploration in reinforcement learning, acts as a strong regularizer in supervised learning. Furthermore, we connect a maximum entropy based confidence penalty to label smoothing through the direction of the KL divergence. We exhaustively evaluate the proposed confidence penalty and label smoothing on 6 common benchmarks: image classification (MNIST and Cifar-10), language modeling (Penn Treebank), machine translation (WMT'14 English-to-German), and speech recognition (TIMIT and WSJ). We find that both label smoothing and the confidence penalty improve state-of-the-art models across benchmarks without modifying existing hyperparameters, suggesting the wide applicability of these regularizers.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 method 2

citation-polarity summary

background 3 use method 2

representative citing papers

Genetic Programming with Transformer-Based Mutation for Approximate Circuit Design

cs.NE · 2026-05-20 · unverdicted · novelty 7.0

A hybrid CGP scheme with a transformer mutation operator evolves approximate multipliers that achieve better error-power trade-offs than the EvoApproxLib library for several target constraints.

DeepL\'evy: Learning Heavy-Tailed Uncertainty in Highly Volatile Time Series

cs.LG · 2026-05-11 · unverdicted · novelty 7.0 · 3 refs

DeepLévy learns mixtures of Lévy stable distributions for heavy-tailed time series forecasting by minimizing discrepancies between empirical and parametric characteristic functions, outperforming prior methods on tail risk metrics under extreme volatility.

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

cs.CL · 2019-10-29 · accept · novelty 7.0

BART introduces a denoising pretraining method for seq2seq models that matches RoBERTa on GLUE and SQuAD while setting new state-of-the-art results on abstractive summarization, dialogue, and QA with up to 6 ROUGE gains.

A Ridge Too Far: Correcting Over-Shrinkage via Negative Regularization

cs.LG · 2025-08-24 · unverdicted · novelty 6.0

Negative-capable ridge regression uses controlled negative regularization as anti-shrinkage to increase effective complexity along weak eigendirections and mitigate underfitting in small-data regression.

Unsupervised Domain Adaptation via Calibrating Uncertainties

cs.LG · 2019-07-25 · unverdicted · novelty 6.0

A new regularization approach for unsupervised domain adaptation that calibrates Renyi entropy of uncertainties estimated via variational Bayes.

AugLabel: Exploiting Word Representations to Augment Labels for Face Attribute Classification

cs.CV · 2019-07-15 · unverdicted · novelty 6.0

Augmenting face attribute labels with word2vec embeddings improves deep classifier performance on CelebA and LFWA and reaches comparable accuracy with 50% less labeled data.

Annotations Mitigate Post-Training Mode Collapse

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.

Can LLMs Learn to Reason Robustly under Noisy Supervision?

cs.LG · 2026-04-05 · conditional · novelty 6.0

Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning benchmarks even at high noise levels.

UniAlign: A Model-Agnostic Framework for Robust Network Traffic Classification under Distribution Shifts

cs.LG · 2026-05-17 · unverdicted · novelty 5.0

UniAlign improves robustness of deep learning NTC models under distribution shifts via domain alignment fine-tuning and stable ensembling, yielding 2.51% accuracy and 2.71% F1 gains over standard training on three public datasets.

Trust the uncertain teacher: distilling dark knowledge via calibrated uncertainty

cs.LG · 2026-02-13 · unverdicted · novelty 5.0

CUD reshapes the teacher's predictive distribution before distillation so that students receive calibrated uncertainty signals alongside accuracy, yielding more robust and better-calibrated models on high-cardinality and distribution-shift benchmarks.

Condensation Transition in Entropy-Constrained Probability Spaces

cond-mat.stat-mech · 2026-05-09 · unverdicted · novelty 5.0

Below a critical entropy H_c ≈ log K - 1 + γ in the large-K limit, the typical fixed-entropy distribution on the probability simplex condenses so that one component holds a macroscopic probability fraction while the rest form a uniform background.

A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models

cs.CV · 2026-04-06 · unverdicted · novelty 5.0

A patch-augmented cross-view regularization method reduces backdoor attack success rates in multimodal LLMs by enforcing output differences between original and perturbed views while using entropy constraints to preserve benign generation quality.

Non-Intrusive Automatic Speech Recognition Refinement: A Survey

eess.AS · 2025-08-10 · accept · novelty 4.0

A survey that classifies non-intrusive ASR refinement methods into five categories, reviews domain adaptation and evaluation datasets, proposes standardized metrics, and identifies future research directions.

Learn Spelling from Teachers: Transferring Knowledge from Language Models to Sequence-to-Sequence Speech Recognition

eess.AS · 2019-07-13 · unverdicted · novelty 4.0

Knowledge distillation from an external RNN language model to a seq2seq ASR model yields 9.3% CER on Chinese datasets, an 18.42% relative improvement over the baseline without test-time fusion components.

Applying a Pre-trained Language Model to Spanish Twitter Humor Prediction

cs.CL · 2019-07-06 · unverdicted · novelty 3.0

A Spanish Twitter language model trained from scratch with label smoothing placed 3rd and 2nd in the HAHA 2019 humor classification and regression tasks.

citing papers explorer

Showing 15 of 15 citing papers.

Genetic Programming with Transformer-Based Mutation for Approximate Circuit Design cs.NE · 2026-05-20 · unverdicted · none · ref 26 · internal anchor
A hybrid CGP scheme with a transformer mutation operator evolves approximate multipliers that achieve better error-power trade-offs than the EvoApproxLib library for several target constraints.
DeepL\'evy: Learning Heavy-Tailed Uncertainty in Highly Volatile Time Series cs.LG · 2026-05-11 · unverdicted · none · ref 23 · 3 links · internal anchor
DeepLévy learns mixtures of Lévy stable distributions for heavy-tailed time series forecasting by minimizing discrepancies between empirical and parametric characteristic functions, outperforming prior methods on tail risk metrics under extreme volatility.
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension cs.CL · 2019-10-29 · accept · none · ref 16
BART introduces a denoising pretraining method for seq2seq models that matches RoBERTa on GLUE and SQuAD while setting new state-of-the-art results on abstractive summarization, dialogue, and QA with up to 6 ROUGE gains.
A Ridge Too Far: Correcting Over-Shrinkage via Negative Regularization cs.LG · 2025-08-24 · unverdicted · none · ref 29 · internal anchor
Negative-capable ridge regression uses controlled negative regularization as anti-shrinkage to increase effective complexity along weak eigendirections and mitigate underfitting in small-data regression.
Unsupervised Domain Adaptation via Calibrating Uncertainties cs.LG · 2019-07-25 · unverdicted · none · ref 24 · internal anchor
A new regularization approach for unsupervised domain adaptation that calibrates Renyi entropy of uncertainties estimated via variational Bayes.
AugLabel: Exploiting Word Representations to Augment Labels for Face Attribute Classification cs.CV · 2019-07-15 · unverdicted · none · ref 24 · internal anchor
Augmenting face attribute labels with word2vec embeddings improves deep classifier performance on CelebA and LFWA and reaches comparable accuracy with 50% less labeled data.
Annotations Mitigate Post-Training Mode Collapse cs.CL · 2026-05-11 · unverdicted · none · ref 37
Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.
Can LLMs Learn to Reason Robustly under Noisy Supervision? cs.LG · 2026-04-05 · conditional · none · ref 20
Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning benchmarks even at high noise levels.
UniAlign: A Model-Agnostic Framework for Robust Network Traffic Classification under Distribution Shifts cs.LG · 2026-05-17 · unverdicted · none · ref 44 · internal anchor
UniAlign improves robustness of deep learning NTC models under distribution shifts via domain alignment fine-tuning and stable ensembling, yielding 2.51% accuracy and 2.71% F1 gains over standard training on three public datasets.
Trust the uncertain teacher: distilling dark knowledge via calibrated uncertainty cs.LG · 2026-02-13 · unverdicted · none · ref 13 · internal anchor
CUD reshapes the teacher's predictive distribution before distillation so that students receive calibrated uncertainty signals alongside accuracy, yielding more robust and better-calibrated models on high-cardinality and distribution-shift benchmarks.
Condensation Transition in Entropy-Constrained Probability Spaces cond-mat.stat-mech · 2026-05-09 · unverdicted · none · ref 26
Below a critical entropy H_c ≈ log K - 1 + γ in the large-K limit, the typical fixed-entropy distribution on the probability simplex condenses so that one component holds a macroscopic probability fraction while the rest form a uniform background.
A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models cs.CV · 2026-04-06 · unverdicted · none · ref 79
A patch-augmented cross-view regularization method reduces backdoor attack success rates in multimodal LLMs by enforcing output differences between original and perturbed views while using entropy constraints to preserve benign generation quality.
Non-Intrusive Automatic Speech Recognition Refinement: A Survey eess.AS · 2025-08-10 · accept · none · ref 112 · internal anchor
A survey that classifies non-intrusive ASR refinement methods into five categories, reviews domain adaptation and evaluation datasets, proposes standardized metrics, and identifies future research directions.
Learn Spelling from Teachers: Transferring Knowledge from Language Models to Sequence-to-Sequence Speech Recognition eess.AS · 2019-07-13 · unverdicted · none · ref 24 · internal anchor
Knowledge distillation from an external RNN language model to a seq2seq ASR model yields 9.3% CER on Chinese datasets, an 18.42% relative improvement over the baseline without test-time fusion components.
Applying a Pre-trained Language Model to Spanish Twitter Humor Prediction cs.CL · 2019-07-06 · unverdicted · none · ref 9 · internal anchor
A Spanish Twitter language model trained from scratch with label smoothing placed 3rd and 2nd in the HAHA 2019 humor classification and regression tasks.

Regularizing Neural Networks by Penalizing Confident Output Distributions

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer