mixup: Beyond Empirical Risk Minimization

David Lopez-Paz; Hongyi Zhang; Moustapha Cisse; Yann N. Dauphin

arxiv: 1710.09412 · v2 · submitted 2017-10-25 · 💻 cs.LG · stat.ML

mixup: Beyond Empirical Risk Minimization

Hongyi Zhang , Moustapha Cisse , Yann N. Dauphin , David Lopez-Paz This is my paper

Pith reviewed 2026-05-12 15:28 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords mixupdata augmentationregularizationgeneralizationadversarial robustnessneural networksempirical risk minimizationgenerative adversarial networks

0 comments

The pith

Training neural networks on convex combinations of example pairs and their labels encourages linear behavior between points and improves generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes mixup as a training principle that creates virtual examples by taking convex combinations of pairs of real inputs and their labels. This regularizes the network to favor simple linear interpolations in the regions between training data rather than complex or memorized functions. A sympathetic reader would care because deep networks often memorize training details and remain fragile to small input changes despite high capacity. If the approach succeeds, it offers a lightweight way to boost performance on image, speech, and tabular tasks while addressing memorization and adversarial sensitivity.

Core claim

Mixup trains a neural network on convex combinations of pairs of examples and their labels, which regularizes the model to exhibit simple linear behavior in between training examples. Experiments on ImageNet-2012, CIFAR-10, CIFAR-100, Google commands, and UCI datasets demonstrate that this yields better generalization than standard empirical risk minimization, reduces memorization of corrupt labels, increases robustness to adversarial examples, and stabilizes generative adversarial network training.

What carries the argument

The mixup procedure that forms virtual training examples as lambda times one input plus one minus lambda times another, with correspondingly mixed labels.

If this is right

State-of-the-art networks achieve higher accuracy on large-scale image classification benchmarks.
Models become less prone to fitting noisy or corrupted training labels.
Networks exhibit greater tolerance to small adversarial perturbations in their inputs.
Training dynamics for generative adversarial networks become more stable across runs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The regularization implicitly smooths the learned function over the convex hull of the training data.
Mixup could serve as a drop-in replacement for other vicinal risk minimization strategies that operate only in input space.
The same mixing principle may extend naturally to tasks beyond classification, such as regression or sequence modeling, where linear label combinations remain well-defined.

Load-bearing premise

That linear interpolation between training examples in input space corresponds to a meaningful linear interpolation in label space that improves the learned function's generalization.

What would settle it

An experiment on a synthetic classification task with deliberately non-linear class boundaries between examples where mixup produces lower test accuracy than standard training.

read the original abstract

Large deep neural networks are powerful, but exhibit undesirable behaviors such as memorization and sensitivity to adversarial examples. In this work, we propose mixup, a simple learning principle to alleviate these issues. In essence, mixup trains a neural network on convex combinations of pairs of examples and their labels. By doing so, mixup regularizes the neural network to favor simple linear behavior in-between training examples. Our experiments on the ImageNet-2012, CIFAR-10, CIFAR-100, Google commands and UCI datasets show that mixup improves the generalization of state-of-the-art neural network architectures. We also find that mixup reduces the memorization of corrupt labels, increases the robustness to adversarial examples, and stabilizes the training of generative adversarial networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mixup is a simple label-mixing regularization trick that delivers consistent empirical gains on image and other benchmarks, even if the linearity story remains informal.

read the letter

Mixup trains networks on convex combinations of example pairs and their labels, with the mixing weight drawn from a beta distribution. The paper shows this change improves test accuracy on CIFAR-10, CIFAR-100, and ImageNet, while also cutting error under label noise and raising robustness to adversarial examples. The core idea is new in its direct application of vicinal risk minimization to both inputs and labels at once, and the implementation is trivial to add to existing pipelines. Experiments cover multiple architectures and datasets, including some non-image tasks, which gives the results decent breadth. The gains on noisy labels and GAN stability are useful side observations that go beyond the main generalization claim. The soft spot is the explanation: the paper argues that the procedure pushes the model toward linear behavior between training points, but supplies only an intuitive regularization argument rather than bounds or a derivation that would pin this down. For image data the interpolated inputs often leave the natural manifold, so the mixed labels are artificial targets; the observed benefits could stem from generic augmentation effects instead. That concern is reasonable, yet the held-out test improvements are reproducible enough that the practical payoff does not depend on the precise mechanism being correct. The citation pattern is appropriate and does not lean on self-reference for the main claims. This paper is aimed at anyone training deep networks who wants a low-cost way to regularize without changing the architecture. It is worth bringing to a reading group because the method is easy to reimplement and the empirical claims are straightforward to check. It deserves peer review: the experimental evidence is solid and the idea is distinct from prior augmentation work, so referees can usefully weigh in on the strength of the results and the informal theory.

Referee Report

1 major / 2 minor

Summary. The paper claims that training neural networks on convex combinations of pairs of examples and their labels (mixup) regularizes the model to exhibit linear behavior between training examples, thereby improving generalization, reducing memorization of corrupt labels, and increasing robustness to adversarial examples. This is demonstrated through experiments on ImageNet-2012, CIFAR-10, CIFAR-100, Google commands, and UCI datasets using various neural network architectures.

Significance. If the results hold, mixup provides a straightforward and computationally efficient regularization technique that extends empirical risk minimization in a novel way. The paper is to be credited for its extensive empirical evaluation across multiple domains and tasks, including the additional findings on label noise robustness and GAN stabilization. These aspects make the contribution significant for practical deep learning applications.

major comments (1)

[§2] §2: The regularization argument that mixup favors 'simple linear behavior in-between training examples' is presented heuristically via the vicinal distribution construction without a formal derivation, generalization bound, or analysis showing why label-space interpolation is semantically appropriate. This is load-bearing for the central claim of going 'beyond empirical risk minimization' in a principled way, as opposed to a generic augmentation effect.

minor comments (2)

The algorithm description would benefit from explicit pseudocode to aid reproducibility.
Figure captions and axis labels in the experimental sections could be expanded for standalone clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive review, the recognition of the empirical contributions across multiple domains, and the recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: [§2] §2: The regularization argument that mixup favors 'simple linear behavior in-between training examples' is presented heuristically via the vicinal distribution construction without a formal derivation, generalization bound, or analysis showing why label-space interpolation is semantically appropriate. This is load-bearing for the central claim of going 'beyond empirical risk minimization' in a principled way, as opposed to a generic augmentation effect.

Authors: We agree that the motivation in §2 is heuristic and does not include a formal generalization bound. The argument extends the vicinal risk minimization framework of Chapelle et al. (2000), where the vicinal distribution is instantiated via convex combinations of training examples and labels; this is a deliberate design choice rather than generic augmentation. Label interpolation is semantically motivated for classification because one-hot (or soft) labels represent class probabilities, and linear interpolation in label space encourages the network to output probabilities that vary smoothly between classes, consistent with the assumption that the underlying data manifold is locally linear. We will revise the manuscript to (i) cite the vicinal risk minimization literature more explicitly, (ii) clarify that the linear-behavior inductive bias is the core modeling assumption, and (iii) distinguish mixup from standard augmentation by emphasizing the joint interpolation of inputs and labels. A full theoretical analysis with generalization bounds is beyond the scope of the current work, which prioritizes broad empirical validation. revision: partial

Circularity Check

0 steps flagged

No significant circularity; mixup defines an augmentation procedure whose effects are measured empirically on held-out data

full rationale

The paper introduces mixup by defining a vicinal distribution over convex combinations of input-label pairs and then applies standard ERM to samples from that distribution. The claimed regularization toward linear behavior between examples is the direct, definitional consequence of minimizing loss on those constructed pairs; it is not derived as a separate prediction. Generalization, robustness, and stability improvements are reported via experiments on independent test sets (ImageNet, CIFAR, etc.), with no fitted parameters, self-citations, or uniqueness theorems invoked to support the central claim. The derivation chain is therefore self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on one tunable hyper-parameter for the mixing distribution and the domain assumption that input-space linear interpolation is a useful vicinal distribution for labels.

free parameters (1)

alpha
Controls the Beta(alpha, alpha) distribution from which the mixing coefficient lambda is sampled; chosen per dataset.

axioms (1)

domain assumption Training on vicinal distributions formed by linear interpolations improves generalization
Invoked to justify why convex combinations of examples and labels should be used as training data.

pith-pipeline@v0.9.0 · 5430 in / 1106 out tokens · 30378 ms · 2026-05-12T15:28:40.009016+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation Jcost_one_plus_eps_quadratic echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

mixup trains a neural network on convex combinations of pairs of examples and their labels. By doing so, mixup regularizes the neural network to favor simple linear behavior in-between training examples.
Foundation.DiscretenessForcing J_log_quadratic_approx echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

mixup regularizes the neural network to favor simple linear behavior in-between training examples.
Foundation.InevitabilityStructure inevitability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our experiments on the ImageNet-2012, CIFAR-10, CIFAR-100, Google commands and UCI datasets show that mixup improves the generalization of state-of-the-art neural network architectures.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
cs.CV 2021-03 accept novelty 8.0

Swin Transformer reaches 87.3% ImageNet accuracy and sets new records on COCO detection and ADE20K segmentation by replacing global self-attention with shifted-window local attention inside a hierarchical pyramid.
TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles
cs.CV 2026-05 unverdicted novelty 7.0

TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.
Efficient and provably convergent end-to-end training of deep neural networks with linear constraints
math.OC 2026-05 unverdicted novelty 7.0

An efficiently computable HS-Jacobian acts as a conservative mapping for projections onto polyhedral sets, supporting provably convergent Adam-based end-to-end training of linearly constrained deep neural networks.
LLM-guided Semi-Supervised Approaches for Social Media Crisis Data Classification
cs.AI 2026-05 conditional novelty 7.0

LG-CoTrain, an LLM-guided co-training method, outperforms classical semi-supervised baselines for crisis tweet classification in low-resource settings with 5-25 labeled examples per class.
LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute
cs.CV 2026-05 conditional novelty 7.0

LookWhen factorizes video recognition into learning when, where, and what to compute via uniqueness-based token selection and dual-teacher distillation, achieving better accuracy-FLOPs trade-offs than baselines on mul...
Domain Generalization through Spatial Relation Induction over Visual Primitives
cs.CV 2026-05 unverdicted novelty 7.0

PARSE improves domain generalization accuracy by factoring recognition into visual primitives and their spatial relational compositions learned end-to-end with differentiable predicates.
LEGO: LoRA-Enabled Generator-Oriented Framework for Synthetic Image Detection
cs.CV 2026-05 unverdicted novelty 7.0

LEGO uses multiple generator-specific LoRA modules modulated by an MLP and fused with attention to detect synthetic images, achieving better performance than prior methods while using under 10% of the training data.
SignMAE: Segmentation-Driven Self-Supervised Learning for Sign Language Recognition
cs.CV 2026-05 unverdicted novelty 7.0

SignMAE uses segmentation-driven masking in a mask-and-reconstruct self-supervised task to learn fine-grained sign representations, achieving state-of-the-art accuracy on WLASL, NMFs-CSL, and Slovo with fewer frames a...
Direct Discrepancy Replay: Distribution-Discrepancy Condensation and Manifold-Consistent Replay for Continual Face Forgery Detection
cs.CV 2026-04 unverdicted novelty 7.0

A replay method for continual face forgery detection condenses real-fake distribution discrepancies into compact maps and synthesizes compatible samples from current real faces to reduce forgetting under tight memory ...
Is your algorithm unlearning or untraining?
cs.LG 2026-04 conditional novelty 7.0

Machine unlearning conflates reversing the influence of specific training examples (untraining) with removing the full underlying distribution or behavior (unlearning).
Unifying Contrastive and Generative Objectives for Visual Understanding and Text-to-Image Generation
cs.CV 2026-03 unverdicted novelty 7.0

DREAM introduces Masking Warmup and Semantically Aligned Decoding to let a single encoder handle both contrastive alignment and masked generation, yielding gains over CLIP and FLUID on understanding and generation benchmarks.
ST-BCP: Tightening Coverage Bound for Backward Conformal Prediction via Non-Conformity Score Transformation
stat.ML 2026-02 conditional novelty 7.0

ST-BCP tightens the coverage bound in Backward Conformal Prediction by applying a computable data-dependent transformation to nonconformity scores, reducing the average gap from 4.20% to 1.12% on benchmarks while prov...
Chronos: Learning the Language of Time Series
cs.LG 2024-03 conditional novelty 7.0

Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.
The DeepFake Detection Challenge (DFDC) Dataset
cs.CV 2020-06 accept novelty 7.0

The DFDC dataset is the largest public collection of face-swapped videos and supports detectors that generalize to in-the-wild deepfakes.
GAMR: Geometric-Aware Manifold Regularization with Virtual Outlier Synthesis for Learning with Noisy Labels
cs.CV 2026-05 unverdicted novelty 6.0

GAMR introduces geometric-aware manifold regularization via virtual outlier synthesis to enhance intra-class compactness and inter-class separation, improving robustness to noisy labels beyond passive sample filtering.
The General Theory of Localization Methods
cs.LG 2026-05 unverdicted novelty 6.0

The localization method unifies kernel methods, local learning algorithms, MeanShift, Hopfield networks, and Transformers through local models, localization tricks, and hierarchical extensions.
HamBR: Active Decision Boundary Restoration Based on Hamiltonian Dynamics for Learning with Noisy Labels
cs.CV 2026-05 unverdicted novelty 6.0

HamBR uses Spherical HMC to probe ambiguous regions and synthesize virtual outliers with energy-based repulsion to restore decision boundaries degraded by noisy labels, achieving SOTA on CIFAR and real-world benchmarks.
LiBaGS: Lightweight Boundary Gap Synthesis for Targeted Synthetic Data Selection
cs.LG 2026-05 unverdicted novelty 6.0

LiBaGS scores and selects synthetic data near decision boundaries using proximity, uncertainty, density, and validity, with boundary-gap allocation and marginal stopping to improve training accuracy.
Cross-Sample Relational Fusion: Unifying Domain Generalization and Class-Incremental Learning
cs.CV 2026-05 unverdicted novelty 6.0

CORF unifies domain generalization and class-incremental learning via selective sample refinement with spatial maps and confidence weighting plus cascaded relational distillation.
ICDAR 2026 Competition on Writer Identification and Pen Classification from Hand-Drawn Circles
cs.CV 2026-05 accept novelty 6.0

CircleID introduces a controlled dataset of 46,155 circles from 66 writers and 8 pens, with competition results showing top accuracies of 64.8% for open-set writer identification and 92.7% for pen classification.
ICDAR 2026 Competition on Writer Identification and Pen Classification from Hand-Drawn Circles
cs.CV 2026-05 accept novelty 6.0

A new dataset of hand-drawn circles from 66 writers and 8 pens yields competition results of 64.8% top-1 accuracy for open-set writer identification and 92.7% for pen classification.
Leveraging Image Generators to Address Training Data Scarcity: The Gen4Regen Dataset for Forest Regeneration Mapping
cs.CV 2026-05 conditional novelty 6.0

Mixing real UAV imagery with 2101 AI-generated image-mask pairs improves semantic segmentation F1 scores for fine-grained forest species by over 15 percentage points overall and up to 30 points for rare classes.
Cheeger--Hodge Contrastive Learning for Structurally Robust Graph Representation Learning
cs.LG 2026-04 unverdicted novelty 6.0

CHCL aligns a Cheeger-Hodge joint signature across graph augmentations to produce embeddings that remain stable under local structural changes.
Beyond Binary Contrast: Modeling Continuous Skeleton Action Spaces with Transitional Anchors
cs.CV 2026-04 unverdicted novelty 6.0

TranCLR models continuous skeleton action spaces with transitional anchors and multi-level manifold calibration, yielding smoother and more accurate representations than binary contrastive methods.
PAC-Bayes Bounds for Gibbs Posteriors via Singular Learning Theory
stat.ML 2026-04 unverdicted novelty 6.0

PAC-Bayes bounds for Gibbs posteriors are obtained via singular learning theory, producing explicit and tighter posterior-averaged risk bounds that adapt to data structure in overparameterized models.
Human Gaze-based Dual Teacher Guidance Learning for Semi-Supervised Medical Image Segmentation
eess.IV 2026-04 unverdicted novelty 6.0

HG-DTGL integrates human gaze as an extra teacher in mean-teacher learning via GazeMix, MGP module and Gaze Loss, reporting superior segmentation across ten organs on multiple modalities.
Feature-Aware Anisotropic Local Differential Privacy for Utility-Preserving Graph Representation Learning in Metal Additive Manufacturing
cs.LG 2026-04 unverdicted novelty 6.0

FI-LDP-HGAT applies feature-importance-aware anisotropic local differential privacy to a hierarchical graph attention network, recovering 81.5% utility at epsilon=4 and 0.762 defect recall at epsilon=2 on a DED porosi...
OASIC: Occlusion-Agnostic and Severity-Informed Classification
cs.CV 2026-04 conditional novelty 6.0

OASIC uses anomaly-based masking and severity estimation to select occlusion-matched models, improving AUC on occluded images by up to 23.7 points.
Can LLMs Learn to Reason Robustly under Noisy Supervision?
cs.LG 2026-04 conditional novelty 6.0

Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning be...
R\'enyi Attention Entropy for Patch Pruning
cs.CV 2026-04 unverdicted novelty 6.0

Rényi entropy of attention maps serves as a tunable criterion for pruning redundant patches in vision transformers, reducing compute with preserved accuracy on image recognition.
Mitigating Shortcut Learning via Feature Disentanglement in Medical Imaging: A Benchmark Study
cs.CV 2026-02 unverdicted novelty 6.0

Benchmark shows that combining data rebalancing with feature disentanglement mitigates shortcut learning more effectively than rebalancing alone in medical imaging models.
A Retrieval-Augmented Generation Approach to Extracting Algorithmic Logic from Neural Networks
cs.CV 2025-12 unverdicted novelty 6.0

NN-RAG extracts 1,289 candidate neural modules from 19 PyTorch repositories, validates 941 of them, and supplies roughly 72% of the novel structures in the LEMUR dataset while enabling cross-repository migration.
LoFT: Parameter-Efficient Fine-Tuning for Long-tailed Semi-Supervised Learning in Open-World Scenarios
cs.LG 2025-09 unverdicted novelty 6.0

LoFT uses parameter-efficient fine-tuning of foundation models for long-tailed semi-supervised learning, supported by proofs that this reduces hypothesis complexity to minimize balanced posterior error and compresses ...
Normalized Matching Transformer
cs.CV 2025-03 unverdicted novelty 6.0

Normalized Matching Transformer enforces unit-norm embeddings at every Transformer layer and trains with InfoNCE plus hyperspherical uniformity loss, reaching new state-of-the-art accuracy on PascalVOC and SPair-71k w...
YOLOv12: Attention-Centric Real-Time Object Detectors
cs.CV 2025-02 unverdicted novelty 6.0

YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.
SpikeDet: Better Firing Patterns for Accurate and Energy-Efficient Object Detection with Spiking Neural Networks
cs.CV 2025-01 unverdicted novelty 6.0

SpikeDet reaches 52.2% AP on COCO 2017 with spiking networks by optimizing firing patterns via MDSNet and SMFM, using half the energy of prior SNN detectors.
Revisiting Feature Prediction for Learning Visual Representations from Video
cs.CV 2024-02 conditional novelty 6.0

V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
Sharpness-Aware Minimization for Efficiently Improving Generalization
cs.LG 2020-10 conditional novelty 6.0

SAM solves a min-max problem to locate flat low-loss regions, improving generalization on CIFAR, ImageNet and label-noise tasks.
Learning Data Augmentation Strategies for Object Detection
cs.CV 2019-06 unverdicted novelty 6.0

Learned data augmentation policies optimized for object detection improve COCO mAP by more than 2.3 and transfer to other datasets and models.
Holistic Reliability Propagation: Decoupling Annotation and Prediction for Robust Noisy-Label
cs.CV 2026-05 unverdicted novelty 5.0

HRP decouples annotation reliability (alpha) and pseudo-label reliability (beta) via bilevel meta-learning and routes them to distinct objectives in reliability-aware Mixup and contrastive learning for improved noisy-...
Axiomatizing Neural Networks via Pursuit of Subspaces
cs.LG 2026-05 unverdicted novelty 5.0

Authors introduce the Pursuit of Subspaces (PoS) hypothesis, an axiomatic geometric framework that unifies explanations for representation, computation, and generalization in shallow and deep neural networks.
Graph Transductive Sharpening: Leveraging Unlabeled Predictions in Node Classification
cs.LG 2026-05 unverdicted novelty 5.0

Transductive Sharpening adds an entropy-minimization term on unlabeled-node predictions to the training objective for graph node classification.
LiBaGS: Lightweight Boundary Gap Synthesis for Targeted Synthetic Data Selection
cs.LG 2026-05 unverdicted novelty 5.0

LiBaGS is a lightweight method that picks synthetic data near decision boundaries while checking density and validity to improve training accuracy over standard oversampling or uncertainty sampling.
CAST: Channel-Aware Spatial Transfer Learning with Pseudo-Image Radar for Sign Language Recognition
cs.CV 2026-05 unverdicted novelty 5.0

CAST achieves 80.5% Top-1 accuracy on radar-only sign language recognition by fusing physics-aware CVD and RTM representations through channel-aware spatial attention and asymmetric cross-attention.
Agentic AIs Are the Missing Paradigm for Out-of-Distribution Generalization in Foundation Models
cs.LG 2026-05 unverdicted novelty 5.0

Agentic AI systems are required to overcome the parameter coverage ceiling that prevents foundation models from handling certain out-of-distribution cases.
HiMix: Hierarchical Artifact-aware Mixup for Generalized Synthetic Image Detection
cs.CV 2026-04 unverdicted novelty 5.0

HiMix combines mixup augmentation to create transitional real-fake samples with hierarchical global-local artifact feature fusion to achieve better generalization in detecting AI-generated images from unseen generators.
Investigating Bias and Fairness in Appearance-based Gaze Estimation
cs.CV 2026-04 unverdicted novelty 5.0

First large-scale fairness audit of gaze estimators reveals sizable accuracy disparities by ethnicity and gender, with existing mitigation methods providing only marginal fairness gains.
Beyond Surface Artifacts: Capturing Shared Latent Forgery Knowledge Across Modalities
cs.CV 2026-04 unverdicted novelty 5.0

Introduces MAF framework and DeepModal-Bench to capture universal cross-modal forgery traces for better generalization in multimodal deepfake detection.
Multi-Aspect Knowledge Distillation for Language Model with Low-rank Factorization
cs.CL 2026-04 unverdicted novelty 5.0

MaKD distills pre-trained language models by deeply mimicking self-attention and feed-forward modules across aspects using low-rank factorization, matching strong baselines at the same parameter budget and extending t...
Why Invariance is Not Enough for Biomedical Domain Generalization and How to Fix It
eess.IV 2026-04 unverdicted novelty 5.0

MaskGen improves domain generalization for biomedical image segmentation by using source intensities plus domain-stable foundation model representations with minimal added complexity.
Ordinal Adaptive Correction: A Data-Centric Approach to Ordinal Image Classification with Noisy Labels
cs.CV 2025-09 unverdicted novelty 5.0

ORDAC adaptively corrects noisy ordinal labels via dynamic label distribution adjustments, yielding lower error and higher recall on noisy Adience and Diabetic Retinopathy benchmarks.
Two-Stage Framework for Efficient UAV-Based Wildfire Video Analysis with Adaptive Compression and Fire Source Detection
cs.CV 2025-08 unverdicted novelty 5.0

A two-stage UAV framework prunes redundant wildfire video clips via a policy network with station point mechanism and detects fire sources in real time using an improved YOLOv8 model.
i-WiViG: Interpretable Window Vision GNN
cs.CV 2025-03 unverdicted novelty 5.0

i-WiViG is an interpretable window vision GNN that constrains nodes to disjoint local windows and applies learnable sparse attention to identify relevant subgraphs, delivering competitive performance on scene classifi...
Automatic Dataset Construction (ADC): Sample Collection, Data Curation, and Beyond
cs.AI 2024-08 unverdicted novelty 5.0

The ADC method automates the creation of large image classification datasets using LLMs and search engines, achieving 79% human agreement and reducing label noise on a 1 million image clothing dataset, while also rele...
Know Yourself Better: Diverse Object-Related Features Improve Open Set Recognition
cs.CV 2024-04 unverdicted novelty 5.0

Diverse discriminative features correlate with and can be leveraged to improve open set recognition performance over prior methods.
YOLOv4: Optimal Speed and Accuracy of Object Detection
cs.CV 2020-04 unverdicted novelty 5.0

YOLOv4 achieves 43.5% AP (65.7% AP50) on MS COCO at ~65 FPS on Tesla V100 by integrating WRC, CSP, CmBN, SAT, Mish activation, Mosaic augmentation, DropBlock, and CIoU loss.
Annotation-Free Cardiac Vessel Segmentation via Knowledge Transfer from Retinal Images
eess.IV 2019-07 unverdicted novelty 5.0

SC-GAN performs annotation-free coronary artery segmentation by transferring shape-consistent knowledge from retinal vessel annotations via a GAN trained on 1092 DSA images.
The Receptive Field as a Regularizer in Deep Convolutional Neural Networks for Acoustic Scene Classification
cs.LG 2019-07 unverdicted novelty 5.0

Tuning receptive field sizes in ResNet and DenseNet enables them to outperform VGG models on acoustic scene classification across three datasets.
Efficient data augmentation using graph imputation neural networks
stat.ML 2019-06 unverdicted novelty 5.0

Graph imputation neural networks augment semi-supervised datasets up to 10x by reconstructing heavily damaged samples on a similarity graph, improving over fully-supervised baselines on benchmarks.
an interpretable vision transformer framework for automated brain tumor classification
cs.CV 2026-04 unverdicted novelty 4.0

Vision Transformer with CLAHE preprocessing, two-stage fine-tuning, MixUp/CutMix, EMA, TTA, and attention rollout achieves 99.29% accuracy and 99.25% macro F1 on four-class brain tumor MRI classification from 7023 scans.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 65 Pith papers

[1]

Amodei, S

D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al. Deep speech 2: End-to-end speech recognition in E nglish and M andarin. In ICML, 2016

work page 2016
[2]

Arpit, S

D. Arpit, S. Jastrzebski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio, et al. A closer look at memorization in deep networks. ICML, 2017

work page 2017
[3]

Bartlett, D

P. Bartlett, D. J. Foster, and M. Telgarsky. Spectrally-normalized margin bounds for neural networks. NIPS, 2017

work page 2017
[4]

Chapelle, J

O. Chapelle, J. Weston, L. Bottou, and V. Vapnik. Vicinal risk minimization. NIPS, 2000

work page 2000
[5]

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. SMOTE : synthetic minority over-sampling technique. Journal of artificial intelligence research, 16: 0 321--357, 2002

work page 2002
[6]

Chelba, T

C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robinson. One billion word benchmark for measuring progress in statistical language modeling. arXiv, 2013

work page 2013
[7]

Cisse, P

M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier. Parseval networks: Improving robustness to adversarial examples. ICML, 2017

work page 2017
[8]

W. M. Czarnecki, S. Osindero, M. Jaderberg, G. \'S wirszcz, and R. Pascanu. Sobolev training for neural networks. NIPS, 2017

work page 2017
[9]

DeVries and G

T. DeVries and G. W. Taylor. Dataset augmentation in feature space. ICLR Workshops, 2017

work page 2017
[10]

Drucker and Y

H. Drucker and Y. Le Cun. Improving generalization performance using double backpropagation. IEEE Transactions on Neural Networks, 3 0 (6): 0 991--997, 1992

work page 1992
[11]

Goodfellow

I. Goodfellow. Tutorial: Generative adversarial networks. NIPS, 2016

work page 2016
[12]

Goodfellow, J

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. NIPS, 2014

work page 2014
[13]

I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. ICLR, 2015

work page 2015
[14]

Goyal, P

P. Goyal, P. Doll \'a r, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch SGD : Training I mage Net in 1 hour. arXiv, 2017

work page 2017
[15]

Graves, A.-r

A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP. IEEE, 2013

work page 2013
[16]

Gulrajani, F

I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of W asserstein GAN s. NIPS, 2017

work page 2017
[17]

Harvey, C

N. Harvey, C. Liaw, and A. Mehrabian. Nearly-tight VC -dimension bounds for piecewise linear neural networks. JMLR, 2017

work page 2017
[18]

K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. ECCV, 2016

work page 2016
[19]

Hein and M

M. Hein and M. Andriushchenko. Formal guarantees on the robustness of a classifier against adversarial manipulation. NIPS, 2017

work page 2017
[20]

Hinton, L

G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 2012

work page 2012
[21]

Huang, Z

G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. CVPR, 2017

work page 2017
[22]

Kingma and J

D. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2015

work page 2015
[23]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Image Net classification with deep convolutional neural networks. NIPS, 2012

work page 2012
[24]

Lecun, L

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of IEEE, 2001

work page 2001
[25]

M. Lichman. UCI machine learning repository, 2013

work page 2013
[26]

Liu, 2017

K. Liu, 2017. URL https://github.com/kuangliu/pytorch-cifar

work page 2017
[27]

Pereyra, G

G. Pereyra, G. Tucker, J. Chorowski, . Kaiser, and G. Hinton. Regularizing neural networks by penalizing confident output distributions. ICLR Workshops, 2017

work page 2017
[28]

Russakovsky, J

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. Image Net large scale visual recognition challenge. IJCV, 2015

work page 2015
[29]

Silver, A

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016

work page 2016
[30]

Simard, Y

P. Simard, Y. LeCun, J. Denker, and B. Victorri. Transformation invariance in pattern recognition—tangent distance and tangent propagation. Neural networks: tricks of the trade, 1998

work page 1998
[31]

Simonyan and A

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. ICLR, 2015

work page 2015
[32]

J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all convolutional net. ICLR Workshops, 2015

work page 2015
[33]

Srivastava, G

N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15 0 (1): 0 1929--1958, 2014

work page 1929
[34]

Szegedy, W

C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus. Intriguing properties of neural networks. ICLR, 2014

work page 2014
[35]

Szegedy, V

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the I nception architecture for computer vision. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016

work page 2016
[36]

V. N. Vapnik. Statistical learning theory. J. Wiley, 1998

work page 1998
[37]

Vapnik and A

V. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 1971

work page 1971
[38]

Veit, 2017

A. Veit, 2017. URL https://github.com/andreasveit

work page 2017
[39]

Warden, 2017

P. Warden, 2017. URL https://research.googleblog.com/2017/08/launching-speech-commands-dataset.html

work page 2017
[40]

S. Xie, R. Girshick, P. Doll \'a r, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. CVPR, 2016

work page 2016
[41]

Zagoruyko and N

S. Zagoruyko and N. Komodakis. Wide residual networks. BMVC, 2016 a

work page 2016
[42]

Zagoruyko and N

S. Zagoruyko and N. Komodakis, 2016 b . URL https://github.com/szagoruyko/wide-residual-networks

work page 2016
[43]

Zhang , S

C. Zhang , S. Bengio , M. Hardt , B. Recht , and O. Vinyals . Understanding deep learning requires rethinking generalization . ICLR, 2017

work page 2017
[44]

Zhang, 2017

C. Zhang, 2017. URL https://github.com/pluskid/fitting-random-labels

work page 2017
[45]

Zhong, L

Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang. Random erasing data augmentation. arXiv, 2017

work page 2017

[1] [1]

Amodei, S

D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al. Deep speech 2: End-to-end speech recognition in E nglish and M andarin. In ICML, 2016

work page 2016

[2] [2]

Arpit, S

D. Arpit, S. Jastrzebski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio, et al. A closer look at memorization in deep networks. ICML, 2017

work page 2017

[3] [3]

Bartlett, D

P. Bartlett, D. J. Foster, and M. Telgarsky. Spectrally-normalized margin bounds for neural networks. NIPS, 2017

work page 2017

[4] [4]

Chapelle, J

O. Chapelle, J. Weston, L. Bottou, and V. Vapnik. Vicinal risk minimization. NIPS, 2000

work page 2000

[5] [5]

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. SMOTE : synthetic minority over-sampling technique. Journal of artificial intelligence research, 16: 0 321--357, 2002

work page 2002

[6] [6]

Chelba, T

C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robinson. One billion word benchmark for measuring progress in statistical language modeling. arXiv, 2013

work page 2013

[7] [7]

Cisse, P

M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier. Parseval networks: Improving robustness to adversarial examples. ICML, 2017

work page 2017

[8] [8]

W. M. Czarnecki, S. Osindero, M. Jaderberg, G. \'S wirszcz, and R. Pascanu. Sobolev training for neural networks. NIPS, 2017

work page 2017

[9] [9]

DeVries and G

T. DeVries and G. W. Taylor. Dataset augmentation in feature space. ICLR Workshops, 2017

work page 2017

[10] [10]

Drucker and Y

H. Drucker and Y. Le Cun. Improving generalization performance using double backpropagation. IEEE Transactions on Neural Networks, 3 0 (6): 0 991--997, 1992

work page 1992

[11] [11]

Goodfellow

I. Goodfellow. Tutorial: Generative adversarial networks. NIPS, 2016

work page 2016

[12] [12]

Goodfellow, J

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. NIPS, 2014

work page 2014

[13] [13]

I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. ICLR, 2015

work page 2015

[14] [14]

Goyal, P

P. Goyal, P. Doll \'a r, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch SGD : Training I mage Net in 1 hour. arXiv, 2017

work page 2017

[15] [15]

Graves, A.-r

A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP. IEEE, 2013

work page 2013

[16] [16]

Gulrajani, F

I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of W asserstein GAN s. NIPS, 2017

work page 2017

[17] [17]

Harvey, C

N. Harvey, C. Liaw, and A. Mehrabian. Nearly-tight VC -dimension bounds for piecewise linear neural networks. JMLR, 2017

work page 2017

[18] [18]

K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. ECCV, 2016

work page 2016

[19] [19]

Hein and M

M. Hein and M. Andriushchenko. Formal guarantees on the robustness of a classifier against adversarial manipulation. NIPS, 2017

work page 2017

[20] [20]

Hinton, L

G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 2012

work page 2012

[21] [21]

Huang, Z

G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. CVPR, 2017

work page 2017

[22] [22]

Kingma and J

D. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2015

work page 2015

[23] [23]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Image Net classification with deep convolutional neural networks. NIPS, 2012

work page 2012

[24] [24]

Lecun, L

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of IEEE, 2001

work page 2001

[25] [25]

M. Lichman. UCI machine learning repository, 2013

work page 2013

[26] [26]

Liu, 2017

K. Liu, 2017. URL https://github.com/kuangliu/pytorch-cifar

work page 2017

[27] [27]

Pereyra, G

G. Pereyra, G. Tucker, J. Chorowski, . Kaiser, and G. Hinton. Regularizing neural networks by penalizing confident output distributions. ICLR Workshops, 2017

work page 2017

[28] [28]

Russakovsky, J

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. Image Net large scale visual recognition challenge. IJCV, 2015

work page 2015

[29] [29]

Silver, A

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016

work page 2016

[30] [30]

Simard, Y

P. Simard, Y. LeCun, J. Denker, and B. Victorri. Transformation invariance in pattern recognition—tangent distance and tangent propagation. Neural networks: tricks of the trade, 1998

work page 1998

[31] [31]

Simonyan and A

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. ICLR, 2015

work page 2015

[32] [32]

J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all convolutional net. ICLR Workshops, 2015

work page 2015

[33] [33]

Srivastava, G

N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15 0 (1): 0 1929--1958, 2014

work page 1929

[34] [34]

Szegedy, W

C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus. Intriguing properties of neural networks. ICLR, 2014

work page 2014

[35] [35]

Szegedy, V

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the I nception architecture for computer vision. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016

work page 2016

[36] [36]

V. N. Vapnik. Statistical learning theory. J. Wiley, 1998

work page 1998

[37] [37]

Vapnik and A

V. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 1971

work page 1971

[38] [38]

Veit, 2017

A. Veit, 2017. URL https://github.com/andreasveit

work page 2017

[39] [39]

Warden, 2017

P. Warden, 2017. URL https://research.googleblog.com/2017/08/launching-speech-commands-dataset.html

work page 2017

[40] [40]

S. Xie, R. Girshick, P. Doll \'a r, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. CVPR, 2016

work page 2016

[41] [41]

Zagoruyko and N

S. Zagoruyko and N. Komodakis. Wide residual networks. BMVC, 2016 a

work page 2016

[42] [42]

Zagoruyko and N

S. Zagoruyko and N. Komodakis, 2016 b . URL https://github.com/szagoruyko/wide-residual-networks

work page 2016

[43] [43]

Zhang , S

C. Zhang , S. Bengio , M. Hardt , B. Recht , and O. Vinyals . Understanding deep learning requires rethinking generalization . ICLR, 2017

work page 2017

[44] [44]

Zhang, 2017

C. Zhang, 2017. URL https://github.com/pluskid/fitting-random-labels

work page 2017

[45] [45]

Zhong, L

Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang. Random erasing data augmentation. arXiv, 2017

work page 2017