Mixup of Feature Maps in a Hidden Layer for Training of Convolutional Neural Network
Pith reviewed 2026-05-25 17:59 UTC · model grok-4.3
The pith
Mixing feature maps in the first hidden layer of a CNN improves generalization more than mixing the input images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that mixup applied to the feature maps produced by the first convolution layer yields better recognition accuracy on out-of-distribution samples than the original mixup performed on the input images themselves.
What carries the argument
Siamese or triplet network architecture that enables mixing of corresponding feature maps at a chosen hidden layer while the CNN is trained.
If this is right
- Mixup performed after the first convolution outperforms mixup performed on the raw pixels.
- The improvement holds across multiple standard image-classification datasets.
- The method requires no change to the underlying loss function beyond the standard mixup interpolation.
- Later convolutional layers can also receive mixup but yield smaller gains than the first layer.
Where Pith is reading between the lines
- The same architecture could be used to test whether mixing at multiple layers simultaneously compounds the benefit.
- The approach might combine with other feature-space regularizers such as dropout or batch-norm statistics.
- If the first-layer mixup effect generalizes, it could be inserted into any CNN without redesigning the data loader.
Load-bearing premise
The Siamese or triplet structure isolates the benefit of mixing feature maps without changing the loss landscape or training dynamics in other ways.
What would settle it
A controlled experiment that applies the identical linear mixing coefficients directly to feature maps without using a Siamese or triplet network and measures whether the accuracy gain disappears.
Figures
read the original abstract
The deep Convolutional Neural Network (CNN) became very popular as a fundamental technique for image classification and objects recognition. To improve the recognition accuracy for the more complex tasks, deeper networks have being introduced. However, the recognition accuracy of the trained deep CNN drastically decreases for the samples which are obtained from the outside regions of the training samples. To improve the generalization ability for such samples, Krizhevsky et al. proposed to generate additional samples through transformations from the existing samples and to make the training samples richer. This method is known as data augmentation. Hongyi Zhang et al. introduced data augmentation method called mixup which achieves state-of-the-art performance in various datasets. Mixup generates new samples by mixing two different training samples. Mixing of the two images is implemented with simple image morphing. In this paper, we propose to apply mixup to the feature maps in a hidden layer. To implement the mixup in the hidden layer we use the Siamese network or the triplet network architecture to mix feature maps. From the experimental comparison, it is observed that the mixup of the feature maps obtained from the first convolution layer is more effective than the original image mixup.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes extending the mixup data-augmentation technique from input images to feature maps inside a CNN hidden layer. It implements the hidden-layer mixing via Siamese or triplet network branches that share weights, and reports that mixing the feature maps from the first convolutional layer yields higher accuracy than standard image-level mixup.
Significance. If the superiority can be isolated to the mixing location rather than to the altered architecture and loss, the result would supply a concrete, testable extension of mixup to internal representations and could motivate further work on where in the network to perform interpolation-based augmentation.
major comments (1)
- [Abstract / Method] Abstract and method description: the headline claim that first-convolution feature-map mixup is more effective than image mixup rests on comparisons performed inside Siamese or triplet architectures; no ablation is described that keeps the network topology, loss, and training dynamics fixed while moving only the site of mixing. Consequently the reported gain cannot be unambiguously attributed to the hidden-layer location.
minor comments (2)
- [Abstract] Abstract: the sentence 'deeper networks have being introduced' contains a grammatical error.
- [Abstract] Abstract: the experimental comparison is asserted without any mention of the datasets, architectures, metrics, or number of runs, making it impossible for a reader to assess the strength of the evidence from the provided text alone.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comment below and describe the revisions we will make.
read point-by-point responses
-
Referee: [Abstract / Method] Abstract and method description: the headline claim that first-convolution feature-map mixup is more effective than image mixup rests on comparisons performed inside Siamese or triplet architectures; no ablation is described that keeps the network topology, loss, and training dynamics fixed while moving only the site of mixing. Consequently the reported gain cannot be unambiguously attributed to the hidden-layer location.
Authors: We agree that the current experimental design does not fully isolate the effect of the mixing location from the change in architecture. The Siamese/triplet structure is necessary to enable feature-map mixing with shared weights, but this introduces a confound relative to standard image mixup on a conventional CNN. To resolve this, we will add a controlled ablation that applies image-level mixup inside the identical Siamese/triplet architecture (same topology, loss, and training dynamics) and directly compares it to feature-map mixup within that same architecture. The abstract and method sections will be updated to present this comparison and clarify the attribution of gains. revision: yes
Circularity Check
No circularity; empirical proposal with no derivation reducing to inputs
full rationale
The paper proposes applying mixup to hidden-layer feature maps via Siamese/triplet architectures and reports experimental accuracy comparisons. No mathematical derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing steps exist. The central claim rests on direct empirical results rather than any self-referential construction. The architecture change noted in the skeptic headline is a methodological limitation but does not constitute circularity under the defined patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
ImageNet classification with deep convolutional neural networks,
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Proc. Conf. Neural Information Processing Systems, pp.1097-1105, 2012
work page 2012
-
[2]
Intriguing properties of neural networks
C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” arXiv:1312.6199, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[3]
Transformation invariance in pattern recognition tangent distance and tangent propagation,
P. Simard, Y. LeCun, J. Denker, and B. Victorri, “Transformation invariance in pattern recognition tangent distance and tangent propagation,” in Neural networks: tricks of the trade , 1998
work page 1998
-
[4]
A structural learning by adding independent noises to hidden units,
T.Kurita, H.Asoh, S.Umeyama, S.Akaho, and A.Hosomi, “A structural learning by adding independent noises to hidden units,” Proc. of IEEE Inter. Conf. on Neural Networks (ICNN’94), pp.275-278, 1994
work page 1994
-
[5]
Effect of additive noise for multi-layered Perceptron with autoencoders,
M. Sabri, and T. Kurita, “Effect of additive noise for multi-layered Perceptron with autoencoders,” IEICE Trans. Information and Systems, Vol.E100D, No.7, pp. 1494- 1504, 2017
work page 2017
-
[6]
H. Inayohsi and T. Kurita, “Improved generalization by adding both auto- association and hidden-layer noise to neural-network-based-classifiers” IEEE Work- shop on Machine Learning for Sigmal Processing, pp.141-146, 2005
work page 2005
-
[7]
mixup: Beyond Empirical Risk Minimization,
H. Zhang, M. Cisse, Y. N. Dauphin, D. Lopez-Paz, “mixup: Beyond Empirical Risk Minimization,” Porc. of 2018 International Conference on Learning Representations (ICLR2018), 2018
work page 2018
-
[8]
Learning from Between-class Examples for Deep Sound Recognition
Y. Tokozume, Y. Ushiku, T. Harada “Learning from Between-class Examples for Deep Sound Recognition” Porc. of 2018 International Conference on Learning Rep- resentations (ICLR2018), 2018
work page 2018
-
[9]
Between-class Learning for Image Classifi- cation
Y. Tokozume, Y. Ushiku, T. Harada “Between-class Learning for Image Classifi- cation” Porc. of 2018 IEEE computer society conference on Computer Vision and Pattern Recognition (CVPR2018), 2018
work page 2018
-
[10]
Signature verification using a siamese time delay neural network,
J. Bromley, I. Guyon, Y. LeCun, E. S¨ ackinger, and R. Shah, “Signature verification using a siamese time delay neural network,” in Advances in Neural Information Processing Systems, Vol.6 (NIPS 1993), 1993
work page 1993
-
[11]
Learning a similarity metric discrimina- tively, with application to face verification,
S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discrimina- tively, with application to face verification,” Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR2005), Vol. 1, pp. 539546, 2005
work page 2005
-
[12]
Dimensionality reduction by learning an invariant mapping,
R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” Proc. of 2006 IEEE computer society conference on Computer Vision and Pattern Recognition (CVPR2006), Vol. 2, pp. 17351742. 2006
work page 2006
- [13]
-
[14]
of 2019 International Conference on Machine Learning (ICML2019) 2019
Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, Aaron Courville, David Lopez-Paz, Yoshua Bengio ”Manifold Mixup: Better Repre- sentations by Interpolating Hidden States” Proc. of 2019 International Conference on Machine Learning (ICML2019) 2019
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.