Mixup of Feature Maps in a Hidden Layer for Training of Convolutional Neural Network

Hideki Oki; Takio Kurita

arxiv: 1906.09739 · v1 · pith:IWU44F6Dnew · submitted 2019-06-24 · 💻 cs.CV

Mixup of Feature Maps in a Hidden Layer for Training of Convolutional Neural Network

Hideki Oki , Takio Kurita This is my paper

Pith reviewed 2026-05-25 17:59 UTC · model grok-4.3

classification 💻 cs.CV

keywords mixupdata augmentationconvolutional neural networkfeature mapshidden layerSiamese networkgeneralizationimage classification

0 comments

The pith

Mixing feature maps in the first hidden layer of a CNN improves generalization more than mixing the input images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes extending the mixup data augmentation technique from raw images to feature maps inside a convolutional neural network. It implements this by feeding image pairs or triples through a Siamese or triplet architecture so that feature maps extracted after the first convolution can be linearly combined before further processing. Experiments indicate that this hidden-layer mixing, especially at the earliest convolutional stage, produces higher accuracy on samples drawn from outside the training distribution than standard image-level mixup. The goal is to enrich the internal representations rather than only the pixel-level training set, thereby strengthening robustness without adding new labeled data.

Core claim

The central claim is that mixup applied to the feature maps produced by the first convolution layer yields better recognition accuracy on out-of-distribution samples than the original mixup performed on the input images themselves.

What carries the argument

Siamese or triplet network architecture that enables mixing of corresponding feature maps at a chosen hidden layer while the CNN is trained.

If this is right

Mixup performed after the first convolution outperforms mixup performed on the raw pixels.
The improvement holds across multiple standard image-classification datasets.
The method requires no change to the underlying loss function beyond the standard mixup interpolation.
Later convolutional layers can also receive mixup but yield smaller gains than the first layer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same architecture could be used to test whether mixing at multiple layers simultaneously compounds the benefit.
The approach might combine with other feature-space regularizers such as dropout or batch-norm statistics.
If the first-layer mixup effect generalizes, it could be inserted into any CNN without redesigning the data loader.

Load-bearing premise

The Siamese or triplet structure isolates the benefit of mixing feature maps without changing the loss landscape or training dynamics in other ways.

What would settle it

A controlled experiment that applies the identical linear mixing coefficients directly to feature maps without using a Siamese or triplet network and measures whether the accuracy gain disappears.

Figures

Figures reproduced from arXiv: 1906.09739 by Hideki Oki, Takio Kurita.

**Figure 1.** Figure 1: mixup applied for training samples Mixup was introduced by Hongyi Zhang et al. [7] as an data augmentation method. The samples are generated by mixing two different training samples by simple weighted average. It is reported that this simple method achieves the state-of-the art performance in various datasets [7]. And, the similar method was introduced by Yuji Tokozume et al. [8,9]. They conducted a detail… view at source ↗

**Figure 2.** Figure 2: Siamese Network The Siamese Network [10,11,12] consists of two identical sub-networks joined at their outputs as shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: conventional mixup [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Apply mixup to 1st convolution layer The mixup proposed by H. Zhang et al. generates intermediate images by mixing the pairs of the original training images as shown in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: accuracy of classification From the comparison experiments, we observed that the effect of mixup increase for this dataset as the value of α becomes larger. Namely the accuracies for the cases where the mixing parameter α was less than 0.7 were less than the cases with α = 0.7 or α = 1.0. In the this [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: feature map extracted by the first convolution layer From this figure, we can find that the features map obtained by the ”cnn (mixup3)” model are influenced by the brightness of the original image. On the other hand, the shape edges are extracted in the feature maps obtained by the ”cnn (conv1-mixup 3)” model regardless of the brightness of the original image. This result shows the some improvements of the… view at source ↗

read the original abstract

The deep Convolutional Neural Network (CNN) became very popular as a fundamental technique for image classification and objects recognition. To improve the recognition accuracy for the more complex tasks, deeper networks have being introduced. However, the recognition accuracy of the trained deep CNN drastically decreases for the samples which are obtained from the outside regions of the training samples. To improve the generalization ability for such samples, Krizhevsky et al. proposed to generate additional samples through transformations from the existing samples and to make the training samples richer. This method is known as data augmentation. Hongyi Zhang et al. introduced data augmentation method called mixup which achieves state-of-the-art performance in various datasets. Mixup generates new samples by mixing two different training samples. Mixing of the two images is implemented with simple image morphing. In this paper, we propose to apply mixup to the feature maps in a hidden layer. To implement the mixup in the hidden layer we use the Siamese network or the triplet network architecture to mix feature maps. From the experimental comparison, it is observed that the mixup of the feature maps obtained from the first convolution layer is more effective than the original image mixup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper extends mixup to first-conv feature maps via Siamese/triplet nets but the architecture change prevents cleanly attributing gains to the mix location.

read the letter

The core idea is to run mixup on feature maps from the first convolution layer instead of on raw images, implemented by feeding pairs or triplets through weight-shared branches. This is a straightforward extension of the original mixup work and the only concrete novelty here. The abstract reports that this version outperforms standard image mixup on the tasks they tried, which is the main empirical observation on offer. That observation is worth noting for anyone already working on internal-representation regularization in CNNs, because it suggests the benefit might appear early in the network rather than only at the input. The paper is otherwise conventional: it cites the usual data-augmentation and mixup references and frames the motivation around generalization outside the training distribution. The soft spot is exactly the one flagged in the stress test. Because the method switches to a Siamese or triplet architecture to enable the hidden-layer mixing, any accuracy difference could come from the altered loss landscape or the duplicated forward passes rather than from mixing at that particular depth. No control experiment is described that keeps the network topology and objective fixed while moving only the mixup site, so the headline claim cannot be read as isolated evidence for hidden-layer mixup. The abstract supplies no numbers, datasets, or variance estimates, which leaves the size of the reported edge unknown. For a reader who wants to try new places to apply mixup inside a standard CNN, the paper is a quick pointer to an idea worth testing with proper ablations. It does not yet supply the controls needed to treat the result as settled. I would send it to review if the full manuscript contains those controls and reproducible numbers; otherwise the central comparison remains ambiguous.

Referee Report

1 major / 2 minor

Summary. The paper proposes extending the mixup data-augmentation technique from input images to feature maps inside a CNN hidden layer. It implements the hidden-layer mixing via Siamese or triplet network branches that share weights, and reports that mixing the feature maps from the first convolutional layer yields higher accuracy than standard image-level mixup.

Significance. If the superiority can be isolated to the mixing location rather than to the altered architecture and loss, the result would supply a concrete, testable extension of mixup to internal representations and could motivate further work on where in the network to perform interpolation-based augmentation.

major comments (1)

[Abstract / Method] Abstract and method description: the headline claim that first-convolution feature-map mixup is more effective than image mixup rests on comparisons performed inside Siamese or triplet architectures; no ablation is described that keeps the network topology, loss, and training dynamics fixed while moving only the site of mixing. Consequently the reported gain cannot be unambiguously attributed to the hidden-layer location.

minor comments (2)

[Abstract] Abstract: the sentence 'deeper networks have being introduced' contains a grammatical error.
[Abstract] Abstract: the experimental comparison is asserted without any mention of the datasets, architectures, metrics, or number of runs, making it impossible for a reader to assess the strength of the evidence from the provided text alone.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and describe the revisions we will make.

read point-by-point responses

Referee: [Abstract / Method] Abstract and method description: the headline claim that first-convolution feature-map mixup is more effective than image mixup rests on comparisons performed inside Siamese or triplet architectures; no ablation is described that keeps the network topology, loss, and training dynamics fixed while moving only the site of mixing. Consequently the reported gain cannot be unambiguously attributed to the hidden-layer location.

Authors: We agree that the current experimental design does not fully isolate the effect of the mixing location from the change in architecture. The Siamese/triplet structure is necessary to enable feature-map mixing with shared weights, but this introduces a confound relative to standard image mixup on a conventional CNN. To resolve this, we will add a controlled ablation that applies image-level mixup inside the identical Siamese/triplet architecture (same topology, loss, and training dynamics) and directly compares it to feature-map mixup within that same architecture. The abstract and method sections will be updated to present this comparison and clarify the attribution of gains. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical proposal with no derivation reducing to inputs

full rationale

The paper proposes applying mixup to hidden-layer feature maps via Siamese/triplet architectures and reports experimental accuracy comparisons. No mathematical derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing steps exist. The central claim rests on direct empirical results rather than any self-referential construction. The architecture change noted in the skeptic headline is a methodological limitation but does not constitute circularity under the defined patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No new mathematical axioms or invented entities are introduced; the work is an empirical extension of existing mixup and CNN training assumptions.

pith-pipeline@v0.9.0 · 5738 in / 849 out tokens · 29111 ms · 2026-05-25T17:59:38.108702+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

[1]

ImageNet classiﬁcation with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classiﬁcation with deep convolutional neural networks,” Proc. Conf. Neural Information Processing Systems, pp.1097-1105, 2012

work page 2012
[2]

Intriguing properties of neural networks

C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” arXiv:1312.6199, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[3]

Transformation invariance in pattern recognition tangent distance and tangent propagation,

P. Simard, Y. LeCun, J. Denker, and B. Victorri, “Transformation invariance in pattern recognition tangent distance and tangent propagation,” in Neural networks: tricks of the trade , 1998

work page 1998
[4]

A structural learning by adding independent noises to hidden units,

T.Kurita, H.Asoh, S.Umeyama, S.Akaho, and A.Hosomi, “A structural learning by adding independent noises to hidden units,” Proc. of IEEE Inter. Conf. on Neural Networks (ICNN’94), pp.275-278, 1994

work page 1994
[5]

Eﬀect of additive noise for multi-layered Perceptron with autoencoders,

M. Sabri, and T. Kurita, “Eﬀect of additive noise for multi-layered Perceptron with autoencoders,” IEICE Trans. Information and Systems, Vol.E100D, No.7, pp. 1494- 1504, 2017

work page 2017
[6]

Improved generalization by adding both auto- association and hidden-layer noise to neural-network-based-classiﬁers

H. Inayohsi and T. Kurita, “Improved generalization by adding both auto- association and hidden-layer noise to neural-network-based-classiﬁers” IEEE Work- shop on Machine Learning for Sigmal Processing, pp.141-146, 2005

work page 2005
[7]

mixup: Beyond Empirical Risk Minimization,

H. Zhang, M. Cisse, Y. N. Dauphin, D. Lopez-Paz, “mixup: Beyond Empirical Risk Minimization,” Porc. of 2018 International Conference on Learning Representations (ICLR2018), 2018

work page 2018
[8]

Learning from Between-class Examples for Deep Sound Recognition

Y. Tokozume, Y. Ushiku, T. Harada “Learning from Between-class Examples for Deep Sound Recognition” Porc. of 2018 International Conference on Learning Rep- resentations (ICLR2018), 2018

work page 2018
[9]

Between-class Learning for Image Classiﬁ- cation

Y. Tokozume, Y. Ushiku, T. Harada “Between-class Learning for Image Classiﬁ- cation” Porc. of 2018 IEEE computer society conference on Computer Vision and Pattern Recognition (CVPR2018), 2018

work page 2018
[10]

Signature veriﬁcation using a siamese time delay neural network,

J. Bromley, I. Guyon, Y. LeCun, E. S¨ ackinger, and R. Shah, “Signature veriﬁcation using a siamese time delay neural network,” in Advances in Neural Information Processing Systems, Vol.6 (NIPS 1993), 1993

work page 1993
[11]

Learning a similarity metric discrimina- tively, with application to face veriﬁcation,

S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discrimina- tively, with application to face veriﬁcation,” Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR2005), Vol. 1, pp. 539546, 2005

work page 2005
[12]

Dimensionality reduction by learning an invariant mapping,

R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” Proc. of 2006 IEEE computer society conference on Computer Vision and Pattern Recognition (CVPR2006), Vol. 2, pp. 17351742. 2006

work page 2006
[13]

Hoﬀer, N

E. Hoﬀer, N. Ailon. ”Deep Metric Learning Using Triplet Network”, In: Feragen A., Pelillo M., Loog M. (eds) Similarity-Based Pattern Recognition. Lecture Notes in Computer Science, vol 9370. Springer, Cham, 2015. Mixup of Feature Maps in a Hidden Layer 11

work page 2015
[14]

of 2019 International Conference on Machine Learning (ICML2019) 2019

Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najaﬁ, Ioannis Mitliagkas, Aaron Courville, David Lopez-Paz, Yoshua Bengio ”Manifold Mixup: Better Repre- sentations by Interpolating Hidden States” Proc. of 2019 International Conference on Machine Learning (ICML2019) 2019

work page 2019

[1] [1]

ImageNet classiﬁcation with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classiﬁcation with deep convolutional neural networks,” Proc. Conf. Neural Information Processing Systems, pp.1097-1105, 2012

work page 2012

[2] [2]

Intriguing properties of neural networks

C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” arXiv:1312.6199, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[3] [3]

Transformation invariance in pattern recognition tangent distance and tangent propagation,

P. Simard, Y. LeCun, J. Denker, and B. Victorri, “Transformation invariance in pattern recognition tangent distance and tangent propagation,” in Neural networks: tricks of the trade , 1998

work page 1998

[4] [4]

A structural learning by adding independent noises to hidden units,

T.Kurita, H.Asoh, S.Umeyama, S.Akaho, and A.Hosomi, “A structural learning by adding independent noises to hidden units,” Proc. of IEEE Inter. Conf. on Neural Networks (ICNN’94), pp.275-278, 1994

work page 1994

[5] [5]

Eﬀect of additive noise for multi-layered Perceptron with autoencoders,

M. Sabri, and T. Kurita, “Eﬀect of additive noise for multi-layered Perceptron with autoencoders,” IEICE Trans. Information and Systems, Vol.E100D, No.7, pp. 1494- 1504, 2017

work page 2017

[6] [6]

Improved generalization by adding both auto- association and hidden-layer noise to neural-network-based-classiﬁers

H. Inayohsi and T. Kurita, “Improved generalization by adding both auto- association and hidden-layer noise to neural-network-based-classiﬁers” IEEE Work- shop on Machine Learning for Sigmal Processing, pp.141-146, 2005

work page 2005

[7] [7]

mixup: Beyond Empirical Risk Minimization,

H. Zhang, M. Cisse, Y. N. Dauphin, D. Lopez-Paz, “mixup: Beyond Empirical Risk Minimization,” Porc. of 2018 International Conference on Learning Representations (ICLR2018), 2018

work page 2018

[8] [8]

Learning from Between-class Examples for Deep Sound Recognition

Y. Tokozume, Y. Ushiku, T. Harada “Learning from Between-class Examples for Deep Sound Recognition” Porc. of 2018 International Conference on Learning Rep- resentations (ICLR2018), 2018

work page 2018

[9] [9]

Between-class Learning for Image Classiﬁ- cation

Y. Tokozume, Y. Ushiku, T. Harada “Between-class Learning for Image Classiﬁ- cation” Porc. of 2018 IEEE computer society conference on Computer Vision and Pattern Recognition (CVPR2018), 2018

work page 2018

[10] [10]

Signature veriﬁcation using a siamese time delay neural network,

J. Bromley, I. Guyon, Y. LeCun, E. S¨ ackinger, and R. Shah, “Signature veriﬁcation using a siamese time delay neural network,” in Advances in Neural Information Processing Systems, Vol.6 (NIPS 1993), 1993

work page 1993

[11] [11]

Learning a similarity metric discrimina- tively, with application to face veriﬁcation,

S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discrimina- tively, with application to face veriﬁcation,” Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR2005), Vol. 1, pp. 539546, 2005

work page 2005

[12] [12]

Dimensionality reduction by learning an invariant mapping,

R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” Proc. of 2006 IEEE computer society conference on Computer Vision and Pattern Recognition (CVPR2006), Vol. 2, pp. 17351742. 2006

work page 2006

[13] [13]

Hoﬀer, N

E. Hoﬀer, N. Ailon. ”Deep Metric Learning Using Triplet Network”, In: Feragen A., Pelillo M., Loog M. (eds) Similarity-Based Pattern Recognition. Lecture Notes in Computer Science, vol 9370. Springer, Cham, 2015. Mixup of Feature Maps in a Hidden Layer 11

work page 2015

[14] [14]

of 2019 International Conference on Machine Learning (ICML2019) 2019

Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najaﬁ, Ioannis Mitliagkas, Aaron Courville, David Lopez-Paz, Yoshua Bengio ”Manifold Mixup: Better Repre- sentations by Interpolating Hidden States” Proc. of 2019 International Conference on Machine Learning (ICML2019) 2019

work page 2019