pith. sign in

arxiv: 1906.09739 · v1 · pith:IWU44F6Dnew · submitted 2019-06-24 · 💻 cs.CV

Mixup of Feature Maps in a Hidden Layer for Training of Convolutional Neural Network

Pith reviewed 2026-05-25 17:59 UTC · model grok-4.3

classification 💻 cs.CV
keywords mixupdata augmentationconvolutional neural networkfeature mapshidden layerSiamese networkgeneralizationimage classification
0
0 comments X

The pith

Mixing feature maps in the first hidden layer of a CNN improves generalization more than mixing the input images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes extending the mixup data augmentation technique from raw images to feature maps inside a convolutional neural network. It implements this by feeding image pairs or triples through a Siamese or triplet architecture so that feature maps extracted after the first convolution can be linearly combined before further processing. Experiments indicate that this hidden-layer mixing, especially at the earliest convolutional stage, produces higher accuracy on samples drawn from outside the training distribution than standard image-level mixup. The goal is to enrich the internal representations rather than only the pixel-level training set, thereby strengthening robustness without adding new labeled data.

Core claim

The central claim is that mixup applied to the feature maps produced by the first convolution layer yields better recognition accuracy on out-of-distribution samples than the original mixup performed on the input images themselves.

What carries the argument

Siamese or triplet network architecture that enables mixing of corresponding feature maps at a chosen hidden layer while the CNN is trained.

If this is right

  • Mixup performed after the first convolution outperforms mixup performed on the raw pixels.
  • The improvement holds across multiple standard image-classification datasets.
  • The method requires no change to the underlying loss function beyond the standard mixup interpolation.
  • Later convolutional layers can also receive mixup but yield smaller gains than the first layer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same architecture could be used to test whether mixing at multiple layers simultaneously compounds the benefit.
  • The approach might combine with other feature-space regularizers such as dropout or batch-norm statistics.
  • If the first-layer mixup effect generalizes, it could be inserted into any CNN without redesigning the data loader.

Load-bearing premise

The Siamese or triplet structure isolates the benefit of mixing feature maps without changing the loss landscape or training dynamics in other ways.

What would settle it

A controlled experiment that applies the identical linear mixing coefficients directly to feature maps without using a Siamese or triplet network and measures whether the accuracy gain disappears.

Figures

Figures reproduced from arXiv: 1906.09739 by Hideki Oki, Takio Kurita.

Figure 1
Figure 1. Figure 1: mixup applied for training samples Mixup was introduced by Hongyi Zhang et al. [7] as an data augmentation method. The samples are generated by mixing two different training samples by simple weighted average. It is reported that this simple method achieves the state-of-the art performance in various datasets [7]. And, the similar method was introduced by Yuji Tokozume et al. [8,9]. They conducted a detail… view at source ↗
Figure 2
Figure 2. Figure 2: Siamese Network The Siamese Network [10,11,12] consists of two identical sub-networks joined at their outputs as shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: conventional mixup [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Apply mixup to 1st convolution layer The mixup proposed by H. Zhang et al. generates intermediate images by mixing the pairs of the original training images as shown in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: accuracy of classification From the comparison experiments, we observed that the effect of mixup in￾crease for this dataset as the value of α becomes larger. Namely the accuracies for the cases where the mixing parameter α was less than 0.7 were less than the cases with α = 0.7 or α = 1.0. In the this [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: feature map extracted by the first convolution layer From this figure, we can find that the features map obtained by the ”cnn (mixup3)” model are influenced by the brightness of the original image. On the other hand, the shape edges are extracted in the feature maps obtained by the ”cnn (conv1-mixup 3)” model regardless of the brightness of the original image. This result shows the some improvements of the… view at source ↗
read the original abstract

The deep Convolutional Neural Network (CNN) became very popular as a fundamental technique for image classification and objects recognition. To improve the recognition accuracy for the more complex tasks, deeper networks have being introduced. However, the recognition accuracy of the trained deep CNN drastically decreases for the samples which are obtained from the outside regions of the training samples. To improve the generalization ability for such samples, Krizhevsky et al. proposed to generate additional samples through transformations from the existing samples and to make the training samples richer. This method is known as data augmentation. Hongyi Zhang et al. introduced data augmentation method called mixup which achieves state-of-the-art performance in various datasets. Mixup generates new samples by mixing two different training samples. Mixing of the two images is implemented with simple image morphing. In this paper, we propose to apply mixup to the feature maps in a hidden layer. To implement the mixup in the hidden layer we use the Siamese network or the triplet network architecture to mix feature maps. From the experimental comparison, it is observed that the mixup of the feature maps obtained from the first convolution layer is more effective than the original image mixup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes extending the mixup data-augmentation technique from input images to feature maps inside a CNN hidden layer. It implements the hidden-layer mixing via Siamese or triplet network branches that share weights, and reports that mixing the feature maps from the first convolutional layer yields higher accuracy than standard image-level mixup.

Significance. If the superiority can be isolated to the mixing location rather than to the altered architecture and loss, the result would supply a concrete, testable extension of mixup to internal representations and could motivate further work on where in the network to perform interpolation-based augmentation.

major comments (1)
  1. [Abstract / Method] Abstract and method description: the headline claim that first-convolution feature-map mixup is more effective than image mixup rests on comparisons performed inside Siamese or triplet architectures; no ablation is described that keeps the network topology, loss, and training dynamics fixed while moving only the site of mixing. Consequently the reported gain cannot be unambiguously attributed to the hidden-layer location.
minor comments (2)
  1. [Abstract] Abstract: the sentence 'deeper networks have being introduced' contains a grammatical error.
  2. [Abstract] Abstract: the experimental comparison is asserted without any mention of the datasets, architectures, metrics, or number of runs, making it impossible for a reader to assess the strength of the evidence from the provided text alone.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and describe the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract / Method] Abstract and method description: the headline claim that first-convolution feature-map mixup is more effective than image mixup rests on comparisons performed inside Siamese or triplet architectures; no ablation is described that keeps the network topology, loss, and training dynamics fixed while moving only the site of mixing. Consequently the reported gain cannot be unambiguously attributed to the hidden-layer location.

    Authors: We agree that the current experimental design does not fully isolate the effect of the mixing location from the change in architecture. The Siamese/triplet structure is necessary to enable feature-map mixing with shared weights, but this introduces a confound relative to standard image mixup on a conventional CNN. To resolve this, we will add a controlled ablation that applies image-level mixup inside the identical Siamese/triplet architecture (same topology, loss, and training dynamics) and directly compares it to feature-map mixup within that same architecture. The abstract and method sections will be updated to present this comparison and clarify the attribution of gains. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical proposal with no derivation reducing to inputs

full rationale

The paper proposes applying mixup to hidden-layer feature maps via Siamese/triplet architectures and reports experimental accuracy comparisons. No mathematical derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing steps exist. The central claim rests on direct empirical results rather than any self-referential construction. The architecture change noted in the skeptic headline is a methodological limitation but does not constitute circularity under the defined patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No new mathematical axioms or invented entities are introduced; the work is an empirical extension of existing mixup and CNN training assumptions.

pith-pipeline@v0.9.0 · 5738 in / 849 out tokens · 29111 ms · 2026-05-25T17:59:38.108702+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

  1. [1]

    ImageNet classification with deep convolutional neural networks,

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Proc. Conf. Neural Information Processing Systems, pp.1097-1105, 2012

  2. [2]

    Intriguing properties of neural networks

    C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” arXiv:1312.6199, 2014

  3. [3]

    Transformation invariance in pattern recognition tangent distance and tangent propagation,

    P. Simard, Y. LeCun, J. Denker, and B. Victorri, “Transformation invariance in pattern recognition tangent distance and tangent propagation,” in Neural networks: tricks of the trade , 1998

  4. [4]

    A structural learning by adding independent noises to hidden units,

    T.Kurita, H.Asoh, S.Umeyama, S.Akaho, and A.Hosomi, “A structural learning by adding independent noises to hidden units,” Proc. of IEEE Inter. Conf. on Neural Networks (ICNN’94), pp.275-278, 1994

  5. [5]

    Effect of additive noise for multi-layered Perceptron with autoencoders,

    M. Sabri, and T. Kurita, “Effect of additive noise for multi-layered Perceptron with autoencoders,” IEICE Trans. Information and Systems, Vol.E100D, No.7, pp. 1494- 1504, 2017

  6. [6]

    Improved generalization by adding both auto- association and hidden-layer noise to neural-network-based-classifiers

    H. Inayohsi and T. Kurita, “Improved generalization by adding both auto- association and hidden-layer noise to neural-network-based-classifiers” IEEE Work- shop on Machine Learning for Sigmal Processing, pp.141-146, 2005

  7. [7]

    mixup: Beyond Empirical Risk Minimization,

    H. Zhang, M. Cisse, Y. N. Dauphin, D. Lopez-Paz, “mixup: Beyond Empirical Risk Minimization,” Porc. of 2018 International Conference on Learning Representations (ICLR2018), 2018

  8. [8]

    Learning from Between-class Examples for Deep Sound Recognition

    Y. Tokozume, Y. Ushiku, T. Harada “Learning from Between-class Examples for Deep Sound Recognition” Porc. of 2018 International Conference on Learning Rep- resentations (ICLR2018), 2018

  9. [9]

    Between-class Learning for Image Classifi- cation

    Y. Tokozume, Y. Ushiku, T. Harada “Between-class Learning for Image Classifi- cation” Porc. of 2018 IEEE computer society conference on Computer Vision and Pattern Recognition (CVPR2018), 2018

  10. [10]

    Signature verification using a siamese time delay neural network,

    J. Bromley, I. Guyon, Y. LeCun, E. S¨ ackinger, and R. Shah, “Signature verification using a siamese time delay neural network,” in Advances in Neural Information Processing Systems, Vol.6 (NIPS 1993), 1993

  11. [11]

    Learning a similarity metric discrimina- tively, with application to face verification,

    S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discrimina- tively, with application to face verification,” Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR2005), Vol. 1, pp. 539546, 2005

  12. [12]

    Dimensionality reduction by learning an invariant mapping,

    R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” Proc. of 2006 IEEE computer society conference on Computer Vision and Pattern Recognition (CVPR2006), Vol. 2, pp. 17351742. 2006

  13. [13]

    Hoffer, N

    E. Hoffer, N. Ailon. ”Deep Metric Learning Using Triplet Network”, In: Feragen A., Pelillo M., Loog M. (eds) Similarity-Based Pattern Recognition. Lecture Notes in Computer Science, vol 9370. Springer, Cham, 2015. Mixup of Feature Maps in a Hidden Layer 11

  14. [14]

    of 2019 International Conference on Machine Learning (ICML2019) 2019

    Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, Aaron Courville, David Lopez-Paz, Yoshua Bengio ”Manifold Mixup: Better Repre- sentations by Interpolating Hidden States” Proc. of 2019 International Conference on Machine Learning (ICML2019) 2019