pith. sign in

arxiv: 2505.22226 · v2 · submitted 2025-05-28 · 💻 cs.CV

Expressive yet Efficient Feature Expansion with Adaptive Cross-Hadamard Products

Pith reviewed 2026-05-19 13:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords Hadamard productfeature expansionadaptive moduleneural architecture searchefficient CNNimage classificationvision modelssoftsign normalization
0
0 comments X

The pith

The Adaptive Cross-Hadamard module expands image features expressively without adding convolutional parameters by using differentiable sampling and softsign normalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Recent theoretical work shows that Hadamard products can create nonlinear representations and implicit high-dimensional mappings in deep networks, but practical use in resource-limited vision models has stayed limited. The paper addresses this by presenting the Adaptive Cross-Hadamard module, which introduces learnability through differentiable discrete sampling of cross terms and dynamic softsign normalization. These additions enable efficient feature reuse while avoiding extra convolutional parameters and maintaining stable gradients. When the module is inserted into networks discovered by neural architecture search, called Hadaptive-Net, experiments report state-of-the-art accuracy and speed balances on image classification benchmarks. The work positions Hadamard operations as concrete, efficient building blocks for vision architectures.

Core claim

The paper presents the Adaptive Cross-Hadamard (ACH) module as a novel operator that embeds learnability through differentiable discrete sampling and dynamic softsign normalization. This facilitates highly efficient feature reuse without incurring additional convolutional parameters, while ensuring stable gradient flow. Integrated into Hadaptive-Net via neural architecture search, the approach achieves state-of-the-art accuracy/speed trade-offs on image classification tasks, establishing Hadamard operations as specific building blocks for efficient vision models.

What carries the argument

The Adaptive Cross-Hadamard (ACH) module, which embeds learnability into Hadamard products via differentiable discrete sampling of cross terms and dynamic softsign normalization to enable parameter-free feature expansion.

If this is right

  • Hadamard operations become viable as specific building blocks for efficient vision models.
  • Feature expansion can occur with high expressivity while avoiding extra convolutional parameters.
  • Neural architecture search can automatically place such adaptive modules for optimal efficiency.
  • Stable gradient flow supports reliable training of networks that incorporate these operators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could extend to other vision tasks such as object detection or segmentation where parameter efficiency matters.
  • Similar differentiable sampling and normalization tricks might improve other product-based operators in neural networks.
  • Automated discovery of integration points via NAS could lower the manual effort needed to design efficient architectures.
  • Gains might increase further when the module is combined with orthogonal efficiency methods like pruning or quantization.

Load-bearing premise

That the ACH module's differentiable discrete sampling and dynamic softsign normalization can be integrated into standard CNN architectures via NAS without adding convolutional parameters, while delivering measurable efficiency gains and stable training that exceed conventional feature expansion methods.

What would settle it

An experiment that places a conventional feature expansion method into the identical NAS-searched architecture and shows equal or superior accuracy-speed trade-offs on image classification benchmarks would falsify the superiority claim for ACH.

Figures

Figures reproduced from arXiv: 2505.22226 by Hao Shi, Liang Chen, Qingshan Guo, Xi Zhang, Xuyang Zhang.

Figure 1
Figure 1. Figure 1: The trade-off between FLOPs/latency and top-1 accuracy. These diagrams compare the efficiency among different state-of-the-art models with ours Hadaptive-Net in image classifica￾tion task. Detailed experimental configurations are provided in section 5.3. of deep learning. Recently, it became a new learning paradigm in the field of lightweight network design owing to effective performance and concise comput… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the ACH module. Input features X undergo linear transformation and batch normalization. An ECA module generates channel-wise scores, with Gumbel-Topk sampling (training) or top-k selection (inference) determining active channels. Selected features Z undergo cross-Hadamard product, normalized by dynamic softsign, then concatenated with original features. The forms of efficient operators are … view at source ↗
Figure 2
Figure 2. Figure 2: fig. 2. The design details and learnable methods of the module will be discussed in the following [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Component-wise ablation. Illustration of component-wise ablation variations with component-accuracy table. (1) and (2) represent removal of pointwise convolution and ECA module, respectively. (3) represents the replacement of learnable selection with fixed channel combinations, and (4) represents the substitution of cross-Hadamard normalization with standard batch normaliza￾tion [PITH_FULL_IMAGE:figures/f… view at source ↗
Figure 4
Figure 4. Figure 4: Hadaptive-Net architecture overview [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of computational efficiency under different input channel sizes [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of computational efficiency under different expansion ratios [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Normalized difference heatmap of optimization approaches runtime. Color-coded vi￾sualization of relative performance between direct-indexing (A) and parity-balanced (B) approaches using A−B A+B+ϵ , where red indicates A is slower (B more efficient) and blue indicates the opposite. (a) Batch size versus spatial dimensions scaling. (b) Channel count versus spatial dimensions scaling. For feature maps with sm… view at source ↗
Figure 8
Figure 8. Figure 8: Network visualization via Grad-CAM across layers (1). Simple scenario: ladybug. Downward arrows denote downsampling layers [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Network visualization via Grad-CAM across layers (2). Complex scenario: mushroom. Downward arrows denote downsampling layers. to examine the changes brought by the ACH module compared to a conventional convolutional net￾work. For clearer and more intuitive comparison, we adopted as the baseline a modified version of Hadaptive-Net-S in which all ACH modules were replaced with Ghost modules, in order to demo… view at source ↗
read the original abstract

Recent theoretical advances reveal that the Hadamard product induces nonlinear representations and implicit high-dimensional mappings for the field of deep learning, yet their practical deployment in resource-constrained vision models remains largely unexplored. To address this gap, we introduce the Adaptive Cross-Hadamard (ACH) module, a novel operator that embeds learnability through differentiable discrete sampling and dynamic softsign normalization. This facilitates highly efficient feature reuse without incurring additional convolutional parameters, while ensuring stable gradient flow. Integrated into Hadaptive-Net (Hadamard Adaptive Network) via neural architecture search, our approach achieves unprecedented efficiency. Comprehensive experiments demonstrate state-of-the-art accuracy/speed trade-offs on image classification tasks, establishing Hadamard operations as specific building blocks for efficient vision models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Adaptive Cross-Hadamard (ACH) module, which embeds learnability via differentiable discrete sampling and dynamic softsign normalization to enable efficient feature reuse and expansion in CNNs without additional convolutional parameters while maintaining stable gradient flow. The module is incorporated into Hadaptive-Net through neural architecture search, with claims of state-of-the-art accuracy/speed trade-offs on image classification tasks that position Hadamard operations as effective building blocks for efficient vision models.

Significance. If the zero-additional-convolutional-parameter claim and the resulting efficiency gains are rigorously verified with quantitative baselines, this could offer a practical operator for resource-constrained vision models, extending theoretical insights on Hadamard products into deployable architectures. The NAS integration and emphasis on stable training represent potential strengths if supported by reproducible experiments.

major comments (2)
  1. [Abstract] Abstract: The central efficiency claim that the ACH module enables 'highly efficient feature reuse without incurring additional convolutional parameters' is load-bearing for the SOTA accuracy/speed results but is not accompanied by any parameter-count breakdown, comparison to standard expansion baselines (e.g., 1x1 convolutions or depthwise separable layers), or explicit accounting for the learnable sampling parameters and softsign normalization weights. This omission prevents verification that the overhead is truly negligible or zero.
  2. [Abstract] Abstract: The assertion of 'state-of-the-art accuracy/speed trade-offs' and 'unprecedented efficiency' lacks any reported quantitative metrics, error bars, dataset names, or ablation studies on the contribution of the differentiable discrete sampling versus conventional feature expansion methods. Without these, the empirical support for the Hadaptive-Net results cannot be assessed.
minor comments (2)
  1. [Abstract] The abstract would benefit from naming the specific image classification datasets and baseline models used in the comprehensive experiments.
  2. Clarify the exact implementation of 'differentiable discrete sampling' to ensure it does not implicitly rely on additional convolutional layers for the sampling logits.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments. We address each major comment below and will revise the manuscript to provide greater clarity and empirical support for the claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central efficiency claim that the ACH module enables 'highly efficient feature reuse without incurring additional convolutional parameters' is load-bearing for the SOTA accuracy/speed results but is not accompanied by any parameter-count breakdown, comparison to standard expansion baselines (e.g., 1x1 convolutions or depthwise separable layers), or explicit accounting for the learnable sampling parameters and softsign normalization weights. This omission prevents verification that the overhead is truly negligible or zero.

    Authors: We agree that an explicit parameter breakdown would strengthen the presentation. The ACH module achieves feature expansion exclusively through Hadamard products and does not introduce any new convolutional layers or convolutional kernel weights, which is the precise meaning of the 'additional convolutional parameters' claim. The differentiable discrete sampling operates on a small set of per-channel selection parameters, and the dynamic softsign normalization uses lightweight per-channel scaling factors; neither component adds convolutional parameters. In the revised manuscript we will insert a dedicated table (in the methods or experiments section) that reports total parameter counts and FLOPs for Hadaptive-Net versus standard expansion baselines such as 1x1 convolutions and depthwise-separable blocks, together with an itemized accounting of the non-convolutional learnable parameters introduced by ACH. revision: yes

  2. Referee: [Abstract] Abstract: The assertion of 'state-of-the-art accuracy/speed trade-offs' and 'unprecedented efficiency' lacks any reported quantitative metrics, error bars, dataset names, or ablation studies on the contribution of the differentiable discrete sampling versus conventional feature expansion methods. Without these, the empirical support for the Hadaptive-Net results cannot be assessed.

    Authors: The abstract is intentionally concise, but the full manuscript already contains the requested information: quantitative accuracy and latency results on CIFAR-10/100 and ImageNet, multiple-run error bars in the main tables, and ablations isolating the differentiable sampling and softsign normalization. To address the referee's concern directly, we will augment the abstract with the key headline numbers (e.g., top-1 accuracy and throughput on ImageNet) and will add a short sentence referencing the ablation study that quantifies the contribution of the differentiable discrete sampling relative to conventional expansion operators. revision: yes

Circularity Check

0 steps flagged

No significant circularity in ACH module or Hadaptive-Net derivation

full rationale

The paper presents the Adaptive Cross-Hadamard (ACH) module as an independent operator whose learnability is introduced via differentiable discrete sampling and dynamic softsign normalization, enabling feature reuse without additional convolutional parameters. This construction is then integrated into Hadaptive-Net through neural architecture search, with performance claims resting on empirical evaluation across image classification benchmarks rather than any self-referential equations or fitted quantities. No load-bearing step reduces by construction to the paper's own inputs, no self-citation chain justifies a uniqueness theorem, and no ansatz is smuggled through prior work. The derivation chain is self-contained against external benchmarks and experimental results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that Hadamard products induce useful nonlinear mappings and on the modeling choice that learnable sampling plus softsign normalization can be realized without extra convolutional cost. No new invented entities are postulated.

free parameters (1)
  • learnable sampling parameters
    Differentiable discrete sampling introduces parameters that are optimized during training to select feature combinations.
axioms (1)
  • domain assumption Hadamard product induces nonlinear representations and implicit high-dimensional mappings
    Invoked in the opening sentence as the theoretical foundation for the work.

pith-pipeline@v0.9.0 · 5656 in / 1251 out tokens · 33120 ms · 2026-05-19T13:27:02.041187+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 8 internal anchors

  1. [2]

    Layer Normalization

    URLhttp://arxiv.org/abs/1607.06450. Yoshua Bengio. Estimating or propagating gradients through stochastic neurons.arXiv preprint arXiv:1305.2982,

  2. [3]

    Gary Chan

    10 Published as a conference paper at ICLR 2026 Jierun Chen, Shiu-hong Kao, Hao He, Weipeng Zhuo, Song Wen, Chul-Ho Lee, and S.-H. Gary Chan. Run, don’t walk: Chasing higher flops for faster neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12021–12031, June

  3. [4]

    URL https://doi.org/10.1109/TPAMI.2025.3560423

    doi: 10.1109/TPAMI.2025.3560423. URL https://doi.org/10.1109/TPAMI.2025.3560423. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hi- erarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee,

  4. [5]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

    Version: 1.20.1. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.),Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tec...

  5. [6]

    BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

    doi: 10.18653/V1/N19-1423. URL https://doi.org/10.18653/v1/n19-1423. Xiaohan Ding, X. Zhang, Ningning Ma, Jungong Han, Guiguang Ding, and Jian Sun. Repvgg: Mak- ing vgg-style convnets great again.2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13728–13737,

  6. [7]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

  7. [8]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.CoRR, abs/2312.00752,

  8. [9]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    doi: 10.48550/ARXIV .2312.00752. URLhttps://doi.org/10. 48550/arXiv.2312.00752. Emil Julius Gumbel. Statistical theory of extreme valuse and some practical applications.Nat. Bur. Standards Appl. Math. Ser. 33,

  9. [10]

    Searching for mobilenetv3

    11 Published as a conference paper at ICLR 2026 Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. InPro- ceedings of the IEEE/CVF international conference on computer vision, pp. 1314–1324,

  10. [11]

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

    Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications.arXiv preprint arXiv:1704.04861,

  11. [12]

    SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size

    Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and ¡1mb model size.ArXiv, abs/1602.07360,

  12. [13]

    Hadamard product for low-rank bilinear pooling

    Jin-Hwa Kim, Kyoung Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung- Tak Zhang. Hadamard product for low-rank bilinear pooling. In5th International Confer- ence on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net,

  13. [14]

    Siyuan Li, Zedong Wang, Zicheng Liu, Cheng Tan, Haitao Lin, Di Wu, Zhiyuan Chen, Jiangbin Zheng, and Stan Z. Li. Moganet: Multi-order gated aggregation network. InThe Twelfth Inter- national Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

  14. [16]

    Microsoft COCO: Common Objects in Context

    URLhttp://arxiv.org/abs/ 1405.0312. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: Single Shot MultiBox Detector.arXiv e-prints, art. arXiv:1512.02325, December

  15. [17]

    SSD: Single Shot MultiBox Detector

    doi: 10.48550/arXiv.1512.02325. Zhenhua Liu, Zhiwei Hao, Kai Han, Yehui Tang, and Yunhe Wang. Ghostnetv3: Exploring the training strategies for compact models.ArXiv, abs/2404.11202,

  16. [18]

    Mobilevit: Light-weight, general-purpose, and mobile- friendly vision transformer

    12 Published as a conference paper at ICLR 2026 Sachin Mehta and Mohammad Rastegari. Mobilevit: Light-weight, general-purpose, and mobile- friendly vision transformer. InThe Tenth International Conference on Learning Representa- tions, ICLR 2022, Virtual Event, April 25-29,

  17. [19]

    Manning, Andrew Y

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y . Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washingto...

  18. [20]

    URLhttps://doi.org/10.18653/v1/d13-1170

    doi: 10.18653/V1/D13-1170. URLhttps://doi.org/10.18653/v1/d13-1170. Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural net- works. InInternational conference on machine learning, pp. 6105–6114. PMLR,

  19. [21]

    Ghostnetv2: Enhance cheap operation with long-range attention.ArXiv, abs/2211.12905,

    Yehui Tang, Kai Han, Jianyuan Guo, Chang Xu, Chaoting Xu, and Yunhe Wang. Ghostnetv2: Enhance cheap operation with long-range attention.ArXiv, abs/2211.12905,

  20. [22]

    Pavan Kumar Anasosalu Vasu, James Gregory Gabriel, Jeff J

    URLhttps: //api.semanticscholar.org/CorpusID:253801665. Pavan Kumar Anasosalu Vasu, James Gregory Gabriel, Jeff J. Zhu, Oncel Tuzel, and Anurag Ranjan. Mobileone: An improved one millisecond mobile backbone.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7907–7917,

  21. [23]

    Adina Williams, Nikita Nangia, and Samuel R. Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Marilyn A. Walker, Heng Ji, and Amanda Stent (eds.),Proceedings of the 2018 Conference of the North American Chapter of the As- sociation for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orl...

  22. [24]

    A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

    doi: 10.18653/V1/N18-1101. URLhttps: //doi.org/10.18653/v1/n18-1101. 13 Published as a conference paper at ICLR 2026 Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF conference on computer vision and pa...

  23. [25]

    Vision mamba: Efficient visual representation learning with bidirectional state space model

    Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27,

  24. [26]

    URLhttps://openreview.net/forum?id=YbHCqn4qF4. A APPENDIX A.1 TRAININGMECHANISM τAdjustment: We implement distinct temperature control mechanisms for ACH modules versus NAS due to fundamental differences in their training paradigms. For ACH modules distributed across network layers, which process heterogeneous features and semantics, we deliberately desig...

  25. [27]

    In the process of ACH training and reasoning, we will involve a standardZi ⊙Z j cross Hadamard product calculation

    and LayerNorm (Ba et al., 2016), have a priori assumption that the statistical mean and statistical variance of the tensors they receive are knowable and traceable, which constitutes the basis of model convergence. In the process of ACH training and reasoning, we will involve a standardZi ⊙Z j cross Hadamard product calculation. In previous machine learni...

  26. [28]

    If the self referring Hadamard product is deformed, for exampleϕ 1(Z)⊙ϕ 2(Z), Letϕhere be a linear transformation operator, the corresponding matrix form isX 1, X2 (X∈R m×n), bias vectors areb 1, b2 (b∈R m), then: E[ϕ(Z)] =E[XZ+b] =µ· Pm i Pn j Xi,j m +E[b] For variance, sinceZcan approximate normal distribution, here we assume that its elements are i.i.d...

  27. [29]

    The net- 17 Published as a conference paper at ICLR 2026 Table 10:Neural Architecture Search Result (a).Compared with different kernel sizes

    To implement Ghost and ACH module with adaptability, we design the Adaptive Bottleneck that can decide the expansion layer of the bottleneck manually. The net- 17 Published as a conference paper at ICLR 2026 Table 10:Neural Architecture Search Result (a).Compared with different kernel sizes. Reaching 67.55% top1-acc as result. Channels Ghost Conf. ACH Con...

  28. [30]

    Object Detection - Training Protocol: The base learning rate of 0.02 corresponds to a batch size of 64 distributed across 5 GPUs, scaled linearly according to the batch size

    All tests used ONNX Runtime 1.16.0 with default execution providers. Object Detection - Training Protocol: The base learning rate of 0.02 corresponds to a batch size of 64 distributed across 5 GPUs, scaled linearly according to the batch size. We apply 3-epoch linear warmup and reduce the learning rate to 1e-5 via cosine scheduling. Data augmentation incl...

  29. [31]

    To systematically evaluate these methods under varying tensor configurations (batch/channel di- mensions versus spatial sizes), we conducted comparative experiments using square matrices (same sized height & width). See fig. 7 for the experiment details and results. Both algorithms demonstrate relatively stable performance across varying batch sizes, indi...

  30. [32]

    and 10% MNLI datasets (Williams et al., 2018). The models were evaluated on two standard natural language understanding benchmarks: the Stan- ford Sentiment Treebank (SST-2) for binary sentiment classification and the Multi-Genre Natural Language Inference (MNLI) dataset for textual entailment. For SST-2, the model was trained and evaluated on the full da...

  31. [33]

    The models, which followed a BERT-base architecture (12 layers, 12 attention heads, 768-dimensional hidden states), were initialized with random weights

    The optimization used a learning rate of 2e-5 with a linear warmup over the first 10% of the training steps and weight decay of 0.01. The models, which followed a BERT-base architecture (12 layers, 12 attention heads, 768-dimensional hidden states), were initialized with random weights. Input sequences were tokenized using the ‘bert-base-uncased‘ tokenize...