pith. machine review for the scientific record. sign in

arxiv: 2006.03654 · v6 · submitted 2020-06-05 · 💻 cs.CL · cs.LG

Recognition: 3 theorem links

· Lean Theorem

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Authors on Pith no claims yet

Pith reviewed 2026-05-13 04:45 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords DeBERTadisentangled attentionmasked language modelingSuperGLUEpre-trained language modelsnatural language understandingBERTRoBERTa
0
0 comments X

The pith

DeBERTa uses separate vectors for word content and position to compute attention, plus absolute positions in the mask decoder, yielding better NLP performance than RoBERTa with less data and the first single-model score above human average.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DeBERTa, a pre-trained language model that refines BERT and RoBERTa through two changes: each token is encoded with distinct content and position vectors, and attention is split into separate matrices for contents and relative positions; an enhanced mask decoder then adds absolute position information when predicting masked tokens. These modifications, combined with virtual adversarial training at fine-tuning time, produce consistent gains on downstream tasks. A DeBERTa model trained on half the data used for RoBERTa-Large improves accuracy on MNLI, SQuAD v2.0, and RACE; when scaled to 1.5 billion parameters the single model reaches a SuperGLUE macro-average of 89.9, exceeding the human baseline of 89.8.

Core claim

By representing each word with two vectors that separately encode its content and its position, and by computing attention weights through disentangled matrices on contents and relative positions, together with an enhanced mask decoder that injects absolute positions into the prediction of masked tokens, the DeBERTa architecture improves both pre-training efficiency and downstream accuracy on natural language understanding and generation tasks. When scaled, this model achieves a macro-average score of 89.9 on SuperGLUE, surpassing the human baseline of 89.8 for the first time with a single model.

What carries the argument

Disentangled attention, in which each word is represented by separate content and position vectors and attention weights are computed with distinct matrices for content and relative-position information.

If this is right

  • A model trained on half the data can still exceed the accuracy of prior RoBERTa models on MNLI, SQuAD v2.0, and RACE.
  • Scaling the architecture to 48 layers and 1.5 billion parameters produces a single-model SuperGLUE macro-average above the human baseline.
  • Adding virtual adversarial training during fine-tuning further improves generalization on the same benchmarks.
  • An ensemble of DeBERTa models widens the margin over the human baseline on the SuperGLUE leaderboard.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The explicit separation of content and relative-position signals may reduce the amount of pre-training data needed for competitive performance in future transformer variants.
  • The same disentanglement pattern could be tested in non-language sequence tasks such as protein folding or time-series forecasting to see whether similar efficiency gains appear.
  • If the absolute-position injection in the decoder proves critical, future mask-prediction objectives in other architectures might benefit from making position information available only at the final decoding step.

Load-bearing premise

The reported accuracy gains come from the disentangled attention and enhanced mask decoder rather than from unreported differences in training data volume, optimizer settings, or other implementation details.

What would settle it

Retraining a standard RoBERTa-Large model on exactly the same data and with the same hyperparameters as the reported DeBERTa model, but without the disentangled attention or enhanced mask decoder, and checking whether its scores on MNLI, SQuAD v2.0, RACE, and SuperGLUE still fall short.

read the original abstract

Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, out performing the human baseline by a decent margin (90.3 versus 89.8).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript proposes DeBERTa, a new pre-trained language model architecture that augments BERT/RoBERTa with two techniques: (1) a disentangled attention mechanism in which each token is represented by separate content and position vectors and attention weights are computed via disentangled matrices on content and relative positions, and (2) an enhanced mask decoder that injects absolute position information when predicting masked tokens during pre-training. It further applies virtual adversarial training at fine-tuning time. The authors report that a DeBERTa model trained on half the data used for RoBERTa-Large outperforms it on MNLI (+0.9%), SQuAD v2.0 (+2.3%), and RACE (+3.6%), and that a 1.5-billion-parameter, 48-layer DeBERTa model achieves a SuperGLUE macro-average of 89.9, exceeding the human baseline of 89.8 (with the ensemble at 90.3).

Significance. If the performance gains are shown to stem from the disentangled attention and enhanced mask decoder rather than model scale, data volume, or unreported hyper-parameter differences, the work would offer a concrete architectural improvement in how transformers handle positional information and would mark a notable milestone by being the first single model to surpass human performance on SuperGLUE. The empirical results on standard NLU benchmarks are a strength, but their attribution to the proposed mechanisms remains the central open question.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experimental Results): The headline comparisons (MNLI 91.1%, SQuAD 90.7%, RACE 86.8%) pit a 1.5 B-parameter DeBERTa against RoBERTa-Large (355 M parameters) trained on more data; without a same-size, same-data baseline that uses standard attention and mask decoding, the reported deltas cannot be unambiguously attributed to the disentangled attention or enhanced mask decoder.
  2. [§4.3] §4.3 (SuperGLUE evaluation): The claim that the single 1.5 B DeBERTa model surpasses human performance (89.9 vs. 89.8) is load-bearing for the paper’s significance, yet no ablation is presented that trains an otherwise identical 1.5 B model with conventional BERT attention and mask decoder on the same pre-training mixture to test whether the architectural changes are necessary for exceeding the human baseline.
  3. [Methods and §3] Methods and §3 (Model Architecture): The training recipe for the 1.5 B model (exact data mixture, number of tokens, optimizer schedule, and whether VATT is used in pre-training) is not fully specified, preventing verification that the observed gains exceed what would be expected from scaling laws alone.
minor comments (3)
  1. [Abstract] Abstract: “natural langauge generation” contains a typo and should read “natural language generation.”
  2. [Abstract] The SuperGLUE leaderboard snapshot is dated “January 6, 2021”; the manuscript should clarify whether this reflects the state at submission or a later update.
  3. [§4] No error bars, standard deviations, or number of random seeds are reported for any benchmark score, which weakens confidence in the small margins (e.g., 89.9 vs. 89.8).

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications on our experimental controls, additional details on the training setup, and commitments to revise the manuscript accordingly. Our responses emphasize the evidence from controlled smaller-scale experiments while acknowledging computational constraints on large-scale ablations.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experimental Results): The headline comparisons (MNLI 91.1%, SQuAD 90.7%, RACE 86.8%) pit a 1.5 B-parameter DeBERTa against RoBERTa-Large (355 M parameters) trained on more data; without a same-size, same-data baseline that uses standard attention and mask decoding, the reported deltas cannot be unambiguously attributed to the disentangled attention or enhanced mask decoder.

    Authors: We acknowledge the value of matched baselines at the 1.5B scale. However, §4.1 and §4.2 report controlled experiments with DeBERTa-base (comparable parameter count to RoBERTa-base) trained on the same data volume and mixture as RoBERTa, where disentangled attention and the enhanced mask decoder yield consistent gains (e.g., +1.5% MNLI, +2.5% SQuAD v2.0). These isolate the architectural contributions independent of scale. The 1.5B results build on this foundation, and we will revise the abstract and §4 to foreground the base-scale controls while noting that full 1.5B matched baselines are left for future work due to resource limits. revision: partial

  2. Referee: [§4.3] §4.3 (SuperGLUE evaluation): The claim that the single 1.5 B DeBERTa model surpasses human performance (89.9 vs. 89.8) is load-bearing for the paper’s significance, yet no ablation is presented that trains an otherwise identical 1.5 B model with conventional BERT attention and mask decoder on the same pre-training mixture to test whether the architectural changes are necessary for exceeding the human baseline.

    Authors: An identical 1.5B ablation with standard attention would directly test necessity for the SuperGLUE result. Unfortunately, the compute cost of training a second 1.5B model on the same mixture makes this infeasible in the current study. We instead demonstrate the mechanisms' effectiveness through base-scale ablations where DeBERTa outperforms RoBERTa equivalents under matched conditions, with gains that compound at larger scales. In revision we will add a limitations paragraph, qualify the attribution language in §4.3, and emphasize the base-model evidence supporting the architectural improvements. revision: no

  3. Referee: [Methods and §3] Methods and §3 (Model Architecture): The training recipe for the 1.5 B model (exact data mixture, number of tokens, optimizer schedule, and whether VATT is used in pre-training) is not fully specified, preventing verification that the observed gains exceed what would be expected from scaling laws alone.

    Authors: We will expand the methods section and add a dedicated appendix with complete pre-training details for the 1.5B model. VATT is applied exclusively at fine-tuning time and not during pre-training. The appendix will list the precise data mixture proportions, total tokens processed, AdamW optimizer settings (betas, weight decay, epsilon), learning-rate schedule with warmup steps and decay, batch size, and other hyperparameters. This will enable readers to situate the results relative to scaling-law expectations. revision: yes

standing simulated objections not resolved
  • Ablation training an otherwise identical 1.5B-parameter model using conventional BERT attention and mask decoder on the exact same pre-training mixture to verify whether the architectural changes are required to exceed the human baseline on SuperGLUE.

Circularity Check

0 steps flagged

No circularity: DeBERTa claims rest on empirical training results, not derivations reducing to fitted inputs or self-citations

full rationale

The manuscript introduces disentangled attention (content/position vectors with separate matrices) and an enhanced mask decoder as architectural proposals, then reports benchmark scores from pre-training and fine-tuning runs. No equations, uniqueness theorems, or first-principles derivations appear that equate the reported gains (e.g., SuperGLUE 89.9) to quantities already present in the training data, RoBERTa baselines, or prior self-citations. The central performance claims are direct outcomes of model training and evaluation on held-out benchmarks, not statistical predictions forced by parameter fitting or renamed empirical patterns. External citations (BERT, RoBERTa, SuperGLUE) supply independent baselines rather than load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper introduces two new architectural components whose effectiveness is supported only by the reported benchmark numbers; no independent evidence or formal justification is given in the abstract.

axioms (1)
  • standard math Transformer layers with self-attention can be stacked to form effective language models
    The model is built directly on the BERT transformer backbone.
invented entities (2)
  • Disentangled attention mechanism no independent evidence
    purpose: Compute attention weights using separate content and relative-position matrices
    New component introduced to replace standard attention.
  • Enhanced mask decoder no independent evidence
    purpose: Incorporate absolute positions when predicting masked tokens during pre-training
    New decoding component added to the pre-training objective.

pith-pipeline@v0.9.0 · 5690 in / 1358 out tokens · 59488 ms · 2026-05-13T04:45:07.449100+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ViLegalNLI: Natural Language Inference for Vietnamese Legal Texts

    cs.CL 2026-04 accept novelty 8.0

    ViLegalNLI is the first 42k-pair Vietnamese legal NLI dataset built via semi-automatic LLM-assisted generation and validation.

  2. RoFormer: Enhanced Transformer with Rotary Position Embedding

    cs.CL 2021-04 accept novelty 8.0

    RoFormer introduces rotary position embeddings that encode absolute positions via rotation matrices and relative dependencies in attention, outperforming prior position methods on long text classification tasks.

  3. Directed Social Regard: Surfacing Targeted Advocacy, Opposition, Aid, Harms, and Victimization in Online Media

    cs.CL 2026-05 unverdicted novelty 7.0

    DSR uses transformer models to detect sentiment targets in text and score them along three theory-motivated axes, with validation showing correlations to existing social science datasets.

  4. RSAT: Structured Attribution Makes Small Language Models Faithful Table Reasoners

    cs.CL 2026-04 conditional novelty 7.0

    RSAT uses SFT on verified traces followed by GRPO with NLI faithfulness rewards to make 1-8B models produce verifiable table reasoning with cell citations, raising faithfulness 3.7x to 0.826.

  5. RSAT: Structured Attribution Makes Small Language Models Faithful Table Reasoners

    cs.CL 2026-04 unverdicted novelty 7.0

    RSAT makes 1-8B language models produce faithful table reasoning by training them to output structured steps with cell citations, using SFT followed by GRPO with an NLI-based faithfulness reward.

  6. Just Pass Twice: Efficient Token Classification with LLMs for Zero-Shot NER

    cs.CL 2026-04 unverdicted novelty 7.0

    JPT enables bidirectional token classification in causal LLMs for zero-shot NER via input concatenation plus definition-guided embeddings, delivering +7.9 F1 gains and over 20x speedup on benchmarks.

  7. The Indra Representation Hypothesis for Multimodal Alignment

    cs.CV 2026-04 unverdicted novelty 7.0

    Unimodal model representations converge to a relational structure captured by the Indra representation via V-enriched Yoneda embedding, which is unique and structure-preserving and improves cross-model and cross-modal...

  8. Indirect Question Answering in English, German and Bavarian: A Challenging Task for High- and Low-Resource Languages Alike

    cs.CL 2026-03 unverdicted novelty 7.0

    IQA is a pragmatically difficult task where multilingual models achieve low performance and overfit severely, even for English, and GPT-4o-mini cannot generate high-quality training data for it.

  9. Context-Aware Spear Phishing: Generative AI-Enabled Attacks Against Individuals via Public Social Media Data

    cs.CR 2026-05 conditional novelty 6.0

    Generative AI enables scalable, context-aware spear phishing by extracting profiles from public social media, producing emails that outperform real-world phishing samples in personalization and lower recipient suspicion.

  10. An Information-theoretic Propagation Denoising and Fusion Framework for Fake News Detection

    cs.CL 2026-05 unverdicted novelty 6.0

    InfoPDF uses mutual information to suppress noise in LLM-generated synthetic propagation graphs and adaptively fuse them with real data, yielding more discriminative representations for fake news detection.

  11. TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

    cs.CR 2026-04 unverdicted novelty 6.0

    TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.

  12. ADE: Adaptive Dictionary Embeddings -- Scaling Multi-Anchor Representations to Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    ADE scales multi-anchor word representations to transformers via Vocabulary Projection, Grouped Positional Encoding, and context-aware reweighting, achieving 98.7% fewer trainable parameters than DeBERTa-v3-base while...

  13. EPM-RL: Reinforcement Learning for On-Premise Product Mapping in E-Commerce

    cs.CL 2026-04 unverdicted novelty 6.0

    EPM-RL uses PEFT followed by RL with agent-based rewards from judge models to create a trainable in-house product mapping model that improves on fine-tuning alone and beats API baselines in quality-cost while enabling...

  14. Beyond Importance Sampling: Rejection-Gated Policy Optimization

    cs.LG 2026-04 unverdicted novelty 6.0

    RGPO replaces importance sampling with a smooth [0,1] acceptance gate in policy gradients, unifying TRPO/PPO/REINFORCE, bounding variance for heavy-tailed ratios, and showing gains in online RLHF experiments.

  15. RouterWise: Joint Resource Allocation and Routing for Latency-Aware Multi-Model LLM Serving

    cs.NI 2026-04 unverdicted novelty 6.0

    Joint resource allocation and routing for multi-model LLM serving can produce up to 87% variation in achievable output quality across setups on the same GPU cluster.

  16. Entities as Retrieval Signals: A Systematic Study of Coverage, Supervision, and Evaluation in Entity-Oriented Ranking

    cs.IR 2026-04 conditional novelty 6.0

    Entity signals cover only 19.7% of relevant documents on Robust04 and no configuration among 443 systems improves MAP by more than 0.05 in open-world evaluation, despite gains when entities are pre-restricted.

  17. Million Tutoring Moves (MTM): An Open Multimodal Dataset for the Science of Tutoring

    cs.CY 2026-04 accept novelty 6.0

    MTM v1 releases 4,654 open math tutoring transcripts as the first step toward a large-scale multimodal repository for studying and improving tutoring.

  18. Overconfidence and Calibration in Medical VQA: Empirical Findings and Hallucination-Aware Mitigation

    cs.CV 2026-04 conditional novelty 6.0

    Empirical study finds overconfidence persists in medical VLMs despite scaling and prompting; post-hoc calibration reduces error while hallucination-aware calibration improves both calibration and AUROC.

  19. Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

    cs.AI 2023-12 conditional novelty 6.0

    Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.

  20. Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

    cs.CL 2023-02 unverdicted novelty 6.0

    Semantic entropy improves uncertainty estimation in natural language generation by incorporating semantic equivalences, outperforming standard entropy baselines on predicting model accuracy for question answering.

  21. Ethical and social risks of harm from Language Models

    cs.CL 2021-12 accept novelty 6.0

    The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job...

  22. Revisiting Semantic Role Labeling: Efficient Structured Inference with Dependency-Informed Analysis

    cs.CL 2026-05 unverdicted novelty 5.0

    A new encoder-based SRL system with dependency-informed analysis delivers 10x faster inference and comparable or better F1 scores using BERT, RoBERTa, and DeBERTa while supporting multilingual projection.

  23. MIPIAD: Multilingual Indirect Prompt Injection Attack Defense with Qwen -- TF-IDF Hybrid and Meta-Ensemble Learning

    cs.CL 2026-05 unverdicted novelty 4.0

    MIPIAD reports a hybrid Qwen-TF-IDF ensemble defense that reaches F1 0.9205 and reduces the English-Bangla performance gap on a 1.43-million-sample synthetic benchmark derived from BIPIA templates.

  24. BiMind: A Dual-Head Reasoning Model with Attention-Geometry Adapter for Incorrect Information Detection

    cs.CL 2026-04 unverdicted novelty 4.0

    BiMind outperforms existing methods in incorrect information detection by disentangling content and knowledge reasoning with attention geometry adaptation and self-retrieval.

  25. Attribution-Driven Explainable Intrusion Detection with Encoder-Based Large Language Models

    cs.CR 2026-04 unverdicted novelty 4.0

    Encoder-based LLMs detect SDN intrusions with decisions driven by meaningful traffic behaviors, as validated by attribution analysis aligning with established intrusion principles.

  26. LLMs Struggle with Abstract Meaning Comprehension More Than Expected

    cs.CL 2026-04 unverdicted novelty 3.0

    LLMs struggle with abstract meaning comprehension on SemEval-2021 Task 4 more than fine-tuned models, and a new bidirectional attention classifier yields small accuracy gains of 3-4%.

  27. Predicting User Satisfaction in Online Education Platforms: A Large Language Model Based Multi-Modal Review Mining Framework

    cs.GR 2026-04 unverdicted novelty 3.0

    An LLM multi-modal system integrates topic modeling, transformer sentiment, and behavioral features to predict MOOC learner satisfaction more accurately than single-modality baselines.

  28. Large Language Models: A Survey

    cs.CL 2024-02 accept novelty 3.0

    The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 27 Pith papers · 8 internal anchors

  1. [1]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450,

  2. [2]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150,

  3. [3]

    Language Models are Few-Shot Learners

    Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165,

  4. [4]

    SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation

    Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055,

  5. [5]

    Natural-to formal-language generation using tensor product representations

    Kezhen Chen, Qiuyuan Huang, Hamid Palangi, Paul Smolensky, Kenneth D Forbus, and Jianfeng Gao. Natural-to formal-language generation using tensor product representations. arXiv preprint arXiv:1910.02339,

  6. [6]

    Generating Long Sequences with Sparse Transformers

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509,

  7. [7]

    BoolQ: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of NAACL-HLT 2019,

  8. [8]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186,

  9. [9]

    Automatically constructing a corpus of sentential paraphrases

    10 Published as a conference paper at ICLR 2021 William B Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005),

  10. [10]

    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.arXiv preprint arXiv:2101.03961,

  11. [11]

    A hybrid neural network model for commonsense reasoning

    Pengcheng He, Xiaodong Liu, Weizhu Chen, and Jianfeng Gao. A hybrid neural network model for commonsense reasoning. arXiv preprint arXiv:1907.11983, 2019a. Pengcheng He, Yi Mao, Kaushik Chakrabarti, and Weizhu Chen. X-sql: reinforce schema representa- tion with context. arXiv preprint arXiv:1908.08113, 2019b. Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Us...

  12. [12]

    Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy

    doi: 10.18653/v1/2020.acl-main.197. Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8:64–77,

  13. [13]

    Small-bench nlp: Benchmark for small single gpu trained models in natural language processing

    Kamal Raj Kanakarajan, Bhuvana Kundumani, and Malaikannan Sankarasubbu. Small-bench nlp: Benchmark for small single gpu trained models in natural language processing. ArXiv, abs/2109.10847,

  14. [14]

    Looking beyond the surface: A challenge set for reading comprehension over multiple sentences

    Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 252–262,

  15. [15]

    Adam: A Method for Stochastic Optimization

    Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

  16. [16]

    Race: Large-scale reading comprehension dataset from examinations

    Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 785–794,

  17. [17]

    The Winograd schema challenge

    11 Published as a conference paper at ICLR 2021 Hector J Levesque, Ernest Davis, and Leora Morgenstern. The Winograd schema challenge. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning, volume 46, pp. 47,

  18. [18]

    Adversarial training for large neural language models

    Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. In International Conference on Learning Representations, 2019a. Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding. In Proceed...

  19. [19]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019c. Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam

  20. [20]

    Deep learning based text classification: A comprehensive review

    Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jian- feng Gao. Deep learning based text classification: A comprehensive review. arXiv preprint arXiv:2004.03705,

  21. [21]

    Virtual adversarial training: a regularization method for supervised and semi-supervised learning

    Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993,

  22. [22]

    Wic: the word-in-context dataset for evaluating context-sensitive meaning representations

    Mohammad Taher Pilehvar and Jose Camacho-Collados. Wic: the word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1267–1273,

  23. [23]

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang

    URL http://jmlr.org/papers/v21/20-074.html. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, November

  24. [24]

    Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series,

  25. [25]

    Enhancing the transformer with explicit relational encoding for math problem solving, 2019, 1910.06611 http://arxiv.org/abs/1910.06611

    Imanol Schlag, Paul Smolensky, Roland Fernandez, Nebojsa Jojic, Jürgen Schmidhuber, and Jianfeng Gao. Enhancing the transformer with explicit relational encoding for math problem solving. arXiv preprint arXiv:1910.06611,

  26. [26]

    Self-attention with relative position representations

    12 Published as a conference paper at ICLR 2021 Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 464–468,

  27. [27]

    Ex- ploiting structured knowledge in text via graph-guided representation learning

    Tao Shen, Yi Mao, Pengcheng He, Guodong Long, Adam Trischler, and Weizhu Chen. Ex- ploiting structured knowledge in text via graph-guided representation learning. arXiv preprint arXiv:2004.14224,

  28. [28]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053,

  29. [29]

    Recursive deep models for semantic compositionality over a sentiment treebank

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642,

  30. [30]

    Ernie: Enhanced representation through knowledge integration

    Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. Ernie: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223,

  31. [31]

    Trinh and Quoc V

    Trieu H Trinh and Quoc V Le. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847,

  32. [32]

    Superglue: A stickier benchmark for general-purpose language understanding systems

    Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in neural information processing systems, pp. 3266–3280, 2019a. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samue...

  33. [33]

    A broad-coverage challenge corpus for sentence understanding through inference

    Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo- gies, Volume 1 (Long Papers), pp. 1112–1122. Association for Computational Linguistics,

  34. [34]

    Swag: A large-scale adversarial dataset for grounded commonsense inference

    Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. Swag: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 93–104,

  35. [35]

    ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension

    13 Published as a conference paper at ICLR 2021 Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. ReCoRD: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint 1810.12885,

  36. [36]

    14 Published as a conference paper at ICLR 2021 A A PPENDIX A.1 D ATASET Corpus Task #Train #Dev #Test #Label Metrics General Language Understanding Evaluation (GLUE) CoLA Acceptability 8.5k 1k 1k 2 Matthews corr SST Sentiment 67k 872 1.8k 2 Accuracy MNLI NLI 393k 20k 20k 3 Accuracy RTE NLI 2.5k 276 3k 2 Accuracy WNLI NLI 634 71 146 2 Accuracy QQP Paraphr...

  37. [37]

    and word sense disambiguation (Pilehvar & Camacho-Collados, 2019). ‚ RACE is a large-scale machine reading comprehension dataset, collected from English examinations in China, which are designed for middle school and high school students (Lai et al., 2017). ‚ SQuAD v1.1/v2.0 is the Stanford Question Answering Dataset (SQuAD) v1.1 and v2.0 (Rajpurkar et al., 2016

  38. [38]

    Their passages come from approximately 500 Wikipedia articles and the questions and answers are obtained by crowd- sourcing

    are popular machine reading comprehension benchmarks. Their passages come from approximately 500 Wikipedia articles and the questions and answers are obtained by crowd- sourcing. The SQuAD v2.0 dataset includes unanswerable questions about the same paragraphs. 15 Published as a conference paper at ICLR 2021 ‚ SW AGis a large-scale adversarial dataset for ...

  39. [39]

    The total data size after data deduplication(Shoeybi et al.,

    9 (6GB), OPENWEBTEXT (public Reddit content (Gokaslan & Cohen, 2019); 38GB) and STORIES10 (a subset of CommonCrawl (Trinh & Le, 2018); 31GB). The total data size after data deduplication(Shoeybi et al.,

  40. [40]

    For fine-tuning, even though we can get better and robust results with RAdam(Liu et al., 2019a) on some tasks, e.g

    as the optimizer with weight decay (Loshchilov & Hutter, 2018). For fine-tuning, even though we can get better and robust results with RAdam(Liu et al., 2019a) on some tasks, e.g. CoLA, RTE and RACE, we use Adam(Kingma & Ba,

  41. [41]

    Our code is implemented based on Huggingface Transformers11, FairSeq12 and Megatron (Shoeybi et al., 2019)13

    The model selection is based on the performance on the task-specific development sets. Our code is implemented based on Huggingface Transformers11, FairSeq12 and Megatron (Shoeybi et al., 2019)13. A.3.1 P RE-TRAINING EFFICIENCY To investigate the efficiency of model pre-training, we plot the performance of the fine-tuned model on downstream tasks as a functi...

  42. [42]

    Sequence length Middle High Accuracy 512 88.8 85.0 86.3 768 88.7 86.3 86.8 Table 11: The effect of handling long sequence input for RACE task with DeBERTa Long sequence handling is an active research area. There have been a lot of studies where the Transformer architecture is extended for long sequence handling(Beltagy et al., 2020; Kitaev et al., 2019; C...

  43. [43]

    a”, “the

    in EMD. A.8 A DDITIONAL DETAILS OF ENHANCED MASK DECODER The structure of EMD is shown in Figure 2b. There are two inputs for EMD, (i.e.,I,H ).H denotes the hidden states from the previous Transformer layer, andI can be any necessary information for decoding, e.g.,H, absolute position embedding or output from previous EMD layer. n denotesn stacked layers ...

  44. [44]

    21 Published as a conference paper at ICLR 2021 (a) (b) (c) Figure 5: Comparison on attention patterns of last layer between DeBERTa and its variants (i.e

    20 Published as a conference paper at ICLR 2021 (a) (b) (c) Figure 4: Comparison on attention patterns of the last layer between DeBERTa and RoBERTa. 21 Published as a conference paper at ICLR 2021 (a) (b) (c) Figure 5: Comparison on attention patterns of last layer between DeBERTa and its variants (i.e. DeBERTa without EMD, C2P and P2C respectively). A.1...