DA-Cramming: Enhancing Cost-Effective Language Model Pretraining with Dependency Agreement Integration

Dongting Li; Jianyi Zhang; Martin Kuo; Yiran Chen

arxiv: 2311.04799 · v2 · submitted 2023-11-08 · 💻 cs.CL · cs.AI

DA-Cramming: Enhancing Cost-Effective Language Model Pretraining with Dependency Agreement Integration

Martin Kuo , Jianyi Zhang , Dongting Li , Yiran Chen This is my paper

Pith reviewed 2026-05-24 05:29 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords dependency agreementlanguage model pretrainingefficient pretrainingBERT-style modelschunk-level embeddingssemantic informationdual-stage workflowcost-effective training

0 comments

The pith

Integrating chunk-level dependency agreements via four submodels during pretraining improves cost-effective BERT-style language models over prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops DA-Cramming as an extension of the Cramming approach for training BERT-style models on a single GPU in one day. It adds a dual-stage workflow that uses four dedicated submodels to identify representative dependency agreements at the chunk level and converts those agreements into embeddings supplied to the main pretraining objective. The goal is to strengthen foundational language understanding with this semantic information from the start, rather than only at fine-tuning time. If the added signals prove non-redundant, the result is higher performance on downstream tasks without increasing the already low compute budget. The work therefore tests whether structured linguistic relations can be injected early to make cheap pretraining more effective.

Core claim

DA-Cramming captures representative dependency agreements at the chunk level with four dedicated submodels, transforms the agreements into embeddings, and supplies them inside a dual-stage pretraining workflow so that the resulting BERT-style models outperform previous cost-effective pretraining methods across various tasks.

What carries the argument

Dual-stage pretraining workflow that runs four dedicated submodels to extract chunk-level dependency agreements and converts those agreements into embeddings for the main model.

If this is right

The resulting models achieve higher accuracy on downstream tasks than earlier one-GPU pretraining baselines.
Semantic dependency information improves foundational representations when supplied during pretraining rather than only later.
The four-submodel design can be run inside the same single-GPU daily budget as plain Cramming.
Chunk-level agreement embeddings can be produced and reused without changing the core masked-language-modeling loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar structured signals such as coreference or discourse relations might be added through the same four-submodel pattern.
The approach raises the question of whether the performance gain comes mainly from the extra parameters or from the specific dependency content.
If the submodels can be distilled or shared, the method could be tested on models larger than BERT-base while still keeping total compute modest.

Load-bearing premise

The chunk-level dependency agreements extracted by the four submodels supply information that is not already learned or redundant under standard masked language modeling.

What would settle it

Train the same BERT-style model with the original Cramming recipe versus the DA-Cramming recipe on identical data and hardware, then measure downstream task accuracy; equal or lower scores for DA-Cramming would falsify the benefit.

Figures

Figures reproduced from arXiv: 2311.04799 by Dongting Li, Jianyi Zhang, Martin Kuo, Yiran Chen.

**Figure 2.** Figure 2: Dependency structure example: Top: Example 1. Down: Example 2 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Stage 1: Training of Dependency Agreement Submodels: Train each agreement submodel using its [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Dependency Agreement embedding generation in Stage 2 [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Stage 2 model is based on Crammed BERT architecture [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Summary of Pretraining Stage 2. Left: MLM Accuracy. Right: MLM loss [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Dependency Agreement Submodel Ablation Studies [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Sentence Dependency Agreement Ablation "evacuated" and "who is evacuated" are the most essential part of determining whether the first sentence entails the second sentence. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

read the original abstract

Pretraining language models is still a challenge for many researchers due to its substantial computational costs. As such, there is growing interest in developing more affordable pretraining methods. One notable advancement in this area is the Cramming technique (Geiping and Goldstein, 2022), which enables the pretraining of BERT-style language models using just one GPU in a single day. Building on this innovative approach, we introduce the Dependency Agreement Cramming (DA-Cramming), an efficient framework that integrates information about dependency agreements into the pretraining process. Unlike existing methods that leverage similar semantic information during finetuning, our approach represents a pioneering effort focusing on enhancing the foundational language understanding with semantic information during pretraining. We meticulously design a dual-stage pretraining work flow with four dedicated submodels to capture representative dependency agreements at the chunk level, effectively transforming these agreements into embeddings to benefit the pretraining. Extensive empirical results demonstrate that our method significantly outperforms previous methods across various tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DA-Cramming adds a dual-stage dependency workflow to the Cramming pipeline but the abstract supplies no numbers or ablations, so the claim that the new component helps remains uncheckable.

read the letter

The paper takes the Cramming setup for one-GPU BERT pretraining and layers on a dual-stage process with four submodels that extract chunk-level dependency agreements and turn them into embeddings fed to the main model. That specific design is not in the cited Cramming work, so the pipeline itself is new. The authors correctly note that most prior attempts to use syntactic or semantic signals happen at fine-tuning rather than during the initial pretraining, and they try to move the signal earlier. That is a reasonable direction for people who want small models to carry more structure from the start. The abstract states that extensive results show significant gains across tasks, which at least signals they ran the experiments. The main weakness is that none of those results appear in the abstract—no tables, no baselines, no error bars, no ablation that isolates the dependency submodels from extra capacity or training signal. Masked language modeling already induces a fair amount of syntax, so the key open question is whether the added embeddings supply non-redundant information or just more parameters. Without that check, any downstream improvement could come from unrelated factors. The paper is aimed at groups that want to pretrain modest BERT-style models on limited hardware and are willing to experiment with linguistic priors. A reader working on efficient pretraining might pick up the workflow idea, but anyone looking for a verified improvement will need the full numbers and controls. If the manuscript contains those details plus ablations, it is worth sending to peer review for a workshop or small conference track; the topic is practical and the framing is straightforward. From the abstract alone the evidence is too thin to judge.

Referee Report

2 major / 1 minor

Summary. The paper introduces DA-Cramming as an extension of the Cramming technique for low-cost BERT-style pretraining. It proposes a dual-stage workflow employing four dedicated submodels to extract chunk-level dependency agreements, convert them into embeddings, and integrate this information during pretraining (rather than only at finetuning) with the goal of improving foundational language understanding; the authors assert that extensive empirical results show significant outperformance over prior methods on various tasks.

Significance. If the empirical gains are robust and the dependency-agreement embeddings supply non-redundant signal beyond standard masked language modeling, the work would offer a concrete route to strengthen syntactic awareness in cost-effective pretraining regimes. The focus on pretraining-stage integration rather than post-hoc finetuning is a clear point of novelty relative to existing dependency-augmented approaches.

major comments (2)

[Abstract] Abstract: the central claim that the method 'significantly outperforms previous methods across various tasks' is stated without any quantitative results, baselines, metrics, error bars, dataset sizes, or statistical tests. Because the abstract supplies none of the evidence needed to evaluate the claim, the soundness of the primary empirical assertion cannot be assessed from the provided text.
The manuscript does not report ablations that isolate the contribution of the four submodel dependency-agreement embeddings from the effects of extra model capacity, additional training signal, or the dual-stage workflow itself. Without such controls it remains possible that any observed downstream gains arise from factors unrelated to the claimed dependency integration (see the weakest-assumption concern).

minor comments (1)

[Abstract] The phrase 'work flow' should be written as the single word 'workflow'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, indicating planned revisions where the manuscript can be strengthened without misrepresenting our results.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the method 'significantly outperforms previous methods across various tasks' is stated without any quantitative results, baselines, metrics, error bars, dataset sizes, or statistical tests. Because the abstract supplies none of the evidence needed to evaluate the claim, the soundness of the primary empirical assertion cannot be assessed from the provided text.

Authors: We agree that the abstract would be strengthened by including quantitative support for the central claim. The full manuscript reports specific downstream results (e.g., GLUE scores and other benchmarks) with the Cramming baseline; we will revise the abstract to summarize key metrics, improvements, and dataset details while remaining within length constraints. revision: yes
Referee: The manuscript does not report ablations that isolate the contribution of the four submodel dependency-agreement embeddings from the effects of extra model capacity, additional training signal, or the dual-stage workflow itself. Without such controls it remains possible that any observed downstream gains arise from factors unrelated to the claimed dependency integration (see the weakest-assumption concern).

Authors: The comment correctly notes the absence of explicit ablation experiments. Our design uses four lightweight, dedicated submodels whose outputs are converted to embeddings and injected into a main model whose parameter count and training compute are matched to the Cramming baseline; the dual-stage workflow is required to produce the chunk-level signals but does not increase the main model's capacity. We will add a dedicated limitations/discussion paragraph clarifying these controls and the rationale for the architecture. New compute-intensive ablations are not feasible within the current resource envelope, but the added discussion will directly address the concern. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical method with no self-referential derivations or fitted predictions

full rationale

The paper presents DA-Cramming as an empirical extension of the external Cramming technique (Geiping and Goldstein, 2022). It describes a dual-stage workflow with four submodels to capture chunk-level dependency agreements and convert them to embeddings for pretraining. No equations, fitted parameters, or 'predictions' are defined that reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim rests on downstream empirical outperformance rather than any mathematical derivation chain. This is a standard empirical methods paper with no detectable circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, training objectives, or implementation details, so the ledger cannot be populated with concrete free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5700 in / 1037 out tokens · 19131 ms · 2026-05-24T05:29:02.840548+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 9 internal anchors

[1]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Alexei Baevski and Michael Auli. 2018. Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Jiangang Bai, Yujing Wang, Yiren Chen, Yaming Yang, Jing Bai, Jing Yu, and Yunhai Tong. 2021. Syntax-bert: Improving pre-trained transformers with syntax trees. arXiv preprint arXiv:2103.04350

work page arXiv 2021
[5]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901

work page 2020
[6]

Michael Cysouw. 2011. https://doi.org/doi:10.1515/thli.2011.011 Very atypical agreement indeed . Theoretical Linguistics, 37(3-4):153--160

work page doi:10.1515/thli.2011.011 2011
[7]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344--16359

work page 2022
[8]

Boris Dayma, Suraj Patil, Pedro Cuenca, Khalid Saifullah, Tanishq Abraham, Phuc Le Khac, Luke Melas, and Ritobrata Ghosh. 2021. https://doi.org/10.5281/zenodo.5146400 Dall e mini

work page doi:10.5281/zenodo.5146400 2021
[9]

Marie-Catherine De Marneffe and Joakim Nivre. 2019. Dependency grammar. Annual Review of Linguistics, 5:197--218

work page 2019
[10]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

https://dumps.wikimedia.org Wikimedia downloads

Wikimedia Foundation. https://dumps.wikimedia.org Wikimedia downloads

work page
[12]

Jonas Geiping and Tom Goldstein. 2022. Cramming: Training a language model on a single gpu in one day. arXiv preprint arXiv:2212.14034

work page arXiv 2022
[13]

Geoff Hart. 2002. The five w’s of online help systems. Geoff Hart

work page 2002
[14]

Matthew Honnibal and Mark Johnson. 2015. An improved non-monotonic transition system for dependency parsing. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 1373--1378

work page 2015
[15]

Matthew Honnibal and Ines Montani. 2017. spaCy 2 : Natural language understanding with B loom embeddings, convolutional neural networks and incremental parsing. To appear

work page 2017
[16]

Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc Le. 2022. https://proceedings.mlr.press/v162/hua22a.html Transformer quality in linear time . In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 9099--9117. PMLR

work page 2022
[17]

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942

work page internal anchor Pith review Pith/arXiv arXiv 2019
[18]

Michael Lewis. 1993. The lexical approach, volume 1. Language teaching publications Hove

work page 1993
[19]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692

work page internal anchor Pith review Pith/arXiv arXiv 2019
[20]

Joakim Nivre and Jens Nilsson. 2005. https://doi.org/10.3115/1219840.1219853 Pseudo-projective dependency parsing . In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics ( ACL ' 05) , pages 99--106, Ann Arbor, Michigan. Association for Computational Linguistics

work page doi:10.3115/1219840.1219853 2005
[21]

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training. OpenAI blog

work page 2018
[22]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9

work page 2019
[23]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485--5551

work page 2020
[24]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250

work page internal anchor Pith review Pith/arXiv arXiv 2016
[25]

Norbert Schmitt. 2000. https://doi.org/10.1093/elt/54.4.400 Key concepts in elt . Elt Journal, 54:400--401

work page doi:10.1093/elt/54.4.400 2000
[26]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053

work page internal anchor Pith review Pith/arXiv arXiv 2019
[27]

Scott Thornbury. 2019. Learning language in chunks. In Cambridge: Cambridge University Press

work page 2019
[28]

Jesse Vig. 2019. A multiscale visualization of attention in the transformer model. arXiv preprint arXiv:1906.05714

work page internal anchor Pith review Pith/arXiv arXiv 2019
[29]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461

work page internal anchor Pith review Pith/arXiv arXiv 2018
[30]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144

work page internal anchor Pith review Pith/arXiv arXiv 2016
[31]

Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. 2020. On layer normalization in the transformer architecture. In International Conference on Machine Learning, pages 10524--10533. PMLR

work page 2020
[32]

Fabio Massimo Zanzotto, Andrea Santilli, Leonardo Ranaldi, Dario Onorati, Pierfrancesco Tommasino, and Francesca Fallucchi. 2020. Kermit: Complementing transformer architectures with encoders of explicit syntactic interpretations. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 256--267

work page 2020
[33]

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In The IEEE International Conference on Computer Vision (ICCV)

work page 2015

[1] [1]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Alexei Baevski and Michael Auli. 2018. Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

Jiangang Bai, Yujing Wang, Yiren Chen, Yaming Yang, Jing Bai, Jing Yu, and Yunhai Tong. 2021. Syntax-bert: Improving pre-trained transformers with syntax trees. arXiv preprint arXiv:2103.04350

work page arXiv 2021

[5] [5]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901

work page 2020

[6] [6]

Michael Cysouw. 2011. https://doi.org/doi:10.1515/thli.2011.011 Very atypical agreement indeed . Theoretical Linguistics, 37(3-4):153--160

work page doi:10.1515/thli.2011.011 2011

[7] [7]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344--16359

work page 2022

[8] [8]

Boris Dayma, Suraj Patil, Pedro Cuenca, Khalid Saifullah, Tanishq Abraham, Phuc Le Khac, Luke Melas, and Ritobrata Ghosh. 2021. https://doi.org/10.5281/zenodo.5146400 Dall e mini

work page doi:10.5281/zenodo.5146400 2021

[9] [9]

Marie-Catherine De Marneffe and Joakim Nivre. 2019. Dependency grammar. Annual Review of Linguistics, 5:197--218

work page 2019

[10] [10]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2018

[11] [11]

https://dumps.wikimedia.org Wikimedia downloads

Wikimedia Foundation. https://dumps.wikimedia.org Wikimedia downloads

work page

[12] [12]

Jonas Geiping and Tom Goldstein. 2022. Cramming: Training a language model on a single gpu in one day. arXiv preprint arXiv:2212.14034

work page arXiv 2022

[13] [13]

Geoff Hart. 2002. The five w’s of online help systems. Geoff Hart

work page 2002

[14] [14]

Matthew Honnibal and Mark Johnson. 2015. An improved non-monotonic transition system for dependency parsing. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 1373--1378

work page 2015

[15] [15]

Matthew Honnibal and Ines Montani. 2017. spaCy 2 : Natural language understanding with B loom embeddings, convolutional neural networks and incremental parsing. To appear

work page 2017

[16] [16]

Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc Le. 2022. https://proceedings.mlr.press/v162/hua22a.html Transformer quality in linear time . In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 9099--9117. PMLR

work page 2022

[17] [17]

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942

work page internal anchor Pith review Pith/arXiv arXiv 2019

[18] [18]

Michael Lewis. 1993. The lexical approach, volume 1. Language teaching publications Hove

work page 1993

[19] [19]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692

work page internal anchor Pith review Pith/arXiv arXiv 2019

[20] [20]

Joakim Nivre and Jens Nilsson. 2005. https://doi.org/10.3115/1219840.1219853 Pseudo-projective dependency parsing . In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics ( ACL ' 05) , pages 99--106, Ann Arbor, Michigan. Association for Computational Linguistics

work page doi:10.3115/1219840.1219853 2005

[21] [21]

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training. OpenAI blog

work page 2018

[22] [22]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9

work page 2019

[23] [23]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485--5551

work page 2020

[24] [24]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250

work page internal anchor Pith review Pith/arXiv arXiv 2016

[25] [25]

Norbert Schmitt. 2000. https://doi.org/10.1093/elt/54.4.400 Key concepts in elt . Elt Journal, 54:400--401

work page doi:10.1093/elt/54.4.400 2000

[26] [26]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053

work page internal anchor Pith review Pith/arXiv arXiv 2019

[27] [27]

Scott Thornbury. 2019. Learning language in chunks. In Cambridge: Cambridge University Press

work page 2019

[28] [28]

Jesse Vig. 2019. A multiscale visualization of attention in the transformer model. arXiv preprint arXiv:1906.05714

work page internal anchor Pith review Pith/arXiv arXiv 2019

[29] [29]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461

work page internal anchor Pith review Pith/arXiv arXiv 2018

[30] [30]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144

work page internal anchor Pith review Pith/arXiv arXiv 2016

[31] [31]

Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. 2020. On layer normalization in the transformer architecture. In International Conference on Machine Learning, pages 10524--10533. PMLR

work page 2020

[32] [32]

Fabio Massimo Zanzotto, Andrea Santilli, Leonardo Ranaldi, Dario Onorati, Pierfrancesco Tommasino, and Francesca Fallucchi. 2020. Kermit: Complementing transformer architectures with encoders of explicit syntactic interpretations. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 256--267

work page 2020

[33] [33]

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In The IEEE International Conference on Computer Vision (ICCV)

work page 2015