XLNet: Generalized Autoregressive Pretraining for Language Understanding
Pith reviewed 2026-05-18 01:24 UTC · model grok-4.3
The pith
XLNet learns bidirectional context by maximizing expected likelihood over all permutations of factorization order.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
XLNet is a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.
What carries the argument
Permutation language modeling objective that maximizes the expected log-likelihood of a sequence over all possible permutations of its factorization order.
Load-bearing premise
Averaging the likelihood over all permutations of the factorization order teaches effective bidirectional context without new optimization problems or sampling biases.
What would settle it
Train an XLNet variant that uses only one fixed factorization order instead of averaging over permutations and measure whether its advantage over BERT disappears on the reported tasks.
read the original abstract
With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes XLNet, a generalized autoregressive pretraining method that maximizes the expected likelihood over all permutations of the factorization order to enable bidirectional context modeling. It addresses BERT's mask dependency and pretrain-finetune discrepancy while integrating Transformer-XL's segment recurrence and relative positional encodings. Under comparable settings, XLNet is shown to outperform BERT on 20 downstream tasks spanning question answering, natural language inference, sentiment analysis, and document ranking.
Significance. If the central empirical claims hold after verification of controls, the work would be significant as a new pretraining paradigm that retains autoregressive tractability while achieving bidirectional context. The explicit permutation objective and its integration with established components like Transformer-XL provide a clear alternative to denoising autoencoders, with potential for broader adoption in language model pretraining.
major comments (2)
- [§3.2] §3.2 (Two-stream self-attention): The mechanism is presented as preventing position-content leakage, yet the paper provides no formal argument or targeted ablation demonstrating that the query stream fully isolates content from positional information across all sampled permutations. This is load-bearing for the claim that gains derive from the permutation objective rather than architectural side effects.
- [§4.1] §4.1 (Experimental setup): The number of permutations sampled per sequence during training is described only at a high level; without an ablation relating sample count to downstream performance, it remains unclear whether the Monte Carlo approximation is sufficient to realize the full expected-likelihood objective or whether reported gains partly reflect sampling bias or the two-stream/Transformer-XL additions.
minor comments (2)
- [Table 1] Table 1 and associated text: clarify whether all baselines use identical data splits and hyperparameter budgets; a single additional column reporting matched hyperparameter counts would strengthen comparability.
- [Figure 2] Figure 2: the legend for the two-stream attention diagram is underspecified; explicitly label the query and content streams in the caption.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and recommendation for minor revision. We address each major comment below with clarifications and indicate where we will revise the manuscript to incorporate additional explanations and analyses.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Two-stream self-attention): The mechanism is presented as preventing position-content leakage, yet the paper provides no formal argument or targeted ablation demonstrating that the query stream fully isolates content from positional information across all sampled permutations. This is load-bearing for the claim that gains derive from the permutation objective rather than architectural side effects.
Authors: We appreciate the referee highlighting this aspect of the two-stream self-attention. The query stream is constructed so that its representation at position z_t attends exclusively to the content representations of the preceding positions in the permutation (z_1 to z_{t-1}), using the attention mask defined in Equations (3)–(4) together with relative positional encodings. This ensures the query never accesses the content embedding of the token being predicted, thereby preventing leakage while still incorporating order information. Although the original manuscript did not supply a standalone formal isolation argument or a dedicated ablation, the separation follows directly from the stream-specific parameterization and masking. In the revision we will expand Section 3.2 with a concise derivation showing the conditional independence property and add a targeted ablation (in the appendix) that compares the full two-stream model against a content-leaking variant on a representative task. revision: yes
-
Referee: [§4.1] §4.1 (Experimental setup): The number of permutations sampled per sequence during training is described only at a high level; without an ablation relating sample count to downstream performance, it remains unclear whether the Monte Carlo approximation is sufficient to realize the full expected-likelihood objective or whether reported gains partly reflect sampling bias or the two-stream/Transformer-XL additions.
Authors: We agree that the description of the Monte Carlo approximation in §4.1 is high-level. In our implementation we draw a fixed number of permutations (K = 6) per sequence to approximate the expectation; this choice is stated in the experimental details but without sensitivity analysis. To demonstrate that the approximation is adequate and that gains are not artifacts of sampling bias or the auxiliary architectural components, we will add an ablation that varies K (1, 3, 6, 12) while holding the two-stream attention and Transformer-XL recurrence fixed, reporting downstream performance on a subset of tasks. The results will be included in the revised experimental section or appendix. revision: yes
Circularity Check
Minor self-citation to Transformer-XL architecture; core permutation LM objective is independently defined and evaluated on external tasks
full rationale
The paper defines its central contribution as maximizing the expected log-likelihood over all permutations of the factorization order, an explicit new objective that is not constructed from or equivalent to any fitted parameter or prior result within the paper. Performance is measured against external benchmarks (20 tasks including QA and NLI) rather than reducing to internal fits. Integration of Transformer-XL ideas is cited for the autoregressive backbone and memory mechanism, but this is an architectural choice whose contribution is separable from the permutation objective and does not bear the load of the bidirectional-context claim. No self-definitional loop, fitted-input-as-prediction, or uniqueness theorem imported from the same authors appears in the derivation chain.
Axiom & Free-Parameter Ledger
free parameters (1)
- permutation sampling distribution
axioms (1)
- domain assumption The attention mechanism in Transformer-XL can be applied to arbitrary factorization orders without architectural change.
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.PhiForcingphi_equation unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
XLNet... enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order
-
IndisputableMonolith.Foundation.DimensionForcingeight_tick_forces_D3 unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
XLNet integrates ideas from Transformer-XL... into pretraining
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 17 Pith papers
-
Language Models are Few-Shot Learners
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
-
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Sentence-BERT adapts BERT with siamese and triplet networks to produce sentence embeddings for efficient cosine-similarity comparisons, cutting computation time from hours to seconds on similarity search while matchin...
-
SecureRouter: Encrypted Routing for Efficient Secure Inference
SecureRouter accelerates secure transformer inference by 1.95x via an encrypted router that selects input-adaptive models from an MPC-optimized pool with negligible accuracy loss.
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
GraphCodeBERT: Pre-training Code Representations with Data Flow
GraphCodeBERT uses data flow graphs in pre-training to capture semantic code structure and reaches state-of-the-art results on code search, clone detection, translation, and refinement.
-
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
BART introduces a denoising pretraining method for seq2seq models that matches RoBERTa on GLUE and SQuAD while setting new state-of-the-art results on abstractive summarization, dialogue, and QA with up to 6 ROUGE gains.
-
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
-
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.
-
ORPHEAS: A Cross-Lingual Greek-English Embedding Model for Retrieval-Augmented Generation
ORPHEAS, a Greek-English embedding model created with knowledge graph fine-tuning, outperforms state-of-the-art multilingual models on monolingual and cross-lingual retrieval benchmarks.
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
-
A General Language Assistant as a Laboratory for Alignment
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
-
Scaling Laws for Transfer
Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
-
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
CodeBERT pre-trains a bimodal model on code and text pairs plus unimodal data to achieve state-of-the-art results on natural language code search and code documentation generation.
-
How Much Knowledge Can You Pack Into the Parameters of a Language Model?
Fine-tuned language models store knowledge in parameters to answer questions competitively with retrieval-based open-domain QA systems.
-
Compressive Transformers for Long-Range Sequence Modelling
Compressive Transformer sets new records on WikiText-103 (17.1 ppl) and Enwik8 (0.97 bpc) via memory compression and introduces the PG-19 long-range language benchmark.
-
HuggingFace's Transformers: State-of-the-art Natural Language Processing
Hugging Face releases an open-source Python library that supplies a unified API and pretrained weights for major Transformer architectures used in natural language processing.
-
RoBERTa: A Robustly Optimized BERT Pretraining Approach
With better hyperparameters, more data, and longer training, an unchanged BERT-Large architecture matches or exceeds XLNet and other successors on GLUE, SQuAD, and RACE.
Reference graph
Works this paper leans on
-
[1]
Character-Level Language Modeling with Deeper Self-Attention
Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. Character-level language modeling with deeper self-attention. arXiv preprint arXiv:1808.04444, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[2]
Bam! born-again multi-task networks for natural language understanding
Anonymous. Bam! born-again multi-task networks for natural language understanding. anony- mous preprint under review, 2018
work page 2018
-
[3]
Adaptive Input Representations for Neural Language Modeling
Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853, 2018. 9
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[4]
Modeling high-dimensional discrete data with multi-layer neural networks
Yoshua Bengio and Samy Bengio. Modeling high-dimensional discrete data with multi-layer neural networks. In Advances in Neural Information Processing Systems, pages 400–406, 2000
work page 2000
-
[5]
Jamie Callan, Mark Hoy, Changkuk Yoo, and Le Zhao. Clueweb09 data set, 2009
work page 2009
- [6]
-
[7]
Semi-supervised sequence learning
Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In Advances in neural information processing systems, pages 3079–3087, 2015
work page 2015
-
[8]
Convolutional neural networks for soft-matching n-grams in ad-hoc search
Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. Convolutional neural networks for soft-matching n-grams in ad-hoc search. In Proceedings of the eleventh ACM international conference on web search and data mining, pages 126–134. ACM, 2018
work page 2018
-
[9]
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[10]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[11]
MaskGAN: Better Text Generation via Filling in the______
William Fedus, Ian Goodfellow, and Andrew M Dai. Maskgan: better text generation via filling in the_. arXiv preprint arXiv:1801.07736, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[12]
Made: Masked autoencoder for distribution estimation
Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, pages 881–889, 2015
work page 2015
-
[13]
A deep relevance matching model for ad-hoc retrieval
Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft. A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pages 55–64. ACM, 2016
work page 2016
-
[14]
Universal Language Model Fine-tuning for Text Classification
Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classifica- tion. arXiv preprint arXiv:1801.06146, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[15]
Deep pyramid convolutional neural networks for text catego- rization
Rie Johnson and Tong Zhang. Deep pyramid convolutional neural networks for text catego- rization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 562–570, 2017
work page 2017
-
[16]
A surprisingly robust trick for winograd schema challenge
Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu, Yordan Yordanov, and Thomas Lukasiewicz. A surprisingly robust trick for winograd schema challenge. arXiv preprint arXiv:1905.06290, 2019
-
[17]
Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[18]
RACE: Large-scale ReAding Comprehension Dataset From Examinations
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[19]
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[20]
Multi-Task Deep Neural Networks for Natural Language Understanding
Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[21]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[22]
Learned in translation: Contextualized word vectors
Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pages 6294–6305, 2017
work page 2017
-
[23]
Adversarial training methods for semi- supervised text classification
Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Adversarial training methods for semi- supervised text classification. arXiv preprint arXiv:1605.07725, 2016
-
[24]
Pixel Recurrent Neural Networks
Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[25]
Improving question answering with external knowledge
Xiaoman Pan, Kai Sun, Dian Yu, Heng Ji, and Dong Yu. Improving question answering with external knowledge. arXiv preprint arXiv:1902.00993, 2019. 10
-
[26]
English gigaword fifth edition, linguistic data consortium
Robert Parker, David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. English gigaword fifth edition, linguistic data consortium. Technical report, Technical Report. Linguistic Data Consortium, Philadelphia, Tech. Rep., 2011
work page 2011
-
[27]
Deep contextualized word representations
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Ken- ton Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[28]
Improving language understanding by generative pre-training
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai- assets/research-covers/languageunsupervised/language understanding paper. pdf, 2018
work page 2018
-
[29]
Know What You Don't Know: Unanswerable Questions for SQuAD
Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[30]
SQuAD: 100,000+ Questions for Machine Comprehension of Text
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[31]
Revisiting lstm networks for semi-supervised text classification via mixed objective function
Devendra Singh Sachan, Manzil Zaheer, and Ruslan Salakhutdinov. Revisiting lstm networks for semi-supervised text classification via mixed objective function. 2018
work page 2018
-
[32]
Neural autoregressive distribution estimation
Benigno Uria, Marc-Alexandre Côté, Karol Gregor, Iain Murray, and Hugo Larochelle. Neural autoregressive distribution estimation. The Journal of Machine Learning Research, 17(1):7184– 7220, 2016
work page 2016
-
[33]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in neural information processing systems, pages 5998–6008, 2017
work page 2017
-
[34]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. 2019. In the Proceedings of ICLR
work page 2019
- [35]
-
[36]
End-to-end neural ad-hoc ranking with kernel pooling
Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings of the 40th International ACM SIGIR conference on research and development in information retrieval, pages 55–64. ACM, 2017
work page 2017
-
[37]
Breaking the Softmax Bottleneck: A High-Rank RNN Language Model
Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W Cohen. Breaking the softmax bottleneck: A high-rank rnn language model. arXiv preprint arXiv:1711.03953, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[38]
Dual co- matching network for multi-choice reading comprehension
Shuailiang Zhang, Hai Zhao, Yuwei Wu, Zhuosheng Zhang, Xi Zhou, and Xiang Zhou. Dual co- matching network for multi-choice reading comprehension. arXiv preprint arXiv:1901.09381, 2019
-
[39]
Character-level convolutional networks for text classification
Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Advances in neural information processing systems, pages 649–657, 2015
work page 2015
-
[40]
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27, 2015. 11 A Target-Aware Representation via Two-Stream Self-...
work page 2015
-
[41]
Thom Yorke is the singer of Radiohead
is only able to cover the dependency (x = York, U = {New}) but not (x = New, U = {York}). XLNet, on the other hand, is able to cover both in expectation over all factorization orders. Such a limitation of AR language modeling can be critical in real-world applications. For example, consider a span extraction question answering task with the context “Thom ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.