arxiv: 1906.08237 · v2 · pith:L7AEHQT4new · submitted 2019-06-19 · 💻 cs.CL · cs.LG

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Zhilin Yang , Zihang Dai , Yiming Yang , Jaime Carbonell , Ruslan Salakhutdinov , Quoc V. Le This is my paper

Pith reviewed 2026-05-18 01:24 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords XLNetautoregressive pretrainingpermutation language modelingbidirectional contextTransformer-XLlanguage understandingBERT comparison

0 comments

The pith

XLNet learns bidirectional context by maximizing expected likelihood over all permutations of factorization order.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes XLNet to fix two problems in existing pretraining methods. Autoregressive models like standard language modeling see only left-to-right context, while masked models like BERT break dependencies between masked tokens and create a mismatch when fine-tuning. XLNet keeps the autoregressive form but trains by averaging the likelihood of every possible ordering of the tokens in a sentence. This produces bidirectional context without masks. The resulting models beat BERT on twenty downstream tasks under matched conditions.

Core claim

XLNet is a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.

What carries the argument

Permutation language modeling objective that maximizes the expected log-likelihood of a sequence over all possible permutations of its factorization order.

Load-bearing premise

Averaging the likelihood over all permutations of the factorization order teaches effective bidirectional context without new optimization problems or sampling biases.

What would settle it

Train an XLNet variant that uses only one fixed factorization order instead of averaging over permutations and measure whether its advantage over BERT disappears on the reported tasks.

read the original abstract

With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes XLNet, a generalized autoregressive pretraining method that maximizes the expected likelihood over all permutations of the factorization order to enable bidirectional context modeling. It addresses BERT's mask dependency and pretrain-finetune discrepancy while integrating Transformer-XL's segment recurrence and relative positional encodings. Under comparable settings, XLNet is shown to outperform BERT on 20 downstream tasks spanning question answering, natural language inference, sentiment analysis, and document ranking.

Significance. If the central empirical claims hold after verification of controls, the work would be significant as a new pretraining paradigm that retains autoregressive tractability while achieving bidirectional context. The explicit permutation objective and its integration with established components like Transformer-XL provide a clear alternative to denoising autoencoders, with potential for broader adoption in language model pretraining.

major comments (2)

[§3.2] §3.2 (Two-stream self-attention): The mechanism is presented as preventing position-content leakage, yet the paper provides no formal argument or targeted ablation demonstrating that the query stream fully isolates content from positional information across all sampled permutations. This is load-bearing for the claim that gains derive from the permutation objective rather than architectural side effects.
[§4.1] §4.1 (Experimental setup): The number of permutations sampled per sequence during training is described only at a high level; without an ablation relating sample count to downstream performance, it remains unclear whether the Monte Carlo approximation is sufficient to realize the full expected-likelihood objective or whether reported gains partly reflect sampling bias or the two-stream/Transformer-XL additions.

minor comments (2)

[Table 1] Table 1 and associated text: clarify whether all baselines use identical data splits and hyperparameter budgets; a single additional column reporting matched hyperparameter counts would strengthen comparability.
[Figure 2] Figure 2: the legend for the two-stream attention diagram is underspecified; explicitly label the query and content streams in the caption.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation for minor revision. We address each major comment below with clarifications and indicate where we will revise the manuscript to incorporate additional explanations and analyses.

read point-by-point responses

Referee: [§3.2] §3.2 (Two-stream self-attention): The mechanism is presented as preventing position-content leakage, yet the paper provides no formal argument or targeted ablation demonstrating that the query stream fully isolates content from positional information across all sampled permutations. This is load-bearing for the claim that gains derive from the permutation objective rather than architectural side effects.

Authors: We appreciate the referee highlighting this aspect of the two-stream self-attention. The query stream is constructed so that its representation at position z_t attends exclusively to the content representations of the preceding positions in the permutation (z_1 to z_{t-1}), using the attention mask defined in Equations (3)–(4) together with relative positional encodings. This ensures the query never accesses the content embedding of the token being predicted, thereby preventing leakage while still incorporating order information. Although the original manuscript did not supply a standalone formal isolation argument or a dedicated ablation, the separation follows directly from the stream-specific parameterization and masking. In the revision we will expand Section 3.2 with a concise derivation showing the conditional independence property and add a targeted ablation (in the appendix) that compares the full two-stream model against a content-leaking variant on a representative task. revision: yes
Referee: [§4.1] §4.1 (Experimental setup): The number of permutations sampled per sequence during training is described only at a high level; without an ablation relating sample count to downstream performance, it remains unclear whether the Monte Carlo approximation is sufficient to realize the full expected-likelihood objective or whether reported gains partly reflect sampling bias or the two-stream/Transformer-XL additions.

Authors: We agree that the description of the Monte Carlo approximation in §4.1 is high-level. In our implementation we draw a fixed number of permutations (K = 6) per sequence to approximate the expectation; this choice is stated in the experimental details but without sensitivity analysis. To demonstrate that the approximation is adequate and that gains are not artifacts of sampling bias or the auxiliary architectural components, we will add an ablation that varies K (1, 3, 6, 12) while holding the two-stream attention and Transformer-XL recurrence fixed, reporting downstream performance on a subset of tasks. The results will be included in the revised experimental section or appendix. revision: yes

Circularity Check

0 steps flagged

Minor self-citation to Transformer-XL architecture; core permutation LM objective is independently defined and evaluated on external tasks

full rationale

The paper defines its central contribution as maximizing the expected log-likelihood over all permutations of the factorization order, an explicit new objective that is not constructed from or equivalent to any fitted parameter or prior result within the paper. Performance is measured against external benchmarks (20 tasks including QA and NLI) rather than reducing to internal fits. Integration of Transformer-XL ideas is cited for the autoregressive backbone and memory mechanism, but this is an architectural choice whose contribution is separable from the permutation objective and does not bear the load of the bidirectional-context claim. No self-definitional loop, fitted-input-as-prediction, or uniqueness theorem imported from the same authors appears in the derivation chain.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard transformer attention mechanism and the assumption that uniform sampling over permutations yields an unbiased estimator of bidirectional dependencies; no new physical entities or ad-hoc constants are introduced.

free parameters (1)

permutation sampling distribution
The paper must choose how to sample or enumerate permutations; this choice is a modeling decision that affects training dynamics.

axioms (1)

domain assumption The attention mechanism in Transformer-XL can be applied to arbitrary factorization orders without architectural change.
The integration of Transformer-XL is invoked to handle long-range dependencies under permutation orders.

pith-pipeline@v0.9.0 · 5701 in / 1219 out tokens · 31002 ms · 2026-05-18T01:24:35.604794+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.PhiForcing phi_equation unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

XLNet... enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order
IndisputableMonolith.Foundation.DimensionForcing eight_tick_forces_D3 unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

XLNet integrates ideas from Transformer-XL... into pretraining

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Language Models are Few-Shot Learners
cs.CL 2020-05 accept novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
cs.CL 2019-08 unverdicted novelty 8.0

Sentence-BERT adapts BERT with siamese and triplet networks to produce sentence embeddings for efficient cosine-similarity comparisons, cutting computation time from hours to seconds on similarity search while matchin...
SecureRouter: Encrypted Routing for Efficient Secure Inference
cs.CR 2026-04 unverdicted novelty 7.0

SecureRouter accelerates secure transformer inference by 1.95x via an encrypted router that selects input-adaptive models from an MPC-optimized pool with negligible accuracy loss.
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
GraphCodeBERT: Pre-training Code Representations with Data Flow
cs.SE 2020-09 accept novelty 7.0

GraphCodeBERT uses data flow graphs in pre-training to capture semantic code structure and reaches state-of-the-art results on code search, clone detection, translation, and refinement.
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
cs.CL 2019-10 accept novelty 7.0

BART introduces a denoising pretraining method for seq2seq models that matches RoBERTa on GLUE and SQuAD while setting new state-of-the-art results on abstractive summarization, dialogue, and QA with up to 6 ROUGE gains.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
cs.LG 2019-10 unverdicted novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
cs.CL 2019-09 unverdicted novelty 7.0

Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.
ORPHEAS: A Cross-Lingual Greek-English Embedding Model for Retrieval-Augmented Generation
cs.CL 2026-04 unverdicted novelty 6.0

ORPHEAS, a Greek-English embedding model created with knowledge graph fine-tuning, outperforms state-of-the-art multilingual models on monolingual and cross-lingual retrieval benchmarks.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Scaling Laws for Transfer
cs.LG 2021-02 unverdicted novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
cs.CL 2020-02 unverdicted novelty 6.0

CodeBERT pre-trains a bimodal model on code and text pairs plus unimodal data to achieve state-of-the-art results on natural language code search and code documentation generation.
How Much Knowledge Can You Pack Into the Parameters of a Language Model?
cs.CL 2020-02 accept novelty 6.0

Fine-tuned language models store knowledge in parameters to answer questions competitively with retrieval-based open-domain QA systems.
Compressive Transformers for Long-Range Sequence Modelling
cs.LG 2019-11 unverdicted novelty 6.0

Compressive Transformer sets new records on WikiText-103 (17.1 ppl) and Enwik8 (0.97 bpc) via memory compression and introduces the PG-19 long-range language benchmark.
HuggingFace's Transformers: State-of-the-art Natural Language Processing
cs.CL 2019-10 accept novelty 6.0

Hugging Face releases an open-source Python library that supplies a unified API and pretrained weights for major Transformer architectures used in natural language processing.
RoBERTa: A Robustly Optimized BERT Pretraining Approach
cs.CL 2019-07 accept novelty 5.0

With better hyperparameters, more data, and longer training, an unchanged BERT-Large architecture matches or exceeds XLNet and other successors on GLUE, SQuAD, and RACE.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 17 Pith papers · 16 internal anchors

[1]

Character-Level Language Modeling with Deeper Self-Attention

Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. Character-level language modeling with deeper self-attention. arXiv preprint arXiv:1808.04444, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

Bam! born-again multi-task networks for natural language understanding

Anonymous. Bam! born-again multi-task networks for natural language understanding. anony- mous preprint under review, 2018

work page 2018
[3]

Adaptive Input Representations for Neural Language Modeling

Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853, 2018. 9

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Modeling high-dimensional discrete data with multi-layer neural networks

Yoshua Bengio and Samy Bengio. Modeling high-dimensional discrete data with multi-layer neural networks. In Advances in Neural Information Processing Systems, pages 400–406, 2000

work page 2000
[5]

Clueweb09 data set, 2009

Jamie Callan, Mark Hoy, Changkuk Yoo, and Le Zhao. Clueweb09 data set, 2009

work page 2009
[6]

Common crawl

Common Crawl. Common crawl. URl: http://http://commoncrawl. org, 2019

work page 2019
[7]

Semi-supervised sequence learning

Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In Advances in neural information processing systems, pages 3079–3087, 2015

work page 2015
[8]

Convolutional neural networks for soft-matching n-grams in ad-hoc search

Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. Convolutional neural networks for soft-matching n-grams in ad-hoc search. In Proceedings of the eleventh ACM international conference on web search and data mining, pages 126–134. ACM, 2018

work page 2018
[9]

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a ﬁxed-length context. arXiv preprint arXiv:1901.02860, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[10]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

MaskGAN: Better Text Generation via Filling in the______

William Fedus, Ian Goodfellow, and Andrew M Dai. Maskgan: better text generation via ﬁlling in the_. arXiv preprint arXiv:1801.07736, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[12]

Made: Masked autoencoder for distribution estimation

Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, pages 881–889, 2015

work page 2015
[13]

A deep relevance matching model for ad-hoc retrieval

Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft. A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pages 55–64. ACM, 2016

work page 2016
[14]

Universal Language Model Fine-tuning for Text Classification

Jeremy Howard and Sebastian Ruder. Universal language model ﬁne-tuning for text classiﬁca- tion. arXiv preprint arXiv:1801.06146, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

Deep pyramid convolutional neural networks for text catego- rization

Rie Johnson and Tong Zhang. Deep pyramid convolutional neural networks for text catego- rization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 562–570, 2017

work page 2017
[16]

A surprisingly robust trick for winograd schema challenge

Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu, Yordan Yordanov, and Thomas Lukasiewicz. A surprisingly robust trick for winograd schema challenge. arXiv preprint arXiv:1905.06290, 2019

work page arXiv 1905
[17]

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

RACE: Large-scale ReAding Comprehension Dataset From Examinations

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[20]

Multi-Task Deep Neural Networks for Natural Language Understanding

Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[21]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[22]

Learned in translation: Contextualized word vectors

Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pages 6294–6305, 2017

work page 2017
[23]

Adversarial training methods for semi- supervised text classiﬁcation

Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Adversarial training methods for semi- supervised text classiﬁcation. arXiv preprint arXiv:1605.07725, 2016

work page arXiv 2016
[24]

Pixel Recurrent Neural Networks

Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[25]

Improving question answering with external knowledge

Xiaoman Pan, Kai Sun, Dian Yu, Heng Ji, and Dong Yu. Improving question answering with external knowledge. arXiv preprint arXiv:1902.00993, 2019. 10

work page arXiv 1902
[26]

English gigaword ﬁfth edition, linguistic data consortium

Robert Parker, David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. English gigaword ﬁfth edition, linguistic data consortium. Technical report, Technical Report. Linguistic Data Consortium, Philadelphia, Tech. Rep., 2011

work page 2011
[27]

Deep contextualized word representations

Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Ken- ton Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

Improving language understanding by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai- assets/research-covers/languageunsupervised/language understanding paper. pdf, 2018

work page 2018
[29]

Know What You Don't Know: Unanswerable Questions for SQuAD

Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[30]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[31]

Revisiting lstm networks for semi-supervised text classiﬁcation via mixed objective function

Devendra Singh Sachan, Manzil Zaheer, and Ruslan Salakhutdinov. Revisiting lstm networks for semi-supervised text classiﬁcation via mixed objective function. 2018

work page 2018
[32]

Neural autoregressive distribution estimation

Benigno Uria, Marc-Alexandre Côté, Karol Gregor, Iain Murray, and Hugo Larochelle. Neural autoregressive distribution estimation. The Journal of Machine Learning Research, 17(1):7184– 7220, 2016

work page 2016
[33]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in neural information processing systems, pages 5998–6008, 2017

work page 2017
[34]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. 2019. In the Proceedings of ICLR

work page 2019
[35]

Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V . Le. Unsupervised data augmentation. arXiv preprint arXiv:1904.12848, 2019

work page arXiv 1904
[36]

End-to-end neural ad-hoc ranking with kernel pooling

Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings of the 40th International ACM SIGIR conference on research and development in information retrieval, pages 55–64. ACM, 2017

work page 2017
[37]

Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W Cohen. Breaking the softmax bottleneck: A high-rank rnn language model. arXiv preprint arXiv:1711.03953, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[38]

Dual co- matching network for multi-choice reading comprehension

Shuailiang Zhang, Hai Zhao, Yuwei Wu, Zhuosheng Zhang, Xi Zhou, and Xiang Zhou. Dual co- matching network for multi-choice reading comprehension. arXiv preprint arXiv:1901.09381, 2019

work page arXiv 1901
[39]

Character-level convolutional networks for text classiﬁcation

Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classiﬁcation. In Advances in neural information processing systems, pages 649–657, 2015

work page 2015
[40]

Layer-wise decay

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27, 2015. 11 A Target-Aware Representation via Two-Stream Self-...

work page 2015
[41]

Thom Yorke is the singer of Radiohead

is only able to cover the dependency (x = York, U = {New}) but not (x = New, U = {York}). XLNet, on the other hand, is able to cover both in expectation over all factorization orders. Such a limitation of AR language modeling can be critical in real-world applications. For example, consider a span extraction question answering task with the context “Thom ...

work page