Recognition: 2 theorem links
The Falcon Series of Open Language Models
Pith reviewed 2026-05-16 09:42 UTC · model grok-4.3
The pith
Falcon-180B, trained on 3.5 trillion tokens from web data, nears PaLM-2-Large performance at lower pretraining and inference cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Falcon-180B significantly outperforms models such as PaLM or Chinchilla, improves upon concurrently developed models such as LLaMA 2 or Inflection-1, and nears the performance of PaLM-2-Large at a reduced pretraining and inference cost, making it one of the three best language models in the world along with GPT-4 and PaLM-2-Large.
What carries the argument
Causal decoder-only transformer models trained on diverse high-quality web corpora using a custom distributed training codebase that scales efficiently to 4,096 A100 GPUs on cloud infrastructure with limited interconnect.
If this is right
- Open release of the 600B-token web data extract and the models under permissive license enables community replication and extension of the training approach.
- Custom distributed training on limited-interconnect cloud hardware demonstrates a practical path for large-scale pretraining without specialized clusters.
- High performance from filtered web data indicates that scale and quality curation can substitute for exclusive data sources in building competitive models.
- Lower inference cost relative to peers supports broader deployment of near-frontier capabilities in open settings.
Where Pith is reading between the lines
- The emphasis on web-data filtering may generalize to show that careful curation matters more than proprietary data origins for frontier-level performance.
- Releasing both models and training data at this scale could accelerate independent verification of scaling laws in open environments.
- Efficiency gains on commodity cloud hardware might lower barriers for academic or smaller-team reproduction of similar models.
Load-bearing premise
The reported benchmark results reflect genuine capability gains rather than differences in evaluation protocols, data contamination, or undisclosed advantages in testing conditions.
What would settle it
Independent re-evaluation of Falcon-180B on the same benchmarks as PaLM-2-Large, using identical protocols and explicit checks for data overlap or contamination, would show whether performance truly nears that level.
read the original abstract
We introduce the Falcon series: 7B, 40B, and 180B parameters causal decoder-only models trained on a diverse high-quality corpora predominantly assembled from web data. The largest model, Falcon-180B, has been trained on over 3.5 trillion tokens of text--the largest openly documented pretraining run. Falcon-180B significantly outperforms models such as PaLM or Chinchilla, and improves upon concurrently developed models such as LLaMA 2 or Inflection-1. It nears the performance of PaLM-2-Large at a reduced pretraining and inference cost, making it, to our knowledge, one of the three best language models in the world along with GPT-4 and PaLM-2-Large. We report detailed evaluations, as well as a deep dive into the methods and custom tooling employed to pretrain Falcon. Notably, we report on our custom distributed training codebase, allowing us to efficiently pretrain these models on up to 4,096 A100s on cloud AWS infrastructure with limited interconnect. We release a 600B tokens extract of our web dataset, as well as the Falcon-7/40/180B models under a permissive license to foster open-science and accelerate the development of an open ecosystem of large language models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Falcon series of open causal decoder-only language models (7B, 40B, and 180B parameters) trained on up to 3.5 trillion tokens of predominantly web-derived data. It details a custom distributed training framework enabling efficient pretraining on up to 4096 A100 GPUs with limited interconnect, reports benchmark results claiming Falcon-180B outperforms PaLM and Chinchilla while approaching PaLM-2-Large at lower cost, and releases the models plus a 600B-token data extract under a permissive license.
Significance. If the benchmark comparisons prove robust, the work would be significant for open LLM research by documenting one of the largest openly detailed pretraining runs, providing a competitive 180B model, and releasing tooling and data that could accelerate reproducible scaling studies and reduce dependence on closed models.
major comments (2)
- [Evaluation] Evaluation section (main results tables): the direct comparisons to closed models such as PaLM-2-Large and GPT-4 do not specify the exact few-shot templates, answer normalization procedures, or decontamination filters applied to the baselines. This detail is load-bearing for the central claim that Falcon-180B 'nears the performance of PaLM-2-Large' given the web-crawled training corpus.
- [Data] Data section (corpus construction): while the 3.5T-token web corpus is described at high level, the manuscript provides no quantitative overlap statistics or explicit decontamination pipeline for standard benchmarks (MMLU, HellaSwag, etc.). Without these, the reported gains cannot be confidently attributed to capability rather than leakage.
minor comments (2)
- [Training] Figure captions in the training infrastructure section could more clearly label scaling curves with exact token counts and hardware configurations for reproducibility.
- [Abstract] The abstract's phrasing 'one of the three best language models in the world' is subjective; a more precise qualifier such as 'among the highest-performing openly documented models' would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which helps strengthen the clarity and rigor of our work. We address each major comment below and will revise the manuscript to incorporate additional details where appropriate.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section (main results tables): the direct comparisons to closed models such as PaLM-2-Large and GPT-4 do not specify the exact few-shot templates, answer normalization procedures, or decontamination filters applied to the baselines. This detail is load-bearing for the central claim that Falcon-180B 'nears the performance of PaLM-2-Large' given the web-crawled training corpus.
Authors: We agree that explicit specification of evaluation protocols is essential for reproducibility and fair comparison. While high-level descriptions appear in the evaluation section and appendix, we will expand the main text in the revised manuscript to list the precise few-shot templates, answer normalization procedures (e.g., log-likelihood vs. probability normalization), and decontamination filters applied to all baselines including PaLM-2-Large and GPT-4. This will directly support the performance claims. revision: yes
-
Referee: [Data] Data section (corpus construction): while the 3.5T-token web corpus is described at high level, the manuscript provides no quantitative overlap statistics or explicit decontamination pipeline for standard benchmarks (MMLU, HellaSwag, etc.). Without these, the reported gains cannot be confidently attributed to capability rather than leakage.
Authors: We acknowledge the value of quantitative decontamination evidence. In the revision, we will add a dedicated subsection detailing our decontamination pipeline (including n-gram overlap filtering against common benchmarks) and report overlap statistics (e.g., 13-gram contamination rates) for MMLU, HellaSwag, and similar suites. The released 600B-token data extract will further enable independent verification, allowing readers to confirm that gains reflect capability rather than leakage. revision: yes
Circularity Check
No circularity: purely empirical training and benchmark reporting
full rationale
The paper reports the training of decoder-only models on a 3.5T-token web corpus and their benchmark scores against external models. No equations, first-principles derivations, fitted parameters, or predictions appear in the provided text. Claims rest on direct training runs and standard benchmark comparisons rather than any step that reduces by construction to inputs defined inside the paper. No self-citation chain, ansatz smuggling, or renaming of known results is present. The work is self-contained as an empirical description.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Causal decoder-only transformer architecture supports next-token prediction at scale
- domain assumption High-quality web data filtered appropriately yields capable language models
Forward citations
Cited by 20 Pith papers
-
ORPO: Monolithic Preference Optimization without Reference Model
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
-
How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them
Subword tokenization impairs phonological knowledge encoding in LMs, but an IPA-based fine-tuning method restores it with minimal impact on other capabilities.
-
From Competition to Collaboration: Designing Sustainable Mechanisms Between LLMs and Online Forums
A new sequential interaction framework lets LLMs propose questions to forums, with simulations on real Stack Exchange data showing players can reach roughly half the utility of an ideal full-information scenario despi...
-
Moshi: a speech-text foundation model for real-time dialogue
Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
-
Massive Activations in Large Language Models
Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
-
Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs
LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.
-
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.
-
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DECO sparse MoE matches dense Transformer performance at 20% expert activation with a 3x hardware inference speedup.
-
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
LoKA enables practical FP8 use in numerically sensitive large recommendation models via profiling, model adaptations, and runtime kernel orchestration.
-
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.
-
Language models recognize dropout and Gaussian noise applied to their activations
Language models detect, localize, and distinguish dropout from Gaussian noise applied to their activations, often with high accuracy.
-
Learning to Edit Knowledge via Instruction-based Chain-of-Thought Prompting
CoT2Edit trains LLMs to reason over edited knowledge using agent-generated CoTs, SFT, GRPO, and RAG, achieving generalization across six editing scenarios on three models.
-
SweetSpot: An Analytical Model for Predicting Energy Efficiency of LLM Inference
SweetSpot is an analytical model from Transformer computational and memory complexity that identifies energy minima at short-to-moderate inputs and medium outputs, achieving 1.79% MAPE on H100 GPU measurements across ...
-
Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM-RL Coupling
A closed-loop system couples LLM-based 3D scene generation with RL optimization and VR user interactions to produce adaptive, immersive environments, claiming SOTA results on the ALFRED benchmark.
-
Instruction-Tuned LLMs for Parsing and Mining Unstructured Logs on Leadership HPC Systems
An instruction-tuned 8B LLaMA model parses HPC logs with accuracy matching larger models and processes 600 million Frontier supercomputer logs to reveal temporal patterns and anomalies.
-
SUMMIR: A Hallucination-Aware Framework for Ranking Sports Insights from LLMs
SUMMIR is a multimetric ranking model that orders LLM-generated sports insights by importance while incorporating hallucination detection to improve factual reliability across cricket, soccer, basketball, and baseball...
-
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
-
InternLM2 Technical Report
InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.
-
AgriIR: A Scalable Framework for Domain-Specific Knowledge Retrieval
AgriIR is a configurable RAG framework using modular stages and 1B-parameter models to deliver grounded, citable answers for Indian agricultural information access.
-
A Survey on Efficient Inference for Large Language Models
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
Reference graph
Works this paper leans on
-
[1]
Warp size impact in GPUs: large or small? , author=. GPGPU@ASPLOS , year=
-
[2]
XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , author=. ArXiv , year=
-
[3]
arXiv preprint arXiv:2203.03466 , year=
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer , author=. arXiv preprint arXiv:2203.03466 , year=
-
[4]
The Power of Scale for Parameter-Efficient Prompt Tuning
The power of scale for parameter-efficient prompt tuning , author=. arXiv preprint arXiv:2104.08691 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
arXiv preprint arXiv:2106.10199 , year=
Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models , author=. arXiv preprint arXiv:2106.10199 , year=
-
[6]
and Schwenk, Holger and Stoyanov, Veselin
Conneau, Alexis and Rinott, Ruty and Lample, Guillaume and Williams, Adina and Bowman, Samuel R. and Schwenk, Holger and Stoyanov, Veselin. XNLI: Evaluating Cross-lingual Sentence Representations. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018
work page 2018
-
[7]
mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , author=. NAACL , year=
-
[8]
The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink , author=. 2022 , publisher=
work page 2022
-
[9]
GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model , author=
-
[11]
arXiv preprint arXiv:2202.07785 , year=
Predictability and Surprise in Large Generative Models , author=. arXiv preprint arXiv:2202.07785 , year=
-
[12]
Finetuned Language Models Are Zero-Shot Learners
Finetuned language models are zero-shot learners , author=. arXiv preprint arXiv:2109.01652 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
arXiv preprint arXiv:2112.12731 , year=
ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation , author=. arXiv preprint arXiv:2112.12731 , year=
-
[14]
What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers , author=. ArXiv , year=
-
[15]
Few-shot Learning with Multilingual Language Models , author=. ArXiv , year=
-
[16]
Multitask Prompted Training Enables Zero-Shot Task Generalization , author=. ArXiv , year=
-
[17]
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Scaling language models: Methods, analysis & insights from training gopher , author=. arXiv preprint arXiv:2112.11446 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Adam: A Method for Stochastic Optimization
Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? , author=
-
[20]
Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers , author=. ArXiv , year=
-
[21]
Gaussian Error Linear Units (GELUs)
Gaussian error linear units (gelus) , author=. arXiv preprint arXiv:1606.08415 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Searching for Activation Functions
Searching for activation functions , author=. arXiv preprint arXiv:1710.05941 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
arXiv preprint arXiv:2011.04006 , year=
Long range arena: A benchmark for efficient transformers , author=. arXiv preprint arXiv:2011.04006 , year=
-
[24]
LaMDA: Language Models for Dialog Applications
LaMDA: Language Models for Dialog Applications , author=. arXiv preprint arXiv:2201.08239 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Scaling Laws for Autoregressive Generative Modeling
Scaling laws for autoregressive generative modeling , author=. arXiv preprint arXiv:2010.14701 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[26]
Pedro Javier. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures , series =. 2019 , abstract =. doi:10.14618/ids-pub-9021 , url =
-
[27]
Advances in Neural Information Processing Systems , volume=
Limits to depth efficiencies of self-attention , author=. Advances in Neural Information Processing Systems , volume=
-
[28]
Advances in Neural Information Processing Systems , pages =
Language Models are Few-Shot Learners , author =. Advances in Neural Information Processing Systems , pages =
-
[30]
Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor , journal=
-
[31]
On the sizes of OpenAI API models , url=
Gao, Leo , year=. On the sizes of OpenAI API models , url=. EleutherAI Blog , publisher=
-
[32]
Turing-NLG: A 17-billion-parameter language model by Microsoft , url=
Rosset, Corby , year=. Turing-NLG: A 17-billion-parameter language model by Microsoft , url=. Microsoft Research Blog , publisher=
-
[33]
Proceedings of the 2013 conference on empirical methods in natural language processing , pages=
Semantic parsing on freebase from question-answer pairs , author=. Proceedings of the 2013 conference on empirical methods in natural language processing , pages=
work page 2013
-
[34]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. CoRR , volume =. 2019 , url =. 1910.10683 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[35]
RoFormer: Enhanced Transformer with Rotary Position Embedding
Roformer: Enhanced transformer with rotary position embedding , author=. arXiv preprint arXiv:2104.09864 , url=
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Advances in neural information processing systems , pages=
Attention is all you need , author=. Advances in neural information processing systems , pages=
-
[37]
doi:10.5281/zenodo.5297715 , url =
Black, Sid and Gao, Leo and Wang, Phil and Leahy, Connor and Biderman, Stella , title =. doi:10.5281/zenodo.5297715 , url =
-
[38]
URL https://openai.com/blog/sparse-transformers , year=
Generating Long Sequences with Sparse Transformers , author=. URL https://openai.com/blog/sparse-transformers , year=
-
[39]
Lieber, Opher and Sharir, Or and Lenz, Barak and Shoham, Yoav , title =
-
[40]
Using the Output Embedding to Improve Language Models
Press, Ofir and Wolf, Lior. Using the Output Embedding to Improve Language Models. Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 2017
work page 2017
-
[41]
Improving language understanding by generative pre-training , author=. OpenAI Blog , year=
-
[42]
International Conference on Learning Representations , year=
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , author=. International Conference on Learning Representations , year=
- [43]
-
[44]
Language Modeling with Gated Convolutional Networks
Yann N. Dauphin and Angela Fan and Michael Auli and David Grangier , title =. CoRR , volume =. 2016 , url =. 1612.08083 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[45]
doi:10.5281/zenodo.5371628 , url =
Gao, Leo and Tow, Jonathan and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and McDonell, Kyle and Muennighoff, Niklas and Phang, Jason and Reynolds, Laria and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy , title =. doi:10.5281/zenodo.5371628 , url =
-
[46]
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , author=. arXiv preprint arXiv:2201.11990 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[48]
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , author =. NAACL , year =
-
[49]
Gordon, Andrew and Kozareva, Zornitsa and Roemmele, Melissa. S em E val-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning. * SEM 2012: The First Joint Conference on Lexical and Computational Semantics -- Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth Inter...
work page 2012
-
[50]
HEAD - QA : A Healthcare Dataset for Complex Reasoning
Vilares, David and G \'o mez-Rodr \' guez, Carlos. HEAD - QA : A Healthcare Dataset for Complex Reasoning. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1092
-
[51]
Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin. H ella S wag: Can a Machine Really Finish Your Sentence?. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1472
-
[52]
The LAMBADA dataset: Word prediction requiring a broad discourse context
Paperno, Denis and Kruszewski, Germ \'a n and Lazaridou, Angeliki and Pham, Ngoc Quan and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fern \'a ndez, Raquel. The LAMBADA dataset: Word prediction requiring a broad discourse context. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (...
-
[53]
International Conference on Learning Representations , year=
Generating Wikipedia by Summarizing Long Sequences , author=. International Conference on Learning Representations , year=
-
[54]
Jian Liu and Leyang Cui and Hanmeng Liu and Dandan Huang and Yile Wang and Yue Zhang , title =. CoRR , volume =. 2020 , url =. 2007.08124 , timestamp =
-
[55]
M ath QA : Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms
Amini, Aida and Gabriel, Saadia and Lin, Shanchuan and Koncel-Kedziorski, Rik and Choi, Yejin and Hajishirzi, Hannaneh. M ath QA : Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, ...
-
[56]
Ben Zhou, Daniel Khashabi, Qiang Ning and Dan Roth , title =. EMNLP , year =
-
[57]
Proceedings of the Third International Workshop on Paraphrasing (IWP2005) , year=
Automatically constructing a corpus of sentential paraphrases , author=. Proceedings of the Third International Workshop on Paraphrasing (IWP2005) , year=
-
[58]
Daniel Khashabi and Snigdha Chaturvedi and Michael Roth and Shyam Upadhyay and Dan Roth , title =. Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL) , year =
-
[59]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. EMNLP , year=
-
[60]
Thirty-Fourth AAAI Conference on Artificial Intelligence , year =
Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. Thirty-Fourth AAAI Conference on Artificial Intelligence , year =
-
[61]
On the Opportunities and Risks of Foundation Models
On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[62]
PROST : P hysical Reasoning about Objects through Space and Time
Aroca-Ouellette, St \'e phane and Paik, Cory and Roncone, Alessandro and Kann, Katharina. PROST : P hysical Reasoning about Objects through Space and Time. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021
work page 2021
-
[63]
PubMedQA: A Dataset for Biomedical Research Question Answering , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages=
work page 2019
-
[64]
Iyer, Shankar and Dandekar, Nikhil and Csernai, Kornel , title =
-
[65]
Crowdsourcing Multiple Choice Science Questions , author=. 2017 , journal=
work page 2017
-
[66]
SQuAD: 100,000+ Questions for Machine Comprehension of Text
Squad: 100,000+ questions for machine comprehension of text , author=. arXiv preprint arXiv:1606.05250 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[67]
International Conference on Learning Representations , year=
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , author=. International Conference on Learning Representations , year=
-
[68]
Machine Learning Challenges Workshop , pages=
The PASCAL recognising textual entailment challenge , author=. Machine Learning Challenges Workshop , pages=. 2005 , organization=
work page 2005
-
[69]
Proceedings of the 2013 conference on empirical methods in natural language processing , pages=
Recursive deep models for semantic compositionality over a sentiment treebank , author=. Proceedings of the 2013 conference on empirical methods in natural language processing , pages=
work page 2013
-
[70]
and Zettlemoyer, Luke , title =
Joshi, Mandar and Choi, Eunsol and Weld, Daniel S. and Zettlemoyer, Luke , title =. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics , month =. 2017 , address =
work page 2017
-
[71]
The winograd schema challenge , author=. Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning , year=
-
[72]
Yingshan Chang and Mridu Narang and Hisami Suzuki and Guihong Cao and Jianfeng Gao and Yonatan Bisk , title =. CoRR , volume =. 2021 , url =. 2109.00590 , timestamp =
-
[73]
WiC: 10, 000 Example Pairs for Evaluating Context-Sensitive Representations , journal=
Mohammad Taher Pilehvar and os. WiC: 10, 000 Example Pairs for Evaluating Context-Sensitive Representations , journal=. 2018 , url=
work page 2018
-
[74]
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. arXiv preprint arXiv:1907.10641 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[75]
Layer normalization , author=. arXiv preprint arXiv:1607.06450 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[76]
Decoupled Weight Decay Regularization
Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[77]
8-bit Optimizers via Block-wise Quantization , author=. 2021 , eprint=
work page 2021
-
[78]
CPM: A Large-scale Generative Chinese Pre-trained Language Model , author=
-
[79]
CPM-2: Large-scale Cost-efficient Pre-trained Language Models , author=
-
[80]
arXiv preprint arXiv:2110.04725 , year=
Yuan 1.0: Large-scale pre-trained language model in zero-shot and few-shot learning , author=. arXiv preprint arXiv:2110.04725 , year=
-
[81]
Advances in Neural Information Processing Systems , volume=
Unified language model pre-training for natural language understanding and generation , author=. Advances in Neural Information Processing Systems , volume=
-
[82]
arXiv preprint arXiv:2104.12369 , year=
PanGu- : Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation , author=. arXiv preprint arXiv:2104.12369 , year=
-
[83]
arXiv preprint arXiv:2010.11934 , year=
mT5: A massively multilingual pre-trained text-to-text transformer , author=. arXiv preprint arXiv:2010.11934 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.