The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Pith reviewed 2026-05-10 21:28 UTC · model grok-4.3
The pith
A new 825-gigabyte dataset built from 22 diverse text sources trains language models that generalize better across domains than those trained on raw web crawls.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that training large language models on this composite dataset of 22 diverse subsets produces better cross-domain knowledge and downstream generalization than training on less curated web data. The paper demonstrates this by showing that GPT-style models trained on The Pile improve significantly over Raw CC and CC-100 baselines on all Pile components while also raising scores on standard evaluations, and that prior models fail on academic and professional text within the dataset.
What carries the argument
The Pile, a composite 825 GiB corpus constructed by combining 22 existing and newly assembled high-quality text subsets drawn from academic and professional sources.
Load-bearing premise
The reported performance gains are caused by the diversity and quality of the 22 subsets rather than by uncontrolled differences in training procedure, model scale, or data volume between the Pile-trained models and the Raw CC or CC-100 baselines.
What would settle it
A controlled retraining experiment that matches data volume, model size, and training steps exactly between a Pile-trained model and a Raw CC model, then evaluates both on held-out samples from every Pile component, would show whether the gains persist.
read the original abstract
Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present \textit{the Pile}: an 825 GiB English text corpus targeted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets -- both existing and newly constructed -- many of which derive from academic or professional sources. Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations. Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces The Pile, an 825 GiB English text corpus assembled from 22 diverse high-quality subsets (both existing and newly constructed, many from academic/professional sources) for training large-scale language models. It reports that untuned GPT-2 and GPT-3 models struggle on several Pile components (e.g., academic writing), while models trained on the Pile outperform Raw CC and CC-100 baselines on all Pile components and on downstream evaluations. The authors include an exploratory analysis of potential data issues and release the construction code publicly.
Significance. If the reported gains hold under controlled conditions, the work supplies a large, publicly documented, and diverse training resource that can improve cross-domain generalization in language models. The open release of construction code is a concrete strength that supports reproducibility and community use.
major comments (2)
- [Abstract and evaluation/results section] Abstract and evaluation/results section: the central claim that 'models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile' is load-bearing for the paper's contribution, yet the manuscript supplies no details on whether the Pile-trained GPT-2/GPT-3 variants and the Raw CC/CC-100 baselines were trained from scratch with identical architecture, token budget, optimizer, learning-rate schedule, or total compute. Without matched controls, performance deltas cannot be isolated to dataset properties.
- [Abstract and evaluation/results section] Abstract and evaluation/results section: the claim of 'significant' improvement is presented without statistical tests, error bars, or variance estimates across runs, leaving the strength of the evidence for downstream-task gains partially unverified.
minor comments (2)
- [Dataset construction section] Dataset construction section: the 22 subsets would benefit from a single consolidated table listing exact sizes, sources, and preprocessing steps for each component to improve clarity and ease of replication.
- [Exploratory analysis] Exploratory analysis: some figures showing data characteristics (e.g., domain distributions or token statistics) could include more precise axis labels and legends for readability.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive feedback on our manuscript. The points raised regarding experimental controls and statistical rigor are important for strengthening the presentation of our results. We address each major comment below and describe the revisions we will incorporate in the updated version of the paper.
read point-by-point responses
-
Referee: [Abstract and evaluation/results section] Abstract and evaluation/results section: the central claim that 'models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile' is load-bearing for the paper's contribution, yet the manuscript supplies no details on whether the Pile-trained GPT-2/GPT-3 variants and the Raw CC/CC-100 baselines were trained from scratch with identical architecture, token budget, optimizer, learning-rate schedule, or total compute. Without matched controls, performance deltas cannot be isolated to dataset properties.
Authors: We appreciate the referee drawing attention to the need for explicit documentation of the training controls. The original manuscript described the overall training setup in Section 4 but did not sufficiently emphasize the matched conditions across datasets. In the revised manuscript we have expanded the training details subsection to state explicitly that the GPT-2-scale and GPT-3-scale models trained on The Pile and the corresponding Raw CC and CC-100 baselines were all trained from scratch using identical model architectures, the same total token budget (approximately 300 billion tokens), the same Adam optimizer with identical hyperparameters, the same learning-rate schedule including warmup and cosine decay, and equivalent total compute. A table summarizing the shared hyperparameters has been added for clarity. These controls ensure that observed differences can be attributed to dataset properties rather than training discrepancies. revision: yes
-
Referee: [Abstract and evaluation/results section] Abstract and evaluation/results section: the claim of 'significant' improvement is presented without statistical tests, error bars, or variance estimates across runs, leaving the strength of the evidence for downstream-task gains partially unverified.
Authors: We agree that the strength of the 'significant' claim would benefit from additional statistical support. In the revised manuscript we have added error bars to the downstream-task figures, derived from multiple runs with different random seeds for the smaller model scales where compute permitted. We have also included the results of paired statistical tests (t-tests) on the key downstream benchmarks comparing Pile-trained models to the CC baselines. For the per-component Pile evaluations we now report standard deviations across model sizes. Due to the high computational cost of full-scale retraining we were limited in the number of replicate runs; however, the consistent direction and magnitude of gains across scales provide supporting evidence. The abstract and results section have been updated to reflect these additions. revision: partial
Circularity Check
No significant circularity; empirical dataset construction and comparisons are self-contained
full rationale
The paper constructs the Pile dataset from 22 subsets and reports empirical results showing improved performance of GPT-2/GPT-3 models trained on it versus Raw CC and CC-100 baselines on Pile components and downstream tasks. No derivation chain, equations, predictions, or first-principles results exist that could reduce to inputs by construction. The patterns of self-definitional claims, fitted inputs called predictions, self-citation load-bearing arguments, uniqueness theorems, ansatz smuggling, or renaming known results are absent. The central claims rest on new data assembly and direct comparisons against external benchmarks, with no self-referential reductions or load-bearing self-citations that collapse the argument.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 60 Pith papers
-
Architecture Determines Observability of Transformers
Certain transformer architectures lose internal linear signals for decision quality during training, making observability an architecture-dependent property rather than a universal one.
-
Test-Time Training with KV Binding Is Secretly Linear Attention
Test-time training with KV binding reduces to learned linear attention.
-
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
-
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
-
Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
Randomly replacing labels in in-context demonstrations barely hurts performance, showing that label space, input distribution, and sequence format drive in-context learning more than ground-truth labels.
-
Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift
GAE reduces the faithfulness gap in dictionary-based explainers under distribution shift by geometrically realigning the ID dictionary to the OOD-active subspace, with a quadratic excess-loss bound.
-
Probabilistic Attribution For Large Language Models
Develops a model-agnostic attribution score as the log-ratio of conditional response probabilities with and without a marginalized prompt token, derived via Bayes inversion of next-token distributions, and relates it ...
-
Provable Joint Decontamination for Benchmarking Multiple Large Language Models
JECS aggregates per-model conformal p-values via their maximum and reconstructs a conservative envelope of the max-p null distribution to select benchmarks with global contamination rate control.
-
Aligned Training: A Parameter-Free Method to Improve Feature Quality and Stability of Sparse Autoencoders (SAE)
Aligned training reparameterizes SAEs to enforce unit inner product between encoder and decoder directions, eliminating dead features and enhancing stability without hyperparameters.
-
LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models
LLMForge is a NAS framework with Infinite-Head Attention, a Forge-Former surrogate, and Forge-DSE engine that discovers hardware-specific architectures for edge language models, yielding variants with improved accurac...
-
To Call or Not to Call: Diagnosing Intrinsic Over-Calling Bias in LLM Agents
LLM agents have an intrinsic over-calling bias diagnosed via SAE activation margins and corrected by adaptive margin-calibrated steering, improving overall decision accuracy.
-
When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability
Tensor similarity is a symmetry-invariant metric that measures functional equivalence between tensor-based networks using a recursive algorithm for cross-layer mechanisms.
-
Scaling Laws for Mixture Pretraining Under Data Constraints
Repetition-aware scaling laws show scarce target data in pretraining mixtures can be repeated 15-20 times optimally, with the best count depending on data size, compute, and model scale.
-
Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts
Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.
-
DistractMIA: Black-Box Membership Inference on Vision-Language Models via Semantic Distraction
DistractMIA performs output-only black-box membership inference on vision-language models by inserting semantic distractors and measuring shifts in generated text responses.
-
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration - Learning from Cheap, Optimizing Expensive
AutoLLMResearch trains agents via a multi-fidelity environment and MDP pipeline to extrapolate configuration principles from inexpensive to costly LLM experiments.
-
fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery
fmxcoders improve cross-layer feature recovery in transformers via factorized weights and layer masking, delivering 10-30 point probing F1 gains, 25-50% lower MSE, doubled functional coherence, and 3-13x more coherent...
-
LoopQ: Quantization for Recursive Transformers
LoopQ provides a loop-aware PTQ framework for recursive Transformers that mitigates distribution shift, state reuse, and recursive error accumulation, yielding 68.8% higher average accuracy and 87.7% lower perplexity ...
-
Beyond Indistinguishability: Measuring Extraction Risk in LLM APIs
Indistinguishability-based privacy is incomparable to extractability in LLMs, and a new (l, b)-inextractability definition with rank-based bounds provides a tighter measure of extraction risk than prior proxies.
-
Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs
A Merkle-committed SAE feature-trace protocol detects model substitutions in hosted LLMs at a stable threshold where parallel-probe baselines fail, including against adaptive LoRA attackers.
-
When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence
FP32-converged language models enter a post-convergence phase where INT4 quantization error explodes while FP32 perplexity remains stable, with onset tied to fine convergence rather than learning rate decay.
-
What Makes a Good Response? An Empirical Analysis of Quality in Qualitative Interviews
Direct relevance to a key research question is the strongest predictor of a response's contribution to qualitative study findings, while clarity and surprisal-based informativeness are not predictive.
-
Detection Without Correction: A Robust Asymmetry in Activation-Based Hallucination Probing
Activation probes detect hallucinations pre-generation in large LLMs but cannot correct them via steering, with output confidence outperforming on accuracy.
-
From Competition to Collaboration: Designing Sustainable Mechanisms Between LLMs and Online Forums
A new sequential interaction framework lets LLMs propose questions to forums, with simulations on real Stack Exchange data showing players can reach roughly half the utility of an ideal full-information scenario despi...
-
Hidden State Poisoning Attacks against Mamba-based Language Models
Short input phrases can irreversibly overwrite hidden states in Mamba models, impairing information retrieval on a new benchmark while leaving pure Transformer models unaffected.
-
QSLM: A Performance- and Memory-aware Quantization Framework with Tiered Search Strategy for Spike-driven Language Models
QSLM automates tiered quantization of spike-driven language models via sensitivity analysis and multi-objective search, delivering up to 86.5% memory reduction and 20% power savings while keeping accuracy close to the...
-
Clotho: Measuring Task-Specific Pre-Generation Test Adequacy for LLM Inputs
Clotho ranks LLM test inputs by failure likelihood using pre-generation hidden states and GMMs, achieving 0.716 ROC-AUC after labeling 5.4% of inputs on average across eight tasks and three models, with transfer to pr...
-
Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training
An RL agent learns domain re-weighting policies from evaluation feedback to improve balanced performance in continual pre-training of LLMs across source and target domains.
-
Power-Softmax: Towards Secure LLM Inference over Encrypted Data
Power-Softmax is a new HE-compatible attention variant that permits training and inference of billion-parameter polynomial LLMs with performance matching standard transformers.
-
Refusal in Language Models Is Mediated by a Single Direction
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
-
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
-
Chronos: Learning the Language of Time Series
Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.
-
Massive Activations in Large Language Models
Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
-
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
LongRoPE extends LLM context windows to 2048k tokens via search for non-uniform positional interpolation, progressive fine-tuning from 256k, and short-context readjustment.
-
Hallucination is Inevitable: An Innate Limitation of Large Language Models
Hallucinations are inevitable in LLMs because they cannot learn all computable functions according to learning theory.
-
Scalable Extraction of Training Data from (Production) Language Models
Adversaries can scalably extract gigabytes of training data from open, semi-open, and closed language models via querying attacks, including a divergence method that increases extraction rates 150x on aligned models l...
-
Detecting Pretraining Data from Large Language Models
Min-K% Prob detects pretraining data in LLMs by flagging outlier low-probability words in text, achieving 7.4% better performance than prior methods on the new WIKIMIA benchmark.
-
Extending Context Window of Large Language Models via Positional Interpolation
Position Interpolation linearly down-scales position indices to extend RoPE context windows to 32768 tokens with 1000-step fine-tuning, delivering strong long-context results on LLaMA 7B-65B while preserving short-con...
-
RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
RepoBench is a new benchmark with retrieval, completion, and pipeline tasks to evaluate code auto-completion systems on entire repositories instead of single files.
-
RWKV: Reinventing RNNs for the Transformer Era
RWKV uses a linear attention mechanism to deliver Transformer-level performance with RNN-style inference efficiency, demonstrated at up to 14 billion parameters.
-
Eliciting Latent Predictions from Transformers with the Tuned Lens
Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
-
Language Is Not All You Need: Aligning Perception with Language Models
Kosmos-1 shows strong zero-shot and few-shot results on language tasks, image captioning, visual QA, OCR-free document understanding, and image recognition guided by text instructions.
-
LAION-5B: An open large-scale dataset for training next generation image-text models
LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
InCoder: A Generative Model for Code Infilling and Synthesis
InCoder is the first generative model to directly perform zero-shot code infilling via bidirectional context from a masked-then-appended training scheme, matching left-to-right models on synthesis while improving on t...
-
Quantifying Memorization Across Neural Language Models
Memorization in language models increases log-linearly with model capacity, data duplication count, and prompt context length.
-
Improving language models by retrieving from trillions of tokens
RETRO matches GPT-3 and Jurassic-1 performance on the Pile benchmark using 25 times fewer parameters by conditioning on retrieved chunks from a 2-trillion-token database.
-
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
LAION-400M is a publicly released open dataset of 400 million CLIP-filtered image-text pairs with embeddings and kNN indices for efficient search.
-
LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws
The Shannon Scaling Law treats LLM training as noisy-channel transmission and predicts U-shaped performance degradation when signal-to-noise ratio falls below a threshold, outperforming monotonic scaling laws on Pythi...
-
Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression
Meta-Soft dynamically synthesizes targeted soft tokens from a learnable orthogonal meta-library via Gumbel-Softmax selection and uses attention-flow integration to preserve semantic information during KV cache eviction.
-
Self-Training Doesn't Flatten Language -- It Restructures It: Surface Markers Amplify While Deep Syntax Dies
Self-training restructures language by amplifying surface markers and collapsing deep syntax according to structural depth rather than frequency, as evidenced by correlations across multiple models and a human fine-tu...
-
Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear Probing
In high-dimensional analysis, pretrained PCA representations for linear probing generalize best at low dimensionality when pretraining data is plentiful but labeled data scarce, with an exact trade-off showing how muc...
-
Are Sparse Autoencoder Benchmarks Reliable?
An audit of SAEBench reveals that Targeted Probe Perturbation and Spurious Correlation Removal metrics fail reliability tests and should not be used to evaluate sparse autoencoders.
-
Scaling Laws for Mixture Pretraining Under Data Constraints
Empirical study shows mixture pretraining tolerates higher target data repetition than single-source training, with a new repetition-aware scaling law enabling principled mixture selection based on data size, compute,...
-
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration - Learning from Cheap, Optimizing Expensive
AutoLLMResearch trains agents in a multi-fidelity LLMConfig-Gym environment formulated as a long-horizon MDP to enable cross-fidelity extrapolation for automating high-cost LLM experiment configurations.
-
LLM Jaggedness Unlocks Scientific Creativity
Jagged capabilities in LLMs for scientific idea generation can be leveraged through inference-time ensembles to outperform individual models.
-
BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization
BROS achieves the same O(ε^{-2}) sample complexity as exact single-loop SBO methods while cutting peak memory by up to 44.9% through randomized subspaces and bias-corrected Hessian estimation.
-
BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization
BROS achieves memory-efficient single-loop stochastic bilevel optimization with O(ε^{-2}) sample complexity by performing updates in randomized subspaces and using Rademacher bi-probe correction for unbiased estimation.
-
NCO: A Versatile Plug-in for Handling Negative Constraints in Decoding
NCO enables efficient online pattern matching for negative hard and regex constraints in LLM decoding to prevent forbidden content without state explosion.
Reference graph
Works this paper leans on
- [1]
-
[2]
2003. Kelly v. arriba soft corp
work page 2003
- [3]
-
[5]
Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010. http://nmis.isti.cnr.it/sebastiani/Publications/LREC10.pdf Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining . In LREC. European Language Resources Association
work page 2010
-
[6]
Emily M Bender and Batya Friedman. 2018. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6:587--604
work page 2018
-
[7]
Stella Biderman. 2021. Data statement for the P ile. arXiv preprint arXiv
work page 2021
-
[8]
Stella Biderman, Kieran Bicheno, and Leo Gao. 2021. Datasheet for the P ile. arXiv preprint arXiv
work page 2021
- [9]
-
[10]
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993--1022
work page 2003
-
[12]
Nick Bostrom. 2014. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, Inc
work page 2014
-
[13]
Nick Bostrom and Eliezer Yudkowsky. 2014. The ethics of artificial intelligence. The Cambridge handbook of artificial intelligence, 1:316--334
work page 2014
- [16]
-
[18]
Brian Christian. 2020. The Alignment Problem: Machine Learning and Human Values. WW Norton & Company
work page 2020
-
[19]
Alina Maria Ciobanu, Liviu P Dinu, and Andrea Sgarro. 2017. Towards a map of the syntactic similarity of languages. In International Conference on Computational Linguistics and Intelligent Text Processing, pages 576--590. Springer
work page 2017
-
[20]
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. https://www.aclweb.org/anthology/2020.acl-main.747 Unsupervised cross-lingual representation learning at scale . In Proceedings of the 58th Annual Meeting of the Association for...
work page 2020
-
[22]
Andrew Critch and David Krueger. 2020. AI Research Considerations for Human Existential Safety (ARCHES) . Preprint at acritch.com/arches http://acritch.com/arches
work page 2020
-
[24]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) . Association for Com...
work page 2019
-
[25]
István Endrédy and Attila Novák. 2013. More effective boilerplate removal – the GoldMiner algorithm. In Polibits
work page 2013
-
[26]
Niels Ferguson and Bruce Schneier. 2003. Practical Cryptography. John Wiley & Sons
work page 2003
-
[27]
Casey Fiesler, Nathan Beard, and Brian C Keegan. 2020. No robots, spiders, or scrapers: Legal and ethical regulation of data collection methods in social media terms of service. In Proceedings of the International AAAI Conference on Web and Social Media, volume 14, pages 187--196
work page 2020
-
[29]
Aaron Gokaslan and Vanya Cohen. 2019. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus
work page 2019
-
[30]
Authors Guild v. Google. 2015. . Docket No. 13-4829-cv, 804:202
work page 2015
-
[31]
Katja Grace, John Salvatier, Allan Dafoe, Baobao Zhang, and Owain Evans. 2018. When will AI exceed human performance? evidence from AI experts. Journal of Artificial Intelligence Research, 62:729--754
work page 2018
-
[32]
David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2003. English gigaword. Linguistic Data Consortium, Philadelphia, 4(1):34
work page 2003
-
[33]
Declan Groves and Andy Way. 2006. Hybridity in mt: Experiments on the Europarl corpus. In Proceeedings of the 11th Annual conference of the European Association for Machine Translation (EAMT 2006)
work page 2006
-
[34]
Alexander Halavais. 2019. Overcoming terms of service: a proposal for ethical distributed research. Information, Communication & Society, 22(11):1567--1581
work page 2019
-
[35]
Chris Hardin. 2018. https://blog.janestreet.com/how-to-shuffle-a-big-dataset/ How to shuffle a big dataset
work page 2018
-
[37]
Matthew Hoffman, Francis Bach, and David Blei. 2010. Online learning for latent dirichlet allocation. advances in neural information processing systems, 23:856--864
work page 2010
-
[38]
Dirk Hovy and Shannon L Spruit. 2016. The social impact of natural language processing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 591--598
work page 2016
- [39]
-
[40]
Eun Seo Jo and Timnit Gebru. 2020. Lessons from archives: S trategies for collecting sociocultural data in machine learning. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 306--316
work page 2020
-
[42]
Bryan Klimt and Yiming Yang. 2004. The E nron corpus: A new dataset for email classification research. In European Conference on Machine Learning, pages 217--226. Springer
work page 2004
-
[43]
Sosuke Kobayashi. 2018. Homemade bookcorpus. https://github.com/BIGBALLON/cifar-10-cnn
work page 2018
-
[44]
Philipp Koehn. 2005. Europarl : A parallel corpus for statistical machine translation. In MT summit, volume 5, pages 79--86. Citeseer
work page 2005
-
[47]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa : A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[48]
Edward Loper and Steven Bird. 2002. NLTK: The Natural Language Toolkit . In Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, pages 62--69. Somerset, NJ: Association for Computational Linguistics. http://arXiv.org/abs/cs/0205028
work page internal anchor Pith review arXiv 2002
-
[49]
John MacFarlane. 2006--2020. https://pandoc.org/ Pandoc: a universal document converter
work page 2006
-
[50]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf Distributed representations of words and phrases and their compositionality . In Advances in Neural Information Processing Systems, volume 26, pages 3111--3119. Curran Associates, Inc
work page 2013
-
[51]
Jonathan A Obar. 2020. Sunlight alone is not a disinfectant: Consent and the futility of opening big data black boxes (without assistance). Big Data & Society, 7(1):2053951720935615
work page 2020
-
[52]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. http://www.aclweb.org/anthology/D14-1162 Glove: Global vectors for word representation . In Empirical Methods in Natural Language Processing (EMNLP), pages 1532--1543
work page 2014
- [54]
-
[55]
Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI
work page 2018
-
[56]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9
work page 2019
-
[57]
Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, Chloe Hillier, and Timothy P Lillicrap. 2019. https://arxiv.org/abs/1911.05507 Compressive transformers for long-range sequence modelling . arXiv preprint
work page internal anchor Pith review arXiv 2019
- [59]
-
[60]
C. Radhakrishna Rao. 1961. http://www.jstor.org/stable/25049166 Generation of random permutations of given number of elements using random sampling numbers . Sankhyā: The Indian Journal of Statistics, Series A (1961-2002), 23(3):305--307
-
[61]
Radim Rehurek, Petr Sojka, et al. 2011. Gensim—statistical semantics in python. NLP Centre, Faculty of Informatics, Masaryk University
work page 2011
-
[62]
C Rosset. 2019. Turing-NLG : A 17-billion-parameter language model by M icrosoft. Microsoft Blog
work page 2019
-
[63]
S. Russell. 2019. https://books.google.de/books?id=M1eFDwAAQBAJ Human Compatible: Artificial Intelligence and the Problem of Control . Penguin Publishing Group
work page 2019
-
[66]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM : Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[67]
Carl Shulman and Nick Bostrom. 2020. Sharing the world with digital minds. preprint
work page 2020
-
[68]
Kaj Sotala and Lukas Gloor. 2017. Superintelligence as a cause or cure for risks of astronomical suffering. Informatica, 41(4)
work page 2017
-
[70]
Merity Stephen, Xiong Caiming, Bradbury James, and Richard Socher. 2016
work page 2016
-
[71]
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33
work page 2020
-
[72]
Pedro Javier Ortiz Su \'a rez, Beno \^ t Sagot, and Laurent Romary. 2019 a . Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut f \"u r Deutsche Sprache
work page 2019
-
[73]
Pedro Javier Ortiz Su \'a rez, Beno \^ t Sagot, and Laurent Romary. 2019 b . Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut f \"u r Deutsche Sprache
work page 2019
-
[74]
Anja Thieme, Danielle Belgrave, and Gavin Doherty. 2020. Machine learning in mental health: A systematic review of the HCI literature to support the development of effective and implementable ML systems. ACM Transactions on Computer-Human Interaction (TOCHI), 27(5):1--53
work page 2020
- [75]
-
[76]
Trieu H. Trinh and Quoc V. Le. 2018. http://arxiv.org/abs/1806.02847 A simple method for commonsense reasoning . CoRR, abs/1806.02847
-
[77]
Hans Van Halteren. 2008. Source language markers in Europarl translations. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 937--944
work page 2008
-
[78]
Jessica Vitak, Katie Shilton, and Zahra Ashktorab. 2016. Beyond the Belmont principles: Ethical challenges, practices, and beliefs in the online data research community. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing, pages 941--953
work page 2016
-
[80]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. https://www.aclweb.org/a...
work page 2020
-
[82]
Eliezer Yudkowsky. 2013. Intelligence explosion microeconomics. Machine Intelligence Research Institute, accessed online October, 23:2015
work page 2013
-
[83]
Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019. http://papers.nips.cc/paper/9106-defending-against-neural-fake-news.pdf Defending against neural fake news . In H. Wallach, H. Larochelle, A. Beygelzimer, F. d\' Alch\' e -Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Pro...
work page 2019
-
[84]
Victor Zhou. 2019. Building a better profanity detection library with scikit-learn
work page 2019
-
[85]
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19--27
work page 2015
-
[86]
Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor , journal=. The
-
[87]
Biderman, Stella and Bicheno, Kieran and Gao, Leo , journal=. Datasheet for the
- [88]
-
[89]
Language models are unsupervised multitask learners , author=. OpenAI Blog , volume=
-
[90]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Exploring the limits of transfer learning with a unified text-to-text transformer , author=. arXiv preprint arXiv:1910.10683 , year=
work page internal anchor Pith review arXiv 1910
-
[91]
Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan , journal=
-
[92]
Rosset, C , journal=
-
[93]
Language Models are Few-Shot Learners
Language models are few-shot learners , author=. arXiv preprint arXiv:2005.14165 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[94]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Gshard: Scaling giant models with conditional computation and automatic sharding , author=. arXiv preprint arXiv:2006.16668 , year=
work page internal anchor Pith review arXiv 2006
-
[95]
Technical report, OpenAI , year=
Improving language understanding with unsupervised learning , author=. Technical report, OpenAI , year=
-
[96]
BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019
work page 2019
-
[97]
Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov , journal=
-
[98]
Generic Web Content Extraction with Open-Source Software , author=. KONVENS , year=
- [99]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.