pith. machine review for the scientific record. sign in

arxiv: 2101.00027 · v1 · submitted 2020-12-31 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Authors on Pith no claims yet

Pith reviewed 2026-05-10 21:28 UTC · model grok-4.3

classification 💻 cs.CL
keywords language modelingdataset constructiondiverse text corpusThe PileCommon Crawlcross-domain generalizationGPT models
0
0 comments X

The pith

A new 825-gigabyte dataset built from 22 diverse text sources trains language models that generalize better across domains than those trained on raw web crawls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces The Pile, an 825 GiB English text corpus assembled from 22 high-quality subsets that include academic papers, books, code, and other professional materials. It shows that existing models such as GPT-2 and GPT-3 struggle on many of these components, especially academic writing. Models trained on The Pile outperform those trained on Raw Common Crawl and CC-100 across every component of the Pile and on downstream evaluations. The authors also release the construction code and document aspects of the data that may concern users.

Core claim

The central claim is that training large language models on this composite dataset of 22 diverse subsets produces better cross-domain knowledge and downstream generalization than training on less curated web data. The paper demonstrates this by showing that GPT-style models trained on The Pile improve significantly over Raw CC and CC-100 baselines on all Pile components while also raising scores on standard evaluations, and that prior models fail on academic and professional text within the dataset.

What carries the argument

The Pile, a composite 825 GiB corpus constructed by combining 22 existing and newly assembled high-quality text subsets drawn from academic and professional sources.

Load-bearing premise

The reported performance gains are caused by the diversity and quality of the 22 subsets rather than by uncontrolled differences in training procedure, model scale, or data volume between the Pile-trained models and the Raw CC or CC-100 baselines.

What would settle it

A controlled retraining experiment that matches data volume, model size, and training steps exactly between a Pile-trained model and a Raw CC model, then evaluates both on held-out samples from every Pile component, would show whether the gains persist.

read the original abstract

Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present \textit{the Pile}: an 825 GiB English text corpus targeted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets -- both existing and newly constructed -- many of which derive from academic or professional sources. Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations. Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces The Pile, an 825 GiB English text corpus assembled from 22 diverse high-quality subsets (both existing and newly constructed, many from academic/professional sources) for training large-scale language models. It reports that untuned GPT-2 and GPT-3 models struggle on several Pile components (e.g., academic writing), while models trained on the Pile outperform Raw CC and CC-100 baselines on all Pile components and on downstream evaluations. The authors include an exploratory analysis of potential data issues and release the construction code publicly.

Significance. If the reported gains hold under controlled conditions, the work supplies a large, publicly documented, and diverse training resource that can improve cross-domain generalization in language models. The open release of construction code is a concrete strength that supports reproducibility and community use.

major comments (2)
  1. [Abstract and evaluation/results section] Abstract and evaluation/results section: the central claim that 'models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile' is load-bearing for the paper's contribution, yet the manuscript supplies no details on whether the Pile-trained GPT-2/GPT-3 variants and the Raw CC/CC-100 baselines were trained from scratch with identical architecture, token budget, optimizer, learning-rate schedule, or total compute. Without matched controls, performance deltas cannot be isolated to dataset properties.
  2. [Abstract and evaluation/results section] Abstract and evaluation/results section: the claim of 'significant' improvement is presented without statistical tests, error bars, or variance estimates across runs, leaving the strength of the evidence for downstream-task gains partially unverified.
minor comments (2)
  1. [Dataset construction section] Dataset construction section: the 22 subsets would benefit from a single consolidated table listing exact sizes, sources, and preprocessing steps for each component to improve clarity and ease of replication.
  2. [Exploratory analysis] Exploratory analysis: some figures showing data characteristics (e.g., domain distributions or token statistics) could include more precise axis labels and legends for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. The points raised regarding experimental controls and statistical rigor are important for strengthening the presentation of our results. We address each major comment below and describe the revisions we will incorporate in the updated version of the paper.

read point-by-point responses
  1. Referee: [Abstract and evaluation/results section] Abstract and evaluation/results section: the central claim that 'models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile' is load-bearing for the paper's contribution, yet the manuscript supplies no details on whether the Pile-trained GPT-2/GPT-3 variants and the Raw CC/CC-100 baselines were trained from scratch with identical architecture, token budget, optimizer, learning-rate schedule, or total compute. Without matched controls, performance deltas cannot be isolated to dataset properties.

    Authors: We appreciate the referee drawing attention to the need for explicit documentation of the training controls. The original manuscript described the overall training setup in Section 4 but did not sufficiently emphasize the matched conditions across datasets. In the revised manuscript we have expanded the training details subsection to state explicitly that the GPT-2-scale and GPT-3-scale models trained on The Pile and the corresponding Raw CC and CC-100 baselines were all trained from scratch using identical model architectures, the same total token budget (approximately 300 billion tokens), the same Adam optimizer with identical hyperparameters, the same learning-rate schedule including warmup and cosine decay, and equivalent total compute. A table summarizing the shared hyperparameters has been added for clarity. These controls ensure that observed differences can be attributed to dataset properties rather than training discrepancies. revision: yes

  2. Referee: [Abstract and evaluation/results section] Abstract and evaluation/results section: the claim of 'significant' improvement is presented without statistical tests, error bars, or variance estimates across runs, leaving the strength of the evidence for downstream-task gains partially unverified.

    Authors: We agree that the strength of the 'significant' claim would benefit from additional statistical support. In the revised manuscript we have added error bars to the downstream-task figures, derived from multiple runs with different random seeds for the smaller model scales where compute permitted. We have also included the results of paired statistical tests (t-tests) on the key downstream benchmarks comparing Pile-trained models to the CC baselines. For the per-component Pile evaluations we now report standard deviations across model sizes. Due to the high computational cost of full-scale retraining we were limited in the number of replicate runs; however, the consistent direction and magnitude of gains across scales provide supporting evidence. The abstract and results section have been updated to reflect these additions. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical dataset construction and comparisons are self-contained

full rationale

The paper constructs the Pile dataset from 22 subsets and reports empirical results showing improved performance of GPT-2/GPT-3 models trained on it versus Raw CC and CC-100 baselines on Pile components and downstream tasks. No derivation chain, equations, predictions, or first-principles results exist that could reduce to inputs by construction. The patterns of self-definitional claims, fitted inputs called predictions, self-citation load-bearing arguments, uniqueness theorems, ansatz smuggling, or renaming known results are absent. The central claims rest on new data assembly and direct comparisons against external benchmarks, with no self-referential reductions or load-bearing self-citations that collapse the argument.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical dataset construction and benchmarking effort; it introduces no mathematical free parameters, domain axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5484 in / 1150 out tokens · 77761 ms · 2026-05-10T21:28:45.866889+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Architecture Determines Observability of Transformers

    cs.LG 2026-04 unverdicted novelty 8.0

    Certain transformer architectures lose internal linear signals for decision quality during training, making observability an architecture-dependent property rather than a universal one.

  2. Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    cs.LG 2023-12 unverdicted novelty 8.0

    Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

  3. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

    cs.CL 2023-04 accept novelty 8.0

    Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

  4. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

    cs.CL 2022-02 accept novelty 8.0

    Randomly replacing labels in in-context demonstrations barely hurts performance, showing that label space, input distribution, and sequence format drive in-context learning more than ground-truth labels.

  5. When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability

    cs.LG 2026-05 unverdicted novelty 7.0

    Tensor similarity is a symmetry-invariant metric that measures functional equivalence between tensor-based networks using a recursive algorithm for cross-layer mechanisms.

  6. Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

    cs.LG 2026-05 unverdicted novelty 7.0

    Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.

  7. DistractMIA: Black-Box Membership Inference on Vision-Language Models via Semantic Distraction

    cs.CV 2026-05 unverdicted novelty 7.0

    DistractMIA performs output-only black-box membership inference on vision-language models by inserting semantic distractors and measuring shifts in generated text responses.

  8. AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive

    cs.AI 2026-05 unverdicted novelty 7.0

    AutoLLMResearch trains agents via a multi-fidelity environment and MDP pipeline to extrapolate configuration principles from inexpensive to costly LLM experiments.

  9. fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery

    cs.LG 2026-05 conditional novelty 7.0

    fmxcoders improve cross-layer feature recovery in transformers via factorized weights and layer masking, delivering 10-30 point probing F1 gains, 25-50% lower MSE, doubled functional coherence, and 3-13x more coherent...

  10. Beyond Indistinguishability: Measuring Extraction Risk in LLM APIs

    cs.CR 2026-04 unverdicted novelty 7.0

    Indistinguishability-based privacy is incomparable to extractability in LLMs, and a new (l, b)-inextractability definition with rank-based bounds provides a tighter measure of extraction risk than prior proxies.

  11. Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs

    cs.CR 2026-04 unverdicted novelty 7.0

    A Merkle-committed SAE feature-trace protocol detects model substitutions in hosted LLMs at a stable threshold where parallel-probe baselines fail, including against adaptive LoRA attackers.

  12. When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence

    cs.LG 2026-04 conditional novelty 7.0

    FP32-converged language models enter a post-convergence phase where INT4 quantization error explodes while FP32 perplexity remains stable, with onset tied to fine convergence rather than learning rate decay.

  13. What Makes a Good Response? An Empirical Analysis of Quality in Qualitative Interviews

    cs.CL 2026-04 unverdicted novelty 7.0

    Direct relevance to a key research question is the strongest predictor of a response's contribution to qualitative study findings, while clarity and surprisal-based informativeness are not predictive.

  14. Detection Without Correction: A Robust Asymmetry in Activation-Based Hallucination Probing

    cs.CL 2026-03 conditional novelty 7.0

    Activation probes detect hallucinations pre-generation in large LLMs but cannot correct them via steering, with output confidence outperforming on accuracy.

  15. Refusal in Language Models Is Mediated by a Single Direction

    cs.LG 2024-06 accept novelty 7.0

    Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

  16. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    cs.LG 2024-05 unverdicted novelty 7.0

    Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.

  17. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    cs.CL 2024-05 unverdicted novelty 7.0

    DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

  18. Chronos: Learning the Language of Time Series

    cs.LG 2024-03 conditional novelty 7.0

    Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.

  19. LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

    cs.CL 2024-02 unverdicted novelty 7.0

    LongRoPE extends LLM context windows to 2048k tokens via search for non-uniform positional interpolation, progressive fine-tuning from 256k, and short-context readjustment.

  20. Extending Context Window of Large Language Models via Positional Interpolation

    cs.CL 2023-06 conditional novelty 7.0

    Position Interpolation linearly down-scales position indices to extend RoPE context windows to 32768 tokens with 1000-step fine-tuning, delivering strong long-context results on LLaMA 7B-65B while preserving short-con...

  21. RWKV: Reinventing RNNs for the Transformer Era

    cs.CL 2023-05 unverdicted novelty 7.0

    RWKV uses a linear attention mechanism to deliver Transformer-level performance with RNN-style inference efficiency, demonstrated at up to 14 billion parameters.

  22. Eliciting Latent Predictions from Transformers with the Tuned Lens

    cs.LG 2023-03 accept novelty 7.0

    Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.

  23. LAION-5B: An open large-scale dataset for training next generation image-text models

    cs.CV 2022-10 accept novelty 7.0

    LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.

  24. OPT: Open Pre-trained Transformer Language Models

    cs.CL 2022-05 unverdicted novelty 7.0

    OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

  25. Quantifying Memorization Across Neural Language Models

    cs.LG 2022-02 unverdicted novelty 7.0

    Memorization in language models increases log-linearly with model capacity, data duplication count, and prompt context length.

  26. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    cs.CV 2021-11 unverdicted novelty 7.0

    LAION-400M is a publicly released open dataset of 400 million CLIP-filtered image-text pairs with embeddings and kNN indices for efficient search.

  27. BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization

    cs.LG 2026-05 unverdicted novelty 6.0

    BROS achieves the same O(ε^{-2}) sample complexity as exact single-loop SBO methods while cutting peak memory by up to 44.9% through randomized subspaces and bias-corrected Hessian estimation.

  28. BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization

    cs.LG 2026-05 unverdicted novelty 6.0

    BROS achieves memory-efficient single-loop stochastic bilevel optimization with O(ε^{-2}) sample complexity by performing updates in randomized subspaces and using Rademacher bi-probe correction for unbiased estimation.

  29. NCO: A Versatile Plug-in for Handling Negative Constraints in Decoding

    cs.CL 2026-05 unverdicted novelty 6.0

    NCO enables efficient online pattern matching for negative hard and regex constraints in LLM decoding to prevent forbidden content without state explosion.

  30. Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders

    cs.LG 2026-05 unverdicted novelty 6.0

    Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark p...

  31. Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders

    cs.LG 2026-05 unverdicted novelty 6.0

    Tree SAE learns hierarchical feature pairs in sparse autoencoders by combining activation coverage with a new reconstruction condition, outperforming prior methods on hierarchy detection while remaining competitive on...

  32. TextLDM: Language Modeling with Continuous Latent Diffusion

    cs.CL 2026-05 unverdicted novelty 6.0

    TextLDM applies DiT-style latent diffusion with flow matching to language modeling via a REPA-aligned VAE, outperforming prior diffusion LMs and matching GPT-2 when trained from scratch on OpenWebText2.

  33. HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware

    cs.DC 2026-05 unverdicted novelty 6.0

    HexiSeq optimizes sequence and head partitioning across mixed GPUs to improve long-context LLM training throughput by up to 1.72x in simulations.

  34. UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

    cs.LG 2026-05 unverdicted novelty 6.0

    A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.

  35. ZAYA1-8B Technical Report

    cs.AI 2026-05 unverdicted novelty 6.0

    ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

  36. Feature Starvation as Geometric Instability in Sparse Autoencoders

    cs.LG 2026-05 unverdicted novelty 6.0

    Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global featu...

  37. Adaptive Inverted-Index Routing for Granular Mixtures-of-Experts

    cs.LG 2026-05 unverdicted novelty 6.0

    AIR-MoE introduces a two-stage inverted-index routing method based on vector quantization that approximates optimal expert selection for granular MoE models at lower cost and with empirical performance gains.

  38. Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation

    cs.CV 2026-05 unverdicted novelty 6.0

    A video transfer pipeline augments simulated VLA data into realistic videos while preserving actions, yielding consistent performance gains on robot benchmarks such as 8% on Robotwin 2.0.

  39. Finite-Size Gradient Transport in Large Language Model Pretraining: From Cascade Size to Intensive Transport Efficiency

    cs.LG 2026-05 unverdicted novelty 6.0

    A gradient-transport framework with observables D, z, β, δ, v_rel applied to Pico-LM and Pythia datasets shows distinct scaling regimes in duration and efficiency while sharing a near-unity cascade-size backbone.

  40. NH-CROP: Robust Pricing for Governed Language Data Assets under Cost Uncertainty

    cs.AI 2026-05 unverdicted novelty 6.0

    NH-CROP introduces a robust online pricing method for governed language data with uncertain costs, using a selective verification gate that improves or matches baselines without relying heavily on paid information acq...

  41. Probe-Geometry Alignment: Erasing the Cross-Sequence Memorization Signature Below Chance

    cs.LG 2026-05 unverdicted novelty 6.0

    Probe-geometry alignment erases cross-sequence memorization signatures in LLMs below chance using per-depth rank-one activation interventions with negligible impact on zero-shot capabilities.

  42. Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives

    cs.CL 2026-05 unverdicted novelty 6.0

    Perplexity gaps between finetuned and reference models on random-prefill completions often reveal the original finetuning objectives across diverse model organisms.

  43. Diversity in Large Language Models under Supervised Fine-Tuning

    cs.LG 2026-04 unverdicted novelty 6.0

    TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.

  44. A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws

    cs.LG 2026-04 unverdicted novelty 6.0

    Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obt...

  45. Architecture Determines Observability of Transformers

    cs.LG 2026-04 unverdicted novelty 6.0

    Architecture and training determine whether transformers retain a readable internal signal that lets activation monitors catch errors missed by output confidence.

  46. PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning

    cs.LG 2026-04 unverdicted novelty 6.0

    PrivUn shows privacy unlearning in LLMs produces gradient-driven ripple effects and only shallow forgetting across layers, with new strategies proposed for deeper removal.

  47. Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints

    cs.AI 2026-04 unverdicted novelty 6.0

    Coupled constraints on weight updates in a safety subspace and regularization of SAE-identified safety features preserve LLM refusal behaviors during fine-tuning better than weight-only or activation-only methods.

  48. Improving Robustness In Sparse Autoencoders via Masked Regularization

    cs.LG 2026-04 unverdicted novelty 6.0

    Masked regularization in sparse autoencoders disrupts token co-occurrences to reduce feature absorption, enhance probing, and narrow OOD gaps across architectures and sparsity levels.

  49. In-Place Test-Time Training

    cs.LG 2026-04 conditional novelty 6.0

    In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.

  50. Geometric Limits of Knowledge Distillation: A Minimum-Width Theorem via Superposition Theory

    cs.LG 2026-04 conditional novelty 6.0

    Student networks are limited to d_S * g(α) features via superposition, creating a permanent importance-weighted loss floor in distillation that cannot be overcome by training.

  51. Beyond Linearity in Attention Projections: The Case for Nonlinear Queries

    cs.LG 2026-03 conditional novelty 6.0

    Nonlinear query projections of the form X + MLP(X) improve transformer performance on small models with only d² + O(d) added parameters.

  52. Titans: Learning to Memorize at Test Time

    cs.LG 2024-12 unverdicted novelty 6.0

    Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.

  53. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    cs.CL 2024-06 unverdicted novelty 6.0

    FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.

  54. Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

    cs.LG 2024-03 unverdicted novelty 6.0

    Sparse feature circuits are introduced as interpretable causal subnetworks in language models, supporting unsupervised discovery of thousands of circuits and a method called SHIFT to improve classifier generalization ...

  55. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    cs.RO 2024-03 accept novelty 6.0

    DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.

  56. Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    cs.CV 2023-11 conditional novelty 6.0

    Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...

  57. Efficient Streaming Language Models with Attention Sinks

    cs.CL 2023-09 accept novelty 6.0

    StreamingLLM lets finite-window LLMs generalize to infinite-length sequences by retaining initial-token KV states as attention sinks, enabling stable streaming inference up to 4M tokens.

  58. Retentive Network: A Successor to Transformer for Large Language Models

    cs.CL 2023-07 unverdicted novelty 6.0

    RetNet is a new sequence modeling architecture that delivers parallel training, constant-time inference, and competitive language modeling performance as a potential replacement for Transformers.

  59. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

    cs.CL 2023-06 unverdicted novelty 6.0

    Properly filtered web data from CommonCrawl alone trains LLMs that significantly outperform models trained on The Pile, with 600 billion tokens and 1.3B/7.5B parameter models released.

  60. BloombergGPT: A Large Language Model for Finance

    cs.LG 2023-03 conditional novelty 6.0

    BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.

Reference graph

Works this paper leans on

193 extracted references · 193 canonical work pages · cited by 81 Pith papers · 11 internal anchors

  1. [1]

    Sony corp

    1984. Sony corp. of america v. universal city studios, inc

  2. [2]

    2003. Kelly v. arriba soft corp

  3. [3]

    Righthaven llc v

    2013. Righthaven llc v. hoehn

  4. [5]

    Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010. http://nmis.isti.cnr.it/sebastiani/Publications/LREC10.pdf Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining . In LREC. European Language Resources Association

  5. [6]

    Emily M Bender and Batya Friedman. 2018. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6:587--604

  6. [7]

    Stella Biderman. 2021. Data statement for the P ile. arXiv preprint arXiv

  7. [8]

    Stella Biderman, Kieran Bicheno, and Leo Gao. 2021. Datasheet for the P ile. arXiv preprint arXiv

  8. [9]

    Scheirer

    Stella Biderman and Walter J. Scheirer. 2020. Pitfalls in machine learning research: Reexamining the development cycle. NeurIPS ``I Can't Believe It's Not Better!'' Workshop

  9. [10]

    David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993--1022

  10. [12]

    Nick Bostrom. 2014. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, Inc

  11. [13]

    Nick Bostrom and Eliezer Yudkowsky. 2014. The ethics of artificial intelligence. The Cambridge handbook of artificial intelligence, 1:316--334

  12. [16]

    Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel. 2020. http://arxiv.org/abs/2012.07805 Extracting training data from large language models

  13. [18]

    Brian Christian. 2020. The Alignment Problem: Machine Learning and Human Values. WW Norton & Company

  14. [19]

    Alina Maria Ciobanu, Liviu P Dinu, and Andrea Sgarro. 2017. Towards a map of the syntactic similarity of languages. In International Conference on Computational Linguistics and Intelligent Text Processing, pages 576--590. Springer

  15. [20]

    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. https://www.aclweb.org/anthology/2020.acl-main.747 Unsupervised cross-lingual representation learning at scale . In Proceedings of the 58th Annual Meeting of the Association for...

  16. [22]

    Andrew Critch and David Krueger. 2020. AI Research Considerations for Human Existential Safety (ARCHES) . Preprint at acritch.com/arches http://acritch.com/arches

  17. [24]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) . Association for Com...

  18. [25]

    István Endrédy and Attila Novák. 2013. More effective boilerplate removal – the GoldMiner algorithm. In Polibits

  19. [26]

    Niels Ferguson and Bruce Schneier. 2003. Practical Cryptography. John Wiley & Sons

  20. [27]

    Casey Fiesler, Nathan Beard, and Brian C Keegan. 2020. No robots, spiders, or scrapers: Legal and ethical regulation of data collection methods in social media terms of service. In Proceedings of the International AAAI Conference on Web and Social Media, volume 14, pages 187--196

  21. [29]

    Aaron Gokaslan and Vanya Cohen. 2019. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus

  22. [30]

    Authors Guild v. Google. 2015. . Docket No. 13-4829-cv, 804:202

  23. [31]

    Katja Grace, John Salvatier, Allan Dafoe, Baobao Zhang, and Owain Evans. 2018. When will AI exceed human performance? evidence from AI experts. Journal of Artificial Intelligence Research, 62:729--754

  24. [32]

    David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2003. English gigaword. Linguistic Data Consortium, Philadelphia, 4(1):34

  25. [33]

    Declan Groves and Andy Way. 2006. Hybridity in mt: Experiments on the Europarl corpus. In Proceeedings of the 11th Annual conference of the European Association for Machine Translation (EAMT 2006)

  26. [34]

    Alexander Halavais. 2019. Overcoming terms of service: a proposal for ethical distributed research. Information, Communication & Society, 22(11):1567--1581

  27. [35]

    Chris Hardin. 2018. https://blog.janestreet.com/how-to-shuffle-a-big-dataset/ How to shuffle a big dataset

  28. [37]

    Matthew Hoffman, Francis Bach, and David Blei. 2010. Online learning for latent dirichlet allocation. advances in neural information processing systems, 23:856--864

  29. [38]

    Dirk Hovy and Shannon L Spruit. 2016. The social impact of natural language processing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 591--598

  30. [39]

    Ben Hutchinson, Vinodkumar Prabhakaran, Emily Denton, Kellie Webster, Yu Zhong, and Stephen Denuyl. 2020. Social biases in NLP models as barriers for persons with disabilities. arXiv preprint arXiv:2005.00813

  31. [40]

    Eun Seo Jo and Timnit Gebru. 2020. Lessons from archives: S trategies for collecting sociocultural data in machine learning. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 306--316

  32. [42]

    Bryan Klimt and Yiming Yang. 2004. The E nron corpus: A new dataset for email classification research. In European Conference on Machine Learning, pages 217--226. Springer

  33. [43]

    Sosuke Kobayashi. 2018. Homemade bookcorpus. https://github.com/BIGBALLON/cifar-10-cnn

  34. [44]

    Philipp Koehn. 2005. Europarl : A parallel corpus for statistical machine translation. In MT summit, volume 5, pages 79--86. Citeseer

  35. [47]

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa : A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692

  36. [48]

    Edward Loper and Steven Bird. 2002. NLTK: The Natural Language Toolkit . In Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, pages 62--69. Somerset, NJ: Association for Computational Linguistics. http://arXiv.org/abs/cs/0205028

  37. [49]

    2006--2020

    John MacFarlane. 2006--2020. https://pandoc.org/ Pandoc: a universal document converter

  38. [50]

    Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf Distributed representations of words and phrases and their compositionality . In Advances in Neural Information Processing Systems, volume 26, pages 3111--3119. Curran Associates, Inc

  39. [51]

    Jonathan A Obar. 2020. Sunlight alone is not a disinfectant: Consent and the futility of opening big data black boxes (without assistance). Big Data & Society, 7(1):2053951720935615

  40. [52]

    Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. http://www.aclweb.org/anthology/D14-1162 Glove: Global vectors for word representation . In Empirical Methods in Natural Language Processing (EMNLP), pages 1532--1543

  41. [54]

    Shawn Presser. 2020. Books3. https://twitter.com/theshawwn/status/1320282149329784833

  42. [55]

    Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI

  43. [56]

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9

  44. [57]

    Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, Chloe Hillier, and Timothy P Lillicrap. 2019. https://arxiv.org/abs/1911.05507 Compressive transformers for long-range sequence modelling . arXiv preprint

  45. [59]

    Inioluwa Deborah Raji and Jingying Yang. 2019. ABOUT ML : Annotation and benchmarking on understanding and transparency of machine learning lifecycles. arXiv preprint arXiv:1912.06166

  46. [60]

    Radhakrishna Rao

    C. Radhakrishna Rao. 1961. http://www.jstor.org/stable/25049166 Generation of random permutations of given number of elements using random sampling numbers . Sankhyā: The Indian Journal of Statistics, Series A (1961-2002), 23(3):305--307

  47. [61]

    Radim Rehurek, Petr Sojka, et al. 2011. Gensim—statistical semantics in python. NLP Centre, Faculty of Informatics, Masaryk University

  48. [62]

    C Rosset. 2019. Turing-NLG : A 17-billion-parameter language model by M icrosoft. Microsoft Blog

  49. [63]

    S. Russell. 2019. https://books.google.de/books?id=M1eFDwAAQBAJ Human Compatible: Artificial Intelligence and the Problem of Control . Penguin Publishing Group

  50. [66]

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM : Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053

  51. [67]

    Carl Shulman and Nick Bostrom. 2020. Sharing the world with digital minds. preprint

  52. [68]

    Kaj Sotala and Lukas Gloor. 2017. Superintelligence as a cause or cure for risks of astronomical suffering. Informatica, 41(4)

  53. [70]

    Merity Stephen, Xiong Caiming, Bradbury James, and Richard Socher. 2016

  54. [71]

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33

  55. [72]

    Pedro Javier Ortiz Su \'a rez, Beno \^ t Sagot, and Laurent Romary. 2019 a . Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut f \"u r Deutsche Sprache

  56. [73]

    Pedro Javier Ortiz Su \'a rez, Beno \^ t Sagot, and Laurent Romary. 2019 b . Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut f \"u r Deutsche Sprache

  57. [74]

    Anja Thieme, Danielle Belgrave, and Gavin Doherty. 2020. Machine learning in mental health: A systematic review of the HCI literature to support the development of effective and implementable ML systems. ACM Transactions on Computer-Human Interaction (TOCHI), 27(5):1--53

  58. [75]

    Tiedemann

    J. Tiedemann. 2016. Finding alternative translations in a large corpus of movie subtitles. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)

  59. [76]

    Trinh and Quoc V

    Trieu H. Trinh and Quoc V. Le. 2018. http://arxiv.org/abs/1806.02847 A simple method for commonsense reasoning . CoRR, abs/1806.02847

  60. [77]

    Hans Van Halteren. 2008. Source language markers in Europarl translations. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 937--944

  61. [78]

    Jessica Vitak, Katie Shilton, and Zahra Ashktorab. 2016. Beyond the Belmont principles: Ethical challenges, practices, and beliefs in the online data research community. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing, pages 941--953

  62. [80]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. https://www.aclweb.org/a...

  63. [82]

    Eliezer Yudkowsky. 2013. Intelligence explosion microeconomics. Machine Intelligence Research Institute, accessed online October, 23:2015

  64. [83]

    Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019. http://papers.nips.cc/paper/9106-defending-against-neural-fake-news.pdf Defending against neural fake news . In H. Wallach, H. Larochelle, A. Beygelzimer, F. d\' Alch\' e -Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Pro...

  65. [84]

    Victor Zhou. 2019. Building a better profanity detection library with scikit-learn

  66. [85]

    Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19--27

  67. [86]

    Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor , journal=. The

  68. [87]

    Datasheet for the

    Biderman, Stella and Bicheno, Kieran and Gao, Leo , journal=. Datasheet for the

  69. [88]

    Data Statement for the

    Biderman, Stella , journal=. Data Statement for the

  70. [89]

    OpenAI Blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI Blog , volume=

  71. [90]

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    Exploring the limits of transfer learning with a unified text-to-text transformer , author=. arXiv preprint arXiv:1910.10683 , year=

  72. [91]

    Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan , journal=

  73. [92]

    Rosset, C , journal=

  74. [93]

    Language Models are Few-Shot Learners

    Language models are few-shot learners , author=. arXiv preprint arXiv:2005.14165 , year=

  75. [94]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Gshard: Scaling giant models with conditional computation and automatic sharding , author=. arXiv preprint arXiv:2006.16668 , year=

  76. [95]

    Technical report, OpenAI , year=

    Improving language understanding with unsupervised learning , author=. Technical report, OpenAI , year=

  77. [96]

    BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019

  78. [97]

    Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov , journal=

  79. [98]

    KONVENS , year=

    Generic Web Content Extraction with Open-Source Software , author=. KONVENS , year=

  80. [99]

    Polibits , year =

    István Endrédy and Attila Novák , title =. Polibits , year =

Showing first 80 references.