pith. sign in

arxiv: 2101.00027 · v1 · submitted 2020-12-31 · 💻 cs.CL

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Pith reviewed 2026-05-10 21:28 UTC · model grok-4.3

classification 💻 cs.CL
keywords language modelingdataset constructiondiverse text corpusThe PileCommon Crawlcross-domain generalizationGPT models
0
0 comments X

The pith

A new 825-gigabyte dataset built from 22 diverse text sources trains language models that generalize better across domains than those trained on raw web crawls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces The Pile, an 825 GiB English text corpus assembled from 22 high-quality subsets that include academic papers, books, code, and other professional materials. It shows that existing models such as GPT-2 and GPT-3 struggle on many of these components, especially academic writing. Models trained on The Pile outperform those trained on Raw Common Crawl and CC-100 across every component of the Pile and on downstream evaluations. The authors also release the construction code and document aspects of the data that may concern users.

Core claim

The central claim is that training large language models on this composite dataset of 22 diverse subsets produces better cross-domain knowledge and downstream generalization than training on less curated web data. The paper demonstrates this by showing that GPT-style models trained on The Pile improve significantly over Raw CC and CC-100 baselines on all Pile components while also raising scores on standard evaluations, and that prior models fail on academic and professional text within the dataset.

What carries the argument

The Pile, a composite 825 GiB corpus constructed by combining 22 existing and newly assembled high-quality text subsets drawn from academic and professional sources.

Load-bearing premise

The reported performance gains are caused by the diversity and quality of the 22 subsets rather than by uncontrolled differences in training procedure, model scale, or data volume between the Pile-trained models and the Raw CC or CC-100 baselines.

What would settle it

A controlled retraining experiment that matches data volume, model size, and training steps exactly between a Pile-trained model and a Raw CC model, then evaluates both on held-out samples from every Pile component, would show whether the gains persist.

read the original abstract

Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present \textit{the Pile}: an 825 GiB English text corpus targeted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets -- both existing and newly constructed -- many of which derive from academic or professional sources. Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations. Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces The Pile, an 825 GiB English text corpus assembled from 22 diverse high-quality subsets (both existing and newly constructed, many from academic/professional sources) for training large-scale language models. It reports that untuned GPT-2 and GPT-3 models struggle on several Pile components (e.g., academic writing), while models trained on the Pile outperform Raw CC and CC-100 baselines on all Pile components and on downstream evaluations. The authors include an exploratory analysis of potential data issues and release the construction code publicly.

Significance. If the reported gains hold under controlled conditions, the work supplies a large, publicly documented, and diverse training resource that can improve cross-domain generalization in language models. The open release of construction code is a concrete strength that supports reproducibility and community use.

major comments (2)
  1. [Abstract and evaluation/results section] Abstract and evaluation/results section: the central claim that 'models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile' is load-bearing for the paper's contribution, yet the manuscript supplies no details on whether the Pile-trained GPT-2/GPT-3 variants and the Raw CC/CC-100 baselines were trained from scratch with identical architecture, token budget, optimizer, learning-rate schedule, or total compute. Without matched controls, performance deltas cannot be isolated to dataset properties.
  2. [Abstract and evaluation/results section] Abstract and evaluation/results section: the claim of 'significant' improvement is presented without statistical tests, error bars, or variance estimates across runs, leaving the strength of the evidence for downstream-task gains partially unverified.
minor comments (2)
  1. [Dataset construction section] Dataset construction section: the 22 subsets would benefit from a single consolidated table listing exact sizes, sources, and preprocessing steps for each component to improve clarity and ease of replication.
  2. [Exploratory analysis] Exploratory analysis: some figures showing data characteristics (e.g., domain distributions or token statistics) could include more precise axis labels and legends for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. The points raised regarding experimental controls and statistical rigor are important for strengthening the presentation of our results. We address each major comment below and describe the revisions we will incorporate in the updated version of the paper.

read point-by-point responses
  1. Referee: [Abstract and evaluation/results section] Abstract and evaluation/results section: the central claim that 'models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile' is load-bearing for the paper's contribution, yet the manuscript supplies no details on whether the Pile-trained GPT-2/GPT-3 variants and the Raw CC/CC-100 baselines were trained from scratch with identical architecture, token budget, optimizer, learning-rate schedule, or total compute. Without matched controls, performance deltas cannot be isolated to dataset properties.

    Authors: We appreciate the referee drawing attention to the need for explicit documentation of the training controls. The original manuscript described the overall training setup in Section 4 but did not sufficiently emphasize the matched conditions across datasets. In the revised manuscript we have expanded the training details subsection to state explicitly that the GPT-2-scale and GPT-3-scale models trained on The Pile and the corresponding Raw CC and CC-100 baselines were all trained from scratch using identical model architectures, the same total token budget (approximately 300 billion tokens), the same Adam optimizer with identical hyperparameters, the same learning-rate schedule including warmup and cosine decay, and equivalent total compute. A table summarizing the shared hyperparameters has been added for clarity. These controls ensure that observed differences can be attributed to dataset properties rather than training discrepancies. revision: yes

  2. Referee: [Abstract and evaluation/results section] Abstract and evaluation/results section: the claim of 'significant' improvement is presented without statistical tests, error bars, or variance estimates across runs, leaving the strength of the evidence for downstream-task gains partially unverified.

    Authors: We agree that the strength of the 'significant' claim would benefit from additional statistical support. In the revised manuscript we have added error bars to the downstream-task figures, derived from multiple runs with different random seeds for the smaller model scales where compute permitted. We have also included the results of paired statistical tests (t-tests) on the key downstream benchmarks comparing Pile-trained models to the CC baselines. For the per-component Pile evaluations we now report standard deviations across model sizes. Due to the high computational cost of full-scale retraining we were limited in the number of replicate runs; however, the consistent direction and magnitude of gains across scales provide supporting evidence. The abstract and results section have been updated to reflect these additions. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical dataset construction and comparisons are self-contained

full rationale

The paper constructs the Pile dataset from 22 subsets and reports empirical results showing improved performance of GPT-2/GPT-3 models trained on it versus Raw CC and CC-100 baselines on Pile components and downstream tasks. No derivation chain, equations, predictions, or first-principles results exist that could reduce to inputs by construction. The patterns of self-definitional claims, fitted inputs called predictions, self-citation load-bearing arguments, uniqueness theorems, ansatz smuggling, or renaming known results are absent. The central claims rest on new data assembly and direct comparisons against external benchmarks, with no self-referential reductions or load-bearing self-citations that collapse the argument.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical dataset construction and benchmarking effort; it introduces no mathematical free parameters, domain axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5484 in / 1150 out tokens · 77761 ms · 2026-05-10T21:28:45.866889+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Architecture Determines Observability of Transformers

    cs.LG 2026-04 unverdicted novelty 8.0

    Certain transformer architectures lose internal linear signals for decision quality during training, making observability an architecture-dependent property rather than a universal one.

  2. Test-Time Training with KV Binding Is Secretly Linear Attention

    cs.LG 2026-02 conditional novelty 8.0

    Test-time training with KV binding reduces to learned linear attention.

  3. Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    cs.LG 2023-12 unverdicted novelty 8.0

    Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

  4. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

    cs.CL 2023-04 accept novelty 8.0

    Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

  5. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

    cs.CL 2022-02 accept novelty 8.0

    Randomly replacing labels in in-context demonstrations barely hurts performance, showing that label space, input distribution, and sequence format drive in-context learning more than ground-truth labels.

  6. Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift

    cs.LG 2026-05 unverdicted novelty 7.0

    GAE reduces the faithfulness gap in dictionary-based explainers under distribution shift by geometrically realigning the ID dictionary to the OOD-active subspace, with a quadratic excess-loss bound.

  7. Probabilistic Attribution For Large Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    Develops a model-agnostic attribution score as the log-ratio of conditional response probabilities with and without a marginalized prompt token, derived via Bayes inversion of next-token distributions, and relates it ...

  8. Provable Joint Decontamination for Benchmarking Multiple Large Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    JECS aggregates per-model conformal p-values via their maximum and reconstructs a conservative envelope of the max-p null distribution to select benchmarks with global contamination rate control.

  9. Aligned Training: A Parameter-Free Method to Improve Feature Quality and Stability of Sparse Autoencoders (SAE)

    cs.LG 2026-05 unverdicted novelty 7.0

    Aligned training reparameterizes SAEs to enforce unit inner product between encoder and decoder directions, eliminating dead features and enhancing stability without hyperparameters.

  10. LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    LLMForge is a NAS framework with Infinite-Head Attention, a Forge-Former surrogate, and Forge-DSE engine that discovers hardware-specific architectures for edge language models, yielding variants with improved accurac...

  11. To Call or Not to Call: Diagnosing Intrinsic Over-Calling Bias in LLM Agents

    cs.LG 2026-05 conditional novelty 7.0

    LLM agents have an intrinsic over-calling bias diagnosed via SAE activation margins and corrected by adaptive margin-calibrated steering, improving overall decision accuracy.

  12. When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability

    cs.LG 2026-05 unverdicted novelty 7.0

    Tensor similarity is a symmetry-invariant metric that measures functional equivalence between tensor-based networks using a recursive algorithm for cross-layer mechanisms.

  13. Scaling Laws for Mixture Pretraining Under Data Constraints

    cs.LG 2026-05 conditional novelty 7.0

    Repetition-aware scaling laws show scarce target data in pretraining mixtures can be repeated 15-20 times optimally, with the best count depending on data size, compute, and model scale.

  14. Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

    cs.LG 2026-05 unverdicted novelty 7.0

    Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.

  15. DistractMIA: Black-Box Membership Inference on Vision-Language Models via Semantic Distraction

    cs.CV 2026-05 unverdicted novelty 7.0

    DistractMIA performs output-only black-box membership inference on vision-language models by inserting semantic distractors and measuring shifts in generated text responses.

  16. AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration - Learning from Cheap, Optimizing Expensive

    cs.AI 2026-05 unverdicted novelty 7.0

    AutoLLMResearch trains agents via a multi-fidelity environment and MDP pipeline to extrapolate configuration principles from inexpensive to costly LLM experiments.

  17. fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery

    cs.LG 2026-05 conditional novelty 7.0

    fmxcoders improve cross-layer feature recovery in transformers via factorized weights and layer masking, delivering 10-30 point probing F1 gains, 25-50% lower MSE, doubled functional coherence, and 3-13x more coherent...

  18. LoopQ: Quantization for Recursive Transformers

    cs.LG 2026-05 unverdicted novelty 7.0

    LoopQ provides a loop-aware PTQ framework for recursive Transformers that mitigates distribution shift, state reuse, and recursive error accumulation, yielding 68.8% higher average accuracy and 87.7% lower perplexity ...

  19. Beyond Indistinguishability: Measuring Extraction Risk in LLM APIs

    cs.CR 2026-04 unverdicted novelty 7.0

    Indistinguishability-based privacy is incomparable to extractability in LLMs, and a new (l, b)-inextractability definition with rank-based bounds provides a tighter measure of extraction risk than prior proxies.

  20. Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs

    cs.CR 2026-04 unverdicted novelty 7.0

    A Merkle-committed SAE feature-trace protocol detects model substitutions in hosted LLMs at a stable threshold where parallel-probe baselines fail, including against adaptive LoRA attackers.

  21. When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence

    cs.LG 2026-04 conditional novelty 7.0

    FP32-converged language models enter a post-convergence phase where INT4 quantization error explodes while FP32 perplexity remains stable, with onset tied to fine convergence rather than learning rate decay.

  22. What Makes a Good Response? An Empirical Analysis of Quality in Qualitative Interviews

    cs.CL 2026-04 unverdicted novelty 7.0

    Direct relevance to a key research question is the strongest predictor of a response's contribution to qualitative study findings, while clarity and surprisal-based informativeness are not predictive.

  23. Detection Without Correction: A Robust Asymmetry in Activation-Based Hallucination Probing

    cs.CL 2026-03 conditional novelty 7.0

    Activation probes detect hallucinations pre-generation in large LLMs but cannot correct them via steering, with output confidence outperforming on accuracy.

  24. From Competition to Collaboration: Designing Sustainable Mechanisms Between LLMs and Online Forums

    cs.AI 2026-02 unverdicted novelty 7.0

    A new sequential interaction framework lets LLMs propose questions to forums, with simulations on real Stack Exchange data showing players can reach roughly half the utility of an ideal full-information scenario despi...

  25. Hidden State Poisoning Attacks against Mamba-based Language Models

    cs.CL 2026-01 unverdicted novelty 7.0

    Short input phrases can irreversibly overwrite hidden states in Mamba models, impairing information retrieval on a new benchmark while leaving pure Transformer models unaffected.

  26. QSLM: A Performance- and Memory-aware Quantization Framework with Tiered Search Strategy for Spike-driven Language Models

    cs.NE 2026-01 unverdicted novelty 7.0

    QSLM automates tiered quantization of spike-driven language models via sensitivity analysis and multi-objective search, delivering up to 86.5% memory reduction and 20% power savings while keeping accuracy close to the...

  27. Clotho: Measuring Task-Specific Pre-Generation Test Adequacy for LLM Inputs

    cs.SE 2025-09 unverdicted novelty 7.0

    Clotho ranks LLM test inputs by failure likelihood using pre-generation hidden states and GMMs, achieving 0.716 ROC-AUC after labeling 5.4% of inputs on average across eight tasks and three models, with transfer to pr...

  28. Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training

    cs.LG 2025-07 unverdicted novelty 7.0

    An RL agent learns domain re-weighting policies from evaluation feedback to improve balanced performance in continual pre-training of LLMs across source and target domains.

  29. Power-Softmax: Towards Secure LLM Inference over Encrypted Data

    cs.LG 2024-10 unverdicted novelty 7.0

    Power-Softmax is a new HE-compatible attention variant that permits training and inference of billion-parameter polynomial LLMs with performance matching standard transformers.

  30. Refusal in Language Models Is Mediated by a Single Direction

    cs.LG 2024-06 accept novelty 7.0

    Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

  31. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    cs.LG 2024-05 unverdicted novelty 7.0

    Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.

  32. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    cs.CL 2024-05 unverdicted novelty 7.0

    DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

  33. Chronos: Learning the Language of Time Series

    cs.LG 2024-03 conditional novelty 7.0

    Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.

  34. Massive Activations in Large Language Models

    cs.CL 2024-02 unverdicted novelty 7.0

    Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.

  35. LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

    cs.CL 2024-02 unverdicted novelty 7.0

    LongRoPE extends LLM context windows to 2048k tokens via search for non-uniform positional interpolation, progressive fine-tuning from 256k, and short-context readjustment.

  36. Hallucination is Inevitable: An Innate Limitation of Large Language Models

    cs.CL 2024-01 conditional novelty 7.0

    Hallucinations are inevitable in LLMs because they cannot learn all computable functions according to learning theory.

  37. Scalable Extraction of Training Data from (Production) Language Models

    cs.LG 2023-11 conditional novelty 7.0

    Adversaries can scalably extract gigabytes of training data from open, semi-open, and closed language models via querying attacks, including a divergence method that increases extraction rates 150x on aligned models l...

  38. Detecting Pretraining Data from Large Language Models

    cs.CL 2023-10 conditional novelty 7.0

    Min-K% Prob detects pretraining data in LLMs by flagging outlier low-probability words in text, achieving 7.4% better performance than prior methods on the new WIKIMIA benchmark.

  39. Extending Context Window of Large Language Models via Positional Interpolation

    cs.CL 2023-06 conditional novelty 7.0

    Position Interpolation linearly down-scales position indices to extend RoPE context windows to 32768 tokens with 1000-step fine-tuning, delivering strong long-context results on LLaMA 7B-65B while preserving short-con...

  40. RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

    cs.CL 2023-06 unverdicted novelty 7.0

    RepoBench is a new benchmark with retrieval, completion, and pipeline tasks to evaluate code auto-completion systems on entire repositories instead of single files.

  41. RWKV: Reinventing RNNs for the Transformer Era

    cs.CL 2023-05 unverdicted novelty 7.0

    RWKV uses a linear attention mechanism to deliver Transformer-level performance with RNN-style inference efficiency, demonstrated at up to 14 billion parameters.

  42. Eliciting Latent Predictions from Transformers with the Tuned Lens

    cs.LG 2023-03 accept novelty 7.0

    Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.

  43. Language Is Not All You Need: Aligning Perception with Language Models

    cs.CL 2023-02 conditional novelty 7.0

    Kosmos-1 shows strong zero-shot and few-shot results on language tasks, image captioning, visual QA, OCR-free document understanding, and image recognition guided by text instructions.

  44. LAION-5B: An open large-scale dataset for training next generation image-text models

    cs.CV 2022-10 accept novelty 7.0

    LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.

  45. OPT: Open Pre-trained Transformer Language Models

    cs.CL 2022-05 unverdicted novelty 7.0

    OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

  46. InCoder: A Generative Model for Code Infilling and Synthesis

    cs.SE 2022-04 unverdicted novelty 7.0

    InCoder is the first generative model to directly perform zero-shot code infilling via bidirectional context from a masked-then-appended training scheme, matching left-to-right models on synthesis while improving on t...

  47. Quantifying Memorization Across Neural Language Models

    cs.LG 2022-02 unverdicted novelty 7.0

    Memorization in language models increases log-linearly with model capacity, data duplication count, and prompt context length.

  48. Improving language models by retrieving from trillions of tokens

    cs.CL 2021-12 unverdicted novelty 7.0

    RETRO matches GPT-3 and Jurassic-1 performance on the Pile benchmark using 25 times fewer parameters by conditioning on retrieved chunks from a 2-trillion-token database.

  49. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    cs.CV 2021-11 unverdicted novelty 7.0

    LAION-400M is a publicly released open dataset of 400 million CLIP-filtered image-text pairs with embeddings and kNN indices for efficient search.

  50. LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws

    cs.LG 2026-05 unverdicted novelty 6.0

    The Shannon Scaling Law treats LLM training as noisy-channel transmission and predicts U-shaped performance degradation when signal-to-noise ratio falls below a threshold, outperforming monotonic scaling laws on Pythi...

  51. Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

    cs.AI 2026-05 unverdicted novelty 6.0

    Meta-Soft dynamically synthesizes targeted soft tokens from a learnable orthogonal meta-library via Gumbel-Softmax selection and uses attention-flow integration to preserve semantic information during KV cache eviction.

  52. Self-Training Doesn't Flatten Language -- It Restructures It: Surface Markers Amplify While Deep Syntax Dies

    cs.CL 2026-05 unverdicted novelty 6.0

    Self-training restructures language by amplifying surface markers and collapsing deep syntax according to structural depth rather than frequency, as evidenced by correlations across multiple models and a human fine-tu...

  53. Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear Probing

    cs.LG 2026-05 unverdicted novelty 6.0

    In high-dimensional analysis, pretrained PCA representations for linear probing generalize best at low dimensionality when pretraining data is plentiful but labeled data scarce, with an exact trade-off showing how muc...

  54. Are Sparse Autoencoder Benchmarks Reliable?

    cs.LG 2026-05 unverdicted novelty 6.0

    An audit of SAEBench reveals that Targeted Probe Perturbation and Spurious Correlation Removal metrics fail reliability tests and should not be used to evaluate sparse autoencoders.

  55. Scaling Laws for Mixture Pretraining Under Data Constraints

    cs.LG 2026-05 unverdicted novelty 6.0

    Empirical study shows mixture pretraining tolerates higher target data repetition than single-source training, with a new repetition-aware scaling law enabling principled mixture selection based on data size, compute,...

  56. AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration - Learning from Cheap, Optimizing Expensive

    cs.AI 2026-05 unverdicted novelty 6.0

    AutoLLMResearch trains agents in a multi-fidelity LLMConfig-Gym environment formulated as a long-horizon MDP to enable cross-fidelity extrapolation for automating high-cost LLM experiment configurations.

  57. LLM Jaggedness Unlocks Scientific Creativity

    cs.AI 2026-05 unverdicted novelty 6.0

    Jagged capabilities in LLMs for scientific idea generation can be leveraged through inference-time ensembles to outperform individual models.

  58. BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization

    cs.LG 2026-05 unverdicted novelty 6.0

    BROS achieves the same O(ε^{-2}) sample complexity as exact single-loop SBO methods while cutting peak memory by up to 44.9% through randomized subspaces and bias-corrected Hessian estimation.

  59. BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization

    cs.LG 2026-05 unverdicted novelty 6.0

    BROS achieves memory-efficient single-loop stochastic bilevel optimization with O(ε^{-2}) sample complexity by performing updates in randomized subspaces and using Rademacher bi-probe correction for unbiased estimation.

  60. NCO: A Versatile Plug-in for Handling Negative Constraints in Decoding

    cs.CL 2026-05 unverdicted novelty 6.0

    NCO enables efficient online pattern matching for negative hard and regex constraints in LLM decoding to prevent forbidden content without state explosion.

Reference graph

Works this paper leans on

193 extracted references · 193 canonical work pages · cited by 153 Pith papers · 12 internal anchors

  1. [1]

    Sony corp

    1984. Sony corp. of america v. universal city studios, inc

  2. [2]

    2003. Kelly v. arriba soft corp

  3. [3]

    Righthaven llc v

    2013. Righthaven llc v. hoehn

  4. [5]

    Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010. http://nmis.isti.cnr.it/sebastiani/Publications/LREC10.pdf Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining . In LREC. European Language Resources Association

  5. [6]

    Emily M Bender and Batya Friedman. 2018. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6:587--604

  6. [7]

    Stella Biderman. 2021. Data statement for the P ile. arXiv preprint arXiv

  7. [8]

    Stella Biderman, Kieran Bicheno, and Leo Gao. 2021. Datasheet for the P ile. arXiv preprint arXiv

  8. [9]

    Scheirer

    Stella Biderman and Walter J. Scheirer. 2020. Pitfalls in machine learning research: Reexamining the development cycle. NeurIPS ``I Can't Believe It's Not Better!'' Workshop

  9. [10]

    David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993--1022

  10. [12]

    Nick Bostrom. 2014. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, Inc

  11. [13]

    Nick Bostrom and Eliezer Yudkowsky. 2014. The ethics of artificial intelligence. The Cambridge handbook of artificial intelligence, 1:316--334

  12. [16]

    Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel. 2020. http://arxiv.org/abs/2012.07805 Extracting training data from large language models

  13. [18]

    Brian Christian. 2020. The Alignment Problem: Machine Learning and Human Values. WW Norton & Company

  14. [19]

    Alina Maria Ciobanu, Liviu P Dinu, and Andrea Sgarro. 2017. Towards a map of the syntactic similarity of languages. In International Conference on Computational Linguistics and Intelligent Text Processing, pages 576--590. Springer

  15. [20]

    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. https://www.aclweb.org/anthology/2020.acl-main.747 Unsupervised cross-lingual representation learning at scale . In Proceedings of the 58th Annual Meeting of the Association for...

  16. [22]

    Andrew Critch and David Krueger. 2020. AI Research Considerations for Human Existential Safety (ARCHES) . Preprint at acritch.com/arches http://acritch.com/arches

  17. [24]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) . Association for Com...

  18. [25]

    István Endrédy and Attila Novák. 2013. More effective boilerplate removal – the GoldMiner algorithm. In Polibits

  19. [26]

    Niels Ferguson and Bruce Schneier. 2003. Practical Cryptography. John Wiley & Sons

  20. [27]

    Casey Fiesler, Nathan Beard, and Brian C Keegan. 2020. No robots, spiders, or scrapers: Legal and ethical regulation of data collection methods in social media terms of service. In Proceedings of the International AAAI Conference on Web and Social Media, volume 14, pages 187--196

  21. [29]

    Aaron Gokaslan and Vanya Cohen. 2019. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus

  22. [30]

    Authors Guild v. Google. 2015. . Docket No. 13-4829-cv, 804:202

  23. [31]

    Katja Grace, John Salvatier, Allan Dafoe, Baobao Zhang, and Owain Evans. 2018. When will AI exceed human performance? evidence from AI experts. Journal of Artificial Intelligence Research, 62:729--754

  24. [32]

    David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2003. English gigaword. Linguistic Data Consortium, Philadelphia, 4(1):34

  25. [33]

    Declan Groves and Andy Way. 2006. Hybridity in mt: Experiments on the Europarl corpus. In Proceeedings of the 11th Annual conference of the European Association for Machine Translation (EAMT 2006)

  26. [34]

    Alexander Halavais. 2019. Overcoming terms of service: a proposal for ethical distributed research. Information, Communication & Society, 22(11):1567--1581

  27. [35]

    Chris Hardin. 2018. https://blog.janestreet.com/how-to-shuffle-a-big-dataset/ How to shuffle a big dataset

  28. [37]

    Matthew Hoffman, Francis Bach, and David Blei. 2010. Online learning for latent dirichlet allocation. advances in neural information processing systems, 23:856--864

  29. [38]

    Dirk Hovy and Shannon L Spruit. 2016. The social impact of natural language processing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 591--598

  30. [39]

    Ben Hutchinson, Vinodkumar Prabhakaran, Emily Denton, Kellie Webster, Yu Zhong, and Stephen Denuyl. 2020. Social biases in NLP models as barriers for persons with disabilities. arXiv preprint arXiv:2005.00813

  31. [40]

    Eun Seo Jo and Timnit Gebru. 2020. Lessons from archives: S trategies for collecting sociocultural data in machine learning. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 306--316

  32. [42]

    Bryan Klimt and Yiming Yang. 2004. The E nron corpus: A new dataset for email classification research. In European Conference on Machine Learning, pages 217--226. Springer

  33. [43]

    Sosuke Kobayashi. 2018. Homemade bookcorpus. https://github.com/BIGBALLON/cifar-10-cnn

  34. [44]

    Philipp Koehn. 2005. Europarl : A parallel corpus for statistical machine translation. In MT summit, volume 5, pages 79--86. Citeseer

  35. [47]

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa : A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692

  36. [48]

    Edward Loper and Steven Bird. 2002. NLTK: The Natural Language Toolkit . In Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, pages 62--69. Somerset, NJ: Association for Computational Linguistics. http://arXiv.org/abs/cs/0205028

  37. [49]

    2006--2020

    John MacFarlane. 2006--2020. https://pandoc.org/ Pandoc: a universal document converter

  38. [50]

    Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf Distributed representations of words and phrases and their compositionality . In Advances in Neural Information Processing Systems, volume 26, pages 3111--3119. Curran Associates, Inc

  39. [51]

    Jonathan A Obar. 2020. Sunlight alone is not a disinfectant: Consent and the futility of opening big data black boxes (without assistance). Big Data & Society, 7(1):2053951720935615

  40. [52]

    Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. http://www.aclweb.org/anthology/D14-1162 Glove: Global vectors for word representation . In Empirical Methods in Natural Language Processing (EMNLP), pages 1532--1543

  41. [54]

    Shawn Presser. 2020. Books3. https://twitter.com/theshawwn/status/1320282149329784833

  42. [55]

    Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI

  43. [56]

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9

  44. [57]

    Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, Chloe Hillier, and Timothy P Lillicrap. 2019. https://arxiv.org/abs/1911.05507 Compressive transformers for long-range sequence modelling . arXiv preprint

  45. [59]

    Inioluwa Deborah Raji and Jingying Yang. 2019. ABOUT ML : Annotation and benchmarking on understanding and transparency of machine learning lifecycles. arXiv preprint arXiv:1912.06166

  46. [60]

    Radhakrishna Rao

    C. Radhakrishna Rao. 1961. http://www.jstor.org/stable/25049166 Generation of random permutations of given number of elements using random sampling numbers . Sankhyā: The Indian Journal of Statistics, Series A (1961-2002), 23(3):305--307

  47. [61]

    Radim Rehurek, Petr Sojka, et al. 2011. Gensim—statistical semantics in python. NLP Centre, Faculty of Informatics, Masaryk University

  48. [62]

    C Rosset. 2019. Turing-NLG : A 17-billion-parameter language model by M icrosoft. Microsoft Blog

  49. [63]

    S. Russell. 2019. https://books.google.de/books?id=M1eFDwAAQBAJ Human Compatible: Artificial Intelligence and the Problem of Control . Penguin Publishing Group

  50. [66]

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM : Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053

  51. [67]

    Carl Shulman and Nick Bostrom. 2020. Sharing the world with digital minds. preprint

  52. [68]

    Kaj Sotala and Lukas Gloor. 2017. Superintelligence as a cause or cure for risks of astronomical suffering. Informatica, 41(4)

  53. [70]

    Merity Stephen, Xiong Caiming, Bradbury James, and Richard Socher. 2016

  54. [71]

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33

  55. [72]

    Pedro Javier Ortiz Su \'a rez, Beno \^ t Sagot, and Laurent Romary. 2019 a . Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut f \"u r Deutsche Sprache

  56. [73]

    Pedro Javier Ortiz Su \'a rez, Beno \^ t Sagot, and Laurent Romary. 2019 b . Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut f \"u r Deutsche Sprache

  57. [74]

    Anja Thieme, Danielle Belgrave, and Gavin Doherty. 2020. Machine learning in mental health: A systematic review of the HCI literature to support the development of effective and implementable ML systems. ACM Transactions on Computer-Human Interaction (TOCHI), 27(5):1--53

  58. [75]

    Tiedemann

    J. Tiedemann. 2016. Finding alternative translations in a large corpus of movie subtitles. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)

  59. [76]

    Trinh and Quoc V

    Trieu H. Trinh and Quoc V. Le. 2018. http://arxiv.org/abs/1806.02847 A simple method for commonsense reasoning . CoRR, abs/1806.02847

  60. [77]

    Hans Van Halteren. 2008. Source language markers in Europarl translations. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 937--944

  61. [78]

    Jessica Vitak, Katie Shilton, and Zahra Ashktorab. 2016. Beyond the Belmont principles: Ethical challenges, practices, and beliefs in the online data research community. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing, pages 941--953

  62. [80]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. https://www.aclweb.org/a...

  63. [82]

    Eliezer Yudkowsky. 2013. Intelligence explosion microeconomics. Machine Intelligence Research Institute, accessed online October, 23:2015

  64. [83]

    Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019. http://papers.nips.cc/paper/9106-defending-against-neural-fake-news.pdf Defending against neural fake news . In H. Wallach, H. Larochelle, A. Beygelzimer, F. d\' Alch\' e -Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Pro...

  65. [84]

    Victor Zhou. 2019. Building a better profanity detection library with scikit-learn

  66. [85]

    Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19--27

  67. [86]

    Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor , journal=. The

  68. [87]

    Datasheet for the

    Biderman, Stella and Bicheno, Kieran and Gao, Leo , journal=. Datasheet for the

  69. [88]

    Data Statement for the

    Biderman, Stella , journal=. Data Statement for the

  70. [89]

    OpenAI Blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI Blog , volume=

  71. [90]

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    Exploring the limits of transfer learning with a unified text-to-text transformer , author=. arXiv preprint arXiv:1910.10683 , year=

  72. [91]

    Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan , journal=

  73. [92]

    Rosset, C , journal=

  74. [93]

    Language Models are Few-Shot Learners

    Language models are few-shot learners , author=. arXiv preprint arXiv:2005.14165 , year=

  75. [94]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Gshard: Scaling giant models with conditional computation and automatic sharding , author=. arXiv preprint arXiv:2006.16668 , year=

  76. [95]

    Technical report, OpenAI , year=

    Improving language understanding with unsupervised learning , author=. Technical report, OpenAI , year=

  77. [96]

    BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019

  78. [97]

    Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov , journal=

  79. [98]

    KONVENS , year=

    Generic Web Content Extraction with Open-Source Software , author=. KONVENS , year=

  80. [99]

    Polibits , year =

    István Endrédy and Attila Novák , title =. Polibits , year =

Showing first 80 references.