The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.
Transactions of the Association for Computational Linguistics , volume=
3 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Introduces the TCR framework to evaluate educational LLM assistants on transparency, consistency, and refinement in multi-turn interactions, complementing aggregate metrics.
The paper introduces the Construct Validity Protocol to validate semantic embeddings for social constructs and proposes Counterfactual Neutralization using LLMs to reduce confounding.
citing papers explorer
-
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.
-
Evaluating Multi-turn Human-AI Interaction
Introduces the TCR framework to evaluate educational LLM assistants on transparency, consistency, and refinement in multi-turn interactions, complementing aggregate metrics.
-
The Proxy Presumption: From Semantic Embeddings to Valid Social Measures
The paper introduces the Construct Validity Protocol to validate semantic embeddings for social constructs and proposes Counterfactual Neutralization using LLMs to reduce confounding.