The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CL 3roles
background 1polarities
background 1representative citing papers
DynamicNER is a dynamic-categorization multilingual NER dataset with 155 entity types paired with CascadeNER, a two-stage lightweight LLM method claiming higher fine-grained accuracy.
The paper introduces the Construct Validity Protocol to validate semantic embeddings for social constructs and proposes Counterfactual Neutralization using LLMs to reduce confounding.
citing papers explorer
-
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.
-
DynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition
DynamicNER is a dynamic-categorization multilingual NER dataset with 155 entity types paired with CascadeNER, a two-stage lightweight LLM method claiming higher fine-grained accuracy.
-
The Proxy Presumption: From Semantic Embeddings to Valid Social Measures
The paper introduces the Construct Validity Protocol to validate semantic embeddings for social constructs and proposes Counterfactual Neutralization using LLMs to reduce confounding.