The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.
Journal of machine Learning research , volume=
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
AI4BayesCode generates validated modular stateful MCMC samplers from natural language Bayesian model descriptions via LLM translation, modular blocks, and recursive stateful composition.
BoolXLLM augments an existing Boolean rule learner with LLMs for feature selection, discretization thresholds, and natural-language rule translation to improve interpretability while preserving accuracy.
Mixtures of convolutional measures on low-dimensional affine spaces admit unique identifiability in semi-parametric settings and posterior contraction rates under convex polytope support assumptions in a well-specified Bayesian regime.
ALLaVA creates 1.3M GPT4V-synthesized samples enabling 4B VLMs to achieve competitive results on 17 benchmarks and match 7B/13B models on some tasks.
Embeddings reliably capture authorial stylistic features in French literary texts, and these signals persist after LLM rewriting while showing model-specific patterns.
Graph-augmented LLMs using a political knowledge graph improve ideology prediction accuracy for Swiss MPs by incorporating relational data beyond text alone.
Large-scale computational comparison of two major Holocaust oral history collections shows both expected differences and significant overlaps in interview structure, yielding a replicable framework for archive analysis.
citing papers explorer
-
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.
-
AI4BayesCode: From Natural Language Descriptions to Validated Modular Stateful Bayesian Samplers
AI4BayesCode generates validated modular stateful MCMC samplers from natural language Bayesian model descriptions via LLM translation, modular blocks, and recursive stateful composition.
-
BoolXLLM: LLM-Assisted Explainability for Boolean Models
BoolXLLM augments an existing Boolean rule learner with LLMs for feature selection, discretization thresholds, and natural-language rule translation to improve interpretability while preserving accuracy.
-
Learning Mixtures of Nonparametric and Convolutional Measures on Effectively Low-dimensional Affine Spaces
Mixtures of convolutional measures on low-dimensional affine spaces admit unique identifiability in semi-parametric settings and posterior contraction rates under convex polytope support assumptions in a well-specified Bayesian regime.
-
ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models
ALLaVA creates 1.3M GPT4V-synthesized samples enabling 4B VLMs to achieve competitive results on 17 benchmarks and match 7B/13B models on some tasks.
-
Measuring Embedding Sensitivity to Authorial Style in French: Comparing Literary Texts with Language Model Rewritings
Embeddings reliably capture authorial stylistic features in French literary texts, and these signals persist after LLM rewriting while showing model-specific patterns.
-
Graph-Augmented LLMs for Swiss MP Ideology Prediction
Graph-augmented LLMs using a political knowledge graph improve ideology prediction accuracy for Swiss MPs by incorporating relational data beyond text alone.
-
The Shape of Testimony: A Scalable Framework for Oral History Archive Comparison
Large-scale computational comparison of two major Holocaust oral history collections shows both expected differences and significant overlaps in interview structure, yielding a replicable framework for archive analysis.