The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.
arXiv preprint arXiv:2005.14050 , year =
14 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
SDGBiasBench reveals intrinsic SDG biases in VLMs driven by priors rather than evidence, and CADE mitigates them with up to 25% accuracy gains and 12-point MAE reductions.
StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.
Causality provides a unifying framework for resolving trade-offs in trustworthy AI by managing invariance conflicts under changes to the data-generating process.
GMRL-BD detects untrustworthy topic boundaries for black-box LLMs by combining bias-diffusion on a Wikipedia KG with multi-agent RL, supported by a released dataset labeling biases in models like Llama2 and Qwen2.
GPT-4 is a scaled Transformer model with post-training alignment that reaches human-level performance on academic and professional benchmarks via infrastructure enabling performance prediction from much smaller models.
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.
Network and discourse analysis of NFT collections shows holding behavior builds dense, socially embedded Web3 communities with ongoing participation, unlike fragmented transactional networks from trading and speculation.
Sycophancy appears in 91.7% of LLM responses during co-creative writing tasks, especially on sensitive topics, while anchoring varies by literary form and is most common in folktales.
Bengali sentiment analysis models exhibit persistent identity-based biases across datasets and developer backgrounds despite similar semantic content.
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
LLMs exhibit persistent inertia in value orientations, with harm avoidance and fairness remaining skewed across persona prompts.
citing papers explorer
-
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.
-
Language Models are Few-Shot Learners
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
-
SDGBiasBench: Benchmarking and Mitigating Vision--Language Models' Biases in Sustainable Development Goals
SDGBiasBench reveals intrinsic SDG biases in VLMs driven by priors rather than evidence, and CADE mitigates them with up to 25% accuracy gains and 12-point MAE reductions.
-
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.
-
Trustworthy AI Suffers from Invariance Conflicts and Causality is The Solution
Causality provides a unifying framework for resolving trade-offs in trustworthy AI by managing invariance conflicts under changes to the data-generating process.
-
Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning
GMRL-BD detects untrustworthy topic boundaries for black-box LLMs by combining bias-diffusion on a Wikipedia KG with multi-agent RL, supported by a released dataset labeling biases in models like Llama2 and Qwen2.
-
GPT-4 Technical Report
GPT-4 is a scaled Transformer model with post-training alignment that reaches human-level performance on academic and professional benchmarks via infrastructure enabling performance prediction from much smaller models.
-
Ethical and social risks of harm from Language Models
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.
-
From Tokens to Ties: Network and Discourse Analysis of Web3 Ecosystems
Network and discourse analysis of NFT collections shows holding behavior builds dense, socially embedded Web3 communities with ongoing participation, unlike fragmented transactional networks from trading and speculation.
-
Lighting Up or Dimming Down? Exploring Dark Patterns of LLMs in Co-Creativity
Sycophancy appears in 91.7% of LLM responses during co-creative writing tasks, especially on sensitive topics, while anchoring varies by literary form and is most common in folktales.
-
How do datasets, developers, and models affect biases in a low-resourced language?: The Case of the Bengali Language
Bengali sentiment analysis models exhibit persistent identity-based biases across datasets and developer backgrounds despite similar semantic content.
-
PaLM 2 Technical Report
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
-
Galactica: A Large Language Model for Science
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
-
Inertia in Moral and Value Judgments of Large Language Models
LLMs exhibit persistent inertia in value orientations, with harm avoidance and fairness remaining skewed across persona prompts.