Recognition: 1 theorem link
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Pith reviewed 2026-05-12 00:38 UTC · model grok-4.3
The pith
A 176B-parameter decoder-only language model trained on text from 59 languages is built through open collaboration and released publicly.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BLOOM is a 176B-parameter decoder-only Transformer language model trained on the ROOTS corpus comprising hundreds of sources in 46 natural and 13 programming languages. It achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. The model and associated code are released publicly under the Responsible AI License to facilitate future research and applications using large language models.
What carries the argument
The BLOOM decoder-only Transformer, trained on the ROOTS multilingual corpus, which supplies the data diversity and scale needed for broad language coverage and benchmark performance.
If this is right
- Public release allows researchers without large compute budgets to study and adapt a 176B-scale multilingual model.
- Multitask prompted finetuning can be applied to the released model to improve results on targeted tasks.
- The multilingual training data supports work on non-English and programming-language tasks at scale.
- The open license enables community inspection and modification of the model for specific applications.
Where Pith is reading between the lines
- Wider availability may encourage development of language tools for languages that have historically had fewer resources.
- The collaborative construction process could serve as a template for other large open models in different domains.
- Public access creates opportunities for independent safety and bias audits that closed models do not permit.
Load-bearing premise
That unreported details of the training procedure, data filtering, and evaluation setup produce general capabilities that hold up outside the specific benchmarks reported.
What would settle it
A clear drop in performance on a new multilingual benchmark or real-world task that was not part of the original evaluation set, even after prompted finetuning.
read the original abstract
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces BLOOM, a 176B-parameter decoder-only Transformer language model trained on the ROOTS corpus, which aggregates hundreds of sources across 46 natural languages and 13 programming languages. The authors claim that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after multitask prompted finetuning, and release the model weights and code under the Responsible AI License to democratize access to large language models.
Significance. If the results hold, this is a significant contribution as one of the largest open-access multilingual LLMs, developed via broad collaboration. The public release of weights, code, and training details under a responsible license enables wider research and applications. The empirical focus on diverse language coverage and prompted finetuning provides a valuable resource for the field, particularly if benchmark claims are supported by rigorous decontamination.
major comments (2)
- [Evaluation section and appendices] Evaluation section and appendices: No systematic n-gram overlap analysis or membership-inference decontamination is reported against the specific test splits of the benchmarks (e.g., MMLU, BIG-bench) used to support the 'competitive performance' claim. Given §3's description of ROOTS as an aggregate of web and curated sources, this is load-bearing for distinguishing generalization from potential leakage or memorization.
- [§4] §4: The multitask prompted finetuning results lack details on prompt templates, the exact tasks/datasets used for finetuning, hyperparameters, and quantitative deltas (with error bars or statistical tests) relative to the base BLOOM model on the reported benchmarks.
minor comments (2)
- [Abstract] Abstract: States competitive benchmark results without numerical scores, error bars, baseline comparisons, or evaluation protocol details, reducing the summary's informativeness despite the full paper containing tables.
- [Throughout] Ensure all evaluation protocols (few-shot settings, data splits, exact metrics) are stated explicitly in the main text with references to appendices, and verify figure/table captions are self-contained.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We have carefully considered each major comment and revised the manuscript to address the concerns regarding evaluation rigor and finetuning transparency.
read point-by-point responses
-
Referee: [Evaluation section and appendices] Evaluation section and appendices: No systematic n-gram overlap analysis or membership-inference decontamination is reported against the specific test splits of the benchmarks (e.g., MMLU, BIG-bench) used to support the 'competitive performance' claim. Given §3's description of ROOTS as an aggregate of web and curated sources, this is load-bearing for distinguishing generalization from potential leakage or memorization.
Authors: We agree that systematic decontamination analysis is critical for validating generalization claims, especially given the web-sourced components of ROOTS. In the revised manuscript, we have added a dedicated n-gram overlap analysis in the Evaluation section and appendices, reporting overlap statistics specifically against the test splits of MMLU, BIG-bench, and other benchmarks used in our evaluations. For membership inference, we have included a discussion of the computational infeasibility at 176B scale along with available proxy analyses and leakage mitigation steps; while full membership inference experiments remain challenging, the added n-gram results and discussion provide stronger evidence distinguishing memorization from generalization. revision: yes
-
Referee: [§4] §4: The multitask prompted finetuning results lack details on prompt templates, the exact tasks/datasets used for finetuning, hyperparameters, and quantitative deltas (with error bars or statistical tests) relative to the base BLOOM model on the reported benchmarks.
Authors: We have expanded §4 substantially in the revision to include the full set of prompt templates, the precise list of tasks and datasets used for multitask prompted finetuning, all relevant hyperparameters, and direct quantitative comparisons (including deltas) between the base BLOOM model and the finetuned version. Error bars are reported where multiple runs were feasible, and we have added statistical significance tests for the observed improvements on the benchmarks. These details enable better reproducibility and assessment of the finetuning gains. revision: yes
Circularity Check
No circularity: empirical model training and release paper
full rationale
This is a standard empirical paper describing the architecture, training data (ROOTS corpus), training procedure, and benchmark results for the BLOOM 176B model. There are no mathematical derivations, first-principles predictions, or claimed results that reduce by construction to fitted parameters, self-citations, or input data. Performance claims rest on direct evaluation against public benchmarks rather than any tautological loop. The skeptic concern about possible benchmark contamination is a validity issue, not a circularity issue in any derivation chain. The paper is self-contained against external benchmarks with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 46 Pith papers
-
Instruction Tuning with GPT-4
GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.
-
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.
-
HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model
Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.
-
Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining
Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
-
Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain
LLMs copy biased analyst ratings in investment decisions but a new detection method encourages independent reasoning and can improve stock return predictions beyond human levels.
-
Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
Apple MPS decoding exhibits non-monotonic latency with spikes up to 21x due to KV cache interactions and execution regimes, unlike monotonic behavior on CPU and CUDA.
-
Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
Apple MPS transformer decoding shows abrupt latency spikes up to 21x in narrow decoding-budget intervals due to KV cache and execution regime shifts, absent on CPU and CUDA.
-
Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models
Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding bette...
-
Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions
Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.
-
Copy First, Translate Later: Interpreting Translation Dynamics in Multilingual Pretraining
Multilingual pretraining develops translation in two phases: early copying driven by surface similarities, followed by generalizing mechanisms while copying is refined.
-
The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts
The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and sal...
-
From OSS to Open Source AI: an Exploratory Study of Collaborative Development Paradigm Divergence
Open source AI shows lower collaboration intensity, reduced direct contributions, and a shift toward adaptive use rather than joint improvement compared to traditional OSS.
-
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
-
QLoRA: Efficient Finetuning of Quantized LLMs
QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
-
RWKV: Reinventing RNNs for the Transformer Era
RWKV uses a linear attention mechanism to deliver Transformer-level performance with RNN-style inference efficiency, demonstrated at up to 14 billion parameters.
-
Eliciting Latent Predictions from Transformers with the Tuned Lens
Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
-
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Visual ChatGPT integrates visual foundation models with ChatGPT via prompts to enable multi-step image understanding, generation, and editing in conversational interactions.
-
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
-
The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts
Features in deep networks correspond to linear directions of centroids summarizing local functional behavior, enabling sparser and more effective feature dictionaries via sparse autoencoders applied to centroids rathe...
-
RUQuant: Towards Refining Uniform Quantization for Large Language Models
RUQuant uses block-wise composite orthogonal matrices from Householder reflections and Givens rotations plus a fine-tuned global reflection to achieve 99.8% full-precision accuracy at W6A6 and 97% at W4A4 for 13B LLMs...
-
Towards an AI co-scientist
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
-
StarCoder 2 and The Stack v2: The Next Generation
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
-
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
-
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Properly filtered web data from CommonCrawl alone trains LLMs that significantly outperform models trained on The Pile, with 600 billion tokens and 1.3B/7.5B parameter models released.
-
Gorilla: Large Language Model Connected with Massive APIs
Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
-
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
MiniGPT-4 shows that aligning a frozen vision encoder to Vicuna via one projection layer plus a second-stage detailed-description fine-tune produces GPT-4-like vision-language abilities including detailed captions, cr...
-
BloombergGPT: A Large Language Model for Finance
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
-
A Comparative Study of Controlled Text Generation Systems Using Level-Playing-Field Evaluation Principles
Re-evaluating controlled text generation systems under standardized conditions reveals that many published performance claims do not hold, highlighting the need for consistent evaluation practices.
-
ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism
ResiHP improves LLM training throughput by 1.04-4.39x under hardware failures by using a workload-aware execution time predictor to avoid false failure detections and a scheduler that dynamically changes parallelism g...
-
TIDE: Every Layer Knows the Token Beneath the Context
TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
-
TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training
TACO compresses tensor-parallel intermediate tensors with an adaptive FP8 scheme and fused kernels, yielding up to 1.87X throughput gains on GPT and Qwen models with near-lossless accuracy.
-
FedProxy: Federated Fine-Tuning of LLMs via Proxy SLMs and Heterogeneity-Aware Fusion
FedProxy replaces weak adapters with a proxy SLM for federated LLM fine-tuning, outperforming prior methods and approaching centralized performance via compression, heterogeneity-aware aggregation, and training-free fusion.
-
SAKURAONE: An Open Ethernet-Based AI HPC System and Its Observed Workload Dynamics in a Single-Tenant LLM Development Environment
A production AI HPC system using fully open Ethernet networking achieves top-100 performance while documenting typical single-tenant LLM workload patterns of many small jobs consuming little time and few large jobs do...
-
SEPTQ: A Simple and Effective Post-Training Quantization Paradigm for Large Language Models
SEPTQ simplifies LLM post-training quantization to two steps via static global importance scoring and mask-guided column-wise weight updates, claiming superior results over baselines in low-bit settings.
-
The Platonic Representation Hypothesis
Representations learned by large AI models are converging toward a shared statistical model of reality.
-
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.
-
StarCoder: may the source be with you!
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
-
ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism
ResiHP introduces a workload-aware failure detector and dynamic scheduler for hybrid-parallel LLM training that achieves 1.04-4.39x higher throughput than prior resilient systems under failures on a 256-GPU cluster.
-
SemEval-2026 Task 7: Everyday Knowledge Across Diverse Languages and Cultures
SemEval-2026 Task 7 presents a benchmark and two evaluation tracks for assessing LLMs on everyday knowledge in diverse languages and cultures without allowing training on the test data.
-
The Rise and Potential of Large Language Model Based Agents: A Survey
The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
-
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
A literature survey that taxonomizes hallucination phenomena in LLMs, reviews evaluation benchmarks, and analyzes approaches for their detection, explanation, and mitigation.
-
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
GLM-4 models rival or exceed GPT-4 on MMLU, GSM8K, MATH, BBH, GPQA, HumanEval, IFEval, long-context tasks, and Chinese alignment while adding autonomous tool use for web, code, and image generation.
-
A Survey on Efficient Inference for Large Language Models
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
-
A Survey of Large Language Models
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
- Model Internal Sleuthing: Finding Lexical Identity and Inflectional Features in Modern Language Models
Reference graph
Works this paper leans on
-
[1]
Exploring BERT's Vocabulary , author =
-
[2]
Proceedings of the AAAI conference on artificial intelligence , year =
Character-level language modeling with deeper self-attention , author =. Proceedings of the AAAI conference on artificial intelligence , year =
-
[5]
International Conference on Learning Representations , year=
What do you learn from context? Probing for sentence structure in contextualized word representations , author=. International Conference on Learning Representations , year=
-
[6]
Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society , pages =
Dixon, Lucas and Li, John and Sorensen, Jeffrey and Thain, Nithum and Vasserman, Lucy , title =. Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society , pages =. 2018 , isbn =. doi:10.1145/3278721.3278729 , abstract =
-
[7]
Gender bias in coreference resolution: Evaluation and debiasing methods
Zhao, Jieyu and Wang, Tianlu and Yatskar, Mark and Ordonez, Vicente and Chang, Kai-Wei. Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018. doi:10.18653/v1/N18-2003
-
[8]
arXiv preprint arXiv:2207.00560 , year=
Is neural language acquisition similar to natural? A chronological probing study , author=. arXiv preprint arXiv:2207.00560 , year=
-
[11]
International Conference on Learning Representations , year =
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , author =. International Conference on Learning Representations , year =
-
[12]
The annals of mathematical statistics , pages=
On a test of whether one of two random variables is stochastically larger than the other , author=. The annals of mathematical statistics , pages=. 1947 , publisher=
work page 1947
-
[16]
The Eleventh International Conference on Learning Representations , year=
Hungry Hungry Hippos: Towards Language Modeling with State Space Models , author=. The Eleventh International Conference on Learning Representations , year=
-
[17]
Lsdsem 2017 shared task: The story cloze test , author=. Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics , pages=
work page 2017
-
[20]
Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency , pages =
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? , author =. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency , pages =
work page 2021
-
[21]
Advances in Neural Information Processing Systems , year =
A neural probabilistic language model , author =. Advances in Neural Information Processing Systems , year =
-
[22]
BFloat16: The secret to high performance on Cloud TPUs , author =
-
[24]
Jason Alan Fries and Leon Weber and Natasha Seelam and Gabriel Altay and Debajyoti Datta and Samuele Garda and Myungsun Kang and Ruisi Su and Wojciech Kusa and Samuel Cahyawijaya and Fabio Barth and Simon Ott and Matthias Samwald and Stephen Bach and Stella Biderman and Mario S. Thirty-sixth Conference on Neural Information Processing Systems Datasets and...
-
[25]
BigScience: A Case Study in the Social Construction of a Multilingual Large Language Model , 2022
Akiki, Christopher and Pistilli, Giada and Mieskes, Margot and Gallé, Matthias and Wolf, Thomas and Ilić, Suzana and Jernite, Yacine , keywords =. 2022 , copyright =. doi:10.48550/ARXIV.2212.04960 , url =
-
[26]
The Values Encoded in Machine Learning Research , publisher =
Birhane, Abeba and Kalluri, Pratyusha and Card, Dallas and Agnew, William and Dotan, Ravit and Bao, Michelle , keywords =. The Values Encoded in Machine Learning Research , publisher =. doi:10.48550/ARXIV.2106.15590 , url =
-
[27]
Multimodal datasets: misogyny, pornography, and malignant stereotypes , author =. ArXiv , year =
-
[30]
Black, Sid and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and others , journal =
-
[31]
doi:10.57967/hf/0003 , publisher =
-
[33]
An industry-led debate: how UK media cover artificial intelligence , author =
-
[35]
Advances in Neural Information Processing Systems , year =
Language models are few-shot learners , author =. Advances in Neural Information Processing Systems , year =
-
[36]
Transactions of the Association for Computational Linguistics , year =
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets , author =. Transactions of the Association for Computational Linguistics , year =
-
[39]
Journal of machine learning research , volume =
Natural language processing (almost) from scratch , author =. Journal of machine learning research , volume =
-
[43]
DeepSpeed: Extreme-scale model training for everyone , author =
-
[44]
Dettmers, Tim and Lewis, Mike and Belkada, Younes and Zettlemoyer, Luke , journal =
-
[45]
Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle =
-
[46]
Conference on Empirical Methods in Natural Language Processing , year =
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus , author =. Conference on Empirical Methods in Natural Language Processing , year =
-
[47]
Phang, Jason and Bradley, Herbie and Gao, Leo and Castricato, Louis J and Biderman, Stella , booktitle=
-
[49]
Journal of Machine Learning Research , year =
Angela Fan and Shruti Bhosale and Holger Schwenk and Zhiyi Ma and Ahmed El-Kishky and Siddharth Goyal and Mandeep Baines and Onur Celebi and Guillaume Wenzek and Vishrav Chaudhary and Naman Goyal and Tom Birch and Vitaliy Liptchinsky and Sergey Edunov and Michael Auli and Armand Joulin , title =. Journal of Machine Learning Research , year =
-
[50]
Journal of Machine Learning Research , volume =
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , author =. Journal of Machine Learning Research , volume =
-
[52]
Dataset Debt in Biomedical Language Modeling , author =. Challenges. 2022 , url =
work page 2022
- [53]
-
[55]
Gehrmann, Sebastian and Clark, Elizabeth and Sellam, Thibault , keywords =. Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2202.06935 , url =
-
[56]
Gehrmann, Sebastian and Bhattacharjee, Abhik and Mahendiran, Abinaya and Wang, Alex and Papangelis, Alexandros and Madaan, Aman and McMillan-Major, Angelina and Shvets, Anna and Upadhyay, Ashish and Yao, Bingsheng and Wilie, Bryan and Bhagavatula, Chandra and You, Chaobin and Thomson, Craig and Garbacea, Cristina and Wang, Dakuo and Deutsch, Daniel and Xi...
-
[57]
Computer Speech & Language , volume =
A bit of progress in language modeling , author =. Computer Speech & Language , volume =
-
[59]
Emergent Structures and Training Dynamics in Large Language Models
Teehan, Ryan and Clinciu, Miruna and Serikov, Oleg and Szczechla, Eliza and Seelam, Natasha and Mirkin, Shachar and Gokaslan, Aaron. Emergent Structures and Training Dynamics in Large Language Models. Proceedings of BigScience Episode \# 5 -- Workshop on Challenges & Perspectives in Creating Large Language Models. 2022. doi:10.18653/v1/2022.bigscience-1.11
-
[62]
International Conference on Learning Representations (ICLR) , year=
Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks , author=. International Conference on Learning Representations (ICLR) , year=
-
[63]
Journal of Artificial Intelligence Research , volume=
Visualisation and 'diagnostic classifiers' reveal how recurrent and recursive neural networks process hierarchical structure , author=. Journal of Artificial Intelligence Research , volume=
-
[67]
Advances in Neural Information Processing Systems , volume =
Hippo: Recurrent memory with optimal polynomial projections , author =. Advances in Neural Information Processing Systems , volume =
-
[68]
International Conference on Learning Representations , year =
Efficiently Modeling Long Sequences with Structured State Spaces , author =. International Conference on Learning Representations , year =
-
[71]
Anthony Moi and Pierric Cistac and Nicolas Patry and Evan P. Walsh and Funtowicz Morgan and Sebastian Pütz and Thomas Wolf and Sylvain Gugger and Clément Delangue and Julien Chaumond and Lysandre Debut and Patrick von Platen , title =. GitHub repository , doi =. 2019 , publisher =
work page 2019
-
[73]
Annual Meeting of the Association for Computational Linguistics , year =
Universal Language Model Fine-tuning for Text Classification , author =. Annual Meeting of the Association for Computational Linguistics , year =
-
[75]
The Ghost in the Machine has an American accent: value conflict in GPT-3 , author =. ArXiv , year =
-
[76]
A Study of BFLOAT16 for Deep Learning Training , author =. 2019 , eprint =
work page 2019
-
[78]
Optimizing Data Warehousing Applications for
Wu, Haicheng and Diamos, Gregory and Wang, Jin and Cadambi, Srihari and Yalamanchili, Sudhakar and Chakradhar, Srimat , booktitle =. Optimizing Data Warehousing Applications for. 2012 , volume =
work page 2012
-
[79]
What Changes Can Large-scale Language Models Bring? Intensive Study on
Kim, Boseop and Kim, HyoungSeok and Lee, Sang-Woo and Lee, Gichang and Kwak, Donghyun and Dong Hyeon, Jeon and Park, Sunghyun and Kim, Sungju and Kim, Seonhoon and Seo, Dongpil and Lee, Heungsub and Jeong, Minyoung and Lee, Sungjae and Kim, Minsub and Ko, Suk Hyun and Kim, Seokhun and Park, Taeyong and Kim, Jinuk and Kang, Soyoung and Ryu, Na-Hyeon and Yo...
-
[80]
Environmental Science and Pollution Research , volume =
Life cycle assessment , author =. Environmental Science and Pollution Research , volume =. 1997 , publisher =
work page 1997
-
[82]
AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages , author =. ArXiv , year =
-
[85]
Hugo Lauren. The. Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year =
-
[86]
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages =
The Power of Scale for Parameter-Efficient Prompt Tuning , author =. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages =
work page 2021
-
[87]
Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Veselin and Zettlemoyer, Luke , booktitle =
-
[90]
ROUGE : A Package for Automatic Evaluation of Summaries
Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004
work page 2004
-
[91]
International Conference on Learning Representations , year =
Generating Wikipedia by Summarizing Long Sequences , author =. International Conference on Learning Representations , year =
-
[92]
Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin , journal =
-
[93]
Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning , author =. arXiv preprint arXiv:2205.05638 , year =
-
[94]
Kyle Lo and Lucy Lu Wang and Mark Neumann and Rodney Michael Kinney and Daniel S. Weld , booktitle =
-
[97]
Luccioni, Alexandra Sasha and Viguier, Sylvain and Ligozat, Anne-Laure , journal =
-
[98]
Martin, Louis and Muller, Benjamin and Ortiz Su. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , month = jul, year =
-
[99]
McMillan-Major, Angelina and Alyafeai, Zaid and Biderman, Stella and Chen, Kimbo and De Toni, Francesco and Dupont, Gérard and Elsahar, Hady and Emezue, Chris and Aji, Alham Fikri and Ilić, Suzana and Khamis, Nurulaqilla and Leong, Colin and Masoud, Maraim and Soroa, Aitor and Suarez, Pedro Ortiz and Talat, Zeerak and van Strien, Daniel and Jernite, Yacin...
-
[100]
International Conference on Learning Representations , year =
Mixed Precision Training , author =. International Conference on Learning Representations , year =
-
[101]
Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP,
Mielke, Sabrina J. and Alyafeai, Zaid and Salesky, Elizabeth and Raffel, Colin and Dey, Manan and Gallé, Matthias and Raja, Arun and Si, Chenglei and Lee, Wilson Y. and Sagot, Benoît and Tan, Samson , keywords =. Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP , publisher =. 2021 , copyright =. doi:10.4855...
-
[102]
Natural language processing with modular PDP networks and distributed lexicon , author =. Cognitive Science , volume =
-
[103]
Recurrent neural network based language model. , author =. Interspeech , year =
-
[104]
Advances in neural information processing systems , volume =
Distributed representations of words and phrases and their compositionality , author =. Advances in neural information processing systems , volume =
-
[107]
Muennighoff, Niklas , journal =
-
[108]
Narayanan, Deepak and Shoeybi, Mohammad and Casper, Jared and LeGresley, Patrick and Patwary, Mostofa and Korthikanti, Vijay and Vainbrand, Dmitri and Kashinkunti, Prethvi and Bernauer, Julie and Catanzaro, Bryan and Phanishayee, Amar and Zaharia, Matei , booktitle=
-
[109]
Wilhelmina Nekoto and Vukosi Marivate and Tshinondiwa Matsila and Timi E. Fasubaa and T Kolawole and Taiwo Helen Fagbohungbe and Solomon Oluwole Akinola and Shamsuddeen Hassan Muhammad and Salomon Kabongo Kabenamualu and Salomey Osei and Sackey Freshia and Rubungo Andre Niyongabo and Ricky Macharm and Perez Ogayo and Orevaoghene Ahia and Musie Meressa and...
-
[110]
Proceedings of the Tenth International Conference on Language Resources and Evaluation (
Nivre, Joakim and de Marneffe, Marie-Catherine and Ginter, Filip and Goldberg, Yoav and Haji. Proceedings of the Tenth International Conference on Language Resources and Evaluation (. 2016 , address =
work page 2016
-
[111]
Nivre, Joakim and Zeman, Daniel and Ginter, Filip and Tyers, Francis , booktitle =. 2017 , address =
work page 2017
-
[116]
Conference of the North American Chapter of the Association for Computational Linguistics , year =
Deep Contextualized Word Representations , author =. Conference of the North American Chapter of the Association for Computational Linguistics , year =
-
[117]
Post, Matt , booktitle =. A Call for Clarity in Reporting. 2018 , address =. doi:10.18653/v1/W18-6319 , pages =
-
[118]
Improving language understanding by generative pre-training , author =
-
[119]
Language models are unsupervised multitask learners , author =
-
[120]
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Scaling language models: Methods, analysis & insights from training gopher , author =. arXiv preprint arXiv:2112.11446 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[121]
Exploring the limits of transfer learning with a unified text-to-text transformer. , author =. J. Mach. Learn. Res. , volume =
-
[122]
Generalized Slow Roll for Tensors
Rajbhandari, Samyam and Rasley, Jeff and Ruwase, Olatunji and He, Yuxiong , year =. doi:10.1109/sc41405.2020.00024 , journal =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41405.2020.00024 2020
-
[123]
InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’22)
Raji, Inioluwa Deborah and Kumar, I. Elizabeth and Horowitz, Aaron and Selbst, Andrew , title =. 2022 ACM Conference on Fairness, Accountability, and Transparency , pages =. 2022 , isbn =. doi:10.1145/3531146.3533158 , abstract =
-
[124]
Rasley, Jeff and Rajbhandari, Samyam and Ruwase, Olatunji and He, Yuxiong , title =. 2020 , isbn =. doi:10.1145/3394486.3406703 , booktitle =
-
[125]
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models , author =. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , month = aug, year =. doi:10.18653/v1/2021.acl-long.243 , pages =
-
[126]
Safaya, Ali and Abdullatif, Moutasem and Yuret, Deniz , booktitle =. 2020 , address =
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.