arxiv: 2211.05100 · v4 · submitted 2022-11-09 · 💻 cs.CL

Recognition: 1 theorem link

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

BigScience Workshop: Teven Le Scao , Angela Fan , Christopher Akiki , Ellie Pavlick , Suzana Ili\'c , Daniel Hesslow , Roman Castagn\'e , Alexandra Sasha Luccioni

show 384 more authors

Fran\c{c}ois Yvon Matthias Gall\'e Jonathan Tow Alexander M. Rush Stella Biderman Albert Webson Pawan Sasanka Ammanamanchi Thomas Wang Beno\^it Sagot Niklas Muennighoff Albert Villanova del Moral Olatunji Ruwase Rachel Bawden Stas Bekman Angelina McMillan-Major Iz Beltagy Huu Nguyen Lucile Saulnier Samson Tan Pedro Ortiz Suarez Victor Sanh Hugo Lauren\c{c}on Yacine Jernite Julien Launay Margaret Mitchell Colin Raffel Aaron Gokaslan Adi Simhi Aitor Soroa Alham Fikri Aji Amit Alfassy Anna Rogers Ariel Kreisberg Nitzav Canwen Xu Chenghao Mou Chris Emezue Christopher Klamm Colin Leong Daniel van Strien David Ifeoluwa Adelani Dragomir Radev Eduardo Gonz\'alez Ponferrada Efrat Levkovizh Ethan Kim Eyal Bar Natan Francesco De Toni G\'erard Dupont Germ\'an Kruszewski Giada Pistilli Hady Elsahar Hamza Benyamina Hieu Tran Ian Yu Idris Abdulmumin Isaac Johnson Itziar Gonzalez-Dios Javier de la Rosa Jenny Chim Jesse Dodge Jian Zhu Jonathan Chang J\"org Frohberg Joseph Tobing Joydeep Bhattacharjee Khalid Almubarak Kimbo Chen Kyle Lo Leandro Von Werra Leon Weber Long Phan Loubna Ben Allal Ludovic Tanguy Manan Dey Manuel Romero Mu\~noz Maraim Masoud Mar\'ia Grandury Mario \v{S}a\v{s}ko Max Huang Maximin Coavoux Mayank Singh Mike Tian-Jian Jiang Minh Chien Vu Mohammad A. Jauhar Mustafa Ghaleb Nishant Subramani Nora Kassner Nurulaqilla Khamis Olivier Nguyen Omar Espejel Ona de Gibert Paulo Villegas Peter Henderson Pierre Colombo Priscilla Amuok Quentin Lhoest Rheza Harliman Rishi Bommasani Roberto Luis L\'opez Rui Ribeiro Salomey Osei Sampo Pyysalo Sebastian Nagel Shamik Bose Shamsuddeen Hassan Muhammad Shanya Sharma Shayne Longpre Somaieh Nikpoor Stanislav Silberberg Suhas Pai Sydney Zink Tiago Timponi Torrent Timo Schick Tristan Thrush Valentin Danchev Vassilina Nikoulina Veronika Laippala Violette Lepercq Vrinda Prabhu Zaid Alyafeai Zeerak Talat Arun Raja Benjamin Heinzerling Chenglei Si Davut Emre Ta\c{s}ar Elizabeth Salesky Sabrina J. Mielke Wilson Y. Lee Abheesht Sharma Andrea Santilli Antoine Chaffin Arnaud Stiegler Debajyoti Datta Eliza Szczechla Gunjan Chhablani Han Wang Harshit Pandey Hendrik Strobelt Jason Alan Fries Jos Rozen Leo Gao Lintang Sutawika M Saiful Bari Maged S. Al-shaibani Matteo Manica Nihal Nayak Ryan Teehan Samuel Albanie Sheng Shen Srulik Ben-David Stephen H. Bach Taewoon Kim Tali Bers Thibault Fevry Trishala Neeraj Urmish Thakker Vikas Raunak Xiangru Tang Zheng-Xin Yong Zhiqing Sun Shaked Brody Yallow Uri Hadar Tojarieh Adam Roberts Hyung Won Chung Jaesung Tae Jason Phang Ofir Press Conglong Li Deepak Narayanan Hatim Bourfoune Jared Casper Jeff Rasley Max Ryabinin Mayank Mishra Minjia Zhang Mohammad Shoeybi Myriam Peyrounette Nicolas Patry Nouamane Tazi Omar Sanseviero Patrick von Platen Pierre Cornette Pierre Fran\c{c}ois Lavall\'ee R\'emi Lacroix Samyam Rajbhandari Sanchit Gandhi Shaden Smith St\'ephane Requena Suraj Patil Tim Dettmers Ahmed Baruwa Amanpreet Singh Anastasia Cheveleva Anne-Laure Ligozat Arjun Subramonian Aur\'elie N\'ev\'eol Charles Lovering Dan Garrette Deepak Tunuguntla Ehud Reiter Ekaterina Taktasheva Ekaterina Voloshina Eli Bogdanov Genta Indra Winata Hailey Schoelkopf Jan-Christoph Kalo Jekaterina Novikova Jessica Zosa Forde Jordan Clive Jungo Kasai Ken Kawamura Liam Hazan Marine Carpuat Miruna Clinciu Najoung Kim Newton Cheng Oleg Serikov Omer Antverg Oskar van der Wal Rui Zhang Ruochen Zhang Sebastian Gehrmann Shachar Mirkin Shani Pais Tatiana Shavrina Thomas Scialom Tian Yun Tomasz Limisiewicz Verena Rieser Vitaly Protasov Vladislav Mikhailov Yada Pruksachatkun Yonatan Belinkov Zachary Bamberger Zden\v{e}k Kasner Alice Rueda Amanda Pestana Amir Feizpour Ammar Khan Amy Faranak Ana Santos Anthony Hevia Antigona Unldreaj Arash Aghagol Arezoo Abdollahi Aycha Tammour Azadeh HajiHosseini Bahareh Behroozi Benjamin Ajibade Bharat Saxena Carlos Mu\~noz Ferrandis Daniel McDuff Danish Contractor David Lansky Davis David Douwe Kiela Duong A. Nguyen Edward Tan Emi Baylor Ezinwanne Ozoani Fatima Mirza Frankline Ononiwu Habib Rezanejad Hessie Jones Indrani Bhattacharya Irene Solaiman Irina Sedenko Isar Nejadgholi Jesse Passmore Josh Seltzer Julio Bonis Sanz Livia Dutra Mairon Samagaio Maraim Elbadri Margot Mieskes Marissa Gerchick Martha Akinlolu Michael McKenna Mike Qiu Muhammed Ghauri Mykola Burynok Nafis Abrar Nazneen Rajani Nour Elkott Nour Fahmy Olanrewaju Samuel Ran An Rasmus Kromann Ryan Hao Samira Alizadeh Sarmad Shubber Silas Wang Sourav Roy Sylvain Viguier Thanh Le Tobi Oyebade Trieu Le Yoyo Yang Zach Nguyen Abhinav Ramesh Kashyap Alfredo Palasciano Alison Callahan Anima Shukla Antonio Miranda-Escalada Ayush Singh Benjamin Beilharz Bo Wang Caio Brito Chenxi Zhou Chirag Jain Chuxin Xu Cl\'ementine Fourrier Daniel Le\'on Peri\~n\'an Daniel Molano Dian Yu Enrique Manjavacas Fabio Barth Florian Fuhrimann Gabriel Altay Giyaseddin Bayrak Gully Burns Helena U. Vrabec Imane Bello Ishani Dash Jihyun Kang John Giorgi Jonas Golde Jose David Posada Karthik Rangasai Sivaraman Lokesh Bulchandani Lu Liu Luisa Shinzato Madeleine Hahn de Bykhovetz Maiko Takeuchi Marc P\`amies Maria A Castillo Marianna Nezhurina Mario S\"anger Matthias Samwald Michael Cullan Michael Weinberg Michiel De Wolf Mina Mihaljcic Minna Liu Moritz Freidank Myungsun Kang Natasha Seelam Nathan Dahlberg Nicholas Michio Broad Nikolaus Muellner Pascale Fung Patrick Haller Ramya Chandrasekhar Renata Eisenberg Robert Martin Rodrigo Canalli Rosaline Su Ruisi Su Samuel Cahyawijaya Samuele Garda Shlok S Deshmukh Shubhanshu Mishra Sid Kiblawi Simon Ott Sinee Sang-aroonsiri Srishti Kumar Stefan Schweter Sushil Bharati Tanmay Laud Th\'eo Gigant Tomoya Kainuma Wojciech Kusa Yanis Labrak Yash Shailesh Bajaj Yash Venkatraman Yifan Xu Yingxin Xu Yu Xu Zhe Tan Zhongli Xie Zifan Ye Mathilde Bras Younes Belkada Thomas Wolf

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:38 UTC · model grok-4.3

classification 💻 cs.CL

keywords large language modelsmultilingual modelsopen source AIdecoder-only transformerprompted finetuninglanguage modeling

0 comments

The pith

A 176B-parameter decoder-only language model trained on text from 59 languages is built through open collaboration and released publicly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes the creation of BLOOM, a large-scale language model developed by hundreds of researchers as an open alternative to closed systems. It was trained as a decoder-only Transformer on the ROOTS corpus, which draws from hundreds of sources across 46 natural languages and 13 programming languages. After training, the model shows competitive results across many standard benchmarks, and these results improve further when the model undergoes multitask prompted finetuning. The authors then release the model weights and code under a responsible AI license to support wider research and use.

Core claim

BLOOM is a 176B-parameter decoder-only Transformer language model trained on the ROOTS corpus comprising hundreds of sources in 46 natural and 13 programming languages. It achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. The model and associated code are released publicly under the Responsible AI License to facilitate future research and applications using large language models.

What carries the argument

The BLOOM decoder-only Transformer, trained on the ROOTS multilingual corpus, which supplies the data diversity and scale needed for broad language coverage and benchmark performance.

If this is right

Public release allows researchers without large compute budgets to study and adapt a 176B-scale multilingual model.
Multitask prompted finetuning can be applied to the released model to improve results on targeted tasks.
The multilingual training data supports work on non-English and programming-language tasks at scale.
The open license enables community inspection and modification of the model for specific applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Wider availability may encourage development of language tools for languages that have historically had fewer resources.
The collaborative construction process could serve as a template for other large open models in different domains.
Public access creates opportunities for independent safety and bias audits that closed models do not permit.

Load-bearing premise

That unreported details of the training procedure, data filtering, and evaluation setup produce general capabilities that hold up outside the specific benchmarks reported.

What would settle it

A clear drop in performance on a new multilingual benchmark or real-world task that was not part of the original evaluation set, even after prompted finetuning.

read the original abstract

Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces BLOOM, a 176B-parameter decoder-only Transformer language model trained on the ROOTS corpus, which aggregates hundreds of sources across 46 natural languages and 13 programming languages. The authors claim that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after multitask prompted finetuning, and release the model weights and code under the Responsible AI License to democratize access to large language models.

Significance. If the results hold, this is a significant contribution as one of the largest open-access multilingual LLMs, developed via broad collaboration. The public release of weights, code, and training details under a responsible license enables wider research and applications. The empirical focus on diverse language coverage and prompted finetuning provides a valuable resource for the field, particularly if benchmark claims are supported by rigorous decontamination.

major comments (2)

[Evaluation section and appendices] Evaluation section and appendices: No systematic n-gram overlap analysis or membership-inference decontamination is reported against the specific test splits of the benchmarks (e.g., MMLU, BIG-bench) used to support the 'competitive performance' claim. Given §3's description of ROOTS as an aggregate of web and curated sources, this is load-bearing for distinguishing generalization from potential leakage or memorization.
[§4] §4: The multitask prompted finetuning results lack details on prompt templates, the exact tasks/datasets used for finetuning, hyperparameters, and quantitative deltas (with error bars or statistical tests) relative to the base BLOOM model on the reported benchmarks.

minor comments (2)

[Abstract] Abstract: States competitive benchmark results without numerical scores, error bars, baseline comparisons, or evaluation protocol details, reducing the summary's informativeness despite the full paper containing tables.
[Throughout] Ensure all evaluation protocols (few-shot settings, data splits, exact metrics) are stated explicitly in the main text with references to appendices, and verify figure/table captions are self-contained.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We have carefully considered each major comment and revised the manuscript to address the concerns regarding evaluation rigor and finetuning transparency.

read point-by-point responses

Referee: [Evaluation section and appendices] Evaluation section and appendices: No systematic n-gram overlap analysis or membership-inference decontamination is reported against the specific test splits of the benchmarks (e.g., MMLU, BIG-bench) used to support the 'competitive performance' claim. Given §3's description of ROOTS as an aggregate of web and curated sources, this is load-bearing for distinguishing generalization from potential leakage or memorization.

Authors: We agree that systematic decontamination analysis is critical for validating generalization claims, especially given the web-sourced components of ROOTS. In the revised manuscript, we have added a dedicated n-gram overlap analysis in the Evaluation section and appendices, reporting overlap statistics specifically against the test splits of MMLU, BIG-bench, and other benchmarks used in our evaluations. For membership inference, we have included a discussion of the computational infeasibility at 176B scale along with available proxy analyses and leakage mitigation steps; while full membership inference experiments remain challenging, the added n-gram results and discussion provide stronger evidence distinguishing memorization from generalization. revision: yes
Referee: [§4] §4: The multitask prompted finetuning results lack details on prompt templates, the exact tasks/datasets used for finetuning, hyperparameters, and quantitative deltas (with error bars or statistical tests) relative to the base BLOOM model on the reported benchmarks.

Authors: We have expanded §4 substantially in the revision to include the full set of prompt templates, the precise list of tasks and datasets used for multitask prompted finetuning, all relevant hyperparameters, and direct quantitative comparisons (including deltas) between the base BLOOM model and the finetuned version. Error bars are reported where multiple runs were feasible, and we have added statistical significance tests for the observed improvements on the benchmarks. These details enable better reproducibility and assessment of the finetuning gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical model training and release paper

full rationale

This is a standard empirical paper describing the architecture, training data (ROOTS corpus), training procedure, and benchmark results for the BLOOM 176B model. There are no mathematical derivations, first-principles predictions, or claimed results that reduce by construction to fitted parameters, self-citations, or input data. Performance claims rest on direct evaluation against public benchmarks rather than any tautological loop. The skeptic concern about possible benchmark contamination is a validity issue, not a circularity issue in any derivation chain. The paper is self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical large-scale machine learning paper describing model training and benchmark evaluation. No mathematical derivations, free parameters in a theoretical sense, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 7403 in / 1047 out tokens · 63129 ms · 2026-05-12T00:38:59.003626+00:00 · methodology

discussion (0)

Forward citations

Cited by 46 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Instruction Tuning with GPT-4
cs.CL 2023-04 unverdicted novelty 8.0

GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
cs.CL 2026-05 unverdicted novelty 7.0

TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.
HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model
cs.CL 2026-05 unverdicted novelty 7.0

Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.
Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining
cs.CL 2026-05 unverdicted novelty 7.0

Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain
cs.CL 2026-05 unverdicted novelty 7.0

LLMs copy biased analyst ratings in investment decisions but a new detection method encourages independent reasoning and can improve stock return predictions beyond human levels.
Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
cs.LG 2026-05 accept novelty 7.0

Apple MPS decoding exhibits non-monotonic latency with spikes up to 21x due to KV cache interactions and execution regimes, unlike monotonic behavior on CPU and CUDA.
Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
cs.LG 2026-05 unverdicted novelty 7.0

Apple MPS transformer decoding shows abrupt latency spikes up to 21x in narrow decoding-budget intervals due to KV cache and execution regime shifts, absent on CPU and CUDA.
Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding bette...
Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions
cs.CL 2026-05 unverdicted novelty 7.0

Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.
Copy First, Translate Later: Interpreting Translation Dynamics in Multilingual Pretraining
cs.CL 2026-04 unverdicted novelty 7.0

Multilingual pretraining develops translation in two phases: early copying driven by surface similarities, followed by generalizing mechanisms while copying is refined.
The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts
cs.LG 2026-04 unverdicted novelty 7.0

The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and sal...
From OSS to Open Source AI: an Exploratory Study of Collaborative Development Paradigm Divergence
cs.SE 2026-04 conditional novelty 7.0

Open source AI shows lower collaboration intensity, reduced direct contributions, and a shift toward adaptive use rather than joint improvement compared to traditional OSS.
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
cs.CV 2024-06 conditional novelty 7.0

Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
QLoRA: Efficient Finetuning of Quantized LLMs
cs.LG 2023-05 conditional novelty 7.0

QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
RWKV: Reinventing RNNs for the Transformer Era
cs.CL 2023-05 unverdicted novelty 7.0

RWKV uses a linear attention mechanism to deliver Transformer-level performance with RNN-style inference efficiency, demonstrated at up to 14 billion parameters.
Eliciting Latent Predictions from Transformers with the Tuned Lens
cs.LG 2023-03 accept novelty 7.0

Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
cs.CV 2023-03 accept novelty 7.0

Visual ChatGPT integrates visual foundation models with ChatGPT via prompts to enable multi-step image understanding, generation, and editing in conversational interactions.
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
cs.CL 2026-05 unverdicted novelty 6.0

TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts
cs.LG 2026-04 unverdicted novelty 6.0

Features in deep networks correspond to linear directions of centroids summarizing local functional behavior, enabling sparser and more effective feature dictionaries via sparse autoencoders applied to centroids rathe...
RUQuant: Towards Refining Uniform Quantization for Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

RUQuant uses block-wise composite orthogonal matrices from Householder reflections and Givens rotations plus a fine-tuned global reflection to achieve 99.8% full-precision accuracy at W6A6 and 97% at W4A4 for 13B LLMs...
Towards an AI co-scientist
cs.AI 2025-02 unverdicted novelty 6.0

A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
StarCoder 2 and The Stack v2: The Next Generation
cs.SE 2024-02 accept novelty 6.0

StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
cs.CV 2023-11 unverdicted novelty 6.0

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
cs.CL 2023-06 unverdicted novelty 6.0

Properly filtered web data from CommonCrawl alone trains LLMs that significantly outperform models trained on The Pile, with 600 billion tokens and 1.3B/7.5B parameter models released.
Gorilla: Large Language Model Connected with Massive APIs
cs.CL 2023-05 conditional novelty 6.0

Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
cs.CV 2023-04 conditional novelty 6.0

MiniGPT-4 shows that aligning a frozen vision encoder to Vicuna via one projection layer plus a second-stage detailed-description fine-tune produces GPT-4-like vision-language abilities including detailed captions, cr...
BloombergGPT: A Large Language Model for Finance
cs.LG 2023-03 conditional novelty 6.0

BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
A Comparative Study of Controlled Text Generation Systems Using Level-Playing-Field Evaluation Principles
cs.CL 2026-05 unverdicted novelty 5.0

Re-evaluating controlled text generation systems under standardized conditions reveals that many published performance claims do not hold, highlighting the need for consistent evaluation practices.
ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism
cs.DC 2026-05 unverdicted novelty 5.0

ResiHP improves LLM training throughput by 1.04-4.39x under hardware failures by using a workload-aware execution time predictor to avoid false failure detections and a scheduler that dynamically changes parallelism g...
TIDE: Every Layer Knows the Token Beneath the Context
cs.CL 2026-05 unverdicted novelty 5.0

TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training
cs.DC 2026-04 unverdicted novelty 5.0

TACO compresses tensor-parallel intermediate tensors with an adaptive FP8 scheme and fused kernels, yielding up to 1.87X throughput gains on GPT and Qwen models with near-lossless accuracy.
FedProxy: Federated Fine-Tuning of LLMs via Proxy SLMs and Heterogeneity-Aware Fusion
cs.LG 2026-04 unverdicted novelty 5.0

FedProxy replaces weak adapters with a proxy SLM for federated LLM fine-tuning, outperforming prior methods and approaching centralized performance via compression, heterogeneity-aware aggregation, and training-free fusion.
SAKURAONE: An Open Ethernet-Based AI HPC System and Its Observed Workload Dynamics in a Single-Tenant LLM Development Environment
cs.DC 2026-04 accept novelty 5.0

A production AI HPC system using fully open Ethernet networking achieves top-100 performance while documenting typical single-tenant LLM workload patterns of many small jobs consuming little time and few large jobs do...
SEPTQ: A Simple and Effective Post-Training Quantization Paradigm for Large Language Models
cs.CL 2026-04 unverdicted novelty 5.0

SEPTQ simplifies LLM post-training quantization to two steps via static global importance scoring and mask-guided column-wise weight updates, claiming superior results over baselines in low-bit settings.
The Platonic Representation Hypothesis
cs.LG 2024-05 unverdicted novelty 5.0

Representations learned by large AI models are converging toward a shared statistical model of reality.
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
cs.CL 2024-01 unverdicted novelty 5.0

DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.
StarCoder: may the source be with you!
cs.CL 2023-05 accept novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism
cs.DC 2026-05 unverdicted novelty 4.0

ResiHP introduces a workload-aware failure detector and dynamic scheduler for hybrid-parallel LLM training that achieves 1.04-4.39x higher throughput than prior resilient systems under failures on a 256-GPU cluster.
SemEval-2026 Task 7: Everyday Knowledge Across Diverse Languages and Cultures
cs.CL 2026-05 unverdicted novelty 4.0

SemEval-2026 Task 7 presents a benchmark and two evaluation tracks for assessing LLMs on everyday knowledge in diverse languages and cultures without allowing training on the test data.
The Rise and Potential of Large Language Model Based Agents: A Survey
cs.AI 2023-09 accept novelty 4.0

The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
cs.CL 2023-09 unverdicted novelty 4.0

A literature survey that taxonomizes hallucination phenomena in LLMs, reviews evaluation benchmarks, and analyzes approaches for their detection, explanation, and mitigation.
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
cs.CL 2024-06 unverdicted novelty 3.0

GLM-4 models rival or exceed GPT-4 on MMLU, GSM8K, MATH, BBH, GPQA, HumanEval, IFEval, long-context tasks, and Chinese alignment while adding autonomous tool use for web, code, and image generation.
A Survey on Efficient Inference for Large Language Models
cs.CL 2024-04 accept novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
Large Language Models: A Survey
cs.CL 2024-02 accept novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
A Survey of Large Language Models
cs.CL 2023-03 accept novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
Model Internal Sleuthing: Finding Lexical Identity and Inflectional Features in Modern Language Models
cs.CL 2025-06

Reference graph

Works this paper leans on

249 extracted references · 249 canonical work pages · cited by 42 Pith papers · 22 internal anchors

[1]

Exploring BERT's Vocabulary , author =

work page
[2]

Proceedings of the AAAI conference on artificial intelligence , year =

Character-level language modeling with deeper self-attention , author =. Proceedings of the AAAI conference on artificial intelligence , year =

work page
[5]

International Conference on Learning Representations , year=

What do you learn from context? Probing for sentence structure in contextualized word representations , author=. International Conference on Learning Representations , year=

work page
[6]

Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society , pages =

Dixon, Lucas and Li, John and Sorensen, Jeffrey and Thain, Nithum and Vasserman, Lucy , title =. Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society , pages =. 2018 , isbn =. doi:10.1145/3278721.3278729 , abstract =

work page doi:10.1145/3278721.3278729 2018
[7]

Gender bias in coreference resolution: Evaluation and debiasing methods

Zhao, Jieyu and Wang, Tianlu and Yatskar, Mark and Ordonez, Vicente and Chang, Kai-Wei. Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018. doi:10.18653/v1/N18-2003

work page doi:10.18653/v1/n18-2003 2018
[8]

arXiv preprint arXiv:2207.00560 , year=

Is neural language acquisition similar to natural? A chronological probing study , author=. arXiv preprint arXiv:2207.00560 , year=

work page arXiv
[11]

International Conference on Learning Representations , year =

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , author =. International Conference on Learning Representations , year =

work page
[12]

The annals of mathematical statistics , pages=

On a test of whether one of two random variables is stochastically larger than the other , author=. The annals of mathematical statistics , pages=. 1947 , publisher=

work page 1947
[16]

The Eleventh International Conference on Learning Representations , year=

Hungry Hungry Hippos: Towards Language Modeling with State Space Models , author=. The Eleventh International Conference on Learning Representations , year=

work page
[17]

Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics , pages=

Lsdsem 2017 shared task: The story cloze test , author=. Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics , pages=

work page 2017
[20]

Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency , pages =

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? , author =. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency , pages =

work page 2021
[21]

Advances in Neural Information Processing Systems , year =

A neural probabilistic language model , author =. Advances in Neural Information Processing Systems , year =

work page
[22]

BFloat16: The secret to high performance on Cloud TPUs , author =

work page
[24]

Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year =

Jason Alan Fries and Leon Weber and Natasha Seelam and Gabriel Altay and Debajyoti Datta and Samuele Garda and Myungsun Kang and Ruisi Su and Wojciech Kusa and Samuel Cahyawijaya and Fabio Barth and Simon Ott and Matthias Samwald and Stephen Bach and Stella Biderman and Mario S. Thirty-sixth Conference on Neural Information Processing Systems Datasets and...

work page
[25]

BigScience: A Case Study in the Social Construction of a Multilingual Large Language Model , 2022

Akiki, Christopher and Pistilli, Giada and Mieskes, Margot and Gallé, Matthias and Wolf, Thomas and Ilić, Suzana and Jernite, Yacine , keywords =. 2022 , copyright =. doi:10.48550/ARXIV.2212.04960 , url =

work page doi:10.48550/arxiv.2212.04960 2022
[26]

The Values Encoded in Machine Learning Research , publisher =

Birhane, Abeba and Kalluri, Pratyusha and Card, Dallas and Agnew, William and Dotan, Ravit and Bao, Michelle , keywords =. The Values Encoded in Machine Learning Research , publisher =. doi:10.48550/ARXIV.2106.15590 , url =

work page doi:10.48550/arxiv.2106.15590
[27]

ArXiv , year =

Multimodal datasets: misogyny, pornography, and malignant stereotypes , author =. ArXiv , year =

work page
[30]

Black, Sid and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and others , journal =

work page
[31]

doi:10.57967/hf/0003 , publisher =

work page doi:10.57967/hf/0003
[33]

An industry-led debate: how UK media cover artificial intelligence , author =

work page
[35]

Advances in Neural Information Processing Systems , year =

Language models are few-shot learners , author =. Advances in Neural Information Processing Systems , year =

work page
[36]

Transactions of the Association for Computational Linguistics , year =

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets , author =. Transactions of the Association for Computational Linguistics , year =

work page
[39]

Journal of machine learning research , volume =

Natural language processing (almost) from scratch , author =. Journal of machine learning research , volume =

work page
[43]

DeepSpeed: Extreme-scale model training for everyone , author =

work page
[44]

Dettmers, Tim and Lewis, Mike and Belkada, Younes and Zettlemoyer, Luke , journal =

work page
[45]

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle =

work page
[46]

Conference on Empirical Methods in Natural Language Processing , year =

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus , author =. Conference on Empirical Methods in Natural Language Processing , year =

work page
[47]

Phang, Jason and Bradley, Herbie and Gao, Leo and Castricato, Louis J and Biderman, Stella , booktitle=

work page
[49]

Journal of Machine Learning Research , year =

Angela Fan and Shruti Bhosale and Holger Schwenk and Zhiyi Ma and Ahmed El-Kishky and Siddharth Goyal and Mandeep Baines and Onur Celebi and Guillaume Wenzek and Vishrav Chaudhary and Naman Goyal and Tom Birch and Vitaliy Liptchinsky and Sergey Edunov and Michael Auli and Armand Joulin , title =. Journal of Machine Learning Research , year =

work page
[50]

Journal of Machine Learning Research , volume =

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , author =. Journal of Machine Learning Research , volume =

work page
[52]

Challenges

Dataset Debt in Biomedical Language Modeling , author =. Challenges. 2022 , url =

work page 2022
[53]

C Users J

Gage, Philip , title =. C Users J. , month =. 1994 , issue_date =

work page 1994
[55]

Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text, 2022

Gehrmann, Sebastian and Clark, Elizabeth and Sellam, Thibault , keywords =. Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2202.06935 , url =

work page doi:10.48550/arxiv.2202.06935 2022
[56]

Gehrmann, Sebastian and Bhattacharjee, Abhik and Mahendiran, Abinaya and Wang, Alex and Papangelis, Alexandros and Madaan, Aman and McMillan-Major, Angelina and Shvets, Anna and Upadhyay, Ashish and Yao, Bingsheng and Wilie, Bryan and Bhagavatula, Chandra and You, Chaobin and Thomson, Craig and Garbacea, Cristina and Wang, Dakuo and Deutsch, Daniel and Xi...

work page doi:10.48550/arxiv.2206.11249 2022
[57]

Computer Speech & Language , volume =

A bit of progress in language modeling , author =. Computer Speech & Language , volume =

work page
[59]

Emergent Structures and Training Dynamics in Large Language Models

Teehan, Ryan and Clinciu, Miruna and Serikov, Oleg and Szczechla, Eliza and Seelam, Natasha and Mirkin, Shachar and Gokaslan, Aaron. Emergent Structures and Training Dynamics in Large Language Models. Proceedings of BigScience Episode \# 5 -- Workshop on Challenges & Perspectives in Creating Large Language Models. 2022. doi:10.18653/v1/2022.bigscience-1.11

work page doi:10.18653/v1/2022.bigscience-1.11 2022
[62]

International Conference on Learning Representations (ICLR) , year=

Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks , author=. International Conference on Learning Representations (ICLR) , year=

work page
[63]

Journal of Artificial Intelligence Research , volume=

Visualisation and 'diagnostic classifiers' reveal how recurrent and recursive neural networks process hierarchical structure , author=. Journal of Artificial Intelligence Research , volume=

work page
[67]

Advances in Neural Information Processing Systems , volume =

Hippo: Recurrent memory with optimal polynomial projections , author =. Advances in Neural Information Processing Systems , volume =

work page
[68]

International Conference on Learning Representations , year =

Efficiently Modeling Long Sequences with Structured State Spaces , author =. International Conference on Learning Representations , year =

work page
[71]

Walsh and Funtowicz Morgan and Sebastian Pütz and Thomas Wolf and Sylvain Gugger and Clément Delangue and Julien Chaumond and Lysandre Debut and Patrick von Platen , title =

Anthony Moi and Pierric Cistac and Nicolas Patry and Evan P. Walsh and Funtowicz Morgan and Sebastian Pütz and Thomas Wolf and Sylvain Gugger and Clément Delangue and Julien Chaumond and Lysandre Debut and Patrick von Platen , title =. GitHub repository , doi =. 2019 , publisher =

work page 2019
[73]

Annual Meeting of the Association for Computational Linguistics , year =

Universal Language Model Fine-tuning for Text Classification , author =. Annual Meeting of the Association for Computational Linguistics , year =

work page
[75]

ArXiv , year =

The Ghost in the Machine has an American accent: value conflict in GPT-3 , author =. ArXiv , year =

work page
[76]

2019 , eprint =

A Study of BFLOAT16 for Deep Learning Training , author =. 2019 , eprint =

work page 2019
[78]

Optimizing Data Warehousing Applications for

Wu, Haicheng and Diamos, Gregory and Wang, Jin and Cadambi, Srihari and Yalamanchili, Sudhakar and Chakradhar, Srimat , booktitle =. Optimizing Data Warehousing Applications for. 2012 , volume =

work page 2012
[79]

What Changes Can Large-scale Language Models Bring? Intensive Study on

Kim, Boseop and Kim, HyoungSeok and Lee, Sang-Woo and Lee, Gichang and Kwak, Donghyun and Dong Hyeon, Jeon and Park, Sunghyun and Kim, Sungju and Kim, Seonhoon and Seo, Dongpil and Lee, Heungsub and Jeong, Minyoung and Lee, Sungjae and Kim, Minsub and Ko, Suk Hyun and Kim, Seokhun and Park, Taeyong and Kim, Jinuk and Kang, Soyoung and Ryu, Na-Hyeon and Yo...

work page
[80]

Environmental Science and Pollution Research , volume =

Life cycle assessment , author =. Environmental Science and Pollution Research , volume =. 1997 , publisher =

work page 1997
[82]

ArXiv , year =

AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages , author =. ArXiv , year =

work page
[85]

Hugo Lauren. The. Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year =

work page
[86]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages =

The Power of Scale for Parameter-Efficient Prompt Tuning , author =. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages =

work page 2021
[87]

Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Veselin and Zettlemoyer, Luke , booktitle =

work page
[90]

ROUGE : A Package for Automatic Evaluation of Summaries

Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

work page 2004
[91]

International Conference on Learning Representations , year =

Generating Wikipedia by Summarizing Long Sequences , author =. International Conference on Learning Representations , year =

work page
[92]

Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin , journal =

work page
[93]

NanotronResearch

Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning , author =. arXiv preprint arXiv:2205.05638 , year =

work page arXiv
[94]

Weld , booktitle =

Kyle Lo and Lucy Lu Wang and Mark Neumann and Rodney Michael Kinney and Daniel S. Weld , booktitle =

work page
[97]

Luccioni, Alexandra Sasha and Viguier, Sylvain and Ligozat, Anne-Laure , journal =

work page
[98]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , month = jul, year =

Martin, Louis and Muller, Benjamin and Ortiz Su. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , month = jul, year =

work page
[99]

Documenting geographically and contextually diverse data sources: The bigscience catalogue of language data and resources, 2022

McMillan-Major, Angelina and Alyafeai, Zaid and Biderman, Stella and Chen, Kimbo and De Toni, Francesco and Dupont, Gérard and Elsahar, Hady and Emezue, Chris and Aji, Alham Fikri and Ilić, Suzana and Khamis, Nurulaqilla and Leong, Colin and Masoud, Maraim and Soroa, Aitor and Suarez, Pedro Ortiz and Talat, Zeerak and van Strien, Daniel and Jernite, Yacin...

work page doi:10.48550/arxiv.2201.10066 2022
[100]

International Conference on Learning Representations , year =

Mixed Precision Training , author =. International Conference on Learning Representations , year =

work page
[101]

Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP,

Mielke, Sabrina J. and Alyafeai, Zaid and Salesky, Elizabeth and Raffel, Colin and Dey, Manan and Gallé, Matthias and Raja, Arun and Si, Chenglei and Lee, Wilson Y. and Sagot, Benoît and Tan, Samson , keywords =. Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP , publisher =. 2021 , copyright =. doi:10.4855...

work page doi:10.48550/arxiv.2112.10508 2021
[102]

Cognitive Science , volume =

Natural language processing with modular PDP networks and distributed lexicon , author =. Cognitive Science , volume =

work page
[103]

, author =

Recurrent neural network based language model. , author =. Interspeech , year =

work page
[104]

Advances in neural information processing systems , volume =

Distributed representations of words and phrases and their compositionality , author =. Advances in neural information processing systems , volume =

work page
[107]

Muennighoff, Niklas , journal =

work page
[108]

Narayanan, Deepak and Shoeybi, Mohammad and Casper, Jared and LeGresley, Patrick and Patwary, Mostofa and Korthikanti, Vijay and Vainbrand, Dmitri and Kashinkunti, Prethvi and Bernauer, Julie and Catanzaro, Bryan and Phanishayee, Amar and Zaharia, Matei , booktitle=

work page
[109]

Wilhelmina Nekoto and Vukosi Marivate and Tshinondiwa Matsila and Timi E. Fasubaa and T Kolawole and Taiwo Helen Fagbohungbe and Solomon Oluwole Akinola and Shamsuddeen Hassan Muhammad and Salomon Kabongo Kabenamualu and Salomey Osei and Sackey Freshia and Rubungo Andre Niyongabo and Ricky Macharm and Perez Ogayo and Orevaoghene Ahia and Musie Meressa and...

work page
[110]

Proceedings of the Tenth International Conference on Language Resources and Evaluation (

Nivre, Joakim and de Marneffe, Marie-Catherine and Ginter, Filip and Goldberg, Yoav and Haji. Proceedings of the Tenth International Conference on Language Resources and Evaluation (. 2016 , address =

work page 2016
[111]

2017 , address =

Nivre, Joakim and Zeman, Daniel and Ginter, Filip and Tyers, Francis , booktitle =. 2017 , address =

work page 2017
[116]

Conference of the North American Chapter of the Association for Computational Linguistics , year =

Deep Contextualized Word Representations , author =. Conference of the North American Chapter of the Association for Computational Linguistics , year =

work page
[117]

multilingual

Post, Matt , booktitle =. A Call for Clarity in Reporting. 2018 , address =. doi:10.18653/v1/W18-6319 , pages =

work page doi:10.18653/v1/w18-6319 2018
[118]

Improving language understanding by generative pre-training , author =

work page
[119]

Language models are unsupervised multitask learners , author =

work page
[120]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Scaling language models: Methods, analysis & insights from training gopher , author =. arXiv preprint arXiv:2112.11446 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[121]

, author =

Exploring the limits of transfer learning with a unified text-to-text transformer. , author =. J. Mach. Learn. Res. , volume =

work page
[122]

Generalized Slow Roll for Tensors

Rajbhandari, Samyam and Rasley, Jeff and Ruwase, Olatunji and He, Yuxiong , year =. doi:10.1109/sc41405.2020.00024 , journal =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41405.2020.00024 2020
[123]

InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’22)

Raji, Inioluwa Deborah and Kumar, I. Elizabeth and Horowitz, Aaron and Selbst, Andrew , title =. 2022 ACM Conference on Fairness, Accountability, and Transparency , pages =. 2022 , isbn =. doi:10.1145/3531146.3533158 , abstract =

work page doi:10.1145/3531146.3533158 2022
[124]

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

Rasley, Jeff and Rajbhandari, Samyam and Ruwase, Olatunji and He, Yuxiong , title =. 2020 , isbn =. doi:10.1145/3394486.3406703 , booktitle =

work page doi:10.1145/3394486.3406703 2020
[125]

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models , author =. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , month = aug, year =. doi:10.18653/v1/2021.acl-long.243 , pages =

work page doi:10.18653/v1/2021.acl-long.243 2021
[126]

2020 , address =

Safaya, Ali and Abdullatif, Moutasem and Yuret, Deniz , booktitle =. 2020 , address =

work page 2020

Showing first 80 references.