Recognition: 3 theorem links
· Lean TheoremGLM-130B: An Open Bilingual Pre-trained Model
Pith reviewed 2026-05-14 17:35 UTC · model grok-4.3
The pith
GLM-130B, a 130B-parameter bilingual model, outperforms GPT-3 175B on English benchmarks and runs in INT4 on four consumer GPUs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GLM-130B is a 130B-parameter bilingual pre-trained model that, after targeted training for stability, delivers higher scores than GPT-3 175B (davinci) across popular English benchmarks and higher scores than ERNIE TITAN 3.0 260B on Chinese benchmarks, while its scaling behavior permits direct INT4 quantization without post-training steps and with negligible loss.
What carries the argument
The training pipeline of design choices and stability strategies that prevent loss spikes and divergence at 130B scale, together with the scaling property that supports lossless INT4 quantization.
Load-bearing premise
The published benchmark scores reflect genuine capability rather than advantages from the bilingual data mixture or overlap with the closed training sets of the comparison models.
What would settle it
Performance on a fresh suite of held-out English and Chinese tasks that were never part of any public training corpus, where GLM-130B loses its reported edge over GPT-3 175B and ERNIE TITAN 3.0.
read the original abstract
We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained. Over the course of this effort, we face numerous unexpected technical and engineering challenges, particularly on loss spikes and divergence. In this paper, we introduce the training process of GLM-130B including its design choices, training strategies for both efficiency and stability, and engineering efforts. The resultant GLM-130B model offers significant outperformance over GPT-3 175B (davinci) on a wide range of popular English benchmarks while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B -- the largest Chinese language model -- across related benchmarks. Finally, we leverage a unique scaling property of GLM-130B to reach INT4 quantization without post training, with almost no performance loss, making it the first among 100B-scale models and more importantly, allowing its effective inference on 4$\times$RTX 3090 (24G) or 8$\times$RTX 2080 Ti (11G) GPUs, the most affordable GPUs required for using 100B-scale models. The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, and lessons learned are open-sourced at \url{https://github.com/THUDM/GLM-130B/}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GLM-130B, a 130 billion parameter bilingual (English and Chinese) pre-trained language model. It describes the training process including design choices, efficiency and stability strategies to address loss spikes and divergence, reports significant outperformance over GPT-3 175B (davinci) on English benchmarks (unlike OPT-175B and BLOOM-176B), consistent superiority over ERNIE TITAN 3.0 260B on Chinese benchmarks, and INT4 quantization without post-training that enables inference on affordable consumer GPUs. Model weights, code, and training logs are open-sourced.
Significance. If the performance claims hold under fair and transparent evaluation protocols, the work is significant for releasing an open 100B-scale model that matches or exceeds closed counterparts like GPT-3, demonstrating practical quantization for accessibility, and documenting stability techniques for large-scale pre-training; these elements can accelerate reproducible research in NLP.
major comments (2)
- [§5 (Evaluation)] §5 (Evaluation) and associated tables: the outperformance claims over GPT-3 davinci rest on benchmark scores whose fairness cannot be verified because exact English data mixture ratios, n-gram decontamination logs, and per-task few-shot prompts are not supplied; without these the attribution of gains to the stability strategies rather than data differences remains insecure.
- [§4 (Training)] §4 (Training): the loss-spike handling and divergence-prevention techniques are presented as central to successful training, yet no ablation studies or quantitative comparisons isolate their contribution to final downstream scores, leaving the causal link to the reported benchmark advantages unestablished.
minor comments (2)
- [Abstract] Abstract: the reference to a 'unique scaling property' enabling INT4 quantization should be cross-referenced to the precise equation or figure that defines it.
- [§4 (Training)] Ensure all training hyperparameters, data mixture statistics, and statistical significance tests for benchmark differences are consolidated in a single reproducibility table.
Simulated Author's Rebuttal
We are grateful for the referee's insightful comments, which help improve the manuscript's rigor. We respond to each major comment below, making revisions where possible to enhance transparency.
read point-by-point responses
-
Referee: [§5 (Evaluation)] §5 (Evaluation) and associated tables: the outperformance claims over GPT-3 davinci rest on benchmark scores whose fairness cannot be verified because exact English data mixture ratios, n-gram decontamination logs, and per-task few-shot prompts are not supplied; without these the attribution of gains to the stability strategies rather than data differences remains insecure.
Authors: We thank the referee for highlighting the need for greater transparency. In the revised manuscript, we will include the exact English data mixture ratios, n-gram decontamination procedures and logs, and the specific per-task few-shot prompts. These additions will permit independent verification of benchmark fairness and help clarify the relative contributions of data and training stability techniques. revision: yes
-
Referee: [§4 (Training)] §4 (Training): the loss-spike handling and divergence-prevention techniques are presented as central to successful training, yet no ablation studies or quantitative comparisons isolate their contribution to final downstream scores, leaving the causal link to the reported benchmark advantages unestablished.
Authors: We agree that ablation studies would provide stronger causal evidence. However, performing them at 130B scale would require multiple full pre-training runs at prohibitive computational cost. We instead document the techniques in detail, release the full training logs, and show their immediate stabilizing effects via loss curves. This supplies practical guidance even without exhaustive ablations. revision: no
- Performing ablation studies at 130B-parameter scale to isolate the downstream impact of loss-spike handling techniques
Circularity Check
Empirical pre-training and external benchmarking; no derivation reduces to inputs by construction
full rationale
The manuscript describes architecture choices, training stability techniques (e.g., loss-spike mitigation), and reports benchmark scores against GPT-3, OPT, BLOOM, and ERNIE. No equations or claims equate a 'prediction' to a fitted parameter, nor does any central result rest on a self-citation chain that itself lacks independent verification. All performance assertions are falsifiable via replication on the released weights and public benchmarks; the bilingual data mixture and decontamination steps are presented as engineering decisions rather than derived quantities.
Axiom & Free-Parameter Ledger
free parameters (2)
- 130B parameter count
- training data mixture ratio
axioms (1)
- domain assumption Standard transformer attention and feed-forward blocks suffice for 100B-scale language modeling
Forward citations
Cited by 22 Pith papers
-
A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models
Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.
-
A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models
Massive activations originate in a specific ME Layer across LLM families; reducing their token rigidity via a targeted method boosts performance and mitigates attention sinks.
-
PR-MaGIC: Prompt Refinement Via Mask Decoder Gradient Flow For In-Context Segmentation
PR-MaGIC refines prompts in in-context segmentation via test-time gradient flow from the mask decoder plus top-1 selection, yielding better masks across benchmarks without training.
-
SAGE: A Service Agent Graph-guided Evaluation Benchmark
SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 m...
-
QLoRA: Efficient Finetuning of Quantized LLMs
QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
-
VideoChat: Chat-Centric Video Understanding
VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
-
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
-
Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio
MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.
-
Understanding the Mechanism of Altruism in Large Language Models
A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
-
EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation
EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.
-
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
-
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.
-
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
Pre-training loss predicts LLM math reasoning better than parameter count; rejection sampling fine-tuning with diverse paths raises LLaMA-7B accuracy on GSM8K from 35.9% with SFT to 49.3%.
-
Gorilla: Large Language Model Connected with Massive APIs
Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
-
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Uptraining multi-head transformer checkpoints to grouped-query attention models achieves near multi-head quality at multi-query inference speeds using 5% additional compute.
-
BloombergGPT: A Large Language Model for Finance
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
-
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
-
Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance
A new pre-training task that maps languages bidirectionally in embedding space improves machine translation by up to 11.9 BLEU, cross-lingual QA by 6.72 BERTScore points, and understanding accuracy by over 5% over str...
-
StarCoder: may the source be with you!
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
-
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
GLM-4 models rival or exceed GPT-4 on MMLU, GSM8K, MATH, BBH, GPQA, HumanEval, IFEval, long-context tasks, and Chinese alignment while adding autonomous tool use for web, code, and image generation.
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
-
A Survey of Large Language Models
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
Reference graph
Works this paper leans on
-
[1]
Xavier Carreras and Lluís Màrquez
Association for Computational Linguistics, 2021. Xavier Carreras and Lluís Màrquez. Introduction to the conll-2005 shared task: Semantic role labeling. In CoNLL, pp. 152–164, 2005. Thiago Castro Ferreira, Claire Gardent, Nikolai Ilinykh, Chris van der Lee, Simon Mille, Diego Moussallem, and Anastasia Shimorina. The 2020 bilingual, bi-directional WebNLG+ s...
-
[2]
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.58. URL https://aclanthology.org/2021.emnlp-main.58. Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. Pre-trained models for natural language processing: A survey.Science China Technological Sciences, 63(10): 1872–1897, 2020. Alec Radford, Karthik Nar...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2021.emnlp-main.58 2021
-
[3]
to yield model predictions for calculating the metrics. The results are shown in Table 6. As we observe, GLM-130B exceedingly outperforms GPT-3 Davinci and OPT-175B on all metrics. Such results accurately align with our discoveries in language modeling experiments and CrowS-Pairs bias evaluation, that GLM-130B has a high quality in both language modeling ...
work page 2021
-
[4]
is a relative position encoding implemented in the form of absolute position encoding, and its core idea is shown in the following equation. (Rmq)⊤(Rnk) = q⊤R⊤ mRnk = q⊤Rn−mk (1) The product of q at position m and k at position n is related to their distance n − m, which reflects the relativity of the position encoding. The definition of R in the above eq...
work page 2023
-
[5]
{{trigger ['text']}} ({{allowed_triggers[trigger['event_type']]}})
(Event Extraction) {{text}} Please write down ALL event arguments related to the trigger "{{trigger ['text']}} ({{allowed_triggers[trigger['event_type']]}})" marked with "[ ]", given the following categories: - {{shuffle(allowed_arguments[trigger['event_type']].values()) | join("\ n- ")}} Answer: ||| {{format_triple(relations, "") | join(" ")}} (Argument ...
work page 2004
-
[6]
\n- ")}} what is the relation between
Given the candidate relations: - {{shuffle(allowed_relations) | join("\n- ")}} what is the relation between "{{relations[triple_idx]['head'][0]}}" and "{{relations[triple_idx]['tail'][0]}}" in the following sentence? {{text}} Answer: ||| {{relations[triple_idx]['relation']}} Nevertheless, existing joint entity and relation extraction datasets have very li...
work page 2021
-
[7]
(Relation Extraction) Answer the relation between entities in the form of "( X ; Y ; Z )": {{text}} The relation between "{{relations[0]['head']}}" and "{{relations[0][' tail']}}" is: ||| ( {{relations[0]['head']}} ; {{allowed_relations[ relations[0]['relation']]}} ; {{relations[0]['tail']}} ) (Knowledge Slot Filling, Prompt 0) Based on the sentence provi...
work page 2005
-
[8]
Based on the fact that "{{entities[entity_idx]}}" is a "{{ entity_types[entity_idx]}}", which verb in the following sentence should it related to? {{text}} Answer: ||| {{verb}} C.3 R ESULT SOURCES FOR GPT-3, BLOOM-176B, AND OPT-175B Here we describe the result sources for GPT-3, BLOOM-176B, and OPT-175B. Other LLMs we may compare are mostly completely clo...
work page 2022
-
[9]
datasets of three LLMs are shown in Table 14 and Figure 16. We just adopt the original prompts from BIG-bench and use the official implementation to generate priming examples for few-shot evaluation and to calculate the final scores. C.6 MMLU E VALUATION All results on 57 MMLU (Hendrycks et al., 2021) datasets of GLM-130B and BLOOM 176B are shown in Table...
work page 2021
-
[10]
Summarize the following article:
from GEM generation benchmark (Gehrmann et al., 2021). We select full WebNLG 2020 and the Clean E2E NLG in the test set and randomly select 5000 test examples from WikiLingua following the practice in (Chowdhery et al., 2022). Following the settings in PaLM, the prompt used for the Summarization tasks is “Summarize the following article:” and the prompt u...
work page 2021
-
[11]
and Winograd273 (Levesque et al., 2012). For Winogender, GPT-3’s results are acquired from OpenAI API, and BLOOM’s 1-shot result is evaluated by ourselves. For Winograd273, since exist- ing works (Brown et al., 2020; Chowdhery et al., 2022) show that 1-shot learning brings almost no improvement, we only test the zero-shot result. Another thing to notice i...
work page 2012
-
[12]
answer_given_question_without_options
in the MIP training, here we choose Natural Questions (Kwiatkowski et al., 2019) and Strat- egyQA (Geva et al., 2021) as the evaluation datasets for CBQA. The results are presented in Table 18. GLM-130B performs relatively poorer on Natural Questions and performs well on StrategyQA. GLM-130B’s underperformance on Natural Questions, we spec- ulate, potenti...
work page 2019
-
[13]
repository. We adopt the task formulation from promptsource, too. As we can observe, GLM (bi) has much fewer variances and higher performances on all tasks. For some of the tasks (such as CB, MultiRC, RTE, COPA, and BoolQ), GLM-130B can even achieve over 80% accuracy. We also attempted to fine-tune GLM-130B on the SuperGLUE dataset. However, we encountere...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.