arxiv: 2210.02414 · v2 · submitted 2022-10-05 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 3 theorem links

· Lean Theorem

GLM-130B: An Open Bilingual Pre-trained Model

Aohan Zeng , Xiao Liu , Zhengxiao Du , Zihan Wang , Hanyu Lai , Ming Ding , Zhuoyi Yang , Yifan Xu

show 10 more authors

Wendi Zheng Xiao Xia Weng Lam Tam Zixuan Ma Yufei Xue Jidong Zhai Wenguang Chen Peng Zhang Yuxiao Dong Jie Tang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 17:35 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords pre-trained language modelbilingual modellarge language modelsINT4 quantizationopen source modelGLM-130Bbenchmark comparison

0 comments

The pith

GLM-130B, a 130B-parameter bilingual model, outperforms GPT-3 175B on English benchmarks and runs in INT4 on four consumer GPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GLM-130B as a 130 billion parameter pre-trained model trained on English and Chinese text. It describes the design choices and training strategies developed to overcome loss spikes and divergence during pre-training at this scale. The resulting model surpasses GPT-3 175B on a range of English benchmarks and exceeds the larger ERNIE TITAN 3.0 on Chinese benchmarks. It further exploits a scaling property to reach INT4 quantization with almost no accuracy drop, allowing inference on modest hardware. The weights, code, and logs are released publicly.

Core claim

GLM-130B is a 130B-parameter bilingual pre-trained model that, after targeted training for stability, delivers higher scores than GPT-3 175B (davinci) across popular English benchmarks and higher scores than ERNIE TITAN 3.0 260B on Chinese benchmarks, while its scaling behavior permits direct INT4 quantization without post-training steps and with negligible loss.

What carries the argument

The training pipeline of design choices and stability strategies that prevent loss spikes and divergence at 130B scale, together with the scaling property that supports lossless INT4 quantization.

Load-bearing premise

The published benchmark scores reflect genuine capability rather than advantages from the bilingual data mixture or overlap with the closed training sets of the comparison models.

What would settle it

Performance on a fresh suite of held-out English and Chinese tasks that were never part of any public training corpus, where GLM-130B loses its reported edge over GPT-3 175B and ERNIE TITAN 3.0.

read the original abstract

We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained. Over the course of this effort, we face numerous unexpected technical and engineering challenges, particularly on loss spikes and divergence. In this paper, we introduce the training process of GLM-130B including its design choices, training strategies for both efficiency and stability, and engineering efforts. The resultant GLM-130B model offers significant outperformance over GPT-3 175B (davinci) on a wide range of popular English benchmarks while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B -- the largest Chinese language model -- across related benchmarks. Finally, we leverage a unique scaling property of GLM-130B to reach INT4 quantization without post training, with almost no performance loss, making it the first among 100B-scale models and more importantly, allowing its effective inference on 4$\times$RTX 3090 (24G) or 8$\times$RTX 2080 Ti (11G) GPUs, the most affordable GPUs required for using 100B-scale models. The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, and lessons learned are open-sourced at \url{https://github.com/THUDM/GLM-130B/}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GLM-130B is a useful open release of a 130B bilingual model with practical stability notes and INT4 quantization that runs on consumer GPUs, but the English benchmark wins over GPT-3 need more data details to attribute cleanly to the training methods.

read the letter

The core value here is the open release itself. They put out weights, code, training logs, and a toolkit for a 130B bilingual English-Chinese model, plus concrete notes on avoiding loss spikes and divergence at scale. That combination is still uncommon at this size and makes the work immediately usable for people who cannot train from scratch. The INT4 quantization without post-training is a clear engineering win; it keeps performance close while letting the model run on 4x RTX 3090s or similar modest setups, which lowers the barrier for inference experiments. Bilingual pre-training at this scale is also new ground, and they show it helps on Chinese benchmarks against a larger closed model like ERNIE TITAN 3.0. The stability techniques they describe sound like the kind of practical detail that can save other groups time and compute. On the performance side, the claim that GLM-130B beats GPT-3 davinci on English tasks where OPT-175B and BLOOM-176B do not is the headline result. If the numbers hold after checking prompts and splits, it suggests the data mix and stability work delivered an edge. The soft spots are mostly around verification. GPT-3's training corpus is closed, so any difference in English data quality, recency, or decontamination could explain part of the gap without the claimed innovations doing all the heavy lifting. The paper would be tighter with explicit tables on data ratios, n-gram overlap logs, and the exact few-shot prompts used for each score. Statistical significance on the benchmark deltas is also missing from the abstract-level view. This is for groups that want a strong open baseline for bilingual work or for anyone studying large-scale training stability and efficient inference. The engineering sections give enough to replicate or extend the practical parts. I would send it to peer review. The openness and the usable quantization results are enough to justify referee time, even if the attribution of the English gains needs more supporting detail in revision.

Referee Report

2 major / 2 minor

Summary. The paper introduces GLM-130B, a 130 billion parameter bilingual (English and Chinese) pre-trained language model. It describes the training process including design choices, efficiency and stability strategies to address loss spikes and divergence, reports significant outperformance over GPT-3 175B (davinci) on English benchmarks (unlike OPT-175B and BLOOM-176B), consistent superiority over ERNIE TITAN 3.0 260B on Chinese benchmarks, and INT4 quantization without post-training that enables inference on affordable consumer GPUs. Model weights, code, and training logs are open-sourced.

Significance. If the performance claims hold under fair and transparent evaluation protocols, the work is significant for releasing an open 100B-scale model that matches or exceeds closed counterparts like GPT-3, demonstrating practical quantization for accessibility, and documenting stability techniques for large-scale pre-training; these elements can accelerate reproducible research in NLP.

major comments (2)

[§5 (Evaluation)] §5 (Evaluation) and associated tables: the outperformance claims over GPT-3 davinci rest on benchmark scores whose fairness cannot be verified because exact English data mixture ratios, n-gram decontamination logs, and per-task few-shot prompts are not supplied; without these the attribution of gains to the stability strategies rather than data differences remains insecure.
[§4 (Training)] §4 (Training): the loss-spike handling and divergence-prevention techniques are presented as central to successful training, yet no ablation studies or quantitative comparisons isolate their contribution to final downstream scores, leaving the causal link to the reported benchmark advantages unestablished.

minor comments (2)

[Abstract] Abstract: the reference to a 'unique scaling property' enabling INT4 quantization should be cross-referenced to the precise equation or figure that defines it.
[§4 (Training)] Ensure all training hyperparameters, data mixture statistics, and statistical significance tests for benchmark differences are consolidated in a single reproducibility table.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We are grateful for the referee's insightful comments, which help improve the manuscript's rigor. We respond to each major comment below, making revisions where possible to enhance transparency.

read point-by-point responses

Referee: [§5 (Evaluation)] §5 (Evaluation) and associated tables: the outperformance claims over GPT-3 davinci rest on benchmark scores whose fairness cannot be verified because exact English data mixture ratios, n-gram decontamination logs, and per-task few-shot prompts are not supplied; without these the attribution of gains to the stability strategies rather than data differences remains insecure.

Authors: We thank the referee for highlighting the need for greater transparency. In the revised manuscript, we will include the exact English data mixture ratios, n-gram decontamination procedures and logs, and the specific per-task few-shot prompts. These additions will permit independent verification of benchmark fairness and help clarify the relative contributions of data and training stability techniques. revision: yes
Referee: [§4 (Training)] §4 (Training): the loss-spike handling and divergence-prevention techniques are presented as central to successful training, yet no ablation studies or quantitative comparisons isolate their contribution to final downstream scores, leaving the causal link to the reported benchmark advantages unestablished.

Authors: We agree that ablation studies would provide stronger causal evidence. However, performing them at 130B scale would require multiple full pre-training runs at prohibitive computational cost. We instead document the techniques in detail, release the full training logs, and show their immediate stabilizing effects via loss curves. This supplies practical guidance even without exhaustive ablations. revision: no

standing simulated objections not resolved

Performing ablation studies at 130B-parameter scale to isolate the downstream impact of loss-spike handling techniques

Circularity Check

0 steps flagged

Empirical pre-training and external benchmarking; no derivation reduces to inputs by construction

full rationale

The manuscript describes architecture choices, training stability techniques (e.g., loss-spike mitigation), and reports benchmark scores against GPT-3, OPT, BLOOM, and ERNIE. No equations or claims equate a 'prediction' to a fitted parameter, nor does any central result rest on a self-citation chain that itself lacks independent verification. All performance assertions are falsifiable via replication on the released weights and public benchmarks; the bilingual data mixture and decontamination steps are presented as engineering decisions rather than derived quantities.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard transformer architecture choices, conventional optimizer settings, and the assumption that the chosen English-Chinese data mixture produces comparable benchmark scores; no new physical or mathematical entities are introduced.

free parameters (2)

130B parameter count
Design choice to reach GPT-3 scale; not derived from data.
training data mixture ratio
Chosen to balance English and Chinese performance; affects downstream scores.

axioms (1)

domain assumption Standard transformer attention and feed-forward blocks suffice for 100B-scale language modeling
Invoked throughout the training description.

pith-pipeline@v0.9.0 · 5662 in / 1323 out tokens · 46148 ms · 2026-05-14T17:35:52.559049+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models
cs.CL 2026-05 conditional novelty 7.0

Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.
A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Massive activations originate in a specific ME Layer across LLM families; reducing their token rigidity via a targeted method boosts performance and mitigates attention sinks.
PR-MaGIC: Prompt Refinement Via Mask Decoder Gradient Flow For In-Context Segmentation
cs.CV 2026-04 unverdicted novelty 7.0

PR-MaGIC refines prompts in in-context segmentation via test-time gradient flow from the mask decoder plus top-1 selection, yielding better masks across benchmarks without training.
SAGE: A Service Agent Graph-guided Evaluation Benchmark
cs.AI 2026-04 unverdicted novelty 7.0

SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 m...
QLoRA: Efficient Finetuning of Quantized LLMs
cs.LG 2023-05 conditional novelty 7.0

QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
VideoChat: Chat-Centric Video Understanding
cs.CV 2023-05 conditional novelty 7.0

VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
cs.LG 2022-08 conditional novelty 7.0

LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio
cs.LG 2026-05 unverdicted novelty 6.0

MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.
Understanding the Mechanism of Altruism in Large Language Models
econ.GN 2026-04 unverdicted novelty 6.0

A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation
cs.DB 2026-04 unverdicted novelty 6.0

EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
cs.CL 2025-05 conditional novelty 6.0

Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
cs.CL 2023-09 conditional novelty 6.0

Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
cs.CL 2023-08 unverdicted novelty 6.0

Pre-training loss predicts LLM math reasoning better than parameter count; rejection sampling fine-tuning with diverse paths raises LLaMA-7B accuracy on GSM8K from 35.9% with SFT to 49.3%.
Gorilla: Large Language Model Connected with Massive APIs
cs.CL 2023-05 conditional novelty 6.0

Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
cs.CL 2023-05 unverdicted novelty 6.0

Uptraining multi-head transformer checkpoints to grouped-query attention models achieves near multi-head quality at multi-query inference speeds using 5% additional compute.
BloombergGPT: A Large Language Model for Finance
cs.LG 2023-03 conditional novelty 6.0

BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
cs.CL 2022-11 unverdicted novelty 6.0

BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance
cs.CL 2026-04 unverdicted novelty 5.0

A new pre-training task that maps languages bidirectionally in embedding space improves machine translation by up to 11.9 BLEU, cross-lingual QA by 6.72 BERTScore points, and understanding accuracy by over 5% over str...
StarCoder: may the source be with you!
cs.CL 2023-05 accept novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
cs.CL 2024-06 unverdicted novelty 3.0

GLM-4 models rival or exceed GPT-4 on MMLU, GSM8K, MATH, BBH, GPQA, HumanEval, IFEval, long-context tasks, and Chinese alignment while adding autonomous tool use for web, code, and image generation.
Large Language Models: A Survey
cs.CL 2024-02 accept novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
A Survey of Large Language Models
cs.CL 2023-03 accept novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 21 Pith papers · 1 internal anchor

[1]

Xavier Carreras and Lluís Màrquez

Association for Computational Linguistics, 2021. Xavier Carreras and Lluís Màrquez. Introduction to the conll-2005 shared task: Semantic role labeling. In CoNLL, pp. 152–164, 2005. Thiago Castro Ferreira, Claire Gardent, Nikolai Ilinykh, Chris van der Lee, Simon Mille, Diego Moussallem, and Anastasia Shimorina. The 2020 bilingual, bi-directional WebNLG+ s...

work page doi:10.18653/v1/w19-8652 2021
[2]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.58. URL https://aclanthology.org/2021.emnlp-main.58. Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. Pre-trained models for natural language processing: A survey.Science China Technological Sciences, 63(10): 1872–1897, 2020. Alec Radford, Karthik Nar...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2021.emnlp-main.58 2021
[3]

yes” or “no

to yield model predictions for calculating the metrics. The results are shown in Table 6. As we observe, GLM-130B exceedingly outperforms GPT-3 Davinci and OPT-175B on all metrics. Such results accurately align with our discoveries in language modeling experiments and CrowS-Pairs bias evaluation, that GLM-130B has a high quality in both language modeling ...

work page 2021
[4]

[User]" and

is a relative position encoding implemented in the form of absolute position encoding, and its core idea is shown in the following equation. (Rmq)⊤(Rnk) = q⊤R⊤ mRnk = q⊤Rn−mk (1) The product of q at position m and k at position n is related to their distance n − m, which reflects the relativity of the position encoding. The definition of R in the above eq...

work page 2023
[5]

{{trigger ['text']}} ({{allowed_triggers[trigger['event_type']]}})

(Event Extraction) {{text}} Please write down ALL event arguments related to the trigger "{{trigger ['text']}} ({{allowed_triggers[trigger['event_type']]}})" marked with "[ ]", given the following categories: - {{shuffle(allowed_arguments[trigger['event_type']].values()) | join("\ n- ")}} Answer: ||| {{format_triple(relations, "") | join(" ")}} (Argument ...

work page 2004
[6]

\n- ")}} what is the relation between

Given the candidate relations: - {{shuffle(allowed_relations) | join("\n- ")}} what is the relation between "{{relations[triple_idx]['head'][0]}}" and "{{relations[triple_idx]['tail'][0]}}" in the following sentence? {{text}} Answer: ||| {{relations[triple_idx]['relation']}} Nevertheless, existing joint entity and relation extraction datasets have very li...

work page 2021
[7]

( X ; Y ; Z )

(Relation Extraction) Answer the relation between entities in the form of "( X ; Y ; Z )": {{text}} The relation between "{{relations[0]['head']}}" and "{{relations[0][' tail']}}" is: ||| ( {{relations[0]['head']}} ; {{allowed_relations[ relations[0]['relation']]}} ; {{relations[0]['tail']}} ) (Knowledge Slot Filling, Prompt 0) Based on the sentence provi...

work page 2005
[8]

{{entities[entity_idx]}}

Based on the fact that "{{entities[entity_idx]}}" is a "{{ entity_types[entity_idx]}}", which verb in the following sentence should it related to? {{text}} Answer: ||| {{verb}} C.3 R ESULT SOURCES FOR GPT-3, BLOOM-176B, AND OPT-175B Here we describe the result sources for GPT-3, BLOOM-176B, and OPT-175B. Other LLMs we may compare are mostly completely clo...

work page 2022
[9]

We just adopt the original prompts from BIG-bench and use the official implementation to generate priming examples for few-shot evaluation and to calculate the final scores

datasets of three LLMs are shown in Table 14 and Figure 16. We just adopt the original prompts from BIG-bench and use the official implementation to generate priming examples for few-shot evaluation and to calculate the final scores. C.6 MMLU E VALUATION All results on 57 MMLU (Hendrycks et al., 2021) datasets of GLM-130B and BLOOM 176B are shown in Table...

work page 2021
[10]

Summarize the following article:

from GEM generation benchmark (Gehrmann et al., 2021). We select full WebNLG 2020 and the Clean E2E NLG in the test set and randomly select 5000 test examples from WikiLingua following the practice in (Chowdhery et al., 2022). Following the settings in PaLM, the prompt used for the Summarization tasks is “Summarize the following article:” and the prompt u...

work page 2021
[11]

partial evaluation

and Winograd273 (Levesque et al., 2012). For Winogender, GPT-3’s results are acquired from OpenAI API, and BLOOM’s 1-shot result is evaluated by ourselves. For Winograd273, since exist- ing works (Brown et al., 2020; Chowdhery et al., 2022) show that 1-shot learning brings almost no improvement, we only test the zero-shot result. Another thing to notice i...

work page 2012
[12]

answer_given_question_without_options

in the MIP training, here we choose Natural Questions (Kwiatkowski et al., 2019) and Strat- egyQA (Geva et al., 2021) as the evaluation datasets for CBQA. The results are presented in Table 18. GLM-130B performs relatively poorer on Natural Questions and performs well on StrategyQA. GLM-130B’s underperformance on Natural Questions, we spec- ulate, potenti...

work page 2019
[13]

Elon Musk

repository. We adopt the task formulation from promptsource, too. As we can observe, GLM (bi) has much fewer variances and higher performances on all tasks. For some of the tasks (such as CB, MultiRC, RTE, COPA, and BoolQ), GLM-130B can even achieve over 80% accuracy. We also attempted to fine-tune GLM-130B on the SuperGLUE dataset. However, we encountere...

work page 2022