arxiv: 2310.11453 · v1 · pith:XVLHF5BOnew · submitted 2023-10-17 · 💻 cs.CL

BitNet: Scaling 1-bit Transformers for Large Language Models

Hongyu Wang , Shuming Ma , Li Dong , Shaohan Huang , Huaijie Wang , Lingxiao Ma , Fan Yang , Ruiping Wang

show 2 more authors

Yi Wu Furu Wei

This is my paper

classification 💻 cs.CL

keywords languagebitnetmodelslargescalingconsumptionenergyintroduce

0 comments

read the original abstract

The increasing size of large language models has posed challenges for deployment and raised concerns about environmental impact due to high energy consumption. In this work, we introduce BitNet, a scalable and stable 1-bit Transformer architecture designed for large language models. Specifically, we introduce BitLinear as a drop-in replacement of the nn.Linear layer in order to train 1-bit weights from scratch. Experimental results on language modeling show that BitNet achieves competitive performance while substantially reducing memory footprint and energy consumption, compared to state-of-the-art 8-bit quantization methods and FP16 Transformer baselines. Furthermore, BitNet exhibits a scaling law akin to full-precision Transformers, suggesting its potential for effective scaling to even larger language models while maintaining efficiency and performance benefits.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CORP: Closed-Form One-shot Representation-Preserving Structured Pruning for Transformers
cs.LG 2026-02 unverdicted novelty 7.0

CORP performs one-shot structured pruning of Transformers by modeling removed components as affine functions of retained ones and solving closed-form ridge regressions on calibration data to fold compensation into wei...
Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices
cs.DC 2025-12 conditional novelty 7.0

Vec-LUT delivers up to 4.2x speedup over prior LUT methods for parallel ultra-low-bit LLM inference on edge devices by unifying lookups across tokens and adding cache-aware tensor layouts.
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
cs.CL 2024-02 unverdicted novelty 7.0

BitNet b1.58 shows that ternary 1.58-bit LLMs can match full-precision performance at substantially lower inference cost.
LAQuant: A Simple Overhead-free Large Reasoning Model Quantization by Layer-wise Lookahead Loss
cs.LG 2026-05 unverdicted novelty 6.0

LAQuant improves long-decoding accuracy on quantized reasoning models like Qwen3-4B by 15pp on AIME25 via layer-wise lookahead loss, achieving 3.42x speedup over FP16.
Litespark Inference on Consumer CPUs: Custom SIMD Kernels for Ternary Neural Networks
cs.CL 2026-05 conditional novelty 6.0

Custom SIMD kernels for ternary LLMs deliver 9.2x faster time-to-first-token, 52x higher throughput, and 14x memory reduction versus standard PyTorch on Apple Silicon and similar CPUs.
BitRL: Reinforcement Learning with 1-bit Quantized Language Models for Resource-Constrained Edge Deployment
cs.LG 2026-04 unverdicted novelty 6.0

BitRL enables on-device RL agents via 1-bit quantized language models, delivering 10-16x memory reduction and 3-5x energy efficiency gains with 85-98% retained performance.
LBLLM: Lightweight Binarization of Large Language Models via Three-Stage Distillation
cs.LG 2026-04 unverdicted novelty 6.0

LBLLM achieves better accuracy than prior binarization methods for LLMs by decoupling weight and activation quantization through initialization, layer-wise distillation, and learnable activation scaling.
GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling
cs.CL 2026-04 unverdicted novelty 6.0

GSQ applies a Gumbel-Softmax relaxation to learn discrete grid assignments in scalar quantization, closing most of the accuracy gap to vector methods like QTIP on Llama-3.1 models at 2-3 bits while using only symmetri...
BiSpikCLM: A Spiking Language Model integrating Softmax-Free Spiking Attention and Spike-Aware Alignment Distillation
cs.NE 2026-04 unverdicted novelty 6.0

BiSpikCLM is the first fully binary spiking MatMul-free causal language model that matches ANN performance on generation tasks using only 4-6 percent of the compute via softmax-free spiking attention and spike-aware d...
STQuant: Spatio-Temporal Adaptive Framework for Optimizer Quantization in Large Multimodal Model Training
cs.LG 2026-04 unverdicted novelty 6.0

STQuant dynamically allocates quantization bits for optimizer states in multimodal model training, reducing memory by 84.4% to an average 5.1 bits while preserving quality on GPT-2 and ViT.
D-Legion: A Scalable Many-Core Architecture for Accelerating Matrix Multiplication in Quantized LLMs
cs.AR 2026-02 unverdicted novelty 6.0

D-Legion proposes a scalable architecture of Legions containing adaptive-precision systolic array cores that accelerates quantized LLM matrix multiplications, delivering up to 8.2x lower latency and 3.8x higher memory...
Rethinking Output Alignment For 1-bit Post-Training Quantization of Large Language Models
cs.LG 2025-12 unverdicted novelty 6.0

A post-training 1-bit quantization method for LLMs that fixes error accumulation and anisotropic representation distortion to outperform prior weight-driven and naive output-driven baselines.
A Composite Activation Function for Learning Stable Binary Representations
cs.LG 2026-05 unverdicted novelty 5.0

HTAF is a sigmoid-tanh composite that approximates the Heaviside function to allow stable gradient training of binary activation networks, yielding ICBMs with stable discretization and competitive performance on image tasks.
SEPTQ: A Simple and Effective Post-Training Quantization Paradigm for Large Language Models
cs.CL 2026-04 unverdicted novelty 5.0

SEPTQ simplifies LLM post-training quantization to two steps via static global importance scoring and mask-guided column-wise weight updates, claiming superior results over baselines in low-bit settings.
BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design
cs.LG 2026-04 unverdicted novelty 5.0

BWTA achieves near full-precision accuracy on BERT and LLMs using binary weights and ternary activations, with 16-24x kernel speedups via specialized CUDA kernels.
AtteConDA: Attention-Based Conflict Suppression in Multi-Condition Diffusion Models and Synthetic Data Augmentation
cs.CV 2026-05 unverdicted novelty 4.0

AtteConDA adds attention-based conflict suppression to multi-condition diffusion models so that generated driving-scene images retain richer structural cues from the original annotations.
Adaptive Domain Models: Bayesian Evolution, Warm Rotation, and Principled Training for Geometric and Neuromorphic AI
cs.AI 2026-03 unverdicted novelty 3.0

The paper claims that composing the Dimensional Type System, Program Hypergraph, and b-posit 2026 standard yields depth-independent training memory at ~2x inference, grade-preserving updates, Bayesian distillation for...