pith. sign in

arxiv: 2310.11453 · v1 · pith:XVLHF5BOnew · submitted 2023-10-17 · 💻 cs.CL

BitNet: Scaling 1-bit Transformers for Large Language Models

Pith reviewed 2026-05-24 05:39 UTC · model grok-4.3

classification 💻 cs.CL
keywords 1-bit transformerslarge language modelsquantizationenergy efficiencyscaling lawsBitLinearmemory footprint
0
0 comments X

The pith

BitNet trains 1-bit weight Transformers from scratch that reach competitive performance with FP16 baselines while cutting memory and energy use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models face growing deployment barriers and energy costs from their size. BitNet addresses this by introducing a 1-bit Transformer that replaces standard linear layers with a specialized BitLinear layer. This change lets the model train its weights to a single bit from the start rather than quantizing after training. Experiments on language modeling show it matches the accuracy of 8-bit and full-precision models while using far less memory and power. The architecture also follows the same scaling pattern as higher-precision Transformers, which suggests it can grow to larger sizes without losing its efficiency edge.

Core claim

BitNet is a 1-bit Transformer architecture that substitutes BitLinear for the standard nn.Linear layer, enabling stable training of models whose weights are restricted to one bit throughout. On language modeling benchmarks this yields performance on par with state-of-the-art 8-bit quantization and FP16 baselines, accompanied by large reductions in memory footprint and energy consumption, and the emergence of a scaling law comparable to that of full-precision Transformers.

What carries the argument

BitLinear, a drop-in replacement for the linear layer that supports training and inference with 1-bit weights.

If this is right

  • Models can be deployed on hardware with tighter memory limits than current FP16 or 8-bit systems allow.
  • Training and inference energy budgets shrink substantially for the same model size.
  • The same scaling relationship between compute and performance observed in full-precision models continues to hold.
  • Larger 1-bit models become feasible without proportional increases in hardware requirements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hardware accelerators specialized for 1-bit arithmetic could further amplify the measured energy savings.
  • The same linear-layer replacement might transfer to other sequence models beyond the Transformer.
  • If the scaling law persists at extreme sizes, 1-bit training could become the default route for the largest models.

Load-bearing premise

That replacing linear layers with BitLinear keeps large-scale training stable and prevents a fundamental performance gap versus higher-precision models.

What would settle it

A side-by-side run of BitNet and an FP16 Transformer of equal parameter count on a standard language-modeling benchmark where BitNet shows markedly higher perplexity that does not close with further training.

Figures

Figures reproduced from arXiv: 2310.11453 by Fan Yang, Furu Wei, Hongyu Wang, Huaijie Wang, Li Dong, Lingxiao Ma, Ruiping Wang, Shaohan Huang, Shuming Ma, Yi Wu.

Figure 1
Figure 1. Figure 1: BitNet trains 1-bit Transformers from scratch, obtaining competitive results in an energy [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) The computation flow of BitLinear. (b) The architecture of BitNet, consisting of the stacks of attentions and FFNs, where matrix multiplication is implemented as BitLinear. 2 BitNet As shown in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Scaling curves of BitNet and FP16 Transformers. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Zero-shot (Left) and few-shot (Right) performance of BitNet and FP16 Transformer against [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: BitNet is more stable than FP16 Transformer with a same learning rate (Left). The training [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

The increasing size of large language models has posed challenges for deployment and raised concerns about environmental impact due to high energy consumption. In this work, we introduce BitNet, a scalable and stable 1-bit Transformer architecture designed for large language models. Specifically, we introduce BitLinear as a drop-in replacement of the nn.Linear layer in order to train 1-bit weights from scratch. Experimental results on language modeling show that BitNet achieves competitive performance while substantially reducing memory footprint and energy consumption, compared to state-of-the-art 8-bit quantization methods and FP16 Transformer baselines. Furthermore, BitNet exhibits a scaling law akin to full-precision Transformers, suggesting its potential for effective scaling to even larger language models while maintaining efficiency and performance benefits.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces BitNet, a 1-bit Transformer for large language models that replaces nn.Linear with a new BitLinear layer to train 1-bit weights from scratch. It claims competitive language-modeling performance versus 8-bit quantization and FP16 baselines, substantially lower memory and energy use, and a scaling law similar to full-precision Transformers.

Significance. If the empirical claims and scaling-law result are substantiated with rigorous controls, the work would be significant for efficient LLM deployment, directly addressing memory and energy constraints while preserving trainability from random initialization.

major comments (2)
  1. [Method / Experiments] The central claim that BitLinear enables stable end-to-end training at LLM scale rests on the straight-through estimator preserving usable gradient flow through the binarization step. No gradient-norm statistics, activation/gradient histograms, or ablation on STE scaling factors are supplied for models beyond a few hundred layers, leaving open the possibility of systematic attenuation that would invalidate both performance parity and the reported scaling law.
  2. [Abstract / Experiments] The abstract and scaling-law claim assert competitive performance, yet the provided text supplies neither dataset descriptions, training hyperparameters, error bars, nor ablation evidence comparing BitNet against the FP16 baseline at matched compute. Without these, the load-bearing empirical result cannot be evaluated.
minor comments (1)
  1. [Method] Notation for the binarization operation (sign vs. round) and the precise form of the STE should be stated explicitly with an equation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, indicating where revisions will be incorporated.

read point-by-point responses
  1. Referee: [Method / Experiments] The central claim that BitLinear enables stable end-to-end training at LLM scale rests on the straight-through estimator preserving usable gradient flow through the binarization step. No gradient-norm statistics, activation/gradient histograms, or ablation on STE scaling factors are supplied for models beyond a few hundred layers, leaving open the possibility of systematic attenuation that would invalidate both performance parity and the reported scaling law.

    Authors: We agree that explicit gradient-flow diagnostics would strengthen the stability claims. Our training runs at the reported scales completed without divergence, providing indirect evidence of usable gradients, but we acknowledge the absence of the requested statistics. In the revised manuscript we will add gradient-norm statistics, activation/gradient histograms, and a summary of internal STE scaling-factor ablations for the largest models trained. revision: partial

  2. Referee: [Abstract / Experiments] The abstract and scaling-law claim assert competitive performance, yet the provided text supplies neither dataset descriptions, training hyperparameters, error bars, nor ablation evidence comparing BitNet against the FP16 baseline at matched compute. Without these, the load-bearing empirical result cannot be evaluated.

    Authors: Dataset descriptions (The Pile) and training hyperparameters appear in Section 4 and the appendix; the FP16 baseline comparisons are also present. However, error bars and explicit matched-compute ablations are not reported. We will add error bars from repeated runs where available and include additional matched-compute ablation results in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: BitNet claims rest on independent empirical training runs and comparisons

full rationale

The paper defines BitLinear explicitly as a drop-in replacement using sign/round with straight-through estimator, then reports measured performance, memory, and scaling behavior from actual training experiments on language modeling. No derivation chain exists that reduces a claimed result to its own inputs by construction; the scaling-law similarity is an observed outcome, not a fitted parameter renamed as prediction. No self-citation is load-bearing for the central claims, and the architecture choices are stated as design decisions rather than theorems imported from prior author work. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the high-level introduction of BitLinear; full paper would be needed to audit training assumptions or normalization choices.

invented entities (1)
  • BitLinear layer no independent evidence
    purpose: Drop-in replacement for nn.Linear to enable from-scratch training of 1-bit weights
    New component presented as the core technical contribution.

pith-pipeline@v0.9.0 · 5673 in / 1093 out tokens · 28531 ms · 2026-05-24T05:39:04.609506+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CORP: Closed-Form One-shot Representation-Preserving Structured Pruning for Transformers

    cs.LG 2026-02 unverdicted novelty 7.0

    CORP performs one-shot structured pruning of Transformers by modeling removed components as affine functions of retained ones and solving closed-form ridge regressions on calibration data to fold compensation into wei...

  2. Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices

    cs.DC 2025-12 conditional novelty 7.0

    Vec-LUT delivers up to 4.2x speedup over prior LUT methods for parallel ultra-low-bit LLM inference on edge devices by unifying lookups across tokens and adding cache-aware tensor layouts.

  3. The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

    cs.CL 2024-02 unverdicted novelty 7.0

    BitNet b1.58 shows that ternary 1.58-bit LLMs can match full-precision performance at substantially lower inference cost.

  4. A Geometric Analysis of Sign-Magnitude Asymmetry in a ReLU + RMSNorm Block under Ternary Quantization

    cs.LG 2026-05 unverdicted novelty 6.0

    Sign-flip perturbations produce π/(π-2) ≈ 2.75 times more transverse output energy than equal-norm sign-preserving perturbations in a ReLU + RMSNorm block because ReLU creates directional asymmetry that RMSNorm's tran...

  5. WinQ: Accelerating Quantization-Aware Training of Language Models Around Saddle Points

    cs.LG 2026-05 unverdicted novelty 6.0

    WinQ accelerates quantization-aware training up to 4x and improves sub-4-bit accuracy up to 8.8% by weight interpolation resets and noise-regularized gradients that increase Hessian eigenvalue magnitudes around saddle points.

  6. LAQuant: A Simple Overhead-free Large Reasoning Model Quantization by Layer-wise Lookahead Loss

    cs.LG 2026-05 unverdicted novelty 6.0

    LAQuant improves long-decoding accuracy on quantized reasoning models like Qwen3-4B by 15pp on AIME25 via layer-wise lookahead loss, achieving 3.42x speedup over FP16.

  7. Litespark Inference on Consumer CPUs: Custom SIMD Kernels for Ternary Neural Networks

    cs.CL 2026-05 conditional novelty 6.0

    Custom SIMD kernels for ternary LLMs deliver 9.2x faster time-to-first-token, 52x higher throughput, and 14x memory reduction versus standard PyTorch on Apple Silicon and similar CPUs.

  8. BitRL: Reinforcement Learning with 1-bit Quantized Language Models for Resource-Constrained Edge Deployment

    cs.LG 2026-04 unverdicted novelty 6.0

    BitRL enables on-device RL agents via 1-bit quantized language models, delivering 10-16x memory reduction and 3-5x energy efficiency gains with 85-98% retained performance.

  9. LBLLM: Lightweight Binarization of Large Language Models via Three-Stage Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    LBLLM achieves better accuracy than prior binarization methods for LLMs by decoupling weight and activation quantization through initialization, layer-wise distillation, and learnable activation scaling.

  10. GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling

    cs.CL 2026-04 unverdicted novelty 6.0

    GSQ uses Gumbel-Softmax to optimize scalar quantization grids for LLMs, closing most of the accuracy gap to vector methods like QTIP at 2-3 bits per parameter while using symmetric scalar grids compatible with existin...

  11. GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling

    cs.CL 2026-04 unverdicted novelty 6.0

    GSQ applies a Gumbel-Softmax relaxation to learn discrete grid assignments in scalar quantization, closing most of the accuracy gap to vector methods like QTIP on Llama-3.1 models at 2-3 bits while using only symmetri...

  12. BiSpikCLM: A Spiking Language Model integrating Softmax-Free Spiking Attention and Spike-Aware Alignment Distillation

    cs.NE 2026-04 unverdicted novelty 6.0

    BiSpikCLM is the first fully binary spiking MatMul-free causal language model that matches ANN performance on generation tasks using only 4-6 percent of the compute via softmax-free spiking attention and spike-aware d...

  13. STQuant: Spatio-Temporal Adaptive Framework for Optimizer Quantization in Large Multimodal Model Training

    cs.LG 2026-04 unverdicted novelty 6.0

    STQuant dynamically allocates quantization bits for optimizer states in multimodal model training, reducing memory by 84.4% to an average 5.1 bits while preserving quality on GPT-2 and ViT.

  14. D-Legion: A Scalable Many-Core Architecture for Accelerating Matrix Multiplication in Quantized LLMs

    cs.AR 2026-02 unverdicted novelty 6.0

    D-Legion proposes a scalable architecture of Legions containing adaptive-precision systolic array cores that accelerates quantized LLM matrix multiplications, delivering up to 8.2x lower latency and 3.8x higher memory...

  15. Rethinking Output Alignment For 1-bit Post-Training Quantization of Large Language Models

    cs.LG 2025-12 unverdicted novelty 6.0

    A post-training 1-bit quantization method for LLMs that fixes error accumulation and anisotropic representation distortion to outperform prior weight-driven and naive output-driven baselines.

  16. Rethinking 1-bit Optimization Leveraging Pre-trained Large Language Models

    cs.CL 2025-08 conditional novelty 6.0

    A progressive training scheme with binary-aware initialization and dual-scaling allows pre-trained LLMs to be converted to high-performance 1-bit models without training from scratch.

  17. A Lower Bound for the Number of Linear Regions of Ternary ReLU Regression Neural Networks

    cs.LG 2025-07 unverdicted novelty 6.0

    Proves polynomial-in-width and exponential-in-depth lower bounds on linear regions for ternary ReLU regression networks, with width-doubling constructions achieving bounds comparable to unrestricted ReLU networks.

  18. Highly Efficient and Effective LLMs with Multi-Boolean Architectures

    stat.ML 2025-05 unverdicted novelty 6.0

    The authors present multi-kernel Boolean architectures for LLMs that support direct fine-tuning in the Boolean domain without latent weights and claim to outperform prior ultra-low-bit methods.

  19. BTC-LLM: Efficient Sub-1-Bit LLM Quantization via Learnable Transformation and Binary Codebook

    cs.LG 2025-05 conditional novelty 6.0

    BTC-LLM uses a binary codebook for pattern clustering and a learnable transformation to achieve 0.7-1.11 bit LLM quantization while limiting accuracy loss to a few percent on LLaMA and Qwen models.

  20. Lever: Speculative LLM Inference on Smartphones

    cs.LG 2026-05 unverdicted novelty 5.0

    Lever optimizes the drafting, verification, and execution stages of speculative decoding for flash-backed LLM inference on smartphones, reporting 2.93x average latency reduction over baseline flash-offloaded inference.

  21. A Composite Activation Function for Learning Stable Binary Representations

    cs.LG 2026-05 unverdicted novelty 5.0

    HTAF is a sigmoid-tanh composite that approximates the Heaviside function to allow stable gradient training of binary activation networks, yielding ICBMs with stable discretization and competitive performance on image tasks.

  22. SEPTQ: A Simple and Effective Post-Training Quantization Paradigm for Large Language Models

    cs.CL 2026-04 unverdicted novelty 5.0

    SEPTQ simplifies LLM post-training quantization to two steps via static global importance scoring and mask-guided column-wise weight updates, claiming superior results over baselines in low-bit settings.

  23. BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design

    cs.LG 2026-04 unverdicted novelty 5.0

    BWTA achieves near full-precision accuracy on BERT and LLMs using binary weights and ternary activations, with 16-24x kernel speedups via specialized CUDA kernels.

  24. Rethinking Output Alignment For 1-bit Post-Training Quantization of Large Language Models

    cs.LG 2025-12 unverdicted novelty 5.0

    A post-training quantization technique for 1-bit LLMs that corrects layer-wise error accumulation and anisotropic representation distortion to preserve output behavior more effectively than existing methods.

  25. AtteConDA: Attention-Based Conflict Suppression in Multi-Condition Diffusion Models and Synthetic Data Augmentation

    cs.CV 2026-05 unverdicted novelty 4.0

    AtteConDA adds attention-based conflict suppression to multi-condition diffusion models so that generated driving-scene images retain richer structural cues from the original annotations.

  26. Adaptive Domain Models: Bayesian Evolution, Warm Rotation, and Principled Training for Geometric and Neuromorphic AI

    cs.AI 2026-03 unverdicted novelty 3.0

    The paper claims that composing the Dimensional Type System, Program Hypergraph, and b-posit 2026 standard yields depth-independent training memory at ~2x inference, grade-preserving updates, Bayesian distillation for...

  27. Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices

    cs.DC 2025-03 unverdicted novelty 2.0

    Position paper claiming that distributed training across massive edge devices can overcome data depletion and centralized compute monopolies in LLM scaling.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 25 Pith papers · 12 internal anchors

  1. [1]

    PaLM 2 Technical Report

    [ADF+23] Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, and et al. PaLM 2 technical report. CoRR, abs/2305.10403,

  2. [2]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    [BLC13] Yoshua Bengio, Nicholas Léonard, and Aaron C. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.CoRR, abs/1308.3432,

  3. [3]

    XNOR-Net++: improved binary neural networks

    10 [BT19] Adrian Bulat and Georgios Tzimiropoulos. XNOR-Net++: improved binary neural networks. In BMVC 2019,

  4. [4]

    QuIP: 2-bit quantization of large language models with guarantees

    [CCKS23] Jerry Chee, Yaohui Cai, V olodymyr Kuleshov, and Christopher De Sa. QuIP: 2-bit quantization of large language models with guarantees. CoRR, abs/2307.13304,

  5. [5]

    PaLM: Scaling Language Modeling with Pathways

    [CND+22] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, and et al. PaLM: scaling language modeling with pathways. CoRR, abs/2204.02311,

  6. [6]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    [Dao23] Tri Dao. FlashAttention-2: faster attention with better parallelism and work partitioning. CoRR, abs/2307.08691,

  7. [7]

    Training Compute-Optimal Large Language Models

    [HBM+22] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laur...

  8. [8]

    Scaling Laws for Autoregressive Generative Modeling

    [HKK+20] Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, and Sam McCandlish. Scaling laws for autoregressive generative modeling. CoRR, abs/2010.14701,

  9. [9]

    1.1 computing’s energy problem (and what we can do about it)

    [Hor14] Mark Horowitz. 1.1 computing’s energy problem (and what we can do about it). In2014 IEEE International Conference on Solid-State Circuits Conference, ISSCC 2014, Digest of Technical Papers, San Francisco, CA, USA, February 9-13, 2014, pages 10–14,

  10. [10]

    Scaling Laws for Neural Language Models

    [KMH+20] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. CoRR, abs/2001.08361,

  11. [11]

    Fast inference from transformers via speculative decoding

    [LKM23] Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA,

  12. [12]

    How do adam and training strategies help bnns optimization

    11 [LSL+21] Zechun Liu, Zhiqiang Shen, Shichao Li, Koen Helwegen, Dong Huang, and Kwang- Ting Cheng. How do adam and training strategies help bnns optimization. In Proceed- ings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 6936–6946. PMLR,

  13. [13]

    [MCH+16] Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James F. Allen. A corpus and evaluation framework for deeper understanding of commonsense stories. CoRR, abs/1604.01696,

  14. [14]

    GPT-4 Technical Report

    [Ope23] OpenAI. GPT-4 technical report. CoRR, abs/2303.08774,

  15. [15]

    XNOR-Net: imagenet classification using binary convolutional neural networks

    [RORF16] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. XNOR-Net: imagenet classification using binary convolutional neural networks. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, Lecture Notes in Computer Science,

  16. [16]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    [SPP+19] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-LM: training multi-billion parameter language models using model parallelism. CoRR, abs/1909.08053,

  17. [17]

    LLaMA: Open and Efficient Foundation Language Models

    [TLI+23] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: open and efficient foundation language models. CoRR, abs/2302.13971,

  18. [18]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    [TMS+23] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, and et al. Llama 2: open foundation and fine-tuned chat models. CoRR, ab...

  19. [19]

    SmoothQuant: accurate and efficient post-training quantization for large language models

    [XLS+23] Guangxuan Xiao, Ji Lin, Mickaël Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA,

  20. [20]

    12 A Hyperparameters Params # Hidden # Layers # Heads Learning Rate 125M 768 12 12 2.4e-3 350M 1024 24 16 1.2e-3 760M 1536 24 16 1e-3 1.3B 2048 24 32 8e-4 2.7B 2560 32 32 6.4e-4 6.7B 4096 32 32 4.8e-4 13B 5120 40 40 4e-4 30B 7168 48 56 4e-4 Table 5: Model configuration for BitNet in the scaling experiments. Hyperparameters Value Training updates 40K Token...