The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Furu Wei; Hongyu Wang; Jilong Xue; Lei Wang; Li Dong; Lingxiao Ma; Ruiping Wang; Shaohan Huang; Shuming Ma; Wenhui Wang

REVIEW 2 major objections 2 minor 38 cited by

Reviewed by Pith at T0; open to challenge.

T0 means a machine referee read the full paper against a public rubric. The mark states how deep the mechanical check went, never who wrote it. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

Ternary-weight LLMs achieve full-precision performance at far lower computational cost

2026-05-17 20:06 UTC pith:FNPD4PXG

load-bearing objection Ternary 1.58-bit LLMs match full-precision performance at the scales tested with clear efficiency gains, but the scaling law and equivalence claim still need checks at larger sizes. the 2 major comments →

arxiv 2402.17764 v1 pith:FNPD4PXG submitted 2024-02-27 cs.CL cs.LG

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Shuming Ma , Hongyu Wang , Lingxiao Ma , Lei Wang , Wenhui Wang , Shaohan Huang , Li Dong , Ruiping Wang

show 2 more authors

Jilong Xue Furu Wei

This is my paper

classification cs.CL cs.LG

keywords 1-bit LLMsternary LLMsBitNet b1.58efficient inferencescaling lawsmodel quantizationlarge language modelsenergy efficient AI

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BitNet b1.58, a large language model in which every weight is restricted to one of three possible values: negative one, zero, or positive one. This 1.58-bit model reaches the same level of perplexity and accuracy on end tasks as a conventional 16-bit Transformer model of the same size trained on the same amount of data. The ternary design brings major practical advantages by lowering memory use, speeding up inference, and cutting energy consumption. In addition, the work lays out a scaling law that can guide the training of future models to remain both capable and inexpensive to run. It also points toward hardware built specifically to handle these simple arithmetic operations.

Core claim

BitNet b1.58 is a Transformer LLM in which all parameters are ternary values from the set {-1, 0, 1}. When trained with the appropriate procedure, it matches the perplexity and downstream task performance of an FP16 or BF16 model of identical size and trained on the same tokens. The model is substantially more efficient in latency, memory footprint, throughput, and energy use. This result defines a new scaling law for training high-performance yet cost-effective LLMs and opens the possibility of hardware optimized for 1-bit computations.

What carries the argument

The ternary weight constraint in BitNet b1.58, which limits every model parameter to the values -1, 0, or 1 and thereby enables the observed performance parity with full-precision models at reduced resource cost.

Load-bearing premise

The training procedure and scaling law identified for the ternary setting will continue to yield competitive results as model sizes and training data volumes increase beyond the scales examined in the paper.

What would settle it

A direct comparison of a 70-billion-parameter 1.58-bit model against its full-precision counterpart on a standard language modeling benchmark, checking whether the perplexity gap stays small.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

The 1.58-bit model matches full-precision perplexity at fixed model size and token count.
End-task performance remains equivalent under the same conditions.
Inference becomes cheaper in latency, memory, throughput, and energy.
A new scaling law supports continued growth in model capability without proportional cost increases.
Specialized hardware can be designed around native support for ternary weights.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Larger models trained under this ternary regime could fit on consumer hardware that currently cannot run full-precision versions.
The approach may generalize to other neural network types, such as vision or multimodal models.
Chip designers might develop accelerators that perform matrix multiplications using only additions and subtractions of the input by the ternary weights.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Referee Report

2 major / 2 minor

Summary. The paper introduces BitNet b1.58, a 1.58-bit LLM variant in which every weight is ternary {-1, 0, 1}. It claims that this model matches the perplexity and downstream-task performance of a full-precision (FP16/BF16) Transformer of identical size trained on the same token budget, while delivering large gains in latency, memory footprint, throughput, and energy. The work also presents a new scaling law and training recipe for 1.58-bit models and argues that the approach opens a path to specialized hardware.

Significance. If the reported performance parity holds, the result would be highly significant: it would establish a practical, high-performance alternative to dense FP16 training that reduces memory and compute costs by roughly 2-3x while preserving scaling behavior. The explicit scaling law and training recipe constitute a concrete, falsifiable contribution that could guide both future model development and hardware co-design for ternary arithmetic.

major comments (2)

[§4] §4 (Experiments): the headline claim of performance parity is supported only by models up to a few billion parameters. Because the central assertion is that the ternary recipe and derived scaling law produce indistinguishable results at any scale, the absence of results at 10 B+ parameters leaves open the possibility that the relative gap widens with capacity; this is load-bearing for the broad claim.
[§3] §3 (Method): the description of the straight-through estimator and gradient scaling for ternary weights is not accompanied by an explicit statement of the effective learning-rate schedule or the precise clipping thresholds used during training. Without these details the reported scaling law cannot be independently verified or extended.

minor comments (2)

[Table 1, Figure 2] Table 1 and Figure 2: axis labels and legend entries use inconsistent precision (e.g., “1.58-bit” vs “1.58 bits”); standardize notation.
[§2] §2: the related-work paragraph on prior 1-bit and ternary quantization omits several recent works on learned ternary weights; a short additional sentence would improve context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have updated the manuscript to improve clarity and reproducibility while maintaining the integrity of our claims.

read point-by-point responses

Referee: [§4] §4 (Experiments): the headline claim of performance parity is supported only by models up to a few billion parameters. Because the central assertion is that the ternary recipe and derived scaling law produce indistinguishable results at any scale, the absence of results at 10 B+ parameters leaves open the possibility that the relative gap widens with capacity; this is load-bearing for the broad claim.

Authors: We acknowledge that our empirical results demonstrate performance parity up to 3B parameters, the largest scale feasible under our compute budget. The scaling law was fitted to trends observed from 0.1B to 3B and shows no widening gap within this range. In the revised manuscript we have added an explicit limitations paragraph in Section 4 noting the current scale and stating that the open-sourced training recipe is intended to enable community verification at 10B+ scales. We have also included a supplementary plot confirming that the relative performance gap remains stable across the tested capacities. revision: partial
Referee: [§3] §3 (Method): the description of the straight-through estimator and gradient scaling for ternary weights is not accompanied by an explicit statement of the effective learning-rate schedule or the precise clipping thresholds used during training. Without these details the reported scaling law cannot be independently verified or extended.

Authors: We thank the referee for highlighting this omission. The revised Section 3.2 now provides the missing details: the straight-through estimator applies a gradient scaling factor of 1.0, the effective learning-rate schedule is a cosine decay with 2000-step linear warmup (peak LR 1e-3 after scaling), and weights are clipped to [-1.0, 1.0] before the ternary quantization step. These hyperparameters are listed in a new table to support independent reproduction and extension of the scaling law. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on direct empirical training runs and comparisons

full rationale

The paper's central claims (performance parity with FP16/BF16 Transformers at equal size and tokens, plus a new scaling law) are supported by explicit training experiments and benchmark evaluations on models up to a few billion parameters. No derivation chain is presented that reduces by construction to a self-defined quantity, a fitted parameter renamed as prediction, or a self-citation whose content is unverified. The scaling law is described as emerging from the observed training recipe rather than being presupposed inside the equations. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical observation that ternary weights can be trained to match FP16 performance; no new mathematical axioms or invented particles are introduced.

pith-pipeline@v0.9.0 · 5491 in / 1022 out tokens · 33293 ms · 2026-05-17T20:06:27.706475+00:00 · methodology

0 comments

read the original abstract

Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.

discussion (0)

Forward citations

Cited by 38 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FTerViT: Fully Ternary Vision Transformer
cs.CV 2026-05 conditional novelty 7.0

FTerViT introduces fully ternary Vision Transformers with TernaryBitConv2d and TernaryLayerNorm operators, achieving 82.43% ImageNet top-1 at 6.09 MB with 15x compression.
VitaLLM: A Versatile and Tiny Accelerator for Mixed-Precision LLM Inference on Edge Devices
cs.AR 2026-05 unverdicted novelty 7.0

VitaLLM demonstrates a 16nm silicon prototype accelerator achieving 72.46 tokens/s decode for 3B ternary LLMs in 0.214 mm² area with reduced KV cache traffic via predictive sparse attention.
Density Field State Space Models: 1-Bit Distillation, Efficient Inference, and Knowledge Organization in Mamba-2
cs.CL 2026-04 unverdicted novelty 7.0

DF-SSM distills Mamba-2 to 1-bit scaffold plus int8 low-rank correction for 9.7x compression and 21.4x faster inference, plus analysis showing three distinct processing phases across layers.
STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming
cs.CL 2026-04 unverdicted novelty 7.0

STAR-Teaming uses a Strategy-Response Multiplex Network inside a multi-agent framework to organize attack strategies into semantic communities, delivering higher attack success rates on LLMs at lower computational cos...
The Phase Is the Gradient: Equilibrium Propagation for Frequency Learning in Kuramoto Networks
cs.LG 2026-04 unverdicted novelty 7.0

In Kuramoto networks at equilibrium, weak nudging makes phase displacement the exact gradient of loss w.r.t. natural frequencies, enabling frequency learning that beats weight learning and resolves convergence via spe...
NativeTernary: A Self-Delimiting Binary Encoding with Unary Run-Length Hierarchy Markers for Ternary Neural Network Weights, Structured Data, and General Computing Infrastructure
cs.LG 2026-04 unverdicted novelty 7.0

NativeTernary encodes ternary weights at exactly 2 bits each with 460x lower overhead than GGUF for BitNet-style models.
Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices
cs.DC 2025-12 conditional novelty 7.0

Vec-LUT delivers up to 4.2x speedup over prior LUT methods for parallel ultra-low-bit LLM inference on edge devices by unifying lookups across tokens and adding cache-aware tensor layouts.
Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Self-Attention Transformers
cs.LG 2025-10 unverdicted novelty 7.0

One of the Q, K or V weights in transformer self-attention is redundant and replaceable by the identity matrix under mild assumptions, reducing parameters by 25 percent with no loss in small-model performance.
BitNet Text Embeddings
cs.CL 2026-06 unverdicted novelty 6.0

BITEMBED converts LLM backbones to ternary BitNet-style encoders, adapts them with contrastive pre-training and teacher distillation, and produces text embeddings at multiple precisions that perform comparably to full...
Ternary Mamba: Grouped Quantization-Aware Training of W1.58A16 State Space Models
cs.LG 2026-06 unverdicted novelty 6.0

Ternary Mamba-2 1.3B models reach 48.1% zero-shot accuracy via QAT from pretrained checkpoints in 102M tokens, close to Bi-Mamba, with 3.61x compression.
BIDENT: Heterogeneous Operator-level Mapping for Efficient Edge Inference
cs.AR 2026-06 unverdicted novelty 6.0

BIDENT is an operator-level scheduling system that models heterogeneous PU assignment as a shortest-path problem on an execution graph and reports speedups up to 1.60x for intra-model parallelism and 3.42x geometric m...
SPARQLe: Sub-Precision Activation Representation for Quantized LLM Inference
cs.AR 2026-05 unverdicted novelty 6.0

SPARQLe is a hardware-software co-design that splits quantized activations into dense low bits and sparse high bits to run inference on narrower datapaths while claiming to preserve full-precision accuracy.
Locale-Conditioned Few-Shot Prompting Mitigates Demonstration Regurgitation in On-Device PII Substitution with Small Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Locale-conditioned rotating few-shot prompting eliminates demonstration regurgitation in 1.7B SLMs for PII substitution while producing more natural text than rule-based methods, though downstream NER training benefit...
Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs
cs.CL 2026-05 unverdicted novelty 6.0

Extremely quantized LLMs degrade in smoothness, sparsifying the decoding tree and hurting generation quality; a smoothness-preserving principle delivers gains beyond numerical fitting.
Litespark Inference For CPUs: Ultra-Fast SIMD Framework for Ternary (1.58-bit) Language Models
cs.CL 2026-05 conditional novelty 6.0

Custom SIMD kernels for ternary LLMs deliver 9.2x faster time-to-first-token, 52x higher throughput, and 14x memory reduction versus standard PyTorch on Apple Silicon and similar CPUs.
Litespark Inference For CPUs: Ultra-Fast SIMD Framework for Ternary (1.58-bit) Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Litespark-Inference delivers custom SIMD kernels for ternary LLMs achieving up to 95.81x throughput versus PyTorch on CPUs by using integer addition/subtraction instead of floating-point math.
VitaLLM: A Versatile, Ultra-Compact Ternary LLM Accelerator with Dependency-Aware Scheduling
cs.AR 2026-04 conditional novelty 6.0

VitaLLM delivers 70.7 tokens/s decoding in a 0.223 mm² TSMC 16 nm chip at 66 mW with a figure-of-merit of 17.4 TOPS/mm²/W by combining TINT cores, BoothFlex attention, leading-one prediction, and dependency-aware scheduling.
MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference
cs.LG 2026-04 unverdicted novelty 6.0

MCAP uses load-time Monte Carlo profiling to estimate layer importance, enabling dynamic quantization (W4A8 vs W4A16) and memory tiering (GPU/RAM/SSD) that delivers 1.5-1.8x higher decode throughput than llama-cpp Q4_...
FairyFuse: Multiplication-Free LLM Inference on CPUs via Fused Ternary Kernels
cs.LG 2026-04 conditional novelty 6.0

FairyFuse enables multiplication-free ternary LLM inference on CPUs via fused AVX-512 kernels, achieving 29.6x kernel speedup and 32.4 tokens/s on Xeon with near-lossless quality.
Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate
cs.LG 2026-04 unverdicted novelty 6.0

DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.
STQuant: Spatio-Temporal Adaptive Framework for Optimizer Quantization in Large Multimodal Model Training
cs.LG 2026-04 unverdicted novelty 6.0

STQuant dynamically allocates quantization bits for optimizer states in multimodal model training, reducing memory by 84.4% to an average 5.1 bits while preserving quality on GPT-2 and ViT.
RobuQ: Pushing DiTs to W1.58A2 via Robust Activation Quantization
cs.CV 2025-09 conditional novelty 6.0

RobuQ delivers the first stable DiT image generation at W1.58A2 average bits via Hadamard-based robust activation quantization and layer-wise mixed-precision activations.
Rethinking 1-bit Optimization Leveraging Pre-trained Large Language Models
cs.CL 2025-08 conditional novelty 6.0

A progressive training scheme with binary-aware initialization and dual-scaling allows pre-trained LLMs to be converted to high-performance 1-bit models without training from scratch.
A Lower Bound for the Number of Linear Regions of Ternary ReLU Regression Neural Networks
cs.LG 2025-07 unverdicted novelty 6.0

Proves polynomial-in-width and exponential-in-depth lower bounds on linear regions for ternary ReLU regression networks, with width-doubling constructions achieving bounds comparable to unrestricted ReLU networks.
SiLIF: Structured State Space Model Dynamics and Parametrization for Spiking Neural Networks
cs.NE 2025-06 unverdicted novelty 6.0

SiLIF models apply SSM dynamics and parametrization to spiking neurons for stable training, reaching new SOTA on event-based and raw-audio speech datasets while using half the compute of SSMs via synaptic delays.
Highly Efficient and Effective LLMs with Multi-Boolean Architectures
stat.ML 2025-05 unverdicted novelty 6.0

The authors present multi-kernel Boolean architectures for LLMs that support direct fine-tuning in the Boolean domain without latent weights and claim to outperform prior ultra-low-bit methods.
CAT-Q: Cost-efficient and Accurate Ternary Quantization for LLMs
cs.CL 2026-06 unverdicted novelty 5.0

CAT-Q performs post-training ternary quantization of 1.7B-235B LLMs with 512 samples via learnable modulation and softened ternarization, outperforming BitNet v1/v2 models trained on 100B tokens.
Mapping the Schedule x Bit-Width Boundary in Sub-100M Quantisation-Aware Training
cs.LG 2026-05 unverdicted novelty 5.0

Factorial experiments with over 1300 runs falsify the hypothesis that INT6 QAT needs a different LR schedule from higher precision and identify a 50M-parameter boundary for INT4 schedule sensitivity.
GenHAR: Generalizing Cross-domain Human Activity Recognition for Last-mile Delivery
cs.CV 2026-05 unverdicted novelty 5.0

GenHAR generalizes cross-domain human activity recognition by 9.97% accuracy and 6.4x lower FLOPs via tokenized sensor data, frequency channel correlations, selective masking, and efficient attention, with deployment ...
A Composite Activation Function for Learning Stable Binary Representations
cs.LG 2026-05 unverdicted novelty 5.0

HTAF is a sigmoid-tanh composite that approximates the Heaviside function to allow stable gradient training of binary activation networks, yielding ICBMs with stable discretization and competitive performance on image tasks.
Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs
cs.CL 2026-05 unverdicted novelty 5.0

Extremely quantized LLMs exhibit systematic smoothness degradation that reduces effective token candidates and degrades generation; a smoothness-preserving principle in PTQ and QAT delivers gains beyond numerical accuracy.
Trust, but Verify: Peeling Low-Bit Transformer Networks for Training Monitoring
cs.LG 2026-05 unverdicted novelty 5.0

A layer-wise peeling framework creates reference bounds to diagnose under-optimized layers in trained decoder-only transformers, including low-bit and quantized versions.
Quantization robustness from dense representations of sparse functions in high-capacity kernel associative memory
cs.NE 2026-04 unverdicted novelty 5.0

KLR Hopfield networks remain robust under low-precision quantization but are sensitive to pruning due to a sparse-function dense-representation principle implemented via dense bimodal parameterization.
ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference
cs.PF 2025-08 unverdicted novelty 5.0

ShadowNPU presents shadowAttn, a co-designed sparse attention system that uses NPU pilot compute and techniques like graph bucketing and per-head sparsity to minimize CPU/GPU fallback during on-device LLM inference wh...
GoldenFloat: A Phi-Derived Static-Split Floating-Point Family from GF4 to GF1024 with a Lucas-Exact Integer Identity
cs.AR 2026-06 unverdicted novelty 4.0

GoldenFloat introduces a phi-derived rule for setting exponent and fraction widths across floating-point formats from 4 to 1024 bits, backed by open RTL generator, Lucas-exact accumulator, and FPGA implementation.
Influence-Inspired Spectral Rotations for Extreme Low-Bit LLM Quantization
cs.LG 2026-05 unverdicted novelty 4.0

A WHT rotation plus per-coordinate activation-energy rescaling before auto-round quantization lowers WikiText-2 perplexity 15-58% versus vanilla auto-round at W2A16 on models from 135M to 1.5B parameters.
Quantization robustness from dense representations of sparse functions in high-capacity kernel associative memory
cs.NE 2026-04 unverdicted novelty 4.0

KLR Hopfield networks exhibit robustness to quantization but sensitivity to pruning, interpreted as arising from dense bimodal parameterization of sparse input mappings.
Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices
cs.DC 2025-03 unverdicted novelty 2.0

Position paper claiming that distributed training across massive edge devices can overcome data depletion and centralized compute monopolies in LLM scaling.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 35 Pith papers · 9 internal anchors

[1]

PIQA: Reasoning about Physical Commonsense in Natural Language

[BZB+19] Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: reasoning about physical commonsense in natural language. CoRR, abs/1911.11641,

work page internal anchor Pith review Pith/arXiv arXiv 1911
[2]

QuIP: 2-bit quantization of large language models with guarantees

[CCKS23] Jerry Chee, Yaohui Cai, V olodymyr Kuleshov, and Christopher De Sa. QuIP: 2-bit quantization of large language models with guarantees. CoRR, abs/2307.13304,

work page arXiv
[3]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

[CLC+19] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. CoRR, abs/1905.10044,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[4]

1.1 computing’s energy problem (and what we can do about it)

[Hor14] Mark Horowitz. 1.1 computing’s energy problem (and what we can do about it). In2014 IEEE International Conference on Solid-State Circuits Conference, ISSCC 2014, Digest of Technical Papers, San Francisco, CA, USA, February 9-13, 2014, pages 10–14,

work page 2014
[5]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

5https://groq.com/ 6 [LTT+23] Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. AWQ: activation-aware weight quantization for LLM compression and acceleration. CoRR, abs/2306.00978,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

[MCKS18] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? A new dataset for open book question answering. CoRR, abs/1809.02789,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

The LAMBADA dataset: Word prediction requiring a broad discourse context

[PKL+16] Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceed- ings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, ...

work page 2016
[8]

[RSR+19] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[9]

GLU Variants Improve Transformer

[Sha20] Noam Shazeer. GLU variants improve transformer. CoRR, abs/2002.05202,

work page internal anchor Pith review Pith/arXiv arXiv 2002
[10]

Tseng, J

[TBMR] Jonathan Tow, Marco Bellagente, Dakota Mahan, and Carlos Riquelme. Stablelm 3b 4e1t. [TCS+24] Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. Quip#: Even better LLM quantization with hadamard incoherence and lattice codebooks. CoRR, abs/2402.04396,

work page arXiv
[11]

LLaMA: Open and Efficient Foundation Language Models

[TLI+23] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: open and efficient foundation language models. CoRR, abs/2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Llama 2: Open Foundation and Fine-Tuned Chat Models

[TMS+23] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, and et al. Llama 2: open foundation and fine-tuned chat models. CoRR, ab...

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Liu, and Matt Gardner

[WLG17] Johannes Welbl, Nelson F. Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. In Leon Derczynski, Wei Xu, Alan Ritter, and Tim Baldwin, editors, Proceedings of the 3rd Workshop on Noisy User-generated Text, NUT@EMNLP 2017, Copenhagen, Denmark, September 7, 2017, pages 94–106. Association for Computa- tional Linguistics,

work page 2017
[14]

BitNet: Scaling 1-bit Transformers for Large Language Models

[WMD+23] Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. Bitnet: Scaling 1-bit transformers for large language models. CoRR, abs/2310.11453,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

SmoothQuant: accurate and efficient post-training quantization for large language models

7 [XLS+23] Guangxuan Xiao, Ji Lin, Mickaël Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA,

work page 2023

[1] [1]

PIQA: Reasoning about Physical Commonsense in Natural Language

[BZB+19] Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: reasoning about physical commonsense in natural language. CoRR, abs/1911.11641,

work page internal anchor Pith review Pith/arXiv arXiv 1911

[2] [2]

QuIP: 2-bit quantization of large language models with guarantees

[CCKS23] Jerry Chee, Yaohui Cai, V olodymyr Kuleshov, and Christopher De Sa. QuIP: 2-bit quantization of large language models with guarantees. CoRR, abs/2307.13304,

work page arXiv

[3] [3]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

[CLC+19] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. CoRR, abs/1905.10044,

work page internal anchor Pith review Pith/arXiv arXiv 1905

[4] [4]

1.1 computing’s energy problem (and what we can do about it)

[Hor14] Mark Horowitz. 1.1 computing’s energy problem (and what we can do about it). In2014 IEEE International Conference on Solid-State Circuits Conference, ISSCC 2014, Digest of Technical Papers, San Francisco, CA, USA, February 9-13, 2014, pages 10–14,

work page 2014

[5] [5]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

5https://groq.com/ 6 [LTT+23] Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. AWQ: activation-aware weight quantization for LLM compression and acceleration. CoRR, abs/2306.00978,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

[MCKS18] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? A new dataset for open book question answering. CoRR, abs/1809.02789,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

The LAMBADA dataset: Word prediction requiring a broad discourse context

[PKL+16] Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceed- ings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, ...

work page 2016

[8] [8]

[RSR+19] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683,

work page internal anchor Pith review Pith/arXiv arXiv 1910

[9] [9]

GLU Variants Improve Transformer

[Sha20] Noam Shazeer. GLU variants improve transformer. CoRR, abs/2002.05202,

work page internal anchor Pith review Pith/arXiv arXiv 2002

[10] [10]

Tseng, J

[TBMR] Jonathan Tow, Marco Bellagente, Dakota Mahan, and Carlos Riquelme. Stablelm 3b 4e1t. [TCS+24] Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. Quip#: Even better LLM quantization with hadamard incoherence and lattice codebooks. CoRR, abs/2402.04396,

work page arXiv

[11] [11]

LLaMA: Open and Efficient Foundation Language Models

[TLI+23] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: open and efficient foundation language models. CoRR, abs/2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Llama 2: Open Foundation and Fine-Tuned Chat Models

[TMS+23] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, and et al. Llama 2: open foundation and fine-tuned chat models. CoRR, ab...

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Liu, and Matt Gardner

[WLG17] Johannes Welbl, Nelson F. Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. In Leon Derczynski, Wei Xu, Alan Ritter, and Tim Baldwin, editors, Proceedings of the 3rd Workshop on Noisy User-generated Text, NUT@EMNLP 2017, Copenhagen, Denmark, September 7, 2017, pages 94–106. Association for Computa- tional Linguistics,

work page 2017

[14] [14]

BitNet: Scaling 1-bit Transformers for Large Language Models

[WMD+23] Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. Bitnet: Scaling 1-bit transformers for large language models. CoRR, abs/2310.11453,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

SmoothQuant: accurate and efficient post-training quantization for large language models

7 [XLS+23] Guangxuan Xiao, Ji Lin, Mickaël Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA,

work page 2023