hub

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang · 2024 · cs.CL · arXiv 2402.17764

26 Pith papers cite this work. Polarity classification is still indexing.

26 Pith papers citing it

open full Pith review browse 26 citing papers arXiv PDF

abstract

Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 dataset 1

citation-polarity summary

background 1 unclear 1 use dataset 1

representative citing papers

FTerViT: Fully Ternary Vision Transformer

cs.CV · 2026-05-20 · conditional · novelty 7.0

FTerViT introduces fully ternary Vision Transformers with TernaryBitConv2d and TernaryLayerNorm operators, achieving 82.43% ImageNet top-1 at 6.09 MB with 15x compression.

VitaLLM: A Versatile and Tiny Accelerator for Mixed-Precision LLM Inference on Edge Devices

cs.AR · 2026-05-01 · unverdicted · novelty 7.0

VitaLLM demonstrates a 16nm silicon prototype accelerator achieving 72.46 tokens/s decode for 3B ternary LLMs in 0.214 mm² area with reduced KV cache traffic via predictive sparse attention.

STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

STAR-Teaming uses a Strategy-Response Multiplex Network inside a multi-agent framework to organize attack strategies into semantic communities, delivering higher attack success rates on LLMs at lower computational cost than prior methods.

The Phase Is the Gradient: Equilibrium Propagation for Frequency Learning in Kuramoto Networks

cs.LG · 2026-04-11 · unverdicted · novelty 7.0

In Kuramoto networks at equilibrium, weak nudging makes phase displacement the exact gradient of loss w.r.t. natural frequencies, enabling frequency learning that beats weight learning and resolves convergence via spectral initialization.

NativeTernary: A Self-Delimiting Binary Encoding with Unary Run-Length Hierarchy Markers for Ternary Neural Network Weights, Structured Data, and General Computing Infrastructure

cs.LG · 2026-04-03 · unverdicted · novelty 7.0

NativeTernary encodes ternary weights at exactly 2 bits each with 460x lower overhead than GGUF for BitNet-style models.

Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices

cs.DC · 2025-12-06 · conditional · novelty 7.0

Vec-LUT delivers up to 4.2x speedup over prior LUT methods for parallel ultra-low-bit LLM inference on edge devices by unifying lookups across tokens and adding cache-aware tensor layouts.

Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Self-Attention Transformers

cs.LG · 2025-10-27 · unverdicted · novelty 7.0

One of the Q, K or V weights in transformer self-attention is redundant and replaceable by the identity matrix under mild assumptions, reducing parameters by 25 percent with no loss in small-model performance.

Locale-Conditioned Few-Shot Prompting Mitigates Demonstration Regurgitation in On-Device PII Substitution with Small Language Models

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

Locale-conditioned rotating few-shot prompting eliminates demonstration regurgitation in 1.7B SLMs for PII substitution while producing more natural text than rule-based methods, though downstream NER training benefits more from synthetic variety than naturalness.

Litespark Inference on Consumer CPUs: Custom SIMD Kernels for Ternary Neural Networks

cs.CL · 2026-05-07 · conditional · novelty 6.0

Custom SIMD kernels for ternary LLMs deliver 9.2x faster time-to-first-token, 52x higher throughput, and 14x memory reduction versus standard PyTorch on Apple Silicon and similar CPUs.

VitaLLM: A Versatile, Ultra-Compact Ternary LLM Accelerator with Dependency-Aware Scheduling

cs.AR · 2026-04-30 · conditional · novelty 6.0

VitaLLM delivers 70.7 tokens/s decoding in a 0.223 mm² TSMC 16 nm chip at 66 mW with a figure-of-merit of 17.4 TOPS/mm²/W by combining TINT cores, BoothFlex attention, leading-one prediction, and dependency-aware scheduling.

MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference

cs.LG · 2026-04-22 · unverdicted · novelty 6.0

MCAP uses load-time Monte Carlo profiling to estimate layer importance, enabling dynamic quantization (W4A8 vs W4A16) and memory tiering (GPU/RAM/SSD) that delivers 1.5-1.8x higher decode throughput than llama-cpp Q4_0 on NVIDIA T4 while fitting models into previously infeasible memory budgets.

FairyFuse: Multiplication-Free LLM Inference on CPUs via Fused Ternary Kernels

cs.LG · 2026-04-22 · conditional · novelty 6.0

FairyFuse enables multiplication-free ternary LLM inference on CPUs via fused AVX-512 kernels, achieving 29.6x kernel speedup and 32.4 tokens/s on Xeon with near-lossless quality.

Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate

cs.LG · 2026-04-15 · unverdicted · novelty 6.0

DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.

STQuant: Spatio-Temporal Adaptive Framework for Optimizer Quantization in Large Multimodal Model Training

cs.LG · 2026-04-08 · unverdicted · novelty 6.0

STQuant dynamically allocates quantization bits for optimizer states in multimodal model training, reducing memory by 84.4% to an average 5.1 bits while preserving quality on GPT-2 and ViT.

RobuQ: Pushing DiTs to W1.58A2 via Robust Activation Quantization

cs.CV · 2025-09-28 · conditional · novelty 6.0

RobuQ delivers the first stable DiT image generation at W1.58A2 average bits via Hadamard-based robust activation quantization and layer-wise mixed-precision activations.

Rethinking 1-bit Optimization Leveraging Pre-trained Large Language Models

cs.CL · 2025-08-09 · conditional · novelty 6.0

A progressive training scheme with binary-aware initialization and dual-scaling allows pre-trained LLMs to be converted to high-performance 1-bit models without training from scratch.

A Lower Bound for the Number of Linear Regions of Ternary ReLU Regression Neural Networks

cs.LG · 2025-07-21 · unverdicted · novelty 6.0

Proves polynomial-in-width and exponential-in-depth lower bounds on linear regions for ternary ReLU regression networks, with width-doubling constructions achieving bounds comparable to unrestricted ReLU networks.

SiLIF: Structured State Space Model Dynamics and Parametrization for Spiking Neural Networks

cs.NE · 2025-06-04 · unverdicted · novelty 6.0

SiLIF models apply SSM dynamics and parametrization to spiking neurons for stable training, reaching new SOTA on event-based and raw-audio speech datasets while using half the compute of SSMs via synaptic delays.

Highly Efficient and Effective LLMs with Multi-Boolean Architectures

stat.ML · 2025-05-28 · unverdicted · novelty 6.0

The authors present multi-kernel Boolean architectures for LLMs that support direct fine-tuning in the Boolean domain without latent weights and claim to outperform prior ultra-low-bit methods.

GenHAR: Generalizing Cross-domain Human Activity Recognition for Last-mile Delivery

cs.CV · 2026-05-21 · unverdicted · novelty 5.0

GenHAR generalizes cross-domain human activity recognition by 9.97% accuracy and 6.4x lower FLOPs via tokenized sensor data, frequency channel correlations, selective masking, and efficient attention, with deployment detecting 2.15 billion activities.

A Composite Activation Function for Learning Stable Binary Representations

cs.LG · 2026-05-12 · unverdicted · novelty 5.0

HTAF is a sigmoid-tanh composite that approximates the Heaviside function to allow stable gradient training of binary activation networks, yielding ICBMs with stable discretization and competitive performance on image tasks.

Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs

cs.CL · 2026-05-09 · unverdicted · novelty 5.0 · 2 refs

Extremely quantized LLMs exhibit systematic smoothness degradation that reduces effective token candidates and degrades generation; a smoothness-preserving principle in PTQ and QAT delivers gains beyond numerical accuracy.

Trust, but Verify: Peeling Low-Bit Transformer Networks for Training Monitoring

cs.LG · 2026-05-04 · unverdicted · novelty 5.0

A layer-wise peeling framework creates reference bounds to diagnose under-optimized layers in trained decoder-only transformers, including low-bit and quantized versions.

ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference

cs.PF · 2025-08-22 · unverdicted · novelty 5.0

ShadowNPU presents shadowAttn, a co-designed sparse attention system that uses NPU pilot compute and techniques like graph bucketing and per-head sparsity to minimize CPU/GPU fallback during on-device LLM inference while maintaining accuracy.

citing papers explorer

Showing 26 of 26 citing papers.

FTerViT: Fully Ternary Vision Transformer cs.CV · 2026-05-20 · conditional · none · ref 27 · internal anchor
FTerViT introduces fully ternary Vision Transformers with TernaryBitConv2d and TernaryLayerNorm operators, achieving 82.43% ImageNet top-1 at 6.09 MB with 15x compression.
VitaLLM: A Versatile and Tiny Accelerator for Mixed-Precision LLM Inference on Edge Devices cs.AR · 2026-05-01 · unverdicted · none · ref 1 · internal anchor
VitaLLM demonstrates a 16nm silicon prototype accelerator achieving 72.46 tokens/s decode for 3B ternary LLMs in 0.214 mm² area with reduced KV cache traffic via predictive sparse attention.
STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming cs.CL · 2026-04-21 · unverdicted · none · ref 44 · internal anchor
STAR-Teaming uses a Strategy-Response Multiplex Network inside a multi-agent framework to organize attack strategies into semantic communities, delivering higher attack success rates on LLMs at lower computational cost than prior methods.
The Phase Is the Gradient: Equilibrium Propagation for Frequency Learning in Kuramoto Networks cs.LG · 2026-04-11 · unverdicted · none · ref 16 · internal anchor
In Kuramoto networks at equilibrium, weak nudging makes phase displacement the exact gradient of loss w.r.t. natural frequencies, enabling frequency learning that beats weight learning and resolves convergence via spectral initialization.
NativeTernary: A Self-Delimiting Binary Encoding with Unary Run-Length Hierarchy Markers for Ternary Neural Network Weights, Structured Data, and General Computing Infrastructure cs.LG · 2026-04-03 · unverdicted · none · ref 1 · internal anchor
NativeTernary encodes ternary weights at exactly 2 bits each with 460x lower overhead than GGUF for BitNet-style models.
Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices cs.DC · 2025-12-06 · conditional · none · ref 29 · internal anchor
Vec-LUT delivers up to 4.2x speedup over prior LUT methods for parallel ultra-low-bit LLM inference on edge devices by unifying lookups across tokens and adding cache-aware tensor layouts.
Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Self-Attention Transformers cs.LG · 2025-10-27 · unverdicted · none · ref 14 · internal anchor
One of the Q, K or V weights in transformer self-attention is redundant and replaceable by the identity matrix under mild assumptions, reducing parameters by 25 percent with no loss in small-model performance.
Locale-Conditioned Few-Shot Prompting Mitigates Demonstration Regurgitation in On-Device PII Substitution with Small Language Models cs.CL · 2026-05-13 · unverdicted · none · ref 6 · internal anchor
Locale-conditioned rotating few-shot prompting eliminates demonstration regurgitation in 1.7B SLMs for PII substitution while producing more natural text than rule-based methods, though downstream NER training benefits more from synthetic variety than naturalness.
Litespark Inference on Consumer CPUs: Custom SIMD Kernels for Ternary Neural Networks cs.CL · 2026-05-07 · conditional · none · ref 7 · internal anchor
Custom SIMD kernels for ternary LLMs deliver 9.2x faster time-to-first-token, 52x higher throughput, and 14x memory reduction versus standard PyTorch on Apple Silicon and similar CPUs.
VitaLLM: A Versatile, Ultra-Compact Ternary LLM Accelerator with Dependency-Aware Scheduling cs.AR · 2026-04-30 · conditional · none · ref 2 · internal anchor
VitaLLM delivers 70.7 tokens/s decoding in a 0.223 mm² TSMC 16 nm chip at 66 mW with a figure-of-merit of 17.4 TOPS/mm²/W by combining TINT cores, BoothFlex attention, leading-one prediction, and dependency-aware scheduling.
MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference cs.LG · 2026-04-22 · unverdicted · none · ref 13 · internal anchor
MCAP uses load-time Monte Carlo profiling to estimate layer importance, enabling dynamic quantization (W4A8 vs W4A16) and memory tiering (GPU/RAM/SSD) that delivers 1.5-1.8x higher decode throughput than llama-cpp Q4_0 on NVIDIA T4 while fitting models into previously infeasible memory budgets.
FairyFuse: Multiplication-Free LLM Inference on CPUs via Fused Ternary Kernels cs.LG · 2026-04-22 · conditional · none · ref 23 · internal anchor
FairyFuse enables multiplication-free ternary LLM inference on CPUs via fused AVX-512 kernels, achieving 29.6x kernel speedup and 32.4 tokens/s on Xeon with near-lossless quality.
Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate cs.LG · 2026-04-15 · unverdicted · none · ref 30 · internal anchor
DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.
STQuant: Spatio-Temporal Adaptive Framework for Optimizer Quantization in Large Multimodal Model Training cs.LG · 2026-04-08 · unverdicted · none · ref 16 · internal anchor
STQuant dynamically allocates quantization bits for optimizer states in multimodal model training, reducing memory by 84.4% to an average 5.1 bits while preserving quality on GPT-2 and ViT.
RobuQ: Pushing DiTs to W1.58A2 via Robust Activation Quantization cs.CV · 2025-09-28 · conditional · none · ref 47 · internal anchor
RobuQ delivers the first stable DiT image generation at W1.58A2 average bits via Hadamard-based robust activation quantization and layer-wise mixed-precision activations.
Rethinking 1-bit Optimization Leveraging Pre-trained Large Language Models cs.CL · 2025-08-09 · conditional · none · ref 23 · internal anchor
A progressive training scheme with binary-aware initialization and dual-scaling allows pre-trained LLMs to be converted to high-performance 1-bit models without training from scratch.
A Lower Bound for the Number of Linear Regions of Ternary ReLU Regression Neural Networks cs.LG · 2025-07-21 · unverdicted · none · ref 7 · internal anchor
Proves polynomial-in-width and exponential-in-depth lower bounds on linear regions for ternary ReLU regression networks, with width-doubling constructions achieving bounds comparable to unrestricted ReLU networks.
SiLIF: Structured State Space Model Dynamics and Parametrization for Spiking Neural Networks cs.NE · 2025-06-04 · unverdicted · none · ref 4 · internal anchor
SiLIF models apply SSM dynamics and parametrization to spiking neurons for stable training, reaching new SOTA on event-based and raw-audio speech datasets while using half the compute of SSMs via synaptic delays.
Highly Efficient and Effective LLMs with Multi-Boolean Architectures stat.ML · 2025-05-28 · unverdicted · none · ref 7 · internal anchor
The authors present multi-kernel Boolean architectures for LLMs that support direct fine-tuning in the Boolean domain without latent weights and claim to outperform prior ultra-low-bit methods.
GenHAR: Generalizing Cross-domain Human Activity Recognition for Last-mile Delivery cs.CV · 2026-05-21 · unverdicted · none · ref 45 · internal anchor
GenHAR generalizes cross-domain human activity recognition by 9.97% accuracy and 6.4x lower FLOPs via tokenized sensor data, frequency channel correlations, selective masking, and efficient attention, with deployment detecting 2.15 billion activities.
A Composite Activation Function for Learning Stable Binary Representations cs.LG · 2026-05-12 · unverdicted · none · ref 44 · internal anchor
HTAF is a sigmoid-tanh composite that approximates the Heaviside function to allow stable gradient training of binary activation networks, yielding ICBMs with stable discretization and competitive performance on image tasks.
Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs cs.CL · 2026-05-09 · unverdicted · none · ref 27 · 2 links · internal anchor
Extremely quantized LLMs exhibit systematic smoothness degradation that reduces effective token candidates and degrades generation; a smoothness-preserving principle in PTQ and QAT delivers gains beyond numerical accuracy.
Trust, but Verify: Peeling Low-Bit Transformer Networks for Training Monitoring cs.LG · 2026-05-04 · unverdicted · none · ref 26 · internal anchor
A layer-wise peeling framework creates reference bounds to diagnose under-optimized layers in trained decoder-only transformers, including low-bit and quantized versions.
ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference cs.PF · 2025-08-22 · unverdicted · none · ref 37 · internal anchor
ShadowNPU presents shadowAttn, a co-designed sparse attention system that uses NPU pilot compute and techniques like graph bucketing and per-head sparsity to minimize CPU/GPU fallback during on-device LLM inference while maintaining accuracy.
Quantization robustness from dense representations of sparse functions in high-capacity kernel associative memory cs.NE · 2026-04-22 · unverdicted · none · ref 4 · 2 links · internal anchor
KLR Hopfield networks exhibit robustness to quantization but sensitivity to pruning, interpreted as arising from dense bimodal parameterization of sparse input mappings.
Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices cs.DC · 2025-03-11 · unverdicted · none · ref 172 · internal anchor
Position paper claiming that distributed training across massive edge devices can overcome data depletion and centralized compute monopolies in LLM scaling.

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer