FTerViT introduces fully ternary Vision Transformers with TernaryBitConv2d and TernaryLayerNorm operators, achieving 82.43% ImageNet top-1 at 6.09 MB with 15x compression.
hub
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
26 Pith papers cite this work. Polarity classification is still indexing.
abstract
Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
VitaLLM demonstrates a 16nm silicon prototype accelerator achieving 72.46 tokens/s decode for 3B ternary LLMs in 0.214 mm² area with reduced KV cache traffic via predictive sparse attention.
STAR-Teaming uses a Strategy-Response Multiplex Network inside a multi-agent framework to organize attack strategies into semantic communities, delivering higher attack success rates on LLMs at lower computational cost than prior methods.
In Kuramoto networks at equilibrium, weak nudging makes phase displacement the exact gradient of loss w.r.t. natural frequencies, enabling frequency learning that beats weight learning and resolves convergence via spectral initialization.
NativeTernary encodes ternary weights at exactly 2 bits each with 460x lower overhead than GGUF for BitNet-style models.
Vec-LUT delivers up to 4.2x speedup over prior LUT methods for parallel ultra-low-bit LLM inference on edge devices by unifying lookups across tokens and adding cache-aware tensor layouts.
One of the Q, K or V weights in transformer self-attention is redundant and replaceable by the identity matrix under mild assumptions, reducing parameters by 25 percent with no loss in small-model performance.
Locale-conditioned rotating few-shot prompting eliminates demonstration regurgitation in 1.7B SLMs for PII substitution while producing more natural text than rule-based methods, though downstream NER training benefits more from synthetic variety than naturalness.
Custom SIMD kernels for ternary LLMs deliver 9.2x faster time-to-first-token, 52x higher throughput, and 14x memory reduction versus standard PyTorch on Apple Silicon and similar CPUs.
VitaLLM delivers 70.7 tokens/s decoding in a 0.223 mm² TSMC 16 nm chip at 66 mW with a figure-of-merit of 17.4 TOPS/mm²/W by combining TINT cores, BoothFlex attention, leading-one prediction, and dependency-aware scheduling.
MCAP uses load-time Monte Carlo profiling to estimate layer importance, enabling dynamic quantization (W4A8 vs W4A16) and memory tiering (GPU/RAM/SSD) that delivers 1.5-1.8x higher decode throughput than llama-cpp Q4_0 on NVIDIA T4 while fitting models into previously infeasible memory budgets.
FairyFuse enables multiplication-free ternary LLM inference on CPUs via fused AVX-512 kernels, achieving 29.6x kernel speedup and 32.4 tokens/s on Xeon with near-lossless quality.
DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.
STQuant dynamically allocates quantization bits for optimizer states in multimodal model training, reducing memory by 84.4% to an average 5.1 bits while preserving quality on GPT-2 and ViT.
RobuQ delivers the first stable DiT image generation at W1.58A2 average bits via Hadamard-based robust activation quantization and layer-wise mixed-precision activations.
A progressive training scheme with binary-aware initialization and dual-scaling allows pre-trained LLMs to be converted to high-performance 1-bit models without training from scratch.
Proves polynomial-in-width and exponential-in-depth lower bounds on linear regions for ternary ReLU regression networks, with width-doubling constructions achieving bounds comparable to unrestricted ReLU networks.
SiLIF models apply SSM dynamics and parametrization to spiking neurons for stable training, reaching new SOTA on event-based and raw-audio speech datasets while using half the compute of SSMs via synaptic delays.
The authors present multi-kernel Boolean architectures for LLMs that support direct fine-tuning in the Boolean domain without latent weights and claim to outperform prior ultra-low-bit methods.
GenHAR generalizes cross-domain human activity recognition by 9.97% accuracy and 6.4x lower FLOPs via tokenized sensor data, frequency channel correlations, selective masking, and efficient attention, with deployment detecting 2.15 billion activities.
HTAF is a sigmoid-tanh composite that approximates the Heaviside function to allow stable gradient training of binary activation networks, yielding ICBMs with stable discretization and competitive performance on image tasks.
Extremely quantized LLMs exhibit systematic smoothness degradation that reduces effective token candidates and degrades generation; a smoothness-preserving principle in PTQ and QAT delivers gains beyond numerical accuracy.
A layer-wise peeling framework creates reference bounds to diagnose under-optimized layers in trained decoder-only transformers, including low-bit and quantized versions.
ShadowNPU presents shadowAttn, a co-designed sparse attention system that uses NPU pilot compute and techniques like graph bucketing and per-head sparsity to minimize CPU/GPU fallback during on-device LLM inference while maintaining accuracy.
citing papers explorer
-
FTerViT: Fully Ternary Vision Transformer
FTerViT introduces fully ternary Vision Transformers with TernaryBitConv2d and TernaryLayerNorm operators, achieving 82.43% ImageNet top-1 at 6.09 MB with 15x compression.
-
VitaLLM: A Versatile and Tiny Accelerator for Mixed-Precision LLM Inference on Edge Devices
VitaLLM demonstrates a 16nm silicon prototype accelerator achieving 72.46 tokens/s decode for 3B ternary LLMs in 0.214 mm² area with reduced KV cache traffic via predictive sparse attention.
-
STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming
STAR-Teaming uses a Strategy-Response Multiplex Network inside a multi-agent framework to organize attack strategies into semantic communities, delivering higher attack success rates on LLMs at lower computational cost than prior methods.
-
The Phase Is the Gradient: Equilibrium Propagation for Frequency Learning in Kuramoto Networks
In Kuramoto networks at equilibrium, weak nudging makes phase displacement the exact gradient of loss w.r.t. natural frequencies, enabling frequency learning that beats weight learning and resolves convergence via spectral initialization.
-
NativeTernary: A Self-Delimiting Binary Encoding with Unary Run-Length Hierarchy Markers for Ternary Neural Network Weights, Structured Data, and General Computing Infrastructure
NativeTernary encodes ternary weights at exactly 2 bits each with 460x lower overhead than GGUF for BitNet-style models.
-
Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices
Vec-LUT delivers up to 4.2x speedup over prior LUT methods for parallel ultra-low-bit LLM inference on edge devices by unifying lookups across tokens and adding cache-aware tensor layouts.
-
Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Self-Attention Transformers
One of the Q, K or V weights in transformer self-attention is redundant and replaceable by the identity matrix under mild assumptions, reducing parameters by 25 percent with no loss in small-model performance.
-
Locale-Conditioned Few-Shot Prompting Mitigates Demonstration Regurgitation in On-Device PII Substitution with Small Language Models
Locale-conditioned rotating few-shot prompting eliminates demonstration regurgitation in 1.7B SLMs for PII substitution while producing more natural text than rule-based methods, though downstream NER training benefits more from synthetic variety than naturalness.
-
Litespark Inference on Consumer CPUs: Custom SIMD Kernels for Ternary Neural Networks
Custom SIMD kernels for ternary LLMs deliver 9.2x faster time-to-first-token, 52x higher throughput, and 14x memory reduction versus standard PyTorch on Apple Silicon and similar CPUs.
-
VitaLLM: A Versatile, Ultra-Compact Ternary LLM Accelerator with Dependency-Aware Scheduling
VitaLLM delivers 70.7 tokens/s decoding in a 0.223 mm² TSMC 16 nm chip at 66 mW with a figure-of-merit of 17.4 TOPS/mm²/W by combining TINT cores, BoothFlex attention, leading-one prediction, and dependency-aware scheduling.
-
MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference
MCAP uses load-time Monte Carlo profiling to estimate layer importance, enabling dynamic quantization (W4A8 vs W4A16) and memory tiering (GPU/RAM/SSD) that delivers 1.5-1.8x higher decode throughput than llama-cpp Q4_0 on NVIDIA T4 while fitting models into previously infeasible memory budgets.
-
FairyFuse: Multiplication-Free LLM Inference on CPUs via Fused Ternary Kernels
FairyFuse enables multiplication-free ternary LLM inference on CPUs via fused AVX-512 kernels, achieving 29.6x kernel speedup and 32.4 tokens/s on Xeon with near-lossless quality.
-
Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate
DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.
-
STQuant: Spatio-Temporal Adaptive Framework for Optimizer Quantization in Large Multimodal Model Training
STQuant dynamically allocates quantization bits for optimizer states in multimodal model training, reducing memory by 84.4% to an average 5.1 bits while preserving quality on GPT-2 and ViT.
-
RobuQ: Pushing DiTs to W1.58A2 via Robust Activation Quantization
RobuQ delivers the first stable DiT image generation at W1.58A2 average bits via Hadamard-based robust activation quantization and layer-wise mixed-precision activations.
-
Rethinking 1-bit Optimization Leveraging Pre-trained Large Language Models
A progressive training scheme with binary-aware initialization and dual-scaling allows pre-trained LLMs to be converted to high-performance 1-bit models without training from scratch.
-
A Lower Bound for the Number of Linear Regions of Ternary ReLU Regression Neural Networks
Proves polynomial-in-width and exponential-in-depth lower bounds on linear regions for ternary ReLU regression networks, with width-doubling constructions achieving bounds comparable to unrestricted ReLU networks.
-
SiLIF: Structured State Space Model Dynamics and Parametrization for Spiking Neural Networks
SiLIF models apply SSM dynamics and parametrization to spiking neurons for stable training, reaching new SOTA on event-based and raw-audio speech datasets while using half the compute of SSMs via synaptic delays.
-
Highly Efficient and Effective LLMs with Multi-Boolean Architectures
The authors present multi-kernel Boolean architectures for LLMs that support direct fine-tuning in the Boolean domain without latent weights and claim to outperform prior ultra-low-bit methods.
-
GenHAR: Generalizing Cross-domain Human Activity Recognition for Last-mile Delivery
GenHAR generalizes cross-domain human activity recognition by 9.97% accuracy and 6.4x lower FLOPs via tokenized sensor data, frequency channel correlations, selective masking, and efficient attention, with deployment detecting 2.15 billion activities.
-
A Composite Activation Function for Learning Stable Binary Representations
HTAF is a sigmoid-tanh composite that approximates the Heaviside function to allow stable gradient training of binary activation networks, yielding ICBMs with stable discretization and competitive performance on image tasks.
-
Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs
Extremely quantized LLMs exhibit systematic smoothness degradation that reduces effective token candidates and degrades generation; a smoothness-preserving principle in PTQ and QAT delivers gains beyond numerical accuracy.
-
Trust, but Verify: Peeling Low-Bit Transformer Networks for Training Monitoring
A layer-wise peeling framework creates reference bounds to diagnose under-optimized layers in trained decoder-only transformers, including low-bit and quantized versions.
-
ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference
ShadowNPU presents shadowAttn, a co-designed sparse attention system that uses NPU pilot compute and techniques like graph bucketing and per-head sparsity to minimize CPU/GPU fallback during on-device LLM inference while maintaining accuracy.
-
Quantization robustness from dense representations of sparse functions in high-capacity kernel associative memory
KLR Hopfield networks exhibit robustness to quantization but sensitivity to pruning, interpreted as arising from dense bimodal parameterization of sparse input mappings.
-
Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices
Position paper claiming that distributed training across massive edge devices can overcome data depletion and centralized compute monopolies in LLM scaling.