pith. sign in

hub

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

33 Pith papers cite this work. Polarity classification is still indexing.

33 Pith papers citing it
abstract

Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.

hub tools

citation-role summary

background 2 dataset 1

citation-polarity summary

years

2026 24 2025 9

clear filters

representative citing papers

FTerViT: Fully Ternary Vision Transformer

cs.CV · 2026-05-20 · conditional · novelty 7.0

FTerViT introduces fully ternary Vision Transformers with TernaryBitConv2d and TernaryLayerNorm operators, achieving 82.43% ImageNet top-1 at 6.09 MB with 15x compression.

BIDENT: Heterogeneous Operator-level Mapping for Efficient Edge Inference

cs.AR · 2026-06-03 · unverdicted · novelty 6.0

BIDENT is an operator-level scheduling system that models heterogeneous PU assignment as a shortest-path problem on an execution graph and reports speedups up to 1.60x for intra-model parallelism and 3.42x geometric mean for multi-model workloads on an Intel Core Ultra SoC.

MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference

cs.LG · 2026-04-22 · unverdicted · novelty 6.0

MCAP uses load-time Monte Carlo profiling to estimate layer importance, enabling dynamic quantization (W4A8 vs W4A16) and memory tiering (GPU/RAM/SSD) that delivers 1.5-1.8x higher decode throughput than llama-cpp Q4_0 on NVIDIA T4 while fitting models into previously infeasible memory budgets.

citing papers explorer

Showing 33 of 33 citing papers.