Highly Efficient and Effective LLMs with Multi-Boolean Architectures

Ba-Hien Tran; Van Minh Nguyen

arxiv: 2505.22811 · v5 · submitted 2025-05-28 · 📊 stat.ML · cs.LG

Highly Efficient and Effective LLMs with Multi-Boolean Architectures

Ba-Hien Tran , Van Minh Nguyen This is my paper

Pith reviewed 2026-05-19 12:38 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords large language modelsBoolean parametersmodel binarizationefficient fine-tuninglow-bit quantizationmodel compressionrepresentational capacity

0 comments

The pith

Multi-kernel Boolean parameters enable direct fine-tuning of LLMs in the Boolean domain without latent weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes representing large language models with multi-kernel Boolean parameters. This setup allows fine-tuning to occur directly in the Boolean domain, removing any need to maintain latent full-precision weights during the process. The result is claimed to increase what the model can represent while cutting computational demands in both fine-tuning and inference. Experiments on multiple LLMs indicate better outcomes than existing ultra-low-bit quantization or binarization approaches. If correct, the method would make adapting and running these models feasible with far less memory and compute overhead.

Core claim

The central claim is that LLMs can be represented with multi-kernel Boolean parameters to support direct finetuning entirely in the Boolean domain for the first time, which removes the need for latent weights. This change is presented as increasing representational capacity while lowering complexity for both the fine-tuning stage and later inference. Tests across diverse LLMs show the approach outperforming recent ultra low-bit quantization and binarization techniques.

What carries the argument

Multi-kernel Boolean parameters: sets of Boolean kernels used to encode weights so that optimization can proceed directly in the Boolean space without intermediate full-precision values.

If this is right

Memory requirements drop because parameters remain Boolean rather than needing space for latent full-precision copies.
Fine-tuning and inference run faster by staying in Boolean operations and avoiding repeated domain conversions.
Larger models become easier to adapt on hardware with limited precision support or memory capacity.
Performance holds up better than post-training binarization methods that typically lose significant accuracy.
The same direct-domain idea could extend efficiency gains to other model compression settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hardware accelerators built around Boolean logic could see direct performance lifts if models adopt this format.
Similar direct training in discrete spaces might apply to ternary or other low-precision weight schemes.
Smaller fully Boolean models could simplify deployment in resource-constrained or offline environments.

Load-bearing premise

Multi-kernel Boolean parameters can preserve or exceed the representational power of latent full-precision weights without adding failure modes or hidden costs that cancel out the efficiency benefits.

What would settle it

Compare accuracy and total operation count on a standard benchmark like GLUE when fine-tuning the same LLM architecture with this Boolean method versus a latent-weight binarization baseline; a clear accuracy drop or higher effective cost would challenge the claims.

Figures

Figures reproduced from arXiv: 2505.22811 by Ba-Hien Tran, Van Minh Nguyen.

**Figure 1.** Figure 1: Finetuning OPT models (Zhang et al., 2022) using our 3 Boolean kernels ( ), compared to OPTQ (Frantar et al., 2023) ( ), which quantizes the models to 3 bits, and the FP16 baseline ( ) on the C4 dataset. Large language models (Brown et al., 2020; Touvron et al., 2023a; Liu et al., 2024a) have demonstrated unprecedented capabilities, largely due to the continuous growth in both model and dataset sizes. A … view at source ↗

**Figure 2.** Figure 2: Illustration of SVID. LLMs (Brown et al., 2020) are mostly based on the Transformer architecture (Vaswani et al., 2017), in which linear layers are the core elements. Inpsired by Xu et al. (2024), we employ sign-value-independent decomposition (SVID) such that an FP input matrix W ∈ R m×n of linear layers is decomposed into one Boolean matrix Wbool ≜ sign(W) and two FP vectors sin and sout. Precisely, let … view at source ↗

**Figure 3.** Figure 3: The computation of a linear layer approximated using multi kernels of Boolean. We have shown that SVID provides a good approximation of the original weights, its expressivity can be still limited to capture well the original FP parameters of complicated models, which were trained on large-scale datasets over extended periods of time. To overcome this limitation, we propose the use of a multi-Boolean ker… view at source ↗

**Figure 4.** Figure 4: Illustration of successive extractions of Boolean kernels from a given [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Normalized L1 norm difference between the approximated weights at initialization and [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: The progression of training losses, number of flips, and perplexity of the resulting models [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Comparions between our method and latent-weight approaches. 6.5 KERNEL ALLOCATION AND COMPARISON TO BITNET B1.58 5 10 0 5 10 15 20 Transfomer Block #Kernels FC 2 FC 1 Out proj Q proj V proj K proj [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 9.** Figure 9: Allocated kernels for OPT-125M. We next evaluate our kernel allocation method on the OPT-125M model. It supports bit allocation at any granularity, including fractional averages, providing practitioners with a flexible model selection tool under deployment constraints [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: OPT-125M performance w.r.t. bit budget. In addition, our framework’s flexibility enables direct comparison with BitNet-b1.58 (Ma et al., 2024), which employs ternary weights. With a 1.58-bit budget, our model achieves reasonable results, whereas BitNetb1.58 reaches a C4 perplexity of 10199.89 due to finetuning instability, consistent with Xu et al. (2024). We also compare against ShiftAddLLM (You et al.… view at source ↗

**Figure 11.** Figure 11: Estimated memory for finetuning for weights ( ) and optimizer states ( ). We emphasize the efficiency of our method during finetuning by comparing MoS (Jo et al., 2024) with our approach using 3 Boolean kernels on the OPT-6.7B model. Because we optimize directly in the Boolean domain, each weight requires only 1 bit, whereas MoS relies on 16-bit latent weights. Moreover, we finetune only the last Boolean … view at source ↗

**Figure 12.** Figure 12: The training convergence of Lis, and Llogits, measured by Forward KL, and the final results with respect to the choice of Dlogits [PITH_FULL_IMAGE:figures/full_fig_p035_12.png] view at source ↗

**Figure 13.** Figure 13: Study on the effect of using knowledge distillation on [PITH_FULL_IMAGE:figures/full_fig_p039_13.png] view at source ↗

**Figure 14.** Figure 14: Histogram of output-scaling values for the first linear layer of [PITH_FULL_IMAGE:figures/full_fig_p039_14.png] view at source ↗

**Figure 15.** Figure 15: The training convergences of MBOK using 3 kernels with OPT models. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_15.png] view at source ↗

**Figure 16.** Figure 16: Study on the effect of using our successive [PITH_FULL_IMAGE:figures/full_fig_p041_16.png] view at source ↗

read the original abstract

Weight binarization has emerged as a promising strategy to reduce the complexity of large language models (LLMs). Existing approaches fall into post-training binarization, which is simple but causes severe performance loss, and training-aware methods, which depend on full-precision latent weights, adding complexity and limiting efficiency. We propose a novel framework that represents LLMs with multi-kernel Boolean parameters and, for the first time, enables direct finetuning LMMs in the Boolean domain, eliminating the need for latent weights. This enhances representational capacity and dramatically reduces complexity during both finetuning and inference. Extensive experiments across diverse LLMs show our method outperforms recent ultra low-bit quantization and binarization techniques.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims direct Boolean-domain fine-tuning of LLMs via multi-kernel parameters without any latent full-precision weights, but the training mechanics need close checking to confirm the efficiency gains.

read the letter

The main point to know is that this work proposes representing LLMs with multi-kernel Boolean parameters so that fine-tuning can happen entirely in the Boolean domain, skipping the latent full-precision weights that most training-aware binarization methods keep around. If the update rule actually works that way, it would cut memory and compute during adaptation in a way that post-training binarization and standard low-bit methods do not.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a multi-Boolean architecture for LLMs that represents weights via multi-kernel Boolean parameters. It claims this is the first method to enable direct finetuning entirely in the Boolean domain without any latent full-precision weights, thereby increasing representational capacity while reducing complexity in both finetuning and inference. Experiments are said to demonstrate outperformance over recent ultra-low-bit quantization and binarization baselines across diverse LLMs.

Significance. If the central claim holds—that end-to-end Boolean-domain finetuning is achieved without auxiliary full-precision structures—the work would offer a meaningful advance in efficient LLM training and deployment. The absence of any quantitative results, model sizes, datasets, or ablation details in the provided abstract, however, leaves the performance and efficiency assertions unsupported at present.

major comments (2)

[Abstract] Abstract: The central performance claim ('outperforms recent ultra low-bit quantization and binarization techniques') is stated without any numerical results, baselines, or error bars. This leaves the primary empirical assertion load-bearing yet unsupported in the visible text.
[Abstract / Training Procedure] The claim of 'direct finetuning LLMs in the Boolean domain, eliminating the need for latent weights' is the load-bearing technical assertion. Standard binarization pipelines rely on straight-through estimators that maintain full-precision copies for gradient flow; the manuscript must explicitly demonstrate that the multi-kernel update rule operates exclusively on Boolean values and Boolean-compatible gradients with no hidden auxiliary full-precision state.

minor comments (1)

[Abstract] Clarify the precise definition of 'multi-kernel Boolean parameters' and how they differ from standard binary or ternary weight representations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and describe the revisions we will implement.

read point-by-point responses

Referee: [Abstract] The central performance claim ('outperforms recent ultra low-bit quantization and binarization techniques') is stated without any numerical results, baselines, or error bars. This leaves the primary empirical assertion load-bearing yet unsupported in the visible text.

Authors: We agree that the abstract would benefit from concrete numerical support. In the revised manuscript we will update the abstract to report key quantitative results, including specific performance deltas versus the cited ultra-low-bit baselines, model sizes, and error bars drawn from our experiments. This will make the empirical claims directly verifiable from the abstract. revision: yes
Referee: [Abstract / Training Procedure] The claim of 'direct finetuning LLMs in the Boolean domain, eliminating the need for latent weights' is the load-bearing technical assertion. Standard binarization pipelines rely on straight-through estimators that maintain full-precision copies for gradient flow; the manuscript must explicitly demonstrate that the multi-kernel update rule operates exclusively on Boolean values and Boolean-compatible gradients with no hidden auxiliary full-precision state.

Authors: We appreciate the referee's request for explicit verification. The multi-kernel Boolean update rule is formulated to act only on Boolean parameters using kernel-derived, Boolean-compatible gradient signals that do not invoke or store any full-precision latent state. To eliminate any ambiguity we will add a dedicated subsection containing the precise mathematical definition of the update, a proof sketch that no auxiliary full-precision tensors are required, and pseudocode of the training loop. revision: yes

Circularity Check

0 steps flagged

No circularity: novel multi-kernel Boolean framework presented as independent architectural proposal

full rationale

The paper introduces a new representation using multi-kernel Boolean parameters and claims direct Boolean-domain finetuning without latent weights. No equations, derivations, or self-citations are shown that reduce the claimed efficiency gains or representational capacity to fitted inputs or prior self-referential results by construction. The central claim is an architectural change whose validity rests on empirical experiments rather than any definitional loop or renamed known result. This is the common case of a self-contained proposal with no load-bearing reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the introduction of multi-kernel Boolean parameters as a new representational primitive whose capacity and trainability are asserted without reference to external benchmarks or proofs in the abstract.

invented entities (1)

multi-kernel Boolean parameters no independent evidence
purpose: To represent LLM weights in the Boolean domain while increasing representational capacity
Core new construct introduced to enable direct Boolean fine-tuning; no independent evidence or falsifiable prediction outside the method itself is stated in the abstract.

pith-pipeline@v0.9.0 · 5636 in / 1180 out tokens · 40803 ms · 2026-05-19T12:38:07.245849+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a novel framework that represents LLMs with multi-kernel Boolean parameters and, for the first time, enables direct finetuning LLMs in the Boolean domain, eliminating the need for latent weights.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The update rule for the accumulator is then defined as: M(l),t+1[i,j] ← βt M(l),t[i,j] + η Q(l),t[i,j]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 7 internal anchors

[1]

URL https://proceedings.neurips.cc/paper_files/paper/2023/ file/edbcb7583fd8921dad78adecfe06a99b-Paper-Conference.pdf. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child...

work page 2023
[2]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

URL https://proceedings.neurips.cc/paper_files/paper/2023/ file/0df38cd13520747e1e64e5b123a78ef8-Paper-Conference.pdf. Hong Chen, Chengtao Lv, Liang Ding, Haotong Qin, Xiabin Zhou, Yifu Ding, Xuebo Liu, Min Zhang, Jinyang Guo, Xianglong Liu, and Dacheng Tao. DB-LLM: Accurate Dual-Binarization for Efficient LLMs. In Findings of the Association for Computat...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/n19-1300 2023
[3]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer

URL https://openreview.net/forum?id=dXiGWqBoxaD. Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Effi- cient Finetuning of Quantized LLMs. In Advances in Neural Information Process- ing Systems , volume 36, pp. 10088–10115. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 1feb8787143...

work page 2023
[4]

URL https://openreview.net/forum?id=6XUSDvBFkV. C. Eckart and G. Young. The Approximation of One Matrix by Another of Lower Rank. Psychome- trika, 1936. Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. OPTQ: Accurate Quantization for Generative Pre-trained Transformers. In The Eleventh International Conference on Learning Representations,...

work page internal anchor Pith review Pith/arXiv arXiv 1936
[5]

Sehoon Kim, Coleman Richard Charles Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W

doi: https://doi.org/10.1017/S0025557200230271. Sehoon Kim, Coleman Richard Charles Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. SqueezeLLM: Dense-and-Sparse Quantization. In Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pp. 23901...

work page doi:10.1017/s0025557200230271 2024
[6]

DeepSeek-V3 Technical Report

URL https://openreview.net/forum?id=ZU8OdDLTts. Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation- aware Weight Quantization for On-Device LLM Compression and Acceleration. In Proceedings of Machine Learning and Systems , volume 6, pp. 87–100, 2024. URL http...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.5. URL https: //aclanthology.org/2023.acl-long.5/. Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. LLM-QAT: Data-Free Quantization Aware Training for Large Language Models. InFindings of ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.acl-long.5 2023
[8]

LLaMA: Open and Efficient Foundation Language Models

URL https://openreview.net/forum?id=8Wuvhh0LYW. Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Re, Ion Stoica, and Ce Zhang. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. InProceedings of the 40th International Conference on Machine Learning, volume 202 of...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

BitNet: Scaling 1-bit Transformers for Large Language Models

URL https://proceedings.neurips.cc/paper_files/paper/2017/ file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. BitNet: Scaling 1-bit Transformers for Large Language Models. arXiv preprint arXiv:2310.11453, 2023. Lei Wang, Lingxiao Ma, Shij...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[10]

doi: 10.18653/v1/2023.acl-long.605

Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.605. URL https://aclanthology.org/2023.acl-long.605/. Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, and Wanxiang Che. OneBit: Towards Extremely Low-bit Large Language Models. In The Thirty- eighth Annual Conference on Neural Information Processi...

work page doi:10.18653/v1/2023.acl-long.605 2023
[11]

OPT: Open Pre-trained Transformer Language Models

URL https://proceedings.neurips.cc/paper_files/paper/2024/ file/2c30a37c75f062e0bf79297c73db8c6c-Paper-Conference.pdf. Zhihang Yuan, Yuzhang Shang, and Zhen Dong. PB-LLM: Partially Binarized Large Language Models. In The Twelfth International Conference on Learning Representations , 2024. URL https://openreview.net/forum?id=BifeBRhikU. Rowan Zellers, Ari ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/p19-1472 2024
[12]

∀x, y ∈ N: p(xy) = xnor(p(x), p(y))

work page
[13]

∀a, b ∈ L: e(xnor(a, b)) = e(a) e(b)

work page
[14]

∀x, y ∈ N: x = y ⇔ |x| = |y| and p(x) = p(y). In particular, property Proposition A.9(2) implies that by the embedding map e(·), we have: ({TRUE , FALSE }, xor) ∼= ({±1}, −×), (18) ({TRUE , FALSE }, xnor) ∼= ({±1}, ×), (19) where ∼= and × stand for isomorphic relation, and the real multiplication, resp. A consequence is that by e(·), a computing sequence ...

work page
[15]

a ∈ L, x ∈ N: xnor(a, x) = e(a)x

work page
[16]

x, y ∈ N: xnor(x, y) = xy

work page
[17]

x ∈ {L, N}, y, z ∈ N: xnor(x, y + z) = xnor(x, y) + xnor(x, z)

work page
[18]

x ∈ {L, N}, y, λ ∈ N: xnor(x, λy) = λxnor(x, y)

work page
[19]

x ∈ {L, N}, y ∈ N: xor(x, y) = −xnor(x, y). Proof. The proof follows definitions A.5 and A.8. • Following Definition A.1 we have ∀t ∈ M, xnor(TRUE , t) = t, xnor(FALSE , t) = ¬t, and xnor(0, t) = 0. Put v = xnor(a, x). We have |v| = |x| and p(v) = xnor(a, p(x)). Hence, a = 0 ⇒ p(v) = 0 ⇒ v = 0; a = TRUE ⇒ p(v) = p(x) ⇒ v = x; a = FALSE ⇒ p(v) = ¬ p(x) ⇒ v...

work page 2026
[20]

δf (x → y) = xnor(δ(x → y), f ′(x))

work page
[21]

(g ◦ f)′(x) = xnor(g′(f(x)), f ′(x)). Proof. The proof is by definition:

work page
[22]

If y = x, then the result is trivial

∀x, y ∈ B, there are two cases. If y = x, then the result is trivial. Otherwise, i.e., y = ¬x, by definition we have: f ′(x) = xnor(δ(x → ¬x), δf(x → ¬x)) ⇔ δf (x → ¬x) = xnor(δ(x → ¬x), f ′(x)). Hence the result. 20 Published as a conference paper at ICLR 2026

work page 2026
[23]

Hence, by definition, (¬f)′(x) = xnor(δ(x → ¬x), δ(¬f(x → ¬x))) = xnor(δ(x → ¬x), ¬δf (x → ¬x)) = ¬xnor(δ(x → ¬x), δf(x → ¬x)) = ¬f ′(x)

∀x, y ∈ B, it is easy to verify by truth table that δ(¬f(x → y)) = ¬δf (x → y). Hence, by definition, (¬f)′(x) = xnor(δ(x → ¬x), δ(¬f(x → ¬x))) = xnor(δ(x → ¬x), ¬δf (x → ¬x)) = ¬xnor(δ(x → ¬x), δf(x → ¬x)) = ¬f ′(x)

work page
[24]

Proposition A.16

Using definition, property (i), and associativity of xnor, ∀x ∈ B we have: (g ◦ f)′(x) = xnor(δ(x → ¬x), δg(f(x) → f(¬x))) = xnor(δ(x → ¬x), xnor(δf (x → ¬x), g′(f(x)))) = xnor(g′(f(x)), xnor(δ(x → ¬x), δf(x → ¬x))) = xnor(g′(f(x)), f ′(x)). Proposition A.16. (Nguyen, 2023; Nguyen et al., 2024) For f ∈ F (B, N), the following properties hold:

work page 2023
[25]

x, y ∈ B: δf (x → y) = xnor(δ(x → y), f ′(x))

work page
[26]

α ∈ N: (αf)′(x) = αf ′(x)

work page
[27]

g ∈ F (B, N): (f + g)′(x) = f ′(x) + g′(x). Proof. The proof is as follows:

work page
[28]

Firstly, the result is trivial if y = x

For x, y ∈ B. Firstly, the result is trivial if y = x. For y ̸= x, i.e., y = ¬x, by definition: f ′(x) = xnor(δ(x → ¬x), δf(x → ¬x)). Hence, |δf (x → ¬x)| = |f ′(x)| since |δ(x → ¬x)| = 1, and p(f ′(x)) = xnor(δ(x → ¬x), p(δf (x → ¬x))) ⇔ p(δf (x → ¬x)) = xnor(δ(x → ¬x), p(f ′(x))), where p(·) is the logic projector Eq. 17. Thus, δf (x → ¬x) = xnor(δ(x → ...

work page
[29]

Hence, by definition, (αf)′(x) = xnor(δ(x → ¬x), δ(αf(x → ¬x))) = xnor(δ(x → ¬x), αδf(x → ¬x)) = α xnor(δ(x → ¬x), δf(x → ¬x)), due to Proposition A.10(4) = αf ′(x)

Firstly ∀x, y ∈ B, we have δ(αf(x → y)) = αf(y) − αf(x) = αδf (x → y). Hence, by definition, (αf)′(x) = xnor(δ(x → ¬x), δ(αf(x → ¬x))) = xnor(δ(x → ¬x), αδf(x → ¬x)) = α xnor(δ(x → ¬x), δf(x → ¬x)), due to Proposition A.10(4) = αf ′(x)

work page
[30]

For f, g ∈ F (B, N), (f + g)′(x) = xnor(δ(x → ¬x), δ(f + g)(x → ¬x)) = xnor(δ(x → ¬x), δf(x → ¬x) + δg(x → ¬x)) (∗) = xnor(δ(x → ¬x), δf(x → ¬x)) + xnor(δ(x → ¬x), δg(x → ¬x)), = f ′(x) + g′(x), where (∗) is due to Proposition A.10(3). 21 Published as a conference paper at ICLR 2026 For f ∈ F (Z, N), its derivative, also known in terms of finite differenc...

work page 2026
[31]

For B f → B g → D: (g ◦ f)′(x) = xnor(g′(f(x)), f ′(x)), ∀x ∈ B

work page
[32]

For B f → Z g → D, x ∈ B, if |f ′(x)| ≤ 1 and g′(f(x)) = g′(f(x) − 1), then: (g ◦ f)′(x) = xnor(g′(f(x)), f ′(x)). Proof. The proof is as follows

work page
[33]

For B f → B g → N, by using Proposition A.16(1), the proof is similar to that of Proposition A.15(3)

The case of B f → B g → B is obtained from Proposition A.15(3). For B f → B g → N, by using Proposition A.16(1), the proof is similar to that of Proposition A.15(3)

work page
[34]

"" 9 G_X = Z.mm(1-2 *W) 10 11

By definition, we have (g ◦ f)′(x) = xnor(δ(x → ¬x), δg(f(x) → f(¬x))). (20) Using property (1) of Proposition A.16, we have: f(¬x) = f(x) + δf (x → ¬x) = f(x) + xnor(δ(x → ¬x), f ′(x)). (21) Applying Eq. 21 back to Eq. 20, the result is trivial if f ′(x) = 0 . The remaining case is |f ′(x)| = 1 for which we have xnor(δ(x → ¬ x), f ′(x)) = ±1. First, for ...

work page 2026
[35]

projection-weighted CCA

(50) Therefore max y,∥y∥2=1 ∥|W|y∥2 ≥ max x,∥x∥2=1 ∥Wx∥2 (51) ⇔ σ1(|W|) ≥ σ1(W). (52) Thus, the lemma is proved. Proposition D.3 (Restated from Xu et al. (2024)) . For W ∈ Rm×n, write W = eUeΣeV ⊤ its SVD. Let a = √˜σ1eU[:,1], and b = √˜σ1eV[:,1]. Similarly, denote |W| = UΣV⊤ its SVD; sin and sout are given as: sin = √σ1V[:,1], and sout = √σ1U[:,1]. We de...

work page 2024
[36]

for 1-bit matrix multiplications. Using FP16 activations with INT1 weights, we measure the latency of linear layers in LLaMA-7B (Table 12) and LLaMA-13B (Table 13) under an inference batch size of 1, evaluating our method MBOK with two kernels. Our results show that MBOK achieves up to 1https://github.com/microsoft/BitBLAS 41 Published as a conference pap...

work page 2026

[1] [1]

URL https://proceedings.neurips.cc/paper_files/paper/2023/ file/edbcb7583fd8921dad78adecfe06a99b-Paper-Conference.pdf. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child...

work page 2023

[2] [2]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

URL https://proceedings.neurips.cc/paper_files/paper/2023/ file/0df38cd13520747e1e64e5b123a78ef8-Paper-Conference.pdf. Hong Chen, Chengtao Lv, Liang Ding, Haotong Qin, Xiabin Zhou, Yifu Ding, Xuebo Liu, Min Zhang, Jinyang Guo, Xianglong Liu, and Dacheng Tao. DB-LLM: Accurate Dual-Binarization for Efficient LLMs. In Findings of the Association for Computat...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/n19-1300 2023

[3] [3]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer

URL https://openreview.net/forum?id=dXiGWqBoxaD. Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Effi- cient Finetuning of Quantized LLMs. In Advances in Neural Information Process- ing Systems , volume 36, pp. 10088–10115. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 1feb8787143...

work page 2023

[4] [4]

URL https://openreview.net/forum?id=6XUSDvBFkV. C. Eckart and G. Young. The Approximation of One Matrix by Another of Lower Rank. Psychome- trika, 1936. Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. OPTQ: Accurate Quantization for Generative Pre-trained Transformers. In The Eleventh International Conference on Learning Representations,...

work page internal anchor Pith review Pith/arXiv arXiv 1936

[5] [5]

Sehoon Kim, Coleman Richard Charles Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W

doi: https://doi.org/10.1017/S0025557200230271. Sehoon Kim, Coleman Richard Charles Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. SqueezeLLM: Dense-and-Sparse Quantization. In Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pp. 23901...

work page doi:10.1017/s0025557200230271 2024

[6] [6]

DeepSeek-V3 Technical Report

URL https://openreview.net/forum?id=ZU8OdDLTts. Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation- aware Weight Quantization for On-Device LLM Compression and Acceleration. In Proceedings of Machine Learning and Systems , volume 6, pp. 87–100, 2024. URL http...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.5. URL https: //aclanthology.org/2023.acl-long.5/. Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. LLM-QAT: Data-Free Quantization Aware Training for Large Language Models. InFindings of ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.acl-long.5 2023

[8] [8]

LLaMA: Open and Efficient Foundation Language Models

URL https://openreview.net/forum?id=8Wuvhh0LYW. Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Re, Ion Stoica, and Ce Zhang. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. InProceedings of the 40th International Conference on Machine Learning, volume 202 of...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

BitNet: Scaling 1-bit Transformers for Large Language Models

URL https://proceedings.neurips.cc/paper_files/paper/2017/ file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. BitNet: Scaling 1-bit Transformers for Large Language Models. arXiv preprint arXiv:2310.11453, 2023. Lei Wang, Lingxiao Ma, Shij...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[10] [10]

doi: 10.18653/v1/2023.acl-long.605

Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.605. URL https://aclanthology.org/2023.acl-long.605/. Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, and Wanxiang Che. OneBit: Towards Extremely Low-bit Large Language Models. In The Thirty- eighth Annual Conference on Neural Information Processi...

work page doi:10.18653/v1/2023.acl-long.605 2023

[11] [11]

OPT: Open Pre-trained Transformer Language Models

URL https://proceedings.neurips.cc/paper_files/paper/2024/ file/2c30a37c75f062e0bf79297c73db8c6c-Paper-Conference.pdf. Zhihang Yuan, Yuzhang Shang, and Zhen Dong. PB-LLM: Partially Binarized Large Language Models. In The Twelfth International Conference on Learning Representations , 2024. URL https://openreview.net/forum?id=BifeBRhikU. Rowan Zellers, Ari ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/p19-1472 2024

[12] [12]

∀x, y ∈ N: p(xy) = xnor(p(x), p(y))

work page

[13] [13]

∀a, b ∈ L: e(xnor(a, b)) = e(a) e(b)

work page

[14] [14]

∀x, y ∈ N: x = y ⇔ |x| = |y| and p(x) = p(y). In particular, property Proposition A.9(2) implies that by the embedding map e(·), we have: ({TRUE , FALSE }, xor) ∼= ({±1}, −×), (18) ({TRUE , FALSE }, xnor) ∼= ({±1}, ×), (19) where ∼= and × stand for isomorphic relation, and the real multiplication, resp. A consequence is that by e(·), a computing sequence ...

work page

[15] [15]

a ∈ L, x ∈ N: xnor(a, x) = e(a)x

work page

[16] [16]

x, y ∈ N: xnor(x, y) = xy

work page

[17] [17]

x ∈ {L, N}, y, z ∈ N: xnor(x, y + z) = xnor(x, y) + xnor(x, z)

work page

[18] [18]

x ∈ {L, N}, y, λ ∈ N: xnor(x, λy) = λxnor(x, y)

work page

[19] [19]

x ∈ {L, N}, y ∈ N: xor(x, y) = −xnor(x, y). Proof. The proof follows definitions A.5 and A.8. • Following Definition A.1 we have ∀t ∈ M, xnor(TRUE , t) = t, xnor(FALSE , t) = ¬t, and xnor(0, t) = 0. Put v = xnor(a, x). We have |v| = |x| and p(v) = xnor(a, p(x)). Hence, a = 0 ⇒ p(v) = 0 ⇒ v = 0; a = TRUE ⇒ p(v) = p(x) ⇒ v = x; a = FALSE ⇒ p(v) = ¬ p(x) ⇒ v...

work page 2026

[20] [20]

δf (x → y) = xnor(δ(x → y), f ′(x))

work page

[21] [21]

(g ◦ f)′(x) = xnor(g′(f(x)), f ′(x)). Proof. The proof is by definition:

work page

[22] [22]

If y = x, then the result is trivial

∀x, y ∈ B, there are two cases. If y = x, then the result is trivial. Otherwise, i.e., y = ¬x, by definition we have: f ′(x) = xnor(δ(x → ¬x), δf(x → ¬x)) ⇔ δf (x → ¬x) = xnor(δ(x → ¬x), f ′(x)). Hence the result. 20 Published as a conference paper at ICLR 2026

work page 2026

[23] [23]

Hence, by definition, (¬f)′(x) = xnor(δ(x → ¬x), δ(¬f(x → ¬x))) = xnor(δ(x → ¬x), ¬δf (x → ¬x)) = ¬xnor(δ(x → ¬x), δf(x → ¬x)) = ¬f ′(x)

∀x, y ∈ B, it is easy to verify by truth table that δ(¬f(x → y)) = ¬δf (x → y). Hence, by definition, (¬f)′(x) = xnor(δ(x → ¬x), δ(¬f(x → ¬x))) = xnor(δ(x → ¬x), ¬δf (x → ¬x)) = ¬xnor(δ(x → ¬x), δf(x → ¬x)) = ¬f ′(x)

work page

[24] [24]

Proposition A.16

Using definition, property (i), and associativity of xnor, ∀x ∈ B we have: (g ◦ f)′(x) = xnor(δ(x → ¬x), δg(f(x) → f(¬x))) = xnor(δ(x → ¬x), xnor(δf (x → ¬x), g′(f(x)))) = xnor(g′(f(x)), xnor(δ(x → ¬x), δf(x → ¬x))) = xnor(g′(f(x)), f ′(x)). Proposition A.16. (Nguyen, 2023; Nguyen et al., 2024) For f ∈ F (B, N), the following properties hold:

work page 2023

[25] [25]

x, y ∈ B: δf (x → y) = xnor(δ(x → y), f ′(x))

work page

[26] [26]

α ∈ N: (αf)′(x) = αf ′(x)

work page

[27] [27]

g ∈ F (B, N): (f + g)′(x) = f ′(x) + g′(x). Proof. The proof is as follows:

work page

[28] [28]

Firstly, the result is trivial if y = x

For x, y ∈ B. Firstly, the result is trivial if y = x. For y ̸= x, i.e., y = ¬x, by definition: f ′(x) = xnor(δ(x → ¬x), δf(x → ¬x)). Hence, |δf (x → ¬x)| = |f ′(x)| since |δ(x → ¬x)| = 1, and p(f ′(x)) = xnor(δ(x → ¬x), p(δf (x → ¬x))) ⇔ p(δf (x → ¬x)) = xnor(δ(x → ¬x), p(f ′(x))), where p(·) is the logic projector Eq. 17. Thus, δf (x → ¬x) = xnor(δ(x → ...

work page

[29] [29]

Hence, by definition, (αf)′(x) = xnor(δ(x → ¬x), δ(αf(x → ¬x))) = xnor(δ(x → ¬x), αδf(x → ¬x)) = α xnor(δ(x → ¬x), δf(x → ¬x)), due to Proposition A.10(4) = αf ′(x)

Firstly ∀x, y ∈ B, we have δ(αf(x → y)) = αf(y) − αf(x) = αδf (x → y). Hence, by definition, (αf)′(x) = xnor(δ(x → ¬x), δ(αf(x → ¬x))) = xnor(δ(x → ¬x), αδf(x → ¬x)) = α xnor(δ(x → ¬x), δf(x → ¬x)), due to Proposition A.10(4) = αf ′(x)

work page

[30] [30]

For f, g ∈ F (B, N), (f + g)′(x) = xnor(δ(x → ¬x), δ(f + g)(x → ¬x)) = xnor(δ(x → ¬x), δf(x → ¬x) + δg(x → ¬x)) (∗) = xnor(δ(x → ¬x), δf(x → ¬x)) + xnor(δ(x → ¬x), δg(x → ¬x)), = f ′(x) + g′(x), where (∗) is due to Proposition A.10(3). 21 Published as a conference paper at ICLR 2026 For f ∈ F (Z, N), its derivative, also known in terms of finite differenc...

work page 2026

[31] [31]

For B f → B g → D: (g ◦ f)′(x) = xnor(g′(f(x)), f ′(x)), ∀x ∈ B

work page

[32] [32]

For B f → Z g → D, x ∈ B, if |f ′(x)| ≤ 1 and g′(f(x)) = g′(f(x) − 1), then: (g ◦ f)′(x) = xnor(g′(f(x)), f ′(x)). Proof. The proof is as follows

work page

[33] [33]

For B f → B g → N, by using Proposition A.16(1), the proof is similar to that of Proposition A.15(3)

The case of B f → B g → B is obtained from Proposition A.15(3). For B f → B g → N, by using Proposition A.16(1), the proof is similar to that of Proposition A.15(3)

work page

[34] [34]

"" 9 G_X = Z.mm(1-2 *W) 10 11

By definition, we have (g ◦ f)′(x) = xnor(δ(x → ¬x), δg(f(x) → f(¬x))). (20) Using property (1) of Proposition A.16, we have: f(¬x) = f(x) + δf (x → ¬x) = f(x) + xnor(δ(x → ¬x), f ′(x)). (21) Applying Eq. 21 back to Eq. 20, the result is trivial if f ′(x) = 0 . The remaining case is |f ′(x)| = 1 for which we have xnor(δ(x → ¬ x), f ′(x)) = ±1. First, for ...

work page 2026

[35] [35]

projection-weighted CCA

(50) Therefore max y,∥y∥2=1 ∥|W|y∥2 ≥ max x,∥x∥2=1 ∥Wx∥2 (51) ⇔ σ1(|W|) ≥ σ1(W). (52) Thus, the lemma is proved. Proposition D.3 (Restated from Xu et al. (2024)) . For W ∈ Rm×n, write W = eUeΣeV ⊤ its SVD. Let a = √˜σ1eU[:,1], and b = √˜σ1eV[:,1]. Similarly, denote |W| = UΣV⊤ its SVD; sin and sout are given as: sin = √σ1V[:,1], and sout = √σ1U[:,1]. We de...

work page 2024

[36] [36]

for 1-bit matrix multiplications. Using FP16 activations with INT1 weights, we measure the latency of linear layers in LLaMA-7B (Table 12) and LLaMA-13B (Table 13) under an inference batch size of 1, evaluating our method MBOK with two kernels. Our results show that MBOK achieves up to 1https://github.com/microsoft/BitBLAS 41 Published as a conference pap...

work page 2026