pith. sign in

arxiv: 2505.22811 · v5 · submitted 2025-05-28 · 📊 stat.ML · cs.LG

Highly Efficient and Effective LLMs with Multi-Boolean Architectures

Pith reviewed 2026-05-19 12:38 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords large language modelsBoolean parametersmodel binarizationefficient fine-tuninglow-bit quantizationmodel compressionrepresentational capacity
0
0 comments X

The pith

Multi-kernel Boolean parameters enable direct fine-tuning of LLMs in the Boolean domain without latent weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes representing large language models with multi-kernel Boolean parameters. This setup allows fine-tuning to occur directly in the Boolean domain, removing any need to maintain latent full-precision weights during the process. The result is claimed to increase what the model can represent while cutting computational demands in both fine-tuning and inference. Experiments on multiple LLMs indicate better outcomes than existing ultra-low-bit quantization or binarization approaches. If correct, the method would make adapting and running these models feasible with far less memory and compute overhead.

Core claim

The central claim is that LLMs can be represented with multi-kernel Boolean parameters to support direct finetuning entirely in the Boolean domain for the first time, which removes the need for latent weights. This change is presented as increasing representational capacity while lowering complexity for both the fine-tuning stage and later inference. Tests across diverse LLMs show the approach outperforming recent ultra low-bit quantization and binarization techniques.

What carries the argument

Multi-kernel Boolean parameters: sets of Boolean kernels used to encode weights so that optimization can proceed directly in the Boolean space without intermediate full-precision values.

If this is right

  • Memory requirements drop because parameters remain Boolean rather than needing space for latent full-precision copies.
  • Fine-tuning and inference run faster by staying in Boolean operations and avoiding repeated domain conversions.
  • Larger models become easier to adapt on hardware with limited precision support or memory capacity.
  • Performance holds up better than post-training binarization methods that typically lose significant accuracy.
  • The same direct-domain idea could extend efficiency gains to other model compression settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hardware accelerators built around Boolean logic could see direct performance lifts if models adopt this format.
  • Similar direct training in discrete spaces might apply to ternary or other low-precision weight schemes.
  • Smaller fully Boolean models could simplify deployment in resource-constrained or offline environments.

Load-bearing premise

Multi-kernel Boolean parameters can preserve or exceed the representational power of latent full-precision weights without adding failure modes or hidden costs that cancel out the efficiency benefits.

What would settle it

Compare accuracy and total operation count on a standard benchmark like GLUE when fine-tuning the same LLM architecture with this Boolean method versus a latent-weight binarization baseline; a clear accuracy drop or higher effective cost would challenge the claims.

Figures

Figures reproduced from arXiv: 2505.22811 by Ba-Hien Tran, Van Minh Nguyen.

Figure 1
Figure 1. Figure 1: Finetuning OPT mod￾els (Zhang et al., 2022) using our 3 Boolean kernels ( ), com￾pared to OPTQ (Frantar et al., 2023) ( ), which quantizes the models to 3 bits, and the FP16 baseline ( ) on the C4 dataset. Large language models (Brown et al., 2020; Touvron et al., 2023a; Liu et al., 2024a) have demonstrated unprecedented capabilities, largely due to the continuous growth in both model and dataset sizes. A … view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of SVID. LLMs (Brown et al., 2020) are mostly based on the Transformer architecture (Vaswani et al., 2017), in which linear layers are the core elements. Inpsired by Xu et al. (2024), we employ sign-value-independent decomposition (SVID) such that an FP input matrix W ∈ R m×n of linear layers is decomposed into one Boolean matrix Wbool ≜ sign(W) and two FP vectors sin and sout. Precisely, let … view at source ↗
Figure 3
Figure 3. Figure 3: The computation of a linear layer approxi￾mated using multi kernels of Boolean. We have shown that SVID provides a good approx￾imation of the original weights, its expressivity can be still limited to capture well the original FP parameters of complicated models, which were trained on large-scale datasets over extended pe￾riods of time. To overcome this limitation, we propose the use of a multi-Boolean ker… view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of successive extractions of Boolean kernels from a given [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Normalized L1 norm difference between the approximated weights at initialization and [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The progression of training losses, number of flips, and perplexity of the resulting models [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparions between our method and latent-weight approaches. 6.5 KERNEL ALLOCATION AND COMPARISON TO BITNET B1.58 5 10 0 5 10 15 20 Transfomer Block #Kernels FC 2 FC 1 Out proj Q proj V proj K proj [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Allocated kernels for OPT-125M. We next evaluate our kernel allocation method on the OPT-125M model. It supports bit allocation at any granularity, including frac￾tional averages, providing practitioners with a flexible model selec￾tion tool under deployment constraints [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: OPT-125M per￾formance w.r.t. bit budget. In addition, our framework’s flexibility enables direct comparison with BitNet-b1.58 (Ma et al., 2024), which employs ternary weights. With a 1.58-bit budget, our model achieves reasonable results, whereas BitNet￾b1.58 reaches a C4 perplexity of 10199.89 due to finetuning instability, consistent with Xu et al. (2024). We also compare against ShiftAddLLM (You et al.… view at source ↗
Figure 11
Figure 11. Figure 11: Estimated memory for finetuning for weights ( ) and optimizer states ( ). We emphasize the efficiency of our method during finetuning by comparing MoS (Jo et al., 2024) with our approach using 3 Boolean kernels on the OPT-6.7B model. Because we optimize directly in the Boolean domain, each weight requires only 1 bit, whereas MoS relies on 16-bit latent weights. Moreover, we finetune only the last Boolean … view at source ↗
Figure 12
Figure 12. Figure 12: The training convergence of Lis, and Llogits, measured by Forward KL, and the final results with respect to the choice of Dlogits [PITH_FULL_IMAGE:figures/full_fig_p035_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Study on the effect of using knowledge distillation on [PITH_FULL_IMAGE:figures/full_fig_p039_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Histogram of output-scaling values for the first linear layer of [PITH_FULL_IMAGE:figures/full_fig_p039_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: The training convergences of MBOK using 3 kernels with OPT models. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Study on the effect of using our successive [PITH_FULL_IMAGE:figures/full_fig_p041_16.png] view at source ↗
read the original abstract

Weight binarization has emerged as a promising strategy to reduce the complexity of large language models (LLMs). Existing approaches fall into post-training binarization, which is simple but causes severe performance loss, and training-aware methods, which depend on full-precision latent weights, adding complexity and limiting efficiency. We propose a novel framework that represents LLMs with multi-kernel Boolean parameters and, for the first time, enables direct finetuning LMMs in the Boolean domain, eliminating the need for latent weights. This enhances representational capacity and dramatically reduces complexity during both finetuning and inference. Extensive experiments across diverse LLMs show our method outperforms recent ultra low-bit quantization and binarization techniques.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a multi-Boolean architecture for LLMs that represents weights via multi-kernel Boolean parameters. It claims this is the first method to enable direct finetuning entirely in the Boolean domain without any latent full-precision weights, thereby increasing representational capacity while reducing complexity in both finetuning and inference. Experiments are said to demonstrate outperformance over recent ultra-low-bit quantization and binarization baselines across diverse LLMs.

Significance. If the central claim holds—that end-to-end Boolean-domain finetuning is achieved without auxiliary full-precision structures—the work would offer a meaningful advance in efficient LLM training and deployment. The absence of any quantitative results, model sizes, datasets, or ablation details in the provided abstract, however, leaves the performance and efficiency assertions unsupported at present.

major comments (2)
  1. [Abstract] Abstract: The central performance claim ('outperforms recent ultra low-bit quantization and binarization techniques') is stated without any numerical results, baselines, or error bars. This leaves the primary empirical assertion load-bearing yet unsupported in the visible text.
  2. [Abstract / Training Procedure] The claim of 'direct finetuning LLMs in the Boolean domain, eliminating the need for latent weights' is the load-bearing technical assertion. Standard binarization pipelines rely on straight-through estimators that maintain full-precision copies for gradient flow; the manuscript must explicitly demonstrate that the multi-kernel update rule operates exclusively on Boolean values and Boolean-compatible gradients with no hidden auxiliary full-precision state.
minor comments (1)
  1. [Abstract] Clarify the precise definition of 'multi-kernel Boolean parameters' and how they differ from standard binary or ternary weight representations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and describe the revisions we will implement.

read point-by-point responses
  1. Referee: [Abstract] The central performance claim ('outperforms recent ultra low-bit quantization and binarization techniques') is stated without any numerical results, baselines, or error bars. This leaves the primary empirical assertion load-bearing yet unsupported in the visible text.

    Authors: We agree that the abstract would benefit from concrete numerical support. In the revised manuscript we will update the abstract to report key quantitative results, including specific performance deltas versus the cited ultra-low-bit baselines, model sizes, and error bars drawn from our experiments. This will make the empirical claims directly verifiable from the abstract. revision: yes

  2. Referee: [Abstract / Training Procedure] The claim of 'direct finetuning LLMs in the Boolean domain, eliminating the need for latent weights' is the load-bearing technical assertion. Standard binarization pipelines rely on straight-through estimators that maintain full-precision copies for gradient flow; the manuscript must explicitly demonstrate that the multi-kernel update rule operates exclusively on Boolean values and Boolean-compatible gradients with no hidden auxiliary full-precision state.

    Authors: We appreciate the referee's request for explicit verification. The multi-kernel Boolean update rule is formulated to act only on Boolean parameters using kernel-derived, Boolean-compatible gradient signals that do not invoke or store any full-precision latent state. To eliminate any ambiguity we will add a dedicated subsection containing the precise mathematical definition of the update, a proof sketch that no auxiliary full-precision tensors are required, and pseudocode of the training loop. revision: yes

Circularity Check

0 steps flagged

No circularity: novel multi-kernel Boolean framework presented as independent architectural proposal

full rationale

The paper introduces a new representation using multi-kernel Boolean parameters and claims direct Boolean-domain finetuning without latent weights. No equations, derivations, or self-citations are shown that reduce the claimed efficiency gains or representational capacity to fitted inputs or prior self-referential results by construction. The central claim is an architectural change whose validity rests on empirical experiments rather than any definitional loop or renamed known result. This is the common case of a self-contained proposal with no load-bearing reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the introduction of multi-kernel Boolean parameters as a new representational primitive whose capacity and trainability are asserted without reference to external benchmarks or proofs in the abstract.

invented entities (1)
  • multi-kernel Boolean parameters no independent evidence
    purpose: To represent LLM weights in the Boolean domain while increasing representational capacity
    Core new construct introduced to enable direct Boolean fine-tuning; no independent evidence or falsifiable prediction outside the method itself is stated in the abstract.

pith-pipeline@v0.9.0 · 5636 in / 1180 out tokens · 40803 ms · 2026-05-19T12:38:07.245849+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 7 internal anchors

  1. [1]

    URL https://proceedings.neurips.cc/paper_files/paper/2023/ file/edbcb7583fd8921dad78adecfe06a99b-Paper-Conference.pdf. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child...

  2. [2]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    URL https://proceedings.neurips.cc/paper_files/paper/2023/ file/0df38cd13520747e1e64e5b123a78ef8-Paper-Conference.pdf. Hong Chen, Chengtao Lv, Liang Ding, Haotong Qin, Xiabin Zhou, Yifu Ding, Xuebo Liu, Min Zhang, Jinyang Guo, Xianglong Liu, and Dacheng Tao. DB-LLM: Accurate Dual-Binarization for Efficient LLMs. In Findings of the Association for Computat...

  3. [3]

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer

    URL https://openreview.net/forum?id=dXiGWqBoxaD. Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Effi- cient Finetuning of Quantized LLMs. In Advances in Neural Information Process- ing Systems , volume 36, pp. 10088–10115. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 1feb8787143...

  4. [4]

    URL https://openreview.net/forum?id=6XUSDvBFkV. C. Eckart and G. Young. The Approximation of One Matrix by Another of Lower Rank. Psychome- trika, 1936. Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. OPTQ: Accurate Quantization for Generative Pre-trained Transformers. In The Eleventh International Conference on Learning Representations,...

  5. [5]

    Sehoon Kim, Coleman Richard Charles Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W

    doi: https://doi.org/10.1017/S0025557200230271. Sehoon Kim, Coleman Richard Charles Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. SqueezeLLM: Dense-and-Sparse Quantization. In Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pp. 23901...

  6. [6]

    DeepSeek-V3 Technical Report

    URL https://openreview.net/forum?id=ZU8OdDLTts. Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation- aware Weight Quantization for On-Device LLM Compression and Acceleration. In Proceedings of Machine Learning and Systems , volume 6, pp. 87–100, 2024. URL http...

  7. [7]

    The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

    Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.5. URL https: //aclanthology.org/2023.acl-long.5/. Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. LLM-QAT: Data-Free Quantization Aware Training for Large Language Models. InFindings of ...

  8. [8]

    LLaMA: Open and Efficient Foundation Language Models

    URL https://openreview.net/forum?id=8Wuvhh0LYW. Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Re, Ion Stoica, and Ce Zhang. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. InProceedings of the 40th International Conference on Machine Learning, volume 202 of...

  9. [9]

    BitNet: Scaling 1-bit Transformers for Large Language Models

    URL https://proceedings.neurips.cc/paper_files/paper/2017/ file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. BitNet: Scaling 1-bit Transformers for Large Language Models. arXiv preprint arXiv:2310.11453, 2023. Lei Wang, Lingxiao Ma, Shij...

  10. [10]

    doi: 10.18653/v1/2023.acl-long.605

    Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.605. URL https://aclanthology.org/2023.acl-long.605/. Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, and Wanxiang Che. OneBit: Towards Extremely Low-bit Large Language Models. In The Thirty- eighth Annual Conference on Neural Information Processi...

  11. [11]

    OPT: Open Pre-trained Transformer Language Models

    URL https://proceedings.neurips.cc/paper_files/paper/2024/ file/2c30a37c75f062e0bf79297c73db8c6c-Paper-Conference.pdf. Zhihang Yuan, Yuzhang Shang, and Zhen Dong. PB-LLM: Partially Binarized Large Language Models. In The Twelfth International Conference on Learning Representations , 2024. URL https://openreview.net/forum?id=BifeBRhikU. Rowan Zellers, Ari ...

  12. [12]

    ∀x, y ∈ N: p(xy) = xnor(p(x), p(y))

  13. [13]

    ∀a, b ∈ L: e(xnor(a, b)) = e(a) e(b)

  14. [14]

    ∀x, y ∈ N: x = y ⇔ |x| = |y| and p(x) = p(y). In particular, property Proposition A.9(2) implies that by the embedding map e(·), we have: ({TRUE , FALSE }, xor) ∼= ({±1}, −×), (18) ({TRUE , FALSE }, xnor) ∼= ({±1}, ×), (19) where ∼= and × stand for isomorphic relation, and the real multiplication, resp. A consequence is that by e(·), a computing sequence ...

  15. [15]

    a ∈ L, x ∈ N: xnor(a, x) = e(a)x

  16. [16]

    x, y ∈ N: xnor(x, y) = xy

  17. [17]

    x ∈ {L, N}, y, z ∈ N: xnor(x, y + z) = xnor(x, y) + xnor(x, z)

  18. [18]

    x ∈ {L, N}, y, λ ∈ N: xnor(x, λy) = λxnor(x, y)

  19. [19]

    x ∈ {L, N}, y ∈ N: xor(x, y) = −xnor(x, y). Proof. The proof follows definitions A.5 and A.8. • Following Definition A.1 we have ∀t ∈ M, xnor(TRUE , t) = t, xnor(FALSE , t) = ¬t, and xnor(0, t) = 0. Put v = xnor(a, x). We have |v| = |x| and p(v) = xnor(a, p(x)). Hence, a = 0 ⇒ p(v) = 0 ⇒ v = 0; a = TRUE ⇒ p(v) = p(x) ⇒ v = x; a = FALSE ⇒ p(v) = ¬ p(x) ⇒ v...

  20. [20]

    δf (x → y) = xnor(δ(x → y), f ′(x))

  21. [21]

    (g ◦ f)′(x) = xnor(g′(f(x)), f ′(x)). Proof. The proof is by definition:

  22. [22]

    If y = x, then the result is trivial

    ∀x, y ∈ B, there are two cases. If y = x, then the result is trivial. Otherwise, i.e., y = ¬x, by definition we have: f ′(x) = xnor(δ(x → ¬x), δf(x → ¬x)) ⇔ δf (x → ¬x) = xnor(δ(x → ¬x), f ′(x)). Hence the result. 20 Published as a conference paper at ICLR 2026

  23. [23]

    Hence, by definition, (¬f)′(x) = xnor(δ(x → ¬x), δ(¬f(x → ¬x))) = xnor(δ(x → ¬x), ¬δf (x → ¬x)) = ¬xnor(δ(x → ¬x), δf(x → ¬x)) = ¬f ′(x)

    ∀x, y ∈ B, it is easy to verify by truth table that δ(¬f(x → y)) = ¬δf (x → y). Hence, by definition, (¬f)′(x) = xnor(δ(x → ¬x), δ(¬f(x → ¬x))) = xnor(δ(x → ¬x), ¬δf (x → ¬x)) = ¬xnor(δ(x → ¬x), δf(x → ¬x)) = ¬f ′(x)

  24. [24]

    Proposition A.16

    Using definition, property (i), and associativity of xnor, ∀x ∈ B we have: (g ◦ f)′(x) = xnor(δ(x → ¬x), δg(f(x) → f(¬x))) = xnor(δ(x → ¬x), xnor(δf (x → ¬x), g′(f(x)))) = xnor(g′(f(x)), xnor(δ(x → ¬x), δf(x → ¬x))) = xnor(g′(f(x)), f ′(x)). Proposition A.16. (Nguyen, 2023; Nguyen et al., 2024) For f ∈ F (B, N), the following properties hold:

  25. [25]

    x, y ∈ B: δf (x → y) = xnor(δ(x → y), f ′(x))

  26. [26]

    α ∈ N: (αf)′(x) = αf ′(x)

  27. [27]

    g ∈ F (B, N): (f + g)′(x) = f ′(x) + g′(x). Proof. The proof is as follows:

  28. [28]

    Firstly, the result is trivial if y = x

    For x, y ∈ B. Firstly, the result is trivial if y = x. For y ̸= x, i.e., y = ¬x, by definition: f ′(x) = xnor(δ(x → ¬x), δf(x → ¬x)). Hence, |δf (x → ¬x)| = |f ′(x)| since |δ(x → ¬x)| = 1, and p(f ′(x)) = xnor(δ(x → ¬x), p(δf (x → ¬x))) ⇔ p(δf (x → ¬x)) = xnor(δ(x → ¬x), p(f ′(x))), where p(·) is the logic projector Eq. 17. Thus, δf (x → ¬x) = xnor(δ(x → ...

  29. [29]

    Hence, by definition, (αf)′(x) = xnor(δ(x → ¬x), δ(αf(x → ¬x))) = xnor(δ(x → ¬x), αδf(x → ¬x)) = α xnor(δ(x → ¬x), δf(x → ¬x)), due to Proposition A.10(4) = αf ′(x)

    Firstly ∀x, y ∈ B, we have δ(αf(x → y)) = αf(y) − αf(x) = αδf (x → y). Hence, by definition, (αf)′(x) = xnor(δ(x → ¬x), δ(αf(x → ¬x))) = xnor(δ(x → ¬x), αδf(x → ¬x)) = α xnor(δ(x → ¬x), δf(x → ¬x)), due to Proposition A.10(4) = αf ′(x)

  30. [30]

    For f, g ∈ F (B, N), (f + g)′(x) = xnor(δ(x → ¬x), δ(f + g)(x → ¬x)) = xnor(δ(x → ¬x), δf(x → ¬x) + δg(x → ¬x)) (∗) = xnor(δ(x → ¬x), δf(x → ¬x)) + xnor(δ(x → ¬x), δg(x → ¬x)), = f ′(x) + g′(x), where (∗) is due to Proposition A.10(3). 21 Published as a conference paper at ICLR 2026 For f ∈ F (Z, N), its derivative, also known in terms of finite differenc...

  31. [31]

    For B f → B g → D: (g ◦ f)′(x) = xnor(g′(f(x)), f ′(x)), ∀x ∈ B

  32. [32]

    For B f → Z g → D, x ∈ B, if |f ′(x)| ≤ 1 and g′(f(x)) = g′(f(x) − 1), then: (g ◦ f)′(x) = xnor(g′(f(x)), f ′(x)). Proof. The proof is as follows

  33. [33]

    For B f → B g → N, by using Proposition A.16(1), the proof is similar to that of Proposition A.15(3)

    The case of B f → B g → B is obtained from Proposition A.15(3). For B f → B g → N, by using Proposition A.16(1), the proof is similar to that of Proposition A.15(3)

  34. [34]

    "" 9 G_X = Z.mm(1-2 *W) 10 11

    By definition, we have (g ◦ f)′(x) = xnor(δ(x → ¬x), δg(f(x) → f(¬x))). (20) Using property (1) of Proposition A.16, we have: f(¬x) = f(x) + δf (x → ¬x) = f(x) + xnor(δ(x → ¬x), f ′(x)). (21) Applying Eq. 21 back to Eq. 20, the result is trivial if f ′(x) = 0 . The remaining case is |f ′(x)| = 1 for which we have xnor(δ(x → ¬ x), f ′(x)) = ±1. First, for ...

  35. [35]

    projection-weighted CCA

    (50) Therefore max y,∥y∥2=1 ∥|W|y∥2 ≥ max x,∥x∥2=1 ∥Wx∥2 (51) ⇔ σ1(|W|) ≥ σ1(W). (52) Thus, the lemma is proved. Proposition D.3 (Restated from Xu et al. (2024)) . For W ∈ Rm×n, write W = eUeΣeV ⊤ its SVD. Let a = √˜σ1eU[:,1], and b = √˜σ1eV[:,1]. Similarly, denote |W| = UΣV⊤ its SVD; sin and sout are given as: sin = √σ1V[:,1], and sout = √σ1U[:,1]. We de...

  36. [36]

    for 1-bit matrix multiplications. Using FP16 activations with INT1 weights, we measure the latency of linear layers in LLaMA-7B (Table 12) and LLaMA-13B (Table 13) under an inference batch size of 1, evaluating our method MBOK with two kernels. Our results show that MBOK achieves up to 1https://github.com/microsoft/BitBLAS 41 Published as a conference pap...