pith. machine review for the scientific record. sign in

arxiv: 2605.05994 · v1 · submitted 2026-05-07 · 💻 cs.LG

Recognition: unknown

DiBA: Diagonal and Binary Matrix Approximation for Neural Network Weight Compression

Authors on Pith no claims yet

Pith reviewed 2026-05-08 14:14 UTC · model grok-4.3

classification 💻 cs.LG
keywords neural network compressionmatrix factorizationbinary matricesdiagonal scalingweight approximationmodel accelerationDiBA
0
0 comments X

The pith

DiBA approximates neural network weight matrices as three diagonals times two binary matrices to slash multiplications and raise accuracy after diagonal retuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DiBA as a compact factorization for dense weight matrices that appear in linear layers, 1x1 convolutions, attention, and embeddings. It writes A as D1 B1 D2 B2 D3 so that matrix-vector multiplication collapses to three element-wise scalings and two binary mixing steps, cutting floating-point multiplies from mn down to m+k+n. An alternating DiBA-Greedy solver updates the diagonals in closed form and tests one-bit flips on the binaries exactly. When the factors replace original layers and only the diagonals are retuned on task data, the compressed models outperform the dense baselines on two separate benchmarks.

Core claim

DiBA approximates A in R^{m x n} by D1 B1 D2 B2 D3 where the B matrices are 0/1 and the D matrices are diagonal. The DiBA-Greedy procedure produces factors that deliver consistent SNR gains on 40 real pretrained weight matrices as the storage ratio improves. After layer replacement, the DiBARD variant (binary matrices frozen, diagonals retuned) lifts DistilBERT masked-token accuracy from 0.4447 to 0.5210 and Audio Spectrogram Transformer accuracy on Speech Commands from 0.7684 to 0.9781.

What carries the argument

The DiBA factorization  = D1 B1 D2 B2 D3, with binary B1 and B2 and diagonal D1, D2, D3, that converts dense multiplication into element-wise scalings plus binary mixing.

If this is right

  • Matrix-vector products require only m + k + n floating-point multiplications instead of mn.
  • Consistent SNR gains appear on 40 weight matrices extracted from public pretrained models as the theoretical storage ratio increases.
  • DiBARD layer replacement improves downstream accuracy without any discrete search over the binary factors during adaptation.
  • The intermediate dimension k directly trades storage cost against approximation quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The fixed binary mixing structure opens a route to specialized hardware that replaces most multiplies with cheaper additions.
  • Retuning only the diagonals after freezing the binaries gives a low-cost adaptation path for already-compressed models on new tasks.
  • The same factorization pattern could be tested on other dense operations such as larger convolutions or recurrent weights.

Load-bearing premise

The binary matrices located by DiBA-Greedy still let downstream accuracy recover when only the diagonal factors are later retuned on new task data.

What would settle it

Apply DiBARD replacement to a new pretrained model and observe that retuned accuracy on the target task falls below the original dense model's accuracy.

Figures

Figures reproduced from arXiv: 2605.05994 by Nobutaka Ono.

Figure 1
Figure 1. Figure 1: DiBA provides a structured approximation to a dense matrix using three real diagonal view at source ↗
Figure 2
Figure 2. Figure 2: SNR versus realized DiBA storage ratio for the 40 selected matrices in Experiment 1. view at source ↗
Figure 3
Figure 3. Figure 3: Held-out accuracy versus target-component theoretical storage ratio for Experiments 2-1 view at source ↗
Figure 4
Figure 4. Figure 4: Per-layer reconstruction SNR for AST attention projections. view at source ↗
read the original abstract

In this paper, we propose DiBA (Diagonal and Binary Matrix Approximation), a compact matrix factorization for neural network weight compression. Many components of modern networks, including linear layers, $1\times1$ convolutions, attention projections, and embedding layers, have dense matrix weights. DiBA approximates $A\in\mathbb{R}^{m\times n}$ by $\widehat A=D_1B_1D_2B_2D_3$, where $D_1,D_2,D_3$ are diagonal matrices and $B_1,B_2$ are $0/1$ binary matrices. The intermediate dimension $k$ controls the trade-off between theoretical storage and approximation accuracy. For matrix-vector products, DiBA decomposes dense multiplication into three element-wise scaling operations and two binary mixing operations, reducing the floating-point multiplication count from $mn$ to $m+k+n$. For optimization, we introduce DiBA-Greedy, an alternating solver that combines closed-form least-squares updates for the diagonal factors with exact one-bit improvement tests for the binary factors. We also introduce DiBARD (DiBA with Retuning only Diagonal factors), which replaces dense-matrix layers by DiBA factors, freezes the binary matrices, and retunes only the diagonal entries on downstream data. This preserves compact binary mixing without discrete search during adaptation. On 40 dense weight matrices extracted from public pretrained models, DiBA-Greedy yields consistent SNR improvements as the theoretical storage ratio increases. After DiBA replacement in two component-replacement studies, DiBARD improves DistilBERT/WikiText masked-token accuracy from 0.4447 to 0.5210 and Speech Commands test accuracy for an Audio Spectrogram Transformer from 0.7684 to 0.9781 without reoptimizing the binary factors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes DiBA, a factorization A ≈ D1 B1 D2 B2 D3 (D_i diagonal, B_j binary 0/1) for compressing dense weight matrices in neural networks. It introduces the DiBA-Greedy alternating solver (closed-form least-squares for diagonals, exact one-bit tests for binaries) and the DiBARD procedure (freeze binaries after replacement, retune only diagonals on downstream data). Experiments report consistent SNR gains on 40 extracted weight matrices as storage ratio improves, plus accuracy lifts after DiBARD replacement in DistilBERT (masked-token accuracy 0.4447 → 0.5210) and an Audio Spectrogram Transformer (0.7684 → 0.9781), while reducing matrix-vector multiplications from mn to m+k+n.

Significance. If the learned binary factors demonstrably outperform random binaries of identical dimensions when only diagonals are retuned, DiBA could provide a lightweight compression scheme that preserves inference efficiency and allows simple post-replacement adaptation without discrete optimization. The closed-form diagonal updates and exact binary improvement tests are attractive for reproducibility and speed.

major comments (3)
  1. [Experiments / DiBARD results] Experiments / DiBARD results (abstract and §4): the accuracy-recovery claim (e.g., DistilBERT 0.4447 → 0.5210) is load-bearing for practical utility, yet no ablation compares DiBA-Greedy binary matrices against random 0/1 matrices of the same k and dimensions when only the three diagonal factors are subsequently retuned on the same downstream data. Without this control, it is impossible to isolate whether the alternating solver contributes beyond the effect of diagonal retuning alone.
  2. [§4.1] §4.1 (SNR evaluation on 40 matrices): the paper states 'consistent SNR improvements' but provides neither error bars nor multiple random seeds, and does not describe the selection criteria or distribution of the 40 matrices (model, layer type, size). This weakens the generality of the reported trend versus storage ratio.
  3. [§3] §3 (DiBA-Greedy algorithm): the alternating procedure is presented without any analysis of convergence rate, sensitivity to initialization, or comparison of final binary factors to other discrete optimization baselines (e.g., greedy column selection or SDP relaxations), even though these factors are frozen in the downstream DiBARD claim.
minor comments (2)
  1. [Method] Notation: the intermediate dimension k is introduced in the abstract but its precise role in the storage ratio formula is not restated in the method section, making it harder to reproduce the reported compression ratios.
  2. [Abstract] The two concrete accuracy numbers in the abstract are given to four decimal places without indicating whether they are single-run or averaged; adding this detail would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments highlight important aspects of experimental controls, statistical reporting, and algorithmic analysis that we will address to improve the manuscript. We respond to each major comment below, indicating planned revisions.

read point-by-point responses
  1. Referee: [Experiments / DiBARD results] Experiments / DiBARD results (abstract and §4): the accuracy-recovery claim (e.g., DistilBERT 0.4447 → 0.5210) is load-bearing for practical utility, yet no ablation compares DiBA-Greedy binary matrices against random 0/1 matrices of the same k and dimensions when only the three diagonal factors are subsequently retuned on the same downstream data. Without this control, it is impossible to isolate whether the alternating solver contributes beyond the effect of diagonal retuning alone.

    Authors: We agree that this control experiment would strengthen the claim by isolating the contribution of the learned binary factors. The DiBA-Greedy solver is intended to produce binary matrices that enable better approximation quality than unstructured choices when paired with diagonal retuning, but the current manuscript does not include the random-binary baseline. We will add this ablation to the revised §4, generating random 0/1 matrices of identical dimensions and k, then retuning only the three diagonal factors on the same downstream data for both the DistilBERT and Audio Spectrogram Transformer tasks, and report the resulting accuracies alongside the DiBA-Greedy results. revision: yes

  2. Referee: [§4.1] §4.1 (SNR evaluation on 40 matrices): the paper states 'consistent SNR improvements' but provides neither error bars nor multiple random seeds, and does not describe the selection criteria or distribution of the 40 matrices (model, layer type, size). This weakens the generality of the reported trend versus storage ratio.

    Authors: We will revise §4.1 to provide a clear description of the 40 matrices, including their source models (DistilBERT, BERT variants, and Audio Spectrogram Transformer), layer types (attention projections, feed-forward layers, embeddings), and size distribution. The SNR values were obtained from single runs of DiBA-Greedy per matrix. While the solver is deterministic once initialized, we acknowledge that different random initializations for the binary factors can yield minor variations. We will add a note on this and, where computationally feasible, report results from a small number of additional initializations on representative matrices to indicate variability; full error bars across all 40 will be included if new runs are performed. revision: partial

  3. Referee: [§3] §3 (DiBA-Greedy algorithm): the alternating procedure is presented without any analysis of convergence rate, sensitivity to initialization, or comparison of final binary factors to other discrete optimization baselines (e.g., greedy column selection or SDP relaxations), even though these factors are frozen in the downstream DiBARD claim.

    Authors: The alternating procedure is guaranteed to produce a non-decreasing objective value at every step: diagonal updates are globally optimal least-squares solutions, and each binary update performs an exhaustive search over all possible single-bit flips to select the change that most improves the objective (or none). Because the set of binary matrices is finite, the procedure converges in a finite number of iterations to a local optimum. We will add a short analysis paragraph in §3 describing these monotonicity and convergence properties, the default random initialization for the binary factors, and observed sensitivity (typically low for the tested k values). For baseline comparisons, we will include a limited empirical comparison on a subset of the 40 matrices against random binary initialization and a simple per-column greedy selection heuristic, showing that the joint alternating optimization yields higher SNR; a full SDP comparison is outside the scope of the current work but noted as future direction. revision: partial

Circularity Check

0 steps flagged

No significant circularity in DiBA derivation or claims.

full rationale

The paper defines a novel factorization A ≈ D1 B1 D2 B2 D3 and presents DiBA-Greedy as an alternating procedure with closed-form least-squares for the diagonal factors and exhaustive one-bit search for the binary factors. SNR improvements are reported on the same 40 weight matrices used for fitting, which is standard reporting of approximation error rather than a renamed prediction. Downstream accuracy gains under DiBARD are measured after freezing the binary matrices and retuning only diagonals on separate task data (WikiText, Speech Commands), which is independent of the original weight-fitting objective. No self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the derivation; the central claims rest on the explicit alternating solver and empirical replacement experiments rather than reducing to the inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The method rests on the empirical claim that binary matrices of intermediate width k can capture enough structure in real neural weights for the subsequent diagonal retuning to recover accuracy; k is the sole explicit free parameter controlling the storage-accuracy trade-off.

free parameters (1)
  • intermediate dimension k
    Controls the width of the binary matrices and thus the storage ratio; chosen per layer to meet a target compression budget.
axioms (2)
  • standard math Least-squares solutions for diagonal factors given fixed binary matrices are optimal for the Frobenius-norm objective.
    Invoked in the alternating solver description.
  • domain assumption Single-bit flips can be tested exactly to improve the binary factors without exhaustive search.
    Core of the one-bit improvement step in DiBA-Greedy.

pith-pipeline@v0.9.0 · 5627 in / 1634 out tokens · 28784 ms · 2026-05-08T14:14:11.036352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 8 canonical work pages · 5 internal anchors

  1. [1]

    S., and Solla, S

    LeCun, Y ., Denker, J. S., and Solla, S. A. Optimal brain damage. InAdvances in Neural Information Processing Systems 2, 1990

  2. [2]

    G., and Wolff, G

    Hassibi, B., Stork, D. G., and Wolff, G. J. Optimal Brain Surgeon and general network pruning. InIEEE International Conference on Neural Networks, 1993

  3. [3]

    Han, S., Pool, J., Tran, J., and Dally, W. J. Learning both weights and connections for efficient neural networks. InAdvances in Neural Information Processing Systems, 2015

  4. [4]

    Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf, H. P. Pruning filters for efficient ConvNets. In International Conference on Learning Representations, 2017

  5. [5]

    Learning efficient convolutional networks through network slimming

    Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., and Zhang, C. Learning efficient convolutional networks through network slimming. InIEEE International Conference on Computer Vision, 2017

  6. [6]

    and Carbin, M

    Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019

  7. [7]

    Sanh, V ., Wolf, T., and Rush, A. M. Movement pruning: Adaptive sparsity by fine-tuning. InAdvances in Neural Information Processing Systems, 2020

  8. [8]

    and Alistarh, D

    Frantar, E. and Alistarh, D. SparseGPT: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, 2023

  9. [9]

    Sun, M., Liu, Z., Bair, A., and Kolter, J. Z. A simple and effective pruning approach for large language models. InInternational Conference on Learning Representations, 2024

  10. [10]

    Compressing deep convolutional networks using vector quantization

    Gong, Y ., Liu, L., Yang, M., and Bourdev, L. Compressing deep convolutional networks using vector quantization. arXiv:1412.6115, 2014

  11. [11]

    T., Tyree, S., Weinberger, K

    Chen, W., Wilson, J. T., Tyree, S., Weinberger, K. Q., and Chen, Y . Compressing neural networks with the hashing trick. InInternational Conference on Machine Learning, 2015

  12. [12]

    Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. InInternational Conference on Learning Representations, 2016

  13. [13]

    BinaryConnect: Training deep neural networks with binary weights during propagations

    Courbariaux, M., Bengio, Y ., and David, J.-P. BinaryConnect: Training deep neural networks with binary weights during propagations. InAdvances in Neural Information Processing Systems, 2015

  14. [14]

    Binarized neural networks

    Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y . Binarized neural networks. In Advances in Neural Information Processing Systems, 2016

  15. [15]

    XNOR-Net: ImageNet classification using binary convolutional neural networks

    Rastegari, M., Ordonez, V ., Redmon, J., and Farhadi, A. XNOR-Net: ImageNet classification using binary convolutional neural networks. InEuropean Conference on Computer Vision, 2016

  16. [16]

    Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients,

    Zhou, S., Wu, Y ., Ni, Z., Zhou, X., Wen, H., and Zou, Y . DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160, 2016

  17. [17]

    Zhu, C., Han, S., Mao, H., and Dally, W. J. Trained ternary quantization. InInternational Conference on Learning Representations, 2017

  18. [18]

    Incremental network quantization: Towards lossless CNNs with low-precision weights

    Zhou, A., Yao, A., Guo, Y ., Xu, L., and Chen, Y . Incremental network quantization: Towards lossless CNNs with low-precision weights. InInternational Conference on Learning Representations, 2017

  19. [19]

    Quantization and training of neural networks for efficient integer-arithmetic-only inference

    Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., and Kalenichenko, D. Quantization and training of neural networks for efficient integer-arithmetic-only inference. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018

  20. [20]

    K., McKinstry, J

    Esser, S. K., McKinstry, J. L., Bablani, D., Appuswamy, R., and Modha, D. S. Learned step size quantiza- tion. InInternational Conference on Learning Representations, 2020

  21. [21]

    Additive powers-of-two quantization: An efficient non-uniform discretiza- tion for neural networks

    Li, Y ., Dong, X., and Wang, W. Additive powers-of-two quantization: An efficient non-uniform discretiza- tion for neural networks. InInternational Conference on Learning Representations, 2020

  22. [22]

    A., van Baalen, M., Louizos, C., and Blankevoort, T

    Nagel, M., Amjad, R. A., van Baalen, M., Louizos, C., and Blankevoort, T. Up or down? Adaptive rounding for post-training quantization. InInternational Conference on Machine Learning, 2020. 10

  23. [23]

    BRECQ: Pushing the limit of post-training quantization by block reconstruction

    Li, Y ., Gong, R., Tan, X., Yang, Y ., Hu, P., Zhang, Q., Yu, F., Wang, W., and Gu, S. BRECQ: Pushing the limit of post-training quantization by block reconstruction. InInternational Conference on Learning Representations, 2021

  24. [24]

    LLM.int8(): 8-bit matrix multiplication for transformers at scale

    Dettmers, T., Lewis, M., Belkada, Y ., and Zettlemoyer, L. LLM.int8(): 8-bit matrix multiplication for transformers at scale. InAdvances in Neural Information Processing Systems, 2022

  25. [25]

    SmoothQuant: Accurate and efficient post-training quantization for large language models

    Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han, S. SmoothQuant: Accurate and efficient post-training quantization for large language models. InInternational Conference on Machine Learning, 2023

  26. [26]

    GPTQ: Accurate post-training quantization for generative pre-trained transformers

    Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. GPTQ: Accurate post-training quantization for generative pre-trained transformers. InInternational Conference on Learning Representations, 2023

  27. [27]

    AWQ: Activation-aware weight quantization for LLM compression and acceleration

    Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G., Dang, X., Gan, C., and Han, S. AWQ: Activation-aware weight quantization for LLM compression and acceleration. InConference on Machine Learning and Systems, 2024

  28. [28]

    L., Zaremba, W., Bruna, J., LeCun, Y ., and Fergus, R

    Denton, E. L., Zaremba, W., Bruna, J., LeCun, Y ., and Fergus, R. Exploiting linear structure within convolutional networks for efficient evaluation. InAdvances in Neural Information Processing Systems, 2014

  29. [29]

    Speeding up convolutional neural networks with low rank expansions

    Jaderberg, M., Vedaldi, A., and Zisserman, A. Speeding up convolutional neural networks with low rank expansions. InBritish Machine Vision Conference, 2014

  30. [30]

    Tensorizing neural networks

    Novikov, A., Podoprikhin, D., Osokin, A., and Vetrov, D. Tensorizing neural networks. InAdvances in Neural Information Processing Systems, 2015

  31. [31]

    Distilling the Knowledge in a Neural Network

    Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. arXiv:1503.02531, 2015

  32. [32]

    E., Chassang, A., Gatta, C., and Bengio, Y

    Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., and Bengio, Y . FitNets: Hints for thin deep nets. InInternational Conference on Learning Representations, 2015

  33. [33]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Bengio, Y ., Leonard, N., and Courville, A. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv:1308.3432, 2013

  34. [34]

    Extremely low bit neural network: Squeeze the last bit out with ADMM

    Leng, C., Dou, Z., Li, H., Zhu, S., and Jin, R. Extremely low bit neural network: Squeeze the last bit out with ADMM. InAAAI Conference on Artificial Intelligence, 2018

  35. [35]

    ProxQuant: Quantized neural networks via proximal operators

    Bai, Y ., Wang, Y .-X., and Liberty, E. ProxQuant: Quantized neural networks via proximal operators. In International Conference on Learning Representations, 2019

  36. [36]

    and Kwok, J

    Hou, L. and Kwok, J. T. Loss-aware weight quantization of deep networks. InInternational Conference on Learning Representations, 2018

  37. [37]

    W., and Keutzer, K

    Dong, Z., Yao, Z., Gholami, A., Mahoney, M. W., and Keutzer, K. HAWQ: Hessian AWare quantization of neural networks with mixed-precision. InIEEE/CVF International Conference on Computer Vision, 2019

  38. [38]

    W., and Keutzer, K

    Dong, Z., Yao, Z., Arfeen, D., Gholami, A., Mahoney, M. W., and Keutzer, K. HAWQ-V2: Hessian aware trace-weighted quantization of neural networks. InAdvances in Neural Information Processing Systems, 2020

  39. [39]

    BERT: Pre-training of deep bidirectional trans- formers for language understanding

    Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional trans- formers for language understanding. InConference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019

  40. [40]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Sanh, V ., Debut, L., Chaumond, J., and Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv:1910.01108, 2019

  41. [41]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Liu, Y ., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V . RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692, 2019

  42. [42]

    Language models are unsupervised multitask learners

    Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. OpenAI technical report, 2019

  43. [43]

    DistilGPT2 model card.https://huggingface.co/distilgpt2

    Hugging Face. DistilGPT2 model card.https://huggingface.co/distilgpt2. 11

  44. [44]

    OPT: Open Pre-trained Transformer Language Models

    Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V ., Mihaylov, T., Ott, M., Shleifer, S., Shuster, K., Simig, D., Koura, P. S., Sridhar, A., Wang, T., and Zettlemoyer, L. OPT: Open pre-trained transformer language models. arXiv:2205.01068, 2022

  45. [45]

    ALBERT: A lite BERT for self- supervised learning of language representations

    Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., and Soricut, R. ALBERT: A lite BERT for self- supervised learning of language representations. InInternational Conference on Learning Representations, 2020

  46. [46]

    V ., and Manning, C

    Clark, K., Luong, M.-T., Le, Q. V ., and Manning, C. D. ELECTRA: Pre-training text encoders as discriminators rather than generators. InInternational Conference on Learning Representations, 2020

  47. [47]

    MobileBERT: A compact task-agnostic BERT for resource-limited devices

    Sun, Z., Yu, H., Song, X., Liu, R., Yang, Y ., and Zhou, D. MobileBERT: A compact task-agnostic BERT for resource-limited devices. InAnnual Meeting of the Association for Computational Linguistics, 2020

  48. [48]

    MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers

    Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., and Zhou, M. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. InAdvances in Neural Information Processing Systems, 2020

  49. [49]

    Deep residual learning for image recognition

    He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. InIEEE Conference on Computer Vision and Pattern Recognition, 2016

  50. [50]

    Howard, A., Sandler, M., Chen, B., Wang, W., Chen, L.-C., Tan, M., Chu, G., Vasudevan, V ., Zhu, Y ., Pang, R., Adam, H., and Le, Q. V . Searching for MobileNetV3. InIEEE/CVF International Conference on Computer Vision, 2019

  51. [51]

    SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size.arXiv2016, arXiv:1602.07360

    Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., and Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and<0.5MB model size. arXiv:1602.07360, 2016. A DiBA-Greedy algorithm details Algorithm 1 gives the outer alternating scheme used by DiBA-Greedy, and Algorithm 2 gives the batch-row greedy step for the left-diag...