arxiv: 2605.05994 · v1 · submitted 2026-05-07 · 💻 cs.LG

Recognition: unknown

DiBA: Diagonal and Binary Matrix Approximation for Neural Network Weight Compression

Nobutaka Ono

Authors on Pith no claims yet

Pith reviewed 2026-05-08 14:14 UTC · model grok-4.3

classification 💻 cs.LG

keywords neural network compressionmatrix factorizationbinary matricesdiagonal scalingweight approximationmodel accelerationDiBA

0 comments

The pith

DiBA approximates neural network weight matrices as three diagonals times two binary matrices to slash multiplications and raise accuracy after diagonal retuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DiBA as a compact factorization for dense weight matrices that appear in linear layers, 1x1 convolutions, attention, and embeddings. It writes A as D1 B1 D2 B2 D3 so that matrix-vector multiplication collapses to three element-wise scalings and two binary mixing steps, cutting floating-point multiplies from mn down to m+k+n. An alternating DiBA-Greedy solver updates the diagonals in closed form and tests one-bit flips on the binaries exactly. When the factors replace original layers and only the diagonals are retuned on task data, the compressed models outperform the dense baselines on two separate benchmarks.

Core claim

DiBA approximates A in R^{m x n} by D1 B1 D2 B2 D3 where the B matrices are 0/1 and the D matrices are diagonal. The DiBA-Greedy procedure produces factors that deliver consistent SNR gains on 40 real pretrained weight matrices as the storage ratio improves. After layer replacement, the DiBARD variant (binary matrices frozen, diagonals retuned) lifts DistilBERT masked-token accuracy from 0.4447 to 0.5210 and Audio Spectrogram Transformer accuracy on Speech Commands from 0.7684 to 0.9781.

What carries the argument

The DiBA factorization Â = D1 B1 D2 B2 D3, with binary B1 and B2 and diagonal D1, D2, D3, that converts dense multiplication into element-wise scalings plus binary mixing.

If this is right

Matrix-vector products require only m + k + n floating-point multiplications instead of mn.
Consistent SNR gains appear on 40 weight matrices extracted from public pretrained models as the theoretical storage ratio increases.
DiBARD layer replacement improves downstream accuracy without any discrete search over the binary factors during adaptation.
The intermediate dimension k directly trades storage cost against approximation quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The fixed binary mixing structure opens a route to specialized hardware that replaces most multiplies with cheaper additions.
Retuning only the diagonals after freezing the binaries gives a low-cost adaptation path for already-compressed models on new tasks.
The same factorization pattern could be tested on other dense operations such as larger convolutions or recurrent weights.

Load-bearing premise

The binary matrices located by DiBA-Greedy still let downstream accuracy recover when only the diagonal factors are later retuned on new task data.

What would settle it

Apply DiBARD replacement to a new pretrained model and observe that retuned accuracy on the target task falls below the original dense model's accuracy.

Figures

Figures reproduced from arXiv: 2605.05994 by Nobutaka Ono.

**Figure 1.** Figure 1: DiBA provides a structured approximation to a dense matrix using three real diagonal view at source ↗

**Figure 2.** Figure 2: SNR versus realized DiBA storage ratio for the 40 selected matrices in Experiment 1. view at source ↗

**Figure 3.** Figure 3: Held-out accuracy versus target-component theoretical storage ratio for Experiments 2-1 view at source ↗

**Figure 4.** Figure 4: Per-layer reconstruction SNR for AST attention projections. view at source ↗

read the original abstract

In this paper, we propose DiBA (Diagonal and Binary Matrix Approximation), a compact matrix factorization for neural network weight compression. Many components of modern networks, including linear layers, $1\times1$ convolutions, attention projections, and embedding layers, have dense matrix weights. DiBA approximates $A\in\mathbb{R}^{m\times n}$ by $\widehat A=D_1B_1D_2B_2D_3$, where $D_1,D_2,D_3$ are diagonal matrices and $B_1,B_2$ are $0/1$ binary matrices. The intermediate dimension $k$ controls the trade-off between theoretical storage and approximation accuracy. For matrix-vector products, DiBA decomposes dense multiplication into three element-wise scaling operations and two binary mixing operations, reducing the floating-point multiplication count from $mn$ to $m+k+n$. For optimization, we introduce DiBA-Greedy, an alternating solver that combines closed-form least-squares updates for the diagonal factors with exact one-bit improvement tests for the binary factors. We also introduce DiBARD (DiBA with Retuning only Diagonal factors), which replaces dense-matrix layers by DiBA factors, freezes the binary matrices, and retunes only the diagonal entries on downstream data. This preserves compact binary mixing without discrete search during adaptation. On 40 dense weight matrices extracted from public pretrained models, DiBA-Greedy yields consistent SNR improvements as the theoretical storage ratio increases. After DiBA replacement in two component-replacement studies, DiBARD improves DistilBERT/WikiText masked-token accuracy from 0.4447 to 0.5210 and Speech Commands test accuracy for an Audio Spectrogram Transformer from 0.7684 to 0.9781 without reoptimizing the binary factors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DiBA's alternating greedy solver gives clean SNR gains on real weights but the missing random-binary ablation means we still don't know if the learned factors actually drive the downstream accuracy recovery after diagonal retuning.

read the letter

DiBA factors a weight matrix as three diagonals around two binary matrices and uses an alternating procedure that solves least-squares exactly for the diagonals while testing one-bit flips for the binaries. It also adds the DiBARD step that freezes the binaries and retunes only the diagonals on task data. That combination is new relative to the usual low-rank or quantization baselines they cite, and the multiply count drops from mn to m+k+n as claimed. On the 40 extracted weight matrices the SNR improves steadily with the storage ratio, which is a solid check that the approximation works on real pretrained layers rather than synthetic data. The two replacement experiments then show clear accuracy lifts after DiBARD on DistilBERT and the audio spectrogram transformer. The optimization itself is straightforward and reproducible in principle. The main gap is the one the stress-test flags: the abstract and description give no comparison of the learned binary patterns against random 0/1 matrices of the same size and k, followed by the same diagonal retuning. Without that, the end-to-end accuracy numbers could be coming mostly from the extra degrees of freedom in the three diagonals rather than from any special structure the greedy solver found. The paper also reports no error bars, no details on matrix selection, and results on only two tasks, so the practical advantage for deployment stays provisional. This is aimed at people already working on structured compression and efficient inference who want a concrete factorization plus a simple retuning protocol. A reader who cares about binary mixing tricks will find the solver and DiBARD workflow useful even if the claims need tightening. It is worth sending to a serious referee because the method is well-specified and the experiments use real models, but the review should ask for the random-binary control and basic statistics before acceptance.

Referee Report

3 major / 2 minor

Summary. The paper proposes DiBA, a factorization A ≈ D1 B1 D2 B2 D3 (D_i diagonal, B_j binary 0/1) for compressing dense weight matrices in neural networks. It introduces the DiBA-Greedy alternating solver (closed-form least-squares for diagonals, exact one-bit tests for binaries) and the DiBARD procedure (freeze binaries after replacement, retune only diagonals on downstream data). Experiments report consistent SNR gains on 40 extracted weight matrices as storage ratio improves, plus accuracy lifts after DiBARD replacement in DistilBERT (masked-token accuracy 0.4447 → 0.5210) and an Audio Spectrogram Transformer (0.7684 → 0.9781), while reducing matrix-vector multiplications from mn to m+k+n.

Significance. If the learned binary factors demonstrably outperform random binaries of identical dimensions when only diagonals are retuned, DiBA could provide a lightweight compression scheme that preserves inference efficiency and allows simple post-replacement adaptation without discrete optimization. The closed-form diagonal updates and exact binary improvement tests are attractive for reproducibility and speed.

major comments (3)

[Experiments / DiBARD results] Experiments / DiBARD results (abstract and §4): the accuracy-recovery claim (e.g., DistilBERT 0.4447 → 0.5210) is load-bearing for practical utility, yet no ablation compares DiBA-Greedy binary matrices against random 0/1 matrices of the same k and dimensions when only the three diagonal factors are subsequently retuned on the same downstream data. Without this control, it is impossible to isolate whether the alternating solver contributes beyond the effect of diagonal retuning alone.
[§4.1] §4.1 (SNR evaluation on 40 matrices): the paper states 'consistent SNR improvements' but provides neither error bars nor multiple random seeds, and does not describe the selection criteria or distribution of the 40 matrices (model, layer type, size). This weakens the generality of the reported trend versus storage ratio.
[§3] §3 (DiBA-Greedy algorithm): the alternating procedure is presented without any analysis of convergence rate, sensitivity to initialization, or comparison of final binary factors to other discrete optimization baselines (e.g., greedy column selection or SDP relaxations), even though these factors are frozen in the downstream DiBARD claim.

minor comments (2)

[Method] Notation: the intermediate dimension k is introduced in the abstract but its precise role in the storage ratio formula is not restated in the method section, making it harder to reproduce the reported compression ratios.
[Abstract] The two concrete accuracy numbers in the abstract are given to four decimal places without indicating whether they are single-run or averaged; adding this detail would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments highlight important aspects of experimental controls, statistical reporting, and algorithmic analysis that we will address to improve the manuscript. We respond to each major comment below, indicating planned revisions.

read point-by-point responses

Referee: [Experiments / DiBARD results] Experiments / DiBARD results (abstract and §4): the accuracy-recovery claim (e.g., DistilBERT 0.4447 → 0.5210) is load-bearing for practical utility, yet no ablation compares DiBA-Greedy binary matrices against random 0/1 matrices of the same k and dimensions when only the three diagonal factors are subsequently retuned on the same downstream data. Without this control, it is impossible to isolate whether the alternating solver contributes beyond the effect of diagonal retuning alone.

Authors: We agree that this control experiment would strengthen the claim by isolating the contribution of the learned binary factors. The DiBA-Greedy solver is intended to produce binary matrices that enable better approximation quality than unstructured choices when paired with diagonal retuning, but the current manuscript does not include the random-binary baseline. We will add this ablation to the revised §4, generating random 0/1 matrices of identical dimensions and k, then retuning only the three diagonal factors on the same downstream data for both the DistilBERT and Audio Spectrogram Transformer tasks, and report the resulting accuracies alongside the DiBA-Greedy results. revision: yes
Referee: [§4.1] §4.1 (SNR evaluation on 40 matrices): the paper states 'consistent SNR improvements' but provides neither error bars nor multiple random seeds, and does not describe the selection criteria or distribution of the 40 matrices (model, layer type, size). This weakens the generality of the reported trend versus storage ratio.

Authors: We will revise §4.1 to provide a clear description of the 40 matrices, including their source models (DistilBERT, BERT variants, and Audio Spectrogram Transformer), layer types (attention projections, feed-forward layers, embeddings), and size distribution. The SNR values were obtained from single runs of DiBA-Greedy per matrix. While the solver is deterministic once initialized, we acknowledge that different random initializations for the binary factors can yield minor variations. We will add a note on this and, where computationally feasible, report results from a small number of additional initializations on representative matrices to indicate variability; full error bars across all 40 will be included if new runs are performed. revision: partial
Referee: [§3] §3 (DiBA-Greedy algorithm): the alternating procedure is presented without any analysis of convergence rate, sensitivity to initialization, or comparison of final binary factors to other discrete optimization baselines (e.g., greedy column selection or SDP relaxations), even though these factors are frozen in the downstream DiBARD claim.

Authors: The alternating procedure is guaranteed to produce a non-decreasing objective value at every step: diagonal updates are globally optimal least-squares solutions, and each binary update performs an exhaustive search over all possible single-bit flips to select the change that most improves the objective (or none). Because the set of binary matrices is finite, the procedure converges in a finite number of iterations to a local optimum. We will add a short analysis paragraph in §3 describing these monotonicity and convergence properties, the default random initialization for the binary factors, and observed sensitivity (typically low for the tested k values). For baseline comparisons, we will include a limited empirical comparison on a subset of the 40 matrices against random binary initialization and a simple per-column greedy selection heuristic, showing that the joint alternating optimization yields higher SNR; a full SDP comparison is outside the scope of the current work but noted as future direction. revision: partial

Circularity Check

0 steps flagged

No significant circularity in DiBA derivation or claims.

full rationale

The paper defines a novel factorization A ≈ D1 B1 D2 B2 D3 and presents DiBA-Greedy as an alternating procedure with closed-form least-squares for the diagonal factors and exhaustive one-bit search for the binary factors. SNR improvements are reported on the same 40 weight matrices used for fitting, which is standard reporting of approximation error rather than a renamed prediction. Downstream accuracy gains under DiBARD are measured after freezing the binary matrices and retuning only diagonals on separate task data (WikiText, Speech Commands), which is independent of the original weight-fitting objective. No self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the derivation; the central claims rest on the explicit alternating solver and empirical replacement experiments rather than reducing to the inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The method rests on the empirical claim that binary matrices of intermediate width k can capture enough structure in real neural weights for the subsequent diagonal retuning to recover accuracy; k is the sole explicit free parameter controlling the storage-accuracy trade-off.

free parameters (1)

intermediate dimension k
Controls the width of the binary matrices and thus the storage ratio; chosen per layer to meet a target compression budget.

axioms (2)

standard math Least-squares solutions for diagonal factors given fixed binary matrices are optimal for the Frobenius-norm objective.
Invoked in the alternating solver description.
domain assumption Single-bit flips can be tested exactly to improve the binary factors without exhaustive search.
Core of the one-bit improvement step in DiBA-Greedy.

pith-pipeline@v0.9.0 · 5627 in / 1634 out tokens · 28784 ms · 2026-05-08T14:14:11.036352+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 8 canonical work pages · 5 internal anchors

[1]

S., and Solla, S

LeCun, Y ., Denker, J. S., and Solla, S. A. Optimal brain damage. InAdvances in Neural Information Processing Systems 2, 1990

1990
[2]

G., and Wolff, G

Hassibi, B., Stork, D. G., and Wolff, G. J. Optimal Brain Surgeon and general network pruning. InIEEE International Conference on Neural Networks, 1993

1993
[3]

Han, S., Pool, J., Tran, J., and Dally, W. J. Learning both weights and connections for efficient neural networks. InAdvances in Neural Information Processing Systems, 2015

2015
[4]

Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf, H. P. Pruning filters for efficient ConvNets. In International Conference on Learning Representations, 2017

2017
[5]

Learning efficient convolutional networks through network slimming

Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., and Zhang, C. Learning efficient convolutional networks through network slimming. InIEEE International Conference on Computer Vision, 2017

2017
[6]

and Carbin, M

Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019

2019
[7]

Sanh, V ., Wolf, T., and Rush, A. M. Movement pruning: Adaptive sparsity by fine-tuning. InAdvances in Neural Information Processing Systems, 2020

2020
[8]

and Alistarh, D

Frantar, E. and Alistarh, D. SparseGPT: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, 2023

2023
[9]

Sun, M., Liu, Z., Bair, A., and Kolter, J. Z. A simple and effective pruning approach for large language models. InInternational Conference on Learning Representations, 2024

2024
[10]

Compressing deep convolutional networks using vector quantization

Gong, Y ., Liu, L., Yang, M., and Bourdev, L. Compressing deep convolutional networks using vector quantization. arXiv:1412.6115, 2014

work page arXiv 2014
[11]

T., Tyree, S., Weinberger, K

Chen, W., Wilson, J. T., Tyree, S., Weinberger, K. Q., and Chen, Y . Compressing neural networks with the hashing trick. InInternational Conference on Machine Learning, 2015

2015
[12]

Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. InInternational Conference on Learning Representations, 2016

2016
[13]

BinaryConnect: Training deep neural networks with binary weights during propagations

Courbariaux, M., Bengio, Y ., and David, J.-P. BinaryConnect: Training deep neural networks with binary weights during propagations. InAdvances in Neural Information Processing Systems, 2015

2015
[14]

Binarized neural networks

Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y . Binarized neural networks. In Advances in Neural Information Processing Systems, 2016

2016
[15]

XNOR-Net: ImageNet classification using binary convolutional neural networks

Rastegari, M., Ordonez, V ., Redmon, J., and Farhadi, A. XNOR-Net: ImageNet classification using binary convolutional neural networks. InEuropean Conference on Computer Vision, 2016

2016
[16]

Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients,

Zhou, S., Wu, Y ., Ni, Z., Zhou, X., Wen, H., and Zou, Y . DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160, 2016

work page arXiv 2016
[17]

Zhu, C., Han, S., Mao, H., and Dally, W. J. Trained ternary quantization. InInternational Conference on Learning Representations, 2017

2017
[18]

Incremental network quantization: Towards lossless CNNs with low-precision weights

Zhou, A., Yao, A., Guo, Y ., Xu, L., and Chen, Y . Incremental network quantization: Towards lossless CNNs with low-precision weights. InInternational Conference on Learning Representations, 2017

2017
[19]

Quantization and training of neural networks for efficient integer-arithmetic-only inference

Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., and Kalenichenko, D. Quantization and training of neural networks for efficient integer-arithmetic-only inference. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018

2018
[20]

K., McKinstry, J

Esser, S. K., McKinstry, J. L., Bablani, D., Appuswamy, R., and Modha, D. S. Learned step size quantiza- tion. InInternational Conference on Learning Representations, 2020

2020
[21]

Additive powers-of-two quantization: An efficient non-uniform discretiza- tion for neural networks

Li, Y ., Dong, X., and Wang, W. Additive powers-of-two quantization: An efficient non-uniform discretiza- tion for neural networks. InInternational Conference on Learning Representations, 2020

2020
[22]

A., van Baalen, M., Louizos, C., and Blankevoort, T

Nagel, M., Amjad, R. A., van Baalen, M., Louizos, C., and Blankevoort, T. Up or down? Adaptive rounding for post-training quantization. InInternational Conference on Machine Learning, 2020. 10

2020
[23]

BRECQ: Pushing the limit of post-training quantization by block reconstruction

Li, Y ., Gong, R., Tan, X., Yang, Y ., Hu, P., Zhang, Q., Yu, F., Wang, W., and Gu, S. BRECQ: Pushing the limit of post-training quantization by block reconstruction. InInternational Conference on Learning Representations, 2021

2021
[24]

LLM.int8(): 8-bit matrix multiplication for transformers at scale

Dettmers, T., Lewis, M., Belkada, Y ., and Zettlemoyer, L. LLM.int8(): 8-bit matrix multiplication for transformers at scale. InAdvances in Neural Information Processing Systems, 2022

2022
[25]

SmoothQuant: Accurate and efficient post-training quantization for large language models

Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han, S. SmoothQuant: Accurate and efficient post-training quantization for large language models. InInternational Conference on Machine Learning, 2023

2023
[26]

GPTQ: Accurate post-training quantization for generative pre-trained transformers

Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. GPTQ: Accurate post-training quantization for generative pre-trained transformers. InInternational Conference on Learning Representations, 2023

2023
[27]

AWQ: Activation-aware weight quantization for LLM compression and acceleration

Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G., Dang, X., Gan, C., and Han, S. AWQ: Activation-aware weight quantization for LLM compression and acceleration. InConference on Machine Learning and Systems, 2024

2024
[28]

L., Zaremba, W., Bruna, J., LeCun, Y ., and Fergus, R

Denton, E. L., Zaremba, W., Bruna, J., LeCun, Y ., and Fergus, R. Exploiting linear structure within convolutional networks for efficient evaluation. InAdvances in Neural Information Processing Systems, 2014

2014
[29]

Speeding up convolutional neural networks with low rank expansions

Jaderberg, M., Vedaldi, A., and Zisserman, A. Speeding up convolutional neural networks with low rank expansions. InBritish Machine Vision Conference, 2014

2014
[30]

Tensorizing neural networks

Novikov, A., Podoprikhin, D., Osokin, A., and Vetrov, D. Tensorizing neural networks. InAdvances in Neural Information Processing Systems, 2015

2015
[31]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. arXiv:1503.02531, 2015

work page internal anchor Pith review arXiv 2015
[32]

E., Chassang, A., Gatta, C., and Bengio, Y

Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., and Bengio, Y . FitNets: Hints for thin deep nets. InInternational Conference on Learning Representations, 2015

2015
[33]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Bengio, Y ., Leonard, N., and Courville, A. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv:1308.3432, 2013

work page internal anchor Pith review arXiv 2013
[34]

Extremely low bit neural network: Squeeze the last bit out with ADMM

Leng, C., Dou, Z., Li, H., Zhu, S., and Jin, R. Extremely low bit neural network: Squeeze the last bit out with ADMM. InAAAI Conference on Artificial Intelligence, 2018

2018
[35]

ProxQuant: Quantized neural networks via proximal operators

Bai, Y ., Wang, Y .-X., and Liberty, E. ProxQuant: Quantized neural networks via proximal operators. In International Conference on Learning Representations, 2019

2019
[36]

and Kwok, J

Hou, L. and Kwok, J. T. Loss-aware weight quantization of deep networks. InInternational Conference on Learning Representations, 2018

2018
[37]

W., and Keutzer, K

Dong, Z., Yao, Z., Gholami, A., Mahoney, M. W., and Keutzer, K. HAWQ: Hessian AWare quantization of neural networks with mixed-precision. InIEEE/CVF International Conference on Computer Vision, 2019

2019
[38]

W., and Keutzer, K

Dong, Z., Yao, Z., Arfeen, D., Gholami, A., Mahoney, M. W., and Keutzer, K. HAWQ-V2: Hessian aware trace-weighted quantization of neural networks. InAdvances in Neural Information Processing Systems, 2020

2020
[39]

BERT: Pre-training of deep bidirectional trans- formers for language understanding

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional trans- formers for language understanding. InConference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019

2019
[40]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Sanh, V ., Debut, L., Chaumond, J., and Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv:1910.01108, 2019

work page internal anchor Pith review arXiv 1910
[41]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Liu, Y ., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V . RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692, 2019

work page internal anchor Pith review arXiv 1907
[42]

Language models are unsupervised multitask learners

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. OpenAI technical report, 2019

2019
[43]

DistilGPT2 model card.https://huggingface.co/distilgpt2

Hugging Face. DistilGPT2 model card.https://huggingface.co/distilgpt2. 11
[44]

OPT: Open Pre-trained Transformer Language Models

Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V ., Mihaylov, T., Ott, M., Shleifer, S., Shuster, K., Simig, D., Koura, P. S., Sridhar, A., Wang, T., and Zettlemoyer, L. OPT: Open pre-trained transformer language models. arXiv:2205.01068, 2022

work page internal anchor Pith review arXiv 2022
[45]

ALBERT: A lite BERT for self- supervised learning of language representations

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., and Soricut, R. ALBERT: A lite BERT for self- supervised learning of language representations. InInternational Conference on Learning Representations, 2020

2020
[46]

V ., and Manning, C

Clark, K., Luong, M.-T., Le, Q. V ., and Manning, C. D. ELECTRA: Pre-training text encoders as discriminators rather than generators. InInternational Conference on Learning Representations, 2020

2020
[47]

MobileBERT: A compact task-agnostic BERT for resource-limited devices

Sun, Z., Yu, H., Song, X., Liu, R., Yang, Y ., and Zhou, D. MobileBERT: A compact task-agnostic BERT for resource-limited devices. InAnnual Meeting of the Association for Computational Linguistics, 2020

2020
[48]

MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers

Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., and Zhou, M. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. InAdvances in Neural Information Processing Systems, 2020

2020
[49]

Deep residual learning for image recognition

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. InIEEE Conference on Computer Vision and Pattern Recognition, 2016

2016
[50]

Howard, A., Sandler, M., Chen, B., Wang, W., Chen, L.-C., Tan, M., Chu, G., Vasudevan, V ., Zhu, Y ., Pang, R., Adam, H., and Le, Q. V . Searching for MobileNetV3. InIEEE/CVF International Conference on Computer Vision, 2019

2019
[51]

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size.arXiv2016, arXiv:1602.07360

Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., and Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and<0.5MB model size. arXiv:1602.07360, 2016. A DiBA-Greedy algorithm details Algorithm 1 gives the outer alternating scheme used by DiBA-Greedy, and Algorithm 2 gives the batch-row greedy step for the left-diag...

work page arXiv 2016