SURGE: Surrogate Gradient Adaptation in Binary Neural Networks

Baochang Zhang; Boyu Liu; Canyu Chen; Haoyu Huang; Linlin Yang; Xuhui Liu; Yanjing Li; Yuguang Yang; Zhongqian Fu

arxiv: 2605.10989 · v2 · pith:V6FO5VMUnew · submitted 2026-05-09 · 💻 cs.LG · cs.AI

SURGE: Surrogate Gradient Adaptation in Binary Neural Networks

Haoyu Huang , Boyu Liu , Linlin Yang , Yanjing Li , Yuguang Yang , Xuhui Liu , Canyu Chen , Zhongqian Fu

show 1 more author

Baochang Zhang

This is my paper

Pith reviewed 2026-05-19 17:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords binary neural networkssurrogate gradientgradient mismatchmodel quantizationdeep learning optimizationauxiliary backpropagation

0 comments

The pith

Binary neural networks train more accurately when a learnable surrogate gradient uses an auxiliary full-precision branch to reduce mismatch.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to fix the gradient mismatch that arises when training binary neural networks because the straight-through estimator clips gradients in a fixed way and loses information. SURGE adds a parallel full-precision auxiliary branch inside each binarized layer so that backpropagation can estimate the parts of the gradient that the basic estimator misses. An adaptive scaler then keeps the two paths balanced by norm so training stays stable. If the approach works, binary networks should reach higher accuracy on classification, detection, and language tasks while retaining their memory and speed advantages.

Core claim

SURGE is a learnable gradient compensation framework grounded in auxiliary backpropagation; its Dual-Path Gradient Compensator constructs a parallel full-precision branch for every binarized layer and decouples the gradient flow through output decomposition, thereby supplying bias-reduced estimates beyond the first-order straight-through approximation, while its Adaptive Gradient Scaler applies norm-based scaling to balance the branches and maintain stability.

What carries the argument

Dual-Path Gradient Compensator (DPGC), a module that runs a parallel full-precision auxiliary branch alongside each binarized layer and uses output decomposition in backpropagation to estimate additional gradient components.

If this is right

SURGE records higher accuracy than prior state-of-the-art methods on image classification, object detection, and language understanding benchmarks.
The auxiliary branch reduces the information loss that fixed-range clipping introduces in conventional straight-through estimators.
Norm-based scaling keeps the combined gradient stable across layers and epochs.
Binary networks become practical for a wider set of resource-limited deployment scenarios because accuracy gaps shrink.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same auxiliary-branch idea could be tested on other non-differentiable operations such as learned quantization or spiking neural networks.
If the auxiliary path can be distilled back into the main network after training, inference cost would remain unchanged.
The framework might be combined with existing binary-network search methods to jointly optimize architecture and gradient compensation.

Load-bearing premise

The full-precision auxiliary branch supplies gradient estimates that are meaningfully less biased than the straight-through estimator and does not create new mismatches or training instabilities.

What would settle it

Training the same binary network architecture on CIFAR-10 or ImageNet once with standard STE and once with SURGE; if final accuracy is statistically indistinguishable or lower with SURGE, or if training diverges, the benefit of the auxiliary branch is not realized.

Figures

Figures reproduced from arXiv: 2605.10989 by Baochang Zhang, Boyu Liu, Canyu Chen, Haoyu Huang, Linlin Yang, Xuhui Liu, Yanjing Li, Yuguang Yang, Zhongqian Fu.

**Figure 1.** Figure 1: (a-b) Activation gradient patterns without/with SURGE (left/right); (c) Gradient distribution comparison; (d) Cumulative probability of gradients. STE provides a first-order approximation for the sign function’s gradient and clips out-of-range activation gradients, while SURGE compensates them with a Dual-Path Gradient Compensator (a-b). SURGE also right-shifts gradient distributions of activations (c-d), … view at source ↗

**Figure 2.** Figure 2: Overall architecture of SURGE. (a) Integration into common backbones (left: convolution block; right: transformer block). (b) Component details. DPGC constructs a parallel full-precision parameterized branch (auxiliary branch, shown with red arrows for forward pass and blue arrows for backpropagation) for each binarized layer (main branch, represented by black arrows in forward pass and green arrows for ba… view at source ↗

**Figure 3.** Figure 3: Ablation study on parameter scaling strategies. (a) is fixed scaling with constant factors across training iterations. (b) is adaptive scaling via parameter η that dynamically adjusts the compensation strength (Eq. 7). driven design (Theorem 5.3) successfully balances gradient compensation and training stability. Ablation on Gradient Compensation Scope of DPGC. We ablate the gradient compensation scope on … view at source ↗

read the original abstract

The training of Binary Neural Networks (BNNs) is fundamentally based on gradient approximation for non-differentiable binarization operations (e.g., sign function). However, prevailing methods including the Straight-Through Estimator (STE) and its improved variants, rely on hand-crafted designs that suffer from gradient mismatch problem and information loss induced by fixed-range gradient clipping. To address this, we propose SURrogate GradiEnt Adaptation (SURGE), a novel learnable gradient compensation framework with theoretical grounding. SURGE mitigates gradient mismatch through auxiliary backpropagation. Specifically, we design a Dual-Path Gradient Compensator (DPGC) that constructs a parallel full-precision auxiliary branch for each binarized layer, decoupling gradient flow via output decomposition during backpropagation. DPGC enables bias-reduced gradient estimation by leveraging the full-precision branch to estimate components beyond STE's first-order approximation. To further enhance training stability, we introduce an Adaptive Gradient Scaler (AGS) based on an optimal scale factor to dynamically balance inter-branch gradient contributions via norm-based scaling. Experiments on image classification, object detection, and language understanding tasks demonstrate that SURGE performs best over state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SURGE adds a dual-path full-precision compensator and norm-based scaler to surrogate gradients for BNNs, with reported gains across tasks, but the claimed bias reduction may not be cleanly isolated from extra compute.

read the letter

The core idea is a Dual-Path Gradient Compensator that runs a parallel full-precision branch alongside each binarized layer and decomposes the output during backprop to estimate terms beyond the usual first-order STE approximation, plus an Adaptive Gradient Scaler that uses norm-based balancing with an optimal scale factor. The paper tests this on image classification, object detection, and language tasks and states it outperforms prior methods.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes SURGE, a learnable surrogate gradient adaptation framework for Binary Neural Networks. It introduces the Dual-Path Gradient Compensator (DPGC), which adds a parallel full-precision auxiliary branch per binarized layer and uses output decomposition in backpropagation to estimate gradient components beyond the first-order Straight-Through Estimator (STE) approximation, thereby reducing bias. An Adaptive Gradient Scaler (AGS) based on an optimal scale factor is added to balance inter-branch gradient norms for training stability. Experiments across image classification, object detection, and language understanding tasks are reported to show that SURGE outperforms prior state-of-the-art BNN methods.

Significance. If the DPGC mechanism can be shown to deliver independent higher-order gradient information without new mismatches or instabilities, the approach would offer a principled, learnable alternative to hand-crafted STE variants. Demonstrating consistent gains on classification, detection, and language tasks would strengthen the case for broader adoption in efficient network training.

major comments (1)

[Abstract / DPGC description] The load-bearing claim is that DPGC's full-precision auxiliary branch, combined with output decomposition, yields bias-reduced estimates beyond STE without introducing mismatches (see skeptic note on shared parameters/activations). The abstract states that the auxiliary branch 'estimates components beyond STE's first-order approximation' and that gradients are 'decoupled via output decomposition,' but provides no explicit equations or pseudocode showing the decomposition (e.g., whether the auxiliary path uses independent weights or re-uses binarized activations). If the paths share parameters or activations, the claimed independence does not hold and observed gains could stem from extra compute or AGS scaling alone. This directly affects attribution of the reported superiority on all three task families.

minor comments (2)

[Abstract] The abstract mentions 'theoretical grounding' but does not preview the key derivation or assumptions; adding a one-sentence outline would improve readability.
[AGS description] Notation for the 'optimal scale factor' in AGS is introduced without an equation reference; a short definition or pointer to the relevant equation would clarify the norm-based scaling.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The major comment raises a valid point about the clarity of the DPGC mechanism in the abstract. We address this directly below and will revise the manuscript accordingly to strengthen the presentation of our contributions.

read point-by-point responses

Referee: [Abstract / DPGC description] The load-bearing claim is that DPGC's full-precision auxiliary branch, combined with output decomposition, yields bias-reduced estimates beyond STE without introducing mismatches (see skeptic note on shared parameters/activations). The abstract states that the auxiliary branch 'estimates components beyond STE's first-order approximation' and that gradients are 'decoupled via output decomposition,' but provides no explicit equations or pseudocode showing the decomposition (e.g., whether the auxiliary path uses independent weights or re-uses binarized activations). If the paths share parameters or activations, the claimed independence does not hold and observed gains could stem from extra compute or AGS scaling alone. This directly affects attribution of the reported superiority on all three task families.

Authors: We agree that the abstract is concise and does not include the supporting equations or pseudocode, which can leave the independence of the paths ambiguous. In the full manuscript (Section 3.2 and Equations 3-5), the auxiliary branch is implemented with completely independent full-precision weights and computes its own activations; it does not reuse binarized weights or activations from the primary path. Output decomposition is performed by subtracting the binarized forward output from the full-precision auxiliary output before backpropagation, allowing the auxiliary path to supply higher-order gradient components that the STE approximation omits. This structure is designed to avoid introducing new mismatches. We acknowledge that explicit pseudocode would make the mechanism clearer and will add it to the revised manuscript (likely as a new Algorithm box in Section 3). We will also insert a short clarifying sentence in the abstract referencing the independent parameters and decomposition. These changes should allow readers to attribute performance gains more confidently to the bias reduction rather than extra compute or AGS alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation introduces independent mechanisms

full rationale

The paper proposes SURGE as a learnable gradient compensation framework using DPGC (parallel full-precision auxiliary branch with output decomposition) and AGS (norm-based adaptive scaling). These are presented as new components addressing STE mismatch, without any quoted equations that define a prediction in terms of its own fitted inputs or reduce the central result to a self-citation chain. No self-definitional steps, fitted inputs renamed as predictions, or ansatz smuggling via prior self-work are exhibited in the provided text. The claims rest on the design of auxiliary paths and scaling rather than re-expressing existing quantities. This is the expected self-contained case for a methods paper introducing architectural additions.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 2 invented entities

The framework rests on the design of an auxiliary full-precision branch and a norm-based scaling rule whose optimality is asserted without external benchmarks in the provided text.

free parameters (1)

optimal scale factor
Used in AGS to dynamically balance inter-branch gradient contributions via norm-based scaling.

invented entities (2)

Dual-Path Gradient Compensator (DPGC) no independent evidence
purpose: Constructs parallel full-precision auxiliary branch for each binarized layer to enable bias-reduced gradient estimation
New component introduced to decouple gradient flow via output decomposition
Adaptive Gradient Scaler (AGS) no independent evidence
purpose: Dynamically balances inter-branch gradient contributions
New scaling mechanism based on optimal scale factor

pith-pipeline@v0.9.0 · 5758 in / 1126 out tokens · 50022 ms · 2026-05-19T17:45:15.360737+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

129 extracted references · 129 canonical work pages · 11 internal anchors

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[2]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Estimating or propagating gradients through stochastic neurons for conditional computation , author=. arXiv preprint arXiv:1308.3432 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Learning sparse neural networks through L\_0 regularization , author=

work page
[4]

Differentiable soft quantization: Bridging full-precision and low-bit neural networks , author=

work page
[5]

Binaryconnect: Training deep neural networks with binary weights during propagations , author=

work page
[6]

2016 , organization=

Xnor-net: Imagenet classification using binary convolutional neural networks , author=. 2016 , organization=

work page 2016
[7]

Forward and backward information retention for accurate binary neural networks , author=

work page
[8]

Recu: Reviving the dead weights in binary neural networks , author=

work page
[9]

Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm , author=

work page
[10]

Learning frequency domain approximation for binary neural networks , author=

work page
[11]

2020 , organization=

Bats: Binary architecture search , author=. 2020 , organization=

work page 2020
[12]

2015 , publisher=

Imagenet large scale visual recognition challenge , author=. 2015 , publisher=

work page 2015
[13]

2010 , publisher=

The pascal visual object classes (voc) challenge , author=. 2010 , publisher=

work page 2010
[14]

2009 , publisher=

Learning multiple layers of features from tiny images , author=. 2009 , publisher=

work page 2009
[15]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Very deep convolutional networks for large-scale image recognition , author=. arXiv preprint arXiv:1409.1556 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

2022 , organization=

Recurrent bilinear optimization for binary neural networks , author=. 2022 , organization=

work page 2022
[17]

2020 , organization=

Reactnet: Towards precise binary neural network with generalized activation functions , author=. 2020 , organization=

work page 2020
[18]

Regularizing activation distribution for training binarized deep networks , author=

work page
[19]

Rotated binary neural network , author=

work page
[20]

DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients

Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients , author=. arXiv preprint arXiv:1606.06160 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Searching for low-bit weights in quantized neural networks , author=

work page
[22]

Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1 , author=. arXiv preprint arXiv:1602.02830 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Tbn: Convolutional neural network with ternary inputs and binary weights , author=

work page
[24]

2022 , publisher=

Towards compact 1-bit cnns via bayesian learning , author=. 2022 , publisher=

work page 2022
[25]

Bidet: An efficient binarized object detector , author=

work page
[26]

2022 , organization=

Ida-det: An information discrepancy-aware distillation for 1-bit detectors , author=. 2022 , organization=

work page 2022
[27]

Categorical Reparameterization with Gumbel-Softmax

Categorical reparameterization with gumbel-softmax , author=. arXiv preprint arXiv:1611.01144 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Learned step size quantization , author=

work page
[29]

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification , author=

work page
[30]

Layer-wise searching for 1-bit detectors , author=

work page
[31]

Circulant binary convolutional networks: Enhancing the performance of 1-bit dcnns with circulant back propagation , author=

work page
[32]

Bnn+: Improved binary network training , author=

work page
[33]

Language models are few-shot learners , author=

work page
[34]

Qwen2.5 Technical Report

Qwen2. 5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

2023 , publisher=

Structured pruning for deep convolutional neural networks: A survey , author=. 2023 , publisher=

work page 2023
[36]

Distilling the knowledge in a neural network , author=

work page
[37]

Efficient Low-Bit Quantization with Adaptive Scales for Multi-Task Co-Training , author=

work page
[38]

On compressing deep models by low rank and sparse decomposition , author=

work page
[39]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , year =

work page
[40]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[41]

Latticenet: Towards lightweight image super-resolution with lattice block , author=

work page
[42]

Enhanced deep residual networks for single image super-resolution , author=

work page
[43]

Fast, accurate, and lightweight super-resolution with cascading residual network , author=

work page
[44]

Data-free knowledge distillation for image super-resolution , author=

work page
[45]

Learning with privileged information for efficient image super-resolution , author=

work page
[46]

Wang, Huan and Zhang, Yulun and Qin, Can and Van Gool, Luc and Fu, Yun , journal=TPAMI, title=

work page
[47]

Deep learning with low precision by half-wave gaussian quantization , author=

work page
[48]

Learning to quantize deep networks by optimizing quantization intervals with task loss , author=

work page
[49]

Lsq+: Improving low-bit quantization through learnable offsets and better initialization , author=

work page
[50]

Network quantization with element-wise gradient scaling , author=

work page
[51]

Fracbits: Mixed precision quantization via fractional bit-widths , author=

work page
[52]

Eq-net: Elastic quantization neural networks , author=

work page
[53]

Wang, Longguang and Dong, Xiaoyu and Wang, Yingqian and Liu, Li and An, Wei and Guo, Yulan , title =

work page
[54]

Pams: Quantized super-resolution via parameterized max scale , author=

work page
[55]

Cadyq: Content-aware dynamic quantization for image super-resolution , author=

work page
[56]

QuantSR: accurate low-bit quantization for efficient image super-resolution , author=

work page
[57]

Pre-trained image processing transformer , author=

work page
[58]

Transactions on Machine Learning Research , year=

Polyvit: Co-training vision transformers on images, videos and audio , author=. Transactions on Machine Learning Research , year=

work page
[59]

Attentive single-tasking of multiple tasks , author=

work page
[60]

Unit: Multimodal multitask learning with a unified transformer , author=

work page
[61]

Computer Science , year=

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , author=. Computer Science , year=

work page
[62]

AQ-DETR: Low-Bit Quantized Detection Transformer with Auxiliary Queries , author=

work page
[63]

Omnivec: Learning robust representations with cross modal sharing , author=. Proc. of WACV , year=

work page
[64]

Moment matching for multi-source domain adaptation , author=

work page
[65]

Imagenet: A large-scale hierarchical image database , author=

work page
[66]

Science China Information Sciences , year=

FUSAR-Ship: Building a high-resolution SAR-AIS matchup dataset of Gaofen-3 for ship detection and recognition , author=. Science China Information Sciences , year=

work page
[67]

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , year=

OpenSARShip: A dataset dedicated to Sentinel-1 ship interpretation , author=. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , year=

work page
[68]

Remote Sensing , year=

A public dataset for fine-grained ship classification in optical remote sensing images , author=. Remote Sensing , year=

work page
[69]

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Minigpt-v2: large language model as a unified interface for vision-language multi-task learning , author=. arXiv:2310.09478 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[70]

DOTA: A large-scale dataset for object detection in aerial images , author=

work page
[71]

Machine learning , year=

Multitask learning , author=. Machine learning , year=

work page
[72]

Adversarial Multi-task Learning for Text Classification , author=. Proc. of ACL , year=

work page
[73]

Facial landmark detection by deep multi-task learning , author=

work page
[74]

Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture , author=

work page
[75]

Faster R-CNN: Towards real-time object detection with region proposal networks , author=

work page
[76]

Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory , author=

work page
[77]

Mask r-cnn , author=

work page
[78]

Multi-task learning using uncertainty to weigh losses for scene geometry and semantics , author=

work page
[79]

Attention is all you need , author=

work page
[80]

Learning to jointly share and prune weights for grounding based vision and language models , author=

work page

Showing first 80 references.

[1] [1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page

[2] [2]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Estimating or propagating gradients through stochastic neurons for conditional computation , author=. arXiv preprint arXiv:1308.3432 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Learning sparse neural networks through L\_0 regularization , author=

work page

[4] [4]

Differentiable soft quantization: Bridging full-precision and low-bit neural networks , author=

work page

[5] [5]

Binaryconnect: Training deep neural networks with binary weights during propagations , author=

work page

[6] [6]

2016 , organization=

Xnor-net: Imagenet classification using binary convolutional neural networks , author=. 2016 , organization=

work page 2016

[7] [7]

Forward and backward information retention for accurate binary neural networks , author=

work page

[8] [8]

Recu: Reviving the dead weights in binary neural networks , author=

work page

[9] [9]

Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm , author=

work page

[10] [10]

Learning frequency domain approximation for binary neural networks , author=

work page

[11] [11]

2020 , organization=

Bats: Binary architecture search , author=. 2020 , organization=

work page 2020

[12] [12]

2015 , publisher=

Imagenet large scale visual recognition challenge , author=. 2015 , publisher=

work page 2015

[13] [13]

2010 , publisher=

The pascal visual object classes (voc) challenge , author=. 2010 , publisher=

work page 2010

[14] [14]

2009 , publisher=

Learning multiple layers of features from tiny images , author=. 2009 , publisher=

work page 2009

[15] [15]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Very deep convolutional networks for large-scale image recognition , author=. arXiv preprint arXiv:1409.1556 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

2022 , organization=

Recurrent bilinear optimization for binary neural networks , author=. 2022 , organization=

work page 2022

[17] [17]

2020 , organization=

Reactnet: Towards precise binary neural network with generalized activation functions , author=. 2020 , organization=

work page 2020

[18] [18]

Regularizing activation distribution for training binarized deep networks , author=

work page

[19] [19]

Rotated binary neural network , author=

work page

[20] [20]

DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients

Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients , author=. arXiv preprint arXiv:1606.06160 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Searching for low-bit weights in quantized neural networks , author=

work page

[22] [22]

Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1 , author=. arXiv preprint arXiv:1602.02830 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Tbn: Convolutional neural network with ternary inputs and binary weights , author=

work page

[24] [24]

2022 , publisher=

Towards compact 1-bit cnns via bayesian learning , author=. 2022 , publisher=

work page 2022

[25] [25]

Bidet: An efficient binarized object detector , author=

work page

[26] [26]

2022 , organization=

Ida-det: An information discrepancy-aware distillation for 1-bit detectors , author=. 2022 , organization=

work page 2022

[27] [27]

Categorical Reparameterization with Gumbel-Softmax

Categorical reparameterization with gumbel-softmax , author=. arXiv preprint arXiv:1611.01144 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Learned step size quantization , author=

work page

[29] [29]

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification , author=

work page

[30] [30]

Layer-wise searching for 1-bit detectors , author=

work page

[31] [31]

Circulant binary convolutional networks: Enhancing the performance of 1-bit dcnns with circulant back propagation , author=

work page

[32] [32]

Bnn+: Improved binary network training , author=

work page

[33] [33]

Language models are few-shot learners , author=

work page

[34] [34]

Qwen2.5 Technical Report

Qwen2. 5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

2023 , publisher=

Structured pruning for deep convolutional neural networks: A survey , author=. 2023 , publisher=

work page 2023

[36] [36]

Distilling the knowledge in a neural network , author=

work page

[37] [37]

Efficient Low-Bit Quantization with Adaptive Scales for Multi-Task Co-Training , author=

work page

[38] [38]

On compressing deep models by low rank and sparse decomposition , author=

work page

[39] [39]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , year =

work page

[40] [40]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016

[41] [41]

Latticenet: Towards lightweight image super-resolution with lattice block , author=

work page

[42] [42]

Enhanced deep residual networks for single image super-resolution , author=

work page

[43] [43]

Fast, accurate, and lightweight super-resolution with cascading residual network , author=

work page

[44] [44]

Data-free knowledge distillation for image super-resolution , author=

work page

[45] [45]

Learning with privileged information for efficient image super-resolution , author=

work page

[46] [46]

Wang, Huan and Zhang, Yulun and Qin, Can and Van Gool, Luc and Fu, Yun , journal=TPAMI, title=

work page

[47] [47]

Deep learning with low precision by half-wave gaussian quantization , author=

work page

[48] [48]

Learning to quantize deep networks by optimizing quantization intervals with task loss , author=

work page

[49] [49]

Lsq+: Improving low-bit quantization through learnable offsets and better initialization , author=

work page

[50] [50]

Network quantization with element-wise gradient scaling , author=

work page

[51] [51]

Fracbits: Mixed precision quantization via fractional bit-widths , author=

work page

[52] [52]

Eq-net: Elastic quantization neural networks , author=

work page

[53] [53]

Wang, Longguang and Dong, Xiaoyu and Wang, Yingqian and Liu, Li and An, Wei and Guo, Yulan , title =

work page

[54] [54]

Pams: Quantized super-resolution via parameterized max scale , author=

work page

[55] [55]

Cadyq: Content-aware dynamic quantization for image super-resolution , author=

work page

[56] [56]

QuantSR: accurate low-bit quantization for efficient image super-resolution , author=

work page

[57] [57]

Pre-trained image processing transformer , author=

work page

[58] [58]

Transactions on Machine Learning Research , year=

Polyvit: Co-training vision transformers on images, videos and audio , author=. Transactions on Machine Learning Research , year=

work page

[59] [59]

Attentive single-tasking of multiple tasks , author=

work page

[60] [60]

Unit: Multimodal multitask learning with a unified transformer , author=

work page

[61] [61]

Computer Science , year=

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , author=. Computer Science , year=

work page

[62] [62]

AQ-DETR: Low-Bit Quantized Detection Transformer with Auxiliary Queries , author=

work page

[63] [63]

Omnivec: Learning robust representations with cross modal sharing , author=. Proc. of WACV , year=

work page

[64] [64]

Moment matching for multi-source domain adaptation , author=

work page

[65] [65]

Imagenet: A large-scale hierarchical image database , author=

work page

[66] [66]

Science China Information Sciences , year=

FUSAR-Ship: Building a high-resolution SAR-AIS matchup dataset of Gaofen-3 for ship detection and recognition , author=. Science China Information Sciences , year=

work page

[67] [67]

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , year=

OpenSARShip: A dataset dedicated to Sentinel-1 ship interpretation , author=. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , year=

work page

[68] [68]

Remote Sensing , year=

A public dataset for fine-grained ship classification in optical remote sensing images , author=. Remote Sensing , year=

work page

[69] [69]

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Minigpt-v2: large language model as a unified interface for vision-language multi-task learning , author=. arXiv:2310.09478 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[70] [70]

DOTA: A large-scale dataset for object detection in aerial images , author=

work page

[71] [71]

Machine learning , year=

Multitask learning , author=. Machine learning , year=

work page

[72] [72]

Adversarial Multi-task Learning for Text Classification , author=. Proc. of ACL , year=

work page

[73] [73]

Facial landmark detection by deep multi-task learning , author=

work page

[74] [74]

Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture , author=

work page

[75] [75]

Faster R-CNN: Towards real-time object detection with region proposal networks , author=

work page

[76] [76]

Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory , author=

work page

[77] [77]

Mask r-cnn , author=

work page

[78] [78]

Multi-task learning using uncertainty to weigh losses for scene geometry and semantics , author=

work page

[79] [79]

Attention is all you need , author=

work page

[80] [80]

Learning to jointly share and prune weights for grounding based vision and language models , author=

work page