arxiv: 1308.3432 · v1 · submitted 2013-08-15 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Yoshua Bengio , Nicholas L\'eonard , Aaron Courville

Authors on Pith no claims yet

Pith reviewed 2026-05-11 05:26 UTC · model grok-4.3

classification 💻 cs.LG

keywords stochastic neuronsgradient estimationconditional computationbackpropagationsparse gatingREINFORCEstraight-through estimator

0 comments

The pith

Binary stochastic neurons can be trained by decomposing them into a stochastic part and a smooth differentiable approximation of their expected effect.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how to estimate gradients with respect to the inputs of stochastic or non-smooth neurons so that backpropagation can be used in deep networks. It compares four families of solutions, including a minimum-variance unbiased estimator and a new decomposition that splits each binary stochastic neuron into a hard stochastic decision plus a smooth part that matches the expected output to first order. The approach is explored in the setting of conditional computation, where sparse stochastic units act as gaters that can turn off large blocks of downstream computation in many different combinations. A reader would care because successful gradient estimation here would allow networks to learn when to skip most of their own work, cutting the cost of very large models.

Core claim

The paper shows that the operation of a binary stochastic neuron can be decomposed into a stochastic binary part and a smooth differentiable part whose output approximates the expected effect of the pure stochastic neuron to first order. This decomposition supplies a usable gradient signal for the stochastic unit. The same idea is applied to sparse gaters in a small-scale conditional-computation model so that the network can learn, via backpropagation, which large chunks of computation to disable on any given input.

What carries the argument

Decomposition of a binary stochastic neuron into a stochastic binary decision and a smooth differentiable part that matches its expected effect to first order.

If this is right

Sparse stochastic gating units become trainable by backpropagation and can turn off large chunks of network computation.
Conditional computation becomes feasible in deep networks, opening a route to large reductions in average computational cost.
The decomposition supplies a theoretical justification for the heuristic straight-through estimator.
The estimator can be used together with variance-reduction techniques such as REINFORCE.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same first-order decomposition idea could be tested on other non-differentiable operations such as discrete sampling or hard thresholds.
Hardware implementations might literally skip the gated-off computations once the network has learned which gates to close.
Higher-order corrections to the approximation might be needed when the stochastic units are used in deeper or more sensitive parts of the network.

Load-bearing premise

The first-order approximation of the expected effect of the stochastic binary neuron stays accurate enough during training to give useful gradients, especially when the units act as sparse gaters that control large downstream computations.

What would settle it

A network trained with the new estimator on a conditional-computation task fails to learn effective sparse gating patterns or shows no advantage over networks trained with other estimators.

read the original abstract

Stochastic neurons and hard non-linearities can be useful for a number of reasons in deep learning models, but in many cases they pose a challenging problem: how to estimate the gradient of a loss function with respect to the input of such stochastic or non-smooth neurons? I.e., can we "back-propagate" through these stochastic neurons? We examine this question, existing approaches, and compare four families of solutions, applicable in different settings. One of them is the minimum variance unbiased gradient estimator for stochatic binary neurons (a special case of the REINFORCE algorithm). A second approach, introduced here, decomposes the operation of a binary stochastic neuron into a stochastic binary part and a smooth differentiable part, which approximates the expected effect of the pure stochatic binary neuron to first order. A third approach involves the injection of additive or multiplicative noise in a computational graph that is otherwise differentiable. A fourth approach heuristically copies the gradient with respect to the stochastic output directly as an estimator of the gradient with respect to the sigmoid argument (we call this the straight-through estimator). To explore a context where these estimators are useful, we consider a small-scale version of {\em conditional computation}, where sparse stochastic units form a distributed representation of gaters that can turn off in combinatorially many ways large chunks of the computation performed in the rest of the neural network. In this case, it is important that the gating units produce an actual 0 most of the time. The resulting sparsity can be potentially be exploited to greatly reduce the computational cost of large deep networks for which conditional computation would be useful.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The straight-through estimator is a practical heuristic for gradients through binary stochastic neurons, but the paper gives only small-scale checks and no analysis of approximation error under strong sparsity.

read the letter

This paper compares four ways to back-propagate through stochastic binary neurons and introduces a decomposition that treats the unit as a stochastic binary choice plus a smooth differentiable part whose derivative matches the expected effect to first order. It also presents the straight-through estimator as a simple copy of the output gradient and shows how these tools fit a conditional-computation setup where binary gates turn off large blocks of the network to create sparsity. The derivations are straightforward and the motivation for needing actual zeros from the gates is clear. The authors correctly frame the problem as one of credit assignment when most units are off most of the time. The main limitation is the evidence. All experiments stay small-scale with no variance analysis or tests on deeper networks, so it is hard to tell whether the first-order approximation stays accurate when activation probabilities are driven low and each gate decision affects substantial downstream computation. No error bounds or regime-specific checks are given for that sparse-gating case. The work is aimed at people building models that mix discrete stochastic choices with gradient-based training, especially for efficiency. It deserves a serious referee because it identifies a concrete obstacle and supplies usable estimators that later papers actually used, even if stronger validation would have made the claims tighter. I would send it out for review.

Referee Report

2 major / 2 minor

Summary. The manuscript examines the problem of estimating gradients through stochastic binary neurons and hard non-linearities. It compares four families of solutions: the minimum-variance unbiased REINFORCE estimator, a novel decomposition of the binary stochastic neuron into a stochastic binary component plus a smooth differentiable component whose derivative approximates the expected effect to first order, additive or multiplicative noise injection into an otherwise differentiable graph, and the straight-through estimator that copies the downstream gradient directly. These estimators are motivated and tested in a small-scale conditional-computation setting in which sparse stochastic units act as distributed gaters that can combinatorially disable large blocks of downstream computation.

Significance. If the first-order decomposition remains sufficiently accurate under the sparsity levels required for conditional computation, the work would provide a practical route to training networks that exploit hard stochastic gating for computational efficiency. The derivations of each estimator from standard stochastic-gradient identities are clear and self-contained; the explicit framing of the conditional-computation use case is also a strength. However, the absence of error bounds on the approximation, variance analysis, or large-scale validation limits the strength of the central claim that these estimators enable reliable credit assignment when gates control substantial downstream computation.

major comments (2)

[§3] §3 (Decomposition approach): The claim that the smooth component approximates the expected effect of the pure stochastic binary neuron to first order is presented without an accompanying error bound or analysis of the neglected higher-order terms. When the gating probability p is driven toward zero (the regime emphasized for sparsity), the curvature of the downstream loss with respect to the gate can make the first-order truncation inaccurate; no regime-specific analysis or numerical check of this truncation error is supplied.
[Experimental section] Experimental evaluation: The reported experiments are confined to small-scale synthetic tasks. No quantitative assessment of estimator variance, bias under sparsity, or wall-clock savings on a conditional-computation benchmark large enough for the gaters to control non-trivial blocks of computation is provided, leaving the practical utility of the estimators for the stated goal unverified.

minor comments (2)

[Abstract] Abstract: repeated typo 'stochatic' for 'stochastic'; the sentence 'The resulting sparsity can be potentially be exploited' contains a duplicated 'be'.
[§3] Notation: the distinction between the stochastic binary output and the smooth differentiable surrogate is introduced without an explicit equation label, making later references to 'the approximation' harder to trace.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting areas where the manuscript can be strengthened. We address each major comment below and indicate the changes planned for the revised version.

read point-by-point responses

Referee: [§3] §3 (Decomposition approach): The claim that the smooth component approximates the expected effect of the pure stochastic binary neuron to first order is presented without an accompanying error bound or analysis of the neglected higher-order terms. When the gating probability p is driven toward zero (the regime emphasized for sparsity), the curvature of the downstream loss with respect to the gate can make the first-order truncation inaccurate; no regime-specific analysis or numerical check of this truncation error is supplied.

Authors: We agree that a formal error bound or explicit analysis of higher-order terms is absent from the original derivation. The decomposition is constructed so that the smooth component exactly matches the first-order term in the expansion of the expected loss; the neglected terms involve the second derivative of the downstream loss multiplied by the variance of the stochastic gate. In the revised manuscript we will add a short subsection deriving the leading second-order error term and stating the regime (small p and moderate curvature) in which it remains negligible. We will also include a numerical check of the truncation error on the synthetic tasks already used in the paper. revision: yes
Referee: [Experimental section] Experimental evaluation: The reported experiments are confined to small-scale synthetic tasks. No quantitative assessment of estimator variance, bias under sparsity, or wall-clock savings on a conditional-computation benchmark large enough for the gaters to control non-trivial blocks of computation is provided, leaving the practical utility of the estimators for the stated goal unverified.

Authors: The experiments were intentionally limited to small-scale synthetic tasks to permit controlled measurement of estimator behavior under varying sparsity. We acknowledge that this leaves open questions about variance, bias, and wall-clock impact at larger scales. In revision we will expand the experimental section with additional plots quantifying variance and bias of each estimator as functions of gate sparsity on the existing tasks. A full large-scale conditional-computation benchmark with measurable wall-clock savings lies beyond the computational resources available for this study; we will therefore add a discussion paragraph outlining how the estimators could be scaled and what savings would be expected once sparsity is exploited in production implementations. revision: partial

Circularity Check

0 steps flagged

Derivations rely on standard REINFORCE and stochastic-gradient identities; no load-bearing step reduces to a self-defined input or self-citation chain.

full rationale

The paper derives four families of gradient estimators for stochastic neurons. The minimum-variance unbiased estimator is identified as a special case of the existing REINFORCE algorithm. The newly introduced decomposition into a stochastic binary component plus a first-order smooth approximation is presented as an explicit construction whose derivative is obtained by direct differentiation of the expectation; it does not presuppose the target result or fit any internal parameter to the downstream loss. The straight-through and noise-injection estimators are likewise obtained from standard identities without circular redefinition. No uniqueness theorem, ansatz, or empirical pattern is imported via self-citation in a load-bearing way. The central claims therefore remain independent of the paper's own fitted quantities or prior self-references.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on the standard assumptions of stochastic gradient descent and the existence of a differentiable surrogate whose first-order behavior matches the expectation of the hard stochastic neuron. No new free parameters or invented entities are introduced beyond the usual temperature or noise scale that can be set by hand.

axioms (2)

domain assumption The gradient of the loss with respect to the stochastic output can be estimated by one of the four families without introducing bias that prevents convergence.
Invoked when claiming the estimators are useful for training.
domain assumption Sparsity from stochastic gates can be exploited to reduce computation without destroying representational power.
Central motivation for the conditional-computation setting.

pith-pipeline@v0.9.0 · 5595 in / 1373 out tokens · 28549 ms · 2026-05-11T05:26:37.319350+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Crafting Reversible SFT Behaviors in Large Language Models
cs.LG 2026-05 unverdicted novelty 8.0

LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.
DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning
cs.LG 2026-05 conditional novelty 8.0

INT4 quantization recovers up to 22 times more forgotten training data in unlearned LLMs, and the proposed DURABLEUN-SAF method is the first to maintain forgetting across BF16, INT8, and INT4 precisions.
Training Non-Differentiable Networks via Optimal Transport
cs.LG 2026-05 unverdicted novelty 8.0

PolyStep optimizes non-differentiable networks via forward-only polytope evaluations and optimal-transport barycentric updates, reaching 93.4% accuracy on hard-LIF spiking networks while outperforming gradient-free baselines.
Zero-Shot Quantization via Weight-Space Arithmetic
cs.CV 2026-04 unverdicted novelty 8.0

A quantization vector derived from a donor model via weight-space arithmetic can be added to a receiver model to improve post-PTQ Top-1 accuracy by up to 60 points in 3-bit settings without receiver-side QAT or data.
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
cs.LG 2017-01 accept novelty 8.0

A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
Categorical Reparameterization with Gumbel-Softmax
stat.ML 2016-11 unverdicted novelty 8.0

Gumbel-Softmax provides a continuous relaxation of categorical sampling that anneals to discrete samples for gradient-based optimization.
Policy Optimization in Hybrid Discrete-Continuous Action Spaces via Mixed Gradients
cs.LG 2026-05 unverdicted novelty 7.0

HPO enables unbiased policy optimization in hybrid action spaces by mixing differentiable simulation gradients with score-function estimates, outperforming PPO as continuous dimensions increase.
All Circuits Lead to Rome: Rethinking Functional Anisotropy in Circuit and Sheaf Discovery for LLMs
cs.CL 2026-05 unverdicted novelty 7.0

LLM tasks are supported by multiple distinct circuits rather than unique mechanisms, demonstrated via Overlap-Aware Sheaf Repulsion and the Distributive Dense Circuit Hypothesis.
Quantum Parity Representations: Learnable Basis Discovery, Encoders, and Shadow Deployment
quant-ph 2026-05 unverdicted novelty 7.0

Hybrid quantum training discovers parity bases that improve accuracy 24-42% on binary tasks and recover performance on text benchmarks, with all inference remaining classical.
AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling
cs.SD 2026-05 unverdicted novelty 7.0

AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
DAP: Doppler-aware Point Network for Heterogeneous mmWave Action Recognition
cs.CV 2026-05 unverdicted novelty 7.0

Introduces the first heterogeneous multi-source mmWave point cloud HAR dataset and DAP-Net architecture with Doppler reparameterization and text alignment for cross-source robustness.
Revisiting Mixture Policies in Entropy-Regularized Actor-Critic
cs.LG 2026-05 unverdicted novelty 7.0

A new marginalized reparameterization estimator allows low-variance training of mixture policies in entropy-regularized actor-critic algorithms, matching or exceeding Gaussian policy performance in several continuous ...
BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing
cs.CV 2026-05 unverdicted novelty 7.0

BRIDGE uses separate main and subject paths plus a discrete gate on positional embeddings to improve local edits with coarse masks, raising local SigLIP2-T from 0.39 to 0.50 on its benchmark.
Approximation-Free Differentiable Oblique Decision Trees
cs.LG 2026-05 unverdicted novelty 7.0

DTSemNet gives an exact, invertible neural-network encoding of hard oblique decision trees that supports direct gradient training for both classification and regression without probabilistic softening or quantized estimators.
PhySPRING: Structure-Preserving Reduction of Physics-Informed Twins via GNN
cs.RO 2026-05 unverdicted novelty 7.0

PhySPRING uses differentiable GNNs to learn hierarchical coarsened spring-mass topologies and parameters from observations, delivering up to 2.3x speedup on PhysTwin benchmarks and comparable robot policy success rate...
SplatWeaver: Learning to Allocate Gaussian Primitives for Generalizable Novel View Synthesis
cs.CV 2026-05 unverdicted novelty 7.0

SplatWeaver dynamically allocates Gaussian primitives via cardinality experts and pixel-level routing guided by high-frequency cues for improved generalizable novel view synthesis.
Long Context Pre-Training with Lighthouse Attention
cs.CL 2026-05 conditional novelty 7.0

Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower los...
Non-Myopic Active Feature Acquisition via Pathwise Policy Gradients
cs.LG 2026-05 unverdicted novelty 7.0

NM-PPG optimizes non-myopic acquisition policies for costly features by enabling pathwise gradients via continuous relaxation and straight-through rollouts in POMDPs, outperforming SOTA baselines.
Mixed-Precision Information Bottlenecks for On-Device Trait-State Disentanglement in Bipolar Agitation Detection
cs.LG 2026-05 unverdicted novelty 7.0

MP-IB uses an 8x information asymmetry via FP16 trait heads and INT4 state heads to disentangle speaker identity from agitation in voice biomarkers, outperforming larger models on edge devices with low latency and sup...
DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning
cs.LG 2026-05 unverdicted novelty 7.0

INT4 quantization recovers forgotten data in unlearned LLMs up to 22x, exposing a trilemma with no existing method solving forgetting, utility, and robustness together; a new sharpness-aware method achieves cross-prec...
GETA-3DGS: Automatic Joint Structured Pruning and Quantization for 3D Gaussian Splatting
cs.LG 2026-05 unverdicted novelty 7.0

GETA-3DGS is the first automatic joint structured pruning and quantization framework for 3D Gaussian Splatting, achieving roughly 5x storage reduction on standard datasets without per-scene thresholds.
Model Compression with Exact Budget Constraints via Riemannian Manifolds
cs.LG 2026-05 unverdicted novelty 7.0

The budget constraint in discrete model compression defines a Riemannian manifold allowing exact-constraint first-order optimization via Riemannian Constrained Optimization (RCO) without extra hyperparameters.
GradMAP: Gradient-Based Multi-Agent Proximal Learning for Grid-Edge Flexibility
cs.LG 2026-04 unverdicted novelty 7.0

GradMAP enables fast offline training of fully decentralized neural policies for grid-edge flexibility by embedding a differentiable three-phase AC power-flow model and applying proximal surrogates in action space.
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models
cs.CV 2026-04 unverdicted novelty 7.0

LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-laye...
Beyond Static Collision Handling: Adaptive Semantic ID Learning for Multimodal Recommendation at Industrial Scale
cs.IR 2026-04 unverdicted novelty 7.0

AdaSID adaptively regulates semantic ID overlaps in multimodal recommendations to improve retrieval performance, codebook utilization, and downstream metrics like GMV.
Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning
cs.LG 2026-04 unverdicted novelty 7.0

A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and common...
Relaxation-Informed Training of Neural Network Surrogate Models
math.OC 2026-04 conditional novelty 7.0

Regularizers that penalize big-M constants, unstable neurons, and per-sample LP relaxation gaps during neural network training reduce MILP solve times by up to four orders of magnitude while preserving surrogate accuracy.
AAC: Admissible-by-Architecture Differentiable Landmark Compression for ALT
cs.AI 2026-04 unverdicted novelty 7.0

AAC is an admissible-by-architecture differentiable compressor for ALT landmarks that achieves near-optimal coverage on road networks with zero admissibility violations and faster median queries than FPS-ALT.
MetaCloak-JPEG: JPEG-Robust Adversarial Perturbation for Preventing Unauthorized DreamBooth-Based Deepfake Generation
cs.CV 2026-04 unverdicted novelty 7.0

MetaCloak-JPEG uses a DiffJPEG layer with straight-through estimator inside a JPEG-aware EOT and curriculum meta-learning loop to produce l-inf bounded perturbations that retain 91.3% effectiveness after real JPEG com...
Ultra-low-light computer vision using trained photon correlations
cs.CV 2026-04 unverdicted novelty 7.0

Trained correlated-photon illumination paired with a Transformer backend improves object classification accuracy by up to 15 percentage points in photon-starved noisy imaging.
HiEdit: Lifelong Model Editing with Hierarchical Reinforcement Learning
cs.CL 2026-04 unverdicted novelty 7.0

HiEdit uses hierarchical RL to dynamically pick knowledge-relevant layers for editing LLMs, improving performance over baselines while perturbing only half the layers per edit.
Training single-electron and single-photon stochastic physical neural networks
quant-ph 2026-04 unverdicted novelty 7.0

Single-electron and single-photon stochastic physical neural networks achieve over 97% MNIST test accuracy when trained with empirical outputs in the backward pass using few trials per layer.
On the Decompositionality of Neural Networks
cs.LO 2026-04 unverdicted novelty 7.0

Neural decompositionality is defined via decision-boundary semantic preservation, and language transformers largely satisfy it under SAVED while vision models often do not.
DesigNet: Learning to Draw Vector Graphics as Designers Do
cs.CV 2026-04 unverdicted novelty 7.0

DesigNet generates editable SVG outlines using a Transformer-VAE with differentiable modules that enforce C0/G1/C1 continuity and horizontal/vertical alignment.
Minimal Information Control Invariance via Vector Quantization
eess.SY 2026-04 unverdicted novelty 7.0

A vector-quantized autoencoder learns minimal control codebooks for forward invariance in sampled-data control, achieving 157x reduction over grid baselines on a 12D quadrotor model.
Self-Supervised Foundation Model for Calcium-imaging Population Dynamics
q-bio.QM 2026-04 unverdicted novelty 7.0

CalM uses a discrete tokenizer and dual-axis autoregressive transformer pretrained self-supervised on calcium traces to outperform specialized baselines on population dynamics forecasting and adapt to superior behavio...
EMA Is Not All You Need: Mapping the Boundary Between Structure and Content in Recurrent Context
cs.CL 2026-03 unverdicted novelty 7.0

EMA traces achieve near-supervised performance on structure-dependent tasks like grammatical role assignment but produce high language modeling perplexity because they apply lossy, data-independent compression that ca...
DC-DiT: Adaptive Compute and Elastic Inference for Visual Generation via Dynamic Chunking
cs.CV 2026-03 unverdicted novelty 7.0

DC-DiT learns dynamic chunking to allocate fewer tokens to smooth or noisy regions and more to detailed or late-stage areas, cutting inference FLOPs up to 36.8% while improving FID up to 37.8% on class-conditional Ima...
Mastering Diverse Domains through World Models
cs.AI 2023-01 unverdicted novelty 7.0

DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.
High Fidelity Neural Audio Compression
eess.AS 2022-10 accept novelty 7.0

EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same ...
Mastering Atari with Discrete World Models
cs.LG 2020-10 accept novelty 7.0

DreamerV2 reaches human-level performance on 55 Atari games by learning behaviors inside a separately trained discrete-latent world model.
Dream to Control: Learning Behaviors by Latent Imagination
cs.LG 2019-12 accept novelty 7.0

Dreamer learns to control from images by imagining and optimizing behaviors in a learned latent world model, outperforming prior methods on 20 visual tasks in data efficiency and final performance.
BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE
cs.AI 2026-05 conditional novelty 6.0

BEAM uses binary expert activation masks trained end-to-end to achieve dynamic sparsity in MoE models, cutting FLOPs by 85% with over 98% performance retention.
ArcVQ-VAE: A Spherical Vector Quantization Framework with ArcCosine Additive Margin
cs.CV 2026-05 unverdicted novelty 6.0

ArcVQ-VAE constrains VQ-VAE codebook vectors inside a time-dependent ball and adds angular margin loss to increase separability and codebook utilization.
The Geno-Synthetic Algorithm: Type-Factored Coevolutionary Optimization for Heterogeneous Genotypes and Assembled Phenotypes
cs.NE 2026-05 unverdicted novelty 6.0

GSA introduces a type-factored coevolutionary framework that evolves heterogeneous gene families separately with native operators and assembles them into phenotypes, enabling optimization over complex-valued and embed...
EMO: Frustratingly Easy Progressive Training of Extendable MoE
cs.LG 2026-05 unverdicted novelty 6.0

EMO progressively expands the expert pool in MoE models during training to match fixed-expert performance with improved wall-clock efficiency.
U-HNO: A U-shaped Hybrid Neural Operator with Sparse-Point Adaptive Routing for Non-stationary PDE Dynamics
cs.LG 2026-05 unverdicted novelty 6.0

U-HNO uses adaptive per-point routing in a U-shaped hybrid architecture to achieve state-of-the-art accuracy on PDE benchmarks with sharp localized features.
COSMIC: Concurrent Optimization of Structure, Material, and Integrated Control for robotic systems
cs.RO 2026-05 unverdicted novelty 6.0

COSMIC co-optimizes robot structure, materials, and control simultaneously via differentiable simulation and constrained gradients, yielding locomotion strategies that outperform sequential baselines.
Conditional Memory Enhanced Item Representation for Generative Recommendation
cs.IR 2026-05 unverdicted novelty 6.0

ComeIR introduces dual-level Engram memory and memory-restoring prediction to reconstruct SID-token embeddings and restore token granularity in generative recommendation.
SURGE: Surrogate Gradient Adaptation in Binary Neural Networks
cs.LG 2026-05 unverdicted novelty 6.0

SURGE proposes a dual-path gradient compensator and adaptive scaler to learn better surrogate gradients for binary neural network training, outperforming prior methods on classification, detection, and language tasks.
Channel Geometry Preserving Generative Models for CSI Feedback in MU-MIMO
eess.SP 2026-05 unverdicted novelty 6.0

Flow-matching generative CSI decoders outperform MMSE baselines in MU-MIMO downlink sum-rate by preserving posterior channel geometry needed for user orthogonality.
BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing
cs.CV 2026-05 unverdicted novelty 6.0

BRIDGE improves coarse-mask local image editing in DiT models by routing background and subject paths separately and using a discrete geometric gate on positional embeddings to reduce mask-shape bias.
Toward Better Geometric Representations for Molecule Generative Models
cs.LG 2026-05 unverdicted novelty 6.0

LENSEs improves representation-conditioned molecule generation by jointly training a multi-level representation head, perceptual loss, and REPA alignment on pretrained encoders, yielding 97.28% validity and 98.51% sta...
Learning Discrete Autoregressive Priors with Wasserstein Gradient Flow
cs.CV 2026-05 unverdicted novelty 6.0

wAR-Tok adds a Wasserstein-gradient-flow prior-matching term to tokenizer training so that discrete tokens become easier for autoregressive priors to model, cutting AR loss and raising generation FID on CIFAR-10 and I...
DiBA: Diagonal and Binary Matrix Approximation for Neural Network Weight Compression
cs.LG 2026-05 unverdicted novelty 6.0

DiBA factors weight matrices into diagonal-binary-diagonal-binary-diagonal form to cut matrix-vector multiplies from mn to m+k+n operations and improves accuracy on DistilBERT and audio transformer tasks after replacement.
Adaptive Inverted-Index Routing for Granular Mixtures-of-Experts
cs.LG 2026-05 unverdicted novelty 6.0

AIR-MoE introduces a two-stage inverted-index routing method based on vector quantization that approximates optimal expert selection for granular MoE models at lower cost and with empirical performance gains.
Physical Adversarial Clothing Evades Visible-Thermal Detectors via Non-Overlapping RGB-T Pattern
cs.CV 2026-05 unverdicted novelty 6.0

Non-overlapping RGB-T adversarial patterns on clothing, optimized with spatial discrete-continuous optimization, achieve high attack success rates against multiple RGB-T detector fusion architectures in both digital a...
Learning to Theorize the World from Observation
cs.LG 2026-05 unverdicted novelty 6.0

NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.
On Model-Based Clustering With Entropic Optimal Transport
stat.ME 2026-05 unverdicted novelty 6.0

Entropic optimal transport yields a clustering loss with the same global optimum as log-likelihood but a better-behaved optimization surface, outperforming standard EM in experiments.
Model Merging: Foundations and Algorithms
cs.LG 2026-05 unverdicted novelty 6.0

New cycle-consistent optimization, task vector theory, singular vector decompositions, adaptive routing, and efficient evolutionary search provide foundations for merging neural network weights across tasks.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 86 Pith papers

[1]

Bengio, Y. (2013). Deep learning of representations: Looking forward. Technical Report arXiv:1305.0445, Universite de Montreal

work page arXiv 2013
[2]

Bengio, Y., Courville, A., and Vincent, P. (2013). Unsupervised feature learning and deep learning: A review and new perspectives. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI)\/

work page 2013
[3]

Dayan, P. (1990). Reinforcement comparison. In Connectionist Models: Proceedings of the 1990 Connectionist Summer School\/ , San Mateo, CA

work page 1990
[4]

and Bengio, Y

El Hihi, S. and Bengio, Y. (1996). Hierarchical recurrent neural networks for long-term dependencies. In NIPS 8\/ . MIT Press

work page 1996
[5]

Fiete, I. R. and Seung, H. S. (2006). Gradient learning in spiking neural networks by dynamic perturbations of conductances. Physical Review Letters\/ , 97 (4)

work page 2006
[6]

Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier neural networks. In AISTATS\/

work page 2011
[7]

J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y

Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013). Maxout networks. In ICML'2013\/

work page 2013
[8]

Hinton, G. (2012). Neural networks for machine learning. Coursera, video lectures

work page 2012
[9]

E., Sejnowski, T

Hinton, G. E., Sejnowski, T. J., and Ackley, D. H. (1984). Boltzmann machines: Constraint satisfaction networks that learn. Technical Report TR-CMU-CS-84-119, Carnegie-Mellon University, Dept. of Computer Science

work page 1984
[10]

Improving neural networks by preventing co-adaptation of feature detectors

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. Technical report, arXiv:1207.0580

work page Pith review arXiv 2012
[11]

Krizhevsky, A., Sutskever, I., and Hinton, G. (2012a). ImageNet classification with deep convolutional neural networks. In NIPS'2012\/

work page 2012
[12]

Krizhevsky, A., Sutskever, I., and Hinton, G. (2012b). ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS'2012)\/

work page 2012
[13]

and Hinton, G

Nair, V. and Hinton, G. E. (2010). Rectified linear units improve restricted B oltzmann machines. In ICML'10\/

work page 2010
[14]

E., Hinton, G

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by back-propagating errors. Nature\/ , 323 , 533--536

work page 1986
[15]

and Hinton, G

Salakhutdinov, R. and Hinton, G. (2009). Semantic hashing. In International Journal of Approximate Reasoning\/

work page 2009
[16]

Spall, J. C. (1992). Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control\/ , 37 , 332--341

work page 1992
[17]

Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and composing robust features with denoising autoencoders. In ICML 2008\/

work page 2008
[18]

and Tao, N

Weaver, L. and Tao, N. (2001). The optimal reward baseline for gradient-based reinforcement learning. In Proc. UAI'2001\/ , pages 538--545

work page 2001
[19]

Williams, R. J. (1992). Simple statistical gradient-following algorithms connectionist reinforcement learning. Machine Learning\/ , 8 , 229--256

work page 1992