pith. machine review for the scientific record. sign in

arxiv: 1308.3432 · v1 · submitted 2013-08-15 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Authors on Pith no claims yet

Pith reviewed 2026-05-11 05:26 UTC · model grok-4.3

classification 💻 cs.LG
keywords stochastic neuronsgradient estimationconditional computationbackpropagationsparse gatingREINFORCEstraight-through estimator
0
0 comments X

The pith

Binary stochastic neurons can be trained by decomposing them into a stochastic part and a smooth differentiable approximation of their expected effect.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how to estimate gradients with respect to the inputs of stochastic or non-smooth neurons so that backpropagation can be used in deep networks. It compares four families of solutions, including a minimum-variance unbiased estimator and a new decomposition that splits each binary stochastic neuron into a hard stochastic decision plus a smooth part that matches the expected output to first order. The approach is explored in the setting of conditional computation, where sparse stochastic units act as gaters that can turn off large blocks of downstream computation in many different combinations. A reader would care because successful gradient estimation here would allow networks to learn when to skip most of their own work, cutting the cost of very large models.

Core claim

The paper shows that the operation of a binary stochastic neuron can be decomposed into a stochastic binary part and a smooth differentiable part whose output approximates the expected effect of the pure stochastic neuron to first order. This decomposition supplies a usable gradient signal for the stochastic unit. The same idea is applied to sparse gaters in a small-scale conditional-computation model so that the network can learn, via backpropagation, which large chunks of computation to disable on any given input.

What carries the argument

Decomposition of a binary stochastic neuron into a stochastic binary decision and a smooth differentiable part that matches its expected effect to first order.

If this is right

  • Sparse stochastic gating units become trainable by backpropagation and can turn off large chunks of network computation.
  • Conditional computation becomes feasible in deep networks, opening a route to large reductions in average computational cost.
  • The decomposition supplies a theoretical justification for the heuristic straight-through estimator.
  • The estimator can be used together with variance-reduction techniques such as REINFORCE.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same first-order decomposition idea could be tested on other non-differentiable operations such as discrete sampling or hard thresholds.
  • Hardware implementations might literally skip the gated-off computations once the network has learned which gates to close.
  • Higher-order corrections to the approximation might be needed when the stochastic units are used in deeper or more sensitive parts of the network.

Load-bearing premise

The first-order approximation of the expected effect of the stochastic binary neuron stays accurate enough during training to give useful gradients, especially when the units act as sparse gaters that control large downstream computations.

What would settle it

A network trained with the new estimator on a conditional-computation task fails to learn effective sparse gating patterns or shows no advantage over networks trained with other estimators.

read the original abstract

Stochastic neurons and hard non-linearities can be useful for a number of reasons in deep learning models, but in many cases they pose a challenging problem: how to estimate the gradient of a loss function with respect to the input of such stochastic or non-smooth neurons? I.e., can we "back-propagate" through these stochastic neurons? We examine this question, existing approaches, and compare four families of solutions, applicable in different settings. One of them is the minimum variance unbiased gradient estimator for stochatic binary neurons (a special case of the REINFORCE algorithm). A second approach, introduced here, decomposes the operation of a binary stochastic neuron into a stochastic binary part and a smooth differentiable part, which approximates the expected effect of the pure stochatic binary neuron to first order. A third approach involves the injection of additive or multiplicative noise in a computational graph that is otherwise differentiable. A fourth approach heuristically copies the gradient with respect to the stochastic output directly as an estimator of the gradient with respect to the sigmoid argument (we call this the straight-through estimator). To explore a context where these estimators are useful, we consider a small-scale version of {\em conditional computation}, where sparse stochastic units form a distributed representation of gaters that can turn off in combinatorially many ways large chunks of the computation performed in the rest of the neural network. In this case, it is important that the gating units produce an actual 0 most of the time. The resulting sparsity can be potentially be exploited to greatly reduce the computational cost of large deep networks for which conditional computation would be useful.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript examines the problem of estimating gradients through stochastic binary neurons and hard non-linearities. It compares four families of solutions: the minimum-variance unbiased REINFORCE estimator, a novel decomposition of the binary stochastic neuron into a stochastic binary component plus a smooth differentiable component whose derivative approximates the expected effect to first order, additive or multiplicative noise injection into an otherwise differentiable graph, and the straight-through estimator that copies the downstream gradient directly. These estimators are motivated and tested in a small-scale conditional-computation setting in which sparse stochastic units act as distributed gaters that can combinatorially disable large blocks of downstream computation.

Significance. If the first-order decomposition remains sufficiently accurate under the sparsity levels required for conditional computation, the work would provide a practical route to training networks that exploit hard stochastic gating for computational efficiency. The derivations of each estimator from standard stochastic-gradient identities are clear and self-contained; the explicit framing of the conditional-computation use case is also a strength. However, the absence of error bounds on the approximation, variance analysis, or large-scale validation limits the strength of the central claim that these estimators enable reliable credit assignment when gates control substantial downstream computation.

major comments (2)
  1. [§3] §3 (Decomposition approach): The claim that the smooth component approximates the expected effect of the pure stochastic binary neuron to first order is presented without an accompanying error bound or analysis of the neglected higher-order terms. When the gating probability p is driven toward zero (the regime emphasized for sparsity), the curvature of the downstream loss with respect to the gate can make the first-order truncation inaccurate; no regime-specific analysis or numerical check of this truncation error is supplied.
  2. [Experimental section] Experimental evaluation: The reported experiments are confined to small-scale synthetic tasks. No quantitative assessment of estimator variance, bias under sparsity, or wall-clock savings on a conditional-computation benchmark large enough for the gaters to control non-trivial blocks of computation is provided, leaving the practical utility of the estimators for the stated goal unverified.
minor comments (2)
  1. [Abstract] Abstract: repeated typo 'stochatic' for 'stochastic'; the sentence 'The resulting sparsity can be potentially be exploited' contains a duplicated 'be'.
  2. [§3] Notation: the distinction between the stochastic binary output and the smooth differentiable surrogate is introduced without an explicit equation label, making later references to 'the approximation' harder to trace.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting areas where the manuscript can be strengthened. We address each major comment below and indicate the changes planned for the revised version.

read point-by-point responses
  1. Referee: [§3] §3 (Decomposition approach): The claim that the smooth component approximates the expected effect of the pure stochastic binary neuron to first order is presented without an accompanying error bound or analysis of the neglected higher-order terms. When the gating probability p is driven toward zero (the regime emphasized for sparsity), the curvature of the downstream loss with respect to the gate can make the first-order truncation inaccurate; no regime-specific analysis or numerical check of this truncation error is supplied.

    Authors: We agree that a formal error bound or explicit analysis of higher-order terms is absent from the original derivation. The decomposition is constructed so that the smooth component exactly matches the first-order term in the expansion of the expected loss; the neglected terms involve the second derivative of the downstream loss multiplied by the variance of the stochastic gate. In the revised manuscript we will add a short subsection deriving the leading second-order error term and stating the regime (small p and moderate curvature) in which it remains negligible. We will also include a numerical check of the truncation error on the synthetic tasks already used in the paper. revision: yes

  2. Referee: [Experimental section] Experimental evaluation: The reported experiments are confined to small-scale synthetic tasks. No quantitative assessment of estimator variance, bias under sparsity, or wall-clock savings on a conditional-computation benchmark large enough for the gaters to control non-trivial blocks of computation is provided, leaving the practical utility of the estimators for the stated goal unverified.

    Authors: The experiments were intentionally limited to small-scale synthetic tasks to permit controlled measurement of estimator behavior under varying sparsity. We acknowledge that this leaves open questions about variance, bias, and wall-clock impact at larger scales. In revision we will expand the experimental section with additional plots quantifying variance and bias of each estimator as functions of gate sparsity on the existing tasks. A full large-scale conditional-computation benchmark with measurable wall-clock savings lies beyond the computational resources available for this study; we will therefore add a discussion paragraph outlining how the estimators could be scaled and what savings would be expected once sparsity is exploited in production implementations. revision: partial

Circularity Check

0 steps flagged

Derivations rely on standard REINFORCE and stochastic-gradient identities; no load-bearing step reduces to a self-defined input or self-citation chain.

full rationale

The paper derives four families of gradient estimators for stochastic neurons. The minimum-variance unbiased estimator is identified as a special case of the existing REINFORCE algorithm. The newly introduced decomposition into a stochastic binary component plus a first-order smooth approximation is presented as an explicit construction whose derivative is obtained by direct differentiation of the expectation; it does not presuppose the target result or fit any internal parameter to the downstream loss. The straight-through and noise-injection estimators are likewise obtained from standard identities without circular redefinition. No uniqueness theorem, ansatz, or empirical pattern is imported via self-citation in a load-bearing way. The central claims therefore remain independent of the paper's own fitted quantities or prior self-references.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on the standard assumptions of stochastic gradient descent and the existence of a differentiable surrogate whose first-order behavior matches the expectation of the hard stochastic neuron. No new free parameters or invented entities are introduced beyond the usual temperature or noise scale that can be set by hand.

axioms (2)
  • domain assumption The gradient of the loss with respect to the stochastic output can be estimated by one of the four families without introducing bias that prevents convergence.
    Invoked when claiming the estimators are useful for training.
  • domain assumption Sparsity from stochastic gates can be exploited to reduce computation without destroying representational power.
    Central motivation for the conditional-computation setting.

pith-pipeline@v0.9.0 · 5595 in / 1373 out tokens · 28549 ms · 2026-05-11T05:26:37.319350+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Crafting Reversible SFT Behaviors in Large Language Models

    cs.LG 2026-05 unverdicted novelty 8.0

    LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.

  2. DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning

    cs.LG 2026-05 conditional novelty 8.0

    INT4 quantization recovers up to 22 times more forgotten training data in unlearned LLMs, and the proposed DURABLEUN-SAF method is the first to maintain forgetting across BF16, INT8, and INT4 precisions.

  3. Training Non-Differentiable Networks via Optimal Transport

    cs.LG 2026-05 unverdicted novelty 8.0

    PolyStep optimizes non-differentiable networks via forward-only polytope evaluations and optimal-transport barycentric updates, reaching 93.4% accuracy on hard-LIF spiking networks while outperforming gradient-free baselines.

  4. Zero-Shot Quantization via Weight-Space Arithmetic

    cs.CV 2026-04 unverdicted novelty 8.0

    A quantization vector derived from a donor model via weight-space arithmetic can be added to a receiver model to improve post-PTQ Top-1 accuracy by up to 60 points in 3-bit settings without receiver-side QAT or data.

  5. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    cs.LG 2017-01 accept novelty 8.0

    A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.

  6. Categorical Reparameterization with Gumbel-Softmax

    stat.ML 2016-11 unverdicted novelty 8.0

    Gumbel-Softmax provides a continuous relaxation of categorical sampling that anneals to discrete samples for gradient-based optimization.

  7. Policy Optimization in Hybrid Discrete-Continuous Action Spaces via Mixed Gradients

    cs.LG 2026-05 unverdicted novelty 7.0

    HPO enables unbiased policy optimization in hybrid action spaces by mixing differentiable simulation gradients with score-function estimates, outperforming PPO as continuous dimensions increase.

  8. All Circuits Lead to Rome: Rethinking Functional Anisotropy in Circuit and Sheaf Discovery for LLMs

    cs.CL 2026-05 unverdicted novelty 7.0

    LLM tasks are supported by multiple distinct circuits rather than unique mechanisms, demonstrated via Overlap-Aware Sheaf Repulsion and the Distributive Dense Circuit Hypothesis.

  9. Quantum Parity Representations: Learnable Basis Discovery, Encoders, and Shadow Deployment

    quant-ph 2026-05 unverdicted novelty 7.0

    Hybrid quantum training discovers parity bases that improve accuracy 24-42% on binary tasks and recover performance on text benchmarks, with all inference remaining classical.

  10. AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling

    cs.SD 2026-05 unverdicted novelty 7.0

    AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.

  11. DAP: Doppler-aware Point Network for Heterogeneous mmWave Action Recognition

    cs.CV 2026-05 unverdicted novelty 7.0

    Introduces the first heterogeneous multi-source mmWave point cloud HAR dataset and DAP-Net architecture with Doppler reparameterization and text alignment for cross-source robustness.

  12. Revisiting Mixture Policies in Entropy-Regularized Actor-Critic

    cs.LG 2026-05 unverdicted novelty 7.0

    A new marginalized reparameterization estimator allows low-variance training of mixture policies in entropy-regularized actor-critic algorithms, matching or exceeding Gaussian policy performance in several continuous ...

  13. BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    BRIDGE uses separate main and subject paths plus a discrete gate on positional embeddings to improve local edits with coarse masks, raising local SigLIP2-T from 0.39 to 0.50 on its benchmark.

  14. Approximation-Free Differentiable Oblique Decision Trees

    cs.LG 2026-05 unverdicted novelty 7.0

    DTSemNet gives an exact, invertible neural-network encoding of hard oblique decision trees that supports direct gradient training for both classification and regression without probabilistic softening or quantized estimators.

  15. PhySPRING: Structure-Preserving Reduction of Physics-Informed Twins via GNN

    cs.RO 2026-05 unverdicted novelty 7.0

    PhySPRING uses differentiable GNNs to learn hierarchical coarsened spring-mass topologies and parameters from observations, delivering up to 2.3x speedup on PhysTwin benchmarks and comparable robot policy success rate...

  16. SplatWeaver: Learning to Allocate Gaussian Primitives for Generalizable Novel View Synthesis

    cs.CV 2026-05 unverdicted novelty 7.0

    SplatWeaver dynamically allocates Gaussian primitives via cardinality experts and pixel-level routing guided by high-frequency cues for improved generalizable novel view synthesis.

  17. Long Context Pre-Training with Lighthouse Attention

    cs.CL 2026-05 conditional novelty 7.0

    Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower los...

  18. Non-Myopic Active Feature Acquisition via Pathwise Policy Gradients

    cs.LG 2026-05 unverdicted novelty 7.0

    NM-PPG optimizes non-myopic acquisition policies for costly features by enabling pathwise gradients via continuous relaxation and straight-through rollouts in POMDPs, outperforming SOTA baselines.

  19. Mixed-Precision Information Bottlenecks for On-Device Trait-State Disentanglement in Bipolar Agitation Detection

    cs.LG 2026-05 unverdicted novelty 7.0

    MP-IB uses an 8x information asymmetry via FP16 trait heads and INT4 state heads to disentangle speaker identity from agitation in voice biomarkers, outperforming larger models on edge devices with low latency and sup...

  20. DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning

    cs.LG 2026-05 unverdicted novelty 7.0

    INT4 quantization recovers forgotten data in unlearned LLMs up to 22x, exposing a trilemma with no existing method solving forgetting, utility, and robustness together; a new sharpness-aware method achieves cross-prec...

  21. GETA-3DGS: Automatic Joint Structured Pruning and Quantization for 3D Gaussian Splatting

    cs.LG 2026-05 unverdicted novelty 7.0

    GETA-3DGS is the first automatic joint structured pruning and quantization framework for 3D Gaussian Splatting, achieving roughly 5x storage reduction on standard datasets without per-scene thresholds.

  22. Model Compression with Exact Budget Constraints via Riemannian Manifolds

    cs.LG 2026-05 unverdicted novelty 7.0

    The budget constraint in discrete model compression defines a Riemannian manifold allowing exact-constraint first-order optimization via Riemannian Constrained Optimization (RCO) without extra hyperparameters.

  23. GradMAP: Gradient-Based Multi-Agent Proximal Learning for Grid-Edge Flexibility

    cs.LG 2026-04 unverdicted novelty 7.0

    GradMAP enables fast offline training of fully decentralized neural policies for grid-edge flexibility by embedding a differentiable three-phase AC power-flow model and applying proximal surrogates in action space.

  24. LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-laye...

  25. Beyond Static Collision Handling: Adaptive Semantic ID Learning for Multimodal Recommendation at Industrial Scale

    cs.IR 2026-04 unverdicted novelty 7.0

    AdaSID adaptively regulates semantic ID overlaps in multimodal recommendations to improve retrieval performance, codebook utilization, and downstream metrics like GMV.

  26. Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning

    cs.LG 2026-04 unverdicted novelty 7.0

    A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and common...

  27. Relaxation-Informed Training of Neural Network Surrogate Models

    math.OC 2026-04 conditional novelty 7.0

    Regularizers that penalize big-M constants, unstable neurons, and per-sample LP relaxation gaps during neural network training reduce MILP solve times by up to four orders of magnitude while preserving surrogate accuracy.

  28. AAC: Admissible-by-Architecture Differentiable Landmark Compression for ALT

    cs.AI 2026-04 unverdicted novelty 7.0

    AAC is an admissible-by-architecture differentiable compressor for ALT landmarks that achieves near-optimal coverage on road networks with zero admissibility violations and faster median queries than FPS-ALT.

  29. MetaCloak-JPEG: JPEG-Robust Adversarial Perturbation for Preventing Unauthorized DreamBooth-Based Deepfake Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    MetaCloak-JPEG uses a DiffJPEG layer with straight-through estimator inside a JPEG-aware EOT and curriculum meta-learning loop to produce l-inf bounded perturbations that retain 91.3% effectiveness after real JPEG com...

  30. Ultra-low-light computer vision using trained photon correlations

    cs.CV 2026-04 unverdicted novelty 7.0

    Trained correlated-photon illumination paired with a Transformer backend improves object classification accuracy by up to 15 percentage points in photon-starved noisy imaging.

  31. HiEdit: Lifelong Model Editing with Hierarchical Reinforcement Learning

    cs.CL 2026-04 unverdicted novelty 7.0

    HiEdit uses hierarchical RL to dynamically pick knowledge-relevant layers for editing LLMs, improving performance over baselines while perturbing only half the layers per edit.

  32. Training single-electron and single-photon stochastic physical neural networks

    quant-ph 2026-04 unverdicted novelty 7.0

    Single-electron and single-photon stochastic physical neural networks achieve over 97% MNIST test accuracy when trained with empirical outputs in the backward pass using few trials per layer.

  33. On the Decompositionality of Neural Networks

    cs.LO 2026-04 unverdicted novelty 7.0

    Neural decompositionality is defined via decision-boundary semantic preservation, and language transformers largely satisfy it under SAVED while vision models often do not.

  34. DesigNet: Learning to Draw Vector Graphics as Designers Do

    cs.CV 2026-04 unverdicted novelty 7.0

    DesigNet generates editable SVG outlines using a Transformer-VAE with differentiable modules that enforce C0/G1/C1 continuity and horizontal/vertical alignment.

  35. Minimal Information Control Invariance via Vector Quantization

    eess.SY 2026-04 unverdicted novelty 7.0

    A vector-quantized autoencoder learns minimal control codebooks for forward invariance in sampled-data control, achieving 157x reduction over grid baselines on a 12D quadrotor model.

  36. Self-Supervised Foundation Model for Calcium-imaging Population Dynamics

    q-bio.QM 2026-04 unverdicted novelty 7.0

    CalM uses a discrete tokenizer and dual-axis autoregressive transformer pretrained self-supervised on calcium traces to outperform specialized baselines on population dynamics forecasting and adapt to superior behavio...

  37. EMA Is Not All You Need: Mapping the Boundary Between Structure and Content in Recurrent Context

    cs.CL 2026-03 unverdicted novelty 7.0

    EMA traces achieve near-supervised performance on structure-dependent tasks like grammatical role assignment but produce high language modeling perplexity because they apply lossy, data-independent compression that ca...

  38. DC-DiT: Adaptive Compute and Elastic Inference for Visual Generation via Dynamic Chunking

    cs.CV 2026-03 unverdicted novelty 7.0

    DC-DiT learns dynamic chunking to allocate fewer tokens to smooth or noisy regions and more to detailed or late-stage areas, cutting inference FLOPs up to 36.8% while improving FID up to 37.8% on class-conditional Ima...

  39. Mastering Diverse Domains through World Models

    cs.AI 2023-01 unverdicted novelty 7.0

    DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.

  40. High Fidelity Neural Audio Compression

    eess.AS 2022-10 accept novelty 7.0

    EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same ...

  41. Mastering Atari with Discrete World Models

    cs.LG 2020-10 accept novelty 7.0

    DreamerV2 reaches human-level performance on 55 Atari games by learning behaviors inside a separately trained discrete-latent world model.

  42. Dream to Control: Learning Behaviors by Latent Imagination

    cs.LG 2019-12 accept novelty 7.0

    Dreamer learns to control from images by imagining and optimizing behaviors in a learned latent world model, outperforming prior methods on 20 visual tasks in data efficiency and final performance.

  43. BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE

    cs.AI 2026-05 conditional novelty 6.0

    BEAM uses binary expert activation masks trained end-to-end to achieve dynamic sparsity in MoE models, cutting FLOPs by 85% with over 98% performance retention.

  44. ArcVQ-VAE: A Spherical Vector Quantization Framework with ArcCosine Additive Margin

    cs.CV 2026-05 unverdicted novelty 6.0

    ArcVQ-VAE constrains VQ-VAE codebook vectors inside a time-dependent ball and adds angular margin loss to increase separability and codebook utilization.

  45. The Geno-Synthetic Algorithm: Type-Factored Coevolutionary Optimization for Heterogeneous Genotypes and Assembled Phenotypes

    cs.NE 2026-05 unverdicted novelty 6.0

    GSA introduces a type-factored coevolutionary framework that evolves heterogeneous gene families separately with native operators and assembles them into phenotypes, enabling optimization over complex-valued and embed...

  46. EMO: Frustratingly Easy Progressive Training of Extendable MoE

    cs.LG 2026-05 unverdicted novelty 6.0

    EMO progressively expands the expert pool in MoE models during training to match fixed-expert performance with improved wall-clock efficiency.

  47. U-HNO: A U-shaped Hybrid Neural Operator with Sparse-Point Adaptive Routing for Non-stationary PDE Dynamics

    cs.LG 2026-05 unverdicted novelty 6.0

    U-HNO uses adaptive per-point routing in a U-shaped hybrid architecture to achieve state-of-the-art accuracy on PDE benchmarks with sharp localized features.

  48. COSMIC: Concurrent Optimization of Structure, Material, and Integrated Control for robotic systems

    cs.RO 2026-05 unverdicted novelty 6.0

    COSMIC co-optimizes robot structure, materials, and control simultaneously via differentiable simulation and constrained gradients, yielding locomotion strategies that outperform sequential baselines.

  49. Conditional Memory Enhanced Item Representation for Generative Recommendation

    cs.IR 2026-05 unverdicted novelty 6.0

    ComeIR introduces dual-level Engram memory and memory-restoring prediction to reconstruct SID-token embeddings and restore token granularity in generative recommendation.

  50. SURGE: Surrogate Gradient Adaptation in Binary Neural Networks

    cs.LG 2026-05 unverdicted novelty 6.0

    SURGE proposes a dual-path gradient compensator and adaptive scaler to learn better surrogate gradients for binary neural network training, outperforming prior methods on classification, detection, and language tasks.

  51. Channel Geometry Preserving Generative Models for CSI Feedback in MU-MIMO

    eess.SP 2026-05 unverdicted novelty 6.0

    Flow-matching generative CSI decoders outperform MMSE baselines in MU-MIMO downlink sum-rate by preserving posterior channel geometry needed for user orthogonality.

  52. BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing

    cs.CV 2026-05 unverdicted novelty 6.0

    BRIDGE improves coarse-mask local image editing in DiT models by routing background and subject paths separately and using a discrete geometric gate on positional embeddings to reduce mask-shape bias.

  53. Toward Better Geometric Representations for Molecule Generative Models

    cs.LG 2026-05 unverdicted novelty 6.0

    LENSEs improves representation-conditioned molecule generation by jointly training a multi-level representation head, perceptual loss, and REPA alignment on pretrained encoders, yielding 97.28% validity and 98.51% sta...

  54. Learning Discrete Autoregressive Priors with Wasserstein Gradient Flow

    cs.CV 2026-05 unverdicted novelty 6.0

    wAR-Tok adds a Wasserstein-gradient-flow prior-matching term to tokenizer training so that discrete tokens become easier for autoregressive priors to model, cutting AR loss and raising generation FID on CIFAR-10 and I...

  55. DiBA: Diagonal and Binary Matrix Approximation for Neural Network Weight Compression

    cs.LG 2026-05 unverdicted novelty 6.0

    DiBA factors weight matrices into diagonal-binary-diagonal-binary-diagonal form to cut matrix-vector multiplies from mn to m+k+n operations and improves accuracy on DistilBERT and audio transformer tasks after replacement.

  56. Adaptive Inverted-Index Routing for Granular Mixtures-of-Experts

    cs.LG 2026-05 unverdicted novelty 6.0

    AIR-MoE introduces a two-stage inverted-index routing method based on vector quantization that approximates optimal expert selection for granular MoE models at lower cost and with empirical performance gains.

  57. Physical Adversarial Clothing Evades Visible-Thermal Detectors via Non-Overlapping RGB-T Pattern

    cs.CV 2026-05 unverdicted novelty 6.0

    Non-overlapping RGB-T adversarial patterns on clothing, optimized with spatial discrete-continuous optimization, achieve high attack success rates against multiple RGB-T detector fusion architectures in both digital a...

  58. Learning to Theorize the World from Observation

    cs.LG 2026-05 unverdicted novelty 6.0

    NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.

  59. On Model-Based Clustering With Entropic Optimal Transport

    stat.ME 2026-05 unverdicted novelty 6.0

    Entropic optimal transport yields a clustering loss with the same global optimum as log-likelihood but a better-behaved optimization surface, outperforming standard EM in experiments.

  60. Model Merging: Foundations and Algorithms

    cs.LG 2026-05 unverdicted novelty 6.0

    New cycle-consistent optimization, task vector theory, singular vector decompositions, adaptive routing, and efficient evolutionary search provide foundations for merging neural network weights across tasks.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 86 Pith papers

  1. [1]

    Bengio, Y. (2013). Deep learning of representations: Looking forward. Technical Report arXiv:1305.0445, Universite de Montreal

  2. [2]

    Bengio, Y., Courville, A., and Vincent, P. (2013). Unsupervised feature learning and deep learning: A review and new perspectives. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI)\/

  3. [3]

    Dayan, P. (1990). Reinforcement comparison. In Connectionist Models: Proceedings of the 1990 Connectionist Summer School\/ , San Mateo, CA

  4. [4]

    and Bengio, Y

    El Hihi, S. and Bengio, Y. (1996). Hierarchical recurrent neural networks for long-term dependencies. In NIPS 8\/ . MIT Press

  5. [5]

    Fiete, I. R. and Seung, H. S. (2006). Gradient learning in spiking neural networks by dynamic perturbations of conductances. Physical Review Letters\/ , 97 (4)

  6. [6]

    Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier neural networks. In AISTATS\/

  7. [7]

    J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y

    Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013). Maxout networks. In ICML'2013\/

  8. [8]

    Hinton, G. (2012). Neural networks for machine learning. Coursera, video lectures

  9. [9]

    E., Sejnowski, T

    Hinton, G. E., Sejnowski, T. J., and Ackley, D. H. (1984). Boltzmann machines: Constraint satisfaction networks that learn. Technical Report TR-CMU-CS-84-119, Carnegie-Mellon University, Dept. of Computer Science

  10. [10]

    Improving neural networks by preventing co-adaptation of feature detectors

    Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. Technical report, arXiv:1207.0580

  11. [11]

    Krizhevsky, A., Sutskever, I., and Hinton, G. (2012a). ImageNet classification with deep convolutional neural networks. In NIPS'2012\/

  12. [12]

    Krizhevsky, A., Sutskever, I., and Hinton, G. (2012b). ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS'2012)\/

  13. [13]

    and Hinton, G

    Nair, V. and Hinton, G. E. (2010). Rectified linear units improve restricted B oltzmann machines. In ICML'10\/

  14. [14]

    E., Hinton, G

    Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by back-propagating errors. Nature\/ , 323 , 533--536

  15. [15]

    and Hinton, G

    Salakhutdinov, R. and Hinton, G. (2009). Semantic hashing. In International Journal of Approximate Reasoning\/

  16. [16]

    Spall, J. C. (1992). Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control\/ , 37 , 332--341

  17. [17]

    Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and composing robust features with denoising autoencoders. In ICML 2008\/

  18. [18]

    and Tao, N

    Weaver, L. and Tao, N. (2001). The optimal reward baseline for gradient-based reinforcement learning. In Proc. UAI'2001\/ , pages 538--545

  19. [19]

    Williams, R. J. (1992). Simple statistical gradient-following algorithms connectionist reinforcement learning. Machine Learning\/ , 8 , 229--256