Deep Residual Learning for Image Recognition

Jian Sun; Kaiming He; Shaoqing Ren; Xiangyu Zhang

arxiv: 1512.03385 · v1 · submitted 2015-12-10 · 💻 cs.CV

Deep Residual Learning for Image Recognition

Kaiming He , Xiangyu Zhang , Shaoqing Ren , Jian Sun This is my paper

Pith reviewed 2026-05-11 02:57 UTC · model grok-4.3

classification 💻 cs.CV

keywords residual learningdeep neural networksimage recognitionImageNetshortcut connectionsobject detectionneural network optimization

0 comments

The pith

Residual networks reformulate layers to learn differences from inputs via identity shortcuts, making much deeper training feasible and more accurate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that very deep neural networks become trainable when each layer is recast as learning a residual function rather than a full unreferenced mapping from input to output. Identity shortcut connections allow the input to bypass layers and be added directly to the residual output, easing gradient flow during optimization. Experiments demonstrate that this change lets networks scale to 152 layers on ImageNet while improving accuracy over shallower models, and the same deeper representations boost performance on detection tasks. The framework won multiple 2015 competition tracks by delivering lower error rates with manageable complexity. A sympathetic reader sees this as evidence that depth itself can be leveraged once the optimization barrier is lowered.

Core claim

We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set.

What carries the argument

Residual learning framework that recasts each layer to learn a residual function F(x) so the desired mapping becomes F(x) + x through identity shortcuts.

If this is right

Residual nets up to 152 layers achieve lower complexity and higher accuracy than prior VGG-style models on ImageNet classification.
An ensemble reaches 3.57% top-5 error on the ImageNet test set and won the 2015 ILSVRC classification task.
Solely through the deeper representations, a 28% relative improvement is obtained on the COCO object detection dataset.
The same residual nets secured first place on ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Analysis on CIFAR-10 extends the approach to networks of 100 and 1000 layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The identity-shortcut pattern could be tested in sequence models or reinforcement learning to see whether similar depth scaling occurs outside vision.
If residual blocks continue to ease optimization at extreme scales, the practical limit on network depth may shift from training dynamics to hardware and data constraints.
A theoretical account of why the identity mapping reduces the effective Lipschitz constant or improves gradient variance would strengthen the empirical observations.

Load-bearing premise

That learning residual functions with identity shortcuts is substantially easier to optimize than learning the original unreferenced mappings.

What would settle it

Training a 152-layer plain network without residual shortcuts on ImageNet and finding that it reaches comparable or lower error than the residual version would falsify the central optimization claim.

read the original abstract

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ResNets show that identity shortcut connections let you train networks over 100 layers deep with better accuracy than shallower models, backed by direct comparisons on ImageNet and CIFAR.

read the letter

The core advance here is the residual block: instead of forcing each layer to learn a full mapping, the network learns the difference from the input via a shortcut. This is not in the VGG or earlier deep net papers they cite, and the experiments make a clear case that it works. On CIFAR-10 they train plain 20-, 56-, and 110-layer nets and show training error rising with depth, while the residual versions keep dropping error. On ImageNet the 152-layer ResNet beats VGG and prior models, and the ensemble hits 3.57% top-5 error to win ILSVRC 2015. They also report a 28% relative gain on COCO detection from the deeper features alone. The training curves, ablation tables, and consistent protocol across depths give the results real weight; the competition outcome adds external confirmation that the numbers are not overfit to one run.

Referee Report

0 major / 4 minor

Summary. The manuscript introduces a residual learning framework that reformulates network layers to learn residual functions with identity shortcuts rather than unreferenced mappings, thereby easing the training of substantially deeper networks. It supplies comprehensive empirical evidence from CIFAR-10 (training curves and accuracy for 20/56/110-layer plain vs. residual nets, plus analysis up to 1000 layers) and ImageNet (ResNet-152 vs. VGG and shallower ResNets) showing that residual networks are easier to optimize and gain accuracy from increased depth; an ensemble achieves 3.57% top-5 error on ImageNet test, winning ILSVRC 2015 classification, with further gains on COCO detection attributed to deeper representations.

Significance. If the empirical results hold, the work is highly significant for computer vision and deep learning: it provides a practical, simple architectural solution to the degradation problem in deep nets, enabling 100+ layer models that outperform shallower counterparts while maintaining lower complexity than VGG. Credit is due for the detailed ablation studies, training error curves with consistent protocols (including batch normalization), direct depth-controlled comparisons, and external validation via competition-winning performance on ImageNet and COCO benchmarks; the residual block with identity shortcut has proven foundational.

minor comments (4)

[Abstract] Abstract: the statement 'analysis on CIFAR-10 with 100 and 1000 layers' should be cross-checked against the exact depths reported in §4.2 and Table 1 for consistency (e.g., 56/110/1202 layers are emphasized in the main experiments).
[§3.1] §3.1, Eq. (1): the residual formulation H(x) = F(x) + x is clear, but a brief note on how the shortcut is implemented when dimensions change (projection vs. zero-padding) would improve readability for readers implementing the blocks.
[§4.3] Figure 3 and §4.3: the ImageNet training curves and accuracy tables would benefit from explicit parameter counts or FLOPs in the same table as the error rates to make the 'lower complexity' claim immediately verifiable.
[§5] §5: the COCO detection improvement is attributed to depth, but a short ablation isolating depth from other factors (e.g., feature pyramid) would strengthen the causal claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and recommendation to accept the manuscript. The summary accurately captures the core contribution of reformulating layers as residual functions with identity shortcuts, the empirical results on CIFAR-10 and ImageNet, and the competition outcomes.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces residual learning by reformulating layers to learn residual functions F(x) = H(x) - x rather than direct mappings H(x), then validates this via direct empirical comparisons of training curves and accuracy on CIFAR-10 (20/56/110-layer nets) and ImageNet (up to 152-layer ResNets vs. VGG). These results are obtained from fixed benchmarks under controlled training protocols (batch norm, same optimizer settings) and do not involve any fitted parameters renamed as predictions, self-definitional equations, or load-bearing self-citations. The derivation chain consists of an architectural definition followed by reproducible experiments; no step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper relies on standard neural network training assumptions and introduces the residual block as its main new component.

axioms (1)

standard math Standard stochastic gradient descent with appropriate initialization and batch normalization can optimize deep networks when gradients are well-behaved.
Invoked implicitly in all training experiments and analysis of optimization difficulty.

invented entities (1)

Residual block with identity shortcut no independent evidence
purpose: To allow layers to learn residual functions F(x) rather than direct mappings H(x).
Core architectural contribution introduced to address vanishing gradient issues in deep nets.

pith-pipeline@v0.9.0 · 5530 in / 1057 out tokens · 45582 ms · 2026-05-11T02:57:14.123676+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WaveNet: A Generative Model for Raw Audio
cs.SD 2016-09 accept novelty 9.0

WaveNet generates realistic raw audio using an autoregressive neural network with dilated convolutions, achieving state-of-the-art naturalness in speech synthesis for English and Mandarin.
When Stronger Triggers Backfire: A High-Dimensional Theory of Backdoor Attacks
cs.LG 2026-05 unverdicted novelty 8.0

In the proportional high-dimensional regime, stronger backdoor training triggers improve clean accuracy and make attack success non-monotonic for regularized GLMs on Gaussian mixtures, with closed-form proofs for squa...
Density estimation using Real NVP
cs.LG 2016-05 accept novelty 8.0

Real NVP uses affine coupling layers to create invertible transformations that support exact density estimation, sampling, and latent inference without approximations.
Beyond Routing: Characterising Expert Tuning and Representation in Vision Mixture-of-Experts
cs.CV 2026-05 unverdicted novelty 7.0

Expert specialization in vision MoE models is dominated by a stable animate-inanimate distinction visible from gating to readout, with broader tuning to continuous visual and semantic dimensions rather than narrow cat...
When Bits Break Recourse: Counterfactual-Faithful Quantization
cs.LG 2026-05 unverdicted novelty 7.0

CFQ trains quantizer parameters and mixed-precision allocation to preserve counterfactual recourse validity, cost, and direction on Adult, German Credit, and COMPAS while matching accuracy of standard quantizers.
Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo
cond-mat.str-el 2026-05 conditional novelty 7.0

PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.
Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters
cs.LG 2026-05 unverdicted novelty 7.0

Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).
Optimal Representations for Generalized Contrastive Learning with Imbalanced Datasets
cs.LG 2026-05 unverdicted novelty 7.0

In generalized contrastive learning with imbalanced classes, optimal representations collapse to class means whose angular geometry is determined by class proportions via convex optimization, and extreme imbalance cau...
Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models
cs.LG 2026-05 unverdicted novelty 7.0

Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.
Replica Theory of Spherical Boltzmann Machine Ensembles
cond-mat.dis-nn 2026-04 unverdicted novelty 7.0

Replica calculations fully solve spherical Boltzmann machine ensembles and identify regimes where ensemble learning outperforms standard training, particularly for nearly finite-dimensional data.
Grokking of Diffusion Models: Case Study on Modular Addition
cs.LG 2026-04 unverdicted novelty 7.0

Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.
Privatar: Scalable Privacy-preserving Multi-user VR via Secure Offloading
cs.CR 2026-04 unverdicted novelty 7.0

Privatar uses horizontal frequency partitioning and distribution-aware minimal perturbation to enable private offloading of VR avatar reconstruction, supporting 2.37x more users with modest overhead.
Momentum Further Constrains Sharpness at the Edge of Stochastic Stability
cs.LG 2026-04 unverdicted novelty 7.0

Momentum SGD exhibits two distinct EoSS regimes for batch sharpness, stabilizing at 2(1-β)/η for small batches and 2(1+β)/η for large batches, aligning with linear stability thresholds.
Seg2Change: Adapting Open-Vocabulary Semantic Segmentation Model for Remote Sensing Change Detection
cs.CV 2026-04 conditional novelty 7.0

Seg2Change adapts open-vocabulary segmentation models to open-vocabulary change detection via a category-agnostic change head and new dataset CA-CDD, delivering +9.52 IoU on WHU-CD and +5.50 mIoU on SECOND.
Deep learning-based phase-field modelling of brittle fracture in anisotropic media
physics.comp-ph 2026-03 unverdicted novelty 7.0

A variational physics-informed neural network solves higher-order anisotropic phase-field fracture models by minimizing total energy with B-spline enriched trial functions.
Illumination-Aware Contactless Fingerprint Spoof Detection via Paired Flash-Non-Flash Imaging
cs.CV 2026-03 unverdicted novelty 7.0

Paired flash-non-flash imaging improves contactless fingerprint spoof detection by highlighting material and structure differences between genuine and fake prints.
Polarized Target Nuclear Magnetic Resonance Measurements with Deep Neural Networks
physics.ins-det 2026-03 unverdicted novelty 7.0

Deep neural networks reduce fitting uncertainties in CW-NMR polarization measurements for dynamically polarized targets.
Contour Refinement using Discrete Diffusion in Low Data Regime
cs.CV 2026-02 unverdicted novelty 7.0

A CNN-based discrete diffusion method refines sparse contours from segmentation masks using simplified denoising steps and minimal post-processing, outperforming baselines on small medical and environmental datasets w...
B-FIRE: Binning-Free Diffusion Implicit Neural Representation for Hyper-Accelerated Motion-Resolved MRI
cs.CV 2026-01 unverdicted novelty 7.0

B-FIRE uses a diffusion-optimized CNN-INR to reconstruct instantaneous 3D abdominal anatomy from binning-free, hyper-accelerated non-Cartesian k-space data in motion-resolved MRI.
NASTaR: NovaSAR Automated Ship Target Recognition Dataset
cs.CV 2025-12 accept novelty 7.0

NASTaR is a new dataset of 3415 AIS-labeled ship patches from NovaSAR S-band SAR imagery with 23 classes, inshore/offshore splits, and wake annotations, validated via benchmark deep learning models.
Federated Learning for Surgical Vision in Appendicitis Classification: Results of the FedSurg EndoVis 2024 Challenge
cs.CV 2025-10 conditional novelty 7.0

The FedSurg challenge benchmarks federated learning on appendectomy videos and finds only 26% F1 on unseen centers even with centralized data, plus extra penalties from decentralization, with spatiotemporal models per...
Atomistic Machine Learning with Irreducible Cartesian Natural Tensors
cond-mat.mtrl-sci 2025-10 unverdicted novelty 7.0

CarNet develops irreducible Cartesian natural tensors and an equivariant model that matches leading spherical-tensor performance for ML interatomic potentials and high-rank tensor predictions like elastic constants.
DPQuant: Efficient and Differentially-Private Model Training via Dynamic Quantization Scheduling
cs.LG 2025-09 unverdicted novelty 7.0

DPQuant uses epoch-wise probabilistic layer rotation and DP loss sensitivity to quantize only a changing subset of layers, reducing accuracy degradation from quantization noise in DP-SGD and delivering up to 2.21x thr...
Prospects for Deep-Learning-Based Mass Reconstruction of Ultra-High-Energy Cosmic Rays using Simulated Air-Shower Profiles
astro-ph.HE 2025-08 conditional novelty 7.0

A CNN predicts ln A from longitudinal shower profiles with bias under 0.4, resolution 1-1.5, and proton-iron merit factor 2.19, outperforming simpler ML models on shape parameters and remaining robust to hadronic mode...
SCOOTER: A Human Evaluation Framework for Unrestricted Adversarial Examples
cs.CV 2025-07 conditional novelty 7.0

SCOOTER supplies best-practice guidelines, open tools, and a 3K-image benchmark with 34K+ human ratings showing that six tested unrestricted attacks produce images humans can detect as fake.
V-RoAst: Visual Road Assessment. Can VLM be a Road Safety Assessor Using the iRAP Standard?
cs.CV 2024-08 unverdicted novelty 7.0

V-RoAst applies zero-shot VLMs (Gemini-1.5-flash, GPT-4o-mini) to iRAP road safety attribute classification on a new ThaiRAP image dataset and compares them to CNN baselines, finding better generalization to unseen cl...
Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
cs.LG 2024-02 unverdicted novelty 7.0

HSTU-based generative recommenders with 1.5 trillion parameters scale as a power law with compute up to GPT-3 scale, outperform baselines by up to 65.8% NDCG, run 5-15x faster than FlashAttention2 on long sequences, a...
Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots
cs.RO 2024-02 conditional novelty 7.0

UMI enables zero-shot deployment of robot manipulation policies trained solely on portable human demonstrations captured with custom handheld grippers, supporting dynamic bimanual tasks across novel environments and objects.
Stateful Detection of Black-Box Adversarial Attacks
cs.CR 2019-07 unverdicted novelty 7.0

The paper argues for stateful defenses over stateless ones to detect adversarial example generation via query history and introduces query blinding as a counter-attack.
Transfer Learning from Audio-Visual Grounding to Speech Recognition
cs.CL 2019-07 unverdicted novelty 7.0

Features from audio-visual semantic grounding models improve speech recognition when used as input, with earlier layers retaining more phonetic detail and deeper layers showing greater domain invariance.
Predicting Retrosynthetic Reaction using Self-Corrected Transformer Neural Networks
physics.chem-ph 2019-07 unverdicted novelty 7.0

SCROP Transformer model with neural syntax corrector reaches 59% accuracy on retrosynthesis benchmarks, outperforming prior deep learning methods by over 21 points and template-based methods by over 6 points, with 1.7...
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
cs.CV 2017-04 accept novelty 7.0

MobileNets introduce depthwise separable convolutions plus width and resolution multipliers to produce efficient CNNs that trade off latency and accuracy for mobile and embedded vision applications.
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
cs.CL 2016-11 accept novelty 7.0

MS MARCO is a new large-scale machine reading comprehension dataset built from real Bing search queries, human-generated answers, and web passages, supporting three tasks including answer synthesis and passage ranking.
Wide Residual Networks
cs.CV 2016-05 accept novelty 7.0

Wide residual networks achieve higher accuracy and faster training than very deep thin residual networks by increasing width and decreasing depth, setting new state-of-the-art results on CIFAR, SVHN, and ImageNet.
Training Deep Nets with Sublinear Memory Cost
cs.LG 2016-04 accept novelty 7.0

An algorithm trains n-layer networks with O(sqrt(n)) memory via selective recomputation of activations, at the cost of one extra forward pass.
Equation of State at High Baryon Densities from a Thermodynamically Informed Neural Network
hep-ph 2026-05 unverdicted novelty 6.0

A thermodynamically consistent neural-network equation of state for QCD matter at finite temperature and conserved charges that matches known low-density results and extrapolates to high baryon densities for use in re...
AURORA: Contextual Orthogonalization for Geometric Representation Learning in Healthcare Foundation Models
cs.LG 2026-05 unverdicted novelty 6.0

AURORA is a representation learning framework that uses contextual orthogonalization and relational alignment to create disentangled, geometrically interpretable latent spaces in healthcare foundation models.
Extracting redshifts from 2D slitless spectroscopic images using deep learning for the CSST galaxy survey
astro-ph.IM 2026-05 unverdicted novelty 6.0

A Bayesian CNN maps 2D slitless spectral images to redshift estimates with NMAD precision 0.0104 for SNR_GI >=1 and better for brighter sources, while remaining robust to wavelength calibration errors via spatial augm...
Enhanced Ionization Charge Identification in the Short-Baseline Neutrino Program Neutrino Detectors with Deep Neural Networks
physics.ins-det 2026-05 conditional novelty 6.0

A DNN-based region of interest detection method for SBN neutrino detectors outperforms traditional wire-by-wire thresholding in identification accuracy and reconstruction quality while being more robust to performance...
MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces
cs.DC 2026-05 unverdicted novelty 6.0

Chakra introduces a portable, interoperable graph-based execution trace format for distributed ML workloads along with supporting tools to standardize performance benchmarking and software-hardware co-design.
MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces
cs.DC 2026-05 unverdicted novelty 6.0

Chakra introduces a standardized graph-based execution trace representation for distributed ML workloads along with supporting tools to enable benchmarking, analysis, generation, and co-design across simulators and hardware.
StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception
cs.RO 2026-05 unverdicted novelty 6.0

StereoPolicy fuses stereo image pairs via a Stereo Transformer on pretrained 2D encoders to boost robotic manipulation policies, showing gains over monocular, RGB-D, point cloud, and multi-view methods in simulations ...
Event Fields: Learning Latent Event Structure for Waveform Foundation Models
cs.LG 2026-05 unverdicted novelty 6.0

Event-centric waveform foundation models are learned via self-supervised consistency on latent event structures and interactions, yielding improved performance and label efficiency over sequence-based baselines on phy...
It Just Takes Two: Scaling Amortized Inference to Large Sets
cs.LG 2026-05 unverdicted novelty 6.0

A mean-pool deep set trained on sets of size at most two produces an encoder that generalizes to arbitrary sizes, decoupling representation learning from posterior modeling and making training cost independent of depl...
ICDAR 2026 Competition on Writer Identification and Pen Classification from Hand-Drawn Circles
cs.CV 2026-05 accept novelty 6.0

A new dataset of hand-drawn circles from 66 writers and 8 pens yields competition results of 64.8% top-1 accuracy for open-set writer identification and 92.7% for pen classification.
ICDAR 2026 Competition on Writer Identification and Pen Classification from Hand-Drawn Circles
cs.CV 2026-05 accept novelty 6.0

CircleID introduces a controlled dataset of 46,155 circles from 66 writers and 8 pens, with competition results showing top accuracies of 64.8% for open-set writer identification and 92.7% for pen classification.
Detecting Adversarial Data via Provable Adversarial Noise Amplification
cs.LG 2026-05 unverdicted novelty 6.0

A provable adversarial noise amplification theorem under sufficient conditions enables a custom-trained detector that identifies adversarial examples at inference time using enhanced layer-wise noise signals.
ShapeY: A Principled Framework for Measuring Shape Recognition Capacity via Nearest-Neighbor Matching
cs.CV 2026-04 unverdicted novelty 6.0

ShapeY is a benchmark dataset and nearest-neighbor protocol that measures shape-based recognition in vision models, revealing that even state-of-the-art networks fail to generalize consistently across 3D viewpoints an...
Fine-Tuning Regimes Define Distinct Continual Learning Problems
cs.LG 2026-04 unverdicted novelty 6.0

The relative rankings of continual learning methods are not preserved across different fine-tuning regimes defined by trainable parameter depth.
Geometric Monomial (GEM): a family of rational 2N-differentiable activation functions
cs.LG 2026-04 unverdicted novelty 6.0

GEM is a new family of C^{2N}-smooth rational activation functions with variants that achieve performance on par with or exceeding GELU on ResNet, GPT-2, and BERT benchmarks.
Materialistic RIR: Material Conditioned Realistic RIR Generation
cs.CV 2026-04 unverdicted novelty 6.0

A two-module neural model disentangles spatial layout from material properties to generate controllable and more realistic room impulse responses, reporting gains of up to 16% on acoustic metrics and 70% on material m...
DVAR: Adversarial Multi-Agent Debate for Video Authenticity Detection
cs.CV 2026-04 unverdicted novelty 6.0

DVAR turns video authenticity detection into an iterative debate between a generative hypothesis agent and a natural mechanism agent, resolved via minimum description length and a knowledge base for better generalizat...
Symphony: Taming Step Misalignments in the Network for Ring-based Collective Operations
cs.NI 2026-04 unverdicted novelty 6.0

Symphony detects step misalignments in ring collectives via lightweight in-network tracking and mitigates them by throttling outpacing flows with congestion signals, yielding up to 54% better communication times in As...
Deepfake Detection Generalization with Diffusion Noise
cs.CV 2026-04 unverdicted novelty 6.0

ANL uses diffusion noise prediction and attention to regularize deepfake detectors for better generalization to unseen synthesis methods without added inference cost.
The illusory simplicity of the feedforward pass: evidence for the dynamical nature of stimulus encoding along the primate ventral stream
q-bio.NC 2026-04 unverdicted novelty 6.0

Primate ventral stream encodes visual stimuli through evolving neural dynamics that carry category information beyond any fixed spatial pattern during the initial feedforward pass.
Habitat-GS: A High-Fidelity Navigation Simulator with Dynamic Gaussian Splatting
cs.RO 2026-04 unverdicted novelty 6.0

Habitat-GS integrates 3D Gaussian Splatting scene rendering and Gaussian avatars into Habitat-Sim, yielding agents with stronger cross-domain generalization and effective human-aware navigation.
ResBM: Residual Bottleneck Models for Low-Bandwidth Pipeline Parallelism
cs.LG 2026-04 unverdicted novelty 6.0

ResBM achieves 128x activation compression in pipeline-parallel transformer training by adding a residual bottleneck module that preserves a low-rank identity path, with no major loss in convergence or added overhead.
Zero-shot World Models Are Developmentally Efficient Learners
cs.AI 2026-04 unverdicted novelty 6.0

A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.
Enhancing event reconstruction for $\gamma$-ray particle detector arrays using transformers
astro-ph.IM 2026-04 unverdicted novelty 6.0

Transformer models applied to simulated water-Cherenkov array data improve gamma-hadron separation and reconstruction of direction, core position, and energy compared to established techniques.
EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World
cs.RO 2026-04 unverdicted novelty 6.0

EgoVerse releases 1,362 hours of standardized egocentric human data across 1,965 tasks and shows via multi-lab experiments that robot policy performance scales with human data volume when the data aligns with robot ob...

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 144 Pith papers

[1]

Bengio, P

Y . Bengio, P. Simard, and P. Frasconi. Learning long-term dependen- cies with gradient descent is difﬁcult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994

work page 1994
[2]

C. M. Bishop. Neural networks for pattern recognition . Oxford university press, 1995

work page 1995
[3]

W. L. Briggs, S. F. McCormick, et al. A Multigrid Tutorial. Siam, 2000

work page 2000
[4]

Chatﬁeld, V

K. Chatﬁeld, V . Lempitsky, A. Vedaldi, and A. Zisserman. The devil is in the details: an evaluation of recent feature encoding methods. In BMVC, 2011

work page 2011
[5]

Everingham, L

M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zis- serman. The Pascal Visual Object Classes (VOC) Challenge. IJCV, pages 303–338, 2010

work page 2010
[6]

Gidaris and N

S. Gidaris and N. Komodakis. Object detection via a multi-region & semantic segmentation-aware cnn model. In ICCV, 2015

work page 2015
[7]

Girshick

R. Girshick. Fast R-CNN. In ICCV, 2015

work page 2015
[8]

Girshick, J

R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hier- archies for accurate object detection and semantic segmentation. In CVPR, 2014

work page 2014
[9]

Glorot and Y

X. Glorot and Y . Bengio. Understanding the difﬁculty of training deep feedforward neural networks. In AISTATS, 2010

work page 2010
[10]

I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y . Bengio. Maxout networks.arXiv:1302.4389, 2013

work page Pith review arXiv 2013
[11]

He and J

K. He and J. Sun. Convolutional neural networks at constrained time cost. In CVPR, 2015

work page 2015
[12]

K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014

work page 2014
[13]

K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation. In ICCV, 2015

work page 2015
[14]

G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co- adaptation of feature detectors. arXiv:1207.0580, 2012

work page Pith review arXiv 2012
[15]

Hochreiter and J

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997

work page 1997
[16]

Ioffe and C

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InICML, 2015

work page 2015
[17]

Jegou, M

H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. TPAMI, 33, 2011

work page 2011
[18]

Jegou, F

H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, and C. Schmid. Aggregating local image descriptors into compact codes. TPAMI, 2012

work page 2012
[19]

Y . Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093, 2014

work page Pith review arXiv 2014
[20]

Krizhevsky

A. Krizhevsky. Learning multiple layers of features from tiny im- ages. Tech Report, 2009

work page 2009
[21]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In NIPS, 2012

work page 2012
[22]

LeCun, B

Y . LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to hand- written zip code recognition. Neural computation, 1989

work page 1989
[23]

LeCun, L

Y . LeCun, L. Bottou, G. B. Orr, and K.-R. M¨uller. Efﬁcient backprop. In Neural Networks: Tricks of the Trade, pages 9–50. Springer, 1998

work page 1998
[24]

C.-Y . Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply- supervised nets. arXiv:1409.5185, 2014

work page arXiv 2014
[25]

M. Lin, Q. Chen, and S. Yan. Network in network. arXiv:1312.4400, 2013

work page Pith review arXiv 2013
[26]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV. 2014

work page 2014
[27]

J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015

work page 2015
[28]

Mont ´ufar, R

G. Mont ´ufar, R. Pascanu, K. Cho, and Y . Bengio. On the number of linear regions of deep neural networks. In NIPS, 2014

work page 2014
[29]

Nair and G

V . Nair and G. E. Hinton. Rectiﬁed linear units improve restricted boltzmann machines. In ICML, 2010

work page 2010
[30]

Perronnin and C

F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image categorization. In CVPR, 2007

work page 2007
[31]

Raiko, H

T. Raiko, H. Valpola, and Y . LeCun. Deep learning made easier by linear transformations in perceptrons. In AISTATS, 2012

work page 2012
[32]

S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015

work page 2015
[33]

S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun. Object detection networks on convolutional feature maps. arXiv:1504.06066, 2015

work page arXiv 2015
[34]

B. D. Ripley. Pattern recognition and neural networks. Cambridge university press, 1996

work page 1996
[35]

Romero, N

A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y . Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2015

work page 2015
[36]

ImageNet Large Scale Visual Recognition Challenge

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. arXiv:1409.0575, 2014

work page Pith review arXiv 2014
[37]

A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120, 2013

work page Pith review arXiv 2013
[38]

N. N. Schraudolph. Accelerated gradient descent by factor-centering decomposition. Technical report, 1998

work page 1998
[39]

N. N. Schraudolph. Centering neural network gradient factors. In Neural Networks: Tricks of the Trade , pages 207–226. Springer, 1998

work page 1998
[40]

Sermanet, D

P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y . Le- Cun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014

work page 2014
[41]

Simonyan and A

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015

work page 2015
[42]

R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. arXiv:1505.00387, 2015

work page Pith review arXiv 2015
[43]

R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. 1507.06228, 2015

work page arXiv 2015
[44]

Szegedy, W

C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Er- han, V . Vanhoucke, and A. Rabinovich. Going deeper with convolu- tions. In CVPR, 2015

work page 2015
[45]

Szeliski

R. Szeliski. Fast surface interpolation using hierarchical basis func- tions. TPAMI, 1990

work page 1990
[46]

Szeliski

R. Szeliski. Locally adapted hierarchical basis preconditioning. In SIGGRAPH, 2006

work page 2006
[47]

Vatanen, T

T. Vatanen, T. Raiko, H. Valpola, and Y . LeCun. Pushing stochas- tic gradient towards second-order methods–backpropagation learn- ing with transformations in nonlinearities. In Neural Information Processing, 2013

work page 2013
[48]

Vedaldi and B

A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algorithms, 2008

work page 2008
[49]

Venables and B

W. Venables and B. Ripley. Modern applied statistics with s-plus. 1999

work page 1999
[50]

Networks on Conv fea- ture maps

M. D. Zeiler and R. Fergus. Visualizing and understanding convolu- tional neural networks. In ECCV, 2014. 9 A. Object Detection Baselines In this section we introduce our detection method based on the baseline Faster R-CNN [32] system. The models are initialized by the ImageNet classiﬁcation models, and then ﬁne-tuned on the object detection data. We have...

work page 2014
[51]

07+12”). For the PASCAL VOC 2012 test set, we use the 10k trainval+test images in VOC 2007 and 16ktrainval images in VOC 2012 for training (“07++12

and a Fast R-CNN detection network [7]. RoI pool- ing [7] is performed before conv5 1. On this RoI-pooled feature, all layers of conv5 x and up are adopted for each region, playing the roles of VGG-16’s fc layers. The ﬁnal classiﬁcation layer is replaced by two sibling layers (classi- ﬁcation and box regression [7]). For the usage of BN layers, after pre-...

work page 2007
[52]

This RPN ends with two sib- ling 1×1 convolutional layers for binary classiﬁcation (cls) and box regression (reg), as in [32]

that is category-agnostic, our RPN for localization is designed in a per-class form. This RPN ends with two sib- ling 1×1 convolutional layers for binary classiﬁcation (cls) and box regression (reg), as in [32]. The cls and reg layers are both in a per-class from, in contrast to [32]. Speciﬁ- cally, the cls layer has a 1000-d output, and each dimension is...

work page 2015

[1] [1]

Bengio, P

Y . Bengio, P. Simard, and P. Frasconi. Learning long-term dependen- cies with gradient descent is difﬁcult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994

work page 1994

[2] [2]

C. M. Bishop. Neural networks for pattern recognition . Oxford university press, 1995

work page 1995

[3] [3]

W. L. Briggs, S. F. McCormick, et al. A Multigrid Tutorial. Siam, 2000

work page 2000

[4] [4]

Chatﬁeld, V

K. Chatﬁeld, V . Lempitsky, A. Vedaldi, and A. Zisserman. The devil is in the details: an evaluation of recent feature encoding methods. In BMVC, 2011

work page 2011

[5] [5]

Everingham, L

M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zis- serman. The Pascal Visual Object Classes (VOC) Challenge. IJCV, pages 303–338, 2010

work page 2010

[6] [6]

Gidaris and N

S. Gidaris and N. Komodakis. Object detection via a multi-region & semantic segmentation-aware cnn model. In ICCV, 2015

work page 2015

[7] [7]

Girshick

R. Girshick. Fast R-CNN. In ICCV, 2015

work page 2015

[8] [8]

Girshick, J

R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hier- archies for accurate object detection and semantic segmentation. In CVPR, 2014

work page 2014

[9] [9]

Glorot and Y

X. Glorot and Y . Bengio. Understanding the difﬁculty of training deep feedforward neural networks. In AISTATS, 2010

work page 2010

[10] [10]

I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y . Bengio. Maxout networks.arXiv:1302.4389, 2013

work page Pith review arXiv 2013

[11] [11]

He and J

K. He and J. Sun. Convolutional neural networks at constrained time cost. In CVPR, 2015

work page 2015

[12] [12]

K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014

work page 2014

[13] [13]

K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation. In ICCV, 2015

work page 2015

[14] [14]

G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co- adaptation of feature detectors. arXiv:1207.0580, 2012

work page Pith review arXiv 2012

[15] [15]

Hochreiter and J

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997

work page 1997

[16] [16]

Ioffe and C

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InICML, 2015

work page 2015

[17] [17]

Jegou, M

H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. TPAMI, 33, 2011

work page 2011

[18] [18]

Jegou, F

H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, and C. Schmid. Aggregating local image descriptors into compact codes. TPAMI, 2012

work page 2012

[19] [19]

Y . Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093, 2014

work page Pith review arXiv 2014

[20] [20]

Krizhevsky

A. Krizhevsky. Learning multiple layers of features from tiny im- ages. Tech Report, 2009

work page 2009

[21] [21]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In NIPS, 2012

work page 2012

[22] [22]

LeCun, B

Y . LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to hand- written zip code recognition. Neural computation, 1989

work page 1989

[23] [23]

LeCun, L

Y . LeCun, L. Bottou, G. B. Orr, and K.-R. M¨uller. Efﬁcient backprop. In Neural Networks: Tricks of the Trade, pages 9–50. Springer, 1998

work page 1998

[24] [24]

C.-Y . Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply- supervised nets. arXiv:1409.5185, 2014

work page arXiv 2014

[25] [25]

M. Lin, Q. Chen, and S. Yan. Network in network. arXiv:1312.4400, 2013

work page Pith review arXiv 2013

[26] [26]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV. 2014

work page 2014

[27] [27]

J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015

work page 2015

[28] [28]

Mont ´ufar, R

G. Mont ´ufar, R. Pascanu, K. Cho, and Y . Bengio. On the number of linear regions of deep neural networks. In NIPS, 2014

work page 2014

[29] [29]

Nair and G

V . Nair and G. E. Hinton. Rectiﬁed linear units improve restricted boltzmann machines. In ICML, 2010

work page 2010

[30] [30]

Perronnin and C

F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image categorization. In CVPR, 2007

work page 2007

[31] [31]

Raiko, H

T. Raiko, H. Valpola, and Y . LeCun. Deep learning made easier by linear transformations in perceptrons. In AISTATS, 2012

work page 2012

[32] [32]

S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015

work page 2015

[33] [33]

S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun. Object detection networks on convolutional feature maps. arXiv:1504.06066, 2015

work page arXiv 2015

[34] [34]

B. D. Ripley. Pattern recognition and neural networks. Cambridge university press, 1996

work page 1996

[35] [35]

Romero, N

A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y . Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2015

work page 2015

[36] [36]

ImageNet Large Scale Visual Recognition Challenge

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. arXiv:1409.0575, 2014

work page Pith review arXiv 2014

[37] [37]

A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120, 2013

work page Pith review arXiv 2013

[38] [38]

N. N. Schraudolph. Accelerated gradient descent by factor-centering decomposition. Technical report, 1998

work page 1998

[39] [39]

N. N. Schraudolph. Centering neural network gradient factors. In Neural Networks: Tricks of the Trade , pages 207–226. Springer, 1998

work page 1998

[40] [40]

Sermanet, D

P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y . Le- Cun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014

work page 2014

[41] [41]

Simonyan and A

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015

work page 2015

[42] [42]

R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. arXiv:1505.00387, 2015

work page Pith review arXiv 2015

[43] [43]

R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. 1507.06228, 2015

work page arXiv 2015

[44] [44]

Szegedy, W

C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Er- han, V . Vanhoucke, and A. Rabinovich. Going deeper with convolu- tions. In CVPR, 2015

work page 2015

[45] [45]

Szeliski

R. Szeliski. Fast surface interpolation using hierarchical basis func- tions. TPAMI, 1990

work page 1990

[46] [46]

Szeliski

R. Szeliski. Locally adapted hierarchical basis preconditioning. In SIGGRAPH, 2006

work page 2006

[47] [47]

Vatanen, T

T. Vatanen, T. Raiko, H. Valpola, and Y . LeCun. Pushing stochas- tic gradient towards second-order methods–backpropagation learn- ing with transformations in nonlinearities. In Neural Information Processing, 2013

work page 2013

[48] [48]

Vedaldi and B

A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algorithms, 2008

work page 2008

[49] [49]

Venables and B

W. Venables and B. Ripley. Modern applied statistics with s-plus. 1999

work page 1999

[50] [50]

Networks on Conv fea- ture maps

M. D. Zeiler and R. Fergus. Visualizing and understanding convolu- tional neural networks. In ECCV, 2014. 9 A. Object Detection Baselines In this section we introduce our detection method based on the baseline Faster R-CNN [32] system. The models are initialized by the ImageNet classiﬁcation models, and then ﬁne-tuned on the object detection data. We have...

work page 2014

[51] [51]

07+12”). For the PASCAL VOC 2012 test set, we use the 10k trainval+test images in VOC 2007 and 16ktrainval images in VOC 2012 for training (“07++12

and a Fast R-CNN detection network [7]. RoI pool- ing [7] is performed before conv5 1. On this RoI-pooled feature, all layers of conv5 x and up are adopted for each region, playing the roles of VGG-16’s fc layers. The ﬁnal classiﬁcation layer is replaced by two sibling layers (classi- ﬁcation and box regression [7]). For the usage of BN layers, after pre-...

work page 2007

[52] [52]

This RPN ends with two sib- ling 1×1 convolutional layers for binary classiﬁcation (cls) and box regression (reg), as in [32]

that is category-agnostic, our RPN for localization is designed in a per-class form. This RPN ends with two sib- ling 1×1 convolutional layers for binary classiﬁcation (cls) and box regression (reg), as in [32]. The cls and reg layers are both in a per-class from, in contrast to [32]. Speciﬁ- cally, the cls layer has a 1000-d output, and each dimension is...

work page 2015