Recognition: no theorem link
Deep Residual Learning for Image Recognition
Pith reviewed 2026-05-11 02:57 UTC · model grok-4.3
The pith
Residual networks reformulate layers to learn differences from inputs via identity shortcuts, making much deeper training feasible and more accurate.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set.
What carries the argument
Residual learning framework that recasts each layer to learn a residual function F(x) so the desired mapping becomes F(x) + x through identity shortcuts.
If this is right
- Residual nets up to 152 layers achieve lower complexity and higher accuracy than prior VGG-style models on ImageNet classification.
- An ensemble reaches 3.57% top-5 error on the ImageNet test set and won the 2015 ILSVRC classification task.
- Solely through the deeper representations, a 28% relative improvement is obtained on the COCO object detection dataset.
- The same residual nets secured first place on ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
- Analysis on CIFAR-10 extends the approach to networks of 100 and 1000 layers.
Where Pith is reading between the lines
- The identity-shortcut pattern could be tested in sequence models or reinforcement learning to see whether similar depth scaling occurs outside vision.
- If residual blocks continue to ease optimization at extreme scales, the practical limit on network depth may shift from training dynamics to hardware and data constraints.
- A theoretical account of why the identity mapping reduces the effective Lipschitz constant or improves gradient variance would strengthen the empirical observations.
Load-bearing premise
That learning residual functions with identity shortcuts is substantially easier to optimize than learning the original unreferenced mappings.
What would settle it
Training a 152-layer plain network without residual shortcuts on ImageNet and finding that it reaches comparable or lower error than the residual version would falsify the central optimization claim.
read the original abstract
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a residual learning framework that reformulates network layers to learn residual functions with identity shortcuts rather than unreferenced mappings, thereby easing the training of substantially deeper networks. It supplies comprehensive empirical evidence from CIFAR-10 (training curves and accuracy for 20/56/110-layer plain vs. residual nets, plus analysis up to 1000 layers) and ImageNet (ResNet-152 vs. VGG and shallower ResNets) showing that residual networks are easier to optimize and gain accuracy from increased depth; an ensemble achieves 3.57% top-5 error on ImageNet test, winning ILSVRC 2015 classification, with further gains on COCO detection attributed to deeper representations.
Significance. If the empirical results hold, the work is highly significant for computer vision and deep learning: it provides a practical, simple architectural solution to the degradation problem in deep nets, enabling 100+ layer models that outperform shallower counterparts while maintaining lower complexity than VGG. Credit is due for the detailed ablation studies, training error curves with consistent protocols (including batch normalization), direct depth-controlled comparisons, and external validation via competition-winning performance on ImageNet and COCO benchmarks; the residual block with identity shortcut has proven foundational.
minor comments (4)
- [Abstract] Abstract: the statement 'analysis on CIFAR-10 with 100 and 1000 layers' should be cross-checked against the exact depths reported in §4.2 and Table 1 for consistency (e.g., 56/110/1202 layers are emphasized in the main experiments).
- [§3.1] §3.1, Eq. (1): the residual formulation H(x) = F(x) + x is clear, but a brief note on how the shortcut is implemented when dimensions change (projection vs. zero-padding) would improve readability for readers implementing the blocks.
- [§4.3] Figure 3 and §4.3: the ImageNet training curves and accuracy tables would benefit from explicit parameter counts or FLOPs in the same table as the error rates to make the 'lower complexity' claim immediately verifiable.
- [§5] §5: the COCO detection improvement is attributed to depth, but a short ablation isolating depth from other factors (e.g., feature pyramid) would strengthen the causal claim.
Simulated Author's Rebuttal
We thank the referee for the positive review and recommendation to accept the manuscript. The summary accurately captures the core contribution of reformulating layers as residual functions with identity shortcuts, the empirical results on CIFAR-10 and ImageNet, and the competition outcomes.
Circularity Check
No significant circularity detected
full rationale
The paper introduces residual learning by reformulating layers to learn residual functions F(x) = H(x) - x rather than direct mappings H(x), then validates this via direct empirical comparisons of training curves and accuracy on CIFAR-10 (20/56/110-layer nets) and ImageNet (up to 152-layer ResNets vs. VGG). These results are obtained from fixed benchmarks under controlled training protocols (batch norm, same optimizer settings) and do not involve any fitted parameters renamed as predictions, self-definitional equations, or load-bearing self-citations. The derivation chain consists of an architectural definition followed by reproducible experiments; no step reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard stochastic gradient descent with appropriate initialization and batch normalization can optimize deep networks when gradients are well-behaved.
invented entities (1)
-
Residual block with identity shortcut
no independent evidence
Forward citations
Cited by 60 Pith papers
-
WaveNet: A Generative Model for Raw Audio
WaveNet generates realistic raw audio using an autoregressive neural network with dilated convolutions, achieving state-of-the-art naturalness in speech synthesis for English and Mandarin.
-
Density estimation using Real NVP
Real NVP uses affine coupling layers to create invertible transformations that support exact density estimation, sampling, and latent inference without approximations.
-
Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo
PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.
-
Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters
Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).
-
Optimal Representations for Generalized Contrastive Learning with Imbalanced Datasets
In generalized contrastive learning with imbalanced classes, optimal representations collapse to class means whose angular geometry is determined by class proportions via convex optimization, and extreme imbalance cau...
-
Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models
Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.
-
Replica Theory of Spherical Boltzmann Machine Ensembles
Replica calculations fully solve spherical Boltzmann machine ensembles and identify regimes where ensemble learning outperforms standard training, particularly for nearly finite-dimensional data.
-
Grokking of Diffusion Models: Case Study on Modular Addition
Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.
-
Privatar: Scalable Privacy-preserving Multi-user VR via Secure Offloading
Privatar uses horizontal frequency partitioning and distribution-aware minimal perturbation to enable private offloading of VR avatar reconstruction, supporting 2.37x more users with modest overhead.
-
Momentum Further Constrains Sharpness at the Edge of Stochastic Stability
Momentum SGD exhibits two distinct EoSS regimes for batch sharpness, stabilizing at 2(1-β)/η for small batches and 2(1+β)/η for large batches, aligning with linear stability thresholds.
-
Seg2Change: Adapting Open-Vocabulary Semantic Segmentation Model for Remote Sensing Change Detection
Seg2Change adapts open-vocabulary segmentation models to open-vocabulary change detection via a category-agnostic change head and new dataset CA-CDD, delivering +9.52 IoU on WHU-CD and +5.50 mIoU on SECOND.
-
Deep learning-based phase-field modelling of brittle fracture in anisotropic media
A variational physics-informed neural network solves higher-order anisotropic phase-field fracture models by minimizing total energy with B-spline enriched trial functions.
-
Illumination-Aware Contactless Fingerprint Spoof Detection via Paired Flash-Non-Flash Imaging
Paired flash-non-flash imaging improves contactless fingerprint spoof detection by highlighting material and structure differences between genuine and fake prints.
-
Polarized Target Nuclear Magnetic Resonance Measurements with Deep Neural Networks
Deep neural networks reduce fitting uncertainties in CW-NMR polarization measurements for dynamically polarized targets.
-
Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
HSTU-based generative recommenders with 1.5 trillion parameters scale as a power law with compute up to GPT-3 scale, outperform baselines by up to 65.8% NDCG, run 5-15x faster than FlashAttention2 on long sequences, a...
-
Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots
UMI enables zero-shot deployment of robot manipulation policies trained solely on portable human demonstrations captured with custom handheld grippers, supporting dynamic bimanual tasks across novel environments and objects.
-
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
MobileNets introduce depthwise separable convolutions plus width and resolution multipliers to produce efficient CNNs that trade off latency and accuracy for mobile and embedded vision applications.
-
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
MS MARCO is a new large-scale machine reading comprehension dataset built from real Bing search queries, human-generated answers, and web passages, supporting three tasks including answer synthesis and passage ranking.
-
Wide Residual Networks
Wide residual networks achieve higher accuracy and faster training than very deep thin residual networks by increasing width and decreasing depth, setting new state-of-the-art results on CIFAR, SVHN, and ImageNet.
-
Training Deep Nets with Sublinear Memory Cost
An algorithm trains n-layer networks with O(sqrt(n)) memory via selective recomputation of activations, at the cost of one extra forward pass.
-
MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces
Chakra introduces a portable, interoperable graph-based execution trace format for distributed ML workloads along with supporting tools to standardize performance benchmarking and software-hardware co-design.
-
StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception
StereoPolicy fuses stereo image pairs via a Stereo Transformer on pretrained 2D encoders to boost robotic manipulation policies, showing gains over monocular, RGB-D, point cloud, and multi-view methods in simulations ...
-
Event Fields: Learning Latent Event Structure for Waveform Foundation Models
Event-centric waveform foundation models are learned via self-supervised consistency on latent event structures and interactions, yielding improved performance and label efficiency over sequence-based baselines on phy...
-
It Just Takes Two: Scaling Amortized Inference to Large Sets
A mean-pool deep set trained on sets of size at most two produces an encoder that generalizes to arbitrary sizes, decoupling representation learning from posterior modeling and making training cost independent of depl...
-
ICDAR 2026 Competition on Writer Identification and Pen Classification from Hand-Drawn Circles
A new dataset of hand-drawn circles from 66 writers and 8 pens yields competition results of 64.8% top-1 accuracy for open-set writer identification and 92.7% for pen classification.
-
Detecting Adversarial Data via Provable Adversarial Noise Amplification
A provable adversarial noise amplification theorem under sufficient conditions enables a custom-trained detector that identifies adversarial examples at inference time using enhanced layer-wise noise signals.
-
ShapeY: A Principled Framework for Measuring Shape Recognition Capacity via Nearest-Neighbor Matching
ShapeY is a benchmark dataset and nearest-neighbor protocol that measures shape-based recognition in vision models, revealing that even state-of-the-art networks fail to generalize consistently across 3D viewpoints an...
-
Fine-Tuning Regimes Define Distinct Continual Learning Problems
The relative rankings of continual learning methods are not preserved across different fine-tuning regimes defined by trainable parameter depth.
-
Geometric Monomial (GEM): a family of rational 2N-differentiable activation functions
GEM is a new family of C^{2N}-smooth rational activation functions with variants that achieve performance on par with or exceeding GELU on ResNet, GPT-2, and BERT benchmarks.
-
Materialistic RIR: Material Conditioned Realistic RIR Generation
A two-module neural model disentangles spatial layout from material properties to generate controllable and more realistic room impulse responses, reporting gains of up to 16% on acoustic metrics and 70% on material m...
-
DVAR: Adversarial Multi-Agent Debate for Video Authenticity Detection
DVAR turns video authenticity detection into an iterative debate between a generative hypothesis agent and a natural mechanism agent, resolved via minimum description length and a knowledge base for better generalizat...
-
Symphony: Taming Step Misalignments in the Network for Ring-based Collective Operations
Symphony detects step misalignments in ring collectives via lightweight in-network tracking and mitigates them by throttling outpacing flows with congestion signals, yielding up to 54% better communication times in As...
-
Deepfake Detection Generalization with Diffusion Noise
ANL uses diffusion noise prediction and attention to regularize deepfake detectors for better generalization to unseen synthesis methods without added inference cost.
-
The illusory simplicity of the feedforward pass: evidence for the dynamical nature of stimulus encoding along the primate ventral stream
Primate ventral stream encodes visual stimuli through evolving neural dynamics that carry category information beyond any fixed spatial pattern during the initial feedforward pass.
-
Habitat-GS: A High-Fidelity Navigation Simulator with Dynamic Gaussian Splatting
Habitat-GS integrates 3D Gaussian Splatting scene rendering and Gaussian avatars into Habitat-Sim, yielding agents with stronger cross-domain generalization and effective human-aware navigation.
-
ResBM: Residual Bottleneck Models for Low-Bandwidth Pipeline Parallelism
ResBM achieves 128x activation compression in pipeline-parallel transformer training by adding a residual bottleneck module that preserves a low-rank identity path, with no major loss in convergence or added overhead.
-
Zero-shot World Models Are Developmentally Efficient Learners
A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.
-
Enhancing event reconstruction for $\gamma$-ray particle detector arrays using transformers
Transformer models applied to simulated water-Cherenkov array data improve gamma-hadron separation and reconstruction of direction, core position, and energy compared to established techniques.
-
EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World
EgoVerse releases 1,362 hours of standardized egocentric human data across 1,965 tasks and shows via multi-lab experiments that robot policy performance scales with human data volume when the data aligns with robot ob...
-
Multispectral representation of Distributed Acoustic Sensing data: a framework for physically interpretable feature extraction and visualization
A multispectral decomposition of DAS data into band-limited energy images enables clearer visualization, unsupervised clustering, and 97.3% accurate CNN detection of whale vocalizations.
-
AE-ViT: Stable Long-Horizon Parametric Partial Differential Equations Modeling
AE-ViT combines a convolutional autoencoder with a latent-space transformer and multi-stage parameter plus coordinate injection to deliver stable long-horizon predictions for parametric PDEs, cutting relative rollout ...
-
Ensemble-Based Dirichlet Modeling for Predictive Uncertainty and Selective Classification
Ensemble-based method of moments on softmax outputs produces stable Dirichlet predictive distributions that improve uncertainty-guided tasks like selective classification over evidential deep learning.
-
LAA-X: Unified Localized Artifact Attention for Quality-Agnostic and Generalizable Face Forgery Detection
LAA-X uses multi-task learning with explicit localized artifact attention and blending synthesis to build a deepfake detector that generalizes to high-quality and unseen manipulations after training only on real and p...
-
Physics-Informed Transformer for Real-Time High-Fidelity Topology Optimization
A transformer model with self-attention and auxiliary physics losses learns a direct non-iterative mapping from loads and fields to manufacturable optimized topologies.
-
PhDLspec: physical-prior embedded deep learning method for spectroscopic determination of stellar labels in high-dimensional parameter space
PhDLspec combines differential spectra from physical stellar models with a transformer to derive approximately 30 stellar parameters from low-resolution spectra hundreds of times faster than traditional calculations.
-
What Does Flow Matching Bring To TD Learning?
Flow matching critics outperform monolithic ones in RL by 2x performance and 5x sample efficiency via test-time error recovery through integration and multi-point velocity supervision that preserves feature plasticity.
-
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
MetaGPT embeds human SOPs into LLM prompts to create role-specialized agent teams that produce more coherent solutions on collaborative software engineering tasks than prior chat-based multi-agent systems.
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
-
A General Language Assistant as a Laboratory for Alignment
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
-
VideoGPT: Video Generation using VQ-VAE and Transformers
VideoGPT generates competitive natural videos by learning discrete latents with VQ-VAE and modeling them autoregressively with a transformer.
-
Rethinking Atrous Convolution for Semantic Image Segmentation
DeepLabv3 improves semantic segmentation by capturing multi-scale context with cascaded or parallel atrous convolutions and adding global context to ASPP, achieving better results on PASCAL VOC 2012 without DenseCRF p...
-
SGDR: Stochastic Gradient Descent with Warm Restarts
SGDR uses periodic warm restarts of the learning rate in SGD to reach new state-of-the-art error rates of 3.14% on CIFAR-10 and 16.21% on CIFAR-100.
-
ArcGate: Adaptive Arctangent Gated Activation
ArcGate is an adaptive activation with seven learnable parameters that outperforms ReLU and other fixed activations on remote sensing benchmarks, reaching 99.67% accuracy on PatternNet and showing strong noise resilience.
-
WISTERIA: Learning Clinical Representations from Noisy Supervision via Multi-View Consistency in Electronic Health Records
WISTERIA learns robust clinical representations from noisy EHR labels by enforcing consistency across multiple weak supervision views plus ontology regularization.
-
Medical Model Synthesis Architectures: A Case Study
MedMSA framework retrieves knowledge via language models then builds formal probabilistic models to produce uncertainty-weighted differential diagnoses from symptoms.
-
mHC-SSM: Manifold-Constrained Hyper-Connections for State Space Language Models with Stream-Specialized Adapters
Manifold-constrained multi-stream mixing plus per-stream adapters improves SSM language model validation loss from 6.3507 to 6.1353 and perplexity from 572.91 to 461.88 on WikiText-2.
-
AI-Generated Images: What Humans and Machines See When They Look at the Same Image
Researchers train AI detectors on a large photorealistic fake image dataset, apply 16 XAI methods, and use human survey feedback to assess alignment between machine explanations and human perception of AI-generated images.
-
Flow matching for Sentinel-2 super-resolution: implementation, application, and implications
Flow matching achieves single-step pixel accuracy and 20-step perceptual quality for Sentinel-2 super-resolution, outperforming diffusion and Real-ESRGAN while enabling large-scale 2.5 m land-cover products.
-
Pre-localization of Massive Black Hole Binaries in the Millihertz Band
A neural spline flow pipeline performs amortized inference on millihertz MBHB signals, delivering ~20 deg² pre-merger sky localizations in ~1 minute while matching PTMCMC sky modes and parameter uncertainties.
-
Investigation of cardinality classification for bacterial colony counting using explainable artificial intelligence
XAI analysis identifies high visual similarity across colony cardinality classes as the primary limit on MicrobiaNet performance in bacterial colony counting, revising prior model assessments.
Reference graph
Works this paper leans on
- [1]
-
[2]
C. M. Bishop. Neural networks for pattern recognition . Oxford university press, 1995
work page 1995
-
[3]
W. L. Briggs, S. F. McCormick, et al. A Multigrid Tutorial. Siam, 2000
work page 2000
-
[4]
K. Chatfield, V . Lempitsky, A. Vedaldi, and A. Zisserman. The devil is in the details: an evaluation of recent feature encoding methods. In BMVC, 2011
work page 2011
-
[5]
M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zis- serman. The Pascal Visual Object Classes (VOC) Challenge. IJCV, pages 303–338, 2010
work page 2010
-
[6]
S. Gidaris and N. Komodakis. Object detection via a multi-region & semantic segmentation-aware cnn model. In ICCV, 2015
work page 2015
- [7]
-
[8]
R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hier- archies for accurate object detection and semantic segmentation. In CVPR, 2014
work page 2014
-
[9]
X. Glorot and Y . Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010
work page 2010
- [10]
- [11]
-
[12]
K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014
work page 2014
-
[13]
K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015
work page 2015
-
[14]
G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co- adaptation of feature detectors. arXiv:1207.0580, 2012
work page Pith review arXiv 2012
-
[15]
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997
work page 1997
-
[16]
S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InICML, 2015
work page 2015
- [17]
- [18]
- [19]
-
[20]
A. Krizhevsky. Learning multiple layers of features from tiny im- ages. Tech Report, 2009
work page 2009
-
[21]
A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012
work page 2012
- [22]
- [23]
- [24]
-
[25]
M. Lin, Q. Chen, and S. Yan. Network in network. arXiv:1312.4400, 2013
work page Pith review arXiv 2013
-
[26]
T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV. 2014
work page 2014
-
[27]
J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015
work page 2015
-
[28]
G. Mont ´ufar, R. Pascanu, K. Cho, and Y . Bengio. On the number of linear regions of deep neural networks. In NIPS, 2014
work page 2014
-
[29]
V . Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010
work page 2010
-
[30]
F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image categorization. In CVPR, 2007
work page 2007
- [31]
-
[32]
S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015
work page 2015
- [33]
-
[34]
B. D. Ripley. Pattern recognition and neural networks. Cambridge university press, 1996
work page 1996
- [35]
-
[36]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. arXiv:1409.0575, 2014
-
[37]
A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120, 2013
work page Pith review arXiv 2013
-
[38]
N. N. Schraudolph. Accelerated gradient descent by factor-centering decomposition. Technical report, 1998
work page 1998
-
[39]
N. N. Schraudolph. Centering neural network gradient factors. In Neural Networks: Tricks of the Trade , pages 207–226. Springer, 1998
work page 1998
-
[40]
P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y . Le- Cun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014
work page 2014
-
[41]
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015
work page 2015
-
[42]
R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. arXiv:1505.00387, 2015
work page Pith review arXiv 2015
- [43]
-
[44]
C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Er- han, V . Vanhoucke, and A. Rabinovich. Going deeper with convolu- tions. In CVPR, 2015
work page 2015
- [45]
- [46]
-
[47]
T. Vatanen, T. Raiko, H. Valpola, and Y . LeCun. Pushing stochas- tic gradient towards second-order methods–backpropagation learn- ing with transformations in nonlinearities. In Neural Information Processing, 2013
work page 2013
-
[48]
A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algorithms, 2008
work page 2008
-
[49]
W. Venables and B. Ripley. Modern applied statistics with s-plus. 1999
work page 1999
-
[50]
Networks on Conv fea- ture maps
M. D. Zeiler and R. Fergus. Visualizing and understanding convolu- tional neural networks. In ECCV, 2014. 9 A. Object Detection Baselines In this section we introduce our detection method based on the baseline Faster R-CNN [32] system. The models are initialized by the ImageNet classification models, and then fine-tuned on the object detection data. We have...
work page 2014
-
[51]
and a Fast R-CNN detection network [7]. RoI pool- ing [7] is performed before conv5 1. On this RoI-pooled feature, all layers of conv5 x and up are adopted for each region, playing the roles of VGG-16’s fc layers. The final classification layer is replaced by two sibling layers (classi- fication and box regression [7]). For the usage of BN layers, after pre-...
work page 2007
-
[52]
that is category-agnostic, our RPN for localization is designed in a per-class form. This RPN ends with two sib- ling 1×1 convolutional layers for binary classification (cls) and box regression (reg), as in [32]. The cls and reg layers are both in a per-class from, in contrast to [32]. Specifi- cally, the cls layer has a 1000-d output, and each dimension is...
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.