arxiv: 1512.03385 · v1 · submitted 2015-12-10 · 💻 cs.CV

Recognition: no theorem link

Deep Residual Learning for Image Recognition

Kaiming He , Xiangyu Zhang , Shaoqing Ren , Jian Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:57 UTC · model grok-4.3

classification 💻 cs.CV

keywords residual learningdeep neural networksimage recognitionImageNetshortcut connectionsobject detectionneural network optimization

0 comments

The pith

Residual networks reformulate layers to learn differences from inputs via identity shortcuts, making much deeper training feasible and more accurate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that very deep neural networks become trainable when each layer is recast as learning a residual function rather than a full unreferenced mapping from input to output. Identity shortcut connections allow the input to bypass layers and be added directly to the residual output, easing gradient flow during optimization. Experiments demonstrate that this change lets networks scale to 152 layers on ImageNet while improving accuracy over shallower models, and the same deeper representations boost performance on detection tasks. The framework won multiple 2015 competition tracks by delivering lower error rates with manageable complexity. A sympathetic reader sees this as evidence that depth itself can be leveraged once the optimization barrier is lowered.

Core claim

We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set.

What carries the argument

Residual learning framework that recasts each layer to learn a residual function F(x) so the desired mapping becomes F(x) + x through identity shortcuts.

If this is right

Residual nets up to 152 layers achieve lower complexity and higher accuracy than prior VGG-style models on ImageNet classification.
An ensemble reaches 3.57% top-5 error on the ImageNet test set and won the 2015 ILSVRC classification task.
Solely through the deeper representations, a 28% relative improvement is obtained on the COCO object detection dataset.
The same residual nets secured first place on ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Analysis on CIFAR-10 extends the approach to networks of 100 and 1000 layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The identity-shortcut pattern could be tested in sequence models or reinforcement learning to see whether similar depth scaling occurs outside vision.
If residual blocks continue to ease optimization at extreme scales, the practical limit on network depth may shift from training dynamics to hardware and data constraints.
A theoretical account of why the identity mapping reduces the effective Lipschitz constant or improves gradient variance would strengthen the empirical observations.

Load-bearing premise

That learning residual functions with identity shortcuts is substantially easier to optimize than learning the original unreferenced mappings.

What would settle it

Training a 152-layer plain network without residual shortcuts on ImageNet and finding that it reaches comparable or lower error than the residual version would falsify the central optimization claim.

read the original abstract

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ResNets show that identity shortcut connections let you train networks over 100 layers deep with better accuracy than shallower models, backed by direct comparisons on ImageNet and CIFAR.

read the letter

The core advance here is the residual block: instead of forcing each layer to learn a full mapping, the network learns the difference from the input via a shortcut. This is not in the VGG or earlier deep net papers they cite, and the experiments make a clear case that it works. On CIFAR-10 they train plain 20-, 56-, and 110-layer nets and show training error rising with depth, while the residual versions keep dropping error. On ImageNet the 152-layer ResNet beats VGG and prior models, and the ensemble hits 3.57% top-5 error to win ILSVRC 2015. They also report a 28% relative gain on COCO detection from the deeper features alone. The training curves, ablation tables, and consistent protocol across depths give the results real weight; the competition outcome adds external confirmation that the numbers are not overfit to one run.

Referee Report

0 major / 4 minor

Summary. The manuscript introduces a residual learning framework that reformulates network layers to learn residual functions with identity shortcuts rather than unreferenced mappings, thereby easing the training of substantially deeper networks. It supplies comprehensive empirical evidence from CIFAR-10 (training curves and accuracy for 20/56/110-layer plain vs. residual nets, plus analysis up to 1000 layers) and ImageNet (ResNet-152 vs. VGG and shallower ResNets) showing that residual networks are easier to optimize and gain accuracy from increased depth; an ensemble achieves 3.57% top-5 error on ImageNet test, winning ILSVRC 2015 classification, with further gains on COCO detection attributed to deeper representations.

Significance. If the empirical results hold, the work is highly significant for computer vision and deep learning: it provides a practical, simple architectural solution to the degradation problem in deep nets, enabling 100+ layer models that outperform shallower counterparts while maintaining lower complexity than VGG. Credit is due for the detailed ablation studies, training error curves with consistent protocols (including batch normalization), direct depth-controlled comparisons, and external validation via competition-winning performance on ImageNet and COCO benchmarks; the residual block with identity shortcut has proven foundational.

minor comments (4)

[Abstract] Abstract: the statement 'analysis on CIFAR-10 with 100 and 1000 layers' should be cross-checked against the exact depths reported in §4.2 and Table 1 for consistency (e.g., 56/110/1202 layers are emphasized in the main experiments).
[§3.1] §3.1, Eq. (1): the residual formulation H(x) = F(x) + x is clear, but a brief note on how the shortcut is implemented when dimensions change (projection vs. zero-padding) would improve readability for readers implementing the blocks.
[§4.3] Figure 3 and §4.3: the ImageNet training curves and accuracy tables would benefit from explicit parameter counts or FLOPs in the same table as the error rates to make the 'lower complexity' claim immediately verifiable.
[§5] §5: the COCO detection improvement is attributed to depth, but a short ablation isolating depth from other factors (e.g., feature pyramid) would strengthen the causal claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and recommendation to accept the manuscript. The summary accurately captures the core contribution of reformulating layers as residual functions with identity shortcuts, the empirical results on CIFAR-10 and ImageNet, and the competition outcomes.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces residual learning by reformulating layers to learn residual functions F(x) = H(x) - x rather than direct mappings H(x), then validates this via direct empirical comparisons of training curves and accuracy on CIFAR-10 (20/56/110-layer nets) and ImageNet (up to 152-layer ResNets vs. VGG). These results are obtained from fixed benchmarks under controlled training protocols (batch norm, same optimizer settings) and do not involve any fitted parameters renamed as predictions, self-definitional equations, or load-bearing self-citations. The derivation chain consists of an architectural definition followed by reproducible experiments; no step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper relies on standard neural network training assumptions and introduces the residual block as its main new component.

axioms (1)

standard math Standard stochastic gradient descent with appropriate initialization and batch normalization can optimize deep networks when gradients are well-behaved.
Invoked implicitly in all training experiments and analysis of optimization difficulty.

invented entities (1)

Residual block with identity shortcut no independent evidence
purpose: To allow layers to learn residual functions F(x) rather than direct mappings H(x).
Core architectural contribution introduced to address vanishing gradient issues in deep nets.

pith-pipeline@v0.9.0 · 5530 in / 1057 out tokens · 45582 ms · 2026-05-11T02:57:14.123676+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WaveNet: A Generative Model for Raw Audio
cs.SD 2016-09 accept novelty 9.0

WaveNet generates realistic raw audio using an autoregressive neural network with dilated convolutions, achieving state-of-the-art naturalness in speech synthesis for English and Mandarin.
Density estimation using Real NVP
cs.LG 2016-05 accept novelty 8.0

Real NVP uses affine coupling layers to create invertible transformations that support exact density estimation, sampling, and latent inference without approximations.
Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo
cond-mat.str-el 2026-05 conditional novelty 7.0

PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.
Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters
cs.LG 2026-05 unverdicted novelty 7.0

Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).
Optimal Representations for Generalized Contrastive Learning with Imbalanced Datasets
cs.LG 2026-05 unverdicted novelty 7.0

In generalized contrastive learning with imbalanced classes, optimal representations collapse to class means whose angular geometry is determined by class proportions via convex optimization, and extreme imbalance cau...
Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models
cs.LG 2026-05 unverdicted novelty 7.0

Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.
Replica Theory of Spherical Boltzmann Machine Ensembles
cond-mat.dis-nn 2026-04 unverdicted novelty 7.0

Replica calculations fully solve spherical Boltzmann machine ensembles and identify regimes where ensemble learning outperforms standard training, particularly for nearly finite-dimensional data.
Grokking of Diffusion Models: Case Study on Modular Addition
cs.LG 2026-04 unverdicted novelty 7.0

Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.
Privatar: Scalable Privacy-preserving Multi-user VR via Secure Offloading
cs.CR 2026-04 unverdicted novelty 7.0

Privatar uses horizontal frequency partitioning and distribution-aware minimal perturbation to enable private offloading of VR avatar reconstruction, supporting 2.37x more users with modest overhead.
Momentum Further Constrains Sharpness at the Edge of Stochastic Stability
cs.LG 2026-04 unverdicted novelty 7.0

Momentum SGD exhibits two distinct EoSS regimes for batch sharpness, stabilizing at 2(1-β)/η for small batches and 2(1+β)/η for large batches, aligning with linear stability thresholds.
Seg2Change: Adapting Open-Vocabulary Semantic Segmentation Model for Remote Sensing Change Detection
cs.CV 2026-04 conditional novelty 7.0

Seg2Change adapts open-vocabulary segmentation models to open-vocabulary change detection via a category-agnostic change head and new dataset CA-CDD, delivering +9.52 IoU on WHU-CD and +5.50 mIoU on SECOND.
Deep learning-based phase-field modelling of brittle fracture in anisotropic media
physics.comp-ph 2026-03 unverdicted novelty 7.0

A variational physics-informed neural network solves higher-order anisotropic phase-field fracture models by minimizing total energy with B-spline enriched trial functions.
Illumination-Aware Contactless Fingerprint Spoof Detection via Paired Flash-Non-Flash Imaging
cs.CV 2026-03 unverdicted novelty 7.0

Paired flash-non-flash imaging improves contactless fingerprint spoof detection by highlighting material and structure differences between genuine and fake prints.
Polarized Target Nuclear Magnetic Resonance Measurements with Deep Neural Networks
physics.ins-det 2026-03 unverdicted novelty 7.0

Deep neural networks reduce fitting uncertainties in CW-NMR polarization measurements for dynamically polarized targets.
Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
cs.LG 2024-02 unverdicted novelty 7.0

HSTU-based generative recommenders with 1.5 trillion parameters scale as a power law with compute up to GPT-3 scale, outperform baselines by up to 65.8% NDCG, run 5-15x faster than FlashAttention2 on long sequences, a...
Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots
cs.RO 2024-02 conditional novelty 7.0

UMI enables zero-shot deployment of robot manipulation policies trained solely on portable human demonstrations captured with custom handheld grippers, supporting dynamic bimanual tasks across novel environments and objects.
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
cs.CV 2017-04 accept novelty 7.0

MobileNets introduce depthwise separable convolutions plus width and resolution multipliers to produce efficient CNNs that trade off latency and accuracy for mobile and embedded vision applications.
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
cs.CL 2016-11 accept novelty 7.0

MS MARCO is a new large-scale machine reading comprehension dataset built from real Bing search queries, human-generated answers, and web passages, supporting three tasks including answer synthesis and passage ranking.
Wide Residual Networks
cs.CV 2016-05 accept novelty 7.0

Wide residual networks achieve higher accuracy and faster training than very deep thin residual networks by increasing width and decreasing depth, setting new state-of-the-art results on CIFAR, SVHN, and ImageNet.
Training Deep Nets with Sublinear Memory Cost
cs.LG 2016-04 accept novelty 7.0

An algorithm trains n-layer networks with O(sqrt(n)) memory via selective recomputation of activations, at the cost of one extra forward pass.
MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces
cs.DC 2026-05 unverdicted novelty 6.0

Chakra introduces a portable, interoperable graph-based execution trace format for distributed ML workloads along with supporting tools to standardize performance benchmarking and software-hardware co-design.
StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception
cs.RO 2026-05 unverdicted novelty 6.0

StereoPolicy fuses stereo image pairs via a Stereo Transformer on pretrained 2D encoders to boost robotic manipulation policies, showing gains over monocular, RGB-D, point cloud, and multi-view methods in simulations ...
Event Fields: Learning Latent Event Structure for Waveform Foundation Models
cs.LG 2026-05 unverdicted novelty 6.0

Event-centric waveform foundation models are learned via self-supervised consistency on latent event structures and interactions, yielding improved performance and label efficiency over sequence-based baselines on phy...
It Just Takes Two: Scaling Amortized Inference to Large Sets
cs.LG 2026-05 unverdicted novelty 6.0

A mean-pool deep set trained on sets of size at most two produces an encoder that generalizes to arbitrary sizes, decoupling representation learning from posterior modeling and making training cost independent of depl...
ICDAR 2026 Competition on Writer Identification and Pen Classification from Hand-Drawn Circles
cs.CV 2026-05 accept novelty 6.0

A new dataset of hand-drawn circles from 66 writers and 8 pens yields competition results of 64.8% top-1 accuracy for open-set writer identification and 92.7% for pen classification.
Detecting Adversarial Data via Provable Adversarial Noise Amplification
cs.LG 2026-05 unverdicted novelty 6.0

A provable adversarial noise amplification theorem under sufficient conditions enables a custom-trained detector that identifies adversarial examples at inference time using enhanced layer-wise noise signals.
ShapeY: A Principled Framework for Measuring Shape Recognition Capacity via Nearest-Neighbor Matching
cs.CV 2026-04 unverdicted novelty 6.0

ShapeY is a benchmark dataset and nearest-neighbor protocol that measures shape-based recognition in vision models, revealing that even state-of-the-art networks fail to generalize consistently across 3D viewpoints an...
Fine-Tuning Regimes Define Distinct Continual Learning Problems
cs.LG 2026-04 unverdicted novelty 6.0

The relative rankings of continual learning methods are not preserved across different fine-tuning regimes defined by trainable parameter depth.
Geometric Monomial (GEM): a family of rational 2N-differentiable activation functions
cs.LG 2026-04 unverdicted novelty 6.0

GEM is a new family of C^{2N}-smooth rational activation functions with variants that achieve performance on par with or exceeding GELU on ResNet, GPT-2, and BERT benchmarks.
Materialistic RIR: Material Conditioned Realistic RIR Generation
cs.CV 2026-04 unverdicted novelty 6.0

A two-module neural model disentangles spatial layout from material properties to generate controllable and more realistic room impulse responses, reporting gains of up to 16% on acoustic metrics and 70% on material m...
DVAR: Adversarial Multi-Agent Debate for Video Authenticity Detection
cs.CV 2026-04 unverdicted novelty 6.0

DVAR turns video authenticity detection into an iterative debate between a generative hypothesis agent and a natural mechanism agent, resolved via minimum description length and a knowledge base for better generalizat...
Symphony: Taming Step Misalignments in the Network for Ring-based Collective Operations
cs.NI 2026-04 unverdicted novelty 6.0

Symphony detects step misalignments in ring collectives via lightweight in-network tracking and mitigates them by throttling outpacing flows with congestion signals, yielding up to 54% better communication times in As...
Deepfake Detection Generalization with Diffusion Noise
cs.CV 2026-04 unverdicted novelty 6.0

ANL uses diffusion noise prediction and attention to regularize deepfake detectors for better generalization to unseen synthesis methods without added inference cost.
The illusory simplicity of the feedforward pass: evidence for the dynamical nature of stimulus encoding along the primate ventral stream
q-bio.NC 2026-04 unverdicted novelty 6.0

Primate ventral stream encodes visual stimuli through evolving neural dynamics that carry category information beyond any fixed spatial pattern during the initial feedforward pass.
Habitat-GS: A High-Fidelity Navigation Simulator with Dynamic Gaussian Splatting
cs.RO 2026-04 unverdicted novelty 6.0

Habitat-GS integrates 3D Gaussian Splatting scene rendering and Gaussian avatars into Habitat-Sim, yielding agents with stronger cross-domain generalization and effective human-aware navigation.
ResBM: Residual Bottleneck Models for Low-Bandwidth Pipeline Parallelism
cs.LG 2026-04 unverdicted novelty 6.0

ResBM achieves 128x activation compression in pipeline-parallel transformer training by adding a residual bottleneck module that preserves a low-rank identity path, with no major loss in convergence or added overhead.
Zero-shot World Models Are Developmentally Efficient Learners
cs.AI 2026-04 unverdicted novelty 6.0

A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.
Enhancing event reconstruction for $\gamma$-ray particle detector arrays using transformers
astro-ph.IM 2026-04 unverdicted novelty 6.0

Transformer models applied to simulated water-Cherenkov array data improve gamma-hadron separation and reconstruction of direction, core position, and energy compared to established techniques.
EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World
cs.RO 2026-04 unverdicted novelty 6.0

EgoVerse releases 1,362 hours of standardized egocentric human data across 1,965 tasks and shows via multi-lab experiments that robot policy performance scales with human data volume when the data aligns with robot ob...
Multispectral representation of Distributed Acoustic Sensing data: a framework for physically interpretable feature extraction and visualization
physics.ins-det 2026-04 unverdicted novelty 6.0

A multispectral decomposition of DAS data into band-limited energy images enables clearer visualization, unsupervised clustering, and 97.3% accurate CNN detection of whale vocalizations.
AE-ViT: Stable Long-Horizon Parametric Partial Differential Equations Modeling
cs.LG 2026-04 unverdicted novelty 6.0

AE-ViT combines a convolutional autoencoder with a latent-space transformer and multi-stage parameter plus coordinate injection to deliver stable long-horizon predictions for parametric PDEs, cutting relative rollout ...
Ensemble-Based Dirichlet Modeling for Predictive Uncertainty and Selective Classification
stat.ML 2026-04 unverdicted novelty 6.0

Ensemble-based method of moments on softmax outputs produces stable Dirichlet predictive distributions that improve uncertainty-guided tasks like selective classification over evidential deep learning.
LAA-X: Unified Localized Artifact Attention for Quality-Agnostic and Generalizable Face Forgery Detection
cs.CV 2026-04 unverdicted novelty 6.0

LAA-X uses multi-task learning with explicit localized artifact attention and blending synthesis to build a deepfake detector that generalizes to high-quality and unseen manipulations after training only on real and p...
Physics-Informed Transformer for Real-Time High-Fidelity Topology Optimization
cs.CE 2026-04 unverdicted novelty 6.0

A transformer model with self-attention and auxiliary physics losses learns a direct non-iterative mapping from loads and fields to manufacturable optimized topologies.
PhDLspec: physical-prior embedded deep learning method for spectroscopic determination of stellar labels in high-dimensional parameter space
astro-ph.GA 2026-04 unverdicted novelty 6.0

PhDLspec combines differential spectra from physical stellar models with a transformer to derive approximately 30 stellar parameters from low-resolution spectra hundreds of times faster than traditional calculations.
What Does Flow Matching Bring To TD Learning?
cs.LG 2026-03 conditional novelty 6.0

Flow matching critics outperform monolithic ones in RL by 2x performance and 5x sample efficiency via test-time error recovery through integration and multi-point velocity supervision that preserves feature plasticity.
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
cs.AI 2023-08 unverdicted novelty 6.0

MetaGPT embeds human SOPs into LLM prompts to create role-specialized agent teams that produce more coherent solutions on collaborative software engineering tasks than prior chat-based multi-agent systems.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
VideoGPT: Video Generation using VQ-VAE and Transformers
cs.CV 2021-04 accept novelty 6.0

VideoGPT generates competitive natural videos by learning discrete latents with VQ-VAE and modeling them autoregressively with a transformer.
Rethinking Atrous Convolution for Semantic Image Segmentation
cs.CV 2017-06 unverdicted novelty 6.0

DeepLabv3 improves semantic segmentation by capturing multi-scale context with cascaded or parallel atrous convolutions and adding global context to ASPP, achieving better results on PASCAL VOC 2012 without DenseCRF p...
SGDR: Stochastic Gradient Descent with Warm Restarts
cs.LG 2016-08 accept novelty 6.0

SGDR uses periodic warm restarts of the learning rate in SGD to reach new state-of-the-art error rates of 3.14% on CIFAR-10 and 16.21% on CIFAR-100.
ArcGate: Adaptive Arctangent Gated Activation
cs.CV 2026-05 unverdicted novelty 5.0

ArcGate is an adaptive activation with seven learnable parameters that outperforms ReLU and other fixed activations on remote sensing benchmarks, reaching 99.67% accuracy on PatternNet and showing strong noise resilience.
WISTERIA: Learning Clinical Representations from Noisy Supervision via Multi-View Consistency in Electronic Health Records
cs.LG 2026-05 unverdicted novelty 5.0

WISTERIA learns robust clinical representations from noisy EHR labels by enforcing consistency across multiple weak supervision views plus ontology regularization.
Medical Model Synthesis Architectures: A Case Study
cs.AI 2026-05 unverdicted novelty 5.0

MedMSA framework retrieves knowledge via language models then builds formal probabilistic models to produce uncertainty-weighted differential diagnoses from symptoms.
mHC-SSM: Manifold-Constrained Hyper-Connections for State Space Language Models with Stream-Specialized Adapters
cs.LG 2026-05 unverdicted novelty 5.0

Manifold-constrained multi-stream mixing plus per-stream adapters improves SSM language model validation loss from 6.3507 to 6.1353 and perplexity from 572.91 to 461.88 on WikiText-2.
AI-Generated Images: What Humans and Machines See When They Look at the Same Image
cs.CV 2026-05 unverdicted novelty 5.0

Researchers train AI detectors on a large photorealistic fake image dataset, apply 16 XAI methods, and use human survey feedback to assess alignment between machine explanations and human perception of AI-generated images.
Flow matching for Sentinel-2 super-resolution: implementation, application, and implications
cs.CV 2026-05 unverdicted novelty 5.0

Flow matching achieves single-step pixel accuracy and 20-step perceptual quality for Sentinel-2 super-resolution, outperforming diffusion and Real-ESRGAN while enabling large-scale 2.5 m land-cover products.
Pre-localization of Massive Black Hole Binaries in the Millihertz Band
gr-qc 2026-04 unverdicted novelty 5.0

A neural spline flow pipeline performs amortized inference on millihertz MBHB signals, delivering ~20 deg² pre-merger sky localizations in ~1 minute while matching PTMCMC sky modes and parameter uncertainties.
Investigation of cardinality classification for bacterial colony counting using explainable artificial intelligence
cs.CV 2026-04 unverdicted novelty 5.0

XAI analysis identifies high visual similarity across colony cardinality classes as the primary limit on MicrobiaNet performance in bacterial colony counting, revising prior model assessments.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 70 Pith papers

[1]

Bengio, P

Y . Bengio, P. Simard, and P. Frasconi. Learning long-term dependen- cies with gradient descent is difﬁcult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994

work page 1994
[2]

C. M. Bishop. Neural networks for pattern recognition . Oxford university press, 1995

work page 1995
[3]

W. L. Briggs, S. F. McCormick, et al. A Multigrid Tutorial. Siam, 2000

work page 2000
[4]

Chatﬁeld, V

K. Chatﬁeld, V . Lempitsky, A. Vedaldi, and A. Zisserman. The devil is in the details: an evaluation of recent feature encoding methods. In BMVC, 2011

work page 2011
[5]

Everingham, L

M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zis- serman. The Pascal Visual Object Classes (VOC) Challenge. IJCV, pages 303–338, 2010

work page 2010
[6]

Gidaris and N

S. Gidaris and N. Komodakis. Object detection via a multi-region & semantic segmentation-aware cnn model. In ICCV, 2015

work page 2015
[7]

Girshick

R. Girshick. Fast R-CNN. In ICCV, 2015

work page 2015
[8]

Girshick, J

R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hier- archies for accurate object detection and semantic segmentation. In CVPR, 2014

work page 2014
[9]

Glorot and Y

X. Glorot and Y . Bengio. Understanding the difﬁculty of training deep feedforward neural networks. In AISTATS, 2010

work page 2010
[10]

I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y . Bengio. Maxout networks.arXiv:1302.4389, 2013

work page arXiv 2013
[11]

He and J

K. He and J. Sun. Convolutional neural networks at constrained time cost. In CVPR, 2015

work page 2015
[12]

K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014

work page 2014
[13]

K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation. In ICCV, 2015

work page 2015
[14]

G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co- adaptation of feature detectors. arXiv:1207.0580, 2012

work page Pith review arXiv 2012
[15]

Hochreiter and J

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997

work page 1997
[16]

Ioffe and C

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InICML, 2015

work page 2015
[17]

Jegou, M

H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. TPAMI, 33, 2011

work page 2011
[18]

Jegou, F

H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, and C. Schmid. Aggregating local image descriptors into compact codes. TPAMI, 2012

work page 2012
[19]

Y . Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093, 2014

work page arXiv 2014
[20]

Krizhevsky

A. Krizhevsky. Learning multiple layers of features from tiny im- ages. Tech Report, 2009

work page 2009
[21]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In NIPS, 2012

work page 2012
[22]

LeCun, B

Y . LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to hand- written zip code recognition. Neural computation, 1989

work page 1989
[23]

LeCun, L

Y . LeCun, L. Bottou, G. B. Orr, and K.-R. M¨uller. Efﬁcient backprop. In Neural Networks: Tricks of the Trade, pages 9–50. Springer, 1998

work page 1998
[24]

C.-Y . Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply- supervised nets. arXiv:1409.5185, 2014

work page arXiv 2014
[25]

M. Lin, Q. Chen, and S. Yan. Network in network. arXiv:1312.4400, 2013

work page Pith review arXiv 2013
[26]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV. 2014

work page 2014
[27]

J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015

work page 2015
[28]

Mont ´ufar, R

G. Mont ´ufar, R. Pascanu, K. Cho, and Y . Bengio. On the number of linear regions of deep neural networks. In NIPS, 2014

work page 2014
[29]

Nair and G

V . Nair and G. E. Hinton. Rectiﬁed linear units improve restricted boltzmann machines. In ICML, 2010

work page 2010
[30]

Perronnin and C

F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image categorization. In CVPR, 2007

work page 2007
[31]

Raiko, H

T. Raiko, H. Valpola, and Y . LeCun. Deep learning made easier by linear transformations in perceptrons. In AISTATS, 2012

work page 2012
[32]

S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015

work page 2015
[33]

S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun. Object detection networks on convolutional feature maps. arXiv:1504.06066, 2015

work page arXiv 2015
[34]

B. D. Ripley. Pattern recognition and neural networks. Cambridge university press, 1996

work page 1996
[35]

Romero, N

A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y . Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2015

work page 2015
[36]

Berg, and Li Fei-Fei

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. arXiv:1409.0575, 2014

work page arXiv 2014
[37]

A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120, 2013

work page Pith review arXiv 2013
[38]

N. N. Schraudolph. Accelerated gradient descent by factor-centering decomposition. Technical report, 1998

work page 1998
[39]

N. N. Schraudolph. Centering neural network gradient factors. In Neural Networks: Tricks of the Trade , pages 207–226. Springer, 1998

work page 1998
[40]

Sermanet, D

P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y . Le- Cun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014

work page 2014
[41]

Simonyan and A

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015

work page 2015
[42]

R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. arXiv:1505.00387, 2015

work page Pith review arXiv 2015
[43]

R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. 1507.06228, 2015

work page arXiv 2015
[44]

Szegedy, W

C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Er- han, V . Vanhoucke, and A. Rabinovich. Going deeper with convolu- tions. In CVPR, 2015

work page 2015
[45]

Szeliski

R. Szeliski. Fast surface interpolation using hierarchical basis func- tions. TPAMI, 1990

work page 1990
[46]

Szeliski

R. Szeliski. Locally adapted hierarchical basis preconditioning. In SIGGRAPH, 2006

work page 2006
[47]

Vatanen, T

T. Vatanen, T. Raiko, H. Valpola, and Y . LeCun. Pushing stochas- tic gradient towards second-order methods–backpropagation learn- ing with transformations in nonlinearities. In Neural Information Processing, 2013

work page 2013
[48]

Vedaldi and B

A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algorithms, 2008

work page 2008
[49]

Venables and B

W. Venables and B. Ripley. Modern applied statistics with s-plus. 1999

work page 1999
[50]

Networks on Conv fea- ture maps

M. D. Zeiler and R. Fergus. Visualizing and understanding convolu- tional neural networks. In ECCV, 2014. 9 A. Object Detection Baselines In this section we introduce our detection method based on the baseline Faster R-CNN [32] system. The models are initialized by the ImageNet classiﬁcation models, and then ﬁne-tuned on the object detection data. We have...

work page 2014
[51]

07+12”). For the PASCAL VOC 2012 test set, we use the 10k trainval+test images in VOC 2007 and 16ktrainval images in VOC 2012 for training (“07++12

and a Fast R-CNN detection network [7]. RoI pool- ing [7] is performed before conv5 1. On this RoI-pooled feature, all layers of conv5 x and up are adopted for each region, playing the roles of VGG-16’s fc layers. The ﬁnal classiﬁcation layer is replaced by two sibling layers (classi- ﬁcation and box regression [7]). For the usage of BN layers, after pre-...

work page 2007
[52]

This RPN ends with two sib- ling 1×1 convolutional layers for binary classiﬁcation (cls) and box regression (reg), as in [32]

that is category-agnostic, our RPN for localization is designed in a per-class form. This RPN ends with two sib- ling 1×1 convolutional layers for binary classiﬁcation (cls) and box regression (reg), as in [32]. The cls and reg layers are both in a per-class from, in contrast to [32]. Speciﬁ- cally, the cls layer has a 1000-d output, and each dimension is...

work page 2015