Residual networks reformulate layers to learn residual functions, enabling effective training of up to 152-layer models that achieve 3.57% error on ImageNet and win ILSVRC 2015.
hub
ImageNet Large Scale Visual Recognition Challenge
23 Pith papers cite this work. Polarity classification is still indexing.
abstract
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
RNNs with ranking loss outperform item-to-item baselines for session-based recommendations on two datasets.
SOCP uses self-organizing maps for unsupervised group discovery to enable local calibration in conformal prediction, reducing regional coverage gaps on benchmarks with small set-size increases while preserving validity guarantees.
StoMPP progressively binarizes BNN layers layerwise from input to output via stochastic masks, delivering depth-scalable accuracy gains in a fully STE-free regime by controlling activation-induced gradient blockades.
CORP performs one-shot structured pruning of Transformers by modeling removed components as affine functions of retained ones and solving closed-form ridge regressions on calibration data to fold compensation into weights, retaining 83.27% Top-1 accuracy on DeiT-Huge after 50% pruning.
VCBench is a new privacy-preserving benchmark showing LLMs like DeepSeek-V3 achieve over six times the market baseline precision in predicting founder success.
DiffGradCAM and DiffGradCAM++ use logit differences for contrastive class activation maps that resist passive fooling while matching GradCAM outputs in clean cases, tested with a new SHAM benchmark on multi-class tasks.
Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
Deep learning generalization error follows power-law scaling with training set size across multiple domains, with model size scaling sublinearly with data size.
MJ EPA applies a single shared ViT encoder and one predictive objective within and across audio-visual modalities, reporting >6.8 mAP gains on AudioSet-20K and competitive video results with 10x less data.
VSM modulates the score Jacobian using variance guidance to reduce hallucinations in diffusion models by up to 25% on synthetic and real datasets while preserving fidelity and diversity.
MCWC aligns permutation-symmetric blocks across layers to enable sequential prediction and residual entropy coding, improving rate-accuracy tradeoffs versus quantization and prior codecs on language and vision models.
CAAP produces patch attributions in ViTs by direct activation patching on intermediate layers to measure causal contribution to the target class score.
FedOptima reduces both straggler and dependency idle times in federated learning via layer offloading, asynchronous aggregation, auxiliary networks, and server scheduling, delivering up to 21.8x faster training.
DINO-WM builds world models on pre-trained DINOv2 features to enable zero-shot planning from offline data without rewards or demonstrations.
ANL uses diffusion noise prediction and attention to regularize deepfake detectors for better generalization to unseen synthesis methods without added inference cost.
LEAP is an adaptive layer-skipping curriculum for ViT feature distillation that reports accuracy gains on ImageNet and retrieval tasks plus training compute savings.
SORA is an adaptive step-size adversarial training algorithm that formalizes epsilon overfitting, introduces the PertAlign metric to predict catastrophic overfitting, and dynamically adjusts perturbations to achieve state-of-the-art robustness and clean accuracy with fixed hyperparameters.
A patch-based fusion method extends CLIP to high-resolution images by retaining multi-scale details for improved class-prompted retrieval.
Teacher-guided routing supplies pseudo-supervision from a dense model's intermediate features to stabilize expert selection in sparse vision MoE models.
PH-GCN constructs a hierarchical graph of person parts and performs local/global feature learning via message passing in an end-to-end network for person re-identification.
DINO-based ViT models pretrained on HPA FOV achieve macro F1 of 0.822 zero-shot and 0.860 after fine-tuning for protein localization on OpenCell, demonstrating effective transfer from SSL pretraining.
A DMF curriculum initialized from pretrained flow models achieves one-step FID 3.36 on CIFAR-10 after only 2000 epochs by exploiting a discretized consistency property in the Meanflow objective.
citing papers explorer
-
Deep Residual Learning for Image Recognition
Residual networks reformulate layers to learn residual functions, enabling effective training of up to 152-layer models that achieve 3.57% error on ImageNet and win ILSVRC 2015.
-
Session-based Recommendations with Recurrent Neural Networks
RNNs with ranking loss outperform item-to-item baselines for session-based recommendations on two datasets.
-
Self-Organized Conformal Prediction: Reducing Regional Coverage Gaps with Unsupervised Group Discovery
SOCP uses self-organizing maps for unsupervised group discovery to enable local calibration in conformal prediction, reducing regional coverage gaps on benchmarks with small set-size increases while preserving validity guarantees.
-
Layerwise Progressive Freezing: A Training Scaffold for Depth-Scalable Binary Networks
StoMPP progressively binarizes BNN layers layerwise from input to output via stochastic masks, delivering depth-scalable accuracy gains in a fully STE-free regime by controlling activation-induced gradient blockades.
-
CORP: Closed-Form One-shot Representation-Preserving Structured Pruning for Transformers
CORP performs one-shot structured pruning of Transformers by modeling removed components as affine functions of retained ones and solving closed-form ridge regressions on calibration data to fold compensation into weights, retaining 83.27% Top-1 accuracy on DeiT-Huge after 50% pruning.
-
VCBench: Benchmarking LLMs in Venture Capital
VCBench is a new privacy-preserving benchmark showing LLMs like DeepSeek-V3 achieve over six times the market baseline precision in predicting founder success.
-
DiffGradCAM: A Class Activation Map Using the Full Model Decision to Solve Unaddressed Adversarial Attacks
DiffGradCAM and DiffGradCAM++ use logit differences for contrastive class activation maps that resist passive fooling while matching GradCAM outputs in clean cases, tested with a new SHAM benchmark on multi-class tasks.
-
Diffusion Models Beat GANs on Image Synthesis
Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
-
Deep Learning Scaling is Predictable, Empirically
Deep learning generalization error follows power-law scaling with training set size across multiple domains, with model size scaling sublinearly with data size.
-
MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning
MJ EPA applies a single shared ViT encoder and one predictive objective within and across audio-visual modalities, reporting >6.8 mAP gains on AudioSet-20K and competitive video results with 10x less data.
-
Score-Control for Hallucination Reduction in Diffusion Models
VSM modulates the score Jacobian using variance guidance to reduce hallucinations in diffusion models by up to 25% on synthetic and real datasets while preserving fidelity and diversity.
-
Motion-Compensated Weight Compression
MCWC aligns permutation-symmetric blocks across layers to enable sequential prediction and residual entropy coding, improving rate-accuracy tradeoffs versus quantization and prior codecs on language and vision models.
-
Causal Attribution via Activation Patching
CAAP produces patch attributions in ViTs by direct activation patching on intermediate layers to measure causal contribution to the target class score.
-
FedOptima: Optimizing Resource Utilization in Federated Learning
FedOptima reduces both straggler and dependency idle times in federated learning via layer offloading, asynchronous aggregation, auxiliary networks, and server scheduling, delivering up to 21.8x faster training.
-
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning
DINO-WM builds world models on pre-trained DINOv2 features to enable zero-shot planning from offline data without rewards or demonstrations.
-
Deepfake Detection Generalization with Diffusion Noise
ANL uses diffusion noise prediction and attention to regularize deepfake detectors for better generalization to unseen synthesis methods without added inference cost.
-
LEAP: Layer-skipping Efficiency via Adaptive Progression for Vision Transformer Distillation
LEAP is an adaptive layer-skipping curriculum for ViT feature distillation that reports accuracy gains on ImageNet and retrieval tasks plus training compute savings.
-
SORA: Free Second-Order Attacks in Fast Adversarial Training
SORA is an adaptive step-size adversarial training algorithm that formalizes epsilon overfitting, introduces the PertAlign metric to predict catastrophic overfitting, and dynamically adjusts perturbations to achieve state-of-the-art robustness and clean accuracy with fixed hyperparameters.
-
DetailCLIP: Injecting Image Details into CLIP's Feature Space
A patch-based fusion method extends CLIP to high-resolution images by retaining multi-scale details for improved class-prompted retrieval.
-
Teacher-Guided Routing for Sparse Vision Mixture-of-Experts
Teacher-guided routing supplies pseudo-supervision from a dense model's intermediate features to stabilize expert selection in sparse vision MoE models.
-
PH-GCN: Person Re-identification with Part-based Hierarchical Graph Convolutional Network
PH-GCN constructs a hierarchical graph of person parts and performs local/global feature learning via message passing in an end-to-end network for person re-identification.
-
Using Deep Learning Models Pretrained by Self-Supervised Learning for Protein Localization
DINO-based ViT models pretrained on HPA FOV achieve macro F1 of 0.822 zero-shot and 0.860 after fine-tuning for protein localization on OpenCell, demonstrating effective transfer from SSL pretraining.
-
Discrete Meanflow Training Curriculum
A DMF curriculum initialized from pretrained flow models achieves one-step FID 3.36 on CIFAR-10 after only 2000 epochs by exploiting a discretized consistency property in the Meanflow objective.