hub Mixed citations

Highway Networks

Srivastava, R · 2015 · cs.LG · arXiv 1505.00387

Mixed citation behavior. Most common role is background (60%).

29 Pith papers citing it

Background 60% of classified citations

open full Pith review browse 29 citing papers arXiv PDF

abstract

There is plenty of theoretical and empirical evidence that depth of neural networks is a crucial ingredient for their success. However, network training becomes more difficult with increasing depth and training of very deep networks remains an open problem. In this extended abstract, we introduce a new architecture designed to ease gradient-based training of very deep networks. We refer to networks with this architecture as highway networks, since they allow unimpeded information flow across several layers on "information highways". The architecture is characterized by the use of gating units which learn to regulate the flow of information through a network. Highway networks with hundreds of layers can be trained directly using stochastic gradient descent and with a variety of activation functions, opening up the possibility of studying extremely deep and efficient architectures.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 method 1

citation-polarity summary

background 3 support 1 use method 1

representative citing papers

Deep Residual Learning for Image Recognition

cs.CV · 2015-12-10 · accept · novelty 8.0

Residual networks reformulate layers to learn residual functions, enabling effective training of up to 152-layer models that achieve 3.57% error on ImageNet and win ILSVRC 2015.

XAttnRes: Cross-Stage Attention Residuals for Medical Image Segmentation

cs.CV · 2026-03-28 · unverdicted · novelty 7.0

XAttnRes introduces cross-stage attention residuals that maintain a global feature history and selectively aggregate prior representations, improving medical image segmentation and performing on par with baselines even without skip connections.

Deep Delta Learning

cs.LG · 2026-01-01 · unverdicted · novelty 7.0

Deep Delta Learning replaces additive residual updates with a gated delta-rule that selectively overwrites residual content along learned directions, improving language modeling quality over standard ResNet-style accumulation.

Graph neural networks for residential location choice: connection to classical logit models

stat.ML · 2025-07-28 · unverdicted · novelty 7.0

GNN-DCMs apply graph neural networks to discrete choice modeling, recovering nested logit and spatially correlated logit via message passing on utilities and demonstrating better predictive performance for residential location choices in Chicago.

Neural Network Architecture Search with Differentiable Cartesian Genetic Programming for Regression

cs.NE · 2019-07-03 · unverdicted · novelty 7.0

dCGPANN encodes neural nets so evolutionary operators can rewire, prune, adapt activations and add skips while gradient descent tunes parameters, yielding smaller networks with lower regression error in fixed time.

Every Feedforward Neural Network Definable in an o-Minimal Structure Has Finite Sample Complexity

stat.ML · 2026-05-08 · unverdicted · novelty 7.0

Every fixed finite feedforward neural network definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting.

Transformers with Selective Access to Early Representations

cs.LG · 2026-05-05 · unverdicted · novelty 7.0 · 2 refs

SATFormer uses a context-dependent gate for selective reuse of early Transformer representations, improving validation loss and zero-shot accuracy especially on retrieval benchmarks.

Searching for Activation Functions

cs.NE · 2017-10-16 · conditional · novelty 7.0

Automated search discovers Swish activation f(x) = x * sigmoid(βx) that improves top-1 ImageNet accuracy over ReLU by 0.9% on Mobile NASNet-A and 0.6% on Inception-ResNet-v2.

Wide Residual Networks

cs.CV · 2016-05-23 · accept · novelty 7.0

Wide residual networks achieve higher accuracy and faster training than very deep thin residual networks by increasing width and decreasing depth, setting new state-of-the-art results on CIFAR, SVHN, and ImageNet.

Rethinking Cross-Layer Information Routing in Diffusion Transformers

cs.CV · 2026-05-20 · conditional · novelty 6.0

DAR replaces residual addition in DiTs with learnable timestep-adaptive non-incremental aggregation of sublayer outputs, improving FID by 2.11 on ImageNet 256x256 and accelerating convergence by 8.75x.

From DES to KiDS: Domain adaptation for cross-survey detection of low-surface-brightness galaxies

astro-ph.GA · 2026-05-13 · unverdicted · novelty 6.0

Domain adaptation with an ensemble of CNN and transformer models trained on DES detects 20,180 LSBGs and 434 UDGs in KiDS DR5, with structural parameters and environmental trends consistent with known samples.

SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation

cs.RO · 2026-03-05 · conditional · novelty 6.0

SeedPolicy introduces self-evolving gated attention to extend the temporal horizon of diffusion policies, yielding 36.8% and 169% relative gains over standard DP on clean and randomized RoboTwin 2.0 tasks.

SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm

cs.LG · 2026-02-08 · unverdicted · novelty 6.0

SiameseNorm is a two-stream architecture that reconciles Pre-Norm and Post-Norm in Transformers by coupling streams via shared residual blocks, yielding performance gains with maintained stability on language, vision, and diffusion models.

Widening the Representation Bottleneck in Neural Machine Translation with Lexical Shortcuts

cs.CL · 2019-06-28 · conditional · novelty 6.0

Gated lexical shortcut connections added to the transformer yield 0.9 BLEU average gains on five WMT directions while lowering the lexical content stored in hidden states.

Set Prediction for Next-Day Active Fire Forecasting

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

WISP reformulates next-day active fire forecasting as point-set prediction and reports 38.2% AP, 53.4% FRP-weighted coverage, and 54.1% localization within 5 km on a global held-out test set.

Position-Agnostic Pre-Projection for Transformer Attention: Nonlinear Feature Construction and Content Skip Before Q/K/V

cs.CL · 2026-04-12 · unverdicted · novelty 6.0

A position-agnostic nonlinear pre-projection MLP plus content skip connection in transformer attention improves LAMBADA accuracy by 40.6% and reduces perplexity by 39% on 160M-scale models.

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

cs.CL · 2025-05-10 · conditional · novelty 6.0

Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

cs.AI · 2023-08-01 · unverdicted · novelty 6.0

MetaGPT embeds human SOPs into LLM prompts to create role-specialized agent teams that produce more coherent solutions on collaborative software engineering tasks than prior chat-based multi-agent systems.

Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges

cs.LG · 2021-04-27 · accept · novelty 6.0

Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

cs.CL · 2016-09-26 · accept · novelty 6.0

GNMT deploys 8-layer LSTMs with attention, wordpieces, low-precision inference, and coverage-penalized beam search to match state-of-the-art on WMT'14 En-Fr and En-De while cutting translation errors by 60% in human evaluations.

Attention Residuals

cs.CL · 2026-03-16 · unverdicted · novelty 5.0

Attention Residuals replaces fixed residual summation with input-dependent softmax attention over preceding layers, and a blocked variant is shown to improve uniformity and downstream performance in a 48B-parameter model pre-trained on 1.4T tokens.

Context-Aware Multipath Networks

cs.CV · 2019-07-26 · unverdicted · novelty 4.0

CAMNet uses data-dependent routing across parallel tensors in a multi-path network to outperform equivalent single-path, multi-path, and deeper networks on classification and pixel-labeling tasks for individual, sequential, and combined datasets.

Iterative temporal differencing with random synaptic feedback weights support error backpropagation for deep learning

cs.NE · 2019-07-15 · unverdicted · novelty 4.0

Iterative temporal differencing with fixed random synaptic feedback can replace the activation function derivative in error backpropagation.

Attending to Emotional Narratives

cs.LG · 2019-07-08 · unverdicted · novelty 4.0

Transformer and Memory Fusion Network attention mechanisms generalize to multimodal time-series emotion recognition on emotional autobiographical narratives, achieving performance comparable to human raters in some cases.

citing papers explorer

Showing 29 of 29 citing papers.

Deep Residual Learning for Image Recognition cs.CV · 2015-12-10 · accept · none · ref 42
Residual networks reformulate layers to learn residual functions, enabling effective training of up to 152-layer models that achieve 3.57% error on ImageNet and win ILSVRC 2015.
XAttnRes: Cross-Stage Attention Residuals for Medical Image Segmentation cs.CV · 2026-03-28 · unverdicted · none · ref 28 · internal anchor
XAttnRes introduces cross-stage attention residuals that maintain a global feature history and selectively aggregate prior representations, improving medical image segmentation and performing on par with baselines even without skip connections.
Deep Delta Learning cs.LG · 2026-01-01 · unverdicted · none · ref 12 · internal anchor
Deep Delta Learning replaces additive residual updates with a gated delta-rule that selectively overwrites residual content along learned directions, improving language modeling quality over standard ResNet-style accumulation.
Graph neural networks for residential location choice: connection to classical logit models stat.ML · 2025-07-28 · unverdicted · none · ref 44 · internal anchor
GNN-DCMs apply graph neural networks to discrete choice modeling, recovering nested logit and spatially correlated logit via message passing on utilities and demonstrating better predictive performance for residential location choices in Chicago.
Neural Network Architecture Search with Differentiable Cartesian Genetic Programming for Regression cs.NE · 2019-07-03 · unverdicted · none · ref 31 · internal anchor
dCGPANN encodes neural nets so evolutionary operators can rewire, prune, adapt activations and add skips while gradient descent tunes parameters, yielding smaller networks with lower regression error in fixed time.
Every Feedforward Neural Network Definable in an o-Minimal Structure Has Finite Sample Complexity stat.ML · 2026-05-08 · unverdicted · none · ref 77
Every fixed finite feedforward neural network definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting.
Transformers with Selective Access to Early Representations cs.LG · 2026-05-05 · unverdicted · none · ref 16 · 2 links
SATFormer uses a context-dependent gate for selective reuse of early Transformer representations, improving validation loss and zero-shot accuracy especially on retrieval benchmarks.
Searching for Activation Functions cs.NE · 2017-10-16 · conditional · none · ref 17
Automated search discovers Swish activation f(x) = x * sigmoid(βx) that improves top-1 ImageNet accuracy over ReLU by 0.9% on Mobile NASNet-A and 0.6% on Inception-ResNet-v2.
Wide Residual Networks cs.CV · 2016-05-23 · accept · none · ref 28
Wide residual networks achieve higher accuracy and faster training than very deep thin residual networks by increasing width and decreasing depth, setting new state-of-the-art results on CIFAR, SVHN, and ImageNet.
Rethinking Cross-Layer Information Routing in Diffusion Transformers cs.CV · 2026-05-20 · conditional · none · ref 51 · internal anchor
DAR replaces residual addition in DiTs with learnable timestep-adaptive non-incremental aggregation of sublayer outputs, improving FID by 2.11 on ImageNet 256x256 and accelerating convergence by 8.75x.
From DES to KiDS: Domain adaptation for cross-survey detection of low-surface-brightness galaxies astro-ph.GA · 2026-05-13 · unverdicted · none · ref 268 · internal anchor
Domain adaptation with an ensemble of CNN and transformer models trained on DES detects 20,180 LSBGs and 434 UDGs in KiDS DR5, with structural parameters and environmental trends consistent with known samples.
SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation cs.RO · 2026-03-05 · conditional · none · ref 31 · internal anchor
SeedPolicy introduces self-evolving gated attention to extend the temporal horizon of diffusion policies, yielding 36.8% and 169% relative gains over standard DP on clean and randomized RoboTwin 2.0 tasks.
SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm cs.LG · 2026-02-08 · unverdicted · none · ref 20 · internal anchor
SiameseNorm is a two-stream architecture that reconciles Pre-Norm and Post-Norm in Transformers by coupling streams via shared residual blocks, yielding performance gains with maintained stability on language, vision, and diffusion models.
Widening the Representation Bottleneck in Neural Machine Translation with Lexical Shortcuts cs.CL · 2019-06-28 · conditional · none · ref 28 · internal anchor
Gated lexical shortcut connections added to the transformer yield 0.9 BLEU average gains on five WMT directions while lowering the lexical content stored in hidden states.
Set Prediction for Next-Day Active Fire Forecasting cs.LG · 2026-05-11 · unverdicted · none · ref 30
WISP reformulates next-day active fire forecasting as point-set prediction and reports 38.2% AP, 53.4% FRP-weighted coverage, and 54.1% localization within 5 km on a global held-out test set.
Position-Agnostic Pre-Projection for Transformer Attention: Nonlinear Feature Construction and Content Skip Before Q/K/V cs.CL · 2026-04-12 · unverdicted · none · ref 10
A position-agnostic nonlinear pre-projection MLP plus content skip connection in transformer attention improves LAMBADA accuracy by 40.6% and reduces perplexity by 39% on 160M-scale models.
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free cs.CL · 2025-05-10 · conditional · none · ref 24
Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework cs.AI · 2023-08-01 · unverdicted · none · ref 142
MetaGPT embeds human SOPs into LLM prompts to create role-specialized agent teams that produce more coherent solutions on collaborative software engineering tasks than prior chat-based multi-agent systems.
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges cs.LG · 2021-04-27 · accept · none · ref 85
Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation cs.CL · 2016-09-26 · accept · none · ref 40
GNMT deploys 8-layer LSTMs with attention, wordpieces, low-precision inference, and coverage-penalized beam search to match state-of-the-art on WMT'14 En-Fr and En-De while cutting translation errors by 60% in human evaluations.
Attention Residuals cs.CL · 2026-03-16 · unverdicted · none · ref 46 · internal anchor
Attention Residuals replaces fixed residual summation with input-dependent softmax attention over preceding layers, and a blocked variant is shown to improve uniformity and downstream performance in a 48B-parameter model pre-trained on 1.4T tokens.
Context-Aware Multipath Networks cs.CV · 2019-07-26 · unverdicted · none · ref 33 · internal anchor
CAMNet uses data-dependent routing across parallel tensors in a multi-path network to outperform equivalent single-path, multi-path, and deeper networks on classification and pixel-labeling tasks for individual, sequential, and combined datasets.
Iterative temporal differencing with random synaptic feedback weights support error backpropagation for deep learning cs.NE · 2019-07-15 · unverdicted · none · ref 8 · internal anchor
Iterative temporal differencing with fixed random synaptic feedback can replace the activation function derivative in error backpropagation.
Attending to Emotional Narratives cs.LG · 2019-07-08 · unverdicted · none · ref 36 · internal anchor
Transformer and Memory Fusion Network attention mechanisms generalize to multimodal time-series emotion recognition on emotional autobiographical narratives, achieving performance comparable to human raters in some cases.
Multi-Gate Residuals cs.LG · 2026-05-22 · unverdicted · none · ref 4 · internal anchor
Multi-Gate Residuals stabilizes activation scales in deep residual networks via multi-stream gating and attention pooling without added communication overhead.
Genetic Network Architecture Search cs.NE · 2019-07-05 · unverdicted · none · ref 26 · internal anchor
Genetic algorithm searches convolution cell architectures with weight sharing via SGD, reporting 96% accuracy on CIFAR10 and 80.1% on CIFAR100.
A Transfer Learning Evaluation of Deep Neural Networks for Image Classification cs.CV · 2026-05-12 · unverdicted · none · ref 27
Empirical comparison of transfer learning performance across eleven pre-trained models on five image datasets using accuracy, time, and size metrics.
Machine Reading Comprehension: a Literature Review cs.CL · 2019-06-30 · unverdicted · none · ref 49 · internal anchor
A 2019 survey of machine reading comprehension corpora and methods.
Simply Stabilizing the Loop via Fully Looped Transformer cs.LG · 2026-05-11 · unreviewed · ref 25 · internal anchor

Highway Networks

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer