Neural GPUs Learn Algorithms

URL https://arxiv · 2015 · cs.LG · arXiv 1511.08228

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

open full Pith review browse 9 citing papers arXiv PDF

abstract

Learning an algorithm from examples is a fundamental problem that has been widely studied. Recently it has been addressed using neural networks, in particular by Neural Turing Machines (NTMs). These are fully differentiable computers that use backpropagation to learn their own programming. Despite their appeal NTMs have a weakness that is caused by their sequential nature: they are not parallel and are are hard to train due to their large depth when unfolded. We present a neural network architecture to address this problem: the Neural GPU. It is based on a type of convolutional gated recurrent unit and, like the NTM, is computationally universal. Unlike the NTM, the Neural GPU is highly parallel which makes it easier to train and efficient to run. An essential property of algorithms is their ability to handle inputs of arbitrary size. We show that the Neural GPU can be trained on short instances of an algorithmic task and successfully generalize to long instances. We verified it on a number of tasks including long addition and long multiplication of numbers represented in binary. We train the Neural GPU on numbers with upto 20 bits and observe no errors whatsoever while testing it, even on much longer numbers. To achieve these results we introduce a technique for training deep recurrent networks: parameter sharing relaxation. We also found a small amount of dropout and gradient noise to have a large positive effect on learning and generalization.

citation-role summary

background 2

citation-polarity summary

background 1 unclear 1

representative citing papers

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

cs.LG · 2022-01-06 · unverdicted · novelty 8.0

Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.

PRIMETIME : Limits of LLMs in Temporal Primitives

cs.NE · 2025-04-22 · unverdicted · novelty 7.0

PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

cs.CL · 2024-04-10 · conditional · novelty 7.0

Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.

Language Models as Knowledge Bases?

cs.CL · 2019-09-03 · accept · novelty 7.0

BERT stores relational knowledge extractable via cloze queries without fine-tuning and matches supervised baselines on open-domain QA tasks.

Concrete Problems in AI Safety

cs.AI · 2016-06-21 · accept · novelty 7.0

The paper categorizes five concrete AI safety problems arising from flawed objectives, costly evaluation, and learning dynamics.

On the Spatiotemporal Dynamics of Generalization in Neural Networks

cs.LG · 2026-02-02 · unverdicted · novelty 6.0

Deriving a neural cellular automaton from locality, symmetry, and stability postulates produces 100% accurate addition generalization from 16-digit to 1-million-digit inputs.

Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges

cs.LG · 2021-04-27 · accept · novelty 6.0

Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.

Universal Transformers

cs.CL · 2018-07-10 · unverdicted · novelty 6.0

Universal Transformers combine Transformer parallelism with recurrent updates and dynamic halting to achieve Turing-completeness under assumptions and outperform standard Transformers on algorithmic and language tasks.

Convex Compositional Reasoning Models

cs.LG · 2026-05-22 · unverdicted · novelty 5.0 · 2 refs

CCEM parameterizes compositional energy factors with input-convex neural networks and optimizes over a convex relaxation to enable deterministic scaling from small to large combinatorial reasoning instances.

citing papers explorer

Showing 9 of 9 citing papers.

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets cs.LG · 2022-01-06 · unverdicted · none · ref 7
Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.
PRIMETIME : Limits of LLMs in Temporal Primitives cs.NE · 2025-04-22 · unverdicted · none · ref 11 · internal anchor
PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention cs.CL · 2024-04-10 · conditional · none · ref 16 · internal anchor
Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.
Language Models as Knowledge Bases? cs.CL · 2019-09-03 · accept · none · ref 236 · internal anchor
BERT stores relational knowledge extractable via cloze queries without fine-tuning and matches supervised baselines on open-domain QA tasks.
Concrete Problems in AI Safety cs.AI · 2016-06-21 · accept · none · ref 80
The paper categorizes five concrete AI safety problems arising from flawed objectives, costly evaluation, and learning dynamics.
On the Spatiotemporal Dynamics of Generalization in Neural Networks cs.LG · 2026-02-02 · unverdicted · none · ref 23 · internal anchor
Deriving a neural cellular automaton from locality, symmetry, and stability postulates produces 100% accurate addition generalization from 16-digit to 1-million-digit inputs.
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges cs.LG · 2021-04-27 · accept · none · ref 41
Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.
Universal Transformers cs.CL · 2018-07-10 · unverdicted · none · ref 15
Universal Transformers combine Transformer parallelism with recurrent updates and dynamic halting to achieve Turing-completeness under assumptions and outperform standard Transformers on algorithmic and language tasks.
Convex Compositional Reasoning Models cs.LG · 2026-05-22 · unverdicted · none · ref 9 · 2 links · internal anchor
CCEM parameterizes compositional energy factors with input-convex neural networks and optimizes over a convex relaxation to enable deterministic scaling from small to large combinatorial reasoning instances.

Neural GPUs Learn Algorithms

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer