Neural GPUs Learn Algorithms

Ilya Sutskever; {\L}ukasz Kaiser

arxiv: 1511.08228 · v3 · pith:YM7SLPASnew · submitted 2015-11-25 · 💻 cs.LG · cs.NE

Neural GPUs Learn Algorithms

{\L}ukasz Kaiser , Ilya Sutskever This is my paper

classification 💻 cs.LG cs.NE

keywords neurallongnumberstrainalgorithmsbeeninstanceslarge

0 comments

read the original abstract

Learning an algorithm from examples is a fundamental problem that has been widely studied. Recently it has been addressed using neural networks, in particular by Neural Turing Machines (NTMs). These are fully differentiable computers that use backpropagation to learn their own programming. Despite their appeal NTMs have a weakness that is caused by their sequential nature: they are not parallel and are are hard to train due to their large depth when unfolded. We present a neural network architecture to address this problem: the Neural GPU. It is based on a type of convolutional gated recurrent unit and, like the NTM, is computationally universal. Unlike the NTM, the Neural GPU is highly parallel which makes it easier to train and efficient to run. An essential property of algorithms is their ability to handle inputs of arbitrary size. We show that the Neural GPU can be trained on short instances of an algorithmic task and successfully generalize to long instances. We verified it on a number of tasks including long addition and long multiplication of numbers represented in binary. We train the Neural GPU on numbers with upto 20 bits and observe no errors whatsoever while testing it, even on much longer numbers. To achieve these results we introduce a technique for training deep recurrent networks: parameter sharing relaxation. We also found a small amount of dropout and gradient noise to have a large positive effect on learning and generalization.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
cs.LG 2022-01 unverdicted novelty 8.0

Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.
PRIMETIME : Limits of LLMs in Temporal Primitives
cs.NE 2025-04 unverdicted novelty 7.0

PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
cs.CL 2024-04 conditional novelty 7.0

Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.
Language Models as Knowledge Bases?
cs.CL 2019-09 accept novelty 7.0

BERT stores relational knowledge extractable via cloze queries without fine-tuning and matches supervised baselines on open-domain QA tasks.
Concrete Problems in AI Safety
cs.AI 2016-06 accept novelty 7.0

The paper categorizes five concrete AI safety problems arising from flawed objectives, costly evaluation, and learning dynamics.
Convex Compositional Reasoning Models
cs.LG 2026-05 unverdicted novelty 6.0

CCEM parameterizes compositional factors with input-convex neural networks and optimizes the summed energy over a convex relaxation, allowing models trained on small instances to transfer to larger ones.
On the Spatiotemporal Dynamics of Generalization in Neural Networks
cs.LG 2026-02 unverdicted novelty 6.0

Deriving a neural cellular automaton from locality, symmetry, and stability postulates produces 100% accurate addition generalization from 16-digit to 1-million-digit inputs.
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
cs.LG 2021-04 accept novelty 6.0

Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.
Universal Transformers
cs.CL 2018-07 unverdicted novelty 6.0

Universal Transformers combine Transformer parallelism with recurrent updates and dynamic halting to achieve Turing-completeness under assumptions and outperform standard Transformers on algorithmic and language tasks.