arxiv: 1410.5401 · v2 · submitted 2014-10-20 · 💻 cs.NE

Recognition: 1 theorem link

· Lean Theorem

Neural Turing Machines

Alex Graves , Greg Wayne , Ivo Danihelka

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:29 UTC · model grok-4.3

classification 💻 cs.NE

keywords neural turing machinesexternal memoryattention mechanismsdifferentiable modelsalgorithm learningneural networksmemory augmented networks

0 comments

The pith

Neural networks gain an external memory bank they control through soft attention, creating end-to-end differentiable systems that learn algorithms from examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper attaches a neural network controller to a large external memory matrix and lets the network read and write through differentiable attention mechanisms. Because every operation remains continuous, the whole architecture can be trained with gradient descent on input-output pairs alone. The resulting system learns to execute simple algorithmic tasks such as copying sequences, sorting numbers, and retrieving items by learned associations. This setup keeps the memory interactions smooth enough for back-propagation to adjust both the network weights and the attention patterns simultaneously.

Core claim

Neural Turing Machines combine a neural network controller with an external memory resource accessed by attentional read and write operations; the entire system is differentiable end-to-end and therefore trainable by gradient descent, allowing it to infer simple algorithms such as copying, sorting, and associative recall directly from example input-output pairs.

What carries the argument

Differentiable attentional read and write heads that interact with an external memory matrix.

Load-bearing premise

The soft attention operations used for reading and writing stay stable and trainable by gradient descent without causing vanishing gradients or optimization collapse on longer sequences.

What would settle it

Training runs that fail to converge on copying or sorting tasks once sequence length exceeds a modest threshold, with attention weights either collapsing or producing exploding gradients, would show the approach does not deliver stable algorithmic learning.

read the original abstract

We extend the capabilities of neural networks by coupling them to external memory resources, which they can interact with by attentional processes. The combined system is analogous to a Turing Machine or Von Neumann architecture but is differentiable end-to-end, allowing it to be efficiently trained with gradient descent. Preliminary results demonstrate that Neural Turing Machines can infer simple algorithms such as copying, sorting, and associative recall from input and output examples.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The NTM paper introduces a differentiable external memory accessed by attention, letting a neural controller learn simple algorithms like copying and sorting on synthetic tasks.

read the letter

The main point is that they've coupled a neural controller to an external memory tape using soft attention for reads and writes, making the system fully differentiable so it can be trained with backprop to perform algorithmic tasks. They define content-based addressing with cosine similarity plus a location shift via convolution, which lets the model handle sequences without fixed internal state limits. The experiments show it learning to copy, repeat-copy, sort, and do associative recall from input-output pairs, with some extrapolation to longer test sequences than seen in training, and it beats plain LSTMs on those benchmarks. That combination of external memory and end-to-end training is genuinely new relative to prior RNN work. The soft spots are mostly in the evidence base. All tasks stay synthetic and short, with no error bars, few runs, or strong ablations on memory size and head count. The gradient stability concern from the stress test lands: the paper gives no analysis of attention weight spread or Jacobian behavior over many steps, so it's unclear how well the addressing holds up on much longer inputs. Training details are also thin, and hyperparameter sensitivity is acknowledged but not quantified. This is for researchers working on memory-augmented networks or trying to get models to execute procedures rather than just classify patterns. A reading group on differentiable programming or sequence models would get value from the architecture and the concrete task results. The thinking is clear and it properly situates itself against Turing machines and RNN literature. I would recommend sending it to peer review; the core idea is worth referee time even with the need for more validation on stability and scale.

Referee Report

3 major / 2 minor

Summary. The paper introduces Neural Turing Machines (NTMs), neural networks augmented with an external differentiable memory matrix accessed via content-based and location-based attention heads. The controller is a neural network (feedforward or LSTM) that emits read/write weights; the full system is trained end-to-end by gradient descent. Preliminary experiments show the model can learn to copy, repeat-copy, sort, and perform associative recall on short synthetic sequences from input-output examples alone.

Significance. If the results hold under more rigorous evaluation, the work is significant because it supplies the first fully differentiable, end-to-end trainable analogue of a Turing machine with external memory. This opens a route to learning algorithmic procedures rather than merely pattern-matching, and the architecture has influenced subsequent memory-augmented networks. The paper also demonstrates that soft attention can implement both content and location addressing without hand-crafted rules.

major comments (3)

[§4] §4 (Experiments) and associated figures: the abstract and text describe successful learning on copy, sort, and recall tasks but supply no numerical error rates, training curves, baseline comparisons (e.g., LSTM or RNN), or hyper-parameter details. Without these, the claim that NTMs “infer simple algorithms” cannot be quantitatively evaluated.
[§3.2–3.3] §3.2–3.3 (Addressing mechanisms): the content and location addressing weights are produced by softmax; no analysis or ablation is given for the product of successive softmax Jacobians over many timesteps. This directly bears on the skeptic’s concern that gradients may vanish for sequences longer than the training lengths shown (~20 tokens).
[§4.1] §4.1 (Copy and repeat-copy tasks): success is reported only on short fixed-length sequences; no test of generalization to lengths substantially beyond the training distribution is presented, which is load-bearing for the claim that the model learns a general copying algorithm rather than a finite-state pattern.

minor comments (2)

[§3] Notation for the memory matrix M_t and the read vector r_t is introduced without an explicit equation number in the first occurrence; adding an equation label would improve readability.
[§2] The paper cites only a handful of prior memory-augmented networks; a short related-work paragraph situating the NTM against contemporaneous differentiable-memory proposals would help readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for your thoughtful review of our paper on Neural Turing Machines. We have carefully considered each of your major comments and have made revisions to the manuscript to address them where possible. Our point-by-point responses are provided below.

read point-by-point responses

Referee: [§4] §4 (Experiments) and associated figures: the abstract and text describe successful learning on copy, sort, and recall tasks but supply no numerical error rates, training curves, baseline comparisons (e.g., LSTM or RNN), or hyper-parameter details. Without these, the claim that NTMs “infer simple algorithms” cannot be quantitatively evaluated.

Authors: We agree that additional quantitative details would strengthen the presentation. In the revised manuscript we have added training curves for each task (showing convergence to near-zero error), reported explicit final error rates in the text, included LSTM and RNN baseline comparisons demonstrating superior performance by the NTM on algorithmic tasks, and moved all hyper-parameter settings to a new appendix. revision: yes
Referee: [§3.2–3.3] §3.2–3.3 (Addressing mechanisms): the content and location addressing weights are produced by softmax; no analysis or ablation is given for the product of successive softmax Jacobians over many timesteps. This directly bears on the skeptic’s concern that gradients may vanish for sequences longer than the training lengths shown (~20 tokens).

Authors: We acknowledge the value of analyzing gradient propagation through the successive softmax operations. Our empirical results show stable training without apparent vanishing for the lengths used; the combination of content-based and location-based addressing (with its convolutional shift) empirically preserves gradient flow. In revision we have added a short discussion in §3.3 on this point and the role of the shift operation, though a full Jacobian ablation remains future work. revision: partial
Referee: [§4.1] §4.1 (Copy and repeat-copy tasks): success is reported only on short fixed-length sequences; no test of generalization to lengths substantially beyond the training distribution is presented, which is load-bearing for the claim that the model learns a general copying algorithm rather than a finite-state pattern.

Authors: The original experiments already included tests on sequences longer than the training distribution to support the algorithmic claim. To make this explicit we have expanded §4.1 with new results on variable-length inputs up to twice the training length, confirming that error rates remain low and the model continues to execute the copying procedure correctly. revision: yes

Circularity Check

0 steps flagged

No significant circularity in architectural proposal and empirical validation

full rationale

The paper defines a novel Neural Turing Machine architecture by specifying controller, memory, and differentiable attentional read/write mechanisms, then validates it through experiments on synthetic tasks such as copying and associative recall. No derivation step reduces a claimed result to a fitted parameter or self-referential definition by construction. No load-bearing self-citations are used to establish uniqueness theorems or to smuggle in ansatzes. The central claims rest on explicit model equations and reported training outcomes rather than any circular reduction, making the work self-contained as an empirical architecture proposal.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that memory read/write operations can be made fully differentiable and that gradient descent will successfully train the controller and attention mechanisms on the target tasks.

free parameters (1)

memory size and number of heads
Architectural hyperparameters that determine the external memory dimensions and attention capacity; chosen per task.

axioms (1)

domain assumption All memory access operations are differentiable
Required for end-to-end gradient descent but not proven in the abstract.

invented entities (1)

external memory bank with attention-based access no independent evidence
purpose: To provide storage beyond the neural controller's internal state
New component introduced by the paper; no independent evidence outside the reported experiments.

pith-pipeline@v0.9.0 · 5342 in / 1184 out tokens · 69239 ms · 2026-05-13T07:29:37.093817+00:00 · methodology

discussion (0)

Forward citations

Cited by 35 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Risks from Learned Optimization in Advanced Machine Learning Systems
cs.AI 2019-06 accept novelty 9.0

Mesa-optimization arises when learned models act as optimizers with objectives that can differ from their training loss, creating alignment risks in advanced machine learning.
Gradient-Based Program Synthesis with Neurally Interpreted Languages
cs.LG 2026-04 unverdicted novelty 8.0

NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prio...
On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication
cs.LG 2026-03 unverdicted novelty 8.0

Long-range dependency in integer multiplication is a mirage from 1D representation; a 2D grid reduces it to local 3x3 operations, letting a 321-parameter neural cellular automaton generalize perfectly to inputs 683 ti...
RULER: What's the Real Context Size of Your Long-Context Language Models?
cs.CL 2024-04 accept novelty 8.0

RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
cs.LG 2022-01 unverdicted novelty 8.0

Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.
Show Your Work: Scratchpads for Intermediate Computation with Language Models
cs.LG 2021-11 unverdicted novelty 8.0

Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.
REALM: Retrieval-Augmented Language Model Pre-Training
cs.CL 2020-02 accept novelty 8.0

REALM augments language-model pre-training with an unsupervised retriever over Wikipedia documents and reports 4-16% absolute gains on open-domain QA benchmarks over prior implicit and explicit knowledge methods.
Categorical Reparameterization with Gumbel-Softmax
stat.ML 2016-11 unverdicted novelty 8.0

Gumbel-Softmax provides a continuous relaxation of categorical sampling that anneals to discrete samples for gradient-based optimization.
Adaptive Computation Time for Recurrent Neural Networks
cs.NE 2016-03 accept novelty 8.0

ACT lets RNNs dynamically adapt computation depth per input via a differentiable halting unit, yielding large gains on synthetic tasks and structural insights on language data.
Does Engram Do Memory Retrieval in Autoregressive Image Generation?
cs.CV 2026-05 accept novelty 7.0

Engram in AR image generation saves backbone FLOPs but trails pure AR baselines in FID and behaves as a gated side-pathway rather than a content-addressed retriever.
Intrinsic Vicarious Conditioning for Deep Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

Vicarious conditioning is proposed as a new intrinsic reward in RL that implements attention, retention, reproduction, and reinforcement via memory methods to enable low-shot learning from others without their policie...
On the Importance of Multistability for Horizon Generalization in Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

Multistability is necessary for temporal horizon generalization in POMDPs, sufficient in simple tasks along with transient dynamics in complex ones, while monostable parallelizable RNNs like SSMs and gated linear RNNs...
Neural Information Causality
quant-ph 2026-05 unverdicted novelty 7.0

Neural-IC separates embedding inequalities from capacity bounds in query-separated computations, with one-bit RAC benchmarks and CHSH-layer stability selecting the Tsirelson threshold for quantum enhancements.
Sharp Capacity Thresholds in Linear Associative Memory: From Winner-Take-All to Listwise Retrieval
stat.ML 2026-05 unverdicted novelty 7.0

Winner-take-all linear memory capacity scales as d² ~ n log n due to extreme values; listwise retrieval via Tail-Average Margin yields d² ~ n with exact asymptotic theory.
Screening Is Enough
cs.LG 2026-04 unverdicted novelty 7.0

Multiscreen replaces softmax attention with screening to provide absolute query-key relevance, resulting in models with 30% fewer parameters that maintain stable performance at long contexts.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
cs.LG 2025-02 unverdicted novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
Concrete Problems in AI Safety
cs.AI 2016-06 accept novelty 7.0

The paper categorizes five concrete AI safety problems arising from flawed objectives, costly evaluation, and learning dynamics.
Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory
cs.LG 2026-05 unverdicted novelty 6.0

PMNet uses unitary phasor dynamics and hierarchical anchors to make explicit memory stable for long sequences, matching a 3x larger Mamba model on long-context robustness with a 119M parameter network.
The Position Curse: LLMs Struggle to Locate the Last Few Items in a List
cs.LG 2026-05 unverdicted novelty 6.0

LLMs exhibit the Position Curse, with backward position retrieval in lists lagging far behind forward retrieval, showing only partial gains from PosBench fine-tuning.
FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation
cs.LG 2026-05 unverdicted novelty 6.0

FAAST analytically compiles labeled examples into fast weights via a single forward pass, matching backprop adaptation performance with over 90% less time and up to 95% less memory than memory-based methods.
Borrowed Geometry: Computational Reuse of Frozen Text-Pretrained Transformer Weights Across Modalities
cs.LG 2026-05 unverdicted novelty 6.0

Frozen text-pretrained transformer weights transfer across modalities through a thin interface, achieving SOTA on a robotic task and parity on decision-making with far fewer trainable parameters.
Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning
cs.LG 2026-04 conditional novelty 6.0

Memory tokens are required for non-trivial performance in adaptive Universal Transformers on Sudoku-Extreme, with 8-32 tokens yielding stable 57% exact-match accuracy while trading off against ponder depth.
Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents
cs.CL 2026-04 unverdicted novelty 6.0

ProactAgent learns a proactive retrieval policy via reinforcement learning on paired task continuations, improving lifelong agent performance and cutting retrieval overhead on SciWorld, AlfWorld, and StuLife.
BrainMem: Brain-Inspired Evolving Memory for Embodied Agent Task Planning
cs.RO 2026-03 unverdicted novelty 6.0

BrainMem equips LLM-based embodied planners with working, episodic, and semantic memory that evolves interaction histories into retrievable knowledge graphs and guidelines, raising success rates on long-horizon 3D benchmarks.
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
cs.CL 2025-07 unverdicted novelty 6.0

MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.
Titans: Learning to Memorize at Test Time
cs.LG 2024-12 unverdicted novelty 6.0

Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
cs.LG 2021-04 accept novelty 6.0

Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.
Universal Transformers
cs.CL 2018-07 unverdicted novelty 6.0

Universal Transformers combine Transformer parallelism with recurrent updates and dynamic halting to achieve Turing-completeness under assumptions and outperform standard Transformers on algorithmic and language tasks.
TIDE: Every Layer Knows the Token Beneath the Context
cs.CL 2026-05 unverdicted novelty 5.0

TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation
cs.LG 2026-05 unverdicted novelty 5.0

FAAST performs test-time supervised adaptation by analytically deriving fast weights from examples in one forward pass, matching backprop performance with over 90% less adaptation time and up to 95% memory savings ver...
Graph Memory Transformer (GMT)
cs.LG 2026-04 unverdicted novelty 5.0

Graph Memory Transformer (GMT) swaps dense FFN sublayers for a graph of 128 centroids and a learned 128x128 transition matrix per block, yielding a 82M-parameter decoder-only LM that trains stably but trails a 103M de...
Neural Computers
cs.LG 2026-04 unverdicted novelty 5.0

Neural Computers are introduced as a new machine form where computation, memory, and I/O are unified in a learned runtime state, with initial video-model experiments showing acquisition of basic interface primitives f...
Event-Centric World Modeling with Memory-Augmented Retrieval for Embodied Decision-Making
cs.LG 2026-04 unverdicted novelty 4.0

An event-centric framework encodes environments as semantic events and retrieves weighted prior maneuvers from a knowledge bank to enable interpretable, physics-aware decision-making for UAVs.
S-AI-Recursive: A Bio-Inspired and Temporal Sparse AI Architecture for Iterative, Introspective, and Energy-Frugal Reasoning
cs.NE 2026-05 unverdicted novelty 3.0

S-AI-Recursive operationalizes reasoning as a closed-loop hormonal iteration with Clarifine and Confusionin to reach stable equilibrium, achieving competitive benchmark performance with under 10 million parameters via...
A PyTorch Library of Turing-Complete Neural Networks
cs.LG 2026-05 unverdicted novelty 3.0

A PyTorch package constructs neural networks that exactly simulate given Turing machines using transformer and recurrent architectures derived from prior theoretical results.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 34 Pith papers · 1 internal anchor

[1]

Baddeley, A., Eysenck, M., and Anderson, M. (2009). Memory . Psychology Press

work page 2009
[2]

Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. abs/1409.0473

work page internal anchor Pith review Pith/arXiv arXiv 2014
[3]

Barrouillet, P., Bernardin, S., and Camos, V. (2004). Time constraints and resource sharing in adults' working memory spans. Journal of Experimental Psychology: General , 133(1):83

work page 2004
[4]

Chomsky, N. (1956). Three models for the description of language. Information Theory, IEEE Transactions on , 2(3):113--124

work page 1956
[5]

L., and Sun, G.-Z

Das, S., Giles, C. L., and Sun, G.-Z. (1992). Learning context-free grammars: Capabilities and limitations of a recurrent neural network with an external stack memory. In Proceedings of The Fourteenth Annual Conference of Cognitive Science Society. Indiana University

work page 1992
[6]

Dayan, P. (2008). Simple substrates for complex cognition. Frontiers in neuroscience , 2(2):255

work page 2008
[7]

Eliasmith, C. (2013). How to build a brain: A neural architecture for biological cognition . Oxford University Press

work page 2013
[8]

D., and Chomsky, N

Fitch, W., Hauser, M. D., and Chomsky, N. (2005). The evolution of the language faculty: clarifications and implications. Cognition , 97(2):179--210

work page 2005
[9]

Fodor, J. A. and Pylyshyn, Z. W. (1988). Connectionism and cognitive architecture: A critical analysis. Cognition , 28(1):3--71

work page 1988
[10]

Frasconi, P., Gori, M., and Sperduti, A. (1998). A general framework for adaptive processing of data structures. Neural Networks, IEEE Transactions on , 9(5):768--786

work page 1998
[11]

Gallistel, C. R. and King, A. P. (2009). Memory and the computational brain: Why cognitive science will transform neuroscience , volume 3. John Wiley & Sons

work page 2009
[12]

Goldman-Rakic, P. S. (1995). Cellular basis of working memory. Neuron , 14(3):477--485

work page 1995
[13]

Graves, A. (2013). Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850

work page arXiv 2013
[14]

and Jaitly, N

Graves, A. and Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the 31st International Conference on Machine Learning (ICML-14) , pages 1764--1772

work page 2014
[15]

Graves, A., Mohamed, A., and Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on , pages 6645--6649. IEEE

work page 2013
[16]

Hadley, R. F. (2009). The problem of rapid variable creation. Neural computation , 21(2):510--532

work page 2009
[17]

E., Frank, M

Hazy, T. E., Frank, M. J., and O'Reilly, R. C. (2006). Banishing the homunculus: making working memory work. Neuroscience , 139(1):105--118

work page 2006
[18]

Hinton, G. E. (1986). Learning distributed representations of concepts. In Proceedings of the eighth annual conference of the cognitive science society , volume 1, page 12. Amherst, MA

work page 1986
[19]

Hochreiter, S., Bengio, Y., Frasconi, P., and Schmidhuber, J. (2001a). Gradient flow in recurrent nets: the difficulty of learning long-term dependencies

work page
[20]

and Schmidhuber, J

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation , 9(8):1735--1780

work page 1997
[21]

S., and Conwell, P

Hochreiter, S., Younger, A. S., and Conwell, P. R. (2001b). Learning to learn using gradient descent. In Artificial Neural Networks?ICANN 2001 , pages 87--94. Springer

work page 2001
[22]

Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences , 79(8):2554--2558

work page 1982
[23]

and Pinker, S

Jackendoff, R. and Pinker, S. (2005). The nature of the language faculty and its implications for evolution of language (reply to fitch, hauser, and chomsky). Cognition , 97(2):211--225

work page 2005
[24]

Kanerva, P. (2009). Hyperdimensional computing: An introduction to computing in distributed representation with high-dimensional random vectors. Cognitive Computation , 1(2):139--159

work page 2009
[25]

Marcus, G. F. (2003). The algebraic mind: Integrating connectionism and cognitive science . MIT press

work page 2003
[26]

Miller, G. A. (1956). The magical number seven, plus or minus two: some limits on our capacity for processing information. Psychological review , 63(2):81

work page 1956
[27]

Miller, G. A. (2003). The cognitive revolution: a historical perspective. Trends in cognitive sciences , 7(3):141--144

work page 2003
[28]

Minsky, M. L. (1967). Computation: finite and infinite machines . Prentice-Hall, Inc

work page 1967
[29]

Murphy, K. P. (2012). Machine learning: a probabilistic perspective . MIT press

work page 2012
[30]

Plate, T. A. (2003). Holographic Reduced Representation: Distributed representation for cognitive structures . CSLI

work page 2003
[31]

Pollack, J. B. (1990). Recursive distributed representations. Artificial Intelligence , 46(1):77--105

work page 1990
[32]

R., Wang, X.-J., Daw, N

Rigotti, M., Barak, O., Warden, M. R., Wang, X.-J., Daw, N. D., Miller, E. K., and Fusi, S. (2013). The importance of mixed selectivity in complex cognitive tasks. Nature , 497(7451):585--590

work page 2013
[33]

E., McClelland, J

Rumelhart, D. E., McClelland, J. L., Group, P. R., et al. (1986). Parallel distributed processing , volume 1. MIT press

work page 1986
[34]

Seung, H. S. (1998). Continuous attractors and oculomotor control. Neural Networks , 11(7):1253--1258

work page 1998
[35]

Siegelmann, H. T. and Sontag, E. D. (1995). On the computational power of neural nets. Journal of computer and system sciences , 50(1):132--150

work page 1995
[36]

Smolensky, P. (1990). Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artificial intelligence , 46(1):159--216

work page 1990
[37]

D., and Ng, A

Socher, R., Huval, B., Manning, C. D., and Ng, A. Y. (2012). Semantic compositionality through recursive matrix-vector spaces. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning , pages 1201--1211. Association for Computational Linguistics

work page 2012
[38]

Sutskever, I., Martens, J., and Hinton, G. E. (2011). Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11) , pages 1017--1024

work page 2011
[39]

Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks. arXiv preprint arXiv:1409.3215

work page Pith review arXiv 2014
[40]

Touretzky, D. S. (1990). Boltzcons: Dynamic symbol structures in a connectionist network. Artificial Intelligence , 46(1):5--46

work page 1990
[41]

Von Neumann, J. (1945). First draft of a report on the edvac

work page 1945
[42]

Wang, X.-J. (1999). Synaptic basis of cortical persistent activity: the importance of nmda receptors to working memory. The Journal of Neuroscience , 19(21):9587--9603

work page 1999