pith. machine review for the scientific record. sign in

arxiv: 2008.02217 · v3 · submitted 2020-07-16 · 💻 cs.NE · cs.CL· cs.LG· stat.ML

Recognition: 3 theorem links

· Lean Theorem

Hopfield Networks is All You Need

Authors on Pith no claims yet

Pith reviewed 2026-05-17 05:42 UTC · model grok-4.3

classification 💻 cs.NE cs.CLcs.LGstat.ML
keywords Hopfield networksattention mechanismtransformersassociative memorymultiple instance learningdeep learning layerspattern retrieval
0
0 comments X

The pith

A modern Hopfield network with continuous states has an update rule identical to the attention mechanism in transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a continuous-state Hopfield network whose new update rule stores exponentially many patterns and retrieves them with one step and small error. This update computes a weighted average of stored patterns using softmax similarities, which turns out to be the same computation as multi-head attention. The equivalence lets the authors reinterpret transformer heads as performing global averaging in early layers and metastable partial averaging in deeper layers. Hopfield layers built from this rule can be dropped into existing networks to add explicit storage of raw inputs or prototypes. Experiments show these layers raise accuracy on multiple-instance learning tasks, immune repertoire classification, and drug design datasets.

Core claim

The central claim is that the fixed-point update for the modern continuous Hopfield network is mathematically equivalent to the scaled dot-product attention operation, so transformer attention heads can be understood as retrieving from an associative memory whose energy minima include a global average, subset averages, and single-pattern states.

What carries the argument

The continuous-state Hopfield update rule, which forms a softmax-weighted average of stored patterns and serves as both a memory retrieval step and an attention layer.

Load-bearing premise

The continuous-state dynamics stay stable and trainable when the Hopfield update is used as a layer inside large gradient-based networks without creating new optimization problems or losing exponential storage capacity.

What would settle it

Training a transformer in which every attention block is replaced by the Hopfield update and observing either divergence during training or a large drop in benchmark accuracy compared with the original attention version.

read the original abstract

We introduce a modern Hopfield network with continuous states and a corresponding update rule. The new Hopfield network can store exponentially (with the dimension of the associative space) many patterns, retrieves the pattern with one update, and has exponentially small retrieval errors. It has three types of energy minima (fixed points of the update): (1) global fixed point averaging over all patterns, (2) metastable states averaging over a subset of patterns, and (3) fixed points which store a single pattern. The new update rule is equivalent to the attention mechanism used in transformers. This equivalence enables a characterization of the heads of transformer models. These heads perform in the first layers preferably global averaging and in higher layers partial averaging via metastable states. The new modern Hopfield network can be integrated into deep learning architectures as layers to allow the storage of and access to raw input data, intermediate results, or learned prototypes. These Hopfield layers enable new ways of deep learning, beyond fully-connected, convolutional, or recurrent networks, and provide pooling, memory, association, and attention mechanisms. We demonstrate the broad applicability of the Hopfield layers across various domains. Hopfield layers improved state-of-the-art on three out of four considered multiple instance learning problems as well as on immune repertoire classification with several hundreds of thousands of instances. On the UCI benchmark collections of small classification tasks, where deep learning methods typically struggle, Hopfield layers yielded a new state-of-the-art when compared to different machine learning methods. Finally, Hopfield layers achieved state-of-the-art on two drug design datasets. The implementation is available at: https://github.com/ml-jku/hopfield-layers

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces a continuous-state modern Hopfield network whose update rule is equivalent to the attention mechanism in transformers. It claims exponential storage capacity (with associative dimension), one-step retrieval with exponentially small errors, and three types of energy minima (global fixed point, metastable subset averages, and single-pattern fixed points). The equivalence is used to characterize transformer attention heads as performing global averaging in early layers and partial averaging via metastable states in higher layers. Hopfield layers are proposed for integration into deep networks to provide memory, association, and attention, with experiments showing SOTA results on multiple instance learning, immune repertoire classification, UCI benchmarks, and drug design tasks.

Significance. If the equivalence holds and the isolated-network guarantees transfer to end-to-end gradient-trained models, the work bridges classical associative memory with modern attention, offering both an interpretive lens for transformer heads and a new architectural primitive. The public implementation and empirical gains on tasks with hundreds of thousands of instances (including SOTA on three of four MIL problems and immune repertoire classification) are concrete strengths that support practical utility.

major comments (3)
  1. [§3 (Update Rule Derivation)] §3 (Update Rule Derivation): the equivalence of the continuous-state Hopfield update (from the log-sum-exp energy) to the transformer attention formula is asserted as enabling head characterization, but the manuscript provides no explicit derivation steps or proof sketch showing whether the mapping is by algebraic identity or independently derived from the dynamics.
  2. [Theoretical Capacity Section] Theoretical Capacity Section: the claims of exponential storage capacity, single-update retrieval, and exponentially small errors are established for fixed patterns, yet no bounds or analysis demonstrate that pattern separation or retrieval error scaling is preserved when the stored patterns become trainable parameters jointly optimized by gradient descent inside deep networks.
  3. [Experimental Evaluation] Experimental Evaluation: task accuracies are reported, but the sections contain no measurements of effective capacity, retrieval error after training, or stability of the three energy minima types when Hopfield layers are inserted into large-scale gradient-trained models, leaving the transfer of isolated-network guarantees unverified.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'exponentially small retrieval errors' lacks a brief statement of the scaling (e.g., dependence on dimension or number of patterns) that would clarify the claim for readers.
  2. [Implementation Details] Implementation Details: while the GitHub link is given, the manuscript could add a short table or paragraph listing the key hyperparameters (e.g., beta scaling, number of patterns) used for the Hopfield layers across the reported experiments.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the positive summary and constructive comments. We address each major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: §3 (Update Rule Derivation): the equivalence of the continuous-state Hopfield update (from the log-sum-exp energy) to the transformer attention formula is asserted as enabling head characterization, but the manuscript provides no explicit derivation steps or proof sketch showing whether the mapping is by algebraic identity or independently derived from the dynamics.

    Authors: We agree that an explicit derivation improves rigor. The equivalence follows from algebraic identity: starting from the gradient of the log-sum-exp energy, a sequence of substitutions and rearrangements directly yields the scaled dot-product attention formula. In the revised manuscript we will insert a self-contained proof sketch in Section 3 that details every algebraic step and the required variable mappings. revision: yes

  2. Referee: Theoretical Capacity Section: the claims of exponential storage capacity, single-update retrieval, and exponentially small errors are established for fixed patterns, yet no bounds or analysis demonstrate that pattern separation or retrieval error scaling is preserved when the stored patterns become trainable parameters jointly optimized by gradient descent inside deep networks.

    Authors: The stated capacity, retrieval, and error results are derived for fixed patterns in the isolated network; the manuscript does not assert that identical scaling laws hold verbatim once patterns become trainable parameters inside an end-to-end gradient-trained model. We will revise the theoretical section and the discussion to state this scope limitation explicitly and to clarify that practical utility in deep networks is supported by the reported empirical results rather than by an extended theoretical guarantee. revision: yes

  3. Referee: Experimental Evaluation: task accuracies are reported, but the sections contain no measurements of effective capacity, retrieval error after training, or stability of the three energy minima types when Hopfield layers are inserted into large-scale gradient-trained models, leaving the transfer of isolated-network guarantees unverified.

    Authors: We acknowledge that direct post-training diagnostics would strengthen the empirical case. In the revision we will add a short supplementary analysis on at least two representative tasks, reporting measured retrieval error and qualitative observations on the energy minima reached after training. A full-scale verification across every experiment is computationally demanding and will be noted as future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation of modern Hopfield update and attention equivalence

full rationale

The paper defines a continuous-state Hopfield energy function, derives the corresponding one-step update rule from it, proves exponential storage capacity and retrieval properties for the isolated network, and shows by direct algebraic comparison that the update matches the transformer attention formula. The characterization of attention heads as global or metastable averaging follows from this equivalence and the energy minima analysis. No step reduces a claimed result to a fitted parameter renamed as prediction, a self-citation chain, or an ansatz smuggled via prior work; the central claims rest on explicit energy-based derivations and mathematical identities that are independent of the conclusions drawn. Empirical demonstrations on MIL, UCI, and drug-design tasks further support applicability without relying on unverified extensions of the isolated-network guarantees.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not enumerate free parameters or axioms; the central claims rest on an unspecified energy function and update derivation whose details are not visible.

pith-pipeline@v0.9.0 · 5671 in / 1074 out tokens · 62050 ms · 2026-05-17T05:42:06.521473+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Cost.FunctionalEquation Jlog_as_cosh echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    The new update rule is equivalent to the attention mechanism used in transformers. This equivalence enables a characterization of the heads of transformer models.

  • IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    We generalize this energy function to continuous-valued patterns while keeping the properties of the modern Hopfield networks like the exponential storage capacity and the extremely fast convergence

  • IndisputableMonolith.Foundation.LawOfExistence existence_economically_inevitable echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    It has three types of energy minima (fixed points of the update): (1) global fixed point averaging over all patterns, (2) metastable states averaging over a subset of patterns, and (3) fixed points which store a single pattern.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Discrete Stochastic Localization for Non-autoregressive Generation

    cs.LG 2026-05 unverdicted novelty 7.0

    Discrete Stochastic Localization provides a continuous-state framework with SNR-invariant denoisers on unit-sphere embeddings, enabling one network to support multiple per-token noise paths and improving MAUVE on OpenWebText.

  2. Sharp Capacity Thresholds in Linear Associative Memory: From Winner-Take-All to Listwise Retrieval

    stat.ML 2026-05 unverdicted novelty 7.0

    Winner-take-all linear memory capacity scales as d² ~ n log n due to extreme values; listwise retrieval via Tail-Average Margin yields d² ~ n with exact asymptotic theory.

  3. HeLa-Mem: Hebbian Learning and Associative Memory for LLM Agents

    cs.CL 2026-04 unverdicted novelty 7.0

    HeLa-Mem is a graph-based memory architecture for LLM agents that applies Hebbian learning to episodic associations and distills hubs into semantic knowledge, yielding better results on long-context benchmarks with fe...

  4. SRMU: Relevance-Gated Updates for Streaming Hyperdimensional Memories

    cs.AI 2026-04 unverdicted novelty 7.0

    SRMU is a relevance-gated update rule with temporal decay for VSA streaming associative memories that filters redundant and stale information, yielding 12.6% higher memory similarity and 53.5% lower cumulative memory ...

  5. FlowEqProp: Training Flow Matching Generative Models with Gradient Equilibrium Propagation

    cond-mat.dis-nn 2026-04 unverdicted novelty 7.0

    FlowEqProp trains flow matching generative models using gradient equilibrium propagation on a 25k-parameter MLP for digit generation without backpropagation, producing recognizable samples and allowing quality gains f...

  6. SuperLocalMemory V3.3: The Living Brain -- Biologically-Inspired Forgetting, Cognitive Quantization, and Multi-Channel Retrieval for Zero-LLM Agent Memory Systems

    cs.AI 2026-04 unverdicted novelty 7.0

    SuperLocalMemory V3.3 implements a cognitive memory taxonomy with mathematical forgetting and multi-channel retrieval, reaching 70.4% on LoCoMo in zero-LLM mode.

  7. Context-Gated Associative Retrieval: From Theory to Transformers

    cond-mat.dis-nn 2026-05 unverdicted novelty 6.0

    Context gating in associative memories boosts inter-memory separation and sparsity for exponential retrieval gains, admits a unique fixed point driven by direct bias and feedback, and matches in-context learning dynam...

  8. HoReN: Normalized Hopfield Retrieval for Large-Scale Sequential Model Editing

    cs.LG 2026-05 unverdicted novelty 6.0

    HoReN achieves stable sequential editing of 50K facts in LLMs by combining a normalized Hopfield codebook with angular retrieval and attractor dynamics.

  9. Emergent Self-Attention from Astrocyte-Gated Associative Memory Dynamics

    physics.data-an 2026-04 unverdicted novelty 6.0

    Astrocytic gains in a Hopfield network evolve under replicator dynamics to produce emergent self-attention as softmax routing on the gain simplex at fixed points.

  10. HyperSpace: A Generalized Framework for Spatial Encoding in Hyperdimensional Representations

    cs.AI 2026-04 unverdicted novelty 6.0

    HyperSpace shows HRR and FHRR have comparable end-to-end runtime in spatial tasks despite FHRR's lower theoretical complexity per operation, with HRR using roughly half the memory.

  11. HyperSpace: A Generalized Framework for Spatial Encoding in Hyperdimensional Representations

    cs.AI 2026-04 unverdicted novelty 6.0

    HyperSpace framework shows that HRR and FHRR VSA backends deliver comparable end-to-end performance in spatial encoding despite FHRR's lower theoretical complexity per operation, with HRR requiring half the memory.

  12. Algorithmic Analysis of Dense Associative Memory: Finite-Size Guarantees and Adversarial Robustness

    cs.LG 2026-04 unverdicted novelty 6.0

    Dense associative memory retrieval converges geometrically with O(log N) time and tolerates adversarial corruptions under separation and bounded-interference conditions, achieving capacity scaling Θ(N^{n-1}).

  13. Dense Associative Memory with biased patterns: a Replica Symmetric analysis

    cond-mat.dis-nn 2026-04 unverdicted novelty 6.0

    Bias in stored patterns reduces the storage capacity of dense higher-order associative memories by the multiplicative factor (1-b²)^P while keeping superlinear scaling, confirmed by replica-symmetric analysis.

  14. GRAFT: Grid-Aware Load Forecasting with Multi-Source Textual Alignment and Fusion

    cs.LG 2025-12 conditional novelty 6.0

    GRAFT improves electric load forecasting accuracy by aligning multi-source daily texts with half-hour load series and using cross-attention fusion, outperforming baselines on a new Australian benchmark across hourly t...

  15. Efficient Image Annotation via Semi-Supervised Object Segmentation with Label Propagation

    cs.CV 2026-04 unverdicted novelty 5.0

    A label-propagation pipeline combining a segment proposer with Hopfield networks on multi-model embeddings can automatically annotate 60% of household object data for up to 50 classes using only limited initial labels.

  16. Introducing Echo Networks for Computational Neuroevolution

    cs.LG 2026-04 unverdicted novelty 5.0

    Echo Networks are recurrent networks defined by a single connection matrix with no layers, enabling matrix-based mutation and recombination in neuroevolution, and demonstrated on ECG signal classification.

  17. TTT3R: 3D Reconstruction as Test-Time Training

    cs.CV 2025-09 unverdicted novelty 5.0

    TTT3R derives a closed-form learning rate from memory-observation alignment confidence to boost length generalization in RNN-based 3D reconstruction by 2x in global pose estimation.

  18. How is gene-regulatory evolution affected by cell-to-cell variability?

    q-bio.PE 2026-04 unverdicted novelty 4.0

    Cell-to-cell variability selects for aligned, motif-enriched gene regulatory networks that are robust to developmental noise and mutations.

  19. Energy-Based Dynamical Models for Neurocomputation, Learning, and Optimization

    cs.LG 2026-04 unverdicted novelty 3.0

    The paper reviews and extends energy-based dynamical models that use gradient flows and energy landscapes for neurocomputation, learning, and optimization tasks.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 18 Pith papers · 9 internal anchors

  1. [1]

    Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics , series =

    Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , author =. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics , series =

  2. [2]

    van den Oord and Y

    A. van den Oord and Y. Li and O. Vinyals , title =. CoRR , volume =

  3. [3]

    CoRR , volume=

    A Simple Framework for Contrastive Learning of Visual Representations , author=. CoRR , volume=

  4. [4]

    Advances in Neural Information Processing Systems , pages=

    Self-normalizing neural networks , author=. Advances in Neural Information Processing Systems , pages=

  5. [5]

    ArXiv , eprint=

    Layer normalization , author=. ArXiv , eprint=

  6. [6]

    Integral Transforms and Special Functions , volume =

    Sharp bounds for the Lambert W function , author=. Integral Transforms and Special Functions , volume =

  7. [7]

    Hoorfar and M

    A. Hoorfar and M. Hassani , journal=. Inequalities on the

  8. [8]

    Chatzigeorgiou , journal=

    I. Chatzigeorgiou , journal=. Bounds on the

  9. [9]

    Roig-Solvas and M

    B. Roig-Solvas and M. Sznaier , journal=. Novel Tractable Bounds on the. 2004.01115 , year=

  10. [10]

    ArXiv , eprint=

    Electra: Pre-training text encoders as discriminators rather than generators , author=. ArXiv , eprint=

  11. [11]

    ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

    Pointer Sentinel Mixture Models , journal=. 2003.10555 , author=

  12. [12]

    Bruck and V

    J. Bruck and V. P. Roychowdhury , journal =. On the number of spurious memories in the

  13. [13]

    Abu-Mostafa and J.-M- StJacques , journal =

    Y. Abu-Mostafa and J.-M- StJacques , journal =. Information capacity of the. 1985 , doi =

  14. [14]

    Cabrera and H

    E. Cabrera and H. Sossa , journal =. Generating exponentially stable states for a. 2018 , doi =

  15. [15]

    ArXiv , eprint=

    Synthesizer: Rethinking Self-Attention in Transformer Models , author=. ArXiv , eprint=

  16. [16]

    Topological and Dynamical Complexity of Random Neural Networks , author =. Phys. Rev. Lett. , publisher =. 2013 , doi =

  17. [17]

    Chaos in Random Neural Networks , author =. Phys. Rev. Lett. , publisher =. 1988 , doi =

  18. [18]

    Tanaka and S

    F. Tanaka and S. F. Edwards , journal =. Analytic theory of the ground state properties of a spin glass. 1980 , doi =

  19. [19]

    Folli and M

    V. Folli and M. Leonetti and G. Ruocco , journal =. On the Maximum Storage Capacity of the. 2017 , doi =

  20. [20]

    Neural Networks , publisher =

    On the Storage Capacity of Nonlinear Neural Networks , author =. Neural Networks , publisher =. 1997 , doi =

  21. [21]

    Storage capacity of attractor neural networks with depressing synapses , author =. Phys. Rev. E , publisher =. 2002 , doi =

  22. [22]

    F. W. J. Olver and D. W. Lozier and R. F. Boisvert and C. W. Clark , publisher =. 2010 , isbn =

  23. [23]

    2019 , url =

    Some new bounds for ratio functions of trigonometric and hyperbolic functions , author=. 2019 , url =

  24. [24]

    Journal of Mathematical Analysis and Applications , volume =

    On a generalization of an upper bound for the exponential function , author =. Journal of Mathematical Analysis and Applications , volume =. 2009 , doi =

  25. [25]

    Crisanti and D

    A. Crisanti and D. J. Amit and H. Gutfreund , journal =. Saturation Level of the. 1986 , doi =

  26. [26]

    R. J. McEliece and E. C. Posner and E. R. Rodemich and S. S. Venkatesh , title =. IEEE Trans. Inf. Theor. , publisher =. 1987 , doi =

  27. [27]

    J. J. Hopfield , title =. Proceedings of the National Academy of Sciences , volume =. 1982 , doi =

  28. [28]

    Hertz and A

    J. Hertz and A. Krogh and R. G. Palmer , title =. 1991 , isbn =

  29. [29]

    J. S. Brauchart and A. B. Reznikov and E. B. Saff and I. H. Sloan and Y. G. Wang and R. S. Womersley , title =. Experimental Mathematics , publisher =. 2018 , doi =

  30. [30]

    Cai and J

    T. Cai and J. Fan and T. Jiang , title =. Journal of Machine Learning Research , volume =

  31. [31]

    On the Convergence Properties of the EM Algorithm , author =. Ann. Statist. , publisher =. 1983 , doi =

  32. [32]

    1969 , isbn =

    Nonlinear programming: a unified approach , author=. 1969 , isbn =

  33. [33]

    Rangarajan and A

    A. Rangarajan and A. Yuille and E. Mjolsness, Eric , title =. Neural Computation , volume =. 1999 , doi =

  34. [34]

    Advances in Neural Information Processing Systems 9 , publisher =

    A Convergence Proof for the Softassign Quadratic Assignment Algorithm , author =. Advances in Neural Information Processing Systems 9 , publisher =

  35. [35]

    Neural Computation , volume =

    A Novel Optimizing Network Architecture with Applications , author =. Neural Computation , volume =. 1996 , doi =

  36. [36]

    Journal of Computer and System Sciences , volume =

    Sufficient conditions for the convergence of monotonic mathematical programming algorithms , author =. Journal of Computer and System Sciences , volume =. 1976 , doi =

  37. [37]

    Lipp and S

    T. Lipp and S. Boyd , title =. Optimization and Engineering , volume =. 2016 , doi =

  38. [38]

    Advances in Neural Information Processing Systems 22 , editor =

    On the Convergence of the Concave-Convex Procedure , author =. Advances in Neural Information Processing Systems 22 , editor =

  39. [39]

    A. L. Yuille and A. Rangarajan , booktitle =. The Concave-Convex Procedure

  40. [40]

    A. L. Yuille and A. Rangarajan , title =. Neural Computation , volume =. 2003 , doi =

  41. [41]

    2009 , isbn =

    Convex Optimization , author=. 2009 , isbn =

  42. [42]

    D. J. H. Garling , series=. Analysis on. 2017 , isbn =

  43. [43]

    H. H. Bauschke and P. L. Combettes , publisher =. Convex Analysis and Monotone Operator Theory in. 2017 , isbn =

  44. [44]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    T. Wolf and L. Debut and V. Sanh and J. Chaumond and C. Delangue and A. Moi and P. Cistac and T. Rault and R. Louf and M. Funtowicz and J. Brew , journal=. 1910.03771 , year=

  45. [45]

    ArXiv , eprint=

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. ArXiv , eprint=

  46. [46]

    Dai and Z

    Z. Dai and Z. Yang and Y. Yang and J. G. Carbonell and Q. V. Le and R. Salakhutdinov , title =. ArXiv , eprint =

  47. [47]

    Precise Asymptotics For Small Eigenvalues , author =

    Metastability In Reversible Diffusion Processes II. Precise Asymptotics For Small Eigenvalues , author =. Journal of the European Mathematical Society , volume =. 2005 , doi =

  48. [48]

    Sharp Asymptotics For Capacities And Exit Times , author =

    Metastability In Reversible Diffusion Processes I. Sharp Asymptotics For Capacities And Exit Times , author =. Journal of the European Mathematical Society , volume =. 2004 , doi =

  49. [49]

    Bovier and M

    A. Bovier and M. Eckhoff and V. Gayrard and M. Klein , journal =. Metastability and Low Lying Spectra in Reversible. 2002 , doi =

  50. [50]

    Probability Theory and Related Fields , volume =

    Metastability in stochastic dynamics of disordered mean-field models , author =. Probability Theory and Related Fields , volume =. 2000 , doi =

  51. [51]

    Journal of Mathematical Physics , volume =

    Theory of nonequilibrium first-order phase transitions for stochastic dynamics , author =. Journal of Mathematical Physics , volume =. 1998 , doi =

  52. [52]

    Journal of the London Mathematical Society , volume =

    Metastable States of Symmetric Markov Semigroups II , author =. Journal of the London Mathematical Society , volume =. 1982 , doi =

  53. [53]

    Proceedings of the London Mathematical Society , volume =

    Metastable States of Symmetric Markov Semigroups I , author =. Proceedings of the London Mathematical Society , volume =. 1982 , doi =

  54. [54]

    an der Heiden , school =

    M. an der Heiden , school =. Metastability of. 2007 , doi =

  55. [55]

    Bianchi and A

    A. Bianchi and A. Bovier and D. Ioffe , journal =. Sharp asymptotics for metastability in the random field. 2009 , doi =

  56. [56]

    Sharp asymptotics for metastability in the random field Curie-Weiss model

    A. Bianchi and A. Bovier and D. Ioffe , journal =. Sharp asymptotics for metastability in the random field. 0806.4478 , year=

  57. [57]

    Metastability in the generalized Hopfield model with finitely many patterns

    M. Shkolnikov , journal =. Metastability in the generalized. 0903.3050 , year=

  58. [58]

    2005 , doi =

    Metastability: A Potential Theoretic Approach , author =. 2005 , doi =

  59. [59]

    Agliari and A

    E. Agliari and A. Barra and A. Galluzzi and F. Guerra and D. Tantari and F. Tavani , journal=. Metastable states in the hierarchical. 2014 , doi =

  60. [60]

    Structure of metastable states in the

    E Gardner , journal =. Structure of metastable states in the. 1986 , doi =

  61. [61]

    ArXiv , eprint=

    Neural Turing Machines , author=. ArXiv , eprint=

  62. [62]

    ArXiv , eprint=

    Memory Networks , author=. ArXiv , eprint=

  63. [63]

    ArXiv , eprint=

    Encoding-based Memory Modules for Recurrent Neural Networks , author=. ArXiv , eprint=

  64. [64]

    ArXiv , eprint =

    Learning to update Auto-associative Memory in Recurrent Neural Networks for Improving Sequence Memorization , author =. ArXiv , eprint =

  65. [65]

    Dehghani and S

    M. Dehghani and S. Gouws and O. Vinyals and J. Uszkoreit and L. Kaiser , title =. ArXiv , eprint =

  66. [66]

    Advances in Neural Information Processing Systems 28 , publisher =

    End-To-End Memory Networks , author =. Advances in Neural Information Processing Systems 28 , publisher =

  67. [67]

    ArXiv , eprint =

    End-To-End Memory Networks , author =. ArXiv , eprint =

  68. [68]

    Xu and J

    K. Xu and J. Ba and R. Kiros and K. Cho and A. C. Courville and R. Salakhutdinov and R. S. Zemel and Y. Bengio , title =. Proceedings of the 32nd International Conference on Machine Learning , editor =

  69. [69]

    Xu and J

    K. Xu and J. Ba and R. Kiros and K. Cho and A. C. Courville and R. Salakhutdinov and R. S. Zemel and Y. Bengio , title =. ArXiv , eprint =

  70. [70]

    Graves , title =

    A. Graves , title =. ArXiv , eprint =

  71. [71]

    Lin and M

    Z. Lin and M. Feng and C. Nogueira dos Santos and M. Yu and B. Xiang and B. Zhou and Y. Bengio , title =. ArXiv , eprint =

  72. [72]

    Gregor and I

    K. Gregor and I. Danihelka and A. Graves and D. Rezende and D. Wierstra , title =. Proceedings of the 32nd International Conference on Machine Learning , editor =

  73. [73]

    Gregor and I

    K. Gregor and I. Danihelka and A. Graves and D. Wierstra , title =. ArXiv , eprint =

  74. [74]

    Devlin and M.-W

    J. Devlin and M.-W. Chang and K. Lee and K. Toutanova , title =. Proceedings of the 2019 Conference of the North. 2019 , doi =

  75. [75]

    Devlin and M.-W

    J. Devlin and M.-W. Chang and K. Lee and K. Toutanova , title =. ArXiv , eprint =

  76. [76]

    Vaswani and N

    A. Vaswani and N. Shazeer and N. Parmar and J. Uszkoreit and L. Jones and A. N. Gomez and L. Kaiser and I. Polosukhin , title =. Advances in Neural Information Processing Systems 30 , editor =

  77. [77]

    Vaswani and N

    A. Vaswani and N. Shazeer and N. Parmar and J. Uszkoreit and L. Jones and A. N. Gomez and L. Kaiser and I. Polosukhin , title =. ArXiv , eprint =

  78. [78]

    Frustratingly Short Attention Spans in Neural Language Modeling

    M. Daniluk and T. Rockt. Frustratingly Short Attention Spans in Neural Language Modeling , journal =. 1702.04521 , note =

  79. [79]

    Cheng and L

    J. Cheng and L. Dong and M. Lapata , title =. ArXiv , eprint =

  80. [80]

    Luong and H

    M.-T. Luong and H. Pham and C. D. Manning , title =. ArXiv , eprint =

Showing first 80 references.