arxiv: 2008.02217 · v3 · submitted 2020-07-16 · 💻 cs.NE · cs.CL· cs.LG· stat.ML

Recognition: 3 theorem links

· Lean Theorem

Hopfield Networks is All You Need

Hubert Ramsauer , Bernhard Sch\"afl , Johannes Lehner , Philipp Seidl , Michael Widrich , Thomas Adler , Lukas Gruber , Markus Holzleitner

show 8 more authors

Milena Pavlovi\'c Geir Kjetil Sandve Victor Greiff David Kreil Michael Kopp G\"unter Klambauer Johannes Brandstetter Sepp Hochreiter

Authors on Pith no claims yet

Pith reviewed 2026-05-17 05:42 UTC · model grok-4.3

classification 💻 cs.NE cs.CLcs.LGstat.ML

keywords Hopfield networksattention mechanismtransformersassociative memorymultiple instance learningdeep learning layerspattern retrieval

0 comments

The pith

A modern Hopfield network with continuous states has an update rule identical to the attention mechanism in transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a continuous-state Hopfield network whose new update rule stores exponentially many patterns and retrieves them with one step and small error. This update computes a weighted average of stored patterns using softmax similarities, which turns out to be the same computation as multi-head attention. The equivalence lets the authors reinterpret transformer heads as performing global averaging in early layers and metastable partial averaging in deeper layers. Hopfield layers built from this rule can be dropped into existing networks to add explicit storage of raw inputs or prototypes. Experiments show these layers raise accuracy on multiple-instance learning tasks, immune repertoire classification, and drug design datasets.

Core claim

The central claim is that the fixed-point update for the modern continuous Hopfield network is mathematically equivalent to the scaled dot-product attention operation, so transformer attention heads can be understood as retrieving from an associative memory whose energy minima include a global average, subset averages, and single-pattern states.

What carries the argument

The continuous-state Hopfield update rule, which forms a softmax-weighted average of stored patterns and serves as both a memory retrieval step and an attention layer.

Load-bearing premise

The continuous-state dynamics stay stable and trainable when the Hopfield update is used as a layer inside large gradient-based networks without creating new optimization problems or losing exponential storage capacity.

What would settle it

Training a transformer in which every attention block is replaced by the Hopfield update and observing either divergence during training or a large drop in benchmark accuracy compared with the original attention version.

read the original abstract

We introduce a modern Hopfield network with continuous states and a corresponding update rule. The new Hopfield network can store exponentially (with the dimension of the associative space) many patterns, retrieves the pattern with one update, and has exponentially small retrieval errors. It has three types of energy minima (fixed points of the update): (1) global fixed point averaging over all patterns, (2) metastable states averaging over a subset of patterns, and (3) fixed points which store a single pattern. The new update rule is equivalent to the attention mechanism used in transformers. This equivalence enables a characterization of the heads of transformer models. These heads perform in the first layers preferably global averaging and in higher layers partial averaging via metastable states. The new modern Hopfield network can be integrated into deep learning architectures as layers to allow the storage of and access to raw input data, intermediate results, or learned prototypes. These Hopfield layers enable new ways of deep learning, beyond fully-connected, convolutional, or recurrent networks, and provide pooling, memory, association, and attention mechanisms. We demonstrate the broad applicability of the Hopfield layers across various domains. Hopfield layers improved state-of-the-art on three out of four considered multiple instance learning problems as well as on immune repertoire classification with several hundreds of thousands of instances. On the UCI benchmark collections of small classification tasks, where deep learning methods typically struggle, Hopfield layers yielded a new state-of-the-art when compared to different machine learning methods. Finally, Hopfield layers achieved state-of-the-art on two drug design datasets. The implementation is available at: https://github.com/ml-jku/hopfield-layers

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hopfield update matches attention exactly, but capacity and one-step retrieval are unproven once patterns are trained by gradients.

read the letter

The one or two things to know are that the authors derive a continuous Hopfield network whose update rule is mathematically the same as transformer attention, and they use this to build Hopfield layers that they test on a range of tasks. What is new is the continuous state version with the specific energy function that yields the log-sum-exp form for the update. This gives the claimed exponential capacity in the associative dimension and the characterization of fixed points as global, metastable, or single-pattern. The equivalence to attention is the part that stands out most, because it turns the heads into something interpretable in terms of pattern averaging. The paper does well at showing practical use. They integrate the layers into deep learning pipelines and get state-of-the-art results on three out of four multiple instance learning problems, plus gains on immune repertoire classification with hundreds of thousands of instances. They also beat other methods on UCI collections and two drug design datasets. Releasing the code is helpful. The soft spots are in the transition from theory to trained models. The exponential capacity and single-step retrieval with exponentially small errors are shown for the isolated network. The manuscript does not include bounds or experiments confirming that these properties persist when the patterns are optimized jointly with the network weights via gradients. Without that, the claim that this opens new ways of deep learning beyond standard layers rests mostly on the task accuracies rather than on verified memory properties. This paper is for researchers interested in memory mechanisms, attention interpretations, or alternative neural architectures. Someone looking for new pooling or association layers might get value from the experiments. It deserves a serious referee because the mathematical link is direct and the empirical scope is wide enough to be worth checking. I would recommend sending it to peer review, with particular attention to whether the capacity claims hold up under end-to-end training.

Referee Report

3 major / 2 minor

Summary. The paper introduces a continuous-state modern Hopfield network whose update rule is equivalent to the attention mechanism in transformers. It claims exponential storage capacity (with associative dimension), one-step retrieval with exponentially small errors, and three types of energy minima (global fixed point, metastable subset averages, and single-pattern fixed points). The equivalence is used to characterize transformer attention heads as performing global averaging in early layers and partial averaging via metastable states in higher layers. Hopfield layers are proposed for integration into deep networks to provide memory, association, and attention, with experiments showing SOTA results on multiple instance learning, immune repertoire classification, UCI benchmarks, and drug design tasks.

Significance. If the equivalence holds and the isolated-network guarantees transfer to end-to-end gradient-trained models, the work bridges classical associative memory with modern attention, offering both an interpretive lens for transformer heads and a new architectural primitive. The public implementation and empirical gains on tasks with hundreds of thousands of instances (including SOTA on three of four MIL problems and immune repertoire classification) are concrete strengths that support practical utility.

major comments (3)

[§3 (Update Rule Derivation)] §3 (Update Rule Derivation): the equivalence of the continuous-state Hopfield update (from the log-sum-exp energy) to the transformer attention formula is asserted as enabling head characterization, but the manuscript provides no explicit derivation steps or proof sketch showing whether the mapping is by algebraic identity or independently derived from the dynamics.
[Theoretical Capacity Section] Theoretical Capacity Section: the claims of exponential storage capacity, single-update retrieval, and exponentially small errors are established for fixed patterns, yet no bounds or analysis demonstrate that pattern separation or retrieval error scaling is preserved when the stored patterns become trainable parameters jointly optimized by gradient descent inside deep networks.
[Experimental Evaluation] Experimental Evaluation: task accuracies are reported, but the sections contain no measurements of effective capacity, retrieval error after training, or stability of the three energy minima types when Hopfield layers are inserted into large-scale gradient-trained models, leaving the transfer of isolated-network guarantees unverified.

minor comments (2)

[Abstract] Abstract: the phrase 'exponentially small retrieval errors' lacks a brief statement of the scaling (e.g., dependence on dimension or number of patterns) that would clarify the claim for readers.
[Implementation Details] Implementation Details: while the GitHub link is given, the manuscript could add a short table or paragraph listing the key hyperparameters (e.g., beta scaling, number of patterns) used for the Hopfield layers across the reported experiments.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the positive summary and constructive comments. We address each major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: §3 (Update Rule Derivation): the equivalence of the continuous-state Hopfield update (from the log-sum-exp energy) to the transformer attention formula is asserted as enabling head characterization, but the manuscript provides no explicit derivation steps or proof sketch showing whether the mapping is by algebraic identity or independently derived from the dynamics.

Authors: We agree that an explicit derivation improves rigor. The equivalence follows from algebraic identity: starting from the gradient of the log-sum-exp energy, a sequence of substitutions and rearrangements directly yields the scaled dot-product attention formula. In the revised manuscript we will insert a self-contained proof sketch in Section 3 that details every algebraic step and the required variable mappings. revision: yes
Referee: Theoretical Capacity Section: the claims of exponential storage capacity, single-update retrieval, and exponentially small errors are established for fixed patterns, yet no bounds or analysis demonstrate that pattern separation or retrieval error scaling is preserved when the stored patterns become trainable parameters jointly optimized by gradient descent inside deep networks.

Authors: The stated capacity, retrieval, and error results are derived for fixed patterns in the isolated network; the manuscript does not assert that identical scaling laws hold verbatim once patterns become trainable parameters inside an end-to-end gradient-trained model. We will revise the theoretical section and the discussion to state this scope limitation explicitly and to clarify that practical utility in deep networks is supported by the reported empirical results rather than by an extended theoretical guarantee. revision: yes
Referee: Experimental Evaluation: task accuracies are reported, but the sections contain no measurements of effective capacity, retrieval error after training, or stability of the three energy minima types when Hopfield layers are inserted into large-scale gradient-trained models, leaving the transfer of isolated-network guarantees unverified.

Authors: We acknowledge that direct post-training diagnostics would strengthen the empirical case. In the revision we will add a short supplementary analysis on at least two representative tasks, reporting measured retrieval error and qualitative observations on the energy minima reached after training. A full-scale verification across every experiment is computationally demanding and will be noted as future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation of modern Hopfield update and attention equivalence

full rationale

The paper defines a continuous-state Hopfield energy function, derives the corresponding one-step update rule from it, proves exponential storage capacity and retrieval properties for the isolated network, and shows by direct algebraic comparison that the update matches the transformer attention formula. The characterization of attention heads as global or metastable averaging follows from this equivalence and the energy minima analysis. No step reduces a claimed result to a fitted parameter renamed as prediction, a self-citation chain, or an ansatz smuggled via prior work; the central claims rest on explicit energy-based derivations and mathematical identities that are independent of the conclusions drawn. Empirical demonstrations on MIL, UCI, and drug-design tasks further support applicability without relying on unverified extensions of the isolated-network guarantees.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not enumerate free parameters or axioms; the central claims rest on an unspecified energy function and update derivation whose details are not visible.

pith-pipeline@v0.9.0 · 5671 in / 1074 out tokens · 62050 ms · 2026-05-17T05:42:06.521473+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation Jlog_as_cosh echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

The new update rule is equivalent to the attention mechanism used in transformers. This equivalence enables a characterization of the heads of transformer models.
IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We generalize this energy function to continuous-valued patterns while keeping the properties of the modern Hopfield networks like the exponential storage capacity and the extremely fast convergence
IndisputableMonolith.Foundation.LawOfExistence existence_economically_inevitable echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

It has three types of energy minima (fixed points of the update): (1) global fixed point averaging over all patterns, (2) metastable states averaging over a subset of patterns, and (3) fixed points which store a single pattern.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Discrete Stochastic Localization for Non-autoregressive Generation
cs.LG 2026-05 unverdicted novelty 7.0

Discrete Stochastic Localization provides a continuous-state framework with SNR-invariant denoisers on unit-sphere embeddings, enabling one network to support multiple per-token noise paths and improving MAUVE on OpenWebText.
Sharp Capacity Thresholds in Linear Associative Memory: From Winner-Take-All to Listwise Retrieval
stat.ML 2026-05 unverdicted novelty 7.0

Winner-take-all linear memory capacity scales as d² ~ n log n due to extreme values; listwise retrieval via Tail-Average Margin yields d² ~ n with exact asymptotic theory.
HeLa-Mem: Hebbian Learning and Associative Memory for LLM Agents
cs.CL 2026-04 unverdicted novelty 7.0

HeLa-Mem is a graph-based memory architecture for LLM agents that applies Hebbian learning to episodic associations and distills hubs into semantic knowledge, yielding better results on long-context benchmarks with fe...
SRMU: Relevance-Gated Updates for Streaming Hyperdimensional Memories
cs.AI 2026-04 unverdicted novelty 7.0

SRMU is a relevance-gated update rule with temporal decay for VSA streaming associative memories that filters redundant and stale information, yielding 12.6% higher memory similarity and 53.5% lower cumulative memory ...
FlowEqProp: Training Flow Matching Generative Models with Gradient Equilibrium Propagation
cond-mat.dis-nn 2026-04 unverdicted novelty 7.0

FlowEqProp trains flow matching generative models using gradient equilibrium propagation on a 25k-parameter MLP for digit generation without backpropagation, producing recognizable samples and allowing quality gains f...
SuperLocalMemory V3.3: The Living Brain -- Biologically-Inspired Forgetting, Cognitive Quantization, and Multi-Channel Retrieval for Zero-LLM Agent Memory Systems
cs.AI 2026-04 unverdicted novelty 7.0

SuperLocalMemory V3.3 implements a cognitive memory taxonomy with mathematical forgetting and multi-channel retrieval, reaching 70.4% on LoCoMo in zero-LLM mode.
Context-Gated Associative Retrieval: From Theory to Transformers
cond-mat.dis-nn 2026-05 unverdicted novelty 6.0

Context gating in associative memories boosts inter-memory separation and sparsity for exponential retrieval gains, admits a unique fixed point driven by direct bias and feedback, and matches in-context learning dynam...
HoReN: Normalized Hopfield Retrieval for Large-Scale Sequential Model Editing
cs.LG 2026-05 unverdicted novelty 6.0

HoReN achieves stable sequential editing of 50K facts in LLMs by combining a normalized Hopfield codebook with angular retrieval and attractor dynamics.
Emergent Self-Attention from Astrocyte-Gated Associative Memory Dynamics
physics.data-an 2026-04 unverdicted novelty 6.0

Astrocytic gains in a Hopfield network evolve under replicator dynamics to produce emergent self-attention as softmax routing on the gain simplex at fixed points.
HyperSpace: A Generalized Framework for Spatial Encoding in Hyperdimensional Representations
cs.AI 2026-04 unverdicted novelty 6.0

HyperSpace shows HRR and FHRR have comparable end-to-end runtime in spatial tasks despite FHRR's lower theoretical complexity per operation, with HRR using roughly half the memory.
HyperSpace: A Generalized Framework for Spatial Encoding in Hyperdimensional Representations
cs.AI 2026-04 unverdicted novelty 6.0

HyperSpace framework shows that HRR and FHRR VSA backends deliver comparable end-to-end performance in spatial encoding despite FHRR's lower theoretical complexity per operation, with HRR requiring half the memory.
Algorithmic Analysis of Dense Associative Memory: Finite-Size Guarantees and Adversarial Robustness
cs.LG 2026-04 unverdicted novelty 6.0

Dense associative memory retrieval converges geometrically with O(log N) time and tolerates adversarial corruptions under separation and bounded-interference conditions, achieving capacity scaling Θ(N^{n-1}).
Dense Associative Memory with biased patterns: a Replica Symmetric analysis
cond-mat.dis-nn 2026-04 unverdicted novelty 6.0

Bias in stored patterns reduces the storage capacity of dense higher-order associative memories by the multiplicative factor (1-b²)^P while keeping superlinear scaling, confirmed by replica-symmetric analysis.
GRAFT: Grid-Aware Load Forecasting with Multi-Source Textual Alignment and Fusion
cs.LG 2025-12 conditional novelty 6.0

GRAFT improves electric load forecasting accuracy by aligning multi-source daily texts with half-hour load series and using cross-attention fusion, outperforming baselines on a new Australian benchmark across hourly t...
Efficient Image Annotation via Semi-Supervised Object Segmentation with Label Propagation
cs.CV 2026-04 unverdicted novelty 5.0

A label-propagation pipeline combining a segment proposer with Hopfield networks on multi-model embeddings can automatically annotate 60% of household object data for up to 50 classes using only limited initial labels.
Introducing Echo Networks for Computational Neuroevolution
cs.LG 2026-04 unverdicted novelty 5.0

Echo Networks are recurrent networks defined by a single connection matrix with no layers, enabling matrix-based mutation and recombination in neuroevolution, and demonstrated on ECG signal classification.
TTT3R: 3D Reconstruction as Test-Time Training
cs.CV 2025-09 unverdicted novelty 5.0

TTT3R derives a closed-form learning rate from memory-observation alignment confidence to boost length generalization in RNN-based 3D reconstruction by 2x in global pose estimation.
How is gene-regulatory evolution affected by cell-to-cell variability?
q-bio.PE 2026-04 unverdicted novelty 4.0

Cell-to-cell variability selects for aligned, motif-enriched gene regulatory networks that are robust to developmental noise and mutations.
Energy-Based Dynamical Models for Neurocomputation, Learning, and Optimization
cs.LG 2026-04 unverdicted novelty 3.0

The paper reviews and extends energy-based dynamical models that use gradient flows and energy landscapes for neurocomputation, learning, and optimization tasks.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 18 Pith papers · 9 internal anchors

[1]

Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics , series =

Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , author =. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics , series =

work page
[2]

van den Oord and Y

A. van den Oord and Y. Li and O. Vinyals , title =. CoRR , volume =

work page
[3]

CoRR , volume=

A Simple Framework for Contrastive Learning of Visual Representations , author=. CoRR , volume=

work page
[4]

Advances in Neural Information Processing Systems , pages=

Self-normalizing neural networks , author=. Advances in Neural Information Processing Systems , pages=

work page
[5]

ArXiv , eprint=

Layer normalization , author=. ArXiv , eprint=

work page
[6]

Integral Transforms and Special Functions , volume =

Sharp bounds for the Lambert W function , author=. Integral Transforms and Special Functions , volume =

work page
[7]

Hoorfar and M

A. Hoorfar and M. Hassani , journal=. Inequalities on the

work page
[8]

Chatzigeorgiou , journal=

I. Chatzigeorgiou , journal=. Bounds on the

work page
[9]

Roig-Solvas and M

B. Roig-Solvas and M. Sznaier , journal=. Novel Tractable Bounds on the. 2004.01115 , year=

work page arXiv 2004
[10]

ArXiv , eprint=

Electra: Pre-training text encoders as discriminators rather than generators , author=. ArXiv , eprint=

work page
[11]

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Pointer Sentinel Mixture Models , journal=. 2003.10555 , author=

work page internal anchor Pith review Pith/arXiv arXiv 2003
[12]

Bruck and V

J. Bruck and V. P. Roychowdhury , journal =. On the number of spurious memories in the

work page
[13]

Abu-Mostafa and J.-M- StJacques , journal =

Y. Abu-Mostafa and J.-M- StJacques , journal =. Information capacity of the. 1985 , doi =

work page 1985
[14]

Cabrera and H

E. Cabrera and H. Sossa , journal =. Generating exponentially stable states for a. 2018 , doi =

work page 2018
[15]

ArXiv , eprint=

Synthesizer: Rethinking Self-Attention in Transformer Models , author=. ArXiv , eprint=

work page
[16]

Topological and Dynamical Complexity of Random Neural Networks , author =. Phys. Rev. Lett. , publisher =. 2013 , doi =

work page 2013
[17]

Chaos in Random Neural Networks , author =. Phys. Rev. Lett. , publisher =. 1988 , doi =

work page 1988
[18]

Tanaka and S

F. Tanaka and S. F. Edwards , journal =. Analytic theory of the ground state properties of a spin glass. 1980 , doi =

work page 1980
[19]

Folli and M

V. Folli and M. Leonetti and G. Ruocco , journal =. On the Maximum Storage Capacity of the. 2017 , doi =

work page 2017
[20]

Neural Networks , publisher =

On the Storage Capacity of Nonlinear Neural Networks , author =. Neural Networks , publisher =. 1997 , doi =

work page 1997
[21]

Storage capacity of attractor neural networks with depressing synapses , author =. Phys. Rev. E , publisher =. 2002 , doi =

work page 2002
[22]

F. W. J. Olver and D. W. Lozier and R. F. Boisvert and C. W. Clark , publisher =. 2010 , isbn =

work page 2010
[23]

2019 , url =

Some new bounds for ratio functions of trigonometric and hyperbolic functions , author=. 2019 , url =

work page 2019
[24]

Journal of Mathematical Analysis and Applications , volume =

On a generalization of an upper bound for the exponential function , author =. Journal of Mathematical Analysis and Applications , volume =. 2009 , doi =

work page 2009
[25]

Crisanti and D

A. Crisanti and D. J. Amit and H. Gutfreund , journal =. Saturation Level of the. 1986 , doi =

work page 1986
[26]

R. J. McEliece and E. C. Posner and E. R. Rodemich and S. S. Venkatesh , title =. IEEE Trans. Inf. Theor. , publisher =. 1987 , doi =

work page 1987
[27]

J. J. Hopfield , title =. Proceedings of the National Academy of Sciences , volume =. 1982 , doi =

work page 1982
[28]

Hertz and A

J. Hertz and A. Krogh and R. G. Palmer , title =. 1991 , isbn =

work page 1991
[29]

J. S. Brauchart and A. B. Reznikov and E. B. Saff and I. H. Sloan and Y. G. Wang and R. S. Womersley , title =. Experimental Mathematics , publisher =. 2018 , doi =

work page 2018
[30]

Cai and J

T. Cai and J. Fan and T. Jiang , title =. Journal of Machine Learning Research , volume =

work page
[31]

On the Convergence Properties of the EM Algorithm , author =. Ann. Statist. , publisher =. 1983 , doi =

work page 1983
[32]

1969 , isbn =

Nonlinear programming: a unified approach , author=. 1969 , isbn =

work page 1969
[33]

Rangarajan and A

A. Rangarajan and A. Yuille and E. Mjolsness, Eric , title =. Neural Computation , volume =. 1999 , doi =

work page 1999
[34]

Advances in Neural Information Processing Systems 9 , publisher =

A Convergence Proof for the Softassign Quadratic Assignment Algorithm , author =. Advances in Neural Information Processing Systems 9 , publisher =

work page
[35]

Neural Computation , volume =

A Novel Optimizing Network Architecture with Applications , author =. Neural Computation , volume =. 1996 , doi =

work page 1996
[36]

Journal of Computer and System Sciences , volume =

Sufficient conditions for the convergence of monotonic mathematical programming algorithms , author =. Journal of Computer and System Sciences , volume =. 1976 , doi =

work page 1976
[37]

Lipp and S

T. Lipp and S. Boyd , title =. Optimization and Engineering , volume =. 2016 , doi =

work page 2016
[38]

Advances in Neural Information Processing Systems 22 , editor =

On the Convergence of the Concave-Convex Procedure , author =. Advances in Neural Information Processing Systems 22 , editor =

work page
[39]

A. L. Yuille and A. Rangarajan , booktitle =. The Concave-Convex Procedure

work page
[40]

A. L. Yuille and A. Rangarajan , title =. Neural Computation , volume =. 2003 , doi =

work page 2003
[41]

2009 , isbn =

Convex Optimization , author=. 2009 , isbn =

work page 2009
[42]

D. J. H. Garling , series=. Analysis on. 2017 , isbn =

work page 2017
[43]

H. H. Bauschke and P. L. Combettes , publisher =. Convex Analysis and Monotone Operator Theory in. 2017 , isbn =

work page 2017
[44]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

T. Wolf and L. Debut and V. Sanh and J. Chaumond and C. Delangue and A. Moi and P. Cistac and T. Rault and R. Louf and M. Funtowicz and J. Brew , journal=. 1910.03771 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1910
[45]

ArXiv , eprint=

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. ArXiv , eprint=

work page
[46]

Dai and Z

Z. Dai and Z. Yang and Y. Yang and J. G. Carbonell and Q. V. Le and R. Salakhutdinov , title =. ArXiv , eprint =

work page
[47]

Precise Asymptotics For Small Eigenvalues , author =

Metastability In Reversible Diffusion Processes II. Precise Asymptotics For Small Eigenvalues , author =. Journal of the European Mathematical Society , volume =. 2005 , doi =

work page 2005
[48]

Sharp Asymptotics For Capacities And Exit Times , author =

Metastability In Reversible Diffusion Processes I. Sharp Asymptotics For Capacities And Exit Times , author =. Journal of the European Mathematical Society , volume =. 2004 , doi =

work page 2004
[49]

Bovier and M

A. Bovier and M. Eckhoff and V. Gayrard and M. Klein , journal =. Metastability and Low Lying Spectra in Reversible. 2002 , doi =

work page 2002
[50]

Probability Theory and Related Fields , volume =

Metastability in stochastic dynamics of disordered mean-field models , author =. Probability Theory and Related Fields , volume =. 2000 , doi =

work page 2000
[51]

Journal of Mathematical Physics , volume =

Theory of nonequilibrium first-order phase transitions for stochastic dynamics , author =. Journal of Mathematical Physics , volume =. 1998 , doi =

work page 1998
[52]

Journal of the London Mathematical Society , volume =

Metastable States of Symmetric Markov Semigroups II , author =. Journal of the London Mathematical Society , volume =. 1982 , doi =

work page 1982
[53]

Proceedings of the London Mathematical Society , volume =

Metastable States of Symmetric Markov Semigroups I , author =. Proceedings of the London Mathematical Society , volume =. 1982 , doi =

work page 1982
[54]

an der Heiden , school =

M. an der Heiden , school =. Metastability of. 2007 , doi =

work page 2007
[55]

Bianchi and A

A. Bianchi and A. Bovier and D. Ioffe , journal =. Sharp asymptotics for metastability in the random field. 2009 , doi =

work page 2009
[56]

Sharp asymptotics for metastability in the random field Curie-Weiss model

A. Bianchi and A. Bovier and D. Ioffe , journal =. Sharp asymptotics for metastability in the random field. 0806.4478 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

Metastability in the generalized Hopfield model with finitely many patterns

M. Shkolnikov , journal =. Metastability in the generalized. 0903.3050 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

2005 , doi =

Metastability: A Potential Theoretic Approach , author =. 2005 , doi =

work page 2005
[59]

Agliari and A

E. Agliari and A. Barra and A. Galluzzi and F. Guerra and D. Tantari and F. Tavani , journal=. Metastable states in the hierarchical. 2014 , doi =

work page 2014
[60]

Structure of metastable states in the

E Gardner , journal =. Structure of metastable states in the. 1986 , doi =

work page 1986
[61]

ArXiv , eprint=

Neural Turing Machines , author=. ArXiv , eprint=

work page
[62]

ArXiv , eprint=

Memory Networks , author=. ArXiv , eprint=

work page
[63]

ArXiv , eprint=

Encoding-based Memory Modules for Recurrent Neural Networks , author=. ArXiv , eprint=

work page
[64]

ArXiv , eprint =

Learning to update Auto-associative Memory in Recurrent Neural Networks for Improving Sequence Memorization , author =. ArXiv , eprint =

work page
[65]

Dehghani and S

M. Dehghani and S. Gouws and O. Vinyals and J. Uszkoreit and L. Kaiser , title =. ArXiv , eprint =

work page
[66]

Advances in Neural Information Processing Systems 28 , publisher =

End-To-End Memory Networks , author =. Advances in Neural Information Processing Systems 28 , publisher =

work page
[67]

ArXiv , eprint =

End-To-End Memory Networks , author =. ArXiv , eprint =

work page
[68]

Xu and J

K. Xu and J. Ba and R. Kiros and K. Cho and A. C. Courville and R. Salakhutdinov and R. S. Zemel and Y. Bengio , title =. Proceedings of the 32nd International Conference on Machine Learning , editor =

work page
[69]

Xu and J

K. Xu and J. Ba and R. Kiros and K. Cho and A. C. Courville and R. Salakhutdinov and R. S. Zemel and Y. Bengio , title =. ArXiv , eprint =

work page
[70]

Graves , title =

A. Graves , title =. ArXiv , eprint =

work page
[71]

Lin and M

Z. Lin and M. Feng and C. Nogueira dos Santos and M. Yu and B. Xiang and B. Zhou and Y. Bengio , title =. ArXiv , eprint =

work page
[72]

Gregor and I

K. Gregor and I. Danihelka and A. Graves and D. Rezende and D. Wierstra , title =. Proceedings of the 32nd International Conference on Machine Learning , editor =

work page
[73]

Gregor and I

K. Gregor and I. Danihelka and A. Graves and D. Wierstra , title =. ArXiv , eprint =

work page
[74]

Devlin and M.-W

J. Devlin and M.-W. Chang and K. Lee and K. Toutanova , title =. Proceedings of the 2019 Conference of the North. 2019 , doi =

work page 2019
[75]

Devlin and M.-W

J. Devlin and M.-W. Chang and K. Lee and K. Toutanova , title =. ArXiv , eprint =

work page
[76]

Vaswani and N

A. Vaswani and N. Shazeer and N. Parmar and J. Uszkoreit and L. Jones and A. N. Gomez and L. Kaiser and I. Polosukhin , title =. Advances in Neural Information Processing Systems 30 , editor =

work page
[77]

Vaswani and N

A. Vaswani and N. Shazeer and N. Parmar and J. Uszkoreit and L. Jones and A. N. Gomez and L. Kaiser and I. Polosukhin , title =. ArXiv , eprint =

work page
[78]

Frustratingly Short Attention Spans in Neural Language Modeling

M. Daniluk and T. Rockt. Frustratingly Short Attention Spans in Neural Language Modeling , journal =. 1702.04521 , note =

work page internal anchor Pith review Pith/arXiv arXiv
[79]

Cheng and L

J. Cheng and L. Dong and M. Lapata , title =. ArXiv , eprint =

work page
[80]

Luong and H

M.-T. Luong and H. Pham and C. D. Manning , title =. ArXiv , eprint =

work page

Showing first 80 references.