Compressive Transformers for Long-Range Sequence Modelling
Pith reviewed 2026-05-18 10:41 UTC · model grok-4.3
The pith
The Compressive Transformer compresses past memories to achieve state-of-the-art results on long-range language modeling benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Compressive Transformer obtains state-of-the-art language modelling results in the WikiText-103 and Enwik8 benchmarks, achieving 17.1 ppl and 0.97 bpc respectively, by compressing past memories for long-range sequence learning. It can also model high-frequency speech effectively and serve as a memory mechanism for reinforcement learning on an object matching task.
What carries the argument
The compressive memory, which applies a learned compression network to summarize and retain past segment activations for later use in attention.
If this is right
- The model reaches 17.1 perplexity on WikiText-103 while using compressed memory.
- It reaches 0.97 bits per character on Enwik8.
- It models high-frequency speech sequences effectively.
- It functions as a memory component in reinforcement learning on object matching tasks.
- A new open-vocabulary benchmark derived from books, PG-19, is introduced for long-range evaluation.
Where Pith is reading between the lines
- The success of compression suggests that much of the distant context can be summarized into a smaller representation without destroying the signals needed for next-token prediction.
- This memory design could be tested on tasks that require precise recall of specific facts from thousands of tokens earlier to see where summarization begins to fail.
- Combining the compression step with other efficiency methods might further increase the reachable context length for transformer-based systems.
Load-bearing premise
Compressing past memories retains the information necessary for accurate long-range predictions on the reported benchmarks without introducing unacceptable information loss.
What would settle it
Running the model on sequences several times longer than those in WikiText-103 or Enwik8 and checking whether perplexity or bits-per-character rises sharply once the compressed memory must store summaries of distant events.
read the original abstract
We present the Compressive Transformer, an attentive sequence model which compresses past memories for long-range sequence learning. We find the Compressive Transformer obtains state-of-the-art language modelling results in the WikiText-103 and Enwik8 benchmarks, achieving 17.1 ppl and 0.97 bpc respectively. We also find it can model high-frequency speech effectively and can be used as a memory mechanism for RL, demonstrated on an object matching task. To promote the domain of long-range sequence learning, we propose a new open-vocabulary language modelling benchmark derived from books, PG-19.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Compressive Transformer, an attentive sequence model that augments the standard Transformer with a learned compression mechanism over past memories to enable longer-range sequence modeling. It reports state-of-the-art language modeling results on WikiText-103 (17.1 perplexity) and Enwik8 (0.97 bits per character), shows effective modeling of high-frequency speech, demonstrates use as a memory module in reinforcement learning on an object-matching task, and proposes the PG-19 open-vocabulary book-derived benchmark for long-range language modeling.
Significance. If the compressive memory mechanism can be shown to retain task-relevant information without unacceptable loss while controlling for capacity, the work would advance long-context sequence modeling and provide a practical alternative to simply scaling memory size. The PG-19 benchmark proposal is a clear positive contribution that could help standardize evaluation in this domain. The architecture description and training details are presented in sufficient detail to support follow-up work.
major comments (1)
- [Experiments] Experiments section (results on WikiText-103 and Enwik8): the headline SOTA claims rest on the assumption that the compressive memory (learned compression plus attention over compressed slots) is what produces the reported 17.1 ppl and 0.97 bpc. No controlled ablation is presented that disables only the compression step while holding total memory footprint, parameter count, and training procedure fixed; therefore it remains possible that equivalent gains could be obtained by enlarging a standard Transformer memory, rendering the compression premise non-load-bearing for the central empirical result.
minor comments (2)
- [Model] Notation for the compressive memory slots and the compression function could be introduced earlier with a clear diagram contrasting it to standard Transformer memory to improve readability.
- [Experiments] The paper should explicitly state the total memory size (in tokens or slots) used for the baseline Transformer comparisons to allow direct assessment of capacity-matched controls.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the potential of the compressive memory mechanism and the PG-19 benchmark. We address the major comment on the experiments in detail below, providing clarification on existing comparisons while agreeing to strengthen the manuscript with an additional controlled ablation.
read point-by-point responses
-
Referee: Experiments section (results on WikiText-103 and Enwik8): the headline SOTA claims rest on the assumption that the compressive memory (learned compression plus attention over compressed slots) is what produces the reported 17.1 ppl and 0.97 bpc. No controlled ablation is presented that disables only the compression step while holding total memory footprint, parameter count, and training procedure fixed; therefore it remains possible that equivalent gains could be obtained by enlarging a standard Transformer memory, rendering the compression premise non-load-bearing for the central empirical result.
Authors: We appreciate the referee highlighting this important methodological point. The manuscript already includes comparisons to Transformer-XL baselines with varying memory lengths, demonstrating that simply extending the non-compressive memory yields improvements but does not match the performance of the Compressive Transformer at equivalent effective context lengths. However, we acknowledge that an ablation which precisely disables the compression operation while exactly matching total memory footprint (by increasing the number of uncompressed memory slots in a standard Transformer), parameter count, and training procedure is not explicitly reported. To directly address this concern and confirm that the learned compression contributes beyond capacity scaling alone, we will add this controlled experiment to the revised manuscript. The new results will be presented alongside the existing ablations on compression rate and memory usage. revision: yes
Circularity Check
No circularity: empirical benchmark results independent of any self-referential derivation
full rationale
The manuscript introduces the Compressive Transformer as an architectural extension of the standard Transformer, with explicit mechanisms for memory compression and attention over compressed representations. Performance claims (17.1 ppl on WikiText-103, 0.97 bpc on Enwik8) are presented strictly as measured outcomes on held-out test sets after training, not as outputs of any closed-form derivation or fitted parameter that is then re-labeled as a prediction. No equation in the paper reduces the reported metrics to the compression operator by algebraic identity, and no uniqueness theorem or ansatz is smuggled via self-citation to force the architecture. The work is therefore self-contained against external benchmarks; any concerns about ablations or alternative explanations belong to correctness or experimental design rather than circularity.
Axiom & Free-Parameter Ledger
free parameters (1)
- compression rate or memory size
axioms (1)
- domain assumption Attention-based sequence models can be extended with lossy compression of distant memories while retaining sufficient signal for next-token or next-action prediction.
invented entities (1)
-
Compressive memory
no independent evidence
Lean theorems connected to this paper
-
Foundation.DimensionForcingeight_tick_forces_D3 unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The Compressive Transformer uses the same attention mechanism over its set of memories and compressed memories, learning to query both its short-term granular memory and longer-term coarse memory.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 20 Pith papers
-
A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention
The upper-tail accumulation scale derived from the gap-counting function N_n sets the critical inverse temperature for softmax attention concentration, unifying prior conflicting laws as special cases of different N_n.
-
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl ...
-
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
-
Massive Activations in Large Language Models
Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
-
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
LongRoPE extends LLM context windows to 2048k tokens via search for non-uniform positional interpolation, progressive fine-tuning from 256k, and short-context readjustment.
-
Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility
SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.
-
OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention
OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.
-
Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory
PMNet uses unitary phasor dynamics and hierarchical anchors to make explicit memory stable for long sequences, matching a 3x larger Mamba model on long-context robustness with a 119M parameter network.
-
Self-Consolidating Language Models: Continual Knowledge Incorporation from Context
SCoL trains LLMs via meta-reinforcement learning to generate layer-specific update instructions that improve knowledge acquisition and retention from context streams over standard baselines.
-
Self-Consolidating Language Models: Continual Knowledge Incorporation from Context
SCoL lets LLMs self-generate sparse layer updates via meta-RL to consolidate knowledge from context, outperforming prompting and fine-tuning baselines on QA and long-context tasks while aligning updates with high-Fish...
-
Training Transformers for KV Cache Compressibility
KV compressibility is a property of learned transformer representations that can be improved by training with KV sparsification, leading to better quality-budget tradeoffs in downstream compression for retrieval, QA, ...
-
Training Transformers for KV Cache Compressibility
Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.
-
StructKV: Preserving the Structural Skeleton for Scalable Long-Context Inference
StructKV compresses LLM KV caches by tracking global in-degree centrality across network depth and dynamically selecting compression layers to preserve long-range dependencies better than local pruning methods.
-
Next-Scale Autoregressive Models for Text-to-Motion Generation
MoScale introduces a hierarchical next-scale autoregressive framework for text-to-motion generation that achieves state-of-the-art performance by refining motions from coarse to fine temporal resolutions.
-
LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling
LPC-SM is a hybrid architecture separating local attention, persistent memory, predictive correction, and control with ONT for memory writes, showing loss reductions on 158M-parameter models up to 4096-token contexts.
-
Positional Encoding via Token-Aware Phase Attention
TAPA adds a learnable phase function to attention to preserve long-range token interactions, enabling direct continual pretraining, length extrapolation, lower perplexity, and stronger retrieval than RoPE-style methods.
-
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Quest speeds up long-context LLM self-attention by up to 2.23x via query-dependent selection of top-K critical KV cache pages, cutting overall latency by 7.03x with negligible accuracy loss.
-
Adaptive Memory Decay for Log-Linear Attention
Making memory decay input-dependent via a lightweight MLP improves log-linear attention performance on associative recall, selective copying, and language modeling, especially for long sequences.
-
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights
This work systematically compares inter-layer and intra-layer hybridization strategies for combining self-attention and Mamba-style state space models, evaluating them on language modeling, downstream tasks, long-cont...
-
Beyond Context: Large Language Models' Failure to Grasp Users' Intent
LLMs fail to detect hidden harmful intent, allowing systematic bypass of safety mechanisms through framing techniques, with reasoning modes often worsening the issue.
Reference graph
Works this paper leans on
-
[1]
R. Al-Rfou, D. Choe, N. Constant, M. Guo, and L. Jones. Character-level language modeling with deeper self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3159--3166, 2019
work page 2019
-
[4]
S. Bai, J. Z. Kolter, and V. Koltun. Convolutional sequence modeling revisited, 2018 a . URL https://openreview.net/forum?id=rk8wKk-R-
work page 2018
-
[6]
C. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. K \" u ttler, A. Lefrancq, S. Green, V. Vald \' e s, A. Sadik, J. Schrittwieser, K. Anderson, S. York, M. Cant, A. Cain, A. Bolton, S. Gaffney, H. King, D. Hassabis, S. Legg, and S. Petersen. Deepmind lab. CoRR, abs/1612.03801, 2016. URL http://arxiv.org/abs/1612.03801
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[7]
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3: 0 993--1022, Mar. 2003. ISSN 1532-4435
work page 2003
-
[15]
L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International Conference on Machine Learning, pages 1406--1415, 2018
work page 2018
- [19]
-
[22]
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9 0 (8): 0 1735--1780, 1997
work page 1997
-
[24]
M. Hutter. The human knowledge compression contest. URL http://prize. hutter1. net, 6, 2012
work page 2012
-
[27]
T. Ko c isk \`y , J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette. The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6: 0 317--328, 2018
work page 2018
-
[29]
Dynamic Evaluation of Transformer Language Models
B. Krause, E. Kahembwe, I. Murray, and S. Renals. Dynamic evaluation of transformer language models. CoRR, abs/1904.08378, 2019. URL http://arxiv.org/abs/1904.08378
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[32]
T. Mikolov, M. Karafi \'a t, L. Burget, J. C ernock \`y , and S. Khudanpur. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association, 2010
work page 2010
-
[33]
A. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. Driessche, E. Lockhart, L. Cobo, F. Stimberg, et al. Parallel wavenet: Fast high-fidelity speech synthesis. In International Conference on Machine Learning, pages 3915--3923, 2018
work page 2018
-
[35]
D. Paperno, G. Kruszewski, A. Lazaridou, Q. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, R. Fern \'a ndez, K. Erk, et al. The lambada dataset: Word prediction requiring a broad discourse context. Association for Computational Linguistics, 2016
work page 2016
-
[36]
J. Rae, J. J. Hunt, I. Danihelka, T. Harley, A. W. Senior, G. Wayne, A. Graves, and T. Lillicrap. Scaling memory-augmented neural networks with sparse reads and writes. In Advances in Neural Information Processing Systems, pages 3621--3629, 2016
work page 2016
-
[38]
B. A. Richards and P. W. Frankland. The persistence and transience of memory. Neuron, 94 0 (6): 0 1071--1084, 2017
work page 2017
-
[39]
D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature, 323 0 (6088): 0 533, 1986
work page 1986
-
[40]
A. Santoro, R. Faulkner, D. Raposo, J. Rae, M. Chrzanowski, T. Weber, D. Wierstra, O. Vinyals, R. Pascanu, and T. Lillicrap. Relational recurrent neural networks. In Advances in Neural Information Processing Systems, pages 7299--7310, 2018
work page 2018
-
[41]
M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019
work page 2019
- [42]
-
[44]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, . Kaiser, and I. Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998--6008, 2017
work page 2017
-
[47]
L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong. End-to-end dense video captioning with masked transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8739--8748, 2018
work page 2018
-
[48]
Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19--27, 2015
work page 2015
-
[49]
J. G. Zilly, R. K. Srivastava, J. Koutn \' k, and J. Schmidhuber. Recurrent highway networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 4189--4198. JMLR. org, 2017
work page 2017
-
[50]
Mastering the game of Go with deep neural networks and tree search , author=. nature , volume=. 2016 , publisher=
work page 2016
-
[51]
Neural turing machines , author=. arXiv preprint arXiv:1410.5401 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[52]
Hybrid computing using a neural network with dynamic external memory , author=. Nature , volume=. 2016 , publisher=
work page 2016
-
[53]
Advances in neural information processing systems , pages=
End-to-end memory networks , author=. Advances in neural information processing systems , pages=
-
[54]
Neural Machine Translation by Jointly Learning to Align and Translate
Neural machine translation by jointly learning to align and translate , author=. arXiv preprint arXiv:1409.0473 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[55]
Advances in neural information processing systems , pages=
Offline handwriting recognition with multidimensional recurrent neural networks , author=. Advances in neural information processing systems , pages=
-
[56]
Acoustics, speech and signal processing (icassp), 2013 ieee international conference on , pages=
Speech recognition with deep recurrent neural networks , author=. Acoustics, speech and signal processing (icassp), 2013 ieee international conference on , pages=. 2013 , organization=
work page 2013
-
[57]
International Conference on Machine Learning , pages=
Asynchronous methods for deep reinforcement learning , author=. International Conference on Machine Learning , pages=
-
[58]
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Google's neural machine translation system: Bridging the gap between human and machine translation , author=. arXiv preprint arXiv:1609.08144 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[59]
Long short-term memory , author=. Neural computation , volume=. 1997 , publisher=
work page 1997
-
[60]
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems , volume=
The vanishing gradient problem during learning recurrent neural nets and problem solutions , author=. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems , volume=. 1998 , publisher=
work page 1998
-
[61]
Memory-based control with recurrent neural networks
Memory-based control with recurrent neural networks , author=. arXiv preprint arXiv:1512.04455 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[62]
One-shot Learning with Memory-Augmented Neural Networks
One-shot learning with memory-augmented neural networks , author=. arXiv preprint arXiv:1605.06065 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[63]
Advances in Neural Information Processing Systems , pages=
Matching networks for one shot learning , author=. Advances in Neural Information Processing Systems , pages=
-
[64]
Advances in Neural Information Processing Systems , pages=
Pointer networks , author=. Advances in Neural Information Processing Systems , pages=
-
[65]
Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling
Learning to create and reuse words in open-vocabulary neural language modeling , author=. arXiv preprint arXiv:1704.06986 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[66]
Pointer Sentinel Mixture Models
Pointer sentinel mixture models , author=. arXiv preprint arXiv:1609.07843 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[67]
Improving Neural Language Models with a Continuous Cache
Improving neural language models with a continuous cache , author=. arXiv preprint arXiv:1612.04426 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[68]
Advances in Neural Information Processing Systems , pages=
Unbounded cache model for online language modeling with open vocabulary , author=. Advances in Neural Information Processing Systems , pages=
-
[69]
Efficient softmax approximation for GPUs
Efficient softmax approximation for GPUs , author=. arXiv preprint arXiv:1609.04309 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[70]
Acoustics, Speech, and Signal Processing, 2001
Classes for fast maximum entropy training , author=. Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP'01). 2001 IEEE International Conference on , volume=. 2001 , organization=
work page 2001
-
[71]
Layer normalization , author=. arXiv preprint arXiv:1607.06450 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[72]
Language Modeling with Gated Convolutional Networks
Language modeling with gated convolutional networks , author=. arXiv preprint arXiv:1612.08083 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[73]
Search Engine Guided Non-Parametric Neural Machine Translation
Search Engine Guided Non-Parametric Neural Machine Translation , author=. arXiv preprint arXiv:1705.07267 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[74]
Advances in Neural Information Processing Systems , pages=
Using fast weights to attend to the recent past , author=. Advances in Neural Information Processing Systems , pages=
-
[75]
Memory Aware Synapses: Learning what (not) to forget
Memory Aware Synapses: Learning what (not) to forget , author=. arXiv preprint arXiv:1711.09601 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[76]
Acoustics, Speech, and Signal Processing, 1995
Improved backing-off for m-gram language modeling , author=. Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on , volume=. 1995 , organization=
work page 1995
-
[77]
Thirteenth Annual Conference of the International Speech Communication Association , year=
LSTM neural networks for language modeling , author=. Thirteenth Annual Conference of the International Speech Communication Association , year=
-
[78]
Proceedings of the 25th international conference on Machine learning , pages=
A unified architecture for natural language processing: Deep neural networks with multitask learning , author=. Proceedings of the 25th international conference on Machine learning , pages=. 2008 , organization=
work page 2008
-
[79]
A Convolutional Neural Network for Modelling Sentences
A convolutional neural network for modelling sentences , author=. arXiv preprint arXiv:1404.2188 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[80]
Exploring the Limits of Language Modeling
Exploring the limits of language modeling , author=. arXiv preprint arXiv:1602.02410 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [81]
-
[82]
Trends in cognitive sciences , volume=
What learning systems do intelligent agents need? Complementary learning systems theory updated , author=. Trends in cognitive sciences , volume=. 2016 , publisher=
work page 2016
-
[83]
Pointing the unknown words , author=. arXiv preprint arXiv:1603.08148 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[84]
The organization of behavior: A neurophysiological approach , author=. 1949 , publisher=
work page 1949
-
[85]
The psychology of language , author=. NY Houghton-Mifflin , year=
- [86]
-
[87]
International Conference on Artificial Neural Networks , pages=
Learning to learn using gradient descent , author=. International Conference on Artificial Neural Networks , pages=. 2001 , organization=
work page 2001
-
[88]
Advances in Neural Information Processing Systems , pages=
Learning to learn by gradient descent by gradient descent , author=. Advances in Neural Information Processing Systems , pages=
-
[89]
Attentive Recurrent Comparators
Attentive recurrent comparators , author=. arXiv preprint arXiv:1703.00767 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[90]
Proceedings of the National Academy of Sciences , volume=
Overcoming catastrophic forgetting in neural networks , author=. Proceedings of the National Academy of Sciences , volume=. 2017 , publisher=
work page 2017
-
[91]
International Conference on Learning Representations , year=
Memory-based Parameter Adaptation , author=. International Conference on Learning Representations , year=
-
[92]
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-agnostic meta-learning for fast adaptation of deep networks , author=. arXiv preprint arXiv:1703.03400 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[93]
Advances in neural information processing systems , pages=
Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters , author=. Advances in neural information processing systems , pages=
-
[94]
Understanding the difficulty of training deep feedforward neural networks , author=. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics , pages=
-
[95]
Human-level concept learning through probabilistic program induction , author=. Science , volume=. 2015 , publisher=
work page 2015
-
[96]
COURSERA: Neural networks for machine learning , volume=
Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude , author=. COURSERA: Neural networks for machine learning , volume=
-
[97]
English gigaword fifth edition ldc2011t07. dvd , author=. Philadelphia: Linguistic Data Consortium , year=
-
[98]
On the State of the Art of Evaluation in Neural Language Models
On the state of the art of evaluation in neural language models , author=. arXiv preprint arXiv:1707.05589 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [99]
-
[100]
Strategies for Training Large Vocabulary Neural Language Models
Strategies for training large vocabulary neural language models , author=. arXiv preprint arXiv:1512.04906 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[101]
Deep Meta-Learning: Learning to Learn in the Concept Space
Deep Meta-Learning: Learning to Learn in the Concept Space , author=. arXiv preprint arXiv:1802.03596 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[102]
Breaking the Softmax Bottleneck: A High-Rank RNN Language Model
Breaking the softmax bottleneck: a high-rank RNN language model , author=. arXiv preprint arXiv:1711.03953 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[103]
Lifelong learning algorithms , author=. Learning to learn , pages=. 1998 , publisher=
work page 1998
-
[104]
The Leabra model of neural interactions and learning in the neocortex , author=. 1996a , school=
-
[105]
Biologically plausible error-driven learning using local activation differences: The generalized recirculation algorithm , author=. Neural computation , volume=. 1996b , publisher=
-
[106]
Advances in Neural Information Processing Systems , pages=
Scaling memory-augmented neural networks with sparse reads and writes , author=. Advances in Neural Information Processing Systems , pages=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.