arxiv: 1911.05507 · v1 · submitted 2019-11-13 · 💻 cs.LG · stat.ML

Compressive Transformers for Long-Range Sequence Modelling

Jack W. Rae , Anna Potapenko , Siddhant M. Jayakumar , Timothy P. Lillicrap This is my paper

Pith reviewed 2026-05-18 10:41 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords compressive transformerlong-range sequence modellinglanguage modellingmemory compressiontransformer architectureWikiText-103Enwik8PG-19 benchmark

0 comments

The pith

The Compressive Transformer compresses past memories to achieve state-of-the-art results on long-range language modeling benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the Compressive Transformer as an attentive sequence model that compresses earlier memories rather than discarding them when processing long inputs. This mechanism aims to extend the effective context length while keeping computational costs manageable compared to standard transformers that scale quadratically with sequence length. The authors report new state-of-the-art numbers on the WikiText-103 and Enwik8 language modeling benchmarks and show the same architecture works for speech and reinforcement learning memory. They also release a new book-derived benchmark called PG-19 to support further research into long-range modeling. If the compression step preserves the necessary predictive information, models could handle book-scale contexts without proportional growth in memory or compute.

Core claim

The Compressive Transformer obtains state-of-the-art language modelling results in the WikiText-103 and Enwik8 benchmarks, achieving 17.1 ppl and 0.97 bpc respectively, by compressing past memories for long-range sequence learning. It can also model high-frequency speech effectively and serve as a memory mechanism for reinforcement learning on an object matching task.

What carries the argument

The compressive memory, which applies a learned compression network to summarize and retain past segment activations for later use in attention.

If this is right

The model reaches 17.1 perplexity on WikiText-103 while using compressed memory.
It reaches 0.97 bits per character on Enwik8.
It models high-frequency speech sequences effectively.
It functions as a memory component in reinforcement learning on object matching tasks.
A new open-vocabulary benchmark derived from books, PG-19, is introduced for long-range evaluation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The success of compression suggests that much of the distant context can be summarized into a smaller representation without destroying the signals needed for next-token prediction.
This memory design could be tested on tasks that require precise recall of specific facts from thousands of tokens earlier to see where summarization begins to fail.
Combining the compression step with other efficiency methods might further increase the reachable context length for transformer-based systems.

Load-bearing premise

Compressing past memories retains the information necessary for accurate long-range predictions on the reported benchmarks without introducing unacceptable information loss.

What would settle it

Running the model on sequences several times longer than those in WikiText-103 or Enwik8 and checking whether perplexity or bits-per-character rises sharply once the compressed memory must store summaries of distant events.

read the original abstract

We present the Compressive Transformer, an attentive sequence model which compresses past memories for long-range sequence learning. We find the Compressive Transformer obtains state-of-the-art language modelling results in the WikiText-103 and Enwik8 benchmarks, achieving 17.1 ppl and 0.97 bpc respectively. We also find it can model high-frequency speech effectively and can be used as a memory mechanism for RL, demonstrated on an object matching task. To promote the domain of long-range sequence learning, we propose a new open-vocabulary language modelling benchmark derived from books, PG-19.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces the Compressive Transformer, an attentive sequence model that augments the standard Transformer with a learned compression mechanism over past memories to enable longer-range sequence modeling. It reports state-of-the-art language modeling results on WikiText-103 (17.1 perplexity) and Enwik8 (0.97 bits per character), shows effective modeling of high-frequency speech, demonstrates use as a memory module in reinforcement learning on an object-matching task, and proposes the PG-19 open-vocabulary book-derived benchmark for long-range language modeling.

Significance. If the compressive memory mechanism can be shown to retain task-relevant information without unacceptable loss while controlling for capacity, the work would advance long-context sequence modeling and provide a practical alternative to simply scaling memory size. The PG-19 benchmark proposal is a clear positive contribution that could help standardize evaluation in this domain. The architecture description and training details are presented in sufficient detail to support follow-up work.

major comments (1)

[Experiments] Experiments section (results on WikiText-103 and Enwik8): the headline SOTA claims rest on the assumption that the compressive memory (learned compression plus attention over compressed slots) is what produces the reported 17.1 ppl and 0.97 bpc. No controlled ablation is presented that disables only the compression step while holding total memory footprint, parameter count, and training procedure fixed; therefore it remains possible that equivalent gains could be obtained by enlarging a standard Transformer memory, rendering the compression premise non-load-bearing for the central empirical result.

minor comments (2)

[Model] Notation for the compressive memory slots and the compression function could be introduced earlier with a clear diagram contrasting it to standard Transformer memory to improve readability.
[Experiments] The paper should explicitly state the total memory size (in tokens or slots) used for the baseline Transformer comparisons to allow direct assessment of capacity-matched controls.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential of the compressive memory mechanism and the PG-19 benchmark. We address the major comment on the experiments in detail below, providing clarification on existing comparisons while agreeing to strengthen the manuscript with an additional controlled ablation.

read point-by-point responses

Referee: Experiments section (results on WikiText-103 and Enwik8): the headline SOTA claims rest on the assumption that the compressive memory (learned compression plus attention over compressed slots) is what produces the reported 17.1 ppl and 0.97 bpc. No controlled ablation is presented that disables only the compression step while holding total memory footprint, parameter count, and training procedure fixed; therefore it remains possible that equivalent gains could be obtained by enlarging a standard Transformer memory, rendering the compression premise non-load-bearing for the central empirical result.

Authors: We appreciate the referee highlighting this important methodological point. The manuscript already includes comparisons to Transformer-XL baselines with varying memory lengths, demonstrating that simply extending the non-compressive memory yields improvements but does not match the performance of the Compressive Transformer at equivalent effective context lengths. However, we acknowledge that an ablation which precisely disables the compression operation while exactly matching total memory footprint (by increasing the number of uncompressed memory slots in a standard Transformer), parameter count, and training procedure is not explicitly reported. To directly address this concern and confirm that the learned compression contributes beyond capacity scaling alone, we will add this controlled experiment to the revised manuscript. The new results will be presented alongside the existing ablations on compression rate and memory usage. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results independent of any self-referential derivation

full rationale

The manuscript introduces the Compressive Transformer as an architectural extension of the standard Transformer, with explicit mechanisms for memory compression and attention over compressed representations. Performance claims (17.1 ppl on WikiText-103, 0.97 bpc on Enwik8) are presented strictly as measured outcomes on held-out test sets after training, not as outputs of any closed-form derivation or fitted parameter that is then re-labeled as a prediction. No equation in the paper reduces the reported metrics to the compression operator by algebraic identity, and no uniqueness theorem or ansatz is smuggled via self-citation to force the architecture. The work is therefore self-contained against external benchmarks; any concerns about ablations or alternative explanations belong to correctness or experimental design rather than circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; full paper would likely list additional hyperparameters and modeling assumptions. The ledger below captures the minimal elements implied by the abstract.

free parameters (1)

compression rate or memory size
Hyperparameter controlling how aggressively past activations are compressed; value chosen to achieve reported benchmark scores.

axioms (1)

domain assumption Attention-based sequence models can be extended with lossy compression of distant memories while retaining sufficient signal for next-token or next-action prediction.
Core modeling premise required for the compressive mechanism to improve rather than degrade long-range performance.

invented entities (1)

Compressive memory no independent evidence
purpose: Compact storage of information from distant past tokens or states.
New architectural component introduced to overcome memory limits of standard transformers.

pith-pipeline@v0.9.0 · 5633 in / 1307 out tokens · 41227 ms · 2026-05-18T10:41:35.199769+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.DimensionForcing eight_tick_forces_D3 unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The Compressive Transformer uses the same attention mechanism over its set of memories and compressed memories, learning to query both its short-term granular memory and longer-term coarse memory.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention
stat.ML 2026-05 unverdicted novelty 8.0

The upper-tail accumulation scale derived from the gap-counting function N_n sets the critical inverse temperature for softmax attention concentration, unifying prior conflicting laws as special cases of different N_n.
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
cs.CL 2020-12 conditional novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl ...
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
cs.LG 2025-02 unverdicted novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
Massive Activations in Large Language Models
cs.CL 2024-02 unverdicted novelty 7.0

Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
cs.CL 2024-02 unverdicted novelty 7.0

LongRoPE extends LLM context windows to 2048k tokens via search for non-uniform positional interpolation, progressive fine-tuning from 256k, and short-context readjustment.
Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility
cs.LG 2026-05 unverdicted novelty 6.0

SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.
OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention
cs.LG 2026-05 unverdicted novelty 6.0

OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.
Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory
cs.LG 2026-05 unverdicted novelty 6.0

PMNet uses unitary phasor dynamics and hierarchical anchors to make explicit memory stable for long sequences, matching a 3x larger Mamba model on long-context robustness with a 119M parameter network.
Self-Consolidating Language Models: Continual Knowledge Incorporation from Context
cs.CL 2026-05 unverdicted novelty 6.0

SCoL trains LLMs via meta-reinforcement learning to generate layer-specific update instructions that improve knowledge acquisition and retention from context streams over standard baselines.
Self-Consolidating Language Models: Continual Knowledge Incorporation from Context
cs.CL 2026-05 unverdicted novelty 6.0

SCoL lets LLMs self-generate sparse layer updates via meta-RL to consolidate knowledge from context, outperforming prompting and fine-tuning baselines on QA and long-context tasks while aligning updates with high-Fish...
Training Transformers for KV Cache Compressibility
cs.LG 2026-05 unverdicted novelty 6.0

KV compressibility is a property of learned transformer representations that can be improved by training with KV sparsification, leading to better quality-budget tradeoffs in downstream compression for retrieval, QA, ...
Training Transformers for KV Cache Compressibility
cs.LG 2026-05 unverdicted novelty 6.0

Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.
StructKV: Preserving the Structural Skeleton for Scalable Long-Context Inference
cs.CL 2026-04 unverdicted novelty 6.0

StructKV compresses LLM KV caches by tracking global in-degree centrality across network depth and dynamically selecting compression layers to preserve long-range dependencies better than local pruning methods.
Next-Scale Autoregressive Models for Text-to-Motion Generation
cs.CV 2026-04 unverdicted novelty 6.0

MoScale introduces a hierarchical next-scale autoregressive framework for text-to-motion generation that achieves state-of-the-art performance by refining motions from coarse to fine temporal resolutions.
LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling
cs.CL 2026-03 unverdicted novelty 6.0

LPC-SM is a hybrid architecture separating local attention, persistent memory, predictive correction, and control with ONT for memory writes, showing loss reductions on 158M-parameter models up to 4096-token contexts.
Positional Encoding via Token-Aware Phase Attention
cs.CL 2025-09 unverdicted novelty 6.0

TAPA adds a learnable phase function to attention to preserve long-range token interactions, enabling direct continual pretraining, length extrapolation, lower perplexity, and stronger retrieval than RoPE-style methods.
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
cs.CL 2024-06 unverdicted novelty 6.0

Quest speeds up long-context LLM self-attention by up to 2.23x via query-dependent selection of top-K critical KV cache pages, cutting overall latency by 7.03x with negligible accuracy loss.
Adaptive Memory Decay for Log-Linear Attention
cs.LG 2026-05 conditional novelty 5.0

Making memory decay input-dependent via a lightweight MLP improves log-linear attention performance on associative recall, selective copying, and language modeling, especially for long sequences.
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights
cs.CL 2025-10 unverdicted novelty 4.0

This work systematically compares inter-layer and intra-layer hybridization strategies for combining self-attention and Mamba-style state space models, evaluating them on language modeling, downstream tasks, long-cont...
Beyond Context: Large Language Models' Failure to Grasp Users' Intent
cs.AI 2025-12 unverdicted novelty 3.0

LLMs fail to detect hidden harmful intent, allowing systematic bypass of safety mechanisms through framing techniques, with reasoning modes often worsening the issue.

Reference graph

Works this paper leans on

125 extracted references · 125 canonical work pages · cited by 18 Pith papers · 44 internal anchors

[1]

Al-Rfou, D

R. Al-Rfou, D. Choe, N. Constant, M. Guo, and L. Jones. Character-level language modeling with deeper self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3159--3166, 2019

work page 2019
[4]

S. Bai, J. Z. Kolter, and V. Koltun. Convolutional sequence modeling revisited, 2018 a . URL https://openreview.net/forum?id=rk8wKk-R-

work page 2018
[6]

DeepMind Lab

C. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. K \" u ttler, A. Lefrancq, S. Green, V. Vald \' e s, A. Sadik, J. Schrittwieser, K. Anderson, S. York, M. Cant, A. Cain, A. Bolton, S. Gaffney, H. King, D. Hassabis, S. Legg, and S. Petersen. Deepmind lab. CoRR, abs/1612.03801, 2016. URL http://arxiv.org/abs/1612.03801

work page internal anchor Pith review Pith/arXiv arXiv 2016
[7]

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3: 0 993--1022, Mar. 2003. ISSN 1532-4435

work page 2003
[15]

Espeholt, H

L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International Conference on Machine Learning, pages 1406--1415, 2018

work page 2018
[19]

Graves, G

A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwi \'n ska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538 0 (7626): 0 471, 2016

work page 2016
[22]

Hochreiter and J

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9 0 (8): 0 1735--1780, 1997

work page 1997
[24]

M. Hutter. The human knowledge compression contest. URL http://prize. hutter1. net, 6, 2012

work page 2012
[27]

Ko c isk \`y , J

T. Ko c isk \`y , J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette. The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6: 0 317--328, 2018

work page 2018
[29]

Dynamic Evaluation of Transformer Language Models

B. Krause, E. Kahembwe, I. Murray, and S. Renals. Dynamic evaluation of transformer language models. CoRR, abs/1904.08378, 2019. URL http://arxiv.org/abs/1904.08378

work page internal anchor Pith review Pith/arXiv arXiv 1904
[32]

Mikolov, M

T. Mikolov, M. Karafi \'a t, L. Burget, J. C ernock \`y , and S. Khudanpur. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association, 2010

work page 2010
[33]

A. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. Driessche, E. Lockhart, L. Cobo, F. Stimberg, et al. Parallel wavenet: Fast high-fidelity speech synthesis. In International Conference on Machine Learning, pages 3915--3923, 2018

work page 2018
[35]

Paperno, G

D. Paperno, G. Kruszewski, A. Lazaridou, Q. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, R. Fern \'a ndez, K. Erk, et al. The lambada dataset: Word prediction requiring a broad discourse context. Association for Computational Linguistics, 2016

work page 2016
[36]

J. Rae, J. J. Hunt, I. Danihelka, T. Harley, A. W. Senior, G. Wayne, A. Graves, and T. Lillicrap. Scaling memory-augmented neural networks with sparse reads and writes. In Advances in Neural Information Processing Systems, pages 3621--3629, 2016

work page 2016
[38]

B. A. Richards and P. W. Frankland. The persistence and transience of memory. Neuron, 94 0 (6): 0 1071--1084, 2017

work page 2017
[39]

D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature, 323 0 (6088): 0 533, 1986

work page 1986
[40]

Santoro, R

A. Santoro, R. Faulkner, D. Raposo, J. Rae, M. Chrzanowski, T. Weber, D. Wierstra, O. Vinyals, R. Pascanu, and T. Lillicrap. Relational recurrent neural networks. In Advances in Neural Information Processing Systems, pages 7299--7310, 2018

work page 2018
[41]

Shoeybi, M

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019

work page 2019
[42]

Smith, P

S. Smith, P. jan Kindermans, C. Ying, and Q. V. Le. Don't decay the learning rate, increase the batch size. 2018. URL https://openreview.net/pdf?id=B1Yy1BxCZ

work page 2018
[44]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, . Kaiser, and I. Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998--6008, 2017

work page 2017
[47]

L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong. End-to-end dense video captioning with masked transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8739--8748, 2018

work page 2018
[48]

Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19--27, 2015

work page 2015
[49]

J. G. Zilly, R. K. Srivastava, J. Koutn \' k, and J. Schmidhuber. Recurrent highway networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 4189--4198. JMLR. org, 2017

work page 2017
[50]

nature , volume=

Mastering the game of Go with deep neural networks and tree search , author=. nature , volume=. 2016 , publisher=

work page 2016
[51]

Neural Turing Machines

Neural turing machines , author=. arXiv preprint arXiv:1410.5401 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

Nature , volume=

Hybrid computing using a neural network with dynamic external memory , author=. Nature , volume=. 2016 , publisher=

work page 2016
[53]

Advances in neural information processing systems , pages=

End-to-end memory networks , author=. Advances in neural information processing systems , pages=

work page
[54]

Neural Machine Translation by Jointly Learning to Align and Translate

Neural machine translation by jointly learning to align and translate , author=. arXiv preprint arXiv:1409.0473 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

Advances in neural information processing systems , pages=

Offline handwriting recognition with multidimensional recurrent neural networks , author=. Advances in neural information processing systems , pages=

work page
[56]

Acoustics, speech and signal processing (icassp), 2013 ieee international conference on , pages=

Speech recognition with deep recurrent neural networks , author=. Acoustics, speech and signal processing (icassp), 2013 ieee international conference on , pages=. 2013 , organization=

work page 2013
[57]

International Conference on Machine Learning , pages=

Asynchronous methods for deep reinforcement learning , author=. International Conference on Machine Learning , pages=

work page
[58]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Google's neural machine translation system: Bridging the gap between human and machine translation , author=. arXiv preprint arXiv:1609.08144 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[59]

Neural computation , volume=

Long short-term memory , author=. Neural computation , volume=. 1997 , publisher=

work page 1997
[60]

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems , volume=

The vanishing gradient problem during learning recurrent neural nets and problem solutions , author=. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems , volume=. 1998 , publisher=

work page 1998
[61]

Memory-based control with recurrent neural networks

Memory-based control with recurrent neural networks , author=. arXiv preprint arXiv:1512.04455 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[62]

One-shot Learning with Memory-Augmented Neural Networks

One-shot learning with memory-augmented neural networks , author=. arXiv preprint arXiv:1605.06065 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[63]

Advances in Neural Information Processing Systems , pages=

Matching networks for one shot learning , author=. Advances in Neural Information Processing Systems , pages=

work page
[64]

Advances in Neural Information Processing Systems , pages=

Pointer networks , author=. Advances in Neural Information Processing Systems , pages=

work page
[65]

Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling

Learning to create and reuse words in open-vocabulary neural language modeling , author=. arXiv preprint arXiv:1704.06986 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[66]

Pointer Sentinel Mixture Models

Pointer sentinel mixture models , author=. arXiv preprint arXiv:1609.07843 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[67]

Improving Neural Language Models with a Continuous Cache

Improving neural language models with a continuous cache , author=. arXiv preprint arXiv:1612.04426 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[68]

Advances in Neural Information Processing Systems , pages=

Unbounded cache model for online language modeling with open vocabulary , author=. Advances in Neural Information Processing Systems , pages=

work page
[69]

Efficient softmax approximation for GPUs

Efficient softmax approximation for GPUs , author=. arXiv preprint arXiv:1609.04309 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[70]

Acoustics, Speech, and Signal Processing, 2001

Classes for fast maximum entropy training , author=. Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP'01). 2001 IEEE International Conference on , volume=. 2001 , organization=

work page 2001
[71]

Layer Normalization

Layer normalization , author=. arXiv preprint arXiv:1607.06450 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[72]

Language Modeling with Gated Convolutional Networks

Language modeling with gated convolutional networks , author=. arXiv preprint arXiv:1612.08083 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[73]

Search Engine Guided Non-Parametric Neural Machine Translation

Search Engine Guided Non-Parametric Neural Machine Translation , author=. arXiv preprint arXiv:1705.07267 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[74]

Advances in Neural Information Processing Systems , pages=

Using fast weights to attend to the recent past , author=. Advances in Neural Information Processing Systems , pages=

work page
[75]

Memory Aware Synapses: Learning what (not) to forget

Memory Aware Synapses: Learning what (not) to forget , author=. arXiv preprint arXiv:1711.09601 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[76]

Acoustics, Speech, and Signal Processing, 1995

Improved backing-off for m-gram language modeling , author=. Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on , volume=. 1995 , organization=

work page 1995
[77]

Thirteenth Annual Conference of the International Speech Communication Association , year=

LSTM neural networks for language modeling , author=. Thirteenth Annual Conference of the International Speech Communication Association , year=

work page
[78]

Proceedings of the 25th international conference on Machine learning , pages=

A unified architecture for natural language processing: Deep neural networks with multitask learning , author=. Proceedings of the 25th international conference on Machine learning , pages=. 2008 , organization=

work page 2008
[79]

A Convolutional Neural Network for Modelling Sentences

A convolutional neural network for modelling sentences , author=. arXiv preprint arXiv:1404.2188 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[80]

Exploring the Limits of Language Modeling

Exploring the limits of language modeling , author=. arXiv preprint arXiv:1602.02410 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[81]

, author=

Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. , author=. Psychological review , volume=. 1995 , publisher=

work page 1995
[82]

Trends in cognitive sciences , volume=

What learning systems do intelligent agents need? Complementary learning systems theory updated , author=. Trends in cognitive sciences , volume=. 2016 , publisher=

work page 2016
[83]

Pointing the Unknown Words

Pointing the unknown words , author=. arXiv preprint arXiv:1603.08148 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[84]

1949 , publisher=

The organization of behavior: A neurophysiological approach , author=. 1949 , publisher=

work page 1949
[85]

NY Houghton-Mifflin , year=

The psychology of language , author=. NY Houghton-Mifflin , year=

work page
[86]

2018 , url=

Convolutional Sequence Modeling Revisited , author=. 2018 , url=

work page 2018
[87]

International Conference on Artificial Neural Networks , pages=

Learning to learn using gradient descent , author=. International Conference on Artificial Neural Networks , pages=. 2001 , organization=

work page 2001
[88]

Advances in Neural Information Processing Systems , pages=

Learning to learn by gradient descent by gradient descent , author=. Advances in Neural Information Processing Systems , pages=

work page
[89]

Attentive Recurrent Comparators

Attentive recurrent comparators , author=. arXiv preprint arXiv:1703.00767 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[90]

Proceedings of the National Academy of Sciences , volume=

Overcoming catastrophic forgetting in neural networks , author=. Proceedings of the National Academy of Sciences , volume=. 2017 , publisher=

work page 2017
[91]

International Conference on Learning Representations , year=

Memory-based Parameter Adaptation , author=. International Conference on Learning Representations , year=

work page
[92]

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

Model-agnostic meta-learning for fast adaptation of deep networks , author=. arXiv preprint arXiv:1703.03400 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[93]

Advances in neural information processing systems , pages=

Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters , author=. Advances in neural information processing systems , pages=

work page
[94]

Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics , pages=

Understanding the difficulty of training deep feedforward neural networks , author=. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics , pages=

work page
[95]

Science , volume=

Human-level concept learning through probabilistic program induction , author=. Science , volume=. 2015 , publisher=

work page 2015
[96]

COURSERA: Neural networks for machine learning , volume=

Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude , author=. COURSERA: Neural networks for machine learning , volume=

work page
[97]

dvd , author=

English gigaword fifth edition ldc2011t07. dvd , author=. Philadelphia: Linguistic Data Consortium , year=

work page
[98]

On the State of the Art of Evaluation in Neural Language Models

On the state of the art of evaluation in neural language models , author=. arXiv preprint arXiv:1707.05589 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[99]

2017 , journal=

Learning to Remember Rare Events , author =. 2017 , journal=

work page 2017
[100]

Strategies for Training Large Vocabulary Neural Language Models

Strategies for training large vocabulary neural language models , author=. arXiv preprint arXiv:1512.04906 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[101]

Deep Meta-Learning: Learning to Learn in the Concept Space

Deep Meta-Learning: Learning to Learn in the Concept Space , author=. arXiv preprint arXiv:1802.03596 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[102]

Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

Breaking the softmax bottleneck: a high-rank RNN language model , author=. arXiv preprint arXiv:1711.03953 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[103]

Learning to learn , pages=

Lifelong learning algorithms , author=. Learning to learn , pages=. 1998 , publisher=

work page 1998
[104]

1996a , school=

The Leabra model of neural interactions and learning in the neocortex , author=. 1996a , school=

work page
[105]

Neural computation , volume=

Biologically plausible error-driven learning using local activation differences: The generalized recirculation algorithm , author=. Neural computation , volume=. 1996b , publisher=

work page
[106]

Advances in Neural Information Processing Systems , pages=

Scaling memory-augmented neural networks with sparse reads and writes , author=. Advances in Neural Information Processing Systems , pages=

work page

Showing first 80 references.