Generating Long Sequences with Sparse Transformers
Pith reviewed 2026-05-10 19:45 UTC · model grok-4.3
The pith
Sparse factorizations of the attention matrix let transformers model sequences tens of thousands of timesteps long.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to O(n sqrt n). We also introduce a variation on architecture and initialization to train deeper networks, the recomputation of attention matrices to save memory, and fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling.
What carries the argument
Sparse factorizations of the attention matrix that lower complexity from quadratic to O(n sqrt n) while supporting long-range dependencies.
If this is right
- Hundreds of layers become practical on sequences of tens of thousands of timesteps.
- State-of-the-art density modeling results are reached on Enwik8, CIFAR-10, and ImageNet-64 from raw bytes.
- Unconditional generation produces samples with global coherence and diversity.
- Self-attention in principle extends to sequences of length one million or more.
Where Pith is reading between the lines
- Structured sparsity may suffice for many long-range dependencies instead of requiring full attention.
- The approach could be tested on other high-dimensional sequence data such as video frames or full-length audio tracks.
- Learned or data-adaptive sparsity patterns might further improve efficiency beyond the fixed factorizations used here.
Load-bearing premise
The chosen sparse factorizations of the attention matrix retain sufficient expressivity to capture the long-range dependencies needed for the reported density modeling tasks.
What would settle it
A head-to-head comparison on one of the long-sequence tasks where a full-attention transformer achieves clearly superior density estimates or sample coherence compared with the sparse version.
read the original abstract
Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \sqrt{n})$. We also introduce a) a variation on architecture and initialization to train deeper networks, b) the recomputation of attention matrices to save memory, and c) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling of Enwik8, CIFAR-10, and ImageNet-64. We generate unconditional samples that demonstrate global coherence and great diversity, and show it is possible in principle to use self-attention to model sequences of length one million or more.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces sparse factorizations of the self-attention matrix in Transformers that reduce complexity from quadratic to O(n √n). Combined with changes to architecture and initialization for training deeper networks, attention recomputation to reduce memory use, and optimized fast attention kernels, the resulting Sparse Transformers are shown to model sequences of tens of thousands of timesteps. The same architecture is applied to raw-byte modeling of text (Enwik8), images (CIFAR-10 and ImageNet-64), achieving new state-of-the-art density modeling results and generating globally coherent unconditional samples; the work also indicates that self-attention can in principle handle sequences of length one million or more.
Significance. If the reported results hold under verification, the work is significant: it provides concrete, practical sparse attention patterns that enable self-attention to scale to sequence lengths far beyond the reach of dense Transformers, while retaining sufficient expressivity for high-quality density modeling on established benchmarks. The accompanying engineering contributions (recomputation, fast kernels) are immediately usable and lower the barrier to experimenting with longer contexts in language, vision, and audio.
major comments (2)
- [§3] §3, Eq. (3) and Figure 2: The central claim that the chosen strided and fixed sparse patterns retain sufficient expressivity for long-range dependencies rests on the SOTA density-modeling results, yet the manuscript contains no direct ablation of sparse versus dense attention on sequence lengths where dense attention remains tractable (e.g., n ≤ 2048). Without this comparison it is impossible to isolate whether the reported gains derive from the sparsity itself or from the deeper training and initialization changes.
- [§4–5] Experimental results (abstract and §4–5): The headline SOTA numbers on Enwik8, CIFAR-10, and ImageNet-64 are presented without error bars, without ablations that quantify the contribution of each proposed component (sparsity pattern, initialization, recomputation), and without an explicit statement of the exact training protocol and hyper-parameters. These omissions make the scaling claim difficult to reproduce or falsify.
minor comments (2)
- [§3] Notation for the two sparse patterns (strided vs. fixed) is introduced in §3 but the precise definition of the attention mask for each is only shown graphically in Figure 2; an explicit matrix-level equation would improve clarity.
- [abstract] The claim that sequences of length one million are feasible “in principle” is stated in the abstract but is not supported by any timing or memory measurements at that scale.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.
read point-by-point responses
-
Referee: [§3] §3, Eq. (3) and Figure 2: The central claim that the chosen strided and fixed sparse patterns retain sufficient expressivity for long-range dependencies rests on the SOTA density-modeling results, yet the manuscript contains no direct ablation of sparse versus dense attention on sequence lengths where dense attention remains tractable (e.g., n ≤ 2048). Without this comparison it is impossible to isolate whether the reported gains derive from the sparsity itself or from the deeper training and initialization changes.
Authors: We agree that a controlled ablation on shorter sequences would help isolate the contribution of the sparse patterns from the architectural and initialization changes. Although the primary motivation is scaling to lengths where dense attention is infeasible, we will add an ablation in the revised manuscript: we will train matched-depth dense and sparse models on sequences of length 512–2048 and report the resulting bits-per-byte (or bits-per-dim) to quantify any expressivity gap introduced by sparsity. revision: yes
-
Referee: [§4–5] Experimental results (abstract and §4–5): The headline SOTA numbers on Enwik8, CIFAR-10, and ImageNet-64 are presented without error bars, without ablations that quantify the contribution of each proposed component (sparsity pattern, initialization, recomputation), and without an explicit statement of the exact training protocol and hyper-parameters. These omissions make the scaling claim difficult to reproduce or falsify.
Authors: We acknowledge that the current presentation lacks error bars, component-wise ablations, and a fully explicit training protocol, all of which are important for reproducibility. In the revised version we will (i) report standard deviations from at least three independent runs for the main Enwik8, CIFAR-10, and ImageNet-64 results, (ii) add ablation tables that isolate the effect of the sparse factorization, the deeper-network initialization, and attention recomputation, and (iii) include a detailed appendix listing all hyperparameters, optimizer settings, data preprocessing, and hardware used for each experiment. revision: yes
Circularity Check
No circularity: architectural proposal and empirical benchmarks are independent of fitted inputs or self-referential definitions.
full rationale
The paper defines sparse attention factorizations (strided and fixed patterns) explicitly in §3 as a hand-designed reduction from dense O(n²) to O(n√n) attention, then evaluates the resulting model on standard external density-modeling benchmarks (Enwik8, CIFAR-10, ImageNet-64) whose test sets are disjoint from any training or hyperparameter choices. No equation equates a reported performance gain to a quantity defined by fitting the same data; no uniqueness theorem or ansatz is imported via self-citation to force the factorization choice; and the central claim (long-sequence modeling with hundreds of layers) rests on measured perplexity/BPD numbers rather than a renaming or self-definition. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Sparse factorizations of the attention matrix preserve enough long-range modeling capacity for the target tasks
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/DimensionForcing.leaneight_tick_forces_D3; linking_requires_D3 unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce sparse factorizations of the attention matrix which reduce this to O(n√n)... We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers... setting a new state of the art for density modeling of Enwik8, CIFAR-10, and ImageNet-64.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel; Jcost unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Sparse factorizations of the attention matrix... two 2d factorized attention schemes... strided... fixed
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
Scaling Limits of Long-Context Transformers
For uniform keys on the d-dimensional sphere, softmax attention becomes selective at inverse temperature scaling β_n* ≍ n^{2/(d-1)}, with explicit limiting laws for attention weights and outputs in each regime.
-
Convergent Stochastic Training of Attention and Understanding LoRA
Attention and LoRA regression losses induce Poincaré inequalities under mild regularization, so SGD-mimicking SDEs converge to minimizers with no assumptions on data or model size.
-
Characterizing the Expressivity of Local Attention in Transformers
Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by adding a second past operator in linear temporal logic, with global and local attention being expressive...
-
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
-
When Does Content-Based Routing Work? Representation Requirements for Selective Attention in Hybrid Sequence Models
Content-based routing succeeds only when models provide bidirectional context and perform pairwise comparisons, with bidirectional Mamba plus rank-1 projection reaching 99.7% precision at linear inference cost.
-
Rotation Equivariant Mamba for Vision Tasks
EQ-VMamba adds rotation-equivariant cross-scan and group Mamba blocks to enforce end-to-end rotation equivariance, yielding better rotation robustness, competitive accuracy, and roughly 50% fewer parameters than non-e...
-
RULER: What's the Real Context Size of Your Long-Context Language Models?
RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
-
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
-
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).
-
Efficiently Modeling Long Sequences with Structured State Spaces
S4 is an efficient state space sequence model that captures long-range dependencies via structured parameterization of the SSM, achieving state-of-the-art results on the Long Range Arena and other benchmarks while bei...
-
Denoising Diffusion Probabilistic Models
Denoising diffusion probabilistic models generate high-quality images by learning to reverse a fixed forward diffusion process, achieving FID 3.17 on CIFAR10.
-
Scaling Laws for Neural Language Models
Empirical power-law scaling governs language model loss versus model size, data size, and compute, enabling optimal allocation of training compute.
-
Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity
Derives a blockwise resolvent-style attention operator that exploits structured sparsity for subquadratic O(n^{4/3}d) entity tracking while matching dense accuracy.
-
Beyond Detection: A Structure-Aware Framework for Scene Text Tracking
SymTrack is the first systematic detection-free framework for scene text tracking that constructs benchmarks from video text spotting datasets and reports up to 11.97% AUC gains over prior trackers.
-
WorldParticle: Unified World Simulation of Lagrangian Particle Dynamics via Transformer
A transformer with prediction-correction and hierarchical super-token merging unifies simulation of six physical dynamics categories on Lagrangian particles and generalizes to unseen conditions.
-
EndPrompt: Efficient Long-Context Extension via Terminal Anchoring
EndPrompt induces reliable long-context generalization in LLaMA models from sparse positional supervision via a two-segment short-sequence construction with terminal anchoring.
-
QLAM: A Quantum Long-Attention Memory Approach to Long-Sequence Token Modeling
QLAM extends state-space models with quantum superposition in the hidden state for linear-time long-sequence modeling and reports consistent gains over RNN and transformer baselines on sequential image tasks.
-
End-to-End Population Inference from Gravitational-Wave Strain using Transformers
Dingo-Pop uses a transformer to perform amortized, end-to-end population inference from GW strain data in seconds, bypassing per-event Monte Carlo sampling.
-
VORT: Adaptive Power-Law Memory for NLP Transformers
VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.
-
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
-
SpecEdit: Training-Free Acceleration for Diffusion based Image Editing via Semantic Locking
SpecEdit accelerates diffusion-based image editing up to 10x by using a low-resolution draft to identify edit-relevant tokens via semantic discrepancies for selective high-resolution denoising.
-
Characterizing the Expressivity of Local Attention in Transformers
Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by introducing a second temporal operator in LTL, with global and local attention being expressively complementary.
-
Adaptive Head Budgeting for Efficient Multi-Head Attention
BudgetFormer adaptively budgets the number and selection of attention heads per input in Transformers, reducing FLOPs and memory on text classification while matching or exceeding standard multi-head performance.
-
Improving Sparse Autoencoder with Dynamic Attention
A cross-attention SAE with sparsemax attention achieves lower reconstruction loss and higher-quality concepts than fixed-sparsity baselines by making activation counts data-dependent.
-
Better and Worse with Scale: How Contextual Entrainment Diverges with Model Size
Contextual entrainment decreases for semantic contexts but increases for non-semantic ones as LLMs scale, following power-law trends with 4x better resistance to misinformation but 2x more copying of arbitrary tokens.
-
LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models
LoSA caches prefix attention for stable tokens in block-wise DLMs and applies sparse attention only to active tokens, preserving near-dense accuracy while achieving 1.54x lower attention density and up to 4.14x speedup.
-
A Hormone-inspired Emotion Layer for Transformer language models (HELT)
HormoneT5 augments T5 with a hormone-inspired block that predicts six continuous emotion values and uses them to modulate responses, reporting over 85% per-hormone accuracy and human preference for emotional quality.
-
Free-Range Gaussians: Non-Grid-Aligned Generative 3D Gaussian Reconstruction
Free-Range Gaussians uses flow matching over Gaussian parameters to predict non-grid-aligned 3D Gaussians from multi-view images, enabling synthesis of plausible content in unobserved regions with fewer primitives tha...
-
Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling
Cactus uses constrained optimization to guarantee bounded divergence from the verifier LLM distribution during speculative sampling, raising acceptance rates without the distortion seen in typical acceptance sampling.
-
More than the Sum: Panorama-Language Models for Adverse Omni-Scenes
Panorama-Language Models with a sparse attention module and PanoVQA dataset deliver superior holistic reasoning on 360° adverse omni-scenes compared to stitched pinhole views.
-
SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators
SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy ...
-
DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning
DELTA partitions layers into full, delta, and sparse groups to select salient tokens via aggregated attention scores, matching full-attention accuracy on AIME and GPQA while cutting attended tokens up to 4.25x and ach...
-
IAFormer: Interaction-Aware Transformer network for collider data analysis
IAFormer uses boost-invariant pairwise quantities and differential attention to create a sparse Transformer that achieves state-of-the-art classification on top-quark and quark-gluon jet datasets while using over an o...
-
Transformer Neural Processes - Kernel Regression
TNP-KR adds a kernel regression transformer block, kernel attention bias, scan attention for translation invariance, and deep kernel attention to achieve lower complexity and state-of-the-art results on meta-regressio...
-
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.
-
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.
-
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.
-
Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
HSTU-based generative recommenders with 1.5 trillion parameters scale as a power law with compute up to GPT-3 scale, outperform baselines by up to 65.8% NDCG, run 5-15x faster than FlashAttention2 on long sequences, a...
-
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
Vim is a bidirectional Mamba vision backbone that outperforms DeiT in accuracy on standard tasks while being substantially faster and more memory-efficient for high-resolution images.
-
Scalable Diffusion Models with Transformers
DiTs achieve SOTA FID of 2.27 on ImageNet 256x256 by scaling transformer-based latent diffusion models, with performance improving consistently as Gflops increase.
-
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
FlashAttention reduces GPU high-bandwidth memory accesses in self-attention via tiling, delivering exact attention with lower IO complexity, 2-3x wall-clock speedups on models like GPT-2, and the ability to train on s...
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
High-Resolution Image Synthesis with Latent Diffusion Models
Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrai...
-
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
-
Scaling Laws for Autoregressive Generative Modeling
Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.
-
Rethinking Attention with Performers
Performers approximate full-rank softmax attention in Transformers via FAVOR+ random features for linear complexity, with theoretical guarantees of unbiased estimation and competitive results on pixel, text, and prote...
-
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
DeBERTa improves BERT-style models by separating content and relative position in attention and adding absolute positions to the decoder, yielding consistent gains on NLU and NLG tasks and the first single-model super...
-
Longformer: The Long-Document Transformer
Longformer uses local windowed attention plus task-specific global attention to achieve linear scaling and state-of-the-art results on long-document language modeling, QA, and summarization after pretraining.
-
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
-
Augmenting Self-attention with Persistent Memory
Augmenting self-attention with persistent memory vectors allows removal of feed-forward layers from Transformers without degrading performance on character and word level language modeling benchmarks.
-
Approaching I/O-optimality for Approximate Attention
Presents I/O-efficient algorithms for approximate attention with almost-linear cost in n, approaching lower bounds in most parameter regimes.
-
Activation-Free Backbones for Image Recognition: Polynomial Alternatives within MetaFormer-Style Vision Models
Polynomial replacements for activations in MLPs, convolutions, and attention within MetaFormer yield PolyNeXt models that match or exceed standard performance on ImageNet, ADE20K, and robustness benchmarks while beati...
-
PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models
PulseCol introduces periodically refreshed column-sparse attention to achieve up to 1.95x speedup over FlashAttention in diffusion LLMs with maintained model quality.
-
WorldParticle: Unified World Simulation of Lagrangian Particle Dynamics via Transformer
A transformer with prediction-correction and hierarchical super-token encoding unifies simulation across six physical dynamics categories on shared Lagrangian particles.
-
Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers
Stateful sessions with incremental KV cache and flash queries allow O(|q|) latency in streaming transformer inference, delivering up to 5.9x speedup over conventional engines while preserving full attention.
-
Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory
PMNet uses unitary phasor dynamics and hierarchical anchors to make explicit memory stable for long sequences, matching a 3x larger Mamba model on long-context robustness with a 119M parameter network.
-
KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference
KV-Fold turns frozen transformers into stable long-context models by folding the KV cache across sequence chunks in repeated forward passes.
-
Search Your Block Floating Point Scales!
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
-
Compute Where it Counts: Self Optimizing Language Models
SOL trains a policy to dynamically control multiple efficiency mechanisms per token via group-relative policy optimization on teacher-forced episodes, yielding better quality at matched average budget than static or r...
-
FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning
FocuSFT uses an inner optimization loop to adapt fast-weight parameters into a parametric memory that sharpens attention on relevant content, then conditions outer-loop supervised fine-tuning on this representation, y...
Reference graph
Works this paper leans on
-
[1]
Character-Level Language Modeling with Deeper Self-Attention
Al-Rfou, R., Choe, D., Constant, N., Guo, M., and Jones, L. Character-level language modeling with deeper self- attention. arXiv preprint arXiv:1808.04444,
-
[2]
Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Britz, D., Guan, M. Y ., and Luong, M.-T. Efficient attention using a fixed-size memory representation. arXiv preprint arXiv:1707.00110,
-
[4]
Training Deep Nets with Sublinear Memory Cost
Chen, T., Xu, B., Zhang, C., and Guestrin, C. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174,
work page internal anchor Pith review arXiv
-
[5]
Pixelsnail: An improved autoregressive generative model
Chen, X., Mishra, N., Rohaninejad, M., and Abbeel, P. Pixelsnail: An improved autoregressive generative model. arXiv preprint arXiv:1712.09763,
-
[6]
Monotonic chunkwise attention.arXiv preprint arXiv:1712.05382, 2017a
Chiu, C.-C. and Raffel, C. Monotonic chunkwise attention. arXiv preprint arXiv:1712.05382,
-
[7]
Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin, Y . N. Convolutional sequence to sequence learning.arXiv preprint arXiv:1705.03122,
-
[8]
Identity Mappings in Deep Residual Networks
He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. arXiv preprint arXiv:1603.05027,
-
[9]
Gaussian Error Linear Units (GELUs)
Hendrycks, D. and Gimpel, K. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. arXiv preprint arXiv:1606.08415,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Generating Long Sequences with Sparse Transformers Huang, C.-Z. A., Vaswani, A., Uszkoreit, J., Shazeer, N., Hawthorne, C., Dai, A. M., Hoffman, M. D., and Eck, D. An improved relative self-attention mechanism for transformer with application to music generation. arXiv preprint arXiv:1809.04281,
-
[11]
Exploring the Limits of Language Modeling
Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., and Wu, Y . Exploring the limits of language modeling.arXiv preprint arXiv:1602.02410,
-
[12]
Koutnik, J., Greff, K., Gomez, F., and Schmidhuber, J. A clockwork rnn. arXiv preprint arXiv:1402.3511,
-
[13]
Generating Wikipedia by Summarizing Long Sequences
Liu, P. J., Saleh, M., Pot, E., Goodrich, B., Sepa- ssi, R., Kaiser, L., and Shazeer, N. Generating wikipedia by summarizing long sequences.arXiv preprint arXiv:1801.10198,
-
[14]
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model
Mehri, S., Kumar, K., Gulrajani, I., Kumar, R., Jain, S., Sotelo, J., Courville, A., and Bengio, Y . Samplernn: An unconditional end-to-end neural audio generation model. arXiv preprint arXiv:1612.07837,
-
[15]
Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling
Menick, J. and Kalchbrenner, N. Generating high fidelity im- ages with subscale pixel networks and multidimensional upscaling. arXiv preprint arXiv:1812.01608,
-
[16]
Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaev, O., Venkatesh, G., et al. Mixed precision training. arXiv preprint arXiv:1710.03740,
work page internal anchor Pith review arXiv
-
[17]
Oord, A. v. d., Kalchbrenner, N., and Kavukcuoglu, K. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759,
-
[18]
Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser,Ł., Shazeer, N., and Ku, A. Image transformer. arXiv preprint arXiv:1802.05751,
- [19]
-
[20]
Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P. Pixelcnn++: Improving the pixelcnn with discretized lo- gistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517,
-
[21]
WaveNet: A Generative Model for Raw Audio
Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. Wavenet: A generative model for raw audio. CoRR abs/1609.03499,
work page internal anchor Pith review arXiv
-
[22]
N., Kaiser, Ł., and Polosukhin, I
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten- tion is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008, 2017
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.