Recognition: 2 theorem links
· Lean TheoremRetentive Network: A Successor to Transformer for Large Language Models
Pith reviewed 2026-05-11 20:24 UTC · model grok-4.3
The pith
RetNet uses a retention mechanism to match Transformer performance while enabling parallel training and constant-time inference for large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The retention mechanism is proposed as a sequence modeling approach that supports three computation paradigms: parallel representation for training parallelism, recurrent representation for O(1) inference cost per step, and chunkwise recurrent representation for linear-complexity long-sequence modeling. This unifies the benefits of recurrence and attention, enabling RetNet to achieve parallel training, low-cost deployment, efficient inference, and favorable scaling results on language modeling without performance loss relative to Transformers.
What carries the argument
The retention mechanism, which links recurrence and attention to support parallel, recurrent, and chunkwise recurrent computation paradigms for sequence modeling.
If this is right
- Training proceeds in full parallel mode like attention-based models.
- Inference runs at constant cost per token, improving throughput and reducing memory use.
- Long sequences are encoded with linear complexity by processing chunks in parallel while carrying state recurrently.
- Overall deployment costs drop while maintaining competitive scaling on language tasks.
Where Pith is reading between the lines
- The three-paradigm design could transfer to non-language sequence domains such as time-series or structured data.
- Hardware designs might specialize accelerators around the recurrent representation for further gains.
- Hybrid models could combine RetNet blocks with other efficiency methods to push scaling boundaries.
Load-bearing premise
The retention mechanism supplies modeling power equivalent to attention without any hidden performance trade-offs or extra parameters that would undermine the efficiency claims.
What would settle it
A head-to-head scaling experiment on language modeling benchmarks where RetNet requires substantially more parameters or training compute to reach the same perplexity as a matched Transformer, or where measured inference latency grows with sequence length.
read the original abstract
In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost $O(1)$ inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation facilitates efficient long-sequence modeling with linear complexity, where each chunk is encoded parallelly while recurrently summarizing the chunks. Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference. The intriguing properties make RetNet a strong successor to Transformer for large language models. Code will be available at https://aka.ms/retnet.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes RetNet as a foundation architecture for large language models. It derives a retention mechanism from the theoretical connection between recurrence and attention, enabling three equivalent computation paradigms: parallel (for training parallelism), recurrent (for O(1) inference), and chunkwise recurrent (for linear-complexity long-sequence modeling). Experimental results on language modeling tasks are reported to show favorable scaling, parallel training, low-cost deployment, and efficient inference, positioning RetNet as a strong successor to the Transformer.
Significance. If the retention mechanism delivers equivalent modeling power and gradient behavior to softmax attention across scales with no hidden capacity loss or optimization biases, the work could meaningfully advance efficient LLM architectures by resolving the training-inference tradeoff. The explicit support for three paradigms and the promise of code release are strengths for reproducibility and practical adoption. However, the significance hinges on verifying the central equivalence claim, which currently rests primarily on empirical scaling curves rather than isolated ablations or formal bounds.
major comments (2)
- [Retention mechanism] Retention mechanism section (theoretical derivation): the claim that retention provides equivalent expressivity to attention without hidden trade-offs requires either formal bounds on long-range dependency modeling (e.g., showing that the exponential decay exactly reproduces arbitrary attention patterns) or a concrete test that the multi-head retention operator matches attention's receptive field and optimization dynamics. Without this, the 'successor' positioning is not yet load-bearing.
- [Experimental results] Experimental results section: the reported favorable scaling and perplexity results lack (a) direct capacity ablations that swap only the retention operator into a Transformer backbone while holding parameter count, depth, and training recipe fixed, and (b) error bars or multi-seed statistics. These omissions make it impossible to confirm the 'no hidden performance trade-offs' premise at the scales where RetNet is positioned as a Transformer successor.
minor comments (3)
- [Abstract] Abstract: the phrase 'favorable scaling results' is used without naming the specific metrics (e.g., perplexity vs. parameter count) or the exact Transformer baselines, which reduces clarity for readers scanning the high-level claims.
- [Method and figures] Notation and figures: the three computation paradigms are described in prose; introducing compact mathematical notation (e.g., parallel/retention/recurrent forms) earlier and ensuring all figures include axis labels and legend clarity would improve readability.
- [Related work] Related work: a brief discussion of prior recurrence-attention hybrids (e.g., RWKV, Mamba) would help situate the novelty of the specific retention formulation.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review of our manuscript on RetNet. The feedback highlights important aspects of the theoretical claims and experimental validation. We address each major comment below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Retention mechanism] Retention mechanism section (theoretical derivation): the claim that retention provides equivalent expressivity to attention without hidden trade-offs requires either formal bounds on long-range dependency modeling (e.g., showing that the exponential decay exactly reproduces arbitrary attention patterns) or a concrete test that the multi-head retention operator matches attention's receptive field and optimization dynamics. Without this, the 'successor' positioning is not yet load-bearing.
Authors: We appreciate the referee's emphasis on strengthening the theoretical grounding. Section 3 derives the retention mechanism directly from the recurrence-attention connection, establishing the mathematical equivalence of the parallel, recurrent, and chunkwise recurrent forms. This shows that retention supports the same sequence modeling operations as attention while enabling the three computation paradigms. Although the derivation does not include formal bounds proving that exponential decay exactly replicates every possible attention pattern, the operator is constructed to preserve long-range dependency modeling through the decay factor, and our scaling experiments indicate no substantial capacity loss. In revision, we will expand the discussion to explicitly analyze the receptive field of multi-head retention and its alignment with attention's optimization behavior, providing a more concrete comparison of the operators. revision: partial
-
Referee: [Experimental results] Experimental results section: the reported favorable scaling and perplexity results lack (a) direct capacity ablations that swap only the retention operator into a Transformer backbone while holding parameter count, depth, and training recipe fixed, and (b) error bars or multi-seed statistics. These omissions make it impossible to confirm the 'no hidden performance trade-offs' premise at the scales where RetNet is positioned as a Transformer successor.
Authors: We agree that these controls would increase confidence in the results. Our reported experiments train RetNet and Transformer models with matched parameter counts, depths, and training recipes, yielding competitive perplexity and scaling curves. However, we did not isolate the operator swap or report multi-seed statistics. In the revised manuscript, we will add error bars computed from multiple independent training runs to all scaling and perplexity figures. We will also include a new ablation that replaces only the attention layers with the retention operator inside an otherwise identical Transformer backbone, keeping all other factors fixed, to directly test for hidden trade-offs. revision: yes
- Formal mathematical bounds proving that the exponential decay in retention exactly reproduces arbitrary attention patterns.
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper states it first derives the recurrence-attention connection theoretically, then introduces the retention mechanism that realizes three equivalent computation paradigms (parallel, recurrent, chunkwise). The modeling-power equivalence and lack of hidden trade-offs are asserted as consequences of that derivation rather than presupposed by definition or by fitting parameters to the target metrics. No load-bearing self-citation, no parameter fitted on a subset then renamed as prediction, and no ansatz imported via prior work by the same authors. Experimental scaling and perplexity results are presented separately and are not required for the formal equivalence claim. The chain therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent
-
IndisputableMonolith.Foundation.PhiForcingphi_equation unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 55 Pith papers
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE decomposes recurrent model cache writes into substitutable atoms with a closed-form logit shift, achieving high substitution success and targeted behavioral installs on models like Qwen3.5 and Mamba-2.
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
-
When Does Content-Based Routing Work? Representation Requirements for Selective Attention in Hybrid Sequence Models
Content-based routing succeeds only when models provide bidirectional context and perform pairwise comparisons, with bidirectional Mamba plus rank-1 projection reaching 99.7% precision at linear inference cost.
-
Rotation Equivariant Mamba for Vision Tasks
EQ-VMamba adds rotation-equivariant cross-scan and group Mamba blocks to enforce end-to-end rotation equivariance, yielding better rotation robustness, competitive accuracy, and roughly 50% fewer parameters than non-e...
-
RULER: What's the Real Context Size of Your Long-Context Language Models?
RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
-
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
-
MeMo: Memory as a Model
MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to r...
-
Chem-GMNet: A Sphere-Native Geometric Transformer for Molecular Property Prediction
Chem-GMNet uses sphere-native embeddings, DualSKA attention, and SH-FFN layers to match or beat ChemBERTa-2 on MoleculeNet tasks with fewer parameters and sometimes no pretraining.
-
TIDES: Implicit Time-Awareness in Selective State Space Models
TIDES reconciles selective SSM expressivity with continuous-time physical discretization by moving input dependence onto the state matrix, enabling native irregular time series handling and achieving SOTA on UEA and P...
-
Long Context Pre-Training with Lighthouse Attention
Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower los...
-
Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts
Adding temporal memory via LIF, precision-weighted gating, and anticipatory prediction to MoE routers recovers effective expert selection at distribution transitions, with ablation confirming a super-additive beta-ant...
-
Direction-Preserving MIMO Speech Enhancement Using a Neural Covariance Estimator
A neural covariance estimator enables fully blind, direction-preserving MIMO speech enhancement that improves over mask-based baselines and approaches oracle performance with fewer parameters.
-
Screening Is Enough
Multiscreen replaces softmax attention with screening to provide absolute query-key relevance, resulting in models with 30% fewer parameters that maintain stable performance at long contexts.
-
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
-
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.
-
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
Vim is a bidirectional Mamba vision backbone that outperforms DeiT in accuracy on standard tasks while being substantially faster and more memory-efficient for high-resolution images.
-
Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers
Stateful sessions with incremental KV cache and flash queries allow O(|q|) latency in streaming transformer inference, delivering up to 5.9x speedup over conventional engines while preserving full attention.
-
OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention
OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.
-
Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory
PMNet uses unitary phasor dynamics and hierarchical anchors to make explicit memory stable for long sequences, matching a 3x larger Mamba model on long-context robustness with a 119M parameter network.
-
Elastic Attention Cores for Scalable Vision Transformers
VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintain...
-
RT-Transformer: The Transformer Block as a Spherical State Estimator
Transformer components arise as the natural solution to precision-weighted directional state estimation on the hypersphere.
-
Priming: Hybrid State Space Models From Pre-trained Transformers
Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasonin...
-
Cubit: Token Mixer with Kernel Ridge Regression
Cubit replaces Transformer attention with Kernel Ridge Regression token mixing and shows potential gains on longer sequences.
-
UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification
UniPrefill accelerates LLM prefill via block-wise dynamic sparsification, achieving up to 2.1x TTFT speedup while supporting hybrid architectures and native vLLM continuous batching.
-
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
-
The Recurrent Transformer: Greater Effective Depth and Efficient Decoding
Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-mat...
-
RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction
RetentiveKV uses entropy to drive state-space model transitions that retain and reactivate low-attention visual tokens in a continuous memory instead of pruning them, delivering 5x KV cache compression and 1.5x faster...
-
Learning to Adapt: In-Context Learning Beyond Stationarity
Gated linear attention enables lower training and test errors in non-stationary in-context learning by adaptively modulating past inputs through a learnable recency bias under an autoregressive model of task evolution.
-
ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning
ABMamba uses Mamba-based linear-complexity processing plus a novel Aligned Hierarchical Bidirectional Scan to deliver competitive video captioning on VATEX and MSR-VTT at roughly 3x higher throughput than typical Tran...
-
Breaking the KV Cache Bottleneck: Fan Duality Model Achieves O(1) Decode Memory with Superior Associative Recall
FDM achieves strictly O(1) decode memory via a fixed 272-slot cache while reaching 0.966 accuracy on multi-query associative recall, outperforming transformers by 59.5%.
-
Optimal Decay Spectra for Linear Recurrences
PoST reparameterizes decay spectra in linear recurrences with geometric log-spacing and position-adaptive scaling to achieve O(exp(-cN/log t)) decay, improving zero-shot language modeling and long-context retrieval ac...
-
In-Place Test-Time Training
In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
-
Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space
PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.
-
M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling
M²RNN achieves perfect state tracking at unseen lengths and outperforms Gated DeltaNet hybrids by 0.4-0.5 perplexity on 7B models with 3x smaller recurrent states.
-
LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling
LPC-SM is a hybrid architecture separating local attention, persistent memory, predictive correction, and control with ONT for memory writes, showing loss reductions on 158M-parameter models up to 4096-token contexts.
-
SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation
SeedPolicy introduces self-evolving gated attention to extend the temporal horizon of diffusion policies, yielding 36.8% and 169% relative gains over standard DP on clean and randomized RoboTwin 2.0 tasks.
-
Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
-
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...
-
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
-
Titans: Learning to Memorize at Test Time
Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.
-
Gated Linear Attention Transformers with Hardware-Efficient Training
Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
-
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...
-
PhysEDA: Physics-Aware Learning Framework for Efficient EDA With Manhattan Distance Decay
PhysEDA folds separable Manhattan-distance exponential decay into linear attention and potential-based rewards, cutting complexity to linear while improving zero-shot transfer and sparse-reward performance on decoupli...
-
Kaczmarz Linear Attention
Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...
-
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
-
Caracal: Causal Architecture via Spectral Mixing
Caracal is a Fourier-based sequence mixing architecture that achieves causal autoregressive modeling with standard operators and competitive performance on long sequences.
-
Kwai Summary Attention Technical Report
Kwai Summary Attention compresses historical contexts into learnable summary tokens to reduce sequence modeling cost to O(n/k) while preserving linear KV cache and long-range dependencies.
-
SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference
SpikingBrain2.0 is a 5B hybrid spiking-Transformer that recovers most base model performance while delivering 10x TTFT speedup at 4M context and supporting over 10M tokens on limited GPUs via dual sparse attention and...
-
Absorber LLM: Harnessing Causal Synchronization for Test-Time Training
Absorber LLM introduces causal synchronization to absorb context into parameters for memory-efficient long-context LLM inference while preserving causal effects.
-
FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control
FG²-GDN replaces the scalar beta in the delta update with a channel-wise vector and decouples key/value scaling to improve recall over prior GDN and KDA models.
-
HST-HGN: Heterogeneous Spatial-Temporal Hypergraph Networks with Bidirectional State Space Models for Global Fatigue Assessment
HST-HGN uses heterogeneous spatial-temporal hypergraph networks combined with bidirectional Mamba state space models to achieve state-of-the-art driver fatigue assessment from untrimmed videos while maintaining comput...
-
Advancing Vision Transformer with Enhanced Spatial Priors
EVT improves Vision Transformers by using Euclidean distance decay for spatial priors and simpler grouping, achieving 86.6% top-1 accuracy on ImageNet-1k.
-
Gated-SwinRMT: Unifying Swin Windowed Attention with Retentive Manhattan Decay via Input-Dependent Gating
Gated-SwinRMT unifies Swin windowed attention with retentive Manhattan decay via gating, reaching 80.22% top-1 accuracy on Mini-ImageNet versus 73.74% for the RMT baseline.
-
A Survey on Efficient Inference for Large Language Models
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
-
A Survey of Large Language Models
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
Reference graph
Works this paper leans on
-
[1]
[BKH16] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Language models are few-shot learners
[BMR+20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litw...
work page 1901
-
[3]
Summscreen: A dataset for abstractive screenplay summarization
[CCWG21] Mingda Chen, Zewei Chu, Sam Wiseman, and Kevin Gimpel. Summscreen: A dataset for abstractive screenplay summarization. arXiv preprint arXiv:2104.07091,
-
[4]
BoolQ: Exploring the surprising difficulty of natural yes/no questions
[CLC+19] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 2924–2936,
work page 2019
-
[5]
[DFS+22] Tri Dao, Daniel Y Fu, Khaled K Saab, Armin W Thomas, Atri Rudra, and Christopher Ré. Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052,
-
[6]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
[GBB+20] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Efficiently Modeling Long Sequences with Structured State Spaces
[GGR21] Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Efficient attentions for long document summarization,
[HCP+21] Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. Efficient attentions for long document summarization. arXiv preprint arXiv:2104.02112,
-
[9]
Language Is Not All You Need: Aligning Perception with Language Models
[HDW+23] Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, and Furu Wei. Language is not all you need: Aligning perception with language models. ArXiv, abs/2302.14045,
work page internal anchor Pith review arXiv
-
[10]
Language models are general-purpose interfaces
[HSD+22a] Yaru Hao, Haoyu Song, Li Dong, Shaohan Huang, Zewen Chi, Wenhui Wang, Shum- ing Ma, and Furu Wei. Language models are general-purpose interfaces. ArXiv, abs/2206.06336,
-
[11]
Structured prompting: Scaling in-context learning to 1,000 examples
[HSD+22b] Yaru Hao, Yutao Sun, Li Dong, Zhixiong Han, Yuxian Gu, and Furu Wei. Structured prompting: Scaling in-context learning to 1,000 examples. ArXiv, abs/2212.06713,
-
[12]
Lsdsem 2017 shared task: The story cloze test
[MRL+17] Nasrin Mostafazadeh, Michael Roth, Annie Louis, Nathanael Chambers, and James Allen. Lsdsem 2017 shared task: The story cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics , pages 46–51,
work page 2017
-
[13]
TorchScale: Transformers at scale
[MWH+22] Shuming Ma, Hongyu Wang, Shaohan Huang, Wenhui Wang, Zewen Chi, Li Dong, Alon Benhaim, Barun Patra, Vishrav Chaudhary, Xia Song, and Furu Wei. TorchScale: Transformers at scale. CoRR, abs/2211.13184,
-
[14]
[OSG+23] Antonio Orvieto, Samuel L. Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. ArXiv, abs/2303.06349,
-
[15]
arXiv preprint arXiv:2302.10866 , year=
[PMN+23] Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866,
-
[16]
Kosmos-2: Grounding Multimodal Large Language Models to the World
[PWD+23] Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. ArXiv, abs/2306.14824,
work page internal anchor Pith review arXiv
-
[17]
A length-extrapolatable transformer
[SDP+22] Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. A length-extrapolatable transformer. arXiv preprint arXiv:2212.10554,
-
[18]
[Sha19] Noam M. Shazeer. Fast transformer decoding: One write-head is all you need. ArXiv, abs/1911.02150,
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[19]
RoFormer: Enhanced Transformer with Rotary Position Embedding
[SLP+21] Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
[SPP+19] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053,
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[21]
[SSI+22] Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv, Ankit Gupta, Wenhan Xiong, Mor Geva, Jonathan Berant, et al. Scrolls: Standardized comparison over long language sequences. arXiv preprint arXiv:2201.03533,
-
[22]
Gomez, Lukasz Kaiser, and Illia Polosukhin
[VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA , pages 6000– 6010,
work page 2017
-
[23]
DeepNet: Scaling Transformers to 1,000 layers
[WMD+22] Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. DeepNet: Scaling Transformers to 1,000 layers. ArXiv, abs/2203.00555,
-
[24]
[WMH+22] Hongyu Wang, Shuming Ma, Shaohan Huang, Li Dong, Wenhui Wang, Zhiliang Peng, Yu Wu, Payal Bajaj, Saksham Singhal, Alon Benhaim, et al. Foundation transformers. arXiv preprint arXiv:2210.06423,
-
[25]
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
[WPN+19] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537,
work page internal anchor Pith review arXiv 1905
-
[26]
Qmsum: A new benchmark for query-based multi-domain meeting summarization
[ZYY+21] Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Has- san Awadallah, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, et al. Qmsum: A new benchmark for query-based multi-domain meeting summarization. arXiv preprint arXiv:2104.05938,
-
[27]
13 A Hyperparameters Hyperparameters 1.3B 2.7B 6.7B Layers 24 32 32 Hidden size 2048 2560 4096 FFN size 4096 5120 8192 Heads 8 10 16 Learning rate 6 × 10−4 3 × 10−4 3 × 10−4 LR scheduler Polynomial decay Warm-up steps 375 Tokens per batch 4M Adam β (0.9, 0.98) Training steps 25,000 Gradient clipping 2.0 Dropout 0.1 Weight decay 0.01 Table 7: Hyperparamter...
work page 2048
-
[28]
B Grouped Results of Different Context Lengths As shown in Table 8, we report language modeling results with different context lengths. In order to make the numbers comparable, we use 2048 text chunks as evaluation data and only compute perplexity for the last 128 tokens. Experimental results show that RetNet outperforms Transformer across different conte...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.