pith. machine review for the scientific record. sign in

arxiv: 2305.13048 · v2 · submitted 2023-05-22 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

RWKV: Reinventing RNNs for the Transformer Era

Authors on Pith no claims yet

Pith reviewed 2026-05-13 10:48 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords RWKVlinear attentionRNNTransformerefficient inferencelanguage modelingscaling lawslong context
0
0 comments X

The pith

RWKV architecture achieves Transformer-level performance at 14 billion parameters while scaling linearly for inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes RWKV as a model that trains in parallel like a Transformer yet runs with constant memory and compute like an RNN. It achieves this by replacing quadratic attention with a linear formulation based on receptance-weighted key-value pairs that can be expressed in either recurrent or parallel form. When scaled to 14 billion parameters, the largest dense RNN reported, RWKV matches the performance of comparably sized Transformers on language tasks. This matters because it removes the quadratic memory barrier that limits long-sequence use and deployment of large models. If the parity holds, sequence models could be trained at scale yet deployed with far lower inference cost.

Core claim

RWKV reformulates attention as a linear time-decay operation over receptance-weighted key-value pairs, allowing the identical set of weights to be computed either as a parallel Transformer during training or as a recurrent model during inference that maintains O(1) memory and compute per token regardless of sequence length. Models trained this way reach performance parity with Transformers when scaled to 14 billion parameters.

What carries the argument

Receptance Weighted Key Value (RWKV) linear attention, which computes each output token as an exponentially decayed weighted sum of all prior key-value pairs using a time-difference decay factor, enabling exact equivalence between the recurrent and parallel formulations.

If this is right

  • Training remains fully parallelizable while inference cost stays constant per token, removing the need to trade one for the other.
  • Memory usage during inference does not grow with sequence length, enabling arbitrarily long contexts at fixed hardware cost.
  • The same trained weights support both batched training and single-stream deployment without architectural changes.
  • Scaling behavior observed in Transformers appears to transfer to this linear-attention RNN form at least up to 14 billion parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the linear formulation generalizes, hardware optimized for recurrent computation could be reused for large language models without accuracy loss.
  • The architecture may simplify serving of long-context models on memory-constrained devices by eliminating quadratic KV caches.
  • Future work could test whether the same linear mechanism extends cleanly to non-text modalities while preserving the training-inference duality.

Load-bearing premise

The linear attention mechanism captures the same long-range dependencies as full quadratic attention without needing extra adjustments or task-specific changes.

What would settle it

A side-by-side evaluation of a 14-billion-parameter RWKV model and a Transformer of identical size showing a clear performance gap on standard language-modeling benchmarks would falsify the parity claim.

read the original abstract

Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, thus parallelizing computations during training and maintains constant computational and memory complexity during inference. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers, suggesting future work can leverage this architecture to create more efficient models. This work presents a significant step towards reconciling trade-offs between computational efficiency and model performance in sequence processing tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes RWKV, a hybrid architecture that reformulates RNNs via a linear attention mechanism (Receptance Weighted Key Value) combining time-mixing with exponential decay and channel-mixing. This allows parallelizable training like Transformers while maintaining constant memory and compute during inference like RNNs. The central claim is that models scale to 14B parameters—the largest dense RNN trained—and achieve performance on par with similarly sized Transformers on NLP tasks.

Significance. If the empirical parity holds under matched conditions, the result would be significant: it offers a path to linear scaling in sequence length without sacrificing the modeling capacity of quadratic attention, potentially enabling more efficient large-scale language models and reducing the inference-memory trade-off that currently favors Transformers.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): the headline claim that RWKV 'performs on par' with 14B-scale Transformers supplies no quantitative metrics, benchmark tables, or ablation details. Without these, it is impossible to verify whether identical data, optimizer, context lengths, or training steps were used, leaving open the possibility that parity depends on unstated post-hoc adjustments rather than the architecture itself.
  2. [§3.2] §3.2 (RWKV formulation): the receptance-weighted KV time-mixing with exponential decay imposes a fixed decay bias on long-range interactions. The paper must demonstrate—via controlled long-context benchmarks or comparison to full attention—that this bias does not degrade modeling capacity relative to quadratic attention at 14B scale; otherwise the generality argument is at risk.
minor comments (2)
  1. [§3] Notation for the linear attention recurrence should be clarified with explicit equations showing how the parallel training form reduces to the RNN inference form without additional approximations.
  2. [Figures] Figure captions and axis labels in the scaling plots should include exact model sizes, token counts, and baseline Transformer variants for direct comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address each major comment point-by-point below and will revise the manuscript to incorporate additional details and clarifications as outlined.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the headline claim that RWKV 'performs on par' with 14B-scale Transformers supplies no quantitative metrics, benchmark tables, or ablation details. Without these, it is impossible to verify whether identical data, optimizer, context lengths, or training steps were used, leaving open the possibility that parity depends on unstated post-hoc adjustments rather than the architecture itself.

    Authors: We agree that the current presentation of results in the abstract and §4 would benefit from greater explicitness. In the revised manuscript we will expand both sections with benchmark tables reporting perplexity and zero-shot accuracy for RWKV models up to 14B parameters against matched Transformer baselines (e.g., comparable GPT-style models), together with precise statements of training data, optimizer settings, context length, and total steps. These additions will make the parity claim directly verifiable from the text. revision: yes

  2. Referee: [§3.2] §3.2 (RWKV formulation): the receptance-weighted KV time-mixing with exponential decay imposes a fixed decay bias on long-range interactions. The paper must demonstrate—via controlled long-context benchmarks or comparison to full attention—that this bias does not degrade modeling capacity relative to quadratic attention at 14B scale; otherwise the generality argument is at risk.

    Authors: The per-channel decay rates in RWKV are learned rather than fixed, which in principle allows the model to retain long-range information when beneficial. Our scaling curves up to 14B already show continued gains on tasks that require long dependencies, but we accept that explicit controlled comparisons would strengthen the claim. We will add a dedicated subsection with long-context evaluations (e.g., extended-sequence perplexity and retrieval tasks) and direct head-to-head results against full-attention models at the largest feasible scale to demonstrate that modeling capacity is not materially degraded. revision: yes

Circularity Check

0 steps flagged

RWKV architecture derivation is self-contained with no circular reductions

full rationale

The paper introduces a new linear-attention formulation (receptance-weighted KV with time-mixing and channel-mixing blocks) and reports empirical scaling results up to 14B parameters. No equation in the provided text defines a quantity in terms of itself, renames a fitted parameter as a prediction, or imports a uniqueness theorem from prior self-citations that would force the reported parity. The central claim is an empirical outcome of training the proposed architecture, not a reduction to its own inputs by construction. External benchmarks and training details are presented as independent evidence rather than tautological restatements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the empirical success of a newly introduced linear-attention mechanism whose internal weighting rules are defined by the paper; no explicit free parameters or external axioms are stated in the abstract.

invented entities (1)
  • Receptance Weighted Key Value (RWKV) mechanism no independent evidence
    purpose: Linear attention operator enabling dual Transformer/RNN formulation
    The mechanism is introduced by the paper to achieve the claimed efficiency trade-off.

pith-pipeline@v0.9.0 · 5616 in / 1047 out tokens · 32977 ms · 2026-05-13T10:48:35.134302+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 34 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. When Does Content-Based Routing Work? Representation Requirements for Selective Attention in Hybrid Sequence Models

    cs.LG 2026-03 conditional novelty 8.0

    Content-based routing succeeds only when models provide bidirectional context and perform pairwise comparisons, with bidirectional Mamba plus rank-1 projection reaching 99.7% precision at linear inference cost.

  2. Rotation Equivariant Mamba for Vision Tasks

    cs.CV 2026-03 unverdicted novelty 8.0

    EQ-VMamba adds rotation-equivariant cross-scan and group Mamba blocks to enforce end-to-end rotation equivariance, yielding better rotation robustness, competitive accuracy, and roughly 50% fewer parameters than non-e...

  3. Learning to (Learn at Test Time): RNNs with Expressive Hidden States

    cs.LG 2024-07 conditional novelty 8.0

    TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.

  4. Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    cs.LG 2023-12 unverdicted novelty 8.0

    Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

  5. Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo

    cond-mat.str-el 2026-05 conditional novelty 7.0

    PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.

  6. Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts

    cs.LG 2026-05 conditional novelty 7.0

    Adding temporal memory via LIF, precision-weighted gating, and anticipatory prediction to MoE routers recovers effective expert selection at distribution transitions, with ablation confirming a super-additive beta-ant...

  7. Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

    cs.LG 2026-04 unverdicted novelty 7.0

    Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.

  8. Winner-Take-All Spiking Transformer for Language Modeling

    cs.NE 2026-04 unverdicted novelty 7.0

    Winner-take-all spiking self-attention replaces softmax in spiking transformers to support language modeling on 16 datasets with spike-driven, energy-efficient architectures.

  9. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    cs.LG 2024-05 unverdicted novelty 7.0

    Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.

  10. Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory

    cs.LG 2026-05 unverdicted novelty 6.0

    PMNet uses unitary phasor dynamics and hierarchical anchors to make explicit memory stable for long sequences, matching a 3x larger Mamba model on long-context robustness with a 119M parameter network.

  11. RT-Transformer: The Transformer Block as a Spherical State Estimator

    cs.LG 2026-05 unverdicted novelty 6.0

    Transformer components arise as the natural solution to precision-weighted directional state estimation on the hypersphere.

  12. Structured Recurrent Mixers for Massively Parallelized Sequence Generation

    cs.CL 2026-05 unverdicted novelty 6.0

    Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.

  13. HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models

    cs.LG 2026-04 unverdicted novelty 6.0

    HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.

  14. The Recurrent Transformer: Greater Effective Depth and Efficient Decoding

    cs.LG 2026-04 unverdicted novelty 6.0

    Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-mat...

  15. Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation

    cs.SE 2026-04 unverdicted novelty 6.0

    Co-locating tests with implementation code yields substantially higher preservation and correctness in foundation-model-generated programs than separated test syntax.

  16. Predicting Where Steering Vectors Succeed

    cs.LG 2026-04 unverdicted novelty 6.0

    The Linear Accessibility Profile predicts steering vector effectiveness and optimal layers with Spearman correlations of 0.86-0.91 using unembedding projections on intermediate states across multiple models and concepts.

  17. Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

    cs.CV 2026-04 unverdicted novelty 6.0

    Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.

  18. Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion

    cs.CL 2026-04 conditional novelty 6.0

    Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.

  19. Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space

    cs.CL 2026-04 unverdicted novelty 6.0

    PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.

  20. Attention to Mamba: A Recipe for Cross-Architecture Distillation

    cs.CL 2026-04 unverdicted novelty 6.0

    A two-stage distillation recipe converts a Pythia-1B Transformer into a Mamba model that preserves performance with perplexity 14.11 versus the teacher's 13.86.

  21. M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling

    cs.LG 2026-03 unverdicted novelty 6.0

    M²RNN achieves perfect state tracking at unseen lengths and outperforms Gated DeltaNet hybrids by 0.4-0.5 perplexity on 7B models with 3x smaller recurrent states.

  22. MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

    cs.CL 2025-07 unverdicted novelty 6.0

    MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.

  23. Gated Linear Attention Transformers with Hardware-Efficient Training

    cs.LG 2023-12 unverdicted novelty 6.0

    Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.

  24. Kaczmarz Linear Attention

    cs.LG 2026-05 unverdicted novelty 5.0

    Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...

  25. MDN: Parallelizing Stepwise Momentum for Delta Linear Attention

    cs.LG 2026-05 unverdicted novelty 5.0

    MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.

  26. Absorber LLM: Harnessing Causal Synchronization for Test-Time Training

    cs.LG 2026-04 unverdicted novelty 5.0

    Absorber LLM introduces causal synchronization to absorb context into parameters for memory-efficient long-context LLM inference while preserving causal effects.

  27. The Cognitive Penalty: Ablating System 1 and System 2 Reasoning in Edge-Native SLMs for Decentralized Consensus

    cs.AI 2026-04 unverdicted novelty 5.0

    System 1 intuition in edge SLMs delivers 100% adversarial robustness and low latency for DAO consensus while System 2 reasoning causes 26.7% cognitive collapse and 17x slowdown.

  28. Adaptive Spiking Neurons for Vision and Language Modeling

    cs.NE 2026-04 unverdicted novelty 5.0

    ASN uses trainable parameters for adaptive membrane dynamics and firing in SNNs, with NASN adding normalization, and reports effectiveness across 19 vision and language datasets.

  29. Belief-State RWKV for Reinforcement Learning under Partial Observability

    cs.LG 2026-04 unverdicted novelty 5.0

    Belief-state RWKV maintains an uncertainty-aware recurrent state for RL policies in partial observability and shows modest gains over standard recurrent baselines in a pilot with observation noise.

  30. Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference

    cs.DC 2026-03 unverdicted novelty 5.0

    Unifying LLM memory optimizations into a Prepare-Compute-Retrieve-Apply pipeline and accelerating it on GPU-FPGA hardware yields up to 2.2x faster inference and 4.7x less energy than GPU-only baselines.

  31. A Survey on Efficient Inference for Large Language Models

    cs.CL 2024-04 accept novelty 3.0

    The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

  32. Large Language Models: A Survey

    cs.CL 2024-02 accept novelty 3.0

    The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

  33. A Survey of Large Language Models

    cs.CL 2023-03 accept novelty 3.0

    This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

  34. Evolution of Video Generative Foundations

    cs.CV 2026-04 unverdicted novelty 2.0

    This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 34 Pith papers · 6 internal anchors

  1. [1]

    Longformer: The Long-Document Transformer

    Longformer: The long-document transformer. arXiv:2004.05150. Stella Biderman, Kieran Bicheno, and Leo Gao

  2. [2]

    Prasad (2021),https://doi.org/10.5281/zenodo

    Datasheet for the pile. arXiv preprint arXiv:2201.07311. Stella Biderman, USVSN Sai Prashanth, Lintang Sutawika, Hailey Schoelkopf, Quentin Anthony, Shivanshu Purohit, and Edward Raf. 2023a. Emer- gent and predictable memorization in large language models. arXiv preprint arXiv:2304.11158. Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley...

  3. [3]

    In Proceedings of BigScience Episode\# 5–Workshop on Challenges & Perspec- tives in Creating Large Language Models, pages 95– 136

    Gpt-neox-20b: An open-source autoregres- sive language model. In Proceedings of BigScience Episode\# 5–Workshop on Challenges & Perspec- tives in Creating Large Language Models, pages 95– 136. James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. 2017. Quasi-recurrent neural net- works. In ICLR. Tom Brown, Benjamin Mann, Nick Ryder, Melanie S...

  4. [4]

    Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev

    Scaling transformer to 1m tokens and beyond with rmt. Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev

  5. [5]

    PaLM: Scaling Language Modeling with Pathways

    Recurrent memory transformer. Advances in Neural Information Processing Systems, 35:11079– 11091. Sahil Chaudhary. 2023. Code alpaca: An instruction- following llama model for code generation. https: //github.com/sahil280114/codealpaca. Joseph Cheung. 2023. Guanacodataset. Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Ga...

  6. [6]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Goemotions: A dataset of fine-grained emo- tions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020 , pages 4040–4054. Association for Computational Linguistics. Leo Gao, Stella Biderman, Sid Black, Laurence Gold- ing, Travis Hoppe, Charles Foster, Jason Phang, Ho- race He, Anish ...

  7. [7]

    Scaling Laws for Autoregressive Generative Modeling

    Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701. Sepp Hochreiter. 1998. The vanishing gradient problem during learning recurrent neural nets and problem so- lutions. International Journal of Uncertainty, Fuzzi- ness and Knowledge-Based Systems, 6(02):107–116. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-ter...

  8. [8]

    Data mining and knowledge discovery , 33(4):917–963

    Deep learning for time series classification: a review. Data mining and knowledge discovery , 33(4):917–963. Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. 2021. Perceiver: General perception with iterative atten- tion. In International conference on machine learn- ing, pages 4651–4664. PMLR. Hanhwi Jang, Joon...

  9. [9]

    Reformer: The Efficient Transformer

    Reformer: The efficient transformer. ArXiv, abs/2001.04451. Jan Koco´n, Igor Cichecki, Oliwier Kaszyca, Mateusz Kochanek, Dominika Szydło, Joanna Baran, Julita Bielaniewicz, Marcin Gruza, Arkadiusz Janz, Kamil Kanclerz, Anna Koco ´n, Bartłomiej Koptyra, Wik- toria Mieleszczenko-Kowszewicz, Piotr Miłkowski, Marcin Oleksy, Maciej Piasecki, Łukasz Radli ´nsk...

  10. [10]

    Tao Lei, Yu Zhang, Sida I

    What language model to train if you have one million gpu hours? In Proceedings of BigScience Episode #5–Workshop on Challenges & Perspectives in Creating Large Language Models. Tao Lei, Yu Zhang, Sida I. Wang, Hui Dai, and Yoav Artzi. 2018. Simple recurrent units for highly paral- lelizable recurrence. In Proceedings of the 2018 Con- ference on Empirical ...

  11. [11]

    Parallelizing Linear Recurrent Neural Nets Over Sequence Length

    Pay attention to mlps. Xuezhe Ma, Xiang Kong, Sinong Wang, Chunting Zhou, Jonathan May, Hao Ma, and Luke Zettlemoyer. 2021. Luna: Linear unified nested attention. Advances in Neural Information Processing Systems, 34:2441– 2453. Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer. 2023. Mega: Mo...

  12. [12]

    Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan

    Scaling data-constrained language models. arXiv preprint arXiv:2305.16264. OpenAI. 2022. Introducing chatgpt. https://openai. com/blog/chatgpt. Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. 2023. Resurrecting recurrent neu- ral networks for long sequences. arXiv preprint arXiv:2303.06349. Deni...

  13. [13]

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Six attributes of unhealthy conversations. In Proceedings of the Fourth Workshop on Online Abuse and Harms, pages 114–124, Online. Association for Computational Linguistics. Markus N. Rabe and Charles Staats. 2022. Self- attention does not need o(n2) memory. Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed: System opti- mi...

  14. [14]

    CoRR, abs/2105.01601

    Mlp-mixer: An all-mlp architecture for vi- sion. CoRR, abs/2105.01601. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. Llama: Open and efficient foundation language mo...

  15. [15]

    ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension

    Transformers: State-of-the-Art Natural Lan- guage Processing. pages 38–45. Association for Computational Linguistics. Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. 2020. A com- prehensive survey on graph neural networks. IEEE transactions on neural networks and learning sys- tems, 32(1):4–24. Ellery Wulczyn, Nithum Th...

  16. [16]

    Despite being memory-efficient, their time complexity remains quadratic or contains chunk size as a hidden factor

    share similarities with RWKV’s chunked computation scheme. Despite being memory-efficient, their time complexity remains quadratic or contains chunk size as a hidden factor. In contrast, RWKV achieves better space and time complexity during inference by formulating a linear attention as an RNN. Attention Free Models Another line of research replaces the a...

  17. [17]

    time decay

    uses both convolutional layers and recurrent pooling functions across timesteps and channels. While QRNN utilizes convolutional filters with fixed sizes, RWKV employs a time-mixing module as an attention mechanism with time-decaying factors. Different from the element-wise pooling in QRNN, RWKV includes a parametrized channel-mixing module that is paralle...

  18. [18]

    Run the model once, and record all states and activation of each layer during the computation

  19. [19]

    The Eiffel Tower

    Corrupt the input embeddings of the subject using noise (“The Eiffel Tower” in this example)

  20. [20]

    The Eiffel

    Restore the states and activation of a certain layer at a certain token during the computation, and record the log-probability of the model outputting the correct answer (“Paris”). Unlike transformers, RWKV relies on the recursive propagation of information in the time dimension. In this case, the fact that the Eiffel Tower is located in Paris is retrieve...

  21. [21]

    RWKV-4-Raven-14B I would like to cook some Chinese food at home

    We can find the sum of this series using the formula for a geometric series: ∞X i=0 ai = a0 1 − r = 1 2 1 − 1 2 = 1 1 − 1 2 = 2 1 = 2 So, the sum of the given series is 2 . RWKV-4-Raven-14B I would like to cook some Chinese food at home. Do you have any suggestions on the meal I could choose? Yes, of course! If you are not very experienced in cooking Chin...