pith. machine review for the scientific record. sign in

arxiv: 2406.07887 · v1 · submitted 2024-06-12 · 💻 cs.LG · cs.CL

An Empirical Study of Mamba-based Language Models

Pith reviewed 2026-05-18 10:25 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords Mambastate space modelshybrid architectureslanguage modelsTransformer comparisoninference efficiencylong contextempirical scaling
0
0 comments X

The pith

The 8B Mamba-2-Hybrid outperforms a standard 8B Transformer on all twelve evaluated tasks while enabling much faster inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains 8B-parameter Mamba, Mamba-2, Transformer, and hybrid models on identical datasets of up to 3.5 trillion tokens to isolate architectural effects. Pure state-space models match or beat Transformers on many language tasks yet fall short on copying and in-context learning. The hybrid, which mixes Mamba-2 layers with limited attention and MLP layers, surpasses the Transformer baseline across every standard task while preserving the inference speed advantages of selective state-space models. Additional long-context tests up to 128K tokens show the hybrid continues to match or exceed Transformer performance on average.

Core claim

In a controlled comparison, the 8B Mamba-2-Hybrid architecture consisting of 43 percent Mamba-2, 7 percent attention, and 50 percent MLP layers exceeds the 8B Transformer by 2.65 points on average across twelve standard tasks and is projected to generate tokens up to eight times faster at inference time; the same hybrid remains competitive with the Transformer on twenty-three additional long-context tasks when both are extended to 16K, 32K, and 128K sequence lengths.

What carries the argument

The Mamba-2-Hybrid architecture, which interleaves selective state-space layers with a small number of attention layers to improve copying and in-context learning while retaining linear-time inference.

If this is right

  • Hybrid designs that combine a majority of Mamba-2 layers with a minority of attention layers can exceed pure Transformers on both short and long-context benchmarks.
  • Inference throughput gains of up to 8x become available without sacrificing task accuracy when the hybrid proportion is used.
  • Pure Mamba models remain limited on tasks that demand explicit copying or few-shot in-context learning even at 8B scale.
  • The released checkpoints allow direct reproduction and extension of the scaling behavior observed up to 3.5T tokens.
  • Long-context extensions of the hybrid maintain parity with Transformers when both architectures receive the same context-length adaptations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • At still larger scales the inference-speed advantage of the hybrid could become decisive for production deployment where latency and memory costs dominate.
  • The optimal fraction of attention layers may vary by domain and could be tuned automatically rather than fixed at 7 percent.
  • The results suggest that selective state-space models benefit from targeted attention injection specifically for in-context reasoning rather than uniform replacement of all layers.

Load-bearing premise

The 8B models were trained under sufficiently identical data, optimizer, learning-rate schedule, and regularization conditions so that performance gaps can be attributed primarily to architecture.

What would settle it

Re-train an 8B Transformer using the exact same data mixture, optimizer, and schedule as the hybrid and measure whether the 2.65-point average gap disappears.

read the original abstract

Selective state-space models (SSMs) like Mamba overcome some of the shortcomings of Transformers, such as quadratic computational complexity with sequence length and large inference-time memory requirements from the key-value cache. Moreover, recent studies have shown that SSMs can match or exceed the language modeling capabilities of Transformers, making them an attractive alternative. In a controlled setting (e.g., same data), however, studies so far have only presented small scale experiments comparing SSMs to Transformers. To understand the strengths and weaknesses of these architectures at larger scales, we present a direct comparison between 8B-parameter Mamba, Mamba-2, and Transformer models trained on the same datasets of up to 3.5T tokens. We also compare these models to a hybrid architecture consisting of 43% Mamba-2, 7% attention, and 50% MLP layers (Mamba-2-Hybrid). Using a diverse set of tasks, we answer the question of whether Mamba models can match Transformers at larger training budgets. Our results show that while pure SSMs match or exceed Transformers on many tasks, they lag behind Transformers on tasks which require strong copying or in-context learning abilities (e.g., 5-shot MMLU, Phonebook) or long-context reasoning. In contrast, we find that the 8B Mamba-2-Hybrid exceeds the 8B Transformer on all 12 standard tasks we evaluated (+2.65 points on average) and is predicted to be up to 8x faster when generating tokens at inference time. To validate long-context capabilities, we provide additional experiments evaluating variants of the Mamba-2-Hybrid and Transformer extended to support 16K, 32K, and 128K sequences. On an additional 23 long-context tasks, the hybrid model continues to closely match or exceed the Transformer on average. To enable further study, we release the checkpoints as well as the code used to train our models as part of NVIDIA's Megatron-LM project.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents a direct empirical comparison of 8B-parameter Mamba, Mamba-2, Transformer, and Mamba-2-Hybrid models trained on the same datasets of up to 3.5T tokens. Pure SSMs match or exceed Transformers on many tasks but lag on copying/in-context learning and long-context reasoning; the Mamba-2-Hybrid (43% Mamba-2, 7% attention, 50% MLP) exceeds the Transformer on all 12 standard tasks by +2.65 points on average, maintains parity on 23 additional long-context tasks up to 128K, and is predicted to offer up to 8x faster inference. Checkpoints and training code are released.

Significance. If training conditions are equivalent, the work supplies concrete evidence that hybrid SSM-attention models can outperform pure Transformers at the 8B scale while delivering inference efficiency gains. The public release of checkpoints and Megatron-LM code is a clear strength for reproducibility and follow-on research.

major comments (1)
  1. The central claim that the +2.65 average gain is attributable to the hybrid layer mix requires that data, optimizer, learning-rate schedule, and regularization were identical across the 8B Transformer and Mamba-2-Hybrid. The manuscript states only that models were 'trained on the same datasets of up to 3.5T tokens' and supplies no table or section listing per-model values for peak LR, decay, Adam betas, weight decay, or clipping. Even modest differences in these settings could produce score shifts comparable to the reported margin.
minor comments (2)
  1. Reporting standard deviation across random seeds for the 8B models would strengthen defensibility of the architectural conclusion, especially for the headline +2.65 average.
  2. The long-context section should explicitly state whether the 16K/32K/128K variants were trained from scratch or obtained via continued pre-training / fine-tuning of the base models.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation for minor revision. We address the concern regarding training configuration details below.

read point-by-point responses
  1. Referee: The central claim that the +2.65 average gain is attributable to the hybrid layer mix requires that data, optimizer, learning-rate schedule, and regularization were identical across the 8B Transformer and Mamba-2-Hybrid. The manuscript states only that models were 'trained on the same datasets of up to 3.5T tokens' and supplies no table or section listing per-model values for peak LR, decay, Adam betas, weight decay, or clipping. Even modest differences in these settings could produce score shifts comparable to the reported margin.

    Authors: We agree that explicit documentation of the full training configuration is necessary to substantiate that performance differences arise from the layer mix rather than hyperparameter variations. All models were trained under identical conditions in the Megatron-LM framework, using the same datasets, optimizer settings, peak learning rate, decay schedule, Adam betas, weight decay, and gradient clipping. To improve transparency, we will add a new table in the revised manuscript that lists these per-model hyperparameter values. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparisons with no derivations or fitted predictions

full rationale

The manuscript is an empirical study that trains 8B-parameter Mamba, Mamba-2, Transformer, and Mamba-2-Hybrid models on identical datasets of up to 3.5T tokens and reports direct performance measurements on 12 standard tasks plus 23 long-context tasks. No equations, fitted parameters, or mathematical derivations appear; the central claim that the hybrid exceeds the Transformer by +2.65 points on average is presented as an observed outcome of independently trained models rather than a prediction derived from any model equation or self-citation. No self-citations are invoked to establish uniqueness theorems, ansatzes, or load-bearing premises. The analysis is therefore self-contained against external benchmarks of model performance.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central performance claims rest on the validity of the controlled training protocol and the assumption that the chosen 12+23 tasks are representative of real-world capabilities.

axioms (1)
  • domain assumption Models trained on identical data and comparable optimization settings allow direct attribution of performance differences to architecture
    Explicitly invoked in the abstract as 'controlled setting (e.g., same data)'

pith-pipeline@v0.9.0 · 5967 in / 1406 out tokens · 44242 ms · 2026-05-18T10:25:57.149983+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Foundation.LedgerCanonicality ZeroParameterComparisonLedger echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    models were 'trained on the same datasets of up to 3.5T tokens' but supplies no table or section listing per-model values for peak LR, decay, Adam betas, weight decay, or clipping

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Hidden State Poisoning Attacks against Mamba-based Language Models

    cs.CL 2026-01 unverdicted novelty 7.0

    Short input phrases can irreversibly overwrite hidden states in Mamba models, impairing information retrieval on a new benchmark while leaving pure Transformer models unaffected.

  2. Selective Rotary Position Embedding

    cs.CL 2025-11 unverdicted novelty 7.0

    Selective RoPE adds input-dependent rotations to generalize RoPE, showing implicit positional structure in softmax attention and improving performance on language modeling, copying, state tracking, and retrieval when ...

  3. Q-RAG: Long Context Multi-step Retrieval via Value-based Embedder Training

    cs.LG 2025-11 unverdicted novelty 7.0

    Q-RAG trains embedders via RL for multi-step retrieval and reports state-of-the-art results on BabiLong and RULER benchmarks for contexts up to 10M tokens.

  4. Priming: Hybrid State Space Models From Pre-trained Transformers

    cs.LG 2026-05 unverdicted novelty 6.0

    Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasonin...

  5. Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators

    cs.LG 2026-05 unverdicted novelty 6.0

    Spectral Koopman operators let SSMs achieve 100% accuracy on long-gap multi-query associative recall with fixed memory, where pure Mamba fails.

  6. ZAYA1-8B Technical Report

    cs.AI 2026-05 unverdicted novelty 6.0

    ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

  7. Rhamba: Region-Aware Hybrid Attention-Mamba Framework for Self-Supervised Learning in Resting-State fMRI

    cs.LG 2026-05 unverdicted novelty 6.0

    Rhamba uses region-aware masking strategies and hybrid Attention-Mamba models pretrained on ABIDE fMRI data to achieve top AUROC on schizophrenia and ADHD classification tasks while outperforming prior methods.

  8. CommFuse: Hiding Tail Latency via Communication Decomposition and Fusion for Distributed LLM Training

    cs.LG 2026-04 unverdicted novelty 6.0

    CommFuse eliminates tail latency in communication-computation overlap for distributed LLM training by decomposing collective operations into P2P communications and fusing them with fine-grained computation scheduling.

  9. Towards Faster Language Model Inference Using Mixture-of-Experts Flow Matching

    cs.AI 2026-04 unverdicted novelty 6.0

    Mixture-of-experts flow matching enables non-autoregressive language models to achieve autoregressive-level quality in three sampling steps, delivering up to 1000x faster inference than diffusion models.

  10. Safety, Security, and Cognitive Risks in State-Space Models: A Systematic Threat Analysis with Spectral, Stateful, and Capacity Attacks

    cs.CR 2026-04 unverdicted novelty 6.0

    State-space models are vulnerable to three new attack types that corrupt state integrity, with experiments showing up to 156x output changes and 6x higher targeted corruption than random inputs.

  11. Kimi Linear: An Expressive, Efficient Attention Architecture

    cs.CL 2025-10 unverdicted novelty 6.0

    Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.

  12. Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

    cs.AI 2025-03 conditional novelty 6.0

    Cosmos-Reason1-7B and 56B models are trained with physical common sense and embodied reasoning ontologies via supervised fine-tuning and reinforcement learning to produce next-step physical actions.

  13. Kaczmarz Linear Attention

    cs.LG 2026-05 unverdicted novelty 5.0

    Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...

  14. Rhamba: Region-Aware Hybrid Attention-Mamba Framework for Self-Supervised Learning in Resting-State fMRI

    cs.LG 2026-05 unverdicted novelty 5.0

    Rhamba is a region-aware hybrid Attention-Mamba framework that uses anatomically guided masking for self-supervised pretraining on ABIDE fMRI data and shows competitive AUROC on downstream schizophrenia and ADHD class...

  15. Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models

    cs.LG 2026-04 unverdicted novelty 5.0

    Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.

  16. NVIDIA Nemotron 3: Efficient and Open Intelligence

    cs.CL 2025-12 unverdicted novelty 5.0

    NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.

  17. TTT3R: 3D Reconstruction as Test-Time Training

    cs.CV 2025-09 unverdicted novelty 5.0

    TTT3R derives a closed-form learning rate from memory-observation alignment confidence to boost length generalization in RNN-based 3D reconstruction by 2x in global pose estimation.

  18. StateX: Enhancing RNN Recall via Post-training State Expansion

    cs.CL 2025-09 unverdicted novelty 5.0

    StateX post-trains RNNs to expand recurrent state size, improving recall and in-context learning with negligible parameter growth.

  19. Hybrid Architectures for Language Models: Systematic Analysis and Design Insights

    cs.CL 2025-10 unverdicted novelty 4.0

    This work systematically compares inter-layer and intra-layer hybridization strategies for combining self-attention and Mamba-style state space models, evaluating them on language modeling, downstream tasks, long-cont...

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 18 Pith papers · 22 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. “GPT-4 Technical Report”. In:arXiv preprint arXiv:2303.08774(2023)

  2. [2]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. “GQA: Training Generalized Multi-Query Transformer Models from Multi-head Checkpoints”. In:arXiv preprint arXiv:2305.13245(2023)

  3. [3]

    Zoology: Measuring and Improving Recall in Efficient Language Models

    Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, and Christopher Ré. “Zoology: Measuring and Improving Recall in Efficient Language Models”. In:arXiv preprint arXiv:2312.04927(2023)

  4. [4]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. “Layer Normalization”. In:arXiv preprint arXiv:1607.06450 (2016)

  5. [5]

    Neural Machine Translation by Jointly Learning to Align and Translate

    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural Machine Translation by Jointly Learning to Align and Translate”. In:arXiv preprint arXiv:1409.0473(2014)

  6. [6]

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. “LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding”. In:arXiv preprint arXiv:2308.14508 (2023)

  7. [7]

    PIQA: Reasoning about Physical Commonsense in Natural Language

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. “PIQA: Reasoning about Physical Commonsense in Natural Language”. In:Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. 05. 2020, pp. 7432–7439

  8. [8]

    NTK-aware Scaled RoPE allows LLaMA models to have Extended (8k+) Context Size Without any Fine-tuning and Minimal Perplexity Degradation

    bloc97. “NTK-aware Scaled RoPE allows LLaMA models to have Extended (8k+) Context Size Without any Fine-tuning and Minimal Perplexity Degradation”. In: (2023).url: https: //www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_%20scaled_rope_allows_ llama_models_to_have

  9. [9]

    Language Models are Few-shot Learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. “Language Models are Few-shot Learners”. In:Advances in Neural Information Processing Systems33 (2020), pp. 1877– 1901

  10. [10]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge”. In:arXiv preprint arXiv:1803.05457(2018)

  11. [11]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Tri Dao and Albert Gu. “Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality”. In:International Conference on Machine Learning (ICML). 2024

  12. [12]

    A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers

    Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. “A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers”. In:Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021, pp. 4599–4610. 16

  13. [13]

    Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

    Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. “Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models”. In:arXiv preprint arXiv:2402.19427 (2024)

  14. [14]

    Version v0.4.0

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou.A Framework ...

  15. [15]

    Zamba: A Compact 7B SSM Hybrid Model

    Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Millidge. “Zamba: A Compact 7B SSM Hybrid Model”. In:arXiv preprint arXiv:2405.16712 (2024)

  16. [16]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. “Mamba: Linear-time Sequence Modeling with Selective State Spaces”. In: arXiv preprint arXiv:2312.00752(2023)

  17. [17]

    Efficiently Modeling Long Sequences with Structured State Spaces

    Albert Gu, Karan Goel, and Christopher Re. “Efficiently Modeling Long Sequences with Structured State Spaces”. In:International Conference on Learning Representations. 2021

  18. [18]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. “Measuring Massive Multitask Language Understanding”. In:International Conference on Learning Representations. 2020

  19. [19]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. “Gaussian Error Linear Units (GELUs)”. In:arXiv preprint arXiv:1606.08415 (2016)

  20. [20]

    Constructing A Multi- hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. “Constructing A Multi- hop QA Dataset for Comprehensive Evaluation of Reasoning Steps”. In:Proceedings of the 28th International Conference on Computational Linguistics. 2020, pp. 6609–6625

  21. [21]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. “RULER: What’s the Real Context Size of Your Long-Context Language Models?” In: arXiv preprint arXiv:2404.06654(2024)

  22. [22]

    Repeat After Me: Transformers are Better than State Space Models at Copying

    Samy Jelassi, David Brandfonbrener, Sham M Kakade, and Eran Malach. “Repeat After Me: Transformers are Better than State Space Models at Copying”. In:arXiv preprint arXiv:2402.01032 (2024)

  23. [23]

    Mistral 7B

    Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. “Mistral 7B”. In:arXiv preprint arXiv:2310.06825(2023)

  24. [24]

    PubMedQA: A Dataset for Biomedical Research Question Answering

    Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. “PubMedQA: A Dataset for Biomedical Research Question Answering”. In:arXiv preprint arXiv:1909.06146 (2019)

  25. [25]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. “TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension”. In:Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017, pp. 1601–1611

  26. [26]

    The NarrativeQA Reading Comprehension Challenge

    Tomáš Kočisk` y, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. “The NarrativeQA Reading Comprehension Challenge”. In: Transactions of the Association for Computational Linguistics6 (2018), pp. 317–328

  27. [27]

    Reducing Activation Recomputation in Large Transformer Models

    Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. “Reducing Activation Recomputation in Large Transformer Models”. In:arXiv preprint arXiv:2205.05198(2022)

  28. [28]

    SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

    Taku Kudo and John Richardson. “Sentencepiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing”. In:arXiv preprint arXiv:1808.06226 (2018)

  29. [29]

    RACE: Large-scale ReAding Comprehension Dataset From Examinations

    Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. “RACE: Large-scale ReAding Comprehension Dataset From Examinations”. In:arXiv preprint arXiv:1704.04683 (2017)

  30. [30]

    Latent Retrieval for Weakly Supervised Open Domain Question Answering

    Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. “Latent Retrieval for Weakly Supervised Open Domain Question Answering”. In:arXiv preprint arXiv:1906.00300(2019). 17

  31. [31]

    Learning Question Classifiers

    Xin Li and Dan Roth. “Learning Question Classifiers”. In:COLING 2002: The 19th International Conference on Computational Linguistics. 2002

  32. [32]

    Jamba: A Hybrid Transformer-Mamba Language Model

    Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. “Jamba: A Hybrid Transformer- mamba Language Model”. In:arXiv preprint arXiv:2403.19887(2024)

  33. [33]

    TruthfulQA: Measuring How Models Mimic Human Falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. “TruthfulQA: Measuring How Models Mimic Human Falsehoods”. In:arXiv preprint arXiv:2109.07958(2021)

  34. [34]

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. “Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering”. In:arXiv preprint arXiv:1809.02789 (2018)

  35. [35]

    Efficient Large-scale Language Model Training on GPU Clusters using Megatron-LM

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. “Efficient Large-scale Language Model Training on GPU Clusters using Megatron-LM”. In: Proceedings of the International Conference for High Performance Computing, Networking...

  36. [36]

    NVIDIA H100 Tensor Core GPU

    NVIDIA. NVIDIA H100 Tensor Core GPU. https://www.nvidia.com/en-us/data-center/ h100/. 2023

  37. [37]

    Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks

    Jongho Park, Jaeseung Park, Zheyang Xiong, Nayoung Lee, Jaewoong Cho, Samet Oymak, Kangwook Lee, and Dimitris Papailiopoulos. “Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks”. In:arXiv preprint arXiv:2402.04248(2024)

  38. [38]

    Nemotron-4 15B Technical Report

    Jupinder Parmar, Shrimai Prabhumoye, Joseph Jennings, Mostofa Patwary, Sandeep Subrama- nian, Dan Su, Chen Zhu, Deepak Narayanan, Aastha Jhunjhunwala, Ayush Dattagupta, et al. “Nemotron-4 15B Technical Report”. In:arXiv preprint arXiv:2402.16819(2024)

  39. [39]

    Block-state Transformers

    Jonathan Pilault, Mahan Fathi, Orhan Firat, Chris Pal, Pierre-Luc Bacon, and Ross Goroshin. “Block-state Transformers”. In:Advances in Neural Information Processing Systems36 (2024)

  40. [40]

    Know What You Don't Know: Unanswerable Questions for SQuAD

    Pranav Rajpurkar, Robin Jia, and Percy Liang. “Know what you don’t Know: Unanswerable Questions for SQuAD”. In:arXiv preprint arXiv:1806.03822(2018)

  41. [41]

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. “WinoGrande: An Adversarial Winograd Schema Challenge at Scale”. In:Communications of the ACM64.9 (2021), pp. 99–106

  42. [42]

    Diagonal State Space Augmented Transformers for Speech Recognition

    George Saon, Ankit Gupta, and Xiaodong Cui. “Diagonal State Space Augmented Transformers for Speech Recognition”. In:ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2023, pp. 1–5

  43. [43]

    Scrolls: Standardized Comparison over Long Language Sequences

    Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv, Ankit Gupta, Wenhan Xiong, Mor Geva, Jonathan Berant, et al. “Scrolls: Standardized Comparison over Long Language Sequences”. In:arXiv preprint arXiv:2201.03533(2022)

  44. [44]

    GLU Variants Improve Transformer

    Noam Shazeer. “GLU Variants Improve Transformer”. In:arXiv preprint arXiv:2002.05202(2020)

  45. [45]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. “Megatron-LM: Training Multi-billion Parameter Language Models using Model Parallelism”. In:arXiv preprint arXiv:1909.08053(2019)

  46. [46]

    Roformer: En- hanced Transformer with Rotary Position Embedding

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. “Roformer: En- hanced Transformer with Rotary Position Embedding”. In:Neurocomputing568 (2024), p. 127063

  47. [47]

    Efficient Transformers: A Survey

    Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. “Efficient Transformers: A Survey”. In: ACM Computing Surveys55.6 (2022), pp. 1–28

  48. [48]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. “Llama 2: Open Foundation and Fine-tuned Chat Models”. In:arXiv preprint arXiv:2307.09288(2023)

  49. [49]

    MuSiQue: Multihop Questions via Single-hop Question Composition

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. “MuSiQue: Multihop Questions via Single-hop Question Composition”. In:Transactions of the Association for Computational Linguistics10 (2022), pp. 539–554

  50. [50]

    Attention is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is All You Need”. In:Advances in Neural Infor- mation Processing Systems30 (2017)

  51. [51]

    Effective Long-context Scaling of Foundation Models

    Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, et al. “Effective Long-context Scaling of Foundation Models”. In:arXiv preprint arXiv:2309.16039(2023). 18

  52. [52]

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. “HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering”. In:Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, pp. 2369–2380

  53. [53]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. “HellaSwag: Can a Machine Really Finish your Sentence?” In:arXiv preprint arXiv:1905.07830(2019)

  54. [54]

    Root Mean Square Layer Normalization

    Biao Zhang and Rico Sennrich. “Root Mean Square Layer Normalization”. In:Advances in Neural Information Processing Systems32 (2019). A Hybrid Layer Allocation Algorithm Although we are able to specify, and experiment with, an arbitrary sequence of Mamba, self-attention, and MLP layers in our hybrid models, by default we use the allocation algorithm descri...