pith. machine review for the scientific record. sign in

arxiv: 2502.11089 · v2 · submitted 2025-02-16 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

Authors on Pith no claims yet

Pith reviewed 2026-05-16 23:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords sparse attentionlong-context modelingefficient transformersnative trainabilityhardware optimizationlanguage modelstoken compression
0
0 comments X

The pith

NSA introduces a natively trainable sparse attention that matches full attention performance on long contexts while delivering major speedups.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Native Sparse Attention (NSA) as a method to make long-context language modeling computationally feasible without the full cost of standard attention. It combines coarse token compression with fine token selection in a dynamic hierarchy that keeps global awareness and local detail intact. This structure supports end-to-end training from scratch, so the model learns to rely on the sparse pattern rather than needing a separate dense pretraining phase. Experiments indicate the resulting models hold their own or improve on standard benchmarks, long-context tasks, and reasoning while running faster in forward passes, backward passes, and decoding on sequences up to 64k tokens.

Core claim

NSA achieves efficient long-context modeling by integrating a dynamic hierarchical sparse strategy—coarse-grained token compression followed by fine-grained token selection—with hardware-aligned optimizations that balance arithmetic intensity, enabling both substantial speedups over full attention and end-to-end native training that maintains or exceeds full-attention performance across general benchmarks, long-context tasks, and instruction-based reasoning.

What carries the argument

The dynamic hierarchical sparse strategy that first compresses tokens at coarse granularity and then selects at fine granularity to preserve both global context and local precision.

Load-bearing premise

The hierarchical compression and selection steps preserve necessary context without introducing systematic biases that would hurt performance on new long-context distributions.

What would settle it

A clear drop in accuracy or reasoning quality on any long-context benchmark when training from scratch with NSA versus an otherwise identical full-attention model.

read the original abstract

Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while maintaining model capabilities. We present NSA, a Natively trainable Sparse Attention mechanism that integrates algorithmic innovations with hardware-aligned optimizations to achieve efficient long-context modeling. NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision. Our approach advances sparse attention design with two key innovations: (1) We achieve substantial speedups through arithmetic intensity-balanced algorithm design, with implementation optimizations for modern hardware. (2) We enable end-to-end training, reducing pretraining computation without sacrificing model performance. As shown in Figure 1, experiments show the model pretrained with NSA maintains or exceeds Full Attention models across general benchmarks, long-context tasks, and instruction-based reasoning. Meanwhile, NSA achieves substantial speedups over Full Attention on 64k-length sequences across decoding, forward propagation, and backward propagation, validating its efficiency throughout the model lifecycle.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Native Sparse Attention (NSA), a hardware-aligned sparse attention mechanism that uses a dynamic hierarchical strategy combining coarse-grained token compression with fine-grained token selection. It claims this enables end-to-end trainable sparsity, delivering substantial speedups over full attention on 64k-length sequences during decoding, forward, and backward passes while maintaining or exceeding full-attention performance on general benchmarks, long-context tasks, and instruction-based reasoning.

Significance. If the central performance parity holds under broader scrutiny, NSA would represent a meaningful advance in efficient long-context modeling by making sparsity natively trainable and hardware-optimized rather than post-hoc, potentially reducing pretraining and inference costs for next-generation models without sacrificing capability.

major comments (2)
  1. [Experiments (Figure 1 and associated results)] The headline claim that NSA 'maintains or exceeds' full attention rests on evaluations confined to standard pretraining distributions and fixed 64k sequences (as summarized around Figure 1). No ablation or evaluation on shifted distributions—different domains, lengths >64k, or adversarial long-range dependencies—is reported, leaving the assumption that coarse-to-fine selection avoids systematic information loss untested and load-bearing for the generalization argument.
  2. [Method description of dynamic hierarchical sparse strategy] The manuscript asserts that the hierarchical strategy 'preserves both global context awareness and local precision,' yet provides no quantitative analysis (e.g., attention-map comparisons or information-retention metrics) showing that compression does not discard critical tokens on out-of-distribution inputs; this directly affects whether NSA can be positioned as a drop-in replacement.
minor comments (2)
  1. Clarify the precise benchmark suites, data splits, and number of runs underlying the 'maintains or exceeds' statement in the abstract and Figure 1 to allow reproducibility assessment.
  2. The abstract references arithmetic-intensity-balanced design and hardware optimizations; the main text should include explicit pseudocode or kernel-level details for the forward/backward passes to substantiate the reported speedups.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, providing clarifications and committing to revisions that strengthen the generalization claims and supporting analyses.

read point-by-point responses
  1. Referee: [Experiments (Figure 1 and associated results)] The headline claim that NSA 'maintains or exceeds' full attention rests on evaluations confined to standard pretraining distributions and fixed 64k sequences (as summarized around Figure 1). No ablation or evaluation on shifted distributions—different domains, lengths >64k, or adversarial long-range dependencies—is reported, leaving the assumption that coarse-to-fine selection avoids systematic information loss untested and load-bearing for the generalization argument.

    Authors: We appreciate the referee's emphasis on rigorous generalization testing. Our reported results demonstrate performance parity or improvement across diverse long-context tasks and benchmarks that span multiple domains and dependency structures within the evaluated 64k context length. Nevertheless, we agree that explicit ablations on lengths beyond 64k and adversarial or domain-shifted inputs would provide stronger evidence. In the revised manuscript we will add experiments evaluating NSA on sequences up to 128k as well as on out-of-distribution data to directly test the robustness of the coarse-to-fine selection mechanism. revision: yes

  2. Referee: [Method description of dynamic hierarchical sparse strategy] The manuscript asserts that the hierarchical strategy 'preserves both global context awareness and local precision,' yet provides no quantitative analysis (e.g., attention-map comparisons or information-retention metrics) showing that compression does not discard critical tokens on out-of-distribution inputs; this directly affects whether NSA can be positioned as a drop-in replacement.

    Authors: We acknowledge that direct quantitative evidence, such as attention-map comparisons and token-retention metrics on out-of-distribution inputs, would make the preservation argument more explicit. The end-to-end training results and benchmark parity already indicate that critical information is retained in practice. To address the referee's point, the revised manuscript will include attention-map visualizations and information-retention metrics computed on both in-distribution and out-of-distribution examples, thereby supporting the claim that the hierarchical strategy functions as a reliable drop-in replacement. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper advances an algorithmic design for native sparse attention via dynamic hierarchical token compression and selection, with hardware-aligned optimizations. Claims of maintained or superior performance and speedups rest on empirical pretraining and benchmark results across general, long-context, and reasoning tasks rather than any closed mathematical derivation. No equations reduce claimed outcomes to fitted parameters on the same data, no self-citations serve as load-bearing uniqueness theorems, and no ansatz is smuggled via prior work. The central mechanism is presented as an explicit design choice whose validity is tested externally on standard distributions, making the derivation self-contained against the reported experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that token importance can be reliably estimated via a two-stage compression-plus-selection process without losing critical information, plus standard assumptions about transformer training dynamics.

axioms (1)
  • domain assumption Token importance can be approximated hierarchically without significant information loss for downstream tasks
    Invoked in the description of the dynamic sparse strategy that combines coarse and fine-grained selection.

pith-pipeline@v0.9.0 · 5538 in / 1167 out tokens · 34946 ms · 2026-05-16T23:43:08.833294+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

    cs.LG 2026-05 conditional novelty 7.0

    MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.

  2. Long Context Pre-Training with Lighthouse Attention

    cs.CL 2026-05 conditional novelty 7.0

    Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower los...

  3. Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation

    cs.DC 2026-04 unverdicted novelty 7.0

    GVR uses previous-step Top-K predictions, pre-indexed stats, secant counting, and shared-memory verification to deliver 1.88x average speedup over radix-select while preserving bit-exact Top-K on DeepSeek-V3.2 workloads.

  4. Neural Garbage Collection: Learning to Forget while Learning to Reason

    cs.LG 2026-04 conditional novelty 7.0

    Language models learn to evict KV cache entries end-to-end via reinforcement learning from outcome reward alone, achieving 2-3x cache compression while maintaining accuracy on Countdown, AMC, and AIME tasks.

  5. Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility

    cs.LG 2026-05 unverdicted novelty 6.0

    SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.

  6. Z-Order Transformer for Feed-Forward Gaussian Splatting

    cs.CV 2026-05 unverdicted novelty 6.0

    A Z-order transformer organizes unstructured Gaussians for sparse attention, enabling feed-forward prediction of high-quality 3D splats with fewer primitives.

  7. ZAYA1-8B Technical Report

    cs.AI 2026-05 unverdicted novelty 6.0

    ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

  8. AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization

    cs.AR 2026-04 unverdicted novelty 6.0

    AQPIM performs in-memory product quantization of activations for LLMs on PIM hardware, reducing GPU-CPU communication by 90-98.5% and delivering 3.4x speedup over prior PIM methods.

  9. In-Place Test-Time Training

    cs.LG 2026-04 conditional novelty 6.0

    In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.

  10. BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models

    cs.LG 2025-12 unverdicted novelty 6.0

    BOOST delivers 1.46-2.27x end-to-end speedups for low-rank bottleneck LLMs by redesigning tensor parallelism around the bottleneck structure plus supporting optimizations.

  11. BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

    cs.CL 2025-12 unverdicted novelty 6.0

    BLASST dynamically sparsifies attention by thresholding softmax scores to skip blocks, delivering 1.5x speedups at 70%+ sparsity while preserving benchmark accuracy.

  12. Kimi Linear: An Expressive, Efficient Attention Architecture

    cs.CL 2025-10 unverdicted novelty 6.0

    Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.

  13. MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

    cs.CL 2025-07 unverdicted novelty 6.0

    MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.

  14. MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    cs.CL 2025-06 unverdicted novelty 6.0

    MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...

  15. Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

    cs.CL 2025-05 conditional novelty 6.0

    Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.

  16. Position: LLM Inference Should Be Evaluated as Energy-to-Token Production

    cs.CE 2026-05 unverdicted novelty 5.0

    LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.

  17. StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k

    cs.LG 2026-05 accept novelty 5.0

    Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.

  18. Challenges and opportunities for AI to help deliver fusion energy

    physics.plasm-ph 2026-03 unverdicted novelty 2.0

    AI offers opportunities to advance fusion energy R&D but requires responsible practices and expert collaborations to overcome its inherent challenges.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 18 Pith papers · 23 internal anchors

  1. [10]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. 2024. URL https://arxiv.org/abs/2405.04434

  2. [11]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948

  3. [22]

    G. Kamradt. LLMTest NeedleInAHaystack . GitHub repository, 2023. URL https://github.com/gkamradt/LLMTest_NeedleInAHaystack. Accessed: [Insert Access Date Here]

  4. [26]

    J. S. Park, J. C. O'Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein. Generative agents: Interactive simulacra of human behavior. In S. Follmer, J. Han, J. Steimle, and N. H. Riche, editors, Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST 2023, San Francisco, CA, USA, 29 October 2023-- 1 November 2...

  5. [27]

    B. Peng, J. Quesnelle, H. Fan, and E. Shippole. Yarn: Efficient context window extension of large language models. In ICLR . OpenReview.net, 2024

  6. [28]

    N. Shazeer. Fast transformer decoding: One write-head is all you need. CoRR, abs/1911.02150, 2019

  7. [31]

    Tillet, H.-T

    P. Tillet, H.-T. Kung, and D. Cox. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10--19, 2019

  8. [32]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 2017

  9. [34]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022

  10. [36]

    C. Xiao, P. Zhang, X. Han, G. Xiao, Y. Lin, Z. Zhang, Z. Liu, and M. Sun. Infllm: Training-free long-context extrapolation for llms with an efficient context memory. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024 a

  11. [39]

    Zaheer, G

    M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, et al. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33: 0 17283--17297, 2020

  12. [40]

    Zelikman, Y

    E. Zelikman, Y. Wu, J. Mu, and N. D. Goodman. Star: Bootstrapping reasoning with reasoning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 -- December 9, 2022, 2022

  13. [41]

    Zhang, B

    F. Zhang, B. Chen, Y. Zhang, J. Keung, J. Liu, D. Zan, Y. Mao, J. Lou, and W. Chen. Repocoder: Repository-level code completion through iterative retrieval and generation. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6--10, 2023 , pages...

  14. [42]

    Zhang, J

    K. Zhang, J. Li, G. Li, X. Shi, and Z. Jin. Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges. In L. Ku, A. Martins, and V. Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August...

  15. [43]

    Zhang, Y

    Z. Zhang, Y. Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y. Tian, C. R \'e , C. Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36: 0 34661--34710, 2023 b

  16. [46]

    Efficient Streaming Language Models with Attention Sinks

    Efficient streaming language models with attention sinks , author=. arXiv preprint arXiv:2309.17453 , year=

  17. [47]

    Linformer: Self-Attention with Linear Complexity

    Linformer: Self-attention with linear complexity , author=. arXiv preprint arXiv:2006.04768 , year=

  18. [48]

    arXiv preprint arXiv:2410.19258 , year=

    Not all heads matter: A head-level KV cache compression method with integrated retrieval and reasoning , author=. arXiv preprint arXiv:2410.19258 , year=

  19. [49]

    arXiv preprint arXiv:2410.10819 , year=

    Duoattention: Efficient long-context llm inference with retrieval and streaming heads , author=. arXiv preprint arXiv:2410.10819 , year=

  20. [50]

    arXiv preprint arXiv:2411.02886 , year=

    TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection , author=. arXiv preprint arXiv:2411.02886 , year=

  21. [51]

    arXiv preprint arXiv:2412.12094 , year=

    Sepllm: Accelerate large language models by compressing one segment into one separator , author=. arXiv preprint arXiv:2412.12094 , year=

  22. [52]

    Longformer: The Long-Document Transformer

    Longformer: The long-document transformer , author=. arXiv preprint arXiv:2004.05150 , year=

  23. [53]

    arXiv preprint arXiv:2410.23079 , year=

    Buzz: Beehive-structured sparse kv cache with segmented heavy hitters for efficient llm inference , author=. arXiv preprint arXiv:2410.23079 , year=

  24. [54]

    arXiv preprint arXiv:2406.14909 , year=

    Moa: Mixture of sparse attention for automatic large language model compression , author=. arXiv preprint arXiv:2406.14909 , year=

  25. [55]

    arXiv preprint arXiv:2410.13276 , year=

    Seerattention: Learning intrinsic sparse attention in your llms , author=. arXiv preprint arXiv:2410.13276 , year=

  26. [56]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Mamba: Linear-time sequence modeling with selective state spaces , author=. arXiv preprint arXiv:2312.00752 , year=

  27. [57]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    Infllm: Training-free long-context extrapolation for llms with an efficient context memory , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  28. [58]

    Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

    Griffin: Mixing gated linear recurrences with local attention for efficient language models , author=. arXiv preprint arXiv:2402.19427 , year=

  29. [59]

    2025 , eprint=

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=

  30. [60]

    Advances in neural information processing systems , volume=

    Big bird: Transformers for longer sequences , author=. Advances in neural information processing systems , volume=

  31. [61]

    Advances in Neural Information Processing Systems , year=

    Attention is all you need , author =. Advances in Neural Information Processing Systems , year=

  32. [62]

    Kojima, T., Gu, S

    Llmlingua: Compressing prompts for accelerated inference of large language models , author=. arXiv preprint arXiv:2310.05736 , year=

  33. [63]

    arXiv preprint arXiv:2410.09342 , year=

    LLM MapReduce: Simplified Long-Sequence Processing using Large Language Models , author=. arXiv preprint arXiv:2410.09342 , year=

  34. [64]

    Program Synthesis with Large Language Models

    Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

  35. [65]

    , title =

    Kamradt, G. , title =. 2023 , howpublished =

  36. [66]

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

    Longbench: A bilingual, multitask benchmark for long context understanding , author=. arXiv preprint arXiv:2308.14508 , year=

  37. [67]

    Training Verifiers to Solve Math Word Problems

    Training verifiers to solve math word problems, 2021 , author=. URL https://arxiv. org/abs/2110.14168 , year=

  38. [68]

    CMMLU: Measuring massive multitask language understanding in Chinese

    Cmmlu: Measuring massive multitask language understanding in chinese , author=. arXiv preprint arXiv:2306.09212 , year=

  39. [69]

    Evaluating Large Language Models Trained on Code

    Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

  40. [70]

    DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

    DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs , author=. arXiv preprint arXiv:1903.00161 , year=

  41. [71]

    Measuring Massive Multitask Language Understanding

    Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

  42. [72]

    MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. arXiv preprint arXiv:2406.01574 , year=

  43. [73]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    Challenging big-bench tasks and whether chain-of-thought can solve them , author=. arXiv preprint arXiv:2210.09261 , year=

  44. [74]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. arXiv preprint arXiv:2403.05530 , year=

  45. [75]

    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

    Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models , author=. arXiv preprint arXiv:2401.06066 , year=

  46. [76]

    arXiv preprint arXiv:2310.01801 , year=

    Model tells you what to discard: Adaptive kv cache compression for llms , author=. arXiv preprint arXiv:2310.01801 , year=

  47. [77]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    Kimi k1. 5: Scaling Reinforcement Learning with LLMs , author=. arXiv preprint arXiv:2501.12599 , year=

  48. [78]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Gqa: Training generalized multi-query transformer models from multi-head checkpoints , author=. arXiv preprint arXiv:2305.13245 , year=

  49. [79]

    Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model , author=

  50. [80]

    Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

    Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference , author=. arXiv preprint arXiv:2406.10774 , year=

  51. [81]

    Advances in Neural Information Processing Systems , volume=

    H2o: Heavy-hitter oracle for efficient generative inference of large language models , author=. Advances in Neural Information Processing Systems , volume=

  52. [82]

    SnapKV: LLM Knows What You are Looking for Before Generation

    Snapkv: Llm knows what you are looking for before generation , author=. arXiv preprint arXiv:2404.14469 , year=

  53. [83]

    arXiv preprint arXiv:2407.02490 , year=

    Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention , author=. arXiv preprint arXiv:2407.02490 , year=

  54. [84]

    arXiv preprint arXiv:2412.03213 , year=

    ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression , author=. arXiv preprint arXiv:2412.03213 , year=

  55. [85]

    arXiv preprint arXiv:2410.16179 , year=

    Magicpig: Lsh sampling for efficient llm generation , author=. arXiv preprint arXiv:2410.16179 , year=

  56. [86]

    Advances in neural information processing systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

  57. [87]

    arXiv preprint arXiv:2412.14468 , year=

    HashAttention: Semantic Sparsity for Faster Inference , author=. arXiv preprint arXiv:2412.14468 , year=

  58. [88]

    Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages , pages=

    Triton: an intermediate language and compiler for tiled neural network computations , author=. Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages , pages=

  59. [89]

    Bowen Peng and Jeffrey Quesnelle and Honglu Fan and Enrico Shippole , title =

  60. [90]

    RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation , booktitle =

    Fengji Zhang and Bei Chen and Yue Zhang and Jacky Keung and Jin Liu and Daoguang Zan and Yi Mao and Jian. RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation , booktitle =

  61. [91]

    CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges , booktitle =

    Kechi Zhang and Jia Li and Ge Li and Xianjie Shi and Zhi Jin , editor =. CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges , booktitle =

  62. [92]

    Goodman , editor =

    Eric Zelikman and Yuhuai Wu and Jesse Mu and Noah D. Goodman , editor =. STaR: Bootstrapping Reasoning With Reasoning , booktitle =

  63. [93]

    CoRR , volume =

    Songlin Yang and Jan Kautz and Ali Hatamizadeh , title =. CoRR , volume =. 2024 , url =

  64. [94]

    O'Brien and Carrie Jun Cai and Meredith Ringel Morris and Percy Liang and Michael S

    Joon Sung Park and Joseph C. O'Brien and Carrie Jun Cai and Meredith Ringel Morris and Percy Liang and Michael S. Bernstein , editor =. Generative Agents: Interactive Simulacra of Human Behavior , booktitle =

  65. [95]

    CoRR , volume =

    Noam Shazeer , title =. CoRR , volume =