arxiv: 2502.11089 · v2 · submitted 2025-02-16 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

Jingyang Yuan , Huazuo Gao , Damai Dai , Junyu Luo , Liang Zhao , Zhengyan Zhang , Zhenda Xie , Y. X. Wei

show 7 more authors

Lean Wang Zhiping Xiao Yuqing Wang Chong Ruan Ming Zhang Wenfeng Liang Wangding Zeng

Authors on Pith no claims yet

Pith reviewed 2026-05-16 23:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords sparse attentionlong-context modelingefficient transformersnative trainabilityhardware optimizationlanguage modelstoken compression

0 comments

The pith

NSA introduces a natively trainable sparse attention that matches full attention performance on long contexts while delivering major speedups.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Native Sparse Attention (NSA) as a method to make long-context language modeling computationally feasible without the full cost of standard attention. It combines coarse token compression with fine token selection in a dynamic hierarchy that keeps global awareness and local detail intact. This structure supports end-to-end training from scratch, so the model learns to rely on the sparse pattern rather than needing a separate dense pretraining phase. Experiments indicate the resulting models hold their own or improve on standard benchmarks, long-context tasks, and reasoning while running faster in forward passes, backward passes, and decoding on sequences up to 64k tokens.

Core claim

NSA achieves efficient long-context modeling by integrating a dynamic hierarchical sparse strategy—coarse-grained token compression followed by fine-grained token selection—with hardware-aligned optimizations that balance arithmetic intensity, enabling both substantial speedups over full attention and end-to-end native training that maintains or exceeds full-attention performance across general benchmarks, long-context tasks, and instruction-based reasoning.

What carries the argument

The dynamic hierarchical sparse strategy that first compresses tokens at coarse granularity and then selects at fine granularity to preserve both global context and local precision.

Load-bearing premise

The hierarchical compression and selection steps preserve necessary context without introducing systematic biases that would hurt performance on new long-context distributions.

What would settle it

A clear drop in accuracy or reasoning quality on any long-context benchmark when training from scratch with NSA versus an otherwise identical full-attention model.

read the original abstract

Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while maintaining model capabilities. We present NSA, a Natively trainable Sparse Attention mechanism that integrates algorithmic innovations with hardware-aligned optimizations to achieve efficient long-context modeling. NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision. Our approach advances sparse attention design with two key innovations: (1) We achieve substantial speedups through arithmetic intensity-balanced algorithm design, with implementation optimizations for modern hardware. (2) We enable end-to-end training, reducing pretraining computation without sacrificing model performance. As shown in Figure 1, experiments show the model pretrained with NSA maintains or exceeds Full Attention models across general benchmarks, long-context tasks, and instruction-based reasoning. Meanwhile, NSA achieves substantial speedups over Full Attention on 64k-length sequences across decoding, forward propagation, and backward propagation, validating its efficiency throughout the model lifecycle.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NSA gives a concrete, end-to-end trainable sparse attention with hardware tweaks that reportedly matches full attention at 64k while cutting compute, but the key assumption about no information loss on shifted long contexts is untested.

read the letter

The main takeaway is that this paper ships a sparse attention design meant to be trained from the start rather than bolted on later. It uses a hierarchical approach—coarse compression to catch global signals, then fine selection for local detail—and pairs it with arithmetic-intensity balancing plus hardware-specific optimizations for training and inference passes. The reported results show pretrained models holding or beating full attention on general benchmarks, long-context tasks, and reasoning, with clear speedups at 64k lengths across forward, backward, and decoding. That combination of native trainability and measured efficiency is the part worth paying attention to if the numbers hold up in the full experiments.

Referee Report

2 major / 2 minor

Summary. The paper introduces Native Sparse Attention (NSA), a hardware-aligned sparse attention mechanism that uses a dynamic hierarchical strategy combining coarse-grained token compression with fine-grained token selection. It claims this enables end-to-end trainable sparsity, delivering substantial speedups over full attention on 64k-length sequences during decoding, forward, and backward passes while maintaining or exceeding full-attention performance on general benchmarks, long-context tasks, and instruction-based reasoning.

Significance. If the central performance parity holds under broader scrutiny, NSA would represent a meaningful advance in efficient long-context modeling by making sparsity natively trainable and hardware-optimized rather than post-hoc, potentially reducing pretraining and inference costs for next-generation models without sacrificing capability.

major comments (2)

[Experiments (Figure 1 and associated results)] The headline claim that NSA 'maintains or exceeds' full attention rests on evaluations confined to standard pretraining distributions and fixed 64k sequences (as summarized around Figure 1). No ablation or evaluation on shifted distributions—different domains, lengths >64k, or adversarial long-range dependencies—is reported, leaving the assumption that coarse-to-fine selection avoids systematic information loss untested and load-bearing for the generalization argument.
[Method description of dynamic hierarchical sparse strategy] The manuscript asserts that the hierarchical strategy 'preserves both global context awareness and local precision,' yet provides no quantitative analysis (e.g., attention-map comparisons or information-retention metrics) showing that compression does not discard critical tokens on out-of-distribution inputs; this directly affects whether NSA can be positioned as a drop-in replacement.

minor comments (2)

Clarify the precise benchmark suites, data splits, and number of runs underlying the 'maintains or exceeds' statement in the abstract and Figure 1 to allow reproducibility assessment.
The abstract references arithmetic-intensity-balanced design and hardware optimizations; the main text should include explicit pseudocode or kernel-level details for the forward/backward passes to substantiate the reported speedups.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, providing clarifications and committing to revisions that strengthen the generalization claims and supporting analyses.

read point-by-point responses

Referee: [Experiments (Figure 1 and associated results)] The headline claim that NSA 'maintains or exceeds' full attention rests on evaluations confined to standard pretraining distributions and fixed 64k sequences (as summarized around Figure 1). No ablation or evaluation on shifted distributions—different domains, lengths >64k, or adversarial long-range dependencies—is reported, leaving the assumption that coarse-to-fine selection avoids systematic information loss untested and load-bearing for the generalization argument.

Authors: We appreciate the referee's emphasis on rigorous generalization testing. Our reported results demonstrate performance parity or improvement across diverse long-context tasks and benchmarks that span multiple domains and dependency structures within the evaluated 64k context length. Nevertheless, we agree that explicit ablations on lengths beyond 64k and adversarial or domain-shifted inputs would provide stronger evidence. In the revised manuscript we will add experiments evaluating NSA on sequences up to 128k as well as on out-of-distribution data to directly test the robustness of the coarse-to-fine selection mechanism. revision: yes
Referee: [Method description of dynamic hierarchical sparse strategy] The manuscript asserts that the hierarchical strategy 'preserves both global context awareness and local precision,' yet provides no quantitative analysis (e.g., attention-map comparisons or information-retention metrics) showing that compression does not discard critical tokens on out-of-distribution inputs; this directly affects whether NSA can be positioned as a drop-in replacement.

Authors: We acknowledge that direct quantitative evidence, such as attention-map comparisons and token-retention metrics on out-of-distribution inputs, would make the preservation argument more explicit. The end-to-end training results and benchmark parity already indicate that critical information is retained in practice. To address the referee's point, the revised manuscript will include attention-map visualizations and information-retention metrics computed on both in-distribution and out-of-distribution examples, thereby supporting the claim that the hierarchical strategy functions as a reliable drop-in replacement. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper advances an algorithmic design for native sparse attention via dynamic hierarchical token compression and selection, with hardware-aligned optimizations. Claims of maintained or superior performance and speedups rest on empirical pretraining and benchmark results across general, long-context, and reasoning tasks rather than any closed mathematical derivation. No equations reduce claimed outcomes to fitted parameters on the same data, no self-citations serve as load-bearing uniqueness theorems, and no ansatz is smuggled via prior work. The central mechanism is presented as an explicit design choice whose validity is tested externally on standard distributions, making the derivation self-contained against the reported experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that token importance can be reliably estimated via a two-stage compression-plus-selection process without losing critical information, plus standard assumptions about transformer training dynamics.

axioms (1)

domain assumption Token importance can be approximated hierarchically without significant information loss for downstream tasks
Invoked in the description of the dynamic sparse strategy that combines coarse and fine-grained selection.

pith-pipeline@v0.9.0 · 5538 in / 1167 out tokens · 34946 ms · 2026-05-16T23:43:08.833294+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision.
IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We achieve substantial speedups through arithmetic intensity-balanced algorithm design, with implementation optimizations for modern hardware.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
cs.LG 2026-05 conditional novelty 7.0

MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
Long Context Pre-Training with Lighthouse Attention
cs.CL 2026-05 conditional novelty 7.0

Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower los...
Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation
cs.DC 2026-04 unverdicted novelty 7.0

GVR uses previous-step Top-K predictions, pre-indexed stats, secant counting, and shared-memory verification to deliver 1.88x average speedup over radix-select while preserving bit-exact Top-K on DeepSeek-V3.2 workloads.
Neural Garbage Collection: Learning to Forget while Learning to Reason
cs.LG 2026-04 conditional novelty 7.0

Language models learn to evict KV cache entries end-to-end via reinforcement learning from outcome reward alone, achieving 2-3x cache compression while maintaining accuracy on Countdown, AMC, and AIME tasks.
Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility
cs.LG 2026-05 unverdicted novelty 6.0

SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.
Z-Order Transformer for Feed-Forward Gaussian Splatting
cs.CV 2026-05 unverdicted novelty 6.0

A Z-order transformer organizes unstructured Gaussians for sparse attention, enabling feed-forward prediction of high-quality 3D splats with fewer primitives.
ZAYA1-8B Technical Report
cs.AI 2026-05 unverdicted novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization
cs.AR 2026-04 unverdicted novelty 6.0

AQPIM performs in-memory product quantization of activations for LLMs on PIM hardware, reducing GPU-CPU communication by 90-98.5% and delivering 3.4x speedup over prior PIM methods.
In-Place Test-Time Training
cs.LG 2026-04 conditional novelty 6.0

In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models
cs.LG 2025-12 unverdicted novelty 6.0

BOOST delivers 1.46-2.27x end-to-end speedups for low-rank bottleneck LLMs by redesigning tensor parallelism around the bottleneck structure plus supporting optimizations.
BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding
cs.CL 2025-12 unverdicted novelty 6.0

BLASST dynamically sparsifies attention by thresholding softmax scores to skip blocks, delivering 1.5x speedups at 70%+ sparsity while preserving benchmark accuracy.
Kimi Linear: An Expressive, Efficient Attention Architecture
cs.CL 2025-10 unverdicted novelty 6.0

Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
cs.CL 2025-07 unverdicted novelty 6.0

MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
cs.CL 2025-06 unverdicted novelty 6.0

MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
cs.CL 2025-05 conditional novelty 6.0

Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
Position: LLM Inference Should Be Evaluated as Energy-to-Token Production
cs.CE 2026-05 unverdicted novelty 5.0

LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
cs.LG 2026-05 accept novelty 5.0

Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
Challenges and opportunities for AI to help deliver fusion energy
physics.plasm-ph 2026-03 unverdicted novelty 2.0

AI offers opportunities to advance fusion energy R&D but requires responsible practices and expert collaborations to overcome its inherent challenges.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 18 Pith papers · 23 internal anchors

[10]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. 2024. URL https://arxiv.org/abs/2405.04434

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

G. Kamradt. LLMTest NeedleInAHaystack . GitHub repository, 2023. URL https://github.com/gkamradt/LLMTest_NeedleInAHaystack. Accessed: [Insert Access Date Here]

work page 2023
[26]

J. S. Park, J. C. O'Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein. Generative agents: Interactive simulacra of human behavior. In S. Follmer, J. Han, J. Steimle, and N. H. Riche, editors, Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST 2023, San Francisco, CA, USA, 29 October 2023-- 1 November 2...

work page 2023
[27]

B. Peng, J. Quesnelle, H. Fan, and E. Shippole. Yarn: Efficient context window extension of large language models. In ICLR . OpenReview.net, 2024

work page 2024
[28]

N. Shazeer. Fast transformer decoding: One write-head is all you need. CoRR, abs/1911.02150, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911
[31]

Tillet, H.-T

P. Tillet, H.-T. Kung, and D. Cox. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10--19, 2019

work page 2019
[32]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 2017

work page 2017
[34]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022

work page 2022
[36]

C. Xiao, P. Zhang, X. Han, G. Xiao, Y. Lin, Z. Zhang, Z. Liu, and M. Sun. Infllm: Training-free long-context extrapolation for llms with an efficient context memory. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024 a

work page 2024
[39]

Zaheer, G

M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, et al. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33: 0 17283--17297, 2020

work page 2020
[40]

Zelikman, Y

E. Zelikman, Y. Wu, J. Mu, and N. D. Goodman. Star: Bootstrapping reasoning with reasoning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 -- December 9, 2022, 2022

work page 2022
[41]

Zhang, B

F. Zhang, B. Chen, Y. Zhang, J. Keung, J. Liu, D. Zan, Y. Mao, J. Lou, and W. Chen. Repocoder: Repository-level code completion through iterative retrieval and generation. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6--10, 2023 , pages...

work page 2023
[42]

Zhang, J

K. Zhang, J. Li, G. Li, X. Shi, and Z. Jin. Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges. In L. Ku, A. Martins, and V. Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August...

work page 2024
[43]

Zhang, Y

Z. Zhang, Y. Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y. Tian, C. R \'e , C. Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36: 0 34661--34710, 2023 b

work page 2023
[46]

Efficient Streaming Language Models with Attention Sinks

Efficient streaming language models with attention sinks , author=. arXiv preprint arXiv:2309.17453 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Linformer: Self-Attention with Linear Complexity

Linformer: Self-attention with linear complexity , author=. arXiv preprint arXiv:2006.04768 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2006
[48]

arXiv preprint arXiv:2410.19258 , year=

Not all heads matter: A head-level KV cache compression method with integrated retrieval and reasoning , author=. arXiv preprint arXiv:2410.19258 , year=

work page arXiv
[49]

arXiv preprint arXiv:2410.10819 , year=

Duoattention: Efficient long-context llm inference with retrieval and streaming heads , author=. arXiv preprint arXiv:2410.10819 , year=

work page arXiv
[50]

arXiv preprint arXiv:2411.02886 , year=

TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection , author=. arXiv preprint arXiv:2411.02886 , year=

work page arXiv
[51]

arXiv preprint arXiv:2412.12094 , year=

Sepllm: Accelerate large language models by compressing one segment into one separator , author=. arXiv preprint arXiv:2412.12094 , year=

work page arXiv
[52]

Longformer: The Long-Document Transformer

Longformer: The long-document transformer , author=. arXiv preprint arXiv:2004.05150 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2004
[53]

arXiv preprint arXiv:2410.23079 , year=

Buzz: Beehive-structured sparse kv cache with segmented heavy hitters for efficient llm inference , author=. arXiv preprint arXiv:2410.23079 , year=

work page arXiv
[54]

arXiv preprint arXiv:2406.14909 , year=

Moa: Mixture of sparse attention for automatic large language model compression , author=. arXiv preprint arXiv:2406.14909 , year=

work page arXiv
[55]

arXiv preprint arXiv:2410.13276 , year=

Seerattention: Learning intrinsic sparse attention in your llms , author=. arXiv preprint arXiv:2410.13276 , year=

work page arXiv
[56]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Mamba: Linear-time sequence modeling with selective state spaces , author=. arXiv preprint arXiv:2312.00752 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Infllm: Training-free long-context extrapolation for llms with an efficient context memory , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page
[58]

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

Griffin: Mixing gated linear recurrences with local attention for efficient language models , author=. arXiv preprint arXiv:2402.19427 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[59]

2025 , eprint=

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=

work page 2025
[60]

Advances in neural information processing systems , volume=

Big bird: Transformers for longer sequences , author=. Advances in neural information processing systems , volume=

work page
[61]

Advances in Neural Information Processing Systems , year=

Attention is all you need , author =. Advances in Neural Information Processing Systems , year=

work page
[62]

Kojima, T., Gu, S

Llmlingua: Compressing prompts for accelerated inference of large language models , author=. arXiv preprint arXiv:2310.05736 , year=

work page arXiv
[63]

arXiv preprint arXiv:2410.09342 , year=

LLM MapReduce: Simplified Long-Sequence Processing using Large Language Models , author=. arXiv preprint arXiv:2410.09342 , year=

work page arXiv
[64]

Program Synthesis with Large Language Models

Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[65]

, title =

Kamradt, G. , title =. 2023 , howpublished =

work page 2023
[66]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Longbench: A bilingual, multitask benchmark for long context understanding , author=. arXiv preprint arXiv:2308.14508 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[67]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems, 2021 , author=. URL https://arxiv. org/abs/2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2021
[68]

CMMLU: Measuring massive multitask language understanding in Chinese

Cmmlu: Measuring massive multitask language understanding in chinese , author=. arXiv preprint arXiv:2306.09212 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[69]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[70]

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs , author=. arXiv preprint arXiv:1903.00161 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1903
[71]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009
[72]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. arXiv preprint arXiv:2406.01574 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[73]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Challenging big-bench tasks and whether chain-of-thought can solve them , author=. arXiv preprint arXiv:2210.09261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[74]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. arXiv preprint arXiv:2403.05530 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[75]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models , author=. arXiv preprint arXiv:2401.06066 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[76]

arXiv preprint arXiv:2310.01801 , year=

Model tells you what to discard: Adaptive kv cache compression for llms , author=. arXiv preprint arXiv:2310.01801 , year=

work page arXiv
[77]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi k1. 5: Scaling Reinforcement Learning with LLMs , author=. arXiv preprint arXiv:2501.12599 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[78]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Gqa: Training generalized multi-query transformer models from multi-head checkpoints , author=. arXiv preprint arXiv:2305.13245 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[79]

Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model , author=

work page
[80]

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference , author=. arXiv preprint arXiv:2406.10774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[81]

Advances in Neural Information Processing Systems , volume=

H2o: Heavy-hitter oracle for efficient generative inference of large language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[82]

SnapKV: LLM Knows What You are Looking for Before Generation

Snapkv: Llm knows what you are looking for before generation , author=. arXiv preprint arXiv:2404.14469 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[83]

arXiv preprint arXiv:2407.02490 , year=

Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention , author=. arXiv preprint arXiv:2407.02490 , year=

work page arXiv
[84]

arXiv preprint arXiv:2412.03213 , year=

ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression , author=. arXiv preprint arXiv:2412.03213 , year=

work page arXiv
[85]

arXiv preprint arXiv:2410.16179 , year=

Magicpig: Lsh sampling for efficient llm generation , author=. arXiv preprint arXiv:2410.16179 , year=

work page arXiv
[86]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

work page
[87]

arXiv preprint arXiv:2412.14468 , year=

HashAttention: Semantic Sparsity for Faster Inference , author=. arXiv preprint arXiv:2412.14468 , year=

work page arXiv
[88]

Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages , pages=

Triton: an intermediate language and compiler for tiled neural network computations , author=. Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages , pages=

work page
[89]

Bowen Peng and Jeffrey Quesnelle and Honglu Fan and Enrico Shippole , title =

work page
[90]

RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation , booktitle =

Fengji Zhang and Bei Chen and Yue Zhang and Jacky Keung and Jin Liu and Daoguang Zan and Yi Mao and Jian. RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation , booktitle =

work page
[91]

CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges , booktitle =

Kechi Zhang and Jia Li and Ge Li and Xianjie Shi and Zhi Jin , editor =. CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges , booktitle =

work page
[92]

Goodman , editor =

Eric Zelikman and Yuhuai Wu and Jesse Mu and Noah D. Goodman , editor =. STaR: Bootstrapping Reasoning With Reasoning , booktitle =

work page
[93]

CoRR , volume =

Songlin Yang and Jan Kautz and Ali Hatamizadeh , title =. CoRR , volume =. 2024 , url =

work page 2024
[94]

O'Brien and Carrie Jun Cai and Meredith Ringel Morris and Percy Liang and Michael S

Joon Sung Park and Joseph C. O'Brien and Carrie Jun Cai and Meredith Ringel Morris and Percy Liang and Michael S. Bernstein , editor =. Generative Agents: Interactive Simulacra of Human Behavior , booktitle =

work page
[95]

CoRR , volume =

Noam Shazeer , title =. CoRR , volume =

work page