Recognition: 2 theorem links
· Lean TheoremNative Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
Pith reviewed 2026-05-16 23:43 UTC · model grok-4.3
The pith
NSA introduces a natively trainable sparse attention that matches full attention performance on long contexts while delivering major speedups.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
NSA achieves efficient long-context modeling by integrating a dynamic hierarchical sparse strategy—coarse-grained token compression followed by fine-grained token selection—with hardware-aligned optimizations that balance arithmetic intensity, enabling both substantial speedups over full attention and end-to-end native training that maintains or exceeds full-attention performance across general benchmarks, long-context tasks, and instruction-based reasoning.
What carries the argument
The dynamic hierarchical sparse strategy that first compresses tokens at coarse granularity and then selects at fine granularity to preserve both global context and local precision.
Load-bearing premise
The hierarchical compression and selection steps preserve necessary context without introducing systematic biases that would hurt performance on new long-context distributions.
What would settle it
A clear drop in accuracy or reasoning quality on any long-context benchmark when training from scratch with NSA versus an otherwise identical full-attention model.
read the original abstract
Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while maintaining model capabilities. We present NSA, a Natively trainable Sparse Attention mechanism that integrates algorithmic innovations with hardware-aligned optimizations to achieve efficient long-context modeling. NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision. Our approach advances sparse attention design with two key innovations: (1) We achieve substantial speedups through arithmetic intensity-balanced algorithm design, with implementation optimizations for modern hardware. (2) We enable end-to-end training, reducing pretraining computation without sacrificing model performance. As shown in Figure 1, experiments show the model pretrained with NSA maintains or exceeds Full Attention models across general benchmarks, long-context tasks, and instruction-based reasoning. Meanwhile, NSA achieves substantial speedups over Full Attention on 64k-length sequences across decoding, forward propagation, and backward propagation, validating its efficiency throughout the model lifecycle.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Native Sparse Attention (NSA), a hardware-aligned sparse attention mechanism that uses a dynamic hierarchical strategy combining coarse-grained token compression with fine-grained token selection. It claims this enables end-to-end trainable sparsity, delivering substantial speedups over full attention on 64k-length sequences during decoding, forward, and backward passes while maintaining or exceeding full-attention performance on general benchmarks, long-context tasks, and instruction-based reasoning.
Significance. If the central performance parity holds under broader scrutiny, NSA would represent a meaningful advance in efficient long-context modeling by making sparsity natively trainable and hardware-optimized rather than post-hoc, potentially reducing pretraining and inference costs for next-generation models without sacrificing capability.
major comments (2)
- [Experiments (Figure 1 and associated results)] The headline claim that NSA 'maintains or exceeds' full attention rests on evaluations confined to standard pretraining distributions and fixed 64k sequences (as summarized around Figure 1). No ablation or evaluation on shifted distributions—different domains, lengths >64k, or adversarial long-range dependencies—is reported, leaving the assumption that coarse-to-fine selection avoids systematic information loss untested and load-bearing for the generalization argument.
- [Method description of dynamic hierarchical sparse strategy] The manuscript asserts that the hierarchical strategy 'preserves both global context awareness and local precision,' yet provides no quantitative analysis (e.g., attention-map comparisons or information-retention metrics) showing that compression does not discard critical tokens on out-of-distribution inputs; this directly affects whether NSA can be positioned as a drop-in replacement.
minor comments (2)
- Clarify the precise benchmark suites, data splits, and number of runs underlying the 'maintains or exceeds' statement in the abstract and Figure 1 to allow reproducibility assessment.
- The abstract references arithmetic-intensity-balanced design and hardware optimizations; the main text should include explicit pseudocode or kernel-level details for the forward/backward passes to substantiate the reported speedups.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below, providing clarifications and committing to revisions that strengthen the generalization claims and supporting analyses.
read point-by-point responses
-
Referee: [Experiments (Figure 1 and associated results)] The headline claim that NSA 'maintains or exceeds' full attention rests on evaluations confined to standard pretraining distributions and fixed 64k sequences (as summarized around Figure 1). No ablation or evaluation on shifted distributions—different domains, lengths >64k, or adversarial long-range dependencies—is reported, leaving the assumption that coarse-to-fine selection avoids systematic information loss untested and load-bearing for the generalization argument.
Authors: We appreciate the referee's emphasis on rigorous generalization testing. Our reported results demonstrate performance parity or improvement across diverse long-context tasks and benchmarks that span multiple domains and dependency structures within the evaluated 64k context length. Nevertheless, we agree that explicit ablations on lengths beyond 64k and adversarial or domain-shifted inputs would provide stronger evidence. In the revised manuscript we will add experiments evaluating NSA on sequences up to 128k as well as on out-of-distribution data to directly test the robustness of the coarse-to-fine selection mechanism. revision: yes
-
Referee: [Method description of dynamic hierarchical sparse strategy] The manuscript asserts that the hierarchical strategy 'preserves both global context awareness and local precision,' yet provides no quantitative analysis (e.g., attention-map comparisons or information-retention metrics) showing that compression does not discard critical tokens on out-of-distribution inputs; this directly affects whether NSA can be positioned as a drop-in replacement.
Authors: We acknowledge that direct quantitative evidence, such as attention-map comparisons and token-retention metrics on out-of-distribution inputs, would make the preservation argument more explicit. The end-to-end training results and benchmark parity already indicate that critical information is retained in practice. To address the referee's point, the revised manuscript will include attention-map visualizations and information-retention metrics computed on both in-distribution and out-of-distribution examples, thereby supporting the claim that the hierarchical strategy functions as a reliable drop-in replacement. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper advances an algorithmic design for native sparse attention via dynamic hierarchical token compression and selection, with hardware-aligned optimizations. Claims of maintained or superior performance and speedups rest on empirical pretraining and benchmark results across general, long-context, and reasoning tasks rather than any closed mathematical derivation. No equations reduce claimed outcomes to fitted parameters on the same data, no self-citations serve as load-bearing uniqueness theorems, and no ansatz is smuggled via prior work. The central mechanism is presented as an explicit design choice whose validity is tested externally on standard distributions, making the derivation self-contained against the reported experiments.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Token importance can be approximated hierarchically without significant information loss for downstream tasks
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision.
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We achieve substantial speedups through arithmetic intensity-balanced algorithm design, with implementation optimizations for modern hardware.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 18 Pith papers
-
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
-
Long Context Pre-Training with Lighthouse Attention
Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower los...
-
Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation
GVR uses previous-step Top-K predictions, pre-indexed stats, secant counting, and shared-memory verification to deliver 1.88x average speedup over radix-select while preserving bit-exact Top-K on DeepSeek-V3.2 workloads.
-
Neural Garbage Collection: Learning to Forget while Learning to Reason
Language models learn to evict KV cache entries end-to-end via reinforcement learning from outcome reward alone, achieving 2-3x cache compression while maintaining accuracy on Countdown, AMC, and AIME tasks.
-
Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility
SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.
-
Z-Order Transformer for Feed-Forward Gaussian Splatting
A Z-order transformer organizes unstructured Gaussians for sparse attention, enabling feed-forward prediction of high-quality 3D splats with fewer primitives.
-
ZAYA1-8B Technical Report
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
-
AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization
AQPIM performs in-memory product quantization of activations for LLMs on PIM hardware, reducing GPU-CPU communication by 90-98.5% and delivering 3.4x speedup over prior PIM methods.
-
In-Place Test-Time Training
In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
-
BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models
BOOST delivers 1.46-2.27x end-to-end speedups for low-rank bottleneck LLMs by redesigning tensor parallelism around the bottleneck structure plus supporting optimizations.
-
BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding
BLASST dynamically sparsifies attention by thresholding softmax scores to skip blocks, delivering 1.5x speedups at 70%+ sparsity while preserving benchmark accuracy.
-
Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
-
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.
-
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...
-
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
-
Position: LLM Inference Should Be Evaluated as Energy-to-Token Production
LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.
-
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
-
Challenges and opportunities for AI to help deliver fusion energy
AI offers opportunities to advance fusion energy R&D but requires responsible practices and expert collaborations to overcome its inherent challenges.
Reference graph
Works this paper leans on
-
[10]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. 2024. URL https://arxiv.org/abs/2405.04434
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
G. Kamradt. LLMTest NeedleInAHaystack . GitHub repository, 2023. URL https://github.com/gkamradt/LLMTest_NeedleInAHaystack. Accessed: [Insert Access Date Here]
work page 2023
-
[26]
J. S. Park, J. C. O'Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein. Generative agents: Interactive simulacra of human behavior. In S. Follmer, J. Han, J. Steimle, and N. H. Riche, editors, Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST 2023, San Francisco, CA, USA, 29 October 2023-- 1 November 2...
work page 2023
-
[27]
B. Peng, J. Quesnelle, H. Fan, and E. Shippole. Yarn: Efficient context window extension of large language models. In ICLR . OpenReview.net, 2024
work page 2024
-
[28]
N. Shazeer. Fast transformer decoding: One write-head is all you need. CoRR, abs/1911.02150, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[31]
P. Tillet, H.-T. Kung, and D. Cox. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10--19, 2019
work page 2019
-
[32]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 2017
work page 2017
-
[34]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022
work page 2022
-
[36]
C. Xiao, P. Zhang, X. Han, G. Xiao, Y. Lin, Z. Zhang, Z. Liu, and M. Sun. Infllm: Training-free long-context extrapolation for llms with an efficient context memory. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024 a
work page 2024
- [39]
-
[40]
E. Zelikman, Y. Wu, J. Mu, and N. D. Goodman. Star: Bootstrapping reasoning with reasoning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 -- December 9, 2022, 2022
work page 2022
-
[41]
F. Zhang, B. Chen, Y. Zhang, J. Keung, J. Liu, D. Zan, Y. Mao, J. Lou, and W. Chen. Repocoder: Repository-level code completion through iterative retrieval and generation. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6--10, 2023 , pages...
work page 2023
-
[42]
K. Zhang, J. Li, G. Li, X. Shi, and Z. Jin. Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges. In L. Ku, A. Martins, and V. Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August...
work page 2024
- [43]
-
[46]
Efficient Streaming Language Models with Attention Sinks
Efficient streaming language models with attention sinks , author=. arXiv preprint arXiv:2309.17453 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
Linformer: Self-Attention with Linear Complexity
Linformer: Self-attention with linear complexity , author=. arXiv preprint arXiv:2006.04768 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[48]
Not all heads matter: A head-level KV cache compression method with integrated retrieval and reasoning , author=. arXiv preprint arXiv:2410.19258 , year=
-
[49]
Duoattention: Efficient long-context llm inference with retrieval and streaming heads
Duoattention: Efficient long-context llm inference with retrieval and streaming heads , author=. arXiv preprint arXiv:2410.10819 , year=
-
[50]
arXiv preprint arXiv:2411.02886 , year=
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection , author=. arXiv preprint arXiv:2411.02886 , year=
-
[51]
arXiv preprint arXiv:2412.12094 , year=
Sepllm: Accelerate large language models by compressing one segment into one separator , author=. arXiv preprint arXiv:2412.12094 , year=
-
[52]
Longformer: The Long-Document Transformer
Longformer: The long-document transformer , author=. arXiv preprint arXiv:2004.05150 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[53]
arXiv preprint arXiv:2410.23079 , year=
Buzz: Beehive-structured sparse kv cache with segmented heavy hitters for efficient llm inference , author=. arXiv preprint arXiv:2410.23079 , year=
-
[54]
arXiv preprint arXiv:2406.14909 , year=
Moa: Mixture of sparse attention for automatic large language model compression , author=. arXiv preprint arXiv:2406.14909 , year=
-
[55]
arXiv preprint arXiv:2410.13276 , year=
Seerattention: Learning intrinsic sparse attention in your llms , author=. arXiv preprint arXiv:2410.13276 , year=
-
[56]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Mamba: Linear-time sequence modeling with selective state spaces , author=. arXiv preprint arXiv:2312.00752 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[57]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
Infllm: Training-free long-context extrapolation for llms with an efficient context memory , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[58]
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Griffin: Mixing gated linear recurrences with local attention for efficient language models , author=. arXiv preprint arXiv:2402.19427 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[59]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=
work page 2025
-
[60]
Advances in neural information processing systems , volume=
Big bird: Transformers for longer sequences , author=. Advances in neural information processing systems , volume=
-
[61]
Advances in Neural Information Processing Systems , year=
Attention is all you need , author =. Advances in Neural Information Processing Systems , year=
-
[62]
Llmlingua: Compressing prompts for accelerated inference of large language models , author=. arXiv preprint arXiv:2310.05736 , year=
-
[63]
arXiv preprint arXiv:2410.09342 , year=
LLM MapReduce: Simplified Long-Sequence Processing using Large Language Models , author=. arXiv preprint arXiv:2410.09342 , year=
-
[64]
Program Synthesis with Large Language Models
Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [65]
-
[66]
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
Longbench: A bilingual, multitask benchmark for long context understanding , author=. arXiv preprint arXiv:2308.14508 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[67]
Training Verifiers to Solve Math Word Problems
Training verifiers to solve math word problems, 2021 , author=. URL https://arxiv. org/abs/2110.14168 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[68]
CMMLU: Measuring massive multitask language understanding in Chinese
Cmmlu: Measuring massive multitask language understanding in chinese , author=. arXiv preprint arXiv:2306.09212 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[69]
Evaluating Large Language Models Trained on Code
Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[70]
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs , author=. arXiv preprint arXiv:1903.00161 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1903
-
[71]
Measuring Massive Multitask Language Understanding
Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[72]
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. arXiv preprint arXiv:2406.01574 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[73]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Challenging big-bench tasks and whether chain-of-thought can solve them , author=. arXiv preprint arXiv:2210.09261 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[74]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. arXiv preprint arXiv:2403.05530 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[75]
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models , author=. arXiv preprint arXiv:2401.06066 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[76]
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
Model tells you what to discard: Adaptive kv cache compression for llms , author=. arXiv preprint arXiv:2310.01801 , year=
work page internal anchor Pith review arXiv
-
[77]
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi k1. 5: Scaling Reinforcement Learning with LLMs , author=. arXiv preprint arXiv:2501.12599 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[78]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Gqa: Training generalized multi-query transformer models from multi-head checkpoints , author=. arXiv preprint arXiv:2305.13245 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[79]
Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model , author=
-
[80]
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference , author=. arXiv preprint arXiv:2406.10774 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[81]
Advances in Neural Information Processing Systems , volume=
H2o: Heavy-hitter oracle for efficient generative inference of large language models , author=. Advances in Neural Information Processing Systems , volume=
-
[82]
SnapKV: LLM Knows What You are Looking for Before Generation
Snapkv: Llm knows what you are looking for before generation , author=. arXiv preprint arXiv:2404.14469 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[83]
arXiv preprint arXiv:2407.02490 , year=
Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention , author=. arXiv preprint arXiv:2407.02490 , year=
-
[84]
arXiv preprint arXiv:2412.03213 , year=
ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression , author=. arXiv preprint arXiv:2412.03213 , year=
-
[85]
arXiv preprint arXiv:2410.16179 , year=
Magicpig: Lsh sampling for efficient llm generation , author=. arXiv preprint arXiv:2410.16179 , year=
-
[86]
Advances in neural information processing systems , volume=
Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
-
[87]
arXiv preprint arXiv:2412.14468 , year=
HashAttention: Semantic Sparsity for Faster Inference , author=. arXiv preprint arXiv:2412.14468 , year=
-
[88]
Triton: an intermediate language and compiler for tiled neural network computations , author=. Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages , pages=
-
[89]
Bowen Peng and Jeffrey Quesnelle and Honglu Fan and Enrico Shippole , title =
-
[90]
RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation , booktitle =
Fengji Zhang and Bei Chen and Yue Zhang and Jacky Keung and Jin Liu and Daoguang Zan and Yi Mao and Jian. RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation , booktitle =
-
[91]
Kechi Zhang and Jia Li and Ge Li and Xianjie Shi and Zhi Jin , editor =. CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges , booktitle =
-
[92]
Eric Zelikman and Yuhuai Wu and Jesse Mu and Noah D. Goodman , editor =. STaR: Bootstrapping Reasoning With Reasoning , booktitle =
-
[93]
Songlin Yang and Jan Kautz and Ali Hatamizadeh , title =. CoRR , volume =. 2024 , url =
work page 2024
-
[94]
O'Brien and Carrie Jun Cai and Meredith Ringel Morris and Percy Liang and Michael S
Joon Sung Park and Joseph C. O'Brien and Carrie Jun Cai and Meredith Ringel Morris and Percy Liang and Michael S. Bernstein , editor =. Generative Agents: Interactive Simulacra of Human Behavior , booktitle =
- [95]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.