pith. machine review for the scientific record. sign in

arxiv: 2605.07363 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords sparse attentionmixture of expertslong-context inferencetoken indexerLLM efficiencyrouterhierarchical re-ranking
0
0 comments X

The pith

A router selects only eight indexer heads per query to match the performance of a full sixty-four-head sparse attention indexer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper replaces the expensive multi-head token indexer in DeepSeek Sparse Attention with a mixture-of-experts design called MISA. A cheap router examines block-level statistics and activates only a small subset of heads to score prefix tokens, while the rest stay idle. This change keeps the same selected token sets on LongBench for two different base models and preserves accurate needle retrieval out to 128K tokens. The approach requires no retraining and yields a measured kernel speedup of roughly 3.82 times on one GPU. Readers care because long-context inference cost is currently dominated by the indexer step, so any reduction that preserves quality directly improves practical usability.

Core claim

MISA treats the indexer heads of DSA as an expert pool and routes each query to a small active subset chosen by block-level statistics. Only the routed heads compute the full token-level scores; a hierarchical variant then re-ranks an enlarged candidate pool with the original indexer to recover more than 92 percent of the original tokens per layer. With eight active heads the method equals the dense indexer on LongBench across DeepSeek-V3.2 and GLM-5, outperforms HISA on average, and keeps fully green needle-in-a-haystack maps up to 128K context.

What carries the argument

The lightweight router that uses cheap block-level statistics to select a query-dependent subset of indexer heads for token scoring.

If this is right

  • Indexer compute scales directly with the number of active heads rather than the total head count.
  • The hierarchical re-ranking step recovers nearly the same final token set without any model retraining.
  • Needle-in-a-haystack retrieval quality remains intact at 128K context length.
  • The same router design can be dropped into existing DSA implementations on DeepSeek-V3.2 and GLM-5.
  • Kernel-level speedup reaches 3.82 times on a single H200 GPU while preserving benchmark parity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same block-statistic routing idea could be tested on other multi-head mechanisms that currently score every token with every head.
  • Jointly training the router with the base model might further improve selection accuracy beyond the current zero-shot results.
  • If block statistics prove sufficient here, similar lightweight routers may reduce cost in other sparsity or retrieval stages of long-context pipelines.
  • The 92 percent token overlap per layer sets a concrete target for future routing methods to match or exceed.

Load-bearing premise

Block-level statistics contain enough information for the router to pick a small set of heads whose token scores stay close in quality to the scores from every head.

What would settle it

Measure whether eight-head MISA and the full DSA indexer select materially different tokens on a long-context task where the full indexer succeeds but the reduced version drops accuracy below the reported LongBench match.

Figures

Figures reproduced from arXiv: 2605.07363 by Fanxu Meng, Guangming Lu, Muhan Zhang, Ruijie Zhou, Tongxuan Liu, Wenjie Pei, Yufei Xu.

Figure 1
Figure 1. Figure 1: Comparison of the DSA and MISA indexers. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Needle-in-a-Haystack retrieval accuracy on DeepSeek-V3.2 up to 128K context. The [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Indexer-kernel latency on a single NVIDIA H200 GPU for DSA and MISA, as a function of [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: sweeps the number of active heads h ∈ {1, 2, 4, 8, 16} on DeepSeek-V3.2 with B = 1024 and k = 2048 fixed, evaluated on NIAH at 128K. As expected, h = 1 and h = 2 prove too aggressive: one or two routed heads cannot cover the diversity of relevance patterns in DSA’s HI = 64-head pool, and the heatmap shows visible accuracy holes. While setting h = 4 mitigates most of the aforementioned deficiencies, it cont… view at source ↗
Figure 5
Figure 5. Figure 5: Per-layer Intersection-over-Union between the indexer-selected token set and the DSA [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: repeats the same sweep for the hierarchical variant MISA† , where the routed pass with h heads now selects an enlarged candidate set of size k ′ = 4k = 8192, and the full HI = 64-head DSA indexer then re-ranks this candidate set to extract the final k = 2048 tokens. Because the routed stage only has to keep the relevant tokens inside the candidate set rather than pinpoint the exact top-k, the DSA refinemen… view at source ↗
Figure 7
Figure 7. Figure 7: Ablation on the head-importance score Et,j used by the MISA router on DeepSeek-V3.2 (Needle-in-a-Haystack accuracy at 128K). (a) the indexer gating weight alone; (b) the ℓ2 norm of the query head; (c) the proposed block-attention score, averaged over the M pooled blocks of the prefix — the only variant that actually consults past content. The x-axis denotes context length and the y-axis the needle depth (0… view at source ↗
Figure 8
Figure 8. Figure 8: shows the resulting NIAH heatmaps. Across this entire sweep, retrieval accuracy is largely insensitive to B: the heatmaps are visually indistinguishable from the dense DSA reference for the small to moderate block sizes, and only the very largest B values (where the router is forced to summarise the prefix into a handful of pooled keys) begin to lose enough locality to introduce a mild degradation at the d… view at source ↗
read the original abstract

DeepSeek Sparse Attention (DSA) sets the state of the art for fine-grained inference-time sparse attention by introducing a learned token-wise indexer that scores every prefix token and selects the most relevant ones for the main attention. To remain expressive, the indexer uses many query heads (for example, 64 on DeepSeek-V3.2) that share the same selected token set; this multi-head design is precisely what makes the indexer the dominant cost on long contexts. We propose MISA (Mixture of Indexer Sparse Attention), a drop-in replacement for the DSA indexer that treats its indexer heads as a pool of mixture-of-experts. A lightweight router uses cheap block-level statistics to pick a query-dependent subset of only a few active heads, and only those heads run the heavy token-level scoring. This preserves the diversity of the original indexer pool while reducing the per-query cost from scoring every prefix token with every head to scoring it with only a handful of routed heads, plus a negligible router term computed on a small set of pooled keys. We further introduce a hierarchical variant of MISA that uses the routed pass to keep an enlarged candidate set and then re-ranks it with the original DSA indexer to recover the final selected tokens almost exactly. With only eight active heads and no additional training, MISA matches the dense DSA indexer on LongBench across DeepSeek-V3.2 and GLM-5 while running with eight and four times fewer indexer heads respectively, and outperforms HISA on average. It also preserves fully green Needle-in-a-Haystack heatmaps up to a 128K-token context and recovers more than 92% of the tokens selected by the DSA indexer per layer. Our TileLang kernel delivers roughly a 3.82 times speedup over DSA's original indexer kernel on a single NVIDIA H200 GPU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes MISA (Mixture of Indexer Sparse Attention) as a drop-in replacement for the DeepSeek Sparse Attention (DSA) indexer. It treats the multi-head indexer as a mixture-of-experts pool and introduces a lightweight router that uses block-level statistics to activate only a small subset of heads (e.g., 8) for token-level scoring, plus a hierarchical re-ranking variant that enlarges the candidate set and recovers tokens with the full indexer. With no additional training, MISA is claimed to match dense DSA performance on LongBench for DeepSeek-V3.2 and GLM-5 (while using 8x and 4x fewer heads), outperform HISA on average, preserve Needle-in-a-Haystack accuracy to 128K context, recover >92% of DSA-selected tokens (hierarchical case), and deliver a 3.82x speedup via a custom TileLang kernel.

Significance. If the empirical results hold, this is a practically significant engineering advance for long-context LLM inference. It reduces the dominant indexer cost of fine-grained sparse attention without retraining or architectural changes, while preserving accuracy on standard benchmarks and delivering measurable kernel speedups. The no-training constraint and custom kernel implementation are notable strengths that enhance applicability and reproducibility.

major comments (2)
  1. [Abstract and experimental results (LongBench and token-recovery sections)] The central LongBench equivalence claim for base MISA (8 active heads) rests on the assumption that block-level router statistics select heads whose per-token scores are sufficiently close to the full 64-head (or 32-head) DSA indexer. However, token-recovery metrics (>92%) are reported only for the hierarchical variant; the base MISA results therefore provide only indirect support via downstream accuracy, leaving the router's selection quality unverified in the manuscript.
  2. [Method description of the router (Section 3)] The router is described as using 'cheap block-level statistics' without training, yet the manuscript provides limited ablation or analysis of how these statistics correlate with head-specific token utility across contexts or layers. This is load-bearing for the no-retraining claim, as poor correlation would cause systematic under-scoring of important tokens.
minor comments (2)
  1. [Abstract] The abstract states 'roughly a 3.82 times speedup' over DSA's original indexer kernel on a single NVIDIA H200 GPU; include the exact baseline kernel implementation details, input sizes, and measurement methodology for clarity.
  2. [Experimental results] LongBench results would be strengthened by reporting variance, multiple seeds, or error bars, given the empirical nature of the claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the practical significance of MISA as an engineering advance for long-context inference. We address each major comment below with clarifications and proposed revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract and experimental results (LongBench and token-recovery sections)] The central LongBench equivalence claim for base MISA (8 active heads) rests on the assumption that block-level router statistics select heads whose per-token scores are sufficiently close to the full 64-head (or 32-head) DSA indexer. However, token-recovery metrics (>92%) are reported only for the hierarchical variant; the base MISA results therefore provide only indirect support via downstream accuracy, leaving the router's selection quality unverified in the manuscript.

    Authors: We thank the referee for this observation. Direct token-recovery rates are reported only for the hierarchical variant. However, the base MISA (8 heads) achieves LongBench scores equivalent to the full DSA indexer on both DeepSeek-V3.2 and GLM-5. Because LongBench tasks depend on the relevance of retrieved context tokens, this downstream parity provides evidence that the router-selected heads yield sufficiently aligned scores for practical use. To strengthen the manuscript, we will add a direct verification subsection (with token-overlap and score-correlation metrics) comparing base MISA selections to full DSA on a representative subset of LongBench examples. revision: yes

  2. Referee: [Method description of the router (Section 3)] The router is described as using 'cheap block-level statistics' without training, yet the manuscript provides limited ablation or analysis of how these statistics correlate with head-specific token utility across contexts or layers. This is load-bearing for the no-retraining claim, as poor correlation would cause systematic under-scoring of important tokens.

    Authors: We agree that expanded analysis of the router statistics would reinforce the no-retraining claim. The statistics are computed from block-level aggregates of the shared key projections and are intended to serve as a lightweight proxy for head-specific relevance. In the revision we will augment Section 3 with quantitative correlation analysis (e.g., Pearson coefficients between block statistics and full per-token head scores) across layers, context lengths, and evaluation datasets. This will demonstrate the reliability of the proxy without requiring additional training. revision: yes

Circularity Check

0 steps flagged

MISA is an empirical engineering modification with no circular derivation

full rationale

The paper describes MISA as a router-based selection over an existing DSA indexer pool, using block-level statistics to activate a small subset of heads. All performance claims (LongBench parity, >92% token recovery in the hierarchical variant, Needle-in-a-Haystack preservation, and kernel speedup) are established through direct head-to-head experiments on DeepSeek-V3.2 and GLM-5 without any equations that reduce the reported gains to quantities defined inside the same derivation or to self-cited prior results. The method contains no self-definitional loops, fitted-input predictions, or load-bearing uniqueness theorems; the central results rest on external benchmark comparisons rather than internal re-derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central performance claims rest on the unproven but empirically tested assumption that block-level statistics suffice for head routing and that the hierarchical step recovers token sets; the number of active heads is chosen by hand for the reported tradeoff.

free parameters (1)
  • number of active heads = 8
    Set to eight in the main experiments to achieve the stated accuracy-speed balance; value is not derived from first principles.
axioms (1)
  • domain assumption Block-level statistics contain enough signal to route to effective indexer heads
    Invoked to justify the lightweight router; validated only through end-to-end benchmark results.
invented entities (2)
  • lightweight router no independent evidence
    purpose: Selects a query-dependent subset of indexer heads using block statistics
    New component introduced to realize the mixture-of-experts behavior for the indexer.
  • hierarchical variant no independent evidence
    purpose: Enlarges candidate set with routed heads then re-ranks with full DSA indexer
    Additional procedure proposed to recover nearly exact token selection.

pith-pipeline@v0.9.0 · 5656 in / 1571 out tokens · 76351 ms · 2026-05-11T02:24:44.026287+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 13 internal anchors

  1. [1]

    Introducing Claude Opus 4.7

    Anthropic. Introducing Claude Opus 4.7. Technical report, Anthropic, 2026. URL https: //www.anthropic.com/news/claude-opus-4-7. Context window: 1,000,000 tokens

  2. [2]

    Indexcache: Accelerating sparse attention via cross-layer index reuse.arXiv preprint arXiv:2603.12201,

    Yushi Bai, Qian Dong, Ting Jiang, Xin Lv, Zhengxiao Du, Aohan Zeng, Jie Tang, and Juanzi Li. IndexCache: Accelerating sparse attention via cross-layer index reuse.arXiv preprint arXiv:2603.12201, 2026. URLhttps://arxiv.org/abs/2603.12201

  3. [3]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020. URLhttps://arxiv.org/abs/2004.05150

  4. [4]

    Mag- icpig: Lsh sampling for efficient llm generation.arXiv preprint arXiv:2410.16179,

    Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, and Beidi Chen. MagicPIG: LSH sampling for efficient LLM generation.arXiv preprint arXiv:2410.16179, 2024. URL https: //arxiv.org/abs/2410.16179

  5. [5]

    Generating Long Sequences with Sparse Transformers

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019. URL https://arxiv.org/ abs/1904.10509

  6. [6]

    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

    Damai Dai, Chengqi Deng, Chenggang Zhao, R.X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y . Wu, et al. DeepSeekMoE: Towards ultimate expert special- ization in mixture-of-experts language models.arXiv preprint arXiv:2401.06066, 2024. URL https://arxiv.org/abs/2401.06066

  7. [7]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025. URLhttps://arxiv.org/abs/2512.02556

  8. [8]

    DeepSeek-V4: Towards highly efficient million-token context.Technical Report, DeepSeek, 2026

    DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context.Technical Report, DeepSeek, 2026. URL https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/ resolve/main/DeepSeek_V4.pdf

  9. [9]

    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23 (120):1–39, 2022. URLhttps://arxiv.org/abs/2101.03961

  10. [10]

    arXiv preprint arXiv:2410.13276 , year=

    Yizhao Gao, Zhichen Zeng, Dayou Du, Shijie Cao, Hayden Kwok-Hay So, Ting Cao, Fan Yang, and Mao Yang. SeerAttention: Learning intrinsic sparse attention in your LLMs.arXiv preprint arXiv:2410.13276, 2024. URLhttps://arxiv.org/abs/2410.13276

  11. [11]

    Gemini 3: A new era of intelligence

    Google DeepMind. Gemini 3: A new era of intelligence. Technical report, Google, 2026. URL https://blog.google/technology/google-deepmind/gemini-3/. Context window: 1,048,576 tokens

  12. [12]

    Mixtral of Experts

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024. URL https: //arxiv.org/abs/2401.04088

  13. [13]

    Moh: Multi-head attention as mixture-of-head attention

    Peng Jin, Bo Zhu, Li Yuan, and Shuicheng Yan. MoH: Multi-head attention as mixture-of-head attention.arXiv preprint arXiv:2410.11842, 2024. URL https://arxiv.org/abs/2410. 11842

  14. [14]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. GShard: Scaling giant models with con- ditional computation and automatic sharding.arXiv preprint arXiv:2006.16668, 2020. URL https://arxiv.org/abs/2006.16668

  15. [15]

    MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, et al

    Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, et al. MiniMax-01: Scaling foundation models with lightning attention.arXiv preprint arXiv:2501.08313, 2025. URL https://arxiv.org/abs/ 2501.08313. 10

  16. [16]

    SnapKV: LLM Knows What You are Looking for Before Generation

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation.arXiv preprint arXiv:2404.14469, 2024. URL https://arxiv.org/abs/ 2404.14469

  17. [17]

    Moba: Mixture of block attention for long-context llms.arXiv preprint arXiv:2502.13189,

    Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, Zhiqi Huang, Huan Yuan, Suting Xu, Xinran Xu, Guokun Lai, Yanru Chen, Huabin Zheng, Junjie Yan, Jianlin Su, Yuxin Wu, Neo Y . Zhang, Zhilin Yang, Xinyu Zhou, Mingxing Zhang, and Jiezhong Qiu. MoBA: Mixture of block attention for long-cont...

  18. [18]

    Kimi K2: Open agentic intelligence

    Moonshot AI. Kimi K2: Open agentic intelligence. Technical report, Moonshot AI, 2025. URL https://github.com/MoonshotAI/Kimi-K2

  19. [19]

    Introducing GPT-5.5

    OpenAI. Introducing GPT-5.5. Technical report, OpenAI, 2026. URL https://openai.com/ index/introducing-gpt-5-5/. API context window: 1,050,000 tokens

  20. [20]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations, 2017. URL https://arxiv. org/abs/1701.06538

  21. [21]

    arXiv preprint arXiv:2406.10774 , year=

    Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-aware sparsity for efficient long-context LLM inference. InInternational Conference on Machine Learning, 2024. URLhttps://arxiv.org/abs/2406.10774

  22. [22]

    Infllm: Unveiling the intrinsic capacity of llms for understanding extremely long sequences with training-free memory.arXiv preprint arXiv:2402.04617, 3(7), 2024

    Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, Song Han, and Maosong Sun. InfLLM: Training-free long-context extrapolation for LLMs with an efficient context memory.arXiv preprint arXiv:2402.04617, 2024. URL https://arxiv.org/abs/2402.04617

  23. [23]

    Efficient streaming language models with attention sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations,

  24. [24]

    URLhttps://arxiv.org/abs/2309.17453

  25. [26]

    URLhttps://www.arxiv.org/abs/2603.28458v3

  26. [27]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. URLhttps://arxiv.org/abs/2505.09388

  27. [28]

    arXiv preprint arXiv:2502.11089 , year=

    Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y .X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, and Wangding Wang. Native sparse attention: Hardware-aligned and natively trainable sparse attention.arXiv preprint arXiv:2502.11089, 2025. URL https://arxiv.org/abs/ 2502.11089

  28. [29]

    Big bird: Transformers for longer sequences

    Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontañón, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for longer sequences. InAdvances in Neural Information Processing Systems,

  29. [30]

    URLhttps://arxiv.org/abs/2007.14062

  30. [31]

    GLM-5: from Vibe Coding to Agentic Engineering

    Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chengxing Xie, Chenzheng Zhu, et al. GLM-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026. URLhttps://arxiv.org/abs/2602.15763

  31. [32]

    Mixture of attention heads: Selecting attention heads per token

    Xiaofeng Zhang, Yikang Shen, Zeyu Huang, Jie Zhou, Wenge Rong, and Zhang Xiong. Mixture of attention heads: Selecting attention heads per token. InConference on Empirical Methods in Natural Language Processing, 2022. URLhttps://arxiv.org/abs/2210.05144. 11

  32. [33]

    arXiv preprint arXiv:2306.14048 , year=

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems, 2023. URL https://arxiv.org/abs/2306.14048. 12 A ...

  33. [34]

    Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...