pith. sign in

arxiv: 2605.16893 · v1 · pith:FQXNA77Onew · submitted 2026-05-16 · 💻 cs.AI

NGM: A Plug-and-Play Training-Free Memory Module for LLMs

Pith reviewed 2026-05-19 20:44 UTC · model grok-4.3

classification 💻 cs.AI
keywords n-gram memorytraining-free moduleplug-and-play LLMknowledge injectioncosine-gated injectormemory augmentationefficient retrieval
0
0 comments X p. Extension
pith:FQXNA77O Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{FQXNA77O}

Prints a linked pith:FQXNA77O badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Averaging pretrained token embeddings creates n-gram representations that a cosine-gated injector adds to LLMs without training or extra parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces N-gram Memory (NGM) as a way to give large language models access to additional knowledge by building n-gram representations directly from the model's own pretrained embeddings. Instead of training new memory vectors, it averages existing token embeddings to form these representations and then uses a simple cosine similarity gate to decide when and how much to inject them into the model's internal states. This plug-and-play approach requires no fine-tuning and no separate retrieval system. A sympathetic reader would care because it suggests that useful memory augmentation can be achieved with zero additional training cost and minimal architectural change, potentially making knowledge-enhanced models more accessible. Evaluations show consistent gains across model sizes on both text and multimodal tasks.

Core claim

NGM consists of a Causal N-Gram Encoder that constructs n-gram representations by averaging the backbone model's pretrained token embeddings and a Cosine-Gated Memory Injector that modulates these representations into contextual hidden states using a non-parametric cosine gate combined with ReLU. When integrated into Qwen3 models from 0.6B to 14B parameters, this module raises average benchmark scores by 0.5 to 1.2 points and delivers larger lifts on code generation and knowledge-intensive tasks such as +3.0 on LiveCodeBench and +3.03 on GPQA for the 14B model, while also improving multimodal performance.

What carries the argument

The Causal N-Gram Encoder, which builds n-gram vectors by direct averaging of pretrained token embeddings, and the Cosine-Gated Memory Injector, which applies a non-parametric cosine similarity gate with ReLU to blend the n-gram embeddings into the model's representations.

If this is right

  • Performance gains appear on code generation benchmarks and knowledge-intensive question answering.
  • The method works across model scales from 0.6B to 14B and extends to vision-language models.
  • No additional parameters or training are needed, making it immediately applicable to existing pretrained models.
  • The design avoids both learned memory tables and separate retrieval pipelines.
  • It provides a more direct knowledge access route than mixture-of-experts routing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Since the n-gram representations come from the model's own embeddings, the approach may generalize to other sequence lengths or higher-order n-grams with minimal adjustment.
  • Future work could test whether the same averaging principle applies to other forms of structured memory such as phrases or facts extracted from text.
  • The cosine gate's simplicity might allow similar non-parametric modulation in other injection scenarios beyond n-grams.
  • Combining this with larger context windows could amplify the benefits on long-document tasks.

Load-bearing premise

That directly averaging pretrained token embeddings produces useful n-gram representations that the cosine-gated injector can meaningfully modulate without introducing noise or requiring any learned parameters or additional training.

What would settle it

Running the same benchmarks on the same models with the averaging step replaced by random vectors or with the cosine gate removed entirely, and checking whether the performance gains disappear.

Figures

Figures reproduced from arXiv: 2605.16893 by Caifeng Shan, Chenyang Si, Wenhui Dong, Yuwen Qu.

Figure 1
Figure 1. Figure 1: Alignment between hidden states and aggerating N-gram embedding in the Qwen3-8B model. Motivated by this view, we propose NGM (N-gram Memory), a training-free, plug-and-play mod￾ule that injects local N-gram signals into frozen decoder-only LLMs. The key idea is to treat the pre￾trained embedding space not only as an input interface, but also as a lightweight source of reusable local memory: if nearby toke… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of NGM. The Causal N-gram Encoder constructs multi-scale N-gram represen￾tations from the backbone’s token embeddings; the Cosine-Gated Memory Injector scores them against decoder hidden states and injects the ag￾gregated residual into selected layers. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cross-position locality of NGM interactions in the default Qwen3-8B-NGM model. Heatmaps show the average cross-position cosine matrix cos(hi , gj ) at the two default injection layers for representative code, math, and knowledge samples. The diagonal structure dominates, indicating that useful memory interactions are predominantly local. Implication for the default gate. Together, these results support the… view at source ↗
Figure 4
Figure 4. Figure 4: Prefill and per-token decode latency for Qwen3-8B vs. Qwen3-8B-NGM on a single RTX 5090 (mean ± std over 20 runs). The gap widens at 2048 tokens because the current implementation recomputes N-gram features over the full prefix. For 256–1024-token prompts, prefill overhead is 3.4–7.3% and decode overhead 1.9–2.3%; at 2048 tokens the figures rise to 16.0% and 9.9%, respectively. These numbers reflect the re… view at source ↗
read the original abstract

Recent studies introduce conditional memory modules that decouple knowledge storage from neural computation, enabling more direct knowledge access. Compared to MoE, which relies on dynamic computation paths, explicit lookup provides a more efficient knowledge retrieval mechanism. However, these approaches still depend on learned memory embeddings, requiring additional training and limiting flexibility. To address this, we propose N-gram Memory (NGM), a training-free, plug-and-play module composed of a Causal N-Gram Encoder and a Cosine-Gated Memory Injector. The Causal N-Gram Encoder directly averages the pretrained token embeddings of the backbone model to construct N-gram representations, thereby eliminating the need to train separate N-gram embeddings from scratch. This design requires neither an additional memory table nor a retrieval pipeline. The Cosine-Gated Memory Injector then uses a non-parametric cosine gate with ReLU to modulate the retrieved embeddings into the contextual representations. We evaluate NGM on the Qwen3 series from 0.6B to 14B across eight benchmarks. NGM improves average performance by 0.5 to 1.2 points, with particularly clear gains on code generation and knowledge-intensive tasks (e.g., +3.0 on LiveCodeBench and +3.03 on GPQA for Qwen3-14B). Moreover, NGM also improves performance in multimodal benchmarks (e.g., MMStar +1.53 on Qwen3-VL-2B).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes NGM, a training-free plug-and-play memory module for LLMs consisting of a Causal N-Gram Encoder that constructs n-gram representations by averaging pretrained token embeddings from the backbone model and a Cosine-Gated Memory Injector that applies a non-parametric cosine similarity gate with ReLU to modulate injection into hidden states. It reports evaluation on Qwen3 models (0.6B to 14B) across eight benchmarks, claiming average gains of 0.5–1.2 points with larger improvements on code generation (+3.0 on LiveCodeBench) and knowledge tasks (+3.03 on GPQA for Qwen3-14B), plus multimodal gains (e.g., +1.53 on MMStar for Qwen3-VL-2B).

Significance. If the gains prove robust, the work would demonstrate a simple, zero-parameter approach to explicit n-gram memory that avoids training separate embeddings or retrieval pipelines, offering efficiency advantages over MoE-style methods. Strengths include the fully non-parametric design, evaluation across model scales, and inclusion of multimodal results. The empirical focus with external benchmarks and absence of fitted parameters or self-referential definitions are positive.

major comments (2)
  1. [§3.1] §3.1 (Causal N-Gram Encoder): the central assumption that directly averaging pretrained token embeddings yields semantically coherent n-gram vectors compatible with the fixed cosine-ReLU injector is load-bearing for attributing any gains to the module, yet no ablation or analysis addresses whether this averaging discards critical positional/higher-order interactions or introduces dilution/collision noise that the non-parametric gate cannot filter.
  2. [Results] Results (benchmark tables): reported deltas of 0.5–1.2 average points (and task-specific +3.0/+3.03) are presented without error bars, statistical significance tests, or controls for selective task emphasis; for gains this small, absence of these details prevents determining whether improvements exceed variance or multiple-comparison effects.
minor comments (2)
  1. [Abstract] The abstract states evaluation on eight benchmarks but does not enumerate them; listing the full set (including any held-out controls) would aid reproducibility.
  2. [§3.2] Notation for the cosine gate threshold and ReLU modulation in the injector would benefit from an explicit equation to clarify the non-parametric computation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will incorporate to improve the manuscript.

read point-by-point responses
  1. Referee: [§3.1] §3.1 (Causal N-Gram Encoder): the central assumption that directly averaging pretrained token embeddings yields semantically coherent n-gram vectors compatible with the fixed cosine-ReLU injector is load-bearing for attributing any gains to the module, yet no ablation or analysis addresses whether this averaging discards critical positional/higher-order interactions or introduces dilution/collision noise that the non-parametric gate cannot filter.

    Authors: We agree that the averaging step in the Causal N-Gram Encoder is a foundational assumption. Pretrained embeddings from the backbone model already encode substantial semantic and syntactic information, and averaging provides a simple, training-free way to form n-gram representations that align with the non-parametric cosine gate. However, we acknowledge that this may overlook higher-order interactions or introduce noise. In the revised manuscript we will add a targeted ablation (new subsection in §3 and corresponding appendix table) that compares plain averaging against (i) position-augmented averaging and (ii) a lightweight learned linear projection over the same n-gram tokens. This will quantify any dilution effects and directly support the design choice. revision: yes

  2. Referee: [Results] Results (benchmark tables): reported deltas of 0.5–1.2 average points (and task-specific +3.0/+3.03) are presented without error bars, statistical significance tests, or controls for selective task emphasis; for gains this small, absence of these details prevents determining whether improvements exceed variance or multiple-comparison effects.

    Authors: The referee is correct that modest average gains require statistical support to be convincing. We will revise the results section and tables to include (i) standard deviations or error bars from repeated evaluations where computationally feasible, (ii) paired significance tests (e.g., Wilcoxon signed-rank) between baseline and NGM runs, and (iii) an explicit statement that the reported average is the uniform mean across all eight benchmarks with no post-hoc selection. These additions will appear in the updated experimental tables and a new paragraph in §4. revision: yes

Circularity Check

0 steps flagged

No circularity: NGM is an empirical training-free proposal with gains measured on external benchmarks

full rationale

The paper defines NGM via direct averaging of pretrained token embeddings in the Causal N-Gram Encoder and a fixed non-parametric cosine-ReLU gate in the injector. These are architectural choices, not derived quantities. Performance deltas (0.5-1.2 average, +3.0 on LiveCodeBench) are reported from direct evaluation on held-out benchmarks (Qwen3 series, GPQA, MMStar). No equations, fitted parameters, or self-citations reduce the claimed improvements to the inputs by construction. The design is self-contained against external test sets.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that pretrained token embeddings already encode sufficient n-gram semantics when averaged, with no free parameters introduced in the description.

axioms (1)
  • domain assumption Pretrained token embeddings from the backbone LLM contain useful compositional information for n-grams when simply averaged.
    This is the core premise enabling the training-free Causal N-Gram Encoder.

pith-pipeline@v0.9.0 · 5796 in / 1174 out tokens · 29585 ms · 2026-05-19T20:44:03.921359+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 13 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Y uxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631, 2025

  2. [2]

    Enriching word vec- tors with subword information

    Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vec- tors with subword information. Transactions of the association for computational linguistics , 5:135–146, 2017

  3. [3]

    Improving language models by retrieving from trillions of tokens

    Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm V an Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR, 2022

  4. [4]

    Large language models in machine translation

    Thorsten Brants, Ashok Popat, Peng Xu, Franz Josef Och, and Jeffrey Dean. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 858–867, 2007

  5. [5]

    Are we on the right way for evaluating large vision- language models? Advances in Neural Information Processing Systems , 37:27056–27087, 2024

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Y uhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Y u Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models? Advances in Neural Information Processing Systems , 37:27056–27087, 2024

  6. [6]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Y uan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Y uri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  7. [7]

    An empirical study of smoothing techniques for lan- guage modeling

    Stanley F Chen and Joshua Goodman. An empirical study of smoothing techniques for lan- guage modeling. Computer Speech & Language , 13(4):359–394, 1999

  8. [8]

    Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

    Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie, Kezhao Huang, Xingkai Y u, Zhewen Hao, Y ukun Li, et al. Conditional memory via scalable lookup: A new axis of sparsity for large language models. arXiv preprint arXiv:2601.07372, 2026

  9. [9]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  10. [10]

    Survey: multiword expression processing: a survey

    Matthieu Constant, Gül¸ sen Eryi˘git, Johanna Monti, Lonneke V an Der Plas, Carlos Ramisch, Michael Rosner, and Amalia Todirascu. Survey: multiword expression processing: a survey. Computational Linguistics, 43(4):837–892, 2017

  11. [11]

    Jump to conclusions: Short-cutting transformers with linear transformations

    Alexander Y om Din, Taelin Karidi, Leshem Choshen, and Mor Geva. Jump to conclusions: Short-cutting transformers with linear transformations. In Proceedings of the 2024 Joint In- ternational Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9615–9625, 2024

  12. [12]

    & Tang, Y

    Ning Ding, Fangcheng Liu, Kyungrae Kim, Linji Hao, Kyeng-Hun Lee, Hyeonmok Ko, and Y ehui Tang. Meki: Memory-based expert knowledge injection for efficient llm scaling. arXiv preprint arXiv:2602.03359, 2026. URL https://arxiv.org/abs/2602.03359

  13. [13]

    Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

    Haodong Duan, Junming Y ang, Y uxuan Qiao, Xinyu Fang, Lin Chen, Y uan Liu, Xiaoyi Dong, Y uhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024

  14. [14]

    A mathematical framework for transformer circuits

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Y untao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1(1):12, 2021

  15. [15]

    The idiom principle and the open choice principle

    Britt Erman. The idiom principle and the open choice principle. Text-Interdisciplinary Journal for the Study of Discourse , 2000. 10

  16. [16]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research , 23(120):1–39, 2022

  17. [17]

    Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Y u Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with mmlu? In Proceedings of the 2025 Confer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Techno...

  18. [18]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex V aughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  19. [19]

    Retrieval aug- mented language model pre-training

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval aug- mented language model pre-training. In International conference on machine learning , pages 3929–3938. PMLR, 2020

  20. [20]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

  21. [21]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

  22. [22]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Y an, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contami- nation free evaluation of large language models for code. arXiv preprint arXiv:2403.07974 , 2024

  23. [23]

    Mistral 7B

    Albert Qiaochu Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Deven- dra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lam- ple, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b. ArXiv, abs/23...

  24. [24]

    Gen- eralization through memorization: Nearest neighbor language models

    Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Gen- eralization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172, 2019

  25. [25]

    Improved backing-off for m-gram language modeling

    Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram language modeling. In 1995 international conference on acoustics, speech, and signal processing , volume 1, pages 181–184. IEEE, 1995

  26. [26]

    Truthfulqa: Measuring how models mimic human falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th annual meeting of the association for computa- tional linguistics (volume 1: long papers) , pages 3214–3252, 2022

  27. [27]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

  28. [28]

    & Cai, X

    Hong Liu, Jiaqi Zhang, Chao Wang, Xing Hu, Linkun Lyu, Jiaqi Sun, Xurui Y ang, Bo Wang, Fengcun Li, Y ulei Qian, Lingtong Si, Y erui Sun, Rumei Li, Peng Pei, Y uchen Xie, and Xunliang Cai. Scaling embeddings outperforms scaling experts in language mod- els. ArXiv, abs/2601.21204, 2026. URL https://api.semanticscholar.org/CorpusID: 285140484

  29. [29]

    Infini- gram: Scaling unbounded n-gram language models to a trillion tokens

    Jiacheng Liu, Sewon Min, Luke Zettlemoyer, Y ejin Choi, and Hannaneh Hajishirzi. Infini- gram: Scaling unbounded n-gram language models to a trillion tokens. arXiv preprint arXiv:2401.17377, 2024. 11

  30. [30]

    Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision , pages 216–233

    Y uan Liu, Haodong Duan, Y uanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Y uan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision , pages 216–233. Springer, 2024

  31. [31]

    Ocrbench: on the hidden mystery of ocr in large multimodal models

    Y uliang Liu, Zhang Li, Mingxin Huang, Biao Y ang, Wenwen Y u, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models. Science China Information Sciences , 67(12):220102, 2024

  32. [32]

    Generalizing and hybridizing count-based and neural lan- guage models

    Graham Neubig and Chris Dyer. Generalizing and hybridizing count-based and neural lan- guage models. In Proceedings of the 2016 Conference on Empirical Methods in Natural Lan- guage Processing, pages 1163–1172, 2016

  33. [33]

    Understanding transformers via n-gram statistics

    Timothy Nguyen. Understanding transformers via n-gram statistics. Advances in neural infor- mation processing systems, 37:98049–98082, 2024

  34. [34]

    interpreting GPT: the logit lens

    nostalgebraist. interpreting GPT: the logit lens. https://www.lesswrong.com/posts/ AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens , 2020

  35. [35]

    Gpqa: A graduate-level google-proof q&a benchmark

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Y uanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In First conference on language modeling, 2024

  36. [36]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017

  37. [37]

    EvalScope: Evaluation framework for large models, 2024

    ModelScope Team. EvalScope: Evaluation framework for large models, 2024. URL https: //github.com/modelscope/evalscope

  38. [38]

    L 3: Large lookup layers

    Albert Tseng and Christopher De Sa. L 3: Large lookup layers. arXiv preprint arXiv:2601.21461, 2026

  39. [39]

    Attention is all you need

    Ashish V aswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

  40. [40]

    Memorizing trans- formers

    Y uhuai Wu, Markus N Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing trans- formers. arXiv preprint arXiv:2203.08913, 2022

  41. [41]

    Qwen3 Technical Report

    An Y ang, Anfeng Li, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Y u, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

  42. [42]

    Scaling embedding layers in language models

    Da Y u, Edith Cohen, Badih Ghazi, Y angsibo Huang, Pritish Kamath, Ravi Kumar, Daogao Liu, and Chiyuan Zhang. Scaling embedding layers in language models. ArXiv, abs/2502.01637,

  43. [43]

    URL https://api.semanticscholar.org/CorpusID:276106917

  44. [44]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023. 12 A NGM implementation Listing 1 gives a simplified PyTorch implementation of NGM. def ngm_forward(hidden_states, input_ids, embed_matrix, ngram_sizes,...