NGM: A Plug-and-Play Training-Free Memory Module for LLMs

arxiv: 2605.16893 · v1 · pith:FQXNA77Onew · submitted 2026-05-16 · 💻 cs.AI

NGM: A Plug-and-Play Training-Free Memory Module for LLMs

Yuwen Qu , Wenhui Dong , Chenyang Si , Caifeng Shan This is my paper

Pith reviewed 2026-05-19 20:44 UTC · model grok-4.3

classification 💻 cs.AI

keywords n-gram memorytraining-free moduleplug-and-play LLMknowledge injectioncosine-gated injectormemory augmentationefficient retrieval

0 comments p. Extension

pith:FQXNA77O Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{FQXNA77O}

Prints a linked pith:FQXNA77O badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Averaging pretrained token embeddings creates n-gram representations that a cosine-gated injector adds to LLMs without training or extra parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces N-gram Memory (NGM) as a way to give large language models access to additional knowledge by building n-gram representations directly from the model's own pretrained embeddings. Instead of training new memory vectors, it averages existing token embeddings to form these representations and then uses a simple cosine similarity gate to decide when and how much to inject them into the model's internal states. This plug-and-play approach requires no fine-tuning and no separate retrieval system. A sympathetic reader would care because it suggests that useful memory augmentation can be achieved with zero additional training cost and minimal architectural change, potentially making knowledge-enhanced models more accessible. Evaluations show consistent gains across model sizes on both text and multimodal tasks.

Core claim

NGM consists of a Causal N-Gram Encoder that constructs n-gram representations by averaging the backbone model's pretrained token embeddings and a Cosine-Gated Memory Injector that modulates these representations into contextual hidden states using a non-parametric cosine gate combined with ReLU. When integrated into Qwen3 models from 0.6B to 14B parameters, this module raises average benchmark scores by 0.5 to 1.2 points and delivers larger lifts on code generation and knowledge-intensive tasks such as +3.0 on LiveCodeBench and +3.03 on GPQA for the 14B model, while also improving multimodal performance.

What carries the argument

The Causal N-Gram Encoder, which builds n-gram vectors by direct averaging of pretrained token embeddings, and the Cosine-Gated Memory Injector, which applies a non-parametric cosine similarity gate with ReLU to blend the n-gram embeddings into the model's representations.

If this is right

Performance gains appear on code generation benchmarks and knowledge-intensive question answering.
The method works across model scales from 0.6B to 14B and extends to vision-language models.
No additional parameters or training are needed, making it immediately applicable to existing pretrained models.
The design avoids both learned memory tables and separate retrieval pipelines.
It provides a more direct knowledge access route than mixture-of-experts routing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Since the n-gram representations come from the model's own embeddings, the approach may generalize to other sequence lengths or higher-order n-grams with minimal adjustment.
Future work could test whether the same averaging principle applies to other forms of structured memory such as phrases or facts extracted from text.
The cosine gate's simplicity might allow similar non-parametric modulation in other injection scenarios beyond n-grams.
Combining this with larger context windows could amplify the benefits on long-document tasks.

Load-bearing premise

That directly averaging pretrained token embeddings produces useful n-gram representations that the cosine-gated injector can meaningfully modulate without introducing noise or requiring any learned parameters or additional training.

What would settle it

Running the same benchmarks on the same models with the averaging step replaced by random vectors or with the cosine gate removed entirely, and checking whether the performance gains disappear.

Figures

Figures reproduced from arXiv: 2605.16893 by Caifeng Shan, Chenyang Si, Wenhui Dong, Yuwen Qu.

**Figure 1.** Figure 1: Alignment between hidden states and aggerating N-gram embedding in the Qwen3-8B model. Motivated by this view, we propose NGM (N-gram Memory), a training-free, plug-and-play module that injects local N-gram signals into frozen decoder-only LLMs. The key idea is to treat the pretrained embedding space not only as an input interface, but also as a lightweight source of reusable local memory: if nearby toke… view at source ↗

**Figure 2.** Figure 2: Overview of NGM. The Causal N-gram Encoder constructs multi-scale N-gram representations from the backbone’s token embeddings; the Cosine-Gated Memory Injector scores them against decoder hidden states and injects the aggregated residual into selected layers. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Cross-position locality of NGM interactions in the default Qwen3-8B-NGM model. Heatmaps show the average cross-position cosine matrix cos(hi , gj ) at the two default injection layers for representative code, math, and knowledge samples. The diagonal structure dominates, indicating that useful memory interactions are predominantly local. Implication for the default gate. Together, these results support the… view at source ↗

**Figure 4.** Figure 4: Prefill and per-token decode latency for Qwen3-8B vs. Qwen3-8B-NGM on a single RTX 5090 (mean ± std over 20 runs). The gap widens at 2048 tokens because the current implementation recomputes N-gram features over the full prefix. For 256–1024-token prompts, prefill overhead is 3.4–7.3% and decode overhead 1.9–2.3%; at 2048 tokens the figures rise to 16.0% and 9.9%, respectively. These numbers reflect the re… view at source ↗

read the original abstract

Recent studies introduce conditional memory modules that decouple knowledge storage from neural computation, enabling more direct knowledge access. Compared to MoE, which relies on dynamic computation paths, explicit lookup provides a more efficient knowledge retrieval mechanism. However, these approaches still depend on learned memory embeddings, requiring additional training and limiting flexibility. To address this, we propose N-gram Memory (NGM), a training-free, plug-and-play module composed of a Causal N-Gram Encoder and a Cosine-Gated Memory Injector. The Causal N-Gram Encoder directly averages the pretrained token embeddings of the backbone model to construct N-gram representations, thereby eliminating the need to train separate N-gram embeddings from scratch. This design requires neither an additional memory table nor a retrieval pipeline. The Cosine-Gated Memory Injector then uses a non-parametric cosine gate with ReLU to modulate the retrieved embeddings into the contextual representations. We evaluate NGM on the Qwen3 series from 0.6B to 14B across eight benchmarks. NGM improves average performance by 0.5 to 1.2 points, with particularly clear gains on code generation and knowledge-intensive tasks (e.g., +3.0 on LiveCodeBench and +3.03 on GPQA for Qwen3-14B). Moreover, NGM also improves performance in multimodal benchmarks (e.g., MMStar +1.53 on Qwen3-VL-2B).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NGM offers a clean training-free way to build n-gram memory from a model's own embeddings and gate it in with cosine similarity, but the modest gains rest on an untested assumption that simple averaging produces useful representations.

read the letter

The paper's main contribution is a plug-and-play module that skips training altogether. It builds n-gram vectors by averaging the backbone LLM's pretrained token embeddings in a causal window, then uses a fixed cosine-ReLU gate to decide when to inject them into the hidden states. No new parameters or retrieval tables are needed, which keeps the overhead low and the design portable across model sizes from 0.6B to 14B Qwen3 variants. That training-free constraint is the clearest practical advantage over learned memory modules or MoE routing.

Referee Report

2 major / 2 minor

Summary. The paper proposes NGM, a training-free plug-and-play memory module for LLMs consisting of a Causal N-Gram Encoder that constructs n-gram representations by averaging pretrained token embeddings from the backbone model and a Cosine-Gated Memory Injector that applies a non-parametric cosine similarity gate with ReLU to modulate injection into hidden states. It reports evaluation on Qwen3 models (0.6B to 14B) across eight benchmarks, claiming average gains of 0.5–1.2 points with larger improvements on code generation (+3.0 on LiveCodeBench) and knowledge tasks (+3.03 on GPQA for Qwen3-14B), plus multimodal gains (e.g., +1.53 on MMStar for Qwen3-VL-2B).

Significance. If the gains prove robust, the work would demonstrate a simple, zero-parameter approach to explicit n-gram memory that avoids training separate embeddings or retrieval pipelines, offering efficiency advantages over MoE-style methods. Strengths include the fully non-parametric design, evaluation across model scales, and inclusion of multimodal results. The empirical focus with external benchmarks and absence of fitted parameters or self-referential definitions are positive.

major comments (2)

[§3.1] §3.1 (Causal N-Gram Encoder): the central assumption that directly averaging pretrained token embeddings yields semantically coherent n-gram vectors compatible with the fixed cosine-ReLU injector is load-bearing for attributing any gains to the module, yet no ablation or analysis addresses whether this averaging discards critical positional/higher-order interactions or introduces dilution/collision noise that the non-parametric gate cannot filter.
[Results] Results (benchmark tables): reported deltas of 0.5–1.2 average points (and task-specific +3.0/+3.03) are presented without error bars, statistical significance tests, or controls for selective task emphasis; for gains this small, absence of these details prevents determining whether improvements exceed variance or multiple-comparison effects.

minor comments (2)

[Abstract] The abstract states evaluation on eight benchmarks but does not enumerate them; listing the full set (including any held-out controls) would aid reproducibility.
[§3.2] Notation for the cosine gate threshold and ReLU modulation in the injector would benefit from an explicit equation to clarify the non-parametric computation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will incorporate to improve the manuscript.

read point-by-point responses

Referee: [§3.1] §3.1 (Causal N-Gram Encoder): the central assumption that directly averaging pretrained token embeddings yields semantically coherent n-gram vectors compatible with the fixed cosine-ReLU injector is load-bearing for attributing any gains to the module, yet no ablation or analysis addresses whether this averaging discards critical positional/higher-order interactions or introduces dilution/collision noise that the non-parametric gate cannot filter.

Authors: We agree that the averaging step in the Causal N-Gram Encoder is a foundational assumption. Pretrained embeddings from the backbone model already encode substantial semantic and syntactic information, and averaging provides a simple, training-free way to form n-gram representations that align with the non-parametric cosine gate. However, we acknowledge that this may overlook higher-order interactions or introduce noise. In the revised manuscript we will add a targeted ablation (new subsection in §3 and corresponding appendix table) that compares plain averaging against (i) position-augmented averaging and (ii) a lightweight learned linear projection over the same n-gram tokens. This will quantify any dilution effects and directly support the design choice. revision: yes
Referee: [Results] Results (benchmark tables): reported deltas of 0.5–1.2 average points (and task-specific +3.0/+3.03) are presented without error bars, statistical significance tests, or controls for selective task emphasis; for gains this small, absence of these details prevents determining whether improvements exceed variance or multiple-comparison effects.

Authors: The referee is correct that modest average gains require statistical support to be convincing. We will revise the results section and tables to include (i) standard deviations or error bars from repeated evaluations where computationally feasible, (ii) paired significance tests (e.g., Wilcoxon signed-rank) between baseline and NGM runs, and (iii) an explicit statement that the reported average is the uniform mean across all eight benchmarks with no post-hoc selection. These additions will appear in the updated experimental tables and a new paragraph in §4. revision: yes

Circularity Check

0 steps flagged

No circularity: NGM is an empirical training-free proposal with gains measured on external benchmarks

full rationale

The paper defines NGM via direct averaging of pretrained token embeddings in the Causal N-Gram Encoder and a fixed non-parametric cosine-ReLU gate in the injector. These are architectural choices, not derived quantities. Performance deltas (0.5-1.2 average, +3.0 on LiveCodeBench) are reported from direct evaluation on held-out benchmarks (Qwen3 series, GPQA, MMStar). No equations, fitted parameters, or self-citations reduce the claimed improvements to the inputs by construction. The design is self-contained against external test sets.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that pretrained token embeddings already encode sufficient n-gram semantics when averaged, with no free parameters introduced in the description.

axioms (1)

domain assumption Pretrained token embeddings from the backbone LLM contain useful compositional information for n-grams when simply averaged.
This is the core premise enabling the training-free Causal N-Gram Encoder.

pith-pipeline@v0.9.0 · 5796 in / 1174 out tokens · 29585 ms · 2026-05-19T20:44:03.921359+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The Causal N-Gram Encoder directly averages the pretrained token embeddings of the backbone model to construct N-gram representations... The Cosine-Gated Memory Injector then uses a non-parametric cosine gate with ReLU to modulate the retrieved embeddings
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_high_calibrated_iff unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

st,n = cos(hl_t, gt,n) = ⟨hl_t, gt,n⟩ / (∥hl_t∥ ∥gt,n∥); optionally ReLU to suppress negatively aligned updates

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 13 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Y uxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Enriching word vec- tors with subword information

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vec- tors with subword information. Transactions of the association for computational linguistics , 5:135–146, 2017

work page 2017
[3]

Improving language models by retrieving from trillions of tokens

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm V an Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR, 2022

work page 2022
[4]

Large language models in machine translation

Thorsten Brants, Ashok Popat, Peng Xu, Franz Josef Och, and Jeffrey Dean. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 858–867, 2007

work page 2007
[5]

Are we on the right way for evaluating large vision- language models? Advances in Neural Information Processing Systems , 37:27056–27087, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Y uhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Y u Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models? Advances in Neural Information Processing Systems , 37:27056–27087, 2024

work page 2024
[6]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Y uan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Y uri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

An empirical study of smoothing techniques for lan- guage modeling

Stanley F Chen and Joshua Goodman. An empirical study of smoothing techniques for lan- guage modeling. Computer Speech & Language , 13(4):359–394, 1999

work page 1999
[8]

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie, Kezhao Huang, Xingkai Y u, Zhewen Hao, Y ukun Li, et al. Conditional memory via scalable lookup: A new axis of sparsity for large language models. arXiv preprint arXiv:2601.07372, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training veriﬁers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Survey: multiword expression processing: a survey

Matthieu Constant, Gül¸ sen Eryi˘git, Johanna Monti, Lonneke V an Der Plas, Carlos Ramisch, Michael Rosner, and Amalia Todirascu. Survey: multiword expression processing: a survey. Computational Linguistics, 43(4):837–892, 2017

work page 2017
[11]

Jump to conclusions: Short-cutting transformers with linear transformations

Alexander Y om Din, Taelin Karidi, Leshem Choshen, and Mor Geva. Jump to conclusions: Short-cutting transformers with linear transformations. In Proceedings of the 2024 Joint In- ternational Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9615–9625, 2024

work page 2024
[12]

& Tang, Y

Ning Ding, Fangcheng Liu, Kyungrae Kim, Linji Hao, Kyeng-Hun Lee, Hyeonmok Ko, and Y ehui Tang. Meki: Memory-based expert knowledge injection for efﬁcient llm scaling. arXiv preprint arXiv:2602.03359, 2026. URL https://arxiv.org/abs/2602.03359

work page arXiv 2026
[13]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

Haodong Duan, Junming Y ang, Y uxuan Qiao, Xinyu Fang, Lin Chen, Y uan Liu, Xiaoyi Dong, Y uhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024

work page 2024
[14]

A mathematical framework for transformer circuits

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Y untao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1(1):12, 2021

work page 2021
[15]

The idiom principle and the open choice principle

Britt Erman. The idiom principle and the open choice principle. Text-Interdisciplinary Journal for the Study of Discourse , 2000. 10

work page 2000
[16]

Switch transformers: Scaling to trillion parameter models with simple and efﬁcient sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efﬁcient sparsity. Journal of Machine Learning Research , 23(120):1–39, 2022

work page 2022
[17]

Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Y u Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with mmlu? In Proceedings of the 2025 Confer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Techno...

work page 2025
[18]

The Llama 3 Herd of Models

Aaron Grattaﬁori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex V aughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Retrieval aug- mented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval aug- mented language model pre-training. In International conference on machine learning , pages 3929–3938. PMLR, 2020

work page 2020
[20]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[21]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[22]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Y an, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contami- nation free evaluation of large language models for code. arXiv preprint arXiv:2403.07974 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Mistral 7B

Albert Qiaochu Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Deven- dra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lam- ple, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b. ArXiv, abs/23...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Gen- eralization through memorization: Nearest neighbor language models

Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Gen- eralization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172, 2019

work page arXiv 1911
[25]

Improved backing-off for m-gram language modeling

Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram language modeling. In 1995 international conference on acoustics, speech, and signal processing , volume 1, pages 181–184. IEEE, 1995

work page 1995
[26]

Truthfulqa: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th annual meeting of the association for computa- tional linguistics (volume 1: long papers) , pages 3214–3252, 2022

work page 2022
[27]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

& Cai, X

Hong Liu, Jiaqi Zhang, Chao Wang, Xing Hu, Linkun Lyu, Jiaqi Sun, Xurui Y ang, Bo Wang, Fengcun Li, Y ulei Qian, Lingtong Si, Y erui Sun, Rumei Li, Peng Pei, Y uchen Xie, and Xunliang Cai. Scaling embeddings outperforms scaling experts in language mod- els. ArXiv, abs/2601.21204, 2026. URL https://api.semanticscholar.org/CorpusID: 285140484

work page arXiv 2026
[29]

Inﬁni- gram: Scaling unbounded n-gram language models to a trillion tokens

Jiacheng Liu, Sewon Min, Luke Zettlemoyer, Y ejin Choi, and Hannaneh Hajishirzi. Inﬁni- gram: Scaling unbounded n-gram language models to a trillion tokens. arXiv preprint arXiv:2401.17377, 2024. 11

work page arXiv 2024
[30]

Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision , pages 216–233

Y uan Liu, Haodong Duan, Y uanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Y uan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision , pages 216–233. Springer, 2024

work page 2024
[31]

Ocrbench: on the hidden mystery of ocr in large multimodal models

Y uliang Liu, Zhang Li, Mingxin Huang, Biao Y ang, Wenwen Y u, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models. Science China Information Sciences , 67(12):220102, 2024

work page 2024
[32]

Generalizing and hybridizing count-based and neural lan- guage models

Graham Neubig and Chris Dyer. Generalizing and hybridizing count-based and neural lan- guage models. In Proceedings of the 2016 Conference on Empirical Methods in Natural Lan- guage Processing, pages 1163–1172, 2016

work page 2016
[33]

Understanding transformers via n-gram statistics

Timothy Nguyen. Understanding transformers via n-gram statistics. Advances in neural infor- mation processing systems, 37:98049–98082, 2024

work page 2024
[34]

interpreting GPT: the logit lens

nostalgebraist. interpreting GPT: the logit lens. https://www.lesswrong.com/posts/ AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens , 2020

work page 2020
[35]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Y uanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In First conference on language modeling, 2024

work page 2024
[36]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[37]

EvalScope: Evaluation framework for large models, 2024

ModelScope Team. EvalScope: Evaluation framework for large models, 2024. URL https: //github.com/modelscope/evalscope

work page 2024
[38]

L 3: Large lookup layers

Albert Tseng and Christopher De Sa. L 3: Large lookup layers. arXiv preprint arXiv:2601.21461, 2026

work page arXiv 2026
[39]

Attention is all you need

Ashish V aswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[40]

Memorizing trans- formers

Y uhuai Wu, Markus N Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing trans- formers. arXiv preprint arXiv:2203.08913, 2022

work page arXiv 2022
[41]

Qwen3 Technical Report

An Y ang, Anfeng Li, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Y u, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Scaling embedding layers in language models

Da Y u, Edith Cohen, Badih Ghazi, Y angsibo Huang, Pritish Kamath, Ravi Kumar, Daogao Liu, and Chiyuan Zhang. Scaling embedding layers in language models. ArXiv, abs/2502.01637,

work page arXiv
[43]

URL https://api.semanticscholar.org/CorpusID:276106917

work page
[44]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023. 12 A NGM implementation Listing 1 gives a simpliﬁed PyTorch implementation of NGM. def ngm_forward(hidden_states, input_ids, embed_matrix, ngram_sizes,...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Qwen3-VL Technical Report

Shuai Bai, Y uxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Enriching word vec- tors with subword information

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vec- tors with subword information. Transactions of the association for computational linguistics , 5:135–146, 2017

work page 2017

[3] [3]

Improving language models by retrieving from trillions of tokens

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm V an Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR, 2022

work page 2022

[4] [4]

Large language models in machine translation

Thorsten Brants, Ashok Popat, Peng Xu, Franz Josef Och, and Jeffrey Dean. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 858–867, 2007

work page 2007

[5] [5]

Are we on the right way for evaluating large vision- language models? Advances in Neural Information Processing Systems , 37:27056–27087, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Y uhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Y u Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models? Advances in Neural Information Processing Systems , 37:27056–27087, 2024

work page 2024

[6] [6]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Y uan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Y uri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

An empirical study of smoothing techniques for lan- guage modeling

Stanley F Chen and Joshua Goodman. An empirical study of smoothing techniques for lan- guage modeling. Computer Speech & Language , 13(4):359–394, 1999

work page 1999

[8] [8]

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie, Kezhao Huang, Xingkai Y u, Zhewen Hao, Y ukun Li, et al. Conditional memory via scalable lookup: A new axis of sparsity for large language models. arXiv preprint arXiv:2601.07372, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[9] [9]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training veriﬁers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[10] [10]

Survey: multiword expression processing: a survey

Matthieu Constant, Gül¸ sen Eryi˘git, Johanna Monti, Lonneke V an Der Plas, Carlos Ramisch, Michael Rosner, and Amalia Todirascu. Survey: multiword expression processing: a survey. Computational Linguistics, 43(4):837–892, 2017

work page 2017

[11] [11]

Jump to conclusions: Short-cutting transformers with linear transformations

Alexander Y om Din, Taelin Karidi, Leshem Choshen, and Mor Geva. Jump to conclusions: Short-cutting transformers with linear transformations. In Proceedings of the 2024 Joint In- ternational Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9615–9625, 2024

work page 2024

[12] [12]

& Tang, Y

Ning Ding, Fangcheng Liu, Kyungrae Kim, Linji Hao, Kyeng-Hun Lee, Hyeonmok Ko, and Y ehui Tang. Meki: Memory-based expert knowledge injection for efﬁcient llm scaling. arXiv preprint arXiv:2602.03359, 2026. URL https://arxiv.org/abs/2602.03359

work page arXiv 2026

[13] [13]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

Haodong Duan, Junming Y ang, Y uxuan Qiao, Xinyu Fang, Lin Chen, Y uan Liu, Xiaoyi Dong, Y uhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024

work page 2024

[14] [14]

A mathematical framework for transformer circuits

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Y untao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1(1):12, 2021

work page 2021

[15] [15]

The idiom principle and the open choice principle

Britt Erman. The idiom principle and the open choice principle. Text-Interdisciplinary Journal for the Study of Discourse , 2000. 10

work page 2000

[16] [16]

Switch transformers: Scaling to trillion parameter models with simple and efﬁcient sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efﬁcient sparsity. Journal of Machine Learning Research , 23(120):1–39, 2022

work page 2022

[17] [17]

Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Y u Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with mmlu? In Proceedings of the 2025 Confer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Techno...

work page 2025

[18] [18]

The Llama 3 Herd of Models

Aaron Grattaﬁori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex V aughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Retrieval aug- mented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval aug- mented language model pre-training. In International conference on machine learning , pages 3929–3938. PMLR, 2020

work page 2020

[20] [20]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[21] [21]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[22] [22]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Y an, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contami- nation free evaluation of large language models for code. arXiv preprint arXiv:2403.07974 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Mistral 7B

Albert Qiaochu Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Deven- dra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lam- ple, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b. ArXiv, abs/23...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Gen- eralization through memorization: Nearest neighbor language models

Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Gen- eralization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172, 2019

work page arXiv 1911

[25] [25]

Improved backing-off for m-gram language modeling

Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram language modeling. In 1995 international conference on acoustics, speech, and signal processing , volume 1, pages 181–184. IEEE, 1995

work page 1995

[26] [26]

Truthfulqa: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th annual meeting of the association for computa- tional linguistics (volume 1: long papers) , pages 3214–3252, 2022

work page 2022

[27] [27]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

& Cai, X

Hong Liu, Jiaqi Zhang, Chao Wang, Xing Hu, Linkun Lyu, Jiaqi Sun, Xurui Y ang, Bo Wang, Fengcun Li, Y ulei Qian, Lingtong Si, Y erui Sun, Rumei Li, Peng Pei, Y uchen Xie, and Xunliang Cai. Scaling embeddings outperforms scaling experts in language mod- els. ArXiv, abs/2601.21204, 2026. URL https://api.semanticscholar.org/CorpusID: 285140484

work page arXiv 2026

[29] [29]

Inﬁni- gram: Scaling unbounded n-gram language models to a trillion tokens

Jiacheng Liu, Sewon Min, Luke Zettlemoyer, Y ejin Choi, and Hannaneh Hajishirzi. Inﬁni- gram: Scaling unbounded n-gram language models to a trillion tokens. arXiv preprint arXiv:2401.17377, 2024. 11

work page arXiv 2024

[30] [30]

Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision , pages 216–233

Y uan Liu, Haodong Duan, Y uanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Y uan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision , pages 216–233. Springer, 2024

work page 2024

[31] [31]

Ocrbench: on the hidden mystery of ocr in large multimodal models

Y uliang Liu, Zhang Li, Mingxin Huang, Biao Y ang, Wenwen Y u, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models. Science China Information Sciences , 67(12):220102, 2024

work page 2024

[32] [32]

Generalizing and hybridizing count-based and neural lan- guage models

Graham Neubig and Chris Dyer. Generalizing and hybridizing count-based and neural lan- guage models. In Proceedings of the 2016 Conference on Empirical Methods in Natural Lan- guage Processing, pages 1163–1172, 2016

work page 2016

[33] [33]

Understanding transformers via n-gram statistics

Timothy Nguyen. Understanding transformers via n-gram statistics. Advances in neural infor- mation processing systems, 37:98049–98082, 2024

work page 2024

[34] [34]

interpreting GPT: the logit lens

nostalgebraist. interpreting GPT: the logit lens. https://www.lesswrong.com/posts/ AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens , 2020

work page 2020

[35] [35]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Y uanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In First conference on language modeling, 2024

work page 2024

[36] [36]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[37] [37]

EvalScope: Evaluation framework for large models, 2024

ModelScope Team. EvalScope: Evaluation framework for large models, 2024. URL https: //github.com/modelscope/evalscope

work page 2024

[38] [38]

L 3: Large lookup layers

Albert Tseng and Christopher De Sa. L 3: Large lookup layers. arXiv preprint arXiv:2601.21461, 2026

work page arXiv 2026

[39] [39]

Attention is all you need

Ashish V aswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017

[40] [40]

Memorizing trans- formers

Y uhuai Wu, Markus N Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing trans- formers. arXiv preprint arXiv:2203.08913, 2022

work page arXiv 2022

[41] [41]

Qwen3 Technical Report

An Y ang, Anfeng Li, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Y u, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Scaling embedding layers in language models

Da Y u, Edith Cohen, Badih Ghazi, Y angsibo Huang, Pritish Kamath, Ravi Kumar, Daogao Liu, and Chiyuan Zhang. Scaling embedding layers in language models. ArXiv, abs/2502.01637,

work page arXiv

[43] [43]

URL https://api.semanticscholar.org/CorpusID:276106917

work page

[44] [44]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023. 12 A NGM implementation Listing 1 gives a simpliﬁed PyTorch implementation of NGM. def ngm_forward(hidden_states, input_ids, embed_matrix, ngram_sizes,...

work page internal anchor Pith review Pith/arXiv arXiv 2023