Lngram: N-gram Conditional Memory in Latent Space

Guoyang Xia; Lei Ren; Xiaojie Wang; Yunao Zheng

arxiv: 2605.24869 · v1 · pith:6XPJRQIEnew · submitted 2026-05-24 · 💻 cs.CL

Lngram: N-gram Conditional Memory in Latent Space

Yunao Zheng , Guoyang Xia , Xiaojie Wang , Lei Ren This is my paper

Pith reviewed 2026-06-30 12:24 UTC · model grok-4.3

classification 💻 cs.CL

keywords LngramN-gram memorylatent spaceconditional memorylanguage modelingperplexitymultimodal taskssequence modeling

0 comments

The pith

Lngram learns discrete symbols from hidden states to enable N-gram lookups in latent space, decoupling retrieval from tokenization and extending to multimodal data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Lngram to handle both compositional reasoning and local knowledge retrieval in sequence models by moving N-gram memory into the latent space. Instead of relying on text token IDs for keys, the method extracts discrete symbols directly from hidden states and performs conditional lookups over those symbols. This change is meant to avoid tokenization dependence and information loss while allowing the module to work on non-text inputs. In the reported experiments, Lngram lowers perplexity relative to standard Transformers and the earlier Engram approach, adds domain knowledge to already-trained models, and yields better results when trained jointly than full fine-tuning does. The same module also improves performance on vision-language and vision-language-action tasks, with internal analyses indicating that prediction-relevant signals appear at earlier layers.

Core claim

Lngram is a latent-space conditional memory module that learns discrete symbols directly from hidden states and performs N-gram lookup over these symbols. This design removes the dependence on tokenizer IDs and naturally extends to non-text modalities. In our evaluated settings, Lngram outperforms Transformer and Engram baselines, consistently reduces perplexity in long-context language modeling, and effectively injects domain knowledge when added post hoc to pretrained models. Joint training with the backbone further surpasses full fine-tuning, while experiments on vision-language and vision-language-action tasks show overall gains.

What carries the argument

Lngram module: a latent-space conditional memory that extracts discrete symbols from hidden states and performs N-gram lookup over those symbols.

If this is right

Consistently reduces perplexity in long-context language modeling.
Effectively injects domain knowledge when added post hoc to pretrained models.
Joint training with the backbone surpasses full fine-tuning.
Shows overall gains on vision-language and vision-language-action tasks.
Enables prediction-relevant information to emerge earlier in the network layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The token-free design could be applied to non-text sequences such as audio waveforms or biological strings to test cross-modal generality.
Earlier emergence of useful signals might allow shorter context windows or shallower stacks to reach the same accuracy.
The learned discrete symbols could be inspected to see whether they correspond to interpretable concepts or patterns inside the model.
Similar latent-symbol memory could be attached to non-Transformer backbones to check whether the benefit is architecture-specific.

Load-bearing premise

That discrete symbols learned directly from hidden states can support effective N-gram conditional memory without the information loss or dependence issues of token-based keys.

What would settle it

If inserting Lngram into a pretrained model produces no reduction in perplexity on long-context language modeling benchmarks or no gains on vision-language tasks relative to the Engram baseline, the performance advantage would be falsified.

Figures

Figures reproduced from arXiv: 2605.24869 by Guoyang Xia, Lei Ren, Xiaojie Wang, Yunao Zheng.

**Figure 1.** Figure 1: The main architecture of Lngram. 2.4. Context-Aware Readout The retrieved e (n) t or e (n) t,s is static memory independent of the current context. To mitigate the impact of mismatched retrievals on the backbone network, Lngram follows Engram in introducing a context-aware filtering mechanism at the readout stage. We first consider the single-table version. For each order n, the retrieval result is project… view at source ↗

**Figure 2.** Figure 2: Illustration of prefix perplexity at different context positions in long-context language modeling. in long-context PPL indicates that the conditional memory branch does not impair the model’s ability to exploit long-range context, while providing additional local-pattern priors for long-context language modeling. 3.3. Lngram-Tuning: Domain Knowledge Injection We further test the domain adaptation capabi… view at source ↗

**Figure 3.** Figure 3: LogitLens KL divergence as a function of layer depth. Except for the shallowest layer, where Lngram has not yet been introduced, Lngram exhibits lower KL divergence on most layers. 4.1.2. CKA: REPRESENTATIONS ALIGN WITH DEEPER LAYERS EARLIER We further use linear CKA to compare the layer representations of the MoE baseline and MoE+Lngram. Given two sets of representations X and Y , their linear-kernel Gra… view at source ↗

**Figure 4.** Figure 4: CKA similarity between the layer representations of the MoE baseline and MoE+Lngram. The dashed line indicates the soft alignment. The high-similarity region is shifted overall above the diagonal, indicating that Lngram layers are closer to deeper representations in the baseline model. alignment curve also lies above the same-layer diagonal for most layers. This phenomenon is most pronounced in the middle … view at source ↗

**Figure 5.** Figure 5: Gating responses of the layer-1 3-gram branch in Lngram. Peaks occur where fixed phrases such as “Great” and “Wales” are completed. 4.2.1. N -GRAM ORDER COMBINATIONS We first examine the effect of different N-gram order combinations on performance. The results are shown in [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

read the original abstract

Sequence modeling requires both compositional reasoning and local static knowledge retrieval, yet standard Transformers handle both through dense computation. Engram partially decouples retrieval from the backbone, but its token-based keys remain tied to text tokenization and hash compression. We propose Lngram, a latent-space conditional memory module that learns discrete symbols directly from hidden states and performs N-gram lookup over these symbols. This design removes the dependence on tokenizer IDs and naturally extends to non-text modalities. In our evaluated settings, Lngram outperforms Transformer and Engram baselines, consistently reduces perplexity in long-context language modeling, and effectively injects domain knowledge when added post hoc to pretrained models. Joint training with the backbone further surpasses full fine-tuning, while experiments on vision-language and vision-language-action tasks show overall gains. Analyses with LogitLens and CKA suggest that Lngram enables prediction-relevant information to emerge earlier, increasing effective depth with limited inference and memory overhead. Code is available at https://github.com/zyaaa-ux/Lngram.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Lngram's latent discretization for tokenizer-free N-gram memory is a modest engineering step past Engram, but the abstract supplies zero experimental details to support any of the performance claims.

read the letter

The paper's main move is to learn discrete symbols straight from hidden states and run N-gram lookup on those symbols instead of token IDs. This is presented as fixing a limitation in the cited Engram work and allowing the same module to apply to non-text data.

What the paper actually does is describe this module, state that it beats Transformer and Engram baselines on perplexity for long contexts, and claim it can inject domain knowledge post-hoc or through joint training. They also mention gains on vision-language and vision-language-action tasks plus some LogitLens and CKA analysis suggesting earlier emergence of useful signals. The code link is public.

The new element is the latent-space discretization step that makes the keys independent of the tokenizer. That part is clear from the abstract and does not appear to be a direct restatement of prior equations.

The soft spot is obvious and central: the abstract gives no datasets, no run counts, no error bars, no ablation results, and no actual numbers. All the "outperforms" and "surpasses full fine-tuning" statements sit there without support. The reader's note confirms the assessment was abstract-only, which matches what we see. The assumption that the learned symbols preserve enough information for useful N-gram lookup is reasonable on paper but untested in the text provided.

This is for people already working on memory modules or long-context efficiency tweaks inside transformers. A reader who wants to try a new primitive and has time to inspect the full experiments and code might get something out of it.

Send it to peer review so the experimental section can be checked; the idea is narrow enough that referees can evaluate it quickly if the numbers are there.

Referee Report

0 major / 2 minor

Summary. The paper proposes Lngram, a latent-space conditional memory module that learns discrete symbols directly from hidden states and performs N-gram lookup over them, decoupling retrieval from token IDs and extending to non-text modalities. It claims that in evaluated settings Lngram outperforms Transformer and Engram baselines, reduces perplexity in long-context language modeling, injects domain knowledge post hoc into pretrained models, surpasses full fine-tuning when jointly trained, yields gains on vision-language and vision-language-action tasks, and enables earlier emergence of prediction-relevant information per LogitLens and CKA analyses, all with limited inference and memory overhead.

Significance. If the empirical results hold with proper controls, the work offers a practical route to add static N-gram-style knowledge retrieval to dense models without tokenizer dependence or full retraining, potentially increasing effective depth at modest cost and broadening applicability beyond text.

minor comments (2)

The abstract states performance gains and architectural advantages but supplies no dataset names, model sizes, training details, error bars, or ablation controls; these must be added to the experimental section for the claims to be verifiable.
The description of the discretization step and N-gram lookup mechanism would benefit from an explicit equation or pseudocode block showing how hidden-state symbols are obtained and indexed.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their summary of the paper and for noting the potential significance of adding static N-gram-style knowledge retrieval to dense models without tokenizer dependence. We appreciate the 'uncertain' recommendation and welcome the opportunity to clarify any aspects of the empirical results.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents Lngram as an empirical architecture proposal: a latent-space discretization module for N-gram lookup that is evaluated on language modeling, vision-language, and related tasks. No equations, derivations, or first-principles claims appear in the provided abstract or description. Performance assertions are framed as experimental outcomes rather than reductions from fitted parameters or self-citations. The work is self-contained against external benchmarks (Transformer and Engram baselines) with no load-bearing self-referential steps visible.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, training details, or architectural diagrams; therefore no free parameters, axioms, or invented entities can be identified with certainty.

pith-pipeline@v0.9.1-grok · 5708 in / 1030 out tokens · 23910 ms · 2026-06-30T12:24:51.784804+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 14 canonical work pages · 12 internal anchors

[1]

Lessons from the Trenches on Reproducible Evaluation of Language Models

Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley, H., O’Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., Skowron, A., Sutawika, L., Van Der Wal, O., et al. Lessons from the trenches on reproducible evaluation of language models.arXiv preprint arXiv:2405.14782,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Accessed: 2026-04-

GitHub repository directory. Accessed: 2026-04-

2026
[3]

C., Xu, P., Och, F

Brants, T., Popat, A. C., Xu, P., Och, F. J., and Dean, J. Large language models in machine translation. InProceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natu- ral Language Learning, pp. 858–867,

2007
[4]

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

Cheng, X., Zeng, W., Dai, D., Chen, Q., Wang, B., Xie, Z., Huang, K., Yu, X., Hao, Z., Li, Y ., Zhang, H., Zhang, H., Zhao, D., and Liang, W. Conditional memory via scalable lookup: A new axis of sparsity for large language models. arXiv preprint arXiv:2601.07372,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Dai, D., Deng, C., Zhao, C., Xu, R. X., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y ., Xie, Z., Li, Y . K., Huang, P., Luo, F., Ruan, C., Sui, Z., and Liang, W. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models.arXiv preprint arXiv:2401.06066,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI, Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., Deng, C., Ruan, C., Dai, D., Guo, D., Yang, D., Chen, D., Ji, D., Li, E., Lin, F., Luo, F., Hao, G., Chen, G., Li, G., et al. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

BERT: Pre-training of deep bidirectional transformers for lan- guage understanding

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, pp. 4171–4186,

2019
[8]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Liu, B., Zhu, Y ., Gao, C., Feng, Y ., Liu, Q., Zhu, Y ., and Stone, P. LIBERO: Benchmarking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310, 2023a. Liu, H., Li, C., Wu, Q., and Lee, Y . J. Visual instruction tuning. InAdvances in Neural Information Processing Systems, 2023b. nostalgebraist. Interpreting GPT: The logit lens....

work page internal anchor Pith review Pith/arXiv arXiv
[9]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Accessed: 2026-04-21. Penedo, G., Kydl ´ıˇcek, H., Ben Allal, L., Lozhkov, A., Mitchell, M., Raffel, C., von Werra, L., and Wolf, T. The FineWeb datasets: Decanting the web for the finest text data at scale.arXiv preprint arXiv:2406.17557,

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Galliker, M. Y ., Ghosh, D., Groom, L., Haus- man, K., Ichter, B., et al. π0.5: A vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Qwen2.5 Technical Report

Qwen, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

StarVLA Community. StarVLA: A lego-like codebase for vision-language-action model developing.arXiv preprint arXiv:2604.05014,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine- tuned chat models.arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Tschannen, M., Gritsenko, A., Wang, X., Naeem, M. F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y ., Mustafa, B., H ´enaff, O., Harmsen, J., Steiner, A., and Zhai, X. SigLIP 2: Multilingual vision- language encoders with improved semantic understand- ing, localization, and dense features.arXiv preprint arXiv:2502.14786,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

A study of situational reasoning for traffic understanding.arXiv preprint arXiv:2306.02520,

Zhang, J., Ilievski, F., Ma, K., Kollaa, A., Francis, J., and Oltramari, A. A study of situational reasoning for traffic understanding.arXiv preprint arXiv:2306.02520,

work page arXiv
[17]

With windowed attention alone, the model’s average score drops from 59.21 under global attention to 29.41, indicating that windowed attention cannot cover the long-range information required by 17 Lngram: N-gram Conditional Memory in Latent Space Table 12.Results of the binary-state retrieval model on general language understanding tasks. Model HellaSwag ...

work page arXiv 2023

[1] [1]

Lessons from the Trenches on Reproducible Evaluation of Language Models

Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley, H., O’Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., Skowron, A., Sutawika, L., Van Der Wal, O., et al. Lessons from the trenches on reproducible evaluation of language models.arXiv preprint arXiv:2405.14782,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Accessed: 2026-04-

GitHub repository directory. Accessed: 2026-04-

2026

[3] [3]

C., Xu, P., Och, F

Brants, T., Popat, A. C., Xu, P., Och, F. J., and Dean, J. Large language models in machine translation. InProceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natu- ral Language Learning, pp. 858–867,

2007

[4] [4]

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

Cheng, X., Zeng, W., Dai, D., Chen, Q., Wang, B., Xie, Z., Huang, K., Yu, X., Hao, Z., Li, Y ., Zhang, H., Zhang, H., Zhao, D., and Liang, W. Conditional memory via scalable lookup: A new axis of sparsity for large language models. arXiv preprint arXiv:2601.07372,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Dai, D., Deng, C., Zhao, C., Xu, R. X., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y ., Xie, Z., Li, Y . K., Huang, P., Luo, F., Ruan, C., Sui, Z., and Liang, W. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models.arXiv preprint arXiv:2401.06066,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI, Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., Deng, C., Ruan, C., Dai, D., Guo, D., Yang, D., Chen, D., Ji, D., Li, E., Lin, F., Luo, F., Hao, G., Chen, G., Li, G., et al. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

BERT: Pre-training of deep bidirectional transformers for lan- guage understanding

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, pp. 4171–4186,

2019

[8] [8]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Liu, B., Zhu, Y ., Gao, C., Feng, Y ., Liu, Q., Zhu, Y ., and Stone, P. LIBERO: Benchmarking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310, 2023a. Liu, H., Li, C., Wu, Q., and Lee, Y . J. Visual instruction tuning. InAdvances in Neural Information Processing Systems, 2023b. nostalgebraist. Interpreting GPT: The logit lens....

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Accessed: 2026-04-21. Penedo, G., Kydl ´ıˇcek, H., Ben Allal, L., Lozhkov, A., Mitchell, M., Raffel, C., von Werra, L., and Wolf, T. The FineWeb datasets: Decanting the web for the finest text data at scale.arXiv preprint arXiv:2406.17557,

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [10]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Galliker, M. Y ., Ghosh, D., Groom, L., Haus- man, K., Ichter, B., et al. π0.5: A vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Qwen2.5 Technical Report

Qwen, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

StarVLA Community. StarVLA: A lego-like codebase for vision-language-action model developing.arXiv preprint arXiv:2604.05014,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine- tuned chat models.arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Tschannen, M., Gritsenko, A., Wang, X., Naeem, M. F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y ., Mustafa, B., H ´enaff, O., Harmsen, J., Steiner, A., and Zhai, X. SigLIP 2: Multilingual vision- language encoders with improved semantic understand- ing, localization, and dense features.arXiv preprint arXiv:2502.14786,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

A study of situational reasoning for traffic understanding.arXiv preprint arXiv:2306.02520,

Zhang, J., Ilievski, F., Ma, K., Kollaa, A., Francis, J., and Oltramari, A. A study of situational reasoning for traffic understanding.arXiv preprint arXiv:2306.02520,

work page arXiv

[17] [17]

With windowed attention alone, the model’s average score drops from 59.21 under global attention to 29.41, indicating that windowed attention cannot cover the long-range information required by 17 Lngram: N-gram Conditional Memory in Latent Space Table 12.Results of the binary-state retrieval model on general language understanding tasks. Model HellaSwag ...

work page arXiv 2023