pith. sign in

arxiv: 2605.24869 · v1 · pith:6XPJRQIEnew · submitted 2026-05-24 · 💻 cs.CL

Lngram: N-gram Conditional Memory in Latent Space

Pith reviewed 2026-06-30 12:24 UTC · model grok-4.3

classification 💻 cs.CL
keywords LngramN-gram memorylatent spaceconditional memorylanguage modelingperplexitymultimodal taskssequence modeling
0
0 comments X

The pith

Lngram learns discrete symbols from hidden states to enable N-gram lookups in latent space, decoupling retrieval from tokenization and extending to multimodal data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Lngram to handle both compositional reasoning and local knowledge retrieval in sequence models by moving N-gram memory into the latent space. Instead of relying on text token IDs for keys, the method extracts discrete symbols directly from hidden states and performs conditional lookups over those symbols. This change is meant to avoid tokenization dependence and information loss while allowing the module to work on non-text inputs. In the reported experiments, Lngram lowers perplexity relative to standard Transformers and the earlier Engram approach, adds domain knowledge to already-trained models, and yields better results when trained jointly than full fine-tuning does. The same module also improves performance on vision-language and vision-language-action tasks, with internal analyses indicating that prediction-relevant signals appear at earlier layers.

Core claim

Lngram is a latent-space conditional memory module that learns discrete symbols directly from hidden states and performs N-gram lookup over these symbols. This design removes the dependence on tokenizer IDs and naturally extends to non-text modalities. In our evaluated settings, Lngram outperforms Transformer and Engram baselines, consistently reduces perplexity in long-context language modeling, and effectively injects domain knowledge when added post hoc to pretrained models. Joint training with the backbone further surpasses full fine-tuning, while experiments on vision-language and vision-language-action tasks show overall gains.

What carries the argument

Lngram module: a latent-space conditional memory that extracts discrete symbols from hidden states and performs N-gram lookup over those symbols.

If this is right

  • Consistently reduces perplexity in long-context language modeling.
  • Effectively injects domain knowledge when added post hoc to pretrained models.
  • Joint training with the backbone surpasses full fine-tuning.
  • Shows overall gains on vision-language and vision-language-action tasks.
  • Enables prediction-relevant information to emerge earlier in the network layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The token-free design could be applied to non-text sequences such as audio waveforms or biological strings to test cross-modal generality.
  • Earlier emergence of useful signals might allow shorter context windows or shallower stacks to reach the same accuracy.
  • The learned discrete symbols could be inspected to see whether they correspond to interpretable concepts or patterns inside the model.
  • Similar latent-symbol memory could be attached to non-Transformer backbones to check whether the benefit is architecture-specific.

Load-bearing premise

That discrete symbols learned directly from hidden states can support effective N-gram conditional memory without the information loss or dependence issues of token-based keys.

What would settle it

If inserting Lngram into a pretrained model produces no reduction in perplexity on long-context language modeling benchmarks or no gains on vision-language tasks relative to the Engram baseline, the performance advantage would be falsified.

Figures

Figures reproduced from arXiv: 2605.24869 by Guoyang Xia, Lei Ren, Xiaojie Wang, Yunao Zheng.

Figure 1
Figure 1. Figure 1: The main architecture of Lngram. 2.4. Context-Aware Readout The retrieved e (n) t or e (n) t,s is static memory independent of the current context. To mitigate the impact of mismatched retrievals on the backbone network, Lngram follows Engram in introducing a context-aware filtering mechanism at the readout stage. We first consider the single-table version. For each order n, the retrieval result is project… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of prefix perplexity at different context posi￾tions in long-context language modeling. in long-context PPL indicates that the conditional mem￾ory branch does not impair the model’s ability to exploit long-range context, while providing additional local-pattern priors for long-context language modeling. 3.3. Lngram-Tuning: Domain Knowledge Injection We further test the domain adaptation capabi… view at source ↗
Figure 3
Figure 3. Figure 3: LogitLens KL divergence as a function of layer depth. Except for the shallowest layer, where Lngram has not yet been introduced, Lngram exhibits lower KL divergence on most layers. 4.1.2. CKA: REPRESENTATIONS ALIGN WITH DEEPER LAYERS EARLIER We further use linear CKA to compare the layer represen￾tations of the MoE baseline and MoE+Lngram. Given two sets of representations X and Y , their linear-kernel Gra… view at source ↗
Figure 4
Figure 4. Figure 4: CKA similarity between the layer representations of the MoE baseline and MoE+Lngram. The dashed line indicates the soft alignment. The high-similarity region is shifted overall above the diagonal, indicating that Lngram layers are closer to deeper representations in the baseline model. alignment curve also lies above the same-layer diagonal for most layers. This phenomenon is most pronounced in the middle … view at source ↗
Figure 5
Figure 5. Figure 5: Gating responses of the layer-1 3-gram branch in Lngram. Peaks occur where fixed phrases such as “Great” and “Wales” are completed. 4.2.1. N -GRAM ORDER COMBINATIONS We first examine the effect of different N-gram order com￾binations on performance. The results are shown in [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
read the original abstract

Sequence modeling requires both compositional reasoning and local static knowledge retrieval, yet standard Transformers handle both through dense computation. Engram partially decouples retrieval from the backbone, but its token-based keys remain tied to text tokenization and hash compression. We propose Lngram, a latent-space conditional memory module that learns discrete symbols directly from hidden states and performs N-gram lookup over these symbols. This design removes the dependence on tokenizer IDs and naturally extends to non-text modalities. In our evaluated settings, Lngram outperforms Transformer and Engram baselines, consistently reduces perplexity in long-context language modeling, and effectively injects domain knowledge when added post hoc to pretrained models. Joint training with the backbone further surpasses full fine-tuning, while experiments on vision-language and vision-language-action tasks show overall gains. Analyses with LogitLens and CKA suggest that Lngram enables prediction-relevant information to emerge earlier, increasing effective depth with limited inference and memory overhead. Code is available at https://github.com/zyaaa-ux/Lngram.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper proposes Lngram, a latent-space conditional memory module that learns discrete symbols directly from hidden states and performs N-gram lookup over them, decoupling retrieval from token IDs and extending to non-text modalities. It claims that in evaluated settings Lngram outperforms Transformer and Engram baselines, reduces perplexity in long-context language modeling, injects domain knowledge post hoc into pretrained models, surpasses full fine-tuning when jointly trained, yields gains on vision-language and vision-language-action tasks, and enables earlier emergence of prediction-relevant information per LogitLens and CKA analyses, all with limited inference and memory overhead.

Significance. If the empirical results hold with proper controls, the work offers a practical route to add static N-gram-style knowledge retrieval to dense models without tokenizer dependence or full retraining, potentially increasing effective depth at modest cost and broadening applicability beyond text.

minor comments (2)
  1. The abstract states performance gains and architectural advantages but supplies no dataset names, model sizes, training details, error bars, or ablation controls; these must be added to the experimental section for the claims to be verifiable.
  2. The description of the discretization step and N-gram lookup mechanism would benefit from an explicit equation or pseudocode block showing how hidden-state symbols are obtained and indexed.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their summary of the paper and for noting the potential significance of adding static N-gram-style knowledge retrieval to dense models without tokenizer dependence. We appreciate the 'uncertain' recommendation and welcome the opportunity to clarify any aspects of the empirical results.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents Lngram as an empirical architecture proposal: a latent-space discretization module for N-gram lookup that is evaluated on language modeling, vision-language, and related tasks. No equations, derivations, or first-principles claims appear in the provided abstract or description. Performance assertions are framed as experimental outcomes rather than reductions from fitted parameters or self-citations. The work is self-contained against external benchmarks (Transformer and Engram baselines) with no load-bearing self-referential steps visible.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, training details, or architectural diagrams; therefore no free parameters, axioms, or invented entities can be identified with certainty.

pith-pipeline@v0.9.1-grok · 5708 in / 1030 out tokens · 23910 ms · 2026-06-30T12:24:51.784804+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 14 canonical work pages · 12 internal anchors

  1. [1]

    Lessons from the Trenches on Reproducible Evaluation of Language Models

    Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley, H., O’Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., Skowron, A., Sutawika, L., Van Der Wal, O., et al. Lessons from the trenches on reproducible evaluation of language models.arXiv preprint arXiv:2405.14782,

  2. [2]

    Accessed: 2026-04-

    GitHub repository directory. Accessed: 2026-04-

  3. [3]

    C., Xu, P., Och, F

    Brants, T., Popat, A. C., Xu, P., Och, F. J., and Dean, J. Large language models in machine translation. InProceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natu- ral Language Learning, pp. 858–867,

  4. [4]

    Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

    Cheng, X., Zeng, W., Dai, D., Chen, Q., Wang, B., Xie, Z., Huang, K., Yu, X., Hao, Z., Li, Y ., Zhang, H., Zhang, H., Zhao, D., and Liang, W. Conditional memory via scalable lookup: A new axis of sparsity for large language models. arXiv preprint arXiv:2601.07372,

  5. [5]

    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

    Dai, D., Deng, C., Zhao, C., Xu, R. X., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y ., Xie, Z., Li, Y . K., Huang, P., Luo, F., Ruan, C., Sui, Z., and Liang, W. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models.arXiv preprint arXiv:2401.06066,

  6. [6]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    DeepSeek-AI, Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., Deng, C., Ruan, C., Dai, D., Guo, D., Yang, D., Chen, D., Ji, D., Li, E., Lin, F., Luo, F., Hao, G., Chen, G., Li, G., et al. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434,

  7. [7]

    BERT: Pre-training of deep bidirectional transformers for lan- guage understanding

    Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, pp. 4171–4186,

  8. [8]

    LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    Liu, B., Zhu, Y ., Gao, C., Feng, Y ., Liu, Q., Zhu, Y ., and Stone, P. LIBERO: Benchmarking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310, 2023a. Liu, H., Li, C., Wu, Q., and Lee, Y . J. Visual instruction tuning. InAdvances in Neural Information Processing Systems, 2023b. nostalgebraist. Interpreting GPT: The logit lens....

  9. [9]

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    Accessed: 2026-04-21. Penedo, G., Kydl ´ıˇcek, H., Ben Allal, L., Lozhkov, A., Mitchell, M., Raffel, C., von Werra, L., and Wolf, T. The FineWeb datasets: Decanting the web for the finest text data at scale.arXiv preprint arXiv:2406.17557,

  10. [10]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Galliker, M. Y ., Ghosh, D., Groom, L., Haus- man, K., Ichter, B., et al. π0.5: A vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054,

  11. [11]

    Qwen2.5 Technical Report

    Qwen, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

  12. [12]

    StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

    StarVLA Community. StarVLA: A lego-like codebase for vision-language-action model developing.arXiv preprint arXiv:2604.05014,

  13. [13]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine- tuned chat models.arXiv preprint arXiv:2307.09288,

  14. [14]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Tschannen, M., Gritsenko, A., Wang, X., Naeem, M. F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y ., Mustafa, B., H ´enaff, O., Harmsen, J., Steiner, A., and Zhai, X. SigLIP 2: Multilingual vision- language encoders with improved semantic understand- ing, localization, and dense features.arXiv preprint arXiv:2502.14786,

  15. [15]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  16. [16]

    A study of situational reasoning for traffic understanding.arXiv preprint arXiv:2306.02520,

    Zhang, J., Ilievski, F., Ma, K., Kollaa, A., Francis, J., and Oltramari, A. A study of situational reasoning for traffic understanding.arXiv preprint arXiv:2306.02520,

  17. [17]

    With windowed attention alone, the model’s average score drops from 59.21 under global attention to 29.41, indicating that windowed attention cannot cover the long-range information required by 17 Lngram: N-gram Conditional Memory in Latent Space Table 12.Results of the binary-state retrieval model on general language understanding tasks. Model HellaSwag ...