arxiv: 2605.13179 · v1 · submitted 2026-05-13 · 💻 cs.CV

Recognition: no theorem link

Does Engram Do Memory Retrieval in Autoregressive Image Generation?

Jinghao Wang , Qiyuan He , Chunbin Gu , Pheng-Ann Heng

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:30 UTC · model grok-4.3

classification 💻 cs.CV

keywords autoregressive image generationEngram modulememory retrievalgated fusionFID evaluationImageNet synthesisTransformer layersresidual stream

0 comments

The pith

The Engram module in autoregressive image generation acts as a gated side-pathway rather than a content-addressed memory retriever.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether the Engram module, previously interpreted as providing content-addressed recall of recurring token patterns in language models, transfers that role to autoregressive image generation. Experiments adapt it with 2D spatial n-gram hashing and gated fusion, then inject it into a class-conditional generator trained on ImageNet 256x256. Across backbone-to-memory budget ratios from 0.17 to 0.90, every augmented model trails the pure autoregressive baseline in FID, showing that the module saves compute but does not raise sample quality. Probing with gate clamps, donor swaps, and frozen-noise tables reveals that the pathway itself drives most of the effect, while the learned hash table adds only minor refinement.

Core claim

The Engram module in AR image generation behaves not as a content-addressed retriever but as a gated architectural side-pathway: a hash-keyed residual stream whose benefit is dominated by the pathway itself, with the learned table contributing only a small distributional refinement. Every Engram variant trails the pure AR baseline in FID; a constant gate of 0.10 matches the learned gate; donor probes with adversarial or random exemplars yield indistinguishable next-token distributions; and freezing the table to N(0,1) noise costs only 0.10 FID while raising Inception Score.

What carries the argument

Hash-keyed residual stream with 2D spatial n-gram hashing and gated fusion, injected into Transformer layers as a KV-cache-compatible side-pathway.

If this is right

Engram saves backbone FLOPs but does not improve sample quality over the pure autoregressive baseline.
Disabling the Engram pathway is catastrophic, yet a tiny constant gate of 0.10 matches or beats the learned gate.
Swapping hash inputs for matched, adversarial, or random same-class exemplars produces statistically identical next-token distributions.
Training from scratch with the memory table frozen to N(0,1) noise raises Inception Score and costs only 0.10 FID.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The results suggest that simpler fixed residual connections could capture most of the observed benefit without the overhead of hashing and table storage.
If the pathway dominates, similar side-stream designs might be worth testing in other autoregressive vision tasks such as video or 3D generation.
The negligible role of the learned table implies that true content-addressed recall in generation may require different table update rules or larger scale.

Load-bearing premise

The 2D spatial n-gram hashing and gated fusion adaptations faithfully preserve the original Engram mechanism while allowing fair comparison to the pure AR baseline.

What would settle it

An ablation that removes the gated hash-keyed stream entirely while keeping all other architectural elements fixed and measures whether FID rises by more than the small delta seen with a noise table.

Figures

Figures reproduced from arXiv: 2605.13179 by Chunbin Gu, Jinghao Wang, Pheng-Ann Heng, Qiyuan He.

**Figure 2.** Figure 2: N-gram Jaccard similarity stratified by DINO similarity. AliTok tokens exhibit consistent local patch patterns that grow stronger with semantic similarity, while MaskGIT-VQ tokens do not. 3.3 Engram for Autoregressive Image Generation Base model. We choose our base autoregressive model as (AR-B): a 24-layer causal Transformer with hidden size 768, 16 attention heads, 2D Rotary Position Embeddings (RoPE), S… view at source ↗

**Figure 3.** Figure 3: Proportion sweep. The Engram-augmented AR saves FLOPs by offloading local-pattern reconstruction to the Engram memory tables, but this comes with a slight FID cost. of the Engram module is fixed at 64, and the number of hash heads is fixed at 4. The size of the memory table is varied from 52831 to 5243. It yields a family of Engram-augmented variants with ρ = {0.17, 0.32, 0.41, 0.51, 0.63, 0.76, 0.90} wher… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison. Rows (top to bottom): (1) AR baseline, (2-8) Engram-augmented variants at ρ∈{0.17 − 0.90}, and (9) 2D Engram with ρ = 0.63 with row labels rendered in the left margin. Classes (left to right): goldfish, cock, macaw, golden retriever, white wolf, tiger cat, panda, airliner, pirate ship, volcano. but only weakly content-addressed, with the learned table contributing only a small distr… view at source ↗

read the original abstract

The Engram module -- a hash-keyed, O(1) associative memory injected into Transformer layers -- was recently shown to improve large language model pretraining, with the appealing interpretation that it provides a content-addressed shortcut to recurring local token patterns. We ask whether this interpretation transfers to autoregressive (AR) image generation, or whether the observed gains, if any, come from a different mechanism. We adapt the Engram module to vision with 2D spatial $n$-gram hashing, gated fusion, and KV-cache-compatible incremental inference, and inject it into a class-conditional AR generator trained on ImageNet 256x256. Across a sweep of backbone-to-memory budget ratios $\rho{\in}[0.17, 0.90]$, every Engram-augmented variant trails the pure AR baseline in FID, indicating that the module saves backbone FLOPs but does not, by itself, improve sample quality. We then probe how the module is used. A gate-clamp sweep shows that disabling the Engram pathway entirely is catastrophic, yet a tiny constant gate (g=0.10) matches or beats the learned gate -- inconsistent with a heavily content-addressed recall mechanism. A donor-probe experiment shows that swapping the hash inputs for matched, adversarial, or random same-class exemplars produces statistically indistinguishable next-token distributions, while collapsing or randomising the table degrades them by two to three orders of magnitude. Finally, training a model from scratch with the entire memory table frozen to $\mathcal{N}(0, 1)$ noise costs only $\Delta\text{FID}{=}0.10$ and actually raises Inception Score. Together, these findings indicate that the Engram in AR image generation behaves not as a content-addressed retriever but as a gated architectural side-pathway: a hash-keyed residual stream whose benefit is dominated by the pathway itself, with the learned table contributing only a small distributional refinement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript adapts the Engram module to class-conditional autoregressive image generation on ImageNet 256x256 via 2D spatial n-gram hashing, gated fusion, and KV-cache compatibility. Across backbone-to-memory budget ratios ρ ∈ [0.17, 0.90], all Engram-augmented models underperform the pure AR baseline in FID. Gate-clamp experiments show that a constant gate g=0.10 matches the learned gate while full disablement is catastrophic; donor-probe tests yield statistically indistinguishable next-token distributions for matched, adversarial, and random same-class hash inputs; and training with the memory table frozen to N(0,1) noise yields only ΔFID=0.10 while improving Inception Score. The authors conclude that Engram functions as a gated architectural side-pathway whose benefit is dominated by the residual stream rather than content-addressed retrieval.

Significance. If the results hold, the work supplies concrete empirical evidence that apparent memory-module gains in vision AR generation can arise from architectural integration rather than associative recall, with direct implications for the design of efficient transformer variants. The combination of budget sweeps, gate ablations, input-perturbation probes, and frozen-table controls constitutes a multi-pronged mechanistic investigation that is stronger than single-metric comparisons.

minor comments (3)

FID and IS metrics throughout the budget sweep, gate ablations, and frozen-table experiment are reported without error bars or standard deviations from multiple random seeds; this omission makes it impossible to judge whether the small ΔFID=0.10 difference is statistically distinguishable from run-to-run variance.
The precise definition and collision properties of the 2D spatial n-gram hash function are not supplied as an equation or algorithm; adding this detail (e.g., in §3) would allow readers to verify that the donor-probe inputs truly exercise the intended content-addressing behavior.
A short limitations paragraph discussing how the introduction of explicit gating and the shift from 1D token to 2D patch hashing might change lookup dynamics relative to the original LLM Engram would help bound the scope of the mechanistic claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful and positive summary of our work, for highlighting the multi-pronged nature of the mechanistic probes, and for recommending minor revision. We are pleased that the empirical evidence is viewed as potentially informative for efficient transformer design. Below we provide point-by-point responses; because the report contains no explicit major criticisms, the responses array is empty and we have no standing objections.

Circularity Check

0 steps flagged

No circularity: purely empirical comparisons with no definitional reduction or self-citation load-bearing

full rationale

The paper presents an empirical study adapting Engram to AR image generation and running controlled experiments (gate sweeps, donor-probe swaps, frozen-noise table training). No equations, derivations, or first-principles claims are made that reduce the central conclusion to fitted parameters or prior self-citations by construction. All evidence consists of direct FID/IS measurements and distribution comparisons against baselines. The adaptations (2D n-gram hashing, gated fusion) are explicitly described as modifications, and the conclusions are scoped to the adapted module. No load-bearing step equates the 'gated side-pathway' interpretation to its inputs via definition or renaming.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The work is experimental and relies on standard computer-vision assumptions such as FID and Inception Score as valid quality proxies and that the 2D hashing adaptation is a reasonable port of the original module.

free parameters (2)

rho (backbone-to-memory budget ratio)
Sweep parameter varied in [0.17, 0.90] to allocate compute between backbone and memory.
constant gate value g
Fixed value 0.10 tested as alternative to learned gate.

pith-pipeline@v0.9.0 · 5663 in / 1254 out tokens · 31569 ms · 2026-05-14T20:30:04.590990+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 3 internal anchors

[1]

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman. Maskgit: Masked generative image transformer. InCVPR, 2022

work page 2022
[2]

Conditional memory via scalable lookup: A new axis of sparsity for large language models.arXiv preprint arXiv:2601.07372, 2026

Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie, Kezhao Huang, Xingkai Yu, Zhewen Hao, Yukun Li, et al. Conditional memory via scalable lookup: A new axis of sparsity for large language models.arXiv preprint arXiv:2601.07372, 2026

work page arXiv 2026
[3]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009

work page 2009
[4]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InCVPR, 2021

work page 2021
[5]

Neural Turing Machines

Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines.arXiv preprint arXiv:1410.5401, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[6]

GANs trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017
[7]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InICML, 2021

work page 2021
[9]

Parallel multiscale autoregressive density estimation

Scott Reed, Aäron van den Oord, Nal Kalchbrenner, Sergio Gómez Colmenarejo, Ziyu Wang, Dan Belov, and Nando de Freitas. Parallel multiscale autoregressive density estimation. In Proceedings of the International Conference on Machine Learning (ICML), 2017

work page 2017
[10]

Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P. Kingma. PixelCNN++: Improving the PixelCNN with discretized logistic mixture likelihood and other modifications. InProceedings of the International Conference on Learning Representations (ICLR), 2017

work page 2017
[11]

Pixel recurrent neural networks

Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. InProceedings of the International Conference on Machine Learning (ICML), 2016

work page 2016
[12]

Neural discrete representation learning.NeurIPS, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.NeurIPS, 2017

work page 2017
[13]

MaskBit: Embedding-free image generation via bit tokens.Transactions on Machine Learning Research (TMLR), 2024

Mark Weber, Lijun Yu, Qihang Yu, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang- Chieh Chen. MaskBit: Embedding-free image generation via bit tokens.Transactions on Machine Learning Research (TMLR), 2024

work page 2024
[14]

Memory networks

Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. InProceedings of the International Conference on Learning Representations (ICLR), 2015

work page 2015
[15]

Towards sequence modeling alignment between tokenizer and autoregressive model.arXiv preprint arXiv:2506.05289, 2025

Pingyu Wu, Kai Zhu, Yu Liu, Longxiang Tang, Jian Yang, Yansong Peng, Wei Zhai, Yang Cao, and Zheng-Jun Zha. Towards sequence modeling alignment between tokenizer and autoregressive model.arXiv preprint arXiv:2506.05289, 2025

work page arXiv 2025
[16]

Rabe, DeLesley Hutchins, and Christian Szegedy

Yuhuai Wu, Markus N. Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing trans- formers. InProceedings of the International Conference on Learning Representations (ICLR), 2022

work page 2022
[17]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Randomized autore- gressive visual generation.arXiv preprint arXiv:2411.00776, 2024

Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, and Liang-Chieh Chen. Randomized autore- gressive visual generation.arXiv preprint arXiv:2411.00776, 2024

work page arXiv 2024
[19]

An image is worth 32 tokens for reconstruction and generation.arXiv preprint arXiv:2406.07550, 2024

Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation.arXiv preprint arXiv:2406.07550, 2024. 10

work page arXiv 2024