pith. machine review for the scientific record. sign in

arxiv: 2604.23862 · v1 · submitted 2026-04-26 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

Graph Memory Transformer (GMT)

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords Graph Memory Transformertransformerfeed-forward networkmemory graphcentroidslanguage modelinginterpretability
0
0 comments X

The pith

A learned graph of memory centroids can replace the feed-forward sublayer in a decoder-only transformer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the feed-forward network inside each transformer block can be swapped for an explicit memory graph without breaking the autoregressive setup. Tokens are moved across a bank of learned centroids linked by a directed transition matrix instead of undergoing a dense transformation. This design keeps the model size smaller and makes the internal routing visible as source selection, target choice, and displacement vectors. A reader might care because it suggests a route to language models whose knowledge updates and internal states are more directly editable and observable. The resulting 82M-parameter model trains without collapse and stays competitive on zero-shot tasks even though its validation perplexity trails the 103M dense baseline.

Core claim

The Graph Memory Transformer (GMT) keeps causal self-attention unchanged but replaces every FFN with a memory cell containing 128 centroids and a 128 by 128 learned transition matrix. Token representations select a source centroid through gravitational routing, choose a target based on the current token, and read out a gated displacement that moves the representation from source toward target. Each of the 16 blocks performs this navigation rather than a standard linear transformation, producing an 82.2 million parameter decoder-only language model whose memory operations remain directly inspectable during the forward pass.

What carries the argument

The memory cell that performs gravitational source routing, token-conditioned target selection, and gated displacement readout to compute movement between centroids instead of a dense feed-forward transformation.

Load-bearing premise

Gravitational source routing combined with token-conditioned target selection and gated displacement readout can match the computational role of a dense feed-forward network using only the memory cell's own parameters.

What would settle it

Training the GMT model on the same data and observing that it diverges or produces incoherent text on basic continuation tasks would show that the memory graph cannot substitute for the FFN.

Figures

Figures reproduced from arXiv: 2604.23862 by Niccol\`o Ferrari, Nicola Zanarini.

Figure 1
Figure 1. Figure 1: Slot-routing flow at Block 00 view at source ↗
Figure 2
Figure 2. Figure 2: Slot-routing flow at Block 06. 35 view at source ↗
Figure 3
Figure 3. Figure 3: Slot-routing flow at Block 11. 36 view at source ↗
Figure 4
Figure 4. Figure 4: Topic-separated Block 11 routing flows for the narrative, political, view at source ↗
Figure 5
Figure 5. Figure 5: Slot-routing flow at Block 15. The sequence from Blocks 00, 06, view at source ↗
Figure 6
Figure 6. Figure 6: Active edge structure at Block 00. Darker cells indicate stronger view at source ↗
Figure 7
Figure 7. Figure 7: Active edge structure at Block 06 view at source ↗
Figure 8
Figure 8. Figure 8: Active edge structure at Block 11. 41 view at source ↗
Figure 9
Figure 9. Figure 9: Active edge structure at Block 15. These are the same representative view at source ↗
Figure 10
Figure 10. Figure 10: Slot-routing flow in block 13 for a political-text probe, illustrating view at source ↗
read the original abstract

We investigate whether the Feed-Forward Network (FFN) sublayer in a decoder-only transformer can be replaced by an explicit learned memory graph while preserving the surrounding autoregressive architecture. The proposed Graph Memory Transformer (GMT) keeps causal self-attention intact, but replaces the usual per-token FFN transformation with a memory cell that routes token representations over a learned bank of centroids connected by a learned directed transition matrix. In the base GMT v7 instantiation studied here, each of 16 transformer blocks contains 128 centroids, a 128 * 128 edge matrix, gravitational source routing, token-conditioned target selection, and a gated displacement readout. The cell therefore returns movement from an estimated source memory state toward a target memory state, rather than a retrieved value. The resulting model is a fully decoder-only language model with 82.2M trainable parameters and no dense FFN sublayers, compared with a 103.0M-parameter dense GPT-style baseline used in the evaluation. The base v7 model trains stably and exposes centroid usage, transition structure, and source-to-target movement as directly inspectable quantities of the forward computation. It remains behind the larger dense baseline in validation loss and perplexity (3.5995/36.58 vs. 3.2903/26.85), while showing close zero-shot benchmark behavior under the evaluated setting. These results are not intended as a state-of-the-art claim; they support the viability and structural interpretability of replacing dense within-token transformation with graph-mediated memory navigation. Broader scaling, optimized kernels, and more extensive benchmark evaluation are left for subsequent work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes replacing the FFN sublayers in decoder-only transformers with a Graph Memory Transformer (GMT) cell that routes token representations over a learned bank of 128 centroids per block connected by a 128x128 directed transition matrix, using gravitational source routing, token-conditioned target selection, and gated displacement readout to return source-to-target movements rather than dense transformations. The base GMT v7 model (82.2M parameters, 16 blocks) trains stably, exposes centroid usage, transitions, and movements as inspectable forward-pass quantities, achieves validation loss/perplexity of 3.5995/36.58 (vs. 3.2903/26.85 for a 103M dense GPT-style baseline), and shows comparable zero-shot benchmark behavior, supporting the viability of graph-mediated memory navigation as an FFN substitute without claiming SOTA results.

Significance. If the substitution holds under further scrutiny, the approach could improve interpretability by making memory operations explicit and directly analyzable, while using fewer parameters than the dense baseline. Stable training and the inspectable quantities are concrete strengths that enable new analyses of internal dynamics. The performance gap and lack of scaling results limit immediate impact, but the work provides a foundation for memory-graph alternatives to opaque FFNs.

major comments (2)
  1. [Experimental evaluation / Results] The central viability claim—that gravitational source routing plus token-conditioned target selection plus gated displacement readout functionally substitutes for the dense FFN without extra capacity or changes outside the memory cell—lacks supporting ablations. No experiments disable or randomize the routing/readout components while holding total parameter count fixed at 82.2M (or compare against a generic low-rank memory bank), so it remains possible that any centroid bank would produce similar results and that the graph structure is not load-bearing.
  2. [Results] Table or results section reporting validation metrics: the GMT trails the dense baseline by ~0.3 nats / ~10 perplexity points, yet no error bars, multiple random seeds, or capacity-matched dense baseline (e.g., 82M-parameter dense model) are provided. This weakens the ability to attribute the gap specifically to the architectural substitution rather than capacity or optimization differences.
minor comments (1)
  1. [Abstract] The abstract states 'close zero-shot benchmark behavior' without naming the specific benchmarks or reporting exact scores, which would strengthen the claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive note on the interpretability potential of GMT. We respond point-by-point to the major comments, agreeing where the experimental design can be strengthened and outlining specific revisions.

read point-by-point responses
  1. Referee: The central viability claim—that gravitational source routing plus token-conditioned target selection plus gated displacement readout functionally substitutes for the dense FFN without extra capacity or changes outside the memory cell—lacks supporting ablations. No experiments disable or randomize the routing/readout components while holding total parameter count fixed at 82.2M (or compare against a generic low-rank memory bank), so it remains possible that any centroid bank would produce similar results and that the graph structure is not load-bearing.

    Authors: We agree that component-level ablations would more convincingly demonstrate that the graph-mediated mechanisms are load-bearing rather than incidental. The current manuscript supports viability through stable end-to-end training and by exposing and qualitatively analyzing centroid usage, transition matrices, and source-to-target displacements as direct outputs of the forward pass. To address the concern directly, the revised manuscript will add an ablation subsection that (i) replaces gravitational source routing with uniform selection, (ii) randomizes the 128x128 transition matrix while preserving parameter count, and (iii) compares against a capacity-matched low-rank memory bank without learned routing. These experiments will be reported alongside the existing results. revision: yes

  2. Referee: Table or results section reporting validation metrics: the GMT trails the dense baseline by ~0.3 nats / ~10 perplexity points, yet no error bars, multiple random seeds, or capacity-matched dense baseline (e.g., 82M-parameter dense model) are provided. This weakens the ability to attribute the gap specifically to the architectural substitution rather than capacity or optimization differences.

    Authors: We acknowledge that single-run results and the absence of a capacity-matched baseline limit attribution of the observed gap. The manuscript already states that results are not intended as a superiority claim and that the 103M dense model serves only as a reference point. In revision we will add an 82M-parameter dense GPT-style baseline trained under identical conditions and report its validation loss/perplexity. We will also state explicitly that all reported numbers are from single training runs (due to compute cost) and will include standard deviations from two additional seeds for the primary GMT and dense models if resources permit; otherwise the limitation will be noted in the text. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical architecture proposal with measured results

full rationale

The paper is an empirical investigation of an architectural substitution (FFN replaced by graph memory cell with gravitational routing and gated readout). No derivation chain, equations, or first-principles predictions are presented; all reported quantities (validation loss 3.5995, perplexity 36.58, parameter counts) are direct training measurements compared against a baseline. No self-citations, ansatzes, or fitted inputs are invoked as load-bearing for any claimed result. The work is self-contained as an experimental demonstration of viability.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 2 invented entities

The central claim rests on the assumption that a small learned graph can substitute for the FFN without further changes to attention or normalization. The design introduces many learned parameters (centroid embeddings, transition matrix, routing weights) that are fitted during training rather than derived. No new physical or mathematical axioms are invoked beyond standard transformer training assumptions.

free parameters (3)
  • number of centroids per block
    Fixed at 128; chosen by hand to balance capacity and inspectability.
  • transition matrix size
    128x128 learned matrix per block; size is a design choice.
  • gravitational source routing parameters
    Learned weights that map token states to source centroids.
axioms (2)
  • domain assumption Causal self-attention remains unchanged and sufficient when paired with the new memory cell.
    Stated in the abstract as keeping causal self-attention intact.
  • domain assumption Standard autoregressive language modeling objective is appropriate for evaluating the replacement.
    Implicit in the comparison to GPT-style baseline.
invented entities (2)
  • centroid bank as memory states no independent evidence
    purpose: Discrete memory locations that tokens route between instead of dense FFN transformation.
    New postulated memory structure; no independent evidence outside the model itself.
  • learned directed transition matrix no independent evidence
    purpose: Encodes movement rules between memory centroids.
    Invented component of the graph memory cell.

pith-pipeline@v0.9.0 · 5587 in / 1831 out tokens · 43532 ms · 2026-05-08T06:24:14.512096+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 11 canonical work pages · 4 internal anchors

  1. [1]

    Gomez,  Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,  Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, volume 30, 2017

  2. [2]

    Transformer feed-forward layers are key-value memories

    Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, 2021

  3. [3]

    Lo- cating and editing factual associations in GPT

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Lo- cating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems, volume 35, pages 17359–17372, 2022

  4. [4]

    A practical review of mechanistic interpretability for transformer-based language models.arXiv preprint arXiv:2407.02646,

    Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov, and Ziyu Yao. A practical review of mechanistic interpretability for transformer-based language models.arXiv preprint arXiv:2407.02646, 2024

  5. [5]

    Memory networks

    Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. InInternational Conference on Learning Representations, 2015

  6. [6]

    End-to-end mem- ory networks

    Sainbayar Sukhbaatar, Jason Weston, and Rob Fergus. End-to-end mem- ory networks. InAdvances in Neural Information Processing Systems, volume 28, 2015

  7. [7]

    Large memory layers with product keys

    Guillaume Lample, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Lu- dovic Denoyer, and Herv´ e J´ egou. Large memory layers with product keys. InAdvances in Neural Information Processing Systems, volume 32, 2019

  8. [8]

    Rabe, DeLesley Hutchins, and Christian Szegedy

    Yuhuai Wu, Markus N. Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing transformers. InInternational Conference on Learning Rep- resentations, 2022

  9. [9]

    Memory layers at scale

    Vincent-Pierre Berges, Barlas Oguz, Daniel Haziza, Wen-Tau Yih, Luke Zettlemoyer, and Gargi Ghosh. Memory layers at scale. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 3831–3842, 2025

  10. [10]

    MemoryLLM: Plug-n- play interpretable feed-forward memory for transformers.arXiv preprint arXiv:2602.00398, 2026

    Ajay Jaiswal, Lauren Hannah, Han-Byul Kim, Duc Hoang, Arnav Kundu, Mehrdad Farajtabar, and Minsik Cho. MemoryLLM: Plug-n- play interpretable feed-forward memory for transformers.arXiv preprint arXiv:2602.00398, 2026. 58

  11. [11]

    Zoom in: An introduction to circuits.Distill, 2020

    Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits.Distill, 2020

  12. [12]

    Thread: Circuits.Distill, 2020

    Nick Cammarata, Shan Carter, Gabriel Goh, Chris Olah, Michael Petrov, Ludwig Schubert, Chelsea Voss, Ben Egan, and Swee Kiat Lim. Thread: Circuits.Distill, 2020

  13. [13]

    Prototype Transformer: Towards language model architectures interpretable by design.arXiv preprintarXiv:2602.11852

    Yordan Yordanov, Matteo Forasassi, Bayar Menzat, Ruizhi Wang, Chang Qi, Markus Kaltenberger, Amine M’Charrak, Tommaso Salvatori, and Thomas Lukasiewicz. Prototype transformer: Towards language model architectures interpretable by design.arXiv preprint arXiv:2602.11852, 2026

  14. [14]

    A comprehensive study of knowledge editing for large language models,

    Ningyu Zhang, Yunzhi Yao, Bozhong Tian, Peng Wang, Shumin Deng, Mengru Wang, Zekun Xi, Shengyu Mao, et al. A comprehensive study of knowledge editing for large language models.arXiv preprint arXiv:2401.01286, 2024

  15. [15]

    Neural Turing Machines

    Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014

  16. [16]

    Hybrid computing using a neu- ral network with dynamic external memory.Nature, 538(7626):471–476, 2016

    Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Dani- helka, Agnieszka Grabska-Barwi´ nska, Sergio G´ omez Colmenarejo, Ed- ward Grefenstette, Tiago Ramalho, John Agapiou, Adri` a Puigdom` enech Badia, Karl Moritz Hermann, Yori Zwols, Georg Ostrovski, Adam Cain, Helen King, Christopher Summerfield, Phil Blunsom, Koray Kavukcuoglu, and Demis Has...

  17. [17]

    Hopfield networks is all you need

    Hubert Ramsauer, Bernhard Sch¨ afl, Johannes Lehner, Philipp Seidl, Michael Widrich, Lukas Gruber, Markus Holzleitner, Thomas Adler, David Kreil, Michael K Kopp, G¨ unter Klambauer, Johannes Brand- stetter, and Sepp Hochreiter. Hopfield networks is all you need. In International Conference on Learning Representations, 2021

  18. [18]

    Generalization through memorization: Nearest neighbor lan- guage models

    Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through memorization: Nearest neighbor lan- guage models. InInternational Conference on Learning Representations, 2020. 59

  19. [19]

    Jacobs, Michael I

    Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts.Neural Computation, 3(1):79– 87, 1991

  20. [20]

    Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations, 2017

  21. [21]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022

  22. [22]

    MemoryFormer: Minimize transformer computation by removing fully-connected layers.arXiv preprint arXiv:2411.12992, 2024

    Ning Ding, Yehui Tang, Haochen Qin, Zhenli Zhou, Chao Xu, Lin Li, Kai Han, Heng Liao, and Yunhe Wang. MemoryFormer: Minimize transformer computation by removing fully-connected layers.arXiv preprint arXiv:2411.12992, 2024

  23. [23]

    Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Lit...

  24. [24]

    Using the output embedding to improve language models

    Ofir Press and Lior Wolf. Using the output embedding to improve language models. InProceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 157–163, 2017

  25. [25]

    On layer normalization in the transformer architecture

    Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. On layer normalization in the transformer architecture. InProceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, 2020

  26. [26]

    Neural discrete representation learning

    Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. InAdvances in Neural Information Processing Systems, volume 30, 2017. 60

  27. [27]

    OpenWebText Corpus

    Aaron Gokaslan and Vanya Cohen. OpenWebText Corpus. https: //skylion007.github.io/OpenWebTextCorpus/, 2019

  28. [28]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019

  29. [29]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabhar- wal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

  30. [30]

    The language model evaluation harness, July 2024

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lin- tang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The langu...

  31. [31]

    HellaSwag: Can a machine really finish your sentence? InProceed- ings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InProceed- ings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

  32. [32]

    PIQA: Reasoning about physical commonsense in natural lan- guage

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural lan- guage. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432–7439, 2020

  33. [33]

    WinoGrande: An adversarial winograd schema challenge at scale

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial winograd schema challenge at scale. InProceedings of the AAAI Conference on Artificial Intelligence, vol- ume 34, pages 8732–8740, 2020

  34. [34]

    Generating diverse high-fidelity images with VQ-V AE-2

    Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with VQ-V AE-2. InAdvances in Neural Information Processing Systems, volume 32, pages 14837–14847, 2019

  35. [35]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12873–12883, 2021. 61

  36. [36]

    How contextual are contextualized word represen- tations? comparing the geometry of BERT, ELMo, and GPT-2 embed- dings

    Kawin Ethayarajh. How contextual are contextualized word represen- tations? comparing the geometry of BERT, ELMo, and GPT-2 embed- dings. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 55–65, 2019

  37. [37]

    Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

    Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: A circuit for indirect object identification in GPT-2 small.arXiv preprint arXiv:2211.00593, 2022

  38. [38]

    Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adri` a Garriga-Alonso

    Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adri` a Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. InAdvances in Neural Infor- mation Processing Systems, volume 36, 2023

  39. [39]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023

  40. [40]

    Exploring activation pat- terns of parameters in language models.arXiv preprint arXiv:2405.17799, 2024

    Yudong Wang, Damai Dai, and Zhifang Sui. Exploring activation pat- terns of parameters in language models.arXiv preprint arXiv:2405.17799, 2024

  41. [41]

    Neuron-guided interpretation of code LLMs: Where, why, and how? In FSE, 2026

    Zhe Yin, Xiaodong Gu, and Beijun Shen. Neuron-guided interpretation of code LLMs: Where, why, and how?arXiv preprint arXiv:2512.19980, 2025

  42. [42]

    Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection

    Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1705–1714, 2019

  43. [43]

    PaDiM: A patch distribution modeling framework for anomaly detection and localization

    Thomas Defard, Aleksandr Setkov, Angelique Loesch, and Romaric Au- digier. PaDiM: A patch distribution modeling framework for anomaly detection and localization. InPattern Recognition. ICPR International Workshops and Challenges, pages 475–489, 2021

  44. [44]

    Towards total recall in industrial anomaly detection

    Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard Sch¨ olkopf, Thomas Brox, and Peter Gehler. Towards total recall in industrial anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14318–14328, 2022. 62

  45. [45]

    MVTec AD: A comprehensive real-world dataset for unsupervised anomaly detection

    Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Ste- ger. MVTec AD: A comprehensive real-world dataset for unsupervised anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9592–9600, 2019. 63 A Reference V7 Realization Details This appendix records implementation-fidelity details neede...