PaceLLM: Brain-Inspired Large Language Models for Long-Context Understanding

Chongjun Tu; Chunfeng Song; Jiamin Wu; Kangcong Li; Lin Zhang; Peng Ye; Qihao Zheng; Tao Chen; Tao Yang

arxiv: 2506.17310 · v3 · submitted 2025-06-18 · 🧬 q-bio.NC · cs.CL· cs.NE

PaceLLM: Brain-Inspired Large Language Models for Long-Context Understanding

Kangcong Li , Peng Ye , Chongjun Tu , Lin Zhang , Chunfeng Song , Jiamin Wu , Tao Yang , Qihao Zheng

show 1 more author

Tao Chen

This is my paper

Pith reviewed 2026-05-19 09:30 UTC · model grok-4.3

classification 🧬 q-bio.NC cs.CLcs.NE

keywords long-context LLMsbrain-inspired modelspersistent activitycortical clusteringFFN optimizationcontext extensioninformation decaysemantic modules

0 comments

The pith

PaceLLM uses persistent activity and semantic clustering to reduce information decay and fragmentation in LLMs for extended context handling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that LLMs suffer from transient activations that cause information to decay over long sequences and from unstructured feed-forward weights that break semantic connections across tokens. It proposes to fix these issues by adding a memory bank that keeps critical states active, modeled on prefrontal cortex persistence, and by grouping weights into specialized modules that create cross-token links, modeled on cortical organization. A reader would care if these changes allow models to retain details across much longer inputs without proportional increases in size or training cost. The work claims the fixes are general and can be added to existing models to boost both performance and interpretability on tasks that require holding many pieces of information in mind at once.

Core claim

Transient neural activations produce contextual decay while unstructured FFN weights produce semantic fragmentation; these are countered by a Persistent Activity Mechanism that maintains an activation-level memory bank to retrieve, reuse, and update key FFN states and by Cortical Expert Clustering that reorganizes FFN weights into semantic modules to establish cross-token dependencies, yielding 6 percent gains on LongBench multi-document QA, 12.5-17.5 percent gains on Infinite-Bench, and reliable performance at 200K tokens in needle-in-haystack tests.

What carries the argument

The Persistent Activity (PA) Mechanism, an activation-level memory bank that dynamically retrieves, reuses, and updates FFN states, together with Cortical Expert (CE) Clustering, which reorganizes FFN weights into semantic modules to build cross-token dependencies.

If this is right

Multi-document question answering on LongBench improves by 6 percent.
Performance on Infinite-Bench tasks rises between 12.5 and 17.5 percent.
Reliable retrieval extends to 200K tokens in needle-in-haystack evaluations.
The same additions can be applied to any existing model to raise long-context scores and interpretability without redesigning its architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The memory-bank approach might lower the compute needed for very long contexts by reusing states instead of recomputing them from scratch.
Semantic modules could make it easier to locate and edit specific pieces of knowledge inside a model after training.
Similar persistence and modularity ideas might transfer to multimodal settings where long video or audio sequences must be tracked.
If the gains hold across architectures, the technique could become a standard lightweight upgrade for any transformer-based system.

Load-bearing premise

The assumption that the memory bank for keeping FFN states active and the reorganization of weights into semantic modules specifically solve decay and fragmentation rather than simply adding capacity or regularization that other methods could also provide.

What would settle it

A controlled test in which a standard model given equivalent extra memory or weight reorganization but without the brain-inspired retrieval and clustering rules shows the same or greater accuracy on the 200K-token needle-in-haystack task.

Figures

Figures reproduced from arXiv: 2506.17310 by Chongjun Tu, Chunfeng Song, Jiamin Wu, Kangcong Li, Lin Zhang, Peng Ye, Qihao Zheng, Tao Chen, Tao Yang.

**Figure 1.** Figure 1: Schematic diagram of the PaceLLM (bottom) and its neuroscience counterpart (top). In this case, which introduces James Chadwick’s character, the brain processes and retains key information through working memory. When the content in working memory appears in the subsequent text, such as "Britain", relevant neurons will persistently to be re-active. When the final question is input, the neuron with the keyw… view at source ↗

**Figure 2.** Figure 2: The illustration of PaceLLM. The left of the figure is an overall pipeline. Note that Activation Memory Bank (AMB) doesn’t interact with all FFN layers. The top right of the figure is a detailed illustration of the modified FFN layer. The bottom right is a detailed processing flow of AMB. ①Lookup Memory shows the process of similarity retrieval, taking the topk, and adding noise. ② shows the selection of r… view at source ↗

**Figure 3.** Figure 3: Evaluation on Needle-In-A-Haystack. PaceLLM (bottom) can retrieve the needle up to 200K than Activation Beacon 128K (top) [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of current and historical activations. The orange circles encircled the clusters of current and past activations, which means they have similar information and useful past activations are sufficiently reused. It illustrates PaceLLM leverages the AMB to retrieve semantically similar past activations, enabling repeated reuse in a manner analogous to working memory. Ablation of fusion thresholds… view at source ↗

read the original abstract

While Large Language Models (LLMs) demonstrate strong performance across domains, their long-context capabilities are limited by transient neural activations causing information decay and unstructured feed-forward network (FFN) weights leading to semantic fragmentation. Inspired by the brain's working memory and cortical modularity, we propose PaceLLM, featuring two innovations: (1) a Persistent Activity (PA) Mechanism that mimics prefrontal cortex (PFC) neurons' persistent firing by introducing an activation-level memory bank to dynamically retrieve, reuse, and update critical FFN states, addressing contextual decay; and (2) Cortical Expert (CE) Clustering that emulates task-adaptive neural specialization to reorganize FFN weights into semantic modules, establishing cross-token dependencies and mitigating fragmentation. Extensive evaluations show that PaceLLM achieves 6% improvement on LongBench's Multi-document QA and 12.5-17.5% performance gains on Infinite-Bench tasks, while extending measurable context length to 200K tokens in Needle-In-A-Haystack (NIAH) tests. This work pioneers brain-inspired LLM optimization and is complementary to other works. Besides, it can be generalized to any model and enhance their long-context performance and interpretability without structural overhauls.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PaceLLM adds a persistent FFN activation memory bank and semantic clustering on weights to chase longer context, but the gains look hard to separate from plain extra capacity without tighter controls.

read the letter

The paper's core move is to keep a running memory bank of FFN activations so critical states can be retrieved and updated across tokens, plus reorganize those weights into task-specific clusters meant to reduce fragmentation. They report a 6% lift on LongBench multi-document QA, 12.5-17.5% on Infinite-Bench, and reliable needle retrieval out to 200k tokens. That is the concrete new piece: the specific pairing of activation-level persistence with weight clustering, applied on top of an existing LLM without changing its overall structure. If the numbers survive normal checks, it gives a low-overhead way to stretch context in deployed models for document work. The implementation details and the claim that it generalizes to any model are the parts worth looking at closely. The main weakness is that the abstract and the stress-test note both leave open whether the improvements come from the dynamic rules and semantic objective or simply from adding persistent storage and some grouping. Standard capacity-matched controls or ablations that turn off the retrieval/update logic while keeping the extra state would settle this, but nothing in the provided summary shows those were run. Statistical significance and baseline details are also missing from the high-level claims. This is useful reading for people already working on long-context tweaks who want another practical lever to test. It is not yet strong enough on its own for someone looking for clear causal evidence that the brain analogies are doing the work. I would send it out for peer review so the experiments can be examined properly rather than desk-rejecting it outright.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces PaceLLM, a brain-inspired approach to improving long-context capabilities in LLMs. It identifies transient neural activations causing information decay and unstructured FFN weights causing semantic fragmentation as key limitations. To address these, it proposes two components: (1) a Persistent Activity (PA) Mechanism that introduces an activation-level memory bank to dynamically retrieve, reuse, and update critical FFN states, modeled after persistent firing in prefrontal cortex neurons; and (2) Cortical Expert (CE) Clustering that reorganizes FFN weights into task-adaptive semantic modules to establish cross-token dependencies. The paper reports empirical gains of 6% on LongBench Multi-document QA, 12.5-17.5% on Infinite-Bench tasks, and extension of measurable context length to 200K tokens in Needle-In-A-Haystack tests, claiming the approach is generalizable to any model without structural overhauls.

Significance. If the performance gains can be shown to arise specifically from the dynamic retrieval/update rules and semantic clustering objective rather than from added persistent storage or regularization, the work would provide a novel, complementary direction for long-context modeling that draws on neuroscience analogies to potentially improve both capability and interpretability. The absence of structural overhauls is a practical strength, but the significance hinges on whether the brain-inspired framing delivers mechanistic advantages beyond capacity increases.

major comments (3)

[Abstract and §4 (Experimental Results)] Abstract and §4 (Experimental Results): The reported 6% improvement on LongBench Multi-document QA and 12.5-17.5% gains on Infinite-Bench are presented without ablation studies that add equivalent persistent storage or weight reorganization while omitting the dynamic retrieve/reuse/update rule of the PA Mechanism or the semantic clustering objective of CE Clustering. Without such isolating controls, the central attribution of gains to the brain-inspired mechanisms rather than generic capacity or regularization effects cannot be evaluated.
[§3.1 (Persistent Activity Mechanism)] §3.1 (Persistent Activity Mechanism): The description of the activation-level memory bank does not include quantitative comparisons or controls against standard long-context techniques such as extended KV caches or external memory modules that provide similar state persistence, leaving open whether the specific dynamic update rule contributes beyond increased effective capacity.
[§3.2 (Cortical Expert Clustering)] §3.2 (Cortical Expert Clustering): No details are provided on the clustering objective function, how semantic modules are formed from FFN weights, or ablations that test reorganization without the task-adaptive specialization claim; this weakens the assertion that the method mitigates fragmentation in a manner distinct from standard mixture-of-experts or modular training approaches.

minor comments (2)

[Abstract] The abstract states 'extensive evaluations' and 'generalized to any model' but provides no information on the base LLM architectures tested, number of runs, or statistical significance of the reported percentage improvements.
[§3 (Methods)] Notation for the memory bank update rule and the clustering loss is introduced without an accompanying equation or pseudocode in the methods overview, reducing reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments raise important points about the need for additional controls to isolate the contributions of the proposed mechanisms. We address each major comment below and will incorporate revisions to strengthen the empirical validation and clarity of the work.

read point-by-point responses

Referee: [Abstract and §4 (Experimental Results)] The reported 6% improvement on LongBench Multi-document QA and 12.5-17.5% gains on Infinite-Bench are presented without ablation studies that add equivalent persistent storage or weight reorganization while omitting the dynamic retrieve/reuse/update rule of the PA Mechanism or the semantic clustering objective of CE Clustering. Without such isolating controls, the central attribution of gains to the brain-inspired mechanisms rather than generic capacity or regularization effects cannot be evaluated.

Authors: We agree that additional isolating ablations would strengthen the attribution of gains to the specific dynamic rules and clustering objective. In the revised manuscript we will add experiments that introduce equivalent persistent storage capacity without the retrieve/reuse/update rules of the PA Mechanism, and weight reorganization without the semantic clustering objective of CE Clustering. These controls will be reported in an expanded §4 to allow direct evaluation of whether the observed improvements exceed those attributable to generic capacity or regularization effects alone. revision: yes
Referee: [§3.1 (Persistent Activity Mechanism)] The description of the activation-level memory bank does not include quantitative comparisons or controls against standard long-context techniques such as extended KV caches or external memory modules that provide similar state persistence, leaving open whether the specific dynamic update rule contributes beyond increased effective capacity.

Authors: We thank the referee for highlighting this gap. While §3.1 presents the PA Mechanism's design and its inspiration from prefrontal cortex persistent firing, we will add quantitative comparisons in the revision against baselines using extended KV caches and external memory modules of matched capacity. These new results will clarify the incremental benefit of the dynamic retrieval, reuse, and update rules beyond simple increases in state persistence. revision: yes
Referee: [§3.2 (Cortical Expert Clustering)] No details are provided on the clustering objective function, how semantic modules are formed from FFN weights, or ablations that test reorganization without the task-adaptive specialization claim; this weakens the assertion that the method mitigates fragmentation in a manner distinct from standard mixture-of-experts or modular training approaches.

Authors: We appreciate the request for greater technical detail. In the revised §3.2 we will explicitly describe the clustering objective function and the procedure for forming semantic modules from FFN weights. We will also include ablations that perform reorganization without the task-adaptive specialization component. These additions will help distinguish CE Clustering from standard mixture-of-experts or modular training methods and support the claim that it mitigates semantic fragmentation through adaptive cross-token dependencies. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical mechanisms evaluated on external benchmarks

full rationale

The paper introduces two new architectural components (Persistent Activity memory bank and Cortical Expert Clustering) as brain-inspired additions to standard LLM FFN layers, then measures their effect via direct performance comparisons on LongBench, Infinite-Bench, and NIAH tasks. No equations are presented that define a target quantity in terms of fitted parameters, no predictions are claimed from first principles that reduce to the inputs by construction, and no load-bearing uniqueness theorems or self-citations are invoked to justify the core claims. The reported gains are therefore independent empirical outcomes rather than tautological restatements of the proposed mechanisms.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The paper rests on the domain assumption that prefrontal persistent firing and cortical modularity are the primary biological solutions to working-memory decay and task specialization; it introduces two new engineered components whose effectiveness is asserted rather than derived from first principles.

axioms (2)

domain assumption Prefrontal cortex neurons maintain persistent firing to support working memory.
Invoked in the abstract to motivate the Persistent Activity Mechanism.
domain assumption Cortical areas achieve functional specialization through modular organization.
Invoked to motivate Cortical Expert Clustering.

invented entities (2)

Activation-level memory bank no independent evidence
purpose: Store and dynamically retrieve critical FFN states to prevent contextual decay.
New component introduced to implement persistent activity; no independent falsifiable prediction outside the reported benchmarks.
Semantic modules from FFN weight reorganization no independent evidence
purpose: Establish cross-token dependencies and reduce semantic fragmentation.
New clustering procedure on existing weights; effectiveness shown only via end-task metrics.

pith-pipeline@v0.9.0 · 5777 in / 1468 out tokens · 33750 ms · 2026-05-19T09:30:40.472216+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a Persistent Activity (PA) Mechanism that mimics prefrontal cortex (PFC) neurons' persistent firing by introducing an activation-level memory bank to dynamically retrieve, reuse, and update critical FFN states
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Cortical Expert (CE) Clustering that emulates task-adaptive neural specialization to reorganize FFN weights into semantic modules

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 4 internal anchors

[1]

Auda and M

G. Auda and M. Kamel. Modular neural networks: a survey. International journal of neural systems , 9(02):129–151, 1999

work page 1999
[2]

Y . Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 3119–3137, 2024

work page 2024
[3]

Y . Bai, J. Zhang, X. Lv, L. Zheng, S. Zhu, L. Hou, Y . Dong, J. Tang, and J. Li. Longwriter: Unleashing 10,000+ word generation from long context llms. arXiv preprint arXiv:2408.07055, 2024

work page arXiv 2024
[4]

L. A. Belady. A study of replacement algorithms for a virtual-storage computer. IBM Systems journal, 5(2):78–101, 1966

work page 1966
[5]

P. Das, S. Chaudhury, E. Nelson, I. Melnyk, S. Swaminathan, S. Dai, A. Lozano, G. Kollias, V . Chenthama- rakshan, Jiˇrí, Navrátil, S. Dan, and P.-Y . Chen. Larimar: Large language models with episodic memory control, 2024

work page 2024
[6]

Y . Ding, L. L. Zhang, C. Zhang, Y . Xu, N. Shang, J. Xu, F. Yang, and M. Yang. Longrope: Extending llm context window beyond 2 million tokens. arXiv preprint arXiv:2402.13753, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Fountas, M

Z. Fountas, M. Benfeghoul, A. Oomerjee, F. Christopoulou, G. Lampouras, H. B. Ammar, and J. Wang. Human-inspired episodic memory for infinite context LLMs. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[8]

J. M. Fuster and G. E. Alexander. Neuron activity related to short-term memory. Science, 173(3997):652– 654, 1971

work page 1971
[9]

S. Ge, Y . Zhang, L. Liu, M. Zhang, J. Han, and J. Gao. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Z. He, Y . Cao, Z. Qin, N. Prakriya, Y . Sun, and J. Cong. Hmt: Hierarchical memory transformer for efficient long context language processing. arXiv preprint arXiv:2405.06067, 2024

work page arXiv 2024
[11]

Hendrycks, C

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations , 2021

work page 2021
[12]

Huang, P

C. Huang, P. Ye, T. Chen, T. He, X. Yue, and W. Ouyang. Emr-merging: Tuning-free high-performance model merging, 2024

work page 2024
[13]

R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991

work page 1991
[14]

Jiang, Q

H. Jiang, Q. Wu, X. Luo, D. Li, C.-Y . Lin, Y . Yang, and L. Qiu. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 1658–1677, 2024

work page 2024
[15]

Jimenez Gutierrez, Y

B. Jimenez Gutierrez, Y . Shu, Y . Gu, M. Yasunaga, and Y . Su. Hipporag: Neurobiologically inspired long-term memory for large language models. Advances in Neural Information Processing Systems , 37:59532–59569, 2024

work page 2024
[16]

G. Kamradt. Needle in a haystack - pressure testing llms. https://github.com/gkamradt/LLMTest_ NeedleInAHaystack, 2023

work page 2023
[17]

J. Ko, G. Park, D. Lee, and K. Lee. FeRG-LLM : Feature engineering by reason generation large language models. In L. Chiruzzo, A. Ritter, and L. Wang, editors, Findings of the Association for Computational Linguistics: NAACL 2025 , pages 4211–4228, Albuquerque, New Mexico, Apr. 2025. Association for Computational Linguistics. 10

work page 2025
[18]

Y . Li, Y . Huang, B. Yang, B. Venkitesh, A. Locatelli, H. Ye, T. Cai, P. Lewis, and D. Chen. Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems , 37:22947–22970, 2024

work page 2024
[19]

M. I. Malinen and P. Fränti. Balanced k-means for clustering. In Structural, Syntactic, and Statistical Pattern Recognition: Joint IAPR International Workshop, S+ SSPR 2014, Joensuu, Finland, August 20-22,

work page 2014
[20]

Springer, 2014

Proceedings, pages 32–41. Springer, 2014

work page 2014
[21]

Nawrot, A

P. Nawrot, A. Ła ´ncucki, M. Chochowski, D. Tarjan, and E. M. Ponti. Dynamic memory compression: Retrofitting llms for accelerated inference. arXiv preprint arXiv:2403.09636, 2024

work page arXiv 2024
[22]

J. Park, K. Atarashi, K. Takeuchi, and H. Kashima. Emulating retrieval augmented generation via prompt engineering for enhanced long context comprehension in llms, 2025

work page 2025
[23]

K. Qian, M. Chen, S. Li, A. Sharma, and Z. Yu. Bottom-up synthesis of knowledge-grounded task-oriented dialogues with iteratively self-refined prompts. In L. Chiruzzo, A. Ritter, and L. Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V ol...

work page 2025
[24]

E. T. Rolls. Brain computations: what and how . Oxford University Press, 2021

work page 2021
[25]

J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024

work page 2024
[26]

K. Team, A. Du, B. Yin, B. Xing, B. Qu, B. Wang, C. Chen, C. Zhang, C. Du, C. Wei, C. Wang, D. Zhang, D. Du, D. Wang, E. Yuan, E. Lu, F. Li, F. Sung, G. Wei, G. Lai, H. Zhu, H. Ding, H. Hu, H. Yang, H. Zhang, H. Wu, H. Yao, H. Lu, H. Wang, H. Gao, H. Zheng, J. Li, J. Su, J. Wang, J. Deng, J. Qiu, J. Xie, J. Wang, J. Liu, J. Yan, K. Ouyang, L. Chen, L. Sui...

work page 2025
[27]

Touvron, L

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V . Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V . Kerkez, M. Khabsa, I. Kloumann, A. Koren...

work page 2023
[28]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems , 30, 2017

work page 2017
[29]

D. Wan, J. Chen, E. Stengel-Eskin, and M. Bansal. MAMM-refine: A recipe for improving faithfulness in generation with multi-agent collaboration. In L. Chiruzzo, A. Ritter, and L. Wang, editors, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1:...

work page 2025
[30]

Z. Wan, X. Wu, Y . Zhang, Y . Xin, C. Tao, Z. Zhu, X. Wang, S. Luo, J. Xiong, L. Wang, and M. Zhang. D2o: Dynamic discriminative operations for efficient long-context inference of large language models. In The Thirteenth International Conference on Learning Representations , 2025

work page 2025
[31]

P. Wang, Z. Li, N. Zhang, Z. Xu, Y . Yao, Y . Jiang, P. Xie, F. Huang, and H. Chen. Wise: Rethinking the knowledge memory for lifelong model editing of large language models. Advances in Neural Information Processing Systems, 37:53764–53797, 2024

work page 2024
[32]

Z. Wang, Z. Wang, L. Le, H. S. Zheng, S. Mishra, V . Perot, Y . Zhang, A. Mattapalli, A. Taly, J. Shang, C.-Y . Lee, and T. Pfister. Speculative rag: Enhancing retrieval augmented generation through drafting, 2025

work page 2025
[33]

C. Xiao, P. Zhang, X. Han, G. Xiao, Y . Lin, Z. Zhang, Z. Liu, S. Han, and M. Sun. Infllm: Unveiling the intrinsic capacity of llms for understanding extremely long sequences with training-free memory. arXiv, 2024. 11

work page 2024
[34]

X. Xiao, H. Ping, C. Zhou, D. Cao, Y . Li, Y .-Z. Zhou, S. Li, N. Kanakaris, and P. Bogdan. Neuron-based multifractal analysis of neuron interaction dynamics in large models. In International Conference on Learning Representations, 2025

work page 2025
[35]

Xiong, Z

H. Xiong, Z. Yang, J. Yu, Y . Zhuge, L. Zhang, J. Zhu, and H. Lu. Streaming video understanding and multi-round interaction with memory-enhanced knowledge, 2025

work page 2025
[36]

C. Xu, W. Ping, P. Xu, Z. Liu, B. Wang, M. Shoeybi, B. Li, and B. Catanzaro. From 128k to 4m: Efficient training of ultra-long context large language models, 2025

work page 2025
[37]

P. Xu, W. Ping, X. Wu, C. Xu, Z. Liu, M. Shoeybi, and B. Catanzaro. Chatqa 2: Bridging the gap to proprietary llms in long context and rag capabilities, 2025

work page 2025
[38]

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

L. Yang, Z. Yu, T. Zhang, S. Cao, M. Xu, W. Zhang, J. E. Gonzalez, and B. Cui. Buffer of thoughts: Thought-augmented reasoning with large language models. Advances in Neural Information Processing Systems, 37:113519–113544, 2024

work page 2024
[40]

P. Ye, T. He, S. Tang, B. Li, T. Chen, L. Bai, and W. Ouyang. Stimulative training++: Go beyond the performance limits of residual networks, 2023

work page 2023
[41]

P. Ye, C. Huang, M. Shen, T. Chen, Y . Huang, and W. Ouyang. Dynamic model merging with mixture of weights. IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2025

work page 2025
[42]

P. Ye, B. Li, Y . Li, T. Chen, J. Fan, and W. Ouyang.β-darts: Beta-decay regularization for differentiable architecture search, 2022

work page 2022
[43]

P. Ye, S. Tang, B. Li, T. Chen, and W. Ouyang. Stimulative training of residual networks: A social psychology perspective of loafing, 2022

work page 2022
[44]

Zhang, Z

P. Zhang, Z. Liu, S. Xiao, N. Shao, Q. Ye, and Z. Dou. Long context compression with activation beacon. In The Thirteenth International Conference on Learning Representations , 2025

work page 2025
[45]

Zhang, Y

X. Zhang, Y . Chen, S. Hu, Z. Xu, J. Chen, M. Hao, X. Han, Z. Thai, S. Wang, Z. Liu, et al. ∞-bench: Extending long context evaluation beyond 100k tokens. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 15262–15277, 2024

work page 2024
[46]

A Survey on the Memory Mechanism of Large Language Model based Agents

Z. Zhang, X. Bo, C. Ma, R. Li, X. Chen, Q. Dai, J. Zhu, Z. Dong, and J.-R. Wen. A survey on the memory mechanism of large language model based agents. arXiv preprint arXiv:2404.13501, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Zhang, J

Z. Zhang, J. Li, Y . Lan, X. Wang, and H. Wang. An empirical study on prompt compression for large language models, 2025

work page 2025
[48]

D. Zhu, L. Wang, N. Yang, Y . Song, W. Wu, F. Wei, and S. Li. Longembed: Extending embedding models for long context retrieval, 2024

work page 2024
[49]

Z. Zhu, C. Luo, Z. Shao, F. Gao, H. Xing, Q. Zheng, and J. Zhang. A Simple yet Effective Layout Token in Large Language Models for Document Understanding. arXiv e-prints, page arXiv:2503.18434, Mar. 2025

work page arXiv 2025
[50]

Zylberberg and B

J. Zylberberg and B. W. Strowbridge. Mechanisms of persistent activity in cortical circuits: possible neural substrates for working memory. Annual review of neuroscience, 40(1):603–627, 2017. 12 A Inference Efficiency Analysis To quantitatively assess the computational overhead introduced by our proposed method PaceLLM, we conduct a series of rigorous inf...

work page arXiv 2017
[51]

For each layer, extract FFN weights W(l) 1 (input projection) and W(l) 2 (output projection). 15 Algorithm 2 Cortical Expert Clustering (CE) Require: Pretrained model M, Number of experts K 1: Initialize empty state dictionary S 2: for layer l ∈ {1, ..., L} do 3: Extract FFN weights W(l) 1 , W(l) 2 4: if cluster indices π(l) not cached then 5: Compute π(l...

work page
[52]

This ensures load balance and specialization

If the clustering result π(l) is not cached, apply constrained KMeans to group neurons into K expert clusters. This ensures load balance and specialization

work page
[53]

Rearrange the weight matrices according to cluster assignments π(l), so that expert-based routing can be implemented efficiently during inference

work page
[54]

This modularization allows PaceLLM to activate specific "experts" during computation and aligns with the cognitive hypothesis of cortical column specialization

Update the model’s weight state dictionary with the new clustered weights. This modularization allows PaceLLM to activate specific "experts" during computation and aligns with the cognitive hypothesis of cortical column specialization. D Detailed Explanation of KMeans-Constrained Clustering and LRU Update Strategy D.1 KMeans and Constrained KMeans Cluster...

work page arXiv

[1] [1]

Auda and M

G. Auda and M. Kamel. Modular neural networks: a survey. International journal of neural systems , 9(02):129–151, 1999

work page 1999

[2] [2]

Y . Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 3119–3137, 2024

work page 2024

[3] [3]

Y . Bai, J. Zhang, X. Lv, L. Zheng, S. Zhu, L. Hou, Y . Dong, J. Tang, and J. Li. Longwriter: Unleashing 10,000+ word generation from long context llms. arXiv preprint arXiv:2408.07055, 2024

work page arXiv 2024

[4] [4]

L. A. Belady. A study of replacement algorithms for a virtual-storage computer. IBM Systems journal, 5(2):78–101, 1966

work page 1966

[5] [5]

P. Das, S. Chaudhury, E. Nelson, I. Melnyk, S. Swaminathan, S. Dai, A. Lozano, G. Kollias, V . Chenthama- rakshan, Jiˇrí, Navrátil, S. Dan, and P.-Y . Chen. Larimar: Large language models with episodic memory control, 2024

work page 2024

[6] [6]

Y . Ding, L. L. Zhang, C. Zhang, Y . Xu, N. Shang, J. Xu, F. Yang, and M. Yang. Longrope: Extending llm context window beyond 2 million tokens. arXiv preprint arXiv:2402.13753, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Fountas, M

Z. Fountas, M. Benfeghoul, A. Oomerjee, F. Christopoulou, G. Lampouras, H. B. Ammar, and J. Wang. Human-inspired episodic memory for infinite context LLMs. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025

[8] [8]

J. M. Fuster and G. E. Alexander. Neuron activity related to short-term memory. Science, 173(3997):652– 654, 1971

work page 1971

[9] [9]

S. Ge, Y . Zhang, L. Liu, M. Zhang, J. Han, and J. Gao. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Z. He, Y . Cao, Z. Qin, N. Prakriya, Y . Sun, and J. Cong. Hmt: Hierarchical memory transformer for efficient long context language processing. arXiv preprint arXiv:2405.06067, 2024

work page arXiv 2024

[11] [11]

Hendrycks, C

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations , 2021

work page 2021

[12] [12]

Huang, P

C. Huang, P. Ye, T. Chen, T. He, X. Yue, and W. Ouyang. Emr-merging: Tuning-free high-performance model merging, 2024

work page 2024

[13] [13]

R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991

work page 1991

[14] [14]

Jiang, Q

H. Jiang, Q. Wu, X. Luo, D. Li, C.-Y . Lin, Y . Yang, and L. Qiu. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 1658–1677, 2024

work page 2024

[15] [15]

Jimenez Gutierrez, Y

B. Jimenez Gutierrez, Y . Shu, Y . Gu, M. Yasunaga, and Y . Su. Hipporag: Neurobiologically inspired long-term memory for large language models. Advances in Neural Information Processing Systems , 37:59532–59569, 2024

work page 2024

[16] [16]

G. Kamradt. Needle in a haystack - pressure testing llms. https://github.com/gkamradt/LLMTest_ NeedleInAHaystack, 2023

work page 2023

[17] [17]

J. Ko, G. Park, D. Lee, and K. Lee. FeRG-LLM : Feature engineering by reason generation large language models. In L. Chiruzzo, A. Ritter, and L. Wang, editors, Findings of the Association for Computational Linguistics: NAACL 2025 , pages 4211–4228, Albuquerque, New Mexico, Apr. 2025. Association for Computational Linguistics. 10

work page 2025

[18] [18]

Y . Li, Y . Huang, B. Yang, B. Venkitesh, A. Locatelli, H. Ye, T. Cai, P. Lewis, and D. Chen. Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems , 37:22947–22970, 2024

work page 2024

[19] [19]

M. I. Malinen and P. Fränti. Balanced k-means for clustering. In Structural, Syntactic, and Statistical Pattern Recognition: Joint IAPR International Workshop, S+ SSPR 2014, Joensuu, Finland, August 20-22,

work page 2014

[20] [20]

Springer, 2014

Proceedings, pages 32–41. Springer, 2014

work page 2014

[21] [21]

Nawrot, A

P. Nawrot, A. Ła ´ncucki, M. Chochowski, D. Tarjan, and E. M. Ponti. Dynamic memory compression: Retrofitting llms for accelerated inference. arXiv preprint arXiv:2403.09636, 2024

work page arXiv 2024

[22] [22]

J. Park, K. Atarashi, K. Takeuchi, and H. Kashima. Emulating retrieval augmented generation via prompt engineering for enhanced long context comprehension in llms, 2025

work page 2025

[23] [23]

K. Qian, M. Chen, S. Li, A. Sharma, and Z. Yu. Bottom-up synthesis of knowledge-grounded task-oriented dialogues with iteratively self-refined prompts. In L. Chiruzzo, A. Ritter, and L. Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V ol...

work page 2025

[24] [24]

E. T. Rolls. Brain computations: what and how . Oxford University Press, 2021

work page 2021

[25] [25]

J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024

work page 2024

[26] [26]

K. Team, A. Du, B. Yin, B. Xing, B. Qu, B. Wang, C. Chen, C. Zhang, C. Du, C. Wei, C. Wang, D. Zhang, D. Du, D. Wang, E. Yuan, E. Lu, F. Li, F. Sung, G. Wei, G. Lai, H. Zhu, H. Ding, H. Hu, H. Yang, H. Zhang, H. Wu, H. Yao, H. Lu, H. Wang, H. Gao, H. Zheng, J. Li, J. Su, J. Wang, J. Deng, J. Qiu, J. Xie, J. Wang, J. Liu, J. Yan, K. Ouyang, L. Chen, L. Sui...

work page 2025

[27] [27]

Touvron, L

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V . Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V . Kerkez, M. Khabsa, I. Kloumann, A. Koren...

work page 2023

[28] [28]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems , 30, 2017

work page 2017

[29] [29]

D. Wan, J. Chen, E. Stengel-Eskin, and M. Bansal. MAMM-refine: A recipe for improving faithfulness in generation with multi-agent collaboration. In L. Chiruzzo, A. Ritter, and L. Wang, editors, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1:...

work page 2025

[30] [30]

Z. Wan, X. Wu, Y . Zhang, Y . Xin, C. Tao, Z. Zhu, X. Wang, S. Luo, J. Xiong, L. Wang, and M. Zhang. D2o: Dynamic discriminative operations for efficient long-context inference of large language models. In The Thirteenth International Conference on Learning Representations , 2025

work page 2025

[31] [31]

P. Wang, Z. Li, N. Zhang, Z. Xu, Y . Yao, Y . Jiang, P. Xie, F. Huang, and H. Chen. Wise: Rethinking the knowledge memory for lifelong model editing of large language models. Advances in Neural Information Processing Systems, 37:53764–53797, 2024

work page 2024

[32] [32]

Z. Wang, Z. Wang, L. Le, H. S. Zheng, S. Mishra, V . Perot, Y . Zhang, A. Mattapalli, A. Taly, J. Shang, C.-Y . Lee, and T. Pfister. Speculative rag: Enhancing retrieval augmented generation through drafting, 2025

work page 2025

[33] [33]

C. Xiao, P. Zhang, X. Han, G. Xiao, Y . Lin, Z. Zhang, Z. Liu, S. Han, and M. Sun. Infllm: Unveiling the intrinsic capacity of llms for understanding extremely long sequences with training-free memory. arXiv, 2024. 11

work page 2024

[34] [34]

X. Xiao, H. Ping, C. Zhou, D. Cao, Y . Li, Y .-Z. Zhou, S. Li, N. Kanakaris, and P. Bogdan. Neuron-based multifractal analysis of neuron interaction dynamics in large models. In International Conference on Learning Representations, 2025

work page 2025

[35] [35]

Xiong, Z

H. Xiong, Z. Yang, J. Yu, Y . Zhuge, L. Zhang, J. Zhu, and H. Lu. Streaming video understanding and multi-round interaction with memory-enhanced knowledge, 2025

work page 2025

[36] [36]

C. Xu, W. Ping, P. Xu, Z. Liu, B. Wang, M. Shoeybi, B. Li, and B. Catanzaro. From 128k to 4m: Efficient training of ultra-long context large language models, 2025

work page 2025

[37] [37]

P. Xu, W. Ping, X. Wu, C. Xu, Z. Liu, M. Shoeybi, and B. Catanzaro. Chatqa 2: Bridging the gap to proprietary llms in long context and rag capabilities, 2025

work page 2025

[38] [38]

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

L. Yang, Z. Yu, T. Zhang, S. Cao, M. Xu, W. Zhang, J. E. Gonzalez, and B. Cui. Buffer of thoughts: Thought-augmented reasoning with large language models. Advances in Neural Information Processing Systems, 37:113519–113544, 2024

work page 2024

[40] [40]

P. Ye, T. He, S. Tang, B. Li, T. Chen, L. Bai, and W. Ouyang. Stimulative training++: Go beyond the performance limits of residual networks, 2023

work page 2023

[41] [41]

P. Ye, C. Huang, M. Shen, T. Chen, Y . Huang, and W. Ouyang. Dynamic model merging with mixture of weights. IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2025

work page 2025

[42] [42]

P. Ye, B. Li, Y . Li, T. Chen, J. Fan, and W. Ouyang.β-darts: Beta-decay regularization for differentiable architecture search, 2022

work page 2022

[43] [43]

P. Ye, S. Tang, B. Li, T. Chen, and W. Ouyang. Stimulative training of residual networks: A social psychology perspective of loafing, 2022

work page 2022

[44] [44]

Zhang, Z

P. Zhang, Z. Liu, S. Xiao, N. Shao, Q. Ye, and Z. Dou. Long context compression with activation beacon. In The Thirteenth International Conference on Learning Representations , 2025

work page 2025

[45] [45]

Zhang, Y

X. Zhang, Y . Chen, S. Hu, Z. Xu, J. Chen, M. Hao, X. Han, Z. Thai, S. Wang, Z. Liu, et al. ∞-bench: Extending long context evaluation beyond 100k tokens. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 15262–15277, 2024

work page 2024

[46] [46]

A Survey on the Memory Mechanism of Large Language Model based Agents

Z. Zhang, X. Bo, C. Ma, R. Li, X. Chen, Q. Dai, J. Zhu, Z. Dong, and J.-R. Wen. A survey on the memory mechanism of large language model based agents. arXiv preprint arXiv:2404.13501, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

Zhang, J

Z. Zhang, J. Li, Y . Lan, X. Wang, and H. Wang. An empirical study on prompt compression for large language models, 2025

work page 2025

[48] [48]

D. Zhu, L. Wang, N. Yang, Y . Song, W. Wu, F. Wei, and S. Li. Longembed: Extending embedding models for long context retrieval, 2024

work page 2024

[49] [49]

Z. Zhu, C. Luo, Z. Shao, F. Gao, H. Xing, Q. Zheng, and J. Zhang. A Simple yet Effective Layout Token in Large Language Models for Document Understanding. arXiv e-prints, page arXiv:2503.18434, Mar. 2025

work page arXiv 2025

[50] [50]

Zylberberg and B

J. Zylberberg and B. W. Strowbridge. Mechanisms of persistent activity in cortical circuits: possible neural substrates for working memory. Annual review of neuroscience, 40(1):603–627, 2017. 12 A Inference Efficiency Analysis To quantitatively assess the computational overhead introduced by our proposed method PaceLLM, we conduct a series of rigorous inf...

work page arXiv 2017

[51] [51]

For each layer, extract FFN weights W(l) 1 (input projection) and W(l) 2 (output projection). 15 Algorithm 2 Cortical Expert Clustering (CE) Require: Pretrained model M, Number of experts K 1: Initialize empty state dictionary S 2: for layer l ∈ {1, ..., L} do 3: Extract FFN weights W(l) 1 , W(l) 2 4: if cluster indices π(l) not cached then 5: Compute π(l...

work page

[52] [52]

This ensures load balance and specialization

If the clustering result π(l) is not cached, apply constrained KMeans to group neurons into K expert clusters. This ensures load balance and specialization

work page

[53] [53]

Rearrange the weight matrices according to cluster assignments π(l), so that expert-based routing can be implemented efficiently during inference

work page

[54] [54]

This modularization allows PaceLLM to activate specific "experts" during computation and aligns with the cognitive hypothesis of cortical column specialization

Update the model’s weight state dictionary with the new clustered weights. This modularization allows PaceLLM to activate specific "experts" during computation and aligns with the cognitive hypothesis of cortical column specialization. D Detailed Explanation of KMeans-Constrained Clustering and LRU Update Strategy D.1 KMeans and Constrained KMeans Cluster...

work page arXiv