pith. machine review for the scientific record. sign in

arxiv: 2604.20915 · v1 · submitted 2026-04-22 · 💻 cs.LG · cs.AI· cs.CL· cs.SE· math.OC

Recognition: unknown

Absorber LLM: Harnessing Causal Synchronization for Test-Time Training

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.SEmath.OC
keywords Absorber LLMtest-time trainingcausal synchronizationlong-context modelingparameter-efficient inferencetransformersstreaming benchmarksmemory reduction
0
0 comments X

The pith

Absorber LLM absorbs historical contexts into model parameters by synchronizing its behavior with the original full-context model on future predictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a method to handle long sequences in transformers without the quadratic memory cost of self-attention. It formulates context retention as a self-supervised task where the model, after updating its parameters with past context, must generate future tokens identically to how it would with the full original context. This causal synchronization is achieved by aligning internal behaviors rather than just outputs, allowing the absorbed context to be used without retaining the history explicitly. A sympathetic reader would care because it promises constant-memory inference for streaming or long documents while avoiding the information loss in state-compression methods and the overfitting in prior test-time training approaches.

Core claim

Absorber LLM formulates long-context retention as a self-supervised causal synchronization: after absorbing historical contexts into parameters, a contextless model should match the original model with full context on future generations. This objective is optimized by synchronizing internal behaviors of the updated model with the original one, ensuring context absorption and generalization without breaking the pretrained causal structure.

What carries the argument

The causal synchronization objective that aligns the internal states or behaviors of the parameter-updated contextless model to those of the original context-aware model during prediction of future tokens.

If this is right

  • Constant memory usage during inference regardless of context length.
  • Improved accuracy on long-context and streaming benchmarks compared to prior parameter-as-memory methods.
  • Preservation of the causal effect from the pretrained LLM.
  • Ability to handle long-tail dependencies better than fixed-state RNNs or SSMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending this synchronization to multi-turn conversations could enable persistent memory across sessions without growing context windows.
  • Applying the method to multimodal models might allow absorbing visual or audio history into parameters for efficient generation.
  • Testing whether the synchronized internal behaviors scale with model size would reveal if larger models benefit more from this absorption technique.

Load-bearing premise

That synchronizing internal behaviors of the updated model with the original one after context absorption will ensure both faithful context retention and generalization to future tokens without introducing overfitting or breaking pretrained causal structure.

What would settle it

Observing a significant mismatch in next-token predictions or internal activations between the context-absorbed model and the original full-context model on a sequence of future tokens would falsify the claim that the synchronization achieves faithful retention.

Figures

Figures reproduced from arXiv: 2604.20915 by Chengcan Wu, Meng Sun, Shabo Zhang, Zeming Wei, Zhixin Zhang.

Figure 1
Figure 1. Figure 1: A brief overview of our method. We propose to absorb historical contexts and move beyond simple context memorization, focusing on preserving the causal relationship between historical context and future generations rather than directly reconstructing historical tokens. Specifically, we fine-tune the context-less model to behave identically to the original model with full contexts. This is achieved through … view at source ↗
Figure 2
Figure 2. Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
read the original abstract

Transformers suffer from a high computational cost that grows with sequence length for self-attention, making inference in long streams prohibited by memory consumption. Constant-memory alternatives such as RNNs and SSMs compress history into states with fixed size and thus lose long-tail dependencies, while methods that memorize contexts into parameters, such as Test-Time Training (TTT), are prone to overfitting token-level projection and fail to preserve the causal effect of context in pretrained LLMs. We propose Absorber LLM, which formulates long-context retention as a self-supervised causal synchronization: after absorbing historical contexts into parameters, a contextless model should match the original model with full context on future generations. We optimize this objective by synchronizing internal behaviors of the updated model with the original one, ensuring context absorption and generalization. Experiments on long-context and streaming benchmarks show that Absorber LLM reduces inference memory and improves accuracy over prior parameter-as-memory baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Absorber LLM, a test-time training approach for transformers that absorbs long historical contexts into model parameters via a self-supervised causal synchronization objective. After absorption, a contextless updated model is trained to match the internal behaviors of the original model (with full context) on future generations. This is claimed to enable constant-memory inference while retaining long-range dependencies better than RNNs, SSMs, or prior parameter-as-memory TTT methods, with experiments on long-context and streaming benchmarks showing reduced inference memory and improved accuracy over baselines.

Significance. If the causal synchronization mechanism can be rigorously validated, the approach could advance efficient long-context inference in LLMs by integrating context into parameters without the overfitting or causal disruption issues of prior TTT methods. The focus on matching internal behaviors rather than token-level projections is a potentially useful distinction, though its significance hinges on providing the missing technical details to substantiate the benchmark gains.

major comments (2)
  1. [Abstract] Abstract: The self-supervised causal synchronization objective is described only qualitatively, with no loss formulation, no specification of which internal behaviors (e.g., activations or layers) are synchronized, and no details on the optimization or constraints enforcing causality. This is load-bearing for the central claim, as the abstract explicitly criticizes prior TTT for overfitting token-level projections and failing to preserve causal effects, yet the manuscript provides no mechanism to verify how the proposed method avoids these pitfalls or reduces to a non-trivial fit.
  2. [Method (no equations provided)] No equations or derivation section: The paper provides no mathematical definition of the synchronization loss or how context absorption into parameters is performed while preserving pretrained causal structure. This directly impacts the ability to assess the skeptic's concern that incomplete synchronization could disrupt causal dependencies or lead to memorization rather than generalizable retention, undermining both the memory reduction and accuracy claims.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'synchronizing internal behaviors' is used without clarification of the precise components involved, which reduces clarity even at the high level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which correctly identifies gaps in the technical exposition of the causal synchronization mechanism. We agree that explicit mathematical details are necessary to substantiate the claims and will revise the manuscript accordingly to include them.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The self-supervised causal synchronization objective is described only qualitatively, with no loss formulation, no specification of which internal behaviors (e.g., activations or layers) are synchronized, and no details on the optimization or constraints enforcing causality. This is load-bearing for the central claim, as the abstract explicitly criticizes prior TTT for overfitting token-level projections and failing to preserve causal effects, yet the manuscript provides no mechanism to verify how the proposed method avoids these pitfalls or reduces to a non-trivial fit.

    Authors: We agree that the abstract is qualitative and does not provide the requested specifics. In the revised version, we will update the abstract to briefly indicate that synchronization targets internal activations and attention patterns across layers, with causality enforced via autoregressive masking during context absorption. Due to length constraints, the full loss formulation (as a self-supervised objective matching behaviors on future generations) and optimization details will be moved to an expanded methods section. This revision will directly address how the approach differs from token-level overfitting in prior TTT methods. revision: yes

  2. Referee: [Method (no equations provided)] No equations or derivation section: The paper provides no mathematical definition of the synchronization loss or how context absorption into parameters is performed while preserving pretrained causal structure. This directly impacts the ability to assess the skeptic's concern that incomplete synchronization could disrupt causal dependencies or lead to memorization rather than generalizable retention, undermining both the memory reduction and accuracy claims.

    Authors: We acknowledge that the submitted manuscript lacks equations and a dedicated derivation section, which limits rigorous evaluation of the claims. We will add a new subsection in the methods with the mathematical definition of the synchronization loss, specifying the internal behaviors synchronized (activations and attention maps), the parameter update procedure for context absorption, and the constraints (e.g., causal masking) used to preserve the pretrained structure. This will allow assessment of whether the objective promotes generalizable retention rather than memorization. revision: yes

Circularity Check

0 steps flagged

No circularity: objective anchored externally to pretrained model behavior

full rationale

The paper defines its core method as a self-supervised objective in which a contextless updated model is trained to match the original pretrained model's internal behaviors and future generations after parameter absorption. This matching target is the external pretrained model with full context, not a quantity derived from the update itself. No equations appear in the provided text that would make the synchronization loss equivalent to a fitted input or self-defined quantity by construction. No self-citations are invoked to justify uniqueness or import an ansatz. The derivation therefore remains self-contained and externally falsifiable via the matching criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no equations, loss functions, or implementation details are available to enumerate free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5472 in / 1068 out tokens · 32015 ms · 2026-05-10T00:48:28.985808+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 16 canonical work pages · 13 internal anchors

  1. [1]

    Scaling Laws for Neural Language Models

    Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

  2. [2]

    Training Compute-Optimal Large Language Models

    Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , year=

  3. [3]

    GPT-4 Technical Report

    GPT-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  4. [5]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  5. [6]

    Emergent Abilities of Large Language Models

    Emergent capabilities of large language models , author=. arXiv preprint arXiv:2206.07682 , year=

  6. [7]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Sparks of artificial general intelligence: Early experiments with gpt-4 , author=. arXiv preprint arXiv:2303.12712 , year=

  7. [8]

    Frontiers of Computer Science , volume=

    A survey on large language model based autonomous agents , author=. Frontiers of Computer Science , volume=. 2024 , publisher=

  8. [9]

    Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=

    Generative agents: Interactive simulacra of human behavior , author=. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=

  9. [10]

    2024 , url=

    Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan , booktitle=. 2024 , url=

  10. [11]

    Proceedings of the ACM on Software Engineering , volume=

    Codeplan: Repository-level coding using llms and planning , author=. Proceedings of the ACM on Software Engineering , volume=. 2024 , publisher=

  11. [12]

    Advances in neural information processing systems , volume=

    Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=

  12. [13]

    Transactions of the Association for Computational Linguistics , volume=

    Lost in the middle: How language models use long contexts , author=. Transactions of the Association for Computational Linguistics , volume=

  13. [14]

    Advances in neural information processing systems , volume=

    Attention is all you need , author=. Advances in neural information processing systems , volume=

  14. [15]

    Advances in neural information processing systems , volume=

    Flashattention: Fast and memory-efficient exact attention with io-awareness , author=. Advances in neural information processing systems , volume=

  15. [18]

    International conference on machine learning , pages=

    Transformers are rnns: Fast autoregressive transformers with linear attention , author=. International conference on machine learning , pages=. 2020 , organization=

  16. [19]

    International Conference on Learning Representations , year=

    Rethinking Attention with Performers , author=. International Conference on Learning Representations , year=

  17. [20]

    The Twelfth International Conference on Learning Representations , year=

    RingAttention with Blockwise Transformers for Near-Infinite Context , author=. The Twelfth International Conference on Learning Representations , year=

  18. [22]

    Advances in neural information processing systems , volume=

    Combining recurrent, convolutional, and continuous-time models with linear state space layers , author=. Advances in neural information processing systems , volume=

  19. [23]

    First conference on language modeling , year=

    Mamba: Linear-time sequence modeling with selective state spaces , author=. First conference on language modeling , year=

  20. [24]

    Retentive Network: A Successor to Transformer for Large Language Models

    Retentive Network: A Successor to Transformer for Large Language Models , author=. arXiv preprint arXiv:2307.08621 , year=

  21. [25]

    Forty-first International Conference on Machine Learning , year=

    The Illusion of State in State-Space Models , author=. Forty-first International Conference on Machine Learning , year=

  22. [26]

    2024 , url=

    Cheng-Ping Hsieh and Simeng Sun and Samuel Kriman and Shantanu Acharya and Dima Rekesh and Fei Jia and Boris Ginsburg , booktitle=. 2024 , url=

  23. [27]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    L-eval: Instituting standardized evaluation for long context language models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  24. [28]

    Neural Computation , volume=

    Learning to control fast-weight memories: An alternative to dynamic recurrent networks , author=. Neural Computation , volume=. 1992 , publisher=

  25. [29]

    Advances in neural information processing systems , volume=

    Using fast weights to attend to the recent past , author=. Advances in neural information processing systems , volume=

  26. [30]

    Learning to (Learn at Test Time):

    Yu Sun and Xinhao Li and Karan Dalal and Jiarui Xu and Arjun Vikram and Genghan Zhang and Yann Dubois and Xinlei Chen and Xiaolong Wang and Sanmi Koyejo and Tatsunori Hashimoto and Carlos Guestrin , booktitle=. Learning to (Learn at Test Time):. 2025 , url=

  27. [31]

    Nested Learning: The Illusion of Deep Learning Architectures , author=

  28. [32]

    Adam: A Method for Stochastic Optimization

    A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , volume=

  29. [33]

    Proceedings of COMPSTAT'2010: 19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers , pages=

    Large-scale machine learning with stochastic gradient descent , author=. Proceedings of COMPSTAT'2010: 19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers , pages=. 2010 , organization=

  30. [34]

    Neural networks: Tricks of the trade: Second edition , pages=

    Practical recommendations for gradient-based training of deep architectures , author=. Neural networks: Tricks of the trade: Second edition , pages=. 2012 , publisher=

  31. [37]

    Neural computation , volume=

    Predictability, complexity, and learning , author=. Neural computation , volume=. 2001 , publisher=

  32. [38]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

    Adapting language models to compress contexts , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

  33. [39]

    Advances in Neural Information Processing Systems , volume=

    Learning to compress prompts with gist tokens , author=. Advances in Neural Information Processing Systems , volume=

  34. [40]

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages=

    Sequence-Level Knowledge Distillation , author=. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages=

  35. [41]

    Advances in Neural Information Processing Systems , volume=

    Causal abstractions of neural networks , author=. Advances in Neural Information Processing Systems , volume=

  36. [42]

    Causal Learning and Reasoning , pages=

    Finding alignments between interpretable causal variables and distributed neural representations , author=. Causal Learning and Reasoning , pages=. 2024 , organization=

  37. [43]

    Proceedings of the IEEE , volume=

    Toward causal representation learning , author=. Proceedings of the IEEE , volume=. 2021 , publisher=

  38. [44]

    2023 , eprint=

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding , author=. 2023 , eprint=

  39. [45]

    SAMS um Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization

    Gliwa, Bogdan and Mochol, Iwona and Biesek, Maciej and Wawer, Aleksander. SAMS um Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization. Workshop on New Frontiers in Summarization. 2019

  40. [46]

    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year=

    BLEURT: Learning Robust Metrics for Text Generation , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year=

  41. [47]

    Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor , journal=. The

  42. [48]

    NIPS , year=

    Character-level Convolutional Networks for Text Classification , author=. NIPS , year=

  43. [50]

    , author=

    Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=

  44. [51]

    International Conference on Learning Representations , year=

    Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

  45. [52]

    The Fourteenth International Conference on Learning Representations , year=

    In-Place Test-Time Training , author=. The Fourteenth International Conference on Learning Representations , year=

  46. [53]

    Longbench: A bilingual, multitask benchmark for long context understanding, 2023

    Bai, Y., Lv, X., Zhang, J., Lyu, H., Tang, J., Huang, Z., Du, Z., Liu, X., Zeng, A., Hou, L., Dong, Y., Tang, J., and Li, J. Longbench: A bilingual, multitask benchmark for long context understanding, 2023

  47. [54]

    D., Iyer, A., Parthasarathy, S., Rajamani, S., Ashok, B., and Shet, S

    Bairi, R., Sonwane, A., Kanade, A., C, V. D., Iyer, A., Parthasarathy, S., Rajamani, S., Ashok, B., and Shet, S. Codeplan: Repository-level coding using llms and planning. Proceedings of the ACM on Software Engineering, 1 0 (FSE): 0 675--698, 2024

  48. [55]

    Nested learning: The illusion of deep learning architectures

    Behrouz, A., Razaviyayn, M., Zhong, P., and Mirrokni, V. Nested learning: The illusion of deep learning architectures. In The Thirty-ninth Annual Conference on Neural Information Processing Systems

  49. [56]

    Longformer: The Long-Document Transformer

    Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020

  50. [57]

    Predictability, complexity, and learning

    Bialek, W., Nemenman, I., and Tishby, N. Predictability, complexity, and learning. Neural computation, 13 0 (11): 0 2409--2463, 2001

  51. [58]

    Analyzing memory effects in large language models through the lens of cognitive psychology

    Cao, Z., Schooler, L., and Zafarani, R. Analyzing memory effects in large language models through the lens of cognitive psychology. arXiv preprint arXiv:2509.17138, 2025

  52. [59]

    Adapting language models to compress contexts

    Chevalier, A., Wettig, A., Ajith, A., and Chen, D. Adapting language models to compress contexts. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 3829--3846, 2023

  53. [60]

    Generating Long Sequences with Sparse Transformers

    Child, R. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019

  54. [61]

    M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J

    Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J. Q., Mohiuddin, A., Kaiser, L., Belanger, D. B., Colwell, L. J., and Weller, A. Rethinking attention with performers. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Ua6zuk0WRH

  55. [62]

    Flashattention: Fast and memory-efficient exact attention with io-awareness

    Dao, T., Fu, D., Ermon, S., Rudra, A., and R \'e , C. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in neural information processing systems, 35: 0 16344--16359, 2022

  56. [63]

    In-place test-time training

    Feng, G., Luo, S., Hua, K., Zhang, G., Huang, W., He, D., and Cai, T. In-place test-time training. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=dTWfCLSoyl

  57. [64]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., and Leahy, C. The P ile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

  58. [65]

    Causal abstractions of neural networks

    Geiger, A., Lu, H., Icard, T., and Potts, C. Causal abstractions of neural networks. Advances in Neural Information Processing Systems, 34: 0 9574--9586, 2021

  59. [66]

    Finding alignments between interpretable causal variables and distributed neural representations

    Geiger, A., Wu, Z., Potts, C., Icard, T., and Goodman, N. Finding alignments between interpretable causal variables and distributed neural representations. In Causal Learning and Reasoning, pp.\ 160--187. PMLR, 2024

  60. [67]

    doi:10.18653/v1/D19-5409 , pages =

    Gliwa, B., Mochol, I., Biesek, M., and Wawer, A. SAMS um corpus: A human-annotated dialogue dataset for abstractive summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pp.\ 70--79, Hong Kong, China, November 2019. Association for Computational Linguistics. doi:10.18653/v1/D19-5409. URL https://www.aclweb.org/anthology/D19-5409

  61. [68]

    and Dao, T

    Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In First conference on language modeling, 2024

  62. [69]

    Combining recurrent, convolutional, and continuous-time models with linear state space layers

    Gu, A., Johnson, I., Goel, K., Saab, K., Dao, T., Rudra, A., and R \'e , C. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34: 0 572--585, 2021

  63. [70]

    RULER : What s the real context size of your long-context language models? In First Conference on Language Modeling, 2024

    Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., and Ginsburg, B. RULER : What s the real context size of your long-context language models? In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=kIoBbc76Sy

  64. [71]

    J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al

    Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al. Lora: Low-rank adaptation of large language models. ICLR, 1 0 (2): 0 3, 2022

  65. [72]

    E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K

    Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. R. SWE -bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=VTF8yNQM66

  66. [73]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pp.\ 5156--5165. PMLR, 2020

  67. [74]

    Ringattention with blockwise transformers for near-infinite context

    Liu, H., Zaharia, M., and Abbeel, P. Ringattention with blockwise transformers for near-infinite context. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=WsRHpHH4s0

  68. [75]

    and Hutter, F

    Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7

  69. [76]

    The illusion of state in state-space models

    Merrill, W., Petty, J., and Sabharwal, A. The illusion of state in state-space models. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=QZgo9JZpLq

  70. [77]

    Learning to compress prompts with gist tokens

    Mu, J., Li, X., and Goodman, N. Learning to compress prompts with gist tokens. Advances in Neural Information Processing Systems, 36: 0 19327--19352, 2023

  71. [78]

    S., O'Brien, J., Cai, C

    Park, J. S., O'Brien, J., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pp.\ 1--22, 2023

  72. [79]

    RWKV: Reinventing RNNs for the Transformer Era

    Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Biderman, S., Cao, H., Cheng, X., Chung, M., Grella, M., et al. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023

  73. [80]

    R., Kalchbrenner, N., Goyal, A., and Bengio, Y

    Sch \"o lkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalchbrenner, N., Goyal, A., and Bengio, Y. Toward causal representation learning. Proceedings of the IEEE, 109 0 (5): 0 612--634, 2021

  74. [81]

    Bleurt: Learning robust metrics for text generation

    Sellam, T., Das, D., and Parikh, A. Bleurt: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020

  75. [82]

    Learning to (learn at test time): RNN s with expressive hidden states

    Sun, Y., Li, X., Dalal, K., Xu, J., Vikram, A., Zhang, G., Dubois, Y., Chen, X., Wang, X., Koyejo, S., Hashimoto, T., and Guestrin, C. Learning to (learn at test time): RNN s with expressive hidden states. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=wXfuOj9C7L

  76. [83]

    The information bottleneck method

    Tishby, N., Pereira, F. C., and Bialek, W. The information bottleneck method. arXiv preprint physics/0004057, 2000

  77. [84]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi \`e re, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  78. [85]

    N., Kaiser, ., and Polosukhin, I

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017

  79. [86]

    A survey on large language model based autonomous agents

    Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., Chen, Z., Tang, J., Chen, X., Lin, Y., et al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18 0 (6): 0 186345, 2024

  80. [87]

    J., and LeCun, Y

    Zhang, X., Zhao, J. J., and LeCun, Y. Character-level convolutional networks for text classification. In NIPS, 2015