pith. sign in

arxiv: 2510.26083 · v2 · submitted 2025-10-30 · 💻 cs.LG · cs.AI

Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism

Pith reviewed 2026-05-18 03:01 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords specialized generalist modelstask-aware memory triggerdomain adaptationlarge language modelsMRI reconstructiontest-time adaptationmemory mechanism
0
0 comments X p. Extension

The pith

Nirvana uses a task-aware memory trigger to adapt general models to specialized domains like biomedicine and law at test time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Nirvana as a specialized generalist model that keeps strong general language abilities while adapting to narrow domains. It does so by treating every input as its own self-supervised fine-tuning task and using the Task-Aware Memory Trigger to adjust relevant parameters dynamically during inference. A Specialized Memory Updater then consolidates the most useful context for that task. The authors report that this yields performance equal to or better than standard large language models on broad benchmarks, the lowest perplexity on biomedicine, finance, and law, and higher-quality MRI reconstructions when lightweight codecs are attached to the frozen backbone. Readers would care because the method offers a single model that handles both everyday and expert tasks without full retraining or separate specialized networks.

Core claim

Nirvana is a Specialized Generalist Model that features specialized memory, linear-time complexity, and test-time task information extraction. Its central components are the Task-Aware Memory Trigger, which treats each input as a self-supervised fine-tuning task and adjusts task-related parameters on the fly, and the Specialized Memory Updater, which dynamically consolidates task-relevant context. This design enables Nirvana to match or surpass LLM baselines on general benchmarks, achieve the lowest perplexity across specialized domains including biomedicine, finance, and law, and deliver higher-fidelity MRI reconstructions than conventional LLM-based models when lightweight codecs are fine-

What carries the argument

Task-Aware Memory Trigger, which extracts task information at test time and dynamically adjusts task-related parameters to enable on-the-fly specialization without harming general capabilities.

Load-bearing premise

Treating each input as a self-supervised fine-tuning task and adjusting task-related parameters on the fly via the Trigger will produce stable, beneficial specialization without harming general capabilities or introducing instability at test time.

What would settle it

If evaluations on general benchmarks show scores dropping below standard LLM baselines, or if specialized-domain perplexity rises above competing models when the Trigger is active, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2510.26083 by Biqing Qi, Bowen Zhou, Che Jiang, Ermo Hua, Feifei Gao, Shuang Cheng, Weigao Sun, Yihao Liu, Yu Cheng, Yuhua Jiang.

Figure 1
Figure 1. Figure 1: Visualization of the architecture of Nirvana. Updater employs conditional interpolation ll [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: MRI reconstruction performance comparison for models with 160M trainable parameters. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: MRI reconstruction performance comparison for models with 160M trainable parameters. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An example of the overall MRI report generation process. range dependency modeling while maintaining computational efficiency. Moreover, the integration of Trigger endows Nirvana with the capability to self-supervise and adapt on a per-sample basis, addressing distributional shifts without requiring costly backbone retraining. Through extensive experiments on standard language modeling benchmarks, we demon… view at source ↗
Figure 5
Figure 5. Figure 5: A toy example for combinatorial tasks of common sense reasoning and key information [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Length extrapolation from 4K to 20K tokens on 3 long benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: MRI reconstruction performances comparison for models with 160M trainable parameters. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
read the original abstract

Large Language Models (LLMs) excel at general language tasks but struggle in specialized domains. Specialized Generalist Models (SGMs) address this by preserving broad capabilities while adapting to target domains. However, existing architectures provide limited support for task-guided specialized memory mechanisms. In this work, we introduce Nirvana, an SGM featuring specialized memory, linear-time complexity, and test-time task information extraction. Central to Nirvana are: (1) Task-Aware Memory Trigger ($\textit{Trigger}$), which treats each input as a self-supervised fine-tuning task and adjusts task-related parameters on the fly; and (2) Specialized Memory Updater ($\textit{Updater}$), which dynamically consolidates task-relevant context. Nirvana matches or surpasses LLM baselines on general benchmarks and achieves the lowest perplexity across specialized domains including biomedicine, finance, and law. On the challenging task of Magnetic Resonance Imaging (MRI), we attach lightweight codecs to the frozen Nirvana backbone and fine-tune them on paired k-space signals and images. Nirvana achieves higher-fidelity reconstructions than conventional LLM-based models, with Trigger providing effective domain-specific adaptation. Ablation studies confirm that removing Trigger leads to substantial degradation across all tasks, underscoring its essential role in task-aware specialization. Models are available at https://huggingface.co/collections/YuhuaJiang/nirvana. Code is available at https://github.com/YuhuaJiang2002/Nirvana.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Nirvana, a Specialized Generalist Model featuring a Task-Aware Memory Trigger (Trigger) that treats each input as a self-supervised fine-tuning task to adjust task-related parameters on the fly, and a Specialized Memory Updater (Updater) to dynamically consolidate task-relevant context. It claims to match or surpass LLM baselines on general benchmarks, achieve the lowest perplexity across specialized domains (biomedicine, finance, law), and deliver higher-fidelity MRI reconstructions by attaching and fine-tuning lightweight codecs to a frozen Nirvana backbone, with the Trigger credited for effective domain-specific adaptation. Ablation studies report substantial degradation when the Trigger is removed.

Significance. If the central claims hold after clarification, the work could contribute a practical mechanism for efficient, test-time task-aware specialization in large models while retaining general capabilities and linear-time complexity. The public release of models on Hugging Face and code on GitHub is a positive step for reproducibility.

major comments (1)
  1. [Abstract (MRI paragraph)] Abstract (MRI paragraph): The manuscript states that 'we attach lightweight codecs to the frozen Nirvana backbone and fine-tune them on paired k-space signals and images' yet attributes the higher-fidelity reconstructions to 'Trigger providing effective domain-specific adaptation.' This creates an internal inconsistency with the core description of Trigger, which 'adjusts task-related parameters on the fly.' Because the backbone is explicitly frozen, it is unclear whether Trigger's adjustment mechanism is active in the MRI experiment or whether the reported benefit derives only from the codecs. This point is load-bearing for the central claim that Trigger enables beneficial specialization without harming general performance, and it directly affects interpretation of the ablation results.
minor comments (2)
  1. [Abstract] The abstract reports benchmark wins, lowest perplexity, and ablation degradation but provides no architecture diagrams, training details, hyperparameter settings, or statistical significance tests for the improvements.
  2. Notation for Trigger and Updater is introduced with italics but could be more consistently defined when first used in the main text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The major comment highlights an important point of potential ambiguity in the abstract's description of the MRI experiments. We address it directly below and have prepared revisions to improve clarity while preserving the integrity of our claims.

read point-by-point responses
  1. Referee: [Abstract (MRI paragraph)] Abstract (MRI paragraph): The manuscript states that 'we attach lightweight codecs to the frozen Nirvana backbone and fine-tune them on paired k-space signals and images' yet attributes the higher-fidelity reconstructions to 'Trigger providing effective domain-specific adaptation.' This creates an internal inconsistency with the core description of Trigger, which 'adjusts task-related parameters on the fly.' Because the backbone is explicitly frozen, it is unclear whether Trigger's adjustment mechanism is active in the MRI experiment or whether the reported benefit derives only from the codecs. This point is load-bearing for the central claim that Trigger enables beneficial specialization without harming general performance, and it directly affects interpretation of the ablation results.

    Authors: We agree that the abstract wording could lead to misinterpretation and thank the referee for identifying this. In the Nirvana design, the Task-Aware Memory Trigger is implemented as a separate, lightweight module that operates on task embeddings extracted from the input; it does not modify the frozen backbone weights. During the MRI experiments, each reconstruction input is still processed by the Trigger, which treats the task as self-supervised and dynamically updates task-specific memory parameters on the fly. The codecs are additionally fine-tuned for the k-space-to-image mapping, but ablation results (reported in the main text) demonstrate that removing the Trigger causes substantial performance drops even when codecs remain. Thus the reported fidelity gains are not attributable solely to the codecs. We will revise the abstract to state explicitly that the Trigger remains active and provides the domain-specific adaptation while the backbone stays frozen, and we will add a brief clarifying sentence in the MRI section of the methods. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims rest on external benchmarks and ablations, not self-referential definitions or fitted inputs

full rationale

The paper introduces Nirvana with Task-Aware Memory Trigger and Specialized Memory Updater as architectural components, then reports empirical results: matching LLM baselines on general tasks, lowest perplexity on biomedicine/finance/law, and higher-fidelity MRI reconstructions using a frozen backbone plus fine-tuned codecs. Ablation studies are cited to show degradation when Trigger is removed. No equations, derivations, or first-principles results appear that reduce to their own inputs by construction. Claims are tied to external benchmark comparisons and internal controls rather than any self-definitional loop, fitted-parameter renaming, or load-bearing self-citation chain. The architecture description and experimental outcomes remain self-contained without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on two newly introduced architectural components whose internal mechanics and training dynamics are not detailed in the abstract; effectiveness is asserted via empirical results rather than derived from first principles.

axioms (1)
  • domain assumption Each input can be treated as a self-supervised fine-tuning task that allows on-the-fly adjustment of task-related parameters without destabilizing the model.
    This premise underpins the Task-Aware Memory Trigger as stated in the abstract.
invented entities (2)
  • Task-Aware Memory Trigger (Trigger) no independent evidence
    purpose: Treats input as self-supervised fine-tuning task and adjusts task-related parameters on the fly
    New component introduced to enable test-time task information extraction and domain adaptation.
  • Specialized Memory Updater (Updater) no independent evidence
    purpose: Dynamically consolidates task-relevant context
    New component introduced to manage specialized memory.

pith-pipeline@v0.9.0 · 5812 in / 1399 out tokens · 32317 ms · 2026-05-18T03:01:42.327888+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models

    cs.RO 2025-11 unverdicted novelty 6.0

    AsyncVLA adds asynchronous flow matching and a confidence rater to VLA models so they can generate actions on flexible schedules and selectively refine low-confidence tokens before execution.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 1 Pith paper · 21 internal anchors

  1. [1]

    Just read twice: closing the recall gap for recurrent language models, 2024

    Simran Arora, Aman Timalsina, Aaryan Singhal, Benjamin Spector, Sabri Eyuboglu, Xinyi Zhao, Ashish Rao, Atri Rudra, and Christopher Ré. Just read twice: closing the recall gap for recurrent language models, 2024

  2. [2]

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023

  3. [3]

    Titans: Learning to Memorize at Test Time

    Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663, 2024

  4. [4]

    Improving language models by retrieving from trillions of tokens

    Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, et al. Improving language models by retrieving from trillions of tokens. InProceedings of the 39th International Conference on Machine Learning (ICML). PMLR, 2022

  5. [5]

    Ai-based reconstruction for fast MRI—a systematic review and meta-analysis.Proceedings of the IEEE, 110(2):224–245, 2022

    Yutong Chen, Carola-Bibiane Schönlieb, Pietro Liò, Tim Leiner, Pier Luigi Dragotti, Ge Wang, Daniel Rueckert, David Firmin, and Guang Yang. Ai-based reconstruction for fast MRI—a systematic review and meta-analysis.Proceedings of the IEEE, 110(2):224–245, 2022

  6. [6]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060, 2024

  7. [7]

    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.arXiv preprint arXiv:2101.03961, 2021

  8. [8]

    Online meta-learning

    Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online meta-learning. InInternational conference on machine learning, pages 1920–1930. PMLR, 2019

  9. [9]

    Sliding window attention training for efficient large language models.arXiv preprint arXiv:2502.18845, 2025

    Zichuan Fu, Wentao Song, Yejing Wang, Xian Wu, Yefeng Zheng, Yingying Zhang, Derong Xu, Xuetao Wei, Tong Xu, and Xiangyu Zhao. Sliding window attention training for efficient large language models.arXiv preprint arXiv:2502.18845, 2025

  10. [10]

    Accelerated MRI reconstructions via variational network and feature domain learning

    Ilias I Giannakopoulos, Matthew J Muckley, Jesi Kim, Matthew Breen, Patricia M Johnson, Yvonne W Lui, and Riccardo Lattanzi. Accelerated MRI reconstructions via variational network and feature domain learning. Scientific Reports, 14(1):10991, 2024

  11. [11]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023

  12. [12]

    Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N

    Neel Guha, , et al. Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models.arXiv preprint arXiv:2308.11462, 2023

  13. [13]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

  14. [14]

    A unified model for compressed sensing MRI across undersampling patterns

    Armeet Singh Jatyani, Jiayun Wang, Aditi Chandrashekar, Zihui Wu, Miguel Liu-Schiaffini, Bahareh Tolooshams, and Anima Anandkumar. A unified model for compressed sensing MRI across undersampling patterns. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26004–26013, 2025

  15. [15]

    Mixtral of Experts

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

  16. [16]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez, John Yang, , et al. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023

  17. [17]

    GapFormer: Fast Autoregressive Transformers meet RNNs for Personalized Adaptive Cruise Control

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–5165. PMLR, 2020

  18. [18]

    Gen- eralization through memorization: Nearest neighbor language models

    Urvashi Khandelwal, Angela Fan, Dan Jurafsky, and Luke Zettlemoyer. Generalization through memoriza- tion: Nearest neighbor language models.arXiv preprint arXiv:1911.00172, 2019

  19. [19]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, et al. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668, 2020. 11 Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism

  20. [20]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al. Retrieval-augmented generation for knowledge- intensive nlp.arXiv preprint arXiv:2005.11401, 2020

  21. [21]

    Holistic Evaluation of Language Models

    Percy Liang, Rishi Bommasani, , et al. Holistic evaluation of language models.arXiv preprint arXiv:2211.09110, 2022

  22. [22]

    Jamba: A Hybrid Transformer-Mamba Language Model

    Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model. arXiv preprint arXiv:2403.19887, 2024

  23. [23]

    Healthgpt: A medical large vision-language model for unifying comprehension and generation via heterogeneous knowledge adaptation

    Tianwei Lin, Wenqiao Zhang, Sijing Li, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Xiaohui Song, et al. Healthgpt: A medical large vision-language model for unifying comprehension and generation via heterogeneous knowledge adaptation.arXiv preprint arXiv:2502.09838, 2025

  24. [24]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov, Frank Hutter, et al. Fixing weight decay regularization in adam.arXiv preprint arXiv:1711.05101, 5(5):5, 2017

  25. [25]

    The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024

    Guilherme Penedo, Hynek Kydlíˇ cek, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro Von Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024

  26. [26]

    RWKV: Reinventing RNNs for the Transformer Era

    Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, and Matteo Grella. RWKV: Reinventing RNNs for the transformer era.arXiv preprint arXiv:2305.13048, 2023

  27. [27]

    Eagle and finch: RWKV with matrix-valued states and dynamic recurrence, apr 2024

    Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Xingjian Du, Teddy Ferdinan, Haowen Hou, Przemysław Kazienko, Kranthi Kiran GV , Jan Koco ´ n, Bartłomiej Koptyra, Satyapriya Krishna, Ronald McClelland Jr., Jiaju Lin, Niklas Muennighoff, Fares Obeid, Atsushi Saito, Guangyu Song, Haoqin Tu, Cahya Wirawa...

  28. [28]

    Rwkv-7" goose" with expressive dynamic state evolution.arXiv preprint arXiv:2503.14456,

    Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, et al. RWKV-7" goose" with expressive dynamic state evolution.arXiv preprint arXiv:2503.14456, 2025

  29. [29]

    HGRN2: Gated linear RNNs with state expansion.arXiv preprint arXiv:2404.07904, 2024

    Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, and Yiran Zhong. HGRN2: Gated linear RNNs with state expansion.arXiv preprint arXiv:2404.07904, 2024

  30. [30]

    ArXiv preprint abs/2406.07522 (2024)

    Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, and Weizhu Chen. Samba: Simple hybrid state space models for efficient unlimited context language modeling.arXiv preprint arXiv:2406.07522, 2024

  31. [31]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, et al. Toolformer: Language models can teach themselves to use tools.arXiv preprint arXiv:2302.04761, 2023

  32. [32]

    MedGemma Technical Report

    Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

  33. [33]

    Emerging trends in fast MRI using deep-learning reconstruction on undersampled k-space data: a systematic review.Bioengineering, 10(9):1012, 2023

    Dilbag Singh, Anmol Monga, Hector L de Moura, Xiaoxia Zhang, Marcelo VW Zibetti, and Ravinder R Regatte. Emerging trends in fast MRI using deep-learning reconstruction on undersampled k-space data: a systematic review.Bioengineering, 10(9):1012, 2023

  34. [34]

    End-to-end variational networks for accelerated MRI reconstruction

    Anuroop Sriram, Jure Zbontar, Tullie Murrell, Aaron Defazio, C Lawrence Zitnick, Nafissa Yakubova, Florian Knoll, and Patricia Johnson. End-to-end variational networks for accelerated MRI reconstruction. In International conference on medical image computing and computer-assisted intervention, pages 64–73. Springer, 2020

  35. [35]

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Aarohi Srivastava, , et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.arXiv preprint arXiv:2206.04615, 2022

  36. [36]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. 12 Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism

  37. [37]

    Learning to (Learn at Test Time): RNNs with Expressive Hidden States

    Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Hashimoto†1 Tatsunori, and Guestrin Carlos. Learning to (learn at test time): RNNs with expressive hidden states.arXiv preprint arXiv:2407.04620, 2024

  38. [38]

    Retentive Network: A Successor to Transformer for Large Language Models

    Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023

  39. [39]

    Jamba-1.5: Hybrid transformer-mamba models at scale.arXiv preprint arXiv:2408.12570,

    Jamba Team, Barak Lenz, Alan Arazi, Amir Bergman, Avshalom Manevich, Barak Peleg, Ben Aviram, Chen Almagor, Clara Fridman, Dan Padnos, et al. Jamba-1.5: Hybrid transformer-mamba models at scale.arXiv preprint arXiv:2408.12570, 2024

  40. [40]

    Meta-learning

    Joaquin Vanschoren. Meta-learning. InAutomated machine learning: methods, systems, challenges, pages 35–61. Springer International Publishing Cham, 2019

  41. [41]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in neural information processing systems, pages 5998–6008, 2017

  42. [42]

    Advances and challenges in meta-learning: A technical review.IEEE transactions on pattern analysis and machine intelligence, 46(7):4763–4779, 2024

    Anna Vettoruzzo, Mohamed-Rafik Bouguelia, Joaquin Vanschoren, Thorsteinn Rögnvaldsson, and KC San- tosh. Advances and challenges in meta-learning: A technical review.IEEE transactions on pattern analysis and machine intelligence, 46(7):4763–4779, 2024

  43. [43]

    Z. Wang, J. Sun, et al. A perspective for adapting generalist ai to specialized medical ai applications and their challenges.npj Digital Medicine, 2025

  44. [44]

    Rabe, DeLesley Hutchins, and Christian Szegedy

    Yuhuai Wu, Markus N. Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing transformers. In International Conference on Learning Representations (ICLR), 2022

  45. [45]

    Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

    Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al. Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044, 2025

  46. [46]

    Parallelizing linear transformers with the delta rule over sequence length.arXiv preprint arXiv:2406.06484, 2024

    Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length.arXiv preprint arXiv:2406.06484, 2024

  47. [47]

    Gated Delta Networks: Improving Mamba2 with Delta Rule

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. arXiv preprint arXiv:2412.06464, 2025

  48. [48]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, et al. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023

  49. [49]

    Tokens-to-token vit: Training vision transformers from scratch on imagenet

    Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. Tokens-to-token vit: Training vision transformers from scratch on imagenet. InProceedings of the IEEE/CVF international conference on computer vision, pages 558–567, 2021

  50. [50]

    fastmri: An open dataset and benchmarks for accelerated mri,

    Jure Zbontar, Florian Knoll, Anuroop Sriram, Tullie Murrell, Zhengnan Huang, Matthew J Muckley, Aaron Defazio, Ruben Stern, Patricia Johnson, Mary Bruno, et al. FastMRI: An open dataset and benchmarks for accelerated MRI.arXiv preprint arXiv:1811.08839, 2018

  51. [51]

    the sky is blue

    Kaiyan Zhang, Biqing Qi, and Bowen Zhou. Towards building specialized generalist ai with system 1 and system 2 fusion.arXiv preprint arXiv:2407.08642, 2024. 13 Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism Appendix A Appendix A.1 Related Work A.1.1 Hybrid Attention-Recurrent Architectures SambaSamba [ 30] interleaves a simple st...